Data Science with Quix: NY Bikes

Aim

Quix allows you to create complex and efficient real time infrastructure in a simple and quick way. To show you that, you are going to build an application that uses real time New York bikes and weather data to predict the future availability of bikes in New York.

In other words, you will complete all the typical phases of a data science project by yourself:

  • build pipelines to gather bikes and weather data in real time

  • store the data efficiently

  • train some ML models with historic data

  • deploy the ML models into production in real time

This will typically take several people (Data Engineers, Data Scientists) and weeks of work, however you will complete this tutorial in under 90 minutes using Quix.

NY bikes graph 1

This guide will take you through the steps to perform CityBikes and OpenWeather API requests using Quix, including getting an OpenWeather account.

Prerequisites

We assume that all you have is a Quix account that you haven’t started using yet.

If you don’t have a Quix account yet, go here and create one.

Overview

This walkthrough covers the following steps:

  1. Create OpenWeather account (third party)

  2. Create a bikes data real time stream

  3. Create a weather forecast data real time stream

  4. Store streamed data with Quix

  5. Query the stored data to train the ML models

  6. Deploy the ML models and produce predictions in real time

We try to provide clear instructions on each step, but don’t hesitate to contact us through our Discord Community if you have any doubts.

1. Create OpenWeather account

NY bikes graph 2

OpenWeather is a team of IT experts and data scientists that provides historical, current and forecasted weather data via light-speed APIs.

Let’s create a (free) OpenWeather account:

  1. Go to the OpenWeather Sign Up page.

  2. Click the "Sign Up" button and complete the dialog. Do the email and text message verifications.

  3. Then, go to the OpenWeather API keys page to copy your key. Keep it safe for later.

2. Bikes real time stream

Ok, we will start by getting the real time bikes stream. We will use CityBikes (that doesn’t require a key) to get real time bikes data.

Get to the Quix Portal and log in.

Create a Workspace

Table 1. Creating a workspace

NY bikes graph 3

A workspace is a collection of all the components for a single application. Find more info here.

new workspace

First, click the + sign to create a workspace.

screenshots 1

Use a name such as "NY Bikes".

Select the Standard storage class, it will be enough for our case.

Click create.

This may take a minute since a discreet, fully featured and highly scalable data platform is being created for you.

Once it is ready, open the workspace.

Create a Topic

Once the workspace has been created, we will generate a topic.

Table 2. Creating a topic

NY bikes graph 4

Topics are streams of data or parameters built on Kafka. Find more info here.

screenshots 2

Since this Workspace is new, you will see some help to get started. You can find tutorials on the right, and examples of topic creation on the left.

We will create the topic from scratch, so ensure you click “+ Create topic” on the left.

screenshots 3

Use a name such as "ny-real-time-bikes-topic".

Click create.

screenshots 4

screenshots 5

Go to the Topics page.

You can see all the topics in your workspace here.

To find the topic details, click the >> icon on the left of our ny-real-time-bikes-topic row.

The topic id and broker settings are needed to write/read data to this specific topic with code. More on that later.

Create Bikes stream

Streams are the traffic lines that are created on your road infrastructure (topic). So, if you expect a lot of traffic, you will build several streams to distribute the data traffic in parallel to avoid delays (like in a highway). In this case, a single stream will do the trick to carry the bikes data.

Table 3. Creating a code project

NY bikes graph 5

Streams are created with code, so let’s create our first code project.

A project is a set of code which can be deployed as one Docker image. Find more info here.

screenshots 34

We can start a code project from scratch or use some of the sample code templates in the Quix library.

  • Go to the Library page.

  • The purpose of our code is to write data into a stream, so ensure you select the "Write" tab in the top.

  • Select the "New York CitiBike API" code sample.

  • Select the topic that you want the code to write into, i.e., "ny-real-time-bikes-topic".

Check the code that has been generated for you. It already contains the proper client id’s and topic id (that you saw earlier in the Topics page) filled for you.

screenshots 35

To save this code as your code project click "<> SAVE AS PROJECT".

Name it "NY-real-time-bikes" and click create.

NY-real-time-bikes Code

Check the NY-real-time-bikes code project that has been created for you. Remember, the objective of this code is to create a stream of bikes data. We do that by periodically performing CityBikes API requests and writing the response data into our ny-real-time-bikes-topic Quix topic.

Check the following tabs if you want to learn more about how the code works:

  • OPTIONAL code project concepts

  • OPTIONAL OpenWeather API request function concepts

  1. Your main code is placed in your project’s main.py, which is the file that the deployment will invoke.

  2. The certificatepath, username, password and broker address in your main.py are those of the topic we are writing to.

  3. You can easily solve dependencies in the requirement.txt file. In our example, Python needs to use the Quix streaming library (our efficient and fast SDK) and the library "requests".

To keep the code simpler, we have created a function that contains all the code needed to perform the API request to Citybikes and to summarize all the NY bike points data into a single aggregated row.

  1. This function is in the nyc_bikes_api.py file, so open it.

  2. Copy the code into your local IDE (Jupyter Notebook, PyCharm, etc.) and execute it. The result is the current number of available bikes at New York.

  3. Your main.py file uses this function repeatedly. Check how it is imported with:

from nyc_bikes_API import get_agg_data

Deploy NY-real-time-bikes code

With the NY-real-time-bikes code ready, it is time to deploy it! Find more info on deployments here.

Table 4. Deploy a code project

screenshots 36

Have a look at the right section in your screen. This is your commit history.

Your code is all committed to a Git repository. Find more info here.

screenshots 8

The top row is the most recent commit.

Click on the three dots next to it and then on "Create tag".

Give the tag a version name such as '1.01'.

Click Create.

deploy button 2

With our work properly saved, we can now deploy the project. Click the Deploy button.

screenshots 9

Configure the deployment

In the General tab, select the Version Tag you created earlier. You decide which code to deploy this simply.

Also change the Deployment Type to Service.

You can name it Real-Time-Bikes-Deployment.

DEPLOY

Hit it!

Just click Deploy and the code will be built, deployed and the service will start running.

screenshots 10

It’s alive!

You should see the status change to Running.

screenshots 11

screenshots 12

Checking your deployment

Hover your mouse over the row to find some contextual options. You can view the logs and deployment logs, download the docker image and delete the deployment.

Open the logs to check the number of available bikes in New York!

3. Weather real time stream

Ok, we now have a working real time stream of bike data. Let’s use the OpenWeather account to create a real time weather stream. The procedure should be a bit more familiar now:

Create a Topic

Table 5. Creating weather topic

NY bikes graph 7

Remember, topics are streams of data or parameters built on Kafka.

screenshots 26

screenshots 25

Go to the Topics Page and click “+ CREATE TOPIC”.

Use a name such as "ny-real-time-weather-topic".

Click create.

Create Weather stream

Same as we created the bikes real time stream, let’s now create the weather forecast stream, for which we will need a new code project.

Table 6. Creating a code project

NY bikes graph 6

Remember that we need to use code to generate streams, so let’s create our second code project.

A project is a set of code which can be deployed as one Docker image. Find more info here.

screenshots 37

Again, we’ll use the Quix library to generate the code.

  • Go to the Library page.

  • The purpose of our code is to write data into a stream, so ensure you select the "Write" tab in the top.

  • Select the "New York OpenWeather API" code sample.

  • Select the topic that you want the code to write into, i.e., "ny-real-time-weather-topic".

Check the code that has been generated for you. It contains the proper client id’s and topic id.

screenshots 35

To save this code as your code project click "<> SAVE AS PROJECT".

Name it "NY-real-time-weather" and click create.

NY-real-time-weather Code

Check the NY-real-time-weather code project that has been created for you. Again, the objective of this code is to create a stream of weather forecast data by periodically (once every 30 min) performing OpenWeather API requests and saving that data into our ny-real-time-weather-topic Quix topic.

Check the following tabs to complete the necessary code modifications and to learn more about how the code works:

  • REQUIRED code modifications

  • OPTIONAL code project concepts

  • OPTIONAL OpenWeather API request function concepts

  • Check the "{placeholder:openweatherkey}" in line 18.

  • Change that for your OpenWeather key (you can find it here) and save the changes.

  1. Your main code is placed in your project’s main.py, which is the file that the deployment will invoke.

  2. The certificatepath, username, password and broker address in your main.py are those of the topic we are writing to.

  3. Remember that is easy to solve dependencies by editing the requirement.txt file. We are using the same libraries as our bikes code project.

Just as for your bikes stream, we have created some functions that contain all the code needed to perform the API request to OpenWeather and return the response into a dataframe.

  1. These functions are in the nyc_weather_API.py file, so open it.

  2. Copy the code into your local IDE (Jupyter Notebook, PyCharm, etc.) and execute it. The result of get_current_weather is the current weather forecast; the result of get_tomorrow_weather is the tomorrow’s weather forecast.

  3. Your main.py file uses these functions repeatedly. Check how they are imported with:

from Weather_API import perform_API_request, get_current_weather, get_tomorrow_weather

Deploy NY-real-time-weather code

Now the NY-real-time-weather code project can be deployed (more on deployments here).

Table 7. Deploy a code project

screenshots 38

Remember to create a tag name in the commit history menu (click on the three dots and then on "Create tag").

You will be able to point the deployment to these tags.

deploy button 2

With our work properly saved, we can now deploy the project. Click the Deploy button.

screenshots 27

In the General tab, select the Version Tag you have just created.

Set the Deployment Type to Service.

Name the deployment "Real-Time-Weather-Deployment".

Finally, click Deploy and the code will be built, deployed and the service will start running.

Once deployed, open the logs to check the current and tomorrow’s forecast in New York!

4. Store data efficiently

Persist data

Let’s see how easy it is to store the data we are streaming.

Table 8. Persisting data

NY bikes graph 8

Data persistence allows to store the data streamed. Find more info here.

screenshots 18

  • To store the bikes data: Go to the bikes topic in the topics page and set the persistence to on.

  • To store the weather forecasts data: Go to the weather topic in the topics page and set the persistence to on.

That’s it! Quix allows you to store your streamed data this simply.

Visualize persisted data

You can also use Quix to visualize your persisted data in a powerfull and flexible way:

Table 9. Visualizing persisted data

screenshots 30

Go the Data explorer page.

Select the stream you want to visualize (i.e. the New York Total Bikes Real Time).

Then, select the parameters.

screenshots 31

Explore your bikes and weather data with the different visualization options.

Recap - What did you just do (so far)!

Let’s recap a bit and see what we have done so far!

Table 10. Summary up to now

summary 01

You created a Workspace to host all the code and infrastructure that we need for our New York bikes work.

A workspace is a collection of all the components for a single application. Find more info here.

summary 02

You created 2 topics: one to stream all the NY bikes real time data and one to stream the weather real time data.

Topics are streams of data or parameters built on Kafka. Find more info here

summary 03

You created one stream per topic.

When would you need more streams per topic? When you need to parallelize the data traffic to decrease the latency. We could have, for instance, created a stream per bike station, or even a stream per bike, instead of having a single stream for all the bikes data.

Find more info about streams here.

summary 04

You selected to save the streamed data to accumulate historic data of weather forecasts and bikes data.

Data persistence allows to store the data streamed. Find more info here.

Perfect! Now that we are accumulating historic bikes and weather data we are ready to train some machine learning models.

5. Train the ML models

Quix gives you the freedom to train ML models any way you want. If you had historic data stored in your current infrastructure outside Quix, you can train your models there, or locally, to then deploy them in Quix.

For this example, let’s go on and train the models with the historic data we are accumulating in Quix.

Table 11. Query persisted (historic) data

NY bikes graph 10

Let’s see how to query your Quix persisted historic data from your local IDE of choice.

screenshots 19

Go to the Data explorer page.

Select the bikes real time stream and the total number of bikes parameter.

This is a good way to visualize the data you are persisting.

screenshots 20

Optionally, test that you can download the bikes and weather data pasting the respective code in the following notebooks:

Select the "<> CODE" tab.

What you see is a python code query that you can copy (check the top right-hand side button).

Paste this code into your local IDE and execute it.

You can download both the bikes and weather persisted data repeating these steps on the notebooks 1 and 2. This is optional, since your historic data will probably consist on a few recent rows only and we provide some weeks long sample csv data for your next steps.

Table 12. Train ML models with the historic data
  • Download this folder locally.

  • The sample-data subfolder contains .csv files with weeks of historic bikes and weather data. Use these to create your training matrix with the Notebook 03.

  • 03 - Creating Training Matrix.ipynb assists you in creating the training matrix.

Once you have historic bikes and weather data, you can now create a matrix with the targets and variables that the ML models will need.

Don’t worry if your historic data consists only on a few recent rows, we provide some weeks long sample csv data.

Use the following notebook to train your models:

Now, let’s train the ML models using the training matrix.

Check our proposed approach in the notebook 4.

6. Deploy the ML models

Now that you have trained some ML models in your local IDE, let’s put them into production!

Create Predictions topic

Let’s start by creating a new topic where we will write our predictions to.

Table 13. Creating a topic

NY bikes graph 9

Remember that topics are streams of data or parameters built on Kafka.

screenshots 22

So, go to the Topics page using the menu on the left.

Then, click the "+ CREATE TOPIC" button to create a new topic.

Call it NY-bikes-prediction.

screenshots 18

We will want to store our predictions, so set the persistence to on.

Create code project

We are going to create now a third code project. This code repository will have three functionalities:

Table 14. Creating a project

NY bikes graph 12

  1. Read data from the ny-real-time-bikes-topic and the ny-real-time-weather-topic.

  2. Store ML models with version control and use them to generate real time predictions.

  3. Write those real time predictions into your recently created NY-bikes-prediction topic through different streams (one per model).

screenshots 39

Again, we’ll use the Quix library to generate the code.

  • Go to the Library page.

  • The purpose of our code is to read the input data from the bikes and weather streams to perform predictions with our model that will be writen into a stream. So, ensure you select the "Model" tab in the top.

  • Select the "New York Bike Predictions" code sample.

  • Select the topic that you want the code to read from (we need to read both from the bikes and weather stream, but select "ny-real-time-bikes-topic").

  • Select the "ny-bikes-prediction" as output topic (topic to write to).

Check the code that has been generated for you.

screenshots 35

To save this code as your code project click "<> SAVE AS PROJECT".

Name it "NY-real-time-predictions" and click create.

NY-real-time-predictions code project

The NY-real-time-predictions code project is now created for you.

Check the following tabs to complete the necessary code modifications and to learn more about how the code works:

  • REQUIRED code modifications

  • OPTIONAL code project concepts

  • OPTIONAL predicting function concepts

  • Check the "{placeholder:inputTopic_Weather}" in line 16.

  • Change that for your weather topic id (you can find it at the Topics page, click the >> icon on the left of your ny-real-time-weather-topic row and copy the topic id).

  • Save the code changes.

  1. The file main.py will be invoked by the deployment.

  2. Remember that is easy to solve dependencies by editing the requirement.txt file. The models we trained are xgboost (as an example of a popular one) and we use the very popular sklearn too, so the requirements.txt include both.

This code project is doing a lot, so take a minute to check how each of the three purposes gets solved.

Table 15. Aim of the code project

NY bikes graph 14

  1. Listen to the two input topics: the ny-real-time-bikes-topic and the ny-real-time-weather-topic. This means that it will get the new data from this topics any time there is some.

  2. Perform ML predictions with the uploaded models and the new incoming data.

  3. Write those predictions into new streams which will be part of the predictions topic

Just as for your bikes and weather streams, we have created some functions that contain all the code needed to perform predictions.

  1. These functions are in the model_functions.py file, so open it.

  2. It contains functions to import the models, build the predicting matrix, write the predictions into new streams, etc.

  3. Check that you are importing the needed functions from main.py:

from model_functions import get_saved_models, predict_bikes_availability_and_write_into_streams

Save ML Models with version control

We are now going to upload our previously trained models into the recently created code repository. This is a good place for them to be, with version control provided by git and together with the code that will use them.

Table 16. Loading your models

NY bikes graph 13

You can load files (our ML models in this case) into one of your Quix repositories.

This is just one of our API functionalities. Find more here.

screenshots 23

You will need a token to perform this API request.

At the upper right-hand corner of the screen, click your user icon and then Tokens.

Then “GENERATE TOKEN”. Check how you can define the expiration date.

Use the following notebook to load your models:

Go to your Notebook 05 or your local IDE of choice (where you had trained your models) and load the models into the code project using the save_ML_model function.

Check how we use the function to load a dictionary with a list of the feature variable names and the model object (not just the model object).

screenshots 24

Go to your NY-real-time-predictions project and check that the ML models have been updated.

Great!

Git will now be controlling the version of the models, which will be handy in the future when we want to update them.

Deploy NY-real-time-predictions code

Now the NY-real-time-predictions code project can be deployed.

By now, you will probably know how to do it ;)

Table 17. Deploy a code project

deploy button 2

With our work properly saved, click the Deploy button.

screenshots 40

In the General tab:

Select the last Version Tag.

Set the Deployment Type to Service.

Name the deployment "NY-real-time-bikes-PREDICTIONS".

This time, increase the CPU millicores to 500 to ensure we run the model predictions fine.

Finally, click Deploy and the bike predictions code will be built, deployed and the service will start running.

screenshots 28

Check the logs

The deployment will only have its bikes and weather variables populated once new bikes/weather data are streamed through each respective topic.

In the case of the bikes, this happens every few seconds.

However, the weather forecasts gets refreshed every 30 mins for this example. If you want to check that your models are predicting fine, stop and then re-start the weather deployment to force it getting a new forecast.

Reflect on what you have just achieved by yourself.

You have just built an application that is predicting the number of available bikes in New York, live, in real time!

Let’s see how well it is doing it!

Visualize your predictions

As introduced earlier in the tutorial, you can use the Data explorer page to check you persisted data. In this case, we will check our persisted prediction streams.

Table 18. Visualizing your predictions

screenshots 32

Go the Visualize page.

Select the streams you want to visualize (i.e. Prediction ML Model 1-day ahead forecast, Prediction ML Model 1-hour ahead forecast, Number of available bikes in NY with local NY timestamp).

Then, select the parameters (forecast_1d, forecast_1h, real_n_bikes).

screenshots 33

Don’t worry if you have just deployed your models and the line graphs are still limited to some minutes. You will be able to check how the models perform as time passes.

Check how the 1-day-ahead model has already created predictions for the next 24 hours and the 1-hour-ahead model for the next 60 minutes.

Recap - What did you just do!

You have just completed all the steps in a typical streaming data science project. As a result, you have two models predicting in real time how many bikes there will be in New York in the near and medium future.

Table 19. Recap

summary 03

Build pipelines to gather bikes and weather data in real time.

You did this by:

  • Creating the topic using Quix interface.

  • Creating the streams (in this case just one per topic) using code.

summary 04

Store the data efficiently.

You did this by:

  • Activating the topic persistance (just one click).

summary 05

Train ML models with historic data.

You did this by:

  • Querying the Quix persisted data into your local IDE.

  • Using that historic data to train your models locally.

summary 06

Deploy the ML models into production in real time.

You did this by:

  • Pushing your ML models into a Quix repository.

  • Reading in real time from the bikes and weather data.

  • Writing the predictions into a new topic through different streams.

What’s Next

What else can you use Quix for?

You can stream any kind of data into the Quix platform, from other apps, websites and services. From hardware, machinery or wearables or from anything you can think of that outputs data. You can then process that data in any imaginable way.

See what you can do with it and please share your experiences with us. In fact, share it with your colleagues, friends and family. Blog, tweet and post about it and tell anyone who will listen about the possibilities.

If you run into trouble please reach out through our Discord Community. We’ll be more than happy to help.