Implementing stream processing: My experience using Python libraries

by Aleš Saska
| 27 Sep, 2021
Stream processing python libraries

Apache Spark on Databricks vs DIY Apache Flink vs Quix.ai

Contents:

When I set out to answer the question of which data stream processing system is best, it seemed like a simple experiment: take a randomly generated string of lorem ipsum text and calculate various metrics over a sliding window of time. But this simple experiment quickly revealed the complexities associated with working with streaming data.

TL;DR: the system you choose will have a big impact on ease of use, performance and cost.

This blog shares my experience with the pros and cons of each platform — Apache Spark on Databricks, a DIY Apache Flink install, and Quix — so you can make the best decision for your application. My colleague Tomas Neubauer (Quix CTO) details our results in full in his blog post.

I’m also eager to hear about your experiences. Jump on our Discord to chat with me and my colleagues.

 

Stream processing with Apache Spark

In Databricks it’s easy to configure Kafka as a streaming source, and you can simply create a table from plain text messages. But despite its large user base, Spark’s documentation doesn’t provide enough information for troubleshooting. As a result, developers must frequently rely on community support via Stack Overflow and other forums.

In particular, the Databricks platform file storage is not properly documented. A file is needed for the Kafka broker to store its internal variables in the checkpoint location. However, the file storage directory (/FileStorage) seems hidden from the user or undocumented. Since this directory is not accessible from the UI, the only way to reset a Kafka checkpoint file is to create a new file and leave the old file on the system. Erasing the Spark state is done illogically via the SQL >>RESET;<< command.

Spark DataFrame functionality is limited when working on the streaming data, which in combination with insufficient documentation makes development difficult. There is no straightforward way that the developer can debug the data being processed.

Spark’s CPU utilization averaged 14%, but the system was not able to process more than 133,000 messages per second — a fraction of what Flink and Quix were capable of handling. This is a problem, especially when working on Databricks where I had to provision a whole cluster for a month.

It cost nearly $1,000 to build, set-up, run and optimize this experiment across the month, so it’s disappointing to know that effectively only 14% of that cost was spent wisely.

 

Recognized as the industry standard, Apache Flink is a scalable data stream processing library with a mature code and user base. Its Table API is easy to use and understand; it provides many types of rolling windows out of the box; and it scales nodes adequately so that it can process a high volume of data.

However, the Table API is not very flexible as it’s really a SQL wrapper that is compiled to Java runtime. This means that as a developer I can’t implement my own classes, methods, etc., or use external libraries in my project.

There is a pure Python API but it is incomplete, and the associated DataStream features are unsupported and undocumented. Additionally, there is no straightforward way for a developer to debug data that is being processed. The Flink Kafka connector requires you to manually include the JAR. And you can’t load the plaintext to the single column table — you must create a specific generator just for Flink (we used JSON).

Setting up a Flink cluster isn’t easy. The self-managed nature of Flink requires knowledge of setting up the server by yourself. In our case, we created Docker images supporting the execution of Flink Python code and deployed them into Azure VM’s. The default Docker image from the Docker hub does not have support for Python, which needs to be manually installed into the container.

 

Stream processing with Quix Streams

As the newest data stream processing platform, Quix sets itself apart by solving many of the problems that plague Spark and Flink, while introducing new techniques that provide greater performance and efficiency.

Quix has tightly coupled compute with the broker and database providing a platform that’s easy to use and set up, while also delivering very high performance. It took me less than a day to run the tests on Quix, compared to multiple weeks to run them properly on Flink and Spark.

 

“It took me less than a day to run the tests on Quix, compared to multiple weeks to run them properly on Flink and Spark.”

 

Quix offers many out-of-the-box capabilities. For example, the visualize feature allows developers to see the incoming data from the message broker and quickly debug running code. This, together with a built-in IDE with Git versioning system ensure that you can iterate and develop very quickly.

The Quix Streams client library is simple to use, supports any Python code, offers very high performance, and enables you to stream and process data in various formats, such as RAW/binary, numeric, and strings and types, such as events and parameters (time-series scalar values like XYZ position).

Quix’s documentation is also the easiest to use and includes many ready-to-use examples. Support for the Pandas DataFrames makes development easy, because the in-memory implementation inside the Pandas is done for you.

The Python implementation is a dream for data scientists. It provides a simple API to access data, either in batched chunks to improve performance, or immediately as data arrives from the broker to achieve the lowest latency.

Quix’s limitations include some elements that Flink already has established, such as an out-of-the-box rolling window; auto scaling for parallelism in the nodes (currently it’s manual with a slider in the UX); shared state (this must be created manually by adding topics); and internal state management (implemented manually using the feature in the client library).

These limitations are all being addressed by Quix on the roadmap, and for now the performance, ease of use, Python API, and managed infrastructure more than make up for the limitations.

 

Which stream processing platform is best?

I’m a big fan of trying things before you buy them. That’s why this seemingly simple experiment helped me dig deep into the pros and cons of each system. It’s clear from the performance results that Apache Spark is a library that just can’t handle the demands of real time data stream processing, while Databricks is expensive and difficult to use for stream processing applications.

Both Quix and Flink have their advantages, depending on your use case, but Flink is really hard to use, and a 100% DIY solution. Taking into account efficiency (read: cost savings) and the ease of use for setup and troubleshooting, I believe Quix is the compelling choice.

Developers waste hours troubleshooting and debugging when they could be writing fresh new code. The cost of their time, and the opportunity cost of bringing new products to market, is often underestimated. Quix removes all of this pain, allowing developers to focus on their code and data from day one.

If you’d like to read more about my experiment and the performance results, see our full breakdown. Also on the blog, my CEO gives a very, very detailed comparison of each of these client libraries. You can also visit my GitHub repository to see the attached codebase. If you’d like to try the Quix platform free, here you can get immediate access.

by Aleš Saska

Aleš Saska is a Software Engineer at Quix, focusing on application performance for both frontend and backend. He previously oversaw the engineering side of the CellCollective project at University of Nebraska-Lincoln. Aleš is author of one research paper and sole creator of the frontend for mySCADA, which is an innovator in industry automation.