Get ready to rethink your core, tables and processing
It’s been 130 years since Herman Hollerith tabulated the US Census electronically, pioneering a machine that processed data in batches on punched cards. Given the evolution of other key technologies since then — the assembly line, the internet, and cloud computing to name a few — it’s fitting that batch data processing also evolves.
Enter streaming data processing. The demand for instant data analytics, rather than waiting for data to be processed in batches, comes from dozens of industries and applications ranging from financial services to automotive to IoT.
In this article, I share how streaming data requires a significant paradigm shift from the habits we developed over five generations of handling data in batches. But first, let’s answer some basic questions…
What is batch data processing?
Batch data processing is the processing of large volumes of data collected over a period of time — minutes, hours, and/or days — in groups, called batches. It usually runs automatically without the need for human interaction, at a scheduled time or as the need arises.
Batch data processing usually undergoes a three-stage process. This includes data gathering and input, data processing, and data output as information. Put another way, batch data processing entails data which has been collected and grouped, then processed via a program, with results output in a sequential order.
To ensure seamless batch data processing, each aspect of the three methods — input, process, and output — requires different programs.
The problem with batch data processing
No doubt one of the biggest advantages of batch data processing is the processing of large volumes of data. For modern businesses, however, access to real time information is vital to competitive decision making.
“Access to real time information is vital to competitive decision making.”
Batch data processing is most suitable for data that doesn’t need to be processed immediately, such as payroll or sales records. However, there are some problems associated with using batch data processing for businesses. These include:
- Cost: Batch data processing systems are capital intensive because setting up the software program, hardware infrastructure, and deployment of the batched data processing system are all costly.
- Expertise: Setting up a batch data processing system is complex. Knowledgeable developers are both expensive and rare, but necessary to a well-functioning system.
Additionally, when there are errors in processing, debugging is time consuming.
- Speed: Effective decision making can suffer from the time lag between when data is created, and when it is processed in a batch and results are returned to the business.
Batch data processing alternatives
What happens when organizations require real time data analysis to support their growth and efficient decision making? We take a look at alternatives to batched data processing.
Real time data processing
Real time data processing means the input, processing and output of data continuously. Data is processed as quickly as it is inputted into the system, in the shortest possible time period (in “real time”), and the processor is active at all times.
Examples of real time processing programs include ATMs and point-of-sale validation, so fraudulent transactions can be stopped before they are completed. Data processed using real time processing can also significantly improve business function through real time analytics.
Stream data processing
Stream data processing is the continuous processing of data in an endless flow. In stream processing, data is analysed as it is inputted into the system. Access to information on the fly is crucial for stream processing.
One example of a continuous stream of data is your news feed on a platform like Twitter. Another is the continuous stream of data generated by sensors on Formula One race cars, with each car producing 1.1 million data points per second.
Quix is a platform for working with streaming data. With Quix, developers can use Python or C# to connect their applications to a message broker (which we talk about in the next section), create contextual streams of data, and process and store these data streams.
Paradigm shift #1: The message broker is your new core
In the old paradigm of batch data processing, a database was at the core of everything. Data had to be inputted, analyzed and results outputted from the database. As we discussed above, all of this reading, processing and writing on a database required significant time and resources.
“In the new paradigm, a message broker is the new beating heart of your information architecture.”
In the new paradigm, a message broker is the new beating heart of your information architecture. The message broker accepts streaming data, the same way a database accepts data, but there is no need to write information to a database before processing it, because the processing happens as the streaming data comes in.
The big advantage is that the broker holds the most recent data in-memory so the program running on the computer cluster can access it quickly, while older data is written to disk. By connecting your code to the broker, your deployments receive messages almost instantly. You can learn more about how this works in Quix documentation.
Paradigm shift #2: Think in Streams, not tables
At the core of the traditional relational databases are tables: a place where data is stored (and retrieved), consisting of rows and columns. Tables hold data and can be queried to retrieve data, just as we learned in batch data processing.
In our paradigm shift to stream data processing, streams are at the core. Instead of data stored in tables in a database, data is delivered in a continuous flow of records on a message broker. Each record is called a log.
This makes things pretty hard for developers because the logs are completely unstructured. Each log has no idea what information is in the next log, or the nth log, so it’s hard to build an application that efficiently processes the right information at the right time.
We solved this at Quix by creating the Streams Class. It lets you define an object to collect all the information for a given context, such as one customer ID. The Streams Class then arranges your data in a table-like format with the timestamp as the primary key for each row, and a column used for each parameter and event value at that time stamp in the stream of records.
“Streams Class maintain structure and context when storing data, so it’s easy to explore historical data or use common ML libraries.”
The Streams Class can also maintain this structure when storing data. This makes it easy to explore historical data because it’s all recorded in your application context. It also makes it easy to use streaming data with common ML libraries and tools such as Scikit-learn and Jupyter Notebooks.
You can use one format to develop models on historical data and deploy them to production, all by using the Quix SDK. And because streams are in memory, while tables are in the disk, your applications will be fast and efficient.
Paradigm shift #3: In-memory processing
With batch data processing, the focus has always been on data that isn’t needed in real time. It requires digging into the database every time you need to process and output data. That works fine — as long as you’re not in a hurry. But high latency can undermine businesses that rely on timely analytics.
For access to real time data, using the traditional approach isn’t fast or sustainable. With in-memory processing, there is a shift in the architectural approach of inputting, processing and outputting data. In-memory processing sends data streams through the message broker instead of the database (which is on the disk), leading to significantly lower latency and higher efficiency.
How can organizations improve their data processing?
Large organizations typically have many systems integrated with a wide variety of technologies. Their purpose is to receive, store and transmit data. The data is most often stored, at rest, in a database. Once the data has been collected and then funnelled into the database for storage, it can be read for batch processing.
“With extremely high volumes of data, or where speed is vital, expensive hardware is often the solution. However, much of this data is not needed or irrelevant.”
In situations where extremely high volumes of data are required, or where speed is vital, expensive hardware is often the solution. However, much of the data is either not needed or is only relevant in the instant it’s generated. The deferred nature of batch processing means that insights, decisions or opportunities that could be gained from working with live data in real time are lost.
The lost opportunity from batch processing stale data doesn’t need to be the reality. Processing live data in real time is possible — and much easier than you’d think.
Instead of a database, Quix is built with a message broker at its core, meaning everything Quix does facilitates working with live data the instant it’s created. What you do with the data at that moment can be as simple as discarding portions of it that aren’t useful, or analyzing it and reacting in real time.
How data stream processing is changing business
The essential trend in data is the demand for companies to act on data faster and more efficiently. Organizations already invest heavily in data — including data warehouses and data scientists — but it’s not enough to simply collect and store data. Producing insights from that data and being able to act on this analysis quickly are the key factors in transforming this data investment into true business value.
Embracing the paradigm shifts associated with streaming data will enable organizations to achieve lower latency, higher bandwidth and greater efficiency, compared to traditional batch processing. Harnessing the power to process an ever-growing volume and velocity of streaming data — and automate actions in response to it — creates a significant competitive advantage over businesses limited by last-generation technology.
“Harnessing the power to process an ever-growing volume and velocity of streaming data — and automate actions in response to it — creates a significant competitive advantage over businesses limited by last-generation technology.”
Until now, working with real time data streams has only been available to massive organizations with the resources to apply hundreds of developers to this problem. But with Quix’s platform, any developer can stream, process and store data at scale without the hassle of managing infrastructure.
By creating a layer of abstraction on the complexities of streaming data, Quix’s SDK enables developers to write code in Python or C# that connect directly to a message broker, creating a seamless live data stream. This setup improves developer productivity without requiring expensive teams or infrastructure.
The transition from batch data processing to stream data processing will no doubt be difficult for some — it requires several paradigm shifts in the approach to storing, processing and acting on data. But the exponential growth in digital products and services, as well as heightened business competition, demands not just a faster way to handle data, but a more efficient approach as well.
If you’d like to try Quix’s data processing platform for free, sign up for immediate access.