Frequently asked questions about data streaming

by Steve Rosam
| 7 Sep, 2021

Data streaming is a complex topic, partly because there are so many ways to describe it. You’ve probably seen references to “streaming data,” “data analytics,” “real time data,” “data pipelines,” “message brokers,” “pub/sub messaging,” and similar terminology.

Many of the terms we’ll discuss are used interchangeably or flexibly because the technology is constantly evolving, so don’t worry about memorizing definitions. Instead, focus on understanding the core process and technology concepts that enable organizations to harness the power of streaming data in real world applications.

This article covers the most frequently asked questions about streaming data.

 

What is data streaming?

Streaming data refers to data that continuously flows from a source to a destination, to be processed and analyzed in near real time. Think about the continuous arrival of data relating to stock prices, or the task of finding the closest car for transportation.

While the noun “streaming data” describes the character of data itself, the verb “data streaming” refers to the act of producing, or working with, such data. Again, you might see these used interchangeably, even if that doesn’t make linguistic sense. Regardless, we should all know what we’re talking about when we refer to “data streaming” or “streaming data.”

 

How does data streaming compare to batch processing?

Batch processing involves working with a fixed set of data, collected over time, which is analyzed together at a certain point. It does not seek to take action — or even analyze data — in real time. Batch processing might be suitable when working in real time doesn’t matter; when dealing with very large volumes of data; or when analysis requires a large amount of data simultaneously.

Batch processing typically occurs at a convenient interval, such as daily or weekly. This may be to align with other external dependencies, such as the close of business in retail, or the close of the stock market in financial services and banking. It might also be scheduled at a convenient time when resources are more available, such as during overnight downtime.

Meanwhile, data streaming is a process that never sleeps. Companies that require real time data to make decisions and those that want to process data as soon as it is created will benefit from streaming data analytics.

 

 

What are data streaming applications?

Put simply, streaming data applications are those designed to process a continuous incoming flow of data. Such applications may process real time data, act on streaming data, or produce new data from the incoming data. Data streaming applications often work at a large scale, but that’s not a requirement. The most significant aspect of streaming applications is that they act in real time.

Keep in mind that it’s easy for an application to be decoupled from its data source, so streaming data may be processed by a separate entity than the one producing the data. A data streaming application — or components of it — may also be able to work with batched data. This can be of great use when developing or testing an application.

 

What is a streaming data pipeline?

A pipeline enables data to flow from one point to another. It’s software that manages the process of transporting that data. A pipeline should help prevent common problems such as:

  • data corruption
  • bottlenecks
  • loss of data
  • duplicated data

A streaming data pipeline does this job in real time, often with large amounts of data. Compared to a data streaming application, which can have various processing tasks, a pipeline’s task is straightforward: move data from point A to point B.

 

What is streaming data analytics?

Also known simply as streaming analytics or event stream processing, this is the analysis of real time data, via event streams, using continuous queries.

This processing can involve:

  • aggregation, such as the summing or averaging of values
  • transformation of values from one format to another
  • analysis, in which future events are planned or predicted
  • enrichment of data by combining it with data from other sources to add context
  • ingestion, by sending the data to a permanent store such as a database

 

Which industries use data streaming?

Many industries use data streaming, with more adopting streaming data technology all the time. However, certain industries are a natural fit for data streaming because of their requirements. Any organization that needs to use continuously available data is a good candidate.

Some of the industries making the best use of data streaming technologies include:

  • Finance — from stock markets to consumer banking, the finance industry deals with huge amounts of data constantly. This can be helpful in fraud detection and prevention, market analysis and predictive analytics.
  • Ecommerce — platforms such as Amazon deal with massive amounts of data and are continuously analyzing it to adjust prices, optimize logistics and make product recommendations. Amazon uses its own product, Amazon Kinesis, to do this.
  • Sports — in particular, Formula 1 cars are very heavily instrumented. Teams collect enormous amounts of data during races and testing, and must analyze the data in real time to make in-the-moment decisions that can affect the outcome of a race.
  • Gaming — multiplayer online platforms gather metrics on player performance, as well as data on technical details such as latency. Findings can be used to fine-tune the experience and provide an optimal configuration for users.

 

How does machine learning work with streaming data?

Machine learning (ML) is a type of artificial intelligence that uses data for self-improvement. ML models are trained on existing data sets so they can process further data according to what they learn.

Traditionally, ML models are trained as part of an offline process, a type of batch processing. But data streaming technologies provide an opportunity to improve existing processes. With data streaming, ML models can be trained continuously, analyzing each piece of data as it is created. The models’ algorithms then improve in real time, with far more accurate and relevant results.

 

How does data streaming work?

A full answer to this question involves a detailed description of several complex technologies. (Our Chief Technology Officer explains why streaming data technologies are so hard to handle in this post.) But at a high level, data streaming usually involves the following:

  • One or more data sources produce continuous streams of data.
  • A message broker, such as Apache Kafka, groups these streams into topics, ensuring consistent ordering. Kafka also uses replication and can distribute data across multiple servers to provide fault tolerance.
  • Applications consume data from these topics, process it, and act on it accordingly.
  • The results of processed data can be streamed back to the message broker for further distribution.

 

 

Can I work with streaming data using Python? Is it better than C#?

Yes, and maybe. Ultimately, the language choice is yours to make, but both are entirely possible. Python is popular with data scientists. It has excellent support for data handling via libraries such as Pandas, Bokeh, and NumPy. C# is popular within the Microsoft ecosystem.

The Quix SDK has bindings for both Python and C#. The SDK abstracts out a lot of complexity, so it‘s easy to use either language. We also offer APIs, which can be used by any HTTP-capable language.

 

If I stream my data, will I lose it once I’ve analyzed it?

No. Just because you’re streaming data in real time, doesn’t mean you have to throw it away immediately. Although some message brokers like RabbitMQ deliver a message, then it’s gone for good, Kafka supports permanent data storage. You can also build your architecture to store data at a later point in the pipeline.

The Quix platform allows you to persist data on a topic-by-topic basis (our head of Platform explains how to persist data efficiently in this post). This provides a compromise, allowing you to store your most important data permanently without wasting resources.

 

Do I need to use Kafka for data streaming?

Apache’s Kafka framework dominates the data streaming landscape. You’ll see it mentioned often as the industry leader. While nothing requires you to use Kafka, you should at least understand its purpose and capabilities, and evaluate it as a candidate for a data streaming architecture.

The Quix platform uses Kafka, although it doesn’t depend on it. We simplify the process of working with Kafka, a highly complex product that can require a lot of engineering resources to set up and manage. As part of our platform, Quix manages Kafka for you, so you don’t have to invest heavily in infrastructure.

 

Does Hadoop handle streaming data?

Hadoop is an Apache product that supports the processing of massive amounts of data. However, it is not real time and does not deal with streaming data itself. It can, however, be used in conjunction with other streaming data systems such as Apache Spark or Flink.

 

Should my company use streaming data?

Ultimately, only you can answer that strategic question, and migrating from batch processing to stream processing can require a paradigm shift for your whole organization.

You’ll be able to best compare batch data processing vs. streaming data processing by considering the following:

  1. How soon does the data you work with become stale? Is there value in analyzing this data as quickly as possible? Real time data streaming is as fast as it gets. You can also consider that your data doesn’t have to arrive strictly in real time to benefit from the streaming model. Some elements of data streaming can be used to improve your architecture and reduce processing and storage costs — even if you only process data in micro-batches.
  2. Can you meaningfully analyze individual data points in isolation? If not, data streaming might prove to be difficult.
  3. How much engineering time do you have to spare to set up a data streaming architecture? (Hint: Quix can help!)
  4. Are you working with truly massive sets of data that cannot realistically be processed in real time? It’s unlikely, but it’s possible.

 

What’s the easiest practical way to understand data streaming?

If you’ve read this FAQ from beginning to end, you might feel a bit overwhelmed. There are many concepts to absorb, and the theory of data streaming is only one part of the puzzle. Hopefully, you’re interested in the practicalities of data streaming too.

We’ve engineered the Quix platform to make data streaming accessible — even without a huge team of infrastructure engineers — so whether you’re an experienced data scientist, a digital product developer, or a student of machine learning, you can get started quickly.

Our series of Quick Start guides introduces the platform, and the topic of data streaming, from first principles. These guides include sample code so you can start working with streaming data right away.

The ‘cardata’ sample project is the easiest way to understand Quix and get started on your data streaming journey. And if you still have questions, please reach out to our friendly engineers on Discord. We love hearing from you.

by Steve Rosam

Steve Rosam is a Senior Software Engineer at Quix, where he creates and maintains solutions both in-house and for customers. Steve has worked as a software developer for two decades, previously in a variety of industries including automotive, finance, media and security.

Related content