You got stream processing to work. Now how do you get it to scale?

by Mike Rosam
| Dec 7, 2021

Scalable data infrastructure is critical to keeping pace with increasing data demands across business functions. This post includes why leadership often underestimates infrastructure challenges, how to scale efficiency in data stream processing and scale stream processing solutions faster with Quix.

“Good, better, best. Never let it rest. Until your good is better and your better is best.”  — Tim Duncan

This simple quote neatly sums up the importance of scalable data infrastructure. Data never rests. As mountains of data stream in from IoT and digital apps, companies face the challenge of finding better ways to capture, extract value and operationalize it.

If you’re currently just doing batch processing, you might also want to add machine learning and automation. If you already have a machine learning model, you want it to work faster. As your business grows, you want your data infrastructure to handle greater volume without losing speed and efficiency. And when you start combining all that with live data and stream processing, everyone else in the company will want to know when they can start using that system.

The question is, who has the time and resources to continuously redesign data infrastructure to deliver the next best thing? Moving from batch processing to stream processing or expanding real-time stream processing across business functions is a big lift.

The data engineers I’ve spoken with estimate that they spend only 30% of their time on what they see as their primary job: scaling infrastructure. They feel pressure from multiple directions: keeping up with business needs, leadership demands, and their expectations to scale sometimes archaic data infrastructure. This can lead to frustration and failure as they struggle to meet all these demands.

They’re not alone. Keeping up with data demands is challenging. Many companies lack the infrastructure needed to scale up efficiently. As Dan Ariely put it, “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it.”

 

Leadership often underestimates infrastructure challenges

Data science is hard to explain and hard to understand. Leadership doesn’t always know the difference between a data scientist, a data analyst and a data engineer — or which one they need. Eyes are likely to glaze over when you try to explain why you can’t just plug real-time data streams into the existing infrastructure and make it work.

In an attempt to describe data quality issues, former data scientist and engineer Dan Friedman writes in Data Science: Reality Doesn’t Meet Expectations, “I’d often compare it to a garbage bag that ripped, had its content spewed all over the ground, and your partner has asked you to find a beautiful earring that was accidentally inside.”

 

“Companies struggle to hire and retain someone with the skills to solve the infrastructure issues and complete the desired project.”

 

This is the mess many data engineers and scientists face at the beginning of a data project. There are common complaints about the lack of quality data, messy storage methods, and poor data infrastructure. These challenges can leave a data project stranded as companies struggle to hire and retain someone with the skills to solve the infrastructure issues and complete the desired task.

 

 

Scale and efficiency in data stream processing

“Through 2020, 80% of AI projects will remain alchemy, run by wizards whose talents will not scale in the organization.”
Gartner Top Strategic Predictions for 2019 and Beyond

Getting data stream processing to work is good, but it’s not good enough. Too often, successful projects hinge on individual talent and, therefore, won’t scale. As volume and ambition increase, the underlying architecture must scale up to keep pace.

 

“Scalability is a vital part of data stream processing infrastructure and something to consider from day one to avoid time-consuming rebuilds”

 

Scalability is a vital part of data stream processing infrastructure and something to consider from day one to avoid time-consuming rebuilds, such as those needed by companies like Alibaba and Twitter.

 

Real-time data pipeline during 11.11 Global Shopping Festival. Source: Alibaba

 

Retail giant Alibaba uses stream processing to handle peak traffic and provide real-time data for media displays, business intelligence and live studio products for executives and customer service representatives. In 2017, the Alibaba streaming computing team fully upgraded their stream computing architecture with an average of more than twice the peak stream processing capability and more than five times processing capability.

The upgrades enabled Alibaba to successfully process 256,000 payments per second at peak volume during the 2017 Singles Day shopping festival — more than double the volume processed in 2016. Since then, Alibaba has continued investing in stream processing infrastructure, acquiring Apache Flink backer data Artisans in 2019.

Twitter also made significant upgrades to its stream processing infrastructure. To avoid a lengthy migration process, it opted to build a new system dubbed “Heron” to replace their existing system, “Storm,” explained Baz Bhäte in “Twitter Storm vs Heron. A real-time streaming system demands.” Heron latency is 5–15x lower than Storm’s latency and increases more gradually; throughput is 10–14x higher than that of Storm, according to Bhäte.

Massive companies like Alibaba and Twitter have the resources to build and rebuild data stream infrastructure from scratch. But there are solutions for smaller companies that lack the resources to build, scale and maintain their own stream processing infrastructure.

 

Scale stream processing solutions faster with Quix

I talk to engineers who have solutions but don’t have the time to scale them. For them, Quix is the rare silver bullet. They can migrate those solutions into our platform and scale from there.

We’ve taken care of all the complicated infrastructure and made it Python-friendly, so developers and data scientists can get straight to coding or building their models. This is the type of big win that can erase months of frustration and open up new opportunities for data-driven products. And teams can do what they do best, feeling proud of what they accomplish.

Once you start using stream processing, it’s easy to see more use cases and — with Quix — scaling up to meet those needs is as easy as sliding a switch. Want to learn more? Book a demo with one of Quix’s friendly experts to talk through your technical challenges in scaling your data infrastructure. We’re here to help, join us in our Slack channel if you have any questions.

Join The Stream community, where you’ll find developers, engineers and scientists supporting each other while working on streaming projects.

by Mike Rosam

Mike Rosam is cofounder and CEO at Quix, where he works at the intersection of business and technology to pioneer the world's first streaming data development platform. He was previously Head of Innovation at McLaren Applied, where he led the data analytics product line. Mike has a degree in Mechanical Engineering and an MBA from Imperial College London.

Related content