There seems to be a lot of discussion recently around Big Data and real-time. As it emerges that Hadoop is not a real-time as hoped, the discussion has moved on to stream processing. I guess the discussion is not helped by the definitions (or lack of) for real-time system behavior. That is until you enter the world of safety-critical systems and perhaps it is this that skews the discussion of real-time systems. Safety-critical systems tend to have specific operating criteria and deadlines that must be met. For example, flight control software and ABS braking. However, these systems ensure response and availability by being specifically designed for the task, eliminating as many external factors as possible, and using components that are highly predictable and measurable in their response latency, i.e. hardware and firmware.
If we move beyond safety-critical, then real-time might simply mean being able to respond quickly enough in order to impact correctly the environment within which the system is operating, and to offer repeatability of response rate, where repeatability can be guaranteed as the volume of data increases or with additional data processing nodes. By this definition, software systems can certainly be real-time, therefore stream processing can be real-time. But are all stream processing platforms real-time?
At a high level, an architectural diagram of a stream processing application looks very similar to a hardware circuit schematic. Both function based on record by record processing of data, processing pipelines with transformation/processing nodes, the concept of time windows and a system clock, and have a parallel processing architecture. And unlike a BI platform, output is as important as input, more like a pub-sub architecture with intelligent processing in the middle.
First therefore by that definition, some stream processing platforms are, and some are not, actually stream processing. For example, I would say that Storm is a true stream processing platform as it processes arriving data streams record by record. Similarly, SQLstream processes data record-by-record. I would also argue that Spark is not a stream processing platform as it is batch-based, more of a faster Hadoop.
So back to the question and stream processing real-time. It’s relatively easy to offer real-time response (low latency, say less than a second, with repeatability of results) when the arriving data rate is low, say up to a few hundred records per second. We’ve benchmarked Apache Storm at SQLstream when we were releasing our distributed streaming SQL on Storm topology, and found the latency to be reasonable at low data rates (sub-second). The main issue was throughput – Storm requires a lot of hardware to scale as the performance per node is slow, which impacts response time and repeatability of response. OK for low volume log processing applications for example, but not for Internet of Things with more sophisticated geospatial and movement processing of sensor and GPS data at rates of millions of records per second. (Further information on the Storm – SQLstream customer benchmark here.)
We’re currently updating our benchmarks for a number of stream processing platforms for response latency, throughput data rates, distributed system capability, and data ingestion into Hadoop. Should have the report available for publication in a few weeks.
Finally, stream processing platforms have evolved considerably to offer complete streaming data management platforms as a hub for all real-time data and applications in an organization. Much as every database-oriented application in the 90’s was built on an RDBMS, and every web site and web application in the 2000’s built on MySQL, so I would see every real-time software application going forward to be build on a stream processing platform. Driving real-time dashboards, and updating databases, Hadoop and data warehouses in real-time is important, but stream processing offers the capability to bridge the gap with operational process automation and real-time BPM – therefore real-time is only real-time if the action can be automated, otherwise, it’s human-time.