5 reasons why Spark Streaming’s batch processing of data streams is not stream processing

There are undoubtedly several approaches to the way systems deal with real-time data before it is persisted in a database. For example, two of the most common open source platforms for this are Apache Storm and Apache Spark (with its Spark Streaming framework), and both take a very different approach to processing data streams. Storm, like Guavus SQLstream, IBM InfoSphere Streams and many others, are true record-by-record stream processing engines. Others such as Apache Spark take a different approach and collect events together for processing in batches. I’ve summarized here the main considerations when considering which paradigm is most appropriate.

#1 Stream Processing versus batch-based processing of data streams

There are two fundamental attributes of data stream processing. First, each and every record in the system must have a timestamp, which in 99% of cases is the time at which the data were created. Second, each and every record is processed as it arrives. These two attributes ensure a system that can react to the contents of every record, and can correlate across multiple records over time, even down to millisecond latency. In contrast, approaches such as Spark Streaming process data streams in batches, where each batch contains a collection of events that arrived over the batch period (regardless of when the data were actually created). This is fine for some applications such as simple counts and ETL into Hadoop, but the lack of true record-by-record processes makes stream processing and time-series analytics impossible.

#2 Data arriving out of time order is a problem for batch-based processing

Processing data in the real world is a messy business. Data is often of poor quality, records can be missing, and streams arrive with data out of (creation) time order. Data from multiple remote sources may be generated at the same time, but due to network or other issues, some streams may be delayed. A corollary of stored batch processing of data streams is that these real-time factors cannot be addressed easily, making it impossible or at best expensive (computing resources and therefore performance) to detect missing data, data gaps, correct out of time order data etc. This is a simple problem to overcome for a record-by-record stream processing platform, where each record has its own timestamp, and is processed individually.

#3 Batch length restricts Window-based analytics

Any system that uses batch-based processing of data streams is limiting the granularity of response to the batch length. Window operations can be simulated by iterating repeatedly over a series of micro batches, in much the same way as static queries operate over stored data. However, this is expensive in terms of processing resources and adds further to the computation overheads. And processing is still limited to the arrival time of the data (rather than the time at which the data were created).

#4 Spark claims to be faster than Storm but is still performance limited

Spark Streaming’s Java or Scala-based execution architecture is claimed to be 4X to 8X faster than Apache Storm using the WordCount benchmark. However, Apache Storm offers limited performance per server by stream processing standards these days, although does scale out over large numbers of servers to gain overall system performance. (This can make larger systems expensive, both in terms of server, power and cooling costs, but also a factor of the additional distributed system complexity.) The point here is that Spark Streaming’s performance can be improved by using larger batches, which may explain the performance increase, but larger batches moves further away from real-time processing towards stored batch mode, and exacerbates the stream processing and real-time, time-based analytics issues.

#5 Writing stream processing operations from scratch is not easy

Batch-based platforms such as Spark Streaming typically offer limited libraries of stream functions that are called programmatically to perform aggregation and counts on the arriving data. Developing a streaming analytics application on Spark Streaming for example requires writing code in Java or Scala. Processing data streams is a a different paradigm, and moreover, Java is typicaly 50X less compact than say SQL – significantly more code required. Java and Scala require significant garbage collection which is particularly inefficient and troublesome for in-memory processing.