Data Stream Processing and the Importance of Time

The concept of time is at the core of all Big Data processing technologies but is particularly important in the world of data stream processing. Indeed, it is reasonable to say that the way in which different systems handle time-based processing is what differentiates the wheat from the chaff as it were, at least in the world of real-time stream processing. There are many aspects of time-based processing. We’re going to look at three. First, the ability to generate results in real-time, second, the ability to process a time-series data stream in real-time, and third, although it may seem obvious, an in-built understanding of time – in particular the time at which data records were created. More on this later.

In essence these factors boil down to the key requirements to consider when rolling out a data stream processing application:

  • The ability to maintain low latency results (< 10ms) independent of arriving data volume, and particular as data volumes grow hundreds of thousands of records per second.
  • The ability to process arriving data record by record, essential for real-time analytics, alerting and true time-based processing.
  • A native understanding of time and the ability to process by data creation time, rather than simply wall clock time.

These attributes are discussed further in the following paragraphs.

Real-time Processing

There are many archtectures capable of processing an unbounded, time-series data stream. The issue is one of generating answers quickly enough as data volume increase. I don’t believe the term streaming implies a particular execution engine. An RDBMS or in-memory column store are perfectly capable of processing a data stream and generating low latency responses if the data volume is low and latency is measured in minutes. As data volume increases, it is architectural limitations that define the boundary between using a data management platform or a data stream management platform. The main difference I see is the trade off between result latency and data volume – data stream management delivers on both.

Time-series Data Processing

For me, stream processing of time-series data means record by record processing, that is,processing each new record as they arrive. Systems such as Spark Streaming use a different approach, using a batch-based architecture. I wouldn’t consider batch-based platforms data stream processing platforms, rather in the wider category of systems that can process a data stream. See 5 reasons why Spark Streaming’s batch processing of data streams is not stream processing. for more details. Batch-based is a new phenomenon which at first glance is difficult to understand as to the reasons why as it appears to be a backwards step. For example, batch-based stream processing is severely restrictive when it comes to real-time analytics from data streams. However, there is one use case which is the data loading Hadoop – where dataload is the only requirement, batch-based can be a useful approach.

Understanding Time

A native capability to understand time is essential for the correct, repeatable and reliable processing of a time-series data stream. Most data stream processing utilizes time windows to process the arriving data. The simplest approach (used by Spark Streaming for example) is to process data using wall clock time – the time that the data arrived as understood by the underlying processing platform. However, wall clock time is unsuitable for most real-world stream processing applications, where alerts and patterns are only valid when based on the time of record creation. Furthermore, data in the real world is often delayed, sometimes significantly, where the only way to achieve useful answers is to wait and to process all data streams against data creation time.

It is a common misnomer that wall clock time is the more popular. It is certainly true and Hadoop-based frameworks are based on wall clock time, however this is a limitation of the architecture. We see very few use cases for processing by wall clock time except in two scenarios: (1) where there is no option as the data for whatever reason does not contain a timestamp or could not be punctuated at source by the change data capture collectors, and (2) dataload for Hadoop where the timestamp of data creation is not required. Therefore a native ability to support both is required, with data creation time being the default requirement in most cases.

Summary

In summary, not all data stream processing platforms are alike. The key consideration is the use case. The simplest case is Hadoop dataload, for which there are numerous solutions of varying throughout capacity and ease of configuration, and where processing by wall clock time and batch-based architectures may be sufficient. However, if real-time alerting and analytics, and the ability to integrate most sophisticated predictive analytics on data streams is required, drive automated actions etc, then processing by data creation time and the ability to process each record as it arrives are essential.