5 Ways to Measure if Fast Analytics on Hadoop Are Fast Enough

With Strata Hadoop around the corner and continuous coverage of data lakes, SQL as a preferred language for streaming, and streaming-at-large, the question is still there-how fast is fast enough when it comes to insight? 

Businesses are rethinking how they make best use of their data, and how to harness their data assets to improve business drivers such as efficiency, market share and profitability. The emergence of Fast and Big Data technologies for processing machine data has enabled companies to consider real-time data management strategies, moving from the 8 to 24 batch-based processing models for structured data, to a more dynamic model based on generating faster answers directly from unstructured machine data. As such Hadoop and NoSQL storage platforms have become an essential component of today’s enterprise architecture maps. But even as these platforms are being deployed, there is a growing awareness that Hadoop-based architectures are faster, but not as fast as they could be, or as fast as businesses are now demanding.

Fast Data frameworks have emerged hand in hand with Hadoop and other big data technologies to the extent that it is now an integral component of what’s being referred to as a unified data management architecture – horses for courses within an integrated IT Fast + Big Data architecture. A unified data architecture contains platforms for real-time data processing (Fast Data, unstructured) and structured data storage and analytics (Big Data), all connected via stream-oriented middleware such as Kafka.

Therefore in the area of a unified data architecture, one could say that the analysts and IT architects are one step ahead of the data scientist, and that fast analytics falls within the domain of the data stream processing function, and is actually straightforward to achieve. Of course, there is a caveat, that if by ‘fast analytics’ is meant getting faster answers by processing large volumes of unstructured data in Hadoop, then the time between query launch and response is likely to be significantly less than previous traditional approaches, even if it still several hours. However, ‘fast analytics’ is increasingly referred to the generation of actionable answers and insights as measured from the time of data creation, rather than the time of query launch.

Here’s 5 requirements to consider when determining what is the best ‘fast analytics’ architecture for you.

1. Fast answers and response latency

Result latency is measured as the time between data being created or made available for processing, and the time when actionable answers are generated. Hadoop-based architectures can demonstrate latency of an hour or more from the time of data creation. Latency is usually measured in milliseconds for data stream processing platforms. For many operational business processes, anything more than a few seconds or a minute or two would not be acceptable, and increasingly sub-second responses are required.

2. Dataload performance with processing capability and flexibility

Some (although not all) of the latency overhead discussed above for Hadoop is down to the time required to capture, process and load the data. A variety of tools are available for dataload into Hadoop and NoSQL platforms, however understanding dataload performance and the related requirements for streaming aggregation and filtering capabilities are key. Data stream processing platforms improve data load times through fast load connectors, but also offer the ability to filter, cleanse and aggregate live data on the fly, enabling streaming analytics and aggregated data to be loaded into multiple Hadoop and RDBMS platforms simultaneous in real-time, based on the same live data streams.

3. Batch processing versus real-time stream processing

Data stream processing executes continuous queries (Java or SQL) over live data as they arrive, processing each and every new record, and updating with near zero latency all outputs or answers on which that record has a part to play. Contrast that with storage-based data processing, where each query is launched and returns a result – as new data arrive, each query must be re-executed over the entire dataset (potentially). This introduces delays but also is unnecessarily resource-intensive – the repeated launch and execution of the same query over a data set that may be growing rapidly over time.

4. Faster business intelligence or real-time operational intelligence

Hadoop may offer adequate response if your requirement is faster answers from stored data regardless of when the data were created. However, if your requirement extends into real-time or near-real analytics, then the integration of a data stream processing platform with your existing storage-based analytics platforms will be a better way to go. The advantage of a data stream processing platform is that it can deliver both real-time analytics (to multiple destinations) as well fulfill the real-time dataload, streaming ETL.

5. Automated actions and operational process automation

Fast analytics, real-time business and operational intelligence are essential first steps when considering a unified data architecture for driving real-time businesses. The next step is the automated update of operational platforms and systems in real-time based on the analytics and predictive analytics generated. Real-time and low latency are the important considerations, so driving these data-driven actions directly from your real-time processing platform is likely to be the only option, and offers a significant improvement on operational efficiency and response rates.

In summary, fast analytics means different things to different people, but if taken within the context of a unified architecture for big data processing and analytics, it is possible to have the best of both worlds – storage-based processing for structured and unstructured data will generate faster analytics, and integrating your data stream processing platform will deliver true real-time analytics where required.