How much of a consideration is data quality in the world of real-time Big Data analytics, Hadoop and stream processing?

The Big Data movement has focussed on technology until recently, but as companies such as Cloudera are pushing Hadoop distributions further into the Enterprise, it’s clear that the traditional issues such as data quality still exist. However, it also seems that extending the Enterprise IT footprint to process unstructured data closer to source and in real-time (or near real-time) has shifted the data quality paradigm.

Enterprises become aligned to the batch-based capabilities of their Enterprise architecture and the relational model for data storage, with a heartbeat of 24 hours to the next report update (typically), and reports that are based on the percentage of data considered to be of sufficient quality. The cost of data quality is invested in getting that percentage as high as possible. This is and will remain an important function. An RDBMS may be strict about data integrity and completeness, but SQL platforms offer a level of reuse, stability, access and security that will not be matched for some time.

Big Data technologies (Hadoop and stream processing platforms such as SQLstream Blaze) exist in part because of the inflexible nature of an RDBMS to analyze unstructured and semi-structured data, and in part down to the volume of the data and the need for multi-server scale-out (an RDBMS scales up well over cores, trickier to scale out over servers, but not impossible). Big Data happened to coincide with the availability of cheap, commodity hardware (chicken or egg?), hence became a reality.

However, Big Data is also about faster results, real-time analytics and actions (as low as millisecond latency) in the case of stream processing, with faster batch operations in the case of Hadoop (an hour or so) and micro-batch platforms such as Spark. The questions being asked of data are changing, and so too the nature of data quality in this real-time paradigm. For example, missing data is a problem for an EDW, but OLAP engines, predictive analytics, more sophisticated queries, all have a large dataset around which to base interpolation and extrapolation of values, so can achieve reasonable accuracy. Also, part of the ETL process is often to aggregate data to specific time boundaries, so missing data can be simpler to identify (not always the case). With a batch-based process, it’s not a major issue if a chunk of data turns up a few hours late, so long as it’s within the window.

Real-time Big Data analytics may not get off quite so easily. Data quality has therefore shifted to focus on questions such as how long should I wait for a missing data stream, how do I know if data in a stream are missing, how can I decide if it matters? For some use cases this is not an issue, for example, visitor profiling for real-time ad placement may be relatively insensitive to the odd missing server log. For other scenarios such as alerting on drilling issues in the Oil & Gas industry or production line failures in Manufacturing, then this is a significant consideration.

With all that said, the primary use case today for Hadoop storage platforms seems to be ETL off-loading of unstructured data. It’s not so much that data quality is being pushed further downstream, more that Hadoop and stream processing offer the perfect platforms for unstructured data quality enhancement – real-time filtering, aggregation and enrichment, prior to the EDW load, with the added advantage of real-time, operational analytics as the data stream past.