Examples of SQL as the Streaming Analytics Language of Choice
A key requirement for minimal latency, accuracy, and completeness of streaming analytics results is the use of a single language—like SQL.
SQL is, of course, the primary language for the DBMS and some big data processing frameworks, available through many platforms for Hadoop and NoSQL storage.
SQL is increasingly becoming the analytics language of choice for both structured and unstructured data (as used by Guavus SQLstream, it becomes a powerful analytics language for fast data streams). The difference between streaming SQL queries and SQL used for batch processing is that the streaming queries execute continuously over moving data streams, rather than repeatedly over stored data at rest.
Given that most are familiar with SQL analytics over stored data, here’s a few examples of typical streaming SQL queries that would be executing in a unified architecture.
The types of analytics for complex event processing, as per any SQL platform, fall into four broad areas – alerts, analytics, predictive analytics and aggregation. The main difference being that for Guavus SQLstream, data ingestion is real-time and results come with millisecond latency as measured from the time of data arrival. There are also other operations that contribute to more powerful analytics such as partitions (PARTITION BY) and joins (JOIN, UNION). Some simple examples (auto-generated in StreamLab) include:
Alerts
The requirement is to generate an output alert through a connector to a notification system or other external platform. For example:
SELECT STREAM ROWTIME, City, Temperature
FROM WEATHERSTREAM
WHERE “City” = ‘San Francisco’ AND “Temperature” > ‘100’;
This query processes an input stream of temperature readings and outputs a record containing the time, location and temperature reading for every input reading matching the selection criteria. The latency from the time the data record was emitted by the sensor to the alert being issued can be as little as a few milliseconds, even where data arrive at millions of records per second.
Aggregations (Tumbling Windows)
Streaming aggregations are used extensively in stream processing, and form the basis of continuous aggregation / ETL operations into Hadoop and DBMSs. Queries utilize functions such as AVG, COUNT, MAX, MIN, SUM, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP, and output a record at specified intervals for each GROUP of input records. For example:
SELECT STREAM
FLOOR(WEATHERSTREAM.ROWTIME to MINUTE) AS FLOOR_MINUTE,
MIN(TEMP) AS MIN_TEMP,
MAX(TEMP) AS MAX_TEMP,
AVG(TEMP) AS AVG_TEMP
FROM WEATHERSTREAM
GROUP BY FLOOR(WEATHERSTREAM.ROWTIME TO MINUTE);
The result is a stream of new records, one per minute, specifying the maximum, minimum and average readings recorded in that minute. This type of query is particularly useful for generating periodic reports with zero latency from the input data streams.
Analytics (Windowed Analytics)
Windowed analytics are the basic building block for streaming analytics. Queries generate an incrementally updated output record (or row) for each new input record. Each field (or column) in the output record may be calculated using a different window or partition. Windows can be time or row-based. For example:
SELECT STREAM
ROWTIME,
MIN(TEMP) OVER W1 AS WMIN_TEMP,
MAX(TEMP) OVER W1 AS WMAX_TEMP,
AVG(TEMP) OVER W1 AS WAVG_TEMP
FROM WEATHERSTREAM
WINDOW W1 AS (RANGE INTERVAL ‘1’ MINUTE PRECEDING);
In this query an output record (or row) is generated for each new input record, specifying the updates to the minimum and maximum temperatures over the preceding 60 seconds, plus an incrementally updated average for the temperature over that period.