Time and Streaming Data
Streaming SQL is inherently time-based, and several factors determine when results are emitted by a running query in s-Server. This section describes time-related issues, methods to get results arriving earlier, and corresponding system and query changes.
In streaming SQL, rowtimes--timestamps for each row--are critical to processing data. Every row is assigned a rowtime, and s-Server uses these to put the rows in an order, and, in processing rows, to apply time-based windows to these rows. Using rowtimes, s-Server can process data in an orderly fashion. Usually, when analyzing and querying stream data, you do so in a window of time--the last second or hour, for example, or as rows are emitted.
Stream rowtimes can be set implicitly or explicitly. Implicit rowtimes use the timestamp for when the data entered s-Server. Explicit rowtimes use a timestamp drawn from the data itself, such as when a series of logon attempts occurred. For explicit rowtimes, you need to promote a column to rowtime.
You use implicit rowtimes to make calculations where the clock time of a transaction doesn't matter. If you were tracking logon attempts, for example, this might be something like: "I have seen 312 unsuccessful attempts, and I may need to act on this as soon as possible, to flag a hacking attempt." In other cases, particularly when data logged is for "system of record," you will likely want to promote rowtime to the actual data time.
Streaming systems often make use of what's known as lambda architecture, whereby a source of data is processed through two separate pipelines--typically one for the long-term storage of large amounts of data, and one for the short-term streaming processing. In this case, you would most likely use implicit rowtimes on the streaming processing fork of the architecture (since the actual time of data may not matter) and explicit rowtimes on the "batch processing" or long-term storage branch, since long-term storage requires data to be stamped with the actual data time.
There are two ways to promote a column to rowtime:
Using an INSERT statement with a column named "rowtime":
INSERT INTO s(rowtime, y, z);
Using a SELECT statement that designates a column as rowtime. The following code promotes the column "dataTime" to ROWTIME in a stream called S1:
SELECT STREAM dataTime AS ROWTIME, * FROM S1;
See the topic ROWTIME for more information.
Once you set a column as rowtime, this column must be monotonic. s-Server will discard any rows that are out of order, that is, rows that are past the current "stream clock."
From the perspective of stream processing, it does not matter if rowtimes are implicit or explicit. As a matter of fact, once rowtime is set either way, s-Server processes data exactly the same.
s-Server does not compare rowtime with wallclock time. It relies on rowtime to determine the "clock of the stream" (no inferred relationships are drawn between wallclock and stream time).
The clock of the stream is always the value of the last rowtime passed to it. The only time wallclock time comes into play is with implicit rowtimes.
Sparse streams may exert backpressure on a pipeline. Combined streams move only as fast as the slowest stream. Combined streams need to orderly. In order to keep streams moving along, you can program rowtime bounds, or punctuation, into stream calculation.
We use implicit punctuation with all streams that use implicit rowtimes. These are sent every 14ms.
The 14ms value is defined in aspen.properties, with the property aspen.stp.punctinterval
Implicit punctuation is never used for explicit rowtimes.
For this reason, it often makes sense to promote rowtime before you get to "of record" type analytics.
You may need to t-sort after you promote.
Use implicit rowtimes if actionable data is not dependent on event times.