Last year has been an interesting experience as I participated in a number of stream processing and streaming analytics customer projects for SQLstream. Developing these real-time, stream computing projects greatly increased my appreciation for the advantages of an open, extensible and standards-compliant middleware infrastructure.
For example, I needed to implement an edge detection mechanism. My colleagues at SQLstream recommended using Bollinger Bands for determining outliers. So, I browsed through the wikipedia entry for Bollinger Bands to learn more. Bollinger Bands are very similar to standard deviations or quartile deviations. A Standard deviation measures variability or dispersion in data distribution. Bollinger Bands, on the other hand, provide thresholds to filter outliers in the data. In fact, Bollinger bands are based on the moving average and moving standard deviation of the data set. For typical data sets, Bollinger bands can be defined as:
lowerBB(lower Bollinger Band) = avg - (k * stddev), upperBB(upper Bollinger Band) = avg + (k* stddev)
stddev are the average and standard deviation over a sufficiently large time window and k is the constant that needs to be determined for the activity being monitored. For typical data sets, k = 2 will create the upper bollinger band at 95th percentile of the data set.
Bollinger Bands are widely used in the financial services industry. However, Bollinger Bands can be applied to solve problems in other industries. (As I am not claiming to be a statistics expert, I would certainly appreciate feedback on our application of Bollinger bands in streaming queries.)
Bollinger bands certainly are a good tool to identify sudden spikes in the activity being monitored in real-time. A number of examples come to my mind, for example, detecting sudden spikes in the price for a ticker symbol in a stock exchange:
SELECT STREAM ROWTIME, ticker, price, FROM (SELECT STREAM ROWTIME, ticker, price, AVG(price) OVER (PARTITION BY ticker RANGE INTERVAL '1' HOUR PRECEDING) AS “avgLastHour”, STDDEV(price) OVER (PARTITION BY ticker RANGE INTERVAL '1' HOUR PRECEDING) AS “stdDevLastHour”, AVG(price) OVER (PARTITION BY ticker ROWS 5 PRECEDING) AS “avgLast5Trades” FROM BIDS) AS S WHERE S.”avgLast5Trades” > S.”avgLastHour” + 2 * S.”stdDevLastHour”;
Bollinger Bands can equally be used to detect spikes in the error rate on a web server. For example:
SELECT STREAM ROWTIME, url, “numErrorsLastMinute”, FROM (SELECT STREAM ROWTIME, url, “numErrorsLastMinute”, AVG(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL '1' MINUTE PRECEDING) AS “avgErrorsPerMinute”, STDDEV(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL '1' MINUTE PRECEDING) AS “stdDevErrorsPerMinute” FROM “HttpRequestsPerMinute”) AS S WHERE S.”numErrorsLastMinute” > S.”avgErrorsPerMinute” + 2 * S.”stdDevErrorsPerMinute”;
Other industry examples where Bollinger Bands offer a useful technique include monitoring call volumes in a call center and Analytics on social/online gaming services.
In the stream computing context, Bollinger Bands provide the high/low-water marks for monitoring activity. Whenever the level of recent activity crosses these Bollinger Band thresholds, the activity can be flagged. The streaming analytics engine can then perform additional analytics to detect patterns in the activity and to provide actionable information to regulate the system that is being monitored. At the very least, Bollinger bands can be used to filter out “uninteresting” rows from the stream, thereby reducing the load on the streaming pipeline.
At SQLstream, we use windowed aggregation functions such as
AVG() OVER (…) and
STDDEV() OVER (…) to establish Bollinger Bands. It is necessary to compute
STDDEV on sufficiently large windows of time. In a streaming context, we used sufficiently large windows of time to calculate Bollinger Bands. So, as the window slides forward in time, the Bollinger bands reflect more recent activity levels. The current activity levels can then be computed on a much smaller window, potentially including only the current row in the stream. Should the current activity level cross either of the Bollinger bands, we then mark that as a spike in the activity level. The formula for Bollinger bands needs to be changed based on the data distribution, that is, to determine exactly what multiple of standard deviation is appropriate.
Coming back to my point about openness and extensibility, as you can see in the example queries above, you could execute a very similar query in Oracle or SQLServer. Key features such as windowed aggregation functions, often called SQL OLAP functions, have been in SQLstream for a long time. Interestingly, SQLstream did not support
STDDEV() windowed aggregation function at the time of this project. A lot of the SQL experts will know
STDDEV can be easily rewritten using a formula involving
I am sure a lot of you readers have interesting ideas and questions. Please feel free to post them here and I will be happy to engage in conversation.