Aggregate

<< Click to Display Table of Contents >>

Navigation:  Using StreamLab > StreamLab Guides Overview > Guide Commands > Analytics >

Aggregate

Previous pageReturn to chapter overviewNext page

The Aggregate analytic lets you make calculations on a column of numerical data, such as the sum, the average, the maximum value, the minimum value, the first value, or the last value. You can also choose statistical formulas such as population variance, population standard deviation, or sample variance. See Aggregation Operations below.

To create one or more aggregate columns, select one column with numeric data and choose an aggregation analytic to apply to the column.

Because StreamLab works with streaming data, you first need to designate a window across which you will perform the aggregation. Windows can be sliding or tumbling (explained below) and designated as a numeric time frame (from seconds to years).

For tumbling windows, you can also group aggregations into categories, such as departments, industries, regions, states, and so on, using the Group By option, including grouping results by multiple categories, such as region, then department (use the + to add additional grouping columns).

You can apply multiple aggregations by using the + button in the right corner of the aggregation section of the Aggregate tab.

sl_aggregate

Types of Windows

Sliding Window

All streaming rows have a timestamp called ROWTIME. This means that you can tell StreamLab "aggregate all the rows in the past hour" or "aggregate all the rows in the past minute" and so on. Again, high numbers may affect performance, so experiment accordingly.

Sliding windows display data continuously, maintaining a running aggregation for the specified time period.

This is the most common way to aggregate data.

Tumbling Window

This data window also uses ROWTIME, and lets you tell StreamLab  "aggregate data for an hour, then clear data, then give me data for the next hour," at, for example, 2:00 PM, 3:00 PM, 4:00 PM, and so on.

Tumbling windows display data only at specified intervals. That is, if you choose 5 seconds, a new row appears every five seconds with aggregated data. For non-aggregated rows (the rows that are, for example, averaged for the aggregation), you can choose to either drop non-aggregated columns, display the first value for non-aggregated columns, or display the last value.

With tumbling windows, you can also choose to use a group by, drawing on a column with limited values, such as "dept A, dept B, dept C." When you apply a group by, StreamLab emits one row for each group that appears window, that is, every 10 seconds a row for dept A, B, and C, provided that each group has appeared in the window. See Using Group By below.

Implementing an Aggregate Analytic

To implement a Aggregate analytic.

1.Select a mode for the aggregation, sliding or tumbling. (See definitions above.)
2.Select a length and unit for the aggregation window, such as "1 hour" or "15 minutes."
3.Select a column to aggregate. This column must contain numerical data.

Note: You should select a column before selecting an operation, because aggregation options change depending on the column's type.

4.Select an operation to aggregate values for the new column. This might be an average temperature, or a running total. See Aggregation Operations below. You can add multiple analytics by using the + button.
5.Enter a column name for the resulting column or columns.

For sliding window, you can also choose to group the column's results. To do so, choose a column with a limited set of values, such as region, department, product line, and so on.

For tumbling windows (which display data at specified intervals), you need to choose what to do with other columns: either drop these (so that only aggregated columns display), use the first value in the window, or use the last value in the window. With tumbling windows, you can also choose to apply a GROUP BY using a column of limited values.

6.Click the + icon to add the command to the Guide script.
7.The results of the script appear in the Output View window.

Aggregation Operations

Average. Returns the average of all the value expressions evaluated for each row in the aggregation.

Count. Returns the number of qualifying rows in the aggregation

Count Distinct. Returns the number of qualifying non-duplicate rows in the aggregation

First Value. Returns the evaluation of the <value expression> from the first row that qualifies for the aggregate.

First Value Ignore Nulls. Returns first non null value of x in <window-specification>

First Value Respect Nulls. Returns first value, including null of x in <window-specification>

Last Value. LAST_VALUE returns the evaluation of the <value expression> from the last row that qualifies for the aggregate. Returns last value, including null of x in <window-specification>

Last Value Ignore Nulls. Returns last non null value of x in <window-specification>

Last Value Respect Nulls. Returns last value, including null of x in <window-specification>

Maximum. Returns the maximum value of all the value expressions evaluated for each row in the aggregation. For string values, MAX is determined by which string is last in the collating sequence.

Minimum. Returns the minimum value of all the value expressions evaluated for each row in the aggregation.

Sum. SUM returns the sum of all the value expressions evaluated for each row in the aggregation.

Population Variance. This is a standard statistical analytic method that returns the population variance of a non-null set of numbers (nulls being ignored). Population variance uses the following calculation:

(SUM(expr*expr) - SUM(expr)*SUM(expr) / COUNT(expr)) / COUNT(expr)

In other words, for non-null values in the aggregation window, and using S1 as the sum of the values and S2 as the sum of the squares of the values, Population Variance returns the result (S2-S1*S1/N)/N.

Population Standard Deviation. This is a standard statistical analytic method that returns the square root of the population variance (defined below) of the row's value, evaluated for each row in the aggregated window.

Sample Variance. This is a standard statistical analytic method that returnseturns the sample variance of a non-null set of numbers (nulls being ignored). Sample Variance uses the following calculation:

(SUM(expr*expr) - SUM(expr)*SUM(expr) / COUNT(expr)) / (COUNT(expr)-1)

In other words, for a given set of non-null values, using S1 as the sum of the values and S2 as the sum of the squares of the values, VAR_POP returns the result (S2-S1*S1/N)/(N-1).

Sample Standard Deviation. This is a standard statistical analytic method that returns the statistical standard deviation of all values in number-expression, evaluated for each row remaining in the group and defined as the square root of the sample variance (defined above).

Using Group By with Tumbling Windows

With tumbling windows, you can apply a group by, drawing on a column with limited values. When you apply a group by, StreamLab emits one row for each group that appears in the windowed time period. This allows you to display aggregated data for each department, or region, or building, and so on.

In the screen shot below, for example, StreamLab emits one aggregated row every ten seconds for trip_no 500s and 600s. During the first window, no rows with 600s appeared, so StreamLab only emits an aggregated row for 500s.

sl_aggregate_group_by_result