Format data in StreamLab
With most real world data, you will need to do some formatting before you can work with it. Usually, this comes down to two things: changing the names of columns so that they are meaningful, and changing the data types for the columns.
Like databases, s-Server stores columns with a type. The data type lets s-Server understand how SQL will interact with the column's data. For example, you cannot perform numerical calculations on columns with a text data type.
You can use StreamLab guides to format the data.
Guides are collections of commands, suggestions, and scripts that let you create views on data sources. These views are composed of SQL. You use the Guide interface to generate this SQL.
Begin by adding a new guide to StreamLab.
|2.||With your mouse, click and drag Guide, then drop it in the dotted box.|
|3.||Click the new Guide|
|4.||Click StreamLab_Output.line_1. (This is the source that you created in the previous section.)|
Guides are stored as collections of views in s-Server. VIEWs are reusable queries on a stream or table--in this case, on the stream called StreamLab_Output.line1.
If you want, you can type in a more meaningful prefix to be used in naming these views.
You have now added the log file as a streaming source for the guide.
As you make changes, changes appear in the Script. As you add items to the Script, the streaming database in s-Server is modified. You can always undo script changes.
StreamLab suggests changes that you might make to the data. As data comes into StreamLab, a piece of software called the Scrutinizer examines the data, looking for possible commands to apply to the data. For example, it might notice that the file contains regular commas, and suggest splitting the file into columns at every comma.
The table in the middle, the output view, shows the stream resulting from the steps in the script, the output of this Guide.
In this case, StreamLab has noticed several ways it made be split into columns. The first suggestion is to split on commas. Click Add to add it to the script.
Notice that the new columns are all named line_1, line_2, line_3, and so on. You want to make these columns more readable by assigning them meaningful names. Remember the information window from the buses demo? Let's use it to rename all the columns at once. Before beginning, go to that window and copy the list of columns.
|1.||Click Rename List|
|2.||Click list of columns. This parameter is now controlled by which columns are selected in the output view.|
|3.||Click line_1, scroll to the right end and click line_16.|
|4.||Shift-Click line_16 to select the range of columns|
|5.||Right-click on the to list parameter and paste in the list of columns we grabbed.|
|6.||Click here to finish editing this field.|
Once you have finished here, the data has been added to StreamLab, and you're ready to start making calculations on the data.
In the next steps, you will format the data in order to perform analytics on it.
StreamLab, and s-Server, depend on a special column called rowtime to deliver accurately timed data. Often, as a developer, you will want to take a column from the existing data with a timestamp and "promote" it to rowtime.
Here, you want to get the timestamp from the reported_at field. But as is often the case with real-world data, you will need to modify this column before it is recognizable as a timestamp: the extra zeros at the end are preventing it from being recognized as a valid time.
To do so, you can extract the timestamp from the column.
1. Click on any cell to select that row. This also stops the output view from scrolling. You can use the button in the upper left corner of the table to clear the selection and resume scrolling.
2. Doubleclick one of the cells in the reported_at column.
3. Select all of the text up to right before the period. Note that the column, start, and length parameters to the extract operation have been filled in the Suggestions window.
4. Add the suggestion Extract text from reported... to the script.
The substring column appears in the table.
Note that the new substring column is outlined in yellow, indicating that it contains valid time values. By default, though, all columns are set to a text type. You need to tell StreamLab that this column is, in fact, a timestamp.
Two suggestions to convert it appear. One also will rename it to ROWTIME. Click Add on that one.
You are now working with the times at which the bus data was originally recorded, instead of the time it was written to our /tmp/buses.log file.
In the next step, you will change the data type of the Speed column, so that StreamLab can analyze it as a number. (Remember, you cannot perform numerical calculations on columns with a text data type.)
Note that the speed column is outlined in yellow. This means that StreamLab thinks it is a number.
In this case we only want to cast one column, so we'll do it by hand.
1. Click Cast.
2. Scroll to the speed column and select it.
3. Select the type DOUBLE
4. Click Add.
As you have been applying commands, StreamLab has been generating SQL.
SQL stands for Structured Query Language and has been a standard since 1986. SQLstream uses a slight variant of SQL called streaming SQL.
For all projects, you can view the SQL generated by the project, as well as the SQL for individual pages. To do so, click the Goals and Scripts button in the upper right corner of StreamLab .
Note that StreamLab has created a series of views that make up the Guide.
At this point, you have formatted your imported data, and you are ready to perform a calculation on it. Proceed to the next section.