Stream Computing Architecture (SCA) Approach to Organizing SQL

<< Click to Display Table of Contents >>

Navigation:  Building Streaming Applications >

Stream Computing Architecture (SCA) Approach to Organizing SQL

Previous pageReturn to chapter overviewNext page

This topic describes how to organize SQL files using directory structures and naming conventions that best suit stream computing. Stream computing has particular structuring requirements when it comes to segmenting and organizing SQL, in that data arrives from sources in various locations in various format, and the same set of data may be leveraged for multiple business use cases. We recommend building a componentized stream computing system as described below, in order to provide yourself the most flexibility in data parsing, enriching, and analyzing.

Some of the common tasks involved with stream computing include:

Importing data from a range of sources. These can arrive from log files, a network feed, an XML file outputted by another application, external databases, and so on. Usually, each data source consists of one or more SQL files and an s-Server plugin.

Enriching data. This refers to creating a logical join with other data, often legacy data in a database. This could be accomplished through a stream-table join or a UDX. For example, you might have a phone number tracked from a log file that you might join to existing customer accounts, or a list of known problematic numbers.

Performing business logic on data. In many environments, you will apply different sets of business logic to the same source data. For this reason, it's advantageous to separate data acquisition from business logic.

Archiving data. Often, you will want to copy data out of the pipeline into a RDBMS or other long-term storage system.

Streaming Computing Processors

We use the term "processor" to refer to groups of SQL and plugins devoted to a particular task or group of tasks. Processors correspond to groups of SQL files that reference both each other and the plugins required to interface with other systems. Processors and links are the basic building blocks of a stream computing application. For example, you might have a source processor with the following files:

setup.sql. This file would create the schema or schemas for the processor.

acquire.sql. This file would load and reference adapters and UDXs, such as the Extendable Common Data Adapter.

condition.sql. This file might perform a number of functions to condition data, such as applying a UDX to apply second-level filtering to acquired data.

deliver.sql. This file would create views and pumps to make the acquired data available for other processors.

By building your streaming computing system in this modular way, you can mix and match your source, feature, and mode processors to acquire, condition, enrich, and archive data where you need it. This modular approach also lets you swap sources and features in and out and add data enrichment or archiving.

Design

Each processor node runs its own schema, created by a create.sql file. You might run multiple processors in a single instance of SQLstream s-Server. Between each processor link, pumps will likely move data from one processor to another.

A key facet of stream computing design is to keep row types of input/output streams consistent. These are equivalent to interface definitions in object-oriented methodology.

Types of Processors

Processors generally fall into three categories: source, feature, and mode.

Source processors often write into streams, using an INSERT statement. This processor receives data from external sources like log files, databases, TCP sockets, AMQP feeds, and so on, performs operations like parsing and conditioning (validations, cleansing) and produces a relational streaming output.

Feature processors primarily use SELECT statements, in that they usually read from streams instead of writing into them. This processor corresponds to an analytics pipeline to solve a business problem. Feature processors often aggregate data over time windows (sliding or otherwise) to  produce results that can be delivered to target systems such as databases or real-time dashboards. You also might use these results to create a feedback loop into the real-time application. For example, in a transport system, you might have the following feature processors:

Travel Time, which calculates the travel time between two points in a transportation network.

Rapid Deceleration of Traffic, which detects rapid deceleration events in a transportation network by monitoring average speeds over different time windows.

Mode processors might have enrichment, persistence, and action elements. These combine source processors and feature processors with links among pairs of processors. Mode processors combine streaming data from multiple processors to address a specific need in the system. For example, in a transportation system, you might have a mode processor to address the separate cases of roads, rail network, maritime transportation, and so on, whereas feature processors tend to work over the entire system. As the name suggests, multi-mode processors combine data from multiple mode processors to address other needs.

Schema, File, and Processor Naming Conventions

As in object-oriented computing, naming conventions are important in SCA, because proper naming conventions will allow scripts to find the right SQL files, coders to find the right schemas, and pumps to start in the proper order.

Schema Names

Each processor (source or feature) will create a schema for all SQLstream objects in it. We recommend naming the schema using the following format, which uses a source processor as an example.

"<ProcessorType>_<SourceType>_<SourceFormat>_<version>".

e.g.,  "sp_gps_MinorPlanet_1"

For other Processors,

"<ProcessorType>_<ProcessorName>_<version>"

e.g., "fp_TravelTime_1", "mp_Roads_2"

 

SQLstream Object Naming Conventions

All SQLstream object names should use quoted CamelCase, as in "listPumps.sql". Each processor can have one or more input/output streams. Input streams should use an "_InputStream"  suffix. Likewise, Output streams should use an "_OutputStream" suffix. All intermediate objects including streams and views which are not intended to be used by other processors should use a "_private" suffix.

Pump Names

Pumps will use numeric prefixes based on the topological order of pumps in the entire pipeline.

The following table describes the numeric prefixes to be used for pumps in various processors.

Processor Type

Prefix

Source

800-

Feature

600-

Mode

400-

Multi Mode

200-

Creating an Order for Pumps

These are done in numerical order so that they can start in order

Organizing Processors in the File System

It's important to organize processors in the file system so that they can be easily found and easily connected to each other. We recommend the following best practices in file system organization.

Sample File Tree

A sample file tree for a Stream Computing Architecture (SCA) might look like the following:

├── server

│   ├── features

│   │   ├── featureOne

│   │       ├── Enrich

│   │       │   ├── analyze.sql

│   │       │   ├── deliver.sql

│   │       │   ├── package.markdown

│   │       │   └── setup.sql

│   │       └── Rate

│   │           ├── analyze.sql

│   │           ├── deliver.sql

│   │           ├── package.markdown

│   │           └── setup.sql

│   │  

│   ├── modes

│   │   ├── package.markdown

│   │   └── modeOne

│   │       ├── catalog.sql

│   │       ├── interface.sql

│   │       ├── jndi

│   │       │   └── streamingsolution.mode_1.archive.properties

│   │       ├── package.markdown

│   │       ├── startPumps.sql

│   │       ├── stopPumps.sql

│   │       └── mode_1.conf

│   ├── sources

│   │   └── sourceOne

│   │       ├── acquire.sql

│   │       ├── condition.sql

│   │       ├── deliver.sql

│   │       ├── package.markdown

│   │       └── setup.sql

│   └── utils

│       ├── deploy.sh

│       ├── functions.sql

│       ├── generateCatalog.sh

│       ├── listPumps.sql

│       ├── package.markdown

│       ├── sqldoc.sql

│       └── wrappers.sql

└── streaming_solution_src.tgz

 

Directory Tree Explanation

utils/wrappers.sql This file will contain definitions for all data wrappers for external sources such as logfiles and databases.

utils/functions.sql  This file will contain utility functions and UDXes used by various processor components.

utils/src/ This directory tree will be for java source code for UDXes developed by SQLstream.

sources/<SourceType>/<SourceFormat>/...  This directory will contain all the scripts and other files for a source processor. For example, the processor for a GPS data feed might sit in a directory called server/sources/gps

features/<FeatureName>. This directory will contain all the scripts and other files for a feature processor, such as server/features/TravelTime/ or server/features/FlowFactors

modes/<modeName>. This directory will contain definition files and scripts to generate a consolidated catalog of SQL scripts that need to be sourced to deploy the mode processor.

Contents of a Source/Feature Processor Directory

The directory for a processor will have a subset of the following files.

setup.sql.  This file creates the schema for the processor.

acquire.sql  This file creates the input stream for the processor. For source processors, this file may create a foreign stream to source data from external sources. It may also parse comma-separated records into individual column values using built-in functions such as VARIABLE_COLUMN_LOG_PARSE.

condition.sql This file contains the validations, filters, and transformations necessary to condition data from the input stream.

analyze.sql This file contains the analytic functions and aggregations which apply business logic. This file may not be present for source processors.

deliver.sql Using views and pumps, this file delivers the results of its processing to other processors through nexuses. As implemented by s-Server native streams, these nexus points act as defined interfaces for each mode/multi-mode processor.

Contents of a Mode/Multi-mode Processor directory

The mode processor or a multi-mode processor is primarily a collection of source processors and feature processors, with optional mode- or multimode-specific features. An SQLstream s-Server instance could host a mode processor or a multi-mode processor. With distributed capability, each processor may extend itself to run on multiple instances of SQLstream s-Server. The following files will be present in a mode processor directory:

interface.sql  This file defines all public streams (nexuses) for passing streaming results among individual processors. Each processor then may define itself either as a "publisher" or a "subscriber" for a subset of these streams.

<processor>.conf  This file will contain all the source/feature processors as well as links in this mode processor.

Documentation & Comments

In order to generate automatic documentation, we recommend using markDown for inline comments. You can then generate HTML documentation ("SQLDoc," in the same sense as "JavaDoc") for each object in the tree, which sits in the same directory as the file tree, as in the following, which corresponds to the featureOne segment of the file tree above.

├── docs

   ├── css

   │   ├── reset.css

   │   └── sqldoc.css

   ├── index.html

   └── server

       ├── features

           └── featureOne

               ├── Enrich

                   ├── analyze.html

                   ├── deliver.html

                   ├── index.html

                   └── setup.html

             

See http://en.wikipedia.org/wiki/Markdown for details.