all_file_formats_snippet

<< Click to Display Table of Contents >>

Navigation:  »No topics above this level«

all_file_formats_snippet

Return to chapter overview

Show/Hide Hidden Text

Input Format

Other options are specific to format type:

hmtoggle_plus1 Reading CSV

To read CSV formatted data, you need to specify a separator character (such as a comma or a pipe) for values. You can also indicate whether or not the file has a header. When the Extensible Common Data Framework parses CSV, each row becomes a row in the stream. Columns are identified by the separator character, which can be a comma, a pipe, or any other character you designate.

s-Server matches data types according to its Ingestion Rules for CSV.

Note: You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

Sample Foreign Stream Implementing ECD Adapter to Parse CSV Files

The following example sets up a server called "CSVReaderServer", then creates and sets a schema called "SCHEMA_CSV". It then sets up a foreign stream that will parse columns called ts, accountNumber, sourceIP, destIP, and customerID from a file in /tmp/ called transactions.log. To parse CSV over other input/output systems, such as IBM MQ or AMQP, you would need to specify options for these formats. See Reading from Other Sources  for more details.

Note: The examples below use the file system as an input system. To parse CSV over other systems, such as IBM MQ or AMQP, you would need to specify options for these formats. See Reading from Other Sources  for more details.

 

create or replace server "CSVReaderServer" type 'FILE'

foreign data wrapper ECDA;

create or replace schema "SCHEMA_CSV";

set schema 'SCHEMA_CSV';

CREATE OR REPLACE FOREIGN STREAM "CSVReaderStream"

("recNo" INTEGER,

"ts" TIMESTAMP NOT NULL,

"accountNumber" INTEGER,

"loginSuccessful" BOOLEAN,

"sourceIP" VARCHAR(32),

"destIP" VARCHAR(32),

"customerId" INTEGER,)

--Columns for the new stream

SERVER "FileReaderServer"

OPTIONS

(directory '/tmp',

filename_pattern 'transactions\.log'

parser 'CSV',

character_encoding 'UTF-8',

skip_header 'true');

 

To actually begin reading from the file, use a statement along the lines of

select stream * from CSVReaderStream;

 

Sample Properties Implementing ECD Agent to Parse CSV Files

To parse CSV files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

ROWTYPE=RECORDTYPE(VARCHAR(2040) id, VARCHAR(2040) reported_at, VARCHAR(2040) shift_no, VARCHAR(2040) trip_no, VARCHAR(2040) route_variant_id)

DIRECTORY=/TMP

FILENAME_PATTERN=TRANSACTIONS\.LOG

PARSER=CSV

CHARACTER_ENCODING=UTF-8

SKIP_HEADER=TRUE

hmtoggle_plus1Reading XML

To read XML data with the ECD Adapter and Agent (ECDA Framework), you need to specify a location for the data to be read and information on the data structure.

Overview

The XML parser converts data from the XML file format into regular SQLstream rows. To read data from XML files using the ECD Adapter or Agent, you define a foreign stream based on the input sources that uses the XML parser, or, with the ECDA agent, define a properties file with similar options. As with other input data formats, you specify the location of the files and other attributes that control how the files are discovered and read. Like any stream definition, the foreign stream has a definite row-type: a set of columns with names and data-types.

XML is a complex and flexible format, so it is also necessary to specify how a portion of the XML input is recognized as the source of one row, and how parts of that XML input are recognized as the column values of that row. This is done by pattern matching, with patterns defined as simple xpath expressions. There is an xpath that matches a row, and an xpath for each column. More information on XPath can be found at http://en.wikipedia.org/wiki/XPath.

See Ingestion Rules for XML for details on how s-Server parses XML data.

Note: You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

How it Works

As input files are read, their contents are appended to a long stream comprised of XML data. As data flows in, s-Server parses the data, watching for xpath matches. Whenever the row xpath matches a chunk of input, s-Server emits an output row has been found. Inside this chunk, as each column xpath matches, the value of that column has been found as text.

Only simple xpaths are allowed as patterns. A pattern may be absolute (starts with a slash) or relative (doesn't start with a slash).

s-Server supports patterns built from tag-names (like /a/b/c), attributes (like /a/b/c@d), and simple conditions (like /a/b/c[@d="foo"]).

A pattern like /a/b/c matches tag <c> when nested inside a <b> inside an <a>.  The value of the match is the text inside the tag: all the text between this <c> and the matching </c>. Leading and trailing whitespace are stripped.

A pattern like /a/b/c@d is almost the same: it matches tag <c> inside a <b> inside an <a>, but the value of match is the d attribute of the tag. That is to say, <c d="foo">...</c> is a match with the test value foo (no quotes, and not stripped of whitespace). If there is no attribute d, there is no match - ie the value is null.

s-Server also supports simple conditions in xpaths,  namely [@a] (which means that attribute a is present)  and [@a=x]  (which means that attribute a is present and has the value "x'). Here x must be a string literal. So the path /a/b[@type="cheese"]/c is like /a/b/c, but with the added requirement that the <b> element found has the attribute type="cheese".

An absolute xpath for a column means start matching from the xml document root. Note that the input stream is a series of well balanced xml documents, so the root occurs over and over again.

A relative xpath for a column means start matching from the start of the row. The row xpath itself must be absolute.

Note: More complicated xpaths are not supported. In particular, xpath conditions that test calculated expressions are not supported. You can get the same effect by filtering the foreign stream with a SQL WHERE clause. See the topic WHERE Clause of the SELECT Statement in the Streaming SQL Reference Guide for more details.

Note: You cannot have two columns in the same stream defined by exactly the same pattern. But two xpath patterns can be similar or overlap.

Nested Rows

It is common for an XML item to contain one or more nested subitems. This is naturally represented in SQL as two foreign streams parsed from the same input files: a stream of the main items, and a related stream of the subitems.

On the SQL side, the streams are tied together by

a primary key column (or columns) in the main stream
a foreign key in the subitem stream.

To produce this, just use the same xpath for the primary key and the foreign key. For each subitem, the value of the foreign key will be found outside the subitem xml, inside the main item xml. Hence the xpath must absolute, and not based on the subitem row.

However, the foreign key must occur in the main item XML before any subitems.

Sample SQL

Here is a very simple example that illustrates the above points.

Note: The examples below use the file system as an input system. To parse XML over other systems, such as IBM MQ or AMQP, you would need to specify options for these formats. See Reading from Other Sources  for more details.

The XML input looks like:

 <row><name>Elmer Fudd</name><occupation>hunter</occupation></row>

 <row><name>Bugs Bunny</name><occupation>wabbit</occupation></row>

the foreign server was defined as:

CREATE SERVER FileReaderServer TYPE 'FILE'

FOREIGN DATA WRAPPER ECDA;

 

and the stream definition is

 CREATE FOREIGN STREAM PLAYERS (

   name        varchar(16),

   occupation  varchar(16)

 ) SERVER FileReaderServer OPTIONS (

   DIRECTORY '/data/cartoons', FILENAME_PATTERN '.*xml',

   PARSER 'XML',

   PARSER_XML_ROW_TAGS   '/row',

   name_xpath 'name',

   occupation_xpath 'occupation'

 );

 

A Bigger Example

Here is a more realistic example, taken from an actual application. The input data is a record of retail sales transactions. Each transaction contains several line items, some of which are things purchased, but others are only cash register actions. The xml data is very detailed, and we will pick out a few (defined as xpaths). We will produce two related streams, a stream of sales transactions and a stream of line items. To tie them together, we recognize some identifier fields in the sales transactions and use the same values as foreign keys in the stream of line items. First, a sample input data file, much abridged. Skip past this and refer back from the explanation of the DDL:

<POSLog xmlns:tri="http://www.triversity.com/TE/integration/">

  <Header>

  <MessageId />

     <Timestamp>2008-05-15T11:47:05</Timestamp>

     <Originator>TE TransactionPostWorkflow</Originator>

     <SequenceNumber>1</SequenceNumber>

  </Header>

  <Body>

     <tri:RetailTransaction>

        <RetailStoreID>Triversity.1</RetailStoreID>

        <WorkstationID>POS.1.1</WorkstationID>

        <SequenceNumber>198</SequenceNumber>

        <BusinessDayDate>2008-04-26</BusinessDayDate>

        <BeginDateTime>2008-05-15T11:46:12</BeginDateTime>

        <EndDateTime>2008-05-15T11:47:05</EndDateTime>

        <OperatorID>222</OperatorID>

        <CurrencyCode>USD</CurrencyCode>

        <TillID>3</TillID>

        <TillSupervisor>222</TillSupervisor>

        <LineItem>

           <SequenceNumber>2</SequenceNumber>

           <EndDateTime>2008-05-15T11:46:12</EndDateTime>

           <tri:NativeLineDetail>

              <tri:LineNumber>2</tri:LineNumber>

              <tri:LineType>Item</tri:LineType>

              <tri:ActionCode>SELL</tri:ActionCode>

           </tri:NativeLineDetail>

           <SupplementalData>

              <ITEM_KEY type="String">12345000001</ITEM_KEY>

              <QUANTITY type="BigDecimal">3</QUANTITY>

              <_datasource_ type="String">Keyboard</_datasource_>

              <msrpPrice type="Money">0.00</msrpPrice>

              <CUSTOMER_PROFILE_ID type="String" />

           </SupplementalData>

           <EntryMethod>Keyed</EntryMethod>

           <Sale ItemType="Stock">

              <POSIdentity>

                 <POSItemID>12345000001</POSItemID>

              </POSIdentity>

              <MerchandiseHierarchy Level="Department">

                 2221

              </MerchandiseHierarchy>

              <Description>SILK TIE</Description>

              <RegularSalesUnitPrice>45.00</RegularSalesUnitPrice>

              <ActualSalesUnitPrice>45.00</ActualSalesUnitPrice>

              <ExtendedAmount>135.00</ExtendedAmount>

              <Quantity Units="1" UnitOfMeasureCode="Each">

                 3

              </Quantity>

              <Tax>

                 <TaxGroupID>TG1</TaxGroupID>

                 <TaxableAmount>-1</TaxableAmount>

              </Tax>

           </Sale>

        </LineItem>

        <LineItem>

           <SequenceNumber>3</SequenceNumber>

           <EndDateTime>2008-05-15T11:46:17</EndDateTime>

           <tri:NativeLineDetail>

              <tri:LineNumber>3</tri:LineNumber>

              <tri:LineType>Item</tri:LineType>

              <tri:ActionCode>SELL</tri:ActionCode>

           </tri:NativeLineDetail>

           <SupplementalData>

              <ITEM_KEY type="String">12345000001</ITEM_KEY>

              <_datasource_ type="String">Keyboard</_datasource_>

              <msrpPrice type="Money">0.00</msrpPrice>

              <CUSTOMER_PROFILE_ID type="String" />

           </SupplementalData>

           <EntryMethod>Keyed</EntryMethod>

           <Sale ItemType="Stock">

              <POSIdentity>

                 <POSItemID>12345000001</POSItemID>

              </POSIdentity>

              <MerchandiseHierarchy Level="Department">

                 2221

              </MerchandiseHierarchy>

              <Description>SILK TIE</Description>

              <RegularSalesUnitPrice>45.00</RegularSalesUnitPrice>

              <ActualSalesUnitPrice>35.00</ActualSalesUnitPrice>

              <ExtendedAmount>35.00</ExtendedAmount>

              <Quantity Units="1" UnitOfMeasureCode="Each">

                 1

              </Quantity>

              <RetailPriceModifier MethodCode="PriceOverride">

                 <SequenceNumber>1</SequenceNumber>

                 <Amount Action="Subtract">10</Amount>

                 <PreviousPrice>45.00</PreviousPrice>

                 <SupplementalData>

                    <DISCOUNT_DESCRIPTION />

                    <SOURCE_LINE_NUMBER>0</SOURCE_LINE_NUMBER>

                    <LINE_NUMBER>3</LINE_NUMBER>

                    <PRICE>35.00</PRICE>

                    <REASON_CODE>Markdown</REASON_CODE>

                    <PARAMETER>LINE_NUMBER</PARAMETER>

                    <_datasource_>Keyboard</_datasource_>

                    <loyaltyPromotionMessage />

                    <loyaltyPromotionAcceptedQualifier />

                 </SupplementalData>

              </RetailPriceModifier>

              <Tax>

                 <TaxGroupID>TG1</TaxGroupID>

                 <TaxableAmount>-1</TaxableAmount>

              </Tax>

           </Sale>

        </LineItem>

         <Total TotalType="TransactionNetAmount"> <Amount>219.98</Amount> </Total>

         <Total TotalType="TransactionGrandAmount"> <Amount>237.58</Amount> </Total>

         <Total TotalType="TransactionDueAmount"> <Amount>237.58</Amount> </Total>

         <Associate> <AssociateID>222</AssociateID> </Associate>

  </tri:RetailTransaction>

</Body>

</POSLog>

 

Here is some DDL that matches the above data, and produces two related streams.

First, we define a stream of transactions. Each XML element, such as <tri:RetailTransaction> produces one row, so the row tag xpath (which must be a full, absolute xpath that starts at the document root) is /PosLog/Body/tri:RetailTransaction.

All the columns come from inside this XML element, so they are defined as xpaths relative to it (no leading slash). We take the first three subelements to be a 3-part primary key that identifies the transaction. Later we will use this to tie the LineItem rows to the RetailTransaction rw.

- Some columns are direct subelements of the row element, such as RetaitStoreID and TillID,

This is the simplest case. The xpath for a column defaults to the name of the column as a relative path, so these options could be omitted.

Some columns are deeper sublements of the row element, like TransactionType

Some columns are defined by an xpath with a condition.

The transaction has three distinct total amounts, which become the three columns TransactionNetTotal, TransactionGrandTotal, and TransactionAmountDue. These come from three XML elements that have the same path, Total/Amount, but that are distinguished by different values of the attribute <Total TotalType=xxx>.

Note that the column "BeginDateTime" is declared as VARCHAR, though it is actually a timestamp, because the xml parser cannot produce a SQL timestamp value.

CREATE FOREIGN STREAM "RetailTransactions"(

   -- unique key is (RetailStoreID, WorkstationID, SequenceNumber)

   "RetailStoreID"                     VARCHAR(128),

   "WorkstationID"                     VARCHAR(40),

   "SequenceNumber"                    BIGINT,

   "BeginDateTime"                     VARCHAR(25),    -- TIMESTAMP,

   "DeviceID"                          BIGINT,

   "SubdeviceID"                       BIGINT,

   "TransactionType"                   VARCHAR(40),

   "TransactionItemCount"              INT,

   "KeyToken"                          VARCHAR(128),

   "OperatorID"                        INT,

   "TillID"                            INT,

   "TransactionNetTotal"               DECIMAL(14,2),

   "TransactionGrandTotal"             DECIMAL(14,2),

   "TransactionAmountDue"              DECIMAL(14,2)

 ) SERVER "FileReaderServer" OPTIONS (

       DIRECTORY '/test/data',

       FILENAME_PATTERN '.*xml',

       CHARACTER_ENCODING 'ISO-8859-1',

       parser 'XML',

       PARSER_XML_ROW_TAGS                 '/POSLog/Body/tri:RetailTransaction',

       "RetailStoreID_XPATH"               'RetailStoreID',

       "WorkstationID_XPATH"               'WorkstationID',

       "SequenceNumber_XPATH"              'SequenceNumber',

       "BeginDateTime_XPATH"               'BeginDateTime',

       "DeviceID_XPATH"                    'tri:NativeTrxDetail/tri:DeviceID',

       "SubdeviceID_XPATH"                 'tri:NativeTrxDetail/tri:SubdeviceID',

       "TransactionType_XPATH"             'tri:NativeTrxDetail/tri:TransactionType',

       "TransactionItemCount_XPATH"           'tri:NativeTrxDetail/tri:TransactionItemCount',

       "KeyToken_XPATH"                    'tri:NativeTrxDetail/tri:KeyToken',

       "OperatorID_XPATH"                  'OperatorID',

       "TillID_XPATH"                      'TillID',

       "TransactionNetTotal_XPATH"         'Total[@TotalType="TransactionNetAmount"]/Amount',

       "TransactionGrandTotal_XPATH"       'Total[@TotalType="TransactionGrandAmount"]/Amount',

       "TransactionAmountDue_XPATH"        'Total[@TotalType="TransactionDueAmount"]/Amount'

);

 

Finally, we look at the definition of the stream of LineItems, which illustrates a few more features of the parser. Each row comes from an element <LineItem> which is inside a <tri:RetailTransaction>, as specified by the row tag. The first three columns are foreign keys that refer to a row in the parent stream RetailTransactions. In fact the parser finds the values of the foreign keys by looking inside the enclosing <tri:RetailTransaction> element, since the xpaths for these three fields are absolute paths that lead into the enclosing element.

Again we see columns that are direct subelements (like ItemSequenceNumber) and deeper sublements (like LineNumber). Note that the subelement /Sale/Quantity is the source of two columns: the body of the element produces the column "Quantity" and the attribute UnitOfMeasureCode produces the column Units.

CREATE FOREIGN STREAM "LineItems" (

   -- foreign keys: (RetailStoreID, WorkstationID, TransactionSequenceNumber)

   "RetailStoreID"                     VARCHAR(128),

   "WorkstationID"                     VARCHAR(40),

   "TransactionSequenceNumber"         BIGINT,

   "ItemSequenceNumber"                BIGINT,      -- unique

   "LineNumber"                        INT,

   "ActionCode"                        VARCHAR(40),

   "ITEM_KEY"                          VARCHAR(40),

   "EntryMethod"                       VARCHAR(40),

   "ItemType"                          VARCHAR(40),

   "POSItemID"                         BIGINT,

   "Department"                        INT,

   "Quantity"                          INT,

   "Units"                             VARCHAR(40),

   "RegularSalesUnitPrice"             DECIMAL(14,2),

   "ActualSalesUnitPrice"              DECIMAL(14,2),

   "ExtendedAmount"                    DECIMAL(14,2),

)

 SERVER "FileReaderServer"

 OPTIONS (

   DIRECTORY '/work/dt/src/cust/bigbox/test/data',

   FILENAME_PATTERN '.*xml',

   CHARACTER_ENCODING 'ISO-8859-1',

   parser 'XML',

   PARSER_XML_ROW_TAGS                   '/POSLog/Body/tri:RetailTransaction/LineItem',

   "RetailStoreID_XPATH"                 '/POSLog/Body/tri:RetailTransaction/RetailStoreID',

   "WorkstationID_XPATH"                 '/POSLog/Body/tri:RetailTransaction/WorkstationID',

   "TransactionSequenceNumber_XPATH"     '/POSLog/Body/tri:RetailTransaction/SequenceNumber',

   "ItemSequenceNumber_XPATH"            'SequenceNumber',

   "LineNumber_XPATH"                    'tri:NativeLineDetail/tri:LineNumber',

   "ActionCode_XPATH"                    'tri:NativeLineDetail/tri:ActionCode',

   "ITEM_KEY_XPATH"                      'SupplementalData/ITEM_KEY',

   "EntryMethod_XPATH"                   'EntryMethod',

   "ItemType_XPATH"                      'Sale@ItemType',

   "POSItemID_XPATH"                     'Sale[@ItemType="Stock"]/POSIdentity/POSItemID',

   "Department_XPATH"                    'Sale[@ItemType="Stock"]/MerchandiseHierarchy[@Level="Department"]',

   "RegularSalesUnitPrice_XPATH"         'Sale[@ItemType="Stock"]/RegularSalesUnitPrice',

   "ActualSalesUnitPrice_XPATH"          'Sale[@ItemType="Stock"]/ActualSalesUnitPrice',

   "ExtendedAmount_XPATH"                'Sale[@ItemType="Stock"]/ExtendedAmount',

   "Quantity_XPATH"                      'Sale[@ItemType="Stock"]/Quantity',

   "Units_XPATH"                         'Sale[@ItemType="Stock"]/Quantity@UnitOfMeasureCode'

);

 

Sample Properties Implementing ECD Agent to Parse XML Files

To parse XML files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

ROWTYPE=RecordType (VARCHAR(128) COL1, VARCHAR(40) COL2, BIGINT COL3, BIGINT COL4, INT COL5, VARCHAR(40), VARCHAR(40) COL6, VARCHAR(40) COL7, VARCHAR(40) COL9, INT COL 10, INT COL12, VARCHAR(40) COL13, DECIMAL(14,2) COL14, DECIMAL(14,2) COL15, DECIMAL(14,2) COL16)

DIRECTORY=/work/dt/src/cust/bigbox/test/data

PARSER=XML

CHARACTER_ENCODING=UTF-8

SKIP_HEADER=TRUE

PARSER_XML_ROW_TAGS=/POSLog/Body/tri:RetailTransaction/LineItem

RetailStoreID_XPATH=/POSLog/Body/tri:RetailTransaction/RetailStoreID

WorkstationID_XPATH=/POSLog/Body/tri:RetailTransaction/WorkstationID

TransactionSequenceNumber_XPATH=/POSLog/Body/tri:RetailTransaction/SequenceNumber

ItemSequenceNumber_XPATH'SequenceNumber

LineNumber_XPATH'tri:NativeLineDetail/tri:LineNumber

ActionCode_XPATH'tri:NativeLineDetail/tri:ActionCode

ITEM_KEY_XPATH =SupplementalData/ITEM_KEY

EntryMethod_XPATH  =EntryMethod

ItemType_XPATH =Sale@ItemType

POSItemID_XPATH=Sale[@ItemType=Stock]/POSIdentity/POSItemID

Department_XPATH'Sale[@ItemType=Stock]/MerchandiseHierarchy[@Level=Department]

RegularSalesUnitPrice_XPATH=Sale[@ItemType=Stock]/RegularSalesUnitPrice

ActualSalesUnitPrice_XPATH =Sale[@ItemType=Stock]/ActualSalesUnitPrice

ExtendedAmount_XPATH'Sale[@ItemType=Stock]/ExtendedAmount

Quantity_XPATH =Sale[@ItemType=Stock]/Quantity

Units_XPATH=Sale[@ItemType=Stock]/Quantity@UnitOfMeasureCode

 

 

hmtoggle_plus1Reading JSON

You can read from JSON data sources that are continually updated (streaming) using s-Server’s Extensible Common Data framework. s-Server’s JSON parsing is based on the Jackson JSON processor. More information on this processor is found at https://github.com/FasterXML/jackson-docs.

s-Server matches data types according to its Ingestion Rules for JSON.

Note: You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

JSON processing works like this:

1.s-Server receives a message from a data sources such as the file system, kafka, or AMQP. Each message may contain one or more JSON records.
2.As with all data sources, s-Server reads the JSON records into a foreign stream. Each JSON record may get parsed into multiple rows of the stream.
3.s-Server parses these JSON records using a JSON path that you define for each column of the foreign stream. A JSON path describes the location of a field in a JSON record. JSON Path are analogous to XPath expressions for XML files.

To implement the JSON parser, you create a foreign stream for a source, as described in the topics under Reading from Other Sources. Under stream or column options, you need to set the following stream options:

PARSER  This needs to be JSON.
ROW_PATH This is the JSON path for the row to be found. The JsonPath parser uses a row path to find JSON objects containing a row, then the path for each column is used to find the value to be extracted.
<COLUMN_NAME>_PATH Optional for each column in the ECDA column set. This defaults to $..<COLUMN_NAME>. Here, a column named 'FOO' would have an option named FOO_PATH that defaulted to $..FOO which would return the first property named FOO under the JSON object found by the ROW_PATH.

Both kinds of paths begin with '$' to indicate the root JSON object or array followed by either .. (scan), a property key, or [] syntax indicating either a list of property keys or an array selection.  A row pattern might be $[0:] which indicates all the objects in an array. Then a column pattern such as $..address would extract the text in the address property of the JSON object in the top level array.

For example, a JSON record like:

{

 "book": "A brief History Of Time",

 "author": "Stephan Hawking"

}

 

can be parsed into a stream with two columns, "book" and "author". To accomplish this, you define two JSON paths: $.book for the column "book" and $.author for the column "author". The character "$" indicates the start of the data you want to stream, and the character "." indicates the start of a specific JSON record. There are more complicated ways to express JSON paths for columns, but this is the basic way.

Here’s a more complicated example:

{

 "books": [

           { "book": "A brief History Of Time",

             "author": "Stephan Hawking"

           },

           { "book": "The world is flat",

             "author": "Tom Friedman"

           },

           { "book": "A theory of justice",

             "author": "John Rawls"

           }

          ]

}

 

Here each JSON record is an array of embedded JSON records. In this example, you first need to define a ROW_PATH, as '$.books[0:] s-Server uses "ROW_PATH" to determine the end of a row when parsing JSON records with embedded arrays. You then define the following column paths. Each path becomes a column in the stream:

Column name

Column Path Name

Path Definition

book

book_PATH

$.books[0:].book

author

author_PATH

$.books[0:].author

JSON paths can use a wild card '..' to indicate any arbitrary path prefix or suffix. The paths described above can also be defined simply as

ROW_PATH => '$.books[0:]'

book_PATH => '$..book',        -- default path when not specified

author_PATH => '$..author',    -- default path when not specified

 

Note: s-Server does not support referencing an array element with a specific subscript. It supports only generic references to "every" element of the array through the wildcard [0:].

In the example above, a single JSON record with an array of embedded JSON records gets parsed into three rows of the stream "books". Whenever there is an embedded array, it can be parsed into as many rows as the number of elements of the array.

The JSON parser creates a "tree" representation of a JSON record. Each field is a "leaf" node of the tree, while embedded JSON records or arrays are intermediate (non-leaf) nodes. As the JSON parser "walks" over this tree representation, "end of a row" event is generated when parser "walks back" to the node represented by "ROW_PATH" defined above.

For a row path, the parser may find a category like "books" or "alerts" multiple times. For a column path, the parse finds a value one time, such as "Great Expectations" or "Pride and Prejudice".

Unnesting an Array

The process of turning an embedded array into a stream of rows that correspond to elements of the array is called "unnesting" the array into the containing JSON record. s-Server can only unnest one array in a JSON record. Other arrays can be extracted as a single VARCHAR column. These restrictions are described at the end of this topic.

For example:

{

 "store": "Amazon",

 "books": [

           { "book": "A brief History Of Time",

             "author": "Stephan Hawking"

           },

           { "book": "The world is flat",

             "author": "Tom Friedman"

           },

           { "book": "A theory of justice",

             "author": "John Rawls"

           }

          ],

 "report_date": "2016-04-01 00:00:00"

 

can be turned into a stream of four columns as follows

book

author

store

report_date

"A Brief History of Time"

"Stephan Hawking"

"Amazon"

"2020-04-01 00:00:00"

"The World is Flat"

"Tom Friedman"

"Amazon"

"2020-04-01 00:00:00"

"A Theory of Justice"

"John Rawls"

"Amazon"

"2020-04-01 00:00:00"

 

The JSON path for each column would be,

book_PATH => '$.books[0:].book',

author_PATH => '$.books[0:].author',

store_PATH => '$.store',

report_date_PATH => '$.report_date',

ROW_PATH => '$'

 

Note that  "ROW_PATH" is  simply '$' as each row is a result of cartesian product of

the fields parsed from each element of the books' array
the fields parsed from the array containing JSON record.

Note: You can extract additional nested arrays as a single VARCHAR column, and then parse these using the Parser UDX. See the topic Parser UDX for more information.

Customizing JSON Paths

You can also use a custom parser for COLUMN_PATH, by adding the following parameter to the stream or server definition:

CUSTOM_TYPE_PARSER_<column_name>

Allows overriding of individual column's parsing. Specifies a fully qualified Java classname that

implements com.sqlstream.aspen.namespace.common.TypeParser

Restrictions to Parsing JSON

There are some additional restrictions on parsing JSON records with arrays embedded at multiple levels.

Only one embedded array can be "unnested." The JSON parser will not unnest more than one embedded array.

Arrays that are not being unnested can only be parsed as VARCHAR columns whose value is the entire array object as unparsed JSON textt.

That is, JSON paths like '$..array1[0:]..array2[0:]..field'  are not supported. In other words, JSON parser will not unnest more than one embedded arrays. If you specify JSON paths that requires unnesting multiple arrays, this will be reported as a SQL validation error. (In some cases, "multiple unnesting" can be determined only during execution and will be reported as a runtime error.)

For example, given the following JSON, the path definitions "col1_PATH" = '$..col1' & "col2_PATH" = '$..col2'  would require unnesting multiple arrays, and would therefor be invalid.

{

 "array1": [ { ..., "col1": 1234.56, ... }, ... ],

 "array2": [ { ..., "col2": "val2", ... }, ... ],

 "col3": "val3"

}

 

For cases where there are multiple embedded arrays, you can parse JSON records by unnesting one of those embedded arrays while passing other embedded arrays as "unparsed JSON text" and then parsing these using the Parser UDX.

For the JSON record above, streaming abstraction can be:

CREATE OR REPLACE json_stream (

    "col1" DOUBLE,

    "array2Col" VARCHAR(1024),

    "col3" VARCHAR(128)

)

SERVER FileReaderServer

OPTIONS (

  ...,

  "col1_PATH" '$.array1[0:].col1',

  "array2Col" '$.array2'    -- alternatively, '$.array2[0:]'

  "col3" '$.col3',

  "ROW_PATH" '$'

  ...

);

 

However, when ROW_PATH itself represents a path for an embedded array, s-Server will unnest another embedded array in its elements.

That is, the path "col_PATH" = '$.array1[0:].array2[0:].col' is allowed even if it means unnesting of array1 & array2 as long as ROW_PATH is '$.array1[0:]'.

JSON and Ambiguous Paths

The JSON parser cannot handle ambiguous paths. Ambiguous paths that cannot be detected at stream definition time will return undefined results.

Given the following array,

{

 "operation": "someOperation",

    "slots":  [{"slotName": "1"}, {"slotName": "2"}],

    "slots2": [{"slotName": "3"}, {"slotName": "4"}]

}

 

the following column path for a SLOTNAME column will be ambiguous, because '$..slotName' fits more than one of the nested arrays.

SLOTNAME_PATH '$..slotName'   -- any field with a name 'slotName'.

 

For an ambiguous path, a field that matches the path last in a JSON record will be read as the value.

If you define a stream with an ambiguous path, it will fail with an exception:

CREATE FOREIGN STREAM fs (

  slotname   varchar(32),

  slotname2  varchar(32)

)

SERVER ...

OPTIONS (

  PARSER 'JSON',

  ...,

  ROW_PATH '$',

  SLOTNAME_PATH '$.slots[0:].slotName',

  SLOTNAME2_PATH '$.slots2[0:].slotName'

);

java.sql.SQLException: Cannot parse fields of embedded arrays "$.slots2[0:]" and "$.slots[0:]" simultaneously. One of these embedded arrays must be extracted as an 'unparsed' column of VARCHAR type by specifying its path as column path.

 

However, the following stream definition will parse slots[0:] as an unparsed value of the slots2 array:

CREATE FOREIGN STREAM fs (

  slotname   varchar(32),

  slots2  varchar(64)

)

SERVER ...

OPTIONS (

  PARSER 'JSON',

  ...,

  ROW_PATH '$',

  SLOTNAME_PATH '$.slots[0:].slotName',

  SLOTS2_PATH '$.slots2[0:]'

);

 

You can parse nested the array 'slots2, with a cascaded view definition that invokes another instanace of JSONPath parser as a UDX, if necessary. See the topic Using the Parser UDX in this guide for more details.

hmtoggle_plus1Reading Key Pair Values

The KeyValue parser reads files with data in the format of Key=Value, where "Key" is the exact name of a column in the foreign stream with columns whose names match the keys in the file. (The parser will ignore any keypairs not defined as columns.) To parse Key Values with the Extensible Common Data Adapter (ECDA), you need to specify a location for the file to be read and indicate KV for parser.

Column types also need to match. If, for example, the file a key value CUSTOMER_ID="Alan" and CUSTOMER_ID is defined as INTEGER, parsing will fail and the error will be logged to the tracer. If the column is quoted then the key must match exactly. If the column is unquoted then the key must be in upper case (e.g. CUSTOMER_ID).

Note: You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

Example of a Workload File

CUSTOMER_ID=66,CUSTOMER_NAME="Jane Liu",CONTACT_NAME="David",ADDRESS="Linda-a-velha",CITY="Lisbon",POSTAL_CODE="2795",COUNTRY="Portugal"

CUSTOMER_ID=72,CUSTOMER_NAME="Ricardo Gomes Pereira",CONTACT_NAME="Ricardo Pereira",ADDRESS="Oeiras",CITY="Lisbon",POSTAL_CODE="2500",COUNTRY="Portugal"

CUSTOMER_ID=99,CUSTOMER_NAME="Juan Villa",CONTACT_NAME="Juan",ADDRESS="Trujillo",CITY="Cáceres",POSTAL_CODE="11125",COUNTRY="Spain"

CUSTOMER_ID=25,CUSTOMER_NAME="John Moore",CONTACT_NAME="John",ADDRESS="158 - 7th Ave.",CITY="Boston",POSTAL_CODE="90110",COUNTRY="USA"

CUSTOMER_ID=51,CUSTOMER_NAME="Rachel Morgan",CONTACT_NAME="Rachel",ADDRESS="352 9th Ave.",CITY="New York",POSTAL_CODE="65123",COUNTRY="USA"

 

Sample Foreign Stream to Parse Key Value Pairs

You indicate column names when you set up the stream, as in the following example, which creates a stream with the column names "customer_id," "customer_name," "contact_name," "address," "city," "postal_code," and "country." These columns will be assigned data from the first five columns found in the Key Value pairs file.

The following example sets up a server called "KVReaderServer", then creates and sets a schema called "SCHEMA_KEY_VALUE". It then creates a foreign stream to parse columns called ts, accountNumber, sourceIP, destIP, and customerID from a file in /opt/sqlstream/5.1.0.14245/s-Server/ParserFiles/. To parse JSON over other input/output systems, such as IBM MQ or AMQP, you would need to specify options for these formats. See Reading from Other Sources for more details.

CREATE OR REPLACE SCHEMA "SCHEMA_KEY_VALUE";
SET SCHEMA 'SCHEMA_KEY_VALUE';
CREATE OR REPLACE SERVER "KVReaderServer" TYPE 'FILE'
FOREIGN DATA WRAPPER ECDA;
CREATE OR REPLACE FOREIGN STREAM KeyValueStreamExample
(CUSTOMER_ID INTEGER,
CUSTOMER_NAME VARCHAR(32),
CONTACT_NAME VARCHAR(32),
ADDRESS VARCHAR(32),
CITY VARCHAR(32),
POSTAL_CODE VARCHAR(32),
COUNTRY VARCHAR(32))
--Columns for the new stream

--These map to the key values in the file 
SERVER "KVReaderServer"
OPTIONS
(directory '/tmp/',
filename_pattern 'workload_kv',
parser 'KV',
character_encoding 'UTF-8',
STATIC_FILES 'true');

 

To actually begin reading from the file, use a statement along the lines of

select stream * from KeyValueStreamExample as t;

 

 

hmtoggle_plus1Reading ProtoBuf Files

You can use s-Server's Extensible Common Data framework to parse files developed with the Google protocol buffer compiler. These files are compiled from .proto files.  Using the ECD framework, you create a foreign stream and the ECD framework converts ProtoBuf messages into columns, which can be specified explicitly. You only choose columns for messages that you want to parse.

Before using the ECD parser for protobuf, you need to first write .proto files and the compile them using the Java compiler that Google provides. See https://developers.google.com/protocol-buffers/ for more details.

The Google protocol buffer compiler lets you define optional, repeated, and required messages. In parsing these message types, the ECD framework allows a single repeated inner message field. If you specify a a message as repeated, then one row will be output per repetition. All non repeated fields will be duplicated between those rows.

You also need to set up a server object for the data source. This can be a file, a Kafka topic, an AMQP message bus, a network socket, or an MQSeries topic/queue.

Note: You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

Proto File Used in Example Below

Our example for s-Server uses a simple .proto file, which appears below.

package protobuf_data_definition;

 

option optimize_for = SPEED;

option java_package = "com.protobuf.types";

option java_outer_classname = "protobuf.PackageProto";

 

message protobufPackage {

   -- ID 1 is reserved in case we ever want to add a self-describing attribute

   -- timestamp in millis

   optional uint64 timestamp = 2;

   -- accountNumber

   optional string accountNumber = 3;

   -- sourceIP

   optional string sourceIP = 4;

   -- loginSuccessful

   optional boolean loginSuccessful = 5;

   -- user customerId

   optional string customerId = 6;

}

Sample Foreign Stream for ProtoBuf

In order to access data from an external source using the Extensible Common Data Adapter, you need to create a special kind of stream called a foreign stream. Once you create this stream, you can query it in s-Server, as you would any other table or stream. Options in the stream specify options specific to the format type.

Note: Use the ECD adapter when you are accessing a data source on the same machine. For data sources in a remote location, use the ECD agent. See the topic Extensible Common Data Agent Overview for more details.

You indicate column names when you set up the stream, as in the following example, which creates a stream with the column names ts, accountNumber, sourceIP, loginSuccessful, and customerId. These five columns will be populated with messages from the ProtoBuf file. In the foreign stream below, these columns are explicitly assigned data from a path within the schema using the <column name>_PATH option.

Note: All specified stream columns must map onto protocol buffer messages. (Some columns are specified and populated by the data source, such as OFFSET for a Kafka source or FILE_NAME for a file source.)

Like all streams, foreign streams must be created within a schema. The example below creates and sets a schema called ProtoBufSchema, creates a server object using a Kafka topic as a data source, and creates a foreign stream called ProtoBufTest, which specifies a server called KafkaServer, as well as options specific to both the ProtoBuf parser and the Kafka server. Note that the options topic and Starting_Time are both specific to Kafka. To parse ProtoBuf over other input/output systems, such as IBM MQ or AMQP, you would need to specify options for these formats. See Reading from Other Sources for more details.

CREATE OR REPLACE SCHEMA 'ProtoBufSchema'

SET SCHEMA 'ProtoBufSchema';

 

CREATE OR REPLACE SERVER "KafkaServer" TYPE 'KAFKA'

FOREIGN DATA WRAPPER ECDA;

 

CREATE OR REPLACE FOREIGN STREAM ProtoBufTest

('"ts" TIMESTAMP,

"accountNumber" INTEGER,

"sourceIP" VARCHAR(32),

"loginSuccessful" BOOLEAN,

"customerId" INTEGER,)   --Columns for the new stream

)

SERVER KafkaServer

OPTIONS

(topic 'test', --Kafka topic

"STARTING_TIME" 'latest', --Time to start reading Kafka topic.

                         --Options are LATEST, EARLIEST,

                         --or a long int representing a timestamp.

                         --Defaults to LATEST.

"SEED_BROKERS" 'localhost',

"PORT" '9092', --Host and port information for Kafka server

parser 'PROTOBUF',      --Tells ECD adapter to parse files as ProtoBuf

ts_PATH 'mytable.timestamp', --Schema path that maps to the

                            --foreign stream column "ts"

                            --Note how this corresponds to .proto file above.

                            --The same pattern holds for the next four options.

accountNumber_PATH 'mytable.accountNumber',

sourceIP_PATH 'mytable.sourceIP',

loginSuccessful_PATH 'mytable.loginSuccessful',

customerId_PATH 'mytable.customerId',

SCHEMA_JAR 'unitsql/concurrent/plugins/common/protobufData/protobufpackage.jar',

                            --JAR generated with Google Protocol Buffer compiler.

SCHEMA_CLASS 'protobuf.PackageProto.protobuf.PackageProto$protobufPackage'

                            --Outer package generated with Google Protocol Buffer

                            --compiler. $ separates inner from outer package.

                            --Note how these correspond to .proto file above.

);

 

 

 

hmtoggle_plus1Parsing Variable Column Files

To parse Variable Column files with the Extensible Common Data Adapter (ECDA), you need to specify a location for the file to be read, a separator character (such as a comma or a pipe), and VCLP for parser.

Note: Unless you are using the VCLP parser for legacy purposes, the CSV option will most likely suit your needs better than the VCLP option. Because you can use any separator with the CSV parser (not only commas), the CSV option generally offers more flexibility than the VCLP.

The Variable Column Log Parse splits an input string into fields separated by a delimiter character or delimiter string. It can handle comma-separated values, tab-separated values, or values separated by any other character.

Column names cannot be dynamically assigned with Variable Column files. You need to declare these as part of a the foreign stream or table.

Note: You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

 

 

hmtoggle_plus1Parsing Fixed Column Files

The Fixed Column option filters fixed-width fields and converts them to rows with columns of the declared types.

Starting position of column is 0. Column specifications for types DATE, TIME and TIMESTAMP support a format parameter allowing the user to specify exact time component layout. The parser uses the Java class java.lang.SimpleDateFormat to parse the strings for types DATE, TIME and TIMESTAMP. The Date and Time Patterns topic in the s-Server Streaming SQL Reference Guide gives a full description and examples of timestamp format strings. The following is an example of a column definition with a format string:

"name" TIMESTAMP 'dd/MMM/yyyy:HH:mm:ss'

 

You implement the Fixed Column parser of the Extensible Common Data Adapter by designating the option FCLP under stream options. Column names cannot be dynamically assigned with Fixed Column files. You need to declare these as part of a the foreign stream or table.

Note: You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

Sample Foreign Stream Implementing ECD Adapter to Parse FCLP Files

The following example will parse the first six columns from a file in /tmp with the filename_pattern transactions\.log and assign these to stream columns called called recNo, ts, accountNumber, sourceID, destIP, and customerID . To parse Fixed Column Data over other input/output systems, such as IBM MQ or AMQP, you would need to specify options for these formats. See Reading from Other Sources for more details.

Note: Information on file location, file name pattern and character encoding can also be set as server options.

CREATE OR REPLACE FOREIGN STREAM "Fixed_Column_Log_Parse"

("recNo" INTEGER,

"ts" TIMESTAMP NOT NULL,

"accountNumber" INTEGER,

"loginSuccessful" BOOLEAN,

"sourceIP" VARCHAR(32),

"destIP" VARCHAR(32),

"customerId" INTEGER,)

SERVER "FileReaderServer"

OPTIONS

(directory '/tmp',

filename_pattern 'transactions\.log',

character_encoding 'UTF-8',

parser 'FCLP');

 

Sample Properties Implementing ECD Agent to Parse FCLP Files

To parse FCLP files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following. To parse Fixed Column Data over other input/output systems, such as IBM MQ or AMQP, you would need to specify options for these formats. See Reading from Other Sources for more details.

ROWTYPE=RECORDTYPE(VARCHAR(2040) id, VARCHAR(2040) reported_at, VARCHAR(2040) shift_no, VARCHAR(2040) trip_no, VARCHAR(2040) route_variant_id)

FILENAME_PATTERN=TRANSACTIONS\.LOG

PARSER=FCLP

CHARACTER_ENCODING=UTF-8

SKIP_HEADER=TRUE

 

 

 

hmtoggle_plus1Parsing Files with the FastRegex Row Filter

The Fast Regex Row Filter option of the Extensible Common Data Adapter works by applying a Java Regular Expression to a log file, identifying columns by groups in the expression. The FastRegex parser uses the FastRegex parser function, described in the topic FAST_REGEX_LOG_PARSER in the Streaming SQL Reference Guide. That function can be used anywhere in your code. The FastRegex parser for the Extensible Common Data Framework lets you parse data as it comes into s-Server. That may be desirable for performance or other reasons.

The Fast Regex Row Filter works by first decomposing the regular expression into a series of regular expressions, one for each expression inside a group and one for each expression outside a group. Any fixed length portions at the start of any expressions are moved to the end of the previous expression. If any expression is entirely fixed length, it is merged with the previous expression. The series of expressions is then evaluated using lazy semantics with no backtracking. (In regular expression parsing parlance, "lazy" means don't parse more than you need to at each step. "Greedy" means parse as much as you can at each step. "Backtracking" means if something doesn't match the expression, you go back and start at the previous level of expression).

The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024).

FastRegex Row Filter uses a lazy search - it stops at the first match. By contrast, the default java regex_parser is greedy unless possessive quantifiers are used.

FastRegex Row Filter scans the supplied input string for all the characters specified by in the FILTER_PATTERN option. All characters in that input string must be accounted for by the characters and scan groups defined in the fast_regex_pattern. Scan groups define the fields-or-columns resulting when a scan is successful.

If all characters in the FILTER_PATTERN are accounted for when the fast_regex_pattern is applied, then FastRegex Row Filter creates an output field (column) from each parenthetical expression in that fast_regex_pattern, in left-to-right order: The first (leftmost) parenthetical expression creates the first output field, the second parenthetical expression creates the second output field, up through the last parenthetical expression creating the last output field.
If the supplied input_string contains any characters not accounted for (matched) by applying fast_regex_pattern, then the FastRegex Row Filter returns no fields at all.

Implementing the FastRegex Row Filter with the Extensible Common Data Adapter

To use the Extensible Common Data Adapter with FastRegex, you set the parser to FastRegex, then pass in groups of filters that will map to columns. The following example uses FastRegex to parse columns for accountNumber, sourceID, destIP, and customerID.

Column names cannot be dynamically assigned with the FastRegex Row Filter. You need to declare these as part of a the foreign stream or table.

Sample Foreign Stream Implementing ECD Adapter to Parse Files with the FastRegex Row Filter

The following example will parse columns called recNo, ts, accountNumber, sourceID, destIP, and customerID from a file in /path/to/logfile. To parse with FastRegex over other input/output systems, such as IBM MQ or AMQP, you would need to specify options for these formats. See Reading from Other Sources for more details.

Note: Information on file location, file name pattern and character encoding can also be set as server options.

CREATE OR REPLACE FOREIGN STREAM "FAST_REGEX_LOG_PARSE"

("recNo" INTEGER,

"ts" TIMESTAMP NOT NULL,

"accountNumber" INTEGER,

"sourceIP" VARCHAR(32),

"destIP" VARCHAR(32),

"customerId" INTEGER,)

SERVER "FileReaderServer"

OPTIONS

(DIRECTORY '/path/to/logfile',

ENCODING 'UTF-8',

PARSER 'FastRegex',

FILTER_PATTERN '\t '889'\d\d\d\d\d' , '(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$' , '(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$'  , '\d\d\d\d\d',

);

Sample Properties Implementing ECD Agent to Parse FCLP Files

To parse FCLP files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

ROWTYPE=RECORDTYPE(VARCHAR(2040) id, VARCHAR(2040) reported_at, VARCHAR(2040) shift_no, VARCHAR(2040) trip_no, VARCHAR(2040) route_variant_id)

(DIRECTORY '/tmp',

FILENAME_PATTERN 'transactions\.log',

PARSER=FastRegex

CHARACTER_ENCODING=UTF-8

SKIP_HEADER=TRUE

SEPARATOR=u\000A

FILTER_PATTERN=\t '889'\d\d\d\d\d' , '(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$' , '(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$'  , '\d\d\d\d\d)

 

 

 

 

hmtoggle_plus1Parsing W3C Files

The W3C option lets you parse logs generated by W3C-compliant applications. Log file entries are described using data specifiers defined in the Apache mod_log_config documentation. The W3C parser uses the W3C parser function, described in the topic W3C_LOG_PARSE in the Streaming SQL Reference Guide. That function can be used anywhere in your code. The W3C parser for the Extensible Common Data Framework lets you parse W3C log data as it comes into s-Server. That may be desirable for performance or other reasons.

To use the Extensible Common Data Adapter with W3C files, you set parser to W3C, then pass in groups of filters that will map to columns. The W3C parser takes one additional property, PARSER_FORMAT, which takes the values specified in the table below.

Column names cannot be dynamically assigned with W3C files. You need to declare these as part of a the foreign stream or table.

Note: SQLstream handles Apache log format specifiers without alteration so a script could copy them from an Apache httpd.conf file directly into log.sqlstream.xml.

Sample Foreign Stream to Parse W3C Files

The following example will parse columns called recNo, ts, accountNumber, sourceID, destIP, and customerID from a file in /usr/local/share/sqlstream/data/.

Note: Information on file location, file name pattern and character encoding can also be set as server options.

 

CREATE OR REPLACE FOREIGN STREAM "W3C_LOG_PARSE"

("recNo" INTEGER,

"ts" TIMESTAMP NOT NULL,

"accountNumber" INTEGER,

"loginSuccessful" BOOLEAN,

"sourceIP" VARCHAR(32),

"destIP" VARCHAR(32),

"customerId" INTEGER,)

SERVER "FileReaderServer"

OPTIONS

(DIRECTORY ''/usr/local/share/sqlstream/datalinks/accesslog/log_data',

            --assume a link to the real location of the data

            filename_pattern 'log.\d{4}(-\d\d){2}', -- e.g. log.2011-09-04',

            encoding 'UTF-8',

            parser 'W3C',

            parser_format 'COMMON'

            )

   DESCRIPTION 'CommonAdapter Foreign Stream';

 

Sample Properties Implementing ECD Agent to Parse FCLP Files

To parse W3C files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

ROWTYPE=RECORDTYPE(VARCHAR(2040) id, VARCHAR(2040) reported_at, VARCHAR(2040) shift_no, VARCHAR(2040) trip_no, VARCHAR(2040) route_variant_id)

(DIRECTORY '/tmp',

FILENAME_PATTERN 'transactions\.log',

PARSER=FastRegex

CHARACTER_ENCODING=UTF-8

SKIP_HEADER=TRUE

SEPARATOR=u\000A

FILENAME_PATTERN=log.\d{4}(-\d\d){2}', -- e.g. log.2011-09-04'\d\d)

PARSER_FORMAT=COMMON