Reading XML

<< Click to Display Table of Contents >>

Navigation:  Integrating SQLstream Blaze with Other Systems > Reading Data into s-Server > Reading from Other Sources  > Parser Types for Reading >

Reading XML

Previous pageReturn to chapter overviewNext page

To read XML data with the ECD Adapter and Agent (ECDA Framework), you need to specify a location for the data to be read and information on the data structure.

Overview

The XML parser converts data from the XML file format into regular SQLstream rows. To read data from XML files using the ECD Adapter or Agent, you define a foreign stream based on the input sources that uses the XML parser, or, with the ECDA agent, define a properties file with similar options. As with other input data formats, you specify the location of the files and other attributes that control how the files are discovered and read. Like any stream definition, the foreign stream has a definite row-type: a set of columns with names and data-types.

XML is a complex and flexible format, so it is also necessary to specify how a portion of the XML input is recognized as the source of one row, and how parts of that XML input are recognized as the column values of that row. This is done by pattern matching, with patterns defined as simple xpath expressions. There is an xpath that matches a row, and an xpath for each column. More information on XPath can be found at http://en.wikipedia.org/wiki/XPath.

See Ingestion Rules for XML for details on how s-Server parses XML data.

Note: You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

How it Works

As input files are read, their contents are appended to a long stream comprised of XML data. As data flows in, s-Server parses the data, watching for xpath matches. Whenever the row xpath matches a chunk of input, s-Server emits an output row has been found. Inside this chunk, as each column xpath matches, the value of that column has been found as text.

Only simple xpaths are allowed as patterns. A pattern may be absolute (starts with a slash) or relative (doesn't start with a slash).

s-Server supports patterns built from tag-names (like /a/b/c), attributes (like /a/b/c@d), and simple conditions (like /a/b/c[@d="foo"]).

A pattern like /a/b/c matches tag <c> when nested inside a <b> inside an <a>.  The value of the match is the text inside the tag: all the text between this <c> and the matching </c>. Leading and trailing whitespace are stripped.

A pattern like /a/b/c@d is almost the same: it matches tag <c> inside a <b> inside an <a>, but the value of match is the d attribute of the tag. That is to say, <c d="foo">...</c> is a match with the test value foo (no quotes, and not stripped of whitespace). If there is no attribute d, there is no match - ie the value is null.

s-Server also supports simple conditions in xpaths,  namely [@a] (which means that attribute a is present)  and [@a=x]  (which means that attribute a is present and has the value "x'). Here x must be a string literal. So the path /a/b[@type="cheese"]/c is like /a/b/c, but with the added requirement that the <b> element found has the attribute type="cheese".

An absolute xpath for a column means start matching from the xml document root. Note that the input stream is a series of well balanced xml documents, so the root occurs over and over again.

A relative xpath for a column means start matching from the start of the row. The row xpath itself must be absolute.

Note: More complicated xpaths are not supported. In particular, xpath conditions that test calculated expressions are not supported. You can get the same effect by filtering the foreign stream with a SQL WHERE clause. See the topic WHERE Clause of the SELECT Statement in the Streaming SQL Reference Guide for more details.

Note: You cannot have two columns in the same stream defined by exactly the same pattern. But two xpath patterns can be similar or overlap.

Nested Rows

It is common for an XML item to contain one or more nested subitems. This is naturally represented in SQL as two foreign streams parsed from the same input files: a stream of the main items, and a related stream of the subitems.

On the SQL side, the streams are tied together by

a primary key column (or columns) in the main stream

a foreign key in the subitem stream.

To produce this, just use the same xpath for the primary key and the foreign key. For each subitem, the value of the foreign key will be found outside the subitem xml, inside the main item xml. Hence the xpath must absolute, and not based on the subitem row.

However, the foreign key must occur in the main item XML before any subitems.

Sample SQL

Here is a very simple example that illustrates the above points.

Note: The examples below use the file system as an input system. To parse XML over other systems, such as IBM MQ or AMQP, you would need to specify options for these formats. See Reading from Other Sources  for more details.

The XML input looks like:

 <row><name>Elmer Fudd</name><occupation>hunter</occupation></row>

 <row><name>Bugs Bunny</name><occupation>wabbit</occupation></row>

the foreign server was defined as:

CREATE SERVER FileReaderServer TYPE 'FILE'

FOREIGN DATA WRAPPER ECDA;

 

and the stream definition is

 CREATE FOREIGN STREAM PLAYERS (

   name        varchar(16),

   occupation  varchar(16)

 ) SERVER FileReaderServer OPTIONS (

   DIRECTORY '/data/cartoons', FILENAME_PATTERN '.*xml',

   PARSER 'XML',

   PARSER_XML_ROW_TAGS   '/row',

   name_xpath 'name',

   occupation_xpath 'occupation'

 );

 

A Bigger Example

Here is a more realistic example, taken from an actual application. The input data is a record of retail sales transactions. Each transaction contains several line items, some of which are things purchased, but others are only cash register actions. The xml data is very detailed, and we will pick out a few (defined as xpaths). We will produce two related streams, a stream of sales transactions and a stream of line items. To tie them together, we recognize some identifier fields in the sales transactions and use the same values as foreign keys in the stream of line items. First, a sample input data file, much abridged. Skip past this and refer back from the explanation of the DDL:

<POSLog xmlns:tri="http://www.triversity.com/TE/integration/">

  <Header>

  <MessageId />

     <Timestamp>2008-05-15T11:47:05</Timestamp>

     <Originator>TE TransactionPostWorkflow</Originator>

     <SequenceNumber>1</SequenceNumber>

  </Header>

  <Body>

     <tri:RetailTransaction>

        <RetailStoreID>Triversity.1</RetailStoreID>

        <WorkstationID>POS.1.1</WorkstationID>

        <SequenceNumber>198</SequenceNumber>

        <BusinessDayDate>2008-04-26</BusinessDayDate>

        <BeginDateTime>2008-05-15T11:46:12</BeginDateTime>

        <EndDateTime>2008-05-15T11:47:05</EndDateTime>

        <OperatorID>222</OperatorID>

        <CurrencyCode>USD</CurrencyCode>

        <TillID>3</TillID>

        <TillSupervisor>222</TillSupervisor>

        <LineItem>

           <SequenceNumber>2</SequenceNumber>

           <EndDateTime>2008-05-15T11:46:12</EndDateTime>

           <tri:NativeLineDetail>

              <tri:LineNumber>2</tri:LineNumber>

              <tri:LineType>Item</tri:LineType>

              <tri:ActionCode>SELL</tri:ActionCode>

           </tri:NativeLineDetail>

           <SupplementalData>

              <ITEM_KEY type="String">12345000001</ITEM_KEY>

              <QUANTITY type="BigDecimal">3</QUANTITY>

              <_datasource_ type="String">Keyboard</_datasource_>

              <msrpPrice type="Money">0.00</msrpPrice>

              <CUSTOMER_PROFILE_ID type="String" />

           </SupplementalData>

           <EntryMethod>Keyed</EntryMethod>

           <Sale ItemType="Stock">

              <POSIdentity>

                 <POSItemID>12345000001</POSItemID>

              </POSIdentity>

              <MerchandiseHierarchy Level="Department">

                 2221

              </MerchandiseHierarchy>

              <Description>SILK TIE</Description>

              <RegularSalesUnitPrice>45.00</RegularSalesUnitPrice>

              <ActualSalesUnitPrice>45.00</ActualSalesUnitPrice>

              <ExtendedAmount>135.00</ExtendedAmount>

              <Quantity Units="1" UnitOfMeasureCode="Each">

                 3

              </Quantity>

              <Tax>

                 <TaxGroupID>TG1</TaxGroupID>

                 <TaxableAmount>-1</TaxableAmount>

              </Tax>

           </Sale>

        </LineItem>

        <LineItem>

           <SequenceNumber>3</SequenceNumber>

           <EndDateTime>2008-05-15T11:46:17</EndDateTime>

           <tri:NativeLineDetail>

              <tri:LineNumber>3</tri:LineNumber>

              <tri:LineType>Item</tri:LineType>

              <tri:ActionCode>SELL</tri:ActionCode>

           </tri:NativeLineDetail>

           <SupplementalData>

              <ITEM_KEY type="String">12345000001</ITEM_KEY>

              <_datasource_ type="String">Keyboard</_datasource_>

              <msrpPrice type="Money">0.00</msrpPrice>

              <CUSTOMER_PROFILE_ID type="String" />

           </SupplementalData>

           <EntryMethod>Keyed</EntryMethod>

           <Sale ItemType="Stock">

              <POSIdentity>

                 <POSItemID>12345000001</POSItemID>

              </POSIdentity>

              <MerchandiseHierarchy Level="Department">

                 2221

              </MerchandiseHierarchy>

              <Description>SILK TIE</Description>

              <RegularSalesUnitPrice>45.00</RegularSalesUnitPrice>

              <ActualSalesUnitPrice>35.00</ActualSalesUnitPrice>

              <ExtendedAmount>35.00</ExtendedAmount>

              <Quantity Units="1" UnitOfMeasureCode="Each">

                 1

              </Quantity>

              <RetailPriceModifier MethodCode="PriceOverride">

                 <SequenceNumber>1</SequenceNumber>

                 <Amount Action="Subtract">10</Amount>

                 <PreviousPrice>45.00</PreviousPrice>

                 <SupplementalData>

                    <DISCOUNT_DESCRIPTION />

                    <SOURCE_LINE_NUMBER>0</SOURCE_LINE_NUMBER>

                    <LINE_NUMBER>3</LINE_NUMBER>

                    <PRICE>35.00</PRICE>

                    <REASON_CODE>Markdown</REASON_CODE>

                    <PARAMETER>LINE_NUMBER</PARAMETER>

                    <_datasource_>Keyboard</_datasource_>

                    <loyaltyPromotionMessage />

                    <loyaltyPromotionAcceptedQualifier />

                 </SupplementalData>

              </RetailPriceModifier>

              <Tax>

                 <TaxGroupID>TG1</TaxGroupID>

                 <TaxableAmount>-1</TaxableAmount>

              </Tax>

           </Sale>

        </LineItem>

         <Total TotalType="TransactionNetAmount"> <Amount>219.98</Amount> </Total>

         <Total TotalType="TransactionGrandAmount"> <Amount>237.58</Amount> </Total>

         <Total TotalType="TransactionDueAmount"> <Amount>237.58</Amount> </Total>

         <Associate> <AssociateID>222</AssociateID> </Associate>

  </tri:RetailTransaction>

</Body>

</POSLog>

 

Here is some DDL that matches the above data, and produces two related streams.

First, we define a stream of transactions. Each XML element, such as <tri:RetailTransaction> produces one row, so the row tag xpath (which must be a full, absolute xpath that starts at the document root) is /PosLog/Body/tri:RetailTransaction.  

All the columns come from inside this XML element, so they are defined as xpaths relative to it (no leading slash). We take the first three subelements to be a 3-part primary key that identifies the transaction. Later we will use this to tie the LineItem rows to the RetailTransaction rw.

- Some columns are direct subelements of the row element, such as RetaitStoreID and TillID,

This is the simplest case. The xpath for a column defaults to the name of the column as a relative path, so these options could be omitted.

Some columns are deeper sublements of the row element, like TransactionType

Some columns are defined by an xpath with a condition.

The transaction has three distinct total amounts, which become the three columns TransactionNetTotal, TransactionGrandTotal, and TransactionAmountDue. These come from three XML elements that have the same path, Total/Amount, but that are distinguished by different values of the attribute <Total TotalType=xxx>.

Note that the column "BeginDateTime" is declared as VARCHAR, though it is actually a timestamp, because the xml parser cannot produce a SQL timestamp value.

CREATE FOREIGN STREAM "RetailTransactions"(

   -- unique key is (RetailStoreID, WorkstationID, SequenceNumber)

   "RetailStoreID"                     VARCHAR(128),

   "WorkstationID"                     VARCHAR(40),

   "SequenceNumber"                    BIGINT,

   "BeginDateTime"                     VARCHAR(25),    -- TIMESTAMP,

   "DeviceID"                          BIGINT,

   "SubdeviceID"                       BIGINT,

   "TransactionType"                   VARCHAR(40),

   "TransactionItemCount"              INT,

   "KeyToken"                          VARCHAR(128),

   "OperatorID"                        INT,

   "TillID"                            INT,

   "TransactionNetTotal"               DECIMAL(14,2),

   "TransactionGrandTotal"             DECIMAL(14,2),

   "TransactionAmountDue"              DECIMAL(14,2)

 ) SERVER "FileReaderServer" OPTIONS (

       DIRECTORY '/test/data',

       FILENAME_PATTERN '.*xml',

       CHARACTER_ENCODING 'ISO-8859-1',

       parser 'XML',

       PARSER_XML_ROW_TAGS                 '/POSLog/Body/tri:RetailTransaction',

       "RetailStoreID_XPATH"               'RetailStoreID',

       "WorkstationID_XPATH"               'WorkstationID',

       "SequenceNumber_XPATH"              'SequenceNumber',

       "BeginDateTime_XPATH"               'BeginDateTime',

       "DeviceID_XPATH"                    'tri:NativeTrxDetail/tri:DeviceID',

       "SubdeviceID_XPATH"                 'tri:NativeTrxDetail/tri:SubdeviceID',

       "TransactionType_XPATH"             'tri:NativeTrxDetail/tri:TransactionType',

       "TransactionItemCount_XPATH"           'tri:NativeTrxDetail/tri:TransactionItemCount',

       "KeyToken_XPATH"                    'tri:NativeTrxDetail/tri:KeyToken',

       "OperatorID_XPATH"                  'OperatorID',

       "TillID_XPATH"                      'TillID',

       "TransactionNetTotal_XPATH"         'Total[@TotalType="TransactionNetAmount"]/Amount',

       "TransactionGrandTotal_XPATH"       'Total[@TotalType="TransactionGrandAmount"]/Amount',

       "TransactionAmountDue_XPATH"        'Total[@TotalType="TransactionDueAmount"]/Amount'

);

 

Finally, we look at the definition of the stream of LineItems, which illustrates a few more features of the parser. Each row comes from an element <LineItem> which is inside a <tri:RetailTransaction>, as specified by the row tag. The first three columns are foreign keys that refer to a row in the parent stream RetailTransactions. In fact the parser finds the values of the foreign keys by looking inside the enclosing <tri:RetailTransaction> element, since the xpaths for these three fields are absolute paths that lead into the enclosing element.

Again we see columns that are direct subelements (like ItemSequenceNumber) and deeper sublements (like LineNumber). Note that the subelement /Sale/Quantity is the source of two columns: the body of the element produces the column "Quantity" and the attribute UnitOfMeasureCode produces the column Units.

CREATE FOREIGN STREAM "LineItems" (

   -- foreign keys: (RetailStoreID, WorkstationID, TransactionSequenceNumber)

   "RetailStoreID"                     VARCHAR(128),

   "WorkstationID"                     VARCHAR(40),

   "TransactionSequenceNumber"         BIGINT,

   "ItemSequenceNumber"                BIGINT,      -- unique

   "LineNumber"                        INT,

   "ActionCode"                        VARCHAR(40),

   "ITEM_KEY"                          VARCHAR(40),

   "EntryMethod"                       VARCHAR(40),

   "ItemType"                          VARCHAR(40),

   "POSItemID"                         BIGINT,

   "Department"                        INT,

   "Quantity"                          INT,

   "Units"                             VARCHAR(40),

   "RegularSalesUnitPrice"             DECIMAL(14,2),

   "ActualSalesUnitPrice"              DECIMAL(14,2),

   "ExtendedAmount"                    DECIMAL(14,2),

)

 SERVER "FileReaderServer"

 OPTIONS (

   DIRECTORY '/work/dt/src/cust/bigbox/test/data',

   FILENAME_PATTERN '.*xml',

   CHARACTER_ENCODING 'ISO-8859-1',

   parser 'XML',

   PARSER_XML_ROW_TAGS                   '/POSLog/Body/tri:RetailTransaction/LineItem',

   "RetailStoreID_XPATH"                 '/POSLog/Body/tri:RetailTransaction/RetailStoreID',

   "WorkstationID_XPATH"                 '/POSLog/Body/tri:RetailTransaction/WorkstationID',

   "TransactionSequenceNumber_XPATH"     '/POSLog/Body/tri:RetailTransaction/SequenceNumber',

   "ItemSequenceNumber_XPATH"            'SequenceNumber',

   "LineNumber_XPATH"                    'tri:NativeLineDetail/tri:LineNumber',

   "ActionCode_XPATH"                    'tri:NativeLineDetail/tri:ActionCode',

   "ITEM_KEY_XPATH"                      'SupplementalData/ITEM_KEY',

   "EntryMethod_XPATH"                   'EntryMethod',

   "ItemType_XPATH"                      'Sale@ItemType',

   "POSItemID_XPATH"                     'Sale[@ItemType="Stock"]/POSIdentity/POSItemID',

   "Department_XPATH"                    'Sale[@ItemType="Stock"]/MerchandiseHierarchy[@Level="Department"]',

   "RegularSalesUnitPrice_XPATH"         'Sale[@ItemType="Stock"]/RegularSalesUnitPrice',

   "ActualSalesUnitPrice_XPATH"          'Sale[@ItemType="Stock"]/ActualSalesUnitPrice',

   "ExtendedAmount_XPATH"                'Sale[@ItemType="Stock"]/ExtendedAmount',

   "Quantity_XPATH"                      'Sale[@ItemType="Stock"]/Quantity',

   "Units_XPATH"                         'Sale[@ItemType="Stock"]/Quantity@UnitOfMeasureCode'

);

 

Sample Properties Implementing ECD Agent to Parse XML Files

To parse XML files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

ROWTYPE=RecordType (VARCHAR(128) COL1, VARCHAR(40) COL2, BIGINT COL3, BIGINT COL4, INT COL5, VARCHAR(40), VARCHAR(40) COL6, VARCHAR(40) COL7, VARCHAR(40) COL9, INT COL 10, INT COL12, VARCHAR(40) COL13, DECIMAL(14,2) COL14, DECIMAL(14,2) COL15, DECIMAL(14,2) COL16)

DIRECTORY=/work/dt/src/cust/bigbox/test/data

PARSER=XML

CHARACTER_ENCODING=UTF-8

SKIP_HEADER=TRUE

PARSER_XML_ROW_TAGS=/POSLog/Body/tri:RetailTransaction/LineItem

RetailStoreID_XPATH=/POSLog/Body/tri:RetailTransaction/RetailStoreID

WorkstationID_XPATH=/POSLog/Body/tri:RetailTransaction/WorkstationID

TransactionSequenceNumber_XPATH=/POSLog/Body/tri:RetailTransaction/SequenceNumber

ItemSequenceNumber_XPATH'SequenceNumber

LineNumber_XPATH'tri:NativeLineDetail/tri:LineNumber

ActionCode_XPATH'tri:NativeLineDetail/tri:ActionCode

ITEM_KEY_XPATH =SupplementalData/ITEM_KEY

EntryMethod_XPATH  =EntryMethod

ItemType_XPATH =Sale@ItemType

POSItemID_XPATH=Sale[@ItemType=Stock]/POSIdentity/POSItemID

Department_XPATH'Sale[@ItemType=Stock]/MerchandiseHierarchy[@Level=Department]

RegularSalesUnitPrice_XPATH=Sale[@ItemType=Stock]/RegularSalesUnitPrice

ActualSalesUnitPrice_XPATH =Sale[@ItemType=Stock]/ActualSalesUnitPrice

ExtendedAmount_XPATH'Sale[@ItemType=Stock]/ExtendedAmount

Quantity_XPATH =Sale[@ItemType=Stock]/Quantity

Units_XPATH=Sale[@ItemType=Stock]/Quantity@UnitOfMeasureCode