Reading JSON

<< Click to Display Table of Contents >>

Navigation:  Integrating Blaze with Other Systems > Reading Data into s-Server > Reading from Other Sources  > File Formats for Reading >

Reading JSON

Previous pageReturn to chapter overviewNext page

You can read from JSON data sources that are continually updated (streaming) using s-Server’s Extensible Common Data framework. s-Server’s JSON parsing is based on the Jackson JSON processor. More information on this processor is found at https://github.com/FasterXML/jackson-docs.

s-Server matches data types according to its Ingestion Rules for JSON.

Note: You can also input data in larger chunks and parse it later using the Parser UDX. This UDX calls the parsers listed above in a function. For more information on using functions, see the topic Transforming Data in s-Server in this guide.

JSON processing works like this:

1.s-Server receives a message from a data sources such as the file system, kafka, or AMQP. Each message may contain one or more JSON records.
2.As with all data sources, s-Server reads the JSON records into a foreign stream. Each JSON record may get parsed into multiple rows of the stream.
3.s-Server parses these JSON records using a JSON path that you define for each column of the foreign stream. A JSON path describes the location of a field in a JSON record. JSON Path are analogous to XPath expressions for XML files.

To implement the JSON parser, you create a foreign stream for a source, as described in the topics under Reading from Other Sources. Under stream or column options, you need to set the following stream options:

PARSER  This needs to be JSON.
ROW_PATH This is the JSON path for the row to be found. The JsonPath parser uses a row path to find JSON objects containing a row, then the path for each column is used to find the value to be extracted.
<COLUMN_NAME>_PATH Optional for each column in the ECDA column set. This defaults to $..<COLUMN_NAME>. Here, a column named 'FOO' would have an option named FOO_PATH that defaulted to $..FOO which would return the first property named FOO under the JSON object found by the ROW_PATH.

Both kinds of paths begin with '$' to indicate the root JSON object or array followed by either .. (scan), a property key, or [] syntax indicating either a list of property keys or an array selection.  A row pattern might be $[0:] which indicates all the objects in an array. Then a column pattern such as $..address would extract the text in the address property of the JSON object in the top level array.

For example, a JSON record like:

{

 "book": "A brief History Of Time",

 "author": "Stephan Hawking"

}

 

can be parsed into a stream with two columns, "book" and "author". To accomplish this, you define two JSON paths: $.book for the column "book" and $.author for the column "author". The character "$" indicates the start of the data you want to stream, and the character "." indicates the start of a specific JSON record. There are more complicated ways to express JSON paths for columns, but this is the basic way.

Here’s a more complicated example:

{

 "books": [

           { "book": "A brief History Of Time",

             "author": "Stephan Hawking"

           },

           { "book": "The world is flat",

             "author": "Tom Friedman"

           },

           { "book": "A theory of justice",

             "author": "John Rawls"

           }

          ]

}

 

Here each JSON record is an array of embedded JSON records. In this example, you first need to define a ROW_PATH, as '$.books[0:] s-Server uses "ROW_PATH" to determine the end of a row when parsing JSON records with embedded arrays. You then define the following column paths. Each path becomes a column in the stream:

Column name

Column Path Name

Path Definition

book

book_PATH

$.books[0:].book

author

author_PATH

$.books[0:].author

JSON paths can use a wild card '..' to indicate any arbitrary path prefix or suffix. The paths described above can also be defined simply as

ROW_PATH => '$.books[0:]'

book_PATH => '$..book',        -- default path when not specified

author_PATH => '$..author',    -- default path when not specified

 

Note: s-Server does not support referencing an array element with a specific subscript. It supports only generic references to "every" element of the array through the wildcard [0:].

In the example above, a single JSON record with an array of embedded JSON records gets parsed into three rows of the stream "books". Whenever there is an embedded array, it can be parsed into as many rows as the number of elements of the array.

The JSON parser creates a "tree" representation of a JSON record. Each field is a "leaf" node of the tree, while embedded JSON records or arrays are intermediate (non-leaf) nodes. As the JSON parser "walks" over this tree representation, "end of a row" event is generated when parser "walks back" to the node represented by "ROW_PATH" defined above.

For a row path, the parser may find a category like "books" or "alerts" multiple times. For a column path, the parse finds a value one time, such as "Great Expectations" or "Pride and Prejudice".

Unnesting an Array

The process of turning an embedded array into a stream of rows that correspond to elements of the array is called "unnesting" the array into the containing JSON record. s-Server can only unnest one array in a JSON record. Other arrays can be extracted as a single VARCHAR column. These restrictions are described at the end of this topic.

For example:

{

 "store": "Amazon",

 "books": [

           { "book": "A brief History Of Time",

             "author": "Stephan Hawking"

           },

           { "book": "The world is flat",

             "author": "Tom Friedman"

           },

           { "book": "A theory of justice",

             "author": "John Rawls"

           }

          ],

 "report_date": "2016-04-01 00:00:00"

 

can be turned into a stream of four columns as follows

book

author

store

report_date

"A Brief History of Time"

"Stephan Hawking"

"Amazon"

"2020-04-01 00:00:00"

"The World is Flat"

"Tom Friedman"

"Amazon"

"2020-04-01 00:00:00"

"A Theory of Justice"

"John Rawls"

"Amazon"

"2020-04-01 00:00:00"

 

The JSON path for each column would be,

book_PATH => '$.books[0:].book',

author_PATH => '$.books[0:].author',

store_PATH => '$.store',

report_date_PATH => '$.report_date',

ROW_PATH => '$'

 

Note that  "ROW_PATH" is  simply '$' as each row is a result of cartesian product of

the fields parsed from each element of the books' array
the fields parsed from the array containing JSON record.

Note: You can extract additional nested arrays as a single VARCHAR column, and then parse these using the Parser UDX. See the topic Parser UDX for more information.

Customizing JSON Paths

You can also use a custom parser for COLUMN_PATH, by adding the following parameter to the stream or server definition:

CUSTOM_TYPE_PARSER_<column_name>

Allows overriding of individual column's parsing. Specifies a fully qualified Java classname that

implements com.sqlstream.aspen.namespace.common.TypeParser

Restrictions to Parsing JSON

There are some additional restrictions on parsing JSON records with arrays embedded at multiple levels.

Only one embedded array can be "unnested." The JSON parser will not unnest more than one embedded array.

Arrays that are not being unnested can only be parsed as VARCHAR columns whose value is the entire array object as unparsed JSON textt.

That is, JSON paths like '$..array1[0:]..array2[0:]..field'  are not supported. In other words, JSON parser will not unnest more than one embedded arrays. If you specify JSON paths that requires unnesting multiple arrays, this will be reported as a SQL validation error. (In some cases, "multiple unnesting" can be determined only during execution and will be reported as a runtime error.)

For example, given the following JSON, the path definitions "col1_PATH" = '$..col1' & "col2_PATH" = '$..col2'  would require unnesting multiple arrays, and would therefor be invalid.

{

 "array1": [ { ..., "col1": 1234.56, ... }, ... ],

 "array2": [ { ..., "col2": "val2", ... }, ... ],

 "col3": "val3"

}

 

For cases where there are multiple embedded arrays, you can parse JSON records by unnesting one of those embedded arrays while passing other embedded arrays as "unparsed JSON text" and then parsing these using the Parser UDX.

For the JSON record above, streaming abstraction can be:

CREATE OR REPLACE json_stream (

    "col1" DOUBLE,

    "array2Col" VARCHAR(1024),

    "col3" VARCHAR(128)

)

SERVER FileReaderServer

OPTIONS (

  ...,

  "col1_PATH" '$.array1[0:].col1',

  "array2Col" '$.array2'    -- alternatively, '$.array2[0:]'

  "col3" '$.col3',

  "ROW_PATH" '$'

  ...

);

 

However, when ROW_PATH itself represents a path for an embedded array, s-Server will unnest another embedded array in its elements.

That is, the path "col_PATH" = '$.array1[0:].array2[0:].col' is allowed even if it means unnesting of array1 & array2 as long as ROW_PATH is '$.array1[0:]'.

JSON and Ambiguous Paths

The JSON parser cannot handle ambiguous paths. Ambiguous paths that cannot be detected at stream definition time will return undefined results.

Given the following array,

{

 "operation": "someOperation",

    "slots":  [{"slotName": "1"}, {"slotName": "2"}],

    "slots2": [{"slotName": "3"}, {"slotName": "4"}]

}

 

the following column path for a SLOTNAME column will be ambiguous, because '$..slotName' fits more than one of the nested arrays.

SLOTNAME_PATH '$..slotName'   -- any field with a name 'slotName'.

 

For an ambiguous path, a field that matches the path last in a JSON record will be read as the value.

If you define a stream with an ambiguous path, it will fail with an exception:

CREATE FOREIGN STREAM fs (

  slotname   varchar(32),

  slotname2  varchar(32)

)

SERVER ...

OPTIONS (

  PARSER 'JSON',

  ...,

  ROW_PATH '$',

  SLOTNAME_PATH '$.slots[0:].slotName',

  SLOTNAME2_PATH '$.slots2[0:].slotName'

);

java.sql.SQLException: Cannot parse fields of embedded arrays "$.slots2[0:]" and "$.slots[0:]" simultaneously. One of these embedded arrays must be extracted as an 'unparsed' column of VARCHAR type by specifying its path as column path.

 

However, the following stream definition will parse slots[0:] as an unparsed value of the slots2 array:

CREATE FOREIGN STREAM fs (

  slotname   varchar(32),

  slots2  varchar(64)

)

SERVER ...

OPTIONS (

  PARSER 'JSON',

  ...,

  ROW_PATH '$',

  SLOTNAME_PATH '$.slots[0:].slotName',

  SLOTS2_PATH '$.slots2[0:]'

);

 

You can parse nested the array 'slots2, with a cascaded view definition that invokes another instanace of JSONPath parser as a UDX, if necessary. See the topic Using the Parser UDX in this guide for more details.