Using the Discovery Parser

<< Click to Display Table of Contents >>

Navigation:  Integrating SQLstream Blaze with Other Systems > Reading Data into s-Server > Reading from Other Sources  > Parser Types for Reading >

Using the Discovery Parser

Previous pageReturn to chapter overviewNext page

If you do not have information about the file format to be read, or need help filling in foreign stream or agent properties for a file format, you can use the Discovery parser to determine information about a file's format. When you point it at a file location, this parser reads a sample of the file and returns information on file format options. Currently, the Discovery parser can identify CSV, XML, JSON, and Avro files.

5_2_indicator Version 5.2 Feature

As of version 5.2, the Discover Parser will identify Apache Avro files. See Using the Discovery Parser with Apache Avro below.

Note: Avro files may sometimes return a result of 'BINARY' or 'UNKNOWN' if the Discovery parser cannot find an Avro schema. In these cases, you can specify an Avro schema for the Discovery parser. See Using the Discovery Parser with Apache Avro below.

The Discovery parser inspects sample data from an indicated data source to recommend values for a parser. The plugin also recommends all necessary relevant options for the "discovered" parser. Some recommended options can be column specific for columns that you define in the foreign stream definition.

The Discovery parser returns stream and column options as a table in s-Server, which you query to get information on options for the file type. You can then use these options to write a foreign stream or agent properties file to parse data from the source. See Input Formats for Reading for an overview of how to parse data from these sources.

discovery_parser

To use the Discovery parser, you first set up a foreign table with the options below in s-Server, with options that set the parser type as "Discovery", indicate the file's location, and specify a regex pattern for the file name. The following code first creates a server for the table, then creates and sets a schema, then creates the foreign table with required options.

CREATE OR REPLACE SERVER "DiscoveryServer"

FOREIGN DATA WRAPPER ECDA;

 

CREATE OR REPLACE SCHEMA "Discovery";

SET SCHEMA '"Discovery"';

 

CREATE OR REPLACE FOREIGN TABLE discovery_table (

 "d_name" VARCHAR(1024), -- recommended name for a column

 "d_path" VARCHAR(1024), -- recommended "PATH" to extract the column

 "d_type" VARCHAR(1024), -- recommended data type for the column

 "d_precision" INTEGER,  -- recommended precision for the column

 "d_scale" INTEGER,      -- recommended scale for the column

 "d_nullable" BOOLEAN,   -- is the column nullable ?

 "d_sample" VARCHAR(4096), -- a sample value for the recommended column

 "d_properties" VARCHAR(4096) -- any column-specific options for the column

                              -- options are space-separated

)

SERVER "DiscoveryServer"

OPTIONS

(

 DIRECTORY '/home/sqlstream/',

 FILENAME_PATTERN 'y\.txt',

 DISCOVERY_TIMEOUT '1000',

 MAX_EXAMPLE_BYTES '1024',

 PARSER 'DISCOVERY'

);

 

Once you have created this foreign table, you run a simple SELECT query against the table to invoke the Discovery parser.

SELECT * FROM discovery_table;

 

The query inspects data from /data/samples/discovery1.log to discover data formats among CSV, JSON and XML. If the discovery process fails to discover one of these data formats, then it returns 'UNKNOWN' as the recommended "PARSER" option.

Results of the query return one row for each "recommended" column for the actual foreign stream to be created. A brief description of each column is in in-line comments.

Last row includes options that are not specific to any recommended columns in "d_properties" column.

An example result of a simple SELECT query:

SELECT * FROM discovery_table;

'd_name','d_path','d_type','d_precision','d_scale','d_nullable','d_sample','d_properties'

'Field1','$.Field1','SMALLINT','4','0','false','1231','Field1_PATH=$.Field1'

'Field2','$.Field2','TIMESTAMP','0','0','false','2016-01-01 23:11:23.653','Field2_DATE_FORMAT=yyyy-MM-dd HH:mm:ss

Field2_PATH=$.Field2'

'Field4','$.Field3.Field4','SMALLINT','4','0','false','1233','Field4_PATH=$.Field3.Field4'

'Field5','$.Field3.Field5','SMALLINT','4','0','false','1234','Field5_PATH=$.Field3.Field5'

'Array1','$.Array1[0:]','VARCHAR(1024)','0','0','false','[]','Array1_PATH=$.Array1[0:]'

'Field8','$.Array2[0:].Field8','SMALLINT','5','0','false','12313','Field8_PATH=$.Array2[0:].Field8'

'Field9','$.Array2[0:].Field9','SMALLINT','5','0','false','12314','Field9_PATH=$.Array2[0:].Field9'

'Array3','$.Array2[0:].Array3[0:]','VARCHAR(1024)','0','0','false','[]','Array3_PATH=$.Array2[0:].Array3[0:]'

'Field11','$.Field11','SMALLINT','5','0','false','12321','Field11_PATH=$.Field11'

'Array4','$.Field12.Array4[0:]','VARCHAR(32)','0','0','true','[12322,12323,12324]','Array4_PATH=$.Field12.Array4[0:]'

'Field13','$.Field12.Field13','SMALLINT','5','0','false','12325','Field13_PATH=$.Field12.Field13'

'Field14','$.Field14','DECIMAL','8','4','false','1238.9824','Field14_PATH=$.Field14'

'','$','','','','','','PARSER=JSON

Field8_PATH=$.Array2[0:].Field8

Field5_PATH=$.Field3.Field5

Field2_PATH=$.Field2

Field9_PATH=$.Array2[0:].Field9

ROW_PATH=$

CHARACTER_ENCODING=UTF-8

Array3_PATH=$.Array2[0:].Array3[0:]

Field13_PATH=$.Field12.Field13

Array4_PATH=$.Field12.Array4[0:]

Field4_PATH=$.Field3.Field4

Field14_PATH=$.Field14

Field2_DATE_FORMAT=yyyy-MM-dd HH:mm:ss

Array1_PATH=$.Array1[0:]

Field1_PATH=$.Field1

Field11_PATH=$.Field11'

13 rows selected (0.172 seconds)

 

The second to last row contains various options for the foreign stream for the recommended PARSER, which in this case is 'JSON'. Options like CHARACTER_ENCODING, and ROW_PATH are required for this parser.

Similarly, if the Discovery parser had identified 'CSV' as the parser, it would have recommended options such as SEPARATOR, ROW_SEPARATOR, and QUOTE_CHARACTER.

The last row of the discovery result returns a "NUM_BYTES_READ", which is the number of bytes read and inspected by the Discover parser.

Sample Properties Implementing ECD Agent to Use the Discovery Parser

To discover files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

# Location, name, and type of file

DIRECTORY=/tmp

filename_pattern=buses\.log

SEPARATOR=,

PARSER=DISCOVERY

DISCOVERY_TIMEOUT=1000

MAX_EXAMPLE_BYTES=1024

CHARACTER_ENCODING=UTF-8

# Schema, name, and parameter signature of destination table

SCHEMA_NAME=Discovery

TABLE_NAME=discoveryTable

ROWTYPE=RECORDTYPE(VARCHAR(1024) d_name, VARCHAR(1024) d_path, VARCHAR(1024) d_type, INTEGER d_precision , INTEGER d_scale, BOOLEAN d_nullable, VARCHAR(4096) d_sample, VARCHAR(4096) d_properties)

 

Using the Discovery Parser with Avro

5_2_indicator Version 5.2 Feature

Apache Avro is a popular, compact data format predominantly used in Hadoop and other big data platforms. See the topic Reading Avro for detailed documentation of how s-Server parses messages in Avro format into stream rows.

The discovery process for Avro data format is quite different from for other text based data format like CSV or XML. For these text based data formats, the Discovery parser uses built-in heuristics to create object-relational mapping.

Since Avro is a binary data format, Discovery handles Avro differently.

Avro files typically embed a schema for all the data stored in those files. Using the Avro API, the Discovery parser first checks to see if the Avro schema is embedded in the sample data being collected from the data source. If the Discovery parser finds the Avro schema in the sample data, it uses that schema to recommend object relational mapping for the Avro data.

If the schema is not embedded in the sample data, the Discovery parser will return 'UNKNOWN' or 'BINARY' as a recommended parser. If you get this result, you can specify an AVRO_SCHEMA_FILE option in Foreign Stream options and try using the Discovery parser again. See the CREATE FOREIGN STREAM topic in the Streaming SQL Reference Guide.

Once you specify the AVRO_SCHEMA_FILE, the Discovery parser uses the schema in the AVRO_SCHEMA_FILE to recommend object relational mapping for Avro data.

In either case, the Discovery parser recommends data type mapping between AVRO & SQL types. See Avro Ingestion Rules in the Reading Avro topic for more details.

Discovery may recommend that one of the embedded arrays be flattened. The simple heuristic is to recommend the array with most number of fields in each array element.