Using the Discovery Parser

<< Click to Display Table of Contents >>

Navigation:  Integrating SQLstream Blaze with Other Systems > Reading Data into s-Server > Reading from Other Sources  > Input Formats for Reading >

Using the Discovery Parser

Previous pageReturn to chapter overviewNext page

If you do not have information about the file format to be read, or need help filling in foreign stream or agent properties for a file format, you can use the Discovery parser to determine information about a file's format. When you point it at a file location, this parser reads a sample of the file and returns information on file format options. Currently, the Discovery parser can identify CSV, XML, and JSON files.

The Discovery Parser inspects sample data from an indicated data source to recommend values for a parser. The plugin also recommends all necessary relevant options for the "discovered" parser. Some recommended options can be column specific for columns that you define in the foreign stream definition.

The Discovery parser returns stream and column options as a table in s-Server, which you query to get information on options for the file type. You can then use these options to write a foreign stream or agent properties file to parse data from the source. See File Formats for Reading for an overview of how to parse data from these sources.

discovery_parser

To use the Discovery parser, you first set up a foreign table with the options below in s-Server, with options that set the parser type as "Discovery", indicate the file's location, and specify a regex pattern for the file name. The following code first creates a server for the table, then creates and sets a schema, then creates the foreign table with required options.

CREATE OR REPLACE SERVER "FileReaderServer"

FOREIGN DATA WRAPPER ECDA;

 

CREATE OR REPLACE SCHEMA "Discovery";

SET SCHEMA '"Discovery"';

 

CREATE OR REPLACE FOREIGN TABLE discovery_table (

 "d_name" VARCHAR(1024), -- recommended name for a column

 "d_path" VARCHAR(1024), -- recommended "PATH" to extract the column

 "d_type" VARCHAR(1024), -- recommended data type for the column

 "d_precision" INTEGER,  -- recommended precision for the column

 "d_scale" INTEGER,      -- recommended scale for the column

 "d_nullable" BOOLEAN,   -- is the column nullable ?

 "d_sample" VARCHAR(4096), -- a sample value for the recommended column

 "d_properties" VARCHAR(4096) -- any column-specific options for the column

                              -- options are space-separated

)

SERVER "FileReaderServer"

OPTIONS

(

 DIRECTORY '/home/sqlstream/',

 FILENAME_PATTERN 'y\.txt',

 DISCOVERY_TIMEOUT '1000',

 MAX_EXAMPLE_BYTES '1024',

 PARSER 'DISCOVERY'

);

 

Once you have created this foreign table, you run a simple SELECT query against the table to invoke the Discovery parser.

SELECT * FROM discovery_table;

 

The query inspects data from /data/samples/discovery1.log to discover data formats among CSV, JSON and XML. If the discovery process fails to discover one of these data formats, then it returns 'UNKNOWN' as the recommended "PARSER" option.

Results of the query return one row for each "recommended" column for the actual foreign stream to be created. A brief description of each column is in in-line comments.

Last row includes options that are not specific to any recommended columns in "d_properties" column.

An example result of a simple SELECT query:

SELECT * FROM discovery_table;

'd_name','d_path','d_type','d_precision','d_scale','d_nullable','d_sample','d_properties'

'Field1','$.Field1','SMALLINT','4','0','false','1231','Field1_PATH=$.Field1'

'Field2','$.Field2','TIMESTAMP','0','0','false','2016-01-01 23:11:23.653','Field2_DATE_FORMAT=yyyy-MM-dd HH:mm:ss

Field2_PATH=$.Field2'

'Field4','$.Field3.Field4','SMALLINT','4','0','false','1233','Field4_PATH=$.Field3.Field4'

'Field5','$.Field3.Field5','SMALLINT','4','0','false','1234','Field5_PATH=$.Field3.Field5'

'Array1','$.Array1[0:]','VARCHAR(1024)','0','0','false','[]','Array1_PATH=$.Array1[0:]'

'Field8','$.Array2[0:].Field8','SMALLINT','5','0','false','12313','Field8_PATH=$.Array2[0:].Field8'

'Field9','$.Array2[0:].Field9','SMALLINT','5','0','false','12314','Field9_PATH=$.Array2[0:].Field9'

'Array3','$.Array2[0:].Array3[0:]','VARCHAR(1024)','0','0','false','[]','Array3_PATH=$.Array2[0:].Array3[0:]'

'Field11','$.Field11','SMALLINT','5','0','false','12321','Field11_PATH=$.Field11'

'Array4','$.Field12.Array4[0:]','VARCHAR(32)','0','0','true','[12322,12323,12324]','Array4_PATH=$.Field12.Array4[0:]'

'Field13','$.Field12.Field13','SMALLINT','5','0','false','12325','Field13_PATH=$.Field12.Field13'

'Field14','$.Field14','DECIMAL','8','4','false','1238.9824','Field14_PATH=$.Field14'

'','$','','','','','','PARSER=JSON

Field8_PATH=$.Array2[0:].Field8

Field5_PATH=$.Field3.Field5

Field2_PATH=$.Field2

Field9_PATH=$.Array2[0:].Field9

ROW_PATH=$

CHARACTER_ENCODING=UTF-8

Array3_PATH=$.Array2[0:].Array3[0:]

Field13_PATH=$.Field12.Field13

Array4_PATH=$.Field12.Array4[0:]

Field4_PATH=$.Field3.Field4

Field14_PATH=$.Field14

Field2_DATE_FORMAT=yyyy-MM-dd HH:mm:ss

Array1_PATH=$.Array1[0:]

Field1_PATH=$.Field1

Field11_PATH=$.Field11'

13 rows selected (0.172 seconds)

 

The second to last row contains various options for the foreign stream for the recommended PARSER, which in this case is 'JSON'. Options like CHARACTER_ENCODING, and ROW_PATH are required for this parser.

Similarly, if the Discovery parser had identified 'CSV' as the parser, it would have recommended options such as SEPARATOR, ROW_SEPARATOR, and QUOTE_CHARACTER.

The last row of the discovery result returns a "NUM_BYTES_READ", which is the number of bytes read and inspected by the Discover parser.

Sample Properties Implementing ECD Agent to Use the Discovery Parser

To discover files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

# Location, name, and type of file

DIRECTORY=/tmp

filename_pattern=buses\.log

SEPARATOR=,

PARSER=DISCOVERY

DISCOVERY_TIMEOUT=1000

MAX_EXAMPLE_BYTES=1024

CHARACTER_ENCODING=UTF-8

# Schema, name, and parameter signature of destination table

SCHEMA_NAME=Discovery

TABLE_NAME=discoveryTable

ROWTYPE=RECORDTYPE(VARCHAR(1024) d_name, VARCHAR(1024) d_path, VARCHAR(1024) d_type, INTEGER d_precision , INTEGER d_scale, BOOLEAN d_nullable, VARCHAR(4096) d_sample, VARCHAR(4096) d_properties)