Parsing Files with FastRegex Row Filter

<< Click to Display Table of Contents >>

Navigation:  Integrating Blaze with Other Systems > Reading Data into s-Server > Reading from Other Sources  > File Formats for Reading >

Parsing Files with FastRegex Row Filter

Previous pageReturn to chapter overviewNext page

The Fast Regex Row Filter option of the Extensible Common Data Adapter works by applying a Java Regular Expression to a log file, identifying columns by groups in the expression. The FastRegex parser uses the FastRegex parser function, described in the topic FAST_REGEX_LOG_PARSER in the Streaming SQL Reference Guide. That function can be used anywhere in your code. The FastRegex parser for the Extensible Common Data Framework lets you parse data as it comes into s-Server. That may be desirable for performance or other reasons.

The Fast Regex Row Filter works by first decomposing the regular expression into a series of regular expressions, one for each expression inside a group and one for each expression outside a group. Any fixed length portions at the start of any expressions are moved to the end of the previous expression. If any expression is entirely fixed length, it is merged with the previous expression. The series of expressions is then evaluated using lazy semantics with no backtracking. (In regular expression parsing parlance, "lazy" means don't parse more than you need to at each step. "Greedy" means parse as much as you can at each step. "Backtracking" means if something doesn't match the expression, you go back and start at the previous level of expression).

The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024).

FastRegex Row Filter uses a lazy search - it stops at the first match. By contrast, the default java regex_parser is greedy unless possessive quantifiers are used.

FastRegex Row Filter scans the supplied input string for all the characters specified by in the FILTER_PATTERN option. All characters in that input string must be accounted for by the characters and scan groups defined in the fast_regex_pattern. Scan groups define the fields-or-columns resulting when a scan is successful.

If all characters in the FILTER_PATTERN are accounted for when the fast_regex_pattern is applied, then FastRegex Row Filter creates an output field (column) from each parenthetical expression in that fast_regex_pattern, in left-to-right order: The first (leftmost) parenthetical expression creates the first output field, the second parenthetical expression creates the second output field, up through the last parenthetical expression creating the last output field.
If the supplied input_string contains any characters not accounted for (matched) by applying fast_regex_pattern, then the FastRegex Row Filter returns no fields at all.

Implementing the FastRegex Row Filter with the Extensible Common Data Adapter

To use the Extensible Common Data Adapter with FastRegex, you set the parser to FastRegex, then pass in groups of filters that will map to columns. The following example uses FastRegex to parse columns for accountNumber, sourceID, destIP, and customerID.

Column names cannot be dynamically assigned with the FastRegex Row Filter. You need to declare these as part of a the foreign stream or table.

Sample Foreign Stream Implementing ECD Adapter to Parse Files with the FastRegex Row Filter

The following example will parse columns called recNo, ts, accountNumber, sourceID, destIP, and customerID from a file in /path/to/logfile. To parse with FastRegex over other input/output systems, such as IBM MQ or AMQP, you would need to specify options for these formats. See Reading from Other Sources for more details.

Note: Information on file location, file name pattern and character encoding can also be set as server options.

CREATE OR REPLACE FOREIGN STREAM "FAST_REGEX_LOG_PARSE"

("recNo" INTEGER,

"ts" TIMESTAMP NOT NULL,

"accountNumber" INTEGER,

"sourceIP" VARCHAR(32),

"destIP" VARCHAR(32),

"customerId" INTEGER,)

SERVER "FileReaderServer"

OPTIONS

(DIRECTORY '/path/to/logfile',

ENCODING 'UTF-8',

PARSER 'FastRegex',

FILTER_PATTERN '\t '889'\d\d\d\d\d' , '(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$' , '(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$'  , '\d\d\d\d\d',

);

Sample Properties Implementing ECD Agent to Parse FCLP Files

To parse FCLP files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

ROWTYPE=RECORDTYPE(VARCHAR(2040) id, VARCHAR(2040) reported_at, VARCHAR(2040) shift_no, VARCHAR(2040) trip_no, VARCHAR(2040) route_variant_id)

(DIRECTORY '/tmp',

FILENAME_PATTERN 'transactions\.log',

PARSER=FastRegex

CHARACTER_ENCODING=UTF-8

SKIP_HEADER=TRUE

SEPARATOR=u\000A

FILTER_PATTERN=\t '889'\d\d\d\d\d' , '(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$' , '(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$'  , '\d\d\d\d\d)