FAST_REGEX_LOG_PARSE

<< Click to Display Table of Contents >>

Navigation:  Integrating SQLstream Blaze with Other Systems > Appendix A: Legacy Adapters and Agents > Log File Adapter >

FAST_REGEX_LOG_PARSE

Previous pageReturn to chapter overviewNext page

Note: This topic describes a legacy adapter. Unless you have used these in the past, we recommend using the Extensible Common Data Adapter and Agent instead of these legacy adapters and agents.

General Syntax:  FAST_REGEX_LOG_PARSE('input_string', 'fast_regex_pattern')

The FAST_REGEX_LOG_PARSE works by first decomposing the regular expression into a series of regular expressions, one for each expression inside a group and one for each expression outside a group. Any fixed length portions at the start of any expressions are moved to the end of the previous expression. If any expression is entirely fixed length, it is merged with the previous expression. The series of expressions is then evaluated using lazy semantics with no backtracking. (In regular expression parsing parlance, "lazy" means don't parse more than you need to at each step. "Greedy" means parse as much as you can at each step. "Backtracking" means if something doesn't match the expression, you go back and start at the previous level of expression)

The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024).  See sample usage below at First FRLP Example and at Further FRLP Examples.
A list of the parsing functions and examples of using them can be found in the topic Log File Adapter in the s-Server Integration Guide.

See also the topic REGEX_LOG_PARSE in this guide, as well as the topic LogFileAdapter.

Description for FAST_REGEX_LOG_PARSER (FRLP)

FAST_REGEX_LOG_PARSER uses a lazy search - it stops at the first match. By contrast, the default java regex_parser is greedy unless possessive quantifiers are used.

FAST_REGEX_LOG_PARSE scans the supplied input string for all the characters specified by the fast_regex_pattern.

All characters in that input string must be accounted for by the characters and scan groups defined in the fast_regex_pattern. Scan groups define the fields-or-columns resulting when a scan is successful.
If all characters in the input_string are accounted for when the fast_regex_pattern is applied, then FRLP creates an output field (column) from each parenthetical expression in that fast_regex_pattern, in left-to-right order: The first (leftmost) parenthetical expression creates the first output field, the second parenthetical expression creates the second output field, up through the last parenthetical expression creating the last output field.
If the supplied input_string contains any characters not accounted for (matched) by applying fast_regex_pattern,
then FRLP returns no fields at all.

 
First FRLP Example

0: jdbc:sqlstream:engine:> select t.r."COLUMN1", t.r."COLUMN2" from
. . . . . . . . . . . . .> (values (FAST_REGEX_LOG_PARSE('Mary_had_a_little_lamb', '(.*)_(._.*)_.*'))) t(r);
+-------------------------+-----------------------+
|         COLUMN1         |         COLUMN2       |
+-------------------------+-----------------------+
| Mary_had                | a_little              |
+-------------------------+-----------------------+
1 row selected

1.The scan of input_string ('Mary_had_a_little_lamb') begins with the 1st group defined in fast_regex_pattern:  (.*), which means "find any character 0 or more times."
This group specification, defining the first field or column desired, causes FRLP to begin accepting or gathering input_string characters starting from the input_string's first character until it finds the next group in fast_regex_pattern or the next literal character-or-string that is not inside a group.
In this example's input_string, the next such literal character after the first group specification is an underscore.
2.Each character in the input_string is scanned until the next specification in this example's fast_regex_pattern is found:  an underscore.
The first underscore shown in the fast_regex_pattern is outside any group specification (defined by enclosing parentheses).
Such character-strings or literals specified in the fast_regex_pattern but not inside a group must be found in the input_string but will not be included in any output field.
3.The next thing sought is group 2, (._.*), which means any 3-character string with an underscore in the middle followed by at least one more character.
The first occurrence of such a group-2 string within the input_string is at "a_l".
Finding a group-2 match defines what will become the contents of the first (group-1) output field:  "Mary_had".
(The underscore following is not a part of any output group/field.)
4.Group-2 thus begins with "a_l".  Where does it end?
The remaining specification in the fast_regex_pattern is (_.*):  an underscore followed by any number of other characters.
That specification is matched by "_little_lamb" after which the input_string ends, meeting the requirement that all input_string characters be accounted for.
So group-2 begins with "a_l" and includes all remaining characters in the input_string.  Thus the second field/column contains  "a_little_lamb".

Note that if the fast_regex_pattern had omitted the final asterisk, no results would be obtained:

 

0: jdbc:sqlstream:engine:> select t.r."COLUMN1", t.r."COLUMN2" from
. . . . . . . . . . . . .> (values (REGEX_LOG_PARSE('Mary_had_a_little_lamb', '(.*)_(._.*)_'))) t(r);
+----------+----------+
| COLUMN1  | COLUMN2  |
+----------+----------+
+----------+----------+
No rows selected

 

Why?  Because this fast_regex_pattern says the input string ends with an underscore with no characters after it.  So upon encountering the underscore after "little", the parser expects no more characters but instead finds there are more, which violates the first rule: All characters in that input string must be accounted for by the characters and scan groups defined in the fast_regex_pattern.

Further Examples

The next example uses a "+", which means repeat the last expression 1 or more times ("*" means 0 or more times).

A.  In this case, the longest prefix is the first underscore. The first field/column group will match on "Mary" and the second will not match.

0: jdbc:sqlstream:engine:> select t.r."COLUMN1", t.r."COLUMN2" from
. . . . . . . . . . . . .> (values (FAST_REGEX_LOG_PARSE('Mary_had_a_little_lamb', '(.*)_+(._.*)'))) t(r);
+----------+----------+
| COLUMN1  | COLUMN2  |
+----------+----------+
+----------+----------+
No rows selected
 
The above example returns no fields because the "+" required there be at least one more underscore-in-a-row; and the input_string does not have that.

 
B.  In the following case, the '+' is superfluous because of the lazy semantics:

0: jdbc:sqlstream:engine:> select t.r."COLUMN1", t.r."COLUMN2" from
. . . . . . . . . . . . .> (values (FAST_REGEX_LOG_PARSE('Mary____had_a_little_lamb', '(.*)_+(.*)'))) t(r);
+-------------------------+-------------------------+
|         COLUMN1         |         COLUMN2         |
+-------------------------+-------------------------+
| Mary                    |    had_a_little_lamb    |
+-------------------------+-------------------------+
1 row selected

 
The above example succeeds in returning two fields because after finding the multiple underscores required by the "_+" specification, the group-2 specification (.*) accepts all remaining characters in the .input_string. Underscores do not appear trailing "Mary" nor leading "had" because the "_+" specification is not enclosed in parentheses.

As mentioned in the introduction, "lazy" in regular expression parsing parlance means don't parse more than you need to at each step. "Greedy" means parse as much as you can at each step.

The first case in this topic, A, fails because when it gets to the first underscore, the regex processor has no way of knowing without backtracking that it can't use the underscore to match "_+", and FRLP doesn't backtrack, whereas REGEX_LOG_PARSE does.

The search directly above, B, gets turned into three searches:

(.*)_

_*(._

.*)

 

Notice that the second field group gets split between the second and third searches, also that "_+" is considered the same as "__*",  i.e., it considers
"underscore repeat-underscore-1-or-more-times" the same as "underscore underscore repeat-underscore-0-or-more-times".)

Case A demonstrates the main difference between REGEX_LOG_PARSE and FAST_REGEX_LOG_PARSE, because the search in A would work under REGEX_LOG_PARSE because that function would use backtracking.

C.  In the following example, the plus is not superfluous, because the "<Alpha> (any alphabetic char) is fixed length thus will be used as a delimiter for the " +" search.

0: jdbc:sqlstream:engine:> select t.r."COLUMN1", t.r."COLUMN2" from
. . . . . . . . . . . . .> (values (FAST_REGEX_LOG_PARSE('Mary____had_a_little_lamb', '(.*)_+(<Alpha>.*)'))) t(r);
+----------------------------+----------------------------+
|          COLUMN1           |          COLUMN2           |
+----------------------------+----------------------------+
| Mary                       | had_a_little_lamb          |
+----------------------------+----------------------------+
1 row selected

 
 
'(.*) +(<Alpha>.*)' gets converted into three regular expressions:
'.* '
' *<Alpha>'
'.*$'
Each is matched in turn using lazy semantics.
 
The following are defined:

<Digit>= "[0-9]",
<Upper> = "[A-Z]",
<Lower> = "[a-z]",
<ASCII> = "[\u0000-\u007F]",
<Alpha> = "<Lower>|<Upper>",
<Alnum> = "<Alpha>|<Digit>",
<Punct> = "[!\"#$%&'()*+,-./:;<=>?@[\\\\\\]^_`{|}~]",
<Blank> = "[ \t]",
<Space> = "[ \t\n\f\r\u000B]",
<Cntrl"> = "[\u0000-\u001F\u007F]",
<XDigit> = "0-9a-fA-F",
<Print> = "<Alnum>|<Punct>",
<Graph> = "<Print>"

 

The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024).

Further References

A list of the parsing functions and examples of using them can be found in the topic Log File Adapter in the s-Server Integration Guide. See also the REGEX_LOG_PARSE write-up in this SQL Reference Guide.