Regex Functions¶
BodoSQL currently uses Python's regular expression library via the re
module. Although this may be subject to change, it means that there are
several deviations from the behavior of Snowflake's regular expression
functions (see here for snowflake documentation).
The key points and major deviations are noted below:
-
Snowflake uses a superset of the POSIX ERE regular expression syntax. This means that BodoSQL can utilize several syntactic forms of regular expressions that Snowflake cannot (see here for Python re documentation). However, there are several features that POSIX ERE has that Python's
re
does not: -
POSIX character classes (see here for a full list). BodoSQL does support these as macros for character sets. In other words,
[[:lower:]]
is transformed into[a-z]
. However, this form of replacement cannot be escaped. Additionally, any character classes that are supposed to include the null terminator\x00
instead start at\x01
-
Equivalence classes (not supported by BodoSQL).
-
Returning the longest match when using alternation patterns (BodoSQL returns the leftmost match).
-
The regex functions can optionally take in a flag argument. The flag is a string whose characters control how matches to patterns occur. The following characters have meaning when contained in the flag string:
-
'c'
: case-sensitive matching (the default behavior) 'i'
: case-insensitive matching (if both 'c' and 'i' are provided, whichever one occurs last is used)'m'
: allows anchor patterns to interact with the start/end of each line, not just the start/end of the entire string.'s'
: allows the.
metacharacter to capture newline characters-
'e'
: seeREGEXP_SUBSTR
/REGEXP_INSTR
-
Currently, BodoSQL supports the lazy
?
operator whereas Snowflake does not. So for example, in Snowflake, the pattern`(.*?),'
would match with as many characters as possible so long as the last character was a comma. However, in BodoSQL, the match would end as soon as the first comma. -
Currently, BodoSQL supports the following regexp features which should crash when done in Snowflake:
(?...)
,\A
,\Z
,\1
,\2
,\3
, etc. -
Currently, BodoSQL requires the pattern argument and the flag argument (if provided) to be string literals as opposed to columns or expressions.
-
Currently, extra backslashes may be required to escape certain characters if they have meaning in Python. The amount of backslashes required to properly escape a character depends on the usage.
-
All matches are non-overlapping.
-
If any of the numeric arguments are zero or negative, or the
group_num
argument is out of bounds, an error is raised. The only exception isREGEXP_REPLACE
, which allows its occurrence argument to be zero.
BodoSQL currently supports the following regex functions:
REGEXP_LIKE¶
-
REGEXP_LIKE(str, pattern[, flag])
Returns
true
if the entire string matches with the pattern. Ifflag
is not provided,''
is used.If the pattern is empty, then
true
is returned if the string is also empty.For example:
-
2 arguments: Returns
true
ifA
is a 5-character string where the first character is an a, the last character is a z, and the middle 3 characters are also lowercase characters (case-sensitive). -
3 arguments: Returns
true
ifA
starts with the letters'THE'
(case-insensitive).
-
REGEXP_COUNT¶
-
REGEXP_COUNT(str, pattern[, position[, flag]])
Returns the number of times the string contains matches to the pattern, starting at the location specified by the
position
argument (with 1-indexing). Ifposition
is not provided,1
is used. Ifflag
is not provided,''
is used.If the pattern is empty, 0 is returned.
For example:
-
2 arguments: Returns the number of times that any letters occur in
A
. -
3 arguments: Returns the number of times that any digit characters occur in
A
, not including the first 5 characters. -
4 arguments: Returns the number of times that a substring occurs in
A
that contains two ones with any character (including newlines) in between.
-
REGEXP_REPLACE¶
-
REGEXP_REPLACE(str, pattern[, replacement[, position[, occurrence[, flag]]]])
Returns the version of the inputted string where each match to the pattern is replaced by the replacement string, starting at the location specified by the
position
argument (with 1-indexing). The occurrence argument specifies which match to replace, where 0 means replace all occurrences. Ifreplacement
is not provided,''
is used. Ifposition
is not provided,1
is used. Ifoccurrence
is not provided,0
is used. Ifflag
is not provided,''
is used.If there are an insufficient number of matches, or the pattern is empty, the original string is returned.
Note
back-references in the replacement pattern are supported, but may require additional backslashes to work correctly.
For example:
-
2 arguments: Deletes all whitespace in
A
. -
3 arguments: Replaces all occurrences of
'hate'
inA
with'love'
(case-sensitive). -
4 arguments: Replaces all occurrences of two consecutive digits in
A
with the same two digits reversed, excluding the first 2 characters. -
5 arguments: Replaces the first character in
A
with an underscore. -
6 arguments: Removes the first and last word from each line of
A
that contains at least 3 words.
-
REGEXP_SUBSTR¶
-
REGEXP_SUBSTR(str, pattern[, position[, occurrence[, flag[, group_num]]]])
Returns the substring of the original string that caused a match with the pattern, starting at the location specified by the
position
argument (with 1-indexing). The occurrence argument specifies which match to extract (with 1-indexing). Ifposition
is not provided,1
is used. Ifoccurrence
is not provided,1
is used. Ifflag
is not provided,''
is used. Ifgroup_num
is not provided, andflag
contains'e
',1
is used. Ifgroup_num
is provided but the flag does not containe
, then it behaves as if it did. If the flag does containe
, then one of the subgroups of the match is returned instead of the entire match. The subgroup returned corresponds to thegroup_num
argument (with 1-indexing).If there are an insufficient number of matches, or the pattern is empty,
NULL
is returned.For example:
-
2 arguments: Returns the first number that occurs inside of
A
. -
3 arguments: Returns the first punctuation symbol that occurs inside of
A
, excluding the first 10 characters. -
4 arguments: Returns the fourth occurrence of two consecutive lowercase vowels in
A
. -
5 arguments: Returns the first 3+ character substring of
A
that starts with and ends with a vowel (case-insensitive, and it can contain newline characters). -
6 arguments: Looks for third occurrence in
A
of a number followed by a colon, a space, and a word that starts with'a'
(case-sensitive) and returns the word that starts with'a'
.
-
REGEXP_INSTR¶
-
REGEXP_INSTR(str, pattern[, position[, occurrence[, option[, flag[, group_num]]]]])
Returns the location within the original string that caused a match with the pattern, starting at the location specified by the
position
argument (with 1-indexing). The occurrence argument specifies which match to extract (with 1-indexing). The option argument specifies whether to return the start of the match (if0
) or the first location after the end of the match (if1
). Ifposition
is not provided,1
is used. Ifoccurrence
is not provided,1
is used. Ifoption
is not provided,0
is used. Ifflag
is not provided,''
is used. Ifgroup_num
is not provided, andflag
contains'e
',1
is used. Ifgroup_num
is provided but the flag does not containe
, then it behaves as if it did. If the flag does containe
, then the location of one of the subgroups of the match is returned instead of the location of the entire match. The subgroup returned corresponds to thegroup_num
argument (with 1-indexing).If there are an insufficient number of matches, or the pattern is empty,
0
is returned.-
2 arguments: Returns the index of the first
'#'
inA
. -
3 arguments: Returns the starting index of the first occurrence of 3 consecutive digits in
A
, excluding the first 3 characters. ```sql SELECT REGEXP_INSTR(A, '\d{3}', 4) -
5 arguments: Returns the ending index of the first substring of
A
that starts and ends with non-ascii characters. -
6 arguments: Returns the starting index of the second line of
A
that begins with an uppercase vowel. -
7 arguments: Looks for the first substring of
A
that has the format of a name in a phonebook (i.e.Lastname, Firstname
) and returns the starting index of the first name.
-