Regex Functions¶
BodoSQL currently uses Python's regular expression library via the re
module. Although this may be subject to change, it means that there are
several deviations from the behavior of Snowflake's regular expression
functions (see here for snowflake documentation).
The key points and major deviations are noted below:
-
Snowflake uses a superset of the POSIX ERE regular expression syntax. This means that BodoSQL can utilize several syntactic forms of regular expressions that Snowflake cannot (see here for Python re documentation). However, there are several features that POSIX ERE has that Python's
redoes not: -
POSIX character classes (see here for a full list). BodoSQL does support these as macros for character sets. In other words,
[[:lower:]]is transformed into[a-z]. However, this form of replacement cannot be escaped. Additionally, any character classes that are supposed to include the null terminator\x00instead start at\x01 -
Equivalence classes (not supported by BodoSQL).
-
Returning the longest match when using alternation patterns (BodoSQL returns the leftmost match).
-
The regex functions can optionally take in a flag argument. The flag is a string whose characters control how matches to patterns occur. The following characters have meaning when contained in the flag string:
-
'c': case-sensitive matching (the default behavior) 'i': case-insensitive matching (if both 'c' and 'i' are provided, whichever one occurs last is used)'m': allows anchor patterns to interact with the start/end of each line, not just the start/end of the entire string.'s': allows the.metacharacter to capture newline characters-
'e': seeREGEXP_SUBSTR/REGEXP_INSTR -
Currently, BodoSQL supports the lazy
?operator whereas Snowflake does not. So for example, in Snowflake, the pattern`(.*?),'would match with as many characters as possible so long as the last character was a comma. However, in BodoSQL, the match would end as soon as the first comma. -
Currently, BodoSQL supports the following regexp features which should crash when done in Snowflake:
(?...),\A,\Z,\1,\2,\3, etc. -
Currently, BodoSQL requires the pattern argument and the flag argument (if provided) to be string literals as opposed to columns or expressions.
-
Currently, extra backslashes may be required to escape certain characters if they have meaning in Python. The amount of backslashes required to properly escape a character depends on the usage.
-
All matches are non-overlapping.
-
If any of the numeric arguments are zero or negative, or the
group_numargument is out of bounds, an error is raised. The only exception isREGEXP_REPLACE, which allows its occurrence argument to be zero.
BodoSQL currently supports the following regex functions:
REGEXP_LIKE¶
-
REGEXP_LIKE(str, pattern[, flag])Returns
trueif the entire string matches with the pattern. Ifflagis not provided,''is used.If the pattern is empty, then
trueis returned if the string is also empty.For example:
-
2 arguments: Returns
trueifAis a 5-character string where the first character is an a, the last character is a z, and the middle 3 characters are also lowercase characters (case-sensitive). -
3 arguments: Returns
trueifAstarts with the letters'THE'(case-insensitive).
-
REGEXP_COUNT¶
-
REGEXP_COUNT(str, pattern[, position[, flag]])Returns the number of times the string contains matches to the pattern, starting at the location specified by the
positionargument (with 1-indexing). Ifpositionis not provided,1is used. Ifflagis not provided,''is used.If the pattern is empty, 0 is returned.
For example:
-
2 arguments: Returns the number of times that any letters occur in
A. -
3 arguments: Returns the number of times that any digit characters occur in
A, not including the first 5 characters. -
4 arguments: Returns the number of times that a substring occurs in
Athat contains two ones with any character (including newlines) in between.
-
REGEXP_REPLACE¶
-
REGEXP_REPLACE(str, pattern[, replacement[, position[, occurrence[, flag]]]])Returns the version of the inputted string where each match to the pattern is replaced by the replacement string, starting at the location specified by the
positionargument (with 1-indexing). The occurrence argument specifies which match to replace, where 0 means replace all occurrences. Ifreplacementis not provided,''is used. Ifpositionis not provided,1is used. Ifoccurrenceis not provided,0is used. Ifflagis not provided,''is used.If there are an insufficient number of matches, or the pattern is empty, the original string is returned.
Note
back-references in the replacement pattern are supported, but may require additional backslashes to work correctly.
For example:
-
2 arguments: Deletes all whitespace in
A. -
3 arguments: Replaces all occurrences of
'hate'inAwith'love'(case-sensitive). -
4 arguments: Replaces all occurrences of two consecutive digits in
Awith the same two digits reversed, excluding the first 2 characters. -
5 arguments: Replaces the first character in
Awith an underscore. -
6 arguments: Removes the first and last word from each line of
Athat contains at least 3 words.
-
REGEXP_SUBSTR¶
-
REGEXP_SUBSTR(str, pattern[, position[, occurrence[, flag[, group_num]]]])Returns the substring of the original string that caused a match with the pattern, starting at the location specified by the
positionargument (with 1-indexing). The occurrence argument specifies which match to extract (with 1-indexing). Ifpositionis not provided,1is used. Ifoccurrenceis not provided,1is used. Ifflagis not provided,''is used. Ifgroup_numis not provided, andflagcontains'e',1is used. Ifgroup_numis provided but the flag does not containe, then it behaves as if it did. If the flag does containe, then one of the subgroups of the match is returned instead of the entire match. The subgroup returned corresponds to thegroup_numargument (with 1-indexing).If there are an insufficient number of matches, or the pattern is empty,
NULLis returned.For example:
-
2 arguments: Returns the first number that occurs inside of
A. -
3 arguments: Returns the first punctuation symbol that occurs inside of
A, excluding the first 10 characters. -
4 arguments: Returns the fourth occurrence of two consecutive lowercase vowels in
A. -
5 arguments: Returns the first 3+ character substring of
Athat starts with and ends with a vowel (case-insensitive, and it can contain newline characters). -
6 arguments: Looks for third occurrence in
Aof a number followed by a colon, a space, and a word that starts with'a'(case-sensitive) and returns the word that starts with'a'.
-
REGEXP_INSTR¶
-
REGEXP_INSTR(str, pattern[, position[, occurrence[, option[, flag[, group_num]]]]])Returns the location within the original string that caused a match with the pattern, starting at the location specified by the
positionargument (with 1-indexing). The occurrence argument specifies which match to extract (with 1-indexing). The option argument specifies whether to return the start of the match (if0) or the first location after the end of the match (if1). Ifpositionis not provided,1is used. Ifoccurrenceis not provided,1is used. Ifoptionis not provided,0is used. Ifflagis not provided,''is used. Ifgroup_numis not provided, andflagcontains'e',1is used. Ifgroup_numis provided but the flag does not containe, then it behaves as if it did. If the flag does containe, then the location of one of the subgroups of the match is returned instead of the location of the entire match. The subgroup returned corresponds to thegroup_numargument (with 1-indexing).If there are an insufficient number of matches, or the pattern is empty,
0is returned.-
2 arguments: Returns the index of the first
'#'inA. -
3 arguments: Returns the starting index of the first occurrence of 3 consecutive digits in
A, excluding the first 3 characters. ```sql SELECT REGEXP_INSTR(A, '\d{3}', 4) -
5 arguments: Returns the ending index of the first substring of
Athat starts and ends with non-ascii characters. -
6 arguments: Returns the starting index of the second line of
Athat begins with an uppercase vowel. -
7 arguments: Looks for the first substring of
Athat has the format of a name in a phonebook (i.e.Lastname, Firstname) and returns the starting index of the first name.
-