BodoSQL¶
BodoSQL provides high performance and scalable SQL query execution using Bodo's HPC capabilities and optimizations. It also provides native Python/SQL integration as well as SQL to Pandas conversion for the first time.
Getting Started¶
Installation¶
Install BodoSQL using:
Using BodoSQL¶
The example below demonstrates using BodoSQL in Python programs. It loads data into a dataframe, runs a SQL query on the data, and runs Python/Pandas code on query results:
import pandas as pd
import bodo
import bodosql
@bodo.jit
def f(filename):
df1 = pd.read_parquet(filename)
bc = bodosql.BodoSQLContext({"TABLE1": df1})
df2 = bc.sql("SELECT A FROM table1 WHERE B > 4")
print(df2.A.sum())
f("my_data.pq")
This program is fully type checked, optimized and parallelized by Bodo
end-to-end. BodoSQLContext
creates a SQL environment with tables
created from dataframes. BodoSQLContext.sql()
runs a SQL query and
returns the results as a dataframe. BodoSQLContext
can be used outside
Bodo JIT functions if necessary as well.
You can run this example by creating my_data.pq
:
import pandas as pd
import numpy as np
NUM_GROUPS = 30
NUM_ROWS = 20_000_000
df = pd.DataFrame({
"A": np.arange(NUM_ROWS) % NUM_GROUPS,
"B": np.arange(NUM_ROWS)
})
df.to_parquet("my_data.pq")
To run the example, save it in a file called example.py
and run it using mpiexec
, e.g.:
Aliasing¶
In all but the most trivial cases, BodoSQL generates internal names to avoid conflicts in the intermediate dataframes. By default, BodoSQL does not rename the columns for the final output of a query using a consistent approach. For example the query:
Results in an output column named$EXPR0
. To reliably reference this
column later in your code, we highly recommend using aliases for all
columns that are the final outputs of a query, such as:
Note
BodoSQL supports using aliases generated in SELECT
inside
GROUP BY
and HAVING
in the same query, but you cannot do so with
WHERE
.
Supported Operations¶
We currently support the following SQL query statements and clauses with
BodoSQL, and are continuously adding support towards completeness. Note
that BodoSQL ignores casing of keywords, and column and table names,
except for the final output column name. Therefore,
select a from table1
is treated the same as SELECT A FROM Table1
,
except for the names of the final output columns (a
vs A
).
SELECT¶
The SELECT
statement is used to select data in the form of
columns. The data returned from BodoSQL is stored in a dataframe.
For Instance:
Example Usage:
>>>@bodo.jit
... def g(df):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df})
... query = "SELECT name FROM customers"
... res = bc.sql(query)
... return res
>>>customers_df = pd.DataFrame({
... "CUSTOMERID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "NAME": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez",
... "Demetrius Chavez","Weston Jefferson","Jonathon Middleton",
... "Shawn Winters","Keely Hutchinson", "Darryl Rosales",],
... "BALANCE": [1123.34, 2133.43, 23.58, 8345.15, 943.43, 68.34, 12764.50, 3489.25, 654.24, 25645.39]
... })
>>>g(customers_df)
NAME
0 Deangelo Todd
1 Nikolai Kent
2 Eden Heath
3 Taliyah Martinez
4 Demetrius Chavez
5 Weston Jefferson
6 Jonathon Middleton
7 Shawn Winters
8 Keely Hutchinson
9 Darryl Rosales
The SELECT
also has some special syntactic forms. The *
term is
used as a shortcut for specifying all columns. The clause * EXCLUDE col
or * EXCLUDE (col1, col2, col3...)
is a shortcut for specifying every
column except the ones after the EXCLUDE keyword.
For example, suppose we have a table The T
with columns named The A
, B
,
C
, D
, E
. Consider the following queries
These two are syntactic sugar for the following:
SELECT DISTINCT¶
The SELECT DISTINCT
statement is used to return only distinct
(different) values:
DISTINCT
can be used in a SELECT statement or inside an
aggregate function. For example:
Example Usage
>>>@bodo.jit
... def g(df):
... bc = bodosql.BodoSQLContext({"PAYMENTS":df})
... query = "SELECT DISTINCT \"paymentType\" FROM payments"
... res = bc.sql(query)
... return res
>>>payment_df = pd.DataFrame({
... "CUSTOMERID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "paymentType": ["VISA", "VISA", "AMEX", "VISA", "WIRE", "VISA", "VISA", "WIRE", "VISA", "AMEX"],
... })
>>>g(payment_df) # inside SELECT
paymentType
0 VISA
2 AMEX
4 WIRE
>>>def g(df):
... bc = bodosql.BodoSQLContext({"PAYMENTS":df})
... query = "SELECT COUNT(DISTINCT \"paymentType\") as num_payment_types FROM payments"
... res = bc.sql(query)
... return res
>>>g(payment_df) # inside aggregate
NUM_PAYMENT_TYPES
0 3
WHERE¶
The WHERE
clause on columns can be used to filter records that
satisfy specific conditions:
For Example:
Example Usage
>>>@bodo.jit
... def g(df):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df})
... query = "SELECT name FROM customers WHERE balance 3000"
... res = bc.sql(query)
... return res
>>>customers_df = pd.DataFrame({
... "CUSTOMERID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "NAME": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez",
... "Demetrius Chavez","Weston Jefferson","Jonathon Middleton",
... "Shawn Winters","Keely Hutchinson", "Darryl Rosales",],
... "BALANCE": [1123.34, 2133.43, 23.58, 8345.15, 943.43, 68.34, 12764.50, 3489.25, 654.24, 25645.39]
... })
>>>g(customers_df)
NAME
3 Taliyah Martinez
6 Jonathon Middleton
7 Shawn Winters
9 Darryl Rosales
ORDER BY¶
The ORDER BY
keyword sorts the resulting DataFrame in ascending
or descending order. By default, it sorts the records in ascending order.
NULLs are sorted in accordance with the optional NULLS FIRST
or
NULLS LAST
keywords.
If the null ordering is not provided, then the default ordering depends
on if the column is ascending or descending. For ascending order, by
default NULL values are returned at the end, while for descending order
nulls are returned at the front. If the order of nulls matter we strongly
recommend explicitly providing either NULLS FIRST
or
NULLS LAST
.
SELECT <COLUMN_NAMES>
FROM <TABLE_NAME>
ORDER BY <ORDERED_COLUMN_NAMES> [ASC|DESC] [NULLS FIRST|LAST]
For Example:
Example Usage
>>>@bodo.jit
... def g(df):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df})
... query = "SELECT name, balance FROM customers ORDER BY balance"
... res = bc.sql(query)
... return res
>>>customers_df = pd.DataFrame({
... "CUSTOMERID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "NAME": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez",
... "Demetrius Chavez","Weston Jefferson","Jonathon Middleton",
... "Shawn Winters","Keely Hutchinson", "Darryl Rosales",],
... "BALANCE": [1123.34, 2133.43, 23.58, 8345.15, 943.43, 68.34, 12764.50, 3489.25, 654.24, 25645.39]
... })
>>>g(customers_df)
NAME BALANCE
2 Eden Heath 23.58
5 Weston Jefferson 68.34
8 Keely Hutchinson 654.24
4 Demetrius Chavez 943.43
0 Deangelo Todd 1123.34
1 Nikolai Kent 2133.43
7 Shawn Winters 3489.25
3 Taliyah Martinez 8345.15
6 Jonathon Middleton 12764.50
9 Darryl Rosales 25645.39
LIMIT¶
BodoSQL supports the LIMIT
keyword to select a limited number
of rows. This keyword can optionally include an offset:
SELECT <COLUMN_NAMES>
FROM <TABLE_NAME>
WHERE <CONDITION>
LIMIT <LIMIT_NUMBER> OFFSET <OFFSET_NUMBER>
>>>@bodo.jit
... def g1(df):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df})
... query = "SELECT name FROM customers LIMIT 4"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df})
... query = "SELECT name FROM customers LIMIT 4 OFFSET 2"
... res = bc.sql(query)
... return res
>>>customers_df = pd.DataFrame({
... "CUSTOMERID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "NAME": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez",
... "Demetrius Chavez","Weston Jefferson","Jonathon Middleton",
... "Shawn Winters","Keely Hutchinson", "Darryl Rosales",],
... "BALANCE": [1123.34, 2133.43, 23.58, 8345.15, 943.43, 68.34, 12764.50, 3489.25, 654.24, 25645.39]
... })
>>>g1(customers_df) # LIMIT 4
NAME
0 Deangelo Todd
1 Nikolai Kent
2 Eden Heath
3 Taliyah Martinez
>>>g2(customers_df) # LIMIT 4 OFFSET 2
NAME
2 Eden Heath
3 Taliyah Martinez
4 Demetrius Chavez
5 Weston Jefferson
NOT IN¶
The IN
determines if a value can be chosen a list of options.
Currently, we support lists of literals or columns with matching
types:
>>>@bodo.jit
... def g1(df):
... bc = bodosql.BodoSQLContext({"PAYMENTS":df})
... query = "SELECT customerID FROM payments WHERE \"paymentType\" IN ('AMEX', 'WIRE')"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df):
... bc = bodosql.BodoSQLContext({"PAYMENTS":df})
... query = "SELECT customerID FROM payments WHERE \"paymentType\" NOT IN ('AMEX', 'VISA')"
... res = bc.sql(query)
... return res
>>>payment_df = pd.DataFrame({
... "CUSTOMERID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "paymentType": ["VISA", "VISA", "AMEX", "VISA", "WIRE", "VISA", "VISA", "WIRE", "VISA", "AMEX"],
... })
>>>g1(payment_df) # IN
CUSTOMERID
2 2
4 4
7 7
9 9
>>>g2(payment_df) # NOT IN
CUSTOMERID
4 4
7 7
NOT BETWEEN¶
The BETWEEN
operator selects values within a given range. The
values can be numbers, text, or datetimes. The BETWEEN
operator
is inclusive: begin and end values are included:
>>>@bodo.jit
... def g(df):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df})
... query = "SELECT name, balance FROM customers WHERE balance BETWEEN 1000 and 5000"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df})
... query = "SELECT name, balance FROM customers WHERE balance NOT BETWEEN 100 and 10000"
... res = bc.sql(query)
... return res
>>>customers_df = pd.DataFrame({
... "CUSTOMERID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "NAME": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez",
... "Demetrius Chavez","Weston Jefferson","Jonathon Middleton",
... "Shawn Winters","Keely Hutchinson", "Darryl Rosales",],
... "BALANCE": [1123.34, 2133.43, 23.58, 8345.15, 943.43, 68.34, 12764.50, 3489.25, 654.24, 25645.39]
... })
>>>g1(payment_df) # BETWEEN
NAME BALANCE
0 Deangelo Todd 1123.34
1 Nikolai Kent 2133.43
7 Shawn Winters 3489.25
>>>g2(payment_df) # NOT BETWEEN
NAME BALANCE
2 Eden Heath 23.58
5 Weston Jefferson 68.34
6 Jonathon Middleton 12764.50
9 Darryl Rosales 25645.39
CAST¶
THE CAST
operator converts an input from one type to another. In
many cases casts are created implicitly, but this operator can be
used to force a type conversion.
The following casts are currently supported. Please refer to
supported_dataframe_data_types
for
the Python types for each type keyword:
From | To | Notes |
---|---|---|
VARCHAR |
VARCHAR |
|
VARCHAR |
TINYINT/SMALLINT/INTEGER/BIGINT |
|
VARCHAR |
FLOAT/DOUBLE |
|
VARCHAR |
DECIMAL |
Equivalent to DOUBLE . This may change in the future. |
VARCHAR |
TIMESTAMP |
|
VARCHAR |
DATE |
Truncates to date but is still Timestamp type. This may change in the future. |
TINYINT/SMALLINT/INTEGER/BIGINT |
VARCHAR |
|
TINYINT/SMALLINT/INTEGER/BIGINT |
TINYINT/SMALLINT/INTEGER/BIGINT |
|
TINYINT/SMALLINT/INTEGER/BIGINT |
FLOAT/DOUBLE |
|
TINYINT/SMALLINT/INTEGER/BIGINT |
DECIMAL |
Equivalent to DOUBLE . This may change in the future. |
TINYINT/SMALLINT/INTEGER/BIGINT |
TIMESTAMP |
|
FLOAT/DOUBLE |
VARCHAR |
|
FLOAT/DOUBLE |
TINYINT/SMALLINT/INTEGER/BIGINT |
|
FLOAT/DOUBLE |
FLOAT/DOUBLE |
|
FLOAT/DOUBLE |
DECIMAL |
Equivalent to DOUBLE . This may change in the future |
TIMESTAMP |
VARCHAR |
|
TIMESTAMP |
TINYINT/SMALLINT/INTEGER/BIGINT |
|
TIMESTAMP |
TIMESTAMP |
|
TIMESTAMP |
DATE |
Truncates to date but is still Timestamp type. This may change in the future. |
Note
CAST
correctness can often not be determined at compile time.
Users are responsible for ensuring that conversion is possible
(e.g. CAST(str_col as INTEGER)
).
::¶
Infix cast operator. Equivalent to cast, but the format is value::Typename
JOIN¶
A JOIN
clause is used to combine rows from two or more tables,
based on a related column between them:
SELECT <COLUMN_NAMES>
FROM <LEFT_TABLE_NAME>
<JOIN_TYPE> <RIGHT_TABLE_NAME>
ON <LEFT_TABLE_COLUMN_NAME> OP <RIGHT_TABLE_COLUMN_NAME>
(INNER) JOIN
: returns records that have matching values in both tablesLEFT (OUTER) JOIN
: returns all records from the left table, and the matched records from the right tableRIGHT (OUTER) JOIN
: returns all records from the right table, and the matched records from the left tableFULL (OUTER) JOIN
: returns all records when there is a match in either left or right table
Example Usage
>>>@bodo.jit
... def g1(df1, df2):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df1, "PAYMENTS":df2})
... query = "SELECT name, \"paymentType\" FROM customers JOIN payments ON customers.customerID = payments.customerID"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df1, df2):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df1, "PAYMENTS":df2})
... query = "SELECT name, paymentType FROM customers FULL JOIN payments ON customers.customerID = payments.customerID"
... res = bc.sql(query)
... return res
>>>customer_df = pd.DataFrame({
... "CUSTOMERID": [0, 2, 4, 5, 7,],
... "NAME": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez","Demetrius Chavez",],
... "ADDRESS": ["223 Iroquois LanenWest New York, NJ 07093","37 Depot StreetnTaunton, MA 02780",
... "639 Maple St.nNorth Kingstown, RI 02852","93 Bowman Rd.nChester, PA 19013",
... "513 Manchester Ave.nWindsor, CT 06095",],
... "BALANCE": [1123.34, 2133.43, 23.58, 8345.15, 943.43,]
... })
>>>payment_df = pd.DataFrame({
... "CUSTOMERID": [0, 1, 4, 6, 7],
... "paymentType": ["VISA", "VISA", "AMEX", "VISA", "WIRE",],
... })
>>>g1(customer_df, payment_df) # INNER JOIN
NAME paymentType
0 Deangelo Todd VISA
1 Eden Heath AMEX
2 Demetrius Chavez WIRE
>>>g2(customer_df, payment_df) # OUTER JOIN
NAME paymentType
0 Deangelo Todd VISA
1 Nikolai Kent NaN
2 Eden Heath AMEX
3 Taliyah Martinez NaN
4 Demetrius Chavez WIRE
5 NaN VISA
6 NaN VISA
NATURAL JOIN¶
A natural join is a type of join that provides an equality condition on all columns with the same name and only returns 1 column for the keys. On cannot be provided because it is implied but all join types can be provided.
For example: Here are the different types of the joins in SQL:(INNER) JOIN
: returns records that have matching values in both tablesLEFT (OUTER) JOIN
: returns all records from the left table, and the matched records from the right tableRIGHT (OUTER) JOIN
: returns all records from the right table, and the matched records from the left tableFULL (OUTER) JOIN
: returns all records when there is a match in either left or right table
Example Usage
>>>@bodo.jit
... def g1(df1, df2):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df1, "PAYMENTS":df2})
... query = "SELECT payments.* FROM customers NATURAL JOIN payments"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df1, df2):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df1, "PAYMENTS":df2})
... query = "SELECT payments.* FROM customers NATURAL FULL JOIN payments"
... res = bc.sql(query)
... return res
>>>customer_df = pd.DataFrame({
... "CUSTOMERID": [0, 2, 4, 5, 7,],
... "NAME": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez","Demetrius Chavez",],
... "ADDRESS": ["223 Iroquois LanenWest New York, NJ 07093","37 Depot StreetnTaunton, MA 02780",
... "639 Maple St.nNorth Kingstown, RI 02852","93 Bowman Rd.nChester, PA 19013",
... "513 Manchester Ave.nWindsor, CT 06095",],
... "BALANCE": [1123.34, 2133.43, 23.58, 8345.15, 943.43,]
... })
>>>payment_df = pd.DataFrame({
... "CUSTOMERID": [0, 1, 4, 6, 7],
... "paymentType": ["VISA", "VISA", "AMEX", "VISA", "WIRE",],
... })
>>>g1(customer_df, payment_df) # INNER JOIN
CUSTOMERID paymentType
0 0 VISA
1 4 AMEX
2 7 WIRE
>>>g2(customer_df, payment_df) # OUTER JOIN
CUSTOMERID paymentType
0 0 VISA
1 <NA> <NA>
2 4 AMEX
3 <NA> <NA>
4 7 WIRE
5 1 VISA
6 6 VISA
UNION¶
The UNION
operator is used to combine the result-set of two SELECT
statements:
SELECT
statement within the UNION
clause must have the same
number of columns. The columns must also have similar data types.
The output of the UNION
is the set of rows which are present in
either of the input SELECT
statements.
The UNION
operator selects only the distinct values from the
inputs by default. To allow duplicate values, use UNION ALL
:
Example Usage
>>>@bodo.jit
... def g1(df):
... bc = bodosql.BodoSQLContext({"CUSTOMERS":df1, "PAYMENTS":df2})
... query = "SELECT name, \"paymentType\" FROM customers JOIN payments ON customers.customerID = payments.customerID WHERE \"paymentType\" in ('WIRE')
... UNION SELECT name, paymentType FROM customers JOIN payments ON customers.customerID = payments.customerID WHERE balance < 1000"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df):
... bc = bodosql.BodoSQLContext({"customers":df1, "payments":df2})
... query = "SELECT name, \"paymentType\" FROM customers JOIN payments ON customers.customerID = payments.customerID WHERE \"paymentType\" in ('WIRE')
... UNION ALL SELECT name, paymentType FROM customers JOIN payments ON customers.customerID = payments.customerID WHERE balance < 1000"
... res = bc.sql(query)
... return res
>>>customer_df = pd.DataFrame({
... "CUSTOMERID": [0, 2, 4, 5, 7,],
... "NAME": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez","Demetrius Chavez",],
... "ADDRESS": ["223 Iroquois LanenWest New York, NJ 07093","37 Depot StreetnTaunton, MA 02780",
... "639 Maple St.nNorth Kingstown, RI 02852","93 Bowman Rd.nChester, PA 19013",
... "513 Manchester Ave.nWindsor, CT 06095",],
... "BALANCE": [1123.34, 2133.43, 23.58, 8345.15, 943.43,]
... })
>>>payment_df = pd.DataFrame({
... "CUSTOMERID": [0, 1, 4, 6, 7],
... "paymentType": ["VISA", "VISA", "AMEX", "VISA", "WIRE",],
... })
>>>g1(customer_df, payment_df) # UNION
NAME paymentType BALANCE
0 Demetrius Chavez WIRE 943.43
0 Eden Heath AMEX 23.58
>>>g2(customer_df, payment_df) # UNION ALL
NAME paymentType BALANCE
0 Demetrius Chavez WIRE 943.43
0 Eden Heath AMEX 23.58
1 Demetrius Chavez WIRE 943.43
INTERSECT¶
The INTERSECT
operator is used to calculate the intersection of
two SELECT
statements:
Each SELECT
statement within the INTERSECT
clause must have the
same number of columns. The columns must also have similar data
types. The output of the INTERSECT
is the set of rows which are
present in both of the input SELECT statements. The INTERSECT
operator selects only the distinct values from the inputs.
GROUP BY¶
The GROUP BY
statement groups rows that have the same values
into summary rows, like "find the number of customers in each
country". The GROUP BY
statement is often used with aggregate
functions to group the result-set by one or more columns:
SELECT <COLUMN_NAMES>
FROM <TABLE_NAME>
WHERE <CONDITION>
GROUP BY <GROUP_EXPRESSION>
ORDER BY <COLUMN_NAMES>
For example:
GROUP BY
statements also referring to columns by alias or
column number:
BodoSQL supports several subclauses that enable grouping by multiple different
sets of columns in the same SELECT
statement. GROUPING SETS
is the first. It is
equivalent to performing a group by for each specified set (setting each column not
present in the grouping set to null), and unioning the results. For example:
This is equivalent to:
SELECT * FROM
(SELECT MAX(A), B, null FROM table1 GROUP BY B)
UNION
(SELECT MAX(A), B, null FROM table1 GROUP BY B)
UNION
(SELECT MAX(A), B, C FROM table1 GROUP BY B, C)
UNION
(SELECT MAX(A), null, null FROM table1)
Note
The above example is not valid BodoSQL code, as we do not support null literals. It is used only to show the null filling behavior.
CUBE
is equivalent to grouping by all possible permutations of the specified set.
For example:
Is equivalent to
ROLLUP
is equivalent to grouping by n + 1 grouping sets, where each set is constructed by dropping the rightmost element from the previous set, until no elements remain in the grouping set. For example:
Is equivalent to
CUBE
and ROLLUP
can be nested into a GROUPING SETS
clause. For example:
Which is equivalent to
HAVING¶
The HAVING
clause is used for filtering with GROUP BY
.
HAVING
applies the filter after generating the groups, whereas
WHERE
applies the filter before generating any groups:
For example:
HAVING
statements also referring to columns by aliases used in
the GROUP BY
:
QUALIFY¶
QUALIFY
is similar to HAVING
, except it applies filters after computing the results of at least one window function. QUALIFY
is used after using WHERE
and HAVING
.
For example:
SELECT column_name(s),
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
QUALIFY MAX(A) OVER (PARTITION BY B ORDER BY C ROWS BETWEEN 1 FOLLOWING AND 1 PRECEDING) > 1
Is equivalent to
SELECT column_name(s) FROM
(SELECT column_name(s), MAX(A) OVER (PARTITION BY B ORDER BY C ROWS BETWEEN 1 FOLLOWING AND 1 PRECEDING) as window_output
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition)
WHERE window_output > 1
CASE¶
The CASE
statement goes through conditions and returns a value
when the first condition is met:
CASE
statement will return null if
none of the conditions are met. For example:
is equivalent to:
LIKE¶
The LIKE
clause is used to filter the strings in a column to
those that match a pattern:
%
and _
. For example:
GREATEST¶
The GREATEST
clause is used to return the largest value from a
list of columns:
LEAST¶
The LEAST
clause is used to return the smallest value from a
list of columns:
PIVOT¶
The PIVOT
clause is used to transpose specific data rows in one
or more columns into a set of columns in a new DataFrame:
SELECT col1, ..., colN FROM table_name PIVOT (
AGG_FUNC_1(colName or pivotVar) AS alias1, ..., AGG_FUNC_N(colName or pivotVar) as aliasN
FOR pivotVar IN (ROW_VALUE_1 as row_alias_1, ..., ROW_VALUE_N as row_alias_N)
)
PIVOT
produces a new column for each pair of pivotVar and
aggregation functions.
For example:
SELECT single_sum_a, single_avg_c, triple_sum_a, triple_avg_c FROM table1 PIVOT (
SUM(A) AS sum_a, AVG(C) AS avg_c
FOR A IN (1 as single, 3 as triple)
)
single_sum_a
will contain sum(A) where A = 1
,
single_avg_c will contain AVG(C) where A = 1
etc.
If you explicitly specify other columns as the output, those columns will be used to group the pivot columns. For example:
SELECT B, single_sum_a, single_avg_c, triple_sum_a, triple_avg_c FROM table1 PIVOT (
SUM(A) AS sum_a, AVG(C) AS avg_c
FOR A IN (1 as single, 3 as triple)
)
SELECT * FROM table1 PIVOT (
SUM(A) AS sum_a, AVG(C) AS avg_c
FOR (A, B) IN ((1, 4) as col1, (2, 5) as col2)
)
WITH¶
The WITH
clause can be used to name sub-queries:
Aliasing¶
SQL aliases are used to give a table, or a column in a table, a temporary name:
For example:
We strongly recommend using aliases for the final outputs of any queries to ensure all column names are predictable.
Operators¶
Arithmetic¶
-
BodoSQL currently supports the following arithmetic operators:
+
(addition)-
(subtraction)*
(multiplication)/
(true division)%
(modulo)
Comparison¶
-
BodoSQL currently supports the following comparison operators:
=
(equal to)>
(greater than)<
(less than)>=
(greater than or equal to)<=
(less than or equal to)<>
(not equal to)!=
(not equal to)<=>
(equal to or both inputs are null)
Logical¶
-
BodoSQL currently supports the following logical operators:
AND
OR
NOT
String¶
-
BodoSQL currently supports the following string operators:
||
(string concatenation)
Numeric Functions¶
Except where otherwise specified, the inputs to each of these functions can be any numeric type, column or scalar. Here is an example using MOD:
BodoSQL Currently supports the following Numeric Functions:
ABS¶
-
ABS(n)
Returns the absolute value of n
COS¶
-
COS(n)
Calculates the Cosine of n
SIN¶
-
SIN(n)
Calculates the Sine of n
TAN¶
-
TAN(n)
Calculates the Tangent of n
COTAN¶
-
COTAN(X)
Calculates the Cotangent of
X
ACOS¶
-
ACOS(n)
Calculates the Arccosine of n
ASIN¶
-
AIN(n)
Calculates the Arcsine of n
ATAN¶
-
ATAN(n)
Calculates the Arctangent of n
ATAN2¶
-
ATAN2(A, B)
Calculates the Arctangent of
A
divided byB
CEIL¶
-
CEIL(X[, scale])
Converts X to the specified scale, rounding towards positive infinity. For example,
scale=0
rounds up to the nearest integer,scale=2
rounds up to the nearest0.01
, andscale=-1
rounds up to the nearest multiple of 10.
CEILING¶
-
CEILING(X)
Equivalent to
CEIL
FLOOR¶
-
FLOOR(X[, scale])
Converts X to the specified scale, rounding towards negative infinity. For example,
scale=0
down up to the nearest integer,scale=2
rounds down to the nearest0.01
, andscale=-1
rounds down to the nearest multiple of 10.
DEGREES¶
-
DEGREES(X)
Converts a value in radians to the corresponding value in degrees
RADIANS¶
-
RADIANS(X)
Converts a value in radians to the corresponding value in degrees
LOG10¶
-
LOG10(X)
Computes Log base 10 of x. Returns NaN for negative inputs, and -inf for 0 inputs.
LOG¶
-
LOG(X)
Equivalent to
LOG10(x)
LOG10¶
-
LOG10(X)
Computes Log base 2 of x. Returns
NaN
for negative inputs, and-inf
for 0 inputs.
LN¶
-
LN(X)
Computes the natural log of x. Returns
NaN
for negative inputs, and-inf
for 0 inputs.
MOD¶
-
MOD(A,B)
Computes A modulo B (behavior analogous to the C library function
fmod
). ReturnsNaN
if B is 0 or if A is inf.
CONV¶
-
CONV(X, current_base, new_base)
CONV
takes a string representation of an integer value, it's current_base, and the base to convert that argument to.CONV
returns a new string, that represents the value in the new base.CONV
is only supported for converting to/from base 2, 8, 10, and 16.For example:
SQRT¶
SQRT(X)
Computes the square root of x. Returns
NaN
for negative inputs, and-inf
for 0 inputs.
PI¶
-
PI()
Returns the value of
PI
POW, POWER¶
-
POW(A, B), POWER(A, B)
Returns A to the power of B. Returns
NaN
if A is negative, and B is a float.POW(0,0)
is 1
EXP¶
-
EXP(X)
Returns e to the power of X
SIGN¶
-
SIGN(X)
Returns 1 if X > 0, -1 if X < 0, and 0 if X = 0
ROUND¶
-
ROUND(X[, num_decimal_places])
Rounds X to the specified number of decimal places
TRUNCATE¶
-
TRUNCATE(X[, num_decimal_places])
Equivalent to
ROUND(X, num_decimal_places)
. Ifnum_decimal_places
is not supplied, it defaults to 0.
TRUNC¶
-
TRUNC(X[, num_decimal_places])
Equivalent to
TRUNC(X[, num_decimal_places])
ifX
is numeric. Note thatTRUNC
is overloaded and may invoke the timestamp functionTRUNC
ifX
is a date or time expression.
BITAND¶
-
BITAND(A, B)
Returns the bitwise-and of its inputs.
BITOR¶
-
BITOR(A, B)
Returns the bitwise-or of its inputs.
BITXOR¶
-
BITOR(A, B)
Returns the bitwise-xor of its inputs.
BITNOT¶
-
BITNOT(A)
Returns the bitwise-negation of its input.
BITSHIFTLEFT¶
-
BITSHIFTLEFT(A, B)
Returns the bitwise-leftshift of its inputs.
Note
- The output is always of type int64.
- Undefined behavior when B is negative or too large.
BITSHIFTRIGHT¶
-
BITSHIFTRIGHT(A, B)
Returns the bitwise-rightshift of its inputs. Undefined behavior when B is negative or too large.
GETBIT¶
-
GETBIT(A, B)
Returns the bit of A corresponding to location B, where 0 is the rightmost bit. Undefined behavior when B is negative or too large.
BOOLAND¶
-
BOOLAND(A, B)
Returns true when
A
andB
are both non-null non-zero. Returns false when one of the arguments is zero and the other is either zero orNULL
. ReturnsNULL
otherwise.
BOOLOR¶
-
BOOLOR(A, B)
Returns true if either
A
orB
is non-null and non-zero. Returns false if bothA
andB
are zero. ReturnsNULL
otherwise.
BOOLXOR¶
-
BOOLXOR(A, B)
Returns true if one of
A
andB
is zero and the other is non-zero. Returns false ifA
andB
are both zero or both non-zero. ReturnsNULL
if eitherA
orB
isNULL
.
BOOLNOT¶
-
BOOLNOT(A)
Returns true if
A
is zero. Returns false ifA
is non-zero. ReturnsNULL
ifA
isNULL
.
REGR_VALX¶
-
REGR_VALX(Y, X)
Returns
NULL
if either input isNULL
, otherwiseX
REGR_VALY¶
-
REGR_VALY(Y, X)
Returns
NULL
if either input isNULL
, otherwiseY
HASH¶
-
HASH(A, B, C, ...)
Takes in a variable number of arguments of any type and returns a hash value that considers the values in each column. The hash function is deterministic across multiple ranks or multiple sessions.
Also supports the syntactic sugar forms
HASH(*)
andHASH(T.*)
as shortcuts for referencing all of the columns in a table, or multiple tables. For example, ifT1
has columnsA
andB
, andT2
has columnsA
,E
andI
, then the following query:SELECT HASH(*), HASH(T1.*) FROM T1 INNER JOIN T2 ON T1.A=T2.I
Would be syntactic sugar for the following:
SELECT HASH(T1.A, T1.B, T2.A, T2.E, T2.I), HASH(T1.A, T1.B) FROM T1 INNER JOIN T2 ON T1.A=T2.I
Data Generation Functions¶
BodoSQL Currently supports the following data generaiton functions:
RANDOM¶
-
RANDOM()
Outputs a random 64-bit integer. If used inside of a select statement with a table, the number of random values will match the number of rows in the input table (and each value should be randomly and independently generated). Note that running with multiple processors may affect the randomization results.
Note
Currently, BodoSQL does not support the format of
RANDOM()
that takes in a seed value.Note
At present, aliases to
RANDOM
calls occasionally produce unexpected behavior. For certain SQL operations, callingRANDOM
and storing the result with an alias, then later re-using that alias may result in another call toRANDOM
. This behavior is somewhat rare.
UNIFORM¶
-
UNIFORM(lo, hi, gen)
Outputs a random number uniformly distributed in the interval
[lo, hi]
. Iflo
andhi
are both integers, then the output is an integer betweenlo
andhi
(including both endpoints). If eitherlo
orhi
is a float, the output is a random float between them. The values ofgen
are used to seed the randomness, so ifgen
is all distinct values (or is randomly generated) then the output ofUNIFORM
should be random. However, if 2 rows have the samegen
value they will produce the same output value.
UUID_STRING¶
-
UUID_STRING([uuid, name])
Outputs a UUID (as defined by RFC 4122) formatted as a string. If no arguments are provdied then a version 4 (randomly generated) UUID is returned, otherwise a version 5 (generated by name) UUID is returned.
Aggregation & Window Functions¶
An aggregation function can be used to combine data across many
rows to form a single answer. Aggregations can be done with a
GROUP BY
clause, in which case the combined value is
calculated once per unique combination of groupbing keys. Aggregations
can also be done without the GROUP BY
clause, in which case
a single value is outputted by calculating the aggregation across
all rows.
For example:
Window functions can be used to compute an aggregation across a row and its surrounding rows. Most window functions have the following syntax:
SELECT WINDOW_FN(ARG1, ..., ARGN) OVER (PARTITION BY PARTITION_COLUMN_1, ..., PARTITION_COLUMN_N ORDER BY SORT_COLUMN_1, ..., SORT_COLUMN_N ROWS BETWEEN <LOWER_BOUND> AND <UPPER_BOUND>) FROM table_name
ROWS BETWEEN ROWS BETWEEN <LOWER_BOUND> AND <UPPER_BOUND>
section is used to specify the window over which to compute the
function. A bound can can come before the current row, using PRECEDING
or after the current row, using
FOLLOWING
. The bounds can be relative (i.e.
N PRECEDING
or N FOLLOWING
), where N
is a positive integer,
or they can be absolute (i.e. UNBOUNDED PRECEDING
or
UNBOUNDED FOLLOWING
).
For example:
This query computes the sum of every 3 rows, i.e. the sum of a row of interest, its preceding row, and its following row.In contrast:
SELECT SUM(A) OVER (PARTITION BY B ORDER BY C ROWS BETWEEN UNBOUNDED PRECEDING AND 0 FOLLOWING) FROM table1
Note
For most window functions, BodoSQL returns NULL
if the specified window frame
is empty or all NULL
. Exceptions to this behavior are noted.
Window functions perform a series of steps as followed:
- Partition the data by
PARTITION_COLUMN
. This is effectively a groupby operation onPARTITION_COLUMN
. - Sort each group as specified by the
ORDER BY
clause. - Perform the calculation over the specified window, i.e. the newly ordered subset of data.
- Shuffle the data back to the original ordering.
For BodoSQL, PARTITION BY
is required, but
ORDER BY
is optional for most functions and
ROWS BETWEEN
is optional for all of them. If
ROWS BETWEEN
is not specified then it defaults to either
computing the result over the entire window (if no ORDER BY
clause is specified) or to using the window UNBOUNDED PRECEDING TO CURRENT ROW
(if there is an ORDER BY
clause).
Note
RANGE BETWEEN
is not currently supported.
Note
If a window frame contains NaN
values, the output may diverge from Snowflake's
behavior. When a NaN
value enters a window, any window function that combines
the results with arithmetic (e.g. SUM
, AVG
, VARIANCE
, etc.) will output
NaN
until the NaN
value has exited the window.
BodoSQL Currently supports the following Aggregation & Window functions:
Function | Supported with GROUP BY? | Supported without GROUP BY? | Supported as window function? | (WINDOW) Allows ORDER BY? | (WINDOW) Requires ORDER BY? | (WINDOW) Allows frame? |
---|---|---|---|---|---|---|
ANY_VALUE |
Y | Y | Y | Y | N | Y |
APPROX_PERCENTILE |
N | Y | Y | N | N | N |
ARRAY_AGG |
Y | N | N | N/A | N/A | N/A |
ARRAY_UNIQUE_AGG |
Y | N | N | N/A | N/A | N/A |
AVG |
Y | Y | Y | Y | N | Y |
BITAND_AGG |
Y | Y | Y | N | N | N |
BITOR_AGG |
Y | Y | Y | N | N | N |
BITXOR_AGG |
Y | Y | Y | N | N | N |
BOOLAND_AGG |
Y | Y | Y | N | N | N |
BOOLOR_AGG |
Y | Y | Y | N | N | N |
BOOLXOR_AGG |
Y | Y | Y | N | N | N |
CONDITIONAL_CHANGE_EVENT |
N | N | Y | Y | Y | N |
CONDITIONAL_TRUE_EVENT |
N | N | Y | Y | Y | N |
CORR |
N | N | Y | N | N | N |
COUNT |
Y | Y | Y | Y | N | Y |
COUNT(*) |
Y | Y | Y | Y | N | Y |
COUNT_IF |
Y | Y | Y | Y | N | Y |
COVAR_POP |
N | N | Y | N | N | N |
COVAR_SAMP |
N | N | Y | N | N | N |
CUME_DIST |
N | N | Y | Y | Y | N |
DENSE_RANK |
N | N | Y | Y | Y | N |
FIRST_VALUE |
N | N | Y | Y | N | Y |
KURTOSIS |
Y | Y | Y | N | N | N |
LEAD |
N | N | Y | Y | Y | N |
LAST_VALUE |
N | N | Y | Y | N | Y |
LAG |
N | N | Y | Y | Y | N |
LISTAGG |
Y | Y | N | N/A | N/A | N/A |
MAX |
Y | Y | Y | Y | N | Y |
MEDIAN |
Y | Y | Y | N | N | N |
MIN |
Y | Y | Y | Y | N | Y |
MODE |
Y | N | Y | Y | N | N |
NTH_VALUE |
N | N | Y | Y | N | Y |
NTILE |
N | N | Y | Y | Y | N |
OBJECT_AGG |
Y | N | Y | N | N | N |
PERCENTILE_CONT |
Y | Y | N | N/A | N/A | N/A |
PERCENTILE_DISC |
Y | Y | N | N/A | N/A | N/A |
PERCENT_RANK |
N | N | Y | Y | Y | N |
RANK |
N | N | Y | Y | Y | N |
RATIO_TO_REPORT |
N | N | Y | Y | N | N |
ROW_NUMBER |
N | N | Y | Y | Y | N |
SKEW |
Y | Y | Y | Y | N | N |
STDDEV |
Y | Y | Y | Y | N | Y |
STDDEV_POP |
Y | Y | Y | Y | N | Y |
STDDEV_SAMP |
Y | Y | Y | Y | N | Y |
SUM |
Y | Y | Y | Y | N | Y |
VARIANCE |
Y | Y | Y | Y | N | Y |
VARIANCE_POP |
Y | Y | Y | Y | N | Y |
VARIANCE_SAMP |
Y | Y | Y | Y | N | Y |
VAR_POP |
Y | Y | Y | Y | N | Y |
VAR_SAMP |
Y | Y | Y | Y | N | Y |
ANY_VALUE¶
-
ANY_VALUE
Select an arbitrary value from the column/group/window. Supported on all types.
Note
Currently, BodoSQL always selects the first value, but this is subject to change at any time.
APPROX_PERCENTILE¶
-
APPROX_PERCENTILE(A, q)
Returns the approximate value of the
q
-th percentile of columnA
(e.g. 0.5 = median, or 0.9 = the 90th percentile).A
can be any numeric column, andq
can be any scalar float between zero and one.The approximation is calculated using the t-digest algorithm.
ARRAY_AGG¶
-
ARRAY_AGG([DISTINCT] A) [WITHIN GROUP(ORDER BY orderby_terms)]
Combines all the values in column
A
within each group into a single array.Optionally allows using a
WITHIN GROUP
clause to specify how the values should be ordered before being combined into an array. If no clause is specified, then the ordering is unpredictable. Nulls will not be included in the arrays.If the
DISTINCT
keyword is provided, then duplicate elements are removed from each of the arrays. However, if this keyword is provied and aWITHIN GROUP
clause is also provided, then theWITHIN GROUP
clause can only refer to the same column as the aggregation input.
ARRAY_UNIQUE_AGG¶
-
ARRAY_UNIQUE_AGG(A)
Equivalent to
ARRAY_AGG(DISTINCT A)
AVG¶
-
AVG
Compute the mean of the the column/group/window. Supported on all numeric types.
BITAND_AGG¶
-
BITAND_AGG
Compute the bitwise AND of every input in a column/group/window, returning
NULL
if there are no non-NULL
entries. Accepts floating point values, integer values, and strings. Strings are interpreted directly as numbers, converting to 64-bit floating point numbers.
BITOR_AGG¶
-
BITOR_AGG
Compute the bitwise OR of every input in a column/group/window, returning
NULL
if there are no non-NULL
entries. Accepts floating point values, integer values, and strings. Strings are interpreted directly as numbers, converting to 64-bit floating point numbers.
BITXOR_AGG¶
-
BITXOR_AGG
Compute the bitwise XOR of every input in a column/group/window, returning
NULL
if there are no non-NULL
entries. Accepts floating point values, integer values, and strings. Strings are interpreted directly as numbers, converting to 64-bit floating point numbers.
BOOLAND_AGG¶
-
BOOLAND_AGG
Compute the logical AND of the boolean value of every input in a column/group/window, returning
NULL
if there are no non-NULL
entries, otherwise returning True if all non-NULL
entries are also non-zero. This is supported for numeric and boolean types.
BOOLOR_AGG¶
-
BOOLOR_AGG
Compute the logical OR of the boolean value of every input in a column/group/window, returning
NULL
if there are no non-NULL
entries, otherwise returning True if there is at least 1 non-zero entry. This is supported for numeric and boolean types.
BOOLXOR_AGG¶
-
BOOLXOR_AGG
Returns
NULL
if there are no non-NULL
entries, otherwise returning True if exactly one non-NULL
entry is also non-zero (this is counterintuitive to how the logical XOR is normally thought of). This is supported for numeric and boolean types.
CONDITIONAL_CHANGE_EVENT¶
-
CONDITIONAL_CHANGE_EVENT(COLUMN_EXPRESSION)
Computes a counter within each partition that starts at zero and increases by 1 each time the value inside the window changes.
NULL
does not count as a new/changed value.ORDER BY
is required for this function.
CONDITIONAL_TRUE_EVENT¶
-
CONDITIONAL_TRUE_EVENT(BOOLEAN_COLUMN_EXPRESSION)
Computes a counter within each partition that starts at zero and increases by 1 each time the boolean column's value is
true
.ORDER BY
is required for this function.
CORR¶
-
CORR(Y, X)
Compute the correlation over the window of both inputs, or
NULL
if the window is empty. Equivalent toCOVAR(Y, X) / (STDDEV_POP(Y) * STDDEV_POP(X))
COUNT¶
-
COUNT
Count the number of non-null elements in the column/group/window. Supported on all types. If used with the syntax
COUNT(*)
returns the total number of rows instead of non-null rows.
COUNT_IF¶
-
COUNT_IF
Compute the total number of occurrences of
true
in a column/group/window of booleans. For example:Is equivalent to
COVAR_POP¶
-
COVAR_POP(Y, X)
Compute the population covariance over the window of both inputs, or
NULL
if the window is empty. Supported on all numeric types.
COVAR_SAMP¶
-
COVAR_SAMP(Y, X)
Compute the sample covariance over the window of both inputs, or
NULL
if the window is empty. Supported on all numeric types.
CUME_DIST¶
-
CUME_DIST()
Compute the cumulative distribution of the value(s) in each row based on the value(s) relative to all value(s) within the window partition.
ORDER BY
is required for this function.
DENSE_RANK¶
-
DENSE_RANK()
Compute the rank of each row based on the value(s) in the row relative to all value(s) within the partition without producing gaps in the rank (compare with
RANK
). The rank begins with 1 and increments by one for each succeeding value. Rows with the same value(s) produce the same rank.ORDER BY
is required for this function.Note
To compare
RANK
andDENSE_RANK
, on input array['a', 'b', 'b', 'c']
,RANK
will output[1, 2, 2, 4]
whileDENSE_RANK
outputs[1, 2, 2, 3]
.
FIRST_VALUE¶
-
FIRST_VALUE(COLUMN_EXPRESSION)
Select the first value in the window or
NULL
if the window is empty. Supported on all non-semi-structured types.
KURTOSIS¶
-
KURTOSIS
Compute the kurtosis of a column or
NULL
if the window contains fewer than 4 non-NULL
entries. Supported on numeric types.Returns
NULL
if the input is allNULL
or empty.
LAG¶
-
LAG(COLUMN_EXPRESSION, [N], [FILL_VALUE])
Returns the row that precedes the current row by N. If N is not specified, defaults to 1. If FILL_VALUE is not specified, defaults to
NULL
. If there are fewer than N rows the follow the current row in the window, it returns FILL_VALUE. N must be a literal integer if specified. FILL_VALUE must be a scalar if specified. Supported on all non-semi-structured types.
LAST_VALUE¶
-
LAST_VALUE(COLUMN_EXPRESSION)
Select the last value in the window or
NULL
if the window is empty. Supported on all non-semi-structured types.
LEAD¶
-
LEAD(COLUMN_EXPRESSION, [, N[, FILL_VALUE]])
Equivalent to
LEAD(COLUMN_EXPRESSION, -N, FILL_VALUE)
, in other words, returns the row following the current row by N.
LISTAGG¶
-
LISTAGG(str_col[, delimeter]) [WITHIN GROUP (ORDER BY order_col)]
Concatenates all of the strings in
str_col
within each group into a single string seperated by the characters in the stringdelimiter
. If no delimiter is provided, an empty string is used by default.Optionally allows using a
WITHIN GROUP
clause to specify how the strings should be ordered before being concatenated. If no clause is specified, then the ordering is unpredictable.Returns
''
if the input is allNULL
or empty.
MAX¶
-
MAX
Compute the maximum value in the column/group/window. Supported on all non-semi-structured types.
Returns
NULL
if the input is allNULL
or empty.
MEDIAN¶
-
MEDIAN(COLUMN_EXPRESSION)
Compute the median over the column/group/window. Supported on all numeric types.
Returns
NULL
if the input is allNULL
or empty.
MIN¶
-
MIN
Compute the minimum value in the column/group/window. Supported on all types. Supported on all non-semi-structured types.
Returns
NULL
if the input is allNULL
or empty.
MODE¶
-
MODE
Returns the most frequent element in a column/group/window, including
#sql NULL
if that is the element that appears the most. Supported on all non-semi-structured types.Returns
NULL
if the input is allNULL
or empty.Note
In case of a tie, BodoSQL will choose a value arbitrarily based on performance considerations.
NTH_VALUE¶
-
NTH_VALUE(COLUMN_EXPRESSION, N)
Select the last value in the window or
NULL
if the window does not haveN
elements. Uses 1-indexing. RequiresN
to be a positive integer literal. Supported on all non-semi-structured types.
NTILE¶
-
NTILE(N)
Divides the partitioned groups into N buckets based on ordering. For example if N=3 and there are 30 rows in a partition, the first 10 are assigned 1, the next 10 are assigned 2, and the final 10 are assigned 3. In cases where the number of rows cannot be evenly divided by the number of buckets, the first buckets will have one more value than the last bucket. For example, if N=4 and there are 22 rows in a partition, the first 6 are assigned 1, the next 6 are assigned 2, the next 5 are assigned 3, and the last 5 are assigned 4.
OBJECT_AGG¶
-
OBJECT_AGG(K, V)
Combines the data from columns
K
andV
into a JSON object where the rows of columnK
are the field names and the rows of columnV
are the values. Any row whereK
orV
isNULL
is not included in the final object. If the group is empty or all the rows are not included, an empty object is returned.
PERCENTILE_CONT¶
-
APPROX_PEPERCENTILE_CONTRCENTILE(q) WITHIN GROUP (ORDER BY A)
Computes the exact value of the
q
-th percentile of columnA
(e.g. 0.5 = median, or 0.9 = the 90th percentile).A
can be any numeric column, andq
can be any scalar float between zero and one.If no value lies exactly at the desired percentile, the two nearest values are linearly interpolated. For example, consider the dataset
[2, 8, 25, 40]
. If we sought the percentileq=0.25
we would be looking for the value at index 0.75. There is no value at index 0.75, so we linearly interpolate between 2 and 8 to get 6.5.
PERCENTILE_DISC¶
-
PERCENTILE_DISC(q) WITHIN GROUP (ORDER BY A)
Computes the exact value of the
q
-th percentile of columnA
(e.g. 0.5 = median, or 0.9 = the 90th percentile).A
can be any numeric column, andq
can be any scalar float between zero and one.This function differs from
PERCENTILE_CONT
in that it always outputs a value from the original array. The value it chooses is the smallest value inA
such that theCUME_DIST
of all values in the columnA
is greater than or equal toq
. For example, consider the dataset[2, 8, 8, 40]
. TheCUME_DIST
of each of these values is[0.25, 0.75, 0.75, 1.0]
. If we sought the percentileq=0.6
we would output 8 since it has the smallestCUME_DIST
that is>=0.6
.
PERCENT_RANK¶
-
PERCENT_RANK()
Compute the percentage ranking of the value(s) in each row based on the value(s) relative to all value(s) within the window partition. Ranking calculated using
RANK()
divided by the number of rows in the window partition minus one. Partitions with one row havePERCENT_RANK()
of 0.ORDER BY
is required for this function.
RANK¶
-
RANK()
Compute the rank of each row based on the value(s) in the row relative to all value(s) within the partition. The rank begins with 1 and increments by one for each succeeding value. Duplicate value(s) will produce the same rank, producing gaps in the rank (compare with
DENSE_RANK
).ORDER BY
is required for this function.
RATIO_TO_REPORT¶
-
RATIO_TO_REPORT(COLUMN_EXPRESSION)
Returns an element in the window frame divided by the sum of all elements in the same window frame, or
NULL
if the window frame has a sum of zero. For example, if calculatingRATIO_TO_REPORT
on[2, 5, NULL, 10, 3]
where the window is the entire partition, the answer is[0.1, 0.25, NULL, 0.5, 0.15]
.
ROW_NUMBER¶
-
ROW_NUMBER()
Compute an increasing row number (starting at 1) for each row. This function cannot be used with
ROWS BETWEEN
.Note
This window function is supported without a partition.
SKEW¶
-
SKEW
Compute the skew of a column/group/window or
NULL
if the window contains fewer than 3 non-NULL
entries. Supported on numeric types.Returns
NULL
if the input is allNULL
or empty.
STDDEV¶
-
STDDEV
Compute the standard deviation of the column/group/window with N - 1 degrees of freedom. Supported on numeric types.
Returns
NULL
if the input is allNULL
or empty.
STDDEV_SAMP¶
-
STDDEV_SAMP
Compute the standard deviation of the column/group/window with N - 1 degrees of freedom. Supported on numeric types.
Returns
NULL
if the input is allNULL
or empty.
STDDEV_POP¶
-
STDDEV_POP
Compute the standard deviation of the column/group/window with N degrees of freedom. Supported on numeric types.
Returns
NULL
if the input is allNULL
or empty.
SUM¶
-
SUM
Compute the sum of the column/group/window. Supported on numeric types.
Returns
NULL
if the input is allNULL
or empty.
VARIANCE¶
-
VARIANCE
Compute the variance for a column/group/window with N - 1 degrees of freedom. Supported on numeric types.
Returns
NULL
if the input is allNULL
or empty.
VARIANCE_SAMP¶
-
VARIANCE_SAMP
Compute the variance for a column/group/window with N - 1 degrees of freedom. Supported on numeric types.
Returns
NULL
if the input is allNULL
or empty.
VARIANCE_POP¶
-
VARIANCE_POP
Compute the variance for a column/group/window with N degrees of freedom. Supported on numeric types.
Returns
NULL
if the input is allNULL
or empty.
VAR_SAMP¶
-
VAR_SAMP
Compute the variance for a column/group/window with N - 1 degrees of freedom. Supported on numeric types.
Returns
NULL
if the input is allNULL
or empty.
VAR_POP¶
-
VAR_POP
Compute the variance for a column/group/window with N degrees of freedom. Supported on numeric types.
Returns
NULL
if the input is allNULL
or empty.
Timestamp Functions¶
BodoSQL currently supports the following Timestamp functions:
DATEDIFF¶
-
DATEDIFF(timestamp_val1, timestamp_val2)
Computes the difference in days between two Timestamp values (timestamp_val1 - timestamp_val2)
-
DATEDIFF(unit, timestamp_val1, timestamp_val2)
Computes the difference between two Timestamp values (timestamp_val2 - timestamp_val1) in terms of unit
Allows the following units, with the specified abbreviations as string literals:
- YEAR:
year
,years
,yr
,yrs
,y
,yy
,yyy
,yyyy
- QUARTER:
quarter
,quarters
,q
,qtr
,qtrs
- MONTH:
month
,months
,mm
,mon
,mons
- WEEK:
week
,weeks
,weekofyear
,w
,wk
,woy
,wy
- DAY:
day
,days
,dayofmonth
,d
,dd
- HOUR:
hour
,hours
,hrs
,h
,hr
,hrs
- MINUTE:
minute
,minutes
,m
,mi
,min
,mins
- SECOND:
second
,seconds
,s
,sec
,secs
- MILLISECOND:
millisecond
,milliseconds
,ms
,msecs
- MICROSECOND:
microsecond
,microseconds
,us
,usec
- NANOSECOND:
nanosecond
,nanoseconds
,nanosec
,nsec
,nsecs
,nsecond
,ns
,nanonsecs
- YEAR:
STR_TO_DATE¶
-
STR_TO_DATE(str_val, literal_format_string)
Converts a string value to a Timestamp value given a literal format string. If a year, month, and day value is not specified, they default to 1900, 01, and 01 respectively. Will throw a runtime error if the string cannot be parsed into the expected values. See
DATE_FORMAT
for recognized formatting characters.For example:
DATE_FORMAT¶
-
DATE_FORMAT(timestamp_val, literal_format_string)
Converts a timestamp value to a String value given a scalar format string.
Recognized formatting characters:
%i
Minutes, zero padded (00 to 59)%M
Full month name (January to December)%r
Time in format in the format (hh:mm:ss AM/PM)%s
Seconds, zero padded (00 to 59)%T
Time in format in the format (hh:mm:ss)%T
Time in format in the format (hh:mm:ss)%u
week of year, where monday is the first day of the week(00 to 53)%a
Abbreviated weekday name (sun-sat)%b
Abbreviated month name (jan-dec)%f
Microseconds, left padded with 0's, (000000 to 999999)%H
Hour, zero padded (00 to 23)%j
Day Of Year, left padded with 0's (001 to 366)%m
Month number (00 to 12)%p
AM or PM, depending on the time of day%d
Day of month, zero padded (01 to 31)%Y
Year as a 4 digit value%y
Year as a 2 digit value, zero padded (00 to 99)%U
Week of year, where Sunday is the first day of the week (00 to 53)%S
Seconds, zero padded (00 to 59)
For example:
DATE_FROM_PARTS¶
-
DATE_FROM_PARTS(year, month, day)
Constructs a date from the integer inputs specified, e.g.
(2020, 7, 4)
will output July 4th, 2020.Note
Month does not have to be in the 1-12 range, and day does not have to be in the 1-31 range. Values out of bounds are overflowed logically, e.g.
(2020, 14, -1)
will output January 31st, 2021.
DATEFROMPARTS¶
-
DATEFROMPARTS(year, month, day)
Equivalent to
DATE_FROM_PARTS
TIME_FROM_PARTS¶
-
TIME_FROM_PARTS(integer_hour_val, integer_minute_val, integer_second_val [, integer_nanoseconds_val])
Creates a time from individual numeric components. Usually,
integer_hour_val
is in the 0-23 range,integer_minute_val
is in the 0-59 range,integer_second_val
is in the 0-59 range, andinteger_nanoseconds_val
(if provided) is a 9-digit integer.
TIMEFROMPARTS¶
-
TIMEFROMPARTS(integer_hour_val, integer_minute_val, integer_second_val [, integer_nanoseconds_val])
See TIME_FROM_PARTS.
TIMESTAMP_FROM_PARTS¶
TIMESTAMP_FROM_PARTS(year, month, day, hour, minute, second[, nanosecond[, timezone]])
-
TIMESTAMP_FROM_PARTS(date_expr, time_expr)
The first overload is equivalent toDATE_FROM_PARTS
but also takes in an hour, minute and second (which can be out of bounds just like the month/day). Optionally takes in a nanosecond value, and a timezone value for the output. If the timezone is not specified, the output is timezone-naive. Note that if any numeric argument cannot be converted to an int64, then it will become NULL.Note
Timezone argument is not supported at this time.
The second overload constructs the timestamp by combining the date and time arguments. The output of this function is always timestamp-naive.
TIMESTAMPFROMPARTS¶
TIMESTAMPFROMPARTS(year, month, day, hour, minute, second[, nanosecond[, timezone]])
-
TIMESTAMPFROMPARTS(date_expr, time_expr)
Equivalent to
TIMESTAMP_FROM_PARTS
TIMESTAMP_NTZ_FROM_PARTS¶
TIMESTAMP_NTZ_FROM_PARTS(year, month, day, hour, minute, second[, nanosecond])
-
TIMESTAMP_NTZ_FROM_PARTS(date_expr, time_expr)
Equivalent to
TIMESTAMP_FROM_PARTS
but without the optional timezone argument in the first overload. The output is always timezone-naive.
TIMESTAMPNTZFROMPARTS¶
TIMESTAMP_NTZ_FROM_PARTS(year, month, day, hour, minute, second[, nanosecond])
-
TIMESTAMP_NTZ_FROM_PARTS(date_expr, time_expr)
Equivalent to
TIMESTAMP_NTZ_FROM_PARTS
TIMESTAMP_LTZ_FROM_PARTS¶
-
TIMESTAMP_LTZ_FROM_PARTS(year, month, day, hour, minute, second[, nanosecond])
Equivalent to
TIMESTAMP_FROM_PARTS(year, month, day, hour, minute, second[, nanosecond])
but without the optional timezone argument in the first overload. The output is always timezone-aware using the local timezone.
TIMESTAMPLTZFROMPARTS¶
-
TIMESTAMP_LTZ_FROM_PARTS(year, month, day, hour, minute, second[, nanosecond])
Equivalent to
TIMESTAMP_LTZ_FROM_PARTS
TIMESTAMP_TZ_FROM_PARTS¶
-
TIMESTAMP_TZ_FROM_PARTS(year, month, day, hour, minute, second[, nanosecond[, timezone]])
Equivalent to
TIMESTAMP_FROM_PARTS(year, month, day, hour, minute, second[, nanosecond[, timezone]])
except the default behavior if no timezone is provided is to use the local timezone instead of timezone-naive.Note
Timezone argument is not supported at this time.
TIMESTAMPTZFROMPARTS¶
-
TIMESTAMPTZFROMPARTS(year, month, day, hour, minute, second[, nanosecond[, timezone]])
Equivalent to
TIMESTAMP_TZ_FROM_PARTS
DATEADD¶
-
DATEADD(unit, amount, timestamp_val)
Computes a timestamp column by adding the amount of the specified unit to the timestamp val. For example,
DATEADD('day', 3, T)
adds 3 days to columnT
. Allows the following units, with the specified abbreviations as string literals:- YEAR:
year
,years
,yr
,yrs
,y
,yy
,yyy
,yyyy
- QUARTER:
quarter
,quarters
,q
,qtr
,qtrs
- MONTH:
month
,months
,mm
,mon
,mons
- WEEK:
week
,weeks
,weekofyear
,w
,wk
,woy
,wy
- DAY:
day
,days
,dayofmonth
,d
,dd
- HOUR:
hour
,hours
,hrs
,h
,hr
,hrs
- MINUTE:
minute
,minutes
,m
,mi
,min
,mins
- SECOND:
second
,seconds
,s
,sec
,secs
- MILLISECOND:
millisecond
,milliseconds
,ms
,msecs
- MICROSECOND:
microsecond
,microseconds
,us
,usec
- NANOSECOND:
nanosecond
,nanoseconds
,nanosec
,nsec
,nsecs
,nsecond
,ns
,nanonsecs
Supported with timezone-aware data.
- YEAR:
-
DATEADD(timestamp_val, amount)
Equivalent to
DATEADD('day', amount, timestamp_val)
TIMEADD¶
-
TIMEADD(unit, amount, timestamp_val)
Equivalent to
DATEADD
.
TIMESTAMPADD¶
-
TIMESTAMPADD(unit, amount, timestamp_val)
Equivalent to
DATEADD
.
DATE_ADD¶
-
DATE_ADD(timestamp_val, interval)
Computes a timestamp column by adding an interval column/scalar to a timestamp value. If the first argument is a string representation of a timestamp, Bodo will cast the value to a timestamp.
-
DATE_ADD(timestamp_val, amount)
Equivalent to
DATE_ADD('day', amount, timestamp_val)
DATE_SUB¶
-
DATE_SUB(timestamp_val, interval)
Computes a timestamp column by subtracting an interval column/scalar to a timestamp value. If the first argument is a string representation of a timestamp, Bodo will cast the value to a timestamp.
DATE_TRUNC¶
-
DATE_TRUNC(str_literal, timestamp_val)
Truncates a timestamp to the provided str_literal field. str_literal must be a compile time constant and one of:
- "MONTH"
- "WEEK"
- "DAY"
- "HOUR"
- "MINUTE"
- "SECOND"
- "MILLISECOND"
- "MICROSECOND"
- "NANOSECOND"
TRUNC¶
-
TRUNC(timestamp_val, str_literal)
Equivalent to
DATE_TRUNC(str_literal, timestamp_val)
. The argument order is reversed when compared toDATE_TRUNC
. Note thatTRUNC
is overloaded, and may invoke the numeric functionTRUNCATE
if the arguments are numeric.
TIME_SLICE¶
-
TIME_SLICE(date_or_time_expr, slice_length, unit[, start_or_end])
Calculates one of the endpoints of a "slice" of time containing the date specified by
date_or_time_expr
where each slice has length of time corresponding toslice_length
times the date/time unit specified byunit
. The slice start/ends are always aligned to the unix epoch1970-01-1
(at midnight). The fourth argument specifies whether to return the begining or the end of the slice ('START'
for begining,'END'
for end), where the default is'START'
.For example,
TIME_SLICE(T, 3, 'YEAR')
would return the timestamp corresponding to the begining of the first 3-year window (aligned with 1970) that contains timestampT
. SoT = 1995-7-4 12:30:00
would output1994-1-1
for'START'
or1997-1-1
for'END'
.
NOW¶
-
NOW()
Computes a timestamp equal to the current time in the session's timezone. By default, the current timezone is UTC, and it can be updated as a parameter when using the Snowflake Catalog.
LOCALTIMESTAMP¶
-
LOCALTIMESTAMP()
Equivalent to
NOW
CURRENT_TIMESTAMP¶
-
CURRENT_TIMESTAMP()
Equivalent to
NOW
GETDATE¶
-
GETDATE()
Equivalent to
NOW
SYSTIMESTAMP¶
-
SYSTIMESTAMP()
Equivalent to
NOW
LOCALTIME¶
-
LOCALTIME()
Computes a time equal to the current time in the session's timezone. By default the current time is in local time, and it can be updated as a parameter when using the Snowflake Catalog.
CURRENT_TIME¶
-
CURRENT_TIME()
Equivalent to
LOCALTIME
CURDATE¶
-
CURDATE()
Computes a timestamp equal to the current system time, excluding the time information
CURRENT_DATE¶
-
CURRENT_DATE()
Equivalent to
CURDATE
EXTRACT¶
-
EXTRACT(TimeUnit from timestamp_val)
Extracts the specified TimeUnit from the supplied date.
Allowed TimeUnits are:
MICROSECOND
MINUTE
HOUR
DAY
(Day of Month)DOY
(Day of Year)DOW
(Day of week)WEEK
MONTH
QUARTER
YEAR
TimeUnits are not case-sensitive.
DATE_PART¶
-
DATE_PART(unit, timestamp_val)
Equivalent to
EXTRACT(unit FROM timestamp_val)
with the following unit string literals:- YEAR:
year
,years
,yr
,yrs
,y
,yy
,yyy
,yyyy
- QUARTER:
quarter
,quarters
,q
,qtr
,qtrs
- MONTH:
month
,months
,mm
,mon
,mons
- WEEK:
week
,weeks
,weekofyear
,w
,wk
,woy
,wy
- DAY:
day
,days
,dayofmonth
,d
,dd
- HOUR:
hour
,hours
,hrs
,h
,hr
,hrs
- MINUTE:
minute
,minutes
,m
,mi
,min
,mins
- SECOND:
second
,seconds
,s
,sec
,secs
- MILLISECOND:
millisecond
,milliseconds
,ms
,msecs
- MICROSECOND:
microsecond
,microseconds
,us
,usec
- NANOSECOND:
nanosecond
,nanoseconds
,nanosec
,nsec
,nsecs
,nsecond
,ns
,nanonsecs
Supported with timezone-aware data.
- YEAR:
MICROSECOND¶
-
MICROSECOND(timestamp_val)
Equivalent to
EXTRACT(MICROSECOND from timestamp_val)
SECOND¶
-
SECOND(timestamp_val)
Equivalent to
EXTRACT(SECOND from timestamp_val)
MINUTE¶
-
MINUTE(timestamp_val)
Equivalent to
EXTRACT(MINUTE from timestamp_val)
HOUR¶
-
HOUR(timestamp_val)
Equivalent to
EXTRACT(HOUR from timestamp_val)
WEEK¶
-
WEEK(timestamp_val)
Equivalent to
EXTRACT(WEEK from timestamp_val)
WEEKOFYEAR¶
-
WEEKOFYEAR(timestamp_val)
Equivalent to
EXTRACT(WEEK from timestamp_val)
MONTH¶
-
MONTH(timestamp_val)
Equivalent to
EXTRACT(MONTH from timestamp_val)
QUARTER¶
-
QUARTER(timestamp_val)
Equivalent to
EXTRACT(QUARTER from timestamp_val)
YEAR¶
-
YEAR(timestamp_val)
Equivalent to
EXTRACT(YEAR from timestamp_val)
WEEKISO¶
-
WEEKISO(timestamp_val)
Computes the ISO week for the provided timestamp value.
YEAROFWEEKISO¶
-
YEAROFWEEKISO(timestamp_val)
Computes the ISO year for the provided timestamp value.
MAKEDATE¶
-
MAKEDATE(integer_years_val, integer_days_val)
Computes a timestamp value that is the specified number of days after the specified year.
DAYNAME¶
-
DAYNAME(timestamp_val)
Computes the 3 letter abreviation for the day of the timestamp value.
MONTHNAME¶
-
MONTHNAME(timestamp_val)
Computes the 3 letter abreviation for the month of the timestamp value.
MONTH_NAME¶
-
MONTH_NAME(timestamp_val)
Computes the 3 letter abreviation for the month of the timestamp value.
TO_DAYS¶
-
TO_DAYS(timestamp_val)
Computes the difference in days between the input timestamp, and year 0 of the Gregorian calendar
TO_SECONDS¶
-
TO_SECONDS(timestamp_val)
Computes the number of seconds since year 0 of the Gregorian calendar
FROM_DAYS¶
-
FROM_DAYS(n)
Returns a timestamp values that is n days after year 0 of the Gregorian calendar
UNIX_TIMESTAMP¶
-
UNIX_TIMESTAMP()
Computes the number of seconds since the unix epoch
FROM_UNIXTIME¶
-
FROM_UNIXTIME(n)
Returns a Timestamp value that is n seconds after the unix epoch
ADDDATE¶
-
ADDDATE(timestamp_val, interval)
Same as
DATE_ADD
SUBDATE¶
-
SUBDATE(timestamp_val, interval)
Same as
DATE_SUB
TIMESTAMPDIFF¶
-
TIMESTAMPDIFF(unit, timestamp_val1, timestamp_val2)
Returns the amount of time that has passed since
timestamp_val1
untiltimestamp_val2
in terms of the unit specified, ignoring all smaller units. E.g., December 31 of 2020 and January 1 of 2021 count as 1 year apart.Note
For all units larger than
NANOSECOND
, the output type isINTEGER
instead ofBIGINT
, so any difference values that cannot be stored as signed 32-bit integers might not be returned correct.
WEEKDAY¶
-
WEEKDAY(timestamp_val)
Returns the weekday number for timestamp_val.
Note
Monday = 0
,Sunday=6
YEARWEEK¶
-
YEARWEEK(timestamp_val)
Returns the year and week number for the provided timestamp_val concatenated as a single number. For example:
LAST_DAY¶
-
LAST_DAY(timestamp_val)
Given a timestamp value, returns a timestamp value that is the last day in the same month as timestamp_val.
UTC_TIMESTAMP¶
-
UTC_TIMESTAMP()
Returns the current UTC date and time as a timestamp value.
SYSDATE¶
-
SYSDATE()
Equivalent to
UTC_TIMESTAMP
UTC_DATE¶
-
UTC_DATE()
Returns the current UTC date as a Timestamp value.
String Functions¶
BodoSQL currently supports the following string functions:
LOWER¶
-
LOWER(str)
Converts the string scalar/column to lower case.
LCASE¶
-
LCASE(str)
Same as
LOWER
.
UPPER¶
-
UPPER(str)
Converts the string scalar/column to upper case.
UCASE¶
-
UCASE(str)
Same as
UPPER
.
CONCAT¶
-
CONCAT(str_0, str_1, ...)
Concatenates the strings together. Requires at least one argument.
CONCAT_WS¶
-
CONCAT_WS(str_separator, str_0, str_1, ...)
Concatenates the strings together, with the specified separator. Requires at least two arguments.
SUBSTRING¶
-
SUBSTRING(str, start_index, len)
Takes a substring of the specified string, starting at the specified index, of the specified length.
start_index = 1
specifies the first character of the string,start_index = -1
specifies the last character of the string.start_index = 0
causes the function to return empty string. Ifstart_index
is positive and greater than the length of the string, returns an empty string. Ifstart_index
is negative, and has an absolute value greater than the length of the string, the behavior is equivalent tostart_index = 1
.For example:
SUBSTRING('hello world', 1, 5) =='hello' SUBSTRING('hello world', -5, 7) =='world' SUBSTRING('hello world', -20, 8) =='hello wo' SUBSTRING('hello world', 0, 10) ==''
MID¶
MID(str, start_index, len)
Equivalent to
SUBSTRING
SUBSTR¶
-
SUBSTR(str, start_index, len)
Equivalent to
SUBSTRING
LEFT¶
-
LEFT(str, n)
Takes a substring of the specified string consisting of the leftmost n characters
RIGHT¶
-
RIGHT(str, n)
Takes a substring of the specified string consisting of the rightmost n characters
REPEAT¶
-
REPEAT(str, len)
Extends the specified string to the specified length by repeating the string. Will truncate the string If the string's length is less than the len argument
For example:
STRCMP¶
-
STRCMP(str1, str2)
Compares the two strings lexicographically. If
str1 > str2
, return 1. Ifstr1 < str2
, returns -1. Ifstr1 == str2
, returns 0.
REVERSE¶
-
REVERSE(str)
Returns the reversed string.
ORD¶
-
ORD(str)
Returns the integer value of the unicode representation of the first character of the input string. returns 0 when passed the empty string
CHAR¶
-
CHAR(int)
Returns the character of the corresponding unicode value. Currently only supported for ASCII characters (0 to 127, inclusive)
SPACE¶
-
SPACE(int)
Returns a string containing the specified number of spaces.
LTRIM¶
-
LTRIM(str[, chars])
Removes leading characters from a string column/literal str. These characters are specified by chars or are whitespace.
RTRIM¶
-
RTRIM(str[, chars])
Removes trailing characters from a string column/literal str. These characters are specified by chars or are whitespace.
TRIM¶
-
TRIM(str[, chars])
Returns the input string, will remove all spaces from the left and right of the string
SUBSTRING_INDEX¶
-
SUBSTRING_INDEX(str, delimiter_str, n)
Returns a substring of the input string, which contains all characters that occur before n occurrences of the delimiter string. if n is negative, it will return all characters that occur after the last n occurrences of the delimiter string. If
num_occurrences
is 0, it will return the empty stringFor example:
LPAD¶
-
LPAD(string, len, padstring)
Extends the input string to the specified length, by appending copies of the padstring to the left of the string. If the input string's length is less than the len argument, it will truncate the input string.
For example:
RPAD¶
-
RPAD(string, len, padstring)
Extends the input string to the specified length, by appending copies of the padstring to the right of the string. If the input string's length is less than the len argument, it will truncate the input string.
For example:
REPLACE¶
-
REPLACE(base_string, substring_to_remove, string_to_substitute)
Replaces all occurrences of the specified substring with the substitute string.
For example:
LENGTH¶
-
LENGTH(string)
Returns the number of characters in the given string.
EDITDISTANCE¶
-
EDITDISTANCE(string0, string1[, max_distance])
Returns the minimum edit distance between
string0
andstring1
according to Levenshtein distance. Optionally accepts a third argument specifying a maximum distance value. If the minimum edit distance between the two strings exceeds this value, then this value is returned instead. If it is negative, zero is returned.
JAROWINKLER_SIMILARITY¶
-
JAROWINKLER_SIMILARITY(string0, string1)
Computes the Jaro-Winkler similarity between
string0
andstring1
as an integer between 0 and 100 (with 0 being no similarity and 100 being an exact match). The computation is not case-sensitive, but is sensitive to spaces or formatting characters. A scaling factor of 0.1 is used for the computation. For the definition of Jaro-Winkler similarity, see here.
SPLIT_PART¶
-
SPLIT_PART(source, delimiter, part)
Returns the substring of the source between certain occurrence of the delimiter string, the occurrence being specified by the part. I.e. if part=1, returns the substring before the first occurrence, and if part=2, returns the substring between the first and second occurrence. Zero is treated like 1. Negative indices are allowed. If the delimiter is empty, the source is treated like a single token. If the part is out of bounds, '' is returned.
STRTOK¶
-
STRTOK(source[, delimiter[, part]])
Tokenizes the source string by occurrences of any character in the delimiter string and returns the occurrence specified by the part. I.e. if part=1, returns the substring before the first occurrence, and if part=2, returns the substring between the first and second occurrence. Zero and negative indices are not allowed. Empty tokens are always skipped in favor of the next non-empty token. In any case where the only possible output is '', the output is
NULL
. The delimiter is optional and defaults to ' '. The part is optional and defaults to 1.
POSITION¶
-
POSITION(str1, str2)
Returns the 1-indexed location where
str1
first occurs instr2
, or 0 if there is no occurrences ofstr1
instr2
.Note
BodoSQL oes not currently support alternate syntax
POSITION(str1, str2)
, or binary data.
CHARINDEX¶
-
CHARINDEX(str1, str2[, start_position])
Equivalent to
POSITION(str1, str2)
when 2 arguments are provided. When the optional third argument is provided, it only starts searching at that index.Note
Not currently supported on binary data.
STARTSWITH¶
-
STARTSWITH(str1, str2)
Returns whether
str2
is a prefix ofstr1
.
ENDSWITH¶
-
ENDSWITH(str1, str2)
Returns whether
str2
is a suffix ofstr1
.
INSERT¶
-
INSERT(str1, pos, len, str2)
Inserts
str2
intostr1
at positionpos
(1-indexed), replacing the firstlen
characters afterpos
in the process. Iflen
is zero, insertsstr2
intostr1
without deleting any characters. Ifpos
is one, prependsstr2
tostr1
. Ifpos
is larger than the length ofstr1
, appendsstr2
tostr1
.Note
Behavior when
pos
orlen
are negative is not well-defined at this time.
SHA2¶
-
SHA2(msg[, digest_size])
Encodes the
msg
string using theSHA-2
algorithm with the specified digest size (only values supported are, 224, 256, 384 and 512). Outputs the result as a hex-encoded string.
SHA2_HEX¶
-
SHA2_HEX(msg[, digest_size])
Equivalent to
SHA2(msg[, digest_size])
MD5¶
-
MD5(msg)
Encodes the
msg
string using theMD5
algorithm. Outputs the result as a hex-encoded string.
MD5_HEX¶
-
MD5_HEX(msg)
Equivalent to
MD5_HEX(msg)
HEX_ENCODE¶
-
HEX_ENCODE(msg[, case])
Encodes the
msg
string into a string using the hex encoding scheme as if it were binary data (or directly encodes binary data). Ifcase
(default one) is zero then the alphabetical hex characters are lowercase, if it is one then they are uppercase. See here for Snowflake documentation.
HEX_DECODE_STRING¶
HEX_DECODE_STRING(msg)
Reverses the process of calling HEX_ENCODE
on a string with either capitalization.
Raises an exception if the string is malformed in any way.
See here for Snowflake documentation.
TRY_HEX_DECODE_STRING¶
-
TRY_HEX_DECODE_STRING(msg)
Equivalent to
HEX_DECODE_STRING
except that it will return null instead of raising an exception if the string is malformed in any way.
HEX_DECODE_BINARY¶
-
HEX_DECODE_BINARY(msg)
The same as
HEX_DECODE_STRING
except that the output is binary instead of a string.
TRY_HEX_DECODE_BINARY¶
-
TRY_HEX_DECODE_BINARY(msg)
Equivalent to
HEX_DECODE_BINARY
except that it will return null instead of raising an exception if the string is malformed in any way.
BASE64_ENCODE¶
-
BASE64_ENCODE(msg[, max_line_length[, alphabet]])
Encodes the
msg
string into a string using the base64 encoding scheme as if it were binary data (or directly encodes binary data). Ifmax_line_length
(default zero) is greater than zero, then newline characters will be inserted after that many characters to effectively add "text wrapping". Ifalphabet
is provided, it specifies substitutes for the usual encoding characters for index 62, index 63, and the padding character. See here for Snowflake documentation.
BASE64_DECODE_STRING¶
-
BASE64_DECODE_STRING(msg[, alphabet])
Reverses the process of calling
BASE64_ENCODE
on a string with the given alphabet, ignoring any newline characters produced by themax_line_length
argument. Raises an exception if the string is malformed in any way. See here for Snowflake documentation.
TRY_BASE64_DECODE_STRING¶
-
TRY_BASE64_DECODE_STRING(msg[, alphabet])
Equivalent to
BASE64_DECODE_STRING
except that it will return null instead of raising an exception if the string is malformed in any way.
BASE64_DECODE_BINARY¶
-
BASE64_DECODE_BINARY(msg[, alphabet])
The same as
BASE64_DECODE_STRING
except that the output is binary instead of a string.
TRY_BASE64_DECODE_BINARY¶
-
TRY_BASE64_DECODE_BINARY(msg[, alphabet])
Equivalent to
BASE64_DECODE_BINARY
except that it will return null instead of raising an exception if the string is malformed in any way.
Regex Functions¶
BodoSQL currently uses Python's regular expression library via the re
module. Although this may be subject to change, it means that there are
several deviations from the behavior of Snowflake's regular expression
functions (see here for snowflake documentation).
The key points and major deviations are noted below:
-
Snowflake uses a superset of the POSIX ERE regular expression syntax. This means that BodoSQL can utilize several syntactic forms of regular expressions that Snowflake cannot (see here for Python re documentation). However, there are several features that POSIX ERE has that Python's
re
does not: -
POSIX character classes (see here for a full list). BodoSQL does support these as macros for character sets. In other words,
[[:lower:]]
is transformed into[a-z]
. However, this form of replacement cannot be escaped. Additionally, any character classes that are supposed to include the null terminator\x00
instead start at\x01
-
Equivalence classes (not supported by BodoSQL).
-
Returning the longest match when using alternation patterns (BodoSQL returns the leftmost match).
-
The regex functions can optionally take in a flag argument. The flag is a string whose characters control how matches to patterns occur. The following characters have meaning when contained in the flag string:
-
'c'
: case-sensitive matching (the default behavior) 'i'
: case-insensitive matching (if both 'c' and 'i' are provided, whichever one occurs last is used)'m'
: allows anchor patterns to interact with the start/end of each line, not just the start/end of the entire string.'s'
: allows the.
metacharacter to capture newline characters-
'e'
: seeREGEXP_SUBSTR
/REGEXP_INSTR
-
Currently, BodoSQL supports the lazy
?
operator whereas Snowflake does not. So for example, in Snowflake, the pattern`(.*?),'
would match with as many characters as possible so long as the last character was a comma. However, in BodoSQL, the match would end as soon as the first comma. -
Currently, BodoSQL supports the following regexp features which should crash when done in Snowflake:
(?...)
,\A
,\Z
,\1
,\2
,\3
, etc. -
Currently, BodoSQL requires the pattern argument and the flag argument (if provided) to be string literals as opposed to columns or expressions.
-
Currently, extra backslashes may be required to escape certain characters if they have meaning in Python. The amount of backslashes required to properly escape a character depends on the usage.
-
All matches are non-overlapping.
-
If any of the numeric arguments are zero or negative, or the
group_num
argument is out of bounds, an error is raised. The only exception isREGEXP_REPLACE
, which allows its occurrence argument to be zero.
BodoSQL currently supports the following regex functions:
REGEXP_LIKE¶
-
REGEXP_LIKE(str, pattern[, flag])
Returns
true
if the entire string matches with the pattern. Ifflag
is not provided,''
is used.If the pattern is empty, then
true
is returned if the string is also empty.For example:
-
2 arguments: Returns
true
ifA
is a 5-character string where the first character is an a, the last character is a z, and the middle 3 characters are also lowercase characters (case-sensitive). -
3 arguments: Returns
true
ifA
starts with the letters'THE'
(case-insensitive).
-
REGEXP_COUNT¶
-
REGEXP_COUNT(str, pattern[, position[, flag]])
Returns the number of times the string contains matches to the pattern, starting at the location specified by the
position
argument (with 1-indexing). Ifposition
is not provided,1
is used. Ifflag
is not provided,''
is used.If the pattern is empty, 0 is returned.
For example:
-
2 arguments: Returns the number of times that any letters occur in
A
. -
3 arguments: Returns the number of times that any digit characters occur in
A
, not including the first 5 characters. -
4 arguments: Returns the number of times that a substring occurs in
A
that contains two ones with any character (including newlines) in between.
-
REGEXP_REPLACE¶
-
REGEXP_REPLACE(str, pattern[, replacement[, position[, occurrence[, flag]]]])
Returns the version of the inputted string where each match to the pattern is replaced by the replacement string, starting at the location specified by the
position
argument (with 1-indexing). The occurrence argument specifies which match to replace, where 0 means replace all occurrences. Ifreplacement
is not provided,''
is used. Ifposition
is not provided,1
is used. Ifoccurrence
is not provided,0
is used. Ifflag
is not provided,''
is used.If there are an insufficient number of matches, or the pattern is empty, the original string is returned.
Note
back-references in the replacement pattern are supported, but may require additional backslashes to work correctly.
For example:
-
2 arguments: Deletes all whitespace in
A
. -
3 arguments: Replaces all occurrences of
'hate'
inA
with'love'
(case-sensitive). -
4 arguments: Replaces all occurrences of two consecutive digits in
A
with the same two digits reversed, excluding the first 2 characters. -
5 arguments: Replaces the first character in
A
with an underscore. -
6 arguments: Removes the first and last word from each line of
A
that contains at least 3 words.
-
REGEXP_SUBSTR¶
-
REGEXP_SUBSTR(str, pattern[, position[, occurrence[, flag[, group_num]]]])
Returns the substring of the original string that caused a match with the pattern, starting at the location specified by the
position
argument (with 1-indexing). The occurrence argument specifies which match to extract (with 1-indexing). Ifposition
is not provided,1
is used. Ifoccurrence
is not provided,1
is used. Ifflag
is not provided,''
is used. Ifgroup_num
is not provided, andflag
contains'e
',1
is used. Ifgroup_num
is provided but the flag does not containe
, then it behaves as if it did. If the flag does containe
, then one of the subgroups of the match is returned instead of the entire match. The subgroup returned corresponds to thegroup_num
argument (with 1-indexing).If there are an insufficient number of matches, or the pattern is empty,
NULL
is returned.For example:
-
2 arguments: Returns the first number that occurs inside of
A
. -
3 arguments: Returns the first punctuation symbol that occurs inside of
A
, excluding the first 10 characters. -
4 arguments: Returns the fourth occurrence of two consecutive lowercase vowels in
A
. -
5 arguments: Returns the first 3+ character substring of
A
that starts with and ends with a vowel (case-insensitive, and it can contain newline characters). -
6 arguments: Looks for third occurrence in
A
of a number followed by a colon, a space, and a word that starts with'a'
(case-sensitive) and returns the word that starts with'a'
.
-
REGEXP_INSTR¶
-
REGEXP_INSTR(str, pattern[, position[, occurrence[, option[, flag[, group_num]]]]])
Returns the location within the original string that caused a match with the pattern, starting at the location specified by the
position
argument (with 1-indexing). The occurrence argument specifies which match to extract (with 1-indexing). The option argument specifies whether to return the start of the match (if0
) or the first location after the end of the match (if1
). Ifposition
is not provided,1
is used. Ifoccurrence
is not provided,1
is used. Ifoption
is not provided,0
is used. Ifflag
is not provided,''
is used. Ifgroup_num
is not provided, andflag
contains'e
',1
is used. Ifgroup_num
is provided but the flag does not containe
, then it behaves as if it did. If the flag does containe
, then the location of one of the subgroups of the match is returned instead of the location of the entire match. The subgroup returned corresponds to thegroup_num
argument (with 1-indexing).If there are an insufficient number of matches, or the pattern is empty,
0
is returned.-
2 arguments: Returns the index of the first
'#'
inA
. -
3 arguments: Returns the starting index of the first occurrence of 3 consecutive digits in
A
, excluding the first 3 characters. ```sql SELECT REGEXP_INSTR(A, '\d{3}', 4) -
5 arguments: Returns the ending index of the first substring of
A
that starts and ends with non-ascii characters. -
6 arguments: Returns the starting index of the second line of
A
that begins with an uppercase vowel. -
7 arguments: Looks for the first substring of
A
that has the format of a name in a phonebook (i.e.Lastname, Firstname
) and returns the starting index of the first name.
-
Control Flow Functions¶
DECODE¶
-
DECODE(Arg0, Arg1, Arg2, ...)
When
Arg0
isArg1
, outputsArg2
. WhenArg0
isArg3
, outputsArg4
. Repeats until it runs out of pairs of arguments. At this point, if there is one remaining argument, this is used as a default value. If not, then the output isNULL
.Note
Treats
NULL
as a literal value that can be matched on.Therefore, the following:
Is logically equivalent to:
EQUAL_NULL¶
-
EQUAL_NULL(A, B)
Returns true if A and B are both
NULL
, or both non-null and equal to each other.
IF¶
-
IF(Cond, TrueValue, FalseValue)
Returns
TrueValue
if cond is true, andFalseValue
if cond is false. Logically equivalent to:
IFF¶
-
IFF(Cond, TrueValue, FalseValue)
Equivalent to
IF
COALESCE¶
-
COALESCE(A, B, C, ...)
Returns the first non-
NULL
argument, orNULL
if no non-NULL
argument is found. Requires at least two arguments. If Arguments do not have the same type, BodoSQL will attempt to cast them to a common data type, which is currently undefined behavior.
IFNULL¶
-
IFNULL(Arg0, Arg1)
Equivalent to
COALESCE(Arg0, Arg1)
NVL¶
-
NVL(Arg0, Arg1)
Equivalent to
COALESCE(Arg0, Arg1)
ZEROIFNULL¶
-
ZEROIFNULL(Arg0, Arg1)
Equivalent to
COALESCE(Arg0, 0)
NVL2¶
-
NVL2(Arg0, Arg1, Arg2)
Equivalent to
IF(Arg0 IS NOT NULL, Arg1, Arg2)
NULLIF¶
-
NULLIF(Arg0, Arg1)
Returns
NULL
ifArg0
is equal toArg1
, and otherwise returnsArg0
.
NULLIFZERO¶
-
NULLIFZERO(Arg0)
Equivalent to
NULLIF(Arg0, 0)
Array Functions¶
Bodo currently supports the following functions that operate on columns of arrays:
GET¶
GET(arr, idx)
GET(object, field)
arr[idx]
-
object[field]
Returns the element found at the specified index in the array, or the specified field of an object.
When indexing into an array: indexing is 0 based, not 1 based. Returns
NULL
if the index is outside of the boundaries of the array. The index must be an integer.When retrieving a field from an object: the field name must be a string. If the object is a struct, the field name must be a string literal. If the object is a map, it can be a non-constant string. Returns
NULL
if the field name is not found. Field name matching is case-sensitive.
ARRAY_TO_STRING¶
-
ARRAY_TO_STRING(arr, sep)
Converted the input array
arr
to a string by casting all values to strings (usingTO_VARCHAR
) and concatenating them (usingsep
to separate the elements).
ARRAY_COMPACT¶
-
ARRAY_COMPACT(arr)
Returns a compacted array with missing and null values removed from
arr
, effectively converting sparse arrays into dense arrays. ReturnNULL
whenarr
isNULL
.
ARRAY_CONTAINS¶
-
ARRAY_CONTAINS(elem, arr)
Returns true if
elem
is an element ofarr
, orNULL
ifarr
isNULL
. The inputelem
can beNULL
, in which case the funciton will check ifarr
containsNULL
.
ARRAY_CONSTRUCT¶
-
ARRAY_CONSTRUCT(A, B, C, ...)
Takes in a variable number of arguments and produces an array containing all of those values (including any null values).
Note
Snowflake allows any number of arguments (even zero arguments) of any type. BodoSQL currently requires 1+ arguments, and requires all arguments to be easily reconciled into a common type.
ARRAY_CONSTRUCT_COMPACT¶
-
ARRAY_CONSTRUCT_COMPACT(A, B, C, ...)
Equivalent to
ARRAY_COMPACT(ARRAY_CONSTRUCT(A, B, C, ...))
ARRAY_EXCEPT¶
-
ARRAY_EXCEPT(A, B)
Takes in two arrays and returns a copy of the first array but with all of the elements from the second array dropped. If an element appears in the first array more than once, that element is only dropped as many times as it appears in the second array. For instance, if the first array contains three 1s and four 6s, and the second array contains two 1s and one 6, then the output will have one 1 and three 6s.
ARRAY_INTERSECTION¶
-
ARRAY_INTERSECTION(A, B)
Takes in two arrays and returns an arary of all the elements from the first array that also appear in the second. If an element appears in either array more than once, that element is kept the minimum of the number of times it appears in either array. For instance, if the first array contains three 1s and four 6s, and the second array contains two 1s and five 6s, then the output will have two 1s and three 6s.
ARRAY_CAT¶
-
ARRAY_CAT(A, B)
Takes in two arrays and returns an arary of all the elements from the first array followed by all of the elements in the second array.
ARRAYS_OVERLAP¶
-
ARRAYS_OVERLAP(arr0, arr1)
Returns true if the two array arguments
arr0
andarr1
have at least 1 element in common (includingNULL
).
ARRAY_POSITION¶
-
ARRAY_POSITION(elem, arr)
Returns the index of the first occurrence of
elem
inarr
(using zero indexing), orNULL
if there are no occurrences. The inputelem
can beNULL
, in which case the funciton will look for the firstNULL
in the array input.
ARRAY_REMOVE¶
-
ARRAY_REMOVE(array, to_remove)
Given a source
array
, returns an array with all elements equal to the specified valueto_remove
removed. ReturnsNULL
ifarray
orto_remove
isNULL
.
ARRAY_REMOVE_AT¶
-
ARRAY_REMOVE_AT(array, pos)
Given a source
array
, returns an array with the element at the specified positionpos
removed. ReturnsNULL
ifarray
orpos
isNULL
. Negative indexing is supported. No element is removed of the indexpos
is out of bound.
ARRAY_SIZE¶
-
ARRAY_SIZE(array)
Returns the size of array, size includes inner elements that are
NULL``. Returns
NULLif array is
NULL`. See here for Snowflake documentation.
ARRAY_SLICE¶
-
ARRAY_SLICE(arr, from, to)
Returns an array constructed from a specified subset of elements of the input array
arr[from:to]
. ReturnsNULL
if one ofarr
,from
andto
isNULL
.
Object Functions¶
BodoSQL currently supports the following Object functions:
OBJECT_CONSTRUCT_KEEP_NULL¶
-
OBJECT_CONSTRUCT_KEEP_NULL(key1, value1[, key2, value2, ...])
Takes in a variable number of key-value pairs and combines them into JSON data. BodoSQL currently requires all
key
arguments to be string literals.The full Snowflake specification.
BodoSQL supports the syntactic sugar
OBJECT_CONSTRUCT_KEEP_NULL(*)
which indicates that all columns should be used as key-value pairs, where the column is the value and its column name is the key. For example, if we have the tableT
as defined below:First Middle Last "George" NULL "WASHINGTON" "John" "Quincy" "Adams" "Lyndon" "Baines" "Johnson" "James" NULL "Madison" Then
SELECT OBJECT_CONSTRUCT_KEEP_NULL(*) as name FROM T
returns the following table:name {"First": "George", "Middle": NULL, "Last": "Washington"} {"First": "John", "Middle": "Quincy", "Last": "Adams"} {"First": "Lyndon", "Middle":"Baines", "Last": "Johnson"} {"First": "Thomas", "Middle": NULL, "Last": "Jefferson"}
OBJECT_CONSTRUCT¶
-
OBJECT_CONSTRUCT(key1, value1[, key2, value2, ...])
The same as
OBJECT_CONSTRUCT_KEEP_NULL
except that for any rows where any input value (e.g.value1
,value2
, ...) is null have that key-value pair dropped from the row's final JSON output.Note
BodoSQL only supports this function under narrow conditions where all of the values are either of the same type or of easily reconciled types.
The full Snowflake specification.
BodoSQL supports the syntactic sugar
OBJECT_CONSTRUCT(*)
which indicates that all columns should be used as key-value pairs, where the column is the value and its column name is the key. For example, if we have the tableT
as defined below:First Middle Last "George" NULL "WASHINGTON" "John" "Quincy" "Adams" "Lyndon" "Baines" "Johnson" "James" NULL "Madison" Then
SELECT OBJECT_CONSTRUCT(*) as name FROM T
returns the following table:name {"First": "George", "Last": "Washington"} {"First": "John", "Middle": "Quincy", "Last": "Adams"} {"First": "Lyndon", "Middle":"Baines", "Last": "Johnson"} {"First": "Thomas", "Last": "Jefferson"}
OBJECT_KEYS¶
-
OBJECT_KEYS(data)
Extracts all of the field names from the object
data
and returns them as an array of strings.
OBJECT_PICK¶
-
OBJECT_PICK(data, key1[, key2, ...])
Takes in a column of object data and 1+ keys and returns the object data only containing the keys specified. If a specified key is not present in
data
, it is ignored.!!! note: BodoSQL supports when the keys are passed in as string literals, but only supports when they are passed in as columns of strings if the object is a map instead of struct.
OBJECT_INSERT¶
-
OBJECT_INSERT(data, key, value[, update])
Takes a columns of JSON data, a column of string keys, and a columns of values and inserts the keys and values into the data. If the key is already present in the data, an error will be thrown, unless an additional argument (
update
) of type boolean is supplied, which will update existing keys to hold the new value only if the value is true.
OBJECT_DELETE¶
-
OBJECT_DELETE(data, key1[, key2, ...])
Takes in a column of JSON data and 1+ keys and returns the same JSON data but with all of those keys removed. If a specified key is not present in
data
, it is ignored.!!! note: BodoSQL supports when the keys are passed in as string literals, but only supports when they are passed in as columns of strings if the object is a map instead of struct.
JSON_EXTRACT_PATH_TEXT¶
-
JSON_EXTRACT_PATH_TEXT(data, path)
Parses the string
data
as if it were JSON data, then extracts values from within (possibly multiple times if the data is nested) using the stringpath
.Obeys the specification described here.
GET_PATH¶
-
GET_PATH(data, path_string)
Extracts an entry from a semi-structured data expression based on the path string. Obeys the specification described here.
PARSE_JSON¶
-
PARSE_JSON(str)
Takes in a string representing a json document and parses it to the corresponding value as a variant. For example:
-
PARSE_JSON('42')
is equivalent toTO_VARIANT(42)
-
PARSE_JSON('{"A": 0, "B": 3.1}')
is equivalent toTO_VARIANT({"A": 0, "B": 3.1})
Note
Currently only supported under limited conditions where it is possible to rewrite the call to
PARSE_JSON
as a sequence of Parse-Extract-Cast operations, where the output ofPARSE_JSON
immediately has an extraction operation like GET/GET_PATH called on it, and the result is casted to a non-semi-structured type. For example,PARSE_JSON(S):fizz::integer
can be rewritten, as canGET_PATH(TO_OBJECT(TO_ARRAY(PARSE_JSON(S))[0]), 'foo.bar')::varchar
. -
Casting / Conversion Functions¶
BodoSQL currently supports the following casting/conversion functions:
TO_BOOLEAN¶
-
TO_BOOLEAN(COLUMN_EXPRESSION)
Casts the input to a boolean value. If the input is a string, it will be cast to
true
if it is'true'
,'t'
,'yes'
,'y'
,'1'
, cast tofalse
if it is'false'
,'f'
,'no'
,'n'
,'0'
, and throw an error otherwise. If the input is an integer, it will be cast totrue
if it is non-zero andfalse
if it is zero. If the input is a float, it will be cast totrue
if it is non-zero,false
if it is zero, and throw an error on other inputs (e.g.inf
) input. If the input isNULL
, the output will beNULL
.Example:
We are given
upon query we will get the following output: Upon query we see the following error:table1
with columnsa
andb
andc
Note
BodoSQL will read NaN
s as NULL
and thus will not produce errors on NaN
input.
TRY_TO_BOOLEAN¶
-
TRY_TO_BOOLEAN(COLUMN_EXPRESSION)
This is similar to
TO_BOOLEAN
except that it will returnNULL
instead of throwing an error invalid inputs.Example:
We are given
upon query we will get the following output:table1
with columnsa
andb
andc
TO_BINARY¶
-
TO_BINARY(COLUMN_EXPRESSION)
Casts the input string to binary data. Currently only supports the
HEX
format. Raises an exception if the input is not a valid hex string: - Must have an even number of characters - All characters must be hexedecimal digits (0-9, a-f case insensitive)Example:
We are given
upon query we will get the following output:table1
with columnsa
andb
:Upon query we will see a value error because all of the values in column b are not valid hex strings: -TO_BINARY(a) 0 b'\xab' -- Binary encoding of the character '¼' 1 b'\x62\x6f\x64\x6f' -- Binary encoding of the string 'bodo' 2 b'\x4a\x2f\x31\x32' -- Binary encoding of the string 'J/12'
'ABC'
is 3 characters, which is not an even number -'ZETA'
contains non-hex charactersZ
andT
-'#fizz'
is 5 characters, which is not an even number and contains non-hex characters#
,i
andz
TRY_TO_BINARY¶
-
TRY_TO_BINARY(COLUMN_EXPRESSION)
See
TO_BINARY
. The only difference is thatTRY_TO_BINARY
will returnNULL
upon encountering an invalid expression instead of raising an exception.
TO_CHAR¶
-
TO_CHAR(COLUMN_EXPRESSION)
Casts the input to a string value. If the input is a boolean, it will be cast to
'true'
if it istrue
and'false'
if it isfalse
. If the input isNULL
, the output will beNULL
.Example:
We are given
upon query we will get the following output:table1
with columnsa
andb
andc
TO_VARCHAR¶
-
TO_VARCHAR(COLUMN_EXPRESSION)
Alias for
TO_CHAR
.
TO_DOUBLE¶
-
TO_DOUBLE(COLUMN_EXPRESSION)
Converts a numeric or string expression to a double-precision floating-point number. For
NULL
input, the result isNULL
. Fixed-point numbers are converted to floating point; the conversion cannot fail, but might result in loss of precision. Strings are converted as decimal or integer numbers. Scientific notation and special values (nan, inf, infinity) are accepted, case insensitive.Example:
We are given
upon query we will get the following output:table1
with columnsa
andb
TRY_TO_DOUBLE¶
-
TRY_TO_DOUBLE(COLUMN_EXPRESSION)
This is similar to
TO_DOUBLE
except that it will returnNULL
instead of throwing an error invalid inputs.
TO_NUMBER¶
-
TO_NUMBER(EXPR [, PRECICION [, SCALE]])
Converts an input expression to a fixed-point number with the specified precicion and scale. Precicon and scale must be constant integer literals if provided. Precicion must be between 1 and 38. Scale must be between 0 and prec - 1. Precicion and scale default to 38 and 0 if not provided. For
NULL
input, the output isNULL
.
TO_NUMERIC¶
-
TO_NUMERIC(EXPR [, PRECICION [, SCALE]])
Equivalent to
TO_NUMBER(EXPR, PRECICION, SCALE)
.
TO_DECIMAL¶
-
TO_DECIMAL(EXPR [, PRECICION [, SCALE]])
Equivalent to
TO_NUMBER(EXPR, PRECICION, SCALE)
.
TRY_TO_NUMBER¶
-
TRY_TO_NUMBER(EXPR [, PRECICION [, SCALE]])
A special version of
TO_NUMBER
that performs the same operation (i.e. converts an input expression to a fixed-point number), but with error-handling support (i.e. if the conversion cannot be performed, it returns aNULL
value instead of raising an error).
TRY_TO_NUMERIC¶
-
TRY_TO_NUMERIC(EXPR [, PRECICION [, SCALE]])
Equivalent to
TRY_TO_NUMBER(EXPR, PRECICION, SCALE)
.
TRY_TO_DECIMAL¶
-
TRY_TO_DECIMAL(EXPR [, PRECICION [, SCALE]])
Equivalent to
TRY_TO_NUMBER(EXPR, PRECICION, SCALE)
.
TO_DATE¶
-
TO_DATE(EXPR)
Converts an input expression to a
DATE
type. The input can be one of the following:TO_DATE(timestamp_expr)
truncates the timestamp to its date value.TO_DATE(string_expr)
if the string is in date format (e.g."1999-01-01"
) then it is convrted to a corresponding date. If the string represents an integer (e.g."123456"
) then it is interpreted as the number of seconds/milliseconds/microseconds/nanoseconds since1970-01-1
. Which unit it is interpreted as depends on the magnitude of the number, in accordance with the semantics used by Snowflake.TO_DATE(string_expr, format_expr)
uses the format string to specify how to parse the string expression as a date. Uses the format string rules as specified by Snowflake.- If the input is
NULL
, outputsNULL
.
Raises an error if the input expression does not match one of these formats.
TRY_TO_DATE¶
-
TRY_TO_DATE(EXPR)
A special version of
TO_DATE
that performs the same operation but returnsNULL
instead of raising an error if something goes wrong during the conversion.
TO_TIME¶
-
TO_TIME(EXPR)
Converts an input expression to a
TIME
type. The input can be one of the following:TO_TIME(timestamp_expr)
extracts the time component from a timestamp.TO_TIME(string_expr)
if the string is in date format (e.g."12:30:15"
) then it is convrted to a corresponding time.TO_TIME(string_expr, format_expr)
uses the format string to specify how to parse the string expression as a time. Uses the format string rules as specified by Snowflake.- If the input is
NULL
, outputsNULL
Raises an error if the input expression does not match one of these formats.
TRY_TO_TIME¶
-
TRY_TO_TIME(EXPR)
A special version of
TO_TIME
that performs the same operation but returnsNULL
instead of raising an error if something goes wrong during the conversion.
TO_TIMESTAMP¶
-
TO_TIMESTAMP(EXPR)
Converts an input expression to a
TIMESTAMP
type without a timezone. The input can be one of the following:TO_TIMESTAMP(date_expr)
upcasts aDATE
to aTIMESTAMP
.TO_TIMESTAMP(integer)
creates a timestamp using the integer as the number of seconds/milliseconds/microseconds/nanoseconds since1970-01-1
. Which unit it is interpreted as depends on the magnitude of the number, in accordance with the semantics used by Snowflake.TO_TIMESTAMP(integer, scale)
the same as the integer case except that the scale provided specifes which unit is used. THe scale can be an integer constant between 0 and 9, where 0 means seconds and 9 means nanoseconds.TO_TIMESTAMP(string_expr)
if the string is in timestamp format (e.g."1999-12-31 23:59:30"
) then it is convrted to a corresponding timestamp. If the string represents an integer (e.g."123456"
) then it uses the same rule as the corresponding input integer.TO_TIMESTAMP(string_expr, format_expr)
uses the format string to specify how to parse the string expression as a timestamp. Uses the format string rules as specified by Snowflake.TO_TIMESTAMP(timestamp_exr)
returns a timestamp expression representing the same moment in time, but changing the timezone if necessary to be timezone-naive.- If the input is
NULL
, outputsNULL
Raises an error if the input expression does not match one of these formats.
TRY_TO_TIMESTAMP¶
-
TRY_TO_TIMESTAMP(EXPR)
A special version of
TO_TIMESTAMP
that performs the same operation but returnsNULL
instead of raising an error if something goes wrong during the conversion.
TO_TIMESTAMP_NTZ¶
-
TO_TIMESTAMP_NTZ(EXPR)
Equivalent to
TO_TIMESTAMP
.
TRY_TO_TIMESTAMP_NTZ¶
-
TRY_TO_TIMESTAMP_NTZ(EXPR)
Equivalent to
TRY_TO_TIMESTAMP
.
TO_TIMESTAMP_LTZ¶
-
TO_TIMESTAMP_LTZ(EXPR)
Equivalent to
TO_TIMESTAMP
except that it uses the local time zone.
TRY_TO_TIMESTAMP_LTZ¶
-
TRY_TO_TIMESTAMP_NTZ(EXPR)
Equivalent to
TRY_TO_TIMESTAMP
except that it uses the local time zone.
TO_TIMESTAMP_TZ¶
-
TO_TIMESTAMP_LTZ(EXPR)
Equivalent to
TO_TIMESTAMP
except that it uses the local time zone, or keeps the original timezone if the input is a timezone-aware timestamp.
TRY_TO_TIMESTAMP_TZ¶
-
TRY_TO_TIMESTAMP_NTZ(EXPR)
Equivalent to
TRY_TO_TIMESTAMP
except that it uses the local time zone, or keeps the original timezone if the input is a timezone-aware timestamp.
TO_ARRAY¶
-
TO_ARRAY(EXPR)
Converts the input expression to a single-element array containing this value. If the input is an ARRAY, or VARIANT containing an array value, the result is unchanged. Returns
NULL
forNULL
or a JSON null input.
TO_OBJECT¶
-
TO_OBJECT(EXPR)
If the input is an object type or a variant containing an object type, returns the input unmodified (except that its type is now
OBJECT
if it wasVARIANT
). For all other types, raises an error. ReturnsNULL
forNULL
or a JSON null input.
Table Functions¶
Bodo currently supports the following functions that produce tables:
FLATTEN¶
-
FLATTEN([INPUT=>]expr[, PATH=>path_epxr][, OUTER=>outer_expr][, RECURSIVE=>recursive_expr][, MODE=>mode_epxr])
Takes in a column of semi-structured data and produces a table by "exploding" the data into multiple rows, producing the following columns:
SEQ
: not currently supported by BodoSQL.KEY
: the individual values from the json data.PATH
: not currently supported by BodoSQL.INDEX
: the index within the array that the value came from.VALUE
: the individual values from the array or json data.THIS
: a copy of the input data.
The function has the following named arguments:
INPUT
(required): the expression of semi-structured data to flatten. Also allowed to be passed in as a positional argument without theINPUT
keyword.PATH
(optional): a constant expression referencing how to access the semi-structured data to flatten from the input expression. BodoSQL currently only supports when this argument is omitted or is an empty string (indicating that the expression itself is the array to flatten).OUTER
(optional): a boolean indicating if a row should be generated even if the input data is an empty/null array/struct/map. The default is false. If provided, theKEY
,PATH
,INDEX
andVALUE
outputs will be null in the generated row.RECURSIVE
(optional): a boolean indicating if flattening should occur recursively, as opposed to just on the data referenced byPATH
. BodoSQL currently only supports when this argument is omitted or is false (which is the default).MODE
(optional): a string literal that can be either'OBJECT'
,'ARRAY'
or'BOTH'
, indicating what type of flattening rule should be done. BodoSQL currently only supports when this argument is omitted or is'BOTH'
(which is the default).
Note
BodoSQL supports the input argument being an array, json or variant so long as the values are of the same type (with limited support for JSON when the values are also JSON).
Below is an example of a query using the
FLATTEN
function with theLATERAL
keyword to explode an array column while also replicating another column.If the input data was as follows:
id tags 10 ["A", "B"] 16 [] 72 ["C", "A", "B", "D", "C"] Then the query would produce the following data:
id idx val 10 0 "A" 10 1 "B" 72 0 "C" 72 1 "A" 72 2 "B" 72 3 "D" 72 4 "C" Below is an example of a query using the
FLATTEN
function with theLATERAL
keyword to explode an JSON column while also replicating another column.If the input data was as follows:
id attributes 42 {"A": 0} 50 {} 64 {"B": 1, "C": 2} Then the query would produce the following data:
id key value 42 "A" 0 64 "B" 1 64 "C" 2
SPLIT_TO_TABLE¶
-
SPLIT_TO_TABLE(str, delim)
Takes in a string column and a delimeter and produces a table by "exploding" the string into multiple rows based on the delimeter, producing the following columns:
SEQ
: not currently supported by BodoSQL.INDEX
: which index in the splitted string did the current seciton come from.VALUE
: the current section of the splitted string.
Note
Currently, BodoSQL supports this function as an alias for
FLATTEN(SPLIT(str, delim))
.Below is an example of a query using the
SPLIT_TO_TABLE
function with theLATERAL
keyword to explode an string column while also replicating another column.If the input data was as follows:
id colors 50 "red orange yellow" 75 "green blue" Then the query would produce the following data:
id idx val 50 0 "red" 50 1 "orange" 50 2 "yellow" 75 0 "green" 75 1 "blue"
GENERATOR¶
-
GENERATOR([ROWCOUNT=>count][, TIMELIMIT=>sec])
Generates a table with a certain number of rows, specified by the
ROWCOUNT
argument. Currently only supports when theROW_COUNT
argument is provided and when it is a non-negative integer. Does not support when theTIMELIMIT
argument is provided, neither argument is provided, or both are provided.
Type Predicates¶
IS_ARRAY¶
-
IS_ARRAY(variant_expr)
Returns
TRUE
for all rows wherevariant_expr
is an array,NULL
if the row isNULL
, andFALSE
otherwise.
IS_OBJECT¶
-
IS_OBJECT(variant_expr)
Returns
TRUE
for all rows wherevariant_expr
is an object,NULL
if the row isNULL
, andFALSE
otherwise.
Context Functions (Session Object)¶
CURRENT_DATABASE¶
-
CURRENT_DATABASE()
Returns the name of the database in use for the current session.
Supported DataFrame Data Types¶
BodoSQL uses Pandas DataFrames to represent SQL tables in memory and converts SQL types to corresponding Python types which are used by Bodo. Below is a table mapping SQL types used in BodoSQL to their respective Python types and Bodo data types.
SQL Type(s) | Equivalent Python Type | Bodo Data Type |
---|---|---|
TINYINT |
np.int8 |
bodo.int8 |
SMALLINT |
np.int16 |
bodo.int16 |
INT |
np.int32 |
bodo.int32 |
BIGINT |
np.int64 |
bodo.int64 |
FLOAT |
np.float32 |
bodo.float32 |
DECIMAL , DOUBLE |
np.float64 |
bodo.float64 |
VARCHAR , CHAR |
str |
bodo.string_type |
TIMESTAMP , DATE |
np.datetime64[ns] |
bodo.datetime64ns |
INTERVAL(day-time) |
np.timedelta64[ns] |
bodo.timedelta64ns |
BOOLEAN |
np.bool_ |
bodo.bool_ |
BodoSQL can also process DataFrames that contain Categorical or Date columns. However, Bodo will convert these columns to one of the supported types, which incurs a performance cost. We recommend restricting your DataFrames to the directly supported types when possible.
Nullable and Unsigned Types¶
Although SQL does not explicitly support unsigned types, by default, BodoSQL maintains the exact types of the existing DataFrames registered in a [BodoSQLContext], including unsigned and non-nullable type behavior. If an operation has the possibility of creating null values or requires casting data, BodoSQL will convert the input of that operation to a nullable, signed version of the type.
Supported Literals¶
BodoSQL supports the following literal types:
array_literal
boolean_literal
datetime_literal
float_literal
integer_literal
interval_literal
object_literal
string_literal
Array Literal¶
Syntax:
where <[>
and <]>
indicate literal [
and ]
s, and expr
is any expression.
Array literals are lists of comma seperated expressions wrapped in square brackets.
Note that BodoSQL currently only supports homogenous lists, and all expr
s
must coerce to a single type.
Boolean Literal¶
Syntax:
Boolean literals are case-insensitive.
Datetime Literal¶
Syntax:
Float Literal¶
Syntax:
where digit is any numeral from 0 to 9
Integer Literal¶
Syntax:
where digit is any numeral from 0 to 9
Interval Literal¶
Syntax:
Where integer_literal is a valid integer literal and interval type is one of:
In addition, we also have limited support for YEAR[S]
and MONTH[S]
.
These literals cannot be stored in columns and currently are only
supported for operations involving add and sub.
Object Literal¶
Syntax:
Where each ki
is a unique string literal, and each vi
is an expression.
Obeys the same semantics as the function OBJECT_CONSTRUCT
, so any pair
where the key or value is null is omitted, and for now BodoSQL only supports
when all values are the same type.
String Literal¶
Syntax:
Where char is a character literal in a Python string.
BodoSQL Caching & Parameterized Queries¶
BodoSQL can reuse Bodo caching to avoid recompilation when used inside a JIT function. BodoSQL caching works the same as Bodo, so for example:
@bodo.jit(cache=True)
def f(filename):
df1 = pd.read_parquet(filename)
bc = bodosql.BodoSQLContext({"TABLE1": df1})
df2 = bc.sql("SELECT A FROM table1 WHERE B > 4")
print(df2.A.sum())
This will avoid recompilation so long as the DataFrame scheme stored in
filename
has the same schema and the code does not change.
To enable caching for queries with scalar parameters that you may want to adjust between runs, we introduce a feature called parameterized queries. In a parameterized query, the SQL query replaces a constant/scalar value with a variable, which we call a named parameter. In addition, the query is passed a dictionary of parameters which maps each name to a corresponding Python variable.
For example, if in the above SQL query we wanted to replace 4 with other integers, we could rewrite our query as:
Now anywhere that @var
is used, the value of python_var at runtime
will be used instead. This can be used in caching, because python_var
can be provided as an argument to the JIT function itself, thus enabling
changing the filter without recompiling. The full example looks like
this:
@bodo.jit(cache=True)
def f(filename, python_var):
df1 = pd.read_parquet(filename)
bc = bodosql.BodoSQLContext({"TABLE1": df1})
df2 = bc.sql("SELECT A FROM table1 WHERE B @var", {"var": python_var})
print(df2.A.sum())
Named parameters cannot be used in places that require a constant value to generate the correct implementation (e.g. TimeUnit in EXTRACT).
Note
Named parameters are case sensitive, so @var
and @VAR
are
different identifiers.
IO Handling¶
BodoSQL is great for compute based SQL queries, but you cannot yet access external storage directly from SQL. Instead, you can load and store data using Bodo and various Python APIs. Here we explain a couple common methods for loading data.
Pandas IO in JIT function with SQL Query¶
The most common way to load data is to first use Pandas APIs to load a DataFrame inside a JIT function and then to use that DataFrame inside a BodoSQLContext.
def f(f1, f2):
df1 = pd.read_parquet(f1)
df2 = pd.read_parquet(f2)
bc = bodosql.BodoSQLContext(
{
"T1": df1,
"T2": df2,
}
)
return bc.sql("select t1.A, t2.B from t1, t2 where t1.C > 5 and t1.D = t2.D")
Pandas IO in a JIT Function Separate from Query¶
The previous approach works well for most individual queries. However, when running several queries on the same dataset, it should ideally be loaded once for all queries. To do this, you can structure your JIT code to contain a single load function at the beginning. For example:
@bodo.jit
def load_data(f1, f2):
df1 = pd.read_parquet(f1)
df2 = pd.read_parquet(f2)
return df1, df2
def q1(df1, df2):
bc = bodosql.BodoSQLContext(
{
"T1": df1,
"T2": df2,
}
)
return bc.sql("select t1.A, t2.B from t1, t2 where t1.C > 5 and t1.D = t2.D")
...
@bodo.jit
def run_queries(f1, f2):
df1, df2 = load_data(f1, f2)
print(q1(df1, df2))
print(q2(df2))
print(q3(df1))
...
run_queries(f1, f2)
This approach prevents certain optimizations, such as filter pushdown. However, the assumption here is that you will use the entire DataFrame across the various benchmarks, so no optimization is useful by itself. In addition, any optimizations that can apply to all queries can be done explicitly inside load_data
. For example, if all queries are operate on a single day's data with df1
, you can write that filter in load_data
to limit IO and filter pushdown will be performed.
@bodo.jit
def load_data(f1, f2, target_date):
df1 = pd.read_parquet(f1)
# Applying this filter limits how much data is loaded.
df1 = df1[df1.date_val == target_date]
df2 = pd.read_parquet(f2)
return df1, df2
@bodo.jit
def run_queries(f1, f2, target_date):
df1, df2 = load_data(f1, f2, target_date)
...
run_queries(f1, f2, target_date)
BodoSQLContext API¶
The BodoSQLContext
API is the primary interface for executing SQL queries. It performs two roles:
- Registering data and connection information to load tables of interest.
- Forwarding SQL queries to the BodoSQL engine for compilation and execution. This is done via the
bc.sql(query)
method, wherebc
is aBodoSQLContext
object.
A BodoSQLContext
can be defined in regular Python and passed as an argument to JIT functions or can be
defined directly inside JIT functions. We recommend defining and modifying a BodoSQLContext
in regular
Python whenever possible.
For example:
bc = bodosql.BodoSQLContext(
{
"T1": bodosql.TablePath("my_file_path.pq", "parquet"),
},
catalog=bodosql.SnowflakeCatalog(
username,
password,
account_name,
warehouse_name,
database name,
)
)
@bodo.jit
def f(bc):
return bc.sql("select t1.A, t2.B from t1, catalogSchema.t2 where t1.C > 5 and t1.D = catalogSchema.t2.D")
API Reference¶
-
bodosql. BodoSQLContext (tables: Optional[Dict[str, Union[pandas.DataFrame|TablePath]]] = None, catalog: Optional[DatabaseCatalog] = None)Defines a
BodoSQLContext
with the given local tables and catalog.Arguments
-
tables
: A dictionary that maps a name used in a SQL query to aDataFrame
orTablePath
object. -
catalog
: ADatabaseCatalog
used to load tables from a remote database (e.g. Snowflake).
-
-
bodosql.BodoSQLContext. sql (self, query: str, params_dict: Optional[Dict[str, Any] = None)Executes a SQL query using the tables registered in this
BodoSQLContext
. This function should be used inside a@bodo.jit
function.Arguments
query
: The SQL query to execute. This function generates code that is compiled so thequery
argument is required to be a compile time constant.
-
params_dict
: A dictionary that maps a SQL usable name to Python variables. For more information please refer to the BodoSQL named parameters section.Returns
A
DataFrame
that results from executing the query. -
bodosql.BodoSQLContext. add_or_replace_view (self, name: str, table: Union[pandas.DataFrame, TablePath])Create a new
BodoSQLContext
from an existingBodoSQLContext
by adding or replacing a table.Arguments
-
name
: The name of the table to add. If the name already exists references to that table are removed from the new context. -
table
: The table object to add.table
must be aDataFrame
orTablePath
object.
Returns
A new
BodoSQLContext
that retains the tables and catalogs from the oldBodoSQLContext
and inserts the new table specified.Note
This DOES NOT update the given context. Users should always use the
BodoSQLContext
object returned from the function call. e.g.bc = bc.add_or_replace_view("t1", table)
-
-
bodosql.BodoSQLContext. remove_view (self, name: str)Creates a new
BodoSQLContext
from an existing context by removing the table with the given name. If the name does not exist, aBodoError
is thrown.Arguments
name
: The name of the table to remove.
Returns
A new
BodoSQLContext
that retains the tables and catalogs from the oldBodoSQLContext
minus the table specified.Note
This DOES NOT update the given context. Users should always use the
BodoSQLContext
object returned from the function call. e.g.bc = bc.remove_view("t1")
-
bodosql.BodoSQLContext. add_or_replace_catalog (self, catalog: DatabaseCatalog)Create a new
BodoSQLContext
from an existing context by replacing theBodoSQLContext
object'sDatabaseCatalog
with a new catalog.Arguments
catalog
: The catalog to insert.
Returns
A new
BodoSQLContext
that retains tables from the oldBodoSQLContext
but replaces the old catalog with the new catalog specified.Note
This DOES NOT update the given context. Users should always use the
BodoSQLContext
object returned from the function call. e.g.bc = bc.add_or_replace_catalog(catalog)
-
bodosql.BodoSQLContext. remove_catalog (self)Create a new
BodoSQLContext
from an existing context by removing itsDatabaseCatalog
.Returns
A new
BodoSQLContext
that retains tables from the oldBodoSQLContext
but removes the old catalog.Note
This DOES NOT update the given context. Users should always use the
BodoSQLContext
object returned from the function call. e.g.bc = bc.remove_catalog()
TablePath API¶
The TablePath
API is a general purpose IO interface to specify IO sources. This API is meant
as an alternative to natively loading tables in Python inside JIT functions.
The TablePath
API stores the user-defined data location and the storage type to load a table of interest.
For example, here is some sample code that loads two DataFrames from parquet using the TablePath
API.
bc = bodosql.BodoSQLContext(
{
"T1": bodosql.TablePath("my_file_path1.pq", "parquet"),
"T2": bodosql.TablePath("my_file_path2.pq", "parquet"),
}
)
@bodo.jit
def f(bc):
return bc.sql("select t1.A, t2.B from t1, t2 where t1.C > 5 and t1.D = t2.D")
Here, the TablePath
constructor doesn't load any data. Instead, a BodoSQLContext
internally generates code to load the tables of interest after parsing the SQL query. Note that a BodoSQLContext
loads all used tables from I/O on every query, which means that if users would like to perform multiple queries on the same data, they should consider loading the DataFrames once in a separate JIT function.
API Reference¶
-
bodosql. TablePath (file_path: str, file_type: str, *, conn_str: Optional[str] = None, reorder_io: Optional[bool] = None)Specifies how a DataFrame should be loaded from IO by a BodoSQL query. This can only load data when used with a
BodoSQLContext
constructor.Arguments
-
file_path
: Path to IO file or name of the table for SQL. This must constant at compile time if used inside JIT. -
file_type
: Type of file to load as a string. Supported values are"parquet"
and"sql"
. This must constant at compile time if used inside JIT. -
conn_str
: Connection string used to connect to a SQL DataBase, equivalent to the conn argument topandas.read_sql
. This must be constant at compile time if used inside JIT and must be None if not loading from a SQL DataBase.
-
-
reorder_io
: Boolean flag determining when to load IO. IfFalse
, all used tables are loaded before executing any of the query. IfTrue
, tables are loaded just before first use inside the query, which often results in decreased peak memory usage as each table is partially processed before loading the next table. The default value,None
, behaves likeTrue
, but this may change in the future. This must be constant at compile time if used inside JIT.
Database Catalogs¶
Database Catalogs are configuration objects that grant BodoSQL access to load tables from a remote database.
For example, when a user wants to load data from Snowflake, a user will create a SnowflakeCatalog
to grant
BodoSQL access to their Snowflake account and load the tables of interest.
A database catalog can be registered during the construction of the BodoSQLContext
by passing it in as a parameter, or can be manually set using the
BodoSQLContext.add_or_replace_catalog
API. Currently, a BodoSQLContext
can support at most one database catalog.
When using a catalog in a BodoSQLContext
we strongly recommend creating the BodoSQLContext
once in regular Python and then
passing the BodoSQLContext
as an argument to JIT functions. There is no benefit to creating the
BodoSQLContext
in JIT and this could increase compilation time.
catalog = bodosql.SnowflakeCatalog(
username,
password,
account_name,
"DEMO_WH", # warehouse name
"SNOWFLAKE_SAMPLE_DATA", # database name
)
bc = bodosql.BodoSQLContext({"LOCAL_TABLE1": df1}, catalog=catalog)
@bodo.jit
def run_query(bc):
return bc.sql("SELECT r_name, local_id FROM TPCH_SF1.REGION, local_table1 WHERE R_REGIONKEY = local_table1.region_key ORDER BY r_name")
run_query(bc)
Database catalogs can be used alongside local, in-memory DataFrame
or TablePath
tables. If a table is
specified without a schema then BodoSQL resolves the table in the following order:
- Default Catalog Schema
- Local (in-memory) DataFrames / TablePath names
An error is raised if the table cannot be resolved after searching through both of these data sources.
This ordering indicates that in the event of a name conflict between a table in the database catalog and a local table, the table in the database catalog is used.
If a user wants to use the local table instead, the user can explicitly specify the table with the local schema __BODOLOCAL__
.
For example:
Currently, BodoSQL supports catalogs Snowflake, but support for other data storage systems will be added in future releases.
SnowflakeCatalog¶
The Snowflake Catalog offers an interface for users to connect their Snowflake accounts to use with BodoSQL.
With a Snowflake Catalog, users only have to specify their Snowflake connection once, and can access any tables of interest in their Snowflake account. Currently, the Snowflake Catalog is defined to
use a single DATABASE
(e.g. USE DATABASE
) at a time, as shown below.
catalog = bodosql.SnowflakeCatalog(
username,
password,
account_name,
"DEMO_WH", # warehouse name
"SNOWFLAKE_SAMPLE_DATA", # database name
)
bc = bodosql.BodoSQLContext(catalog=catalog)
@bodo.jit
def run_query(bc):
return bc.sql("SELECT r_name FROM TPCH_SF1.REGION ORDER BY r_name")
run_query(bc)
BodoSQL does not currently support Snowflake syntax for specifying defaults
and session parameters (e.g. USING SCHEMA <NAME>
). Instead users can pass
any session parameters through the optional connection_params
argument, which
accepts a Dict[str, str]
for each session parameter. For example, users can provide
a default schema to simplify the previous example.
catalog = bodosql.SnowflakeCatalog(
username,
password,
account,
"DEMO_WH", # warehouse name
"SNOWFLAKE_SAMPLE_DATA", # database name
connection_params={"schema": "TPCH_SF1"}
)
bc = bodosql.BodoSQLContext(catalog=catalog)
@bodo.jit
def run_query(bc):
return bc.sql("SELECT r_name FROM REGION ORDER BY r_name")
run_query(bc)
Internally, Bodo uses the following connections to Snowflake:
- A JDBC connection to lazily fetch metadata.
- The Snowflake-Python-Connector's distributed fetch API to load batches of arrow data.
API Reference¶
-
bodosql. SnowflakeCatalog (username: str, password: str, account: str, warehouse: str, database: str, connection_params: Optional[Dict[str, str]] = None)Constructor for
SnowflakeCatalog
. Allows users to execute queries on tables stored in Snowflake when theSnowflakeCatalog
object is registered with aBodoSQLContext
.Arguments
-
username
: Snowflake account username. -
password
: Snowflake account password. -
account
: Snowflake account name. -
warehouse
: Snowflake warehouse to use when loading data. -
database
: Name of Snowflake database to load data from. The Snowflake Catalog is currently restricted to using a single Snowflakedatabase
. -
connection_params
: A dictionary of Snowflake session parameters.
-
Supported Query Types¶
The SnowflakeCatalog
currently supports the following types of SQL queries:
SELECT
INSERT INTO
DELETE
Identifier Case Sensitivity¶
In BodoSQL all identifiers not wrapped in quotes are automatically converted to upper case. If you are a Snowflake user who is using either the Snowflake Catalog or Table Path API, then this should not impact you and the rules will be the same as Snowflake (i.e. identifiers are case-insensitive unless wrapped in quotes during table creation). See here for the Snowflake documentation..
This means that the following queries are equivalent:
When providing column or table names, identifiers will only match if the original name is in uppercase For example, the following code will fail to compile because there is no match for TABLE1:
@bodo.jit
def f(filename):
df1 = pd.read_parquet(filename)
bc = bodosql.BodoSQLContext({"table1": df1})
return bc.sql("SELECT A FROM table1")
To match non-uppercase names you can use quotes to specify the name exactly as it appears in the BodoSQLContext definition or the columns of a DataFrame. For example:
@bodo.jit
def f(filename):
df1 = pd.read_parquet(filename)
bc = bodosql.BodoSQLContext({"table1": df1})
return bc.sql("SELECT A FROM \"table1\"")
Similarly if you want an alias to be case sensitive then you will also need it to be wrapped in quotes:
If you provide DataFrames directly from Python or are using the TablePath API to load Parquet files, then please be advised that the column names will be required to match exactly and for ease of use we highly recommend using uppercase column names.
Performance Considerations¶
Snowflake Views¶
Users may define views within their Snowflake account to enable greater query reuse. Views may constitute performance bottlenecks because if a view is evaluated in Snowflake Bodo will need to wait for the result before it can fetch data and may have less access to optimizations.
To improve performance in these circumstances Bodo will attempt to expand any views into the body of the query to allow Bodo to operate on the underlying tables. When this occurs users should face no performance penalty for using views in their queries. However there are a few situations in which this is not possible, namely
- The Snowflake User passed to Bodo does not have permissions to determine the view definition.
- The Snowflake User passed to Bodo does not have permissions to read all of the underlying tables.
- The view is a materalized or secure view.
If for any reason Bodo is unable to expand the view, then the query will execute treating the view as a table and delegate it to Snowflake.