Bodo SQL¶
Bodo SQL provides high performance and scalable SQL query execution using Bodo's HPC capabilities and optimizations. It also provides native Python/SQL integration as well as SQL to Pandas conversion for the first time. BodoSQL is in early stages and its capabilities are expanding rapidly.
Getting Started¶
Installation¶
Bodo SQL is currently in Beta. Install it using:
Using Bodo SQL¶
The example below demonstrates using Bodo SQL in Python programs. It loads data into a dataframe, runs a SQL query on the data, and runs Python/Pandas code on query results:
import pandas as pd
import bodo
import bodosql
@bodo.jit
def f(filename):
df1 = pd.read_parquet(filename)
bc = bodosql.BodoSQLContext({"table1": df1})
df2 = bc.sql("SELECT A FROM table1 WHERE B > 4")
print(df2.A.sum())
f("my_data.pq")
This program is fully type checked, optimized and parallelized by Bodo
end-to-end. BodoSQLContext
creates a SQL environment with tables
created from dataframes. BodoSQLContext.sql()
runs a SQL query and
returns the results as a dataframe. BodoSQLContext
can be used outside
Bodo JIT functions if necessary as well.
You can run this example by creating my_data.pq
:
import pandas as pd
import numpy as np
NUM_GROUPS = 30
NUM_ROWS = 20_000_000
df = pd.DataFrame({
"A": np.arange(NUM_ROWS) % NUM_GROUPS,
"B": np.arange(NUM_ROWS)
})
df.to_parquet("my_data.pq")
To run the example, save it in a file called example.py
and run it using mpiexec, e.g.:
Aliasing¶
In all but the most trivial cases, Bodo SQL generates internal names to avoid conflicts in the intermediate dataframes. By default, Bodo SQL does not rename the columns for the final output of a query using a consistent approach. For example the query:
Results in an output column named$EXPR0
. To reliably reference this
column later in your code, we highly recommend using aliases for all
columns that are the final outputs of a query, such as:
Note
BodoSQL supports using aliases generated in SELECT
inside
GROUP BY
and HAVING
in the same query, but you cannot do so with
WHERE
.
Supported Operations¶
We currently support the following SQL query statements and clauses with
Bodo SQL, and are continuously adding support towards completeness. Note
that Bodo SQL ignores casing of keywords, and column and table names,
except for the final output column name. Therefore,
select a from table1
is treated the same as SELECT A FROM Table1
,
except for the names of the final output columns (a
vs A
).
SELECT¶
The SELECT
statement is used to select data in the form of
columns. The data returned from Bodo SQL is stored in a dataframe.
For Instance:
Example Usage:
>>>@bodo.jit
... def g(df):
... bc = bodosql.BodoSQLContext({"customers":df})
... query = "SELECT name FROM customers"
... res = bc.sql(query)
... return res
>>>customers_df = pd.DataFrame({
... "customerID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "name": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez",
... "Demetrius Chavez","Weston Jefferson","Jonathon Middleton",
... "Shawn Winters","Keely Hutchinson", "Darryl Rosales",],
... "balance": [1123.34, 2133.43, 23.58, 8345.15, 943.43, 68.34, 12764.50, 3489.25, 654.24, 25645.39]
... })
>>>g(customers_df)
name
0 Deangelo Todd
1 Nikolai Kent
2 Eden Heath
3 Taliyah Martinez
4 Demetrius Chavez
5 Weston Jefferson
6 Jonathon Middleton
7 Shawn Winters
8 Keely Hutchinson
9 Darryl Rosales
SELECT DISTINCT¶
The SELECT DISTINCT
statement is used to return only distinct
(different) values:
DISTINCT
can be used in a SELECT statement or inside an
aggregate function. For example:
Example Usage
>>>@bodo.jit
... def g(df):
... bc = bodosql.BodoSQLContext({"payments":df})
... query = "SELECT DISTINCT paymentType FROM payments"
... res = bc.sql(query)
... return res
>>>payment_df = pd.DataFrame({
... "customerID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "paymentType": ["VISA", "VISA", "AMEX", "VISA", "WIRE", "VISA", "VISA", "WIRE", "VISA", "AMEX"],
... })
>>>g(payment_df) # inside SELECT
paymentType
0 VISA
2 AMEX
4 WIRE
>>>def g(df):
... bc = bodosql.BodoSQLContext({"payments":df})
... query = "SELECT COUNT(DISTINCT paymentType) as num_payment_types FROM payments"
... res = bc.sql(query)
... return res
>>>g(payment_df) # inside aggregate
num_payment_types
0 3
WHERE¶
The WHERE
clause on columns can be used to filter records that
satisfy specific conditions:
For Example:
Example Usage
>>>@bodo.jit
... def g(df):
... bc = bodosql.BodoSQLContext({"customers":df})
... query = "SELECT name FROM customers WHERE balance 3000"
... res = bc.sql(query)
... return res
>>>customers_df = pd.DataFrame({
... "customerID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "name": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez",
... "Demetrius Chavez","Weston Jefferson","Jonathon Middleton",
... "Shawn Winters","Keely Hutchinson", "Darryl Rosales",],
... "balance": [1123.34, 2133.43, 23.58, 8345.15, 943.43, 68.34, 12764.50, 3489.25, 654.24, 25645.39]
... })
>>>g(customers_df)
name
3 Taliyah Martinez
6 Jonathon Middleton
7 Shawn Winters
9 Darryl Rosales
ORDER BY¶
The ORDER BY
keyword sorts the resulting dataframe in ascending
or descending order, with NULL values either at the start or end
of the column. By default, it sorts the records in ascending order
with null values at the end. For descending order and nulls at the
front, the DESC
and NULLS FIRST
keywords can be used:
For Example:
Example Usage
>>>@bodo.jit
... def g(df):
... bc = bodosql.BodoSQLContext({"customers":df})
... query = "SELECT name, balance FROM customers ORDER BY balance"
... res = bc.sql(query)
... return res
>>>customers_df = pd.DataFrame({
... "customerID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "name": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez",
... "Demetrius Chavez","Weston Jefferson","Jonathon Middleton",
... "Shawn Winters","Keely Hutchinson", "Darryl Rosales",],
... "balance": [1123.34, 2133.43, 23.58, 8345.15, 943.43, 68.34, 12764.50, 3489.25, 654.24, 25645.39]
... })
>>>g(customers_df)
name balance
2 Eden Heath 23.58
5 Weston Jefferson 68.34
8 Keely Hutchinson 654.24
4 Demetrius Chavez 943.43
0 Deangelo Todd 1123.34
1 Nikolai Kent 2133.43
7 Shawn Winters 3489.25
3 Taliyah Martinez 8345.15
6 Jonathon Middleton 12764.50
9 Darryl Rosales 25645.39
LIMIT¶
Bodo SQL supports the LIMIT
keyword to select a limited number
of rows. This keyword can optionally include an offset:
>>>@bodo.jit
... def g1(df):
... bc = bodosql.BodoSQLContext({"customers":df})
... query = "SELECT name FROM customers LIMIT 4"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df):
... bc = bodosql.BodoSQLContext({"customers":df})
... query = "SELECT name FROM customers LIMIT 4 OFFSET 2"
... res = bc.sql(query)
... return res
>>>customers_df = pd.DataFrame({
... "customerID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "name": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez",
... "Demetrius Chavez","Weston Jefferson","Jonathon Middleton",
... "Shawn Winters","Keely Hutchinson", "Darryl Rosales",],
... "balance": [1123.34, 2133.43, 23.58, 8345.15, 943.43, 68.34, 12764.50, 3489.25, 654.24, 25645.39]
... })
>>>g1(customers_df) # LIMIT 4
name
0 Deangelo Todd
1 Nikolai Kent
2 Eden Heath
3 Taliyah Martinez
>>>g2(customers_df) # LIMIT 4 OFFSET 2
name
2 Eden Heath
3 Taliyah Martinez
4 Demetrius Chavez
5 Weston Jefferson
NOT IN¶
The IN
determines if a value can be chosen a list of options.
Currently we support lists of literals or columns with matching
types:
>>>@bodo.jit
... def g1(df):
... bc = bodosql.BodoSQLContext({"payments":df})
... query = "SELECT customerID FROM payments WHERE paymentType IN ('AMEX', 'WIRE')"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df):
... bc = bodosql.BodoSQLContext({"payments":df})
... query = "SELECT customerID FROM payments WHERE paymentType NOT IN ('AMEX', 'VISA')"
... res = bc.sql(query)
... return res
>>>payment_df = pd.DataFrame({
... "customerID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "paymentType": ["VISA", "VISA", "AMEX", "VISA", "WIRE", "VISA", "VISA", "WIRE", "VISA", "AMEX"],
... })
>>>g1(payment_df) # IN
customerID
2 2
4 4
7 7
9 9
>>>g2(payment_df) # NOT IN
customerID
4 4
7 7
NOT BETWEEN¶
The BETWEEN
operator selects values within a given range. The
values can be numbers, text, or datetimes. The BETWEEN
operator
is inclusive: begin and end values are included:
>>>@bodo.jit
... def g(df):
... bc = bodosql.BodoSQLContext({"customers":df})
... query = "SELECT name, balance FROM customers WHERE balance BETWEEN 1000 and 5000"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df):
... bc = bodosql.BodoSQLContext({"customers":df})
... query = "SELECT name, balance FROM customers WHERE balance NOT BETWEEN 100 and 10000"
... res = bc.sql(query)
... return res
>>>customers_df = pd.DataFrame({
... "customerID": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
... "name": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez",
... "Demetrius Chavez","Weston Jefferson","Jonathon Middleton",
... "Shawn Winters","Keely Hutchinson", "Darryl Rosales",],
... "balance": [1123.34, 2133.43, 23.58, 8345.15, 943.43, 68.34, 12764.50, 3489.25, 654.24, 25645.39]
... })
>>>g1(payment_df) # BETWEEN
name balance
0 Deangelo Todd 1123.34
1 Nikolai Kent 2133.43
7 Shawn Winters 3489.25
>>>g2(payment_df) # NOT BETWEEN
name balance
2 Eden Heath 23.58
5 Weston Jefferson 68.34
6 Jonathon Middleton 12764.50
9 Darryl Rosales 25645.39
CAST¶
THE CAST
operator converts an input from one type to another. In
many cases casts are created implicitly, but this operator can be
used to force a type conversion.
The following casts are currently supported. Please refer to
supported_dataframe_data_types
for
the Python types for each type keyword:
From | To | Notes |
---|---|---|
VARCHAR |
VARCHAR |
|
VARCHAR |
TINYINT/SMALLINT/INTEGER/BIGINT |
|
VARCHAR |
FLOAT/DOUBLE |
|
VARCHAR |
DECIMAL |
Equivalent to DOUBLE . This may change in the future. |
VARCHAR |
TIMESTAMP |
|
VARCHAR |
DATE |
Truncates to date but is still Timestamp type. This may change in the future. |
TINYINT/SMALLINT/INTEGER/BIGINT |
VARCHAR |
|
TINYINT/SMALLINT/INTEGER/BIGINT |
TINYINT/SMALLINT/INTEGER/BIGINT |
|
TINYINT/SMALLINT/INTEGER/BIGINT |
FLOAT/DOUBLE |
|
TINYINT/SMALLINT/INTEGER/BIGINT |
DECIMAL |
Equivalent to DOUBLE . This may change in the future. |
TINYINT/SMALLINT/INTEGER/BIGINT |
TIMESTAMP |
|
FLOAT/DOUBLE |
VARCHAR |
|
FLOAT/DOUBLE |
TINYINT/SMALLINT/INTEGER/BIGINT |
|
FLOAT/DOUBLE |
FLOAT/DOUBLE |
|
FLOAT/DOUBLE |
DECIMAL |
Equivalent to DOUBLE . This may change in the future |
TIMESTAMP |
VARCHAR |
|
TIMESTAMP |
TINYINT/SMALLINT/INTEGER/BIGINT |
|
TIMESTAMP |
TIMESTAMP |
|
TIMESTAMP |
DATE |
Truncates to date but is still Timestamp type. This may change in the future. |
Note
CAST correctness can often not be determined at compile time.
Users are responsible for ensuring that conversion is possible
(e.g. CAST(str_col as INTEGER)
).
JOIN¶
A JOIN
clause is used to combine rows from two or more tables,
based on a related column between them:
SELECT <COLUMN_NAMES>
FROM <LEFT_TABLE_NAME>
<JOIN_TYPE<RIGHT_TABLE_NAME>
ON <LEFT_TABLE_COLUMN_NAME= <RIGHT_TABLE_COLUMN_NAME>
(INNER) JOIN
: returns records that have matching values in both tablesLEFT (OUTER) JOIN
: returns all records from the left table, and the matched records from the right tableRIGHT (OUTER) JOIN
: returns all records from the right table, and the matched records from the left tableFULL (OUTER) JOIN
: returns all records when there is a match in either left or right table
Bodo SQL currently supports inner join on all conditions, but all outer joins are only supported on an equality between columns.
Example Usage
>>>@bodo.jit
... def g1(df1, df2):
... bc = bodosql.BodoSQLContext({"customers":df1, "payments":df2})
... query = "SELECT name, paymentType FROM customers JOIN payments ON customers.customerID = payments.customerID"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df1, df2):
... bc = bodosql.BodoSQLContext({"customers":df1, "payments":df2})
... query = "SELECT name, paymentType FROM customers FULL JOIN payments ON customers.customerID = payments.customerID"
... res = bc.sql(query)
... return res
>>>customer_df = pd.DataFrame({
... "customerID": [0, 2, 4, 5, 7,],
... "name": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez","Demetrius Chavez",],
... "address": ["223 Iroquois LanenWest New York, NJ 07093","37 Depot StreetnTaunton, MA 02780",
... "639 Maple St.nNorth Kingstown, RI 02852","93 Bowman Rd.nChester, PA 19013",
... "513 Manchester Ave.nWindsor, CT 06095",],
... "balance": [1123.34, 2133.43, 23.58, 8345.15, 943.43,]
... })
>>>payment_df = pd.DataFrame({
... "customerID": [0, 1, 4, 6, 7],
... "paymentType": ["VISA", "VISA", "AMEX", "VISA", "WIRE",],
... })
>>>g1(customer_df, payment_df) # INNER JOIN
name paymentType
0 Deangelo Todd VISA
1 Eden Heath AMEX
2 Demetrius Chavez WIRE
>>>g2(customer_df, payment_df) # OUTER JOIN
name paymentType
0 Deangelo Todd VISA
1 Nikolai Kent NaN
2 Eden Heath AMEX
3 Taliyah Martinez NaN
4 Demetrius Chavez WIRE
5 NaN VISA
6 NaN VISA
UNION¶
The UNION operator is used to combine the result-set of two SELECT statements:
Each SELECT statement within the UNION clause must have the same number of columns. The columns must also have similar data types. The output of the UNION is the set of rows which are present in either of the input SELECT statements.The UNION operator selects only the distinct values from the inputs by default. To allow duplicate values, use UNION ALL:
Example Usage
>>>@bodo.jit
... def g1(df):
... bc = bodosql.BodoSQLContext({"customers":df1, "payments":df2})
... query = "SELECT name, paymentType FROM customers JOIN payments ON customers.customerID = payments.customerID WHERE paymentType in ('WIRE')
... UNION SELECT name, paymentType FROM customers JOIN payments ON customers.customerID = payments.customerID WHERE balance < 1000"
... res = bc.sql(query)
... return res
>>>@bodo.jit
... def g2(df):
... bc = bodosql.BodoSQLContext({"customers":df1, "payments":df2})
... query = "SELECT name, paymentType FROM customers JOIN payments ON customers.customerID = payments.customerID WHERE paymentType in ('WIRE')
... UNION ALL SELECT name, paymentType FROM customers JOIN payments ON customers.customerID = payments.customerID WHERE balance < 1000"
... res = bc.sql(query)
... return res
>>>customer_df = pd.DataFrame({
... "customerID": [0, 2, 4, 5, 7,],
... "name": ["Deangelo Todd","Nikolai Kent","Eden Heath", "Taliyah Martinez","Demetrius Chavez",],
... "address": ["223 Iroquois LanenWest New York, NJ 07093","37 Depot StreetnTaunton, MA 02780",
... "639 Maple St.nNorth Kingstown, RI 02852","93 Bowman Rd.nChester, PA 19013",
... "513 Manchester Ave.nWindsor, CT 06095",],
... "balance": [1123.34, 2133.43, 23.58, 8345.15, 943.43,]
... })
>>>payment_df = pd.DataFrame({
... "customerID": [0, 1, 4, 6, 7],
... "paymentType": ["VISA", "VISA", "AMEX", "VISA", "WIRE",],
... })
>>>g1(customer_df, payment_df) # UNION
name paymentType balance
0 Demetrius Chavez WIRE 943.43
0 Eden Heath AMEX 23.58
>>>g2(customer_df, payment_df) # UNION ALL
name paymentType balance
0 Demetrius Chavez WIRE 943.43
0 Eden Heath AMEX 23.58
1 Demetrius Chavez WIRE 943.43
INTERSECT¶
The INTERSECT operator is used to calculate the intersection of two SELECT statements:
Each SELECT statement within the INTERSECT clause must have the same number of columns. The columns must also have similar data types. The output of the INTERSECT is the set of rows which are present in both of the input SELECT statements. The INTERSECT operator selects only the distinct values from the inputs.
GROUP BY¶
The GROUP BY
statement groups rows that have the same values
into summary rows, like "find the number of customers in each
country". The GROUP BY
statement is often used with aggregate
functions to group the result-set by one or more columns:
SELECT <COLUMN_NAMES>
FROM <TABLE_NAME>
WHERE <CONDITION>
GROUP BY <COLUMN_NAMES>
ORDER BY <COLUMN_NAMES>
For example:
GROUP BY
statements also referring to columns by alias or
column number:
HAVING¶
The HAVING
clause is used for filtering with GROUP BY
.
HAVING
applies the filter after generating the groups, whereas
WHERE
applies the filter before generating any groups:
HAVING
statements also referring to columns by aliases used in
the GROUP BY
:
CASE¶
The CASE
statement goes through conditions and returns a value
when the first condition is met:
LIKE¶
The LIKE
clause is used to filter the strings in a column to
those that match a pattern:
%
and _
. For example:
GREATEST¶
The GREATEST
clause is used to return the largest value from a
list of columns:
LEAST¶
The LEAST
clause is used to return the smallest value from a
list of columns:
PIVOT¶
The PIVOT
clause is used to transpose specific data rows in one
or more columns into a set of columns in a new DataFrame:
SELECT col1, ..., colN FROM table_name PIVOT (
AGG_FUNC_1(colName or pivotVar) AS alias1, ..., AGG_FUNC_N(colName or pivotVar) as aliasN
FOR pivotVar IN (ROW_VALUE_1 as row_alias_1, ..., ROW_VALUE_N as row_alias_N)
)
PIVOT
produces a new column for each pair of pivotVar and
aggregation functions.
For example:
SELECT single_sum_a, single_avg_c, triple_sum_a, triple_avg_c FROM table1 PIVOT (
SUM(A) AS sum_a, AVG(C) AS avg_c
FOR A IN (1 as single, 3 as triple)
)
single_sum_a
will contain sum(A) where A = 1
,
single_avg_c will contain AVG(C) where A = 1
etc.
If you explicitly specify other columns as the output, those columns will be used to group the pivot columns. For example:
SELECT B, single_sum_a, single_avg_c, triple_sum_a, triple_avg_c FROM table1 PIVOT (
SUM(A) AS sum_a, AVG(C) AS avg_c
FOR A IN (1 as single, 3 as triple)
)
SELECT * FROM table1 PIVOT (
SUM(A) AS sum_a, AVG(C) AS avg_c
FOR (A, B) IN ((1, 4) as col1, (2, 5) as col2)
)
WITH¶
The WITH
clause can be used to name subqueries:
Aliasing¶
SQL aliases are used to give a table, or a column in a table, a temporary name:
For example:
We strongly recommend using aliases for the final outputs of any queries to ensure all column names are predictable.
Operators¶
Arithmetic¶
-
Bodo SQL currently supports the following arithmetic operators:
+
(addition)-
(subtraction)*
(multiplication)/
(true division)%
(modulo)
Comparison¶
-
Bodo SQL currently supports the following comparison operators:
=
(equal to)>
(greater than)<
(less than)>=
(greater than or equal to)<=
(less than or equal to)<>
(not equal to)!=
(not equal to)<=>
(equal to or both inputs are null)
Logical¶
-
Bodo SQL currently supports the following logical operators:
AND
OR
NOT
String¶
-
Bodo SQL currently supports the following string operators:
||
(string concatenation)
Numeric Functions¶
Except where otherwise specified, the inputs to each of these functions can be any numeric type, column or scalar. Here is an example using MOD:
Bodo SQL Currently supports the following Numeric Functions:
ABS¶
-
ABS(n)
Returns the absolute value of n
COS¶
-
COS(n)
Calculates the Cosine of n
SIN¶
-
SIN(n)
Calculates the Sine of n
TAN¶
-
TAN(n)
Calculates the Tangent of n
ACOS¶
-
ACOS(n)
Calculates the Arccosine of n
ASIN¶
-
ASIN(n)
Calculates the Arcsine of n
ATAN¶
-
ATAN(n)
Calculates the Arctangent of n
ATAN2¶
-
ATAN2(A, B)
Calculates the Arctangent of
A
divided byB
COTAN¶
-
COTAN(X)
Calculates the Cotangent of
X
CEIL¶
-
CEIL(X)
Converts X to an integer, rounding towards positive infinity
CEILING¶
-
CEILING(X)
Equivalent to CEIL
FLOOR¶
-
FLOOR(X)
Converts X to an integer, rounding towards negative infinity
DEGREES¶
-
DEGREES(X)
Converts a value in radians to the corresponding value in degrees
RADIANS¶
-
RADIANS(X)
Converts a value in radians to the corresponding value in degrees
LOG10¶
-
LOG10(X)
Computes Log base 10 of x. Returns NaN for negative inputs, and -inf for 0 inputs.
LOG¶
-
LOG(X)
Equivalent to LOG10(x)
LOG10¶
-
LOG10(X)
Computes Log base 2 of x. Returns NaN for negative inputs, and -inf for 0 inputs.
LN¶
-
LN(X)
Computes the natural log of x. Returns NaN for negative inputs, and -inf for 0 inputs.
MOD¶
-
MOD(A,B)
Computes A modulo B.
CONV¶
-
CONV(X, current_base, new_base)
CONV
takes a string representation of an integer value, it's current_base, and the base to convert that argument to.CONV
returns a new string, that represents the value in the new base.CONV
is only supported for converting to/from base 2, 8, 10, and 16.For example:
SQRT¶
SQRT(X)
Computes the square root of x. Returns NaN for negative inputs, and -inf for 0 inputs.
PI¶
-
PI()
Returns the value of PI
POW, POWER¶
-
POW(A, B), POWER(A, B)
Returns A to the power of B. Returns NaN if A is negative, and B is a float. POW(0,0) is 1
EXP¶
-
EXP(X)
Returns e to the power of X
SIGN¶
-
SIGN(X)
Returns 1 if X 0, -1 if X < 0, and 0 if X = 0
ROUND¶
-
ROUND(X, num_decimal_places)
Rounds X to the specified number of decimal places
TRUNCATE¶
-
TRUNCATE(X, num_decimal_places)
Equivalent to
ROUND(X, num_decimal_places)
Aggregation Functions¶
Bodo SQL Currently supports the following Aggregation Functions on all types:
COUNT¶
-
COUNT
Count the number of elements in a column or group.
In addition, Bodo SQL also supports the following functions on numeric types
AVG¶
-
AVG
Compute the mean for a column.
MAX¶
-
MAX
Compute the max value for a column.
MIN¶
-
MIN
Compute the min value for a column.
STDDEV¶
-
STDDEV
Compute the standard deviation for a column with N - 1 degrees of freedom.
STDDEV_SAMP¶
-
STDDEV_SAMP
Compute the standard deviation for a column with N - 1 degrees of freedom.
STDDEV_POP¶
-
STDDEV_POP
Compute the standard deviation for a column with N degrees of freedom.
SUM¶
-
SUM
Compute the sum for a column.
VARIANCE¶
-
VARIANCE
Compute the variance for a column with N - 1 degrees of freedom.
VAR_SAMP¶
-
VAR_SAMP
Compute the variance for a column with N - 1 degrees of freedom.
VAR_POP¶
-
VAR_POP
Compute the variance for a column with N degrees of freedom.
All aggregate functions have the syntax:
These functions can be used either in a groupby clause, where they will be computed for each group, or by itself on an entire column expression. For example:
Timestamp Functions¶
Bodo SQL currently supports the following Timestamp functions:
DATEDIFF¶
-
DATEDIFF(timestamp_val1, timestamp_val2)
Computes the difference in days between two Timestamp values
STR_TO_DATE¶
-
STR_TO_DATE(str_val, literal_format_string)
Converts a string value to a Timestamp value given a literal format string. If a year, month, and day value is not specified, they default to 1900, 01, and 01 respectively. Will throw a runtime error if the string cannot be parsed into the expected values. See
DATE_FORMAT
for recognized formatting characters.For example:
DATE_FORMAT¶
-
DATE_FORMAT(timestamp_val, literal_format_string)
Converts a timestamp value to a String value given a scalar format string.
Recognized formatting characters:
%i
Minutes, zero padded (00 to 59)%M
Full month name (January to December)%r
Time in format in the format (hh:mm:ss AM/PM)%s
Seconds, zero padded (00 to 59)%T
Time in format in the format (hh:mm:ss)%T
Time in format in the format (hh:mm:ss)%u
week of year, where monday is the first day of the week(00 to 53)%a
Abbreviated weekday name (sun-sat)%b
Abbreviated month name (jan-dec)%f
Microseconds, left padded with 0's, (000000 to 999999)%H
Hour, zero padded (00 to 23)%j
Day Of Year, left padded with 0's (001 to 366)%m
Month number (00 to 12)%p
AM or PM, depending on the time of day%d
Day of month, zero padded (01 to 31)%Y
Year as a 4 digit value%y
Year as a 2 digit value, zero padded (00 to 99)%U
Week of year where sunday is the first day of the week (00 to 53)%S
Seconds, zero padded (00 to 59)
For example:
DATE_ADD¶
-
DATE_ADD(timestamp_val, interval)
Computes a timestamp column by adding an interval column/scalar to a timestamp value. If the first argument is a string representation of a timestamp, Bodo will cast the value to a timestamp.
DATE_SUB¶
-
DATE_SUB(timestamp_val, interval)
Computes a timestamp column by subtracting an interval column/scalar to a timestamp value. If the first argument is a string representation of a timestamp, Bodo will cast the value to a timestamp.
DATE_TRUNC¶
-
DATE_TRUNC(str_literal, timestamp_val)
Truncates a timestamp to the provided str_literal field. str_literal must be a compile time constant and one of:
- "MONTH"
- "WEEK"
- "DAY"
- "HOUR"
- "MINUTE"
- "SECOND"
- "MILLISECOND"
- "MICROSECOND"
- "NANOSECOND"
NOW¶
-
NOW()
Computes a timestamp equal to the current system time
LOCALTIMESTAMP¶
-
LOCALTIMESTAMP()
Equivalent to
NOW
CURDATE¶
-
CURDATE()
Computes a timestamp equal to the current system time, excluding the time information
CURRENT_DATE¶
-
CURRENT_DATE()
Equivalent to
CURDATE
EXTRACT¶
-
EXTRACT(TimeUnit from timestamp_val)
Extracts the specified TimeUnit from the supplied date.
Allowed TimeUnits are:
MICROSECOND
MINUTE
HOUR
DAY
(Day of Month)DOY
(Day of Year)DOW
(Day of week)WEEK
MONTH
QUARTER
YEAR
TimeUnits are not case sensitive.
MICROSECOND¶
-
MICROSECOND(timestamp_val)
Equivalent to
EXTRACT(MICROSECOND from timestamp_val)
SECOND¶
-
SECOND(timestamp_val)
Equivalent to
EXTRACT(SECOND from timestamp_val)
MINUTE¶
-
MINUTE(timestamp_val)
Equivalent to
EXTRACT(MINUTE from timestamp_val)
HOUR¶
-
HOUR(timestamp_val)
Equivalent to
EXTRACT(HOUR from timestamp_val)
WEEK¶
-
WEEK(timestamp_val)
Equivalent to
EXTRACT(WEEK from timestamp_val)
WEEKOFYEAR¶
-
WEEKOFYEAR(timestamp_val)
Equivalent to
EXTRACT(WEEK from timestamp_val)
MONTH¶
-
MONTH(timestamp_val)
Equivalent to
EXTRACT(MONTH from timestamp_val)
QUARTER¶
-
QUARTER(timestamp_val)
Equivalent to
EXTRACT(QUARTER from timestamp_val)
YEAR¶
-
YEAR(timestamp_val)
Equivalent to
EXTRACT(YEAR from timestamp_val)
WEEKISO¶
-
WEEKISO(timestamp_val)
Computes the ISO week for the provided timestamp value.
YEAROFWEEKISO¶
-
YEAROFWEEKISO(timestamp_val)
Computes the ISO year for the provided timestamp value.
MAKEDATE¶
-
MAKEDATE(integer_years_val, integer_days_val)
Computes a timestamp value that is the specified number of days after the specified year.
DAYNAME¶
-
DAYNAME(timestamp_val)
Computes the string name of the day of the timestamp value.
MONTHNAME¶
-
MONTHNAME(timestamp_val)
Computes the string name of the month of the timestamp value.
TO_DAYS¶
-
TO_DAYS(timestamp_val)
Computes the difference in days between the input timestamp, and year 0 of the Gregorian calendar
TO_SECONDS¶
-
TO_SECONDS(timestamp_val)
Computes the number of seconds since year 0 of the Gregorian calendar
FROM_DAYS¶
-
FROM_DAYS(n)
Returns a timestamp values that is n days after year 0 of the Gregorian calendar
UNIX_TIMESTAMP¶
-
UNIX_TIMESTAMP()
Computes the number of seconds since the unix epoch
FROM_UNIXTIME¶
-
FROM_UNIXTIME(n)
Returns a Timestamp value that is n seconds after the unix epoch
ADDDATE¶
-
ADDDATE(timestamp_val, interval)
Same as
DATE_ADD
SUBDATE¶
-
SUBDATE(timestamp_val, interval)
Same as
DATE_SUB
TIMESTAMPDIFF¶
-
TIMESTAMPDIFF(unit, timestamp_val1, timestamp_val2)
Returns timestamp_val1 - timestamp_val2 rounded down to the provided unit.
WEEKDAY¶
-
WEEKDAY(timestamp_val)
Returns the weekday number for timestamp_val.
Note
Monday = 0, Sunday=6
YEARWEEK¶
-
YEARWEEK(timestamp_val)
Returns the year and week number for the provided timestamp_val concatenated as a single number. For example:
LAST_DAY¶
-
LAST_DAY(timestamp_val)
Given a timestamp value, returns a timestamp value that is the last day in the same month as timestamp_val.
UTC_TIMESTAMP¶
-
UTC_TIMESTAMP()
Returns the current UTC date and time as a timestamp value.
UTC_DATE¶
-
UTC_DATE()
Returns the current UTC date as a Timestamp value.
TO_DATE¶
-
TO_DATE(col_expr)
Casts the col_expr to a timestamp column truncated to the date portion.
String Functions¶
Bodo SQL currently supports the following string functions:
LOWER¶
-
LOWER(str)
Converts the string scalar/column to lower case.
LCASE¶
-
LCASE(str)
Same as
LOWER
.
UPPER¶
-
UPPER(str)
Converts the string scalar/column to upper case.
UCASE¶
-
UCASE(str)
Same as
UPPER
.
CONCAT¶
-
CONCAT(str_0, str_1, ...)
Concatenates the strings together. Requires at least two arguments.
CONCAT_WS¶
-
CONCAT_WS(str_separator, str_0, str_1, ...)
Concatenates the strings together, with the specified separator. Requires at least three arguments
SUBSTRING¶
-
SUBSTRING(str, start_index, len)
Takes a substring of the specified string, starting at the specified index, of the specified length.
start_index = 1
specifies the first character of the string,start_index = -1
specifies the last character of the string.start_index = 0
causes the function to return empty string. Ifstart_index
is positive and greater then the length of the string, returns an empty string. Ifstart_index
is negative, and has an absolute value greater then the length of the string, the behavior is equivalent tostart_index = 1
.For example:
SUBSTRING('hello world', 1, 5) =='hello' SUBSTRING('hello world', -5, 7) =='world' SUBSTRING('hello world', -20, 8) =='hello wo' SUBSTRING('hello world', 0, 10) ==''
MID¶
MID(str, start_index, len)
Equivalent to
SUBSTRING
SUBSTR¶
-
SUBSTR(str, start_index, len)
Equivalent to
SUBSTRING
LEFT¶
-
LEFT(str, n)
Takes a substring of the specified string consisting of the leftmost n characters
RIGHT¶
-
RIGHT(str, n)
Takes a substring of the specified string consisting of the rightmost n characters
REPEAT¶
-
REPEAT(str, len)
Extends the specified string to the specified length by repeating the string. Will truncate the string If the string's length is less then the len argument
For example:
STRCMP¶
-
STRCMP(str1, str2)
Compares the two strings lexicographically. If
str1 > str2
, return 1. Ifstr1 < str2
, returns -1. Ifstr1 == str2
, returns 0.
REVERSE¶
-
REVERSE(str)
Returns the reversed string.
ORD¶
-
ORD(str)
Returns the integer value of the unicode representation of the first character of the input string. returns 0 when passed the empty string
CHAR¶
-
CHAR(int)
Returns the character of the corresponding unicode value. Currently only supported for ASCII characters (0 to 127, inclusive)
SPACE¶
-
SPACE(int)
Returns a string containing the specified number of spaces.
LTRIM¶
-
LTRIM(str)
returns the input string, will remove all spaces from the left of the string
RTRIM¶
-
RTRIM(str)
returns the input string, will remove all spaces from the right of the string
TRIM¶
-
TRIM(str)
returns the input string, will remove all spaces from the left and right of the string
SUBSTRING_INDEX¶
-
SUBSTRING_INDEX(str, delimiter_str, n)
Returns a substring of the input string, which contains all characters that occur before n occurrences of the delimiter string. if n is negative, it will return all characters that occur after the last n occurrences of the delimiter string. If
num_occurrences
is 0, it will return the empty stringFor example:
LPAD¶
-
LPAD(string, len, padstring)
Extends the input string to the specified length, by appending copies of the padstring to the left of the string. If the input string's length is less then the len argument, it will truncate the input string.
For example:
RPAD¶
-
RPAD(string, len, padstring)
Extends the input string to the specified length, by appending copies of the padstring to the right of the string. If the input string's length is less then the len argument, it will truncate the input string.
For example:
REPLACE¶
-
REPLACE(base_string, substring_to_remove, string_to_substitute)
Replaces all occurrences of the specified substring with the substitute string.
For example:
LENGTH¶
-
LENGTH(string)
Returns the number of characters in the given string.
Control flow Functions¶
IF¶
-
IF(Cond, TrueValue, FalseValue)
Returns
TrueValue
if cond is true, andFalseValue
if cond is false. Logically equivalent to:
IFNULL¶
-
IFNULL(Arg0, Arg1)
Returns
Arg1
ifArg0
isnull
, and otherwise returnsArg1
. If arguments do not have the same type, Bodo SQL will attempt to cast them all to a common type, which is currently undefined behavior.
NVL¶
-
NVL(Arg0, Arg1)
Equivalent to
IFNULL
NULLIF¶
-
NULLIF(Arg0, Arg1)
Returns
null
if theArg0
evaluates to true, and otherwise returnsArg1
COALESCE¶
-
COALESCE(A, B, C, ...)
Returns the first non NULL argument, or NULL if no non NULL argument is found. Requires at least two arguments. If Arguments do not have the same type, Bodo SQL will attempt
COUNT¶
-
COUNT(*)
Compute the number of entries in a window.
SUM¶
-
SUM(COLUMN_EXPRESSION)
Compute the sum over the window or
NULL
if the window is empty.
AVG¶
-
AVG(COLUMN_EXPRESSION)
Compute the average over the window or
NULL
if the window is empty.
STDDEV¶
-
STDDEV(COLUMN_EXPRESSION)
Compute the standard deviation for a sample over the window or
NULL
if the window is empty.
STDDEV_POP¶
-
STDDEV_POP(COLUMN_EXPRESSION)
Compute the standard deviation for a population over the window or
NULL
if the window is empty.
VARIANCE¶
-
VARIANCE(COLUMN_EXPRESSION)
Compute the variance for a sample over the window or
NULL
if the window is empty.
VAR_POP¶
-
VAR_POP(COLUMN_EXPRESSION)
Compute the variance for a population over the window or
NULL
if the window is empty.
Window Functions¶
Window functions can be used to compute an aggregation across a row and its surrounding rows. Most window functions have the following syntax:
SELECT WINDOW_FN(ARG1, ..., ARGN) OVER (PARTITION BY PARTITION_COLUMN_1, ..., PARTITION_COLUMN_N ORDER BY SORT_COLUMN_1, ..., SORT_COLUMN_N ROWS BETWEEN <LOWER_BOUND AND <UPPER_BOUND>) FROM table_name
ROWS BETWEEN ROWS BETWEEN <LOWER_BOUNDAND <UPPER_BOUND>
section is used to specify the window over which to compute the
function. A bound can can come before the current row, using
PRECEDING
or after the current row, using
FOLLOWING
. The bounds can be relative (i.e.
N PRECEDING
) or they can be absolute using the UNBOUNDED
keyword. These bounds are inclusive.
For example:
Computes a sum for each row of the current row, the row preceding, and the row following. In contrast:SELECT SUM(A) OVER (PARTITION BY B ORDER BY C ROWS BETWEEN UNBOUNDED PRECEDING AND 0 FOLLOWING) FROM table1
Window functions execute by performing a series of steps which influences the final output.
- Partition by the
PARTITION_COLUMN
. This effectively performs a groupby on the providedPARTITION_COLUMN
. - Sort each group according to the Order By clause.
- Apply the function over the "window" given by the window.
- Shuffle the data back to the original ordering.
For BodoSQL, PARTITION BY
and ORDER BY
are required, but
ROWS BETWEEN
is optional. If ROWS BETWEEN
is not specified
then it defaults to computing the result over the entire window.
Currently BodoSQL supports the following Window functions:
MAX¶
-
MAX(COLUMN_EXPRESSION)
Compute the maximum value over the window or NULL if the window is empty.
MIN¶
-
MIN(COLUMN_EXPRESSION)
Compute the minimum value over the window or NULL if the window is empty.
COUNT¶
-
COUNT(COLUMN_EXPRESSION)
Compute the number of non-NULL entries in a window.
LEAD¶
-
LEAD(COLUMN_EXPRESSION, N)
Returns the row that follows the current row by N. If there are fewer than N rows the follow the current row in the window, it returns
NULL
. N must be a literal non-negative integer.This function cannot be used with
ROWS BETWEEN
.
LAG¶
-
LAG(COLUMN_EXPRESSION, N)
Returns the row that precedes the current row by N. If there are fewer than N rows the precede the current row in the window, it returns
NULL
. N must be a literal non-negative integer.This function cannot be used with
ROWS BETWEEN
.
FIRST_VALUE¶
-
FIRST_VALUE(COLUMN_EXPRESSION)
Select the first value in the window or
NULL
if the window is empty.
LAST_VALUE¶
-
LAST_VALUE(COLUMN_EXPRESSION)
Select the last value in the window or
NULL
if the window is empty.
NTH_VALUE¶
-
NTH_VALUE(COLUMN_EXPRESSION, N)
Select the Nth value in the window (1-indexed) or
NULL
if the window is empty. If N is greater or than the window size, this returnsNULL
.
NTILE¶
-
NTILE(N)
Divides the partitioned groups into N buckets based on ordering. For example if N=3 and there are 30 rows in a partition, the first 10 are assigned 1, the next 10 are assigned 2, and the final 10 are assigned 3.
ROW_NUMBER¶
-
ROW_NUMBER()
Compute an increasing row number (starting at 1) for each row. This function cannot be used with
ROWS BETWEEN
.to cast them to a common datatype, which is currently undefined behavior.
Supported DataFrame Data Types¶
BodoSQL uses Pandas DataFrames to represent SQL tables in memory and converts SQL types to corresponding Python types which are used by Bodo. Below is a table mapping SQL types used in BodoSQL to their respective Python types and Bodo data types.
SQL Type(s) | Equivalent Python Type | Bodo Data Type |
---|---|---|
TINYINT |
np.int8 |
bodo.int8 |
SMALLINT |
np.int16 |
bodo.int16 |
INT |
np.int32 |
bodo.int32 |
BIGINT |
np.int64 |
bodo.int64 |
FLOAT |
np.float32 |
bodo.float32 |
DECIMAL , DOUBLE |
np.float64 |
bodo.float64 |
VARCHAR , CHAR |
str |
bodo.string_type |
TIMESTAMP , DATE |
np.datetime64[ns] |
bodo.datetime64ns |
INTERVAL(day-time) |
np.timedelta64[ns] |
bodo.timedelta64ns |
BOOLEAN |
np.bool_ |
bodo.bool_ |
BodoSQL can also process DataFrames that contain Categorical or Date columns. However, Bodo will convert these columns to one of the supported types, which incurs a performance cost. We recommend restricting your DataFrames to the directly supported types when possible.
Nullable and Unsigned Types¶
Although SQL does not explicitly support unsigned types, by default, BodoSQL maintains the exact types of the existing DataFrames registered in a [BodoSQLContext], including unsigned and non-nullable type behavior. If an operation has the possibility of creating null values or requires casting data, BodoSQL will convert the input of that operation to a nullable, signed version of the type.
Supported Literals¶
BodoSQL supports the following literal types:
boolean_literal
datetime_literal
float_literal
integer_literal
interval_literal
string_literal
Boolean Literal¶
Syntax:
Boolean literals are case insensitive.
Datetime Literal¶
Syntax:
Float Literal¶
Syntax:
where digit is any numeral from 0 to 9
Integer Literal¶
Syntax:
where digit is any numeral from 0 to 9
Interval Literal¶
Syntax:
Where integer_literal is a valid integer literal and interval type is one of:
In addition we also have limited support for YEAR[S]
and MONTH[S]
.
These literals cannot be stored in columns and currently are only
supported for operations involving add and sub.
String Literal¶
Syntax:
Where char is a character literal in a Python string.
NULL Semantics¶
Bodo SQL converts SQL queries to Pandas code that executes inside Bodo. As a result, NULL behavior aligns with Pandas and may be slightly different than other SQL systems. This is currently an area of active development to ensure compatibility with other SQL systems.
Most operators with a NULL
input return NULL
. However, there a couple
notable places where Bodo SQL may not match other SQL systems:
- Bodo SQL treats
NaN
the same asNULL
- Is (NOT) False and Is (NOT) True return
NULL
when used on a null expression - AND will return
NULL
if any of the inputs isNULL
BodoSQL Caching & Parameterized Queries¶
BodoSQL can reuse Bodo caching to avoid recompilation when used inside a JIT function. BodoSQL caching works the same as Bodo, so for example:
@bodo.jit(cache=True)
def f(filename):
df1 = pd.read_parquet(filename)
bc = bodosql.BodoSQLContext({"table1": df1})
df2 = bc.sql("SELECT A FROM table1 WHERE B > 4")
print(df2.A.sum())
This will avoid recompilation so long as the DataFrame scheme stored in
filename
has the same schema and the code does not change.
To enable caching for queries with scalar parameters that you may want to adjust between runs, we introduce a feature called parameterized queries. In a parameterized query, the SQL query replaces a constant/scalar value with a variable, which we call a named parameter. In addition, the query is passed a dictionary of parameters which maps each name to a corresponding Python variable.
For example, if in the above SQL query we wanted to replace 4 with other integers, we could rewrite our query as:
Now anywhere that @var
is used, the value of python_var at runtime
will be used instead. This can be used in caching, because python_var
can be provided as an argument to the JIT function itself, thus enabling
changing the filter without recompiling. The full example looks like
this:
@bodo.jit(cache=True)
def f(filename, python_var):
df1 = pd.read_parquet(filename)
bc = bodosql.BodoSQLContext({"table1": df1})
df2 = bc.sql("SELECT A FROM table1 WHERE B @var", {"var": python_var})
print(df2.A.sum())
Named parameters cannot be used in places that require a constant value to generate the correct implementation (e.g. TimeUnit in EXTRACT).
IO Handling¶
BodoSQL is great for compute based SQL queries, but you cannot yet access external storage directly from SQL. Instead, you can load and store data using Bodo and various Python APIs. Here we explain a couple common methods for loading data.
Pandas IO in JIT function with SQL Query¶
The most common way to load data is to first use Pandas APIs to load a DataFrame inside a JIT function and then to use that DataFrame inside a BodoSQLContext.
def f(f1, f2):
df1 = pd.read_parquet(f1)
df2 = pd.read_parquet(f2)
bc = bodosql.BodoSQLContext(
{
"t1": df1,
"t2": df2,
}
)
return bc.sql("select t1.A, t2.B from t1, t2 where t1.C > 5 and t1.D = t2.D")
Pandas IO in a JIT Function Separate from Query¶
The previous approach works well for most individual queries. However, when running several queries on the same dataset, it should ideally be loaded once for all queries. To do this, you can structure your JIT code to contain a single load function at the beginning. For example:
@bodo.jit
def load_data(f1, f2):
df1 = pd.read_parquet(f1)
df2 = pd.read_parquet(f2)
return df1, df2
def q1(df1, df2):
bc = bodosql.BodoSQLContext(
{
"t1": df1,
"t2": df2,
}
)
return bc.sql("select t1.A, t2.B from t1, t2 where t1.C > 5 and t1.D = t2.D")
...
@bodo.jit
def run_queries(f1, f2):
df1, df2 = load_data(f1, f2)
print(q1(df1, df2))
print(q2(df2))
print(q3(df1))
...
run_queries(f1, f2)
This approach prevents certain optimizations, such as filter pushdown. However, the assumption here is that you will use the entire DataFrame across the various benchmarks, so no optimization is useful by itself. In addition, any optimizations that can apply to all queries can be done explicitly inside load_data
. For example, if all queries are operate on a single day's data with df1
, you can write that filter in load_data
to limit IO and filter pushdown will be performed.
@bodo.jit
def load_data(f1, f2, target_date):
df1 = pd.read_parquet(f1)
# Applying this filter limits how much data is loaded.
df1 = df1[df1.date_val == target_date]
df2 = pd.read_parquet(f2)
return df1, df2
@bodo.jit
def run_queries(f1, f2, target_date):
df1, df2 = load_data(f1, f2, target_date)
...
run_queries(f1, f2, target_date)
TablePath API¶
The TablePath
API a general purpose IO interface to specify IO sources. It is meant as an alternative to loading the exact tables that you are using inside your JIT function. The TablePath
API indicates how to find a table if needed. BodoSQL will then load that table when needed within a single query. For example, here is some sample code that loads two DataFrames from parquet using the TablePath
API.
@bodo.jit
def f(f1, f2):
bc = bodosql.BodoSQLContext(
{
"t1": bodosql.TablePath(f1, "parquet"),
"t2": bodosql.TablePath(f2, "parquet"),
}
)
return bc.sql("select t1.A, t2.B from t1, t2 where t1.C > 5 and t1.D = t2.D")
Here, the TablePath
constructor doesn't load any data (even when used in regular Python). Instead, a BodoSQLContext
will generate code to load that data once it sees the contents are used inside the query. This allows you to declare your tables once without worrying about memory usage resulting from loading an entire dataset. For example:
t1 = bodosql.TablePath(f1, "parquet")
t2 = bodosql.TablePath(f2, "parquet")
@bodo.jit
def f1(t1):
bc = bodosql.BodoSQLContext(
{
"t1": t1
}
)
...
@bodo.jit
def f2(t2):
bc = bodosql.BodoSQLContext(
{
"t2": t2,
}
)
...
Additionally, TablePath
provides the best filter pushdown support on individual BodoSQL queries. Note that it loads all used tables from IO on every query. If you reuse the same table and columns across multiple queries, you should consider loading the DataFrame once in a separate JIT function.
API Reference¶
-
bodosql. TablePath (file_path, file_type, *, conn_str=None, reorder_io=None)
Specifies how a DataFrame should be loaded from IO by a BodoSQL query. This can only load data when used with aBodoSQLContext
constructor.Arguments
-
file_path
: Path to IO file or name of the table for SQL. This must constant at compile time if used inside JIT. -
file_type
: Type of file to load as a string. Supported values are"parquet"
and"sql"
. This must constant at compile time if used inside JIT. -
conn_str
: Connection string used to connect to a SQL DataBase, equivalent to the conn argument topandas.read_sql
. This must be constant at compile time if used inside JIT and must be None if not loading from a SQL DataBase.
-
-
reorder_io
: Boolean flag determining when to load IO. IfFalse
, all used tables are loaded before executing any of the query. IfTrue
, tables are loaded just before first use inside the query, which often results in decreased peak memory usage as each table is partially processed before loading the next table. The default value,None
, behaves likeTrue
, but this may change in the future. This must be constant at compile time if used inside JIT.