Supported DataFrame Data Types¶
BodoSQL uses its internal Python tables to represent SQL tables in memory and converts SQL types to corresponding Python types which are used by Bodo. Below is a table mapping SQL types used in BodoSQL to their respective Python types and Bodo data types.
SQL Type(s) | Equivalent Python Type | Bodo Data Type |
---|---|---|
BOOLEAN |
np.bool_ |
bodo.bool_ |
TINYINT |
np.int8 |
bodo.int8 |
SMALLINT |
np.int16 |
bodo.int16 |
INT |
np.int32 |
bodo.int32 |
BIGINT |
np.int64 |
bodo.int64 |
FLOAT |
np.float32 |
bodo.float32 |
DOUBLE |
np.float64 |
bodo.float64 |
VARCHAR , CHAR |
str |
bodo.string_type |
VARBINARY , BINARY |
bytes |
bodo.bytes_type |
DATE |
datetime.date |
bodo.datetime_date_type |
TIME |
bodo.Time |
bodo.TimeType |
TIMESTAMP_NTZ |
pd.Timestamp |
bodo.PandasTimestampType(None) |
TIMESTAMP_LTZ |
pd.Timestamp |
bodo.PandasTimestampType(local_tz) |
TIMESTAMP_TZ |
bodo.TimestampTZ |
bodo.timestamptz_type |
INTERVAL(day-time) |
np.timedelta64[ns] |
bodo.timedelta64ns |
ARRAY |
pyarrow.large_list |
bodo.ArrayItemArray |
MAP |
pyarrow.map |
bodo.MapScalarType |
NULL |
pyarrow.NA |
bodo.null_dtype |
BodoSQL may be able to handle additional column types if the data is unused. When loading data from Snowflake or other sources, BodoSQL will treat Decimal columns as either BigInt or Float64 depending on the column's scale and precision.
Unsigned Types¶
Although SQL does not explicitly support unsigned types, BodoSQL typically maintains the types of the existing DataFrames registered in a [BodoSQLContext]. If these types are unsigned, then this may result in different behavior than expected. We always recommend working with signed types to avoid any potential issues.
TIMESTAMP_TZ¶
Note that bodo.TimestampTZ
in python is a custom type provided by the Bodo
library. In sql
this datatype is compatible with Snowflake's
TIMESTAMP_TZ.
TIMESTAMP_TZ
stores a timestamp along with a UTC
offset with a resolution of
minutes. This offset can be arbitrary, but it is not dependant on the timestamp
value. In other words, it is not aware of timezones and changes in offset such
as DST
. While most operations will use the timestamp value (not UTC
), any
comparison between two TIMESTAMP_TZ
values will treat them as equal if
their UTC
time is equal. For example:
TRUE
- the timestamps are the same
with respect to UTC
even though their values without the offset are different.
The above query will output a row with False
- the timestamps are not the same
with respect to UTC
even though their values without the offset are equal.
This means that grouping by a TIMESTAMP_TZ
value will follow the same equality
rules above, and we make no guarantees about what the offset of the key for a
group will be - only guarantee is that the key's UTC
timestamp is equal to all
values for that group. For example, consider the following table:
A | B |
---|---|
2023-01-01 00:00:00 +00:00 | 1 |
2023-01-01 01:00:00 +01:00 | 1 |
2023-01-01 00:00:00 +01:00 | 1 |
2023-01-01 01:00:00 +00:00 | 1 |
2023-01-02 00:00:00 +00:00 | 1 |
2023-01-02 01:00:00 +01:00 | 1 |
Where A
is a TIMESTAMP_TZ
and B
is a NUMBER
. Note that rows 0
and 1
have equal values for A
. Similarly rows 2
and 3
are equal in terms of A
,
and same for rows 4
and 5
. Then, both of the following are valid results for
SELECT A, sum(B) FROM table GROUP BY A
:
A | B |
---|---|
2023-01-01 00:00:00 +00:00 | 2 |
2023-01-01 00:00:00 +01:00 | 2 |
2023-01-02 00:00:00 +00:00 | 2 |
A | B |
---|---|
2023-01-01 01:00:00 +01:00 | 2 |
2023-01-01 00:00:00 +01:00 | 2 |
2023-01-02 01:00:00 +01:00 | 2 |
Note that these aren't the only two possibilities - for the query above there
are 8
possible results.
If you need to compare values by their local timestamp instead of their UTC
timestamp, consider casting to timestampntz
. For the same input table above,
here's what the result of SELECT A::timestampntz FROM table
would look like:
A::timestampntz |
---|
2023-01-01 00:00:00 |
2023-01-01 01:00:00 |
2023-01-01 00:00:00 |
2023-01-01 01:00:00 |
2023-01-02 00:00:00 |
2023-01-02 01:00:00 |
Note that this model of equality also holds during JOIN
s:
Table 1: | A | B | |----------------------------|---| | 2023-01-01 00:00:00 +00:00 | 1 | | 2024-02-02 00:00:00 +00:00 | 2 |
Table 2: | A | |----------------------------| | 2023-01-01 00:00:00 +01:00 | | 2023-01-01 00:00:00 +02:00 | | 2023-01-01 00:00:00 +03:00 | | 2024-02-02 00:00:00 +01:00 | | 2024-02-02 00:00:00 +02:00 | | 2024-02-02 00:00:00 +03:00 |
The result of SELECT TABLE1.A, TABLE2.A, B FROM TABLE1 JOIN TABLE2 ON TABLE1.A=TABLE2.A
would be:
TABLE1.A | TABLE2.A | B |
---|---|---|
2023-01-01 00:00:00 +00:00 | 2023-01-01 00:00:00 +01:00 | 1 |
2023-01-01 00:00:00 +00:00 | 2023-01-01 00:00:00 +02:00 | 1 |
2023-01-01 00:00:00 +00:00 | 2023-01-01 00:00:00 +03:00 | 1 |
2024-02-02 00:00:00 +00:00 | 2024-02-02 00:00:00 +01:00 | 2 |
2024-02-02 00:00:00 +00:00 | 2024-02-02 00:00:00 +02:00 | 2 |
2024-02-02 00:00:00 +00:00 | 2024-02-02 00:00:00 +03:00 | 2 |
Aside from comparison most other operations will treat TIMESTAMP_TZ
as it's
local timestamp, for example SELECT EXTRACT(HOUR from '2024-01-02 03:04:05 +06:07'::timestamptz)
should return 3
(even though the UTC
timestamp would have an hour of 21
).
TIMESTAMP_TZ interaction with Snowflake¶
Note that reading TIMESTAMP_TZ
values to or from Snowflake may change the
session parameter TIMESTAMP_TZ_OUTPUT_FORMAT
. If your query relies on custom
values for TIMESTAMP_TZ_OUTPUT_FORMAT
you may experience unexpected behavior.
TIMESTAMP_TZ limitations¶
Currently only the following aggregation functions are supported with
TIMESTAMP_TZ
. Future releases will expand this list.
- min/max
- first/last/any_value
- count
- mode
Additionally, TIMESTAMP_TZ
is not supported in semi-structured data (arrays,
and objects).
Supported Literals¶
BodoSQL supports the following literal types:
array_literal
boolean_literal
datetime_literal
float_literal
integer_literal
interval_literal
object_literal
string_literal
binary_literal
Array Literal¶
Syntax:
where <[>
and <]>
indicate literal [
and ]
s, and expr
is any expression.
Array literals are lists of comma separated expressions wrapped in square brackets.
Note that BodoSQL currently only supports homogenous lists, and all expr
s
must coerce to a single type.
Boolean Literal¶
Syntax:
Boolean literals are case-insensitive.
Datetime Literal¶
Syntax:
Float Literal¶
Syntax:
where digit is any numeral from 0 to 9
Integer Literal¶
Syntax:
where digit is any numeral from 0 to 9
Interval Literal¶
Syntax:
Where integer_literal is a valid integer literal and interval type is one of:
In addition, we also have limited support for YEAR[S]
and MONTH[S]
.
These literals cannot be stored in columns and currently are only
supported for operations involving add and sub.
Object Literal¶
Syntax:
Where each ki
is a unique string literal, and each vi
is an expression.
Obeys the same semantics as the function OBJECT_CONSTRUCT
, so any pair
where the key or value is null is omitted, and for now BodoSQL only supports
when all values are the same type.
String Literal¶
Syntax:
Where char is a character literal in a Python string.
Binary Literal¶
Syntax:
Where hex is a hexadecimal character between 0-F.