Typing Considerations¶
This section discusses some supported Pandas datatypes, potential typing related issues, and ways to resolve them.
Supported Pandas Data Types¶
Bodo supports the following data types as values in Pandas Dataframe and
Series data structures. This represents all Pandas data
types
except TZ-aware datetime
, Period
,
Interval
, and Sparse
(which will be
supported in the future). Comparing to Spark, equivalents of all Spark
data types
are supported.
- Numpy booleans:
np.bool_
. - Numpy integer data types:
np.int8
,np.int16
,np.int32
,np.int64
,np.uint8
,np.uint16
,np.uint32
,np.uint64
. - Numpy floating point data types:
np.float32
,np.float64
. - Numpy datetime data types:
np.dtype("datetime64[ns]")
andnp.dtype("timedelta[ns]")
. The resolution has to bens
currently, which covers most practical use cases. - Numpy complex data types:
np.complex64
andnp.complex128
. - Strings (including nulls).
datetime.date
values (including nulls).datetime.timedelta
values (including nulls).- Pandas nullable integers.
- Pandas nullable booleans.
- Pandas Categoricals.
- Lists of other data types.
- Tuples of other data types.
- Structs of other data types.
- Maps of other data types (each map is a set of key-value pairs). All keys should have the same type to ensure type stability. All values should have the same type as well.
decimal.Decimal
values (including nulls). The decimal values are stored as fixed-precision Apache Arrow Decimal128 format, which is also similar to PySpark decimals. The decimal type has aprecision
(the maximum total number of digits) and ascale
(the number of digits on the right of dot) attribute, specifying how the stored data is interpreted. For example, the (4, 2) case can store from -999.99 to 999.99. The precision can be up to 38, and the scale must be less or equal to precision. Arbitrary-precision Pythondecimal.Decimal
values are converted with precision of 38 and scale of 18.
In addition, it may be desirable to specify type annotations in some
cases (e.g., file I/O array input types).
Typically these types are array types and they all can be
accessed directly from the bodo
module. The following
table can be used to select the necessary Bodo Type based upon the
desired Python, Numpy, or Pandas type.
Bodo Type Name | Equivalent Python, Numpy, or Pandas type |
---|---|
bodo.bool_[:] , bodo.int8[:] , ..., bodo.int64[:] , bodo.uint8[:] , ..., bodo.uint64[:] , bodo.float32[:] , bodo.float64[:] |
One-dimensional Numpy array of the given type. A full list of supported Numpy types can be found here. A multidimensional can be specified by adding additional colons (e.g., bodo.int32[:, :, :] for a three-dimensional array). |
bodo.string_array_type |
Array of nullable strings |
bodo.IntegerArrayType(integer_type) |
Array of Pandas nullable integers of the given integer type. e.g., bodo.IntegerArrayType(bodo.int64) |
bodo.boolean_array_type |
Array of Pandas nullable booleans |
bodo.datetime64ns[:] |
Array of Numpy datetime64 values |
bodo.timedelta64ns[:] |
Array of Numpy timedelta64 values |
bodo.datetime_date_array_type |
Array of datetime.date types |
bodo.datetime_timedelta_array_type |
Array of datetime.timedelta types |
bodo.DecimalArrayType(precision, scale) |
Array of Apache Arrow Decimal128 values with the given precision and scale. e.g., bodo.DecimalArrayType(38, 18) |
bodo.binary_array_type |
Array of nullable bytes values |
bodo.StructArrayType(data_types, field_names) |
Array of a user defined struct with the given tuple of data types and field names. e.g., bodo.StructArrayType((bodo.int32[:], bodo.datetime64ns[:]), ("a", "b")) |
bodo.TupleArrayType(data_types) |
Array of a user defined tuple with the given tuple of data types. e.g., bodo.TupleArrayType((bodo.int32[:], bodo.datetime64ns[:])) |
bodo.MapArrayType(key_arr_type, value_arr_type) |
Array of Python dictionaries with the given key and value array types. e.g., bodo.MapArrayType(bodo.uint16[:], bodo.string_array_type) |
bodo.PDCategoricalDtype(cat_tuple, cat_elem_type, is_ordered_cat) |
Pandas categorical type with the possible categories, each category's type, and if the categories are ordered. e.g., bodo.PDCategoricalDtype(("A", "B", "AA"), bodo.string_type, True) |
bodo.CategoricalArrayType(categorical_type) |
Array of Pandas categorical values. e.g., bodo.CategoricalArrayType(bodo.PDCategoricalDtype(("A", "B", "AA"), bodo.string_type, True)) |
bodo.DatetimeIndexType(name_type) |
Index of datetime64 values with a given name type. e.g., bodo.DatetimeIndexType(bodo.string_type) |
bodo.NumericIndexType(data_type, name_type) |
Index of pd.Int64 , pd.Uint64 , or Float64 objects, based upon the given data_type and name type. e.g., bodo.NumericIndexType(bodo.float64, bodo.string_type) |
bodo.PeriodIndexType(freq, name_type) |
pd.PeriodIndex with a given frequency and name type. e.g., bodo.PeriodIndexType('A', bodo.string_type) |
bodo.RangeIndexType(name_type) |
RangeIndex with a given name type. e.g., bodo.RangeIndexType(bodo.string_type) |
bodo.StringIndexType(name_type) |
Index of strings with a given name type. e.g., bodo.StringIndexType(bodo.string_type) |
bodo.BinaryIndexType(name_type) |
Index of binary values with a given name type. e.g., bodo.BinaryIndexType(bodo.string_type) |
bodo.TimedeltaIndexType(name_type) |
Index of timedelta64 values with a given name type. e.g., bodo.TimedeltaIndexType(bodo.string_type) |
bodo.SeriesType(dtype=data_type, index=index_type, name_typ=name_type) |
Series with a given data type, index type, and name type. e.g., bodo.SeriesType(bodo.float32, bodo.DatetimeIndexType(bodo.string_type), bodo.string_type) |
bodo.DataFrameType(data_types_tuple, index_type, column_names) |
DataFrame with a tuple of data types, an index type, and the names of the columns. e.g., bodo.DataFrameType((bodo.int64[::1], bodo.float64[::1]), bodo.RangeIndexType(bodo.none), ("A", "B")) |
Compile Time Constants¶
Unlike regular Python, which is dynamically typed, Bodo needs to be able to type all functions at compile time. While in most cases, the output types depend solely on the input types, some APIs require knowing exact values in order to produce accurate types.
As an example, consider the iloc
DataFrame API. This API can be used
to selected a subset of rows and columns by passing integers or slices
of integers. A Bodo JIT version of a function calling this API might
look like:
import numpy as np
import pandas as pd
import bodo
@bodo.jit
def df_iloc(df, rows, columns):
return df.iloc[rows, columns]
df = pd.DataFrame({'A': np.arange(100), 'B': ["A", "B", "C", "D"]* 25})
print(df_iloc(df, slice(1, 4), 0))
If we try to run this file, we will get an error message:
$ python iloc_example.py
Traceback (most recent call last):
File "iloc_example.py", line 10, in <module>
df_iloc(df, slice(1, 4), 0)
File "/my_path/bodo/numba_compat.py", line 1195, in _compile_for_args
raise error
bodo.utils.typing.BodoError: idx2 in df.iloc[idx1, idx2] should be a constant integer or constant list of integers
File "iloc_example.py", line 7:
def df_iloc(df, rows, columns):
return df.iloc[rows, columns]
The relevant part of the error message is
idx2 in df.iloc[idx1, idx2] should be a constant integer or constant list of integers
.
This error is thrown because depending on the value of columns
, Bodo
selects different columns with different types. When columns=0
Bodo
will need to compile code for numeric values, but when columns=1
Bodo
needs to compile code for strings, so it cannot properly type this
function.
To resolve this issue, you will need to replace columns
with a literal
integer. If instead the Bodo function is written as:
import numpy as np
import pandas as pd
import bodo
@bodo.jit
def df_iloc(df, rows):
return df.iloc[rows, 0]
df = pd.DataFrame({'A': np.arange(100), 'B': ["A", "B", "C", "D"]* 25})
print(df_iloc(df, slice(1, 4)))
Bodo now can see that the output DataFrame should have a single int64
column and it is able to compile the code.
Whenever a value needs to be known for typing purposes, Bodo will throw
an error that indicates some argument requires a constant value
. All
of these can be resolved by making this value a literal. Alternatively,
some APIs support other ways of specifying the output types, which will
be indicated in the error message.
Integer NA issue in Pandas¶
DataFrame and Series objects with integer data need special care due to integer NA issues in Pandas. By default, Pandas dynamically converts integer columns to floating point when missing values (NAs) are needed (which can result in loss of precision). This is because Pandas uses the NaN floating point value as NA, and Numpy does not support NaN values for integers. Bodo does not perform this conversion unless enough information is available at compilation time.
Pandas introduced a new nullable integer data
type
that can solve this issue, which is also supported by Bodo. For example,
this code reads column A
into a nullable integer array
(the capital "I"
denotes nullable integer type):
@bodo.jit
def example(fname):
dtype = {'A': 'Int64', 'B': 'float64'}
df = pd.read_csv(fname,
names=dtype.keys(),
dtype=dtype,
)
...
Type Inference for Object Data¶
Pandas stores some data types (e.g. strings) as object arrays which are untyped. Therefore, Bodo needs to infer the actual data type of object arrays when dataframes or series values are passed to JIT functions from regular Python. Bodo uses the first non-null value of the array to determine the type, and throws a warning if the array is empty or all nulls:
BodoWarning: Empty object array passed to Bodo, which causes ambiguity in typing. This can cause errors in parallel execution.
In this case, Bodo assumes the array is a string array which is the most common. However, this can cause errors if a distributed dataset is passed to Bodo, and some other processor has non-string data. This corner case can usually be avoided by load balancing the data across processors to avoid empty arrays.