7. Supported Pandas Operations¶
Below is a reference list of the Pandas data types and operations that Bodo supports. This list will expand regularly as we add support for more APIs. Optional arguments are not supported unless if specified.
7.1. Data Types¶
Bodo supports the following data types as values in Pandas Dataframe and Series data structures. This represents all Pandas data types except TZ-aware datetime, Period, Interval, and Sparse (which will be supported in the future). Comparing to Spark, equivalent of all Spark data types are supported.
Numpy booleans: np.bool_.
Numpy integer data types: np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64.
Numpy floating point data types: np.float32, np.float64.
Numpy datetime data types: np.dtype(“datetime64[ns]”) and np.dtype(“timedelta[ns]”). The resolution has to be ns currently, which covers most practical use cases.
Numpy complex data types: np.complex64 and np.complex128.
Strings (including nulls).
datetime.date values (including nulls).
datetime.timedelta values (including nulls).
Pandas nullable integers.
Pandas nullable booleans.
Pandas Categoricals.
Lists of integer, float, and string values.
decimal.Decimal values (including nulls). The decimal values are stored as fixed-precision Apache Arrow Decimal128 format, which is also similar to PySpark decimals. The decimal type has a precision (the maximum total number of digits) and a scale (the number of digits on the right of dot) attribute, specifying how the stored data is interpreted. For example, the (4, 2) case can store from -999.99 to 999.99. The precision can be up to 38, and the scale must be less or equal to precision. Arbitrary-precision Python decimal.Decimal values are converted with precision of 38 and scale of 18.
In addition, it may be desirable to specify type annotations in some cases (e.g. file I/O array input types). Typically these types are array types and they all can be accessed directly from the bodo module. The following table can be used to select the necessary Bodo Type based upon the desired Python, Numpy, or Pandas type.
Bodo Type Name |
Equivalent Python, Numpy, or Pandas type |
---|---|
|
One-dimensional Numpy array of the given type. A full list of supported Numpy types can be found here.
A multidimensional can be specified by adding additional colons (e.g. |
|
Array of nullable strings |
|
Array of Pandas nullable integers of the given integer type
e.g.
bodo.IntegerArrayType(bodo.int64) |
|
Array of Pandas nullable booleans |
|
Array of Numpy datetime64 values |
|
Array of Numpy timedelta64 values |
|
Array of datetime.date types |
|
Array of datetime.timedelta types |
|
Array of Apache Arrow Decimal128 values with the given precision and scale
e.g.
bodo.DecimalArrayType(38, 18) |
|
Array of a user defined struct with the given tuple of data types and field names
e.g.
bodo.StructArrayType((bodo.int32[:], bodo.datetime64ns[:]), ("a", "b")) |
|
Array of a user defined tuple with the given tuple of data types
e.g.
bodo.TupleArrayType((bodo.int32[:], bodo.datetime64ns[:])) |
|
Array of Python dictionaries with the given key and value array types.
e.g.
bodo.MapArrayType(bodo.uint16[:], bodo.string_array_type) |
|
Index of datetime64 values with a given name type.
e.g.
bodo.DatetimeIndexType(bodo.string_type) |
|
Index of pd.Int64, pd.Uint64, or Float64 objects,
based upon the given data_type and name type.
e.g.
bodo.NumericIndexType(bodo.float64, bodo.string_type) |
|
pd.PeriodIndex with a given frequency and name type.
e.g.
bodo.PeriodIndexType('A', bodo.string_type) |
|
RangeIndex with a given name type.
e.g.
bodo.RangeIndexType(bodo.string_type) |
|
Index of strings with a given name type.
e.g.
bodo.StringIndexType(bodo.string_type) |
|
Index of timedelta64 values with a given name type.
e.g.
bodo.TimedeltaIndexType(bodo.string_type) |
|
Series with a given data type, index type, and name type.
e.g.
bodo.SeriesType(bodo.float32, bodo.DatetimeIndexType(bodo.string_type), bodo.string_type) |
|
DataFrame with a tuple of data types, an index type, and the names of the columns.
e.g.
bodo.DataFrameType((bodo.int64[::1], bodo.float64[::1]), bodo.RangeIndexType(bodo.none), ("A", "B")) |
7.2. Input/Output¶
Also see Amazon S3 and Hadoop Distributed File System (HDFS) and Azure Data Lake Storage (ADLS) Gen2 configuration requirements and more on File I/O.
-
filepath_or_buffer
should be a string and is required. It could be pointing to a single CSV file, or a directory containing multiple partitioned CSV files (must havecsv
file extension inside directory).Arguments
sep
,delimiter
,header
,names
,index_col
,usecols
,dtype
,skiprows
, andparse_dates
are supported.Either
names
anddtype
arguments should be provided to enable type inference, orfilepath_or_buffer
should be a constant string for Bodo to infer types by looking at the file at compile time.names
,usecols
,parse_dates
should be constant lists.dtype
should be a constant dictionary of strings and types.When a CSV file is read in parallel (distributed mode) and each process reads only a portion of the file, reading columns that contain line breaks is not supported.
-
output dataframe cannot be parallelized automatically yet.
only arguments
io
,sheet_name
,header
,names
,comment
,dtype
,skiprows
,parse_dates
are supported.io
should be a string and is required.Either
names
anddtype
arguments should be provided to enable type inference, orio
should be a constant string for Bodo to infer types by looking at the file at compile time.sheet_name
,header
,comment
, andskiprows
should be constant if provided.names
andparse_dates
should be constant lists if provided.dtype
should be a constant dictionary of strings and types if provided.
-
Argument
sql
is supported but only as a string form. SQLalchemy Selectable is not supported. There is no restriction on the form of the sql request.Argument
con
is supported but only as a string form. SQLalchemy connectable is not supported.Argument
index_col
is supported.Arguments
chunksize
,column
,coerce_float
,params
are not supported.
-
Arguments
path
andcolumns
are supported.columns
should be a constant list of strings.If
path
is constant, Bodo finds the schema from file at compilation time. Otherwise, schema should be provided. For example:@bodo.jit(locals={'df':{'A': bodo.float64[:], 'B': bodo.string_array_type}}) def impl(f): df = pd.read_parquet(f) return df
-
Only supports reading JSON Lines text file format (
pd.read_json(filepath_or_buffer, orient='records', lines=True)
) and regular multi-line JSON file(pd.read_json(filepath_or_buffer, orient='records', lines=False)
).Argument
filepath_or_buffer
is supported: it can point to a single JSON file, or a directory containing multiple partitioned JSON files. When reading a directory, the JSON files inside the directory must be JSON Lines text file format withjson
file extension.Argument
orient = 'records'
is used as default, instead of Pandas’ default'columns'
for dataframes.'records'
is the only supported value fororient
.Argument
typ
is supported.'frame'
is the only supported value fortyp
.dtype
argument should be provided to enable type inference, orfilepath_or_buffer
should be a constant string for Bodo to infer types by looking at the file at compile time (not supported for multi-line JSON files)Arguments
convert_dates
,precise_float
,lines
are supported.
pandas.DataFrame.to_sql()
Argument
con
is supported but only as a string form. SQLalchemy connectable is not supported.Argument
name
,schema
,if_exists
,index
,index_label
,dtype
,method
are supported.Argument
chunksize
is not supported.
7.3. General functions¶
Data manipulations:
-
Annotation of pivot values is required. For example, @bodo.jit(pivots={‘pt’: [‘small’, ‘large’]}) declares the output table pt will have columns called small and large.
-
Arguments
left
,right
should be dataframes.how
,on
,left_on
,right_on
,left_index
, and right_index are supported but should be constant values.The output dataframe is not sorted by default for better parallel performance (Pandas may preserve key order depending on how). One can use explicit sort if needed.
pandas.merge_asof()
(similar arguments to merge)pandas.concat()
Input list or tuple of dataframes or series is supported.pandas.get_dummies()
Input must be a categorical array with categories that are known at compile time (for type stability).
Top-level missing data:
Top-level conversions:
pandas.to_numeric()
Input can be a Series or array. Output type is float64 by default. Unlike Pandas, Bodo does not dynamically determine output type, and does not downcast to the smallest numerical type. downcast parameter should be used for type annotation of output. The errors argument is not supported currently (errors will be coerced by default).
Top-level dealing with datetime and timedelta like:
-
start
,end
,periods
,freq
,name
andclosed
arguments are supported. This function is not parallelized yet.
-
All arguments are supported.
-
arg_a
andunit
arguments are supported.
7.4. Series¶
Bodo provides extensive Series support. However, operations between Series (+, -, /, , *) do not implicitly align values based on their associated index values yet.
-
Arguments
data
,index
, andname
are supported.data
is required and can be a list, array, Series or Index. Ifdata
is Series andindex
is provided, implicit alignment is not performed yet.
Attributes:
pandas.Series.dtype()
(object data types such as dtype of string series not supported yet)
Methods:
Conversion:
pandas.Series.astype()
(onlydtype
argument, can be a Numpy numeric dtype orstr
)pandas.Series.copy()
(includingdeep
argument)
Indexing, iteration:
Location based indexing using [], iat, and iloc is supported. Changing values of existing string Series using these operators is not supported yet.
pandas.Series.loc()
Read support for all indexers except using a callable object. Label-based indexing is not supported yet.
Binary operator functions:
The fill_value optional argument for binary functions below is supported.
pandas.Series.round()
(decimals argument supported)
Function application, GroupBy & Window:
pandas.Series.apply()
(only the func argument)pandas.Series.map()
(only the arg argument, which should be a function)pandas.Series.groupby()
(pass array to by argument, or level=0 with regular Index)pandas.Series.rolling()
(window, min_periods and center arguments supported)
Computations / Descriptive Stats:
Statistical functions below are supported without optional arguments unless support is explicitly mentioned.
pandas.Series.all()
only default arguments supportedpandas.Series.any()
only default arguments supportedpandas.Series.describe()
currently returns a string instead of Series object.pandas.Series.autocorr()
(supports lag argument)pandas.Series.median()
(supports skipna argument)pandas.Series.nlargest()
(non-numerics not supported yet)pandas.Series.nsmallest()
(non-numerics not supported yet)pandas.Series.pct_change()
(supports numeric types and only the periods argument supported)pandas.Series.std()
(support skipna and ddof arguments)pandas.Series.var()
(support skipna and ddof arguments)pandas.Series.sem()
(support skipna and ddof arguments)pandas.Series.mad()
argument skipna supportedpandas.Series.kurt()
argument skipna supportedpandas.Series.kurtosis()
argument skipna supportedpandas.Series.skew()
argument skipna supported
Reindexing / Selection / Label manipulation:
pandas.Series.head()
(n argument is supported)pandas.Series.isin()
values argument supports both distributed array/Series and replicated list/array/Seriespandas.Series.rename()
(only set a new name using a string value)pandas.Series.reset_index()
only default arguments supported. Also, requires Index name to be known at compilation time.pandas.Series.tail()
(n argument is supported)
Missing data handling:
Reshaping, sorting:
pandas.Series.append()
ignore_index is supported. setting name for output Series not supported yet)
Time series-related:
pandas.Series.shift()
(supports numeric types and only the periods argument supported)
Datetime properties:
String handling:
pandas.Series.str.contains()
regex argument supported.pandas.Series.str.extract()
(input pattern should be a constant string)pandas.Series.str.extractall()
(input pattern should be a constant string)pandas.Series.str.replace()
regex argument supported.
7.5. DataFrame¶
Bodo provides extensive DataFrame support documented below.
-
data
argument can be a constant dictionary or 2d Numpy array. Other arguments are also supported.
Attributes and underlying data:
pandas.DataFrame.columns
(can access but not set new columns yet)pandas.DataFrame.index
(can access but not set new index yet)pandas.DataFrame.select_dtypes()
(only supports constant strings or types as arguments)pandas.DataFrame.to_numpy()
(only for numeric dataframes)pandas.DataFrame.values
(only for numeric dataframes)
Conversion:
pandas.DataFrame.astype()
(only accepts a single data type of Numpy dtypes or str)pandas.DataFrame.copy()
(including deep flag)
Indexing, iteration:
pandas.DataFrame.head()
(including n argument)pandas.DataFrame.isin()
(values can be a dataframe with matching index or a list or a set)pandas.DataFrame.itertuples()
Read support for all indexers except reading a single row using an interger, slicing across columns, or using a callable object. Label-based indexing is not supported yet.pandas.DataFrame.query()
(expr can be a constant string or an argument to the jit function)pandas.DataFrame.tail()
(including n argument)
Function application, GroupBy & Window:
pandas.DataFrame.groupby()
by should be a constant column label or column labels. sort=False is set by default. as_index argument is supported but MultiIndex is not supported yet (will just drop output MultiIndex).pandas.DataFrame.rolling()
window argument should be integer or a time offset as a constant string. min_periods, center and on arguments are also supported.
Computations / Descriptive Stats:
pandas.DataFrame.corr()
(min_periods argument supported)pandas.DataFrame.cov()
(min_periods argument supported)pandas.DataFrame.nunique()
(dropna argument not supported yet. The behavior is slightly different from .nunique implementation in pandas)
Reindexing / Selection / Label manipulation:
pandas.DataFrame.drop()
(only dropping columns supported, either using columns argument or setting axis=1)pandas.DataFrame.head()
(including n argument)pandas.DataFrame.rename()
(only columns argument with a constant dictionary)pandas.DataFrame.reset_index()
(only drop=True supported)pandas.DataFrame.set_index()
keys can only be a column label (a constant string).pandas.DataFrame.tail()
(including n argument)
Missing data handling:
Reshaping, sorting, transposing:
pandas.DataFrame.pivot_table()
Arguments
values
,index
,columns
andaggfunc
are supported.Annotation of pivot values is required. For example, @bodo.jit(pivots={‘pt’: [‘small’, ‘large’]}) declares the output pivot table pt will have columns called small and large.
pandas.DataFrame.sample()
is supported except for the argumentsrandom_state
,weights
andaxis
.pandas.DataFrame.sort_index()
ascending argument is supported.pandas.DataFrame.sort_values()
by
argument should be constant string or constant list of strings.ascending
andna_position
arguments are supported.
Combining / joining / merging:
pandas.DataFrame.append()
appending a dataframe or list of dataframes supported. ignore_index=True is necessary and set by default.pandas.DataFrame.assign()
function arguments not supported yet.pandas.DataFrame.join()
only dataframes. The output dataframe is not sorted by default for better parallel performance (Pandas may preserve key order depending on how). One can use explicit sort if needed.pandas.DataFrame.merge()
only dataframes. how, on, left_on, right_on, left_index, and right_index are supported but should be constant values.
Time series-related:
pandas.DataFrame.shift()
(supports numeric types and only the periods argument supported)
Serialization / IO / conversion:
Also see Amazon S3 and Hadoop Distributed File System (HDFS) and Azure Data Lake Storage (ADLS) Gen2 configuration requirements and more on File I/O.
7.6. Index objects¶
7.6.1. Index¶
Properties
pandas.Index.values
Returns the underlying data array
Modifying and computations:
NaT
output for empty or all NaT
input not supported yet):Missing values:
Conversion:
7.6.2. Numeric Index¶
Numeric index objects RangeIndex
, Int64Index
, UInt64Index
and
Float64Index
are supported as index to dataframes and series.
Constructing them in Bodo functions, passing them to Bodo functions (unboxing),
and returning them from Bodo functions (boxing) are also supported.
-
start
,stop
andstep
arguments are supported.
-
data
,copy
andname
arguments are supported.data
can be a list or array.
7.6.3. DatetimeIndex¶
DatetimeIndex
objects are supported. They can be constructed,
boxed/unboxed, and set as index to dataframes and series.
-
Only
data
argument is supported, and can be array-like ofdatetime64['ns']
,int64
or strings.
Date fields of DatetimeIndex are supported:
Subtraction of Timestamp
from DatetimeIndex
and vice versa
is supported.
Comparison operators ==
, !=
, >=
, >
, <=
, <
between
DatetimeIndex
and a string of datetime
are supported.
7.6.4. TimedeltaIndex¶
TimedeltaIndex
objects are supported. They can be constructed,
boxed/unboxed, and set as index to dataframes and series.
-
Only
data
argument is supported, and can be array-like oftimedelta64['ns']
orint64
.
Time fields of TimedeltaIndex are supported:
7.6.5. PeriodIndex¶
PeriodIndex
objects can be
boxed/unboxed and set as index to dataframes and series.
Operations on them will be supported in upcoming releases.
7.7. Timestamp¶
Timestamp functionality is documented in pandas.Timestamp.
pandas.Timestamp.components
7.8. Timedelta¶
Timedelta functionality is documented in pandas.Timedelta.
-
The unit argument is not supported and all Timedeltas are represented in nanosecond precision.
Datetime related fields are supported:
7.10. GroupBy¶
The operations are documented on pandas.DataFrame.groupby.
pandas.core.groupby.GroupBy.apply()
(func should return a DataFrame or Series)pandas.core.groupby.GroupBy.agg()
func should be a function or constant dictionary of input/function mappings. Passing a list of functions is also supported if only one output column is selected. Alternatively, outputs can be specified using keyword arguments and pd.NamedAgg().pandas.core.groupby.DataFrameGroupBy.aggregate()
same as aggpandas.core.groupby.GroupBy.rolling()
pandas.core.groupby.GroupBy.shift()
7.11. Offsets¶
Bodo supports a subset of the offset types from within pandas.tseries.offsets
.
For the offsets supported, the currently supported operations are the constructor
and addition and subtraction with a scalar datetime.date, datetime.datetime
or pandas.Timestamp. These can also be mapped across Series or DataFrame of
dates using UDFs. The offsets currently supported are:
pandas.tseries.offsets.DateOffset()
pandas.tseries.offsets.MonthEnd()
7.12. Integer NA issue in Pandas¶
DataFrame and Series objects with integer data need special care due to integer NA issues in Pandas. By default, Pandas dynamically converts integer columns to floating point when missing values (NAs) are needed (which can result in loss of precision). This is because Pandas uses the NaN floating point value as NA, and Numpy does not support NaN values for integers. Bodo does not perform this conversion unless enough information is available at compilation time.
Pandas introduced a new nullable integer data type that can solve this issue, which is also supported by Bodo. For example, this code reads column A into a nullable integer array (the capital “I” denotes nullable integer type):
@bodo.jit
def example(fname):
dtype = {'A': 'Int64', 'B': 'float64'}
df = pd.read_csv(fname,
names=dtype.keys(),
dtype=dtype,
)
...
7.13. User-Defined Functions (UDFs)¶
User-defined functions (UDFs) can be applied to dataframes with DataFrame.apply()
and to
series with Series.apply()
or Series.map()
. Bodo offers support for UDFs without the
significant runtime penalty generally incurred in Pandas.
It is recommended to pass additional variables to UDFs explicitly, instead of directly using
values in the main function. The latter results in the “captured” variables case, which is
often error-prone and may result in compilation errors. Therefore, arguments should be passed
directly to either Series.apply()
or DataFrame.apply()
.
For example, consider a UDF that appends a variable suffix to each string
in a Series of strings. The proper way to write this function through Series.apply()
is:
@bodo.jit
def add_suffix(S, suffix):
return S.apply(lambda x, suf: x + suf, args=(suffix,))
Alternatively, arguments can be passed as named arguments like:
@bodo.jit
def add_suffix(S, suffix):
return S.apply(lambda x, suf: x + suf, suf=suffix)
The same process can be applied in the Dataframe case using DataFrame.apply()
.
7.14. Type Inference for Object Data¶
Pandas stores some data types (e.g. strings) as object arrays which are untyped. Therefore, Bodo needs to infer the actual data type of object arrays when dataframes or series values are passed to JIT functions from regular Python. Bodo uses the first non-null value of the array to determine the type, and throws a warning if the array is empty or all nulls:
BodoWarning: Empty object array passed to Bodo, which causes ambiguity in typing. This can cause errors in parallel execution.
In this case, Bodo assumes the array is a string array which is the most common. However, this can cause errors if a distributed dataset is passed to Bodo, and some other processor has non-string data. This corner case can usually be avoided by load balancing the data across processors to avoid empty arrays.