Bodo 2021.7 Release (Date: 7/23/2021)¶
This release includes many new features, optimizations, bug fixes and usability improvements. Overall, 109 code patches were merged since the last release.
New Features and Improvements¶
-
Documentation has been reorganized and updated, with improved navigation and a detailed walkthrough of Pandas equivalents of PySpark functions.
-
Improvements to enable BodoSQL features
-
- Connectors:
-
- Improved performance of
pd.read_parquet
when reading from remote storage systems like S3 - Support reading categorical columns of Parquet
- Improved performance of
-
- Performance improvements:
-
- Improved performance and scalability of
sort_values
- Optimized
pd.Series.isin(values)
performance for long list ofvalues
.
- Improved performance and scalability of
-
UDFs in Series.apply and Dataframe.apply: the Bodo compiler transforms the code to pass main function values referenced in the UDF ("free variables") as arguments to
apply()
automatically if possible (to simplify UDF usage). -
Support passing Bodo data types to objmode directly (in addition to string representation of the data types). For example, the following code sets the return type an int64 type:
@bodo.jit def f(a, b): with bodo.objmode(res=bodo.int64): res = random.randint(a, b) return res
-
Compilation time improvements for some dataframe operations
-
Distributed support for
pd.RangeIndex
calls -
- Pandas coverage:
-
- Initial support for binary arrays, including within series/dataframes
- Support for `groupby.transform` - Groupby: support repeated input columns. For example: df.groupby("A").agg( D=pd.NamedAgg(column="B", aggfunc=lambda A: A.sum()), F=pd.NamedAgg(column="C", aggfunc="max"), E=pd.NamedAgg(column="B", aggfunc="min"), ) - Support Groupby with `dropna=False` - Support for `dropna` in `Series.nunique`, `DataFrame.nunique`, and `groupby.nunique` - Support for `DataFrame.insert()` - Support `tolist()` for string and numpy arrays - Expanded `astype` support: : - str to timedelta64/datetime64 - timedelta64/datetime64 to int64 - date arrays - Numeric-like inputs to datetime/timedelta - Support for `pd.StringDtype()` in `astype` - numeric-like to nullable integer types - Support for `pd.Timestamp.now()` - Support Timestamp in `pd.to_datetime` - Support for Timestamp/Timedelta as the scalar value for a Series - Support for `Series.dt.month_name`, `Timestamp.month_name` - Support for min/max on timedelta64 series/arrays
-
- Python coverage:
-
- Support for
bytes.fromhex()
- Support for
bytes.__hash__
- Support for
min
andmax
for string values
- Support for