Bodo 2021.3 Release (Date: 3/25/2021)¶
This release includes many new features, bug fixes and usability improvements. Overall, 148 code patches were merged since the last release.
New Features and Improvements¶
-
Bodo is updated to use Numba 0.53 (latest) and support Python 3.9
-
Many improvements to error checking and reporting
-
Compilation time is reduced, especially for user-defined functions (UDFs)
-
Reduced initialization time when importing Bodo
-
Distributed diagnostics improvements:
- Show distributed diagnostics when raising errors for distributed flag
- Only show user defined variables in diagnostics level one
-
Performance optimizations:
- Faster groupby
nunique
with improved scaling - Faster
setitem
for categorical arrays
- Faster groupby
-
Connectors:
- Google Cloud Storage (GCS) support with Parquet
- Support reading Delta Lake tables
- Improved Snowflake support
- Removed s3fs dependency (Bodo now fully relies on Apache Arrow for S3 connectivity)
-
Change default parallelism semantics of
unique()
to replicated output to match user expectations better -
Support
objmode
in groupby apply UDFs -
Pandas coverage:
- Support
pd.DataFrame.duplicated()
with categorical data - Groupby support for min/max on categorical data
- Support for categorical in
pd.Series.dropna
- Support nullable int array in
pd.Categorical
constructor - Support for
pd.Series.where
andpd.Series.mask
with categorical data and a scalar value. - Support for
pd.Series.diff()
- Support for
pd.DataFrame.diff()
- Support for
pd.Series.repeat()
- Support list of functions in
groupby.agg()
- Support tuple of UDFs inside
groupby.agg()
dictionary case - Support single row and scalar UDF output in
groupby.apply()
- Support Categorical values in
Groupby.shift
- Support
case=False
inSeries.str.contains
- Support
mapper
withaxis=1
forpd.DataFrame.rename
. - Support
Timedelta64
data inpd.Groupby
- Support for
datetime.date
arrays inSeries.max
andSeries.min
- Support for
pd.timedelta_range
- Support equality between
datetime64
/pd.Timestamp
andtimedelta64
/pd.Timedelta
- Support for iterating across most index types
- Support getting the
name
attribute of data insidedf.apply
- Support
Series.reset_index(drop=False)
for common cases - Support
==
and!=
on Dataframe and a scalar with a different type -
Sequential support for `pd.Series.idxmax`, `pd.Series.idxmin`, : `pd.DataFrame.idxmax`, and `pd.DataFrame.idxmin` with Nullable and Categorical arrays.
- Support
-
Python coverage:
- Support
datetime.date.replace()
- Improved support for
datetime.date.strftime()
- Support for
calendar.month_abbr
- Support
-
SciPy:
- Initial support for
scipy.sparse.csr_matrix
- Initial support for
-
Scikit-learn:
- Support for
sklearn.feature_extraction.text.HashingVectorizer
- Support for