Bodo 2020.10 Release (Date: 10/20/2020)¶
This release includes many new features, bug fixes and performance improvements. Overall, 117 code patches were merged since the last release.
New Features and Improvements¶
-
Initial support for Python classes using
bodo.jitclassdecorator. -
- Scikit-learn:
-
Initial support for these scikit-learn classes: : - `sklearn.linear_model.SGDClassifier` - `sklearn.linear_model.SGDRegressor` - `sklearn.cluster.KMeans` For more information please refer to the documentation [here](https://docs.bodo.ai/latest/source/sklearn.html) - Improved scaling of `RandomForestClassifier` training -
Memory management and memory consumption improvements
-
- Improvements for User-defined functions (UDFs):
-
- Compilation errors are now clearly shown for UDFs
- Support more complex UDFs (by running a full compiler pipeline)
- Support passing keyword arguments to UDF in
DataFrame.apply()andSeries.apply() - Support much wider range of UDF types in
groupby.agg
-
- Connectors:
-
- Improved connector error handling
- Improved performance of
pd.read_csv(further improvements in next release) pd.read_parquetsupports column containing all NA (null) values
-
Caching: for Bodo functions that receive parquet file names as string arguments, the cache will now be reused when file name arguments differ but have the same parquet dataset type (schema).
-
Significantly improved the performance of merge/join operations in some cases
-
Support for loops over dataframe columns by automatic loop unrolling
-
Support using global dataframe/array values inside jit functions
-
Performance optimization for the
series.str.split().explode()pattern -
- Pandas coverage:
-
- Support setting
df.columnsanddf.index - Support setting values in Categorical arrays
series.str.split: added support for regular expression andnparameterSeries.replacesupport for more array types- Support
pd.series.dt.quarter - Support
series.str.slice_replace - Support
series.str.repeat - Improved support for
df.pivot_tableandpd.crosstab - Support for
Series.notnull - Support integer label indexing for Dataframes and Series with RangeIndex
- Support setting
NoneandOptionalvalues for most arrays
- Support setting
-
- NumPy coverage:
-
- Support for
np.union1d np.where,np.unique,np.sort,np.repeat: support for Series and most array types- Support
np.argmaxwithaxis=1 - Support for
np.min,np.max,min,max,np.sum,sum,np.prodon nullable arrays
- Support for