Bodo 2022.6 Release (Date: 06/30/2022)¶
New Features and Improvements¶
- Bodo is upgraded to use Numba 0.55.2 (the latest release)
Dataframe compilation improvements:
-
pandas.mergeis now much faster to compile and supports super wide dataframes (e.g. 100,000 columns). -
DataFrame.sort_valuesis now much faster to compile and supports super wide dataframes. -
DataFrame.astypeis now much faster to compile and supports super wide dataframes. -
DataFrame.loc,DataFrame.ilocandDataFrame[col_list]are now faster to compile and support super wide dataframes when returning a DataFrame. -
Bodo can now automatically optimize out unused output keys of join and sort operations (e.g. pd.merge, df.sort_values). This should result in significant runtime and memory usage improvements.
Iceberg connector (alpha):
-
Now supports reading from Nessie, Arctic, and Glue catalogs.
-
Iceberg connector now uses py4j. This should remove any conflicts with other packages that use jpype.
Parquet I/O:
-
Improved performance and robustness when reading Parquet files.
-
Several improvements to Dead Column Elimination and Filter Pushdown that enable faster Parquet read in many scenarios.
Pandas coverage:
-
Several Series operation are optimized to support dictionary-encoded string arrays, which reduces memory usage and execution time:
pd.Series.str.getpd.Series.str.repeatpd.Series.str.slicepd.Series.str.padpd.Series.str.rjustpd.Series.str.ljustpd.Series.str.zfillpd.Series.str.centerpd.Series.str.countpd.Series.str.lenpd.Series.str.findpd.Series.str.rfindpd.Series.str.strippd.Series.str.lstrippd.Series.str.rstrippd.Series.str.extractpd.Series.str.extractallpd.Series.str.isalnumpd.Series.str.isalphapd.Series.str.isdigitpd.Series.str.isspacepd.Series.str.islowerpd.Series.str.isupperpd.Series.str.istitlepd.Series.str.isnumericpd.Series.str.isdecimal
-
Support for dictionary-encoded string arrays as the key values to
DataFrame.groupby, which reduces memory usage and execution time. -
Bodo now supports
Index.is_integer(),Index.is_floating(),Index.is_boolean(),Index.is_numeric(),Index.is_interval(),Index.is_categorical(),Index.is_object(),Index.T, Index.size,Index.ndim,Index.nlevels,Index.is_all_dates,Index.inferred_type,Index.empty,Index.names,Index.shapefor all Index types. -
Bodo now supports
Index.argmax(),Index.argmin(),Index.argsort(), andIndex.nunique()for the follwing Index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- PeriodIndex
-
Bodo now supports
Index.all()andIndex.any()for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
-
Bodo now supports
Index.isin(),Index.union(),Index.intersection(),Index.difference(),Index.symmetric_difference(),Index.to_list(), andIndex.tolist()for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
-
Bodo now supports
Index.dtypeandIndex.to_frame()for the following index types- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- MultiIndex
-
Bodo now supports
Index.to_series(),Index.where(),Index.putmask(), andIndex.sort_values()for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
-
Bodo now supports
Index.unique(), andIndex.to_numpy()for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- IntervalIndex
-
Added support for
Categorical Index iterator -
Added support for
Series.rank()with replicated data
Scikit-Learn Coverage:
- Added support for the following functions:
sklearn.metrics.log_losssklearn.metrics.pairwise.cosine_similaritysklearn.model_selection.KFoldsklearn.model_selection.LeavePOutsklearn.preprocessing.OneHotEncodersklearn.preprocessing.MaxAbsScalersklearn.utils.shuffle
BodoSQL:
-
BodoSQL is available on pypi
-
BodoSQL now uses py4j. This should remove any conflicts with other packages that use jpype.
-
Significantly reduced compilation time when compiling queries with large numbers of columns for common operations (join, where, order by, limit)
-
Optimized
first_valueandlast_valuewindow functions when a single value is repeated for the entire column. -
Reduced compilation with
LPADandRPAD -
Increased filter pushdown coverage when loading data from Parquet.