Bodo 2022.6 Release (Date: 06/30/2022)¶
New Features and Improvements¶
- Bodo is upgraded to use Numba 0.55.2 (the latest release)
 
Dataframe compilation improvements:
- 
pandas.mergeis now much faster to compile and supports super wide dataframes (e.g. 100,000 columns). - 
DataFrame.sort_valuesis now much faster to compile and supports super wide dataframes. - 
DataFrame.astypeis now much faster to compile and supports super wide dataframes. - 
DataFrame.loc,DataFrame.ilocandDataFrame[col_list]are now faster to compile and support super wide dataframes when returning a DataFrame. - 
Bodo can now automatically optimize out unused output keys of join and sort operations (e.g. pd.merge, df.sort_values). This should result in significant runtime and memory usage improvements.
 
Iceberg connector (alpha):
- 
Now supports reading from Nessie, Arctic, and Glue catalogs.
 - 
Iceberg connector now uses py4j. This should remove any conflicts with other packages that use jpype.
 
Parquet I/O:
- 
Improved performance and robustness when reading Parquet files.
 - 
Several improvements to Dead Column Elimination and Filter Pushdown that enable faster Parquet read in many scenarios.
 
Pandas coverage:
- 
Several Series operation are optimized to support dictionary-encoded string arrays, which reduces memory usage and execution time:
pd.Series.str.getpd.Series.str.repeatpd.Series.str.slicepd.Series.str.padpd.Series.str.rjustpd.Series.str.ljustpd.Series.str.zfillpd.Series.str.centerpd.Series.str.countpd.Series.str.lenpd.Series.str.findpd.Series.str.rfindpd.Series.str.strippd.Series.str.lstrippd.Series.str.rstrippd.Series.str.extractpd.Series.str.extractallpd.Series.str.isalnumpd.Series.str.isalphapd.Series.str.isdigitpd.Series.str.isspacepd.Series.str.islowerpd.Series.str.isupperpd.Series.str.istitlepd.Series.str.isnumericpd.Series.str.isdecimal
 - 
Support for dictionary-encoded string arrays as the key values to
DataFrame.groupby, which reduces memory usage and execution time. - 
Bodo now supports
Index.is_integer(),Index.is_floating(),Index.is_boolean(),Index.is_numeric(),Index.is_interval(),Index.is_categorical(),Index.is_object(),Index.T, Index.size,Index.ndim,Index.nlevels,Index.is_all_dates,Index.inferred_type,Index.empty,Index.names,Index.shapefor all Index types. - 
Bodo now supports
Index.argmax(),Index.argmin(),Index.argsort(), andIndex.nunique()for the follwing Index types:- NumericIndex
 - RangeIndex
 - StringIndex
 - BinaryIndex
 - DatetimeIndex
 - TimedeltaIndex
 - CategoricalIndex
 - PeriodIndex
 
 - 
Bodo now supports
Index.all()andIndex.any()for the following index types:- NumericIndex
 - RangeIndex
 - StringIndex
 - BinaryIndex
 
 - 
Bodo now supports
Index.isin(),Index.union(),Index.intersection(),Index.difference(),Index.symmetric_difference(),Index.to_list(), andIndex.tolist()for the following index types:- NumericIndex
 - RangeIndex
 - StringIndex
 - BinaryIndex
 - DatetimeIndex
 - TimedeltaIndex
 
 - 
Bodo now supports
Index.dtypeandIndex.to_frame()for the following index types- NumericIndex
 - RangeIndex
 - StringIndex
 - BinaryIndex
 - DatetimeIndex
 - TimedeltaIndex
 - CategoricalIndex
 - MultiIndex
 
 - 
Bodo now supports
Index.to_series(),Index.where(),Index.putmask(), andIndex.sort_values()for the following index types:- NumericIndex
 - RangeIndex
 - StringIndex
 - BinaryIndex
 - DatetimeIndex
 - TimedeltaIndex
 - CategoricalIndex
 
 - 
Bodo now supports
Index.unique(), andIndex.to_numpy()for the following index types:- NumericIndex
 - RangeIndex
 - StringIndex
 - BinaryIndex
 - DatetimeIndex
 - TimedeltaIndex
 - CategoricalIndex
 - IntervalIndex
 
 - 
Added support for
Categorical Index iterator - 
Added support for
Series.rank()with replicated data 
Scikit-Learn Coverage:
- Added support for the following functions:
sklearn.metrics.log_losssklearn.metrics.pairwise.cosine_similaritysklearn.model_selection.KFoldsklearn.model_selection.LeavePOutsklearn.preprocessing.OneHotEncodersklearn.preprocessing.MaxAbsScalersklearn.utils.shuffle
 
BodoSQL:
- 
BodoSQL is available on pypi
 - 
BodoSQL now uses py4j. This should remove any conflicts with other packages that use jpype.
 - 
Significantly reduced compilation time when compiling queries with large numbers of columns for common operations (join, where, order by, limit)
 - 
Optimized
first_valueandlast_valuewindow functions when a single value is repeated for the entire column. - 
Reduced compilation with
LPADandRPAD - 
Increased filter pushdown coverage when loading data from Parquet.