Bodo 2022.6 Release (Date: 06/30/2022)¶
New Features and Improvements¶
- Bodo is upgraded to use Numba 0.55.2 (the latest release)
Dataframe compilation improvements:
- 
pandas.mergeis now much faster to compile and supports super wide dataframes (e.g. 100,000 columns).
- 
DataFrame.sort_valuesis now much faster to compile and supports super wide dataframes.
- 
DataFrame.astypeis now much faster to compile and supports super wide dataframes.
- 
DataFrame.loc,DataFrame.ilocandDataFrame[col_list]are now faster to compile and support super wide dataframes when returning a DataFrame.
- 
Bodo can now automatically optimize out unused output keys of join and sort operations (e.g. pd.merge, df.sort_values). This should result in significant runtime and memory usage improvements. 
Iceberg connector (alpha):
- 
Now supports reading from Nessie, Arctic, and Glue catalogs. 
- 
Iceberg connector now uses py4j. This should remove any conflicts with other packages that use jpype. 
Parquet I/O:
- 
Improved performance and robustness when reading Parquet files. 
- 
Several improvements to Dead Column Elimination and Filter Pushdown that enable faster Parquet read in many scenarios. 
Pandas coverage:
- 
Several Series operation are optimized to support dictionary-encoded string arrays, which reduces memory usage and execution time: - pd.Series.str.get
- pd.Series.str.repeat
- pd.Series.str.slice
- pd.Series.str.pad
- pd.Series.str.rjust
- pd.Series.str.ljust
- pd.Series.str.zfill
- pd.Series.str.center
- pd.Series.str.count
- pd.Series.str.len
- pd.Series.str.find
- pd.Series.str.rfind
- pd.Series.str.strip
- pd.Series.str.lstrip
- pd.Series.str.rstrip
- pd.Series.str.extract
- pd.Series.str.extractall
- pd.Series.str.isalnum
- pd.Series.str.isalpha
- pd.Series.str.isdigit
- pd.Series.str.isspace
- pd.Series.str.islower
- pd.Series.str.isupper
- pd.Series.str.istitle
- pd.Series.str.isnumeric
- pd.Series.str.isdecimal
 
- 
Support for dictionary-encoded string arrays as the key values to DataFrame.groupby, which reduces memory usage and execution time.
- 
Bodo now supports Index.is_integer(),Index.is_floating(),Index.is_boolean(),Index.is_numeric(),Index.is_interval(),Index.is_categorical(),Index.is_object(),Index.T, Index.size,Index.ndim,Index.nlevels,Index.is_all_dates,Index.inferred_type,Index.empty,Index.names,Index.shapefor all Index types.
- 
Bodo now supports Index.argmax(),Index.argmin(),Index.argsort(), andIndex.nunique()for the follwing Index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- PeriodIndex
 
- 
Bodo now supports Index.all()andIndex.any()for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
 
- 
Bodo now supports Index.isin(),Index.union(),Index.intersection(),Index.difference(),Index.symmetric_difference(),Index.to_list(), andIndex.tolist()for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
 
- 
Bodo now supports Index.dtypeandIndex.to_frame()for the following index types- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- MultiIndex
 
- 
Bodo now supports Index.to_series(),Index.where(),Index.putmask(), andIndex.sort_values()for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
 
- 
Bodo now supports Index.unique(), andIndex.to_numpy()for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- IntervalIndex
 
- 
Added support for Categorical Index iterator
- 
Added support for Series.rank()with replicated data
Scikit-Learn Coverage:
- Added support for the following functions:- sklearn.metrics.log_loss
- sklearn.metrics.pairwise.cosine_similarity
- sklearn.model_selection.KFold
- sklearn.model_selection.LeavePOut
- sklearn.preprocessing.OneHotEncoder
- sklearn.preprocessing.MaxAbsScaler
- sklearn.utils.shuffle
 
BodoSQL:
- 
BodoSQL is available on pypi 
- 
BodoSQL now uses py4j. This should remove any conflicts with other packages that use jpype. 
- 
Significantly reduced compilation time when compiling queries with large numbers of columns for common operations (join, where, order by, limit) 
- 
Optimized first_valueandlast_valuewindow functions when a single value is repeated for the entire column.
- 
Reduced compilation with LPADandRPAD
- 
Increased filter pushdown coverage when loading data from Parquet.