2022)¶

New Features and Improvements¶

Bodo is upgraded to use Numba 0.55.2 (the latest release)

Dataframe compilation improvements:

pandas.merge is now much faster to compile and supports super wide dataframes (e.g. 100,000 columns).
DataFrame.sort_values is now much faster to compile and supports super wide dataframes.
DataFrame.astype is now much faster to compile and supports super wide dataframes.
DataFrame.loc, DataFrame.iloc and DataFrame[col_list] are now faster to compile and support super wide dataframes when returning a DataFrame.
Bodo can now automatically optimize out unused output keys of join and sort operations (e.g. pd.merge, df.sort_values). This should result in significant runtime and memory usage improvements.

Iceberg connector (alpha):

Now supports reading from Nessie, Arctic, and Glue catalogs.
Iceberg connector now uses py4j. This should remove any conflicts with other packages that use jpype.

Parquet I/O:

Improved performance and robustness when reading Parquet files.
Several improvements to Dead Column Elimination and Filter Pushdown that enable faster Parquet read in many scenarios.

Pandas coverage:

Several Series operation are optimized to support dictionary-encoded string arrays, which reduces memory usage and execution time:
- pd.Series.str.get
- pd.Series.str.repeat
- pd.Series.str.slice
- pd.Series.str.pad
- pd.Series.str.rjust
- pd.Series.str.ljust
- pd.Series.str.zfill
- pd.Series.str.center
- pd.Series.str.count
- pd.Series.str.len
- pd.Series.str.find
- pd.Series.str.rfind
- pd.Series.str.strip
- pd.Series.str.lstrip
- pd.Series.str.rstrip
- pd.Series.str.extract
- pd.Series.str.extractall
- pd.Series.str.isalnum
- pd.Series.str.isalpha
- pd.Series.str.isdigit
- pd.Series.str.isspace
- pd.Series.str.islower
- pd.Series.str.isupper
- pd.Series.str.istitle
- pd.Series.str.isnumeric
- pd.Series.str.isdecimal
Support for dictionary-encoded string arrays as the key values to DataFrame.groupby, which reduces memory usage and execution time.
Bodo now supports Index.is_integer(), Index.is_floating(), Index.is_boolean(), Index.is_numeric(), Index.is_interval(), Index.is_categorical(), Index.is_object(), Index.T, Index.size, Index.ndim, Index.nlevels, Index.is_all_dates, Index.inferred_type, Index.empty, Index.names, Index.shape for all Index types.
Bodo now supports Index.argmax(), Index.argmin(), Index.argsort(), and Index.nunique() for the follwing Index types:
- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- PeriodIndex
Bodo now supports Index.all() and Index.any() for the following index types:
- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
Bodo now supports Index.isin(), Index.union(), Index.intersection(), Index.difference(), Index.symmetric_difference(), Index.to_list(), and Index.tolist() for the following index types:
- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
Bodo now supports Index.dtype and Index.to_frame() for the following index types
- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- MultiIndex
Bodo now supports Index.to_series(), Index.where(), Index.putmask(), and Index.sort_values() for the following index types:
- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
Bodo now supports Index.unique(), and Index.to_numpy() for the following index types:
- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- IntervalIndex
Added support for Categorical Index iterator
Added support for Series.rank() with replicated data

Scikit-Learn Coverage:

Added support for the following functions:
- sklearn.metrics.log_loss
- sklearn.metrics.pairwise.cosine_similarity
- sklearn.model_selection.KFold
- sklearn.model_selection.LeavePOut
- sklearn.preprocessing.OneHotEncoder
- sklearn.preprocessing.MaxAbsScaler
- sklearn.utils.shuffle

BodoSQL:

BodoSQL is available on pypi
BodoSQL now uses py4j. This should remove any conflicts with other packages that use jpype.
Significantly reduced compilation time when compiling queries with large numbers of columns for common operations (join, where, order by, limit)
Optimized first_value and last_value window functions when a single value is repeated for the entire column.
Reduced compilation with LPAD and RPAD
Increased filter pushdown coverage when loading data from Parquet.