Skip to content

Bodo 2022.6 Release (Date: 06/30/2022)

New Features and Improvements

  • Bodo is upgraded to use Numba 0.55.2 (the latest release)

Dataframe compilation improvements:

  • pandas.merge is now much faster to compile and supports super wide dataframes (e.g. 100,000 columns).

  • DataFrame.sort_values is now much faster to compile and supports super wide dataframes.

  • DataFrame.astype is now much faster to compile and supports super wide dataframes.

  • DataFrame.loc, DataFrame.iloc and DataFrame[col_list] are now faster to compile and support super wide dataframes when returning a DataFrame.

  • Bodo can now automatically optimize out unused output keys of join and sort operations (e.g. pd.merge, df.sort_values). This should result in significant runtime and memory usage improvements.

Iceberg connector (alpha):

  • Now supports reading from Nessie, Arctic, and Glue catalogs.

  • Iceberg connector now uses py4j. This should remove any conflicts with other packages that use jpype.

Parquet I/O:

  • Improved performance and robustness when reading Parquet files.

  • Several improvements to Dead Column Elimination and Filter Pushdown that enable faster Parquet read in many scenarios.

Pandas coverage:

  • Several Series operation are optimized to support dictionary-encoded string arrays, which reduces memory usage and execution time:

    • pd.Series.str.get
    • pd.Series.str.repeat
    • pd.Series.str.slice
    • pd.Series.str.pad
    • pd.Series.str.rjust
    • pd.Series.str.ljust
    • pd.Series.str.zfill
    • pd.Series.str.center
    • pd.Series.str.count
    • pd.Series.str.len
    • pd.Series.str.find
    • pd.Series.str.rfind
    • pd.Series.str.strip
    • pd.Series.str.lstrip
    • pd.Series.str.rstrip
    • pd.Series.str.extract
    • pd.Series.str.extractall
    • pd.Series.str.isalnum
    • pd.Series.str.isalpha
    • pd.Series.str.isdigit
    • pd.Series.str.isspace
    • pd.Series.str.islower
    • pd.Series.str.isupper
    • pd.Series.str.istitle
    • pd.Series.str.isnumeric
    • pd.Series.str.isdecimal
  • Support for dictionary-encoded string arrays as the key values to DataFrame.groupby, which reduces memory usage and execution time.

  • Bodo now supports Index.is_integer(), Index.is_floating(), Index.is_boolean(), Index.is_numeric(), Index.is_interval(), Index.is_categorical(), Index.is_object(), Index.T, Index.size, Index.ndim, Index.nlevels, Index.is_all_dates, Index.inferred_type, Index.empty, Index.names, Index.shape for all Index types.

  • Bodo now supports Index.argmax(), Index.argmin(), Index.argsort(), and Index.nunique() for the follwing Index types:

    • NumericIndex
    • RangeIndex
    • StringIndex
    • BinaryIndex
    • DatetimeIndex
    • TimedeltaIndex
    • CategoricalIndex
    • PeriodIndex
  • Bodo now supports Index.all() and Index.any() for the following index types:

    • NumericIndex
    • RangeIndex
    • StringIndex
    • BinaryIndex
  • Bodo now supports Index.isin(), Index.union(), Index.intersection(), Index.difference(), Index.symmetric_difference(), Index.to_list(), and Index.tolist() for the following index types:

    • NumericIndex
    • RangeIndex
    • StringIndex
    • BinaryIndex
    • DatetimeIndex
    • TimedeltaIndex
  • Bodo now supports Index.dtype and Index.to_frame() for the following index types

    • NumericIndex
    • RangeIndex
    • StringIndex
    • BinaryIndex
    • DatetimeIndex
    • TimedeltaIndex
    • CategoricalIndex
    • MultiIndex
  • Bodo now supports Index.to_series(), Index.where(), Index.putmask(), and Index.sort_values() for the following index types:

    • NumericIndex
    • RangeIndex
    • StringIndex
    • BinaryIndex
    • DatetimeIndex
    • TimedeltaIndex
    • CategoricalIndex
  • Bodo now supports Index.unique(), and Index.to_numpy() for the following index types:

    • NumericIndex
    • RangeIndex
    • StringIndex
    • BinaryIndex
    • DatetimeIndex
    • TimedeltaIndex
    • CategoricalIndex
    • IntervalIndex
  • Added support for Categorical Index iterator

  • Added support for Series.rank() with replicated data

Scikit-Learn Coverage:

  • Added support for the following functions:
    • sklearn.metrics.log_loss
    • sklearn.metrics.pairwise.cosine_similarity
    • sklearn.model_selection.KFold
    • sklearn.model_selection.LeavePOut
    • sklearn.preprocessing.OneHotEncoder
    • sklearn.preprocessing.MaxAbsScaler
    • sklearn.utils.shuffle

BodoSQL:

  • BodoSQL is available on pypi

  • BodoSQL now uses py4j. This should remove any conflicts with other packages that use jpype.

  • Significantly reduced compilation time when compiling queries with large numbers of columns for common operations (join, where, order by, limit)

  • Optimized first_value and last_value window functions when a single value is repeated for the entire column.

  • Reduced compilation with LPAD and RPAD

  • Increased filter pushdown coverage when loading data from Parquet.