Bodo 2022.6 Release (Date: 06/30/2022)¶
New Features and Improvements¶
- Bodo is upgraded to use Numba 0.55.2 (the latest release)
Dataframe compilation improvements:
-
pandas.merge
is now much faster to compile and supports super wide dataframes (e.g. 100,000 columns). -
DataFrame.sort_values
is now much faster to compile and supports super wide dataframes. -
DataFrame.astype
is now much faster to compile and supports super wide dataframes. -
DataFrame.loc
,DataFrame.iloc
andDataFrame[col_list]
are now faster to compile and support super wide dataframes when returning a DataFrame. -
Bodo can now automatically optimize out unused output keys of join and sort operations (e.g. pd.merge, df.sort_values). This should result in significant runtime and memory usage improvements.
Iceberg connector (alpha):
-
Now supports reading from Nessie, Arctic, and Glue catalogs.
-
Iceberg connector now uses py4j. This should remove any conflicts with other packages that use jpype.
Parquet I/O:
-
Improved performance and robustness when reading Parquet files.
-
Several improvements to Dead Column Elimination and Filter Pushdown that enable faster Parquet read in many scenarios.
Pandas coverage:
-
Several Series operation are optimized to support dictionary-encoded string arrays, which reduces memory usage and execution time:
pd.Series.str.get
pd.Series.str.repeat
pd.Series.str.slice
pd.Series.str.pad
pd.Series.str.rjust
pd.Series.str.ljust
pd.Series.str.zfill
pd.Series.str.center
pd.Series.str.count
pd.Series.str.len
pd.Series.str.find
pd.Series.str.rfind
pd.Series.str.strip
pd.Series.str.lstrip
pd.Series.str.rstrip
pd.Series.str.extract
pd.Series.str.extractall
pd.Series.str.isalnum
pd.Series.str.isalpha
pd.Series.str.isdigit
pd.Series.str.isspace
pd.Series.str.islower
pd.Series.str.isupper
pd.Series.str.istitle
pd.Series.str.isnumeric
pd.Series.str.isdecimal
-
Support for dictionary-encoded string arrays as the key values to
DataFrame.groupby
, which reduces memory usage and execution time. -
Bodo now supports
Index.is_integer()
,Index.is_floating()
,Index.is_boolean()
,Index.is_numeric()
,Index.is_interval()
,Index.is_categorical()
,Index.is_object()
,Index.T, Index.size
,Index.ndim
,Index.nlevels
,Index.is_all_dates
,Index.inferred_type
,Index.empty
,Index.names
,Index.shape
for all Index types. -
Bodo now supports
Index.argmax()
,Index.argmin()
,Index.argsort()
, andIndex.nunique()
for the follwing Index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- PeriodIndex
-
Bodo now supports
Index.all()
andIndex.any()
for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
-
Bodo now supports
Index.isin()
,Index.union()
,Index.intersection()
,Index.difference()
,Index.symmetric_difference()
,Index.to_list()
, andIndex.tolist()
for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
-
Bodo now supports
Index.dtype
andIndex.to_frame()
for the following index types- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- MultiIndex
-
Bodo now supports
Index.to_series()
,Index.where()
,Index.putmask()
, andIndex.sort_values()
for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
-
Bodo now supports
Index.unique()
, andIndex.to_numpy()
for the following index types:- NumericIndex
- RangeIndex
- StringIndex
- BinaryIndex
- DatetimeIndex
- TimedeltaIndex
- CategoricalIndex
- IntervalIndex
-
Added support for
Categorical Index iterator
-
Added support for
Series.rank()
with replicated data
Scikit-Learn Coverage:
- Added support for the following functions:
sklearn.metrics.log_loss
sklearn.metrics.pairwise.cosine_similarity
sklearn.model_selection.KFold
sklearn.model_selection.LeavePOut
sklearn.preprocessing.OneHotEncoder
sklearn.preprocessing.MaxAbsScaler
sklearn.utils.shuffle
BodoSQL:
-
BodoSQL is available on pypi
-
BodoSQL now uses py4j. This should remove any conflicts with other packages that use jpype.
-
Significantly reduced compilation time when compiling queries with large numbers of columns for common operations (join, where, order by, limit)
-
Optimized
first_value
andlast_value
window functions when a single value is repeated for the entire column. -
Reduced compilation with
LPAD
andRPAD
-
Increased filter pushdown coverage when loading data from Parquet.