Bodo 2022.4 Release (Date: 4/29/2022)¶
This release includes many new features, usability and performance improvements, and bug fixes. Overall, 60 code patches were merged since the last release.
New Features and Improvements¶
-
Support for Python 3.10 (Conda/pip packages will be available soon)
-
Support for Pandas 1.4 (along with continued support for v1.3)
-
Connectors:
- When passing a list of paths to
pd.read_parquet
, the paths can be a combination of paths to files and glob strings. - Improved performance of
pd.read_parquet
on remote filesystems when passing long lists of files. DataFrame.to_parquet
now supportsrow_group_size
parameter, which can be used to specify the maximum number of rows in generated row groups. Bodo now has a default row group size of 1M rows, to improve performance when reading the generated parquet datasets in parallel.pd.read_parquet
: string columns can be forced to be read with dictionary encoding by passing a list of column names with_bodo_read_as_dict
parameter.- Support for S3 anonymous access with
storage_options={"anon": True}
inpd.read_csv
andpd.read_json
- Improved performance and memory utilization of
pd.read_csv
at compilation and run time (especially when reading first n rows from remote filesystems such as S3)
- When passing a list of paths to
-
Parallel support for
pd.date_range
: Bodo automatically creates a date range that is distributed across processes -
Improved performance of
Series.str.startswith/endswith/contains
for dictionary-encoded string arrays -
Reduced compilation time for
DataFrame.memory_usage()
-
Reduced compilation time when using
pandas.read_sql()
with wide tables. -
Pandas:
- Support for
DataFrame.melt()
andpd.melt()
- Support for