Bodo 2022.3 Release (Date: 3/31/2022)¶
This release includes many new features, usability and performance improvements, and bug fixes. Overall, 74 code patches were merged since the last release.
New Features and Improvements¶
-
Bodo is updated to use Arrow 7.0 (latest)
-
Initial support for dictionary-encoded string arrays. Dictionary encoding can improve performance and reduce memory usage significantly when data has many repeated values which is common in practice (see here). Bodo now uses dictionary encoding automatically in
pd.read_parquetwhen a string column can benefit from it. Join, sort and parquet write operations support dictionary-encoded string arrays as well, and the support will expand to others in the future. Bodo will fall back to regular string arrays automatically if an operation does not support dictionary encoding. -
Connectors:
pd.read_parquetperformance improvements when multiple processes read from the same file.- Support for filter pushdown in Parquet and Snowflake when using
Series.isin - Support for SparkSQL's
input_file_namefunctionality forread_parquetusing a new_bodo_input_file_name_colargument. - Support for
chunksizeinpd.to_sql - Optimized
df.to_parquetmemory usage when writing string columns - Support for passing list of columns as
columnsparameter ofdf.to_csv - Support in
pd.read_sqlfor returning an empty DataFrame from Snowflake, either due to an empty query or the result of filter pushdown. - Changed default value of
orientandlinesinDataFrame.to_jsontorecordsandTruerespectively to enable parallel write (Pandas usescolumnsandFalseas default).
-
Bodo now provides compiler optimization logging through
bodo.set_verbose_level(). This can be used to display certain optimizations performed at compile time, such as filter pushdown, column pruning, and which columns are read with dictionary encoding when reading from Parquet. See Verbose Mode for more details. -
Improvements in error checking and quality of error messages.
-
Avoid hang when encountering unhandled exceptions on a single process.
-
Introduced
replicatedJIT decorator flag (opposite ofdistributed). -
If the user provided
distributedJIT flag for some input and return values but not all, bodo can now infer distribution of the rest. -
Performance optimizations:
- Improved memory usage during parallel
groupby.apply - Improved
df.sampleperformance whenfrac=1andreplace=False
- Improved memory usage during parallel
-
Pandas:
- Initial support for Timezone-Aware arrays and timestamps
- Added support for
array.tz_convert,Series.dt.tz_convert,Timestamp.tz_convert,DatetimeIndex.tz_convert,Timestamp.tz_localize
- Added support for
- Support for
Series.str.cat - Support for
pd.uniqueon Series and 1-D arrays - Support for comparison operators between
DatetimeIndexandpd.TimestampandTimedeltaIndexandpd.Timedelta - Support for
DataFrame.set_indexon single-column DataFrames - Support for
Series.first_valid_indexandSeries.last_valid_index - Support for conversion between
pd.timestampandnp.datetime64
- Initial support for Timezone-Aware arrays and timestamps