Bodo 2020.06 Release (Date: 06/12/2020)¶
New Features and Improvements¶
Bodo is updated to use the latest minor releases of Numba and Apache Arrow packages:
Apache Arrow 0.17.1
Significant optimizations in read CSV/JSON/Parquet to reduce number of requests, files opened and overall load on the filesystem (for local filesystems, S3 and HDFS).
Support reading compressed JSON and CSV files (gzip and bz2)
Can read directories containing files with any extension
Correctly handle CSV files with headers when reading a directory of CSV files
Support automatic data type inference of JSON files when
Bodo can now automatically infer the required constant values (e.g. list of key names for groupby) from the program in many cases. In addition, Bodo raises informative errors for cases that are not possible to infer automatically.
Various improvements to support caching of Bodo functions, including adding support for caching inside Jupyter Notebook (see here for more information)
Support NA value check with
Support creating empty dataframes and setting columns on empty dataframes
More balanced workload distribution across processor cores
Support for user-defined functions calling other JIT functions, and improved error messages for invalid cases
pandas.read_parquet(): support reading columns of list of integers/floats
bodo.scatterv()for arrays of list of strings/integers/floats.
Improved support for
pd.to_datetime()to handle optional arguments and cases such as string and integer array/Series inputs
Improved pd.concat support to handle arrays of list,
Improved array indexing (getitem/setitem) support for various data types such as date/time cases and Decimals
Support sorting of Decimal series
groupbywith Decimal columns
Groupby: ignore non-numeric columns for numeric-only operations like sum (same behavior as pandas).
Support for comparison of Timedelta data types (
glob.glob(...)inside Bodo functions
Error messages and warnings:
Improvements to clarity and conciseness of error messages
Can use numba syntax highlighting for Bodo errors (enable with NUMBA_COLOR_SCHEME environment variable)
New theme and style
Revamped introductory material and guide
Improved documentation for
Documented Bodo’s coverage of data types
Overall, 82 code patches are merged since the last release.