Bodo 2020.06 Release (Date: 06/12/2020)¶
New Features and Improvements¶
-
Bodo is updated to use the latest minor releases of Numba and Apache Arrow packages:
- numba 0.49.1
- Apache Arrow 0.17.1
-
Significant optimizations in read CSV/JSON/Parquet to reduce number of requests, files opened and overall load on the filesystem (for local filesystems, S3 and HDFS).
-
Improvements in
pandas.read_csv()
andpandas.read_json()
:- Support reading compressed JSON and CSV files (gzip and bz2)
- Can read directories containing files with any extension
- Correctly handle CSV files with headers when reading a directory of CSV files
- Support automatic data type inference of JSON files when
orient='records'
andlines=True
-
Bodo can now automatically infer the required constant values (e.g. list of key names for groupby) from the program in many cases. In addition, Bodo raises informative errors for cases that are not possible to infer automatically.
-
Various improvements to support caching of Bodo functions, including adding support for caching inside Jupyter Notebook (see here for more information)
-
Support NA value check with
pandas.isna(A[i])
-
Support creating empty dataframes and setting columns on empty dataframes
-
More balanced workload distribution across processor cores
-
Support for user-defined functions calling other JIT functions, and improved error messages for invalid cases
-
pandas.read_parquet()
: support reading columns of list of integers/floats -
Support
bodo.scatterv()
for arrays of list of strings/integers/floats. -
Improved support for
pd.to_datetime()
to handle optional arguments and cases such as string and integer array/Series inputs -
Improved
pd.concat
support to handle arrays of list,Decimal
anddatetime.date
values -
Improved array indexing (getitem/setitem) support for various data types such as date/time cases and Decimals
-
Support sorting of Decimal series
-
Support Dataframe
merge
andgroupby
with Decimal columns -
Groupby: ignore non-numeric columns for numeric-only operations like sum (same behavior as pandas).
-
Support for comparison of Timedelta data types (
datetime.timedelta
andtimedelta64
) -
Parallelization of
numpy.full()
-
Support
glob.glob(...)
inside Bodo functions -
Error messages and warnings:
- Improvements to clarity and conciseness of error messages
- Can use numba syntax highlighting for Bodo errors (enable with NUMBA_COLOR_SCHEME environment variable)
-
Documentation:
- New theme and style
- Revamped introductory material and guide
- Improved documentation for
pd.read_csv()
andpd.read_json()
- Documented Bodo's coverage of data types
Overall, 82 code patches are merged since the last release.