2021)¶

This release includes many new features, optimizations, bug fixes and usability improvements. Overall, 109 code patches were merged since the last release.

New Features and Improvements¶

Documentation has been reorganized and updated, with improved navigation and a detailed walkthrough of Pandas equivalents of PySpark functions.
Improvements to enable BodoSQL features
Connectors:
- Improved performance of pd.read_parquet when reading from remote storage systems like S3
- Support reading categorical columns of Parquet
Performance improvements:
- Improved performance and scalability of sort_values
- Optimized pd.Series.isin(values) performance for long list of values.
UDFs in Series.apply and Dataframe.apply: the Bodo compiler transforms the code to pass main function values referenced in the UDF ("free variables") as arguments to apply() automatically if possible (to simplify UDF usage).
Support passing Bodo data types to objmode directly (in addition to string representation of the data types). For example, the following code sets the return type an int64 type:
```
@bodo.jit
def f(a, b):
    with bodo.objmode(res=bodo.int64):
        res = random.randint(a, b)
    return res
```
Compilation time improvements for some dataframe operations
Distributed support for pd.RangeIndex calls

Pandas coverage:

Initial support for binary arrays, including within series/dataframes

-   Support for `groupby.transform`

-   Groupby: support repeated input columns. For example:

        df.groupby("A").agg(
                D=pd.NamedAgg(column="B", aggfunc=lambda A: A.sum()),
                F=pd.NamedAgg(column="C", aggfunc="max"),
                E=pd.NamedAgg(column="B", aggfunc="min"),
        )

-   Support Groupby with `dropna=False`

-   Support for `dropna` in `Series.nunique`,
    `DataFrame.nunique`, and `groupby.nunique`

-   Support for `DataFrame.insert()`

-   Support `tolist()` for string and numpy arrays

-

    Expanded `astype` support:

    :   -   str to timedelta64/datetime64
        -   timedelta64/datetime64 to int64
        -   date arrays
        -   Numeric-like inputs to datetime/timedelta
        -   Support for `pd.StringDtype()` in `astype`
        -   numeric-like to nullable integer types

-   Support for `pd.Timestamp.now()`

-   Support Timestamp in `pd.to_datetime`

-   Support for Timestamp/Timedelta as the scalar value for a
    Series

-   Support for `Series.dt.month_name`, `Timestamp.month_name`

-   Support for min/max on timedelta64 series/arrays

Python coverage:
- Support for bytes.fromhex()
- Support for bytes.__hash__
- Support for min and max for string values