Skip to content

Bodo 2021.7 Release (Date: 7/23/2021)

This release includes many new features, optimizations, bug fixes and usability improvements. Overall, 109 code patches were merged since the last release.

New Features and Improvements

  • Documentation has been reorganized and updated, with improved navigation and a detailed walkthrough of Pandas equivalents of PySpark functions.

  • Improvements to enable BodoSQL features

  • Connectors:
    • Improved performance of pd.read_parquet when reading from remote storage systems like S3
    • Support reading categorical columns of Parquet
  • Performance improvements:
    • Improved performance and scalability of sort_values
    • Optimized pd.Series.isin(values) performance for long list of values.
  • UDFs in Series.apply and Dataframe.apply: the Bodo compiler transforms the code to pass main function values referenced in the UDF ("free variables") as arguments to apply() automatically if possible (to simplify UDF usage).

  • Support passing Bodo data types to objmode directly (in addition to string representation of the data types). For example, the following code sets the return type an int64 type:

    @bodo.jit
    def f(a, b):
        with bodo.objmode(res=bodo.int64):
            res = random.randint(a, b)
        return res
    
  • Compilation time improvements for some dataframe operations

  • Distributed support for pd.RangeIndex calls

  • Pandas coverage:
    • Initial support for binary arrays, including within series/dataframes
    -   Support for `groupby.transform`
    
    -   Groupby: support repeated input columns. For example:
    
            df.groupby("A").agg(
                    D=pd.NamedAgg(column="B", aggfunc=lambda A: A.sum()),
                    F=pd.NamedAgg(column="C", aggfunc="max"),
                    E=pd.NamedAgg(column="B", aggfunc="min"),
            )
    
    -   Support Groupby with `dropna=False`
    
    -   Support for `dropna` in `Series.nunique`,
        `DataFrame.nunique`, and `groupby.nunique`
    
    -   Support for `DataFrame.insert()`
    
    -   Support `tolist()` for string and numpy arrays
    
    -
    
        Expanded `astype` support:
    
        :   -   str to timedelta64/datetime64
            -   timedelta64/datetime64 to int64
            -   date arrays
            -   Numeric-like inputs to datetime/timedelta
            -   Support for `pd.StringDtype()` in `astype`
            -   numeric-like to nullable integer types
    
    -   Support for `pd.Timestamp.now()`
    
    -   Support Timestamp in `pd.to_datetime`
    
    -   Support for Timestamp/Timedelta as the scalar value for a
        Series
    
    -   Support for `Series.dt.month_name`, `Timestamp.month_name`
    
    -   Support for min/max on timedelta64 series/arrays
    
  • Python coverage:
    • Support for bytes.fromhex()
    • Support for bytes.__hash__
    • Support for min and max for string values