Bodo 2021.7 Release (Date: 7/23/2021)

This release includes many new features, optimizations, bug fixes and usability improvements. Overall, 109 code patches were merged since the last release.

New Features and Improvements

  • Documentation has been reorganized and updated, with improved navigation and a detailed walkthrough of Pandas equivalents of PySpark functions.

  • Improvements to enable BodoSQL features

  • Connectors:
    • Improved performance of pd.read_parquet when reading from remote storage systems like S3

    • Support reading categorical columns of Parquet

  • Performance improvements:
    • Improved performance and scalability of sort_values

    • Optimized pd.Series.isin(values) performance for long list of values.

  • UDFs in Series.apply and Dataframe.apply: the Bodo compiler transforms the code to pass main function values referenced in the UDF (“free variables”) as arguments to apply() automatically if possible (to simplify UDF usage).

  • Support passing Bodo data types to objmode directly (in addition to string representation of the data types). For example, the following code sets the return type an int64 type:

    @bodo.jit
    def f(a, b):
        with bodo.objmode(res=bodo.int64):
            res = random.randint(a, b)
        return res
    
  • Compilation time improvements for some dataframe operations

  • Distributed support for pd.RangeIndex calls

  • Pandas coverage:
    • Initial support for binary arrays, including within series/dataframes

    • Support for groupby.transform

    • Groupby: support repeated input columns. For example:

      df.groupby("A").agg(
              D=pd.NamedAgg(column="B", aggfunc=lambda A: A.sum()),
              F=pd.NamedAgg(column="C", aggfunc="max"),
              E=pd.NamedAgg(column="B", aggfunc="min"),
      )
      
    • Support Groupby with dropna=False

    • Support for dropna in Series.nunique, DataFrame.nunique, and groupby.nunique

    • Support for DataFrame.insert()

    • Support tolist() for string and numpy arrays

    • Expanded astype support:
      • str to timedelta64/datetime64

      • timedelta64/datetime64 to int64

      • date arrays

      • Numeric-like inputs to datetime/timedelta

      • Support for pd.StringDtype() in astype

      • numeric-like to nullable integer types

    • Support for pd.Timestamp.now()

    • Support Timestamp in pd.to_datetime

    • Support for Timestamp/Timedelta as the scalar value for a Series

    • Support for Series.dt.month_name, Timestamp.month_name

    • Support for min/max on timedelta64 series/arrays

  • Python coverage:
    • Support for bytes.fromhex()

    • Support for bytes.__hash__

    • Support for min and max for string values