Pandas Operations

Below is a reference list of the Pandas data types and operations that Bodo supports. This list will expand regularly as we add support for more APIs. Optional arguments are not supported unless if specified.

Data Types

Bodo supports the following data types as values in Pandas Dataframe and Series data structures. This represents all Pandas data types except TZ-aware datetime, Period, Interval, and Sparse (which will be supported in the future). Comparing to Spark, equivalent of all Spark data types are supported.

  • Numpy booleans: np.bool_.

  • Numpy integer data types: np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64.

  • Numpy floating point data types: np.float32, np.float64.

  • Numpy datetime data types: np.dtype(“datetime64[ns]”) and np.dtype(“timedelta[ns]”). The resolution has to be ns currently, which covers most practical use cases.

  • Numpy complex data types: np.complex64 and np.complex128.

  • Strings (including nulls).

  • datetime.date values (including nulls).

  • datetime.timedelta values (including nulls).

  • Pandas nullable integers.

  • Pandas nullable booleans.

  • Pandas Categoricals.

  • Lists of other data types.

  • Tuples of other data types.

  • Structs of other data types.

  • Maps of other data types (each map is a set of key-value pairs). All keys should have the same type to ensure type stability. All values should have the same type as well.

  • decimal.Decimal values (including nulls). The decimal values are stored as fixed-precision Apache Arrow Decimal128 format, which is also similar to PySpark decimals. The decimal type has a precision (the maximum total number of digits) and a scale (the number of digits on the right of dot) attribute, specifying how the stored data is interpreted. For example, the (4, 2) case can store from -999.99 to 999.99. The precision can be up to 38, and the scale must be less or equal to precision. Arbitrary-precision Python decimal.Decimal values are converted with precision of 38 and scale of 18.

In addition, it may be desirable to specify type annotations in some cases (e.g. file I/O array input types). Typically these types are array types and they all can be accessed directly from the bodo module. The following table can be used to select the necessary Bodo Type based upon the desired Python, Numpy, or Pandas type.

Bodo Type Name

Equivalent Python, Numpy, or Pandas type

bodo.bool_[:], bodo.int8[:], …, bodo.int64[:], bodo.uint8[:], …, bodo.uint64[:], bodo.float32[:], bodo.float64[:]

One-dimensional Numpy array of the given type. A full list of supported Numpy types can be found here. A multidimensional can be specified by adding additional colons (e.g. bodo.int32[:, :, :] for a three-dimensional array).

bodo.string_array_type

Array of nullable strings

bodo.IntegerArrayType(integer_type)

Array of Pandas nullable integers of the given integer type
e.g. bodo.IntegerArrayType(bodo.int64)

bodo.boolean_array

Array of Pandas nullable booleans

bodo.datetime64ns[:]

Array of Numpy datetime64 values

bodo.timedelta64ns[:]

Array of Numpy timedelta64 values

bodo.datetime_date_array_type

Array of datetime.date types

bodo.datetime_timedelta_array_type

Array of datetime.timedelta types

bodo.DecimalArrayType(precision, scale)

Array of Apache Arrow Decimal128 values with the given precision and scale
e.g. bodo.DecimalArrayType(38, 18)

bodo.binary_array_type

Array of nullable bytes values

bodo.StructArrayType(data_types, field_names)

Array of a user defined struct with the given tuple of data types and field names
e.g. bodo.StructArrayType((bodo.int32[:], bodo.datetime64ns[:]), ("a", "b"))

bodo.TupleArrayType(data_types)

Array of a user defined tuple with the given tuple of data types
e.g. bodo.TupleArrayType((bodo.int32[:], bodo.datetime64ns[:]))

bodo.MapArrayType(key_arr_type, value_arr_type)

Array of Python dictionaries with the given key and value array types.
e.g. bodo.MapArrayType(bodo.uint16[:], bodo.string_array_type)

bodo.PDCategoricalDtype(cat_tuple, cat_elem_type, is_ordered_cat)

Pandas categorical type with the possible categories, each category’s type, and if the categories are ordered.
e.g. bodo.PDCategoricalDtype(("A", "B", "AA"), bodo.string_type, True)

bodo.CategoricalArrayType(categorical_type)

Array of Pandas categorical values.
e.g. bodo.CategoricalArrayType(bodo.PDCategoricalDtype(("A", "B", "AA"), bodo.string_type, True))

bodo.DatetimeIndexType(name_type)

Index of datetime64 values with a given name type.
e.g. bodo.DatetimeIndexType(bodo.string_type)

bodo.NumericIndexType(data_type, name_type)

Index of pd.Int64, pd.Uint64, or Float64 objects, based upon the given data_type and name type.
e.g. bodo.NumericIndexType(bodo.float64, bodo.string_type)

bodo.PeriodIndexType(freq, name_type)

pd.PeriodIndex with a given frequency and name type.
e.g. bodo.PeriodIndexType('A', bodo.string_type)

bodo.RangeIndexType(name_type)

RangeIndex with a given name type.
e.g. bodo.RangeIndexType(bodo.string_type)

bodo.StringIndexType(name_type)

Index of strings with a given name type.
e.g. bodo.StringIndexType(bodo.string_type)

bodo.BinaryIndexType(name_type)

Index of binary values with a given name type.
e.g. bodo.BinaryIndexType(bodo.string_type)

bodo.TimedeltaIndexType(name_type)

Index of timedelta64 values with a given name type.
e.g. bodo.TimedeltaIndexType(bodo.string_type)

bodo.SeriesType(dtype=data_type, index=index_type, name_typ=name_type)

Series with a given data type, index type, and name type.
e.g. bodo.SeriesType(bodo.float32, bodo.DatetimeIndexType(bodo.string_type), bodo.string_type)

bodo.DataFrameType(data_types_tuple, index_type, column_names)

DataFrame with a tuple of data types, an index type, and the names of the columns.
e.g. bodo.DataFrameType((bodo.int64[::1], bodo.float64[::1]), bodo.RangeIndexType(bodo.none), ("A", "B"))

Input/Output

See more in File I/O, such as Amazon S3 and Hadoop Distributed File System (HDFS) and Azure Data Lake Storage (ADLS) Gen2 configuration requirements.

  • pandas.read_csv()

    • example usage and more system specific instructions

    • filepath_or_buffer should be a string and is required. It could be pointing to a single CSV file, or a directory containing multiple partitioned CSV files (must have csv file extension inside directory).

    • Arguments sep, delimiter, header, names, index_col, usecols, dtype, nrows, skiprows, and parse_dates are supported.

    • Either names and dtype arguments should be provided to enable type inference, or filepath_or_buffer should be inferrable as a constant string for Bodo to infer types by looking at the file at compile time.

    • names, usecols, parse_dates should be constant lists.

    • dtype should be a constant dictionary of strings and types.

    • If skiprows is not a constant, names must be provided to enable type inference.

    • When a CSV file is read in parallel (distributed mode) and each process reads only a portion of the file, reading columns that contain line breaks is not supported.

  • pandas.read_excel()

    • output dataframe cannot be parallelized automatically yet.

    • only arguments io, sheet_name, header, names, comment, dtype, skiprows, parse_dates are supported.

    • io should be a string and is required.

    • Either names and dtype arguments should be provided to enable type inference, or io should be inferrable as a constant string for Bodo to infer types by looking at the file at compile time.

    • sheet_name, header, comment, and skiprows should be constant if provided.

    • names and parse_dates should be constant lists if provided.

    • dtype should be a constant dictionary of strings and types if provided.

  • pandas.read_sql()

    • example usage and more system specific instructions

    • Argument sql is supported but only as a string form. SQLalchemy Selectable is not supported. There is no restriction on the form of the sql request.

    • Argument con is supported but only as a string form. SQLalchemy connectable is not supported.

    • Argument index_col is supported.

    • Arguments chunksize, column, coerce_float, params are not supported.

  • pandas.read_parquet()

    • example usage and more system specific instructions

    • Arguments path and columns are supported. columns should be a constant list of strings if provided.

    • Argument anon of storage_options is supported for S3 filepaths.

    • If path can be inferred as a constant (e.g. it is a function argument), Bodo finds the schema from file at compilation time. Otherwise, schema should be provided. For example:

      @bodo.jit(locals={'df':{'A': bodo.float64[:],
                              'B': bodo.string_array_type}})
      def impl(f):
        df = pd.read_parquet(f)
        return df
      
  • pandas.read_json()

    • Example usage and more system specific instructions

    • Only supports reading JSON Lines text file format (pd.read_json(filepath_or_buffer, orient='records', lines=True)) and regular multi-line JSON file(pd.read_json(filepath_or_buffer, orient='records', lines=False)).

    • Argument filepath_or_buffer is supported: it can point to a single JSON file, or a directory containing multiple partitioned JSON files. When reading a directory, the JSON files inside the directory must be JSON Lines text file format with json file extension.

    • Argument orient = 'records' is used as default, instead of Pandas’ default 'columns' for dataframes. 'records' is the only supported value for orient.

    • Argument typ is supported. 'frame' is the only supported value for typ.

    • dtype argument should be provided to enable type inference, or filepath_or_buffer should be inferrable as a constant string for Bodo to infer types by looking at the file at compile time (not supported for multi-line JSON files)

    • Arguments convert_dates, precise_float, lines are supported.

  • pandas.DataFrame.to_sql()

    • example usage and more system specific instructions

    • Argument con is supported but only as a string form. SQLalchemy connectable is not supported.

    • Argument name, schema, if_exists, index, index_label, dtype, method are supported.

    • Argument chunksize is not supported.

General functions

Data manipulations:

  • pandas.crosstab()

    • Annotation of pivot values is required. For example, @bodo.jit(pivots={‘pt’: [‘small’, ‘large’]}) declares the output table pt will have columns called small and large.

  • pandas.cut() (‘include_lowest’ optional argument is supported)

  • pandas.qcut()

  • pandas.merge()

    • Arguments left, right should be dataframes.

    • how, on, left_on, right_on, left_index, right_index, and indicator are supported but should be constant values.

    • The output dataframe is not sorted by default for better parallel performance (Pandas may preserve key order depending on how). One can use explicit sort if needed.

    • Bodo additionally supports more complex merge conditions not available in standard Pandas. Please refer to General Merge Conditions for more information.

  • pandas.merge_asof() (similar arguments to merge)

  • pandas.concat() Input list or tuple of dataframes or series is supported. axis and ignore_index are also supported. Bodo currently concatenates local data chunks for distributed datasets, which does not preserve global order of concatenated objects in output.

  • pandas.get_dummies() Input must be a categorical array with categories that are known at compile time (for type stability).

Top-level missing data:

Top-level conversions:

  • pandas.to_numeric() Input can be a Series or array. Output type is float64 by default. Unlike Pandas, Bodo does not dynamically determine output type, and does not downcast to the smallest numerical type. downcast parameter should be used for type annotation of output. The errors argument is not supported currently (errors will be coerced by default).

Top-level dealing with datetime and timedelta like:

Series

Bodo provides extensive Series support. However, operations between Series (+, -, /, , *) do not implicitly align values based on their associated index values yet.

  • pandas.Series

    • Arguments data, index, and name are supported. data can be a list, array, Series, Index, or None. If data is Series and index is provided, implicit alignment is not performed yet.

Attributes:

Methods:

Conversion:

Indexing, iteration:

Location based indexing using [], iat, and iloc is supported. Changing values of existing string Series using these operators is not supported yet.

  • pandas.Series.iat()

  • pandas.Series.iloc()

  • pandas.Series.loc() Read support for all indexers except using a callable object. Label-based indexing is not supported yet.

Binary operator functions:

The fill_value optional argument for binary functions below is supported.

Function application, GroupBy & Window:

  • pandas.Series.apply() (convert_dtype not supported yet)

    • func argument can be a function (e.g. lambda), a jit function, or a constant string. Constant strings must refer to a supported Series method or Numpy ufunc.

  • pandas.Series.map() (only the arg argument, which should be a function or dictionary)

  • pandas.Series.groupby() (pass array to by argument, or level=0 with regular Index, sort=False and observed=True are set by default)

  • pandas.Series.rolling() (window, min_periods and center arguments supported)

Computations / Descriptive Stats:

Statistical functions below are supported without optional arguments unless support is explicitly mentioned.

Reindexing / Selection / Label manipulation:

Missing data handling:

Reshaping, sorting:

Time series-related:

  • pandas.Series.shift() (supports numeric, boolean, datetime.date, datetime64, timedelta64, and string types. Only the periods argument is supported)

Datetime properties:

String handling:

Serialization / Conversion

DataFrame

Bodo provides extensive DataFrame support documented below.

  • pandas.DataFrame

    data argument can be a constant dictionary or 2D Numpy array. Other arguments are also supported.

Attributes and underlying data:

Conversion:

Indexing, iteration:

Function application, GroupBy & Window:

  • pandas.DataFrame.apply()

    • func argument can be a function (e.g. lambda), a jit function, or a constant string. Constant strings must refer to a supported DataFrame method. If axis is provided with a constant string the method must also take axis as an argument. The axis argument must be 1 if a function or jit function is provided.

    • Supports extra _bodo_inline boolean argument to manually control bodo’s inlining behavior. Inlining user-defined functions (UDFs) can potentially improve performance at the expense of extra compilation time. Bodo uses heuristics to make a decision automatically if _bodo_inline is not provided.

  • pandas.DataFrame.groupby() by should be a constant column label or column labels. sort=False and observed=True are set by default. as_index and dropna arguments are supported.

  • pandas.DataFrame.rolling() window argument should be integer or a time offset (as a constant string, pd.Timedelta, or datetime.timedelta). min_periods, center and on arguments are also supported. on should be a constant column name.

Computations / Descriptive Stats:

Reindexing / Selection / Label manipulation:

Missing data handling:

Reshaping, sorting, transposing:

Combining / joining / merging:

  • pandas.DataFrame.append() appending a dataframe or list of dataframes supported. ignore_index is supported.

  • pandas.DataFrame.assign() function arguments not supported yet.

  • pandas.DataFrame.join() only dataframes. The output dataframe is not sorted by default for better parallel performance (Pandas may preserve key order depending on how). One can use explicit sort if needed.

  • pandas.DataFrame.merge() only dataframes. how, on, left_on, right_on, left_index, right_index, and indicator are supported but should be constant values.

Time series-related:

  • pandas.DataFrame.shift() (supports numeric, boolean, datetime.date and string types, and only the periods argument supported)

Serialization / IO / conversion:

Also see Amazon S3 and Hadoop Distributed File System (HDFS) and Azure Data Lake Storage (ADLS) Gen2 configuration requirements and more on File I/O.

Index objects

Index

Properties

Modifying and computations:

The min/max methods are supported for DatetimeIndex. They are supported without optional arguments
(NaT output for empty or all NaT input not supported yet):

Missing values:

Conversion:

Numeric Index

Numeric index objects RangeIndex, Int64Index, UInt64Index and Float64Index are supported as index to dataframes and series. Constructing them in Bodo functions, passing them to Bodo functions (unboxing), and returning them from Bodo functions (boxing) are also supported.

DatetimeIndex

DatetimeIndex objects are supported. They can be constructed, boxed/unboxed, and set as index to dataframes and series.

  • pandas.DatetimeIndex

    • Only data argument is supported, and can be array-like of datetime64['ns'], int64 or strings.

Date fields of DatetimeIndex are supported:

Subtraction of Timestamp from DatetimeIndex and vice versa is supported.

Comparison operators ==, !=, >=, >, <=, < between DatetimeIndex and a string of datetime are supported.

TimedeltaIndex

TimedeltaIndex objects are supported. They can be constructed, boxed/unboxed, and set as index to dataframes and series.

  • pandas.TimedeltaIndex

    • Only data argument is supported, and can be array-like of timedelta64['ns'] or int64.

Time fields of TimedeltaIndex are supported:

  • pandas.TimedeltaIndex.days()

  • pandas.TimedeltaIndex.seconds()

  • pandas.TimedeltaIndex.microseconds()

  • pandas.TimedeltaIndex.nanoseconds()

Min and Max operators are supported:

  • pandas.TimedeltaIndex.min()

  • pandas.TimedeltaIndex.max()

PeriodIndex

PeriodIndex objects can be boxed/unboxed and set as index to dataframes and series. Operations on them will be supported in upcoming releases.

BinaryIndex

BinaryIndex objects can be boxed/unboxed and set as index to dataframes and series. Operations on them will be supported in upcoming releases.

MultiIndex

Timestamp

Timestamp functionality is documented in pandas.Timestamp.

Timedelta

Timedelta functionality is documented in pandas.Timedelta.

  • pandas.Timedelta

    • The unit argument is not supported and all Timedeltas are represented in nanosecond precision.

Datetime related fields are supported:

Window

Rolling functionality is documented in pandas.DataFrame.rolling.

GroupBy

The operations are documented on pandas.DataFrame.groupby.

Offsets

Bodo supports a subset of the offset types in pandas.tseries.offsets:

  • pandas.tseries.offsets.DateOffset()

  • pandas.tseries.offsets.MonthBegin()

  • pandas.tseries.offsets.MonthEnd()

  • pandas.tseries.offsets.Week()

The currently supported operations are the constructor, addition and subtraction with a scalar datetime.date, datetime.datetime or pandas.Timestamp, and multiplication with a scalar integer value. Additon and subtraction can also be mapped across Series or DataFrame of dates using UDFs.

Integer NA issue in Pandas

DataFrame and Series objects with integer data need special care due to integer NA issues in Pandas. By default, Pandas dynamically converts integer columns to floating point when missing values (NAs) are needed (which can result in loss of precision). This is because Pandas uses the NaN floating point value as NA, and Numpy does not support NaN values for integers. Bodo does not perform this conversion unless enough information is available at compilation time.

Pandas introduced a new nullable integer data type that can solve this issue, which is also supported by Bodo. For example, this code reads column A into a nullable integer array (the capital “I” denotes nullable integer type):

@bodo.jit
def example(fname):
  dtype = {'A': 'Int64', 'B': 'float64'}
  df = pd.read_csv(fname,
      names=dtype.keys(),
      dtype=dtype,
  )
  ...

User-Defined Functions (UDFs)

User-defined functions (UDFs) can be applied to dataframes with DataFrame.apply() and to series with Series.apply() or Series.map(). Bodo offers support for UDFs without the significant runtime penalty generally incurred in Pandas.

It is recommended to pass additional variables to UDFs explicitly, instead of directly using values in the main function. The latter results in the “captured” variables case, which is often error-prone and may result in compilation errors. Therefore, arguments should be passed directly to either Series.apply() or DataFrame.apply(). The Bodo compiler transforms the code to pass main function values as arguments to apply() automatically if possible.

For example, consider a UDF that appends a variable suffix to each string in a Series of strings. The proper way to write this function through Series.apply() is:

@bodo.jit
def add_suffix(S, suffix):
    return S.apply(lambda x, suf: x + suf, args=(suffix,))

Alternatively, arguments can be passed as named arguments like:

@bodo.jit
def add_suffix(S, suffix):
    return S.apply(lambda x, suf: x + suf, suf=suffix)

The same process can be applied in the Dataframe case using DataFrame.apply().

Type Inference for Object Data

Pandas stores some data types (e.g. strings) as object arrays which are untyped. Therefore, Bodo needs to infer the actual data type of object arrays when dataframes or series values are passed to JIT functions from regular Python. Bodo uses the first non-null value of the array to determine the type, and throws a warning if the array is empty or all nulls:

BodoWarning: Empty object array passed to Bodo, which causes ambiguity in typing. This can cause errors in parallel execution.

In this case, Bodo assumes the array is a string array which is the most common. However, this can cause errors if a distributed dataset is passed to Bodo, and some other processor has non-string data. This corner case can usually be avoided by load balancing the data across processors to avoid empty arrays.

General Merge Conditions

Within Pandas, the merge criteria supported by pd.merge are limited to equality between 1 or more pairs of keys. For some use cases, this is not sufficient and more generalized support is necessary. For example, with these limitations, a left outer join where df1.A == df2.B & df2.C < df1.A cannot be efficiently computed.

Bodo supports these use cases by allowing users to pass general merge conditions to pd.merge. We plan to contribute this feature to Pandas to ensure full compatibility of Bodo and Pandas code.

General merge conditions are performed by providing the condition as a string via the on argument. Columns in the left table are referred to by left.`{column name}` and columns in the right table are referred to by right.`{column name}`.

To execute the example above, a user would use this function call:

@bodo.jit
def general_merge(df1, df2):
    return df1.merge(df2, on="left.`A` == right.`B` & right.`C` < left.`A`", how="left")

These calls have a few additional requirement:

  • The condition must be constant string.

  • The condition must be of the form cond_1 & ... & cond_N where at least one cond_i is a simple equality. This restriction will be removed in a future release.

  • The columns specified in these conditions are limited to certain column types. We currently support boolean, integer, float, datetime64, timedelta64, datetime.date, and string columns.