7. Supported Pandas Operations

Below is a reference list of the Pandas data types and operations that Bodo supports. This list will expand regularly as we add support for more APIs. Optional arguments are not supported unless if specified.

7.1. Data Types

Bodo supports the following data types as values in Pandas Dataframe and Series data structures. This represents all Pandas data types except TZ-aware datetime, Period, Interval, and Sparse (which will be supported in the future). Comparing to Spark, equivalent of all Spark data types are supported.

  • Numpy booleans: np.bool_.

  • Numpy integer data types: np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64.

  • Numpy floating point data types: np.float32, np.float64.

  • Numpy datetime data types: np.dtype(“datetime64[ns]”) and np.dtype(“timedelta[ns]”). The resolution has to be ns currently, which covers most practical use cases.

  • Numpy complex data types: np.complex64 and np.complex128.

  • Strings (including nulls).

  • datetime.date values (including nulls).

  • datetime.timedelta values (including nulls).

  • Pandas nullable integers.

  • Pandas nullable booleans.

  • Pandas Categoricals.

  • Lists of integer, float, and string values.

  • decimal.Decimal values (including nulls). The decimal values are stored as fixed-precision Apache Arrow Decimal128 format, which is also similar to PySpark decimals. The decimal type has a precision (the maximum total number of digits) and a scale (the number of digits on the right of dot) attribute, specifying how the stored data is interpreted. For example, the (4, 2) case can store from -999.99 to 999.99. The precision can be up to 38, and the scale must be less or equal to precision. Arbitrary-precision Python decimal.Decimal values are converted with precision of 38 and scale of 18.

In addition, it may be desirable to specify type annotations in some cases (e.g. file I/O array input types). Typically these types are array types and they all can be accessed directly from the bodo module. The following table can be used to select the necessary Bodo Type based upon the desired Python, Numpy, or Pandas type.

Bodo Type Name

Equivalent Python, Numpy, or Pandas type

np.bool_[:], np.int8[:], …, np.int64[:], np.uint8[:], …, np.uint64[:], np.float32[:], np.float64[:]

One-dimensional Numpy array of the given type. A full list of supported Numpy types can be found here. A multidimensional can be specified by adding additional colons (e.g. bodo.int32[:, :, :] for a three-dimensional array).

bodo.string_array_type

Array of nullable strings

bodo.IntegerArrayType(integer_type)

Array of Pandas nullable integers of the given integer type
e.g. bodo.IntegerArrayType(bodo.int64)

bodo.boolean_array

Array of Pandas nullable booleans

bodo.datetime64ns[:]

Array of Numpy datetime64 values

bodo.timedelta64ns[:]

Array of Numpy timedelta64 values

bodo.datetime_date_array_type

Array of datetime.date types

bodo.datetime_timedelta_array_type

Array of datetime.timedelta types

bodo.DecimalArrayType(precision, scale)

Array of Apache Arrow Decimal128 values with the given precision and scale
e.g. bodo.DecimalArrayType(38, 18)

bodo.StructArrayType(data_types, field_names)

Array of a user defined struct with the given tuple of data types and field names
e.g. bodo.StructArrayType((bodo.int32[:], bodo.datetime64ns[:]), ("a", "b"))

bodo.MapArrayType(key_arr_type, value_arr_type)

Array of Python dictionaries with the given key and value array types.
e.g. bodo.MapArrayType(bodo.uint16[:], bodo.string_array_type)

bodo.DatetimeIndexType(name_type)

Index of datetime64 values with a given name type.
e.g. bodo.DatetimeIndexType(bodo.string_type)

bodo.NumericIndexType(data_type, name_type)

Index of pd.Int64, pd.Uint64, or Float64 objects, based upon the given data_type and name type.
e.g. bodo.NumericIndexType(bodo.float64, bodo.string_type)

bodo.PeriodIndexType(freq, name_type)

pd.PeriodIndex with a given frequency and name type.
e.g. bodo.PeriodIndexType('A', bodo.string_type)

bodo.RangeIndexType(name_type)

RangeIndex with a given name type.
e.g. bodo.RangeIndexType(bodo.string_type)

bodo.StringIndexType(name_type)

Index of strings with a given name type.
e.g. bodo.StringIndexType(bodo.string_type)

bodo.TimedeltaIndexType(name_type)

Index of timedelta64 values with a given name type.
e.g. bodo.TimedeltaIndexType(bodo.string_type)

bodo.SeriesType(dtype=data_type, index=index_type, name_typ=name_type)

Series with a given data type, index type, and name type.
e.g. bodo.SeriesType(bodo.float32, bodo.DatetimeIndexType(bodo.string_type), bodo.string_type)

bodo.DataFrameType(data_types_tuple, index_type, column_names)

DataFrame with a tuple of data types, an index type, and the names of the columns.
e.g. bodo.DataFrameType((bodo.int64[::1], bodo.float64[::1]), bodo.RangeIndexType(bodo.none), ("A", "B"))

7.2. Input/Output

Also see Amazon S3 and Hadoop Distributed File System (HDFS) configuration requirements and more on File I/O.

  • pandas.read_csv()

    • example usage and more system specific instructions

    • filepath_or_buffer should be a string and is required. It could be pointing to a single CSV file, or a directory containing multiple partitioned CSV files (must have csv file extension inside directory).

    • Arguments sep, delimiter, header, names, index_col, usecols, dtype, skiprows, and parse_dates are supported.

    • Either names and dtype arguments should be provided to enable type inference, or filepath_or_buffer should be a constant string for Bodo to infer types by looking at the file at compile time.

    • names, usecols, parse_dates should be constant lists.

    • dtype should be a constant dictionary of strings and types.

    • When a CSV file is read in parallel (distributed mode) and each process reads only a portion of the file, reading columns that contain line breaks is not supported.

  • pandas.read_excel()

    • output dataframe cannot be parallelized automatically yet.

    • only arguments io, sheet_name, header, names, comment, dtype, skiprows, parse_dates are supported.

    • io should be a string and is required.

    • Either names and dtype arguments should be provided to enable type inference, or io should be a constant string for Bodo to infer types by looking at the file at compile time.

    • sheet_name, header, comment, and skiprows should be constant if provided.

    • names and parse_dates should be constant lists if provided.

    • dtype should be a constant dictionary of strings and types if provided.

  • pandas.read_sql()

    • example usage and more system specific instructions

    • Argument sql is supported but only as a string form. SQLalchemy Selectable is not supported. There is no restriction on the form of the sql request.

    • Argument con is supported but only as a string form. SQLalchemy connectable is not supported.

    • Argument index_col is supported.

    • Arguments chunksize, column, coerce_float, params are not supported.

  • pandas.read_parquet()

    • example usage and more system specific instructions

    • Arguments path and columns are supported. columns should be a constant list of strings.

    • If path is constant, Bodo finds the schema from file at compilation time. Otherwise, schema should be provided. For example:

      @bodo.jit(locals={'df':{'A': bodo.float64[:],
                              'B': bodo.string_array_type}})
      def impl(f):
        df = pd.read_parquet(f)
        return df
      
  • pandas.read_json()

    • Example usage and more system specific instructions

    • Only supports reading JSON Lines text file format (pd.read_json(filepath_or_buffer, orient='records', lines=True)) and regular multi-line JSON file(pd.read_json(filepath_or_buffer, orient='records', lines=False)).

    • Argument filepath_or_buffer is supported: it can point to a single JSON file, or a directory containing multiple partitioned JSON files. When reading a directory, the JSON files inside the directory must be JSON Lines text file format with json file extension.

    • Argument orient = 'records' is used as default, instead of Pandas’ default 'columns' for dataframes. 'records' is the only supported value for orient.

    • Argument typ is supported. 'frame' is the only supported value for typ.

    • dtype argument should be provided to enable type inference, or filepath_or_buffer should be a constant string for Bodo to infer types by looking at the file at compile time (not supported for multi-line JSON files)

    • Arguments convert_dates, precise_float, lines are supported.

  • pandas.DataFrame.to_sql()

    • example usage and more system specific instructions

    • Argument con is supported but only as a string form. SQLalchemy connectable is not supported.

    • Argument name, schema, if_exists, index, index_label, dtype, method are supported.

    • Argument chunksize is not supported.

7.3. General functions

Data manipulations:

  • pandas.crosstab()

    • Annotation of pivot values is required. For example, @bodo.jit(pivots={‘pt’: [‘small’, ‘large’]}) declares the output table pt will have columns called small and large.

  • pandas.merge()

    • Arguments left, right should be dataframes.

    • how, on, left_on, right_on, left_index, and right_index are supported but should be constant values.

    • The output dataframe is not sorted by default for better parallel performance (Pandas may preserve key order depending on how). One can use explicit sort if needed.

  • pandas.merge_asof() (similar arguments to merge)

  • pandas.concat() Input list or tuple of dataframes or series is supported.

Top-level missing data:

Top-level conversions:

  • pandas.to_numeric() Input can be a Series or array. Output type is float64 by default. Unlike Pandas, Bodo does not dynamically determine output type, and does not downcast to the smallest numerical type. downcast parameter should be used for type annotation of output. The errors argument is not supported currently (errors will be coerced by default).

Top-level dealing with datetime like:

  • pandas.date_range()

    • start, end, periods, freq, name and closed arguments are supported. This function is not parallelized yet.

7.4. Series

Bodo provides extensive Series support. However, operations between Series (+, -, /, , *) do not implicitly align values based on their associated index values yet.

  • pandas.Series

    • Arguments data, index, and name are supported. data is required and can be a list, array, Series or Index. If data is Series and index is provided, implicit alignment is not performed yet.

Attributes:

Methods:

Conversion:

Indexing, iteration:

Location based indexing using [], iat, and iloc is supported. Changing values of existing string Series using these operators is not supported yet.

Binary operator functions:

The fill_value optional argument for binary functions below is supported.

Function application, GroupBy & Window:

Computations / Descriptive Stats:

Statistical functions below are supported without optional arguments unless support is explicitly mentioned.

Reindexing / Selection / Label manipulation:

Missing data handling:

Reshaping, sorting:

Time series-related:

Datetime properties:

String handling:

7.5. DataFrame

Bodo provides extensive DataFrame support documented below.

  • pandas.DataFrame

    data argument can be a constant dictionary or 2d Numpy array. Other arguments are also supported.

Attributes and underlying data:

Conversion:

Indexing, iteration:

Function application, GroupBy & Window:

  • pandas.DataFrame.apply()

  • pandas.DataFrame.groupby() by should be a constant column label or column labels. sort=False is set by default. as_index argument is supported but MultiIndex is not supported yet (will just drop output MultiIndex).

  • pandas.DataFrame.rolling() window argument should be integer or a time offset as a constant string. center and on arguments are also supported.

Computations / Descriptive Stats:

Reindexing / Selection / Label manipulation:

Missing data handling:

Reshaping, sorting, transposing:

  • pandas.DataFrame.pivot_table()

    • Arguments values, index, columns and aggfunc are supported.

    • Annotation of pivot values is required. For example, @bodo.jit(pivots={‘pt’: [‘small’, ‘large’]}) declares the output pivot table pt will have columns called small and large.

  • pandas.DataFrame.sample() is supported except for the arguments random_state, weights and axis.

  • pandas.DataFrame.sort_index() ascending argument is supported.

  • pandas.DataFrame.sort_values() by argument should be constant string or constant list of strings. ascending and na_position arguments are supported.

Combining / joining / merging:

  • pandas.DataFrame.append() appending a dataframe or list of dataframes supported. ignore_index=True is necessary and set by default.

  • pandas.DataFrame.assign() function arguments not supported yet.

  • pandas.DataFrame.join() only dataframes. The output dataframe is not sorted by default for better parallel performance (Pandas may preserve key order depending on how). One can use explicit sort if needed.

  • pandas.DataFrame.merge() only dataframes. how, on, left_on, right_on, left_index, and right_index are supported but should be constant values.

Time series-related:

Serialization / IO / conversion:

Also see Amazon S3 and Hadoop Distributed File System (HDFS) configuration requirements and more on File I/O.

7.6. Index objects

7.6.1. Index

Properties

Modifying and computations:

The min/max methods are supported for DatetimeIndex. They are supported without optional arguments
(NaT output for empty or all NaT input not supported yet):

Missing values:

Conversion:

7.6.2. Numeric Index

Numeric index objects RangeIndex, Int64Index, UInt64Index and Float64Index are supported as index to dataframes and series. Constructing them in Bodo functions, passing them to Bodo functions (unboxing), and returning them from Bodo functions (boxing) are also supported.

7.6.3. DatetimeIndex

DatetimeIndex objects are supported. They can be constructed, boxed/unboxed, and set as index to dataframes and series.

  • pandas.DatetimeIndex

    • Only data argument is supported, and can be array-like of datetime64['ns'], int64 or strings.

Date fields of DatetimeIndex are supported:

Subtraction of Timestamp from DatetimeIndex and vice versa is supported.

Comparison operators ==, !=, >=, >, <=, < between DatetimeIndex and a string of datetime are supported.

7.6.4. TimedeltaIndex

TimedeltaIndex objects are supported. They can be constructed, boxed/unboxed, and set as index to dataframes and series.

  • pandas.TimedeltaIndex

    • Only data argument is supported, and can be array-like of timedelta64['ns'] or int64.

Time fields of TimedeltaIndex are supported:

7.6.5. PeriodIndex

PeriodIndex objects can be boxed/unboxed and set as index to dataframes and series. Operations on them will be supported in upcoming releases.

7.9. GroupBy

The operations are documented on pandas.DataFrame.groupby.

7.10. Integer NA issue in Pandas

DataFrame and Series objects with integer data need special care due to integer NA issues in Pandas. By default, Pandas dynamically converts integer columns to floating point when missing values (NAs) are needed (which can result in loss of precision). This is because Pandas uses the NaN floating point value as NA, and Numpy does not support NaN values for integers. Bodo does not perform this conversion unless enough information is available at compilation time.

Pandas introduced a new nullable integer data type that can solve this issue, which is also supported by Bodo. For example, this code reads column A into a nullable integer array (the capital “I” denotes nullable integer type):

@bodo.jit
def example(fname):
  dtype = {'A': 'Int64', 'B': 'float64'}
  df = pd.read_csv(fname,
      names=dtype.keys(),
      dtype=dtype,
  )
  ...