General functions¶

Data manipulations¶

`pd.pivot`¶

pandas.pivot(data, values=None, index=None, columns=None)

Supported Arguments

argument	datatypes
`data`	DataFrame
`values`	Constant Column Label or list of labels
`index`	Constant Column Label or list of labels
`columns`	Constant Column Label

Note

The the number of columns and names of the output DataFrame won't be known at compile time. To update typing information on DataFrame you should pass it back to Python.

Example Usage

>>> @bodo.jit
... def f():
...   df = pd.DataFrame({"A": ["X","X","X","X","Y","Y"], "B": [1,2,3,4,5,6], "C": [10,11,12,20,21,22]})
...   pivoted_tbl = pd.pivot(data, columns="A", index="B", values="C")
...   return pivoted_tbl
>>> f()
A     X     Y
B
1  10.0   NaN
2  11.0   NaN
3  12.0   NaN
4  20.0   NaN
5   NaN  21.0
6   NaN  22.0

`pd.pivot_table`¶

pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)

Supported Arguments

argument	datatypes
`data`	DataFrame
`values`	Constant Column Label or list of labels
`index`	Constant Column Label or list of labels
`columns`	Constant Column Label
`aggfunc`	String Constant

Note

This code takes two different paths depending on if pivot values are annotated. When pivot values are annotated then output columns are set to the annotated values. For example, @bodo.jit(pivots={'pt': ['small', 'large']}) declares the output pivot table pt will have columns called small and large.

If pivot values are not annotated, then the number of columns and names of the output DataFrame won't be known at compile time. To update typing information on DataFrame you should pass it back to Python.

Example Usage

>>> @bodo.jit(pivots={'pivoted_tbl': ['X', 'Y']})
... def f():
...   df = pd.DataFrame({"A": ["X","X","X","X","Y","Y"], "B": [1,2,3,4,5,6], "C": [10,11,12,20,21,22]})
...   pivoted_tbl = pd.pivot_table(df, columns="A", index="B", values="C", aggfunc="mean")
...   return pivoted_tbl
>>> f()
      X     Y
B
1  10.0   NaN
2  11.0   NaN
3  12.0   NaN
4  20.0   NaN
5   NaN  21.0
6   NaN  22.0

`pd.crosstab`¶

pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)

Supported Arguments

argument	datatypes
`index`	SeriesType
`columns`	SeriesType

Note

Annotation of pivot values is required. For example, @bodo.jit(pivots={'pt': ['small', 'large']}) declares the output table pt will have columns called small and large.

Example Usage

>>> @bodo.jit(pivots={"pt": ["small", "large"]})
... def f(df):
...   pt = pd.crosstab(df.A, df.C)
...   return pt

>>> list_A = ["foo", "foo", "bar", "bar", "bar", "bar"]
>>> list_C = ["small", "small", "large", "small", "small", "middle"]
>>> df = pd.DataFrame({"A": list_A, "C": list_C})
>>> f(df)

       small  large
index
foo        2      0
bar        2      1

`pd.cut`¶

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates="raise", ordered=True)

Supported Arguments

argument	datatypes
`x`	Series or Array like
`bins`	Integer or Array like
`include_lowest`	Boolean

Example Usage

 >>> @bodo.jit
 ... def f(S):
 ...   bins = 4
 ...   include_lowest = True
 ...   return pd.cut(S, bins, include_lowest=include_lowest)

 >>> S = pd.Series(
 ...    [-2, 1, 3, 4, 5, 11, 15, 20, 22],
 ...    ["a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9"],
 ...    name="ABC",
 ... )
 >>> f(S)

a1    (-2.025, 4.0]
a2    (-2.025, 4.0]
a3    (-2.025, 4.0]
a4    (-2.025, 4.0]
a5      (4.0, 10.0]
a6     (10.0, 16.0]
a7     (10.0, 16.0]
a8     (16.0, 22.0]
a9     (16.0, 22.0]
Name: ABC, dtype: category
Categories (4, interval[float64, right]): [(-2.025, 4.0] < (4.0, 10.0] < (10.0, 16.0] < (16.0, 22.0]]

`pd.qcut`¶

pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates="raise")

Supported Arguments

argument	datatypes
`x`	Series or Array like
`q`	Integer or Array like of floats

Example Usage

 >>> @bodo.jit
 ... def f(S):
 ...   q = 4
 ...   return pd.qcut(S, q)

 >>> S = pd.Series(
 ...      [-2, 1, 3, 4, 5, 11, 15, 20, 22],
 ...      ["a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9"],
 ...      name="ABC",
 ... )
 >>> f(S)

 a1    (-2.001, 3.0]
 a2    (-2.001, 3.0]
 a3    (-2.001, 3.0]
 a4       (3.0, 5.0]
 a5       (3.0, 5.0]
 a6      (5.0, 15.0]
 a7      (5.0, 15.0]
 a8     (15.0, 22.0]
 a9     (15.0, 22.0]
 Name: ABC, dtype: category
 Categories (4, interval[float64, right]): [(-2.001, 3.0] < (3.0, 5.0] < (5.0, 15.0] < (15.0, 22.0]]

`pd.merge`¶

pandas.merge(left, right, how="inner", on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=("_x", "_y"), copy=True, indicator=False, validate=None, _bodo_na_equal=True)

Supported Arguments

argument	datatypes	other requirements
`left`	DataFrame
`right`	DataFrame
`how`	String	Must be one of `"inner"`, `"outer"`, `"left"`, `"right"` Must be constant at Compile Time
`on`	Column Name, List of Column Names, or General Merge Condition String (see merge-notes)	Must be constant at Compile Time
`left_on`	Column Name or List of Column Names	Must be constant at Compile Time
`right_on`	Column Name or List of Column Names	Must be constant at Compile Time
`left_index`	Boolean	Must be constant at Compile Time
`right_index`	Boolean	Must be constant at Compile Time
`suffixes`	Tuple of Strings	Must be constant at Compile Time
`indicator`	Boolean	Must be constant at Compile Time
`_bodo_na_equal`	Boolean	Must be constant at Compile Time This argument is unique to Bodo and not available in Pandas. If False, Bodo won't consider NA/nan keys as equal, which differs from Pandas.

Important

The argument _bodo_na_equal is unique to Bodo and not available in Pandas. If it is False, Bodo won't consider NA/nan keys as equal, which differs from Pandas.

Merge Notes¶

Output Ordering:

The output dataframe is not sorted by default for better parallel performance (Pandas may preserve key order depending on how). One can use explicit sort if needed.
General Merge Conditions:

Within Pandas, the merge criteria supported by pd.merge are limited to equality between 1 or more pairs of keys. For some use cases, this is not sufficient and more generalized support is necessary. For example, with these limitations, a left outer join where df1.A == df2.B & df2.C < df1.A cannot be efficiently computed.

Bodo supports these use cases by allowing users to pass general merge conditions to pd.merge. We plan to contribute this feature to Pandas to ensure full compatibility of Bodo and Pandas code.

General merge conditions are performed by providing the condition as a string via the on argument. Columns in the left table are referred to by left.{column name} and columns in the right table are referred to by right.{column name}.

Here's an example demonstrating the above:
```
>>> @bodo.jit
... def general_merge(df1, df2):
...   return df1.merge(df2, on="left.`A` == right.`B` & right.`C` < left.`A`", how="left")

>>> df1 = pd.DataFrame({"col": [2, 3, 5, 1, 2, 8], "A": [4, 6, 3, 9, 9, -1]})
>>> df2 = pd.DataFrame({"B": [1, 2, 9, 3, 2], "C": [1, 7, 2, 6, 5]})
>>> general_merge(df1, df2)

   col  A     B     C
0    2  4  <NA>  <NA>
1    3  6  <NA>  <NA>
2    5  3  <NA>  <NA>
3    1  9     9     2
4    2  9     9     2
5    8 -1  <NA>  <NA>
```
These calls have a few additional requirements:
- The condition must be constant string.
- The condition must be of the form cond_1 & ... & cond_N where at least one cond_i is a simple equality. This restriction will be removed in a future release.
- The columns specified in these conditions are limited to certain column types. We currently support boolean, integer, float, datetime64, timedelta64, datetime.date, and string columns.
Example Usage
```
>>> @bodo.jit
... def f(df1, df2):
...   return pd.merge(df1, df2, how="inner", on="key")

>>> df1 = pd.DataFrame({"key": [2, 3, 5, 1, 2, 8], "A": np.array([4, 6, 3, 9, 9, -1], float)})
>>> df2 = pd.DataFrame({"key": [1, 2, 9, 3, 2], "B": np.array([1, 7, 2, 6, 5], float)})
>>> f(df1, df2)

key    A    B
0    2  4.0  7.0
1    2  4.0  5.0
2    3  6.0  6.0
3    1  9.0  1.0
4    2  9.0  7.0
5    2  9.0  5.0
```

`pd.merge_asof`¶

pandas.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=("_x", "_y"), tolerance=None, allow_exact_matches=True, direction="backward")

Supported Arguments

argument	datatypes	other requirements
`left`	DataFrame
`right`	DataFrame
`on`	Column Name, List of Column Names	Must be constant at Compile Time
`left_on`	Column Name or List of Column Names	Must be constant at Compile Time
`right_on`	Column Name or List of Column Names	Must be constant at Compile Time
`left_index`	Boolean	Must be constant at Compile Time
`right_index`	Boolean	Must be constant at Compile Time
`suffixes`	Tuple of Strings	Must be constant at Compile Time

Example Usage

>>> @bodo.jit
... def f(df1, df2):
...   return pd.merge_asof(df1, df2, on="time")

>>> df1 = pd.DataFrame(
...   {
...       "time": pd.DatetimeIndex(["2017-01-03", "2017-01-06", "2017-02-21"]),
...       "B": [4, 5, 6],
...   }
... )
>>> df2 = pd.DataFrame(
...   {
...       "time": pd.DatetimeIndex(
...           ["2017-01-01", "2017-01-02", "2017-01-04", "2017-02-23", "2017-02-25"]
...       ),
...       "A": [2, 3, 7, 8, 9],
...   }
... )
>>> f(df1, df2)

     time  B  A
0 2017-01-03  4  3
1 2017-01-06  5  7
2 2017-02-21  6  7

`pd.concat`¶

pandas.concat(objs, axis=0, join="outer", join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)

Supported Arguments

argument	datatypes	other requirements
`objs`	List or Tuple of DataFrames/Series
`axis`	Integer with either 0 or 1	Must be constant at Compile Time
`ignore_index`	Boolean	Must be constant at Compile Time

Important

Bodo currently concatenates local data chunks for distributed datasets, which does not preserve global order of concatenated objects in output.

Example Usage

>>> @bodo.jit
... def f(df1, df2):
...     return pd.concat([df1, df2], axis=1)

>>> df1 = pd.DataFrame({"A": [3, 2, 1, -4, 7]})
>>> df2 = pd.DataFrame({"B": [3, 25, 1, -4, -24]})
>>> f(df1, df2)

A   B
0  3   3
1  2  25
2  1   1
3 -4  -4
4  7 -24

`pd.get_dummies`¶

pandas.get_dummies(data, prefix=None, prefix_sep="_", dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

Supported Arguments

argument	datatypes	other requirements
`data`	Array or Series with Categorical dtypes	Categories must be known at compile time.

Example Usage

>>> @bodo.jit
... def f(S):
...     return pd.get_dummies(S)

>>> S = pd.Series(["CC", "AA", "B", "D", "AA", None, "B", "CC"]).astype("category")
>>> f(S)

AA  B  CC  D
0   0  0   1  0
1   1  0   0  0
2   0  1   0  0
3   0  0   0  1
4   1  0   0  0
5   0  0   0  0
6   0  1   0  0
7   0  0   1  0

Top-level missing data¶

`pd.isna`¶

pandas.isna(obj)

Supported Arguments

argument	datatypes
`obj`	DataFrame, Series, Index, Array, or Scalar

Example Usage

>>> @bodo.jit
... def f(df):
...     return pd.isna(df)

>>> df = pd.DataFrame(
...    {"A": ["AA", np.nan, "", "D", "GG"], "B": [1, 8, 4, -1, 2]},
...    [1.1, -2.1, 7.1, 0.1, 3.1],
... )
>>> f(df)

       A      B
1.1  False  False
-2.1   True  False
7.1  False  False
0.1  False  False
3.1  False  False

`pd.isnull`¶

pandas.isnull(obj)

Supported Arguments

argument	datatypes
`obj`	DataFrame, Series, Index, Array, or Scalar

Example Usage

>>> @bodo.jit
... def f(df):
...     return pd.isnull(df)

>>> df = pd.DataFrame(
...    {"A": ["AA", np.nan, "", "D", "GG"], "B": [1, 8, 4, -1, 2]},
...    [1.1, -2.1, 7.1, 0.1, 3.1],
... )
>>> f(df)

       A      B
1.1  False  False
-2.1   True  False
7.1  False  False
0.1  False  False
3.1  False  False

`pd.notna`¶

pandas.notna(obj)

Supported Arguments

argument	datatypes
`obj`	DataFrame, Series, Index, Array, or Scalar

Example Usage

 >>> @bodo.jit
 ... def f(df):
 ...     return pd.notna(df)

 >>> df = pd.DataFrame(
 ...    {"A": ["AA", np.nan, "", "D", "GG"], "B": [1, 8, 4, -1, 2]},
 ...    [1.1, -2.1, 7.1, 0.1, 3.1],
 ... )
 >>> f(df)

           A     B
  1.1   True  True
 -2.1  False  True
  7.1   True  True
  0.1   True  True
  3.1   True  True

`pd.notnull`¶

pandas.notnull(obj)

Supported Arguments

argument	datatypes
`obj`	DataFrame, Series, Index, Array, or Scalar

Example Usage

>>> @bodo.jit
... def f(df):
...     return pd.notnull(df)

>>> df = pd.DataFrame(
...    {"A": ["AA", np.nan, "", "D", "GG"], "B": [1, 8, 4, -1, 2]},
...    [1.1, -2.1, 7.1, 0.1, 3.1],
... )
>>> f(df)

       A     B
1.1   True  True
-2.1  False  True
7.1   True  True
0.1   True  True
3.1   True  True

Top-level conversions¶

`pd.to_numeric`¶

pandas.to_numeric(arg, errors="raise", downcast=None)

Supported Arguments
argument

datatypes

other requirements
arg

Series or Array
downcast

String and one of ('integer', 'signed', 'unsigned', 'float')
Must be constant at Compile Time
Note
- Output type is float64 by default
- Unlike Pandas, Bodo does not dynamically determine output type, and does not downcast to the smallest numerical type.
- downcast parameter should be used for type annotation of output.
Example Usage
```
>>> @bodo.jit
... def f(S):
...     return pd.to_numeric(S, errors="coerce", downcast="integer")

>>> S = pd.Series(["1", "3", "12", "4", None, "-555"])
>>> f(S)

0       1
1       3
2      12
3       4
4    <NA>
5    -555
dtype: Int64
```

Top-level dealing with datetime and timedelta like¶

`pd.to_datetime`¶

pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True)

Supported Arguments

argument	datatypes	other requirements
`arg`	Series, Array or scalar of integers or strings
`errors`	String and one of ('ignore', 'raise', 'coerce')
`dayfirst`	Boolean
`yearfirst`	Boolean
`utc`	Boolean
`format`	String matching Pandas strftime /strptime
`exact`	Boolean
`unit`	String	Must be a valid Pandas timedelta unit
`infer _datetime_format`	Boolean
`origin`	Scalar string or timestamp value
`cache`	Boolean

Note

The function is not optimized.
Bodo doesn't support Timezone-Aware datetime values

Example Usage

>>> @bodo.jit
... def f(val):
...     return pd.to_datetime(val, format="%Y-%d-%m")

>>> val = "2016-01-06"
>>> f(val)

Timestamp('2016-06-01 00:00:00')

`pd.to_timedelta`¶

pandas.to_timedelta(arg, unit=None, errors='raise')

Supported Arguments

argument	datatypes	other requirements
`arg`	Series, Array or scalar of integers or strings
`unit`	String	Must be a valid Pandas timedelta unit

Note

Passing string data as arg is not optimized.

Example Usage

>>> @bodo.jit
... def f(S):
...     return pd.to_timedelta(S, unit="D")

>>> S = pd.Series([1.0, 2.2, np.nan, 4.2], [3, 1, 0, -2], name="AA")
>>> f(val)

3   1 days 00:00:00
1   2 days 04:48:00
0               NaT
-2   4 days 04:48:00
Name: AA, dtype: timedelta64[ns]

`pd.date_range`¶

pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)

Supported Arguments

argument	datatypes	other requirements
`start`	String or Timestamp
`end`	String or Timestamp
`periods`	Integer
`freq`	String	Must be a valid Pandas frequ ency
`name`	String
`closed`	String and one of (`'left'`, `'right'`)

Note

Exactly three of start, end, periods, and freq must be provided.
Bodo Does Not support kwargs, even for compatibility.
This function is not parallelized yet.

Example Usage

>>> @bodo.jit
... def f():
...     return pd.date_range(start="2018-04-24", end="2018-04-27", periods=3)

>>> f()

DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
              '2018-04-27 00:00:00'],
             dtype='datetime64[ns]', freq=None)

`pd.timedelta_range`¶

pandas.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None)

Supported Arguments

argument	datatypes	other requirements
`start`	String or Timedelta
`end`	String or Timedelta
`periods`	Integer
`freq`	String	Must be a valid Pandas frequ ency
`name`	String
`closed`	String and one of ('left', 'right')

Note

Exactly three of start, end, periods, and freq must be provided.
This function is not parallelized yet.

Example Usage

>>> @bodo.jit
... def f():
...     return pd.timedelta_range(start="1 day", end="11 days 1 hour", periods=3)

>>> f()

TimedeltaIndex(['1 days 00:00:00', '6 days 00:30:00', '11 days 01:00:00'], dtype='timedelta64[ns]', freq=None)

argument	datatypes	other requirements
`arg`	Series or Array
`downcast`	String and one of (`'integer'`, `'signed'`, `'unsigned'`, `'float'`)	Must be constant at Compile Time

General functions¶

Data manipulations¶

pd.pivot¶

pd.pivot_table¶

pd.crosstab¶

pd.cut¶

pd.qcut¶

pd.merge¶

Merge Notes¶

pd.merge_asof¶

pd.concat¶

pd.get_dummies¶

Top-level missing data¶

pd.isna¶

pd.isnull¶

pd.notna¶

pd.notnull¶

Top-level conversions¶

pd.to_numeric¶

Top-level dealing with datetime and timedelta like¶

pd.to_datetime¶

pd.to_timedelta¶

pd.date_range¶

pd.timedelta_range¶

`pd.pivot`¶

`pd.pivot_table`¶

`pd.crosstab`¶

`pd.cut`¶

`pd.qcut`¶

`pd.merge`¶

`pd.merge_asof`¶

`pd.concat`¶

`pd.get_dummies`¶

`pd.isna`¶

`pd.isnull`¶

`pd.notna`¶

`pd.notnull`¶

`pd.to_numeric`¶

`pd.to_datetime`¶

`pd.to_timedelta`¶

`pd.date_range`¶

`pd.timedelta_range`¶