General functions¶
Data manipulations¶
pd.pivot¶
-
pandas. pivot (data, values=None, index=None, columns=None)Supported Arguments
argument
datatypes
data- DataFrame
values- Constant Column Label or list of labels
index- Constant Column Label or list of labels
columns- Constant Column Label
Note
The the number of columns and names of the output DataFrame won't be known at compile time. To update typing information on DataFrame you should pass it back to Python.
Example Usage
>>> @bodo.jit ... def f(): ... df = pd.DataFrame({"A": ["X","X","X","X","Y","Y"], "B": [1,2,3,4,5,6], "C": [10,11,12,20,21,22]}) ... pivoted_tbl = pd.pivot(data, columns="A", index="B", values="C") ... return pivoted_tbl >>> f() A X Y B 1 10.0 NaN 2 11.0 NaN 3 12.0 NaN 4 20.0 NaN 5 NaN 21.0 6 NaN 22.0
pd.pivot_table¶
-
pandas. pivot_table (data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)Supported Arguments
argument
datatypes
data- DataFrame
values- Constant Column Label or list of labels
index- Constant Column Label or list of labels
columns- Constant Column Label
aggfunc- String Constant
Note
This code takes two different paths depending on if pivot values are annotated. When pivot values are annotated then output columns are set to the annotated values. For example,
@bodo.jit(pivots={'pt': ['small', 'large']})declares the output pivot tableptwill have columns calledsmallandlarge.If pivot values are not annotated, then the number of columns and names of the output DataFrame won't be known at compile time. To update typing information on DataFrame you should pass it back to Python.
Example Usage
>>> @bodo.jit(pivots={'pivoted_tbl': ['X', 'Y']}) ... def f(): ... df = pd.DataFrame({"A": ["X","X","X","X","Y","Y"], "B": [1,2,3,4,5,6], "C": [10,11,12,20,21,22]}) ... pivoted_tbl = pd.pivot_table(df, columns="A", index="B", values="C", aggfunc="mean") ... return pivoted_tbl >>> f() X Y B 1 10.0 NaN 2 11.0 NaN 3 12.0 NaN 4 20.0 NaN 5 NaN 21.0 6 NaN 22.0
pd.crosstab¶
-
pandas. crosstab (index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
Supported Argumentsargument
datatypes
indexSeriesType
columnsSeriesType
Note
Annotation of pivot values is required. For example,
@bodo.jit(pivots={'pt': ['small', 'large']})declares the output tableptwill have columns calledsmallandlarge.Example Usage
>>> @bodo.jit(pivots={"pt": ["small", "large"]}) ... def f(df): ... pt = pd.crosstab(df.A, df.C) ... return pt >>> list_A = ["foo", "foo", "bar", "bar", "bar", "bar"] >>> list_C = ["small", "small", "large", "small", "small", "middle"] >>> df = pd.DataFrame({"A": list_A, "C": list_C}) >>> f(df) small large index foo 2 0 bar 2 1
pd.cut¶
-
pandas. cut (x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates="raise", ordered=True)
Supported Argumentsargument
datatypes
xSeries or Array like
binsInteger or Array like
include_lowestBoolean
Example Usage
>>> @bodo.jit ... def f(S): ... bins = 4 ... include_lowest = True ... return pd.cut(S, bins, include_lowest=include_lowest) >>> S = pd.Series( ... [-2, 1, 3, 4, 5, 11, 15, 20, 22], ... ["a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9"], ... name="ABC", ... ) >>> f(S) a1 (-2.025, 4.0] a2 (-2.025, 4.0] a3 (-2.025, 4.0] a4 (-2.025, 4.0] a5 (4.0, 10.0] a6 (10.0, 16.0] a7 (10.0, 16.0] a8 (16.0, 22.0] a9 (16.0, 22.0] Name: ABC, dtype: category Categories (4, interval[float64, right]): [(-2.025, 4.0] < (4.0, 10.0] < (10.0, 16.0] < (16.0, 22.0]]
pd.qcut¶
-
pandas. qcut (x, q, labels=None, retbins=False, precision=3, duplicates="raise")
Supported Argumentsargument
datatypes
xSeries or Array like
qInteger or Array like of floats
Example Usage
>>> @bodo.jit ... def f(S): ... q = 4 ... return pd.qcut(S, q) >>> S = pd.Series( ... [-2, 1, 3, 4, 5, 11, 15, 20, 22], ... ["a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9"], ... name="ABC", ... ) >>> f(S) a1 (-2.001, 3.0] a2 (-2.001, 3.0] a3 (-2.001, 3.0] a4 (3.0, 5.0] a5 (3.0, 5.0] a6 (5.0, 15.0] a7 (5.0, 15.0] a8 (15.0, 22.0] a9 (15.0, 22.0] Name: ABC, dtype: category Categories (4, interval[float64, right]): [(-2.001, 3.0] < (3.0, 5.0] < (5.0, 15.0] < (15.0, 22.0]]
pd.merge¶
-
pandas. merge (left, right, how="inner", on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=("_x", "_y"), copy=True, indicator=False, validate=None, _bodo_na_equal=True)Supported Arguments
argument
datatypes
other requirements
leftDataFrame
rightDataFrame
howString
- Must be one of
"inner","outer","left","right" - Must be constant at Compile Time
onColumn Name, List of Column Names, or General Merge Condition String (see merge-notes)
- Must be constant at Compile Time
left_onColumn Name or List of Column Names
- Must be constant at Compile Time
right_onColumn Name or List of Column Names
- Must be constant at Compile Time
left_indexBoolean
- Must be constant at Compile Time
right_indexBoolean
- Must be constant at Compile Time
suffixesTuple of Strings
- Must be constant at Compile Time
indicatorBoolean
- Must be constant at Compile Time
_bodo_na_equalBoolean
- Must be constant at Compile Time
- This argument is unique to Bodo and not available in Pandas. If False, Bodo won't consider NA/nan keys as equal, which differs from Pandas.
Important
The argument
_bodo_na_equalis unique to Bodo and not available in Pandas. If it isFalse, Bodo won't consider NA/nan keys as equal, which differs from Pandas. - Must be one of
Merge Notes¶
-
Output Ordering:
The output dataframe is not sorted by default for better parallel performance (Pandas may preserve key order depending on
how). One can use explicit sort if needed. -
General Merge Conditions:
Within Pandas, the merge criteria supported by
pd.mergeare limited to equality between 1 or more pairs of keys. For some use cases, this is not sufficient and more generalized support is necessary. For example, with these limitations, aleft outer joinwheredf1.A == df2.B & df2.C < df1.Acannot be efficiently computed.Bodo supports these use cases by allowing users to pass general merge conditions to
pd.merge. We plan to contribute this feature to Pandas to ensure full compatibility of Bodo and Pandas code.General merge conditions are performed by providing the condition as a string via the
onargument. Columns in the left table are referred to byleft.{column name}and columns in the right table are referred to byright.{column name}.Here's an example demonstrating the above:
>>> @bodo.jit ... def general_merge(df1, df2): ... return df1.merge(df2, on="left.`A` == right.`B` & right.`C` < left.`A`", how="left") >>> df1 = pd.DataFrame({"col": [2, 3, 5, 1, 2, 8], "A": [4, 6, 3, 9, 9, -1]}) >>> df2 = pd.DataFrame({"B": [1, 2, 9, 3, 2], "C": [1, 7, 2, 6, 5]}) >>> general_merge(df1, df2) col A B C 0 2 4 <NA> <NA> 1 3 6 <NA> <NA> 2 5 3 <NA> <NA> 3 1 9 9 2 4 2 9 9 2 5 8 -1 <NA> <NA>These calls have a few additional requirements:
- The condition must be constant string.
- The condition must be of the form
cond_1 & ... & cond_Nwhere at least onecond_iis a simple equality. This restriction will be removed in a future release. - The columns specified in these conditions are limited to certain column types.
We currently support
boolean,integer,float,datetime64,timedelta64,datetime.date, andstringcolumns.
Example Usage
>>> @bodo.jit ... def f(df1, df2): ... return pd.merge(df1, df2, how="inner", on="key") >>> df1 = pd.DataFrame({"key": [2, 3, 5, 1, 2, 8], "A": np.array([4, 6, 3, 9, 9, -1], float)}) >>> df2 = pd.DataFrame({"key": [1, 2, 9, 3, 2], "B": np.array([1, 7, 2, 6, 5], float)}) >>> f(df1, df2) key A B 0 2 4.0 7.0 1 2 4.0 5.0 2 3 6.0 6.0 3 1 9.0 1.0 4 2 9.0 7.0 5 2 9.0 5.0
pd.merge_asof¶
-
pandas. merge_asof (left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=("_x", "_y"), tolerance=None, allow_exact_matches=True, direction="backward")Supported Arguments
argument
datatypes
other requirements
leftDataFrame
rightDataFrame
onColumn Name, List of Column Names
- Must be constant at Compile Time
left_onColumn Name or List of Column Names
- Must be constant at Compile Time
right_onColumn Name or List of Column Names
- Must be constant at Compile Time
left_indexBoolean
- Must be constant at Compile Time
right_indexBoolean
- Must be constant at Compile Time
suffixesTuple of Strings
- Must be constant at Compile Time
Example Usage
>>> @bodo.jit ... def f(df1, df2): ... return pd.merge_asof(df1, df2, on="time") >>> df1 = pd.DataFrame( ... { ... "time": pd.DatetimeIndex(["2017-01-03", "2017-01-06", "2017-02-21"]), ... "B": [4, 5, 6], ... } ... ) >>> df2 = pd.DataFrame( ... { ... "time": pd.DatetimeIndex( ... ["2017-01-01", "2017-01-02", "2017-01-04", "2017-02-23", "2017-02-25"] ... ), ... "A": [2, 3, 7, 8, 9], ... } ... ) >>> f(df1, df2) time B A 0 2017-01-03 4 3 1 2017-01-06 5 7 2 2017-02-21 6 7
pd.concat¶
-
pandas. concat (objs, axis=0, join="outer", join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
Supported Argumentsargument
datatypes
other requirements
objsList or Tuple of DataFrames/Series
axisInteger with either 0 or 1
- Must be constant at Compile Time
ignore_indexBoolean
- Must be constant at Compile Time
Important
Bodo currently concatenates local data chunks for distributed datasets, which does not preserve global order of concatenated objects in output.
Example Usage
pd.get_dummies¶
-
pandas. get_dummies (data, prefix=None, prefix_sep="_", dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Supported Argumentsargument
datatypes
other requirements
dataArray or Series with Categorical dtypes
- Categories must be known at compile time.
Example Usage
Top-level missing data¶
pd.isna¶
-
pandas. isna (obj)
Supported Argumentsargument
datatypes
objDataFrame, Series, Index, Array, or Scalar
Example Usage
pd.isnull¶
-
pandas. isnull (obj)Supported Arguments
argument
datatypes
objDataFrame, Series, Index, Array, or Scalar
Example Usage
pd.notna¶
-
pandas. notna (obj)
Supported Argumentsargument
datatypes
objDataFrame, Series, Index, Array, or Scalar
Example Usage
pd.notnull¶
-
pandas. notnull (obj)
Supported Argumentsargument
datatypes
objDataFrame, Series, Index, Array, or Scalar
Example Usage
Top-level conversions¶
pd.to_numeric¶
-
pandas. to_numeric (arg, errors="raise", downcast=None)
Supported Argumentsargument
datatypes
other requirements
argSeries or Array
downcastString and one of (
'integer','signed','unsigned','float')- Must be constant at Compile Time
Note
- Output type is float64 by default
- Unlike Pandas, Bodo does not dynamically determine output type, and does not downcast to the smallest numerical type.
downcastparameter should be used for type annotation of output.
Example Usage
Top-level dealing with datetime and timedelta like¶
pd.to_datetime¶
-
pandas. to_datetime (arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True)
Supported Argumentsargument
datatypes
other requirements
argSeries, Array or scalar of integers or strings
errorsString and one of ('ignore', 'raise', 'coerce')
dayfirstBoolean
yearfirstBoolean
utcBoolean
formatString matching Pandas strftime /strptime
exactBoolean
unitString
- Must be a valid Pandas timedelta unit
infer _datetime_formatBoolean
originScalar string or timestamp value
cacheBoolean
Note
- The function is not optimized.
- Bodo doesn't support Timezone-Aware datetime values
Example Usage
pd.to_timedelta¶
-
pandas. to_timedelta (arg, unit=None, errors='raise')
Supported Argumentsargument
datatypes
other requirements
argSeries, Array or scalar of integers or strings
unitString
- Must be a valid Pandas timedelta unit
Note
Passing string data as
argis not optimized.Example Usage
pd.date_range¶
-
pandas. date_range (start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)Supported Arguments
argument
datatypes
other requirements
startString or Timestamp
endString or Timestamp
periodsInteger
freqString
- Must be a valid Pandas frequ ency
nameString
closedString and one of (
'left','right')Note
- Exactly three of
start,end,periods, andfreqmust be provided. - Bodo Does Not support
kwargs, even for compatibility. - This function is not parallelized yet.
Example Usage
pd.timedelta_range¶
-
pandas. timedelta_range (start=None, end=None, periods=None, freq=None, name=None, closed=None)Supported Arguments
argument
datatypes
other requirements
startString or Timedelta
endString or Timedelta
periodsInteger
freqString
- Must be a valid Pandas frequ ency
nameString
closedString and one of ('left', 'right')
Note
- Exactly three of
start,end,periods, andfreqmust be provided. - This function is not parallelized yet.
Example Usage