General functions¶
Data manipulations¶
pd.pivot
¶
-
pandas. pivot (data, values=None, index=None, columns=None)Supported Arguments
argument
datatypes
data
- DataFrame
values
- Constant Column Label or list of labels
index
- Constant Column Label or list of labels
columns
- Constant Column Label
Note
The the number of columns and names of the output DataFrame won't be known at compile time. To update typing information on DataFrame you should pass it back to Python.
Example Usage
>>> @bodo.jit ... def f(): ... df = pd.DataFrame({"A": ["X","X","X","X","Y","Y"], "B": [1,2,3,4,5,6], "C": [10,11,12,20,21,22]}) ... pivoted_tbl = pd.pivot(data, columns="A", index="B", values="C") ... return pivoted_tbl >>> f() A X Y B 1 10.0 NaN 2 11.0 NaN 3 12.0 NaN 4 20.0 NaN 5 NaN 21.0 6 NaN 22.0
pd.pivot_table
¶
-
pandas. pivot_table (data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)Supported Arguments
argument
datatypes
data
- DataFrame
values
- Constant Column Label or list of labels
index
- Constant Column Label or list of labels
columns
- Constant Column Label
aggfunc
- String Constant
Note
This code takes two different paths depending on if pivot values are annotated. When pivot values are annotated then output columns are set to the annotated values. For example,
@bodo.jit(pivots={'pt': ['small', 'large']})
declares the output pivot tablept
will have columns calledsmall
andlarge
.If pivot values are not annotated, then the number of columns and names of the output DataFrame won't be known at compile time. To update typing information on DataFrame you should pass it back to Python.
Example Usage
>>> @bodo.jit(pivots={'pivoted_tbl': ['X', 'Y']}) ... def f(): ... df = pd.DataFrame({"A": ["X","X","X","X","Y","Y"], "B": [1,2,3,4,5,6], "C": [10,11,12,20,21,22]}) ... pivoted_tbl = pd.pivot_table(df, columns="A", index="B", values="C", aggfunc="mean") ... return pivoted_tbl >>> f() X Y B 1 10.0 NaN 2 11.0 NaN 3 12.0 NaN 4 20.0 NaN 5 NaN 21.0 6 NaN 22.0
pd.crosstab
¶
-
pandas. crosstab (index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
Supported Argumentsargument
datatypes
index
SeriesType
columns
SeriesType
Note
Annotation of pivot values is required. For example,
@bodo.jit(pivots={'pt': ['small', 'large']})
declares the output tablept
will have columns calledsmall
andlarge
.Example Usage
>>> @bodo.jit(pivots={"pt": ["small", "large"]}) ... def f(df): ... pt = pd.crosstab(df.A, df.C) ... return pt >>> list_A = ["foo", "foo", "bar", "bar", "bar", "bar"] >>> list_C = ["small", "small", "large", "small", "small", "middle"] >>> df = pd.DataFrame({"A": list_A, "C": list_C}) >>> f(df) small large index foo 2 0 bar 2 1
pd.cut
¶
-
pandas. cut (x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates="raise", ordered=True)
Supported Argumentsargument
datatypes
x
Series or Array like
bins
Integer or Array like
include_lowest
Boolean
Example Usage
>>> @bodo.jit ... def f(S): ... bins = 4 ... include_lowest = True ... return pd.cut(S, bins, include_lowest=include_lowest) >>> S = pd.Series( ... [-2, 1, 3, 4, 5, 11, 15, 20, 22], ... ["a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9"], ... name="ABC", ... ) >>> f(S) a1 (-2.025, 4.0] a2 (-2.025, 4.0] a3 (-2.025, 4.0] a4 (-2.025, 4.0] a5 (4.0, 10.0] a6 (10.0, 16.0] a7 (10.0, 16.0] a8 (16.0, 22.0] a9 (16.0, 22.0] Name: ABC, dtype: category Categories (4, interval[float64, right]): [(-2.025, 4.0] < (4.0, 10.0] < (10.0, 16.0] < (16.0, 22.0]]
pd.qcut
¶
-
pandas. qcut (x, q, labels=None, retbins=False, precision=3, duplicates="raise")
Supported Argumentsargument
datatypes
x
Series or Array like
q
Integer or Array like of floats
Example Usage
>>> @bodo.jit ... def f(S): ... q = 4 ... return pd.qcut(S, q) >>> S = pd.Series( ... [-2, 1, 3, 4, 5, 11, 15, 20, 22], ... ["a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9"], ... name="ABC", ... ) >>> f(S) a1 (-2.001, 3.0] a2 (-2.001, 3.0] a3 (-2.001, 3.0] a4 (3.0, 5.0] a5 (3.0, 5.0] a6 (5.0, 15.0] a7 (5.0, 15.0] a8 (15.0, 22.0] a9 (15.0, 22.0] Name: ABC, dtype: category Categories (4, interval[float64, right]): [(-2.001, 3.0] < (3.0, 5.0] < (5.0, 15.0] < (15.0, 22.0]]
pd.merge
¶
-
pandas. merge (left, right, how="inner", on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=("_x", "_y"), copy=True, indicator=False, validate=None, _bodo_na_equal=True)Supported Arguments
argument
datatypes
other requirements
left
DataFrame
right
DataFrame
how
String
- Must be one of
"inner"
,"outer"
,"left"
,"right"
- Must be constant at Compile Time
on
Column Name, List of Column Names, or General Merge Condition String (see merge-notes)
- Must be constant at Compile Time
left_on
Column Name or List of Column Names
- Must be constant at Compile Time
right_on
Column Name or List of Column Names
- Must be constant at Compile Time
left_index
Boolean
- Must be constant at Compile Time
right_index
Boolean
- Must be constant at Compile Time
suffixes
Tuple of Strings
- Must be constant at Compile Time
indicator
Boolean
- Must be constant at Compile Time
_bodo_na_equal
Boolean
- Must be constant at Compile Time
- This argument is unique to Bodo and not available in Pandas. If False, Bodo won't consider NA/nan keys as equal, which differs from Pandas.
Important
The argument
_bodo_na_equal
is unique to Bodo and not available in Pandas. If it isFalse
, Bodo won't consider NA/nan keys as equal, which differs from Pandas. - Must be one of
Merge Notes¶
-
Output Ordering:
The output dataframe is not sorted by default for better parallel performance (Pandas may preserve key order depending on
how
). One can use explicit sort if needed. -
General Merge Conditions:
Within Pandas, the merge criteria supported by
pd.merge
are limited to equality between 1 or more pairs of keys. For some use cases, this is not sufficient and more generalized support is necessary. For example, with these limitations, aleft outer join
wheredf1.A == df2.B & df2.C < df1.A
cannot be efficiently computed.Bodo supports these use cases by allowing users to pass general merge conditions to
pd.merge
. We plan to contribute this feature to Pandas to ensure full compatibility of Bodo and Pandas code.General merge conditions are performed by providing the condition as a string via the
on
argument. Columns in the left table are referred to byleft.{column name}
and columns in the right table are referred to byright.{column name}
.Here's an example demonstrating the above:
>>> @bodo.jit ... def general_merge(df1, df2): ... return df1.merge(df2, on="left.`A` == right.`B` & right.`C` < left.`A`", how="left") >>> df1 = pd.DataFrame({"col": [2, 3, 5, 1, 2, 8], "A": [4, 6, 3, 9, 9, -1]}) >>> df2 = pd.DataFrame({"B": [1, 2, 9, 3, 2], "C": [1, 7, 2, 6, 5]}) >>> general_merge(df1, df2) col A B C 0 2 4 <NA> <NA> 1 3 6 <NA> <NA> 2 5 3 <NA> <NA> 3 1 9 9 2 4 2 9 9 2 5 8 -1 <NA> <NA>
These calls have a few additional requirements:
- The condition must be constant string.
- The condition must be of the form
cond_1 & ... & cond_N
where at least onecond_i
is a simple equality. This restriction will be removed in a future release. - The columns specified in these conditions are limited to certain column types.
We currently support
boolean
,integer
,float
,datetime64
,timedelta64
,datetime.date
, andstring
columns.
Example Usage
>>> @bodo.jit ... def f(df1, df2): ... return pd.merge(df1, df2, how="inner", on="key") >>> df1 = pd.DataFrame({"key": [2, 3, 5, 1, 2, 8], "A": np.array([4, 6, 3, 9, 9, -1], float)}) >>> df2 = pd.DataFrame({"key": [1, 2, 9, 3, 2], "B": np.array([1, 7, 2, 6, 5], float)}) >>> f(df1, df2) key A B 0 2 4.0 7.0 1 2 4.0 5.0 2 3 6.0 6.0 3 1 9.0 1.0 4 2 9.0 7.0 5 2 9.0 5.0
pd.concat
¶
-
pandas. concat (objs, axis=0, join="outer", join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
Supported Argumentsargument
datatypes
other requirements
objs
List or Tuple of DataFrames/Series
axis
Integer with either 0 or 1
- Must be constant at Compile Time
ignore_index
Boolean
- Must be constant at Compile Time
Important
Bodo currently concatenates local data chunks for distributed datasets, which does not preserve global order of concatenated objects in output.
Example Usage
pd.get_dummies
¶
-
pandas. get_dummies (data, prefix=None, prefix_sep="_", dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Supported Argumentsargument
datatypes
other requirements
data
Array or Series with Categorical dtypes
- Categories must be known at compile time.
Example Usage
pd.unique
¶
-
pandas. unique (values)
Supported Argumentsargument
datatypes
values
Series or 1-d array with Categorical dtypes
Example Usage
Top-level missing data¶
pd.isna
¶
-
pandas. isna (obj)
Supported Argumentsargument
datatypes
obj
DataFrame, Series, Index, Array, or Scalar
Example Usage
pd.isnull
¶
-
pandas. isnull (obj)Supported Arguments
argument
datatypes
obj
DataFrame, Series, Index, Array, or Scalar
Example Usage
pd.notna
¶
-
pandas. notna (obj)
Supported Argumentsargument
datatypes
obj
DataFrame, Series, Index, Array, or Scalar
Example Usage
pd.notnull
¶
-
pandas. notnull (obj)
Supported Argumentsargument
datatypes
obj
DataFrame, Series, Index, Array, or Scalar
Example Usage
Top-level conversions¶
pd.to_numeric
¶
-
pandas. to_numeric (arg, errors="raise", downcast=None)
Supported Argumentsargument
datatypes
other requirements
arg
Series or Array
downcast
String and one of (
'integer'
,'signed'
,'unsigned'
,'float'
)- Must be constant at Compile Time
Note
- Output type is float64 by default
- Unlike Pandas, Bodo does not dynamically determine output type, and does not downcast to the smallest numerical type.
downcast
parameter should be used for type annotation of output.
Example Usage
Top-level dealing with datetime and timedelta like¶
pd.to_datetime
¶
-
pandas. to_datetime (arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True)
Supported Argumentsargument
datatypes
other requirements
arg
Series, Array or scalar of integers or strings
errors
String and one of ('ignore', 'raise', 'coerce')
dayfirst
Boolean
yearfirst
Boolean
utc
Boolean
format
String matching Pandas strftime /strptime
exact
Boolean
unit
String
- Must be a valid Pandas timedelta unit
infer _datetime_format
Boolean
origin
Scalar string or timestamp value
cache
Boolean
Note
- The function is not optimized.
- Bodo doesn't support Timezone-Aware datetime values
Example Usage
pd.to_timedelta
¶
-
pandas. to_timedelta (arg, unit=None, errors='raise')
Supported Argumentsargument
datatypes
other requirements
arg
Series, Array or scalar of integers or strings
unit
String
- Must be a valid Pandas timedelta unit
Note
Passing string data as
arg
is not optimized.Example Usage
pd.date_range
¶
-
pandas. date_range (start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)Supported Arguments
argument
datatypes
other requirements
start
String or Timestamp
end
String or Timestamp
periods
Integer
freq
String
- Must be a valid Pandas frequ ency
name
String
Note
- Exactly three of
start
,end
,periods
, andfreq
must be provided. - Bodo Does Not support
kwargs
, even for compatibility.
Example Usage
pd.timedelta_range
¶
-
pandas. timedelta_range (start=None, end=None, periods=None, freq=None, name=None, closed=None)Supported Arguments
argument
datatypes
other requirements
start
String or Timedelta
end
String or Timedelta
periods
Integer
freq
String
- Must be a valid Pandas frequ ency
name
String
closed
String and one of ('left', 'right')
Note
- Exactly three of
start
,end
,periods
, andfreq
must be provided. - This function is not parallelized yet.
Example Usage