DataFrame API¶
bodo.pandas.BodoDataFrame.apply¶
BodoDataFrame.apply(
func,
axis=0,
raw=False,
result_type=None,
args=(),
by_row="compat",
engine="python",
engine_kwargs=None,
**kwargs,
) -> BodoSeries
Apply a function along an axis of the BodoDataFrame.
Currently only supports applying a function that returns a scalar value for each row (i.e. axis=1
).
All other uses will fall back to Pandas.
See pandas.DataFrame.apply
for more details.
Note
Calling BodoDataFrame.apply
will immediately execute a plan to generate a small sample of the BodoDataFrame
and then call pandas.DataFrame.apply
on the sample to infer output types
before proceeding with lazy evaluation.
Parameters
-
func : function: Function to apply to each row.
-
axis : {0 or 1}, default 0: The axis to apply the function over.
axis=0
will fall back topandas.DataFrame.apply
. -
args : tuple: Additional positional arguments to pass to func.
-
**kwargs: Additional keyword arguments to pass as keyword arguments to func.
-
All other parameters will trigger a fallback to
pandas.DataFrame.apply
if a non-default value is provided. Returns
-
BodoSeries: The result of applying func to each row in the BodoDataFrame.
Example
import pandas as pd
import bodo.pandas as bodo_pd
df = pd.DataFrame(
{
"a": pd.array([1, 2, 3] * 4, "Int64"),
"b": pd.array([4, 5, 6] * 4, "Int64"),
"c": ["a", "b", "c"] * 4,
},
)
bdf = bodo_pd.from_pandas(df)
out_bodo = bdf.apply(lambda x: x["a"] + 1, axis=1)
print(type(out_bodo))
print(out_bodo)
Output:
<class 'bodo.pandas.series.BodoSeries'>
0 2
1 3
2 4
3 2
4 3
5 4
6 2
7 3
8 4
9 2
10 3
11 4
dtype: int64[pyarrow]
bodo.pandas.BodoDataFrame.head¶
Returns the first n rows of the BodoDataFrame.
Parameters
-
n : int, default 5: Number of rows to select.
Returns
-
BodoDataFrame
Example
import bodo
import bodo.pandas as bodo_pd
import pandas as pd
original_df = pd.DataFrame(
{"foo": range(15), "bar": range(15, 30)}
)
@bodo.jit
def write_parquet(df):
df.to_parquet("example.pq")
write_parquet(original_df)
restored_df = bodo_pd.read_parquet("example.pq")
restored_df_head = restored_df.head(2)
print(type(restored_df_head))
print(restored_df_head)
Output:
Setting DataFrame Columns¶
Bodo DataFrames support setting columns lazily when the value is a Series created from the same DataFrame or a constant value. Other cases will fallback to Pandas.
Examples
import bodo.pandas as bodo_pd
import pandas as pd
df = pd.DataFrame(
{
"A": pd.array([1, 2, 3, 7] * 3, "Int64"),
"B": ["A1", "B1 ", "C1", "Abc"] * 3,
"C": pd.array([4, 5, 6, -1] * 3, "Int64"),
}
)
bdf = bodo_pd.from_pandas(df)
bdf["D"] = bdf["B"].str.lower()
print(type(bdf))
print(bdf.D)
Output:
<class 'bodo.pandas.frame.BodoDataFrame'>
0 a1
1 b1
2 c1
3 abc
4 a1
5 b1
6 c1
7 abc
8 a1
9 b1
10 c1
11 abc
Name: D, dtype: string
import bodo.pandas as bodo_pd
import pandas as pd
df = pd.DataFrame(
{
"A": pd.array([1, 2, 3, 7] * 3, "Int64"),
"B": ["A1", "B1 ", "C1", "Abc"] * 3,
"C": pd.array([4, 5, 6, -1] * 3, "Int64"),
}
)
bdf = bodo_pd.from_pandas(df)
bdf["D"] = 11
print(type(bdf))
print(bdf.D)
Output:
<class 'bodo.pandas.frame.BodoDataFrame'>
0 11
1 11
2 11
3 11
4 11
5 11
6 11
7 11
8 11
9 11
10 11
11 11
Name: D, dtype: int64[pyarrow]
bodo.pandas.BodoDataFrame.sort_values¶
BodoDataFrame.sort_values(
self,
by: IndexLabel,
*,
axis: Axis = 0,
ascending: bool | list[bool] | tuple[bool, ...] = True,
inplace: bool = False,
kind: SortKind = "quicksort",
na_position: str | list[str] | tuple[str, ...] = "last",
ignore_index: bool = False,
key: ValueKeyFunc | None = None,
) -> BodoDataFrame
Parameters
-
by: str or list of str: Name or list of column names to sort by.
-
ascending : bool or list of bool, default True: Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
-
na_position: str {'first', 'last'} or list of str, default 'last': Puts NaNs at the beginning if first; last puts NaNs at the end. Specify list for multiple NaN orders by key. If this is a list of strings, must match the length of the by.
-
All other parameters will trigger a fallback to
pandas.DataFrame.sort_values
if a non-default value is provided. Returns
-
BodoDataFrame
Example
import bodo.pandas as bodo_pd
import pandas as pd
df = pd.DataFrame(
{
"A": pd.array([1, 2, 3, 7] * 3, "Int64"),
"B": ["A1", "B1", "C1", "Abc"] * 3,
"C": pd.array([6, 5, 4] * 4, "Int64"),
}
)
bdf = bodo_pd.from_pandas(df)
bdf_sorted = bdf.sort_values(by=["A", "C"], ascending=[False, True])
print(bdf_sorted)
Output:
A B C
0 7 Abc 4
1 7 Abc 5
2 7 Abc 6
3 3 C1 4
4 3 C1 5
5 3 C1 6
6 2 B1 4
7 2 B1 5
8 2 B1 6
9 1 A1 4
10 1 A1 5
11 1 A1 6