Skip to content

Bodo Pandas API (Bodo DataFrame Library)

The Bodo DataFrame Library is designed to accelerate and scale Pandas workflows with just a one-line change — simply replace:

import pandas as pd

with

import bodo.pandas as pd

and your existing code can immediately take advantage of high-performance, scalable execution.

Key features include:

  • Full Pandas compatibility with a transparent fallback mechanism to native Pandas, ensuring that your workflows continue uninterrupted even if a feature is not yet supported.

  • Advanced query optimization such as filter pushdown, column pruning and join reordering behind the scenes.

  • Scalable MPI-based execution, leveraging High-Performance Computing (HPC) techniques for efficient parallelism; whether you're working on a laptop or running jobs across a large cloud cluster.

  • Vectorized execution with streaming and spill-to-disk capabilities, making it possible to process datasets larger than memory reliably.

Warning

Bodo DataFrame Library is under active development and is currently considered experimental. Some features and APIs may not yet be fully supported. We welcome your feedback — please join our community Slack or open an issue on our GitHub if you encounter any problems!

Lazy Evaluation and Fallback to Pandas

Bodo DataFrame Library operates with lazy evaluation to allow query optimization, meaning operations are recorded into a query plan rather than executed immediately. Execution is automatically triggered only when results are actually needed, such as when displaying a DataFrame df with print(df).

If the user code encounters an unsupported Pandas API or an unsupported parameter, Bodo DataFrame library gracefully falls back to native Pandas. When this happens, the current query plan of the DataFrame is immediately executed, the resulting data is collected onto a single core and converted to a Pandas DataFrame, and further operations proceed using Pandas.

Warning

Fallback to Pandas may lead to degraded performance and increase the risk of out-of-memory (OOM) errors, especially for large datasets.

General Functions

bodo.pandas.from_pandas

bodo.pandas.from_pandas(df: pandas.DataFrame) -> BodoDataFrame

Converts a Pandas DataFrame into an equivalent BodoDataFrame.

Parameters

df : pandas.DataFrame: The Pandas DataFrame to use as data source.

Returns

BodoDataFrame

Example

import pandas as pd
import bodo.pandas as bodo_pd

df = pd.DataFrame(
        {
            "a": [1, 2, 3, 7] * 3,
            "b": [4, 5, 6, 8] * 3,
            "c": ["a", "b", None, "abc"] * 3,
        },
    )

bdf = bodo_pd.from_pandas(df)
print(type(bdf))
print(bdf)

Output:

<class 'bodo.pandas.frame.BodoDataFrame'>
    a  b     c
0   1  4     a
1   2  5     b
2   3  6  <NA>
3   7  8   abc
4   1  4     a
5   2  5     b
6   3  6  <NA>
7   7  8   abc
8   1  4     a
9   2  5     b
10  3  6  <NA>
11  7  8   abc


Input/Output

bodo.pandas.read_parquet

bodo.pandas.read_parquet(
    path,
    engine="auto",
    columns=None,
    storage_options=None,
    use_nullable_dtypes=lib.no_default,
    dtype_backend=lib.no_default,
    filesystem=None,
    filters=None,
    **kwargs,
) -> BodoDataFrame

Creates a BodoDataFrame object for reading from parquet file(s) lazily.

Parameters

path : str, list[str]: Location of the parquet file(s) to read. Refer to pandas.read_parquet for more details. The type of this argument differs from Pandas.

All other parameters will trigger a fallback to pandas.read_parquet if a non-default value is provided.

Returns

BodoDataFrame

Example

import bodo
import bodo.pandas as bodo_pd
import pandas as pd

original_df = pd.DataFrame(
    {"foo": range(15), "bar": range(15, 30)}
   )

@bodo.jit
def write_parquet(df):
    df.to_parquet("example.pq")

write_parquet(original_df)

restored_df = bodo_pd.read_parquet("example.pq")
print(type(restored_df))
print(restored_df.head())

Output:

<class 'bodo.pandas.frame.BodoDataFrame'>
   foo  bar
0    0   15
1    1   16
2    2   17
3    3   18
4    4   19

DataFrame API

bodo.pandas.BodoDataFrame.apply

BodoDataFrame.apply(
        func,
        axis=0,
        raw=False,
        result_type=None,
        args=(),
        by_row="compat",
        engine="python",
        engine_kwargs=None,
        **kwargs,
    ) -> BodoSeries

Apply a function along an axis of the BodoDataFrame.

Currently only supports applying a function that returns a scalar value for each row (i.e. axis=1). All other uses will fall back to Pandas. See pandas.DataFrame.apply for more details.

Note

Calling BodoDataFrame.apply will immediately execute a plan to generate a small sample of the BodoDataFrame and then call pandas.DataFrame.apply on the sample to infer output types before proceeding with lazy evaluation.

Parameters

func : function: Function to apply to each row.

axis : {0 or 1}, default 0: The axis to apply the function over. axis=0 will fall back to pandas.DataFrame.apply.

All other parameters will trigger a fallback to pandas.DataFrame.apply if a non-default value is provided.

Returns

BodoSeries: The result of applying func to each row in the BodoDataFrame.

Example

import pandas as pd
import bodo.pandas as bodo_pd

df = pd.DataFrame(
        {
            "a": pd.array([1, 2, 3] * 4, "Int64"),
            "b": pd.array([4, 5, 6] * 4, "Int64"),
            "c": ["a", "b", "c"] * 4,
        },
    )
bdf = bodo_pd.from_pandas(df)
out_bodo = bdf.apply(lambda x: x["a"] + 1, axis=1)

print(type(out_bodo))
print(out_bodo)

Output:

<class 'bodo.pandas.series.BodoSeries'>
0     2
1     3
2     4
3     2
4     3
5     4
6     2
7     3
8     4
9     2
10    3
11    4
dtype: int64[pyarrow]


bodo.pandas.BodoDataFrame.head

BodoDataFrame.head(n=5) -> BodoDataFrame

Returns the first n rows of the BodoDataFrame.

Parameters

n : int, default 5: Number of rows to select.

Returns

BodoDataFrame

Example

import bodo
import bodo.pandas as bodo_pd
import pandas as pd

original_df = pd.DataFrame(
    {"foo": range(15), "bar": range(15, 30)}
   )

@bodo.jit
def write_parquet(df):
    df.to_parquet("example.pq")

write_parquet(original_df)

restored_df = bodo_pd.read_parquet("example.pq")
restored_df_head = restored_df.head(2)
print(type(restored_df_head))
print(restored_df_head)

Output:

<class 'bodo.pandas.frame.BodoDataFrame'>
   foo  bar
0    0   15
1    1   16


Setting BodoDataFrame Columns

BodoDataFrames support setting columns lazily when the value is a projection from the same DataFrame. Other cases will fallback to Pandas.

Example

import bodo.pandas as bodo_pd
import pandas as pd

df = pd.DataFrame(
        {
            "A": pd.array([1, 2, 3, 7] * 3, "Int64"),
            "B": ["A1", "B1 ", "C1", "Abc"] * 3,
            "C": pd.array([4, 5, 6, -1] * 3, "Int64"),
        }
    )

bdf = bodo_pd.from_pandas(df)

bdf["D"] = bdf["B"].str.lower()
print(type(bdf))
print(bdf.D)

Output:

<class 'bodo.pandas.frame.BodoDataFrame'>
0      a1
1     b1
2      c1
3     abc
4      a1
5     b1
6      c1
7     abc
8      a1
9     b1
10     c1
11    abc
Name: D, dtype: string


Series API

bodo.pandas.BodoSeries.head

BodoSeries.head(n=5) -> BodoSeries

Returns the first n rows of the BodoSeries.

Parameters

n : int, default 5: Number of elements to select.

Returns

BodoSeries

Example

import bodo.pandas as bodo_pd
import pandas as pd

df = pd.DataFrame(
        {
            "A": pd.array([1, 2, 3, 7] * 3, "Int64"),
        }
    )

bdf = bodo_pd.from_pandas(df)
bodo_ser_head = bdf.A.head(3)
print(type(bodo_ser_head))
print(bodo_ser_head)

Output:

<class 'pandas.core.series.Series'>
0    1
1    2
2    3
Name: A, dtype: Int64

bodo.pandas.BodoSeries.map

BodoSeries.map(arg, na_action=None) -> BodoSeries
Map values of a BodoSeries according to a mapping.

Note

Calling BodoSeries.map will immediately execute a plan to generate a small sample of the BodoSeries and then call pandas.Series.map on the sample to infer output types before proceeding with lazy evaluation.

Parameters

arg : function, collections.abc.Mapping subclass or Series: Mapping correspondence.

na_actions: will fall back to pandas.Series.map if 'ignore' is provided.

Returns

BodoSeries

Example

import bodo.pandas as bodo_pd
import pandas as pd

df = pd.DataFrame(
    {
        "A": pd.array([1, 2, 3, 7] * 3, "Int64"),
        "B": ["A1", "B1", "C1", "Abc"] * 3,
        "C": pd.array([4, 5, 6, -1] * 3, "Int64"),
    }
)

bdf = bodo_pd.from_pandas(df)
bodo_ser = bdf.A.map(lambda x: x ** 2)
print(type(bodo_ser))
print(bodo_ser)

Output:

<class 'bodo.pandas.series.BodoSeries'>
0      1
1      4
2      9
3     49
4      1
5      4
6      9
7     49
8      1
9      4
10     9
11    49
Name: A, dtype: int64[pyarrow]


bodo.pandas.BodoSeries.str.lower

BodoSeries.str.lower() -> BodoSeries
Converts strings in a BodoSeries to lowercase. Equivalent to str.lower().

Returns

BodoSeries

Example

import bodo.pandas as bodo_pd
import pandas as pd

df = pd.DataFrame(
        {
            "A": ["A1", "B1", "C1", "Abc"] * 3,
        }
    )

bdf = bodo_pd.from_pandas(df)
bodo_ser = bdf.A.str.lower()
print(type(bodo_ser))
print(bodo_ser)

Output:

<class 'bodo.pandas.series.BodoSeries'>
0      a1
1      b1
2      c1
3     abc
4      a1
5      b1
6      c1
7     abc
8      a1
9      b1
10     c1
11    abc
Name: A, dtype: string

bodo.pandas.BodoSeries.str.strip

BodoSeries.str.strip(to_strip=None) -> BodoSeries
Remove leading and trailing characters. Equivalent to str.strip().

Parameters

to_strip: Will fall back to pandas.Series.str.strip if a value other than None is provided.

Returns

BodoSeries

Example

import bodo.pandas as bodo_pd
import pandas as pd

df = pd.DataFrame(
        {
            "A": [" \t A1\n", "\n\nB1 \t", "C1\n", "\t\nAbc"] * 3,
        }
    )

bdf = bodo_pd.from_pandas(df)
bodo_ser = bdf.A.str.strip()
print(type(bodo_ser))
print(bodo_ser)

Output:

<class 'bodo.pandas.series.BodoSeries'>
0      A1
1      B1
2      C1
3     Abc
4      A1
5      B1
6      C1
7     Abc
8      A1
9      B1
10     C1
11    Abc
Name: A, dtype: string