Bodo Pandas API (Bodo DataFrame Library)¶

The Bodo DataFrame Library is designed to accelerate and scale Pandas workflows with just a one-line change — simply replace:

import pandas as pd

with

import bodo.pandas as pd

and your existing code can immediately take advantage of high-performance, scalable execution.

Key features include:

Full Pandas compatibility with a transparent fallback mechanism to native Pandas, ensuring that your workflows continue uninterrupted even if a feature is not yet supported.
Advanced query optimization such as filter pushdown, column pruning and join reordering behind the scenes.
Scalable MPI-based execution, leveraging High-Performance Computing (HPC) techniques for efficient parallelism; whether you're working on a laptop or running jobs across a large cloud cluster.
Vectorized execution with streaming and spill-to-disk capabilities, making it possible to process datasets larger than memory reliably.

Warning

Bodo DataFrame Library is under active development and is currently considered experimental. Some features and APIs may not yet be fully supported. We welcome your feedback — please join our community Slack or open an issue on our GitHub if you encounter any problems!

Lazy Evaluation and Fallback to Pandas¶

Bodo DataFrame Library operates with lazy evaluation to allow query optimization, meaning operations are recorded into a query plan rather than executed immediately. Execution is automatically triggered only when results are actually needed, such as when displaying a DataFrame df with print(df).

If the user code encounters an unsupported Pandas API or an unsupported parameter, Bodo DataFrame library gracefully falls back to native Pandas. When this happens, the current query plan of the DataFrame is immediately executed, the resulting data is collected onto a single core and converted to a Pandas DataFrame, and further operations proceed using Pandas.

Warning

Fallback to Pandas may lead to degraded performance and increase the risk of out-of-memory (OOM) errors, especially for large datasets.