Installing Bodo Community Edition

Bodo can be installed as a Python package using the conda command (See How to Install Conda). We recommend creating a conda environment and installing Bodo and its dependencies in it as shown below:

conda create -n Bodo python
conda activate Bodo
conda install bodo -c -c conda-forge

Bodo uses MPI for parallelization, which is automatically installed as part of the conda install command above. This command installs Bodo Community Edition by default, which is free and works on up to 4 cores. For information on Bodo Enterprise Edition and pricing, please contact us .

How to Install Conda

Install Conda using the instructions below.

On Linux:

wget -O
chmod +x
./ -b
export PATH=$HOME/miniconda3/bin:$PATH

On macOS:

curl -L -o
chmod +x
./ -b
export PATH=$HOME/miniconda3/bin:$PATH

On Windows:

start /wait "" Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /D=%UserProfile%\Miniconda3

Open the Anaconda Prompt (click Start, select Anaconda Prompt). You may use other Terminals if you have already added Anaconda to your PATH.

Optional Dependencies

Some Bodo functionality may require other dependencies, as summarized in the table below. All optional dependencies except Hadoop can be installed using the commands conda install gcsfs sqlalchemy hdf5='*=*mpich*' openjdk -c conda-forge and pip install deltalake.



pd.read_sql / df.to_sql



hdf5 (MPI version)



Delta Lake



hadoop (only the Hadoop client is needed)

Testing your Installation

Once you have activated your conda environment and installed Bodo in it, you can test it using the example program below. This program has two functions:

  • The function gen_data creates a sample dataset with 20,000 rows and writes to a parquet file called example1.pq.

  • The function test reads example1.pq and performs multiple computations on it.

import bodo
import pandas as pd
import numpy as np
import time

def gen_data():
    NUM_GROUPS = 30
    NUM_ROWS = 20_000_000
    df = pd.DataFrame({
        "A": np.arange(NUM_ROWS) % NUM_GROUPS,
        "B": np.arange(NUM_ROWS)

def test():
    df = pd.read_parquet("example1.pq")
    t0 = time.time()
    df2 = df.groupby("A")["B"].agg(
        (lambda a: (a==1).sum(), lambda a: (a==2).sum(), lambda a: (a==3).sum())
    m = df2.mean()
    print("Result:", m, "\nCompute time:", time.time() - t0, "secs")


Save this code in a file called, and run it on a single core as follows:


Alternatively, to run the code on four cores, you can use mpiexec:

$ mpiexec -n 4 python

You may need to delete example1.pq between consecutive runs.

Jupyter Notebook Setup

To use Bodo with Jupyter Notebook, install jupyter and ipyparallel in your Bodo conda environment:

conda install jupyter ipyparallel mpi4py -c conda-forge

Create an MPI profile for IPython:

ipython profile create --parallel --profile=mpi

Edit the ~/.ipython/profile_mpi/ file and add the following line:

c.IPClusterEngines.engine_launcher_class = 'MPIEngineSetLauncher'

Start the Jupyter notebook in your Bodo conda environment:

jupyter notebook

Go to the IPython Clusters tab, select the number of engines (i.e., cores) you’d like to use, and click Start next to the mpi profile. Alternatively, you can use ipcluster start -n 4 --profile=mpi in a terminal to start the engines. Initialization of the engines can take several seconds.

Now you can start a new notebook and run the following code in a cell to set up the environment:

import ipyparallel as ipp

c = ipp.Client(profile='mpi')
view = c[:]
view.block = True  # equivalent to running with %%px --block

# Set the working directory:
import os
view["cwd"] = os.getcwd()
%px cd $cwd

This should complete without any errors. An error may appear if the cluster is not initialized yet (usually NoEnginesRegistered). In this case, wait a few seconds and try again.

To run Bodo functions on the execution engines, you use ipyparallel hooks such as %%px magic. For example, run this code in a cell:

import bodo
import numpy as np
import time

def calc_pi(n):
    t1 = time.time()
    x = 2 * np.random.ranf(n) - 1
    y = 2 * np.random.ranf(n) - 1
    pi = 4 * np.sum(x**2 + y**2 < 1) / n
    print("Execution time:", time.time()-t1, "\nresult:", pi)

calc_pi(2 * 10**8)

If you wish to run across multiple nodes, you can add the following to

c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"]

Where machinefile (or hostfile) is a file with the hostnames of available nodes that MPI can use. You can find more information about machinefiles here.

It is important to note that other MPI systems and launchers (such as QSUB/PBS) may use a different user interface for the allocation of computational nodes.