Skip to content

Installing Bodo Community Edition

Bodo can be installed as a using the conda command (see how to install conda below). If you are installing bodo through conda, we recommend creating a conda environment and installing Bodo and its dependencies in it as shown below:

conda create -n Bodo python=3.9 mamba -c conda-forge
conda activate Bodo
mamba install bodo -c -c conda-forge

mamba is a drop-in replacement for conda that uses the same commands and configuration but is much faster.

Bodo uses MPI for parallelization, which is automatically installed as part of the conda install command above. This command installs Bodo Community Edition by default, which is free and works on up to 8 cores. For information on Bodo Enterprise Edition and pricing, please contact us.

How to Install Conda

Install Conda using the instructions below.

On Linux

wget -O
chmod +x
./ -b
export PATH=$HOME/miniconda3/bin:$PATH

On MacOS

curl -L -o
chmod +x
./ -b
export PATH=$HOME/miniconda3/bin:$PATH

On Windows

start /wait "" Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /D=%UserProfile%\Miniconda3

Open the Anaconda Prompt to use Bodo (click Start, select Anaconda Prompt). You may use other terminals if you have already added Anaconda to your PATH.

Optional Dependencies

Some Bodo functionality may require other dependencies, as summarized in the table below. All optional dependencies except Hadoop can be installed using the commands

conda install gcsfs sqlalchemy snowflake-connector-python hdf5='1.10.*=*mpich*' openjdk -c conda-forge


pip install deltalake

mamba is also useful if conda install commands are taking a long time to execute:

conda install mamba -c conda-forge

Functionality Dependency
pd.read_sql / df.to_sql sqlalchemy
Snowflake I/O snowflake-connector-python
GCS I/O gcsfs
Delta Lake deltalake
HDFS or ADLS Gen2 hadoop (only the Hadoop client is needed)
HDF5 hdf5 (MPI version)

Testing your Installation

Once you have activated your conda environment and installed Bodo in it, you can test it using the example program below. This program has two functions:

  • The function gen_data creates a sample dataset with 20,000 rows and writes to a parquet file called example1.pq.
  • The function test reads example1.pq and performs multiple computations on it.
import bodo
import pandas as pd
import numpy as np
import time

def gen_data():
    NUM_GROUPS = 30
    NUM_ROWS = 20_000_000
    df = pd.DataFrame({
        "A": np.arange(NUM_ROWS) % NUM_GROUPS,
        "B": np.arange(NUM_ROWS)

def test():
    df = pd.read_parquet("example1.pq")
    t0 = time.time()
    df2 = df.groupby("A")["B"].agg(
        (lambda a: (a==1).sum(), lambda a: (a==2).sum(), lambda a: (a==3).sum())
    m = df2.mean()
    print("Result:", m, "\nCompute time:", time.time() - t0, "secs")


Save this code in a file called, and run it on a single core as follows:


Alternatively, to run the code on four cores, you can use mpiexec:

mpiexec -n 8 python


You may need to delete example1.pq between consecutive runs.