Installing Bodo Community Edition¶
Bodo can be installed as a using the conda
command (see how to install conda below).
We recommend creating a conda
environment and installing
Bodo and its dependencies in it as shown below:
conda create -n Bodo python=3.12 -c conda-forge
conda activate Bodo
conda install bodo -c bodo.ai -c conda-forge
Bodo uses MPI
for parallelization, which is automatically installed as part of the
conda
install command above. This command installs Bodo Community
Edition by default, which is free and works on up to 8 cores. For
information on Bodo Enterprise Edition and pricing, please contact
us.
See Also
How to Install Conda¶
Install Conda using the instructions below.
On Linux¶
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
chmod +x miniconda.sh
./miniconda.sh -b
export PATH=$HOME/miniconda3/bin:$PATH
On MacOS¶
curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -L -o miniconda.sh
chmod +x miniconda.sh
./miniconda.sh -b
export PATH=$HOME/miniconda3/bin:$PATH
On Windows¶
start /wait "" Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /D=%UserProfile%\Miniconda3
Open the Anaconda Prompt to use Bodo (click Start, select Anaconda Prompt). You may use other terminals if you have already added Anaconda to your PATH.
Optional Dependencies¶
Some Bodo functionality may require other dependencies, as summarized in the table below. All optional dependencies except Hadoop can be installed using the commands
conda install gcsfs sqlalchemy snowflake-connector-python hdf5='1.14.*=*mpich*' openjdk -c conda-forge
and
Functionality | Dependency |
---|---|
pd.read_sql / df.to_sql |
sqlalchemy |
Snowflake I/O |
snowflake-connector-python |
GCS I/O |
gcsfs |
Delta Lake |
deltalake |
HDFS or ADLS Gen2 |
hadoop (only the Hadoop client is needed) |
HDF5 |
hdf5 (MPI version) |
Testing your Installation¶
Once you have activated your conda
environment and installed Bodo in
it, you can test it using the example program below. This program has
two functions:
- The function
gen_data
creates a sample dataset with 20,000 rows and writes to a parquet file calledexample1.pq
. - The function
test
readsexample1.pq
and performs multiple computations on it.
import bodo
import pandas as pd
import numpy as np
import time
@bodo.jit
def gen_data():
NUM_GROUPS = 30
NUM_ROWS = 20_000_000
df = pd.DataFrame({
"A": np.arange(NUM_ROWS) % NUM_GROUPS,
"B": np.arange(NUM_ROWS)
})
df.to_parquet("example1.pq")
@bodo.jit
def test():
df = pd.read_parquet("example1.pq")
t0 = time.time()
df2 = df.groupby("A")["B"].agg(
(lambda a: (a==1).sum(), lambda a: (a==2).sum(), lambda a: (a==3).sum())
)
m = df2.mean()
print("Result:", m, "\nCompute time:", time.time() - t0, "secs")
gen_data()
test()
Save this code in a file called example.py
, and run it on a single
core as follows:
Alternatively, to run the code on four cores, you can use mpiexec
:
Note
You may need to delete example1.pq
between consecutive runs.