Bodo works seamlessly with Horovod to support large-scale distributed deep learning with PyTorch and TensorFlow.
You will need to install Horovod and a deep learning framework in your Bodo conda environment. Horovod needs to be compiled and linked with the same MPI library that Bodo uses.
Installing Horovod and PyTorch¶
Here are simple instructions for installing Horovod and PyTorch in your Bodo environment without CUDA support:
For information on setting up Horovod in a conda environment with CUDA see here.
How it works¶
Bodo works seamlessly with Horovod for distributed deep learning. The main thing to consider if you are using GPUs for deep learning is that a Bodo application typically uses all CPU cores on a cluster (there is one worker or process per core), whereas for deep learning only a subset of Bodo workers are pinned to a GPU (one worker per GPU). This means that data processed and generated by Bodo will need to be distributed to the GPU workers before training.
Bodo can automatically assign a subset of workers in your cluster to GPU devices, initialize Horovod with these workers, and distribute data for deep learning to these workers.
To ensure that Bodo automatically handles all of the above call
bodo.dl.start() before starting training and
to distribute the data.
frameworkis a string specifying the DL framework to use ("torch" or "tensorflow"). Note that this must be called before starting deep learning. It initializes Horovod and pins workers to GPUs.
Redistributes the given data to DL workers.
On calling this function, non-DL workers will wait for DL workers. They will become idle to free up computational resources for DL workers. This has to be called by every process.
The code snippet below shows how deep learning can be integrated in a Bodo application:
# Deep learning code in regular Python usign Horovod def deep_learning(X, y): # Note: X and y have already been distributed by Bodo and there is no # need to partition data with Horovod if hvd.initialized(): # do deep learning with Horovod and your DL framework of choice ... use_cuda = bodo.dl.is_cuda_available() ... else: # this rank does not participate in DL (not pinned to GPU) pass @bodo.jit def main() ... X = ... # distributed NumPy array generated with Bodo y = ... # distributed NumPy array generated with Bodo bodo.dl.start("torch") # Initialize Horovod with PyTorch X = bodo.dl.prepare_data(X) y = bodo.dl.prepare_data(y) with bodo.objmode: deep_learning() # DL user code bodo.dl.end()
As we can see, the deep learning code is not compiled by Bodo. It runs
in Python (in
objmode or outside Bodo jitted functions) and must use
Horovod. Note that data coming from Bodo has already been partitioned
and distributed by Bodo (in
bodo.dl.prepare_data), and that you don't
have to initialize Horovod.
A full distributed training example with the MNIST dataset can be found here.