内容简介:In this post I describe how I build Conda environments for my deep learning projects when I am using Horovod to enable distributed training across multiple GPUs (either on the same node or spread across multuple nodes). If you like my approach then you can
How to get started with distributed training of DNNs using Horovod.
Apr 27 ·6min read
What is Horovod?
Horovod is an open-source distributed training framework for TensorFlow , Keras , PyTorch , and Apache MXNet . Originally developed by Uber for in house use, Horovod was open sourced a couple of years ago and is now an official Linux Foundation AI (LFAI) project.
In this post I describe how I build Conda environments for my deep learning projects when I am using Horovod to enable distributed training across multiple GPUs (either on the same node or spread across multuple nodes). If you like my approach then you can make use of the template repository on GitHub to get started with your next Horovod data science project!
Installing the NVIDIA CUDA Toolkit
First thing you need to do is to install the appropriate version of the NVIDIA CUDA Toolkit on your workstation. I am using NVIDIA CUDA Toolkit 10.1 (documentation) which works with all three deep learning frameworks that are currently supported by Horovod.
Why not just use the cudatoolkit
package?
Typically when installing PyTorch, TensorFlow, or Apache MXNet with GPU support using Conda you simply add the appropriate version of the cudatoolkit
package to your environment.yml
file.
Unfortunately, for the moment at least, the cudatoolkit
package available from conda-forge
does not include NVCC which is required in order to use Horovod with either PyTorch, TensorFlow, or MXNet as you need to compile extensions.
What about the cudatoolkit-dev
package?
While there are cudatoolkit-dev
packages available from conda-forge
that do include NVCC, I have had difficult getting these packages to consistently install properly. Some of the available builds require manual intervention to accept license agreements making these builds unsuitable for installing on remote systems (which is critical functionality). Other builds seems to work on Ubuntu but not on other flavors of Linux.
I would encourage you to try adding cudatoolkit-dev
to your environment.yml
file and see what happens! The package is well maintained so perhaps it will become more stable in the future.
Use the nvcc_linux-64
meta-pacakge!
The most robust approach to obtain NVCC and still use Conda to manage all the other dependencies is to install the NVIDIA CUDA Toolkit on your system and then install a meta-package nvcc_linux-64
from conda-forge
which configures your Conda environment to use the NVCC installed on the system together with the other CUDA Toolkit components installed inside the Conda environment. For more details on this package I recommend reading through the issue threads on GitHub .
The environment.yml
file
I prefer to specify as many dependencies as possible in the Conda environment.yml
file and only specify dependencies in requirements.txt
that are not available via Conda channels. Check the official Horovod installation guide for details of required dependencies.
Channel Priority
I use the recommended channel priorities. Note that conda-forge
has priority over defaults
.
name: nullchannels: - pytorch - conda-forge - defaults
Dependencies
There are a few things worth noting about the dependencies.
- Even though I have installed the NVIDIA CUDA Toolkit manually I still use Conda to manage the other required CUDA components such as
cudnn
andnccl
(and the optionalcupti
). - I use two meta-pacakges,
cxx-compiler
andnvcc_linux-64
, to make sure that suitable C, and C++ compilers are installed and that the resulting Conda environment is aware of the manually installed CUDA Toolkit. - Horovod requires some controller library to coordinate work between the various Horovod processes. Typically this will be some MPI implementation such as OpenMPI . However, rather than specifying the
openmpi
package directly I instead opt for mpi4py Conda package which provides a cuda-aware build of OpenMPI (assuming it is supported by your hardware). - Horovod also support that Gloo collective communications library that can be used in place of MPI. I include
cmake
in order to insure that the Horovod extensions for Gloo are built.
Below are the core required dependencies. The complete environment.yml
file is available on GitHub .
dependencies: - bokeh=1.4 - cmake=3.16 # insures that Gloo library extensions will be built - cudnn=7.6 - cupti=10.1 - cxx-compiler=1.0 # insures C and C++ compilers are available - jupyterlab=1.2 - mpi4py=3.0 # installs cuda-aware openmpi - nccl=2.5 - nodejs=13 - nvcc_linux-64=10.1 # configures environment to be "cuda-aware" - pip=20.0 - pip: - mxnet-cu101mkl==1.6.* # MXNET is installed prior to horovod - -r file:requirements.txt - python=3.7 - pytorch=1.4 - tensorboard=2.1 - tensorflow-gpu=2.1 - torchvision=0.5
The requirements.txt
File
The requirements.txt
file is where all of the pip
dependencies, including Horovod itself, are listed for installation. In addition to Horovod I typically will also use pip
to install JupyterLab extensions to enable GPU and CPU resource monitoring via jupyterlab-nvdashboard
and Tensorboard support via jupyter-tensorboard
.
horovod==0.19.* jupyterlab-nvdashboard==0.2.* jupyter-tensorboard==0.2.*# make sure horovod is re-compiled if environment is re-built --no-binary=horovod
Note the use of the --no-binary
option at the end of the file. Including this option insures that Horovod will be re-built whenever the Conda environment is re-built.
The complete requirements.txt
file is available on GitHub .
Building Conda Environment
After adding any necessary dependencies that should be downloaded via conda
to the environment.yml
file and any dependencies that should be downloaded via pip
to the requirements.txt
file you create the Conda environment in a sub-directory ./env
of your project directory by running the following commands.
export ENV_PREFIX=$PWD/env export HOROVOD_CUDA_HOME=$CUDA_HOME export HOROVOD_NCCL_HOME=$ENV_PREFIX export HOROVOD_GPU_ALLREDUCE=NCCL export HOROVOD_GPU_BROADCAST=NCCL conda env create --prefix $ENV_PREFIX --file environment.yml --force
By default Horovod will try and build extensions for all detected frameworks. See the Horovod documentation on environment variables for the details on additional environment variables that can be set prior to building Horovod.
Once the new environment has been created you can activate the environment with the following command.
conda activate $ENV_PREFIX
The postBuild
File
If you wish to use any JupyterLab extensions included in the environment.yml
and requirements.txt
files, then you may need to rebuild the JupyterLab application.
For simplicity, I typically include the instructions for re-building JupyterLab in a postBuild
script. Here is what this script looks like for my Horovod environments.
jupyter labextension install --no-build @pyviz/jupyterlab_pyviz jupyter labextension install --no-build jupyterlab-nvdashboard jupyter labextension install --no-build jupyterlab_tensorboard jupyter serverextension enable jupyterlab_sql --py --sys-prefix jupyter lab build
Use the following commands to source the postBuild
script.
conda activate $ENV_PREFIX # optional if environment already active . postBuild
Wrapping it all up in a Bash script
I typically wrap these commands into a shell script create-conda-env.sh
. Running the shell script will set the Horovod build variables, create the Conda environment, activate the Conda environment, and built JupyterLab with any additional extensions.
#!/bin/bash --loginset -eexport ENV_PREFIX=$PWD/env export HOROVOD_CUDA_HOME=$CUDA_HOME export HOROVOD_NCCL_HOME=$ENV_PREFIX export HOROVOD_GPU_ALLREDUCE=NCCL export HOROVOD_GPU_BROADCAST=NCCLconda env create --prefix $ENV_PREFIX --file environment.yml --force conda activate $ENV_PREFIX . postBuild
I typically put scripts inside a ./bin
directory in my project root directory. The script should be run from the project root directory as follows.
./bin/create-conda-env.sh # assumes that $CUDA_HOME is set properly
Verifying the Conda environment
After building the Conda environment you can check that Horovod has been built with support for the deep learning frameworks TensorFlow, PyTorch, Apache MXNet, and the contollers MPI and Gloo with the following command.
conda activate $ENV_PREFIX # optional if environment already active horovodrun --check-build
You should see output similar to the following.
Horovod v0.19.1:Available Frameworks: [X] TensorFlow [X] PyTorch [X] MXNetAvailable Controllers: [X] MPI [X] GlooAvailable Tensor Operations: [X] NCCL [ ] DDL [ ] CCL [X] MPI [X] Gloo
Listing the contents of the Conda environment
To see the full list of packages installed into the environment run the following command.
conda activate $ENV_PREFIX # optional if environment already active conda list
Updating the Conda environment
If you add (remove) dependencies to (from) the environment.yml
file or the requirements.txt
file after the environment has already been created, then you can re-create the environment with the following command.
conda env create --prefix $ENV_PREFIX --file environment.yml --force
However, whenever I add new dependencies I prefer to re-run the Bash script which will re-build both the Conda environment and JupyterLab.
./bin/create-conda-env.sh
Summary
Finding a reproducible process for building Horovod extensions for my deep learning projects was tricky. Key to my solution is the use of meta-packages from conda-forge
to insure that the appropriate compilers are installed and that the resulting Conda environment is aware of the system installed NVIDIA CUDA Toolkit. The second key is to use the --no-binary
flag in the requirements.txt
file to insure that Horovod is re-built whenever the Conda environment is re-built.
If you like my approach then you can make use of the template repository on GitHub to get started with your next Horovod data science project!
以上所述就是小编给大家介绍的《Building a Conda environment for Horovod》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
大连接
[美] 尼古拉斯•克里斯塔基斯(Nicholas A. Christakis)、[美] 詹姆斯•富勒(James H. Fowler) / 简学 / 中国人民大学出版社 / 2013-1 / 59.90元
[内容简介] 1. 本书是继《六度分隔》之后,社会科学领域最重要的作品。作者发现:相距三度之内是强连接,强连接可以引发行为;相聚超过三度是弱连接,弱连接只能传递信息。 2. 本书讲述了社会网络是如何形成的以及对人类现实行为的影响,如对人类的情绪、亲密关系、健康、经济的运行和政治的影响等,并特别指出,三度影响力(即朋友的朋友的朋友也能影响到你)是社会化网络的强连接原则,决定着社会化网络的......一起来看看 《大连接》 这本书的介绍吧!