内容简介:How to automate and scale your deep learning experiments with Ansible, AWS cloud infrastructure and Pytorch Lightning library.Let’s say you are a deep learning practitioner, but you don’t have an in-house GPU cluster or a machine learning platform at your
How to automate and scale your deep learning experiments with Ansible, AWS cloud infrastructure and Pytorch Lightning library.
Jun 28 ·14min read
Let’s say you are a deep learning practitioner, but you don’t have an in-house GPU cluster or a machine learning platform at your disposal. Nobody trains their models on a CPU for almost a decade. Even worse, with models and datasets getting bigger, you have to deal with distributed deep learning and scale your training either in a model-parallel and/or data-parallel regimes. What can we do about it?
We can follow the modern cloud paradigm and utilize some GPU-as-a-service. It will allow you to allocate the necessary infrastructure dynamically on demand and release it once you have finished. It works well, but this is where the main complexity lies. Modern deep learning frameworks like PyTorch Lightning or Horovod make data-parallel distributed learning easy nowadays. The most annoying and time-consuming thing is creating a proper environment because often we have to do it manually. Even for services that hide a lot of infrastructure complexity from you — like Google Collab or Paperscape — some manual work still needs to be done.
I’m a strong believer in that manual work is your enemy. Why? Here is my list of personal concerns:
- Reproducibility of results . Have you ever heard of a so-called human factor? We are very error-prone creatures and we are not good at memorizing something in great detail. The more human work some process involves the harder it will be to reproduce it in the future.
- Mental distractions . Deep learning is an empirical endeavor and your progress in it relies deeply on your ability to iterate quickly and test as many hypotheses as you can. And due to that fact, anything that distracts you from your main task, — training and evaluating your models or analyzing the data, — negatively affects the success of an overall process.
- Effectiveness. Computers do many things a lot faster than we, humans, do. When you have to repeat the same slow procedure over and over it all adds up.
Manual work is your enemy
In this article, I’ll describe how you can automate the way you conduct your deep learning experiments.
Automate your Deep Learning experiments
The following are three main ideas of this article:
- Utilize cloud-based infrastructure to dynamically allocate resources for your training purposes;
- Use DevOps automation toolset to manage all manual work on the experiment environment setup;
- Write your training procedure in a modern deep learning framework that makes it capable of data-parallel distributed learning effortlessly.
To actually implement these ideas we will utilize AWS cloud infrastructure, Ansible automation tool, and PyTorch Lightning deep learning library.
Our work will be divided into two parts. In this article we will provide a minimal working example which:
- Automatically creates and destroys EC2 instances for our deep learning cluster;
- Establishes connectivity between them necessary for Pytorch and Pytorch Lightning distributed training;
- Creates a local ssh config file to enable connection to the cluster;
- Creates a Python virtual environment and installs all library dependencies for the experiment;
- Provides a submit script to run distributed data-parallel workloads on the created cluster.
In the next article, we will add additional features and build a fully automated environment for distributed learning experiments.
Now, let’s take a brief overview of the chosen technology stack.
What is AWS EC2?
AWS Elastic Compute Cloud (EC2) is a core AWS service that allows you to manage virtual machines in Amazon data centers. With this service you can dynamically create and destroy your machines either manually via AWS Console or via API provided by AWS SDK.
As of today, AWS provides a range of GPU-enabled instances for our purposes with one or multiple GPUs per instance and different choices of NVIDIA GPUs: Tesla GRID K520, M60, K80, T4, V100. See the official site for a full list.
What is Ansible?
Ansible is a tool for software and infrastructure provisioning and configuration management. With Ansible you can remotely provision a whole cluster of remote servers, provision software on them, and monitor them.
It is an open-source project written in Python. It uses a declarative approach: you define a desired system state, and Ansible executes necessary actions. To do that you use ordinary YAML files. Declarative nature of Ansible also means that most of the instructions you define for it are idempotent: if you run it more than once it would not cause any undesirable side effects.
One of the distinctive features of Ansible is that it is agent-less, i.e. it doesn’t require any agent software to be installed on the manageable nodes. It operates solely via SSH protocol. So the only thing you need to ensure is the SSH connectivity between the control host on which you run Ansible commands and the inventory hosts you want to manage.
Ansible core concepts
Let’s dive a bit into the core concepts of Ansible. There are not many of those, so you can quickly get your head around them and start playing with this brilliant tool.
- Inventory
Inventory is simply a list of hosts you want to manage with Ansible. They are organized into named groups. You can define inventory in an INI-formatted file if you have a static predefined infrastructure. Another way — use inventory plugins that will tell Ansible which hosts to operate on if your infrastructure is not known in advance or may change dynamically (like in our case here).
- Modules
A module is the unit of work that you can perform in Ansible. There is a massive library of modules you can use in Ansible. And it constitutes an extremely extensible architecture. See the module index .
- Variables
Nothing fancy here. You can define variables like in any programming language either to separate your logic from the data or to pass information between parts of your system. Ansible collects a lot of system information and stores them in predefined variables — facts. You can read more about variables in the official documentation .
- Tasks
A task is a module invocation with some parameters. You can also define a name, variable to store the result, conditional and loop expressions for the task. Here is an example of a task that copies some local file into a remote computer’s file system when some_variable
variable is defined:
- Plays
A play in Ansible is a way to apply a list of tasks to a group of hosts from inventory. You define a play as a dictionary in YAML. hosts
parameter specifies an inventory group and tasks
parameter contains a list of tasks.
- Playbooks
A playbook is just a YAML file that contains a list of plays to run. The way to run a playbook is to pass it to ansible-playbook CLI that comes with Ansible installation.
Here’s a diagram to illustrate how these concepts interplay with each other:
There are also more advanced concepts in Ansible that allow you to write more modular code for complex scenarios. We’ll use some of them in Part 2 of the article.
What is Pytorch Lightning?
Pytorch Lightning is a high-level library on top of PyTorch. You can think of it as a Keras for PyTorch. There are a couple of features that make it stand out from the crowd of other PyTorch-based deep learning libraries:
- It is transparent . As authors have written in the documentation, it is more a convention to write a Pytorch code than a separate framework. You don’t need to learn another library and you don’t need to make a huge effort to convert your ordinary PyTorch code to use it with Pytorch-Lightning. Your PyTorch Lightning code is actually your PyTorch code.
- It hides a lot of boilerplate engineering code . Pytorch is a brilliant framework but when it comes to conducting full-featured experiments with it, you quickly end up with a lot of code that is not particularly related to the actual research you are doing. And you have to repeat this work every time. Pytorch Lighting provides this functionality for you. Specifically, it adds distributed data-parallel learning capability to your model with no modifications to the code required from you at all!
- It is simple . All PyTorch Lightning code base revolves around a few number of abstractions:
-
LightningModule
is a class that organizes your PyTorch code. The way you use PyTorch Lightning is by creating a custom class that is inherited fromLightningModule
and implementing its virtual methods.LightningModule
itself is inherited from PyTorch Module. -
Trainer
automates your training procedure. Once you’ve organized your PyTorch code into a LightningModule, you pass its instance to aTrainer
and it does the actual heavy lifting of training. -
Callbacks
,Loggers
andHooks
are the means to customize theTrainer
’s behavior.
For more information read the official documentation .
Okay, enough talking, let’s start building.
以上所述就是小编给大家介绍的《Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。