Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1

栏目: IT技术 · 发布时间: 4年前

内容简介:How to automate and scale your deep learning experiments with Ansible, AWS cloud infrastructure and Pytorch Lightning library.Let’s say you are a deep learning practitioner, but you don’t have an in-house GPU cluster or a machine learning platform at your

How to automate and scale your deep learning experiments with Ansible, AWS cloud infrastructure and Pytorch Lightning library.

Let’s say you are a deep learning practitioner, but you don’t have an in-house GPU cluster or a machine learning platform at your disposal. Nobody trains their models on a CPU for almost a decade. Even worse, with models and datasets getting bigger, you have to deal with distributed deep learning and scale your training either in a model-parallel and/or data-parallel regimes. What can we do about it?

We can follow the modern cloud paradigm and utilize some GPU-as-a-service. It will allow you to allocate the necessary infrastructure dynamically on demand and release it once you have finished. It works well, but this is where the main complexity lies. Modern deep learning frameworks like PyTorch Lightning or Horovod make data-parallel distributed learning easy nowadays. The most annoying and time-consuming thing is creating a proper environment because often we have to do it manually. Even for services that hide a lot of infrastructure complexity from you — like Google Collab or Paperscape — some manual work still needs to be done.

I’m a strong believer in that manual work is your enemy. Why? Here is my list of personal concerns:

  1. Reproducibility of results . Have you ever heard of a so-called human factor? We are very error-prone creatures and we are not good at memorizing something in great detail. The more human work some process involves the harder it will be to reproduce it in the future.
  2. Mental distractions . Deep learning is an empirical endeavor and your progress in it relies deeply on your ability to iterate quickly and test as many hypotheses as you can. And due to that fact, anything that distracts you from your main task, — training and evaluating your models or analyzing the data, — negatively affects the success of an overall process.
  3. Effectiveness. Computers do many things a lot faster than we, humans, do. When you have to repeat the same slow procedure over and over it all adds up.
Manual work is your enemy

In this article, I’ll describe how you can automate the way you conduct your deep learning experiments.

Automate your Deep Learning experiments

The following are three main ideas of this article:

  1. Utilize cloud-based infrastructure to dynamically allocate resources for your training purposes;
  2. Use DevOps automation toolset to manage all manual work on the experiment environment setup;
  3. Write your training procedure in a modern deep learning framework that makes it capable of data-parallel distributed learning effortlessly.

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1

AWS EC2, Ansible and Pytorch Lightning. Image by author

To actually implement these ideas we will utilize AWS cloud infrastructure, Ansible automation tool, and PyTorch Lightning deep learning library.

Our work will be divided into two parts. In this article we will provide a minimal working example which:

  • Automatically creates and destroys EC2 instances for our deep learning cluster;
  • Establishes connectivity between them necessary for Pytorch and Pytorch Lightning distributed training;
  • Creates a local ssh config file to enable connection to the cluster;
  • Creates a Python virtual environment and installs all library dependencies for the experiment;
  • Provides a submit script to run distributed data-parallel workloads on the created cluster.

In the next article, we will add additional features and build a fully automated environment for distributed learning experiments.

Now, let’s take a brief overview of the chosen technology stack.

What is AWS EC2?

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1
Source: https://github.com/gilbarbara/logos/tree/master/logos

AWS Elastic Compute Cloud (EC2) is a core AWS service that allows you to manage virtual machines in Amazon data centers. With this service you can dynamically create and destroy your machines either manually via AWS Console or via API provided by AWS SDK.

As of today, AWS provides a range of GPU-enabled instances for our purposes with one or multiple GPUs per instance and different choices of NVIDIA GPUs: Tesla GRID K520, M60, K80, T4, V100. See the official site for a full list.

What is Ansible?

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1
Source: https://github.com/gilbarbara/logos/tree/master/logos

Ansible is a tool for software and infrastructure provisioning and configuration management. With Ansible you can remotely provision a whole cluster of remote servers, provision software on them, and monitor them.

It is an open-source project written in Python. It uses a declarative approach: you define a desired system state, and Ansible executes necessary actions. To do that you use ordinary YAML files. Declarative nature of Ansible also means that most of the instructions you define for it are idempotent: if you run it more than once it would not cause any undesirable side effects.

One of the distinctive features of Ansible is that it is agent-less, i.e. it doesn’t require any agent software to be installed on the manageable nodes. It operates solely via SSH protocol. So the only thing you need to ensure is the SSH connectivity between the control host on which you run Ansible commands and the inventory hosts you want to manage.

Ansible core concepts

Let’s dive a bit into the core concepts of Ansible. There are not many of those, so you can quickly get your head around them and start playing with this brilliant tool.

  • Inventory

Inventory is simply a list of hosts you want to manage with Ansible. They are organized into named groups. You can define inventory in an INI-formatted file if you have a static predefined infrastructure. Another way — use inventory plugins that will tell Ansible which hosts to operate on if your infrastructure is not known in advance or may change dynamically (like in our case here).

  • Modules

A module is the unit of work that you can perform in Ansible. There is a massive library of modules you can use in Ansible. And it constitutes an extremely extensible architecture. See the module index .

  • Variables

Nothing fancy here. You can define variables like in any programming language either to separate your logic from the data or to pass information between parts of your system. Ansible collects a lot of system information and stores them in predefined variables — facts. You can read more about variables in the official documentation .

  • Tasks

A task is a module invocation with some parameters. You can also define a name, variable to store the result, conditional and loop expressions for the task. Here is an example of a task that copies some local file into a remote computer’s file system when some_variable variable is defined:

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1

Copy task example
  • Plays

A play in Ansible is a way to apply a list of tasks to a group of hosts from inventory. You define a play as a dictionary in YAML. hosts parameter specifies an inventory group and tasks parameter contains a list of tasks.

  • Playbooks

A playbook is just a YAML file that contains a list of plays to run. The way to run a playbook is to pass it to ansible-playbook CLI that comes with Ansible installation.

Here’s a diagram to illustrate how these concepts interplay with each other:

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1

Ansible core concepts

There are also more advanced concepts in Ansible that allow you to write more modular code for complex scenarios. We’ll use some of them in Part 2 of the article.

What is Pytorch Lightning?

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1
Source: Wikipedia

Pytorch Lightning is a high-level library on top of PyTorch. You can think of it as a Keras for PyTorch. There are a couple of features that make it stand out from the crowd of other PyTorch-based deep learning libraries:

  • It is transparent . As authors have written in the documentation, it is more a convention to write a Pytorch code than a separate framework. You don’t need to learn another library and you don’t need to make a huge effort to convert your ordinary PyTorch code to use it with Pytorch-Lightning. Your PyTorch Lightning code is actually your PyTorch code.
  • It hides a lot of boilerplate engineering code . Pytorch is a brilliant framework but when it comes to conducting full-featured experiments with it, you quickly end up with a lot of code that is not particularly related to the actual research you are doing. And you have to repeat this work every time. Pytorch Lighting provides this functionality for you. Specifically, it adds distributed data-parallel learning capability to your model with no modifications to the code required from you at all!
  • It is simple . All PyTorch Lightning code base revolves around a few number of abstractions:
  1. LightningModule is a class that organizes your PyTorch code. The way you use PyTorch Lightning is by creating a custom class that is inherited from LightningModule and implementing its virtual methods. LightningModule itself is inherited from PyTorch Module.
  2. Trainer automates your training procedure. Once you’ve organized your PyTorch code into a LightningModule, you pass its instance to a Trainer and it does the actual heavy lifting of training.
  3. Callbacks , Loggers and Hooks are the means to customize the Trainer ’s behavior.

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1

PyTorch Lightning Architecture

For more information read the official documentation .

Okay, enough talking, let’s start building.

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1


以上所述就是小编给大家介绍的《Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Rust编程之道

Rust编程之道

张汉东 / 电子工业出版社 / 2019-1 / 128

Rust 是一门利用现代化的类型系统,有机地融合了内存管理、所有权语义和混合编程范式的编程语言。它不仅能科学地保证程序的正确性,还能保证内存安全和线程安全。同时,还有能与C/C++语言媲美的性能,以及能和动态语言媲美的开发效率。 《Rust编程之道》并非对语法内容进行简单罗列讲解,而是从四个维度深入全面且通透地介绍了Rust 语言。从设计哲学出发,探索Rust 语言的内在一致性;从源码分析入......一起来看看 《Rust编程之道》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具