Quick Start to Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute

栏目: IT技术 · 发布时间: 4年前

内容简介:This article is a quick start guide to running distributed multi-GPU deep learning using AWS Sagemaker and TensorFlow 2.2.0 tf.distribute.All of my code related to this article can be found in my GitHub repository,First, we need to understand our options f

Quick Start to Distributed Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute

Introduction

This article is a quick start guide to running distributed multi-GPU deep learning using AWS Sagemaker and TensorFlow 2.2.0 tf.distribute.

Photo by Markus Spiske on Unsplash

Code

All of my code related to this article can be found in my GitHub repository, here . The code in my repository is an example of running a version of BERT on data from Kaggle, specifically the Jigsaw Multilingual Toxic Comment Classification competition. So of my code is adopted from a top public kernel .

The Need to Know Information

Getting Started

First, we need to understand our options for running deep learning on AWS Sagemaker.

  1. Run your code in a notebook instance
  2. Run your code in a tailored Sagemaker TensorFlow container

In this article, we focus on option #2 because it’s cheaper and it’s the intended design of Sagemaker.

(option #1 is a nice way to get started, but it’s more expensive because you’re paying for every second the notebook instance is running).

Running a Sagemaker TensorFlow Container

There is a lot of flexibility to Sagemaker TensorFlow containers, but we’re going to focus on the bare essentials.

Photo by Upadek Matmy on Unsplash

To start, we need to launch a Sagemaker notebook instance and store our data on S3. If you don’t know how to do this, I review some simple options on my blog . Once we have our data in S3, we can launch a Jupyter notebook (from our notebook instance) and start coding. This notebook will be responsible for launching your training job, or i.e. your Sagemaker TensorFlow container.

Again, we’re going to focus on the bare essentials. We need a variable to indicate where our data is located, and then we need to add that location to a dictionary.

data_s3 = 's3://<your-bucket>/'
inputs = {'data':data_s3}

Pretty simple. Now we need to create a Sagemaker TensorFlow container object.

Our entry_point is a Python script (which we’ll make later) that contains all of our modeling code. Our train_instance_type is a multi-GPU Sagemaker instance type. You can find a full list of Sagemaker instance types here . Notice that a ml.p3.8xlarge runs 4 V100 NVIDIA GPUs . And since we’re going to be using MirroredStrategy (more on this later) we need train_instance_count=1. So that’s 1 machine with 4 V100s. The other settings you can leave alone for now, or research further as needed.

The main settings we need to get right are entry_point and train_instance_type . (And then for Mirrored Strategy we need train_instance_count=1).

# create estimator
estimator = TensorFlow(entry_point='jigsaw_DistilBert_SingleRun_v1_sm_tfdist0.py',
 train_instance_type='ml.p3.8xlarge',
 output_path="s3://<your-bucket>",
 train_instance_count=1,
 role=sagemaker.get_execution_role(),
 framework_version='2.1.0',
 py_version='py3',
 script_mode=True)

We can kick off our training job by running the following line.

estimator.fit(inputs)

Notice that we included our dictionary (which contained our S3 location) as an input to ‘fit()’. Before we run this code, we need to create the Python script which we tied to entry_point (otherwise our container won’t have any code to run) .

Create Training Script

I have a lot going on in my training script because I’m running a version of BERT on some data from Kaggle, but I’m going to highlight the main code required for Sagemaker.

Photo by Brooks Leibee on Unsplash

First we need to grab our data location, which was passed when we ran ‘estimator.fit(inputs)’. We can do this using argparse.

def parse_args(): 
 parser = argparse.ArgumentParser()
 parser.add_argument(‘ — data’, 
 type=str, 
 default=os.environ.get(‘SM_CHANNEL_DATA’)) 
 return parser.parse_known_args()args, _ = parse_args()

You could probably simplify this even further by just hard coding your S3 location in your training script.

If all we wanted to do was run our training job in a Sagemaker container, that’s basically all we need. Now if we want to run multi-GPU train using tf.distribute we need a few more things.

Say Goodbye to Horovod, Say Hello to TF.Distribute

Photo by Taylor Vick on Unsplash

First we need to indicate that we want to run multi-GPU training. We can do that very easily with the following line.

strategy = tf.distribute.MirroredStrategy()

We’re going to use our strategy object throughout our training code. Next we need to adjust our batch size for multi-GPU training by including the following line.

BATCH_SIZE = 16 * strategy.num_replicas_in_sync

To distribute our model we can define our model using strategy as well.

with strategy.scope():
 # define model here

And that’s it! We can then continue on to run ‘model.fit()’ we usually do.

Again, full code related to this article can be found in my GitHub repository, here .

Thanks for reading and hope you find this helpful!


以上所述就是小编给大家介绍的《Quick Start to Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

深度学习轻松学

深度学习轻松学

冯超 / 电子工业出版社 / 2017-7 / 79.00

《深度学习轻松学:核心算法与视觉实践》介绍了深度学习基本算法和视觉领域的应用实例。书中以轻松直白的语言,生动详细地介绍了深层模型相关的基础知识,并深入剖析了算法的原理与本质。同时,书中还配有大量案例与源码,帮助读者切实体会深度学习的核心思想和精妙之处。除此之外,书中还介绍了深度学习在视觉领域的应用,从原理层面揭示其思路思想,帮助读者在此领域中夯实技术基础。 《深度学习轻松学:核心算法与视觉实......一起来看看 《深度学习轻松学》 这本书的介绍吧!

随机密码生成器
随机密码生成器

多种字符组合密码

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具