Quick Start to Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute

栏目: IT技术 · 发布时间: 4年前

内容简介:This article is a quick start guide to running distributed multi-GPU deep learning using AWS Sagemaker and TensorFlow 2.2.0 tf.distribute.All of my code related to this article can be found in my GitHub repository,First, we need to understand our options f

Quick Start to Distributed Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute

Introduction

This article is a quick start guide to running distributed multi-GPU deep learning using AWS Sagemaker and TensorFlow 2.2.0 tf.distribute.

Photo by Markus Spiske on Unsplash

Code

All of my code related to this article can be found in my GitHub repository, here . The code in my repository is an example of running a version of BERT on data from Kaggle, specifically the Jigsaw Multilingual Toxic Comment Classification competition. So of my code is adopted from a top public kernel .

The Need to Know Information

Getting Started

First, we need to understand our options for running deep learning on AWS Sagemaker.

  1. Run your code in a notebook instance
  2. Run your code in a tailored Sagemaker TensorFlow container

In this article, we focus on option #2 because it’s cheaper and it’s the intended design of Sagemaker.

(option #1 is a nice way to get started, but it’s more expensive because you’re paying for every second the notebook instance is running).

Running a Sagemaker TensorFlow Container

There is a lot of flexibility to Sagemaker TensorFlow containers, but we’re going to focus on the bare essentials.

Photo by Upadek Matmy on Unsplash

To start, we need to launch a Sagemaker notebook instance and store our data on S3. If you don’t know how to do this, I review some simple options on my blog . Once we have our data in S3, we can launch a Jupyter notebook (from our notebook instance) and start coding. This notebook will be responsible for launching your training job, or i.e. your Sagemaker TensorFlow container.

Again, we’re going to focus on the bare essentials. We need a variable to indicate where our data is located, and then we need to add that location to a dictionary.

data_s3 = 's3://<your-bucket>/'
inputs = {'data':data_s3}

Pretty simple. Now we need to create a Sagemaker TensorFlow container object.

Our entry_point is a Python script (which we’ll make later) that contains all of our modeling code. Our train_instance_type is a multi-GPU Sagemaker instance type. You can find a full list of Sagemaker instance types here . Notice that a ml.p3.8xlarge runs 4 V100 NVIDIA GPUs . And since we’re going to be using MirroredStrategy (more on this later) we need train_instance_count=1. So that’s 1 machine with 4 V100s. The other settings you can leave alone for now, or research further as needed.

The main settings we need to get right are entry_point and train_instance_type . (And then for Mirrored Strategy we need train_instance_count=1).

# create estimator
estimator = TensorFlow(entry_point='jigsaw_DistilBert_SingleRun_v1_sm_tfdist0.py',
 train_instance_type='ml.p3.8xlarge',
 output_path="s3://<your-bucket>",
 train_instance_count=1,
 role=sagemaker.get_execution_role(),
 framework_version='2.1.0',
 py_version='py3',
 script_mode=True)

We can kick off our training job by running the following line.

estimator.fit(inputs)

Notice that we included our dictionary (which contained our S3 location) as an input to ‘fit()’. Before we run this code, we need to create the Python script which we tied to entry_point (otherwise our container won’t have any code to run) .

Create Training Script

I have a lot going on in my training script because I’m running a version of BERT on some data from Kaggle, but I’m going to highlight the main code required for Sagemaker.

Photo by Brooks Leibee on Unsplash

First we need to grab our data location, which was passed when we ran ‘estimator.fit(inputs)’. We can do this using argparse.

def parse_args(): 
 parser = argparse.ArgumentParser()
 parser.add_argument(‘ — data’, 
 type=str, 
 default=os.environ.get(‘SM_CHANNEL_DATA’)) 
 return parser.parse_known_args()args, _ = parse_args()

You could probably simplify this even further by just hard coding your S3 location in your training script.

If all we wanted to do was run our training job in a Sagemaker container, that’s basically all we need. Now if we want to run multi-GPU train using tf.distribute we need a few more things.

Say Goodbye to Horovod, Say Hello to TF.Distribute

Photo by Taylor Vick on Unsplash

First we need to indicate that we want to run multi-GPU training. We can do that very easily with the following line.

strategy = tf.distribute.MirroredStrategy()

We’re going to use our strategy object throughout our training code. Next we need to adjust our batch size for multi-GPU training by including the following line.

BATCH_SIZE = 16 * strategy.num_replicas_in_sync

To distribute our model we can define our model using strategy as well.

with strategy.scope():
 # define model here

And that’s it! We can then continue on to run ‘model.fit()’ we usually do.

Again, full code related to this article can be found in my GitHub repository, here .

Thanks for reading and hope you find this helpful!


以上所述就是小编给大家介绍的《Quick Start to Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

精益创业

精益创业

[美] 埃里克·莱斯 / 吴彤 / 中信出版社 / 2012-8 / 49.00元

《精益创业:新创企业的成长思维》内容简介:我们正处在一个空前的全球创业兴盛时代,但无数创业公司都黯然收场,以失败告终。精益创业代表了一种不断形成创新的新方法,它源于“精益生产”的理念,提倡企业进行“验证性学习”,先向市场推出极简的原型产品,然后在不断地试验和学习中,以最小的成本和有效的方式验证产品是否符合用户需求,灵活调整方向。如果产品不符合市场需求,最好能“快速地失败、廉价地失败”,而不要“昂贵......一起来看看 《精益创业》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具