内容简介:This article is a quick start guide to running distributed multi-GPU deep learning using AWS Sagemaker and TensorFlow 2.2.0 tf.distribute.All of my code related to this article can be found in my GitHub repository,First, we need to understand our options f
Quick Start to Distributed Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute
Introduction
This article is a quick start guide to running distributed multi-GPU deep learning using AWS Sagemaker and TensorFlow 2.2.0 tf.distribute.
Code
All of my code related to this article can be found in my GitHub repository, here . The code in my repository is an example of running a version of BERT on data from Kaggle, specifically the Jigsaw Multilingual Toxic Comment Classification competition. So of my code is adopted from a top public kernel .
The Need to Know Information
Getting Started
First, we need to understand our options for running deep learning on AWS Sagemaker.
- Run your code in a notebook instance
- Run your code in a tailored Sagemaker TensorFlow container
In this article, we focus on option #2 because it’s cheaper and it’s the intended design of Sagemaker.
(option #1 is a nice way to get started, but it’s more expensive because you’re paying for every second the notebook instance is running).
Running a Sagemaker TensorFlow Container
There is a lot of flexibility to Sagemaker TensorFlow containers, but we’re going to focus on the bare essentials.
To start, we need to launch a Sagemaker notebook instance and store our data on S3. If you don’t know how to do this, I review some simple options on my blog . Once we have our data in S3, we can launch a Jupyter notebook (from our notebook instance) and start coding. This notebook will be responsible for launching your training job, or i.e. your Sagemaker TensorFlow container.
Again, we’re going to focus on the bare essentials. We need a variable to indicate where our data is located, and then we need to add that location to a dictionary.
data_s3 = 's3://<your-bucket>/' inputs = {'data':data_s3}
Pretty simple. Now we need to create a Sagemaker TensorFlow container object.
Our entry_point is a Python script (which we’ll make later) that contains all of our modeling code. Our train_instance_type is a multi-GPU Sagemaker instance type. You can find a full list of Sagemaker instance types here . Notice that a ml.p3.8xlarge runs 4 V100 NVIDIA GPUs . And since we’re going to be using MirroredStrategy (more on this later) we need train_instance_count=1. So that’s 1 machine with 4 V100s. The other settings you can leave alone for now, or research further as needed.
The main settings we need to get right are entry_point and train_instance_type . (And then for Mirrored Strategy we need train_instance_count=1).
# create estimator estimator = TensorFlow(entry_point='jigsaw_DistilBert_SingleRun_v1_sm_tfdist0.py', train_instance_type='ml.p3.8xlarge', output_path="s3://<your-bucket>", train_instance_count=1, role=sagemaker.get_execution_role(), framework_version='2.1.0', py_version='py3', script_mode=True)
We can kick off our training job by running the following line.
estimator.fit(inputs)
Notice that we included our dictionary (which contained our S3 location) as an input to ‘fit()’. Before we run this code, we need to create the Python script which we tied to entry_point (otherwise our container won’t have any code to run) .
Create Training Script
I have a lot going on in my training script because I’m running a version of BERT on some data from Kaggle, but I’m going to highlight the main code required for Sagemaker.
First we need to grab our data location, which was passed when we ran ‘estimator.fit(inputs)’. We can do this using argparse.
def parse_args(): parser = argparse.ArgumentParser() parser.add_argument(‘ — data’, type=str, default=os.environ.get(‘SM_CHANNEL_DATA’)) return parser.parse_known_args()args, _ = parse_args()
You could probably simplify this even further by just hard coding your S3 location in your training script.
If all we wanted to do was run our training job in a Sagemaker container, that’s basically all we need. Now if we want to run multi-GPU train using tf.distribute we need a few more things.
Say Goodbye to Horovod, Say Hello to TF.Distribute
First we need to indicate that we want to run multi-GPU training. We can do that very easily with the following line.
strategy = tf.distribute.MirroredStrategy()
We’re going to use our strategy object throughout our training code. Next we need to adjust our batch size for multi-GPU training by including the following line.
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
To distribute our model we can define our model using strategy as well.
with strategy.scope(): # define model here
And that’s it! We can then continue on to run ‘model.fit()’ we usually do.
Again, full code related to this article can be found in my GitHub repository, here .
Thanks for reading and hope you find this helpful!
以上所述就是小编给大家介绍的《Quick Start to Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。