A quick guide to using Spot instances with Amazon SageMaker

栏目: IT技术 · 发布时间: 5年前

内容简介：One of the simplest ways to lower your machine learning training costs is to use Amazon EC2 Spot instances. Spot instances allow you to access spare Amazon EC2 compute capacity at a steep discount of up to 90% compared to on-demand rates. So why not always

One of the simplest ways to lower your machine learning training costs is to use Amazon EC2 Spot instances. Spot instances allow you to access spare Amazon EC2 compute capacity at a steep discount of up to 90% compared to on-demand rates. So why not always use Spot instances? Well, you can, as long as your workload is tolerant to sudden interruptions. Since Spot instances are part of spare capacity, they may be reclaimed with just 2 minutes notice!

Deep learning training is a good example of a workload that can be made tolerant to interruptions and I’ve written about using Amazon EC2 Spot instances for deep learning training before. However, as a machine learning developer or data scientist, you may not want to manage Spot fleet requests, poll for capacity, poll for termination status, manually back up your checkpoints, manually sync checkpoints when resuming training, and set everything up every time you want to run training jobs.

Amazon SageMaker offers Managed Spot Training , which is a convenient way to lower training costs using Amazon EC2 Spot instances for Amazon SageMaker training jobs. This means you can now save up to 90% on training workloads without having to setup and manage Spot instances. Amazon SageMaker will automatically provision Spot instances for you, and if a Spot instance is reclaimed, Amazon SageMaker will automatically resume training after capacity is available!

In this blog post, I’ll provide a step-by-step guide to using Spot instances with Amazon SageMaker for deep learning training. I’ll cover what code changes you need to make to take advantage of Amazon SageMaker’s automatic checkpoint back up and sync to Amazon S3 feature. I’ll be using Keras with TensorFlow backend to illustrate how you can take advantage of Amazon SageMaker Managed Spot Training. You can also implement the same steps on an another framework such as PyTorch or MXNet.

A complete example with Jupyter notebook is available on GitHub: https://github.com/shashankprasanna/sagemaker-spot-training

What workloads can take advantage of Spot Instances?

To take advantage of Spot instance savings, your workload must be tolerant of interruptions. In machine learning, and there are two types of workloads that broadly fall into this category:

Stateless microservices, such as model servers (TF Serving, TorchServe — read my blog post ), serving inference requests
Stateful jobs such as deep learning training that are capable of saving their full state with frequent checkpointing.

In the first case, if a Spot instance is reclaimed, traffic can be routed to another instance, assuming you’ve set up your service with redundancy for high-availability. In the second case, if a Spot instance is interrupted, your application must immediately save its current state, and resume training when capacity has been restored.

In this guide, I’ll cover the second use-case, i.e. Spot instances for deep learning training jobs using open-source deep learning frameworks such as TensorFlow, PyTorch, MXNet and others.

Quick recap on how Amazon SageMaker runs deep learning training

Let me start with how Amazon SageMaker runs deep learning training. This background is important to understand how SageMaker manages Spot training and backs up your checkpoint data and resumes training. If you’re an Amazon SageMaker user, this should serve as a quick reminder.

What you are responsible for:

A quick guide to using Spot instances with Amazon SageMaker — Develop your training scripts and provide it to the SageMaker SDK Estimator function and Amazon SageMaker will take care of the rest

Writing your training scripts in TensorFlow, PyTorch, MXNet or other supported framework.
Writing a SageMaker Python SDK Estimator function specifying where to find your training scripts, what type of CPU or GPU instance to train on, how many instances (for distributed) to train on, where to find your training dataset and where to save the trained models in Amazon S3.

What Amazon SageMaker is responsible for:

Amazon SageMaker will manage infrastructure details, so you don’t have to. Amazon SageMaker will:

Upload your training script and dependencies to Amazon S3
Provision the specified number of instances in a fully managed cluster
Pull the specified TensorFlow container image and instantiate containers on every instance.
Download the training code from Amazon S3 into the instance and make it available in the container
Download training dataset from Amazon S3 and make it available in the container
Run training
Copy trained models to a specified location in Amazon S3

Running Amazon SageMaker managed spot training

Spot instances can be preempted and be terminated with just 2 minutes notice, therefore it’s critical that you frequently checkpoint training progress. Thankfully, Amazon SageMaker will manage everything else. It’ll automatically backup your training checkpoints to Amazon S3 and if the training instance is terminated due to lack of capacity, it’ll keep polling for capacity, and automatically restart training once capacity becomes available.

Amazon SageMaker will automatically copy your dataset and the checkpoint files into the new instance and make it available to your training script in a docker container so that you can resume training from the latest checkpoint.

Let’s take a look at an example to see how you can prepare your training scripts to make it Spot training ready.

Amazon SageMaker Managed Spot Training with TensorFlow and Keras

To make sure that your training scripts can take advantage of SageMaker Managed Spot instances you’ll need to implement:

frequent saving of checkpoints and
ability to resume training from checkpoints.

I’ll show how to make these changes in Keras, but you can follow the same steps on another framework.

Step 1: Saving checkpoints

Amazon SageMaker will automatically back up and sync checkpoint files generated by your training script to Amazon S3. Therefore you’ll need to make sure that your training script saves checkpoints to a local checkpoint directory on the docker container that’s running the training. The default location to save the checkpoint files is /opt/ml/checkpoints and Amazon SageMaker will sync these files to the specific Amazon S3 bucket. Both local and Amazon S3 checkpoint locations are customizable.

If you’re using Keras, this is very easy. Create an instance of the ModelCheckpoint callback class and register it with the model by passing it to fit() function.

The full implementation is available in this file: https://github.com/shashankprasanna/sagemaker-spot-training/blob/master/code/cifar10-training-sagemaker.py

Here is the relevant excerpt:

Notice that I’m passing initial_epoch which you normally wouldn’t have bothered with. This lets us resume training from a certain epoch number and will come in handy when you already have checkpoint files.

Step 2: Resuming from checkpoint files

When spot capacity becomes available again after an interruption, Amazon SageMaker will:

Launch a new spot instance
Instantiate a Docker container with your training script
Copy your dataset and checkpoint files from Amazon S3 to the container
Run your training scripts

Your script needs to implement resuming training from checkpoint files, otherwise your training script will restart training from scratch. You can implement a load_checkpoint_mode function as shown below. It takes in the local checkpoint files path ( /opt/ml/checkpoints being the default), and returns a model loaded from the latest checkpoint and the associated epoch number.

There are many ways to query a list of files in a directory, extract the epoch number from the file names and load the file name with the latest epoch number. I use os.listdir() and regular expressions. I’m sure you can come up with more clever and elegant ways to do the same thing.

The full implementation is available in this file: https://github.com/shashankprasanna/sagemaker-spot-training/blob/master/code/cifar10-training-sagemaker.py

Here is the relevant excerpt:

Step 3: Instructing Amazon SageMaker to run Managed Spot training

You can launch Amazon SageMaker training jobs from your laptop, desktop, Amazon EC2 instance, or Amazon SageMaker Notebook instances. As long as you have Amazon SageMaker Python SDK installed and the right user permissions to run SageMaker training jobs.

To run a managed spot training job, you’ll need to specify few additional options to your standard Amazon SageMaker Estimator function call.

The full implementation is available in this file: https://github.com/shashankprasanna/sagemaker-spot-training/blob/master/tf-keras-cifar10-spot-training.ipynb

Here is the relevant excerpt:

train_use_spot_instances : Instructs Amazon SageMaker to run Managed Spot training
checkpoint_s3_uri : Instructs Amazon SageMaker to sync your checkpoint files to this Amazon S3 location
train_max_wait : Instructs Amazon SageMaker to terminate the job after this time has passed and spot capacity doesn’t become available.

That’s it. Those are all the changes you need to make to dramatically lower your cost to train.

To monitor your training job and view savings you can look at the logs on your Jupyter notebook or navigate to Amazon SageMaker Console > Training Job , click on your training job name. Once the training is completed, you should see how much you saved. As an example, for a 30 epoch training on a p3.2xlarge GPU instance, I was able to save 70% on training cost!

Simulating Spot interruptions on Amazon SageMaker

How do you know if your training will resume properly if a spot interruption occurs?

If you’re familiar with running Amazon EC2 Spot instances, you know that you can simulate your application behavior during a Spot interruption by terminating the Amazon EC2 Spot instance. If there is capacity, Spot fleet will launch a new instance to replace the one you terminated. You can monitor your application to check if it handles interruptions and resumes gracefully. Unfortunately, you can’t terminate an Amazon SageMaker training instance manually. Your only option is to stop the entire training job.

Fortunately, you can still test your code for its behavior when resuming training. To do that first run an Amazon SageMaker Managed Spot training for a specified number of epochs as described in the previous section. Let’s say you run training for 10 epochs. Amazon SageMaker would have backed up your checkpoint files to the specified Amazon S3 location for the 10 epochs. Head over to Amazon S3 to verify that the checkpoints are available:

Now run a second training run, but this time provide the first jobs’ checkpoint location to checkpoint_s3_uri

checkpoint_s3_uri = tf_estimator.checkpoint_s3_uri .

Here is the relevant excerpt from the Jupyter notebook :

https://github.com/shashankprasanna/sagemaker-spot-training/blob/master/tf-keras-cifar10-spot-training.ipynb

By providing checkpoint_s3_uri with your previous job’s checkpoints, you’re telling Amazon SageMaker to copy those checkpoints to your new job’s container. Your training script will then load the latest checkpoint and resume training. In the figure below you can see that the training will resume from the 10th epoch.

以上所述就是小编给大家介绍的《A quick guide to using Spot instances with Amazon SageMaker》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

A quick guide to using Spot instances with Amazon SageMaker

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Windows 程序设计：第5版

CharlesPetzold / 北京博彦科技发展有限公司 / 北京大学出版社 / 2003-11-1 / 160.00元

Windows程序设计（第5版）对于Windows程序员来说，“从 Charles 的（Windows程序设计）一书中寻找答案。”几乎成了一句至理名言。而（Windows程序设计》第5版是专门为在Microsoft Windows 98、Microsoft Windows NT 4和 Windows NT 5下编程的开发人员编写的。内容博大精深，并有大量的源代码来帮助读者掌握Windows编程。本......一起来看看《Windows 程序设计：第5版》这本书的介绍吧!

码农工具

HTML 压缩/解压工具

在线压缩/解压 HTML 代码

SHA 加密

SHA 加密工具