Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

栏目: IT技术 · 发布时间: 4年前

内容简介:So for training the state-of-the-art or SOTA models, GPU is a big necessity. And even if we are able to procure one, there comes the problem of memory constraints. We are more or less accustomed to seeing the OOM (Out of Memory) error whenever we throw a l

A brief overview of the problem and the solution

Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

Photo by Nic Low on Unsplash

L et’s be honest. Deep Learning without GPUs sucks big time! Yes, people will claim you can do without it but life isn’t just about training a neat and cool MNIST classifier.

So for training the state-of-the-art or SOTA models, GPU is a big necessity. And even if we are able to procure one, there comes the problem of memory constraints. We are more or less accustomed to seeing the OOM (Out of Memory) error whenever we throw a large batch to train. The problem is far more apparent when we talk about state-of-the-art computer vision algorithms. We have crossed much longer ground since the time of VGG or even ResNet18. Modern deeper architectures like UNet, ResNet-152, RCNN, Mask-RCNN are extremely memory intensive. Hence, there exists quite a high probability that we will run out of memory while training deeper models.

Here is an OOM error from while running the model in PyTorch.

RuntimeError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 10.76 GiB total capacity; 9.46 GiB already allocated; 30.94 MiB free; 9.87 GiB reserved in total by PyTorch)

There are usually 2 solutions that practitioners do instantly whenever encountering the OOM error.

  1. Reduce batch size
  2. Reduce image dimensions

In over 90% of cases, these two solutions are more than enough. So the question you want to ask is: why does the remaining 5% need something else. In order to answer, let’s check out the below images.

Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

From Kaggle notebook of Dimitre Oliveira

It’s from the Kaggle competition, Understanding Clouds from Satellite Images . The task was to correctly segment the different types of clouds. Now, these images were of very high resolution 1400 x 2100. As well you can understand, reducing image dimensions too much will have a very negative impact in this scenario, since the minute patterns and textures are important features to learn here. Hence the only other option is to reduce the batch size.

As a refresher, if you happen to remember gradient descent or specifically mini-batch gradient descent in our case, you’ll remember that instead of calculating the loss and the eventual gradients on the whole dataset, we do the operation on the smaller batches. Other than helping us to fit the data into memory, it also helps us to converge faster, since the parameters are updated after each mini-batch. But what happens when the batch size becomes too small as in the above case. Taking a rough estimate that maybe 4 such images can be fit into a single batch in an 11GB GPU, the loss and the gradients calculated will not accurately represent the whole dataset. As a result, the model will converge a lot slower, or worse, not converge at all.

Enters gradient accumulation.

The idea behind gradient accumulation is stupidly simple. It calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches. And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches.

Coding the gradient accumulation part is also ridiculously easy on pytorch . All you need to do is to store the loss at each batch and then update the model parameters only after a set number of batches that you choose.

We hold onto optimizer.step() which updates the parameters for accumulation_steps number of batches. Also, model.zero_grad() is called at the same time to reset the accumulated gradients.

Doing the same thing is a little more tricky for keras/tensorflow . There are different versions written by people that you’ll find on the internet. Here’s one of those written by @alexeydevederkin

There are tensorflow codes also available which are smaller in size. You’ll find those easily.

Gradient Accumulation is a great tool for hobbyists with less computing or even for practitioners intending to use images without scaling them down. Whichever one you are, it is always a handy trick in your armory.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

利用Python进行数据分析 原书第2版

利用Python进行数据分析 原书第2版

Wes McKinney / 徐敬一 / 机械工业出版社 / 2018-7 / 119

本书由Python pandas项目创始人Wes McKinney亲笔撰写,详细介绍利用Python进行操作、处理、清洗和规整数据等方面的具体细节和基本要点。第2版针对Python 3.6进行全面修订和更新,涵盖新版的pandas、NumPy、IPython和Jupyter,并增加大量实际案例,可以帮助你高效解决一系列数据分析问题。 第2版中的主要更新包括: • 所有的代码,包括把Py......一起来看看 《利用Python进行数据分析 原书第2版》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换