Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

栏目: IT技术 · 发布时间: 4年前

内容简介:So for training the state-of-the-art or SOTA models, GPU is a big necessity. And even if we are able to procure one, there comes the problem of memory constraints. We are more or less accustomed to seeing the OOM (Out of Memory) error whenever we throw a l

A brief overview of the problem and the solution

Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

Photo by Nic Low on Unsplash

L et’s be honest. Deep Learning without GPUs sucks big time! Yes, people will claim you can do without it but life isn’t just about training a neat and cool MNIST classifier.

So for training the state-of-the-art or SOTA models, GPU is a big necessity. And even if we are able to procure one, there comes the problem of memory constraints. We are more or less accustomed to seeing the OOM (Out of Memory) error whenever we throw a large batch to train. The problem is far more apparent when we talk about state-of-the-art computer vision algorithms. We have crossed much longer ground since the time of VGG or even ResNet18. Modern deeper architectures like UNet, ResNet-152, RCNN, Mask-RCNN are extremely memory intensive. Hence, there exists quite a high probability that we will run out of memory while training deeper models.

Here is an OOM error from while running the model in PyTorch.

RuntimeError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 10.76 GiB total capacity; 9.46 GiB already allocated; 30.94 MiB free; 9.87 GiB reserved in total by PyTorch)

There are usually 2 solutions that practitioners do instantly whenever encountering the OOM error.

  1. Reduce batch size
  2. Reduce image dimensions

In over 90% of cases, these two solutions are more than enough. So the question you want to ask is: why does the remaining 5% need something else. In order to answer, let’s check out the below images.

Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

From Kaggle notebook of Dimitre Oliveira

It’s from the Kaggle competition, Understanding Clouds from Satellite Images . The task was to correctly segment the different types of clouds. Now, these images were of very high resolution 1400 x 2100. As well you can understand, reducing image dimensions too much will have a very negative impact in this scenario, since the minute patterns and textures are important features to learn here. Hence the only other option is to reduce the batch size.

As a refresher, if you happen to remember gradient descent or specifically mini-batch gradient descent in our case, you’ll remember that instead of calculating the loss and the eventual gradients on the whole dataset, we do the operation on the smaller batches. Other than helping us to fit the data into memory, it also helps us to converge faster, since the parameters are updated after each mini-batch. But what happens when the batch size becomes too small as in the above case. Taking a rough estimate that maybe 4 such images can be fit into a single batch in an 11GB GPU, the loss and the gradients calculated will not accurately represent the whole dataset. As a result, the model will converge a lot slower, or worse, not converge at all.

Enters gradient accumulation.

The idea behind gradient accumulation is stupidly simple. It calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches. And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches.

Coding the gradient accumulation part is also ridiculously easy on pytorch . All you need to do is to store the loss at each batch and then update the model parameters only after a set number of batches that you choose.

We hold onto optimizer.step() which updates the parameters for accumulation_steps number of batches. Also, model.zero_grad() is called at the same time to reset the accumulated gradients.

Doing the same thing is a little more tricky for keras/tensorflow . There are different versions written by people that you’ll find on the internet. Here’s one of those written by @alexeydevederkin

There are tensorflow codes also available which are smaller in size. You’ll find those easily.

Gradient Accumulation is a great tool for hobbyists with less computing or even for practitioners intending to use images without scaling them down. Whichever one you are, it is always a handy trick in your armory.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

颠覆式创新:移动互联网时代的生存法则

颠覆式创新:移动互联网时代的生存法则

李善友 / 机械工业出版社 / 2015-3-1

为什么把每件事情都做对了,仍有可能错失城池?为什么无人可敌的领先企业,却在一夜之间虎落平阳?短短三年间诺基亚陨落,摩托罗拉以区区29亿美元出售给联想,芯片业霸主英特尔在移动芯片领域份额几乎为零,风光无限的巨头转眼成为被颠覆的恐龙,默默无闻的小公司一战成名迅速崛起,令人瞠目结舌的现象几乎都能被“颠覆式创新”法则所解释。 颠覆式创新教你在新的商业竞争中“换操作系统”而不是“打补丁”,小公司用破坏......一起来看看 《颠覆式创新:移动互联网时代的生存法则》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

html转js在线工具
html转js在线工具

html转js在线工具