Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

栏目: IT技术 · 发布时间: 4年前

内容简介：So for training the state-of-the-art or SOTA models, GPU is a big necessity. And even if we are able to procure one, there comes the problem of memory constraints. We are more or less accustomed to seeing the OOM (Out of Memory) error whenever we throw a l

A brief overview of the problem and the solution

Mayukh Bhattacharyya

May 7 ·4min read

Gradient Accumulation: Overcoming Memory Constraints in Deep Learning — Photo by Nic Low on Unsplash

L et’s be honest. Deep Learning without GPUs sucks big time! Yes, people will claim you can do without it but life isn’t just about training a neat and cool MNIST classifier.

So for training the state-of-the-art or SOTA models, GPU is a big necessity. And even if we are able to procure one, there comes the problem of memory constraints. We are more or less accustomed to seeing the OOM (Out of Memory) error whenever we throw a large batch to train. The problem is far more apparent when we talk about state-of-the-art computer vision algorithms. We have crossed much longer ground since the time of VGG or even ResNet18. Modern deeper architectures like UNet, ResNet-152, RCNN, Mask-RCNN are extremely memory intensive. Hence, there exists quite a high probability that we will run out of memory while training deeper models.

Here is an OOM error from while running the model in PyTorch.

RuntimeError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 10.76 GiB total capacity; 9.46 GiB already allocated; 30.94 MiB free; 9.87 GiB reserved in total by PyTorch)

There are usually 2 solutions that practitioners do instantly whenever encountering the OOM error.

Reduce batch size
Reduce image dimensions

In over 90% of cases, these two solutions are more than enough. So the question you want to ask is: why does the remaining 5% need something else. In order to answer, let’s check out the below images.

It’s from the Kaggle competition, Understanding Clouds from Satellite Images . The task was to correctly segment the different types of clouds. Now, these images were of very high resolution 1400 x 2100. As well you can understand, reducing image dimensions too much will have a very negative impact in this scenario, since the minute patterns and textures are important features to learn here. Hence the only other option is to reduce the batch size.

As a refresher, if you happen to remember gradient descent or specifically mini-batch gradient descent in our case, you’ll remember that instead of calculating the loss and the eventual gradients on the whole dataset, we do the operation on the smaller batches. Other than helping us to fit the data into memory, it also helps us to converge faster, since the parameters are updated after each mini-batch. But what happens when the batch size becomes too small as in the above case. Taking a rough estimate that maybe 4 such images can be fit into a single batch in an 11GB GPU, the loss and the gradients calculated will not accurately represent the whole dataset. As a result, the model will converge a lot slower, or worse, not converge at all.

Enters gradient accumulation.

The idea behind gradient accumulation is stupidly simple. It calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches. And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches.

Coding the gradient accumulation part is also ridiculously easy on pytorch . All you need to do is to store the loss at each batch and then update the model parameters only after a set number of batches that you choose.

We hold onto optimizer.step() which updates the parameters for accumulation_steps number of batches. Also, model.zero_grad() is called at the same time to reset the accumulated gradients.

Doing the same thing is a little more tricky for keras/tensorflow . There are different versions written by people that you’ll find on the internet. Here’s one of those written by @alexeydevederkin

There are tensorflow codes also available which are smaller in size. You’ll find those easily.

Gradient Accumulation is a great tool for hobbyists with less computing or even for practitioners intending to use images without scaling them down. Whichever one you are, it is always a handy trick in your armory.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

计数组合学（第一卷）

斯坦利 / 付梅、侯庆虎、辛国策 / 高等教育 / 2009-6 / 42.00元

《计数组合学(第1卷)》是两卷本计数组合学基础导论中的第一卷，适用于研究生和数学研究人员。《计数组合学(第1卷)》主要介绍生成函数的理论及其应用，生成函数是计数组合学中的基本工具。《计数组合学(第1卷)》共分为四章，分别介绍了计数（适合高年级的本科生），筛法（包括容斥原理），偏序集以及有理生成函数。《计数组合学(第1卷)》提供了大量的习题，并几乎都给出了解答，它们不仅是对《计数组合学(第1卷)》正......一起来看看《计数组合学（第一卷）》这本书的介绍吧!

码农工具

Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

A brief overview of the problem and the solution

计数组合学（第一卷）

URL 编码/解码

MD5 加密

RGB HSV 转换