What is Gradient Accumulation in Deep Learning?

栏目: IT技术 · 发布时间: 4年前

What is Gradient Accumulation in Deep Learning?

Backpropagation process of neural networks explained

Draft ·5min read

What is Gradient Accumulation in Deep Learning?

Photo by Austris Augusts on Unsplash

Inanother article, we addressed the problem of batch size being limited by GPU memory, and how gradient accumulation helps in overcoming this.

In this post, we will first examine the backpropagation process of a neural network and then go through the technical and algorithmic details of gradient accumulation. We will discuss how it works, and iterate through an example.

What is Gradient Accumulation?

Gradient accumulation is a mechanism to split the batch of samples — used for training a neural network — into several mini-batches of samples that will be run sequentially.

What is Gradient Accumulation in Deep Learning?

Gradient accumulation

Before further going into gradient accumulation, it will be good to examine the backpropagation process of a neural network.

Backpropagation of a neural network

A deep-learning model consists of many layers, connected to each other, in all of which the samples are propagating through the forward pass in every step. After propagating through all the layers, the network generates predictions for the samples and then calculates the loss value for every sample, which specifies “how wrong was the network for this sample?”. The neural network then computes the gradients of those loss values with respect to the model parameters. Then, these gradients are used for calculating the updates for the respective variables.

When building the model, we choose an optimizer, which is responsible for the algorithm used for minimizing the loss. The optimizer can be one of the common optimizers that are already implemented in the framework (SGD, Adam, etc…), or a custom optimizer, implementing the desired algorithm. Along with the gradients, there may be more parameters that the optimizer would manage and use for calculating the updates, such as the learning rate, the current step index (for adaptive learning rate), momentums, etc…

The optimizer represents a mathematical formula that computes the parameter updates. A simple example would be the stochastic gradient descent (SGD) algorithm: V = V — (lr * grad) , where V is any trainable model parameter (weight or bias), lr is the learning rate, and grad is the gradients of the loss with respect to the model parameter:

The algorithm of SGD optimizer

So what is gradient accumulation, technically?

Gradient accumulation means running a configured number of steps without updating the model variables while accumulating the gradients of those steps and then using the accumulated gradients to compute the variable updates.

Yes, it’s really that simple.

Running some steps without updating any of the model variables is the way we — logically — split the batch of samples into a few mini-batches. The batch of samples that is used in every step is effectively a mini-batch, and all the samples of those steps combined are effectively the global batch.

By not updating the variables at all those steps, we cause all the mini-batches to use the same model variables for calculating the gradients. This is mandatory to ensure the same gradients and updates are calculated as if we were using the global batch size.

Accumulating the gradients in all of these steps results in the same sum of gradients as if we were using the global batch size.

Iterating through an example

So, let’s say we are accumulating gradients over 5 steps. We want to accumulate the gradients of the first 4 steps, without updating any variable. At the fifth step, we want to use the accumulated gradients of the previous 4 steps combined with the gradients of the fifth step to compute and assign the variable updates. Let’s see it in action:

Starting at the first step, all the samples of the first mini-batch propagate through the forward and backward passes, resulting in computed gradients for each trainable model variable. We don’t want to actually update the variables, so there is no need in computing the updates at this point. What we need, though, is a place to store the gradients of the first step, in order for them to be accessible in the following steps, and we will use another variable for each trainable model variable, to hold the accumulated gradients. So, after computing the gradients of the first step, we will store them in the variables we created for the accumulated gradients.

What is Gradient Accumulation in Deep Learning?
The value of the accumulated gradients at the end of N steps

Now the second step starts, and again, all the samples of the second mini-batch propagate through all the layers of the model, computing the gradients of the second step. Just like the step before, we don’t want to update the variables yet, so there is no need in computing the variable updates. What’s different than the first step though, is that instead of just storing the gradients of the second step in our variables, we are going to add them to the values stored in the variables, which currently hold the gradients of the first step.

Steps 3 and 4 are pretty much the same as the second step, as we are not yet updating the variables, and we are accumulating the gradients by adding them to our variables.

Then, in step 5, we do want to update the variables, as we intended to accumulate the gradients over 5 steps. After computing the gradients of the fifth step, we will add them to the accumulated gradients, resulting in the sum of all the gradients of those 5 steps.

We’ll then take this sum and insert it as a parameter to the optimizer, resulting in the updates computed using all the gradients of those 5 steps, computed over all the samples in the global batch.

If we will take the SGD optimizer as an example, let’s see the variables after the updates at the end of the fifth steps, computed using the gradients of those 5 steps (N=5 in the following example):

What is Gradient Accumulation in Deep Learning?
The value of a trainable variable after N steps (using SGD)

Great! So let’s implement it!

It is possible to implement a gradient-accumulated version of any optimizer. Each optimizer has a different formula and therefore will require a different implementation. This is not optimal, as gradient accumulation is a general approach and should be optimizer-independent.

In another article, we cover the way in which we implemented a generic gradient accumulation mechanism and show you how you could use it in your own models using any optimizer of your choice.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

深度学习入门

深度学习入门

[ 日] 斋藤康毅 / 陆宇杰 / 人民邮电出版社 / 2018-7 / 59.00元

本书是深度学习真正意义上的入门书,深入浅出地剖析了深度学习的原理和相关技术。书中使用Python3,尽量不依赖外部库或工具,从基本的数学知识出发,带领读者从零创建一个经典的深度学习网络,使读者在此过程中逐步理解深度学习。书中不仅介绍了深度学习和神经网络的概念、特征等基础知识,对误差反向传播法、卷积神经网络等也有深入讲解,此外还介绍了深度学习相关的实用技巧,自动驾驶、图像生成、强化学习等方面的应用,......一起来看看 《深度学习入门》 这本书的介绍吧!

SHA 加密
SHA 加密

SHA 加密工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

html转js在线工具
html转js在线工具

html转js在线工具