What is Gradient Accumulation in Deep Learning?

栏目: IT技术 · 发布时间: 4年前

What is Gradient Accumulation in Deep Learning?

Backpropagation process of neural networks explained

Draft ·5min read

What is Gradient Accumulation in Deep Learning?

Photo by Austris Augusts on Unsplash

Inanother article, we addressed the problem of batch size being limited by GPU memory, and how gradient accumulation helps in overcoming this.

In this post, we will first examine the backpropagation process of a neural network and then go through the technical and algorithmic details of gradient accumulation. We will discuss how it works, and iterate through an example.

What is Gradient Accumulation?

Gradient accumulation is a mechanism to split the batch of samples — used for training a neural network — into several mini-batches of samples that will be run sequentially.

What is Gradient Accumulation in Deep Learning?

Gradient accumulation

Before further going into gradient accumulation, it will be good to examine the backpropagation process of a neural network.

Backpropagation of a neural network

A deep-learning model consists of many layers, connected to each other, in all of which the samples are propagating through the forward pass in every step. After propagating through all the layers, the network generates predictions for the samples and then calculates the loss value for every sample, which specifies “how wrong was the network for this sample?”. The neural network then computes the gradients of those loss values with respect to the model parameters. Then, these gradients are used for calculating the updates for the respective variables.

When building the model, we choose an optimizer, which is responsible for the algorithm used for minimizing the loss. The optimizer can be one of the common optimizers that are already implemented in the framework (SGD, Adam, etc…), or a custom optimizer, implementing the desired algorithm. Along with the gradients, there may be more parameters that the optimizer would manage and use for calculating the updates, such as the learning rate, the current step index (for adaptive learning rate), momentums, etc…

The optimizer represents a mathematical formula that computes the parameter updates. A simple example would be the stochastic gradient descent (SGD) algorithm: V = V — (lr * grad) , where V is any trainable model parameter (weight or bias), lr is the learning rate, and grad is the gradients of the loss with respect to the model parameter:

The algorithm of SGD optimizer

So what is gradient accumulation, technically?

Gradient accumulation means running a configured number of steps without updating the model variables while accumulating the gradients of those steps and then using the accumulated gradients to compute the variable updates.

Yes, it’s really that simple.

Running some steps without updating any of the model variables is the way we — logically — split the batch of samples into a few mini-batches. The batch of samples that is used in every step is effectively a mini-batch, and all the samples of those steps combined are effectively the global batch.

By not updating the variables at all those steps, we cause all the mini-batches to use the same model variables for calculating the gradients. This is mandatory to ensure the same gradients and updates are calculated as if we were using the global batch size.

Accumulating the gradients in all of these steps results in the same sum of gradients as if we were using the global batch size.

Iterating through an example

So, let’s say we are accumulating gradients over 5 steps. We want to accumulate the gradients of the first 4 steps, without updating any variable. At the fifth step, we want to use the accumulated gradients of the previous 4 steps combined with the gradients of the fifth step to compute and assign the variable updates. Let’s see it in action:

Starting at the first step, all the samples of the first mini-batch propagate through the forward and backward passes, resulting in computed gradients for each trainable model variable. We don’t want to actually update the variables, so there is no need in computing the updates at this point. What we need, though, is a place to store the gradients of the first step, in order for them to be accessible in the following steps, and we will use another variable for each trainable model variable, to hold the accumulated gradients. So, after computing the gradients of the first step, we will store them in the variables we created for the accumulated gradients.

What is Gradient Accumulation in Deep Learning?
The value of the accumulated gradients at the end of N steps

Now the second step starts, and again, all the samples of the second mini-batch propagate through all the layers of the model, computing the gradients of the second step. Just like the step before, we don’t want to update the variables yet, so there is no need in computing the variable updates. What’s different than the first step though, is that instead of just storing the gradients of the second step in our variables, we are going to add them to the values stored in the variables, which currently hold the gradients of the first step.

Steps 3 and 4 are pretty much the same as the second step, as we are not yet updating the variables, and we are accumulating the gradients by adding them to our variables.

Then, in step 5, we do want to update the variables, as we intended to accumulate the gradients over 5 steps. After computing the gradients of the fifth step, we will add them to the accumulated gradients, resulting in the sum of all the gradients of those 5 steps.

We’ll then take this sum and insert it as a parameter to the optimizer, resulting in the updates computed using all the gradients of those 5 steps, computed over all the samples in the global batch.

If we will take the SGD optimizer as an example, let’s see the variables after the updates at the end of the fifth steps, computed using the gradients of those 5 steps (N=5 in the following example):

What is Gradient Accumulation in Deep Learning?
The value of a trainable variable after N steps (using SGD)

Great! So let’s implement it!

It is possible to implement a gradient-accumulated version of any optimizer. Each optimizer has a different formula and therefore will require a different implementation. This is not optimal, as gradient accumulation is a general approach and should be optimizer-independent.

In another article, we cover the way in which we implemented a generic gradient accumulation mechanism and show you how you could use it in your own models using any optimizer of your choice.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Twenty Lectures on Algorithmic Game Theory

Twenty Lectures on Algorithmic Game Theory

Tim Roughgarden / Cambridge University Press / 2016-8-31 / USD 34.99

Computer science and economics have engaged in a lively interaction over the past fifteen years, resulting in the new field of algorithmic game theory. Many problems that are central to modern compute......一起来看看 《Twenty Lectures on Algorithmic Game Theory》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

MD5 加密
MD5 加密

MD5 加密工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换