Gradient Descent Extensions to Your Deep Learning Models

栏目: IT技术 · 发布时间: 4年前

内容简介:Learn about the different available methods, and to select the one most appropriate to solve your problem.The objective of this article is to explore the different Gradient Descent extensions such as Momentum, Adagrad, RMSprop…Inprevious articles, we have

Learn about the different available methods, and to select the one most appropriate to solve your problem.

Gradient Descent Extensions to Your Deep Learning Models

Source: Pixabay

Introduction

The objective of this article is to explore the different Gradient Descent extensions such as Momentum, Adagrad, RMSprop…

Inprevious articles, we have studied three methods to implement back-propagation in Deep Learning models:

  • Gradient Descent
  • Stochastic Gradient Descent
  • Mini-Batch Stochastic Gradient Descent

Upon which, we keep the mini-batch because it allows for greater speed, as it does not have to calculate gradients and errors for the entire dataset, and eliminates the high variability that exists in the Stochastic Gradient Descent.

Well, there are improvements over these methods, such as Momentum. Besides, there are other more complex algorithms such as Adam, RMSProp or Adagrad.

Let’s see them!

Momentum

Imagine being a kid again and having the great idea of putting on your skates, climbing up the steepest street and starting to go down it. You are total beginners and this is the second time you have worn skates.

I don’t know if any of you have ever really done this, but well, I have, so let me explain what happens:

  • You just start, the speed is small, you even seem to be in control and you could stop at any time.
  • But the lower you go, the faster you move: this is called momentum.
    so the more road you go down, the more inertia you carry and the faster you go.
  • Well, for those of you who are curious, the end of the story is that at the end of the steep street there is a fence. The rest you can imagine…

Well, the Momentum technique is precisely this. As we go down our loss curve when calculating the gradients and making the updates, we give more importance to the updates that go in the direction that minimizes the gradient, and less importance to those that go in other directions.

Gradient Descent Extensions to Your Deep Learning Models
Figure by the Author

So, the result is to speed up the training of the network.

Also, thanks to the moment, we could have been able to avoid small potholes or holes in the road (flying over them thanks to the speed).

You can learn more about the mathematic foundation behind this technique in this great post: http://cs231n.github.io/neural-networks-3/#sgd

Nesterov Momentum

Going back to the example of before: we are going down the road at full speed (because we have built a lot of momentum) and suddenly we see the end of it. We would like to be able to brake, to slow down to avoid crashing. Well, this is precisely what Nesterov does.

Nesterov calculates the gradient, but instead of doing it at the current point, it does it at the point where we know our moment is going to take us, and then apply a correction.

Figure by Author

Notice that using the standard moment, we calculate the gradient (small orange vector) and then take a big step in the direction of the gradient (large orange vector).

Using Nesterov, we would first make a big jump in the direction of our previous gradient (green vector), measure the gradient and make the appropriate correction (red vector).

In practice, it works a little better than the momentum alone. It’s like calculating the gradient of weights in the future (because we have added the moment we had calculated).

You can learn more about the mathematic foundation behind this technique in this great post: http://cs231n.github.io/neural-networks-3/#sgd

Both Nesterov’s momentum and the standard momentum are extensions of the SGD.

The methods that we are going to see now are based on adaptive learning rates, allowing us to accelerate or slow down the speed with which we update the weights. For example, we could use a high speed at the beginning, and lower it as we approach the minimum.

Adaptive gradient (AdaGrad)

It keeps a history of the calculated gradients (in particular, of the sum of the squared gradients) and normalizes the “step” of the update.

The intuition behind it is that it identifies the parameters with a very high gradient, which weights update will be very abrupt and then assign to them a lower learning rate to mitigate this abruptness.

At the same time, the parameters that have a very low gradient will be assigned a high learning rate.

In this way, we manage to accelerate the convergence of the algorithm.

You can learn more about the theory behind this technique in its original paper here: http://jmlr.org/papers/v12/duchi11a.html

RMSprop

The problem with AdaGrad is that when calculating the sum of the squared gradients, we are using a monotonic increasing function, which can cause the learning rate to try to compensate values that do not stop growing until it becomes zero, thus stopping learning.

What RMSprop proposes is to decrease that sum of the squared gradients using a decay_rate.

The paper is not published yet, but you can read more about it here: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Adam

Finally, Adam is one of the most modern algorithms, which improves RMSprop by adding momentum to the update rule. It introduces 2 new parameters, beta1 and beta2, with recommended values of 0.9 and 0.999.

You can check out its paper here: https://arxiv.org/abs/1412.6980 .

But then, which one should we use?

Gradient Descent Extensions to Your Deep Learning Models

Source: original ADAM paper

As a rule of thumb, the recommendation is to start with Adam. If it does not works well, then you can try and tune the rest of the techniques. But most of the time, Adam works great.

You can check these resources to gain a better understanding of these techniques, how and when to apply them:

Final Words

As always, I hope you enjoyed the post!

If you liked this post then you can take a look at my other posts on Data Science and Machine Learning here .

If you want to learn more about Machine Learning, Data Science and Artificial Intelligence follow me on Medium , and stay tuned for my next posts!


以上所述就是小编给大家介绍的《Gradient Descent Extensions to Your Deep Learning Models》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

捉虫记

捉虫记

施迎 / 清华大学出版社 / 2010-6 / 56.00元

《捉虫记:大容量Web应用性能测试与LoadRunner实战》主要讲解大容量Web性能测试的特点和方法,以及使用业内应用非常广泛的工具——Load Runner 9进行性能测试的具体技术与技巧。《捉虫记:大容量Web应用性能测试与LoadRunner实战》共17章,分为5篇。第1篇介绍软件测试的定义、方法和过程等内容:第2篇介绍Web应用、Web性能测试的分类、基本硬件知识、Web应用服务器选型、......一起来看看 《捉虫记》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

随机密码生成器
随机密码生成器

多种字符组合密码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码