Don’t look backwards, LookAhead!

栏目: IT技术 · 发布时间: 4年前

内容简介:The task of an optimizer is to look for such a set of weights for which a NN model yields the lowest possible loss. If you only had one weight and a loss function like the one depicted below you wouldn’t have to be a genius to find the solution.Unfortunate

How to make your optimizer less sensitive to the choice of hyperparameters

May 30 ·5min read

Don’t look backwards, LookAhead!

Image by rihaij z Pixabay

The task of an optimizer is to look for such a set of weights for which a NN model yields the lowest possible loss. If you only had one weight and a loss function like the one depicted below you wouldn’t have to be a genius to find the solution.

Don’t look backwards, LookAhead!

Unfortunately you normally have a multitude of weights and a loss landscape that is hardly simple, not to mention no longer suited for a 2D drawing.

Don’t look backwards, LookAhead!

The loss surface of ResNet-56 without skip connections visualized using a method proposed in https://arxiv.org/pdf/1712.09913.pdf .

Finding a minimum of such a function is no longer a trivial task. The most common optimizers like Adam or SGD require very time-consuming hyperparameter tuning and can get caught in the local minima. The importance of choosing a hyperparameter like learning rate can be summarized by the following picture:

Don’t look backwards, LookAhead!

Too big learning rate causes oscillations around the minimum and too small learning rate makes the learning process super slow.

The recently proposed LookAhead optimizer makes the optimization process

less sensitive to suboptimal hyperparameters and therefore lessens the need for extensive hyperparameter tuning.

It sounds like something worth exploring!

The algorithm

Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of “fast weights” generated by another optimizer.

The optimizer keeps two sets of weights: fast weights θ and slow weights ϕ . They are both initialized with the same values. A standard optimizer (e.g. Adam, SGD, …) with a certain learning rate η is used to update the fast weights θ for a defined number of steps k resulting in some new values θ’ .

Then a crucial thing happens: the slow weights ϕ are moved along the direction defined by the difference of weight vectors θ’- ϕ . The length of this step is controlled by the parameter α — the slow weights learning rate.

Crucial update in the LookAhead algorithm

The process is then repeated starting by re-setting the fast weights values to newly computed slow weights values ϕ’ . You can see the pseudocode below:

Don’t look backwards, LookAhead!

source: https://arxiv.org/pdf/1907.08610.pdf

What’s the point of this?

To answer this question we will study the (slightly modified) picture from the LookAhead publication , but as an introduction let’s first look at another picture. If our model only had three weights, the loss function could be easily visualized like in the picture below.

Don’t look backwards, LookAhead!

Loss function visualization in the weight space in the case of model depending on three weights only. The projection of the loss to three planes (“hyperplanes”) in the space of weights with one weight having a constant value is presented.

Obviously in real-life examples we have much more than three weights resulting in weight space with higher dimensionality. Nevertheless we can still visualize the loss by projecting it to a hyperplane in such a space.

That is what is presented in the LookAhead paper :

Don’t look backwards, LookAhead!

source: https://arxiv.org/pdf/1907.08610.pdf

We see a projection of the objective function (in this case it’s accuracy, but it could be loss just as well) to a hyperplane in the weight space. Different colors correspond to different objective function values: the brighter the color, the more optimal the value. The behavior of the LookAhead optimizer is shown in the following way: the blue dashed line represents the trajectory of the fast weights θ (with blue squares indicating ten subsequent states), while the violet line shows the direction of fast weight update θ’- ϕ . The violet triangles indicate two subsequent slow-weights values ϕ , ϕ’ . The distance between the triangles is defined by slow-weights learning rate α .

We can see that the standard optimizer (in this case SGD) traverses a sub-optimal green region, whereas the second slow-weight state is already much closer to the optimum. The paper describes it more elegantly:

When oscillating in the high curvature directions, the fast weights updates make rapid progress along the low curvature directions. The slow weights help smooth out the oscillations through the parameter interpolation. The combination of fast weights and slow weights improves learning in high curvature directions, reduces variance, and enables Lookahead to converge rapidly in practice.

How to use it in Keras?

Now to the practical side of method: so far there is only an unofficial Keras implementation which can easily be used with your current optimizer:

As you can see, apart from the optimizer itself Lookahead expects two arguments:

  • sync_period which corresponds to previously introduced k — number of steps after which the two set of weights are synchronized,
  • slow_step which corresponds to α learning rate of the slow weights.

To check that it works as expected you can set slow_step to 1 and compare the behavior of Lookahead with that of a regular optimizer.

For the α of 1 the LookAhead update step reduces to:

which means that the LookAhead gets reduced to its underlying standard optimizer. We can also see it on the modified weights trajectory picture:

Don’t look backwards, LookAhead!

Adapted from: https://arxiv.org/pdf/1907.08610.pdf

Now the end state for slow weight is the same as the end state for fast weights.

You can test it using the following code:

Test to prove that the LookAhead with Adam and slow learning rate of 1 is equivalent to pure Adam.

Final word

LookAhead is an effective optimization algorithm which at a negligible computational cost makes the process of finding the minimum of a loss function more stable. What is more, less hyperparameter tuning is required.

It is said to be particularly effective when combined with Rectified Adam optimizer . I will cover this topic in my next article.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Graph Algorithms

Graph Algorithms

Shimon Even / Cambridge University Press / 2011-9-19 / USD 32.99

Shimon Even's Graph Algorithms, published in 1979, was a seminal introductory book on algorithms read by everyone engaged in the field. This thoroughly revised second edition, with a foreword by Richa......一起来看看 《Graph Algorithms》 这本书的介绍吧!

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具