Don’t look backwards, LookAhead!

栏目: IT技术 · 发布时间: 4年前

内容简介:The task of an optimizer is to look for such a set of weights for which a NN model yields the lowest possible loss. If you only had one weight and a loss function like the one depicted below you wouldn’t have to be a genius to find the solution.Unfortunate

How to make your optimizer less sensitive to the choice of hyperparameters

May 30 ·5min read

Don’t look backwards, LookAhead!

Image by rihaij z Pixabay

The task of an optimizer is to look for such a set of weights for which a NN model yields the lowest possible loss. If you only had one weight and a loss function like the one depicted below you wouldn’t have to be a genius to find the solution.

Don’t look backwards, LookAhead!

Unfortunately you normally have a multitude of weights and a loss landscape that is hardly simple, not to mention no longer suited for a 2D drawing.

Don’t look backwards, LookAhead!

The loss surface of ResNet-56 without skip connections visualized using a method proposed in https://arxiv.org/pdf/1712.09913.pdf .

Finding a minimum of such a function is no longer a trivial task. The most common optimizers like Adam or SGD require very time-consuming hyperparameter tuning and can get caught in the local minima. The importance of choosing a hyperparameter like learning rate can be summarized by the following picture:

Don’t look backwards, LookAhead!

Too big learning rate causes oscillations around the minimum and too small learning rate makes the learning process super slow.

The recently proposed LookAhead optimizer makes the optimization process

less sensitive to suboptimal hyperparameters and therefore lessens the need for extensive hyperparameter tuning.

It sounds like something worth exploring!

The algorithm

Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of “fast weights” generated by another optimizer.

The optimizer keeps two sets of weights: fast weights θ and slow weights ϕ . They are both initialized with the same values. A standard optimizer (e.g. Adam, SGD, …) with a certain learning rate η is used to update the fast weights θ for a defined number of steps k resulting in some new values θ’ .

Then a crucial thing happens: the slow weights ϕ are moved along the direction defined by the difference of weight vectors θ’- ϕ . The length of this step is controlled by the parameter α — the slow weights learning rate.

Crucial update in the LookAhead algorithm

The process is then repeated starting by re-setting the fast weights values to newly computed slow weights values ϕ’ . You can see the pseudocode below:

Don’t look backwards, LookAhead!

source: https://arxiv.org/pdf/1907.08610.pdf

What’s the point of this?

To answer this question we will study the (slightly modified) picture from the LookAhead publication , but as an introduction let’s first look at another picture. If our model only had three weights, the loss function could be easily visualized like in the picture below.

Don’t look backwards, LookAhead!

Loss function visualization in the weight space in the case of model depending on three weights only. The projection of the loss to three planes (“hyperplanes”) in the space of weights with one weight having a constant value is presented.

Obviously in real-life examples we have much more than three weights resulting in weight space with higher dimensionality. Nevertheless we can still visualize the loss by projecting it to a hyperplane in such a space.

That is what is presented in the LookAhead paper :

Don’t look backwards, LookAhead!

source: https://arxiv.org/pdf/1907.08610.pdf

We see a projection of the objective function (in this case it’s accuracy, but it could be loss just as well) to a hyperplane in the weight space. Different colors correspond to different objective function values: the brighter the color, the more optimal the value. The behavior of the LookAhead optimizer is shown in the following way: the blue dashed line represents the trajectory of the fast weights θ (with blue squares indicating ten subsequent states), while the violet line shows the direction of fast weight update θ’- ϕ . The violet triangles indicate two subsequent slow-weights values ϕ , ϕ’ . The distance between the triangles is defined by slow-weights learning rate α .

We can see that the standard optimizer (in this case SGD) traverses a sub-optimal green region, whereas the second slow-weight state is already much closer to the optimum. The paper describes it more elegantly:

When oscillating in the high curvature directions, the fast weights updates make rapid progress along the low curvature directions. The slow weights help smooth out the oscillations through the parameter interpolation. The combination of fast weights and slow weights improves learning in high curvature directions, reduces variance, and enables Lookahead to converge rapidly in practice.

How to use it in Keras?

Now to the practical side of method: so far there is only an unofficial Keras implementation which can easily be used with your current optimizer:

As you can see, apart from the optimizer itself Lookahead expects two arguments:

  • sync_period which corresponds to previously introduced k — number of steps after which the two set of weights are synchronized,
  • slow_step which corresponds to α learning rate of the slow weights.

To check that it works as expected you can set slow_step to 1 and compare the behavior of Lookahead with that of a regular optimizer.

For the α of 1 the LookAhead update step reduces to:

which means that the LookAhead gets reduced to its underlying standard optimizer. We can also see it on the modified weights trajectory picture:

Don’t look backwards, LookAhead!

Adapted from: https://arxiv.org/pdf/1907.08610.pdf

Now the end state for slow weight is the same as the end state for fast weights.

You can test it using the following code:

Test to prove that the LookAhead with Adam and slow learning rate of 1 is equivalent to pure Adam.

Final word

LookAhead is an effective optimization algorithm which at a negligible computational cost makes the process of finding the minimum of a loss function more stable. What is more, less hyperparameter tuning is required.

It is said to be particularly effective when combined with Rectified Adam optimizer . I will cover this topic in my next article.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

妙手回春

妙手回春

(美)Steve Krug / 袁国忠 / 人民邮电出版社 / 2010-7 / 39.00元

本书是作者Steve Krug继畅销书《点石成金:访客至上的网页设计秘笈》(Don't Make Me Think)后推出的又一力作。多年来,人们就认识到网站可用性测试可以极大地改善产品质量,但鉴于正规的可用性测试流程复杂、费用高昂,很少人这样做。在本书中,作者详细阐述了一种简化的网站可用性测试方法,让任何人都能够尽早并频繁地对其网站、应用程序和其他产品进行可用性测试,从而将最严重的可用性问题消灭......一起来看看 《妙手回春》 这本书的介绍吧!

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具