Improving Deep Neural Networks

栏目: IT技术 · 发布时间: 4年前

内容简介：Andrew Ng’s advice for Hyperparameter Tuning and Regularisation from his Deep Learning Specialisation course.I have recently been going through Coursera’s Deep Learning Specialisation course, designed and taught by Andrew Ng. The second sub-course is Impro

Andrew Ng’s advice for Hyperparameter Tuning and Regularisation from his Deep Learning Specialisation course.

Ahilan Srivishnumohan

Jul 19 ·6min read

Improving Deep Neural Networks — Pawel Kadysz

I have recently been going through Coursera’s Deep Learning Specialisation course, designed and taught by Andrew Ng. The second sub-course is Improving Deep Neural Networks: Hyperparameter Tuning, Regularisation, and Optimisation. Before I started this sub-course I had already done all of those steps for traditional machine learning algorithms in my previous projects. I’ve tuned hyperparameters for decision trees such as max_depth and min_samples_leaf, and for SVMs tuned C, kernel, and gamma. For regularisation I have applied Ridge (L2 penalty), Lasso (L1 penalty), and ElasticNet (L1 and L2) to regression models. So I thought it would not be much more than translating those concepts over to neural networks. Well, I was somewhat right, but given how Andrew Ng explains the mathematics and visually represents the inner-workings of these optimisation methods, I have a much greater understanding from a fundamental level.

In this article I want to go over some of Andrew’s explanations for these techniques, accompanied with some mathematics and diagrams.

Hyperparameter Tuning

Here are a few popular hyperparameters that are tuned for deep networks:

α (alpha): learning rate
β (beta): momentum
number of layers
number of hidden units
learning rate decay
mini-batch size

There are others specific to optimisation techniques, for instance, you have β1, β2, and ε for the Adam optimisation.

Grid Search vs Random Search

Let’s say for a model we have more than one hyperparameter we are tuning, one hyperparameter probably will have more of an influence on train/validation accuracy than another hyperparameter. In this case, we may want to try a wider variety of values for the more impactful hyperparameter, but also at the same time, we don’t want to run too many models as it is time consuming.

For this example let us say we are optimising two different hyperparameters, α and ε. We know α is more important and needs to be tuned by trying out as many different values as possible. Then again you still want to try 5 different ε values as well. So, if I choose to try 5 different α values, that comes to 25 different models. We have run 25 models with different combinations of 5 α and 5 ε.

But we want to try more α values without increasing the number of models. Here is Andrew’s solution:

For this, we use a random search, where we choose 25 different random values of each α and ε, and each pair of values is used for each model. Now we have to only run 25 models but we get to try 25 different values of α instead of the 5 we did in a grid search.

Bonus: Using a coarse to fine can help further to improve tuning. This involves zooming into a smaller region of the hyperparameters which performed best and then creating more models within that region to more precisely tune those hyperparameters.

Choosing a Scale

When trying out different hyperparameter values, choosing the correct scale can be difficult, especially trying to make sure you thoroughly search within a range of really large numbers and a range of really small numbers.

Learning rate is a hyperparameter that can vary so much based on the model, it can between 0.000001 and 0.000002, or between 0.8 and 0.9. It is very hard to search fairly between these two different ranges at once when looking at a linear scale, but we can solve this issue with using the log scale.

Let’s say we are looking at values between 0.0001 and 1 for α. Using a linear scale means 10% of the attempted α values are between 0.0001 and 0.1 and 90% between 0.1 and 1. This is bad, as we are not giving a thorough search for such a wide range of values. By using a log of 10 scale, 25% of α values are between 0.0001 and 0.001, 25% between 0.001 and 0.01, 25% between 0.01 and 0.1, and a final 25% between 0.1 and 1. This way we have a thorough search of α. The range of 0.0001 to 0.1 was 10% with a linear scale but 75% with a log scale.

Here is a little bit of mathematics with a numpy function to demonstrate how this works for a random value for α.

Regularisation

Overfitting can be a huge problem with models due to high variance, this can be solved by getting more training data, but that’s not always possible, so a great alternative is regularisation.

L2 Regularisation (‘Weight Decay’)

Regularisation utilises one of two penalty techniques, L1 and L2, with neural networks L2 is predominantly used.

We must first look at the cost function for a neural network:

Cost Function

And then add the L2 penalty term, which includes the Frobenius Norm:

With L2 regularisation the weight reduces not only by the learning rate and backpropagation but also by the middle term which includes the regularisation hyperparameter λ (lambda). The larger λ is the smaller w becomes.

How does regularisation prevent overfitting?

We see that L2 regularisation uses the λ penalty to reduce the weights w, but how does this reduce variance and prevent overfitting of the model?

λ rises, w falls, changing the magnitude of z

If w is small the size of z will drop too, if z is a large positive number it will become smaller, if it is a large negative number it will become larger, nearing to 0. When passing z through the activation function we have a more linear effect (as you can see the image below, the tanh curve is more linear near 0).

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Improving Deep Neural Networks

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

算法设计与分析

王红梅 / 清华大学 / 2006-7 / 23.00元

《算法设计与分析》(普通高校本科计算机专业特色教材精选)将计算机经典问题和算法设计技术很好地结合起来，系统地介绍了算法设计技术及其在经典问题中的应用。全书共12章，第1章介绍了算法的基本概念和算法分析方法，第2章从算法的观点介绍了NP完全理论，第3章~~第11章分别介绍了蛮力法、分治法、减治法、动态规划法、贪心法、回溯法、分支限界法、概率算法和近似算法等算法设计技术，第12章基于图灵机计算模型介绍......一起来看看《算法设计与分析》这本书的介绍吧!

码农工具