Lecture Notes in Deep Learning: Loss and Optimization — Part 1

栏目: IT技术 · 发布时间: 4年前

内容简介:These are the lecture notes for FAU’s YouTube Lecture “Welcome everybody to deep learning! So, today we want to continue talking about the different losses and optimization. We want to go ahead and talk a bit about more details of these interesting problem

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Classification and Regression Losses

Jun 24 ·11min read

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Deep Learning at FAU. Image under CC BY 4.0 from the Deep Learning Lecture

These are the lecture notes for FAU’s YouTube Lecture “ Deep Learning ”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. If you spot mistakes, please let us know!

Navigation

Previous Lecture / Watch this Video / Top Level / Next Lecture

Welcome everybody to deep learning! So, today we want to continue talking about the different losses and optimization. We want to go ahead and talk a bit about more details of these interesting problems. Let’s talk first about the loss functions first. Loss functions are generally used for different tasks and for different tasks you have different loss functions.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Classification and regression are two common tasks for deep learning. Image under CC BY 4.0 from the Deep Learning Lecture .

The two most important tasks that we are facing are regression and classification. So in classification, you want to estimate a discrete variable for every input. This means that you want to essentially decide in this two-class problem here on the left whether it’s blue or red dots. So, you need to model a decision boundary.

In regression, the idea is that you want to model a function that explains your data. So, you have some input function let’s say x₂ and you want to predict x₁ from it. To do so, you compute a function that will produce the appropriate value of x₁ for any given x₂. Here in this example, you can see this is a line fit.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

The difference between the loss function and the last activation function is important. Image under CC BY 4.0 from the Deep Learning Lecture .

We talked about activation functions, last activation as softmax, and cross-entropy loss. Somehow, we combined them and obviously there’s a difference between the last activation function in our network and the loss function. The last activation function is applied to the individual samples x each of the batch. It will also be present at training and testing time. So, the last activation function will become part of the network and will remain there to produce the output/the prediction. It generally produces a vector.

Now, the loss function combines all M samples and labels. In their combination, they produce a loss that describes how good the fit is. So, it’s only present during training time and the loss is generally a scalar value that describes how good the fit is. So, you only need it during training time.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Most training losses are related to a probabilistic interpretation using maximum likelihood estimation. Image under CC BY 4.0 from the Deep Learning Lecture .

Interestingly, many of those loss functions can be put in a probabilistic framework. This leads us to maximum likelihood estimation. In maximum likelihood estimation — just as a reminder — we consider everything to be probabilistic. So, we have a set of observations X that consists of individual observations. Then, we have associated labels. They also stem from some distribution and the observations are denoted as Y . Of course, we need a conditional probability density function that describes us somehow how y and x are related. In particular, we can compute the probability for y given some observation x . This will be very useful for example if we want to decide on a specific class. Now, we have to somehow model this data set. They are drawn from some distribution and the joint probability for the given data set can then be computed as a product over the individual conditional probabilities. Of course, if they’re independent and identically distributed, you can simply write this up as a large product over the entire training data set. So, you end up with this product over all M samples, where it’s just a product of the conditionals. This is useful because we can determine the best parameters by maximizing the joint probability over the entire training data set. We have to do it by evaluating this large product.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

The negative log likelihood converts the product into a sum and the maximisation into a minimisation. Image under CC BY 4.0 from the Deep Learning Lecture .

Now, this large product has a couple of problems. In particular, if we have high and low values, they may cancel out very quickly. So, it may be interesting to transform the entire problem into the logarithmic domain. Because the logarithm is a monotonous transformation, it doesn’t change the position of the maximum. Hence, we can use the log function and a negative sign to flip the maximization into a minimization. Instead of looking at the likelihood function, we can look at the negative log-likelihood function. Then, our large product is suddenly a sum over all the observations times the negative logarithm of the conditional probabilities.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Let’s assume a univariate Gaussian model as statistical basis. Image under CC BY 4.0 from the Deep Learning Lecture .

Now, we can look at a univariate gaussian model. So, now we are one dimensional again and we can model this with a normal distribution where we would then choose the output of our network as the expected value and 1/β as the standard deviation. If we do so, we can find the following formulation: Square root of beta over square root of 2 pi times the exponential function of minus beta times the label minus the prediction to the power of 2 divided by 2.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Only in a few steps, we can convert the log-likelihood of the Gaussian distribution. Image under CC BY 4.0 from the Deep Learning Lecture .

Okay so let’s go ahead and put this into our log-likelihood function. Remember this is really something, you should know in the oral exam. Everybody needs to know the normal distribution and everybody needs to be able to convert this kind of universe Gaussian distribution into a loss function. If we do so, you will see that we can use the logarithm. It comes in very handy because it allows us to split the product here. Then, we also see that the logarithm cancels out with the exponential function. We simply get this beta over 2 times y subscript m minus y hat subscript m to the power of 2. We can simplify the first term further by applying the logarithm and pulling out the square root 2 pi. Then, we see that the sum over the first two terms is not depending on m, so we can simply multiply by M in order to get rid of the sum and move the sum only to the last term.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Finally, we arrive at the L2-loss. Image under CC BY 4.0 from the Deep Learning Lecture .

Now, you can see that only the last part here actually depends on w. Everything else doesn’t even contain w. So, if we seek to optimize towards w, we can simply neglect the first two parts. Then, we end up only with the part here on the right-hand side. You see that if we now assume β to be 1, we end up exactly with 1/2 and the sum over the square root of the differences. This is nothing else than the L2 norm. If you would write it in vector notation, you end up with this here. Of course, this is equivalent to a multi-dimensional Gaussian distribution with uniform variance.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Also other Lp norms are possible as loss functions. Image under CC BY 4.0 from the Deep Learning Lecture .

Okay, so well there’s not just L2-losses. There’s also L1 losses. So, we can also replace those, and we will look at some properties of different L norms in a couple of videos as well. It’s generally a very nice approach and it corresponds to minimizing the expected misclassification probability. It may cause slow convergence, because they don’t penalize heavy misclassified probabilities, but they may be advantageous in extreme label noise.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

For classification, we assume the categorical distribution. Image under CC BY 4.0 from the Deep Learning Lecture .

Now, let’s assume now let’s assume that we want to classify. Then, our network would provide us with some probabilistic output p . Let’s say, we classify only into two classes. Then, we can model this as a Bernoulli distribution where we have classes zero and one. Of course, the probability of the other class is simply one minus p . This then gives us the probability distribution p ʸ times (1 — p)¹⁻ʸ. Typically, we don’t have only two classes. This means we need to generalize to the multinulli or categorical distribution. Then y is typically modeled again one-hot encoded vector. We can then write down the categorical distribution as the product over all the classes of the probability for each class to the power of the ground truth label which is either zero or one.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

An example of the categorical distribution. Image under CC BY 4.0 from the Deep Learning Lecture .

Let’s look at an example of a categorical distribution. The example that we want to take here is a Bernoulli trial a coin flip. We encode head as (1 0)ᵀ and tail as (0 1)ᵀ. Then, we have an unfair coin and this unfair coin prefers tails with a probability of 0.7. Its likelihood for heads is 0.3. Then, we observe the true label y as tails. Now, we can use the above equation and plug those observations in. This means we get 0.3 to the power of 0 and 0.7 to the power of 1. Something to the power of 0 always equals to 1. Then 0.7 to the power of 1 is of course 0.7. This gives us 0.7 and this then means that the probability to observe tails for our unfair coin is 70%.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

With the softmax function, we can always convert arbitrarily scaled values to 0 and 1. This enables the use of the categorical distribution as a loss function. Image under CC BY 4.0 from the Deep Learning Lecture .

We can always use the softmax function within the network to convert everything into probabilities. Now, we can look at how this behaves with our categorically distributed system. Here, we simply replace our conditional with the categorical distribution. This then gives us a negative log-likelihood function. Again what we’re doing here is of high relevance for the oral exam. So everybody should be able to explain how to come from a probabilistic assumption to the respective loss function using the categorical distribution. So here, we again apply the negative log-likelihood. We plug in the definition of the categorical distribution which is simply the product over all our y subscript k hat to the power of the ground truth label. This can be further simplified because the product can be converted into a sum by moving in the logarithm. If we do so, you can see that the power of the ground truth label can actually be pulled in front of the logarithm. We see that we exactly end up with cross-entropy. Now, if you use the trick with the one-hot encoding again, you can see that we exactly end up with the cross-entropy loss where we have the sum over the entire set of observations times the logarithm of the output at exactly the position where our ground truth label was 1. Hence, we neglect all the other terms in the sum of the classes.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Cross-entropy and its relation to the KL Divergence. Image under CC BY 4.0 from the Deep Learning Lecture .

Interestingly, this can also be put in relation to the Kullback Leibler (KL) Divergence. KL divergence is a very common construct that you find in many machine learning papers. Here, you can see the definition. We essentially have an integral over the entire domain of x. It’s integrating the probability of p(x) times the logarithm of p(x) divided by q(x). q(x) is the reference distribution that you want to compare to. Now, you can see that you can split the two into two parts using the property of the logarithm. So, we get the minus part on the right-hand-side which is the cross-entropy. The left-hand side is simply the entropy. So, we can see that this training process is essentially identical to minimizing the cross-entropy. So, in order to minimize the KL divergence, we can minimize the cross-entropy. You should keep that in mind this kind of relationship appears very often in machine learning papers. So you will find them easier to understand if you have these things in the back of your mind.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Guidelines for using cross-entropy. Image under CC BY 4.0 from the Deep Learning Lecture .

Now, can we use cross-entropy for regression? Well, yes we can do that of course. But you have to make sure that your predictions are going to be in the domain of [0, 1] for all of your classes. You can, for example, do this with a sigmoid activation function. Then you have to be careful because in regression typically you’re no longer one-hot encoded. So, this is something that you have to deal with appropriately. As seen before, this loss is equivalent to minimizing the KL divergence.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

Summary of this unit. Image under CC BY 4.0 from the Deep Learning Lecture .

Let’s summarize what we’ve seen so far. So L2 loss is typically used for regression. Cross-entropy loss is typically used for classification typically in combination with one-hot encoding. Of course, you can derive them from ML estimators from strict probabilistic assumptions. So what we’re doing here is completely in line with probability theory. In the absence of more domain knowledge, these are our first choices. If you have additional domain knowledge then, of course, it’s a good idea to use it to build a better estimator. The cross-entropy loss is intrinsically multivariate. So, we are not just stuck with two-class problems. We can go to multi-dimensional regression and classification problems as well.

Lecture Notes in Deep Learning: Loss and Optimization — Part 1

More exciting things coming up in this deep learning lecture. Image under CC BY 4.0 from the Deep Learning Lecture .

Next time in deep learning, we want to go into some more details about loss functions and in particular, we want to highlight the hinge loss. It is a very important loss function because it allows you to embed constraints. We will see that there are also some relations to classical machine learning and pattern recognition, in particular, the support vector machine. So I hope you enjoyed this video and I am looking forward to seeing you in the next one”!

If you liked this post, you can find more essays here , more educational material on Machine Learning here , or have a look at our Deep Learning Lecture . I would also appreciate a clap or a follow on YouTube , Twitter , Facebook , or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced.

References

[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.

[2] Anna Choromanska, Mikael Henaff, Michael Mathieu, et al. “The Loss Surfaces of Multilayer Networks.” In: AISTATS. 2015.

[3] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, et al. “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”. In: Advances in neural information processing systems. 2014, pp. 2933–2941.

[4] Yichuan Tang. “Deep learning using linear support vector machines”. In: arXiv preprint arXiv:1306.0239 (2013).

[5] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. “On the Convergence of Adam and Beyond”. In: International Conference on Learning Representations. 2018.

[6] Katarzyna Janocha and Wojciech Marian Czarnecki. “On Loss Functions for Deep Neural Networks in Classification”. In: arXiv preprint arXiv:1702.05659 (2017).

[7] Jeffrey Dean, Greg Corrado, Rajat Monga, et al. “Large scale distributed deep networks”. In: Advances in neural information processing systems. 2012, pp. 1223–1231.

[8] Maren Mahsereci and Philipp Hennig. “Probabilistic line searches for stochastic optimization”. In: Advances In Neural Information Processing Systems. 2015, pp. 181–189.

[9] Jason Weston, Chris Watkins, et al. “Support vector machines for multi-class pattern recognition.” In: ESANN. Vol. 99. 1999, pp. 219–224.

[10] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

锦绣蓝图

锦绣蓝图

[美] 沃德科 (Christina Wodtke)、[美] 戈夫拉 (Austin Govella) / 蔡芳 / 人民邮电出版社 / 2009-11-01 / 59.00

Web 2.0和社会化大趋势下,你的网站发展喜人,但是问题也接踵而来:信息变得越来越庞杂无序,业务流程愈加复杂,搜索和导航越来越难,用户对使用体验的要求也越来越高……怎么办? 作者非常通俗易懂地讲述了如何规划易用的网站及其背后的信息架构原理。首先介绍了建立信息架构的八项基本原则,然后重点强调了组织系统和元数据在信息架构中的作用,并指出设计搜索和导航需要考虑的问题和方法,另外还补充了当今热门的......一起来看看 《锦绣蓝图》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具