A Quick Guide to Activation Functions In Deep Learning

栏目: IT技术 · 发布时间: 4年前

A Quick Guide to Activation Functions In Deep Learning

A quick discussion of all major activation function in deep learning.

Jul 26 ·10min read

A Quick Guide to Activation Functions In Deep Learning

Image by Author

In this article, I am going to give a quick overview of some of the most commons activation functions that are used in neural networks. But before discussing them let’s discuss some basic concepts related to activation functions.

Content-

  1. Why do we need activation functions?
  2. Why we always choose non-linear activation functions in a neural network?
  3. Universal Approximation Theorem
  4. Vanishing and Exploding Gradient Problems
  5. Activation Functions-
  • Sigmoid Function
  • Softmax Function
  • Hyperbolic Tangent(tanh)
  • Relu
  • Leaky Relu
  • Swish
  • Parameterized Relu
  • ELU
  • Softplus and Softsign
  • Selu
  • Gelu
  • Linear Activation Function

6. How to decide which activation function should be used?

7. Conclusion

8. Credits

Why do we need activation functions?

As we know, In artificial neurons inputs and weights are given from which the weighted sum of input is calculated, and then it is given to an activation function that converts it into the output. So basically an activation function is used to map the input to the output. This activation function helps a neural network to learn complex relationships and patterns in data. Now the question is what if we don’t use any activation function and allow a neuron to give the weighted sum of inputs as it is as the output. Well in that case computation will be very difficult as the weighted sum of input doesn’t have any range and depending upon input it can take any value. Hence one important use of the activation function is to keep output restricted to a particular range. Another use of activation function is to add non-linearity in data. We always choose non-linear functions as activation functions. Let’s, why is it important.

Why do we always choose non-linear activation functions in a neural network?

Non-linearity means output can not be generated from a linear combination of inputs. Non-linearity is important in neural networks because linear activation functions are not enough to form a universal function approximator. If we use linear activation functions in a deep neural network no matter how deep our network is it will be equivalent to just a linear neural network with no hidden layers because those linear activation functions can be combined to form another single linear function. So basically our whole network will be reduced to a single neuron with that combined linear function as its activation function and that single neuron won’t be able to learn complex relationships in data. As most of the real-world problems are very complex we need non-linear activation functions in a neural network. Neural Network without non-linear activation functions will be just a simple linear regression model.

However, in the final layer of the neural network, we can choose linear activation functions.

Universal Approximation Theorem

Directly quoting from Wikipedia-

Universal Approximation Theorem states that a feed-forward network constructed of artificial neurons can approximate arbitrary well real-valued continuous functions on compact subset of Rⁿ.

In short, this theorem says a neural network can learn any continuous function. Now the question is what makes it to do so. The answer is the non-linearity of activation functions.

Vanishing and Exploding Gradient Problems-

In neural networks during backpropagation, each weight receives an update proportional to the partial derivative of the error function. In some cases, this derivative term is so small that it makes updates very small. Especially in deep layers of the neural network, the update is obtained by multiplication of various partial derivatives. If these partial derivatives are very small then the overall update becomes very small and approaches zero. In such a case, weights will not be able to update and hence there will be slow or no convergence. This problem is known as the Vanishing gradient problem.

Similarly, if the derivative term is very large then updates will also be very large. In such a case, the algorithm will overshoot the minimum and won’t be able to converge. This problem is known as the Exploding gradient problem.

There are various methods to avoid these problems. Choosing the appropriate activation function is one of them

Activation Functions-

1. Sigmoid Function-

Sigmoid is an ‘S’ shaped mathematical function whose formula is-

A Quick Guide to Activation Functions In Deep Learning
Source: Wikipedia

Here is a graph of the sigmoid function. You must have come crossed this function while learning logistic regression. Although the sigmoid function is very popular still it is not used much because of the following reasons-

A Quick Guide to Activation Functions In Deep Learning

Sigmoid Function (Source: Wikipedia)

Pros-

  1. Sigmoid Function is continuous and differentiable.
  2. It will limit output between 0 and 1
  3. Very clear predictions for binary classification.

Cons-

  1. It can cause the vanishing gradient problem.
  2. It’s not centered around zero.
  3. Computationally Expensive

Example-

A Quick Guide to Activation Functions In Deep Learning
Image by Author

2. Softmax Function-

Softmax Function is a generalization of sigmoid function to a multi-class setting. It’s popularly used in the final layer of multi-class classification. It takes a vector of ‘k’ real number and then normalizes it into a probability distribution consisting of ‘k’ probabilities corresponding to the exponentials of the input number. Before applying softmax, some vector components could be negative, or greater than one, and might not sum to 1 but after applying softmax each component will be in the range of 0–1 and will sum to 1, and hence they can be interpreted as probabilities.

A Quick Guide to Activation Functions In Deep Learning
Source: Wikipedia

Pros-

It can be used for multiclass classification and hence used in the output layer of neural networks.

Cons-

It is computationally expensive as we have to calculate a lot of exponent terms.

Example-

A Quick Guide to Activation Functions In Deep Learning
Image by Author

3. Hyperbolic Tangent (tanh) -

Hyperbolic Tangent or in short ‘tanh’ is represented by-

A Quick Guide to Activation Functions In Deep Learning
Image by Author
A Quick Guide to Activation Functions In Deep Learning
Image by Author

It is very similar to the sigmoid function. It is centered at zero and has a range between -1 and +1.

A Quick Guide to Activation Functions In Deep Learning

Source: Wikipedia

Pros-

  1. It is continuous and differentiable everywhere.
  2. It is centered around zero.
  3. It will limit output in a range of -1 to +1.

Cons-

  1. It can cause the vanishing gradient problem.
  2. Computationally expensive.

Example-

A Quick Guide to Activation Functions In Deep Learning
Image by Author

4. Relu-

Rectified linear Unit often called as just a rectifier or relu is-

A Quick Guide to Activation Functions In Deep Learning
Image by Author

A Quick Guide to Activation Functions In Deep Learning

Image Source

Pros-

  1. Easy to compute.
  2. Does not cause vanishing gradient problem
  3. As all neurons are not activated, this creates sparsity in the network and hence it will be fast and efficient.

Cons-

  1. Cause Exploding Gradient problem.
  2. Not Zero Centered.
  3. Can kill some neurons forever as it always gives 0 for negative values.

Example-

A Quick Guide to Activation Functions In Deep Learning
Image by Author

To overcome the exploding gradient problem in relu activation, we can set its saturation threshold value ie maximum value that function will return.

A Quick Guide to Activation Functions In Deep Learning
Image by Author

5. Leaky Relu-

Leaky relu is the improvement of relu function. Relu function can kill some neurons in each iteration, this is known as dying relu condition. Leaky relu can overcome this problem, instead of giving 0 for negative values, it will use a relatively small component of input to compute output, hence it will never kill any neuron.

A Quick Guide to Activation Functions In Deep Learning
Source: Wikipedia

A Quick Guide to Activation Functions In Deep Learning

Image Source

Pros-

  1. Easy to compute.
  2. Does not cause vanishing gradient problem
  3. Does not cause the dying relu problem.

Cons-

  1. Cause Exploding Gradient problem.
  2. Not Zero Centered

Example-

A Quick Guide to Activation Functions In Deep Learning
Image by Author

6. Parameterized Relu-

In parameterized relu, instead of fixing a rate for the negative axis, it is passed as a new trainable parameter which network learns on its own to achieve faster convergence.

A Quick Guide to Activation Functions In Deep Learning
Source: Wikipedia

Pros-

  1. The network will learn the most appropriate value of alpha on its own.
  2. Does not cause vanishing gradient problem

Cons-

  1. Difficult to compute.
  2. Performance depends upon the problem.

In Tensorflow parameterized relu is implemented as a custom layer. Example-

Image by Author

7. Swish-

The swish function is obtained by multiplying x with the sigmoid function.

Source: Wikipedia

A Quick Guide to Activation Functions In Deep Learning

Image Source

The swish function is proposed by Google’s Brain team. Their experiments show that swish tends to work faster than Relu of deep models across several challenging data sets.

Pros-

  1. Does not cause vanishing gradient problem.
  2. Proven to be slightly better than relu.

Cons-

Computationally Expensive

8. ELU-

Exponential Linear Unit(ELU) is another variation of relu that tries to make activations closer to zero which speeds up learning. It has shown better classification accuracy than relu. ELUs have negative values that push the mean of the activations closer to zero.

A Quick Guide to Activation Functions In Deep Learning
Source: Wikipedia

Pros-

  1. Does not cause the dying relu problem.

Cons-

  1. Computationally expensive.
  2. Does not avoid exploding gradient problem.
  3. Alpha value needs to be decided.
Image by Author

9. Softplus and Softsign-

Softplus function is -

A Quick Guide to Activation Functions In Deep Learning
Source: Wikipedia

Its derivative is a sigmoid function.

SoftSign function is-

Image Source

Softplus and Softsign are not much used and generally, relu and its variants are preferred over them.

Pros-

  1. Does not cause vanishing gradient problem.

Cons-

  1. Computationally expensive.
  2. Slower than Relu.

Example-

A Quick Guide to Activation Functions In Deep Learning

Image by Author
A Quick Guide to Activation Functions In Deep Learning
Image by Author

10. Selu-

Selu stands for a Scaled exponential linear unit. Selu is defined as-

Image Source

where alpha and scale are constants whose values are 1.673 and 1.050 respectively. The values of alpha and scale are chosen so that the mean and variance of the inputs are preserved between two consecutive layers as long as the weights are initialized correctly. Selu proves to be better than relu and has the following advantages.

Pros-

  1. Does not cause vanishing gradient problem.
  2. Does not cause dying relu problem
  3. Faster and better than other activation functions.

Example-

A Quick Guide to Activation Functions In Deep Learning
Image by Author

11. Linear Activation-

As we discussed earlier we should use non-linear activation functions in the neural networks. However, in the final layer of the neural network for the regression problem, we can use a linear activation function.

Example-

A Quick Guide to Activation Functions In Deep Learning
Image by Author

How to decide which activation function should be used

  • Sigmoid and tanh should be avoided due to vanishing gradient problem.
  • Softplus and Softsign should also be avoided as Relu is a better choice.
  • Relu should be preferred for hidden layers. If it is causing the dying relu problem then its modifications like leaky relu, ELU, SELU, etc should be used.
  • For deep networks, swish performs better than relu.
  • For the final layer in case of regression linear function is the right choice, for binary classification sigmoid is the right choice and for multiclass classification, softmax is the right choice. The same concept should be used in autoencoders.

Conclusion-

We have discussed all popular activation functions with their pros and cons. There are many more activation functions but we don’t use them very often. We can also define our activation functions. Some of the activation functions discussed here are never used for solving real-world problems. They are just for knowledge. Most of the time for hidden layers we use relu and its variants and for the final layer, we use softmax or linear function depending upon the type of problem.

A Quick Guide to Activation Functions In Deep Learning

Image by Author

That’s all from my side. Thanks for reading this article. Sources for few images used are mentioned rest of them are my creation. Feel free to post comments, suggest corrections and improvements. Connect with me on Linkedin or you can mail me at sahdevkansal02@gmail.com. I look forward to hearing your feedback. Check out my medium profile for more such articles.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Perl高效编程

Perl高效编程

霍尔 / 胜春、王晖、张东亮、蒋永清 / 人民邮电出版社 / 2011-5 / 65.00元

《Perl高效编程(第2版)》,本书是Perl编程领域的“圣经级”著作。它提供了一百多个详实的应用案例,足以涵盖编程过程中经常遇到的方方面面,由此详细阐释出各种高效且简洁的写法。一起来看看 《Perl高效编程》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具