A Quick Guide to Activation Functions In Deep Learning
A quick discussion of all major activation function in deep learning.
Jul 26 ·10min read
In this article, I am going to give a quick overview of some of the most commons activation functions that are used in neural networks. But before discussing them let’s discuss some basic concepts related to activation functions.
Content-
- Why do we need activation functions?
- Why we always choose non-linear activation functions in a neural network?
- Universal Approximation Theorem
- Vanishing and Exploding Gradient Problems
- Activation Functions-
- Sigmoid Function
- Softmax Function
- Hyperbolic Tangent(tanh)
- Relu
- Leaky Relu
- Swish
- Parameterized Relu
- ELU
- Softplus and Softsign
- Selu
- Gelu
- Linear Activation Function
6. How to decide which activation function should be used?
7. Conclusion
8. Credits
Why do we need activation functions?
As we know, In artificial neurons inputs and weights are given from which the weighted sum of input is calculated, and then it is given to an activation function that converts it into the output. So basically an activation function is used to map the input to the output. This activation function helps a neural network to learn complex relationships and patterns in data. Now the question is what if we don’t use any activation function and allow a neuron to give the weighted sum of inputs as it is as the output. Well in that case computation will be very difficult as the weighted sum of input doesn’t have any range and depending upon input it can take any value. Hence one important use of the activation function is to keep output restricted to a particular range. Another use of activation function is to add non-linearity in data. We always choose non-linear functions as activation functions. Let’s, why is it important.
Why do we always choose non-linear activation functions in a neural network?
Non-linearity means output can not be generated from a linear combination of inputs. Non-linearity is important in neural networks because linear activation functions are not enough to form a universal function approximator. If we use linear activation functions in a deep neural network no matter how deep our network is it will be equivalent to just a linear neural network with no hidden layers because those linear activation functions can be combined to form another single linear function. So basically our whole network will be reduced to a single neuron with that combined linear function as its activation function and that single neuron won’t be able to learn complex relationships in data. As most of the real-world problems are very complex we need non-linear activation functions in a neural network. Neural Network without non-linear activation functions will be just a simple linear regression model.
However, in the final layer of the neural network, we can choose linear activation functions.
Universal Approximation Theorem
Directly quoting from Wikipedia-
Universal Approximation Theorem states that a feed-forward network constructed of artificial neurons can approximate arbitrary well real-valued continuous functions on compact subset of Rⁿ.
In short, this theorem says a neural network can learn any continuous function. Now the question is what makes it to do so. The answer is the non-linearity of activation functions.
Vanishing and Exploding Gradient Problems-
In neural networks during backpropagation, each weight receives an update proportional to the partial derivative of the error function. In some cases, this derivative term is so small that it makes updates very small. Especially in deep layers of the neural network, the update is obtained by multiplication of various partial derivatives. If these partial derivatives are very small then the overall update becomes very small and approaches zero. In such a case, weights will not be able to update and hence there will be slow or no convergence. This problem is known as the Vanishing gradient problem.
Similarly, if the derivative term is very large then updates will also be very large. In such a case, the algorithm will overshoot the minimum and won’t be able to converge. This problem is known as the Exploding gradient problem.
There are various methods to avoid these problems. Choosing the appropriate activation function is one of them
Activation Functions-
1. Sigmoid Function-
Sigmoid is an ‘S’ shaped mathematical function whose formula is-
Here is a graph of the sigmoid function. You must have come crossed this function while learning logistic regression. Although the sigmoid function is very popular still it is not used much because of the following reasons-
Pros-
- Sigmoid Function is continuous and differentiable.
- It will limit output between 0 and 1
- Very clear predictions for binary classification.
Cons-
- It can cause the vanishing gradient problem.
- It’s not centered around zero.
- Computationally Expensive
Example-
2. Softmax Function-
Softmax Function is a generalization of sigmoid function to a multi-class setting. It’s popularly used in the final layer of multi-class classification. It takes a vector of ‘k’ real number and then normalizes it into a probability distribution consisting of ‘k’ probabilities corresponding to the exponentials of the input number. Before applying softmax, some vector components could be negative, or greater than one, and might not sum to 1 but after applying softmax each component will be in the range of 0–1 and will sum to 1, and hence they can be interpreted as probabilities.
Pros-
It can be used for multiclass classification and hence used in the output layer of neural networks.
Cons-
It is computationally expensive as we have to calculate a lot of exponent terms.
Example-
3. Hyperbolic Tangent (tanh) -
Hyperbolic Tangent or in short ‘tanh’ is represented by-
It is very similar to the sigmoid function. It is centered at zero and has a range between -1 and +1.
Pros-
- It is continuous and differentiable everywhere.
- It is centered around zero.
- It will limit output in a range of -1 to +1.
Cons-
- It can cause the vanishing gradient problem.
- Computationally expensive.
Example-
4. Relu-
Rectified linear Unit often called as just a rectifier or relu is-
Pros-
- Easy to compute.
- Does not cause vanishing gradient problem
- As all neurons are not activated, this creates sparsity in the network and hence it will be fast and efficient.
Cons-
- Cause Exploding Gradient problem.
- Not Zero Centered.
- Can kill some neurons forever as it always gives 0 for negative values.
Example-
To overcome the exploding gradient problem in relu activation, we can set its saturation threshold value ie maximum value that function will return.
5. Leaky Relu-
Leaky relu is the improvement of relu function. Relu function can kill some neurons in each iteration, this is known as dying relu condition. Leaky relu can overcome this problem, instead of giving 0 for negative values, it will use a relatively small component of input to compute output, hence it will never kill any neuron.
Pros-
- Easy to compute.
- Does not cause vanishing gradient problem
- Does not cause the dying relu problem.
Cons-
- Cause Exploding Gradient problem.
- Not Zero Centered
Example-
6. Parameterized Relu-
In parameterized relu, instead of fixing a rate for the negative axis, it is passed as a new trainable parameter which network learns on its own to achieve faster convergence.
Pros-
- The network will learn the most appropriate value of alpha on its own.
- Does not cause vanishing gradient problem
Cons-
- Difficult to compute.
- Performance depends upon the problem.
In Tensorflow parameterized relu is implemented as a custom layer. Example-
7. Swish-
The swish function is obtained by multiplying x with the sigmoid function.
The swish function is proposed by Google’s Brain team. Their experiments show that swish tends to work faster than Relu of deep models across several challenging data sets.
Pros-
- Does not cause vanishing gradient problem.
- Proven to be slightly better than relu.
Cons-
Computationally Expensive
8. ELU-
Exponential Linear Unit(ELU) is another variation of relu that tries to make activations closer to zero which speeds up learning. It has shown better classification accuracy than relu. ELUs have negative values that push the mean of the activations closer to zero.
Pros-
- Does not cause the dying relu problem.
Cons-
- Computationally expensive.
- Does not avoid exploding gradient problem.
- Alpha value needs to be decided.
9. Softplus and Softsign-
Softplus function is -
Its derivative is a sigmoid function.
SoftSign function is-
Softplus and Softsign are not much used and generally, relu and its variants are preferred over them.
Pros-
- Does not cause vanishing gradient problem.
Cons-
- Computationally expensive.
- Slower than Relu.
Example-
10. Selu-
Selu stands for a Scaled exponential linear unit. Selu is defined as-
where alpha and scale are constants whose values are 1.673 and 1.050 respectively. The values of alpha and scale are chosen so that the mean and variance of the inputs are preserved between two consecutive layers as long as the weights are initialized correctly. Selu proves to be better than relu and has the following advantages.
Pros-
- Does not cause vanishing gradient problem.
- Does not cause dying relu problem
- Faster and better than other activation functions.
Example-
11. Linear Activation-
As we discussed earlier we should use non-linear activation functions in the neural networks. However, in the final layer of the neural network for the regression problem, we can use a linear activation function.
Example-
How to decide which activation function should be used
- Sigmoid and tanh should be avoided due to vanishing gradient problem.
- Softplus and Softsign should also be avoided as Relu is a better choice.
- Relu should be preferred for hidden layers. If it is causing the dying relu problem then its modifications like leaky relu, ELU, SELU, etc should be used.
- For deep networks, swish performs better than relu.
- For the final layer in case of regression linear function is the right choice, for binary classification sigmoid is the right choice and for multiclass classification, softmax is the right choice. The same concept should be used in autoencoders.
Conclusion-
We have discussed all popular activation functions with their pros and cons. There are many more activation functions but we don’t use them very often. We can also define our activation functions. Some of the activation functions discussed here are never used for solving real-world problems. They are just for knowledge. Most of the time for hidden layers we use relu and its variants and for the final layer, we use softmax or linear function depending upon the type of problem.
Credits-
- https://keras.io/api/layers/activations/#layer-activation-functions
- https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-functions-when-to-use-them/
- https://towardsdatascience.com/everything-you-need-to-know-about-activation-functions-in-deep-learning-models-84ba9f82c253
- https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a
- https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/
- https://en.wikipedia.org/wiki/Rectifier_(neural_networks)
- https://mlfromscratch.com/activation-functions-explained/#/
- https://en.wikipedia.org/wiki/Universal_approximation_theorem
That’s all from my side. Thanks for reading this article. Sources for few images used are mentioned rest of them are my creation. Feel free to post comments, suggest corrections and improvements. Connect with me on Linkedin or you can mail me at sahdevkansal02@gmail.com. I look forward to hearing your feedback. Check out my medium profile for more such articles.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。