Common Activation Functions and Why You Must Know Them

栏目: IT技术 · 发布时间: 4年前

内容简介:A summary of implementation, perks, caveats and utilisation of commonly used activation functionsActivation functions are a basic building block of neural networks. However, they must be studied carefully before one can use them effectively. This is becaus

A summary of implementation, perks, caveats and utilisation of commonly used activation functions

Activation functions are a basic building block of neural networks. However, they must be studied carefully before one can use them effectively. This is because activation functions have active regions and dead regions, making them either learn or act dead in a model. Let’s go through one by one along with proper use of them and their downsides.

Sigmoid Function

Sigmoid functionor logistic function or Soft step function is the common starting point to study performance of neural networks. Mathematically, this function and its derivative can be represented as shown in Figure 1.

Fig 1 Sigmoid function and derivative

The function can be plotted as denoted in Figure 2.

Common Activation Functions and Why You Must Know Them

Fig. 2 Sigmoid function

We can see the function can only help the training within the range roughly -4 to 4. This is because beyond these limits on either sides, the gradients are not significant. Let’s have a look at the gradient curve shown in Figure 3.

Common Activation Functions and Why You Must Know Them

Fig. 3 Derivative of sigmoid function

We can see that beyond the limit of [-4, 4] shown in green and red lines, the gradients are not that significant. This is called the problem of diminishing gradients . Because of this reason, using the sigmoid function in a deep network is not desired. Furthermore, if you think of multiplying smaller values (<1) continuously, you will end up with very small values. This is an unfavourable condition in deep networks. So sigmoid is simply not a desired activation for deep learning. ReLU is the first workaround to overcome the undesired outcomes of the sigmoid function.

Although the sigmoid function is not used in hidden layers, it is an ideal candidate for the output layer . This is because sigmoid gives us values in the range [0, 1] which can help us to train a network for binary encoded output . This shall be done along with a binary cross entropy loss function.

ReLU (RectifiedLinear Unit) Function

ReLUis simply outputting the non negative output of a neurone. Mathematically, this can be represented as shown in Figure 4.

Fig. 4 ReLU function and derivative

We can plot the function and its derivate as illustrated in Figure 5.

Common Activation Functions and Why You Must Know Them

Fig. 5 ReLU and derivative

As shown in Figure 5, the derivatives are never dead in the positive region. Furthermore, as the values are output as it is without any dampening, values will not vanish as we saw in the sigmoid function. Thus, ReLU becomes an ideal candidate for deep learning. However, as you might have noted, ReLU simply does not exist in the negative space. This is often okay since input values and output values of neural networks are positive. However, if you have scaled values or normalised values in the range [-1, 1], this might kill some neurones beyond recovery. This phenomenon is called dying ReLU . Though this is not something someone should worry about (can be avoided by scaling using a min-max scaler), it is worth knowing the workarounds. Note that due to the exploding nature of values through a ReLU network, it is often desired to not use ReLU in an output layer . Furthermore, exploding values can result in the phenomenon of exploding gradients . However, using proper clipping and regularisation can help.

ReLU variants and Other Linear Unit Functions

There are a few variants of ReLU which helps to overcome the problem of dying ReLU .

PReLU (Parametric ReLU) and Leaky ReLU

Parametric ReLUtries to parameterise the negative input thus enabling the recovery of the dying ReLU. However, the parameter is a learnable parameter with the exception of Leaky ReLU which uses a fixed parameter for the negative component. Usually this parameter is chosen to be a small value since the typical outputs of a network is supposed to be positive.

Fig. 6 PReLU and derivative

We have plot the diagram shown in Figure 7 for a given value of (0.1). This is the scenario of Leaky ReLU with 0.1 as the negative multiplier.

Common Activation Functions and Why You Must Know Them

Fig. 7 Leaky ReLU with ⍺=0.1

Note that the gradient in the negative region is subtle but non-zero. Also, the nodes will not be dead when faced with too many negative figures.

ELU (Exponential Linear Unit) Function

In this function, the negative component is modelled using an exponential representation. However, we will still have the learnable parameter .

Fig. 8 ELU function

We can plot this as shown in Figure 9. Here, we assume =0.5 for better visibility.

Common Activation Functions and Why You Must Know Them

Fig. 9 ELU function and derivative

Now that we have talked about few important functions used in deep networks, let’s see some common but not so involving functions.

Tanh Function

Tanhis the driving force behind LSTM networks. This helps avoiding the shortcomings of RNNs .

Fig. 10 Tanh function and derivative

We can plot it as shown in Figure 11.

Common Activation Functions and Why You Must Know Them

Fig. 11 Tanh function and derivative

Few More Activation Functions

  • Identity function: This is the naive function of f(x)=x with derivative f'(x)=1 .
  • Softmax function: This function is guaranteed to output values for the layer that adds up to 1. This is mostly used in categorical classification with categorical cross entropy as the loss function and mostly used in the output layer.

Notes

Activation functions are designed mostly with the derivative in mind. That’s why you see a nice simplified derivative function.

Activation functions must be chosen by looking at the range of the input values. However, ReLU or PReLU is a good starting point with sigmoid or softmax to the output layer.

Activation function with learnable parameters are often implemented as separate layers. This is more intuitive since the parameters are learned through the same back-propagation algorithm.

Hope you enjoyed reading this article.

Cheers! :)


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

这才是马云

这才是马云

陈伟 / 浙江人民出版社 / 2011-5 / 30.00元

“幽默马云”、“开心马云”、“顽皮马云”、“狂妄马云”等。《这才是马云》从各个角度揭开了千面马云的真面目,告诉你一个与想象中大不一样的马云。这不只是一本书,更像一部喜剧电影,让你通过声音、色彩、表情等诸多要素走近马云,感受阿里巴巴。没有冗长的说教,只有让人忍俊不禁的细节;没有高深的理论,只有通俗、诚恳的陈述。作者借幽默平常的琐事,记录下马云“可爱”的一面,看完后让人恍然大悟:原来,马云是这样一个人......一起来看看 《这才是马云》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

随机密码生成器
随机密码生成器

多种字符组合密码

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具