Most important 11 Activation Functions: How to choose them?
What is Activation Function?
Activation function is some kind of mathematical equation which decides a neuron should be activated or not by calculating the weighted sum and after adding bias with it . This function is binary: that means either the neuron will be fired or not.Activation function are meant to approximate an input to output relation.
What happens if we do not use any activation function after each convolution layer?
A Neural Network without Activation function would simply be a Linear regression Model.
We have to choose any activation function considering the following point-
- continuity of the function that means whether a function is differentiable or not
- power consumption during processing of all neurons of the network
- type of the desired output (logistic/continuous variables or classification/categorical data)
Some Important Activation Functions
- Sigmoid
- Tanh
- ELU
- ReLU
- Leaky ReLU
- GELU
- Scaled exponential linear unit (SELU)
- Swish
- Softmax
- Mish
- SoftSign
Sigmoid
The name “Sigmoidal” comes from the Greek letter Sigma,having a characteristic “S”-shaped curve when it is plotted
From the Equ:1 , we can say that-
- If z goes to minus infinity, y goes to 0 (neurons will not fire).
- If z goes to plus infinity, y goes to 1 (neuron will fire):
- At z=0, y=0.5 (Threshold value for many of cases)
Pros:
- It is a simple function, so it is easy to calculate
- It is differentiable, so it can be used in gradient based back propagation.
- It is a monotonic function, and has a fixed output range.
- It can be used where probability is to be predicted.
Cons:
- Sigmoid saturates after a certain point and kill gradients.
- Sigmoid outputs are not zero-centered.
- Towards either end of the sigmoid function, the values tend to respond very less to changes in z.
Tanh
Pros
- The Tanh function is symmetric around the origin.
- Its derivatives are steeper than Sigmoid.
Cons
1.Vanishing Gradient problem still exits for this function.
ELU
Pros
- Avoids the dead relu problem.
- Produces negative outputs, which helps the network nudge weights and biases in the right directions.
- Produce activations instead of letting them be zero, when calculating the gradient.
Cons
- Introduces longer computation time, because of the exponential operation included
- Does not avoid the exploding gradient problem
ReLU
pros:
- Less time and space complexity, because of sparsity, and compared to the sigmoid, it does not evolve the exponential operation, which are more costly.
- Avoids the vanishing gradient problem.
Cons:
- Introduces the dead relu problem, where components of the network are most likely never updated to a new value. This can sometimes also be a pro.
- ReLUs does not avoid the exploding gradient problem.
Leaky ReLU
Pros
- Like the ELU, we avoid the dead relu problem, since we allow a small gradient, when computing the derivative.
- Faster to compute then ELU, because no exponential operation is included
Cons
- Does not avoid the exploding gradient problem
- The neural network does not learn the alpha value
- Becomes a linear function, when it is differentiated, whereas ELU is partly linear and nonlinear.
GELU
Pros
- Seems to be state-of-the-art in NLP, specifically Transformer models — i.e. it performs best
- Avoids vanishing gradients problem
Cons
- Fairly new in practical use, although introduced in 2016.
Scaled exponential linear unit (SELU)
Swish
Pros
- It is continuous and differentiable at all points.
- It is simple and easy to use.
- Unlike ReLU, it does not suffer from the problem of dying neurons.
- It performs better than various activation functions such as ReLU, Leaky ReLU, Parameterized ReLU, ELU, SELU, GELU when compared on standard datasets such as CIFAR and ImageNet.
- Being a non-saturating activation function, it does not suffer from the problems of exploding or vanishing gradients.
Cons
- It is slower to compute as compared to ReLU and its variants such as Leaky ReLU and Parameterized ReLU because of the use of sigmoid function involved in computing the outputs.
- swish activation function is unstable and cannot be predicted a priori.
Softmax
Normally, Softmax is used only for the output layer, whenever we have to classify inputs into multiple categories. The softmax function is defined by the following formula:
The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1
Mish Activation Function
Important properties of Mish:
- Unbounded Above:- Mish avoids saturation which causes training to slow down to near-zero gradients.
- Bounded Below:- Mish shows strong regularization effects.
- Non-monotonic:- Due to preserving small negative gradients, Mish allows the network to learn better by let the gradient flow in the negative region
- Continuity:- Mish’s first derivative is continuous over the entire domain which helps in effective optimization and generalization.
SoftSign Activation Function
The value of a softsign function is zero-centered which helps the next neuron during propagating. It re-scales the values between -1 and 1 by applying a threshold just like a sigmoid function.
In fig:13, sgn is the signum function which returns ± 1 depending on the sign of z