Most important 11 Activation Functions: How to choose them?

5 min readAug 27, 2020

What is Activation Function?

Activation function is some kind of mathematical equation which decides a neuron should be activated or not by calculating the weighted sum and after adding bias with it . This function is binary: that means either the neuron will be fired or not.Activation function are meant to approximate an input to output relation.

What happens if we do not use any activation function after each convolution layer?

A Neural Network without Activation function would simply be a Linear regression Model.

We have to choose any activation function considering the following point-

continuity of the function that means whether a function is differentiable or not
power consumption during processing of all neurons of the network
type of the desired output (logistic/continuous variables or classification/categorical data)

Some Important Activation Functions

Sigmoid
Tanh
ELU
ReLU
Leaky ReLU
GELU
Scaled exponential linear unit (SELU)
Swish
Softmax
Mish
SoftSign

Sigmoid

The name “Sigmoidal” comes from the Greek letter Sigma,having a characteristic “S”-shaped curve when it is plotted

From the Equ:1 , we can say that-

If z goes to minus infinity, y goes to 0 (neurons will not fire).
If z goes to plus infinity, y goes to 1 (neuron will fire):
At z=0, y=0.5 (Threshold value for many of cases)

Pros:

It is a simple function, so it is easy to calculate
It is differentiable, so it can be used in gradient based back propagation.
It is a monotonic function, and has a fixed output range.
It can be used where probability is to be predicted.

Cons:

Sigmoid saturates after a certain point and kill gradients.
Sigmoid outputs are not zero-centered.
Towards either end of the sigmoid function, the values tend to respond very less to changes in z.

Tanh

Pros

The Tanh function is symmetric around the origin.
Its derivatives are steeper than Sigmoid.

Cons

1.Vanishing Gradient problem still exits for this function.

ELU

Pros

Avoids the dead relu problem.
Produces negative outputs, which helps the network nudge weights and biases in the right directions.
Produce activations instead of letting them be zero, when calculating the gradient.

Cons

Introduces longer computation time, because of the exponential operation included
Does not avoid the exploding gradient problem

ReLU

pros:

Less time and space complexity, because of sparsity, and compared to the sigmoid, it does not evolve the exponential operation, which are more costly.
Avoids the vanishing gradient problem.

Cons:

Introduces the dead relu problem, where components of the network are most likely never updated to a new value. This can sometimes also be a pro.
ReLUs does not avoid the exploding gradient problem.

Leaky ReLU

Pros

Like the ELU, we avoid the dead relu problem, since we allow a small gradient, when computing the derivative.
Faster to compute then ELU, because no exponential operation is included

Cons

Does not avoid the exploding gradient problem
The neural network does not learn the alpha value
Becomes a linear function, when it is differentiated, whereas ELU is partly linear and nonlinear.

GELU

Pros

Seems to be state-of-the-art in NLP, specifically Transformer models — i.e. it performs best
Avoids vanishing gradients problem

Cons

Fairly new in practical use, although introduced in 2016.

Scaled exponential linear unit (SELU)

Swish

Pros

It is continuous and differentiable at all points.
It is simple and easy to use.
Unlike ReLU, it does not suffer from the problem of dying neurons.
It performs better than various activation functions such as ReLU, Leaky ReLU, Parameterized ReLU, ELU, SELU, GELU when compared on standard datasets such as CIFAR and ImageNet.
Being a non-saturating activation function, it does not suffer from the problems of exploding or vanishing gradients.

Cons

It is slower to compute as compared to ReLU and its variants such as Leaky ReLU and Parameterized ReLU because of the use of sigmoid function involved in computing the outputs.
swish activation function is unstable and cannot be predicted a priori.

Softmax

Normally, Softmax is used only for the output layer, whenever we have to classify inputs into multiple categories. The softmax function is defined by the following formula:

The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1

Mish Activation Function

Important properties of Mish:

Unbounded Above:- Mish avoids saturation which causes training to slow down to near-zero gradients.
Bounded Below:- Mish shows strong regularization effects.
Non-monotonic:- Due to preserving small negative gradients, Mish allows the network to learn better by let the gradient flow in the negative region
Continuity:- Mish’s first derivative is continuous over the entire domain which helps in effective optimization and generalization.

SoftSign Activation Function

The value of a softsign function is zero-centered which helps the next neuron during propagating. It re-scales the values between -1 and 1 by applying a threshold just like a sigmoid function.