# Most important 11 Activation Functions: How to choose them?

--

# What is Activation Function?

Activation function is some kind of mathematical equation which decides a neuron should be activated or not by calculating the weighted sum and after adding bias with it . This function is binary: that means either the neuron will be fired or not.Activation function are meant to approximate an input to output relation.

# What happens if we do not use any activation function after each convolution layer?

A Neural Network without Activation function would simply be a Linear regression Model.

We have to choose any activation function considering the following point-

- continuity of the function that means whether a function is differentiable or not
- power consumption during processing of all neurons of the network
- type of the desired output (logistic/continuous variables or classification/categorical data)

# Some Important Activation Functions

- Sigmoid
- Tanh
- ELU
- ReLU
- Leaky ReLU
- GELU
- Scaled exponential linear unit (SELU)
- Swish
- Softmax
- Mish
- SoftSign

# Sigmoid

The name “Sigmoidal” comes from the Greek letter Sigma,having a characteristic “S”-shaped curve when it is plotted

From the Equ:1 , we can say that-

- If z goes to minus infinity, y goes to 0 (neurons will not fire).
- If z goes to plus infinity, y goes to 1 (neuron will fire):
- At z=0, y=0.5 (Threshold value for many of cases)

Pros:

- It is a simple function, so it is easy to calculate
- It is differentiable, so it can be used in gradient based back propagation.
- It is a monotonic function, and has a fixed output range.
- It can be used where probability is to be predicted.

Cons:

*Sigmoid saturates after a certain point and kill gradients*.*Sigmoid outputs are not zero-centered*.- Towards either end of the sigmoid function, the values tend to respond very less to changes in z.

# Tanh

Pros

- The Tanh function is symmetric around the origin.
- Its derivatives are steeper than Sigmoid.

Cons

1.Vanishing Gradient problem still exits for this function.

# ELU

Pros

- Avoids the
*dead relu*problem. - Produces negative outputs, which helps the network nudge weights and biases in the right directions.
- Produce activations instead of letting them be zero, when calculating the gradient.

Cons

- Introduces longer computation time, because of the exponential operation included
- Does not avoid the exploding gradient problem

# ReLU

pros:

- Less time and space complexity, because of sparsity, and compared to the sigmoid, it does not evolve the exponential operation, which are more costly.
- Avoids the vanishing gradient problem.

Cons:

- Introduces the
*dead relu*problem, where components of the network are most likely never updated to a new value. This can sometimes also be a pro. - ReLUs does not avoid the exploding gradient problem.

# Leaky ReLU

Pros

- Like the ELU, we avoid the
*dead relu*problem, since we allow a small gradient, when computing the derivative. - Faster to compute then ELU, because no exponential operation is included

Cons

- Does not avoid the exploding gradient problem
- The neural network does not learn the alpha value
- Becomes a linear function, when it is differentiated, whereas ELU is partly linear and nonlinear.

# GELU

Pros

- Seems to be state-of-the-art in NLP, specifically Transformer models — i.e. it performs best
- Avoids vanishing gradients problem

Cons

- Fairly new in practical use, although introduced in 2016.

# Scaled exponential linear unit (SELU)

# Swish

# Pros

- It is continuous and differentiable at all points.
- It is simple and easy to use.
- Unlike ReLU, it does not suffer from the problem of dying neurons.
- It performs better than various activation functions such as ReLU, Leaky ReLU, Parameterized ReLU, ELU, SELU, GELU when compared on standard datasets such as CIFAR and ImageNet.
- Being a non-saturating activation function, it does not suffer from the problems of exploding or vanishing gradients.

# Cons

- It is slower to compute as compared to ReLU and its variants such as Leaky ReLU and Parameterized ReLU because of the use of sigmoid function involved in computing the outputs.
- swish activation function is unstable and cannot be predicted a priori.

# Softmax

Normally, Softmax is used only for the output layer, whenever we have to classify inputs into multiple categories. The softmax function is defined by the following formula:

The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1

# Mish Activation Function

Important properties of Mish:

**Unbounded Above:-**Mish avoids saturation which causes training to slow down to near-zero gradients.**Bounded Below:-**Mish shows strong regularization effects.**Non-monotonic:-**Due to preserving small negative gradients, Mish allows the network to learn better by let the gradient flow in the negative region**Continuity:-**Mish’s first derivative is continuous over the entire domain which helps in effective optimization and generalization.

# SoftSign Activation Function

The value of a softsign function is zero-centered which helps the next neuron during propagating. It re-scales the values between -1 and 1 by applying a threshold just like a sigmoid function.

In fig:13, sgn is the signum function which returns ± 1 depending on the sign of z