10 Activation Functions Simplified In Neural Network

7 min readMay 9, 2021

— — — — — — — — — — — — — — — — — — — — — — — — — —

“An activation function determine the output of the neural network . Also , it helps to normalize the output of each neuron to certain scale”

Inputs is what that gets inside the activation function ( f ) .

Whenever we pass our data to the neural network , in the hidden layer we encounter activation function , what is exactly a activation function, i have got you covered.

Why we need Activation Function ?

  • To normalize the the output of each neurons : If there is a huge dispersion on the data , it will be difficult to learn for any neural networks, hence we bring down our data to a certain scale without any information loss.
  • Introduce Non- Linearity : We receive linear input that is Wi*xi+b as a input to the next layer , so we need some non-linear activation function ti convert to and achieve non-linearity .

There are two types of activation function :

  1. Linear activation function
  2. Non-linear activation function

Linear activation function

This function is a line. It will be ranging between -∞ to +∞ .

Non-linear activation function

Linear vs Non linear

Most used activation function. It will learn complex relations , compute and provide accurate predictions.

Here are some of the Non-linear activation function:

1. Sigmoid Function:

Left : Sigmoid , Right : Derivative of Sigmoid
Equation of sigmoid function
  • We use sigmoid because it exist between the range 0 to 1.
  • It is used for models where we have the predict a binary classification problem (0 or 1 , True or False, Cancer or No Cancer).
  • Sigmoid have smooth curve and it is easily differentiable .
  • Its derivative ranges between 0 to 0.25 while back-propagating to update the weights, and yes contribution of activation function is there while performing back-propagation.

Disadvantages :

Prone to vanishing gradient problem that is derivatives will get smaller and smaller as we backwards until the gradient or derivative vanishes.

Output is not zero-centric .

Computationally Heavy because of Exponential calculations

2. Tanh or Hyperbolic Tangent Function

Tanh and its derivative
Equation for Tanh
  • Tanh is a better activation function than sigmoid.
  • The range of tanh is from -1 to +1 /
  • Also, It is zero-centric function (output revolves around +ve and -ve values).
  • The advantage over sigmoid is that if your value is negative, it will be mapped strongly negative and not exactly zero .
  • Its derivative ranges between 0 to 1 .

Disadvantages :

Prone to Vanishing gradient Problem

Computationally Heavy because of Exponential calculations

3. ReLU (Rectified Linear Unit Function)

Here comes the most used activation function in deep learning .

Left : Relu , Right : Its derivative
  • Equation : max (0 , X) where X is the input you pass .
  • Rectified means you are removing some portions of the input data to be specific, the negative values.
  • When you are passing a -ve value, it return 0 .
  • For +ve value , it takes the maximum value out of it, it also means that we are retraining the most significant feature .
  • For derivative, it acts like a step function .
  • The calculation speed is much faster than sigmoid and tanh .

Yeah it also comes with disadvantages :

When the input is negative , it is not going to return anything except 0 . In forward propagation it is not a problem , but in Backpropagation, when entered a negative value the gradients will completely be zero ,that introduce a dead reLu problem .

Relu is not zero-centric function .

4. Leaky ReLU Function

Leaky Relu and its Derivative

Equation : max( 0.01x + x )

  • It attempts to solve the dying Relu problem. They have increased the range of Relu by 0.01x instead of 0 .
  • Its Derivative is same as Relu , kind of a step function .
  • When you are passing a negative value, it is going to give some amount of negative value and not exactly zero .
  • And Leaky Relu is not fully zero-centric but still give some partial results in negative axis .
  • It is not proved that Leaky RELU is always better than RELU.

5. ELU ( Exponential Linear Units )

Elu and its derivative
Equation of ELU, where α is alpha, a hyper-parameter
  • ELU solves the problems of both RELU and LEAKY RELU .
  • There is no dead RELU issue .
  • When it is positive, it will take the input as it is , and when it’s negative it return some negative value depending on the alpha value.
  • We have seen that Relu and Leaky Relu’s derivative are kind of step function , the problem with that it will be not able to converge smoothly in a gradient descent curve.
  • One problem with ELU is that it is computationally intensive .
  • Also, It is not proved that it is better than RELU, Leaky RELU.

6. PRELU ( Parametric Relu )

Leaky Relu vs P-Relu
where α is a learnable parameter
  • Parametric Relu is an improved version of Relu .
  • In negative region, it returns small slopes which avoids the problem of dead RELU.
  • For positive regions, it takes the maximum value out of it, it also means that we are retraining the most significant feature .


Extra parameter to train.

Risk of overfitting for smaller datasets

7. SoftMax Function

Equation above and calculations below.
  • It is used in multi-class classification problem to find out different probabilities for different classes.
  • It is used in the output or final layer for classification which gives the probability of each class.
  • Also, Softmax is used in mutli-nominal logistic regression .
  • We use softmax to find out the probability for each target classes.
  • One disadvantages is that it is computationally heavy , but it is the most widely used one in the output layer in case of multi-class problems.

8. Swish (Self- Gated ) Function

Swish Vs ReLU
  • Mathematical formula: Y = X * sigmoid(X)
  • Derivative of Swish, dy/dX= Y + sigmoid(X) * (1-Y)

Swish design was designed by the use of sigmoid function in LSTM

It resolves the following problems -

  1. Problem of Dead ReLu — negative values assigned to zero .
  2. Vanishing Gradient Problem
  3. Its derivative is not a step function .
  4. Swish is a zero-centric function too.

It is recommended to use Swish when your neural network is too deep , like network with more than 40 layers.

Disadvantage : Computationally Expensive

9. MaxOut Function

Learnable Function that learns itself
Equation of Maxout function
  • Maxout activation is a generalization of the Relu and leaky reLu function.
  • It is learnable activation function that is while doing back-propagation we will get new weights and bias which it will try to learn.
  • PReLu was having a learnable parameter α for only the negative inputs, here maxout activation’s learnable parameter is for all the input data.

10. SoftPlus

  • It is similar to ReLu function but is relatively smooth.
  • It has range from 0 to +∞ .
  • Here the derivate for softplus is not a step function.
  • It does not have dead ReLu issue and vanishing gradient problem.
  • But Softplus is computationally expensive (exponential) than ReLu .


— — — — — — — — — — — — — — — -

Generally speaking, these activation functions have their own advantages and disadvantages. There is no statement that indicates which ones are not working, and which activation functions are good. All the good and bad must be obtained by experiments.