Important Loss functions used in Deep Learning

5 min readFeb 20, 2022

What is Loss Function?
Loss in neural network helps us to understand how much the predicted value differ from the actual value.

In general terms, a loss function is basically for a single training example or data whereas cost function is the sum/loss over the entire training data . Cost function is used in optimization problem , we aim to minimize the cost function that is to reduce the loss for the entire data.

Today we will take a look at 8 major loss function used in deep learning.

1. L1 Loss (Least Absolute Deviation)

It is used to minimize the error which is the sum of all the absolute differences in between the true value and the predicted value.

We can use L1-Loss in case our data has outliers as it punishes large errors in prediction.

2. L2 Loss (Least Square Error)

It is also used to minimize the error which is the sum of all the squared differences in between the true value and the predicted value.

The disadvantage of the L2 Loss is that when there are outliers, these points will account for the main component of the loss. For example, the true value is 1, the prediction is 10 times, the prediction value is 1000 once, and the prediction value of the other times is about 1, obviously the loss value is mainly dominated by 1000.

Let’s now plot the L1 and L2 loss function:

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt# Random values for actual and predicted
x_guess = tf.lin_space(-1., 1., 100)
x_actual = tf.constant(0,dtype=tf.float32)# Based on the equation
l1_loss = tf.abs((x_guess-x_actual))
l2_loss = tf.square((x_guess-x_actual))with tf.Session() as sess:
    x_,l1_,l2_ = sess.run([x_guess, l1_loss, l2_loss])
    plt.plot(x_,l1_,label='l1_loss')
    plt.plot(x_,l2_,label='l2_loss')
    plt.legend()
    plt.show()

3. Huber Loss

Huber Loss is often used in regression problems. Compared with L2 loss, Huber Loss is less sensitive to outliers(because if the residual is too large, it is a piecewise function, loss is a linear function of the residual).

Among them, 𝛿 is a set parameter, 𝑦 represents the real value, and 𝑓(𝑥) represents the predicted value.
The advantage of this is that when the residual is small, the loss function is L2 norm, and when the residual is large, it is a linear function of L1 norm.

4. Pseudo-Huber loss function

A smooth approximation of Huber loss to ensure that each order is differentiable.

Where 𝛿 is the set parameter, the larger the value, the steeper the linear part on both sides.

5. Hinge Loss

Hinge loss is often used for binary classification problems, such as ground true: t = 1 or -1, predicted value y = Wx + B (W: weights, B: Bias & x : data point)
Hinge loss is mostly used in SVM(Support Vector Machine) classifier.
The hinge loss is used for “maximum-margin” classification, most notably for support vector machines.

Equation can be rewritten as l (y) = max(0,1 -y*f(x)), where -y is the predicted value and f(x) is the ground truth or actual value. When -y*f(x) ≥1 , Loss becomes zero and -y*f(x)<1, loss will be massively increase. So the closer the f(x) is with y, the smaller the loss will be.

6. Cross Entropy Loss

Cross-Entropy loss is mainly applied to binary classification problems. The predicted value is a probability value and the loss is defined according to the cross entropy. Note the value range of the above value: the predicted value of y should be a probability and the value range is [0,1]

7. Sigmoid-Cross-Entropy Loss

The above cross-entropy loss requires that the predicted value is a probability. Generally, we calculate 𝑠𝑐𝑜𝑟𝑒𝑠=𝑥∗𝑤+𝑏scores= w*x + b. Entering this value into the sigmoid function can compress the value range to (0,1). To know what sigmoid is , check out my article about it here.

It can be seen that the sigmoid function smooths the predicted value(such as directly inputting 0.1 and 0.01 and inputting 0.1, 0.01 sigmoid and then entering, the latter will obviously have a much smaller change value), which makes the predicted value of sigmoid-ce far from the label loss growth is not so steep.

8. Softmax Cross-Entropy Loss

First, the softmax function can convert a set of fraction vectors into corresponding probability vectors.

Here is the definition of softmax function

As above, softmax also implements a vector of ‘squashes’ k-dimensional real value to the [0,1] range of k-dimensional, while ensuring that the cumulative sum is 1.
According to the definition of cross entropy, probability is required as input. Sigmoid-cross-entropy-loss uses sigmoid to convert the score vector into a probability vector, and softmax-cross-entropy-loss uses a softmax function to convert the score vector into a probability vector.
According to the definition of cross entropy loss.

where 𝑝(𝑥) represents the probability that classification 𝑥 is a correct classification, and the value of 𝑝 can only be 0 or 1. This is the prior value
𝑞(𝑥) is the prediction probability that the 𝑥 category is a correct classification, and the value range is (0,1)
So specific to a classification problem with a total of C types, then 𝑝(𝑥𝑗), (0<=𝑗<=𝐶) must be only 1 and C-1 is 0(because there can be only one correct classification, correct the probability of classification as correct classification is 1, and the probability of the remaining classification as correct classification is 0)
Then the definition of softmax-cross-entropy-loss can be derived naturally.

Where 𝑓𝑗 is the score of all possible categories, and 𝑓𝑦𝑖 is the score of ground true class

Hope you learned something new today, Happy Learning!