Keypoints that should be noted
- Weight should be small
- Weights should not be same
- Weight should have variance
Weights should not be very small? Why? If it is very small it causes vanishing gradient problem.
Vanishing gradient and Exploding gradient problem mainly occurs because of wrong weight initialization and activation function.
Let’s now discuss three main types of weight initialization technique used :
1) Xavier/Glorot Initializer
Glorot Normal :
- Also called as Xavier Normal initializer.
- Weights are being selected from a normal distribution with mean as 0 and standard deviation as above figure.
Glorot Uniform :
- Also called as Xavier Uniform initializer.
- Tensorflow uses glorot_uniform by default
- Weights are being selected from uniform distribution with limits. [ -x , +x]
2) He Initializer
He Normal :
He Uniform :
3) Lecun Initializer:
Lecun Normal :
Weights are being selected from a normal distribution with mean as 0 and standard deviation as sqrt(1/inputs)
Lecun Uniform :
Weights are being selected from a uniform distribution with limits [-limit, limit]
Wij ~ Uniform Distribution [-limit, limit]
, where limit = sqrt(3 / fan_in)
Standard Approach :
- Xavier/Glorot Initialization works well with hyperbolic Tan (tanh), Logistic(sigmoid)
- He Initialization works well with Rectified Linear activation unit(ReLU) and its Variants.
- LeCun Initialization works well with Scaled Exponential Linear Unit(SELU), hyperbolic Tan (tanh)
Hope you learned something new today, Happy Learning!