03 Jan 2024

Vanishing gradient problem

The problem

As more layers using certain activation functions are added to networks, the gradients and the loss function approaches zero, making the network harder to train.

This is because some activation functions (for example the sigmoid function) squishes a large input into a small input space between 0 and 1. Therefore a large change in the input of the sigmoid will cause a small change in the output, and the derivative becomes small.

When the inputs of the sigmoid function become very large or very small, the derivative is close to zero.

For small networks this isn’t a problem but for big networks it makes it harder to train effectively.

Solutions

→ ReLU: doesn’t cause a small derivative

→ Residual networks

→ Batch normalization