What is the vanishing gradient problem?
The vanishing gradient problem is an issue that sometimes arises when training machine learning algorithms through gradient descent. This most often occurs in neural networks that have several neuronal layers such as in a deep learning system, but also occurs in recurrent neural networks. The key point is that the calculated partial derivatives used to compute the gradient as one goes deeper into the network. Since the gradients control how much the network learns during training, if the gradients are very small or zero, then little to no training can take place, leading to poor predictive performance.
This technique pretrains one layer at a time, and then performs backpropagation for fine tuning.
The technique introduces bypass connections that connect layers further behind the preceding layer to a given layer. This allows gradients to propagate faster to deep layers before they can be attenuated to small or zero values
Rectified linear units (ReLUs)
When using rectified linear units, the typical sigmoidal activation functions used for node output is replaced with with a new function: f(x) = max(0, x). This activation only saturates on one direction and thus are more resilient to the vanishing of gradients.