The goal of many initialization methods is to preserve the gradient signal as it goes backwards through each layer of a neural network Glorot & Bengio (2010); He et al. (2015); Saxe et al. (2014). In RNNs, not preserving this signal may lead to the well-known vanishing and exploding gradient problems Hochreiter (1998). In general, learning in neural nets is much faster when the composite scales of all layers remains near a constant of the problem Saxe et al. (2014).
Results from this line of work suggest that initially preserving scale is conducive to learning. However, any update rule which does not enforce scale-preservation will violate this condition (isometry) after the first iteration. Does preserving scale continue to speed up learning after the first iteration?
There is evidence for and against. On one hand, if scale were preserved throughout all epochs, the network would fail to learn non-isometric projections. However, there is circumstantial evidence for the benefit of preserving scale during training:
Co-training with both unsupervised and supervised objectives leads to faster-learning and more generalizable networks Rasmus et al. (2015). At least some of this result is due to the scale-preserving effect of the unsupervised objective, which effectively regularizes the singular values of each weight matrix.
The contribution of this work is to investigate the utility of scale-preserving constraints. We separate this from the other effects of the unsupervised objective by normalizing scale without optimizing reconstruction. Preliminary results with two different methods indicate that at least for the first few iterations of training, scale normalization leads to faster learning.
2 Preserving Scale
The forward scale of a layer with weight matrix is its effect on the length of its input : . This is a function of the singular values of (if
were a singular vector, it is scaled by the corresponding singular value).
The backward scale of a layer is , where is the incoming gradient. Just as is used to calculate the forward pass, is used to calculate the outcoming gradient for the backward pass. The forward and backward scales are both functions of the singular values , which are invariant under transposition.
The magnitude of the gradient at a given layer is the product of the original gradient’s magnitude and the scales of all layers . If all these scales (singular values) are greater than one, we have exploding gradients; if we have less than 1, we have vanishing gradients.
Saxe et al. (2014) suggests using orthogonal matrices (or rectangular matrices with unit singular values) to avoid the complications of scale. This effectively makes all matrices non-scaling (though length can still be reduced when components of lies in the nullspace of ). This orthonormal initialization scheme contrasts with the popular initializations in Glorot & Bengio (2010) and He et al. (2015)
, which are based on preserving the variance of the forward and backward passes. The benefit of the singular value interpretation of scale over their interpretation are thus:
Unites the notions of forward and backwards scales (both functions of singular values).
Yields a constructive method of creating non-scaling weight matrices (orthonormal init).
Scaling as a function of singular values relies on much less assumptions than scaling as variance preservation.
3 Normalizing Scale
Here we propose multiple ways to normalize scale during learning. The challenge is to preserve scale, while retaining what is learned after each update.
It should be noted that simply making unitary after each update (setting all singular values to 1) would decorrelate and destroy information contained in
. It may not make sense to have every neuron’s output to be in the same range. So while orthonormalizingworks for initialization, it does not make sense to do it in the course of training.
3.1 Determinant Normalization
One method of preserving the scale of each layer is to set its pseudo-determinant
to one. Just as the determinant of a matrix is a product of its eigenvalues, the pseudo-determinant is a product of a (not necessarily square) matrix’s singular values (). It is an aggregate measure of the scales of .
If our update rule on gives us a new weight matrix , we can determinant-normalize :
where det is the product of the singular values of . This sets the determinant of
to be one. Alternatively, determinant-normalziation sets the geometric mean of the scales to one, in a sense ”centering” the singular values.
Determinant normalization speeds up learning over Glorot and orthonormal initialization. As expected, the benefits are most pronounced during the beginning of training (1 epoch is 500 batches).
3.2 Scale Normalization
Determinant normalization has the attractive property of being a function of , not . However, calculating the pseudo-determinant runs into multiple problems:
Calculating the pseudo-determinant requires an SVD to get the singular values, and is thus prohibitively expensive.
When singular values are less than 1 and matrices are large, it is very easy to get numerical underflow ().
To address this, we propose an alternative notion of scale. Empirically, the scaling of each vector by can be measured by the ratio:
Thus, we can scale normalize by dividing by the average scale observed over a mini-batch:
This sets the expected scaling to be one, such that on average, will preserve scale. As there are many singular values, is different depending on , and thus may be noisy depending on the batch, especially for small batches.
We noted that scale-normalization does not help after the first epoch. When running scale-normalization for only the first epoch (100 iterations), it achieves similar results to determinant normalization.
4 Future Work
Preliminary results indicate that maintaining isometry is useful for learning, at least in the beginning. Future work will relate scale-normalization to batch-normalization, and more advanced optimization algorithms. Experiments on larger datasets such as CIFAR10 and convolutional architectures are in progress. We believe that investigating how isometry interplays with learning speed will bring insight into how to speed up learning in the future.
- (1) H. Bourlard and Y. Kamp. Biological Cybernetics, 59(4):291–294.
- Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
He et al. (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.ICCV, 2015.
The vanishing gradient problem during learning recurrent neural nets and problem solutions.Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 6(2), April 1998.
- Rasmus et al. (2015) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In NIPS. 2015.
- Saxe et al. (2014) Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR, 2014.