Stochastic gradient descent (SGD) has been the workhorse for optimization of deep networks . The most well-known form uses the Euclidean gradients with a varying learning rate to optimize the weights. In this regard, the recent work 
has brought to light scale invariance properties in the weight space which commonly used deep networks possess. These symmetries or invariance to reparameterizations of the weights imply that even though the loss function remains unchanged, the Euclidean gradient varies based on the chosen parameterization. Consequently, optimization trajectories can vary significantly for different reparameterizations.
Although these issues have been raised recently, the precursor to these methods is the early work of Amari , who proposed the use of natural gradients
to tackle weight space symmetries in neural networks. The idea is to compute the steepest descent direction for the weight update on the manifold defined by these symmetries and use this direction to update the weights[4, 5, 6, 7]. Most of theses proposals are either computationally expensive to implement or they need modifications to the architecture. On the other hand, optimization over a manifold with symmetries or invariances has been a topic of much research and provides guidance to other simpler metric constructions [8, 9, 10, 11, 12, 13, 14].
In Section 2, our analysis into a commonly used network shows that there exists more complex forms of symmetries which can affect optimization, and hence there is a need to define simpler weight updates which take into account these invariances. Accordingly, in Section 3, we look at one particular way of resolving the symmetries by constraining the filters to lie on the unit-norm manifold. This results from a geometric viewpoint on the manifold of the search space. The proposed updates, shown in Table 1, are symmetry-invariant and are numerically efficient to implement.
2 Architecture and symmetry analysis
A two layer deep architecture, ArchBN, is shown in Figure 1
. Each layer in ArchBN has typical components commonly found in convolutional neural networks such as multiplication with a trainable weight matrix (), a batch normalization layer () , element-wise rectification ReLU, . The network is trained with a cross-entropy loss. The rows of the weight matrices and correspond to filters in layers and , respectively. The dimension of each row corresponds to the input dimension of the layer. For the MNIST digits dataset, the input is a
dimensional vector. Withfilters in each of the layers, the dimensionality of is and of is . The dimension of is , where each row corresponds to a trainable class vector.
The batch normalization  layer normalizes each feature (element) in the and
layers to have zero-mean unit variance over each mini-batch. Then a separate and trainable scale and shift is applied to the resulting features to obtainand , respectively. This effectively models the distribution of the features in and as Gaussian whose mean and variance are learnt during training. Empirical results in  show that this significantly improves convergence and our experiments also support this claim. A key observation is that the normalization of and allows for complex symmetries to exist in the network. To this end, consider the reparameterizations
where and and the elements of and can be any real number. is an operator which creates a diagonal matrix with its argument placed along the diagonal. Due to batch normalization which makes and unit-variance, is unchanged, and hence the loss is invariant to reparameterizations (1) of the weights. Equivalently, there exists continuous symmetries or reparameterizations of and , which leave the loss function unchanged. It should be stressed that our analysis differs from , where the authors deal with a simpler case of , , and is a non-zero scalar.
Unfortunately, the Euclidean gradient of the weights (used in standard SGD) is not invariant to reparameterizations of the weights . Consequently, optimization trajectories can vary significantly based on the chosen parameterizations. This issue can be resolved either by defining a suitable non-Euclidean gradient which is invariant to reparameterizations (1) or by placing appropriate constraints on the filter weights as we show in the following section.
3 Resolving symmetry issues using manifold optimization
An efficient way to resolve the symmetries that exist in ArchBN is to constrain the weight vectors (filters) in and to lie on the oblique manifold [8, 15], i.e., each filter in the fully connected layers is constrained to have unit norm (abbreviated UN). Equivalently, we impose the constraints and , where is an operator which extracts the diagonal elements of the argument matrix. To this end, consider a weight vector with the constraint . (For example, is a row of .) The steepest descent direction for a loss with on the unit-norm manifold is computed , where is the Euclidean gradient and is the Riemannian gradient on the unit-norm manifold [8, Chapter 3]. Effectively, the normal component of the Euclidean gradient, i.e., , is subtracted to result in the tangential (to the unit-norm manifold) component. Following the tangential direction takes the update out of the manifold, which is then pulled back to the manifold with a retraction operation [8, Example 4.1.1]. Finally, an update of on the unit-norm manifold is of the form
where is the current weight, is the updated weight, is the Euclidean gradient, and is the learning rate. It should be noted that when and are constrained, the variable is reparameterization free.
The proposed weight update (2) can be used in a stochastic gradient descent (SGD) setting which we use in our experiments described in the following section. It should be emphasized that the proposed update is numerically efficient to implement. The formulas are shown in Table 1. The convergence analysis of SGD on manifolds follows the developments in [1, 18].
4 Experiments and results
We train both two and four layer deep ArchBN to perform digit classification on the MNIST dataset (K training and K testing images). We use 64 features per layer. The digit images are rasterized into a dimensional vector as input to the network(s). No input pre-processing is performed. The weights in each layer are drawn from a standard Gaussian and each filter is unit-normalized. The class vectors are also drawn from a standard Gaussian and unit-normalized.
We use SGD-based optimization and use choose the base learning rate from the set for for each training run. For finding the base learning rate, we create a validation set of images from the training set. We then train the network with a fixed learning rate using a randomly chosen set of images for epochs. At the start of each epoch, the training set is randomly permuted and mini-batches are sampled in a sequence ensuring each training sample is used only once within an epoch. We record the validation error measured as the error per training sample for each candidate base learning rate. We then choose the candidate rate which corresponds to the lowest validation error and use this for training the network on the full training set. We repeat this whole process for training runs for each network to measure the mean and variance of the test error. We ignore the runs where the validation error diverged. For each full dataset training run, we use the bold-driver protocol  to anneal the learning rate. We choose randomly chosen samples as the training set and the remaining samples for validation. We train for a minimum of epochs and a maximum of epochs. Training is terminated if either the training error is less than or the validation error increases with respect to the one measured before epochs or successive validation error measurements differ less than .
|2||0.0206 0.0024||0.0199 0.0046|
|4||0.0204 0.0027||0.0179 0.0025|
From Table 2 we see that for both two and four layer deep networks, the mean test error is lower for UN as compared to balanced SGD (B-SGD) which is simply the Euclidean update, but where the starting values of filters and class vectors are unit-normalized. The lowest mean and variance in the test error is obtained when UN weight update is used for training a four layer deep network. The difference between the B-SGD and UN updates is more significant for the four layer deep network, thereby highlighting the performance improvement over what is achieved by standard batch normalization in deeper networks. The use UN can also be seen as a way to regularize the weights of the network during training without introducing any hyper-parameters, e.g., a weight decay term. It should also be noted that the performance difference between the two and four layer networks is not very large. This raises the question for future research as to whether some deep networks necessarily have to be that deep or it can be made shallower (and efficient) by better optimization .
5 Application to image segmentation
We apply SGD with the proposed UN weight updates in Table 1 for training SegNet, a deep convolutional network proposed for road scene image segmentation into multiple classes . This network, although convolutional, possesses the same symmetries as those analyzed for ArchBN in (1). The network is trained for 100 epochs on the CamVid  training set of 367 images. The predictions on some sample test images from CamVid are shown in Figure 2. These qualitative results indicate the usefulness of symmetry-invariant weight updates for larger networks that arise in practice.
We have highlighted the symmetries that exist in the weight space of deep neural network architectures which are currently popular. These symmetries can be absorbed into gradient descent by applying a unit-norm constraint on the filter weights. This takes into account the manifold structure on which the weights of the network reside. The empirical results show the test performance can be improved using our proposed weight update technique on a modern architecture. As a future research direction, we would like to explore other efficient symmetry-invariant weight update techniques and exploit them for deep convolutional neural network used in practical applications.
Bamdev Mishra was supported as an FNRS research fellow (Belgian Fund for Scientific Research). The scientific responsibility rests with its authors.
Large-scale machine learning with stochastic gradient descent.In International Conference on Computational Statistics (COMPSTAT), pages 177–186, 2010.
-  B. Neyshabur, R. Salakhutdinov, and N. Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems 29 (NIPS), 2015. Accepted for publication.
-  S.-I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
-  R. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. Technical report, arXiv:1301.3584, 2013.
-  G. Desjardins, K. Simonyan, R. Pascanu, and K. Kavukcuoglu. Natural neural networks. Technical report, arXiv:1507.00210, 2015.
-  Y. Ollivier. Riemannian metrics for neural networks I: Feedforward networks. Information and Inference, 4(2):108–153, 2015.
-  Y. Ollivier. Riemannian metrics for neural networks II: Recurrent networks and learning symbolic data sequences. Information and Inference, 4(2):154–193, 2015.
-  P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, 2008.
-  B. Mishra and R. Sepulchre. Riemannian preconditioning. Technical report, arXiv:1405.6055, 2014.
-  N. Boumal and P.-A. Absil. Low-rank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra and its Applications, 475:200–239, 2015.
-  M. Journée, F. Bach, P.-A. Absil, and R. Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, 20(5):2327–2351, 2010.
-  P.-A. Absil, R. Mahony, and R. Sepulchre. Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Applicandae Mathematicae, 80(2):199–220, 2004.
-  A. Edelman, T.A. Arias, and S.T. Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
-  J.H Manton. Optimization algorithms exploiting unitary constraints. IEEE Transactions on Signal Processing, 50(3):635–650, 2002.
-  N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1):1455–1459, 2014.
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine learning (ICML), 2015.
-  S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
-  G. Hinton. Lecture notes. Technical report, University of Toronto, 2008.
-  J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems 28 (NIPS), pages 2654–2662, 2014.
-  V. Badrinarayanan, A. Handa, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. Technical report, arXiv:1505.07293, 2015.
-  G. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88–97, 2009.