Deep Neural Networks (DNNs) have achieved outstanding performance across a wide range of empirical tasks such as image classification (Krizhevsky et al., 2012), image segmentation (He et al., 2017), speech recognition (Hinton et al., 2012a)et al., 2011) or playing the game of Go (Silver et al., 2017). These successes have been driven by the availability of large labeled datasets such as ImageNet (Russakovsky et al., 2015), increasing computational power and the use of deeper models (He et al., 2015a).
Although the expressivity of the function computed by a neural network grows exponentially with depth (Pascanu et al., 2013; Raghu et al., 2017; Telgarsky, 2016), in practice deep networks are vulnerable to both over- and underfitting (Glorot and Bengio, 2010; Krizhevsky et al., 2012; He et al., 2015a). Widely used techniques to prevent DNNs from overfitting include regularization methods such as weight decay (Krogh and Hertz, 1992), Dropout (Hinton et al., 2012b) and various data augmentation schemes (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014; Szegedy et al., 2014; He et al., 2015a)
. Underfitting can occur if the network gets stuck in a local minima, which can be avoided by using stochastic gradient descent algorithms(Bottou, 2010; Duchi et al., 2011; Sutskever et al., 2013; Kingma and Ba, 2014), sometimes along with carefully tuned learning rate schedules (He et al., 2015a; Goyal et al., 2017).
. After a few iterations, the gradients computed during backpropagation become either too small or too large, preventing the optimization scheme from converging. This is alleviated by using non-saturating activation functions such as rectified linear units (ReLUs)(Krizhevsky et al., 2012)
or better initialization schemes preserving the variance of the input across layers(Glorot and Bengio, 2010; Mishkin and Matas, 2015; He et al., 2015b). Failure modes that prevent the training from starting have been theoretically studied by Hanin and Rolnick (2018).
Two techniques in particular have allowed vision models to achieve “super-human” accuracy. Batch Normalization (BN) was developed to train Inception networks(Ioffe and Szegedy, 2015). It introduces intermediate layers that normalize the features by the mean and variance computed within the current batch. BN is effective in reducing training time, provides better generalization capabilities after training and diminishes the need for a careful initialization. Network architectures such as ResNet (He et al., 2015a) and DenseNet (Huang et al., 2016) use skip connections along with BN to improve the information flow during both the forward and backward passes.
However, BN has some limitations. In particular, BN only works well with sufficiently large batch sizes (Ioffe and Szegedy, 2015; Wu and He, 2018). For sizes below 16 or 32, the batch statistics have a high variance and the test error increases significantly. This prevents the investigation of higher-capacity models because large, memory-consuming batches are needed in order for BN to work in its optimal range. In many use cases, including video recognition (Carreira and Zisserman, 2017) and image segmentation (He et al., 2017), the batch size restriction is even more challenging because the size of the models allows for only a few samples per batch. Another restriction of BN is that it is computationally intensive, typically consuming 20% to 30% of the training time. Variants such as Group Normalization (GN) (Wu and He, 2018) cover some of the failure modes of BN.
In this paper, we introduce a novel algorithm to improve both the training speed and generalization accuracy of networks by using their over-parameterization to regularize them. In particular, we focus on neural networks that are positive-rescaling equivalent (Neyshabur et al., 2015), i.e. whose weights are identical up to positive scalings and matching inverse scalings. The main principle of our method, referred to as Equi-normalization (ENorm), is illustrated in Figure 1 for the fully-connected case. We scale two consecutive matrices with rescaling coefficients that minimize the joint norm of those two matrices. This amounts to re-parameterizing the network under the constraint of implementing the same function. We conjecture that this particular choice of rescaling coefficients ensures a smooth propagation of the gradients during training.
A limitation is that our current proposal, in its current form, can only handle learned skip-connections like those proposed in type-C ResNet. For this reason, we focus on architectures, in particular ResNet18, for which the learning converges with learned skip-connection, as opposed to architectures like ResNet-50 for which identity skip-connections are required for convergence.
We introduce an iterative, batch-independent algorithm that re-parametrizes the network within the space of rescaling equivalent networks, thus preserving the function implemented by the network;
We prove that the proposed Equi-normalization algorithm converges to a unique canonical parameterization of the network that minimizes the global norm of the weights, or equivalently, when , the weight decay regularizer;
We extend ENorm to modern convolutional architectures, including the widely used ResNets, and show that the theoretical computational overhead is lower compared to BN () and even compared to GN ();
We show that applying one ENorm step after each SGD step outperforms both BN and GN on the CIFAR-10 (fully connected) and ImageNet (ResNet-18) datasets.
Our code is available at https://github.com/facebookresearch/enorm.
shows how to adapt ENorm to convolutional neural networks (CNNs). Section5 details how to employ ENorm for training neural networks and Section 6 presents our experimental results.
2 Related work
This section reviews methods improving neural network training and compares them with ENorm. Since there is a large body of literature in this research area, we focus on the works closest to the proposed approach. From early works, researchers have noticed the importance of normalizing the input of a learning system, and by extension the input of any layer in a DNN (LeCun et al., 1998)
. Such normalization is applied either to the weights or to the activations. On the other hand, several strategies aim at better controlling the geometry of the weight space with respect to the loss function. Note that these research directions are not orthogonal. For example, explicitly normalizing the activations using BN has smoothing effects on the optimization landscape(Santurkar et al., 2018).
Batch Normalization (Ioffe and Szegedy, 2015) normalizes the activations by using statistics computed along the batch dimension. As stated in the introduction, the dependency on the batch size leads BN to underperform when small batches are used. Batch Renormalization (BR) (Ioffe, 2017) is a follow-up that reduces the sensitivity to the batch size, yet does not completely alleviate the negative effect of small batches. Several batch-independent methods operate on other dimensions, such as Layer Normalization (channel dimension) (Ba et al., 2016) and Instance-Normalization (sample dimension) (Ulyanov et al., 2016)
. Parametric data-independent estimation of the mean and variance in every layer is investigated byArpit et al. (2016). However, these methods are inferior to BN in standard classification tasks. More recently, Group Normalization (GN) (Wu and He, 2018)
, which divides the channels into groups and normalizes independently each group, was shown to effectively replace BN for small batch sizes on computer vision tasks.
Early weight normalization techniques only served to initialize the weights before training (Glorot and Bengio, 2010; He et al., 2015b). These methods aim at keeping the variance of the output activations close to one along the whole network, but the assumptions made to derive these initialization schemes may not hold as training evolves. More recently, Salimans and Kingma (2016)
propose a polar-like re-parametrization of the weights to disentangle the direction from the norm of the weight vectors. Note that Weight Norm(WN) does require mean-only BN to get competitive results, as well as a greedy layer-wise initialization as mentioned in the paper.
Generally, in the parameter space, the loss function moves quickly along some directions and slowly along others. To account for this anisotropic relation between the parameters of the model and the loss function, natural gradient methods have been introduced (Amari, 1998). They require storing and inverting the curvature matrix, where is the number of network parameters. Several works approximate the inverse of the curvature matrix to circumvent this problem (Pascanu and Bengio, 2013; Marceau-Caron and Ollivier, 2016; Martens and Grosse, 2015). Another method called Diagonal Rescaling (Lafond et al., 2017) proposes to tune a particular re-parametrization of the weights with a block-diagonal approximation of the inverse curvature matrix. Finally, Neyshabur et al. (2015) propose a rescaling invariant path-wise regularizer and use it to derive Path-SGD, an approximate steepest descent with respect to the path-wise regularizer.
Unlike BN, Equi-normalization focuses on the weights and is independent of the concept of batch. Like Path-SGD, our goal is to obtain a balanced network ensuring a good back-propagation of the gradients, but our method explicitly re-balances the network using an iterative algorithm instead of using an implicit regularizer. Moreover, ENorm can be readily adapted to the convolutional case whereas Neyshabur et al. (2015) restrict themselves to the fully-connected case. In addition, the theoretical computational complexity of our method is much lower than the overhead introduced by BN or GN (see Section 5). Besides, WN parametrizes the weights in a polar-like manner, , where is a scalar and are the weights, thus it does not balance the network but individually scales each layer. Finally, Sinkhorn’s algorithm aims at making a single matrix doubly stochastic, while we balance a product of matrices to minimize their global norm.
We first define Equi-normalization in the context of simple feed forward networks that consist of two operators: linear layers and ReLUs. The algorithm is inspired by Sinkhorn-Knopp and is designed to balance the energy of a network, i.e., the -norm of its weights, while preserving its function. When not ambiguous, we may denote by network a weight parametrization of a given network architecture.
3.1 Notation and definitions
We consider a network with linear layers, whose input is a row vector . We denote by
the point-wise ReLU activation. For the sake of exposition, we omit a bias term at this stage. We recursively define a simple fully connected feedforward neural network withlayers by ,
and . Each linear layer is parametrized by a matrix . We denote by the function implemented by the network, where is the concatenation of all the network parameters. We denote by the set of diagonal matrices of size for which all diagonal elements are strictly positive and by
the identity matrix of size.
and are functionally equivalent if, for all , .
and are rescaling equivalent if, for all , there exists a rescaling matrix such that, for all ,
with the conventions that and . This amounts to positively scaling all the incoming weights and inversely scaling all the outgoing weights for every hidden neuron.
. This amounts to positively scaling all the incoming weights and inversely scaling all the outgoing weights for every hidden neuron.
Two weights vectors and that are rescaling equivalent are also functionally equivalent (see Section 3.5 for a detailed derivation). Note that a functional equivalence class is not entirely described by rescaling operations. For example, permutations of neurons inside a layer also preserve functional equivalence, but do not change the gradient. In what follows our objective is to seek a canonical parameter vector that is rescaling equivalent to a given parameter vector. The same objective under a functional equivalence constraint is beyond the scope of this paper, as there exist degenerate cases where functional equivalence does not imply rescaling equivalence, even up to permutations.
3.2 Objective function: Canonical representation
Given a network and , we define the norm of its weights as We are interested in minimizing inside an equivalence class of neural networks in order to exhibit a unique canonical element per equivalence class. We denote the rescaling coefficients within the network as for or as diagonal matrices . We denote , where is the number of hidden neurons. Fixing the weights , we refer to as the rescaled weights, and seek to minimize their norm as a function of the rescaling coefficients:
3.3 Coordinate descent: ENorm Algorithm
We formalize the ENorm algorithm using the framework of block coordinate descent. We denote by (resp. ) the th column (resp. th row) of a matrix . In what follows we assume that each hidden neuron is connected to at least one input and one output neuron. ENorm generates a sequence of rescaling coefficients obtained by the following steps.
Initialization. Define .
Iteration. At iteration , consider layer such that mod and define
Denoting the coordinate-wise product of two vectors and for the division, we have
Algorithm and pseudo-code.
We now state our main convergence result for Equi-normalization. The proof relies on a coordinate descent Theorem by Tseng (2001) and can be found in Appendix B.1. The main difficulty is to prove the uniqueness of the minimum of .
Let and be the sequence of rescaling coefficients generated by ENorm from the starting point as described in Section 3.3. We assume that each hidden neuron is connected to at least one input and one output neuron. Then,
Convergence. The sequence of rescaling coefficients converges to as . As a consequence, the sequence of rescaled weights also converges;
Minimum global norm. The rescaled weights after convergence minimize the global norm among all rescaling equivalent weights;
Uniqueness. The minimum is unique, i.e. does not depend on the starting point .
3.5 Handling biases – Functional Equivalence
In the presence of biases, the network is defined as and where . For rescaling-equivalent weights satisfying (2), in order to preserve the input-output function, we define matched rescaling equivalent biases In Appendix B.2, we show by recurrence that for every layer ,
where (resp. ) is the intermediary network function associated with the weights (resp. ). In particular, , i.e. rescaling equivalent networks are functionally equivalent. We also compute the effect of applying ENorm on the gradients in the same Appendix.
3.6 Asymmetric scaling
Equi-normalization is easily adapted to introduce a depth-wise penalty on each layer. We consider the weighted loss . This amounts to modifiying the rescaling coefficients as
In Section 6, we explore two natural ways of defining : (uniform) and (adaptive). In the uniform setup, we penalize layers exponentially according to their depth: for instance, values of larger than increase the magnitude of the weights at the end of the network. In the adaptive setup, the loss is weighted by the size of the matrices.
4 Extension to CNNs
We now extend ENorm to CNNs, by focusing on the typical ResNet architecture. We briefly detail how we adapt ENorm to convolutional or max-pooling layers, and then how to update an elementary block with a skip-connection. We refer the reader to Appendix C for a more extensive discussion. Sanity checks of our implementation are provided in Appendix E.1.
4.1 Convolutional layers
Figure 3 explains how to rescale two consecutive convolutional layers. As detailed in Appendix C, this is done by first properly reshaping the filters to 2D matrices, then performing the previously described rescaling on those matrices, and then reshaping the matrices back to convolutional filters.
This matched rescaling does preserve the function implemented by the composition of the two layers, whether they are interleaved with a ReLU or not. It can be applied to any two consecutive convolutional layers with various stride and padding parameters. Note that when the kernel size isin both layers, we recover the fully-connected case of Figure 1.
The MaxPool layer operates per channel by computing the maximum within a fixed-size kernel. We adapt Equation (5) to the convolutional case where the rescaling matrix is applied to the channel dimension of the activations . Then,
Thus, the activations before and after the MaxPool layer have the same scaling and the functional equivalence is preserved when interleaving convolutional layers with MaxPool layers.
We now consider an elementary block of a ResNet-18 architecture as depicted in Figure 3. In order to maintain functional equivalence, we only consider ResNet architectures of type C as defined in (He et al., 2015a), where all shortcuts are learned convolutions. As detailed in Appendix C, rescaling two consecutive blocks requires (a) to define the structure of the rescaling process, i.e. where to insert the rescaling coefficients and (b) a formula for computing those rescaling coefficients.
5 Training Procedure: Equi-normalization & SGD
ENorm & SGD.
As detailed in Algorithm 2, we balance the network periodically after updating the gradients. By design, this does not change the function implemented by the network but will yield different gradients in the next SGD iteration. Because this re-parameterization performs a jump in the parameter space, we update the momentum using Equation (17) and the same matrices
as those used for the weights. The number of ENorm cycles after each SGD step is an hyperparameter and by default we perform one ENorm cycle after each SGD step. In AppendixD, we also explore a method to jointly learn the rescaling coefficients and the weights with SGD, and report corresponding results.
Computational advantage over BN and GN.
provides the number of elements (weights or activations) accessed when normalizing using various techniques. Assuming that the complexity (number of operations) of normalizing is proportional to the number of elements and assuming all techniques are equally parallelizable, we deduce that our method (ENorm) is theoretically 50 times faster than BN and 3 times faster than GN for a ResNet-18. In terms of memory, ENorm requires no extra-learnt parameters, but the number of parameters learnt by BN and GN is negligible (4800 for a ResNet-18 and 26,650 for a ResNet-50). We implemented ENorm using a tensor library; to take full advantage of the theoretical reduction in compute would require an optimized CUDA kernel.
We analyze our approach by carrying out experiments on the MNIST and CIFAR-10 datasets and on the more challenging ImageNet dataset. ENorm will refer to Equi-normalization with .
6.1 MNIST auto-encoder
We follow the setup of Desjardins et al. (2015) . Input data is normalized by subtracting the mean and dividing by standard deviation. The encoder has the structure FC(784, 1000)-ReLU-FC(1000, 500)-ReLU-FC(500, 250)-ReLU-FC(250, 30) and the decoder has the symmetric structure. We use He’s initialization for the weights. We select the learning rate in
. Input data is normalized by subtracting the mean and dividing by standard deviation. The encoder has the structure FC(784, 1000)-ReLU-FC(1000, 500)-ReLU-FC(500, 250)-ReLU-FC(250, 30) and the decoder has the symmetric structure. We use He’s initialization for the weights. We select the learning rate inand decay it linearly to zero. We use a batch size of 256 and SGD with no momentum and a weight decay of . For path-SGD, our implementation closely follows the original paper (Neyshabur et al., 2015) and we set the weight decay to zero. For GN, we cross-validate the number of groups among . For WN, we use BN as well as a greedy layer-wise initialization as described in the original paper.
While ENorm alone obtains competitive results compared to BN and GN, ENorm + BN outperforms all other methods, including WN + BN. Note that here ENorm refers to Enorm using the adaptive parameter as described in Subsection 3.6, whereas for ENorm + BN we use the uniform setup with . We perform a parameter study for different values and setups of the asymmetric scaling (uniform and adaptive) in Appendix E.2. Without BN, the adaptive setup outperforms all other setups, which may be due to the strong bottleneck structure of the network. With BN, the dynamic is different and the results are much less sensitive to the values of . Results without any normalization and with Path-SGD are not displayed because of their poor performance.
6.2 CIFAR-10 Fully Connected
We first experiment with a basic fully-connected architecture that takes as input the flattened image of size . Input data is normalized by subtracting mean and dividing by standard deviation independently for each channel. The first linear layer is of size . We then consider layers , being an architecture parameter for the sake of the analysis. The last classification is of size . The weights are initialized with He’s scheme. We train for epochs using SGD with no momentum, a batch size of and weight decay of . Cross validation is used to pick an initial learning rate in . Path-SGD, GN and WN are learned as detailed in Section 6.1. All results are the average test accuracies over 5 training runs.
ENorm alone outperforms both BN and GN for any depth of the network. ENorm + BN outperforms all other methods, including WN + BN, by a good margin for more than intermediate layers. Note that here ENorm as well as ENorm + BN refers to ENorm in the uniform setup with . The results of the parameter study for different values and setups of the asymmetric scaling are similar to those of the MNIST setup, see Appendix E.2.
6.3 CIFAR-10 Fully Convolutional
We use the CIFAR-NV architecture as described by Gitman and Ginsburg (2017). Images are normalized by subtracting mean and dividing by standard deviation independently for each channel. During training, we use random crops and randomly flip the image horizontally. At test time, we use center crops. We split the train set into one training set (40,000 images) and one validation set (10,000 images). We train for 128 epochs using SGD and an initial learning rate cross-validated on a held-out set among , along with a weight decay of . The learning rate is then decayed linearly to . We use a momentum of 0.9. The weights are initialized with He’s scheme. In order to stabilize the training, we employ a BatchNorm layer at the end of the network after the FC layer for the Baseline and ENorm cases. For GN we cross-validate the number of groups among . All results are the average test accuracies over 5 training runs.
See Table 3. ENorm + BN outperforms all other methods, including WN + BN, by a good margin. Note that here ENorm refers to ENorm in the uniform setup with the parameter whereas ENorm + BN refers to the uniform setup with . A parameter study for different values and setups of the asymmetric scaling can be found in Appendix E.2.
This dataset contains 1.3M training images and 50,000 validation images split into 1000 classes. We use the ResNet-18 model with type-C learnt skip connections as described in Section 4.
Our experimental setup closely follows that of GN (Wu and He, 2018). We train on 8 GPUs and compute the batch mean and standard deviation per GPU when evaluating BN. We use the Kaiming initialization for the weights (He et al., 2015b) and the standard data augmentation scheme of Szegedy et al. (2014). We train our models for 90 epochs using SGD with a momentum of 0.9. We adopt the linear scaling rule for the learning rate (Goyal et al., 2017) and set the initial learning rate to where the batch size is set to , or . As smaller batches lead to more iterations per epoch, we adopt a similar rule and adopt a weight decay of for and , for and for . We decay the learning rate quadratically (Gitman and Ginsburg, 2017) to and report the median error rate on the final 5 epochs. When using GN, we set the number of groups to and did not cross-validate this value as prior work (Wu and He, 2018) reports little impact when varying from to . In order for the training to be stable and faster, we added a BatchNorm at the end of the network after the FC layer for the Baseline and ENorm cases. The batch mean and variance for this additional BN are shared across GPUs. Note that the activation size at this stage of the network is , which is a negligible overhead (see Table 1).
We compare the Top 1 accuracy on a ResNet-18 when using no normalization scheme, (Baseline), when using BN, GN and ENorm (our method). In both the Baseline and ENorm settings, we add a BN at the end of the network as described in 6.3. The results are reported in Table 4. The performance of BN decreases with small batches, which concurs with prior observations (Wu and He, 2018). Our method outperforms GN and BN for batch sizes ranging from to . GN presents stable results across batch sizes. Note that values of different from 1 did not yield better results. The training curves for a batch size of are presented in Figure 6. While BN and GN are faster to converge than ENorm, our method achieves better results after convergence in this case. Note also that ENorm overfits the training set less than BN and GN, but more than the Baseline case.
We applied ENorm to a deeper (ResNet-50), but obtained unsatisfactory results. We observed that learnt skip-connections, even initialized to identity, make it harder to train without BN, even with careful layer-wise initialization or learning rate warmup. This would require further investigation.
7 Concluding remarks
We presented Equi-normalization, an iterative method that balances the energy of the weights of a network while preserving the function it implements. ENorm provably converges towards a unique equivalent network that minimizes the norm of its weights and it can be applied to modern CNN architectures. Using ENorm during training adds a much smaller computational overhead than BN or GN and leads to competitive performances in the FC case as well as in the convolutional case.
While optimizing an unbalanced network is hard (Neyshabur et al., 2015), the criterion we optimize to derive ENorm is likely not optimal regarding convergence or training properties. These limitations suggest that further research is required in this direction.
We thank the anonymous reviewers for their detailed feedback, which helped us to significantly improve the paper’s clarity and the experimental validation. We also thank Timothée Lacroix, Yann Ollivier and Léon Bottou for useful feedback on various aspects of this paper.
- Natural gradient works efficiently in learning. Neural Comput.. Cited by: §2.
Normalization propagation: a parametric technique for removing internal covariate shift in deep networks.
Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16. Cited by: §2.
- Layer normalization. CoRR. Cited by: §2.
- Large-scale machine learning with stochastic gradient descent. In COMPSTAT, Cited by: §1.
- Quo vadis, action recognition? A new model and the kinetics dataset. CoRR. Cited by: §1.
- Natural language processing (almost) from scratch. J. Mach. Learn. Res.. Cited by: §1.
- Natural neural networks. In Advances in Neural Information Processing Systems 28, pp. 2071–2079. Cited by: §6.1.
- Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.. Cited by: §1.
- Comparison of batch normalization and weight normalization algorithms for the large-scale image classification. CoRR. Cited by: §6.3, §6.4.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Cited by: §1, §1, §2.
- Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR abs/1706.02677. Cited by: §1, §6.4.
- How to start training: the effect of initialization and architecture. arXiv preprint. Cited by: §1.
- Mask R-CNN. CoRR. Cited by: §1, §1.
- Deep residual learning for image recognition. CoRR. Cited by: §C.2, §1, §1, §1, §4.3.
- Delving deep into rectifiers: surpassing human-level performance on imagenet classification. CoRR. Cited by: §1, §2, §6.4.
- Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine. Cited by: §1.
- Improving neural networks by preventing co-adaptation of feature detectors. CoRR. Cited by: §1.
- Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. Cited by: §1.
- Densely connected convolutional networks. CoRR. External Links: Cited by: §1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR. Cited by: §1, §1, §2.
- Batch renormalization: towards reducing minibatch dependence in batch-normalized models. CoRR. Cited by: §2.
- Adam: A method for stochastic optimization. CoRR. Cited by: §1.
- ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, Cited by: §1, §1, §1.
- A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems 4, Cited by: §1.
- Diagonal rescaling for neural networks. CoRR. Cited by: §2.
- Efficient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, Cited by: §2.
- Practical riemannian neural networks. CoRR. Cited by: §2.
- Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, Cited by: §2.
- All you need is a good init. CoRR (English). Cited by: §1.
- Path-sgd: path-normalized optimization in deep neural networks. CoRR. Cited by: §1, §2, §2, §6.1, §7.
- Natural gradient revisited. CoRR. Cited by: §2.
- On the number of inference regions of deep feed forward networks with piece-wise linear activations. CoRR. Cited by: §1.
Automatic differentiation in pytorch. Cited by: Appendix D.
- On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, Cited by: §1.
- ImageNet large scale visual recognition challenge. Int. J. Comput. Vision. Cited by: §1.
- Weight normalization: A simple reparameterization to accelerate training of deep neural networks. CoRR. External Links: Cited by: §2.
- How does batch normalization help optimization? (no, it is not about internal covariate shift). CoRR. Cited by: §2.
- Mastering the game of go without human knowledge. Nature. Cited by: §1.
- Very deep convolutional networks for large-scale image recognition. CoRR. Cited by: §1.
On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, Cited by: §1.
- Going deeper with convolutions. CoRR. Cited by: §1, §6.4.
- Benefits of depth in neural networks. CoRR. Cited by: §1.
- Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl.. Cited by: §B.1, §3.4.
- Instance normalization: the missing ingredient for fast stylization. CoRR. Cited by: §2.
- Group normalization. CoRR abs/1803.08494. Cited by: §1, §2, §6.4, §6.4.
Appendix A Illustration of the effect of Equi-normalization
We first apply ENorm to one randomly initialized fully connected network comprising 20 intermediary layers. All the layers have a size and are initialized following the Xavier scheme. The network has been artificially unbalanced as follows: all the weights of layer 6 are multiplied by a factor 1.2 and all the weights of layer 12 are multiplied by 0.8, see Figure 7. We then iterate our ENorm algorithm on the network, without training it, to see that it naturally re-balances the network, see Figure 8.
Appendix B Proofs
b.1 Convergence of Equi-normalization
We now prove Theorem 1. We use the framework of block coordinate descent and we first state a consequence of a theorem of Tseng (2001) [Theorem 4.1]111Note that what Tseng denotes as stationary point in his paper is actually defined as a point where the directional derivative is positive along every possible direction, i.e. a local minimum..
Let be an open set and a real function of block variables with . Let be the starting point of the coordinate descent algorithm and the level set . We make the following assumptions:
is differentiable on ;
is compact ;
for each , each block coordinate function , where , has at most one minimum.
Under these assumptions, the sequence generated by the coordinate descent algorithm is defined and bounded. Moreover, every cluster point of is a local minimizer of .
We apply Theorem 2 to the function . This is possible because all the assumptions are verified as shown below. Recall that
is differentiable on the open set .
when . Let such that , . Let us show by induction that for all , , where and
For the first hidden layer, index . By assumption, every hidden neuron is connected at least to one neuron in the input layer. Thus, for every , there exists such that . Because and for all ,
For some hidden layer, index . By assumption, every hidden neuron is connected at least to one neuron in the previous layer. Thus, for every , there exists such that . Because ,
Using the induction hypothesis, we get .
Thus, because . By contraposition, .
Thus, there exists a ball such that implies . Thus, implies that and is bounded. Moreover, is closed because is continuous thus is compact and Assumption (2) is satisfied.
We next note that
has a unique minimum as shown in Section 3.3, see Equation (4). The existence and uniqueness of the minimum comes from the fact that each hidden neuron is connected to at least one input and one output neuron, thus all the row and column norms of the hidden weight matrices are non-zero, as well as the column (resp. row) norms or (resp. ).
We prove that has at most one stationary point on under the assumption that each hidden neuron is connected either to an input neuron or to an output neuron, which is weaker than the general assumption of Theorem 1.
We first introduce some definitions. We denote the set of all neurons in the network by . Each neuron belongs to a layer and has an index in this layer. Any edge connects some neuron at layer to some neuron at layer , . We further denote by the set of hidden neurons belonging to layers . We define E as the set of edges whose weights are non-zero, i.e.
For each neuron , we define as the neurons connected to that belong to the previous layer.
We now show that has at most one stationary point on . Directly computing the gradient of and solving for zeros happens to be painful or even intractable. Thus, we define a change of variables as follows. We define as
We next define the shift operator such that, for every ,
and the padding operator as
We define the extended shift operator . We are now ready to define our change of variables. We define where
and observe that
so that its differential satisfies
Since is a diffeomorphism, its differential is invertible for any . It follows that if, and only if, . As is the composition of a strictly convex function, , and a linear injective function, (proof after Step 3), it is strictly convex. Thus has at most one stationary point, which concludes this step.
We prove that the sequence converges. Step 1 implies that the sequence is bounded and has at least one cluster point, as is continuous on the compact . Step 2 implies that the sequence has at most one cluster point. We then use the fact that any bounded sequence with exactly one cluster point converges to conclude the proof.
Let . Let us show by induction on the hidden layer index that for every neuron at layer , .
. Let be a neuron at layer . Then, there exists a path coming from an input neuron to through edge . By definition, and , hence . Since it follows that .
. Same reasoning using the fact that by the induction hypothesis.
The case where the path goes from neuron to some output neuron is similar.
b.2 Functional Equivalence
We show (5) by induction that for every layer , i.e.,
where (resp. ) is the intermediary network function associated with weights (resp. ). This holds for since by convention. If the property holds for some , then by (2) we have hence, since ,
The same equations hold for without the non-linearity .
Similarly, we obtain
Appendix C Extension of ENorm to CNNs
c.1 Convolutional layers
Let us consider two consecutive convolutional layers and , without bias. Layer has filters of size , where is the number of input features and is the kernel size. This results in a weight tensor of size . Similarly, layer has a weight matrix of size . We then perform axis-permutation and reshaping operations to obtain the following 2D matrices:
For example, we first reshape as a 2D matrix by collapsing its last 3 dimensions, then transpose it to obtain . We then jointly rescale those 2D matrices using rescaling matrices as detailed in Section 3 and perform the inverse axis permutation and reshaping operations to obtain a right-rescaled weight tensor and a left-rescaled weight tensor . See Figure 3 for an illustration of the procedure. This matched rescaling does preserve the function implemented by the composition of the two layers, whether they are interleaved with a ReLU or not. It can be applied to any two consecutive convolutional layers with various stride and padding parameters. Note that when the kernel size is in both layers, we recover the fully-connected case of Figure 1.
We now consider an elementary block of a ResNet-18 architecture as depicted in Figure 3. In order to maintain functional equivalence, we only consider ResNet architectures of type C as defined in (He et al., 2015a), where all shortcuts are learned convolutions.
Structure of the rescaling process.
Let us consider a ResNet block . We first left-rescale the Conv1 and ConvSkip weights using the rescaling coefficients calculated between blocks and . We then rescale the two consecutive layers Conv1 and Conv2 with their own specific rescaling coefficients, and finally right-rescale the Conv2 and ConvSkip weights using the rescaling coefficients calculated between blocks and .
Computation of the rescaling coefficients.
Two types of rescaling coefficients are involved, namely those between two convolution layers inside the same block and those between two blocks. The rescaling coefficients between the Conv1 and Conv2 layers are calculated as explained in Section 4.1. Then, in order to calculate the rescaling coefficients between two blocks, we compute equivalent block weights to deduce rescaling coefficients.
We empirically explored some methods to compute the equivalent weight of a block using electrical network analogies. The most accurate method we found is to compute the equivalent weight of the Conv1 and Conv2 layers, i.e., to express the succession of two convolution layers as only one convolution layer denoted as ConvEquiv (series equivalent weight), and in turn to express the two remaining parallel layers ConvEquiv and ConvSkip again as a single convolution layer (parallel equivalent weight). It is not possible to obtain series of equivalent weights, in particular when the convolution layers are interleaved with ReLUs. Therefore, we approximate the equivalent weight as the parallel equivalent weight of the Conv1 and ConvSkip layers.
Appendix D Implicit Equi-normalization
In Section 3, we defined an iterative algorithm that minimizes the global norm of the network
As detailed in Algorithm 2, we perform alternative SGD and ENorm steps during training. We now derive an implicit formulation of this algorithm that we call Implicit Equi-normalization. Let us fix . We denote by the cross-entropy loss for the training sample and by the weight decay regularizer (20). The loss function of the network writes
where is a regularization parameter. We now consider both the weights and the rescaling coefficients as learnable parameters and we rely on automatic differentiation packages to compute the derivatives of with respect to the weights and to the rescaling coefficients. We then simply train the network by performing iterative SGD steps and updating all the learnt parameters. Note that by design, the derivative of with respect to any rescaling coefficient is zero. Although the additional overhead of implicit ENorm is theoretically negligible, we observed an increase of the training time of a ResNet-18 by roughly using PyTorch 4.0 (Paszke et al., 2017). We refer to Implicit Equi-normalization as ENorm-Impl and to Explicit Equi-normalization as ENorm.
We performed early experiments for the CIFAR10 fully-connected case. ENorm-Impl performs generally better than the baseline but does not outperform explicit ENorm, in particular when the network is deep. We follow the same experimental setup than previously, except that we additionally cross-validated . We also initialize all the rescaling coefficients to one.. Recall that ENorm or ENorm denotes explicit Equi-normalization while ENorm-Impl denotes Implicit Equi-normalization. We did not investigate learning the weights and the rescaling coefficients at different speeds (e.g. with different learning rates or momentum). This may explain in part why ENorm-Impl generally underperforms ENorm in those early experiments.
Appendix E Experiments
We perform sanity checks to verify our implementation and give additional results.
e.1 Sanity checks
We apply our Equi-normalization algorithm to a ResNet architecture by integrating all the methods exposed in Section 4
. We perform three sanity checks before proceeding to experiments. First, we randomly initialize a ResNet-18 and verify that it outputs the same probabilities before and after balancing. Second, we randomly initialize a ResNet-18 and perform successive ENorm cycles (without any training) and observe that thenorm of the weights in the network is decreasing and then converging, as theoretically proven in Section 3, see Figure 9.
e.2 Asymetric scaling: uniform vs. adaptive
For the uniform setup, we test for three different values of , without BN: (uniform setup), (uniform setup), (uniform setup). We also test the adaptive setup. The adaptive setup outperforms all other choices, which may be due to the strong bottleneck structure of the network. With BN, the dynamics are different and the results are much less sensitive to the values of (see Figures 12 and 12).
CIFAR10 Fully Convolutional.
For the uniform setup, we test for three different values of , without BN: (uniform setup), (uniform setup), (uniform setup). We also test the adaptive setup (see Table 5). Once again, the dynamics with or without BN are quite different. With or without BN, performs the best, which may be linked to the fact that the ReLUs are cutting energy during each forward pass. With BN, the results are less sensitive to the values of .
|Method||Test top 1 accuracy|
|ENorm + BN uniform||91.85|
|ENorm + BN uniform||90.95|
|ENorm + BN uniform||90.89|
|ENorm + BN adaptive||90.79|