mean-spectral-norm
Code for the paper "Mean Spectral Normalization"
view repo
Deep Neural Networks (DNNs) have begun to thrive in the field of automation systems, owing to the recent advancements in standardising various aspects such as architecture, optimization techniques, and regularization. In this paper, we take a step towards a better understanding of Spectral Normalization (SN) and its potential for standardizing regularization of a wider range of Deep Learning models, following an empirical approach. We conduct several experiments to study their training dynamics, in comparison with the ubiquitous Batch Normalization (BN) and show that SN increases the gradient sparsity and controls the gradient variance. Furthermore, we show that SN suffers from a phenomenon, we call the mean-drift effect, which mitigates its performance. We, then, propose a weight reparameterization called as the Mean Spectral Normalization (MSN) to resolve the mean drift, thereby significantly improving the network's performance. Our model performs 16 practice, and has fewer trainable parameters. We also show the performance of our MSN for small, medium, and large CNNs - 3-layer CNN, VGG7 and DenseNet-BC, respectively - and unsupervised image generation tasks using Generative Adversarial Networks (GANs) to evaluate its applicability for a broad range of embedded automation tasks.
READ FULL TEXT VIEW PDFCode for the paper "Mean Spectral Normalization"
The rise of application of Deep Neural Networks (DNNs) to robot automation motivates various research questions that typically differ from that of the traditional computer vision. While the research direction in DNNs mainly focus towards building architectures, developing loss functions and studying their internal mechanism, standardizing DNN models has been the main motivation and an essential criterion for their successful application to robot automation. The reason being that traditional robot automation relies on well-understood white-box models, while DNNs are black-box models where progress is still being made to reach a consensus about their internal behaviour and dynamics. By
standardizing, we mean the basic well-behaved framework of architectures, loss functions, regularization techniques and activation functions for the DNN models.
To reinforce our motivation, we shall provide some recent examples of applications of DNNs in automation tasks. Concerning industrial automation, DNNs have found applications that include fault diagnosis [1], combustion optimization [2], welding faults detection [3], traffic control [4], power line inspection [5] and spectrum sensing [6]. Even Generative Adversarial Networks (GANs) have attracted some attention and have been applied in practice (apart from the standard unsupervised image generation or translation tasks), such as fault detection [7]. The main surge in the interests and applications of DNNs can be reasoned as follows - unlike traditional/classical automation tasks where the features
or the control elements of the task are usually preset and their dynamics/state changes are completely characterised analytically, DNNs try to capture the relevant features and their dynamics automatically given the nominally pre-processed data. The characteristics of DNNs where this feature selection and extraction happens due to their hierarchical structure and their ability to train over modern hardware over thousands of data points are the foremost attractive reasons for their current popularity and success.
The above discussed automation systems often use small-medium sized neural networks in their tasks owing to speed and hardware/cost constraints. DNNs are over-parameterised models, in the sense that they have large number of trainable parameters (typically hundreds of thousands) compared to the size of the dataset ( tens of thousands). Though this over-parameterization helps greatly in optimizing the network weights [8], regularization methods are required to improve the generalization and stability of the network during training. As such, a lot of regularization methods have been developed to address this problem; of which the class of methods called Weight reparameterization have proven to be quite successful.
Weight reparameterization techniques like Batch Normalization (BN) [9], Weight Normalization (WN) [10], Layer Normalization (LN) [11] are implicit regularization methods that restrict the capacity of the over-parameterized network by normalizing/reparameterizing the network weights. Among them, BN has established itself as a very effective component of almost all modern DNNs. This is backed by its stability over a wide range of learning rates, ability to train over large minibatch sizes and faster convergence.
Although BN works well for almost all architectures of neural networks, it is an overkill for small-medium sized networks as it introduces additional training parameters. It is in this background that we investigate the application of spectral normalization (SN) for such networks owing to it requiring no additional parameters; unlike batch normalization. SN [12] is a recently introduced technique for normalizing the Lipschitz constant of intermediate layers of deep neural networks, originally proposed for Wasserstein GANs. We, however, have found that SN performs poorly compared to BN for small to medium-sized networks, and identified the cause of which to be the effect called mean-drift. We rectify this effect using our proposed Mean Spectral Normalization (MSN), thereby improving the performance to be comparable to that of BN.
We demonstrate empirically that our MSN method works across a wide range of model depths with fewer parameters and performs at par with BN. Most of the recent applications of neural networks to automation tasks utilise smaller to medium neural networks with 3-20 layers - usually convolution or fully connected layers [13]. Almost every neural network model employs BN (or its slight variants) for regularization and training stabilization. We, therefore, extend our ideas to a wide range of models, even to very deep networks, with greatly improved performance. Through our method, we propose to standardize the regularization aspect of DNNs, and applicable for robot automation tasks. Our contributions can be summarized as follows:
We provide empirical results for the sparsity of gradients in spectral normalized networks (Refer to Fig. 1
). By bounding the gradient magnitude of the activations (Lipschitz normalization), spectral normalized networks yield a much sparser network compared to the sparsity induced by the rectified linear units (ReLU).
We show that by controlling the mean layer singular values, spectral norm offers better utilization of feature dimensions, unlike other methods such as weight normalization.
We identify the mean-drift effect to be a major cause for the diminishing performance of SN as a regularization technique for small and medium sized networks.
We propose a modified SN technique called Mean Spectral Normalization(MSN) to correct for the mean-drift and accelerate the performance of the spectral normalized network for small, medium and large neural networks.
The structure of this paper is as follows - Section II introduces BN and SN methods formally; In Section III, we introduce our mean spectral normalization. In Section IV, we provide our experimental results (focussing on image-related tasks) and various empirical observations. Note that in this paper, we follow the current trend of empirical insights into deep learning to provide solid experimental footing to understand the dynamics of the SN.
In this section we discuss batch normalization and spectral normalization techniques for convolutional neural networks along with some background about other weight reparameterization methods. The normalization methods discussed here, come under a subclass of regularisation methods where the network parameters are normalised based on some norm of their parameters which limit some of their capabilities. For example, the
norm tends to limit the parameters values to lie on a unit ball centered about the origin i.e. be closer to zero. The key insight here is that a normalization method makes the network invariant to the scaling of the weights. This makes the network more robust to the new data points and parameter initialization strategies. This is true for all currently used normalization techniques as well as our proposed one.Batch Normalization (BN) or simply batch norm, was initially introduced to reduce the internal covariate-shift (ICS) in DNNs. The internal covariate shift is the phenomenon when the distribution of activations of a layer
shift due to the weight updates in the previous layers during training. Batch norm rectifies this problem by simply standardizing(z-score normalization) the activations of the intermediate layers to zero mean and unit variance and rescaling them using the affine transformations
andalong each channel with respect to all the pixels/points in the input tensor.
(1) |
where is a small value adding for numerical stability. By standardizing the layer weights, we essentially remove the dependencies on the previous layer updates. Rescaling the weights based on some learnable parameters (, ) -called the scaling factor and bias respectively- enable the flexibility in choosing appropriate weights during the training. Note that the batch normalized activations have a norm independent of the data, and depends only on the effective layer dimension and the affine scaling . The effective layer dimension for a input tensor of dimensions is simply where is the minibatch size, are the height, width and depth (number of channels) of the input data.
The concrete reasoning behind the exceptional performance of BN is still being investigated. The recent findings include preventing gradient explosion, improving optimization by smoothing the loss landscape [14] and, most importantly, improving the Lipschitzness of the layer [15], i.e., the gradients become more concentrated around the mean. This reduction of gradient variance has been accepted [16] as one of the core reasons for the success of BN. The idea of controlling the Lipschitzness of the network, motivated us to probe a related weight reparameterization technique - SN. Furthermore, in practice, BN is difficult to accelerate as it is bounded by memory-bandwidth. Precisely, BN requires two passes through the input data to compute the statistics of the minibatch and then to normalize the output; and this may consume up to a quarter of the total training time for large networks [17]. Other similar normalization methods like Layer normalization and Instance normalization [18] are slight variations to BN with normalization across different dimensions of the output like channels, layers or spatial dimensions. As such, all these suffer from the same drawbacks as that of BN.
Spectral Normalization (SN) [12] essentially restricts Lipschitz constant of the network to unity by restricting the spectral norm of each layer. Recall that the a function is Lipschitz if , for all ; where is the Lipschitz norm of . In other words, small changes in the input of the function causes corresponding small changes in the magnitude of its gradients. From its definition, the Lipschitz constant of a given intermediate layer of a neural network, whose activations are given by , is equal to the spectral norm of the weight matrix . Here, is the activation function. The SN method is then, defined as follows.
(2) |
where is the spectral norm ( matrix norm) of the weight matrix given by
(3) |
Again, from its definition, the spectral norm is essentially the largest singular value of the matrix . Furthermore, the spectral normalization of each layer applies to the weights of the layer and not the activation; similar to the Weight Normalization techniques. This is a crucial distinction from BN which applies to the activations. Since the weights are much fewer than the activation of intermediate layers, SN is often computationally faster than BN. In practice, this means that SN is not bounded by memory-bandwidth, unlike BN. An important caveat is that the Lipschitz norm of the activation function used must be equal to . Therefore, we are limited to activations such as ReLU and leaky ReLU ^{1}^{1}1For proof, refer to Lemma A.1 in the Appendix of [19].
In contrast to spectral norm regularization [20], which penalizes the spectral norm by adding an explicit regularization term to the loss function, the layer weights are simply divided by their corresponding spectral norm in SN. Furthermore, convolutional neural networks usually have fewer weights compared to pre-activations. Therefore, SN is computationally much cheaper, and does not introduce any additional parameters to be trained as evident from Eq. (2).
In this section, we provide some stronger theoretical motivation for the use of spectral norm for regularizing DNNs. Recent theoretical insights [19], [21] in analyzing the learning capacity and the generalizability of the neural networks have shown that the those characteristics can be bounded by the network’s spectral complexity . The spectral complexity of a given neural network is given by
(4) |
which is essentially the product of the spectral norm of the weight matrices of all the layers in the network times a correction factor dependent on those weight matrices.
For a given neural network function , computed as where is the activation function at layer and a dataset , drawn i.i.d from some data distribution, we have
(5) |
where is the risk associated with the network, defined as the expectation of the loss.
The above theorem states that the generalization error can be reduced by reducing the upper bound, given by the spectral complexity of the network^{2}^{2}2For a complete proof of Theorem 1, refer to [19]. Additionally, recent studies [20], [22] have been conducted in enforcing such Lipschitz continuity in neural networks, albeit through explicit regularization methods, in contrast to our proposed implicit technique.
The empirical motivation for our proposed Mean Spectral Normalization is the reduced performance of SN for small and medium sized networks. Through our experimentation, we observed that the reason to be the gradual uncontrolled drift of the layer mean during training (Refer to Fig. 4). We hypothesize that the mean drift is directly related to the internal covariate shift, where the distribution of layer activations change during training. This can be clearly observed from Fig. 3, where the shift in some selected layers are shown. It is evident that the spectral norm sufficiently restricts the variance of the distribution of activations, however causes their mean to drift during training. Moreover, the mean-drift is also observable in batch normalized networks, but the drift is controlled by the bias during training. Therefore, the rapid and uncontrolled drift of the activation-distribution mean is the foremost cause for diminished performance of spectral normalized networks. We resolve this mean-drift by proposing a modification to the original SN, called as mean spectral normalization.
We explore the idea of combining SN with a part of BN, which we call Mean Spectral Normalization (MSN). In this method, we perform the spectral normalization on the weights and then subtract the minibatch means from the activations like with BN, as
(6) | ||||
(7) |
where is the preactivation for the given layer and is the external bias learned during training. The activation is then given by passing through the activation function as . By subtracting the mean, we create a normalization method that restricts the variance as well as the mean of the activation distribution, thereby resolving the problem of mean-drift. Moreover, the mean correction introduces only a small computational overhead compared to the full BN.
During training, the running average of the minibatch mean is stored to be used for validation data. The spectral norm of the weight matrix can be efficiently computed, with negligible overhead, using the power iteration in practice (as pointed out in [12]
). During stochastic gradient descent, because the weights change slowly during each update, a single power iteration on the latest version of the initial vectors is sufficient for each training iteration; making MSN computationally more efficient than BN. By recentering the pre-activations, the dependency on the inputs of the neurons
on the pre-activations is completely detached. This method of decoupling the norm of the pre-activations from the input vectors have shown to improve the rate of convergence [10].We distinguish our MSN from the weight normalization with mean-only batch norm [10] from the fact that unlike weight normalization, spectral norm does not reduce the rank of the weight matrix and therefore can leverage upon a wider range of features to improve the performance. Weight normalization, on the other hand, regularizes the network by forcing the network to produce weight matrices that lie (approximately) in low dimensional vector spaces, compromising the feature dimensions. Besides, by dividing by the Frobenius norm of the weights, weight normalization enforces a stronger restriction on the layer weights, often causing over-fitting. This was empirically shown in [17] where even other methods like dropout and weight decay failed to improve the generalization of the weight normalized network. Our proposed MSN, however, has a stronger regularizing effect than weight normalization as it restricts the layer weights in their gradient space, effectively regulating their learnability.
The gradient of MSN can be computed as follows. Consider the gradient of the layer weight after SN, w.r.t.
(8) | ||||
(9) |
where is the matrix that has in its entry and zero elsewhere; and are the left and right singular vectors of respectively. Note that the first column of the left and right singular matrices of correspond to the largest singular value of . Therefore, the gradient with respect to the largest singular value at a given element is the entry in the left and right singular vectors of the largest singular value.
Now, the gradient of the loss with respect to MSN pre-activation after the mean subtraction, can be found in a straightforward manner.
(10) |
The recentering of the pre-activations has a much lower computational overhead compared to the classical BN where the second order batch statistics are required.
To investigate the training dynamics of various normalization techniques discussed thus far, we use a set three different convolutions neural networks - 3-layer CNN (without pooling and dropout), VGG-7, and 100-layer DenseNet BC [23] architectures. These networks were trained on the standard MNIST^{3}^{3}3http://yann.lecun.com/exdb/mnist/ , SVHN^{4}^{4}4http://ufldl.stanford.edu/housenumbers/ and CIFAR10^{5}^{5}5https://www.cs.toronto.edu/ kriz/cifar.html datasets, respectively. In this work, our core focus is on improving the networks for image-based tasks, as the application of DNNs in automation are predominantly image-based. We train these networks with Adam optimizer and set initial learning rates from
. We train these models with sufficiently long epochs such that learning plateaus. We always report the best results among those learning rates. The code for all the experiments, plots and trained models are given in the following GitHub repository
https://github.com/AntixK/mean-spectral-norm. Moreover, the choice of the networks was motivated from their widespread applications in real-world object recognition and segmentation.In this section, we discuss our empirical observations of SN during training and the performance of our proposed normalization technique. Firstly, we present the performance comparison of BN, SN and MSN networks for all the three models in Fig. 2, to illustrate the effectiveness of our proposed MSN weight reparametrization. We observe that MSN greatly improves upon SN for small and medium sized networks (3-layer CNN and VGG-7, respectively) and provides a comparable performance to that of BN. Table I provides a comparison of test accuracy of all the models.
Inducing sparsity - Sparsity in DNNs has usually been connected to its robustness, with the reasoning that the network automatically determines the right subset of parameters required to capture the high-level information from the data. From Fig. 1, it is evident that the SN and MSN methods constantly improve the gradient sparsity of the network during training, while the gradient sparsity in the batch normalized network saturates around . One of the advantages of such sparse gradients is that they are well suited for distributed training of large neural networks [24], as little gradient information has to be shared between the sub-networks. Such distributed training of networks provide exciting opportunities for distributed training of autonomous systems.
Mean Drift Correction - As discussed before, the mean-drift is a consequence of the internal covariate shift, observed in all neural networks in general. From Fig. 3 and Fig. 4, it is clear that the BN and MSN methods control the drift of the mean compared to SN. We also observe that the mean-drift is always in the negative region. Large negative mean for layer weights causes the gradients to be extremely small after the LeakyReLU activation, used in all our models. This, in addition to already sparsified gradient, effectively reduced the learning capacity of the networks with many dead neurons. BN avoids this problem with recentering its layer weights using a learnable bias (Refer to Eq. 1). In MSN, we follow a similar approach, where this recentering (Refer to Eq. 6) avoids the mean-drift, causing a balance between creating a robust sparse network but preventing too many dead neurons. Fig. 4 confirms our hypothesis and MSN correctly rectifies the mean-drift to match the performance with that of batch normalised networks.
Lipschitzness of the network - From its definition, SN controls the Lipschitz constant of the hidden layers of the neural network to be . Empirically, we observe this in our spectral normalized models as shown in Fig. 5. Neural networks with SN and MSN methods, concentrate the gradients around the mean with much smaller variance compared to batch normalized neural networks with the same learning rate. As noted in [16], neural nets with larger variance of gradients or the models with heavy-tailed gradient histograms (e.g., unnormalized networks) lead to divergence rather than convergence, as the training progresses.
Another interesting consequence of such concentrated gradients is that the loss landscape of the network becomes smooth as it does not allow erratic or sharp gradient changes. This smoothing of the loss landscape is one of the prime reasons for BN and now our proposed MSN to work over a wider range of learning rates.
Singular value regularization - Fig. 6 shows the variation of mean layer singular values of different layers during training. Specifically, during training, BN causes the singular values to increase monotonically. Furthermore, the average layer singular values of all layers are closely spaced, implying that all the weight matrices lie on the same vector subspace. SN, on the other hand, causes the mean layer singular values to taper more quickly, especially for higher layers, where the the activations are affected by previous weight matrices. We reason that in spectral normalized networks, the weights of each layer lie in different vector subspaces and therefore has lesser freedom in choosing the number of singular components. In MSN, the bias correction term , improves the average singular value by appropriate factor, learned during training. As a result, the divergence of the mean singular values during training is reduced, forcing the weights to lie in the same vector subspace.
Dataset | Model | BN | SN | MSN |
---|---|---|---|---|
MNIST | 3-layer CNN | |||
SVHN | VGG-7 | 88.56 | 78.43 | 90.86 |
CIFAR10 | DenseNet-BC |
Model | Inception Score (IS) | FID Score |
---|---|---|
Real Data (CIFAR10) | ||
WGAN-GP(With BN) | ||
SNGAN | ||
MSNGAN (ours) |
Model | Normalization | Number of Parameters |
---|---|---|
WGAN (Discriminator) | BN | |
SN | ||
MSN | ||
3-layer CNN | BN | |
SN | ||
MSN | ||
VGG-7 | BN | |
SN | ||
MSN | ||
DenseNet-BC | BN | |
SN | ||
MSN |
Fewer trainable parameters - Table III shows the comparison of the number of trainable parameters of various models with BN, SN and MSN normalization methods. The amount of reduction in the number of parameters is given within parentheses. Note that the SN does not introduce any additional parameters to the network. Therefore, spectral normalized models have fewer trainable parameters compared to batch normalized models, and thus are usually faster during training and more memory efficient. Albeit introducing bias correction parameters, MSN still has lesser number of parameters compared to BN. Additionally, this reduction in number of parameters results in faster training. During our experiments, SN models trained faster compared to BN models, and MSN models trained faster than BN model. This reduction in the number of parameters, coupled with highly sparsity, makes the MSN a highly desirable choice for embedded application of DNNs.
We evaluate our proposed MSN method against the original SN method on the Wasserstein Generative Adversarial Network(WGAN) model (called as SNGAN [12]), for the task of unsupervised image generation on the CIFAR-10 dataset. Furthermore, we also compare these spectral norm-based Lipschitz regularizers against the gradient penalty (WGAN-GP) regularization [25] method. Also, we use BN for WGAN-GP following the original paper. In this section, we shall very briefly discuss the GAN objective function and Lipschitz regularization.
The SN scheme was initially proposed for improving the training of Wasserstein GANs. Generative adversarial networks [26] are a class of generative models with two dueling neural networks - namely the generator and the discriminator . The discriminator is trained to differentiate between real and fake data, while is trained to generate fake data that identifies as real. However, in the original GAN, the gradient of the optimal discriminator with respect to its input can be unbounded, and therefore can lead to instability in training or modal collapse. Addressing this problem, various methods [27], [25] have been proposed for penalizing the Lipschitz constant -essentially regularising the gradients- of the discriminator in the form of Wasserstein distance-based GAN losses. Note that the Wasserstein distance, in its dual form, asserts that the discriminator function must have a Lipschitz constant of . Thus, employing the Wasserstein distance rather than the original Jensen-Shannon distance for the GAN loss, implicitly requires that the discriminator gradients must be bounded. The WGAN objective function used in our experiments (except for WGAN-GP) is given as follows
(11) |
For WGAN-GP, we have an additional gradient penalty term, following the original paper. Furthermore, we observe that the recentering of the pre-activations in MSN, does not alter the Lipschitz norm of the activations . Therefore MSN still regularizes the Lipschitz norm of the activations effected by SN.
We employ the same DCGAN [28] architecture for both generator and discriminator as described in [12]. To evaluate the quality of the generated image samples, we use the standard inception score (IS) [29] and the Fréchet inception distance (FID) [30]. In Table II, we show the inception scores (higher, the better) and FID^{6}^{6}6Code obtained from https://github.com/mseitzer/pytorch-fid (lower, the better) for the unsupervised image generation on various models, with optimal setting, on the CIFAR10 dataset. The report the average scores over runs, each with sampled images. The scores for the real CIFAR10 data is given for a baseline comparison. We observe that MSN clearly improves upon the WGAN-GP and performs at par with the original SNGAN.
Albeit originally proposed to control the Lipschitz constant of WGANs, we believe SN is a generic method to reparameterize the weights, with a goal to build a standardized framework to employ DNNs for robot automation. In this paper, we investigated a consequence of the internal covariate shift, called mean drift, in spectral normalized networks, which affects their performance compared to BN. Furthermore, we presented many experimental results to demonstrate the gradient sparsity and Lipschitzness induced by SN in small, medium and large DNNs. We then proposed a solution to resolve the mean drift, called mean spectral normalization(MSN), deriving ideas from both BN and SN. Through our experiments, we confirm that MSN clearly out-performs SN for supervised classification models for all depths of neural networks. Parallelly, Farnia et al., [31] observe a similar result as ours with spectral normalized DNNs. In contrast to our analysis, they conclude that the naive algorithm used to compute the spectral norm (the power iteration as used in [12]) was inefficient in regularizing the actual spectral norm of the convolution layers. To correct this, they slightly loosen the spectral norm constrain to be , where is some fixed value. Besides having no such tunable parameter, in our work, we observe a deeper mean-drift effect restricting network’s performance and rectify the effect with our MSN method. Furthermore, we also compare the qualitative results of our MSNGAN with that of the SNGAN for unsupervised image generation. In future, we wish to focus on evaluating the performance of our MSN on sequence modelling tasks and on real-time data from robots.
Deep reinforcement learning combustion optimization system using synchronous neural episodic control.
In 37th Chinese Control Conference, pages 8770–8775. IEEE, 2018.