1 Introduction
Despite their prevalent use, the effects of Batch Normalization (BN) [7] in Generative Adversarial Networks (GAN) [5] have not been examined carefully. Popularized by the influential DCGAN architecture [14], the use of BN in GANs is typically justified by its perceived training speedup and stability, but the generated samples often suffer from visual artifacts and limited variations (mode collapse). The lack of evidence that BN always improves GAN training is partly due to the unavailability of quality measures for GAN models. Being puzzled by this technique, we propose a methodical evaluation of GAN models and assess their abilities to generate large variations of samples (mode coverage). The idea is to hold out a portion of the dataset as a test dataset and try to find the latent code that generates the closest approximation to these test images. For each test image, we optimize for the latent code by gradient descent for a fixed number of iterations. The average squared Euclidean distance between the test samples and the reconstructed ones is used as a measure of the quality of GANs.
Our experiments show that the reconstruction error correlates with the visual quality of the generated samples, and while still time consuming, this approach is more efficient than existing loglikelihoodbased evaluation methods. Our evaluation technique is therefore convenient for monitoring the progress during training. We show that BN generally accelerates training in early stages, and can increase the success rate of GAN training for certain datasets and network structures where a model without any normalization could often fail. In many cases though, BN can cause the stability and generalization power of the model to decrease drastically. Following the work of Salimans and Kingma [16] and Arpit et al. [2], we introduce a modified Weight Normalization (WN) technique for GAN training. Using the same sets of experiments, we found that our WN approach can achieve faster and more stable training than BN, as well as generate equal or higher quality samples than GAN models without normalization. We believe that our proposed WN technique is superior than BN in the context of GANs.
2 Related Work
Batch Normalization.
Batch Normalization (BN) [7]
is a technique to accelerate the training of deep neural networks and has been shown to be effective in various applications. In the context of GANs, it first appeared in LAPGAN by Denton
et al. [4] (for generator only), and made popular by the influential DCGAN architecture by Radford et al. [14] (for both generator and discriminator). It has since become a common practice, as listed in this overview of GAN techniques [3] and used in many GAN architectures (e.g. WGAN [1] and EBGAN [19]). To summarize, BN takes a batch of samples and computes the following:(1) 
where and
are the means and standard deviations of the input batch and
and are the learned parameters. As a result, the output will always have a mean and a standard deviation , regardless of the input distribution. Most importantly, the gradients must be backpropagated through the computation of and .Weight Normalization.
Weight Normalization (WN) is a more recent normalization technique proposed by Salimans and Kingma [16]. For a linear layer
(2) 
where , , and , weight normalization performs a reparameterization with and :
(3) 
where and are the th column of and , respectively. As with BN, the computation of is taken into account when computing the gradient with respect to .
Although presented as a reparameterization that modifies the curvature of the loss function, the main idea is to simply divide the weight vectors by their norms. A very similar idea has been proposed around the same time, under “normalization propogation” (NormProp) by Arpit
et al. [2]. While the effectiveness of this technique has been illustrated on various experiments in [16] and [2], they did not investigate this acceleration approach for GANs. As detailed in Section 3, we propose a modified version of Weight Normalization to improve the training of GAN models.GAN Evaluation.
In earlier GANrelated works, with a lack of quantitative measures, visual inspection has been a commonly used method. In addition to inspecting visual quality, this has also been used to show that the model did not overfit, by interpolation in latent space (e.g.
[4]) and by finding closest training sample to generated samples and point out their difference (e.g. [5]).Various quantitative measures has since been proposed. A commonly used one is estimating the loglikelihood of the training set in the generator’s distribution, by generating a large amount of samples and fitting a Gaussian Parzen window (e.g.
[4, 11]). As discussed by Theis et al. [17], this is not particularly effective, as the amount of samples that need to be generated for accurate loglikelihood estimation is intractable.Another measure is Inception score, proposed by Salimans et al. [15]
, based on the assumption that a good generative model should be able to generate meaningful objects. A limitation of this approach is that, the inception model is pretrained on another image classification task, usually for natural objects. Thus, it is only useful as a measure of GAN quality trained on images on similar objects. The quality of GANs has also been evaluated indirectly, e.g. by measuring the classification accuracy using features extracted by a GAN discriminator
[14]. Our proposed measure of reconstruction loss is most similar to that used by Metz et al. [13]. We discuss differences in section 4.3 Weight Normalization for GAN Training
We propose a modified formulation of the weight normalization approach introduced by Salimans and Kingma [16]. A notable deficiency of the original WN technique is that, in its simplest form, it does not normalize the mean value of the input. In [16]
this is solved by augmenting WN with a version of BN that only normalizes the mean of the input but not the variance. While their experiments showed an improved performance for the CIFAR10 classification task compared to plain WN, it gave worse results for several of our experiments (See appendix
D.2). Hence, we chose to not include this augmentation in our approach.In [2], the authors attempt to solve this problem by enforcing a zeromean, unitvariance distribution throughout the network. In their method, the scale and bias are first fixed as and , that is,
(4) 
For simplicity, we consider here a single output neuron. The training data is normalized so that the input to the network has zero mean and unit variance. The mean and variance of the output of each nonlinear layer (ReLU in this case) is evaluated in closed form under the assumption that the input to the preceding linear layer is from a multivariate standard normal distribution. The mean and variance is then used to correct the distribution of the output:
(5) 
where
(6) 
are the mean and standard deviation of the distributions after ReLU when is from a multivariate standard normal distribution. The output in equation 5 would also have zero mean and unit variance.
Notice that this adhoc fix does not really achieve its goal. Firstly, as mentioned in [2], the closed form mean and variance is only an approximation since the correctness of this derivation requires the input to be normal distributed, which does not strictly hold beyond the first layer. More critically, after deriving equation 5 and fixing and , akin to Batch Normalization, they argue that an affine transformation needs to be learned after the weightnormalized linear layer and before the succeeding nonlinear layer, in order to avoid decreasing the set of functions that can be represented by the network. The formulation then becomes:
(7) 
Their derivation for and is for the restricted case, when there is no learned affine transformation, i.e. when and . When this restriction is relaxed, the result would be invalid, even if the i.i.d. normal condition on does hold. We could make and functions of and to fix this error, but the backpropagation computation would be overly complex, since these functions also need to be taken into account.
As we cannot hope to strictly enforce a zeromean unitvariance distribution, we propose to use a simpler approximation instead. Note that with ReLUlike nonlinearity (i.e. ReLU, leaky ReLU and parametric ReLU) we have when . In equation 7, when , we can always invert the direction of and take the negative of . Hence, without loss of generality, we can assume . Then equivalently, equation 7 can be written as
(8) 
The purpose of and is to cancel out the mean and variance introduced by ReLU and the affine transformation (i.e. and ). Instead of deriving a complex formula, we simply set and , and reformulate the equation using . Equation 8 becomes:
(9) 
Note that we can now separate out the restricted weight normalized layer from equation 9. We call the remaining part “Translated ReLU (TReLU)”:
(10)  
(11) 
where is a learned parameter. It is more commonly referred to as a “threshold layer”, defined by , but here the threshold is learned. We chose this name to reflect the fact that other ReLUlike nonlinear functions can be used to give translated leaky and parametric ReLU layers. Here, we “translate” the data by , apply the nonlinear function, then “translate” the data “back” (by ). By using TReLU instead of adding bias to the previous layer, we prevent (to a certain degree) the introduction of a large mean into the distribution.
This simplification effectively negates the learned affine transformation, which seemingly would reduce the set of functions that can be represented by the network. We argue, however, that allowing the learning of an affine transformation at the last weightnormalized layer recovers the expressiveness of the entire stack of layers (see appendix A for proof). From now on, “strict weightnormalized layers” will refer to layers without affine transformations (Equation 4), while layers with a learned affine transformation
(12) 
are referred as “affine weightnormalized layers”. These are collectively called “weightnormalized layers”.
4 Evaluation Method
For many generative models, the reconstruction error on the training set is often explicitly optimized in some form (e.g., Variational Autoencoders
[8]). Even when this is not the case as in GANs, it is natural to evaluate the model with a reconstruction loss (squared Euclidean distance) measured on a test set. In the case of GANs, given a generator and a set of test samples , the reconstruction loss of on is defined as(13) 
In the case of images, we normalize for different image sizes by considering per pixel, per color channel reconstruction loss, thus we divide the loss by where and are the width and height of the training images. Since there is no way to directly infer the optimal from , we use an alternative method: starting from an allzero vector, we perform gradient descent on the latent code to find one that minimizes the squared Euclidean distance between the sample generated from the code and the target one. Because the code is optimized instead of computed from a feedforward network, the evaluation process is timeconsuming. Thus, we avoid performing this evaluation at every training iteration when monitoring the training process, and only use a reduced number of samples and gradient descent steps. Only for the final trained model, we perform an extensive evaluation on a larger test set, with a larger number of steps.
This method is very similar to that proposed by Metz et al. [13]. There are two important differences: in [13] the samples used for reconstruction come from the training set, while we take the samples from a separate test set. Intuitively, in order to generate the test samples that are not in the training set, the generator must learn the distribution of the training samples, but not memorize and overfit on them. Such an effect would not be achieved if the test samples come from the training set.
Furthermore, [13] uses LBFGS for optimization on the latent code. LBFGS is known to give good and fast optimization for problems that are not too highdimension, which suits the setting of this problem well. However, its effectiveness is sensitive to many of its parameters. We were not able to find a combination of parameters that consistently work well under the various experiment settings. This also made it harder to justify the choice of parameters since for the different models we would like to compare the best parameters may be very different.
Instead we use RMSProp. It may not be the fastest optimization method for this problem, but we found it to work well under our settings, and altering the parameters (learning rate and number of steps) generally affect the reconstruction result of different models in the same way, which makes comparison easier.
5 Experiments
We conducted experiments on image generation tasks, with quantitative analysis on DCGANbased architecture on CelebA, LSUN bedroom and CIFAR10 datasets, and qualitative results with a 21layer ResNet on CelebA. The CelebA experiments are detailed here. Due to limited space, we only show some generated and reconstructed samples on LSUN and CIFAR10 here, and discuss the settings, qualitative and quantitative results along with more samples in Appendices D.3 and D.4.
5.1 DCGAN Setup
For CelebA [10], we using central 160
160 patches. We compared three DCGANbased models: (1) trained without any normalization as a reference (the nonnormalized or “vanilla” model), (2) with Batch Normalization (“BN model”), and (3) with our formulation of Weight Normalization (“WN model”). The network is structured in the following way: for the discriminator, we use successive convolution layers with kernel size 4, stride 2, padding 1 and output features doubling that of the previous layer, starting from 64 features in the first layer. We add convolution layers until the spatial size of the feature map is sufficiently small (5
5). We then add one final convolution layer with stride 1, zero padding and kernel size 5 (equaling the size of the last feature map). For the generator, we reverse this structure and use transposed convolution layers.As per common practice, Batch Normalization is not applied to the first layer of the discriminator, nor to the last layers of both the discriminator and generator. Weight normalization is used for every layer. For the last layer of both discriminator and generator, we use affine weightnormalized layers (AWNConv) while for every other layer we use strict weightnormalized layers (SWNConv). Parametric ReLU (PReLU) is used for vanilla and batchnormalized models and Translated Parametric ReLU (TPReLU) for weightnormalized models. Slope and bias parameters are learned perchannel. The length of the code is 256 for all models. The architectures are summarized in table 1. Additional details regarding the implementation of weight normalized layers are discussed in appendix B.
vanilla  BN  WN  

Conv  Conv  SWNConv  4, 2, 1, 64 
      
PReLU  PReLU  TPReLU  
Conv  Conv  SWNConv  4, 2, 1, 128 
  BN    
PReLU  PReLU  TPReLU  
Conv  Conv  SWNConv  4, 2, 1, 256 
  BN    
PReLU  PReLU  TPReLU  
Conv  Conv  SWNConv  4, 2, 1, 512 
  BN    
PReLU  PReLU  TPReLU  
Conv  Conv  SWNConv  4, 2, 1, 1024 
  BN    
PReLU  PReLU  TPReLU  
Conv  Conv  AWNConv  5, 1, 0, 1 
Sigmoid  Sigmoid  Sigmoid 
vanilla  BN  WN  

Conv  Conv  SWNConv  5, 1, 0, 1024 
  BN    
PReLU  PReLU  TPReLU  
Conv  Conv  SWNConv  4, 2, 1, 512 
  BN    
PReLU  PReLU  TPReLU  
Conv  Conv  SWNConv  4, 2, 1, 256 
  BN    
PReLU  PReLU  TPReLU  
Conv  Conv  SWNConv  4, 2, 1, 128 
  BN    
PReLU  PReLU  TPReLU  
Conv  Conv  SWNConv  4, 2, 1, 64 
  BN    
PReLU  PReLU  TPReLU  
Conv  Conv  AWNConv  4, 2, 1, 3 
Sigmoid  Sigmoid  Sigmoid 
All models are optimized with RMSProp [18], with a learning rate of , , and a batch size of 32. Specifically for the BN model, we use separate batches for true samples and generated samples when training the discriminator, as suggested by [3]. After each parameter update, we clip the learned slope of parametric ReLU layers to . There are a total of 202,599 images in CelebA dataset. We randomly selected 2,000 images for evaluation and used the rest for training. During the training, we perform a “running evaluation” for every 500 training iterations, on a randomly selected and fixed subset of 200 test samples. The optimal code is found by performing gradient descent for 50 steps, starting from a zero vector. Again we use RMSProp, with a learning rate of 0.01.
For each model, the best performing network during the training is saved and used for final evaluation. In the final evaluation, we use all 2,000 test samples and perform gradient descent for 2,000 steps. For BN model, we use its inference mode. In addition, we also use the “converged” model for evaluation, in case the model does converge but gives notably worse running reconstruction than the optimal recorded model. However, this did not occur in the main experiment. We consider that training has converged if both the running reconstruction loss and the generated samples stay stable for a sufficient amount of time.
We observed mode collapse issues with both the vanilla and BN models. To reduce the possibility that these observations are caused by random factors, we repeat the training procedure for these models three times. We present the results from the best training instances and additional ones of the vanilla and BN models can be found in Appendix D.1.
5.2 Reconstruction
Model  Optimal iteration  Running loss  Final loss 

vanilla  30,500  0.014509  0.006171 
BN  30,500  0.017199  0.006355 
WN  463,000  0.013010  0.005524 
The running reconstruction loss of the three models is shown in Figure 1 for the first 150,000 iterations. The generated samples from both the vanilla and BN models have collapsed. The WN model was trained to 700,000 iterations and is considered to have converged (see Appendix C for the prolonged training).
The lowest running reconstruction loss recorded during training, the iteration at which this minimum loss is achieved, and the final reconstruction loss for each model is listed in table 2. WN achieves about 10.5% lower final reconstruction loss than the vanilla model, while for BN the loss is 3% higher. We can also see from the loss curve that, until the vanilla model collapses, BN never achieved a better reconstruction loss.
We also provide qualitative results of the reconstructions. Selected reconstructed samples are compared to the original test samples in figure 2. These samples are selected such that all three models give reasonable results. Random samples can be found in Appendix C. The WN model captures details (e.g. facial expression, texture of hair, subtle color variation) much more faithfully. Samples reconstructed by the BN model are significantly blurrier and affected by artifacts.
5.3 Stability
As shown in Figure 1, the reconstruction of vanilla and BN models started to get worse relatively early on during their training, after achieving their optimal reconstruction loss. For the vanilla model, the loss went up slowly, then in a relatively short time around iteration 135,000, the generator collapses and produces the same output, which caused the reconstruction loss to increase suddenly. For the BN model, at around 40,000 iterations, the loss started to show excessive fluctuation. Our WN model however, kept improving steadily until 300,000 iterations and then remained largely stable.
We can also visualize this (in)stability by checking samples generated from the same code at different iterations, as shown in Figure 3. The WN model is noticeably more stable as samples generated from the same code remain mostly constant across a time scale of 100,000 iterations, and the generated samples are slowly improving, while the other two models produce more random variations. Additional visual analysis and samples can be found in Appendix C.
5.4 Training Speed
We compare the training speed of the three models by assessing their generated samples during early stages of the training, as illustrated in Figure 4. It is evident that Batch Normalization does accelerate training and the effect of Weight Normalization is comparable. Notice that our WN model can already produce a human face in only 100 iterations. This accelerated training is mostly useful as a fast sanity check, when monitoring the training progress of deep neural networks. As shown in Figure, the visual quality of the samples generated by the three models are comparable at 10,000 iterations, and none of the models achieve a noticeably faster progression than the other. In addition, the ability to generate visually plausible samples earlier on does not necessarily translate into an overall faster improvement of the reconstruction. Notice that BN allows a higher learning rate. The training of the vanilla and WN models often fail with a learning rate of , while the BN model can still be trainable with a learning rate of . However, we found that an increased learning rate did not accelerate the training of the BN model. Instead, it further harms the stability of the model.
5.5 Results on LSUN and CIFAR10
5.6 ResNet Setup
Residual Networks [6]
are becoming increasingly popular for image classification. While it has been used in GANs, the setting is usually an imagetoimage translation task, e.g. image superresolution
[9]. Direct image generation from noise with ResNet has not been particularly successful. Here we test our method on a 21layer residual network.Our block structure is as follows: we base our design on the basic blocks from [6]. On the shortcut branch, we use an optional average pooling, present when the stride is 2, followed by an optional convolution with kernel size 1, present when the number of input features does not equal the number of output features. On the residue branch, we use ConvBNPReLUConvBN structure for the BN model and remove batch normalization layers for the vanilla model. The two branches are then summed, then a final PReLU layer is applied to the result.
In the WN model, all convolutions are replaced with the strict weight normalized version and PReLU layers are replaced with the translated version. There is some complication when summing the two branches in the WN model, see Appendex B for more details. The first convolution on the residue branch has kernel size 3 or 4 when the stride is 1 or 2 respectively. the second convolution always have kernel size 3. The two convolutions always have padding 1. In the generator, all convolutions are replaced with transposed convolutions, and the average pooling on the shortcut branch is replaced with a nearest neighbour upscaling.
The discriminator network consists of 5 levels, with each level consisting of a stride 2 block followed by a stride 1 block with the same number of output features, for a total of 10 residue blocks and thus 20 layers. Then a final convolution with kernel size 5 and no padding is added, as in the DCGAN models above, for a total of 21 layers. Since the network is much deeper, to save computation time, we reduced the number of features to (64, 128, 256, 384, 512) and dimension of the latent space to 128. Again, the generator is a mirror image of the discriminator. We also reduced the batch size to 16 and learning rate to .
5.7 ResNet Results
During 70,000 iterations of training, the vanilla model and the BN model were never able to generate more than a handful of different samples (random samples from iteration 70,000 shown in figure 9) and were extremely unstable (evolution of samples for every 10,000 iterations shown in figure 10). The WN model was trained to iteration 300,000 without major issues, and was able to generate samples with high quality and diversity. The best running reconstruction loss was 0.016906, achieved at iteration 195,000. Random samples from that iteration are shown in figure 11.
We do point out however, that with continued training after around iteration 200,000 we observe some degradation of sample quality in the weight normalized ResNet model, in similar ways as in the vanilla DCGAN model examined in Appendix C. This indicates that Weight Normalization in itself may not be sufficient to guarantee the stability of the network. But it is not our goal to compete with other techniques and find a complete solution to the instability of GAN training. Rather, since our method does not propose different training loss (e.g. least squares in LSGAN [12]) or protocol (e.g. batch discrimination) or favour a particular architecture (e.g. autoencoderbased, in EBGAN [19]), our method is complementary to these existing GAN training improvement techniques, and can be combined with any of these to further improve the quality of GANs.
6 Conclusion
We introduced weight normalization for the training of GANs using an alternative formulation than the original work of [16] and [2], which achieves superior training performance. We also presented an evaluation method for GANs based on the mean squared Euclidean distance between the test samples and the closest generated ones, which are synthesized via gradient descent on a latent code. We trained and analyzed variants of DCGAN [14] with different normalization methods for image generation on datasets of multiple scales. We found that batchnormalized models perform worse in reconstructing test samples and are less stable during training. In particular, both reconstruction errors and the visual quality can be deteriorated by BN. However, our formulation of weight normalization improves both reconstruction quality and training stability considerably. We further demonstrate the stabilizing power of weight normalization by successful training of a residual GAN that is considerably deeper. Based on our extensive evaluations, we believe that weight normalization should be used instead of batch normalization when training generative adversarial networks.
References
 [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [2] D. Arpit, Y. Zhou, B. U. Kota, and V. Govindaraju. Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. arXiv preprint arXiv:1603.01431, 2016.
 [3] S. Chintala, E. Denton, M. Arjovsky, and M. Mathieu. How to train a gan? tips and tricks to make gans work. https://github.com/soumith/ganhacks, 2016. Accessed: 20170226.
 [4] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
 [5] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[6]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [8] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [9] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photorealistic single image superresolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
 [10] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
 [11] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 [12] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. arXiv preprint ArXiv:1611.04076, 2016.
 [13] L. Metz, B. Poole, D. Pfau, and J. SohlDickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
 [14] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [15] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
 [16] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–901, 2016.
 [17] L. Theis, A. v. d. Oord, and M. Bethge. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.

[18]
T. Tieleman and G. Hinton.
Lecture 6.5rmsprop: Divide the gradient by a running average of its
recent magnitude.
COURSERA: Neural networks for machine learning
, 4(2), 2012.  [19] J. Zhao, M. Mathieu, and Y. LeCun. Energybased generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
Appendix A Proof for The Equivalence Between a NonNormalized and a Strict WeightNormalized Network with One Affine WeightNormalized Layer at the End
Consider two networks with layers each. The first network is a nonnormalized network, where layers are linear layers and layers are ReLU layers. The second network is a weightnormalized network, where layers are strict weightnormalized layers, layer is an affine weightnormalized layer, and layers are translated ReLU layers.
We make the following claim:
Claim 1.
The aforementioned two networks are capable of representing the same set of functions.
First, we prove that a linearandReLU combination is equivalent to a strict weightnormalizedandTReLU combination, if both are augmented by a learned affine transformation at the end.
Lemma 2.
A linear layer, followed by a ReLU layer, followed by an affine transformation, is equivalent to a strict weightnormalized layer, followed by a TReLU layer, then by an affine transformation.
Proof.
For simplicity we consider the case where the first layer has only one output neuron. Then, a linear layer, followed by a ReLU layer, followed by an affine transformation, becomes
(14) 
while a strict weightnormalized layer, followed by a TReLU layer, followed by an affine transformation would be
(15) 
where are learned parameters. The transformation
(16) 
establishes a onetoone correspondence between these two forms. ∎
We make the following observations:
Lemma 3.
A linear layer preceded by an affine transformation is equivalent to a single linear layer.
Lemma 4.
A linear layer is equivalent to an affine weightnormalized layer.
These proofs are trivial. Now we demonstrate the proof of claim 1
by transforming network 1 to network 2:
We perform the following procedure for each from to : first, we add an affine transformation between the ReLU layer and the linear layer . We then exchange the linear layer and the ReLU layer
with a strict weightnormalized layer and a TReLU layer. The additional linear transformations are then removed. By adding and removing an affine transformation does not change the expressiveness of the network since a linear layer succeeds it. With an affine layer in place, the exchange would not change the expressiveness of the network either.
Finally, we change the last linear layer to an affine weightnormalized layer.
Appendix B Implementation Details
Here we provide some implementation details regarding the weight normalized layers.
Note that for strided and transposed convolutional layers, each element in the output tensor receives an input from only a subset of the
elements in the input tensor, which corresponds to the kernel, where is the number of input features and and are the kernel width and height. Ideally, we should perform weight normalization for each of these different subsets of weights separately. In our experiments, we use a simple trick: we compute the norm of the weight for the full kernel as a whole, and divide the norm by where and are horizontal and vertical strides. This norm is used to normalize the weight in all different subsets.The first layer in the generator deserves some special treatment. While it can be seen as a transposed convolutional layer, (since the spatial size of the input is ), it can also be viewed as a fully connected layer (with shared bias between output elements from the same feature map). These two views do make a difference when Weight Normalization is in place: similar to the case above, each output element actually receives an input from a subset of elements instead of elements, corresponding to the kernel. Hence, it is more appropriate to implement this layer as a weight normalized fully connected layer, which is what we did in our experiments.
We also found that weight initialization of the first layer of the generator had an impact on the effect of WN. In our experiments, initial weights are drawn uniformly from where is the size of the input which is just the length of the latent code. For the convolutional layers, initial weights are drawn uniformly from , as usual.
When computing the norm, a numerical stability term is added to the sum of squared weights before taking the square root.
Recall that the purpose of weight normalization is to normalize the mean and variance of the output of a linear layer. In the strict weight normalized case (equation 4), if each dimension of the input vector is independently drawn from a distribution with expected value 0 and variance 1, then the output will also have expected value 0 and variance 1.
In a residue block, if the shortcut branch and the residue branch are simply summed, the output distribution will have variance 2, and the normalizing effect would be lost. One possible fix to this is to simply divide the sum by , but this will cause the shortcut branch to vanish as the network grows deeper, which defeats the purpose of residual networks. Our solution is what we call “weight normalized addition”. Consider first the simplest case of adding two variables. We take
(17) 
where and are learned weights. This will preserve the normalizing effect. At the same time, if we set the initial weight to 0 on the residue branch and 1 on the shortcut branch, the shortcut branch will be able to “go through” the network at the beginning, thus the benefits of residual networks are also preserved.
When adding two convolutional feature maps, we learn pairs of weights for each feature channel and share the weights across spatial locations.
Appendix C Additional Samples and Analysis
Here we show addtitional generated samples from the three models. In addition, we talk about certain issues with GAN training that can only be detected by visually inspecting large amounts of samples.
c.1 Vanilla Model
Figure 13 shows samples generated by the vanilla model from the same set of random codes, at iteration 30,500 (optimal iteration) and 120,000. Samples generated at 120,000 iterations are visually superior on average if each sample is inspected individually, despite a higher reconstruction loss. But notice how the diversity of the samples has decreased: the lower half of the figure is dominated by yellow and brown colors and are darker than the upper half. Even more subtle, the lower half has less variation in facial expressions. The gender diversity is also decreasing: some clear male faces in the upper half become more feminine in the lower half.
When training beyond a certain amount of iterations, the samples start to evolve toward the same direction. While the samples are still different, similar changes can be observed in each iteration. In this process, the difference between samples is gradually lost. This corresponds to the slow and steady increase of the reconstruction loss. At around 130,000 iterations, this process suddenly accelerates and then the model collapses at a certain point, as shown in figure 12.
Interestingly, this appears to be like a reversed behavior of an early stage training. In particular, training usually starts with a code that generates a similar output and changing in a similar way until the synthesized samples start to gain diversity.
c.2 BatchNormalized Model
Figure 14 shows samples generated by the BN model from the same set of random codes, at iteration 30,500 (optimal iteration) and 110,000. At first sight, it does not show a decrease in diversity as with the vanilla model. But after comparing the samples carefully, we can discover certain repeatedlyoccurring features. To see this more clearly, in Figure 15 we picked and rearranged several samples from Figure 14.
We identified two groups from samples generated at iteration 110,000. Within each group, while the appearance of the face varies considerably, almost the exact same expression is produced. When comparing these samples from ones that are generated from the same code at iteration 30,500, we found that a second group is new, while the first group has already existed for a long period of time. This indicates that the BN model had limited diversity even at its optimal iteration.
This is a indication of a different cause for mode collapse: as the training progresses, certain features become dominant. While most samples stay different, more and more start to acquire these dominating features. In the extreme case, only a handful of different possible outputs remain in the end.
c.3 WeightNormalized Model
Figure 17 illustrates random samples generated by the optimal WN model. In terms of diversity, it is not ideal, as the samples still show a lack of color variation and an unbalanced gender ratio compared to the ground truth distribution. However, they show more variations than the vanilla model and less subtle recurring features compared to the BN model. The individual samples are of higher quality on average as well. Also note the relative low rate of “failed” or highly implausible samples.
Figure 18 shows the running reconstruction loss recorded during the whole training process of the WN model. The loss remains nearly constant after 300,000 iterations, which demonstrates the stability of the WN model.
c.4 Random Reconstructed Samples
Figure 19 shows a random selection of reconstructed samples.
Appendix D Additional Experiments
d.1 Additional Training Instances of Vanilla and BN Models
Figures 20 and 21, and Tables 3 and 4 show the reconstruction loss recorded during all training instances of the vanilla and batchnormalized models.
Instance  Optimal iteration  Running loss 

vanilla1  30,500  0.014509 
vanilla2  34,500  0.014703 
vanilla3  31,000  0.014734 
Instance  Optimal iteration  Running loss 

BN1  37,000  0.017349 
BN2  30,500  0.017199 
BN3  16,000  0.018810 
We can see that the three instances of vanilla model gave almost identical loss curves. They achieved similar optimal loss at similar times, and the mode collapse also happened around the same time.
In addition to the instability observed for each training instance, the batchnormalized models also showed “metainstability” as the behavior differed considerably between each training instance. Notably, in the third instance, mode collapse happened very early on. The training did recover to some extent, but the model was never able to regain the same sampling diversity as before the mode collapse.
d.2 Additional Models
For completeness, we also compare different formulations of Weight Normalization. The first one is a fullaffine WN model, constructed from the WN model by replacing all strict weightnormalized layers by affine weightnormalized layers and all TPReLU layers by PReLU layers. The second one is a model with Weight Normalization plus meanonly Batch Normalization, as used in [16], constructed by taking the affine WN model and adding meanonly Batch Normalization at those places where regular Batch Normalization layers would be used in the BN model.
The reconstruction loss for the first 150,000 steps are shown in Figure 22 and Table 5. The results of the WN model are included for comparison.
Model  Optimal iteration  Running loss 

affineWN  51,000  0.014034 
WN+meanBN  19,500  0.016639 
WN  463,000  0.013010 
The WN with meanonly BN model achieved worse reconstruction than the vanilla model. In addition, although less severe than the BN model, it results in similar fluctuations of the BN model. We believe that the major advantage of WN over BN is its independence from batch statistics. By adding meanonly BN, this dependency is reintroduced, which harms the stability of the model.
We are also interested in the comparison between the WN model and the affineWN model, as it compares our formulation of Weight Normalization against the one originally one in [16]. For this purpose, we trained the affineWN model using also 700,000 iterations. Figure 23 shows the comparison of their reconstruction loss during training.
After making a quick descent in the beginning, the running reconstruction loss of the affineWN model start to increase steadily. Unlike the vanilla model, the generated samples kept stable and are of highquality. Hence, we evaluate both the optimal model and the model at iteration 700,000. The results are compared with the WN model in table 6.
Model  Iteration  Running loss  Final loss 

affineWN  51,000  0.014034  0.005941 
affineWN  700,000  0.020478  0.005939 
WN  463,000  0.013010  0.005525 
Surprisingly, the affineWN model at iteration 700,000 yields equally good 2,000step reconstructions as with iteration 51,000, when the model achieved optimal 50step reconstruction. Both of them, however, are about 7.5% worse than the WN model.
d.3 Experiments on CIFAR10 Dataset
There are 60,000 images (training plus validation) of size 3232 in the CIFAR10 dataset. We construct models in similar ways as for CelebA, but begin with 96 output channels for the first convolutional layer and stop further convolutions when the spatial size of the feature map reaches 44. The length of the code (256) and training batch size (32) remains the same.
We use 58,000 images for training and 2,000 images for evaluation. During training, evaluation is performed every 1,000 training iterations on 400 images, with 50 gradient descent steps. Final evaluation is performed on the whole test set with 2,000 gradient descent steps.
BN is still the worst model compared to the vanilla and WN models. Now the WN model achieves optimal loss early on, but then becomes worse. On the other hand, the vanilla model keeps improving. However, both models converges, as shown by the flat section in the loss curve, between iterations 400,000 and 500,000. So we take these two models at iteration 500,000 for evaluation in additional to the optimal 50step models. The BN model does not actually converge, as the rapidly changing “recurring feature”, discussed in Section C, occurs. For completeness however, we also take the BN model from iteration 500,000 for evaluation.
The seemingly worse 500,000iteration WN model turned out to give the best final reconstruction result. A more careful examination of the reconstruction process revealed that the 500,000iteration WN model achieved better reconstruction than the optimal vanilla model at around 400th reconstruction step. We acknowledge that this exposes a weakness of our evaluation method: performing the reconstruction for too few steps may give inaccurate results, while too many steps would be timeconsuming, which makes it unsuitable for training process monitoring.
Model  Iteration  Running loss  Final loss 

vanilla  387,000  0.010382  0.003413 
BN  1,000  0.017987  0.004904 
WN  50,000  0.010906  0.003509 
vanilla  500,000  0.010948  0.003414 
BN  500,000  0.019287  0.005421 
WN  500,000  0.014195  0.003269 
Figure 25 shows random samples generated by the three models, at their optimal iteration and at iteration 500,000. While the visual quality of samples from all models are good, the results are consistent in terms of diversity with the analysis in Appendix C. The vanilla samples look dull and are dominated by one color (green); the BN samples show a recurring feature (marked with red border).
Figure 26 shows random test samples and reconstructed ones.
d.4 Experiments on LSUN Bedroom Dataset
There are 3,033,042 images in the bedroom class of the LSUN dataset, with images having 256 pixels on the shorter side. Unlike many published results on this dataset, we use the fullsized images. We crop with centered 256256 patches but do not downsample the image. We construct models in a similar way as with CelebA, but stop further convolution when the spatial size of the feature map reaches 44 and use a code length of 512. Due to the large size of the images and the network, we reduce the batch sizes to 12 to save computation.
We use 2,000 images for evaluation and the rest for training. During training, evaluation is performed every 1,000 iterations on 200 images, with 50 gradient descent steps. Final evaluation is performed on the whole test set with 2,000 gradient descent steps.
For the vanilla model, training fails constantly, even when we reduce the learning rate by a factor of 10 (to ), so only the BN and WN models are compared here. The BN model collapsed at iteration 330,800. The WN model was trained with 600,000 iterations. The reconstruction loss is shown in Figure 27 and Table 8.
Model  Optimal iteration  Running loss  Final loss 

BN  125,000  0.020943  0.011051 
WN  478,000  0.016266  0.008546 
Random samples generated by the two models are shown in Figures 28 and 29. Reconstruction of random samples are shown in figure 30.
In the BN samples, recurring tilelike artifacts are observed. The best quality samples that are generated by the BN model are arguably sharper and cleaner, while the WN model reproduces details more accurately.
Appendix E Connection to Wasserstein GAN
For Wasserstein GANs [1], the discriminator is replaced with a critic, that is Lipschitzcontinuous for some constant and only depends on the structure of the network. To achieve this, they clipped the parameters of the critic network to a small window after each parameter update during training.
We claim that our weightnormalized discriminator is Lipschitzcontinuous with a small modification:
Claim 1.
The weightnormalized discriminator proposed in this paper is Lipschitzcontinuous for some constant if the sigmoid layer is removed and the only affine weightnormalized layer is replaced by a strict weightnormalized layer.
To see this, we first prove the following lemma:
Lemma 2.
For a strict weightnormalized layer
(18) 
where , , is the weight matrix and the th column of . If the loss function of the network is , then
(19) 
Proof.
∎
For a strict weightnormalized convolution layer with input channels and kernel size , change in inequality 19 to .
Note that in our implementation, the learned slope of parametric ReLU layers are clipped to , so the following becomes obvious:
Lemma 3.
For a TPReLU layer with input and output in ,
(20) 
Now it is easy to see that claim 1 is true since for each layer, the sum of absolute value of gradients grows by at most a constant factor. So with such a modification, our discriminator changes into a WGAN critic.
Comments
There are no comments yet.