Batch Normalization (BN) [Ioffe and Szegedy, 2015]
is considered one of the breakthrough enabling techniques in training deep neural networks. Without it, the gradient in each layer is tightly coupled to all other layers. Should the gradient on any layer be close to zero, then this chokes off the gradient to all subsequent layers during back-propagation of the gradient. This problem is known as vanishing gradients. BN works as follows. The activations from a layer, for a full mini-batch, are passed to the BN function. It calculates the sample mean and standard deviation for this mini-batch. It subtracts this mean and divides by this standard deviation to leave the activations for the mini-batch with a mean of zero and a standard deviation of one. Next, it does the reverse of this step by multiplying by a new standard deviation calledand adds a new mean called . Importantly are trainable parameters that exist for each channel of activations in a layer. The net effect is that the distribution of activations of one layer is shifted and expanded/contracted to match the input to the next layer.
Generative networks take various forms. [Johnson et al., 2016]
shows an example of image-in image-out for super-resolution and style transfer. Generative Adversarial Networks (GANs) introduced by
[Goodfellow et al., 2014] as input with the output being an image. All of these forms of generative networks must at some point constrain the activations to pixel values. For the case of colour images, the network must reduce down to three channels and the values would normally be constrained to be integers in the range . To constrain the pixel values to be in an appropriate range the activation of choice is the although a could also be used. The function takes an unbounded real number and constrains it to the real number range . However as is a non-linear function and we see from Figure 1 that inputs outside the range will be saturated. When converted to an 8-bit image these saturated values will convert to colour values of and . For most real images the pixel values will be well spread between . If the generator network is to produce realistic looking images, it should naturally produce images that have pixel values well spread between on the output of the . The network previous to the should be aiming to produce activations with a mean close to zero and standard deviation close to one, to ensure a good spread of values entering the . It is still reasonable for some activations of the to saturate. Many real-world colour values will be or . Placing a BN layer between the final activations and the allows the activations earlier in the network to be less constrained. The BN will shift and spread/condense the values to a range that suits the function. Indeed we show that the may not be optimal and that BN with appropriate values may suffice with simple clipping to , keeping in mind that clipping is a non-linear operation.
2 State of the Art
2.1 Generative Adversarial Networks (GANs)
GANs were originally introduced by
[Goodfellow et al., 2014].
In their experiments on images, they are not explicit about the architecture design apart from to say the generator used and activations while the discriminator used activations. There is no suggestion that BN was used. The ideas outlined in [Goodfellow et al., 2014] have sparked a large body of research, but the architectures used in most GANs for image generation follow the design of [Radford et al., 2015]
[Radford et al., 2015], introduced what is commonly referred to as the DCGAN (Deep Convolutional GAN). They introduce the activation at the output of the final layer, observing that using a bounded activation like allows the model to saturate quicker and thus cover the colour space of the distribution. This is legitimate if the default output distribution is tightly compacted within the output activation, but if it is widely spread or far from centred then the may saturate the majority of the outputs making it very difficult for the generator to learn. [Radford et al., 2015] advise using BN in most layers except for the final layer of the generator and the first layer of the discriminator. They noted that including BN in those layers led to sample oscillation and model instability which was avoided when it was removed. It should be noted here that we have also experienced this with the DCGAN using the original loss regime from [Goodfellow et al., 2014]. However, with some other designs and loss regimes, we find that BN is a benefit and that this heuristic may not be appropriate everywhere.
[Goodfellow, 2016] refers to the key insights of the DCGAN, stating that BN is left out of these layers so that the Model can learn the correct mean and scale of the distribution. BN has learnable parameters (, ) that can represent these and are condensed into (, ) though there may be reasons the [Goodfellow et al., 2014] loss regime prefers to distribute this over the rest of the weights in the network. A clear explanation of why BN in these specific layers causes oscillation and instability has to our knowledge not been resolved. One of the key problems with GANs that follow [Goodfellow et al., 2014]
is that the loss functions do not inform us how training is progressing.
[Arjovsky et al., 2017] introduced a new loss function to GANs, called the Wasserstein distance. The Wasserstein distance conveys how training is progressing, though in this first implementation it was crudely approximated in a computerised setting by means of constraining the weights of the discriminator network. In their experiments, they remove all BN from DCGANs in their entirety. They also used a constant number of filters in each layer instead of the doubling at each layer used in [Radford et al., 2015]. There doesn’t seem to be any reasoning for this design choice, except perhaps to show that the Wasserstein distance can overcome all these handicaps or perhaps that they do not matter. [Gulrajani et al., 2017] improves upon the work of [Arjovsky et al., 2017] with the improved WGAN. The approximation in a computerised setting was the main improvement though this also led to necessary changes in architecture. [Gulrajani et al., 2017] do use BN in the generator but use Layer Normalization [Lei Ba et al., 2016] in the Critic (WGAN name for the discriminator). It is not stated whether they use BN in the final layer of the generator but given that they say they follow the DCGAN architecture of [Radford et al., 2015] we can assume that they do not. The GitHub repository they supply certainly does not appear to use it in the final layer. The [Gulrajani et al., 2017] architecture will be referred to as the iWGAN in this paper.
2.2 Image-in Image-out networks
[Johnson et al., 2016] created a generative network for super-resolution and style-transfer. They used a pre-trained VGG network [Simonyan and Zisserman, 2014] as their loss function, and the generative network consisted of multiple residual convolutional blocks with BN. However they also intentionally leave BN out of the final layer stating that they use a function on the output layer to ensure that the pixels are in the range . The suggestion here is that the replaces the need for BN, though there is no explanation as to why one would be a direct replacement for the other. In section 3 we will show the different effects of and BN and give the case that BN makes more sense in the output of a generator network.
2.3 How to evaluate generative models
It is worth considering how to evaluate the performance of any type of generative model. [Theis et al., 2015]
compare average log-likelihood, Parzen window estimates[Breuleux et al., 2009] and visual fidelity of samples which at the time were the most commonly used evaluation criteria for generative models. They show robust theoretical reasons why these are largely independent of each other. Their results show that Parzen window estimates should in general be ruled out. Of average log-likelihood and visual fidelity they show that both can be misleading. Plausible samples can be generated while still achieving poor log-likelihood. However good log-likelihood can be achieved with no guarantee of visually plausible samples.
We can see from many of the above state of the art in generative networks that the heuristic to leave BN out of the final layer of generator network is widely used. Despite this the justifications given for this appear inconsistent. In the following sections we will show for some of the solutions above that BN in the final layer can lead to faster training.
3 How BN relates to
Consider what happens to a distribution of activations that vary beyond the bounds in Figure 1. Any activations beyond these bounds will saturate to . While the output will be bound to the desired the network has no way to differentiate between any one saturated activation and another, which creates difficulties with learning. We have seen in practice that this can be overcome, and the values do eventually begin to fall in the correct range, but this can take time.
Now consider what happens if we put BN before this . In the first part of the BN algorithm, the mean and standard deviation of the batch of activations will be calculated, the mean will be subtracted from each activation value, and then each will be divided by the batch’s standard deviation. The expected value of any activation is now zero, which is the central value of the input to the function and the standard deviation of an activation is one which means that most activation values will be between . The majority of activations (up to two standard deviations) will result in an unsaturated output from the functions. From two standard deviations and above we expect saturation. This assumes that BN’s learnable parameters do nothing, but as they are trainable, they can vary to suit the target distribution. If any channel of the target distribution has a non-zero mean, then the parameter for that channel can vary to suit. Likewise for the standard deviation of colour values of the target distribution. They must change in such a way that they move the activations to a value that will give the desired target distribution after output from the . This does present a problem as are parameters of an affine operation and varying them cannot undo the effect of a non-linear function like .
This brings up the question, why the at all? If the BN can learn or be set to the target distribution for each channel is there any benefit to the ? We show later that is unnecessary and it can, in fact, bring a small improvement to replace it with a clipping operation.
It should be noted that BN is not a replacement for a non-linearity in general. BN is an affine operation so it cannot serve as a non-linearity. Instead, we are saying that the affine operation it offers along with a clipping activation (itself a non-linearity) is of more utility here than the . Aside from the effects of adding complexity and generalisation to the network a or clipping non-linearity is necessary here or it can cause artefacts in the generated image where a number outside the range is inserted into the image.
As outlined in section 2.3, evaluation of generative models is not straightforward. In this paper, we will only be looking at the first phase of training. To some extent, we will use visual fidelity but usually only in conjunction with a histogram of the image showing the distribution of values. As [Theis et al., 2015] work predates the work of [Arjovsky et al., 2017, Gulrajani et al., 2017, Johnson et al., 2016] they did not consider using the loss functions of these methods as a way to measure the performance of generative models. We do use these loss functions, but it should be noted that they suffer the same drawbacks that average log-likelihood suffer. Namely, that plausible samples can be generated while still achieving high loss and a low loss can be achieved with no guarantee of visually plausible samples.
To determine the efficacy of using BN in the final layer of the generative network, we will look at the Image-in Image-out architecture of [Johnson et al., 2016] and the iWGAN architecture of [Gulrajani et al., 2017]
For the Image-in Image-out architecture we consider super-resolution. We will compare the histogram/images created at the initial phase of training and also look at the trend of the perceptual loss function, which is a meaningful value that should tend to zero as training progresses. We use a subset of ImageNet[Deng et al., 2009] (19439 images) of size as our high-res target distribution. From this set, we make a low-res set of images of size . The Generator network must learn to output the hi-res data set in response to the low-res set input. To see how training is progressing we will take an image from outside the target data set and resize to . This will be passed to the generator at periodic intervals of training. The output hi-res image will be considered in terms of visual fidelity and histogram distribution. Using this underlying architecture, we compare the combinations of alone, BN with and BN with clipping. For BN in conjunction with the we use the default initialisation values of as these are reasonable numbers for spread across the function. For BN-with-clipping we initialise the with the per-channel standard deviation and mean of the hi-res training set.
For the iWGAN architecture, we use the critic loss, which [Arjovsky et al., 2017] showed was proportional to the Wasserstein distance, as a measure of how quickly our different output layers converge. CIFAR-10 [Krizhevsky et al., 2009] pixel images of frogs (5000 images) are used for training. Once again we will consider visual fidelity and in particular the histogram spread of values As above, we compare the combinations of -alone, BN-with- and BN-with-clipping. Details of both architectures and hyper-parameter values are in the Appendix.
Figure (2) shows three plots of perceptual loss at each batch update iteration for the three final layer types, -alone, BN-with- and BN-with-clipping used with the super-resolution network.
Figure (3) shows the critic loss (Wasserstein distance) change over generator iterations. The three plots show the distance for -alone, BN-with-
and BN-with-clipping used with the iWGAN design. All plots use a 100 point running mean, the actual loss has a much greater variance.
Figure (2) shows the perceptual loss function for the three scenarios, alone, BN with and BN with clipping. We can see that BN-with-clipping shows a very slight improvement on BN-with-. It is difficult to determine if choosing a better initialisation for the BN parameters (from the default) for the BN with would close the gap or whether this is a more fundamental difference between using BN with clipping and . We can certainly see that in the early part of training the architectures with BN outperform alone. As training progresses, eventually converges with the others. This is something we see with the iWGAN as well. However, BN has predominantly used as a method of faster training. Here as well we achieve faster an smoother convergence in the early part of training while the alone version must learn the correct mean and standard deviation of the distribution. We can see the effect of this on the visual fidelity of the output images. The first two rows show the alone, the middle two the BN with and the final two rows show BN with clipping. A large part of the loss relates to the pixel values not being in the correct dynamic range. We also see that the network is still able to cope with an image that has a large section of saturated pixels.
Looking at Figure (3) we see all curves have a steep change in critic loss but at different times. For the -alone the steep reduction in loss occurs between 900-1500 generator iterations. For the network with the extra BN layer before the the reduction begins much earlier at approx 500 iterations and the Wasserstein distance never reaches the magnitudes shown in the example without the BN. A look at the image output and more importantly the histograms for the alone (top two rows) shows that the steep reduction coincides with the histogram falling in line with the expected histograms of the data set. Note the number of generator iterations at the top of each histogram to show where it was recorded. This suggests the vast majority of the Wasserstein distance relates to the generated image having most pixels in saturation. For the BN with (middle two rows) this also happens earlier suggesting that the large change in Wasserstein distance is caused by moving the image pixel values to match the dynamic range that allows them to produce a well-behaved histogram. For the BN with clipping, where the values are already set to the standard deviation and mean of the target distribution we see only a minimal change and we note the histograms are well behaved from very early in training. The images should be of frogs. This is not at all obvious, and yet we see that most of the Wasserstein distance does not relate to perceptual content but instead to the spread of colour values.
We have shown that the use of BN in the final layer of generator networks deserves reconsideration. While the heuristic to remove BN from the final layer may be justified when using the architectures and techniques of [Goodfellow et al., 2014, Radford et al., 2015], its use in image-in image out generation and Wasserstein GANs may be beneficial, at least regarding the early part of training. Consideration should also be given to replacing on the output with BN with clipping particularly if the mean and standard deviation of the target data set can be reliably approximated.
The question of whether BN in the final layer can lead to a better long-term result is still unclear, and further long-run testing would be required for this. This may require better methods for assessing the quality of generative networks. The issues surrounding the problems encountered by [Radford et al., 2015] deserve further investigation. If BN in the final layer is the reason for the oscillation and instability, it would be interesting to learn why. There may be other ways to solve the problem that give better results than simple removal of the BN from the final layer.
The principal author would like to acknowledge Jeremy Howard and Rachel Thomas for the fast.ai MOOC, and Shaofan Lai on which some of the base programs for the experiments in this paper were based.
Appendix A Appendix: Network and Hyper-paramter details
For Super-Resolution: VGG networks are used as the loss networks with a weighted average of the VGG layers [Conv1_1, Conv2_1, Conv3_1] respectively. We use ADAM [Kingma and Ba, 2014] with Keras default learning rates. The generator is as that used by [Johnson et al., 2016] in their supplementary material, table 2, 4x super-resolution
For iWGAN:The training regime is that the critic runs 5 iterations for every generator iteration apart from in the first 25 generator iterations where the critic runs 50 iterations for every generator iteration. In order to further ensure that the critic remains at optimality, the critic runs 50 iterations on the 500th iteration of the generator. The latent vector has dimension 64. We use ADAM with learning rate 0.0001, and . The network architecture can be seen at https://github.com/seanmullery/iWGAN
- [Arjovsky et al., 2017] Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein GAN.
- [Breuleux et al., 2009] Breuleux, O., Bengio, Y., Vincent, P., and Montreal, U. (2009). Unlearning for Better Mixing. pages 1–14.
- [Deng et al., 2009] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
- [Goodfellow, 2016] Goodfellow, I. (2016). NIPS 2016 Tutorial: Generative Adversarial Networks.
- [Goodfellow et al., 2014] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Networks. pages 1–9.
- [Gulrajani et al., 2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. (2017). Improved training of wasserstein gans. CoRR, abs/1704.00028.
- [Ioffe and Szegedy, 2015] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167.
- [Johnson et al., 2016] Johnson, J., Alahi, A., and Fei-Fei, L. (2016). Perceptual Losses Supplementary. Arxiv, pages 1–5.
- [Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- [Krizhevsky et al., 2009] Krizhevsky, A., Nair, V., and Hinton, G. (2009). Cifar-10 (canadian institute for advanced research).
- [Lei Ba et al., 2016] Lei Ba, J., Kiros, J. R., and Hinton, G. E. (2016). Layer Normalization. ArXiv e-prints.
- [Radford et al., 2015] Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. pages 1–16.
- [Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- [Theis et al., 2015] Theis, L., van den Oord, A., and Bethge, M. (2015). A note on the evaluation of generative models. pages 1–10.