1 Introduction and Related Work
Since the introduction by Goodfellow 
, Generative Adversarial Networks (GANs) have become the de facto standard for high quality image synthesis. The success of GANs comes from the fact that they do not require manually designed loss functions for optimization, and can therefore learn to generate complex data distributions without the need to be able to explicitly define them. While flow based models such as[4, 5, 23, 15]
and autoregressive models such as[27, 26, 25]
allow training generative models directly using Maximum Likelihood Estimation (explicitly and implicitly respectively), the fidelity of the images generated has not yet been able to match that of the state-of-the-art GAN models[13, 14]
. However, GAN training suffers from two prominent problems: (1) mode collapse and (2) training instability. The problem of mode collapse occurs when the generator network is only able to capture a subset of the variance present in the data distribution. Numerous works[24, 34, 13, 18] have been proposed to address this problem.
In this work, we address the problem of training instability. This is a fundamental issue with GANs, and has been widely reported by previous works [24, 19, 2, 9, 16, 28, 12, 13, 31, 22]. One theory for the instability is that the generator is not actually able to minimize the Jenson-Shannon divergence between the generated data distribution and the original data distribution . The main reason being that an optimal discriminator cannot exist in practice, as the discriminator can only become optimal if it is trained against all possible generators (this infinite set includes the generators ranging from worst performing to the best performing ones), while such a set of all possible generators cannot exist unless there is an optimal discriminator. Thus, the Nash equilibrium for a GAN is not the same as the one for divergence minimization.
We propose a method to address training instability for the task of image generation, by investigating how gradients at multiple scales can be used to generate high resolution images (typically more challenging due to the dimensionality of the space) without relying on previous greedy approaches, such as the progressive growing technique . Our MSG-GAN technique allows the discriminator to look at not only the final output (highest resolution) of the generator, but also at the outputs of the intermediate layers (Fig. 2). As a result, the discriminator becomes a function of multiple scale outputs of the generator and importantly, passes gradients to all the scales simultaneously (more details in section 2 and section 3). In summary, we present the following contributions:
We propose the Multi-Scale Gradient GAN (MSG-GAN) technique for image synthesis which allows gradients to flow from the discriminator to the generator at multiple scales. MSG-GAN improves the stability of training as defined in prior work , where the variance of images generated from fixed latent points during training is quantified.
We empirically validate the approach and show that we are able to robustly generate high quality samples on a number of commonly used datasets, including CIFAR10, Oxford102 flowers, and even high resolution results on CelebA-HQ, without any hyperparameter tuning.
Arjovsky and Bottou  pointed out that one of the reasons for the training instability of GANs is due to the passage of random (uninformative) gradients from the discriminator to the generator when there is insubstantial overlap between the supports of the real and fake distributions. There are two recent works which focus on alleviating this problem viz. ‘Variational Discriminator Bottleneck’ (VDB)  and the ‘Progressive Growing of GANs’ (ProGANs) . The VDB’s solution is to apply a mutual information bottleneck between the input images to the discriminator and it’s deepest representation for them. This forces the discriminator to focus only on the most discerning features of the images for classification into real and fake images. Our work is orthogonal to the VDB technique, and we leave an investigation into a combination of MSG-GAN and VDB to future work.
The ProGANs technique tackles the instability problem by training the GAN layer-by-layer, progressively doubling the generated image resolution. At every point in training, all the previously trained layers are trainable, and whenever a new layer is added to the training it is slowly faded in such that the learning of the previous layers are retained. Intuitively, this technique helps with the instability problem because, for every new layer added, the network learns finer details, while retaining the low frequency details previously learnt, thus, progressively generating sensible and higher resolution images. Learning and retaining the low frequency details ensures that the distribution of generated images tends to be closer to that of real images, which reduce the chances of instability during training.
While this approach is able to generate state-of-the-art results, we found it to be somewhat difficult to use, due to its sensetivity to hyperparameter changes during training in several places. We describe some points that we observed which motivated us to develop the MSG-GAN technique:
The model trained on lower resolution doesn’t produce the same images after subsequent higher resolution layers are trained (see Fig. 3). This means that it can be hard to judge how well the method is working until it has converged at the highest resolution. Another effect is that the parameters for the 1 x 1 conv layers used for generating images at intermediate resolutions during their training are essentially “wasted” in the end.
The progressive growing training scheme introduces complexity in the form of extra parameters and training steps that have to be carefully managed. One example of this is that if there is an interruption in training during fade-in iterations, the exact state of the blending, including both sets of weights and the blending value need to be considered for resuming training. Another is that according to ProGANs, an equal number of iterations should be spent for all resolutions. However, in our experiments with the ProGANs, we found that with this schedule, there are often unnecessary iterations at the lowest resolutions, and the same number of iterations are not enough for the higher resolutions. We found that there exists an optimal training schedule for the ProGANs which is task specific, and required empirical analysis as well as constant supervision for derivation, which can be hard when considering the previous point (higher variance in result quality during training).
These challenges motivated us to find an alternate solution for generating high resolution images that does not rely on the progressive blending step. The idea of multi-scale image generation is a well established technique, with methods existing well before deep networks became popular for this task [17, 29]. More recently, a number of GAN-based methods break the process of high resolution image synthesis into smaller subtasks [30, 33, 32, 6, 7, 13]. For example, LR-GAN  uses separate generators for synthesizing the background, foreground and compositor masks for the final image. We use a similar multi-resolution image synthesis paradigm, however our technique is different from StackGAN and GMAN [33, 32, 6] as they use separate multiple discriminators for a single generator and also different from MAD-GAN  because it uses multiple generators and a single discriminator. We instead make multiple connections from a single generator to a single discriminator at intermediate layers.
This approach has several advantages, largely driven by the simplicity of the proposed approach. If multiple discriminators are used at each resolution [33, 6], the number of total parameters grows exponentially across scales, as repeated downsampling layers are needed, whereas in MSG-GAN the relationship is linear. Besides having fewer parameters and design choices required, another important advantage of our approach is that it avoids the need for an explicit color consistency regularization term across images generated at multiple scales, which was necessary in prior approaches . In MSG-GAN, the discriminator’s multi-scale layers are composed with each other, enforcing that images are coherent across scale space.
3 Multi-Scale Gradient GAN
Figure 2 shows an overview of our architecture, and we provide a mathematical definition of the MSG-GAN framework in this section. Our base model starts with a ProGANs  architecture. Let the initial block of the generator function be defined as , such that , and contains 4 x 4 RGB images. Let the function be a generic function which acts as the basic generator unit of the GAN. In our implementation, is modelled as a block containing upsampling followed by conv layers. The generator of our GAN follows the standard format, and can be defined as a composition of with of such functions:
We now define the function which generates the output at different stages of the generator (red blocks in Fig. 2), where the output corresponds to different downsampled versions of the final output resolution. The function can be simply modelled using a (1 x 1) convolution which converts the intermediate convolutional activation volume into images.
In other words, is an image synthesized from the output of the intermediate layer in the generator computations. Similar to ProGANs , acts as a regularizer, requiring that the learned feature maps are able to be projected directly into RGB space.
Because the discriminator’s final critic loss is a function of not only the final output of the generator (or ), but also the intermediate outputs of the generator , it causes gradients to flow from the intermediate layers of the discriminator to the intermediate layers of the generator. We denote all the components of the discriminator function with the letter . Let be the final layer of the discriminator which provides the critic score. Let or be the function which defines the first layer of the discriminator which takes the real image (true sample) or the highest resolution synthesized image (fake sample) as the input. Let be a similar function which represents the intermediate layers of the discriminator. Thus, the output activation volume of any intermediate layer of the discriminator can be written as a function that is defined as:
where emulates the inverse of (not a true inverse), i.e. it converts the intermediate input image to an activation volume, and combines the output of with the corresponding intermediate activation volume in the discriminator computations. The complete discriminator function can then be defined as:
While can be modelled using either parametric or non-parametric families of functions, we found that defining as a concatenation operation worked well in practice,
where, the denotes channelwise concatenation. In the section 4 we go into details about the performance of the architecture as evaluated on various image-synthesis datasets.
Image stability during training. These plots show the MSE between images generated from the same latent sample at the beginning of sequential epochs (averaged over 36 latent samples). This distance quantifies the amount of changes made by the generator in the generated images during training. MSG-GAN converges to stable images over time while ProGANs continues to vary significantly across epochs. We show a subset of all training resolutions for clarity, but note that the rest of the resolutions also follow the same trend.
While evaluating the quality of GAN generated images is a non-trivial task, the most commonly used metrics today are the Inception Score (IS)  and Fréchet Inception Distance (FID) . In line with previous works, we use the IS for the CIFAR10 and Oxford102 flowers experiments.
4.1 Implementation Details
We evaluate on CIFAR10 (at 32 x 32 resolution), Oxford flowers (at 256 x 256 resolution), and CelebA-HQ (at 1024 x 1024 resolution). These datasets were selected specifically to show that the MSG-GAN technique can be used for a variety of image settings without any changes. For each three datasets, we use the same initial latent dimensionality of 512
, drawn from a standard normal distribution. The architecture of the model used for the CelebA-HQ experiment has the exact same structure as the model described in ProGANs . The models used on the other datasets are identical, except with fewer upsampling layers. Full details about all the model architectures are provided in the supplementary material.
All the models were trained with RMSprop and a learning rate of 0.001 for both generator and discriminator. Note that we do not use any specialized initialization technique and initialize all the parameters of both generator and discriminator according to the standard normal distribution. Parameters are scaled according to the equalized learning rate setting of ProGANs  at runtime during training and inference. For the EMA technique , we use the standard decay value equal to 0.999.
All the code for reproducing our work is made available for research purposes at https://github.com/akanimax/BMSG-GAN.
It has been observed [24, 12, 21, 20] in the community and has also been our experience that convergence of GANs during training is very heavily dependant on the choice of hyperparameters, in particular, choice of learning rate. To validate the robustness of MSG-GAN against hyperparameter changes, we trained our network with the same architecture and hyperparameters, with the exception of three different learning rates (0.001, 0.005 and 0.01) for the CIFAR-10 dataset (Fig. 6). We can see that all three models converge, producing sensible images and similar inception scores, even with large changes in learning rate. This result indicates that the structural changes to the network done by MSG-GAN may be an important tool and a step forward in obtaining consistently stable and trainable GANs.
Table 1 shows quantitative results for our method on different datasets. Importantly, we use the same loss, architecture, and hyper-parameter settings for all three datasets. We can see that we obtain more realistic results on CIFAR10 than those of LR-GAN  and GMAN  in terms of Inception Score (IS). Also, while we were not able to match the very impressive reported inception score of ProGANs , we show through a constrained experiment that the model with multi-scale gradients achieves a better inception score (with stable increase) than a model with all the same settings except for progressive growing (implemented ourselves) instead of MSG, as shown in Figure 6
. In order to ensure that all factors are identical, we have used our own PyTorch implementation of ProGANs, which have made available as open source111https://github.com/akanimax/pro_gan_pytorch. After each progressive growing step, we observe a large change in inception score as new parameters are introduced.
. We note that the StackGAN works are conditioned on textual descriptions, while our method is conditioned only on the random latent vector. However, what is interesting is that our results are quite close to the inception score for theoriginal flower images, which is a theoretical upper bound on this dataset (Tab. 1).
For CelebA-HQ, we again, cannot match the quality of ProGANs, however we note that our method is one of the view that is able to generate results on this dataset size at all. One possible reason for the discrepancy in final inception score is the amount of training resources that were used in the original ProGANs work, which we were unable to match on our experiments. While other recent works like BigGAN  have shown that outstanding results can be achieved when sufficient computational power is available, we have shown that our method makes GANs easier and more stable to train under limited resources, and might be able to improve the upper bound obtainable with unlimited resources. We discuss this observation further in the limitations section.
For comparing the stability of MSG-GAN with ProGANs, we measure the changes made by the generator to the generated samples for fixed latent points during training. This was introduced in  by displaying the generated samples during the training. We quantify this by calculating the MSE (mean squared error) between two consecutive samples during training. Figure 4 shows that while ProGANs tends towards convergence (making less changes) for lower resolutions only, MSG-GAN shows the same convergence trait for all the resolutions. The training epochs for the ProGANs take place in sequence over each resolution, whereas for the MSG-GAN they are simultaneous.
Figure 7 describes how the synchronization across the layers takes place ultimately resulting in a coherent set of images synthesized at multiple resolutions. This matches human understanding of image formation, and allows users to gain an intuition as to whether their task is working or not, which is useful for debugging and visualizing GAN training. As ProGANs does not use all networks throughout training, we observe that our output can exhibit some color variability (Fig.3).
Finally, one of the benefits of the MSG-GAN technique is that the images generated at higher resolution maintain symmetry of certain features, such as same color for both eyes, or earrings in both ears. These high level structures are challenging to generate fully symmetrically with ProGANs  because the supervision is applied at a single scale only. However, with MSG-GAN, consistency is enforced between various resolutions of the synthesized images, and large scale structures fall out naturally.
Our method has some limitations. By dropping progressive growing, we have simplified the training process, and show that we can converge stably across scales (Fig. 4) towards more realistic images (Fig. 6) in the same number of epochs (with all other settings held constant). However, our implementation of progressive growing has not been able to replicate the quality of results in the original paper. On CelebA-HQ, MSG-GAN (which outperforms our own progressive growing method), has reached an FID score of (14.86), while the original authors report an FID score of (7.30) (lower is better).
We hypothesize that this difference is likely due to hardware availability. All our experiments were conducted on 2 Tesla V100 GPUs, while the original implementation was run on 8 Tesla V100 GPUs. Recent work such as BigGAN  has shown that significant quality improvements can come from increasing the batch size by parallelizing across many GPUs. At the highest resolution, we are limited to a batch size of 8, while ProGANs uses a batch size of 32. However, we feel that our approach, while not improving the state-of-the-art FID scores on CelebA-HQ, introduces a new idea that has the potential to help improve result quality when used in conjunction with large hardware resources.
We also note that our method primarily addresses stable multi-scale image generation. However, in order to increase sample variance, MSG-GAN stills needs to leverage techniques such as the BatchDiscrimination  or MinBatchStdDev . As such, we observe challenges with modeling diversity in pose, expression, etc in generated results. However, as improvements are made in the quality of increased sample variance, our method could adopt these techniques as well.
Although huge strides have been made for solving the problem of photo-realistic high resolution image synthesis, true photo-realism which is indistinguishable from real images is yet to be achieved. We present our MSG-GAN technique which contributes to these efforts with a simple approach to enable true multi-scale image generation, and has shown convergence on high resolution images.
We would like to thank Alexia Jolicoeur-Martineau (Ph.D. student at MILA) for her guidance and expertise over Relativism in GANs. We also acknowledge Mobiliya for sponsoring our work and providing 2 GPUS of the DGX-1 machine for experimentation.
-  M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. CoRR, abs/1701.04862, 2017.
M. Arjovsky, S. Chintala, and L. Bottou.
Wasserstein generative adversarial networks.
In D. Precup and Y. W. Teh, editors,
Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
-  A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
-  L. Dinh, D. Krueger, and Y. Bengio. NICE: non-linear independent components estimation. CoRR, abs/1410.8516, 2014.
-  L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. CoRR, abs/1605.08803, 2016.
-  I. P. Durugkar, I. Gemp, and S. Mahadevan. Generative multi-adversarial networks. CoRR, abs/1611.01673, 2016.
-  A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, and P. K. Dokania. Multi-agent diverse generative adversarial networks. In , June 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
-  A. Jolicoeur-Martineau. Gans beyond divergence minimization. CoRR, abs/1809.02145, 2018.
-  A. Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard GAN. CoRR, abs/1807.00734, 2018.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
-  T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. CoRR, abs/1812.04948, 2018.
-  D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10236–10245, 2018.
-  N. Kodali, J. Abernethy, J. Hays, and Z. Kira. On convergence and stability of gans. arXiv preprint arXiv:1705.07215, 2017.
-  S. Lefebvre and H. Hoppe. Parallel controllable texture synthesis. In ACM Transactions on Graphics (ToG), volume 24, pages 777–786. ACM, 2005.
-  Z. Lin, A. Khetan, G. Fanti, and S. Oh. Pacgan: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems, pages 1505–1514, 2018.
-  X. Mao, Q. Li, H. Xie, R. Y. K. Lau, and Z. Wang. Multi-class generative adversarial networks with the L2 loss function. CoRR, abs/1611.04076, 2016.
-  L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for gans do actually converge?, 2018.
-  L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks, 2016.
X. B. Peng, A. Kanazawa, S. Toyer, P. Abbeel, and S. Levine.
Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow.In International Conference on Learning Representations, 2019.
-  D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1530–1538. JMLR.org, 2015.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
-  T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. PixelCNN++: A PixelCNN implementation with discretized logistic mixture likelihood and other modifications. In ICLR, 2017.
-  A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
-  A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. CoRR, abs/1601.06759, 2016.
-  R. Wang, A. Cully, H. J. Chang, and Y. Demiris. MAGAN: margin adaptation for generative adversarial networks. CoRR, abs/1704.03817, 2017.
-  Y. Wexler, E. Shechtman, and M. Irani. Space-time completion of video. IEEE Transactions on Pattern Analysis & Machine Intelligence, (3):463–476, 2007.
-  J. Yang, A. Kannan, D. Batra, and D. Parikh. LR-GAN: layered recursive generative adversarial networks for image generation. CoRR, abs/1703.01560, 2017.
-  Y. Yazıcı, C.-S. Foo, S. Winkler, K.-H. Yap, G. Piliouras, and V. Chandrasekhar. The unusual effectiveness of averaging in GAN training. In International Conference on Learning Representations, 2019.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. CoRR, abs/1710.10916, 2017.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and
Toward multimodal image-to-image translation.In Advances in Neural Information Processing Systems, pages 465–476, 2017.