MSG-GAN: Multi-Scale Gradients GAN for more stable and synchronized multi-scale image synthesis

03/14/2019 ∙ by Animesh Karnewar, et al. ∙ mobiliya 12

Generative Adversarial Network (GAN) which is widely used for Image synthesis via generative modelling suffers peculiarly from training instability. One of the known reasons for this instability is the passage of uninformative gradients from the Discriminator to the Generator due to learning imbalance between them during training. In this work, we propose Multi-Scale Gradients Generative Adversarial Network (MSG-GAN), a simplistic but effective technique for addressing this problem; by allowing the flow of gradients from the Discriminator to the Generator at multiple scales. This results in the Generator acquiring the ability to synthesize synchronized images at multiple resolutions simultaneously. We also highlight a suite of techniques that together buttress the stability of training without excessive hyperparameter tuning. Our MSG-GAN technique is a generic mathematical framework which has multiple instantiations. We present an intuitive form of this technique which uses the concatenation operation in the Discriminator computations and empirically validate it through experiments on the CelebA-HQ, CIFAR10 and Oxford102 flowers datasets and by comparing it with some of the current state-of-the-art techniques.



There are no comments yet.


page 1

page 4

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Figure 1: Images synthesized at various resolutions using our proposed MSG-GAN technique. With this method, image generation is stable across all resolutions. This allows for an understanding of image properties (e.g., quality and diversity) during training. The bottom row has a resolution and doubles towards the top till .
Figure 2: Architecture of MSG-GAN for generating synchronized multi-scale images. Our method is based on the architecture proposed in [13], but instead of a progressively growing training scheme, includes connections from the intermediate layers of the generator to the intermediate layers of the discriminator. The multi-scale images input to the discriminator are converted into spatial volumes which are concatenated with the corresponding activation volumes obtained from the main path of convolutional layers.

Since the introduction by Goodfellow  [8]

, Generative Adversarial Networks (GANs) have become the de facto standard for high quality image synthesis. The success of GANs comes from the fact that they do not require manually designed loss functions for optimization, and can therefore learn to generate complex data distributions without the need to be able to explicitly define them. While flow based models such as

[4, 5, 23, 15]

and autoregressive models such as

[27, 26, 25]

allow training generative models directly using Maximum Likelihood Estimation (explicitly and implicitly respectively), the fidelity of the images generated has not yet been able to match that of the state-of-the-art GAN models

[13, 14]

. However, GAN training suffers from two prominent problems: (1) mode collapse and (2) training instability. The problem of mode collapse occurs when the generator network is only able to capture a subset of the variance present in the data distribution. Numerous works

[24, 34, 13, 18] have been proposed to address this problem.

In this work, we address the problem of training instability. This is a fundamental issue with GANs, and has been widely reported by previous works [24, 19, 2, 9, 16, 28, 12, 13, 31, 22]. One theory for the instability is that the generator is not actually able to minimize the Jenson-Shannon divergence between the generated data distribution and the original data distribution [11]. The main reason being that an optimal discriminator cannot exist in practice, as the discriminator can only become optimal if it is trained against all possible generators (this infinite set includes the generators ranging from worst performing to the best performing ones), while such a set of all possible generators cannot exist unless there is an optimal discriminator. Thus, the Nash equilibrium for a GAN is not the same as the one for divergence minimization.

We propose a method to address training instability for the task of image generation, by investigating how gradients at multiple scales can be used to generate high resolution images (typically more challenging due to the dimensionality of the space) without relying on previous greedy approaches, such as the progressive growing technique [13]. Our MSG-GAN technique allows the discriminator to look at not only the final output (highest resolution) of the generator, but also at the outputs of the intermediate layers (Fig.  2). As a result, the discriminator becomes a function of multiple scale outputs of the generator and importantly, passes gradients to all the scales simultaneously (more details in section 2 and section 3). In summary, we present the following contributions:

  1. We propose the Multi-Scale Gradient GAN (MSG-GAN) technique for image synthesis which allows gradients to flow from the discriminator to the generator at multiple scales. MSG-GAN improves the stability of training as defined in prior work [31], where the variance of images generated from fixed latent points during training is quantified.

  2. We empirically validate the approach and show that we are able to robustly generate high quality samples on a number of commonly used datasets, including CIFAR10, Oxford102 flowers, and even high resolution results on CelebA-HQ, without any hyperparameter tuning.

2 Motivation

Arjovsky and Bottou [1] pointed out that one of the reasons for the training instability of GANs is due to the passage of random (uninformative) gradients from the discriminator to the generator when there is insubstantial overlap between the supports of the real and fake distributions. There are two recent works which focus on alleviating this problem viz. ‘Variational Discriminator Bottleneck’ (VDB) [22] and the ‘Progressive Growing of GANs’ (ProGANs) [13]. The VDB’s solution is to apply a mutual information bottleneck between the input images to the discriminator and it’s deepest representation for them. This forces the discriminator to focus only on the most discerning features of the images for classification into real and fake images. Our work is orthogonal to the VDB technique, and we leave an investigation into a combination of MSG-GAN and VDB to future work.

The ProGANs technique tackles the instability problem by training the GAN layer-by-layer, progressively doubling the generated image resolution. At every point in training, all the previously trained layers are trainable, and whenever a new layer is added to the training it is slowly faded in such that the learning of the previous layers are retained. Intuitively, this technique helps with the instability problem because, for every new layer added, the network learns finer details, while retaining the low frequency details previously learnt, thus, progressively generating sensible and higher resolution images. Learning and retaining the low frequency details ensures that the distribution of generated images tends to be closer to that of real images, which reduce the chances of instability during training.

While this approach is able to generate state-of-the-art results, we found it to be somewhat difficult to use, due to its sensetivity to hyperparameter changes during training in several places. We describe some points that we observed which motivated us to develop the MSG-GAN technique:

Figure 3: Images synthesized at various resolutions by a fully trained Progressively Grown GAN model. These images are counter-intuitively not colour and brightness synchronized across the resolutions, although they preserve certain similarities in structure.
  • The model trained on lower resolution doesn’t produce the same images after subsequent higher resolution layers are trained (see Fig. 3). This means that it can be hard to judge how well the method is working until it has converged at the highest resolution. Another effect is that the parameters for the 1 x 1 conv layers used for generating images at intermediate resolutions during their training are essentially “wasted” in the end.

  • The progressive growing training scheme introduces complexity in the form of extra parameters and training steps that have to be carefully managed. One example of this is that if there is an interruption in training during fade-in iterations, the exact state of the blending, including both sets of weights and the blending value need to be considered for resuming training. Another is that according to ProGANs, an equal number of iterations should be spent for all resolutions. However, in our experiments with the ProGANs, we found that with this schedule, there are often unnecessary iterations at the lowest resolutions, and the same number of iterations are not enough for the higher resolutions. We found that there exists an optimal training schedule for the ProGANs which is task specific, and required empirical analysis as well as constant supervision for derivation, which can be hard when considering the previous point (higher variance in result quality during training).

These challenges motivated us to find an alternate solution for generating high resolution images that does not rely on the progressive blending step. The idea of multi-scale image generation is a well established technique, with methods existing well before deep networks became popular for this task [17, 29]. More recently, a number of GAN-based methods break the process of high resolution image synthesis into smaller subtasks [30, 33, 32, 6, 7, 13]. For example, LR-GAN [30] uses separate generators for synthesizing the background, foreground and compositor masks for the final image. We use a similar multi-resolution image synthesis paradigm, however our technique is different from StackGAN and GMAN [33, 32, 6] as they use separate multiple discriminators for a single generator and also different from MAD-GAN [7] because it uses multiple generators and a single discriminator. We instead make multiple connections from a single generator to a single discriminator at intermediate layers.

This approach has several advantages, largely driven by the simplicity of the proposed approach. If multiple discriminators are used at each resolution [33, 6], the number of total parameters grows exponentially across scales, as repeated downsampling layers are needed, whereas in MSG-GAN the relationship is linear. Besides having fewer parameters and design choices required, another important advantage of our approach is that it avoids the need for an explicit color consistency regularization term across images generated at multiple scales, which was necessary in prior approaches  [32]. In MSG-GAN, the discriminator’s multi-scale layers are composed with each other, enforcing that images are coherent across scale space.

3 Multi-Scale Gradient GAN

Figure 2 shows an overview of our architecture, and we provide a mathematical definition of the MSG-GAN framework in this section. Our base model starts with a ProGANs [13] architecture. Let the initial block of the generator function be defined as , such that , and contains 4 x 4 RGB images. Let the function be a generic function which acts as the basic generator unit of the GAN. In our implementation, is modelled as a block containing upsampling followed by conv layers. The generator of our GAN follows the standard format, and can be defined as a composition of with of such functions:


We now define the function which generates the output at different stages of the generator (red blocks in Fig. 2), where the output corresponds to different downsampled versions of the final output resolution. The function can be simply modelled using a (1 x 1) convolution which converts the intermediate convolutional activation volume into images.


In other words, is an image synthesized from the output of the intermediate layer in the generator computations. Similar to ProGANs [13], acts as a regularizer, requiring that the learned feature maps are able to be projected directly into RGB space.

Because the discriminator’s final critic loss is a function of not only the final output of the generator (or ), but also the intermediate outputs of the generator , it causes gradients to flow from the intermediate layers of the discriminator to the intermediate layers of the generator. We denote all the components of the discriminator function with the letter . Let be the final layer of the discriminator which provides the critic score. Let or be the function which defines the first layer of the discriminator which takes the real image (true sample) or the highest resolution synthesized image (fake sample) as the input. Let be a similar function which represents the intermediate layers of the discriminator. Thus, the output activation volume of any intermediate layer of the discriminator can be written as a function that is defined as:


where emulates the inverse of (not a true inverse), i.e. it converts the intermediate input image to an activation volume, and combines the output of with the corresponding intermediate activation volume in the discriminator computations. The complete discriminator function can then be defined as:


While can be modelled using either parametric or non-parametric families of functions, we found that defining as a concatenation operation worked well in practice,


where, the denotes channelwise concatenation. In the section 4 we go into details about the performance of the architecture as evaluated on various image-synthesis datasets.

We experimented on CIFAR-10 with different loss functions, and found that the relativistic [12] version of the HingeGAN [28] critic loss worked well in practice. The Relativistic HingeGAN loss functions for generator and discriminator are defined as:


4 Experiments

Figure 4:

Image stability during training. These plots show the MSE between images generated from the same latent sample at the beginning of sequential epochs (averaged over 36 latent samples). This distance quantifies the amount of changes made by the generator in the generated images during training. MSG-GAN converges to stable images over time while ProGANs 

[13] continues to vary significantly across epochs. We show a subset of all training resolutions for clarity, but note that the rest of the resolutions also follow the same trend.

While evaluating the quality of GAN generated images is a non-trivial task, the most commonly used metrics today are the Inception Score (IS) [24] and Fréchet Inception Distance (FID) [10]. In line with previous works, we use the IS for the CIFAR10 and Oxford102 flowers experiments.

4.1 Implementation Details

We evaluate on CIFAR10 (at 32 x 32 resolution), Oxford flowers (at 256 x 256 resolution), and CelebA-HQ (at 1024 x 1024 resolution). These datasets were selected specifically to show that the MSG-GAN technique can be used for a variety of image settings without any changes. For each three datasets, we use the same initial latent dimensionality of 512

, drawn from a standard normal distribution

. The architecture of the model used for the CelebA-HQ experiment has the exact same structure as the model described in ProGANs [13]. The models used on the other datasets are identical, except with fewer upsampling layers. Full details about all the model architectures are provided in the supplementary material.

All the models were trained with RMSprop and a learning rate of 0.001 for both generator and discriminator. Note that we do not use any specialized initialization technique and initialize all the parameters of both generator and discriminator according to the standard normal distribution. Parameters are scaled according to the equalized learning rate setting of ProGANs [13] at runtime during training and inference. For the EMA technique [31], we use the standard decay value equal to 0.999.

All the code for reproducing our work is made available for research purposes at

Figure 5: Random samples generated by MSG-GAN on different datasets. (a), (b) and (c) are uncurated highest resolution samples produced for the CIFAR10, Oxford102 flowers and CelebA-HQ datasets respectively. Our approach generates high quality results across all datasets with the same training settings.

4.2 Results

Figure 6: Inception Score across training epochs for CIFAR-10. We can see that MSG-GAN is both a) robust to changes in learning rate, and b) has a more consistent increase in image quality when compared to progressive growing. In this plot we use our own implementation of progressive growing in this experiment to keep the rest of the variables fixed and isolate the contribution of each component. The progressive growing model was trained with lr=0.001.


It has been observed [24, 12, 21, 20] in the community and has also been our experience that convergence of GANs during training is very heavily dependant on the choice of hyperparameters, in particular, choice of learning rate. To validate the robustness of MSG-GAN against hyperparameter changes, we trained our network with the same architecture and hyperparameters, with the exception of three different learning rates (0.001, 0.005 and 0.01) for the CIFAR-10 dataset (Fig. 6). We can see that all three models converge, producing sensible images and similar inception scores, even with large changes in learning rate. This result indicates that the structural changes to the network done by MSG-GAN may be an important tool and a step forward in obtaining consistently stable and trainable GANs.

Dataset Method Value Metric
CIFAR10 Real Images 11.34
ProGANs [13] 8.80
MSG-GAN (ours) 7.96
LR-GAN [30] 7.17
ImprovedGAN [24] 6.86
GMAN [6] 6.00

Real Images 3.79
MSG-GAN (ours) 3.56
StackGAN++ [32] 3.26
StackGAN [33] 3.20

Table 1: Results of our experiments. See paper text for discussion.
Figure 7: During training all the layers in the MSG-GAN first synchronize colorwise and subsequently improve the generated images at various scales. The brightness of the images across all layers (scales) synchronizes eventually.


Table  1 shows quantitative results for our method on different datasets. Importantly, we use the same loss, architecture, and hyper-parameter settings for all three datasets. We can see that we obtain more realistic results on CIFAR10 than those of LR-GAN [30] and GMAN [6] in terms of Inception Score (IS). Also, while we were not able to match the very impressive reported inception score of ProGANs [13], we show through a constrained experiment that the model with multi-scale gradients achieves a better inception score (with stable increase) than a model with all the same settings except for progressive growing (implemented ourselves) instead of MSG, as shown in Figure 6

. In order to ensure that all factors are identical, we have used our own PyTorch implementation of ProGANs, which have made available as open source

111 After each progressive growing step, we observe a large change in inception score as new parameters are introduced.

When we compare the IS we obtain on the Oxford102 flowers dataset, we see that our method is doing better than the previous multi-scale GAN generation approaches StackGAN and StackGAN++ [33, 32]

. We note that the StackGAN works are conditioned on textual descriptions, while our method is conditioned only on the random latent vector. However, what is interesting is that our results are quite close to the inception score for the

original flower images, which is a theoretical upper bound on this dataset (Tab. 1).

For CelebA-HQ, we again, cannot match the quality of ProGANs, however we note that our method is one of the view that is able to generate results on this dataset size at all. One possible reason for the discrepancy in final inception score is the amount of training resources that were used in the original ProGANs work, which we were unable to match on our experiments. While other recent works like BigGAN [3] have shown that outstanding results can be achieved when sufficient computational power is available, we have shown that our method makes GANs easier and more stable to train under limited resources, and might be able to improve the upper bound obtainable with unlimited resources. We discuss this observation further in the limitations section.


For comparing the stability of MSG-GAN with ProGANs, we measure the changes made by the generator to the generated samples for fixed latent points during training. This was introduced in [31] by displaying the generated samples during the training. We quantify this by calculating the MSE (mean squared error) between two consecutive samples during training. Figure 4 shows that while ProGANs tends towards convergence (making less changes) for lower resolutions only, MSG-GAN shows the same convergence trait for all the resolutions. The training epochs for the ProGANs take place in sequence over each resolution, whereas for the MSG-GAN they are simultaneous.

5 Discussion

Figure 7 describes how the synchronization across the layers takes place ultimately resulting in a coherent set of images synthesized at multiple resolutions. This matches human understanding of image formation, and allows users to gain an intuition as to whether their task is working or not, which is useful for debugging and visualizing GAN training. As ProGANs does not use all networks throughout training, we observe that our output can exhibit some color variability (Fig.3).

Finally, one of the benefits of the MSG-GAN technique is that the images generated at higher resolution maintain symmetry of certain features, such as same color for both eyes, or earrings in both ears. These high level structures are challenging to generate fully symmetrically with ProGANs [13] because the supervision is applied at a single scale only. However, with MSG-GAN, consistency is enforced between various resolutions of the synthesized images, and large scale structures fall out naturally.


Our method has some limitations. By dropping progressive growing, we have simplified the training process, and show that we can converge stably across scales (Fig. 4) towards more realistic images (Fig. 6) in the same number of epochs (with all other settings held constant). However, our implementation of progressive growing has not been able to replicate the quality of results in the original paper. On CelebA-HQ, MSG-GAN (which outperforms our own progressive growing method), has reached an FID score of (14.86), while the original authors report an FID score of (7.30) (lower is better).

We hypothesize that this difference is likely due to hardware availability. All our experiments were conducted on 2 Tesla V100 GPUs, while the original implementation was run on 8 Tesla V100 GPUs. Recent work such as BigGAN [3] has shown that significant quality improvements can come from increasing the batch size by parallelizing across many GPUs. At the highest resolution, we are limited to a batch size of 8, while ProGANs uses a batch size of 32. However, we feel that our approach, while not improving the state-of-the-art FID scores on CelebA-HQ, introduces a new idea that has the potential to help improve result quality when used in conjunction with large hardware resources.

We also note that our method primarily addresses stable multi-scale image generation. However, in order to increase sample variance, MSG-GAN stills needs to leverage techniques such as the BatchDiscrimination [24] or MinBatchStdDev [13]. As such, we observe challenges with modeling diversity in pose, expression, etc in generated results. However, as improvements are made in the quality of increased sample variance, our method could adopt these techniques as well.

6 Conclusion

Although huge strides have been made for solving the problem of photo-realistic high resolution image synthesis, true photo-realism which is indistinguishable from real images is yet to be achieved. We present our MSG-GAN technique which contributes to these efforts with a simple approach to enable true multi-scale image generation, and has shown convergence on high resolution images.

7 Acknowledgements

We would like to thank Alexia Jolicoeur-Martineau (Ph.D. student at MILA) for her guidance and expertise over Relativism in GANs. We also acknowledge Mobiliya for sponsoring our work and providing 2 GPUS of the DGX-1 machine for experimentation.