Creating High Resolution Images with a Latent Adversarial Generator

03/04/2020 ∙ by David Berthelot, et al. ∙ Google 53

Generating realistic images is difficult, and many formulations for this task have been proposed recently. If we restrict the task to that of generating a particular class of images, however, the task becomes more tractable. That is to say, instead of generating an arbitrary image as a sample from the manifold of natural images, we propose to sample images from a particular "subspace" of natural images, directed by a low-resolution image from the same subspace. The problem we address, while close to the formulation of the single-image super-resolution problem, is in fact rather different. Single image super-resolution is the task of predicting the image closest to the ground truth from a relatively low resolution image. We propose to produce samples of high resolution images given extremely small inputs with a new method called Latent Adversarial Generator (LAG). In our generative sampling framework, we only use the input (possibly of very low-resolution) to direct what class of samples the network should produce. As such, the output of our algorithm is not a unique image that relates to the input, but rather a possible se of related images sampled from the manifold of natural images. Our method learns exclusively in the latent space of the adversary using perceptual loss – it does not have a pixel loss.



There are no comments yet.


page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are concerned with the task of generating high resolution (HR) images. In the context of super-resolution from a low resolution (LR) input, this task has been researched extensively and the current best methods are based on deep learning (DL). In the deep learning setting, single image super-resolution is modeled as a regression problem. The neural network weights are optimized to minimize a loss representing the distance from the predicted image to the ground truth. This is

not the aim of this paper. We do not care for generating an image that is close to input (when down-sampled), but rather we wish to use the input as guidance toward generating a set of plausible high resolution images.

To this end, we need a different notion of closeness. Indeed, the concept of distance between images has recently evolved beyond the classical regression framework. Consider two very different applications: In [4]

, the authors introduced a new pairwise distance computed in a high level of abstraction space inferred from an inception classifier layer. This distance is associated with a content loss, which, when minimized, results in better modeling of the semantic content but also in visual artifacts. Meanwhile more recently, SRGAN

[10] added a new distributional loss term by using a Generative Adversarial Network (GAN) [5]. This new loss term is used to capture the distributional properties of image patches which, in practice, results in sharper images.

For image reconstruction purposes (such as denoising and super-resolution), the various distances (pixel, content, distributional) are typically combined together to form an overall loss that includes perceptual measures. Still more losses can be added, for example a texture loss [15] or a contextual loss [13], leading to various perception-distortion trade-offs [2]. The perceptual component of the loss is often manually adjusted against other losses; this requiring extensive fiddling to determine the right balance. This problem has been addressed in [18]

by interpolating between neural networks parameters to manually find the right amount of sharpness; yet it is still not an automated process.

1.1 Contributions

In this paper, we propose a novel approach to generating high resolution images, guided by small inputs, that results in perceptually convincing details. To accomplish this, we break with the previous approaches and seek a single perceptual latent space that encompasses all the desirable properties of the resulting sampled images without manually setting weights.

Our proposed method, called Latent Adversarial Generator (LAG), aims to address the aforementioned fundamental limitations of existing techniques. We present the following contributions:

  • We model the input images as a set of possibilities rather than a single choice. This in effect models the manifold of (low-resolution) input images.

  • We learn a single perceptual latent space in which to describe distances between prediction and ground truth.

  • We analyze the relationship between conditional GANs and LAG.

2 Latent Adversarial Generator

Given a low resolution input image , we want to predict the perceptual center

of possible high resolution images. We propose to achieve this by modeling the possible choices as a random vector

. In this model, a pair uniquely maps to a high resolution image . We make the assumption that the high resolution image

is obtained at the center of the normal distribution for


The function we train takes a pair and predicts a high resolution image . We adopt the GANs terminology and call this function a generator; it has the following signature:


We design the critic function to judge whether a high resolution image corresponds to a low resolution image . We propose to decompose the critic into two functions: the projection from an image to a latent space and the mapping from this latent space to . We will refer to as the perceptual latent space. Formally, we define the projection function , parameterized by as:


and the critic parameterized by and is the composition of and :


The functions , and are implemented using neural networks. For the remainder of this paper, for simplicity of notation, we omit the parameters , and of these functions.

2.1 Training

Both the generator and the critic are learned adversarially by means of a minimax game. What particular GAN loss is used is not essential for our mathematical formulation. However it does matter from a quality of results point view. We decided to use the Wasserstein GAN objective [1] since it led to good visual results:


where is the space of 1-Lipschitz functions. The constraint is implemented using a gradient penalty [6] loss for the critic:


where is a uniformly sampled point between and . Formally

for a random variable


While distributional alignment is not available for the training of general purpose GANs, it is available for our task of high resolution image generation as follows. The perceptual representation of a high resolution image is . For the corresponding input image , the perceptual center is . We define the generator loss for aligning the centers as:


To summarize, the losses for the generator and critic are:


The simplifying assumption that lies in the center of the perceptual space, namely at , may be considered a limitation of our method.

This limitation can be ameliorated to some extent by learning an embedding vector with each training sample with the constraint that . This would make the convergence much slower since each embedding vector is updated only when its corresponding training sample appears in the mini-batch.

Another way to overcome this limitation would be to use an embedding function parameterized by to obtain , such that . While not as flexible as the discrete embedding approach, it is faster to learn since its weights can be updated with each mini-batch.

These extensions are left for future research.

2.2 Relation to conditional GANs

The approach we propose may be conflated with conditional GANs. Let’s clarify the difference. Conditional GANs [14] are GANs in which the generated sample is conditioned on the input . Specializing to our case, is a high resolution image and is its corresponding low resolution image . The resulting conditional GAN formulation is:


Generating a high resolution image conditioned on a low resolution image is a continuous conditional GAN problem. There is, however, a subtle nuance that distinguishes this approach from ours. In our case, the goal for the generated image is to be both plausible, and close to the ground truth (in the latent space). By comparison, in the continuous conditional GAN framework, the generated image must be down-scalable to precisely the same and be indistinguishable from a real image from the discriminator’s judgment without regard for closeness to ground truth.

As such, our proposed approach can be interpreted as a novel form of a conditional GAN, augmented to consider closeness to ground truth. This is achieved with the loss that minimizes the distance between and in the adversarial latent space.

3 Implementation Choices

3.1 Losses, Conditioning and Architecture

We’ve used the Wasserstein GAN loss using gradient penalty. It should be noted that we also obtained good results with a relativistic GAN [8, 18] loss paired with spectral normalized convolutions. We did not do a full exploration of all possible GANs losses, as it is beyond the scope of this paper.

We simplified the task of the critic by providing it with the absolute difference with the low resolution input truth rather than the ground truth itself. In other words, we compute:


where are generated samples and is the down-scaling operator and is the color resolution. The down-scaling operator produces the low resolution image of a corresponding high resolution image. We round the output of downsampling operator to its nearest color resolution, in our case since we represent images on

. This is done to prevent the network from becoming unstable by converging to large weights to measure infinitesimal differences. To allow gradient propagation through the rounding operation, we use Hinton’s straight through estimator

[7]. Assuming a stop gradient operation , the straight through estimator for rounding a value is:


We do not advocate a specific neural network architecture as there are a wide variety of other potential implementation candidates. Indeed, newer and better architectures are constantly appearing in the literature [17, 19] and LAG should be adaptable to these other architecture. In practice, for our experiments, we decided to use a residual network similar to EDSR [11] for its simplicity. For the critic, we used almost the same architecture but in reverse order.

The architecture is trained by progressively growing the network as introduced in [9, 19]. This training procedure is not required but seems to yield slightly better visual results.

All the implementation details of the architecture and training, as well as the TensorFlow code can be found here

3.2 Creating Small Inputs

Ideally, for a particular task like super-resolution, a specific down-scaling operator is used to generate low-resolution images. This operator is determined by the physical process that generates such images (e.g. a camera) or by a choice of prior distribution on the set of such images.

In our case, however, the choice of a specific operator that generates low-res images is entirely irrelevant since our generator operates in the latent space and should create plausible images from any low-resolution input. While the choice of a down-scaling operator is an interesting topic of its own [3], our main goal is generate images and evaluate the quality of our output’s robustness to lack of such information. But to be specific, in our results we simply used the bi-cubic down-scaling function and the average pooling function to generate very low resolution images.

As will be seen in the experiments, regardless of the down-scaling choice, our method achieves roughly the same scores on the metrics under consideration, and produces high quality, realistic and plausible outputs. This shows experimentally that our technique is robust to the process for generating input images.

4 Experimental Results

Measuring the quality of a generative model is the subject of ongoing research. We employ two different approaches. The main approach we employ to demonstrate the proposed method empirical, based on a broad set of experiments that illustrate the flexibility of our framework.

The other (presented in the Supplementary Material) is quantitative – based on comparison of perceptual quality against other approaches to generating high res images from low-resolution inputs (what is sometimes called super-resolution). Since solving super-resolution is not the stated aim of this paper, this quantitative comparison is presented only in the Supplemental Material. In the GAN literature, several metrics have been developed to measure the quality of the generated sample distributions. Unfortunately, these metrics are primarily focused on capturing the sample diversity and there is no standalone perceptual quality estimation currently available. Other metrics [16, 12] have been proposed to measure the perceptual quality of a single image.

5 Generating Sets of Plausible Images

By design, the LAG method’s main strength is the ability to generate not just one, but a family of plausible images given a low-resolution input. Namely, while we can model the set of possible images and predict the one at their center, we can also generate samples by drawing from the distribution of . We illustrate this capacity with examples in three categories: Faces, Churches, and Bedrooms. We also illustrate examples where the generator is trained on one class, but the input image is taken from a different class.

We observe in Fig. 1 examples for faces. Specifically, with a modest down-sampling factor, the set of generated images by sampling

is rather similar and consistent. That is, the variance across the samples is modest. However, Fig.

2 shows that by providing the network with a very low resolution image (i.e. high down-sampling factor) the set of generated images by sampling is much more varied. This means that the size of the input in effect controls the diversity of the output. A desirable property that should, in retrospect not be entirely unexpected for a well-designed generator.

High res Low res Samples with
Figure 1: Samples of generated image from an down-sampled input for various . Note that the various results are similar, relatively consistent, but not identical. Spurious objects such as glasses are not generated.
High res Low res Samples with
Figure 2: Samples of generated image from an down-sampled input for various . Note that given the paucity of information in the input, the variance across the resulting generated images is quite large.
High res Low res Samples with
Figure 3: Samples of generated bedroom images from down-sampled input images for various .
High res Low res Samples with
Figure 4: Samples of generated church images from down-sampled input images for various .

5.1 Mirroring

In this experiment, we consider how well the network can generate images across a limited, well-defined class. Namely, we consider a given image and it’s mirror image. Then we down-sample these images and generated high-resolution versions using our network. The expectation is that the high resolution images should be consistent in the sense that they should illustrate ”turning” the given image to its mirror image. Experimental results in 5 bare this out.

Figure 5: Samples of generated image from an down-sampled input. The sequence of input images are generated by interpolating between the original high-res and its mirror image. Note that the model generates quite consistent result that illustrate the face going smoothly from a given pose to its mirror image.

Next, we conducted an additional experiment to see how the network behaves when the input image belongs to data-set A and generator is trained on different data-set B. The resuls are shown in 6

Figure 6: Samples of generated image from an down-sampled input. The sequence of input images are generated by interpolating between the original high-res and its mirror image. Note that the model generates quite consistent result that illustrate the face going smoothly from a given pose to its mirror image.

5.2 Noisy and Random Inputs

In this experiment, for the sake of completeness, we consider how the network reacts to being given either noisy images, or simply images that consist of only noise.

It is of interest to see how the generated set varies for a fixed , but where progressively more uniform pixel noise is added to the input image. The results in 7 illustrate a rather varied set of generated outputs.

Figure 7: Two examples of generated image from a down-sampled input with progressively more noise added to the input images. Note that in all cases, the latent parameter was fixed. Top Left: high-res, followed by low-res, and generated images from noisy low-res

6 Conclusion

We introduced LAG, a new method for generating high resolution images using latent space. It is simple to train and appears robust: we do not need to tune hyper-parameters to avoid mode collapse.

LAG obtains (automatically) through an emergent behavior of the latent space, a good balance between pixel and perceptual accuracy. Our approach results in flexibility in both variations of generated images, and higher quality results when a specific image is to be generated against a reference. This is achieved by training in a perceptual latent space which does not require additional losses to measure pixel or content.

Importantly, LAG allows to explore a space of high resolution images given an input, allowing a flexible framework that seems to allow higher resolution generated images for simpler manifolds. In addition to the latent variable , we identified two additional mechanism for controlling the variety of images the network can generate. First, the size of the input image directly affects the observed variations across images generated by the network. Very small inputs generate a much broader variety of images, as might be expected. Second, even for a fixed , the addition of modest amounts of noise to the input images is also able to generate a variety of outputs that appear plausible.

Several avenues for future research seem open. First, alternative architectures can of course be explored. Second, and more fundamentally, our assumption that the reference image lies at the center of a Gaussian distribution is probably a terrible simplification. We expect that even more compelling results are possible.

Last, taking advantage of semantic information that describes the content of an image seems a natural path for producing better high resolution images that may be modeled as coming from simpler manifolds.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.1.
  • [2] Y. Blau and T. Michaeli (2018) The perception-distortion tradeoff. In

    Proc. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA

    pp. 6228–6237. Cited by: §1.
  • [3] A. Bulat, J. Yang, and G. Tzimiropoulos (2018) To learn image super-resolution, use a gan to learn how to do image degradation first. arXiv preprint arXiv:1807.11458. Cited by: §3.2.
  • [4] L. Gatys, A. S. Ecker, and M. Bethge (2015)

    Texture synthesis using convolutional neural networks

    In Advances in Neural Information Processing Systems, pp. 262–270. Cited by: §1.
  • [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [6] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §2.1.
  • [7] G. Hinton (2012)

    Neural networks for machine learning

    Note: Coursera, video lectures. Cited by: §3.1.
  • [8] A. Jolicoeur-Martineau (2018) The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734. Cited by: §3.1.
  • [9] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §3.1.
  • [10] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network.. In CVPR, Vol. 2, pp. 4. Cited by: §1.
  • [11] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. In The IEEE conference on computer vision and pattern recognition (CVPR) workshops, Vol. 1, pp. 4. Cited by: §3.1.
  • [12] C. Ma, C. Yang, X. Yang, and M. Yang (2017) Learning a no-reference quality metric for single-image super-resolution. Computer Vision and Image Understanding 158, pp. 1–16. Cited by: §4.
  • [13] R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor (2018) Learning to maintain natural image statistics. arXiv preprint arXiv:1803.04626. Cited by: §1.
  • [14] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.2.
  • [15] M. S. Sajjadi, B. Schölkopf, and M. Hirsch (2017) Enhancenet: single image super-resolution through automated texture synthesis. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 4501–4510. Cited by: §1.
  • [16] H. Talebi and P. Milanfar (2018) Nima: neural image assessment. IEEE Transactions on Image Processing 27 (8), pp. 3998–4011. Cited by: §4.
  • [17] T. Tong, G. Li, X. Liu, and Q. Gao (2017) Image super-resolution using dense skip connections. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 4809–4817. Cited by: §3.1.
  • [18] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang (2018) ESRGAN: enhanced super-resolution generative adversarial networks. arXiv preprint arXiv:1809.00219. Cited by: §1, §3.1.
  • [19] Y. Wang, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O. Sorkine-Hornung, and C. Schroers (2018) A fully progressive approach to single-image super-resolution. arXiv preprint arXiv:1804.02900. Cited by: §3.1, §3.1.