1 Introduction
Image superresolution (SR) is the underdetermined inverse problem of estimating a high resolution (HR) image given the corresponding low resolution (LR) input. This problem has recently attracted significant research interest due to the potential of enhancing the visual experience in many applications while limiting the amount of raw pixel data that needs to be stored or transmitted. While SR has many applications in for example medical diagnostics or forensics (Nasrollahi & Moeslund, 2014, and references therein), here we are primarily motivated to improve the perceptual quality when applied to natural images. Most current single image SR methods use empirical risk minimisation, often with a pixelwise mean squared error (MSE) loss (Dong et al., 2016; Shi et al., 2016)
. However, MSE, and convex loss functions in general, are known to have limitations when presented with uncertainty in multimodal and nontrivial distributions such as distributions over natural images. In SR, a large number of plausible images can explain the LR input and the Bayesoptimal behaviour for any MSE trained model is to output the mean of the plausible solutions weighted according to their posterior probability. For natural images this averaging behaviour leads to blurry and oversmoothed outputs that generally appear implausible, i.e. the produced estimates have low probability under the natural image prior.
An idealised method for our applications would use a fullreference perceptual loss function that describes the sensitivity of the human visual perception system to different distortions. However the most widely used loss functions MSE and the related peaksignaltonoiseratio (PSNR) metric have been shown to correlate poorly with human perception of image quality (Laparra et al., 2016; Wang et al., 2004). Improved perceptual quality metrics have been proposed, the most popular being structural similarity (SSIM) (Wang et al., 2004) and its multiscale variants (Wang et al., 2003). Although the correlation of these metrics with human perception has improved, they still do not provide a fully satisfactory alternative to MSE for training of neural networks (NN) for SR.
In lieu of a satisfactory perceptual loss function, we leave the empirical risk minimisation framework and present methods based only on natural image statistics. In this paper we argue that a desirable approach is to employ amortised Maximum a Posteriori (MAP) inference, preferring solutions that have a high posterior probability and thus high probability under the image prior while keeping the computational benefits of amortised inference. To motivate why MAP inference is desirable consider the toy problem in Figure
1a, where the HR data is twodimensional and distributed according to the Swissroll density. The LR observation is defined as the average of the two pixels . Consider observing a LR data point : the set of possible HR solutions is the line , more generally an affine subspace, which is shown by the dashed line in Figure 1a. The posterior distribution is thus degenerate, and corresponds to a slice of the prior along this line, as shown by the red shading. If one minimise MSE or Mean Absolute Error (MAE), the Bayesoptimal solution will lie at the mean or the median along the line, respectively. This example illustrates that MSE and MAE can produce output with very low probability under that data prior whereas MAP inference would always find the mode which by definition is in a highprobability region. See Section 5.6 for a discussion of possible limitations of the MAP inference approach.Our first contribution is a convolutional neural networks (CNN) architecture designed to exploit the structure of the SR problem. Image downsampling is a linear transformation, and can be modelled as a strided convolution. As Figure
1a illustrates, the set of HR images that are compatible with any LR image span an affine subspace. We show that by using specifically chosen linear convolution and deconvolution layers we can implement a projection to this affine subspace. This ensures that our CNNs always output estimates that are consistent with the inputs. The affine projection layer can be added to any CNN, or indeed, any other trainable SR algorithm. Using this architecture we show that training the model for MAP inference reduces to minimising the crossentropy between the HR data distribution and the implied distribution of the model’s output when evaluated at random LR images. As a result, we don’t need corresponding HR and LR image pairs any more, and training becomes more akin to training generative models. However direct minimisation of the crossentropy is not possible and instead we develop three approaches, all depending on projecting the model output to the affine subspace of valid solution, to approximate it directly from data:
We present a variant of the Generative Adversarial Networks (GAN) (Goodfellow et al., 2014)
which approximately minimises the Kullback–Leibler divergence (
) and crossentropy between and . Our analysis provides theoretical grounding for using GANs in image SR (Ledig et al., 2016). We also introduce a trick that we call instance noise that can be generally applied to address the instability of training GANs. 
We employ denoising as a way to capture natural image statistics. Bayesoptimal denoising approximately learn to take a gradient step along the logprobability of the data distribution (Alain & Bengio, 2014). These gradient estimates from denoising can be directly backpropagated through the network to minimise crossentropy between and via gradient descent.

We present an approach where the probability density of data is directly modelled via a generative model trained by maximum likelihood. We use a differentiable generative model based on PixelCNNs (Oord et al., 2016) and Mixture of Conditional Gaussian Scale Mixtures (MCGSM, Theis et al., 2012) whose performance we believe is very close to thestateoftheart in this category.
In section 5 we empirically demonstrate the behaviour of the proposed methods on both the two dimensional toy dataset and on real image datasets. Lastly, in Appendix F we show that a stochastic version of AffGAN performs amortised variational inference, which for the first time establishes a connection between GANs and variational inference as in e. g. variational autoencoders (Kingma & Welling, 2014).
2 Related work
The GAN framework was introduced by Goodfellow et al. (2014) which also showed that these models minimise the ShannonJensen Divergence between and under certain conditions. In Section 3.2, we present an update rule that corresponds to minising . Recently, Nowozin et al. (2016) presented a more general treatment that connects GANs to divergence minimisation. In parallel to our contributions, theoretical work by Mohamed & Lakshminarayanan (2016) presented a unifying view on learning in GANstyle algorithms, of which our variant can be regarded a special case. The focus of several recent papers on GANs were algorithmic tricks to improve their stability (Radford et al., 2015; Salimans et al., 2016). In Section 3.2.1 we introduce another such trick we call instance noise. We discuss theoretical motivations for this and compare it to onesided label smoothing proposed by Salimans et al. (2016). We also refer to parallel work by Arjovsky & Bottou (2017) proposing a similar method. Recently, several attempts have been made to improve perceptual quality of SR using deep representations of natural images. Bruna et al. (2016) and Li & Wand (2016) measure the Euclidean distance in the nonlinear feature space of a deep NN pretrained to perform object classification. Dosovitskiy & Brox (2016) and Ledig et al. (2016) use a similar approach and also add an adversarial loss term. Unpublished work by Garcia (2016) explored combining GANs with an penalty between the LR input and the downsampled output. We note that the soft or penalties used in these methods can be interpreted as assuming Gaussian and Laplace observation noise. In contrast, our approach assumes no observation noise and satisfies the consistency of inputs and outputs exactly by using an affine projection as explained in Section 3.1. In other work, Larsen et al. (2015) proposed to replace the pixelwise MSE used for training of variational autoencoders with a learned metric from the GAN discriminator. Our denoiser based method exploits a fundamental connection between probabilistic modelling and learning to denoise (see e. g. Vincent et al., 2008; Alain & Bengio, 2014; Särelä & Valpola, 2005; Rasmus et al., 2015; Greff et al., 2016): a Bayesoptimal denoiser can be used to estimate the gradient of the log probability of data. To our knowledge this work is the first time that the output of a denoiser is explicitly backpropagated to train another network. Lastly, we note that denoising has been used to solve inverse problems in compressed sensing as in approximate message passing (Metzler et al., 2015).
3 Theory
Consider a function parametrised by which maps a LR observation to a HR estimate . Most current SR methods optimise model parameters via empirical risk minimization:
(1) 
Where is the true target and is some loss function. The loss function is typically a simple convex function most often MSE as in (Dong et al., 2016; Shi et al., 2016). Here, we seek to perform MAP inference instead. For a single LR observation the MAP estimate is
(2) 
Instead of calculating for each separately we perform amortised inference, i. e. we would like to train the SR function to calculate the MAP estimate. A natural loss function for learning the parameters is the average logposterior:
(3) 
where the expectation is taken over the distribution of LR observations . This loss depends on the unknown posterior distribution . We proceed by decomposing the logposterior using Bayes’ rule as follows.
(4) 
3.1 Handling the Likelihood term
Notice that the last term of Eqn. (4), the marginal likelihood, does not depend on , so we only have to deal with the likelihood and image prior. The observation model in SR can be described as follows.
(5) 
where is a linear transformation used for image downsampling. In general, can be modelled as a strided twodimensional convolution. Therefore, the likelihood term in Eqn. (4) is degenerate , and Eqn. (4) can be rewritten as constrained optimisation:
(6) 
To satisfy the constraints, we introduce a parametric function class that always guarantees . Specifically, we propose to use functions of the form
(7) 
where is an arbitrary mapping from LR to HR space, a projection to the affine subspace , and is the MoorePenrose pseudoinverse of , which satisfies and . Conveniently, if is a strided twodimensional convolution, then
becomes a deconvolution or upconvolution, which is a standard operation used in deep learning
(e. g. Shi et al., 2016). It is important to stress that the optimal deconvolution is not simply the transpose of , Figure 2 illustrates the upsampling kernel () that corresponds to a Gaussian downsampling kernel (). For any the deconvolution can be easily found, here we used numerical methods as detailed in Appendix B. Intuitively, can be thought of as a baseline SR solution, while is the residual. The operation is a projection to the nullspace of , therefore when we downsample the residual we are guaranteed to get no matter what is. By using functions of this form we can turn Eqn. (6) into an unconstrained optimization problem.(8) 
Interestingly, the objective above can be expressed in terms of the probability distribution of the model output
as follows.(9) 
where denotes the crossentropy between and and we used . To minimise this objective, we do not need matched inputoutput pairs as in empirical risk minimisation. Instead we need to match the marginal distribution of reconstructed images
to that of the distribution of HR images. In this respect, the problem becomes more akin to unsupervised learning or generative modelling. In the following sections we present three approaches to finding the optimal
utilising the properties of the affine projection.3.2 Affine projected Generative Adversarial Networks
Generative Adversarial Networks (Goodfellow et al., 2014) consist of a generator that turns noise sampled from some distribution into images via a parametric mapping, and a discriminator that learns to distinguish between real and synthetic images. The generator and discriminator are updated in tandem resulting in the generative distribution moving closer to the distribution of real data . The behaviour of GANs depends on the specifics of how the generator and the discriminator are trained. We use the following objective functions for and :
(10)  
The algorithm iterates two steps: first, it updates by lowering keeping fixed, then it updates by lowering keeping fixed. It can be shown that this amounts to minimising , where is the distribution of samples generated by . See Appendix A for a proof^{1}^{1}1First shown in (Huszár, 2016). In the context of SR, the affine projected SR function takes the role of the generator. Instead of noise, the generator is now fed lowresolution images . Leaving everything else unchanged, we can deploy the GAN algorithm to minimise . We call this algorithm affine projected GAN or AffGAN for short. Similarly, we introduce notation SoftGAN to denote the GAN algorithm without the affine projection, which instead uses an additional softconstraint as in (Garcia, 2016). Note that the difference between the crossentropy and the KL divergence is the entropy of : . Hence, we can expect AffGAN to favour approximate MAP solutions that lead to higher entropy and thus more diverse solutions overall.
3.2.1 Instance Noise
The theory suggests that GANs should be a convergent algorithm. If a unique optimal discriminator exists and it is reached by optimising to perfection at each step, technically the whole algorithm corresponds to gradient descent on an estimate of with respect to . In practice, however, GANs tend to be highly unstable. So where does the theory go wrong? We think the main reason for the instability of GANs stems from and being concentrated distributions whose support does not overlap. The distribution of natural images is often assumed to concentrate on or around a lowdimensional manifold. In most cases,
is degenerate and manifoldlike by construction, such as in AffGAN. Therefore, odds are that especially before convergence is reached,
and can be perfectly separated by several s violating a condition for the convergence proof. We try to remedy this problem by adding instance noise to both SR and true image samples. This amounts to minimising the divergence , where denotes convolution of with the noise distribution . The noise level can be annealed during training, and the noise allows us to safely optimise until convergence in each iteration. The trick is related to onesided label noise introduced by Salimans et al. (2016), however without introducing a bias in the optimal discriminator, and we believe it is a promising technique for stabilising GAN training in general. For more details please see Appendix C3.3 Denoiser Guided SuperResolution
To optimise the criterion Eqn. (6) via gradient descent we need its gradient with respect to :
(11) 
Here are the gradients of the SR function which can be calculated via backpropagation whereas requires estimation since is unknown. We use results from (Alain & Bengio, 2014; Särelä & Valpola, 2005) showing that in the limit of infinitesimal Gaussian noise, optimal denoising functions can be used to estimate this gradient:
(12) 
where
is Gaussian white noise,
is the Bayesoptimal denoising function for noise level . Using these results we can maximise Eqn. (9) by first training a neural network to denoise samples from and then backpropagate the gradient estimates from Eqn. (12) via the chain rule in Eqn. (
11) to update . Well call this method AffDG, as it uses the affine subspace projection and is guided by the gradient from the DAE. Similar to above we’ll call the similar algorithm softenforcing Eqn. (5) SoftDG.3.4 Density Guided SuperResolution
As a more direct baseline model for amortised MAP inference we fit a tractable, yet powerful density model to using maximum likelihood, and then use cross entropy with respect to the generative model to approximate Eqn. (9). We use a deep generative model similar to the pixelCNN (Oord et al., 2016) but with a continuous (and differentiable) MCGSM (Theis et al., 2012) likelihood. These type of models are stateoftheart in density estimation, are relatively fast to evaluate and produce visually interesting samples (Oord et al., 2016). We call this method AffLL, as it uses the affine projection and is guided by the loglikelihood of a density model.
4 Experiments
We designed our experiments to address the following questions:

Does the affine projection layer hurt the performance of CNNs for image SR? Section 5.2
We initially illustrate the behaviour of the proposed algorithms on data where exact MAP inference is computationally tractable. Here the HR data is drawn from a twodimensional noisy Swissroll distribution and the onedimensional LR data is simply the average of the two HR pixels. Next we tested the proposed algorithm in a series of experiments on natural images using downsampling.. For the first dataset, we took random crops from HR images containing grass texture. SR of random textures is known to be very hard using MSE or MAE loss functions. Finally, we tested the proposed models on real image data of faces (CelebA) and natural images (ImageNet
). All models were convolution neural networks implemented using Theano
(Team et al., 2016) and Lasagne (Dieleman et al., 2015). We refer to Appendix D for full experimental details.5 Results and Discussion
5.1 2D MAP inference: SwissRoll
In this experiment we wanted to demonstrate that AffGAN and AffDG are indeed minimising the MAP objective in Eqn. (9). For this we used the twodimensional toy problem where can be evaluated using bruteforce Monte Carlo. Figure 1b) shows the outputs for for models trained with different criterion. The AffGAN and AffDG solutions largely fit the dominant mode similar to MAP inference. For the MSE and MAE models the output generally falls in regions with low prior density. Table 1 shows the crossentropy achieved by different methods, averaged over 10 independent trials with random initialisation. The crossentropy values for the GAN and DAE based models are relatively close to the optimal MAP solution, which in this case we can find in a bruteforce way. As expected the MSE and MAE models perform worse as these models do not minimize . We also calculated the average MSE between the network input and the downsampled network output. For the affine projected models, this error is exactly . The soft constrained models only approximately satisfy this constraint, even after extensive training (Table 1 second column). Further, we observe that the affine projected models generally found a lower crossentropy when compared to softconstrained versions.
5.2 Affine Projected Networks: Proof of Concept using MSE criterion
Adding the affine projection restricts the class of functions that the SR network can model, so it is important to verify that the network is still capable of achieving the same performance in SR as unconstrained CNN architectures. To test this, we trained CNNs with and without affine projections to perform SR on the CelebA dataset using MSE as the objective function. Results are shown in Figure 2. First note that when using affine projections, a randomly initialised network starts learning from a lower initial loss as the lowfrequency components of the network output already match those of the target image. We observed that the affine projected networks generally train faster than unconstrained ones. Furthermore, the affine projected networks tend to find a better solution as measured by MSE and SSIM (Figure 2ab). To investigate which aspects of the network architecture are responsible for the improved performance, we evaluated two further models: In one variant, we initialise the affine projected CNN to implement the correct projection, but then treat as a trainable parameter. In the final variant, we keep the architecture the same, but initialise the final deconvolution layer randomly and allow it to be trained. We found that initialising to the correct MoorePenrose inverse is important, and we get the similar results irrespective of whether or not it is fixed during training. Figure 2c shows the error between the network input and the downsampled network output. We can see that the exact affine projected network keeps this error at virtually (up to numerical precision), whereas any other network will violate this consistency. In Figure 2d we show the downsampling kernel and the corresponding optimal kernel for .
5.3 Grass Textures
Random textures are known to be hard model using MSE loss function. Figure 3 shows SR of grass texture patches using identical affine projected CNNs trained with different loss functions. When randomly initialised, affine projected CNNs always produce an output with the correct lowfrequency components,as illustrated by the third panel labelled in Figure 3. The AffGAN model produces clearly the sharpest images, and we found the images to be plausible given the LR inputs. Notice that the reconstruction is not perfect pixelbypixel, but it has the correct statistical properties for the human visual system to recognise it as grass texture. The AffDG and AffLL models both produced blurry results which we where unable to improve upon using various optimization methods. Due to these findings we choose not to perform any further experiments with these models and concentrate on AffGAN instead. We refer to Appendix E for discussion of the results of these models.
5.4 CelebA Faces
In Figure 5 the SR results are seen for several models trained using different loss functions. The MSE trained models outputs somewhat generic and oversmoothed images as expected. For the GAN models the global content is correct for both the affine projected and soft constrained models. Comparing the AffGAN and SoftGAN outputs the AffGAN model produces slightly sharper images which however also seem to contain slightly more high frequency noise. We observed some colour drifting for the soft constrained models. Table 5 shows quantitative results for the same four models where, in terms of PSNR and SSIM, the MSE model achieves the best scores as expected. The consistency between input and output clearly shows that the models using the affine projections satisfy Eqn. (5) better than the soft constrained versions for both MSE and GAN losses.
5.5 Natural Images
In Figure 6 we show the results for SR from to pixels for AffGAN trained on natural images from ImageNET. For most of the images the results are sharp and corresponds well with the LR input. However we still see the highfrequency noise present in most GAN results in some of the images. Interestingly the snake depicted in the third column is super resolved into water which is obviously wrong but still a very plausible image considering the LR input image. Further, water will likely have a higher density under the image prior than snakes which suggests that the GAN model dreams up reasonable data.
5.6 Criticism and future directions
One argument against MAP inference is that the mode of a distribution is dependent on the representation: transforming a variable through an invertible transformation and performing MAP inference in the transformed space may lead to different answers depending on the transformation. As an extreme example, consider transforming a continuous random scalar
with its cumulative distribution function
. The resulting variableis uniformly distributed, so any value in the interval
can be the mode. Thus, the MAP estimate is not unique if one allows for alternative representations, and there is no guarantee that the MAP estimate in 24bit RGB pixel representation which we seek in this paper is in any way special. One may arrive at a different solution when performing MAP estimation in the feature space of a convolutional neural network, or even if merely an alternative colour space is used. Interestingly, AffGAN is more resilient to coordinate transformations: Eqn. (10) includes the extra term which is effected by transformations the same way as . The second argument relates to the assumption that MAP estimates appear plausible. Although by definition the mode lies in a highprobability region, it does not guarantee that its appearance is anything like that of a random sample. Consider for example data drawn from adimensional standard Normal distribution. Due to concentration of measure, as
increases the norm of a typical sample will be approximately with very high probability. The mode, however, has a norm of . In this sense, the mode of the distribution is highly atypical. Indeed human observers can easily tell apart a typical sample from the noise distribution and the mode, but would have a hard time noticing the difference between two random samples. This argument suggests that sampling from the posterior may be a good or even preferable way to obtain plausible reconstructions. In Appendix F we establish a connection between variational inference, such as in varational autoencoders (Kingma & Welling, 2014), and a stochastic version of AffGAN, however leaving emperical studies as further.6 Conclusion
In this work we developed methods for approximate MAP inference in SR. We first introduced an architectural restriction to neural networks projecting the model output to the affine subspace of valid solutions. We then proposed three methods, based on GANs, denoising or density models, for amortised MAP inference in SR using this affine projection. In high dimensions we empirically found that the GAN based approach, AffGAN produced the most visually appealing results. Our work follows successful demonstrations of GANbased algorithms for image SR (Ledig et al., 2016), and we provide additional theoretical motivation for why this approach makes sense. In future work we plan to focus on a stochastic extension of AffGAN which can be seen as performing amortised variational inference.
References

Alain & Bengio (2014)
Guillaume Alain and Yoshua Bengio.
What regularized autoencoders learn from the datagenerating
distribution.
Journal of Machine Learning Research
, 15(1):3563–3593, 2014.  Arjovsky & Bottou (2017) Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.
 Bruna et al. (2016) Joan Bruna, Pablo Sprechmann, and Yann LeCun. Superresolution with deep convolutional sufficient statistics. International Conference on Learning Representations, 2016.
 Dieleman et al. (2015) Sander Dieleman, Jan Schlüter, Colin Raffel, Eben Olson, Søren Kaae Sønderby, Daniel Nouri, and Eric Battenberg and. Lasagne: First release., 2015. URL http://dx.doi.org/10.5281/zenodo.27878.
 Dong et al. (2016) Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image superresolution using deep convolutional networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 295–307, 2016.
 Dosovitskiy & Brox (2016) Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. arXiv preprint arXiv:1602.02644, 2016.
 Garcia (2016) David Garcia. Open source code. retrieved on 22 Sept 2016, 2016. URL https://github.com/davidgpu/srez.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Greff et al. (2016) Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hotloo Hao, Jürgen Schmidhuber, and Harri Valpola. Tagger: Deep unsupervised perceptual grouping. In Advances in Neural Information Processing Systems, 2016.
 Huang et al. (2016) Gao Huang, Zhuang Liu, and Kilian Q Weinberger. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
 Huszár (2016) Ferenc Huszár. An alternative update rule for generative adversarial networks. Unpublished note (retrieved on 7 Oct 2016), 2016. URL http://www.inference.vc/analternativeupdateruleforgenerativeadversarialnetworks/.
 Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Autoencoding variational bayes. In The International Conference on Learning Representations, 2014.
 Laparra et al. (2016) Valero Laparra, Johannes Ballé, Alexander Berardino, and Eero P Simoncelli. Perceptual image quality assessment using a normalized laplacian pyramid. In Proc. IS&T Int’l Symposium on Electronic Imaging, Conf. on Human Vision and Electronic Imaging, 2016.
 Larsen et al. (2015) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1558––1566, 2015.
 Ledig et al. (2016) Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photorealistic single image superresolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.

Li & Wand (2016)
Chuan Li and Michael Wand.
Combining markov random fields and convolutional neural networks for
image synthesis.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016.  Metzler et al. (2015) Christopher A Metzler, Arian Maleki, and Richard G Baraniuk. Optimal recovery from compressive measurements via denoisingbased approximate message passing. In International Conference on Sampling Theory and Applications (SampTA), pp. 508–512, 2015.
 Mohamed & Lakshminarayanan (2016) Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
 Nasrollahi & Moeslund (2014) Kamal Nasrollahi and Thomas B. Moeslund. Superresolution: a comprehensive survey. Machine Vision and Applications, pp. 1423–1468, 2014.
 Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fGAN: Training generative neural samplers using variational divergence minimization. arXiv preprint arXiv:1606.00709, 2016.

Oord et al. (2016)
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
Pixel recurrent neural networks.
In Proceedings of The 33rd International Conference on Machine Learning, pp. 1747––1756, 2016.  Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, 2015.
 Rasmus et al. (2015) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554, 2015.
 Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016.
 Särelä & Valpola (2005) Jaakko Särelä and Harri Valpola. Denoising source separation. Journal of Machine Learning Research, pp. 233–272, 2005.
 Shi et al. (2016) Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Realtime single image and video superresolution using an efficient subpixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883, 2016.
 Team et al. (2016) The Theano Development Team, Rami AlRfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, et al. Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.
 Theis & Bethge (2015) Lucas Theis and Matthias Bethge. Generative image modeling using spatial lstms. In Advances in Neural Information Processing Systems, pp. 1927–1935, 2015.
 Theis et al. (2012) Lucas Theis, Reshad Hosseini, and Matthias Bethge. Mixtures of conditional gaussian scale mixtures applied to multiscale image representations. PLoS ONE, 2012.

Vincent et al. (2008)
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol.
Extracting and composing robust features with denoising autoencoders.
In Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103, 2008.  Wang et al. (2003) Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In Conference Record of the 27th Asilomar Conference on Signals, Systems and Computers, volume 2, pp. 1398–1402, 2003.
 Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, pp. 600–612, 2004.
Appendix A Generative Adversarial Networks for minimising KLdivergence
First note that for a fixed generator the discriminator maximises:
(13)  
(14)  
(15) 
where is the generative distribution. A function of the form always has maximum at and we find the Bayesoptimal discriminator to be (assuming equal prior class probabilities)
(16) 
Let’s assume that this Bayesoptimal discriminator is unique and can be approximated closely by our neural network (see Appendix C for more discussion on this assumption).
Using the modified update rule proposed here the combined optimization problem for the discriminator and generator is
(17) 
Starting from the definition of
(18)  
(19)  
(20) 
Which is equal to the terms affecting the generator in Eqn. (17).
Appendix B Affine projection
b.1 Numerical Estimation of the pseudoinverse
In practice we implement the downsampling projection as a strided convolution with a fixed Gaussian smoothing kernel where the stride corresponds to the downsampling factor.
is implemented as a transposed convolution operation with parameters optimised numerically via stochastic gradient descent on the following objective function:
(21)  
(22)  
(23) 
Where is the dimensional standard normal distribution, and is the dimensionality of LR data . and can be thought of as a Monte Carlo estimate of the spectral norm of the transformations and , respectively. The Monte Carlo formulation above has the advantage that it can be optimised via stochastic gradient descent. The operation can be thought of as a threelayer fully linear convolutional neural network, where corresponds to a strided convolution with fixed kernels, while is a trainable deconvolution. We note that for certain downsampling kernels the exact would have an infinitely large kernel, although it can always be approximated with a local kernel. At convergence we found to be between and depending on the downsampling factor, width of the Gaussian kernel used for and the filter sizes of and .
b.2 Gradients
The gradients of the affine projected SR models is derived by applying the chain rule
(25) 
Which is essentially the highpass filtered version of the gradient of .
Appendix C Instance Noise
GANs are notoriously unstable to train, and several papers exist that try to improve their convergence properties (Salimans et al., 2016; Radford et al., 2015) via various tricks. Consider the following idealised GAN algorithm, each iteration consisting of the following steps:

we extract from an estimate of the logarithmic likelihood ratio

we update by taking a stochastic gradient step with objective function
If and are wellconditioned distributions in a lowdimensional space, this algorithm performs gradient descent on an approximation to the KL divergence, so it should converge. So why is it highly unstable in practical situations?
Crucially, the convergence of this algorithm relies on a few assumptions that don’t always hold: (1) that the loglikelihoodratio is finite, (2) that the JensenShannon divergence is a wellbehaved function of and (3) that the Bayesoptimal solution to the logistic regression problem is unique. We stipulate that in realworld situations neither of these holds, mainly because and are concentrated distributions whose support may not overlap. In image modelling, distribution of natural images is often assumed to be concentrated on or around a lowerdimensional manifold. Similarly, is often degenerate by construction. The odds that the two distributions share support in highdimensional space, especially early in training, are very small. If and have nonoverlapping support (1) the loglikelihoodratio and therefore KL divergence is infinite (2) the JensenShannon divergence is saturated so its maximum value and is locally constant in and (3) there may be a large set of nearoptimal discriminators whose logistic regression loss is very close to the Bayes optimum, but each of these possibly provides very different gradients to the generator. Thus, training the discriminator might find a different nearoptimal solution each time depending on initialisation, even for a fixed and .
The main ways to avoid these pathologies involve making the discriminator’s job harder. For example, in most GAN implementations the discriminator is only partially updated in each iteration, rather than trained until convergence. Another way to cripple the discriminator is adding label noise, or equivalently, onesided label smoothing as introduced by Salimans et al. (2016). In this technique the labels in the discriminator’s training data are randomly flipped. However we do not believe these techniques adequately address all of the concerns described above.
In Figure 7
a we illustrate two almost perfectly separable distributions. Notice how the large gap between the distributions means that there are large number of possible classifiers that tell the two distributions apart and achieve similar logistic loss. The Bayesoptimal classifier may not be unique, and the set of nearoptimal classifiers is very large and diverse. In Figure
7b we show the effect of one sided label smoothing or equivalently, adding label noise. In this technique, the labels of some real data samples are flipped so the discriminator is trained thinking they were samples from . The discriminator indeed has a harder task now, but all classifiers are penalised almost equally. As a result, there is still a large set of discriminators which achieve nearoptimal loss, it’s just that the nearoptimal loss is now larger. Label smoothing does not help if the Bayesoptimal classifier is not unique.Instead we propose to add noise to the samples, rather than labels, which we denote instance noise. Using instance noise the support of the two distributions is broadened and they are no longer perfectly separable as illustrated in Figure 7c. Adding noise, the Bayesoptimal discriminator becomes unique, the discriminator is less prone to overfitting because it has a wider training distribution, and the loglikelihoodratio becomes better behaved. The JensenShannon divergence between the noisy distributions is now a nonconstant function of .
Using instance noise, is easy to construct an algorithm that minimises the following divergence:
(26) 
where is the parameter of the noise distribution. Logistic regression on the noisy samples provides an estimate of . When updating the generator we have to minimise the mean of on noisy samples from . We know that, if is Gaussian, is a Bregmandivergence, and that it is if and only if the two distributions are equal. Because of the added noise,
is less sensitive to local features of the distribution. We found that in our experiments instance noise helped the convergence of AffGAN. We have not tested the instance noise in the generative modelling application. Because we don’t have to worry about overtraining the discriminator, we can train it until convergence, or take more gradient steps between subsequent updates to the generator. One critical hyperparameter of this method is the noise distribution. We used additive Gaussian noise, whose variance we annealed during training. We propose a heuristic annealing schedule where the noise is adapted so as to keep the optimal discriminator’s loss constant during training. It is possible that other noise distributions such as heavytailed or spikeandslab would work better but we have not investigated these options.
Appendix D experimental details
Loss functions
For the GAN models the generative and discriminative parameters were updated using Eqn. (10). For the models enforcing Eqn. (5) using a softconstraint we added an extra MAE loss term to the generative parameters , where i runs over the number of data samples .
The denoiser guided models were trained in a two step procedure. Initially we pretrained a DAE to denoise samples from the data distribution by minimising
(27)  
During training we anneal the noise level and continuously save the model parameters of the DAE trained at increasingly smaller noise levels. We then learn the parameters of the generator by following the gradient in Eqn. (11) using the DAE to estimate
(28)  
(29)  
(30) 
Where is the learning rate. During training we continuously load parameters of the DAE trained at increasingly low noise levels to get gradients pointing in the approximate correct direction in beginning of training while covering a large data space and precise gradients close to the data manifold in the end of the training.
For the density guided models we first pretrain a density model by maximising the tractable loglikelihood
(31) 
Where the joint density have been decomposed using the chain rule and runs over the pixels. Similar to the DAE we continuously save the parameters of the density model during training. We then learn the parameters of the generator by directly minimising the negative loglikelihood of the generated samples under the learned density model.
(32) 
2D swissroll
The 2D target data was sampled from the 2D SwissRoll defined as:
(33)  
(34)  
(35) 
Where , , and . The LR input was defined as . The crossentropy
were calculated by estimating the probability density function using a Gaussian kernel density estimator fitted to
samples from a noiseless Swiss Roll density i.e. , and setting the bandwidth of each kernel to. All generator and discriminators were 2layered fully connected NNs with 64 units in each layer. For the AffDG model the DAE was a two layered NN with 256 units in each layer trained while annealing the standard deviation of the Gaussian noise from
to .Image data
For all image experiments we set to a convolution using a Gaussian smoothing kernel of size using a stride of corresponding to downsampling. were set to a convolution operation with kernels of size followed by a reordering of the pixel with the output corresponding to upsampling convolution as described in (Shi et al., 2016). The parameters of the was optimised numerically as described in Appendix B. All downsampling were done using the
projection. For all image models we used convolutional models using ReLU nonlinearities and batch normalization in all layers except the output. All generators used skip connections similar to
(Huang et al., 2016) and a final sigmoid nonlinearity was applied to output of the model which were either used directly or feed through the affine transformation layers parameterised by and . The discriminators were standard convolutional networks followed by a final sigmoid layer.For the grass texture experiments we used randomly extracted patches of data from high resolution grass texture images. The generators used 6 layers of convolutions with 32, 32, 64, 64, 128 and filter maps and skip connections after every second layer. The discriminators had four layers of strided convolutions with 32, 64, 128 and 256 filter maps. For the AffDG model the DAE was a four layer convolutional network with 128 filter maps in each layer trained while annealing the standard deviation of the Gaussian noise from to . The density model was implemented as a pixelCNN similar to Oord et al. (2016) with four layers of convolution with 64 filter map with kernel sizes of 5, except for the first layers which used 7. The original PixelCNN uses a nondifferentiable categorical distribution as the likelihood model why it can not be used for gradient based optimization. Instead we used a MCGSM as the likelihood model (Theis & Bethge, 2015), which have been shown to be a good density model for images (Theis et al., 2012), using 32 mixture components and 32 quadratic features to approximate the covariance matrices.
For the CelebA experiments the datasets were split into train, validation and test set using the standard splitting. All images were center cropped and resized to before downsampling to using . All generators were 12 layer convolution networks with four layers of 128, 256 and 512 filter maps and skip connections between every fourth layer. The discriminators were 8 layer convolution nets with two layers of 128, 256, 512 and 1024 filter maps using a stride of 2 for every second layer.
For the ImageNET experiments the 2012 dataset were randomly split into train, validation and test set with samples in the test and validation sets. All images below 20kB were then discarded to remove images with to low resolution. The images were center cropped and resized to before downsampling to using . The generator were a 8 layer convolutional network with 4 layers of 128 and 256 filter maps and skip connections between every second layer. The discriminators were 8 layer convolution nets with two layers of 128, 256, 512 and 1024 filter maps using a stride of 2 for every second layer. To stabilise training we used Gaussian instance noise linearly annealed from an initial standard deviation of to . We were unable to stable train models without this extra regularization.
Appendix E Additional results for Denoiser and Density guided Superresolution
Figure 8 show the PSNR and SSIM scores during training for the AffDG and AffLL models trained on the grass textures. Note that the models are converging, but as seen in Figure 3 the images are very blurry. For both models we had problems with diverging training. For the DAE models with high noise levels the gradients are only approximately correct but covers a large space around the data manifold whereas for small noise levels the gradients are more accurate in a small space around the data manifold. For the density model we believe a similar phenomenon is making the training diverge since for accurate density models the estimated density is likely very peaked around the data manifold making learning in the beginning of training difficult. To resolve these issue we started training using models with high noise levels or low loglikelihood values and then loaded model parameters during training with continuously smaller noise levels or better loglikelihood values. The effect of this can be clearly seen during training as the step like behavior of the AffDG in Figure 8. We note that the density model used for training the AffLL achieved a loglikelihood of bits per dimension which is comparable to values obtained in Theis & Bethge (2015) on a texture dataset. Further the AffLL model achieved high loglikelihood values under this model suggesting that the density model is simply not providing an accurate enough representation of to provide precises scores for training the AffLL model.
Appendix F Amortised variational inference using AffGAN
Here we’ll show that a stochastic extension of the AffGAN model approximately minimises an amortised variational inference criterion as in e. g. variational autoencoders, which for the first time establishes a connection between adversarial methods of inferences and and variational inference. We introduce a variant of AffGAN where, in addition to the LR data , the generator function also takes as input some independent noise variables : we establish a connection between GANs and amortised variational
(36)  
(37) 
Similarly to how we defined in Section 3.1 we introduce the following notation:
(38)  
(39)  
(40) 
Here the affine projection ensures that under , and are always consistent. Therefore, under , the conditional of given is the same as the likelihood by construction and the following equality holds:
(41) 
Applying Bayes’ rule to and substituting into the above equality we get:
(42) 
The divergence that the AffGAN objective minimises can now be rewritten as.
(43)  
(44)  
(45)  
(46) 
Therefore we can conclude that the AffGAN algorithm described in Section 3.2 approximately minimizes the following amortised variational inference criterion:
(47) 
and in doing so it only requires samples from and .