Tensorflow implementation of pixel-recursive-super-resolution(Google Brain paper: https://arxiv.org/abs/1702.00783)
We present a pixel recursive super resolution model that synthesizes realistic details into images while enhancing their resolution. A low resolution image may correspond to multiple plausible high resolution images, thus modeling the super resolution process with a pixel independent conditional model often results in averaging different details--hence blurry edges. By contrast, our model is able to represent a multimodal conditional distribution by properly modeling the statistical dependencies among the high resolution image pixels, conditioned on a low resolution input. We employ a PixelCNN architecture to define a strong prior over natural images and jointly optimize this prior with a deep conditioning convolutional network. Human evaluations indicate that samples from our proposed model look more photo realistic than a strong L2 regression baseline.READ FULL TEXT VIEW PDF
In this paper we explain a process of super-resolution reconstruction
We propose a novel approach to automatically produce multiple colorized
The square and rectangular shape of the pixels in the digital images for...
The growing use of convolutional neural networks (CNN) for a broad range...
Guided super-resolution is a unifying framework for several computer vis...
Machine learning techniques have been successfully applied to
In this paper, we propose an end-to-end Korean singing voice synthesis s...
Tensorflow implementation of pixel-recursive-super-resolution(Google Brain paper: https://arxiv.org/abs/1702.00783)
The problem of super resolution entails artificially enlarging a low resolution photograph to recover a corresponding plausible image with higher resolution . When a small magnification is desired (e.g., ), super resolution techniques achieve satisfactory results [41, 8, 16, 39, 22] by building statistical prior models of images [35, 2, 51] that capture low-level characteristics of natural images.
This paper studies super resolution with particularly small inputs and large magnification ratios, where the amount of information available to accurately construct a high resolution image is very limited (Figure 1, left column). Thus, the problem is underspecified and many plausible, high resolution images may match a given low resolution input image. Building improved models for state-of-the-art in super resolution in the high magnification regime is significant for improving the state-of-art in super resolution, and more generally for building better conditional generative models of images [44, 33, 30, 43].
As the magnification ratio increases, a super resolution model need not only account for textures, edges, and other low-level statistics [16, 39, 22], but must increasingly account for complex variations of objects, viewpoints, illumination, and occlusions. At increasing levels of magnification, the details do not exist in the source image anymore, and the predictive challenge shifts from recovering details (e.g., deconvolution ) to synthesizing plausible novel details de novo [33, 44].
Consider a low resolution image of a face in Figure 1, left column. In such
pixel images the fine spatial details of the hair and the skin are missing and cannot be faithfully restored with interpolation techniques. However, by incorporating prior knowledge of faces and their typical variations, a sketch artist might be able to imagine and draw believable details using specialized software packages .
In this paper, we show how a fully probabilistic model that is trained end-to-end using a log-likelihood objective can play the role of such an artist by synthesizing face images depicted in Figure 1, middle column. We find that drawing multiple samples from this model produces high resolution images that exhibit multi-modality, resembling the diversity of images that plausibly correspond to a low resolution image. In human evaluation studies we demonstrate that naive human observers can easily distinguish real images from the outputs of sophisticated super resolution models using deep networks and mean squared error (MSE) objectives . However, samples drawn from our probabilistic model are able fool a human observer up to of the time – compared to a chance rate of .
In summary, the main contributions of the paper include:
Characterization of the underspecified super resolution problem in terms of multi-modal prediction.
Proposal of a new probabilistic model tailored to the super resolution problem, which produces diverse, plausible non-blurry high resolution samples.
Proposal of a new loss term for conditional probabilistic models with powerful autoregressive decoders to avoid the conditioning signal to be ignored.
Human evaluation demonstrating that traditional metrics in super resolution (e.g., pSNR and SSIM) fail to capture sample quality in the regime of underspecified super resolution.
We proceed by describing related work, followed by explaining how the multi-modal problem is not addressed using traditional objectives. Then, we propose a new probabilistic model building on top of ResNet  and PixelCNN . The paper highlights the diversity of high resolution samples generated by the model and demonstrates the quality of the samples through human evaluation studies.
Super resolution has a long history in computer vision. Methods relying on interpolation  are easy to implement and widely used, however these methods suffer from a lack of expressivity since linear models cannot express complex dependencies between the inputs and outputs. In practice, such methods often fail to adequately predict high frequency details leading to blurry high resolution outputs.
Enhancing linear methods with rich image priors such as sparsity  or Gaussian mixtures  have substantially improved the quality of the methods; likewise, leveraging low-level image statistics such as edge gradients improves predictions [47, 41, 8, 16, 39, 22]. Much work has been done on algorithms that search a database of patches and combine them to create plausible high frequency details in zoomed images [9, 17]. Recent patch-based work has focused on improving basic interpolation methods by building a dictionary of pre-learned filters on images and selecting the appropriate patches by an efficient hashing mechanism . Such dictionary methods have improved the inference speed while being comparable to state-of-the-art.
Another approach for super resolution is to abandon inference speed requirements and focus on constructing the high resolution images at increasingly higher magnification factors. Convolutional neural networks (CNNs) represent an approach to the problem that avoids explicit dictionary construction, but rather implicitly extracts multiple layers of abstractions by learning layers of filter kernels. Donget al.  employed a three layer CNN with MSE loss. Kim et al.  improved accuracy by increasing the depth to layers and learning only the residuals between the high resolution image and an interpolated low resolution image. Most recently, SRResNet  uses many ResNet blocks to achieve state of the art pSNR and SSIM on standard super resolution benchmarks–we employ a similar design for our conditional network and catchall regression baseline.
Instead of using a per-pixel loss, Johnson et al. use Euclidean distance between activations of a pre-trained CNN for model’s predictions vs. ground truth images. Using this so-called preceptual loss, they train feed-forward networks for super resolution and style transfer. Bruna et al.  also use perceptual loss to train a super resolution network, but inference is done via gradient propagation to the low-res input (e.g., ).
Another promising direction has been to employ an adversarial loss for training a network. A super-resolution network is trained in opposition to a secondary network that attempts to discriminate whether or not a synthesized high resolution image is real or fake. Networks trained with traditional losses (e.g. [21, 7]) suffer from blurry images, where as networks employing an adversarial loss predict compelling, high frequency detail [26, 49]. Sønderby et al.  employed networks trained with adversarial losses but constrained the network to learn affine transformations that ensures the model only generate images that downscale back to the low resolution inputs. Sønderby et al. 
also explore a masked autoregressive model but without the gated layers and using a mixture of gaussians instead of a multinomial distribution. Dentonet al.  use a multi-scale adversarial network for image synthesis that is amenable for super-resolutions tasks.
Although generative adversarial networks (GANs)  provide a promising direction, such networks suffer from several drawbacks: first, training an adversarial network is unstable  and many methods are being developed to increase the robustness of training . Second, GANs suffer from a common failure case of mode collapse  where by the resulting model produces samples that do not capture the diversity of samples available in the training data. Finally, tracking the performance of adversarial networks is challenging because it is difficult to associate a probabilistic interpretation to their results. These points motivate approaching the problem with a distinct approach to permit covering of the full diversity of the training dataset.
are probabilistic generative models that impose an order on image pixels in order to represent them as a long sequence. The probability of subsequent pixels is conditioned on previously observed pixels. One variant of PixelCNN obtained state-of-the-art predictive ability in terms of log-likelihood on academic benchmarks such as CIFAR-10 and MNIST. Since PixelCNN uses log-likelihood for training, the model is penalized if negligible probability is assigned to any of the training examples. By contrast, adversarial networks only learn enough to fool a non-stationary discriminator. This latter point suggests that a PixelCNN might be able to predict a large diversity of high resolution images that might be associated with a given low resolution image. Further, using log-likelihood as the training objective allows for hyper parameter search to find models within a model family by simply comparing their log probabilities on a validation set.
We aim to learn a probabilistic super resolution model that discerns the statistical dependencies between a high resolution image and a corresponding low resolution image. Let and denote a low resolution and a high resolution image, and let
represent a ground-truth high resolution image. In order to learn a parametric model of, we exploit a large dataset of pairs of low resolution inputs and ground-truth high resolution outputs, denoted . One can easily collect such a large dataset by starting from some high resolution images and lowering the resolution as much as needed. To optimize the parameters of the conditional distribution , we maximize a conditional log-likelihood objective defined as,
The key problem discussed in this paper is the exact form of that enables efficient learning and inference, while generating realistic non-blurry outputs. We first discuss pixel-independent models that assume that each output pixel is generated with an independent stochastic process given the input. We elaborate why these techniques result in sub-optimal blurry super resolution results. Then, we describe our pixel recursive super resolution model that generates output pixels one at a time to enable modeling the statistical dependencies between the output pixels, resulting in sharp synthesized images given very low resolution inputs.
The simplest form of a probabilistic super resolution model assumes that the output pixels are conditionally independent given the inputs. As such, the conditional distribution of factors into a product of independent pixel predictions. Suppose an RGB output has pixels each with three color channels, i.e., . Then,
Two general forms of pixel prediction models have been explored in the literature: Gaussian and multinomial distributions to model continuous and discrete pixel values respectively. In the Gaussian case,
where denotes the
element of a non-linear transformation ofvia a convolutional neural network. Accordingly,
is the estimated mean for theoutput pixel , and
denotes the variance. Often the variance is not learned, in which case maximizing the conditional log-likelihood of (1) reduces to minimizing the MSE between and across the pixels and channels throughout the dataset. Super resolution models based on MSE regression fall within this family of pixel independent models [7, 21, 26]
. Implicitly, the outputs of a neural network parameterize a set of Gaussians with fixed variance. It is easy to verify that the joint distributionis unimodal as it forms an isotropic multi-variate Gaussian.
Alternatively, one could discrete the output dimensions into possible values (e.g., ), and use a multinomial distribution as the predictive model for each pixel , where . The pixel prediction model based on a multinomial softmax operator is represented as,
where a network with a set of softmax weights, , for each value per color channel is used to induce . Even though in (4) can express multimodal distributions, the conditional dependency between the pixels cannot be captured, i.e., the model cannot choose between drawing an edge at one position vs. another since that requires coordination between the samples.
How the dataset was created
Samples from trained model
To demonstrate how pixel independent models fail at conditional image modeling, we create a synthetic dataset that explicitly requires multimodal prediction. For many dense image predictions tasks, e.g. super resolution 50, 6], and depth estimation , models that are able to predict a single mode are heavily preferred over models that blend modes together. For example, in the task of colorization selecting a strong red or green for an apple is better than selecting a brown-toned color that reflects the smeared average of all of the apple colors observed in the training set.
We construct a simple multimodal MNIST corners dataset to demonstrate the challenge of this problem. MNIST corners is constructed by randomly placing an MNIST digit in either the top-left or bottom-right corner (Figure 2, top). Several networks are trained to predict individual samples from this dataset to demonstrate the unique challenge of this simple example.
The challenge behind this toy example is for a network to exclusively predict an individual digit in a corner of an image. Training a moderate-sized 10-layer convolutional neural network ( parameters) with an objective (i.e. MSE regression) results in blurry image samples in which the two modes are blended together (Figure 2, regression). That is, never in the dataset does an example image contain a digit in both corners, yet this model incorrectly predicts a blend of such samples. Replacing the loss with a discrete, per-pixel cross-entropy produces sharper images but likewise fails to stochastically predict a digit in a corner of the image (Figure 2, cross-entropy).
The lack of conditional independence between predicted pixels is a significant failure mode for the previous probabilistic objectives in the synthetic example (Equations 3 and 4). One approach to this problem is to define the conditional distribution of the output pixels jointly as a multivariate Gaussian mixture  or an undirected graphical model . Both of these conditional distributions require constructing a statistical dependency between output pixels for which inference may be computationally expensive.
A second approach is to factorize the joint distribution using the chain rule by imposing an order on image pixels,
where the generation of each output dimension is conditioned on the input and previous output pixels [24, 42]. We denote the conditioning111Note that in color images one must impose an order on both spatial locations as well as color channels. In a color image the conditioning is based on the the input and previously outputted pixels at previous spatial locations as well as pixels at the same spatial location. up to pixel by where . The benefits of this approach are that the exact form of the conditional dependencies is flexible and the inference is straightforward.
PixelCNN is a stochastic model that provides an explicit model for as a gated, hierarchical chain of cleverly masked convolutions [43, 44, 36]. The goal of PixelCNN is to capture multi-modality and capture pixel correlations in an image. Indeed, training a PixelCNN on the MNIST corners dataset successfully captures the bimodality of the problem and produces sample in which digits reside exclusively in a single corner (Figure 2, PixelCNN). Importantly, the model never predicts both digits simultaneously.
Applying the PixelCNN to a super-resolution problem is a straightforward application that requires modifying the architecture to supply a conditioning on a low resolution version of the image. In early experiments we found the auto-regressive distribution of the model largely ignore the conditioning of the low resolution image. This phenomenon referred to as “optimization challenges” has been readily documented in the context of sequential autoencoder models (see also [38, 40] for more discussion).
To address this issue we modify the architecture of PixelCNN to more explicitly depend on the conditioning of a low resolution image. In particular, we propose a late fusion model  that factors the problem into auto-regressive and conditioning components (Figure 3). The auto-regressive portion of the model, termed a prior network captures the serial dependencies of the pixels while the conditioning component, termed a conditioning network captures the global structure of the low resolution image. Specifically, we formulate the prior network to be a PixelCNN and the conditioning network to be a deep convolutional network employed previously for super resolution .
Given an input , let
denote a conditioning network predicting a vector of logit values corresponding to thepossible values that the output pixel can take. Similarly, let denote a prior network predicting a vector of logit values for the output pixel. Our probabilistic model predicts a distribution over the output pixel by simply adding the two sets of logits and applying a softmax operator on them,
To optimize the parameters of and jointly, we perform stochastic gradient ascent to maximize the conditional log likelihood in (1). That is, we optimize a cross-entropy loss between the model’s predictions in (6) and discrete ground truth labels ,
where is the log-sum-exp operator corresponding to the log of the denominator of a softmax, and denotes a -dimensional one-hot indicator vector with its dimension set to .
Our preliminary experiments indicate that models trained with (7) tend to ignore the conditioning network as the statistical correlation between a pixel and previous high resolution pixels is stronger than its correlation with low resolution inputs. To mitigate this issue, we include an additional loss in our objective to enforce the conditioning network to be optimized. This additional loss measures the cross-entropy between the conditioning network’s predictions via and ground truth labels. The total loss that is optimized in our experiments is a sum of two cross-entropy losses formulated as,
Once the network is trained, sampling from the model is straightforward. Using (6), starting at , first we sample a high resolution pixel. Then, we proceed pixel by pixel, feeding in the previously sampled pixel values back into the network, and draw new high resolution pixels. The three channels of each pixel are generated sequentially in turn.
We additionally consider greedy decoding, where one always selects the pixel value with the largest probability and sampling from a tempered softmax, where the concentration of a distribution is adjusted by using a temperature parameter ,
To control the concentration of our sampling distribution , it suffices to divide the logits from and by a parameter . Note that as goes towards , the distribution converges to the mode.
We summarize the network architecture for the pixel recursive super resolution model. The conditioning architecture is similar in design to SRResNet . The conditioning network is a feed-forward convolutional neural network that takes a low resolution image through a series of ResNet blocks  and transposed convolution layers . The last layer uses a convolution to increase the number of channels to predict a multinomial distribution over possible color channel values for each sub-pixel. The prior network architecture consists of 20 gated PixelCNN blocks with 32 channels at each layer 
. The final layer of the super-resolution network is a softmax operation over the sum of the activations from the conditioning and prior networks. The model is built by using TensorFlow and trained across GPUs with synchronous SGD updates. For training details and a complete list of architecture parameters, please see Appendix A.
We assess the effectiveness of the proposed pixel recursive super resolution method on two datasets containing centrally cropped faces (CelebA ) and bedroom images (LSUN Bedrooms ). In both datasets we resize the images to and pixels with bicubic interpolation to provide the input and output for training and evaluation.
We compare our technique against three baselines including (1) Nearest N.; a nearest neighbor search baseline inspired by previous work on example-based super resolution , (2) ResNet ; a deep neural network using Resnet blocks trained with MSE objective, and (3) GAN; a GAN based super resolution model implemented by  similar to . We exclude the results of the GANbaseline on bedrooms dataset as they are not competitive, and the model was developed specifically for faces.
The Nearest N. baseline computes for a sample by searching the training set for the nearest example indexed by , and returns the high resolution counterpart . The Nearest N. baseline is a representative result of exemplar based super resolution approaches, and helps us test whether the model performs a naive lookup from the training dataset.
The ResNet baseline employs a design similar to SRResNet 
that reports state-of-the-art in terms of image similarity metrics222 Note that the regression architecture is nearly identical to the conditioning network in Section 4.1. The slight change is to force the network to predict bounded values in RGB space. To enforce this behavior, the top layer is outputs three channels instead of one and employ a
instead of a ReLUnonlinearity.. Most significantly, we alter the network to compute the residuals with respect to a bicubic interpolation of the input . The regression provides a comparison to a state-of-the-art convolutional network that performs a unimodal pixel independent prediction.
The GAN super resolution baseline  exploits a conditional GAN architecture, and combines an adversarial loss with a consistency loss, which encourages the low-resolution version of predicted to be close to as measures by . There is a weighting between the two losses specified by  as for the consistency and for the adversarial loss, and we keep them the same in our face experiments.
High resolution samples generated by the pixel recursive super resolution capture the rich structure of the dataset and appear perceptually plausible (Figure 1 and 4; Appendix B and C). Sampling from the super resolution model multiple times results in different high resolution images for a given low resolution image (Figure 5; Appendix B and C). Qualitatively, the samples from the model identify many plausible high resolution images with distinct qualitative features that correspond to a given lower resolution image. Note that the differences between samples for the faces dataset are far less drastic than seen in our synthetic dataset, where failure to cleanly predict modes indicated complete failure.
The quality of samples is sensitive to the temperature (Figure 6, right columns). Greedy decoding () results in poor quality samples that are overly smooth and contain horizontal and vertical line artifacts. Samples from the default temperature () are perceptually more plausible, although they tend to contain undesired high frequency content. Tuning the temperature () between and proves beneficial for improving the quality of the samples.
|Input||G. Truth||Nearest N.||GAN ||Bicubic||ResNet||Greedy|
Many methods exist for quantifying image similarity that attempt to measure human perception judgements of similarity [45, 46, 28]. We quantified the prediction accuracy of our model compared to ground truth using pSNR and MS-SSIM (Table 1). We found that our own visual assessment of the predicted image quality did not correspond to these image similarities metrics. For instance, bicubic interpolation achieved relatively high metrics even though the samples appeared quite poor. This result matches recent observations that suggest that pSNR and SSIM provide poor judgements of super resolution quality when new details are synthesized [26, 18]. In addition, Figure 6 highlights how the perceptual quality of model samples do not necessarily correspond to negative log likelihood (NLL). Smaller NLL means the model has assigned that image a larger probability mass. The greedy, bicubic, and regression faces are preferred by the model despite exhibiting worse perceptual quality.
We next measured how well the high resolution samples corresponded to the low resolution input by measuring the consistency. The consistency is quantified as distance between the low-resolution input image and a bicubic downsampled version of the high resolution estimate. Lower consistencies indicate superior correspondence with the low-resolution image. Note that this is an explicit objective the GAN . The pixel recursive model achieved consistencies on par with the regression model and bicubic interpolation indicating that even though the model was producing diverse samples, the samples were largely constrained by the low-resolution image. Most importantly, the pixel recursive model achieved superior consistencies then the GAN  even though the model does not explicitly optimize for this criterion.333Note that one may improve the consistency of the GAN by increasing its weight in the objective. Increasing the weight for the consistency term will likely lead to decreased perceptual quality in the images but improved consistency. Regardless, the images generated by the pixel recursive model are superior in both consistency and perceptual quality as judged humans for a range of temperatures.
The consistency measure additionally provided an important control experiment to determine if the pixel recursive model were just naively copying the nearest training sample. If the pixel recursive model were just copying the nearest training sample, then the consistency of the Nearest N. model would be equivalent to the pixel recursive model. We instead find that the pixel recursive model has superior consistency values indicating that the model is not just naively copying the closest training examples.
Given that automated quantitative measures did not match our perceptual judgements, we conducted a human study to assess the effectiveness of the super resolution algorithm. In particular, we performed a forced choice experiment on crowd-sourced workers in order to determine how plausible a given high resolution sample is from each model. Following , each worker was presented a true image and a corresponding prediction from a model, and asked “Which image, would you guess, is from a camera?”. We performed this study across 283 workers on Amazon Mechanical Turk and statistics were accrued across 40 unique workers for each super resolution algorithm.444Specifically, each worker was given one second to make a forced choice decision. Workers began a session with 10 practice questions during which they received feedback. The practice pairs were not counted in the results. After the practice pairs, each worker was shown 45 additional pairs. A subset of the pairs were simple, golden questions designed to constantly check if the worker was paying attention. Data from workers that answered golden questions incorrectly were thrown out.
Table 1 reports the percentage of samples for a given algorithm that a human incorrectly believed to be a real image. Note that a perfect algorithm would fool a human at rate of 50%. The regression model fooled humans 2-4% of the time and the GAN  fooled humans 8.5% of the time. The pixel recursive model fooled humans 11.0% and 27.9% of the time for faces and bedrooms, respectively – significantly above the state-of-the-art regression model. Importantly, we found that the selection of the sampling temperature greatly influenced the quality of the samples and in turn the fraction of time that humans were fooled. Nevertheless the pixel recursive model outperformed the strongest baseline model, the GAN, across all temperatures. A ranked list of the best and worst fooling examples is reproduced in Appendix D along with the fool rates.
We advocate research on super resolution with high magnification ratios, where the problem is dramatically underspecified as high frequency details are missing. Any model that produces non-blurry super resolution outputs must make sensible predictions of the missing content to operate in such a heavily multimodal regime. We present a fully probabilistic method that tackles super resolution with small inputs, demonstrating that even images can be enlarged to sharp images. Our technique outperforms several strong baselines including the ones optimizing a regression objective or an adversarial loss. We perform human evaluation studies showing that samples from the pixel recursive model look more plausible to humans, and more generally, common metrics like pSNR and SSIM do not correlate with human judgment when the magnification ratio is large.
We thank Aäron van den Oord, Sander Dieleman, and the Google Brain team for insightful comments and discussions.
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.Software available from tensorflow.org.
Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., volume 1. IEEE, 1999.
The Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, volume 15 of JMLR: W&CP, pages 29–37, 2011.
Pixel recurrent neural networks.ICML, 2016.
|Conditional network – input|
|PixelCNN network – input|
|20 Gated Convolution Blocks||1|
|Optimizer||RMSProp (decay=0.95, momentum=0.9, epsilon=1e-8)|
|Iterations||2,000,000 for Bedrooms, 200,000 for faces.|
|Learning Rate||0.0004 and divide by 2 every 500000 steps.|
|Weight, bias initialization||truncated normal (stddev=0.1), Constant()|
|Input||Bicubic||ResNet||Truth||Nearest N.||GAN |
The best and worst rated images in the human study. The fractions below the images denote how many times a person choose that image over the ground truth.