pixelrecursivesuperresolution
Tensorflow implementation of pixelrecursivesuperresolution(Google Brain paper: https://arxiv.org/abs/1702.00783)
view repo
We present a pixel recursive super resolution model that synthesizes realistic details into images while enhancing their resolution. A low resolution image may correspond to multiple plausible high resolution images, thus modeling the super resolution process with a pixel independent conditional model often results in averaging different detailshence blurry edges. By contrast, our model is able to represent a multimodal conditional distribution by properly modeling the statistical dependencies among the high resolution image pixels, conditioned on a low resolution input. We employ a PixelCNN architecture to define a strong prior over natural images and jointly optimize this prior with a deep conditioning convolutional network. Human evaluations indicate that samples from our proposed model look more photo realistic than a strong L2 regression baseline.
READ FULL TEXT VIEW PDFTensorflow implementation of pixelrecursivesuperresolution(Google Brain paper: https://arxiv.org/abs/1702.00783)
The problem of super resolution entails artificially enlarging a low resolution photograph to recover a corresponding plausible image with higher resolution [31]. When a small magnification is desired (e.g., ), super resolution techniques achieve satisfactory results [41, 8, 16, 39, 22] by building statistical prior models of images [35, 2, 51] that capture lowlevel characteristics of natural images.
This paper studies super resolution with particularly small inputs and large magnification ratios, where the amount of information available to accurately construct a high resolution image is very limited (Figure 1, left column). Thus, the problem is underspecified and many plausible, high resolution images may match a given low resolution input image. Building improved models for stateoftheart in super resolution in the high magnification regime is significant for improving the stateofart in super resolution, and more generally for building better conditional generative models of images [44, 33, 30, 43].
input  samples  ground truth 

As the magnification ratio increases, a super resolution model need not only account for textures, edges, and other lowlevel statistics [16, 39, 22], but must increasingly account for complex variations of objects, viewpoints, illumination, and occlusions. At increasing levels of magnification, the details do not exist in the source image anymore, and the predictive challenge shifts from recovering details (e.g., deconvolution [23]) to synthesizing plausible novel details de novo [33, 44].
Consider a low resolution image of a face in Figure 1, left column. In such
pixel images the fine spatial details of the hair and the skin are missing and cannot be faithfully restored with interpolation techniques
[15]. However, by incorporating prior knowledge of faces and their typical variations, a sketch artist might be able to imagine and draw believable details using specialized software packages [25].In this paper, we show how a fully probabilistic model that is trained endtoend using a loglikelihood objective can play the role of such an artist by synthesizing face images depicted in Figure 1, middle column. We find that drawing multiple samples from this model produces high resolution images that exhibit multimodality, resembling the diversity of images that plausibly correspond to a low resolution image. In human evaluation studies we demonstrate that naive human observers can easily distinguish real images from the outputs of sophisticated super resolution models using deep networks and mean squared error (MSE) objectives [21]. However, samples drawn from our probabilistic model are able fool a human observer up to of the time – compared to a chance rate of .
In summary, the main contributions of the paper include:
Characterization of the underspecified super resolution problem in terms of multimodal prediction.
Proposal of a new probabilistic model tailored to the super resolution problem, which produces diverse, plausible nonblurry high resolution samples.
Proposal of a new loss term for conditional probabilistic models with powerful autoregressive decoders to avoid the conditioning signal to be ignored.
Human evaluation demonstrating that traditional metrics in super resolution (e.g., pSNR and SSIM) fail to capture sample quality in the regime of underspecified super resolution.
We proceed by describing related work, followed by explaining how the multimodal problem is not addressed using traditional objectives. Then, we propose a new probabilistic model building on top of ResNet [14] and PixelCNN [43]. The paper highlights the diversity of high resolution samples generated by the model and demonstrates the quality of the samples through human evaluation studies.
Super resolution has a long history in computer vision
[31]. Methods relying on interpolation [15] are easy to implement and widely used, however these methods suffer from a lack of expressivity since linear models cannot express complex dependencies between the inputs and outputs. In practice, such methods often fail to adequately predict high frequency details leading to blurry high resolution outputs.Enhancing linear methods with rich image priors such as sparsity [2] or Gaussian mixtures [51] have substantially improved the quality of the methods; likewise, leveraging lowlevel image statistics such as edge gradients improves predictions [47, 41, 8, 16, 39, 22]. Much work has been done on algorithms that search a database of patches and combine them to create plausible high frequency details in zoomed images [9, 17]. Recent patchbased work has focused on improving basic interpolation methods by building a dictionary of prelearned filters on images and selecting the appropriate patches by an efficient hashing mechanism [34]. Such dictionary methods have improved the inference speed while being comparable to stateoftheart.
Another approach for super resolution is to abandon inference speed requirements and focus on constructing the high resolution images at increasingly higher magnification factors. Convolutional neural networks (CNNs) represent an approach to the problem that avoids explicit dictionary construction, but rather implicitly extracts multiple layers of abstractions by learning layers of filter kernels. Dong
et al. [7] employed a three layer CNN with MSE loss. Kim et al. [21] improved accuracy by increasing the depth to layers and learning only the residuals between the high resolution image and an interpolated low resolution image. Most recently, SRResNet [26] uses many ResNet blocks to achieve state of the art pSNR and SSIM on standard super resolution benchmarks–we employ a similar design for our conditional network and catchall regression baseline.Instead of using a perpixel loss, Johnson et al.[18] use Euclidean distance between activations of a pretrained CNN for model’s predictions vs. ground truth images. Using this socalled preceptual loss, they train feedforward networks for super resolution and style transfer. Bruna et al. [4] also use perceptual loss to train a super resolution network, but inference is done via gradient propagation to the lowres input (e.g., [12]).
Another promising direction has been to employ an adversarial loss for training a network. A superresolution network is trained in opposition to a secondary network that attempts to discriminate whether or not a synthesized high resolution image is real or fake. Networks trained with traditional losses (e.g. [21, 7]) suffer from blurry images, where as networks employing an adversarial loss predict compelling, high frequency detail [26, 49]. Sønderby et al. [19] employed networks trained with adversarial losses but constrained the network to learn affine transformations that ensures the model only generate images that downscale back to the low resolution inputs. Sønderby et al. [19]
also explore a masked autoregressive model but without the gated layers and using a mixture of gaussians instead of a multinomial distribution. Denton
et al. [5] use a multiscale adversarial network for image synthesis that is amenable for superresolutions tasks.Although generative adversarial networks (GANs) [13] provide a promising direction, such networks suffer from several drawbacks: first, training an adversarial network is unstable [33] and many methods are being developed to increase the robustness of training [29]. Second, GANs suffer from a common failure case of mode collapse [29] where by the resulting model produces samples that do not capture the diversity of samples available in the training data. Finally, tracking the performance of adversarial networks is challenging because it is difficult to associate a probabilistic interpretation to their results. These points motivate approaching the problem with a distinct approach to permit covering of the full diversity of the training dataset.
PixelRNN and PixelCNN [43, 44]
are probabilistic generative models that impose an order on image pixels in order to represent them as a long sequence. The probability of subsequent pixels is conditioned on previously observed pixels. One variant of PixelCNN
[44] obtained stateoftheart predictive ability in terms of loglikelihood on academic benchmarks such as CIFAR10 and MNIST. Since PixelCNN uses loglikelihood for training, the model is penalized if negligible probability is assigned to any of the training examples. By contrast, adversarial networks only learn enough to fool a nonstationary discriminator. This latter point suggests that a PixelCNN might be able to predict a large diversity of high resolution images that might be associated with a given low resolution image. Further, using loglikelihood as the training objective allows for hyper parameter search to find models within a model family by simply comparing their log probabilities on a validation set.We aim to learn a probabilistic super resolution model that discerns the statistical dependencies between a high resolution image and a corresponding low resolution image. Let and denote a low resolution and a high resolution image, and let
represent a groundtruth high resolution image. In order to learn a parametric model of
, we exploit a large dataset of pairs of low resolution inputs and groundtruth high resolution outputs, denoted . One can easily collect such a large dataset by starting from some high resolution images and lowering the resolution as much as needed. To optimize the parameters of the conditional distribution , we maximize a conditional loglikelihood objective defined as,(1) 
The key problem discussed in this paper is the exact form of that enables efficient learning and inference, while generating realistic nonblurry outputs. We first discuss pixelindependent models that assume that each output pixel is generated with an independent stochastic process given the input. We elaborate why these techniques result in suboptimal blurry super resolution results. Then, we describe our pixel recursive super resolution model that generates output pixels one at a time to enable modeling the statistical dependencies between the output pixels, resulting in sharp synthesized images given very low resolution inputs.
The simplest form of a probabilistic super resolution model assumes that the output pixels are conditionally independent given the inputs. As such, the conditional distribution of factors into a product of independent pixel predictions. Suppose an RGB output has pixels each with three color channels, i.e., . Then,
(2) 
Two general forms of pixel prediction models have been explored in the literature: Gaussian and multinomial distributions to model continuous and discrete pixel values respectively. In the Gaussian case,
(3) 
where denotes the
element of a nonlinear transformation of
via a convolutional neural network. Accordingly,is the estimated mean for the
output pixel , anddenotes the variance. Often the variance is not learned, in which case maximizing the conditional loglikelihood of (
1) reduces to minimizing the MSE between and across the pixels and channels throughout the dataset. Super resolution models based on MSE regression fall within this family of pixel independent models [7, 21, 26]. Implicitly, the outputs of a neural network parameterize a set of Gaussians with fixed variance. It is easy to verify that the joint distribution
is unimodal as it forms an isotropic multivariate Gaussian.Alternatively, one could discrete the output dimensions into possible values (e.g., ), and use a multinomial distribution as the predictive model for each pixel [50], where . The pixel prediction model based on a multinomial softmax operator is represented as,
(4) 
where a network with a set of softmax weights, , for each value per color channel is used to induce . Even though in (4) can express multimodal distributions, the conditional dependency between the pixels cannot be captured, i.e., the model cannot choose between drawing an edge at one position vs. another since that requires coordination between the samples.
How the dataset was created
Samples from trained model 


regression 

crossentropy  
PixelCNN  

To demonstrate how pixel independent models fail at conditional image modeling, we create a synthetic dataset that explicitly requires multimodal prediction. For many dense image predictions tasks, e.g. super resolution [31]
[50, 6], and depth estimation [37], models that are able to predict a single mode are heavily preferred over models that blend modes together. For example, in the task of colorization selecting a strong red or green for an apple is better than selecting a browntoned color that reflects the smeared average of all of the apple colors observed in the training set.We construct a simple multimodal MNIST corners dataset to demonstrate the challenge of this problem. MNIST corners is constructed by randomly placing an MNIST digit in either the topleft or bottomright corner (Figure 2, top). Several networks are trained to predict individual samples from this dataset to demonstrate the unique challenge of this simple example.
The challenge behind this toy example is for a network to exclusively predict an individual digit in a corner of an image. Training a moderatesized 10layer convolutional neural network ( parameters) with an objective (i.e. MSE regression) results in blurry image samples in which the two modes are blended together (Figure 2, regression). That is, never in the dataset does an example image contain a digit in both corners, yet this model incorrectly predicts a blend of such samples. Replacing the loss with a discrete, perpixel crossentropy produces sharper images but likewise fails to stochastically predict a digit in a corner of the image (Figure 2, crossentropy).
The lack of conditional independence between predicted pixels is a significant failure mode for the previous probabilistic objectives in the synthetic example (Equations 3 and 4). One approach to this problem is to define the conditional distribution of the output pixels jointly as a multivariate Gaussian mixture [52] or an undirected graphical model [10]. Both of these conditional distributions require constructing a statistical dependency between output pixels for which inference may be computationally expensive.
A second approach is to factorize the joint distribution using the chain rule by imposing an order on image pixels,
(5) 
where the generation of each output dimension is conditioned on the input and previous output pixels [24, 42]. We denote the conditioning^{1}^{1}1Note that in color images one must impose an order on both spatial locations as well as color channels. In a color image the conditioning is based on the the input and previously outputted pixels at previous spatial locations as well as pixels at the same spatial location. up to pixel by where . The benefits of this approach are that the exact form of the conditional dependencies is flexible and the inference is straightforward.
PixelCNN is a stochastic model that provides an explicit model for as a gated, hierarchical chain of cleverly masked convolutions [43, 44, 36]. The goal of PixelCNN is to capture multimodality and capture pixel correlations in an image. Indeed, training a PixelCNN on the MNIST corners dataset successfully captures the bimodality of the problem and produces sample in which digits reside exclusively in a single corner (Figure 2, PixelCNN). Importantly, the model never predicts both digits simultaneously.
Applying the PixelCNN to a superresolution problem is a straightforward application that requires modifying the architecture to supply a conditioning on a low resolution version of the image. In early experiments we found the autoregressive distribution of the model largely ignore the conditioning of the low resolution image. This phenomenon referred to as “optimization challenges” has been readily documented in the context of sequential autoencoder models
[3] (see also [38, 40] for more discussion).To address this issue we modify the architecture of PixelCNN to more explicitly depend on the conditioning of a low resolution image. In particular, we propose a late fusion model [20] that factors the problem into autoregressive and conditioning components (Figure 3). The autoregressive portion of the model, termed a prior network captures the serial dependencies of the pixels while the conditioning component, termed a conditioning network captures the global structure of the low resolution image. Specifically, we formulate the prior network to be a PixelCNN and the conditioning network to be a deep convolutional network employed previously for super resolution [26].
Given an input , let
denote a conditioning network predicting a vector of logit values corresponding to the
possible values that the output pixel can take. Similarly, let denote a prior network predicting a vector of logit values for the output pixel. Our probabilistic model predicts a distribution over the output pixel by simply adding the two sets of logits and applying a softmax operator on them,(6) 
To optimize the parameters of and jointly, we perform stochastic gradient ascent to maximize the conditional log likelihood in (1). That is, we optimize a crossentropy loss between the model’s predictions in (6) and discrete ground truth labels ,
(7)  
where is the logsumexp operator corresponding to the log of the denominator of a softmax, and denotes a dimensional onehot indicator vector with its dimension set to .
Our preliminary experiments indicate that models trained with (7) tend to ignore the conditioning network as the statistical correlation between a pixel and previous high resolution pixels is stronger than its correlation with low resolution inputs. To mitigate this issue, we include an additional loss in our objective to enforce the conditioning network to be optimized. This additional loss measures the crossentropy between the conditioning network’s predictions via and ground truth labels. The total loss that is optimized in our experiments is a sum of two crossentropy losses formulated as,
(8)  
Once the network is trained, sampling from the model is straightforward. Using (6), starting at , first we sample a high resolution pixel. Then, we proceed pixel by pixel, feeding in the previously sampled pixel values back into the network, and draw new high resolution pixels. The three channels of each pixel are generated sequentially in turn.
We additionally consider greedy decoding, where one always selects the pixel value with the largest probability and sampling from a tempered softmax, where the concentration of a distribution is adjusted by using a temperature parameter ,
To control the concentration of our sampling distribution , it suffices to divide the logits from and by a parameter . Note that as goes towards , the distribution converges to the mode.
We summarize the network architecture for the pixel recursive super resolution model. The conditioning architecture is similar in design to SRResNet [26]. The conditioning network is a feedforward convolutional neural network that takes a low resolution image through a series of ResNet blocks [14] and transposed convolution layers [32]. The last layer uses a convolution to increase the number of channels to predict a multinomial distribution over possible color channel values for each subpixel. The prior network architecture consists of 20 gated PixelCNN blocks with 32 channels at each layer [44]
. The final layer of the superresolution network is a softmax operation over the sum of the activations from the conditioning and prior networks. The model is built by using TensorFlow
[1] and trained across GPUs with synchronous SGD updates. For training details and a complete list of architecture parameters, please see Appendix A.We assess the effectiveness of the proposed pixel recursive super resolution method on two datasets containing centrally cropped faces (CelebA [27]) and bedroom images (LSUN Bedrooms [48]). In both datasets we resize the images to and pixels with bicubic interpolation to provide the input and output for training and evaluation.
We compare our technique against three baselines including (1) Nearest N.; a nearest neighbor search baseline inspired by previous work on examplebased super resolution [9], (2) ResNet ; a deep neural network using Resnet blocks trained with MSE objective, and (3) GAN; a GAN based super resolution model implemented by [11] similar to [49]. We exclude the results of the GANbaseline on bedrooms dataset as they are not competitive, and the model was developed specifically for faces.
The Nearest N. baseline computes for a sample by searching the training set for the nearest example indexed by , and returns the high resolution counterpart . The Nearest N. baseline is a representative result of exemplar based super resolution approaches, and helps us test whether the model performs a naive lookup from the training dataset.
The ResNet baseline employs a design similar to SRResNet [26]
that reports stateoftheart in terms of image similarity metrics
^{2}^{2}2 Note that the regression architecture is nearly identical to the conditioning network in Section 4.1. The slight change is to force the network to predict bounded values in RGB space. To enforce this behavior, the top layer is outputs three channels instead of one and employ ainstead of a ReLU
nonlinearity.. Most significantly, we alter the network to compute the residuals with respect to a bicubic interpolation of the input [21]. The regression provides a comparison to a stateoftheart convolutional network that performs a unimodal pixel independent prediction.The GAN super resolution baseline [11] exploits a conditional GAN architecture, and combines an adversarial loss with a consistency loss, which encourages the lowresolution version of predicted to be close to as measures by . There is a weighting between the two losses specified by [11] as for the consistency and for the adversarial loss, and we keep them the same in our face experiments.
input  samples  ground truth 

High resolution samples generated by the pixel recursive super resolution capture the rich structure of the dataset and appear perceptually plausible (Figure 1 and 4; Appendix B and C). Sampling from the super resolution model multiple times results in different high resolution images for a given low resolution image (Figure 5; Appendix B and C). Qualitatively, the samples from the model identify many plausible high resolution images with distinct qualitative features that correspond to a given lower resolution image. Note that the differences between samples for the faces dataset are far less drastic than seen in our synthetic dataset, where failure to cleanly predict modes indicated complete failure.
The quality of samples is sensitive to the temperature (Figure 6, right columns). Greedy decoding () results in poor quality samples that are overly smooth and contain horizontal and vertical line artifacts. Samples from the default temperature () are perceptually more plausible, although they tend to contain undesired high frequency content. Tuning the temperature () between and proves beneficial for improving the quality of the samples.
Input  G. Truth  Nearest N.  GAN [11]  Bicubic  ResNet  Greedy  
–  2.85  2.74  –  1.76  2.34  1.82  2.94  2.79  2.69 
–  2.96  2.71  –  1.82  2.17  1.77  3.18  3.09  2.95 
–  2.76  2.63  –  1.80  2.35  1.64  2.99  2.90  2.64 
Many methods exist for quantifying image similarity that attempt to measure human perception judgements of similarity [45, 46, 28]. We quantified the prediction accuracy of our model compared to ground truth using pSNR and MSSSIM (Table 1). We found that our own visual assessment of the predicted image quality did not correspond to these image similarities metrics. For instance, bicubic interpolation achieved relatively high metrics even though the samples appeared quite poor. This result matches recent observations that suggest that pSNR and SSIM provide poor judgements of super resolution quality when new details are synthesized [26, 18]. In addition, Figure 6 highlights how the perceptual quality of model samples do not necessarily correspond to negative log likelihood (NLL). Smaller NLL means the model has assigned that image a larger probability mass. The greedy, bicubic, and regression faces are preferred by the model despite exhibiting worse perceptual quality.
We next measured how well the high resolution samples corresponded to the low resolution input by measuring the consistency. The consistency is quantified as distance between the lowresolution input image and a bicubic downsampled version of the high resolution estimate. Lower consistencies indicate superior correspondence with the lowresolution image. Note that this is an explicit objective the GAN [11]. The pixel recursive model achieved consistencies on par with the regression model and bicubic interpolation indicating that even though the model was producing diverse samples, the samples were largely constrained by the lowresolution image. Most importantly, the pixel recursive model achieved superior consistencies then the GAN [11] even though the model does not explicitly optimize for this criterion.^{3}^{3}3Note that one may improve the consistency of the GAN by increasing its weight in the objective. Increasing the weight for the consistency term will likely lead to decreased perceptual quality in the images but improved consistency. Regardless, the images generated by the pixel recursive model are superior in both consistency and perceptual quality as judged humans for a range of temperatures.
The consistency measure additionally provided an important control experiment to determine if the pixel recursive model were just naively copying the nearest training sample. If the pixel recursive model were just copying the nearest training sample, then the consistency of the Nearest N. model would be equivalent to the pixel recursive model. We instead find that the pixel recursive model has superior consistency values indicating that the model is not just naively copying the closest training examples.
CelebA  pSNR  SSIM  MSSSIM  Consistency  % Fooled 

Bicubic  28.92  0.84  0.76  0.006  – 
Nearest N.  28.18  0.73  0.66  0.024  – 
ResNet  29.16  0.90  0.90  0.004  
GAN [11]  28.19  0.72  0.67  0.029  
29.09  0.84  0.86  0.008  
29.08  0.84  0.85  0.008  
29.08  0.84  0.86  0.008 
LSUN  pSNR  SSIM  MSSSIM  Consistency  % Fooled 

Bicubic  28.94  0.70  0.70  0.002  – 
Nearest N.  28.15  0.49  0.45  0.040  – 
ResNet  28.87  0.74  0.75  0.003  
28.92  0.58  0.60  0.016  
28.92  0.59  0.59  0.017  
28.93  0.59  0.58  0.018 
Given that automated quantitative measures did not match our perceptual judgements, we conducted a human study to assess the effectiveness of the super resolution algorithm. In particular, we performed a forced choice experiment on crowdsourced workers in order to determine how plausible a given high resolution sample is from each model. Following [50], each worker was presented a true image and a corresponding prediction from a model, and asked “Which image, would you guess, is from a camera?”. We performed this study across 283 workers on Amazon Mechanical Turk and statistics were accrued across 40 unique workers for each super resolution algorithm.^{4}^{4}4Specifically, each worker was given one second to make a forced choice decision. Workers began a session with 10 practice questions during which they received feedback. The practice pairs were not counted in the results. After the practice pairs, each worker was shown 45 additional pairs. A subset of the pairs were simple, golden questions designed to constantly check if the worker was paying attention. Data from workers that answered golden questions incorrectly were thrown out.
Table 1 reports the percentage of samples for a given algorithm that a human incorrectly believed to be a real image. Note that a perfect algorithm would fool a human at rate of 50%. The regression model fooled humans 24% of the time and the GAN [11] fooled humans 8.5% of the time. The pixel recursive model fooled humans 11.0% and 27.9% of the time for faces and bedrooms, respectively – significantly above the stateoftheart regression model. Importantly, we found that the selection of the sampling temperature greatly influenced the quality of the samples and in turn the fraction of time that humans were fooled. Nevertheless the pixel recursive model outperformed the strongest baseline model, the GAN, across all temperatures. A ranked list of the best and worst fooling examples is reproduced in Appendix D along with the fool rates.
We advocate research on super resolution with high magnification ratios, where the problem is dramatically underspecified as high frequency details are missing. Any model that produces nonblurry super resolution outputs must make sensible predictions of the missing content to operate in such a heavily multimodal regime. We present a fully probabilistic method that tackles super resolution with small inputs, demonstrating that even images can be enlarged to sharp images. Our technique outperforms several strong baselines including the ones optimizing a regression objective or an adversarial loss. We perform human evaluation studies showing that samples from the pixel recursive model look more plausible to humans, and more generally, common metrics like pSNR and SSIM do not correlate with human judgment when the magnification ratio is large.
We thank Aäron van den Oord, Sander Dieleman, and the Google Brain team for insightful comments and discussions.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on.
, volume 1. IEEE, 1999.The Proceedings of the 14th International Conference on Artificial Intelligence and Statistics
, volume 15 of JMLR: W&CP, pages 29–37, 2011.Pixel recurrent neural networks.
ICML, 2016.Operation  Kernel  Strides  Feature maps 

Conditional network – input  
ResNet block  1  
Transposed Convolution  2  
ResNet block  1  
Transposed Convolution  2  
ResNet block  1  
Convolution  1  
PixelCNN network – input  
Masked Convolution  1  
20 Gated Convolution Blocks  1  
Masked Convolution  1  
Masked Convolution  1  
Optimizer  RMSProp (decay=0.95, momentum=0.9, epsilon=1e8)  
Batch size  32  
Iterations  2,000,000 for Bedrooms, 200,000 for faces.  
Learning Rate  0.0004 and divide by 2 every 500000 steps.  
Weight, bias initialization  truncated normal (stddev=0.1), Constant() 
Input  Bicubic  ResNet  Truth  Nearest N.  

Input  Bicubic  ResNet  Truth  Nearest N.  GAN [11]  

The best and worst rated images in the human study. The fractions below the images denote how many times a person choose that image over the ground truth.