We consider the task of generating high-resolution photo-realistic images from incomplete input such as a low-resolution image, sketches, surface normal map, or label mask. Such a task has a number of practical applications such as upsampling/colorizing legacy footage, texture synthesis for graphics applications, and semantic image understanding for vision through analysis-by-synthesis. These problems share a common underlying structure: a human/machine is given a signal that is missing considerable details, and the task is to reconstruct plausible details. Consider the edge map of cat in Figure 1-c. When we humans look at this edge map, we can easily imagine multiple variations of whiskers, eyes, and stripes that could be viable and pleasing to the eye. Indeed, the task of image synthesis has been well explored, not just for its practical applications but also for its aesthetic appeal.
GANs: Current state-of-the-art approaches rely on generative adversarial networks (GANs) , and most relevant to us, conditional GANS that generate image conditioned on an input signal [10, 35, 22]. We argue that there are two prominent limitations to such popular formalisms: (1) First and foremost, humans can imagine multiple plausible output images given a incomplete input. We see this rich space of potential outputs as a vital part of the human capacity to imagine and generate. Conditional GANs are in principle able to generate multiple outputs through the injection of noise, but in practice suffer from limited diversity (i.e., mode collapse) (Fig. 2). Recent approaches even remove the noise altogether, treating conditional image synthesis as regression problem . (2) Deep networks are still difficult to explain or interpret, making the synthesized output difficult to modify. One implication is that users are not able to control the synthesized output. Moreover, the right mechanism for even specifying user constraints (e.g., “generate an cat image that looks like my cat”) is unclear. This restricts applicability, particularly for graphics tasks.
To address these limitations, we appeal to a classic learning architecture that can naturally allow for multiple outputs and user-control: non-parametric models, or nearest-neighbors (NN). Though quite a classic approach[11, 15, 20, 24], it has largely been abandoned in recent history with the advent of deep architectures. Intuitively, NN works by requiring a large training set of pairs of (incomplete inputs, high-quality outputs), and works by simply matching the an incomplete query to the training set and returning the corresponding output. This trivially generalizes to multiple outputs through -NN and allows for intuitive user control through on-the-fly modification of the training set - e.g., by restricting the training examplars to those that “look like my cat”. In practice, there are several limitations in applying NN for conditional image synthesis. The first is a practical lack of training data. The second is a lack of an obvious distance metric. And the last is a computational challenge of scaling search to large training sets.
Approach: To reduce the dependency on training data, we take a compositional approach by matching local pixels instead of global images. This allows us to synthesize a face by matching “copy-pasting” the eye of one training image, the nose of another, etc. Compositions dramatically increases the representational power of our approach: given that we want to synthesize an image of pixels using training images (with pixels each), we can synthesize an exponential number of compositions, versus a linear number of global matches . A significant challenge, however, is defining an appropriate feature descriptor
for matching pixels in the incomplete input signal. We would like to capture context (such that whisker pixels are matched only to other whiskers) while allowing for compositionality (left-facing whiskers may match to right-facing whiskers). To do so, we make use of deep features, as described below.
Pipeline: Our precise pipeline (Figure 3) works in two stages. (1) We first train an initial regressor (CNN) that maps the incomplete input into a single output image. This output image suffers from the aforementioned limitations - it is a single output that will tend to look like a “smoothed” average of all the potential images that could be generated. (2) We then perform nearest-neighbor queries on pixels from this regressed output. Importantly, pixels are matched (to regressed outputs from training data) using a multiscale deep descriptor that captures the appropriate level of context. This enjoys the aforementioned benefits - we can efficiently match to an exponential number of training examples in an interpretable and controllable manner. Finally, an interesting byproduct of our approach is the generation of dense, pixel-level correspondences from the training set to the final synthesized outputs.
2 Related Work
Our work is inspired by a large body of work on discriminative and generative models, nearest neighbors architectures, pixel-level tasks, and dense pixel-level correspondences. We provide a broad overview, focusing on those most relevant to our approach.
, depth and surface normal estimation[2, 1, 13, 12], semantic boundary detection [1, 45] etc. Such networks are usually trained using standard losses (such as softmax or regression) on image-label data pairs. However, such networks do not typically perform well for the inverse problem of image synthesis from a (incomplete) label, though exceptions do exist . A major innovation was the introduction of adversarially-trained generative networks (GANs) 
. This formulation was hugely influential in computer visions, having been applied to various image generation tasks that condition on a low-resolution image[10, 27], segmentation mask , surface normal map  and other inputs [7, 21, 35, 44, 47, 51]. Most related to us is Isola et al. 
who propose a general loss function for adversarial learning, applying it to a diverse set of image synthesis tasks.
Interpretability and user-control: Interpreting and explaining the outputs of generative deep networks is an open problem. As a community, we do not have a clear understanding of what, where, and how outputs are generated. Our work is fundamentally based on copy-pasting information via nearest neighbors, which explicitly reveals how each pixel-level output is generated (by in turn revealing where it was copied from). This makes our synthesized outputs quite interpretable. One important consequence is the ability to intuitively edit and control the process of synthesis. Zhu et al.  provide a user with controls for editing image such as color, and outline. But instead of using a predefined set of editing operations, we allow a user to have an arbitrarily-fine level of control through on-the-fly editing of the exemplar set (E.g., “resynthesize an image using the eye from this image and the nose from that one”).
Correspondence: An important byproduct of pixelwise NN is the generation of pixelwise correspondences between the synthesized output and training examples. Establishing such pixel-level correspondence has been one of the core challenges in computer vision [9, 25, 29, 32, 43, 49, 50]. Tappen et al.  use SIFT flow 
to hallucinate details for image super-resolution. Zhou et al. propose a CNN to predict appearance flow that can be used to transfer information from input views to synthesize a new view. Kanazawa et al.  generate 3D reconstructions by training a CNN to learn correspondence between object instances. Our work follows from the crucial observation of Long et al. , who suggest that features from pre-trained convnets can also be used for pixel-level correspondences. In this work, we make an additional empirical observation: hypercolumn features trained for semantic segmentation learn nuances and details better than one trained for image classification. This finding helped us to establish semantic correspondences between the pixels in query and training images, and enabled us to extract high-frequency information from the training examples to synthesize a new image from a given input.
Nonparametrics: Our work closely follows data-driven approaches that make use of nearest neighbors [11, 19, 15, 24, 37, 39]. Hays and Efros  match a query image to 2 million training images for various tasks such as image completion. We make use of dramatically smaller training sets by allowing for compositional matches. Liu et al.  propose a two-step pipeline for face hallucination where global constraints capture overall structure, and local constraints produce photorealistic local features. While they focus on the task of facial super-resolution, we address variety of synthesis applications. Final, our compositional approach is inspired by Boiman and Irani [4, 5], who reconstruct a query image via compositions of training examples.
3 PixelNN: One-to-Many Mappings
We define the problem of conditional image synthesis as follows: given an input to be conditioned on (such as an edge map, normal depth map, or low-resolution image), synthesize a high-quality output image(s). To describe our approach, we focus on illustrative task of image super-resolution, where the input is a low-resolution image. We assume we are given training pairs of input/outputs, written as . The simplest approach would be formulating this task as a (nonlinear) regression problem:
where refers to the output of an arbitrary (possibly nonlinear) regressor parameterized with . In our formulation, we use a fully-convolutional neural net – specifically, PixelNet  – as our nonlinear regressor. For our purposes, this regressor could be any trainable black-box mapping function. But crucially, such functions generate one-to-one mappings, while our underlying thesis is that conditional image synthesis should generate many mappings from an input. By treating synthesis as a regression problem, it is well-known that outputs tend to be over-smoothed 
. In the context of the image colorization task (where the input is a grayscale image), such outputs tend to desaturated[26, 48].
Frequency analysis: Let us analyze this smoothing a bit further. Predicted outputs (we drop the dependance on to simplify notation) are particularly straightforward to analyze in the context of super-resolution (where the conditional input is a low-resolution image). Given a low-resolution image of a face, there may exist multiple textures (e.g., wrinkles) or subtle shape cues (e.g., of local features such as noses) that could be reasonably generated as output. In practice, this set of outputs tends to be “blurred” into a single output returned by a regressor. This can be readably seen in a frequency analysis of the input, output, and original target image (Fig. 4). In general, we see that the regressor generates mid-frequencies fairly well, but fails to return much high-frequency content. We make the operational assumption that a single output suffices for mid-frequency output, but multiple outputs are required to capture the space of possible high-frequency textures.
To capture multiple possible outputs, we appeal to a classic non-parametric approaches in computer vision. We note that a simple K-nearest-neighbor (KNN) algorithm has the trivially ability to report backoutputs. However, rather than using a KNN model to return an entire image, we can use it to predict the (multiple possible) high-frequencies missing from :
where is some distance function measuring similarity between two (mid-frequency) reconstructions. To generate multiple outputs, one can report back the best matches from the training set instead of the overall best match.
Compositional Matching: However, the above is limited to report back high frequency images in the training set. As we previously argued, we can synthesize a much larger set of outputs by copying and pasting (high-frequency) patches from the training set. To allow for such compositional matchings, we simply match individual pixels rather than global images. Writing for the pixel in the reconstructed image, the final composed output can be written as:
where refers to the output pixel in training example .
Distance functions: A crucial question in non-parametric matching is the choice of distance function. To compare global images, contemporary approaches tend to learn a deep embedding where similarity is preserved [3, 8, 31]. Distance functions for pixels are much more subtle (3). In theory, one could also learn a metric for pixel matching, but this requires large-scale training data with dense pixel-level correspondances.
Pixel representations: Suppose we are trying to generate the left corner of an eye. If our distance function takes into account only local information around the corner, we might mistakenly match to the other eye or mouth. If our distance function takes into account only global information, then compositional matching reduces to global (exemplar) matching. Instead, we exploit the insight from previous works that different layers of a deep network tend to capture different amounts of spatial context (due to varying receptive fields) [1, 2, 18, 36, 38]. Hypercolumn descriptors  aggregate such information across multiple layers into a highly accurate, multi-scale pixel representation (visualized in Fig. 3). We construct a pixel descriptor using features from conv- for a PixelNet model trained for semantic segmentation (on PASCAL Context ). To measure pixel similarity, we compute cosine distances between two descriptors. We visualize the compositional matches (and associated correspondences) in Figure. 5. Finally, Figure 6, Figure 7, and Figure 8 shows the output of our approach for various input modalities.
Efficient search: We have so far avoided the question of run-time while doing nearest neighbor search in pixel-space. Run-time performance is another reason why generative models are more popular than nearest neighbors. To speed up search, we made some non-linear approximations: Given a reconstructed image , we first (1) find the global K nearest neighbors using conv-5 features and then (2) search for pixel-wise matches only in a pixel window around pixel in this set of images. In practice, we vary from and from and generate 72 candidate outputs for a given input. Because the size of synthesized image is , our search parameters include both a fully-compositional output and a fully global exemplar match as candidate outputs. Figure 9, Figure 10, and Figure 11 show examples of multiple outputs generated using our approach by simply varying these parameters.
We now present our findings for multiple modalities such as a low-resolution image ( image), a surface normal map, and edges/boundaries for domains such as human faces, cats, dogs, handbags, and shoes. We compare our approach both quantitatively and qualitatively with the recent work of Isola et al.  that use generative adversarial networks for pixel-to-pixel translation.
Dataset: We conduct experiments for human faces, cats and dogs, shoes, and handbags using various modalities.
Human Faces We use images from the training set of CUHK CelebA dataset  to train a regression model and do nearest neighbors. We used the subset of test images to evaluate our approach. The images were resized to following Gucluturk et al. .
Cats and Dogs: We use images of cats and dogs from the Oxford-IIIT Pet dataset . Of these images were used for training, and remaining for evaluation. We used the bounding box annotation made available by Parkhi et al.  to extract head of the cats and dogs.
For human faces, and cats and dogs, we used the pre-trained PixelNet  to extract surface normal and edge maps. We did not do any post-processing (NMS) to the outputs of edge detection.
Shoes & Handbags: We followed Isola et al.  for this setting. training images of shoes were used from , and images of Amazon handbags from . The edge maps for this data was computed using HED  by Isola et al. .
|Pix-to-Pix  (Oracle)||15.8||13.1||19.4||41.9||78.5||89.3||0.34|
|Cats and Dogs|
|Pix-to-Pix  (Oracle)||13.2||11.4||15.7||49.2||87.1||95.3||0.85|
|Pix-to-Pix  (Oracle)||0.35||11.5||9.1||14.6||61.1||89.7||95.6|
|Cats and Dogs|
|Pix-to-Pix  (Oracle)||0.81||16.5||14.2||19.8||37.2||76.4||89.0|
Quantitative Evaluation: We quantitatively evaluate our approach to measure if our generated outputs for human faces, cats and dogs can be used to determine surface normal and edges from an off-the-shelf trained PixelNet  model for surface normal estimation and edge detection. The outputs from the real images are considered as ground truth for evaluation as it gives an indication of how far are we from them. Somewhat similar approach is used by Isola et al.  to measure their synthesized cityscape outputs and compare against the output using real world images, and Wang and Gupta for object detection evaluation.
We compute six statistics, previously used by [2, 12, 14, 41], over the angular error between the normals from a synthesized image and normals from real image to evaluate the performance – Mean, Median, RMSE, 11.25, 22.5, and 30 – The first three criteria capture the mean, median, and RMSE of angular error, where lower is better. The last three criteria capture the percentage of pixels within a given angular error, where higher is better. We evaluate the edge detection performance using average precision (AP).
Table 1 and Table 2 quantitatively shows the performance of our approach with . Our approach generates multiple outputs and we do not have any direct way of ranking the outputs, therefore we show the performance using a random selection from one of 72 outputs, and an oracle selecting the best output. To do a fair comparison, we ran trained models for Pix-to-Pix  72 times and used an oracle for selecting the best output as well. We observe that our approach generates better multiple outputs as performance improves significantly from a random selection to oracle as compared with Isola et al. . Our approach, though based on simple nearest neighbors, achieves result quantitatively and qualitatively competitive (and many times better than) with state-of-the-art models based on GANs and produce outputs close to natural images.
Controllable synthesis: Finally, NN provides a user with intuitive control over the synthesis process. We explore a simple approach based on on-the-fly pruning of the training set. Instead of matching to the entire training library, a user can specify a subset of relevant training examples. Figure 13 shows an example of controllable synthesis. A user “instructs” the system to generate an image that looks like a particular cat-breed by either denoting the subset of training examplars (e.g., through a subcategory label), or providing an image that can be used to construct an on-the-fly neighbor set.
Failure cases: Our approach mostly fails when there are no suitable nearest neighbors to extract the information from. Figure 14 shows some example failure cases of our approach. One way to deal with this problem is to do exhaustive pixel-wise nearest neighbor search but that would increase the run-time to generate the output. We believe that system-level optimization such as Scanner111https://github.com/scanner-research/scanner, may potentially be useful in improving the run-time performance for pixel-wise nearest neighbors.
We present a simple approach to image synthesis based on compositional nearest-neighbors. Our approach somewhat suggests that GANs themselves may operate in a compositional “copy-and-paste” fashion. Indeed, examining the impressive outputs of recent synthesis methods suggests that some amount of local memorization is happening. However, by making this process explicit, our system is able to naturally generate multiple outputs, while being interpretable and amenable to user constraints. An interesting byproduct of our approach is dense pixel-level correspondences. If training images are augmented with semantic label masks, these labels can be transfered using our correspondences, implying that our approach may also be useful for image analysis through label transfer .
-  A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan. Pixelnet: Representation of the pixels, by the pixels, and for the pixels. arXiv:1702.06506, 2017.
-  A. Bansal, B. Russell, and A. Gupta. Marr Revisited: 2D-3D model alignment via surface normal prediction. In CVPR, 2016.
S. Bell and K. Bala.
Learning visual similarity for product design with convolutional neural networks.ACM Transactions on Graphics (Proceeding of SIGGRAPH), 2015.
-  O. Boiman and M. Irani. Similarity by composition. In NIPS, 2006.
-  O. Boiman and M. Irani. Detecting irregularities in images and in video. IJCV, 2007.
-  Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. arXiv preprint arXiv:1707.09405, 2017.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. CoRR, abs/1606.03657, 2016.
-  S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2005.
-  C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker. Universal correspondence network. In NIPS, 2016.
-  E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. CoRR, abs/1506.05751, 2015.
-  A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In IEEE ICCV, 1999.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, 2015.
-  D. Eigen, D. Krishnan, and R. Fergus. Restoring an image taken through a window covered with dirt or rain. In ICCV, pages 633–640, 2013.
-  D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3D primitives for single image understanding. In ICCV, 2013.
-  W. T. Freeman, T. R. Jones, and E. C. Pasztor. Example-based super-resolution. IEEE Comput. Graph. Appl., 22(2):56–65, Mar. 2002.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial networks. CoRR, abs/1406.2661, 2014.
-  Y. Gucluturk, U. Guclu, R. van Lier, and M. A. J. van Gerven. Convolutional sketch inversion. In ECCV, 2016.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.
-  J. Hays and A. A. Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (SIGGRAPH 2007), 26(3), 2007.
-  A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’01. ACM, 2001.
-  X. Huang, Y. Li, O. Poursaeed, J. E. Hopcroft, and S. J. Belongie. Stacked generative adversarial networks. CoRR, abs/1612.04357, 2016.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arxiv, 2016.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
-  M. K. Johnson, K. Dale, S. Avidan, H. Pfister, W. T. Freeman, and W. Matusik. Cg2real: Improving the realism of computer generated images using a large collection of photographs. IEEE Transactions on Visualization and Computer Graphics, 17(9):1273–1285, 2011.
-  A. Kanazawa, D. W. Jacobs, and M. Chandraker. Warpnet: Weakly supervised matching for single-view reconstruction. CoRR, abs/1604.05592, 2016.
-  G. Larsson, M. Maire, and G. Shakhnarovich. Learning representations for automatic colorization. In European Conference on Computer Vision (ECCV), 2016.
-  C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. CoRR, abs/1609.04802, 2016.
-  C. Liu, H. Shum, and W. T. Freeman. Face hallucination: Theory and practice. IJCV, 75(1):115–134, 2007.
-  C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell., 2011.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional models for semantic segmentation. In CVPR, 2015.
-  J. Long, N. Zhang, and T. Darrell. Do convnets learn correspondence? In NIPS, 2014.
-  R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar.
Cats and dogs.
IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
-  T. Raiko, H. Valpola, and Y. LeCun. In AISTATS, volume 22, pages 924–932, 2012.
-  L. Ren, A. Patrick, A. A. Efros, J. K. Hodgins, and J. M. Rehg. A data-driven approach to quantifying natural human motion. ACM Trans. Graph., 24(3):1090–1097, July 2005.
-  P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multi-stage feature learning. In CVPR. IEEE, 2013.
-  A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros. Data-driven visual similarity for cross-domain image matching. ACM Transaction of Graphics (TOG) (Proceedings of ACM SIGGRAPH ASIA), 30(6), 2011.
-  M. F. Tappen and C. Liu. A bayesian approach to alignment-based image hallucination. In ECCV, 2012.
-  X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In CVPR, 2015.
-  X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016.
-  L. Wei, Q. Huang, D. Ceylan, E. Vouga, and H. Li. Dense human body correspondences using convolutional networks. CoRR, abs/1511.05904, 2015.
-  J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
-  S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV, 2015.
-  A. Yu and K. Grauman. Fine-Grained Visual Comparisons with Local Learning. In Computer Vision and Pattern Recognition (CVPR), June 2014.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. CoRR, abs/1612.03242, 2016.
-  R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. ECCV, 2016.
-  T. Zhou, P. Krähenbühl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3d-guided cycle consistency. In Computer Vision and Pattern Recognition (CVPR), 2016.
-  T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In European Conference on Computer Vision, 2016.
-  J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017.
-  J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In Proceedings of European Conference on Computer Vision (ECCV), 2016.