PixelNN: Example-based Image Synthesis

08/17/2017 ∙ by Aayush Bansal, et al. ∙ Carnegie Mellon University 0

We present a simple nearest-neighbor (NN) approach that synthesizes high-frequency photorealistic images from an "incomplete" signal such as a low-resolution image, a surface normal map, or edges. Current state-of-the-art deep generative models designed for such conditional image synthesis lack two important things: (1) they are unable to generate a large set of diverse outputs, due to the mode collapse problem. (2) they are not interpretable, making it difficult to control the synthesized output. We demonstrate that NN approaches potentially address such limitations, but suffer in accuracy on small datasets. We design a simple pipeline that combines the best of both worlds: the first stage uses a convolutional neural network (CNN) to maps the input to a (overly-smoothed) image, and the second stage uses a pixel-wise nearest neighbor method to map the smoothed output to multiple high-quality, high-frequency outputs in a controllable manner. We demonstrate our approach for various input modalities, and for various domains ranging from human faces to cats-and-dogs to shoes and handbags.



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the task of generating high-resolution photo-realistic images from incomplete input such as a low-resolution image, sketches, surface normal map, or label mask. Such a task has a number of practical applications such as upsampling/colorizing legacy footage, texture synthesis for graphics applications, and semantic image understanding for vision through analysis-by-synthesis. These problems share a common underlying structure: a human/machine is given a signal that is missing considerable details, and the task is to reconstruct plausible details. Consider the edge map of cat in Figure 1-c. When we humans look at this edge map, we can easily imagine multiple variations of whiskers, eyes, and stripes that could be viable and pleasing to the eye. Indeed, the task of image synthesis has been well explored, not just for its practical applications but also for its aesthetic appeal.

GANs: Current state-of-the-art approaches rely on generative adversarial networks (GANs) [16], and most relevant to us, conditional GANS that generate image conditioned on an input signal [10, 35, 22]. We argue that there are two prominent limitations to such popular formalisms: (1) First and foremost, humans can imagine multiple plausible output images given a incomplete input. We see this rich space of potential outputs as a vital part of the human capacity to imagine and generate. Conditional GANs are in principle able to generate multiple outputs through the injection of noise, but in practice suffer from limited diversity (i.e., mode collapse) (Fig. 2). Recent approaches even remove the noise altogether, treating conditional image synthesis as regression problem [6]. (2) Deep networks are still difficult to explain or interpret, making the synthesized output difficult to modify. One implication is that users are not able to control the synthesized output. Moreover, the right mechanism for even specifying user constraints (e.g., “generate an cat image that looks like my cat”) is unclear. This restricts applicability, particularly for graphics tasks.

Figure 2: Mode collapse problem for GANs: We ran pix-to-pix pipeline of Isola et al. [22] 72 times. Despite the random noise set using dropout at test time, we observe similar output generated each time. Here we try to show 6 possible diverse examples of generation for a hand-picked best-looking output from Isola et al. [22].


To address these limitations, we appeal to a classic learning architecture that can naturally allow for multiple outputs and user-control: non-parametric models, or nearest-neighbors (NN). Though quite a classic approach 

[11, 15, 20, 24], it has largely been abandoned in recent history with the advent of deep architectures. Intuitively, NN works by requiring a large training set of pairs of (incomplete inputs, high-quality outputs), and works by simply matching the an incomplete query to the training set and returning the corresponding output. This trivially generalizes to multiple outputs through -NN and allows for intuitive user control through on-the-fly modification of the training set - e.g., by restricting the training examplars to those that “look like my cat”. In practice, there are several limitations in applying NN for conditional image synthesis. The first is a practical lack of training data. The second is a lack of an obvious distance metric. And the last is a computational challenge of scaling search to large training sets.

Approach: To reduce the dependency on training data, we take a compositional approach by matching local pixels instead of global images. This allows us to synthesize a face by matching “copy-pasting” the eye of one training image, the nose of another, etc. Compositions dramatically increases the representational power of our approach: given that we want to synthesize an image of pixels using training images (with pixels each), we can synthesize an exponential number of compositions, versus a linear number of global matches . A significant challenge, however, is defining an appropriate feature descriptor

for matching pixels in the incomplete input signal. We would like to capture context (such that whisker pixels are matched only to other whiskers) while allowing for compositionality (left-facing whiskers may match to right-facing whiskers). To do so, we make use of deep features, as described below.

Pipeline: Our precise pipeline (Figure 3) works in two stages. (1) We first train an initial regressor (CNN) that maps the incomplete input into a single output image. This output image suffers from the aforementioned limitations - it is a single output that will tend to look like a “smoothed” average of all the potential images that could be generated. (2) We then perform nearest-neighbor queries on pixels from this regressed output. Importantly, pixels are matched (to regressed outputs from training data) using a multiscale deep descriptor that captures the appropriate level of context. This enjoys the aforementioned benefits - we can efficiently match to an exponential number of training examples in an interpretable and controllable manner. Finally, an interesting byproduct of our approach is the generation of dense, pixel-level correspondences from the training set to the final synthesized outputs.

Figure 3: Overview of pipeline: Our approach is a two-stage pipeline. The first stage directly regresses an image from an incomplete input (using a CNN trained with

loss). This image will tend to look like a “smoothed” average of all the potential images that could be generated. In the second stage, we look for matching pixels in similarly-smoothed training images. Importantly, we match pixels using multiscale descriptors that capture the appropriate levels of context (such that eye pixels tend to match only to eyes). To do so, we make use of off-the-shelf hypercolumn features extracted from a CNN trained for semantic pixel segmentation. By varying the size of the matched set of pixels, we can generate multiple outputs (on the right).

2 Related Work

Our work is inspired by a large body of work on discriminative and generative models, nearest neighbors architectures, pixel-level tasks, and dense pixel-level correspondences. We provide a broad overview, focusing on those most relevant to our approach.

Synthesis with CNNs: Convolutional Neural Networks (CNNs) have enjoyed great success for various discriminative pixel-level tasks such as segmentation [1, 31]

, depth and surface normal estimation 

[2, 1, 13, 12], semantic boundary detection [1, 45] etc. Such networks are usually trained using standard losses (such as softmax or regression) on image-label data pairs. However, such networks do not typically perform well for the inverse problem of image synthesis from a (incomplete) label, though exceptions do exist [6]. A major innovation was the introduction of adversarially-trained generative networks (GANs) [16]

. This formulation was hugely influential in computer visions, having been applied to various image generation tasks that condition on a low-resolution image 

[10, 27], segmentation mask [22], surface normal map [42] and other inputs [7, 21, 35, 44, 47, 51]. Most related to us is Isola et al. [22]

who propose a general loss function for adversarial learning, applying it to a diverse set of image synthesis tasks.

Interpretability and user-control: Interpreting and explaining the outputs of generative deep networks is an open problem. As a community, we do not have a clear understanding of what, where, and how outputs are generated. Our work is fundamentally based on copy-pasting information via nearest neighbors, which explicitly reveals how each pixel-level output is generated (by in turn revealing where it was copied from). This makes our synthesized outputs quite interpretable. One important consequence is the ability to intuitively edit and control the process of synthesis. Zhu et al. [52] provide a user with controls for editing image such as color, and outline. But instead of using a predefined set of editing operations, we allow a user to have an arbitrarily-fine level of control through on-the-fly editing of the exemplar set (E.g., “resynthesize an image using the eye from this image and the nose from that one”).

Correspondence: An important byproduct of pixelwise NN is the generation of pixelwise correspondences between the synthesized output and training examples. Establishing such pixel-level correspondence has been one of the core challenges in computer vision [9, 25, 29, 32, 43, 49, 50]. Tappen et al. [40] use SIFT flow [29]

to hallucinate details for image super-resolution. Zhou et al. 

[50] propose a CNN to predict appearance flow that can be used to transfer information from input views to synthesize a new view. Kanazawa et al. [25] generate 3D reconstructions by training a CNN to learn correspondence between object instances. Our work follows from the crucial observation of Long et al. [32], who suggest that features from pre-trained convnets can also be used for pixel-level correspondences. In this work, we make an additional empirical observation: hypercolumn features trained for semantic segmentation learn nuances and details better than one trained for image classification. This finding helped us to establish semantic correspondences between the pixels in query and training images, and enabled us to extract high-frequency information from the training examples to synthesize a new image from a given input.

Nonparametrics: Our work closely follows data-driven approaches that make use of nearest neighbors [11, 19, 15, 24, 37, 39]. Hays and Efros [19] match a query image to 2 million training images for various tasks such as image completion. We make use of dramatically smaller training sets by allowing for compositional matches. Liu et al. [28] propose a two-step pipeline for face hallucination where global constraints capture overall structure, and local constraints produce photorealistic local features. While they focus on the task of facial super-resolution, we address variety of synthesis applications. Final, our compositional approach is inspired by Boiman and Irani [4, 5], who reconstruct a query image via compositions of training examples.

3 PixelNN: One-to-Many Mappings

Figure 4: Frequency Analysis: We show the image and its corresponding Fourier spectrum. Note how the frequency spectrum improve as we move from left to right. The Fourier spectrum of our final output closely matches that of original high resolution image.
Figure 5: Exemplar vs. Compositional: Given the output of Stage-1 for low-resolution to high-resolution, we compare an exemplar output with a compositional output. On the right hand side, we show correspondences and how 8 global nearest neighbors have been used to extract different parts to generate output. As an example, the right eye of first image is copied from the bottom-left nearest neighbor; and the nose and mouth of second image is copied from the top-middle nearest neighbor.
Figure 6: Edges to RGB: Our approach used for faces, cats, and dogs to generate RGB maps for a given edge map as input. One output was picked from the multiple generations.

We define the problem of conditional image synthesis as follows: given an input to be conditioned on (such as an edge map, normal depth map, or low-resolution image), synthesize a high-quality output image(s). To describe our approach, we focus on illustrative task of image super-resolution, where the input is a low-resolution image. We assume we are given training pairs of input/outputs, written as . The simplest approach would be formulating this task as a (nonlinear) regression problem:


where refers to the output of an arbitrary (possibly nonlinear) regressor parameterized with . In our formulation, we use a fully-convolutional neural net – specifically, PixelNet [1] – as our nonlinear regressor. For our purposes, this regressor could be any trainable black-box mapping function. But crucially, such functions generate one-to-one mappings, while our underlying thesis is that conditional image synthesis should generate many mappings from an input. By treating synthesis as a regression problem, it is well-known that outputs tend to be over-smoothed [23]

. In the context of the image colorization task (where the input is a grayscale image), such outputs tend to desaturated 

[26, 48].

Frequency analysis: Let us analyze this smoothing a bit further. Predicted outputs (we drop the dependance on to simplify notation) are particularly straightforward to analyze in the context of super-resolution (where the conditional input is a low-resolution image). Given a low-resolution image of a face, there may exist multiple textures (e.g., wrinkles) or subtle shape cues (e.g., of local features such as noses) that could be reasonably generated as output. In practice, this set of outputs tends to be “blurred” into a single output returned by a regressor. This can be readably seen in a frequency analysis of the input, output, and original target image (Fig. 4). In general, we see that the regressor generates mid-frequencies fairly well, but fails to return much high-frequency content. We make the operational assumption that a single output suffices for mid-frequency output, but multiple outputs are required to capture the space of possible high-frequency textures.

Figure 7: Normals to RGB: Our approach used for faces, cats, and dogs to generate RGB maps for a given surface normal map as input. One output was picked from multiple generations.

Exemplar Matching:

To capture multiple possible outputs, we appeal to a classic non-parametric approaches in computer vision. We note that a simple K-nearest-neighbor (KNN) algorithm has the trivially ability to report back

outputs. However, rather than using a KNN model to return an entire image, we can use it to predict the (multiple possible) high-frequencies missing from :


where is some distance function measuring similarity between two (mid-frequency) reconstructions. To generate multiple outputs, one can report back the best matches from the training set instead of the overall best match.

Figure 8: Low-Resolution to High-Resolution: We used our approach for hallucinating images from an input low-resolution image. One output was picked from multiple generations.
Figure 9: Edges-to-Shoes: Our approach used to generate multiple outputs of shoes from the edges. We picked seven distinct examples from multiple generations.
Figure 10: Edges-to-Bags: Our approach used to generate multiple outputs of bags from the edges. We picked seven distinct examples from multiple generations.

Compositional Matching: However, the above is limited to report back high frequency images in the training set. As we previously argued, we can synthesize a much larger set of outputs by copying and pasting (high-frequency) patches from the training set. To allow for such compositional matchings, we simply match individual pixels rather than global images. Writing for the pixel in the reconstructed image, the final composed output can be written as:


where refers to the output pixel in training example .

Distance functions: A crucial question in non-parametric matching is the choice of distance function. To compare global images, contemporary approaches tend to learn a deep embedding where similarity is preserved [3, 8, 31]. Distance functions for pixels are much more subtle (3). In theory, one could also learn a metric for pixel matching, but this requires large-scale training data with dense pixel-level correspondances.

Pixel representations: Suppose we are trying to generate the left corner of an eye. If our distance function takes into account only local information around the corner, we might mistakenly match to the other eye or mouth. If our distance function takes into account only global information, then compositional matching reduces to global (exemplar) matching. Instead, we exploit the insight from previous works that different layers of a deep network tend to capture different amounts of spatial context (due to varying receptive fields)  [1, 2, 18, 36, 38]. Hypercolumn descriptors [18] aggregate such information across multiple layers into a highly accurate, multi-scale pixel representation (visualized in Fig. 3). We construct a pixel descriptor using features from conv- for a PixelNet model trained for semantic segmentation (on PASCAL Context [33]). To measure pixel similarity, we compute cosine distances between two descriptors. We visualize the compositional matches (and associated correspondences) in Figure. 5. Finally, Figure 6, Figure 7, and Figure 8 shows the output of our approach for various input modalities.

Efficient search: We have so far avoided the question of run-time while doing nearest neighbor search in pixel-space. Run-time performance is another reason why generative models are more popular than nearest neighbors. To speed up search, we made some non-linear approximations: Given a reconstructed image , we first (1) find the global K nearest neighbors using conv-5 features and then (2) search for pixel-wise matches only in a pixel window around pixel in this set of images. In practice, we vary from and from and generate 72 candidate outputs for a given input. Because the size of synthesized image is , our search parameters include both a fully-compositional output and a fully global exemplar match as candidate outputs. Figure 9, Figure 10, and Figure 11 show examples of multiple outputs generated using our approach by simply varying these parameters.

4 Experiments

We now present our findings for multiple modalities such as a low-resolution image ( image), a surface normal map, and edges/boundaries for domains such as human faces, cats, dogs, handbags, and shoes. We compare our approach both quantitatively and qualitatively with the recent work of Isola et al. [22] that use generative adversarial networks for pixel-to-pixel translation.

Dataset: We conduct experiments for human faces, cats and dogs, shoes, and handbags using various modalities.

Human Faces We use images from the training set of CUHK CelebA dataset [30] to train a regression model and do nearest neighbors. We used the subset of test images to evaluate our approach. The images were resized to following Gucluturk et al. [17].

Cats and Dogs: We use images of cats and dogs from the Oxford-IIIT Pet dataset [34]. Of these images were used for training, and remaining for evaluation. We used the bounding box annotation made available by Parkhi et al. [34] to extract head of the cats and dogs.

For human faces, and cats and dogs, we used the pre-trained PixelNet [1] to extract surface normal and edge maps. We did not do any post-processing (NMS) to the outputs of edge detection.

Shoes & Handbags: We followed Isola et al. [22] for this setting. training images of shoes were used from [46], and images of Amazon handbags from [52]. The edge maps for this data was computed using HED [45] by Isola et al. [22].

Figure 11: Multiple Outputs for Edges/Normals to RGB: Our approach used to generate multiple outputs of faces, cats, and dogs from the edges/normals. As an example, note how the subtle details such as eyes, stripes, and whiskers of cat (left) that could not be inferred from the edge map are different in multiple generations.
Figure 12: Comparison of our approach with Pix-to-Pix [22].
Mean Median RMSE 11.25 22.5 30 AP
Human Faces
Pix-to-Pix [22] 17.2 14.3 21.0 37.2 74.7 86.8 0.34
Pix-to-Pix [22] (Oracle) 15.8 13.1 19.4 41.9 78.5 89.3 0.34
PixelNN (Rand-1) 12.8 10.4 16.0 54.2 86.6 94.1 0.38
PixelNN (Oracle) 10.8 8.7 13.5 63.7 91.6 96.7 0.42
Cats and Dogs
Pix-to-Pix [22] 14.7 12.8 17.5 42.6 82.5 92.9 0.82
Pix-to-Pix [22] (Oracle) 13.2 11.4 15.7 49.2 87.1 95.3 0.85
PixelNN (Rand-1) 16.6 14.3 19.8 36.8 76.2 88.8 0.80
PixelNN (Oracle) 13.8 11.9 16.6 46.9 84.9 94.1 0.92
Table 1: Normals-to-RGB We compared our approach, PixelNN, with the GAN-based formulation of Isola et al. [22] for human faces, and cats and dogs. We used an off-the-shelf PixelNet model trained for surface normal estimation and edge detection. We use the output from real images as ground truth surface normal and edge map respectively.
AP Mean Median RMSE 11.25 22.5 30
Human Faces
Pix-to-Pix [22] 0.35 12.1 9.6 15.5 58.1 88.1 94.7
Pix-to-Pix [22] (Oracle) 0.35 11.5 9.1 14.6 61.1 89.7 95.6
PixelNN (Rand-1) 0.38 13.3 10.6 16.8 52.9 85.0 92.9
PixelNN (Oracle) 0.41 11.3 9.0 14.4 61.6 90.0 95.7
Cats and Dogs
Pix-to-Pix [22] 0.78 18.2 16.0 21.8 32.4 71.0 85.1
Pix-to-Pix [22] (Oracle) 0.81 16.5 14.2 19.8 37.2 76.4 89.0
PixelNN (Rand-1) 0.77 18.9 16.4 22.5 30.3 68.9 83.5
PixelNN (Oracle) 0.89 16.3 14.1 19.6 37.6 77.0 89.4
Table 2: Edges-to-RGB: We compared our approach, PixelNN, with the GAN-based formulation of Isola et al. [22] for human faces, and cats and dogs. We used an off-the-shelf PixelNet model trained for edge detection and surface normal estimation. We use the output from real images as ground truth edges and surface normal map.

Qualitative Evaluation: Figure 12 shows the comparison of our nearest-neighbor based approach (PixelNN) with Isola et al. [22] (Pix-to-Pix).

Quantitative Evaluation: We quantitatively evaluate our approach to measure if our generated outputs for human faces, cats and dogs can be used to determine surface normal and edges from an off-the-shelf trained PixelNet [1] model for surface normal estimation and edge detection. The outputs from the real images are considered as ground truth for evaluation as it gives an indication of how far are we from them. Somewhat similar approach is used by Isola et al. [22] to measure their synthesized cityscape outputs and compare against the output using real world images, and Wang and Gupta[42] for object detection evaluation.

Figure 13: Controllable synthesis: We generate the output of cats given a user input from a edge map. From the edge map, we do not know what type of cat it is. A user can suggest what kind of the output they would like, and our approach can copy-paste the information.

We compute six statistics, previously used by [2, 12, 14, 41], over the angular error between the normals from a synthesized image and normals from real image to evaluate the performance – Mean, Median, RMSE, 11.25, 22.5, and 30 – The first three criteria capture the mean, median, and RMSE of angular error, where lower is better. The last three criteria capture the percentage of pixels within a given angular error, where higher is better. We evaluate the edge detection performance using average precision (AP).

Table 1 and Table 2 quantitatively shows the performance of our approach with [22]. Our approach generates multiple outputs and we do not have any direct way of ranking the outputs, therefore we show the performance using a random selection from one of 72 outputs, and an oracle selecting the best output. To do a fair comparison, we ran trained models for Pix-to-Pix [22] 72 times and used an oracle for selecting the best output as well. We observe that our approach generates better multiple outputs as performance improves significantly from a random selection to oracle as compared with Isola et al. [22]. Our approach, though based on simple nearest neighbors, achieves result quantitatively and qualitatively competitive (and many times better than) with state-of-the-art models based on GANs and produce outputs close to natural images.

Controllable synthesis: Finally, NN provides a user with intuitive control over the synthesis process. We explore a simple approach based on on-the-fly pruning of the training set. Instead of matching to the entire training library, a user can specify a subset of relevant training examples. Figure 13 shows an example of controllable synthesis. A user “instructs” the system to generate an image that looks like a particular cat-breed by either denoting the subset of training examplars (e.g., through a subcategory label), or providing an image that can be used to construct an on-the-fly neighbor set.

Failure cases: Our approach mostly fails when there are no suitable nearest neighbors to extract the information from. Figure 14 shows some example failure cases of our approach. One way to deal with this problem is to do exhaustive pixel-wise nearest neighbor search but that would increase the run-time to generate the output. We believe that system-level optimization such as Scanner111https://github.com/scanner-research/scanner, may potentially be useful in improving the run-time performance for pixel-wise nearest neighbors.

5 Discussion

Figure 14: Failure Cases: We show some failure cases for different input types. Our approach mostly fails when it is not able to find suitable nearest neighbors.

We present a simple approach to image synthesis based on compositional nearest-neighbors. Our approach somewhat suggests that GANs themselves may operate in a compositional “copy-and-paste” fashion. Indeed, examining the impressive outputs of recent synthesis methods suggests that some amount of local memorization is happening. However, by making this process explicit, our system is able to naturally generate multiple outputs, while being interpretable and amenable to user constraints. An interesting byproduct of our approach is dense pixel-level correspondences. If training images are augmented with semantic label masks, these labels can be transfered using our correspondences, implying that our approach may also be useful for image analysis through label transfer [29].