Image inpainting algorithms find applications in many domains such as the restoration of damaged paintings and photographs (bertalmio2000image), the removal or replacement of objects in images (liu2018image) or the generation of maps from sparse measurements (dupont2018generating). In these applications, a partially occluded image is passed as input to an algorithm which generates a complete image constrained by the visible pixels of the original image. As the missing or hidden regions of the image are unknown, there is an inherent uncertainty related to the inpainting of these images. For each occluded image, there are typically a large number of plausible inpaintings which both satisfy the constraints of the visible pixels and are realistic (see Fig. 1). As such, it is desirable to sample image inpaintings as opposed to generating them deterministically. Even though recent algorithms have shown great progress in generating realistic inpaintings, most of these algorithms are deterministic (liu2018image; yeh2016semantic; yu2018free).
In this paper, we propose a method to sample inpaintings from a distribution of images conditioned on the visible pixels. Specifically, we propose a model that simultaneously (a) generates realistic images, (b) matches pixel constraints and (c) exhibits high sample diversity. Our method, which we term Pixel Constrained CNN, is based on a modification of PixelCNNs (van2016conditional; oord2016pixel) to allow for conditioning on arbitrary sets of pixels as opposed to only the ones above and to the left as in the original PixelCNN framework.
Further, our model can estimate the likelihood of generated inpaintings. This allows us, for example, to rank the various generated inpaintings by their likelihood. To the best of our knowledge, this is the first method for efficiently estimating the likelihood of inpaintings of arbitrary missing pixel regions.
To validate our method, we perform experiments on the MNIST and CelebA datasets. Our results show that the model learns to generate realistic inpaintings while exhibiting high sample diversity. Further, we also show that the likelihood estimates of the model correlate well with the realism of the generated inpaintings. Finally, we show how our model can be used to visualize individual pixel probabilities as the inpaintings are generated.
2 Related Work
Early approaches for image inpainting were mostly based on propagating the information available in the occluded image. Methods based on minimizing the total variation, for example, are able to fill small holes in an image (shen2002mathematical; afonso2011augmented). Other methods directly propagate information from visible pixels to fill in hidden pixel regions (bertalmio2000image; ballester2001filling; telea2004image)
. As these methods use only the information available in the image, they are unable to fill in large holes or holes where the color or texture have high variance. More importantly, these algorithms are also deterministic and so generate a single inpainting given an occluded image.
Other methods are based on finding patches in the occluded image or in other image datasets to infill the hidden regions (efros2001image; kwatra2005texture). This family of methods also includes PatchMatch (barnes2009patchmatch) which has a random component. This randomness is however limited by the possible matches that can be found in the available datasets.
Learning based approaches have also been popular for inpainting tasks (iizuka2017globally; yang2017high; song2017image; li2017context; yu2018generative). Importantly, these methods often learn a prior over the image distribution and can take advantage of both this information and the pixel information available in the occluded image. liu2018image
for example achieve impressive results using partial convolutions, but these approaches are deterministic and the inpainting operation often corresponds to a forward pass of a neural network. Our method, in contrast, is able to generate several samples given a single inpainting task.
Several methods for image inpainting are also based on optimizing the latent variables of generative models. pathak2016context; yeh2016semantic for example, train a Generative Adversarial Network (GAN) on unobstructed images (goodfellow2014generative). Using the trained GAN, these algorithms optimize the latent variables to match the visible pixels in the occluded image. These methods are pseudo random in the sense that different initializations of the latent variable can lead to different minima of the optimization problem that matches the generated image with the visible pixels. However, the resulting completions are typically not diverse (bellemare2017cramer). Further, since the final images are generated by the GAN, the lack of diversity of samples sometimes observed in GANs can also be limiting (arjovsky2017wasserstein). Our approach, in contrast, is based on PixelCNNs which typically exhibit high sample diversity (dahl2017pixel).
3 Review of PixelCNNs
PixelCNNs (oord2016pixel; van2016conditional) are probabilistic generative models which aim to learn a distribution of images . These models are based on sampling each pixel in an image conditioned on all the previously sampled pixels. Specifically, letting denote the set of pixels of an by image and numbering the pixels from 1 to row by row (in raster scan order), we can write as:
We can then build a model for each pixel , which takes as input the previous
pixels and outputs the probability distribution. We could, for example, build a CNN which takes as input the first pixels of an image and outputs the probability distribution over pixel intensities for the th pixel. PixelCNNs use a hierarchy of masked convolutions to enforce this conditioning order, by masking pixels to the bottom and the right of each pixel, so that each pixel can only access information from pixels . The model is then trained by maximizing the log likelihood on real image data.
PixelCNNs are not only used to estimate but also to generate samples from . To generate a sample, we first initialize all pixels of an image to zero (or any other number). After a forward pass of the image through the network, the output at pixel 1 is the distribution . The value of the first pixel of the image can then be sampled from . After setting pixel 1 to the sampled value, we pass the image through the network again to sample from . We then set pixel 2 to the sampled value and repeat this procedure until all pixels have been sampled.
However, PixelCNNs can only generate images in raster scan order. For example, if pixel 3 is known, then we cannot sample
since this does not match the sampling order imposed by the masking. In image inpainting, an arbitrary set of pixels is fixed and known, so we would like to be able to sample from distributions conditioned on any subset of pixels. A trivial way to enforce this conditioning is to modify the PixelCNN architecture to take in the visible pixels as a conditioning vector (see Conditional PixelCNNs for more details(van2016conditional)
). However, our initial experiments showed that the conditioning is largely ignored and the model tends to generate images which do not match the conditioning pixels. Similar problems have been observed when using PixelCNNs for super resolution(dahl2017pixel).
4 Pixel Constrained CNN
In this section we introduce Pixel Constrained CNN, a probabilistic generative model that can generate image samples conditioned on arbitrary subsets of pixels. Specifically, given a set of known constrained pixel values (e.g. ) we would like to model and sample from , i.e. we would like to sample all the pixels in an image, given the visible pixels . We factorize as
where the product is over all the missing pixels in the image. As noted in section 3, PixelCNNs enforce this factorization by hiding pixels with a hierarchy of masked convolutions. In the constrained pixel case, we would like to hide pixels in the same order except for the known pixels which should be visible to all output pixel distributions. Therefore, building the constrained model amounts to using the same factorization as the original PixelCNN, but modifying the masking to make the constrained pixels visible to all pixels. This can be achieved by building a model composed of two subnetworks, a prior network and a conditioning network.
The prior network is a PixelCNN, which takes as input the full image and outputs logits which encode information from pixelsfor each pixel . The conditioning network is a CNN with regular (non masked) convolutions which takes as input the masked image, containing only the visible pixels , and outputs logits which encode the information in the visible pixels. Since the conditioning network does not use masked convolutions, each pixel in the logit output will have access to every visible pixel in the input (assuming the network is deep enough for the receptive field to cover the entire image).
Finally, the prior logits and conditional logits are added to output the final logits. The softmax of these logits models the probability distribution for each pixel . This approach is illustrated in Fig. 2. Note that a similar approach has been used in the context of super resolution, where the conditioning network takes in a low resolution image instead of a masked image (dahl2017pixel).
4.1 Model Inputs
During training, the prior network takes as input a fully visible image while the conditioning network takes as input a masked version of the same image, representing the constrained pixels . More precisely, the constrained pixels can be thought of as a function of the image and a mask . The mask is a binary matrix, where 1 represents a visible pixel and 0 a hidden pixel. The input image (where is the number of color channels) is masked by elementwise multiplication with . To differentiate between masked pixels and pixels which are visible but have a value of 0, we append to the masked image, so the final input to the conditioning network is in . This is illustrated in Fig. 3. The approach is similar to the one used by zhang2016colorful
for deep colorization.
4.2 Likelihood Maximization
We train the model by maximizing on a dataset of images. Ideally, the trained model should work for any set of constrained pixels or, equivalently, for any mask. To achieve this, we define a distribution of masks and maximize the log likelihood of our model over both the masks and the data
When optimizing this loss in practice, we found that the conditional logits (which model the information in ) were often partially ignored by the model. We hypothesize that this is because there is a stronger correlation between a pixel and its neighbors (which is what the prior network models) than between a pixel and the visible pixels in the image (which is what the conditioning network models). To encourage the model to use the conditional logits, we add a weighted term to the loss. Denoting by the softmax of the conditional logits, the total loss is
where is a positive constant and the dependence of on and has been omitted for clarity. This loss encourages the model to encode more information into the conditional logits and we observe that this improves performance in practice.
4.3 Random Masks
In order to evaluate the loss and train the model, we need to define the distribution of masks . There are several ways this can be done. For example, if it is known a priori that we are only interested in completing images which have part of their right side occluded, we can train on masks of varying width covering the right side of the image. While this is application dependent, we would like to build models that are as general as possible and can work on a wide variety of masks. Specifically, we would like our model to perform well even when missing areas are irregular and disconnected. To this end, we build an algorithm that generates irregular masks of random blobs. The algorithm randomly samples blob centers and then iteratively and randomly expands each blob. The algorithm is described in detail in Algorithm 1 and examples of the generated masks are shown in Fig. 4. While the generated masks are all irregular we find that they generalize well to completing any occlusion in unseen images, including regular occlusions.
Given a trained model and an image with a subset of visible pixels , we would like to generate samples from the distribution . To generate these, we first pass the occluded image and the mask through the conditioning network to calculate the conditional logits. When then pass a blank image through the prior network to generate the prior logits for the first pixel. If the first pixel is part of the visible pixels , we simply set to the value given in , otherwise we sample and set the value of the first pixel to
. We then pass the updated image through the prior network again to generate the conditional probability distribution for the second pixel and continue sampling in this way until the image is complete. Since we know the probability distribution at every pixel, we can also calculate the likelihood of the generated sample by taking the product of the probabilities of each sampled pixel.
We test our model on the binarized MNIST111The images are binarized by setting any pixel intensity greater than 0.5 to 1 and others to 0. and CelebA datasets (liu2015faceattributes). As training the model is computationally intensive, we crop and resize the CelebA images to 32 by 32 pixels and quantize the colors to 5 bits (i.e. 32 colors). For both MNIST and CelebA, we use a Gated PixelCNN (van2016conditional) for the prior network and a residual network (he2016deep) for the conditioning network. Full descriptions of the network architectures are given in the supplementary material.
The parameters used for generating masks are max_num_blobs=4, , . Since generating masks at every iteration is expensive, we generate a dataset of 50k masks prior to training and randomly sample these during training. The full list of training details can be found in the supplementary material.
We test our models on images and masks that were not seen during training. Examples of inpaintings are shown in Fig. 5. As can be seen, the generated samples are realistic and, importantly, diverse. For example, even when the source image for the masked pixels is of a male face, the model plausibly generates a variety of both male and female face completions, each with varying hair, eye color, skin tone and so on. For MNIST, the model generates a variety of digits, all of which naturally match the conditioning.
We further test our model on rectangular masks, by occluding the top 75% of the image. Indeed, as PixelCNNs sample pixels from top to bottom, showing only the bottom pixels ensures that we are testing the conditioning network, since only it will be able to bias the top pixels based on the visible ones. Examples of inpaintings based on bottom pixels are shown in Fig. 6. As can be seen, the sampled inpaintings are plausible, match the visible pixels and are diverse. Interestingly, even though the digit which was used to generate the visible in pixels in the figure is a seven, the model is able to generate many other digit completions which plausibly match the constrained pixels. Similarly, while the faces generated in Fig. 6 share the same bottom pixels, the model generates several distinct completions.
5.2 Inpainting Likelihood
As noted in section 4.4, our method allows us to calculate the likelihood of inpaintings. Fig. 7 shows a set of sampled inpaintings ranked by their likelihood. As can be seen, samples with high likelihood tend to look more plausible while low likelihood samples tend to look less realistic. To the best of our knowledge, this is the first method for semantic inpainting of arbitrary occlusions which also estimates the likelihood of the inpaintings. The ability to estimate inpainting likelihoods could be useful for applications where the inpainted image is used for downstream tasks which require some uncertainty quantification (dupont2018generating).
5.3 Pixel Probabilities
As our model estimates the probability for each pixel , we can also visualize how the pixel probabilities are affected by various occlusions. Since the MNIST images are binary, we can plot the probability of a pixel intensity being 1 for all pixels in the image, given the visible pixels. Similarly, we can observe how these probabilities change as more pixels are sampled. This is shown in Fig. 8. As can be seen, the conditional pixels bias the model towards generating digits which are plausible given the occlusion. As more pixels are generated, the probabilities become sharper as the model becomes more certain of which digit it will generate. For example, in the first row, the pixel probabilities suggest that both a 3, 5 or an 8 are plausible completions. As more pixels are sampled it becomes clear that a 5 is the only plausible completion and the pixel probabilities get updated accordingly.
6 Scope and Limitations
While our approach can generate a diverse set of plausible image completions and estimate their likelihood, it also comes with some drawbacks and limitations.
First, our approach is very computationally intensive both during training and sampling. As is well known, PixelCNN models tend to be very slow to train (oord2016pixel) which can limit the applicability of our method to large scale images. Further, most deterministic inpainting algorithms require a single forward pass of a neural net, while our model (since it is based on PixelCNNs) requires as many forward passes as there are pixels in the image.
Second, our model also has failure modes where it generates implausible inpaintings or inpaintings that do not match the surrounding pixels. A few failure examples are shown in Fig. 9. These failure examples tend to have a low likelihood so this problem could be mitigated by only keeping samples which have a likelihood above a certain threshold, but they still represent a limitation of our model.
In order to address the uncertainty of image inpainting, we have introduced Pixel Constrained CNN, a model that performs probabilistic semantic inpainting by sampling images from a distribution conditioned on the visible pixels. Experiments show that our model generates plausible and diverse completions for a wide variety of regular and irregular masks on the MNIST and CelebA datasets. Further, our model also calculates the likelihood of the inpaintings which correlates well with the realism of the image completion.
In future work, it would be interesting to scale our approach to larger images by combining it with methods that aim to accelerate the training and generation of PixelCNN models (ramachandran2017fast; kolesnikov2016pixelcnn; reed2017parallel). Further, it would be interesting to explore more sophisticated ways of incorporating the conditional information, such as using a weighted sum of the prior and conditional logits depending on the pixel location.
Appendix A Model Architecture
|Restricted Gated Conv Block, 32 filters,|
|Gated Conv Block, 32 filters,|
|Conv, 2 filters,|
|Residual Blocks, 32 filters,|
|Conv, 2 filters,|
|Restricted Gated Conv Block, 66 filters,|
|Gated Conv Block, 66 filters,|
|Conv, 1023 filters,
|Conv, 66 filters,|
|Residual Blocks, 66 filters,|
|Conv, 66 filters,|
Appendix B Model Training
Learning rate: 4e-4
Learning rate: 4e-4