Tensorflow Implementation of “Semantic Image Inpainting with Perceptual and Contextual Losses” using Deep Convolution Generative Adversarial Network https://arxiv.org/abs/1607.07539
Semantic image inpainting is a challenging task where large missing regions have to be filled based on the available visual data. Existing methods which extract information from only a single image generally produce unsatisfactory results due to the lack of high level context. In this paper, we propose a novel method for semantic image inpainting, which generates the missing content by conditioning on the available data. Given a trained generative model, we search for the closest encoding of the corrupted image in the latent image manifold using our context and prior losses. This encoding is then passed through the generative model to infer the missing content. In our method, inference is possible irrespective of how the missing content is structured, while the state-of-the-art learning based method requires specific information about the holes in the training phase. Experiments on three datasets show that our method successfully predicts information in large missing regions and achieves pixel-level photorealism, significantly outperforming the state-of-the-art methods.READ FULL TEXT VIEW PDF
Tensorflow Implementation of “Semantic Image Inpainting with Perceptual and Contextual Losses” using Deep Convolution Generative Adversarial Network https://arxiv.org/abs/1607.07539
Code for image completion using DCGANs. Code forked from https://github.com/bamos/dcgan-completion.tensorflow. ArXiv paper : http://arxiv.org/abs/1607.07539
Semantic inpainting  refers to the task of inferring arbitrary large missing regions in images based on image semantics. Since prediction of high-level context is required, this task is significantly more difficult than classical inpainting or image completion which is often more concerned with correcting spurious data corruption or removing entire objects. Numerous applications such as restoration of damaged paintings or image editing  benefit from accurate semantic inpainting methods if large regions are missing. However, inpainting becomes increasingly more difficult if large regions are missing or if scenes are complex.
Classical inpainting methods are often based on either local or non-local information to recover the image. Most existing methods are designed for single image inpainting. Hence they are based on the information available in the input image, and exploit image priors to address the ill-posed-ness. For example, total variation (TV) based approaches [34, 1] take into account the smoothness property of natural images, which is useful to fill small missing regions or remove spurious noise. Holes in textured images can be filled by finding a similar texture from the same image . Prior knowledge, such as statistics of patch offsets , planarity  or low rank (LR)  can greatly improve the result as well. PatchMatch (PM)  searches for similar patches in the available part of the image and quickly became one of the most successful inpainting methods due to its high quality and efficiency. However, all single image inpainting methods require appropriate information to be contained in the input image, e.g., similar pixels, structures, or patches. This assumption is hard to satisfy, if the missing region is large and possibly of arbitrary shape. Consequently, in this case, these methods are unable to recover the missing information. Fig. 1 shows some challenging examples with large missing regions, where local methods fail to recover the nose and eyes.
In order to address inpainting in the case of large missing regions, non-local methods try to predict the missing pixels using external data. Hays and Efros  proposed to cut and paste a semantically similar patch from a huge database. Internet based retrieval can be used to replace a target region of a scene . Both methods require exact matching from the database or Internet, and fail easily when the test scene is significantly different from any database image. Unlike previous hand-crafted matching and editing, learning based methods have shown promising results [27, 38, 33, 22]
. After an image dictionary or a neural network is learned, the training set is no longer required for inference. Oftentimes, these learning-based methods are designed for small holes or for removing small text in the image.
Instead of filling small holes in the image, we are interested in the more difficult task of semantic inpainting . It aims to predict the detailed content of a large region based on the context of surrounding pixels. A seminal approach for semantic inpainting, and closest to our work is the Context Encoder (CE) by Pathak et al. . Given a mask indicating missing regions, a neural network is trained to encode the context information and predict the unavailable content. However, the CE only takes advantage of the structure of holes during training but not during inference. Hence it results in blurry or unrealistic images especially when missing regions have arbitrary shapes.
In this paper, we propose a novel method for semantic image inpainting. We consider semantic inpainting as a constrained image generation problem and take advantage of the recent advances in generative modeling. After a deep generative model, i.e., in our case an adversarial network [9, 32], is trained, we search for an encoding of the corrupted image that is “closest” to the image in the latent space. The encoding is then used to reconstruct the image using the generator. We define “closest” by a weighted context loss to condition on the corrupted image, and a prior loss to penalizes unrealistic images. Compared to the CE, one of the major advantages of our method is that it does not require the masks for training and can be applied for arbitrarily structured missing regions during inference. We evaluate our method on three datasets: CelebA , SVHN  and Stanford Cars , with different forms of missing regions. Results demonstrate that on challenging semantic inpainting tasks our method can obtain much more realistic images than the state of the art techniques.
A large body of literature exists for image inpainting, and due to space limitations we are unable to discuss all of it in detail. Seminal work in that direction includes the aforementioned works and references therein. Since our method is based on generative models and deep neural nets, we will review the technically related learning based work in the following.
Generative Adversarial Networks
(GANs) are a framework for training generative parametric models, and have been shown to produce high quality images[9, 4, 32]. This framework trains two networks, a generator, , and a discriminator .
maps a random vector, sampled from a prior distribution , to the image space while maps an input image to a likelihood. The purpose of is to generate realistic images, while plays an adversarial role, discriminating between the image generated from , and the real image sampled from the data distribution .
networks are trained by optimizing the loss function:
where is the sample from the distribution; is a random encoding on the latent space.
With some user interaction, GANs have been applied in interactive image editing 
. However, GANs can not be directly applied to the inpainting task, because they produce an entirely unrelated image with high probability, unless constrained by the provided corrupted image.
and Variational Autoencoders (VAEs) have become a popular approach to learning of complex distributions in an unsupervised setting. A variety of VAE flavors exist, e.g., extensions to attribute-based image editing tasks . Compared to GANs, VAEs tend to generate overly smooth images, which is not preferred for inpainting tasks. Fig. 2 shows some examples generated by a VAE and a Deep Convolutional GAN (DCGAN) . Note that the DCGAN generates much sharper images. Jointly training VAEs with an adveserial loss prevents the smoothness , but may lead to artifacts.
The Context Encoder (CE)  can be also viewed as an autoencoder conditioned on the corrupted images.
It produces impressive reconstruction results when the structure of holes is fixed during both training and inference, e.g., fixed in the center, but is less effective for arbitrarily structured regions.
Back-propagation to the input data is employed in our approach to find the encoding which is close to the provided but corrupted image. In earlier work, back-propagation to augment data has been used for texture synthesis and style transfer [8, 7, 20]
. Google’s DeepDream uses back-propagation to create dreamlike images. Additionally, back-propagation has also been used to visualize and understand the learned features in a trained network, by “inverting” the network through updating the gradient at the input layer [26, 5, 35, 21]. Similar to our method, all these back-propagation based methods require specifically designed loss functions for the particular tasks.
To fill large missing regions in images, our method for image inpainting utilizes the generator and the discriminator , both of which are trained with uncorrupted data. After training, the generator is able to take a point drawn from and generate an image mimicking samples from . We hypothesize that if is efficient in its representation then an image that is not from (e.g., corrupted data) should not lie on the learned encoding manifold, . Therefore, we aim to recover the encoding “closest” to the corrupted image while being constrained to the manifold, as illustrated in Fig. 3; we visualize the latent manifold, using t-SNE  on the 2-dimensional space, and the intermediate results in the optimization steps of finding . After is obtained, we can generate the missing content by using the trained generative model .
More specifically, we formulate the process of finding as an optimization problem. Let be the corrupted image and be the binary mask with size equal to the image, to indicate the missing parts. An example of and is shown in Fig. 3 (a).
Using this notation we define the “closest” encoding via:
where denotes the context loss, which constrains the generated image given the input corrupted image and the hole mask ; denotes the prior loss, which penalizes unrealistic images. The details of the proposed loss function will be discussed in the following sections.
Besides the proposed method, one may also consider using to update by maximizing , similar to back-propagation in DeepDream  or neural style transfer . However, the corrupted data is neither drawn from a real image distribution nor the generated image distribution. Therefore, maximizing may lead to a solution that is far away from the latent image manifold, which may hence lead to results with poor quality.
To fill large missing regions, our method takes advantage of the remaining available data. We designed the context loss to capture such information. A convenient choice for the context loss is simply the norm between the generated sample and the uncorrupted portion of the input image . However, such a loss treats each pixel equally, which is not desired. Consider the case where the center block is missing: a large portion of the loss will be from pixel locations that are far away from the hole, such as the background behind the face. Therefore, in order to find the correct encoding, we should pay significantly more attention to the missing region that is close to the hole.
To achieve this goal, we propose a context loss with the hypothesis that the importance of an uncorrupted pixel is positively correlated with the number of corrupted pixels surrounding it. A pixel that is very far away from any holes plays very little role in the inpainting process. We capture this intuition with the importance weighting term, ,
where is the pixel index, denotes the importance weight at pixel location , refers to the set of neighbors of pixel in a local window, and denotes the cardinality of . We use a window size of 7 in all experiments.
Empirically, we also found the -norm to perform slightly better than the -norm in our framework. Taking it all together, we define the conextual loss to be a weighted -norm difference between the recovered image and the uncorrupted portion, defined as follows,
Here, denotes the element-wise multiplication.
The prior loss refers to a class of penalties based on high-level image feature representations instead of pixel-wise differences. In this work, the prior loss encourages the recovered image to be similar to the samples drawn from the training set. Our prior loss is different from the one defined in  which uses features from pre-trained neural networks.
Our prior loss penalizes unrealistic images. Recall that in GANs, the discriminator, , is trained to differentiate generated images from real images. Therefore, we choose the prior loss to be identical to the GAN loss for training the discriminator , i.e.,
Here, is a parameter to balance between the two losses. is updated to fool and make the corresponding generated image more realistic. Without , the mapping from to may converge to a perceptually implausible result. We illustrate this by showing the unstable examples where we optimized with and without in Fig. 4.
|Real||Input||Ours w/o||Ours w|
With the defined prior and context losses at hand, the corrupted image can be mapped to the closest in the latent representation space, which we denote . is randomly initialized and updated using back-propagation on the total loss given in Eq. (2). Fig. 3 (b) shows for one example that is approaching the desired solution on the latent image manifold.
After generating , the inpainting result can be easily obtained by overlaying the uncorrupted pixels from the input. However, we found that the predicted pixels may not exactly preserve the same intensities of the surrounding pixels, although the content is correct and well aligned. Poisson blending  is used to reconstruct our final results. The key idea is to keep the gradients of to preserve image details while shifting the color to match the color in the input image . Our final solution, , can be obtained by:
In general, our contribution is orthogonal to specific GAN architectures and our method can take advantage of any generative model . We used the DCGAN model architecture from Radford et al.  in the experiments. The generative model,
, takes a random 100 dimensional vector drawn from a uniform distribution betweenand generates a image. The discriminator model, , is structured essentially in reverse order. The input layer is an image of dimension , followed by a series of convolution layers where the image dimension is half, and the number of channels is double the size of the previous layer, and the output layer is a two class softmax.
For training the DCGAN model, we follow the training procedure in  and use Adam  for optimization. We choose in all our experiments. We also perform data augmentation of random horizontal flipping on the training images. In the inpainting stage, we need to find in the latent space using back-propagation. We use Adam for optimization and restrict to in each iteration, which we observe to produce more stable results. We terminate the back-propagation after 1500 iterations. We use the identical setting for all testing datasets and masks.
In the following section we evaluate results qualitatively and quantitatively, more comparisons are provided in the supplementary material.
The CelebA contains face images with coarse alignment . We remove approximately 2000 images from the dataset for testing. The images are cropped at the center to , which contain faces with various viewpoints and expressions.
The SVHN dataset contains a total of 99,289 RGB images of cropped house numbers. The images are resized to to fit the DCGAN model architecture. We used the provided training and testing split. The numbers in the images are not aligned and have different backgrounds.
The Stanford Cars dataset contains 16,185 images of 196 classes of cars. Similar as the CelebA dataset, we do not use any attributes or labels for both training and testing. The cars are cropped based on the provided bounding boxes and resized to . As before, we use the provided training and test set partition.
Comparisons with TV and LR inpainting. We compare our method with local inpainting methods. As we already showed in Fig. 1, local methods generally fail for large missing regions. We compare our method with TV inpainting  and LR inpainting [24, 12] on images with small random holes. The test images and results are shown in Fig. 6. Due to a large number of missing points, TV and LR based methods cannot recover enough image details, resulting in very blurry and noisy images. PM  cannot be applied to this case due to insufficient available patches.
Comparisons with NN inpainting. Next we compare our method with nearest neighbor (NN) filling from the training dataset, which is a key component in retrieval based methods [10, 37]. Examples are shown in Fig. 7, where the misalignment of skin texture, eyebrows, eyes and hair can be clearly observed by using the nearest patches in Euclidean distance. Although people can use different features for retrieval, the inherit misalignment problem cannot be easily solved . Instead, our results are obtained automatically without any registration.
Comparisons with CE. In the remainder, we compare our result with those obtained from the CE , the state-of-the-art method for semantic inpainting. It is important to note that the masks is required to train the CE. For a fair comparison, we use all the test masks in the training phase for the CE. However, there are infinite shapes and missing ratios for the inpainting task. To achieve satisfactory results one may need to re-train the CE. In contrast, our method can be applied to arbitrary masks without re-training the network, which is according to our opinion a huge advantage when considering inpainting applications.
Figs. 8 and 9 show the results on the CelebA dataset with four types of masks. Despite some small artifacts, the CE performs best with central masks. This is due to the fact that the hole is always fixed during both training and testing in this case, and the CE can easily learn to fill the hole from the context. However, random missing data, is much more difficult for the CE to learn. In addition, the CE does not use the mask for inference but pre-fill the hole with the mean color. It may mistakenly treat some uncorrupted pixels with similar color as unknown. We could observe that the CE has more artifacts and blurry results when the hole is at random positions. In many cases, our results are as realistic as the real images. Results on SVHN and car datasets are shown in Figs. 10 and 11, and our method generally produces visually more appealing results than the CE since the images are sharper and contain fewer artifacts.
It is important to note that semantic inpainting is not trying to reconstruct the ground-truth image. The goal is to fill the hole with realistic content. Even the ground-truth image is one of many possibilities. However, readers may be interested in quantitative results, often reported by classical inpainting approaches. Following previous work, we compare the PSNR values of our results and those by the CE. The real images from the dataset are used as groundtruth reference. Table 1 provides the results on the three datasets. The CE has higher PSNR values in most cases except for the random masks, as they are trained to minimize the mean square error. Similar results are obtained using SSIM  instead of PSNR. These results conflict with the aforementioned visual comparisons, where our results generally yield to better perceptual quality.
We investigate this claim by carefully investigating the errors of the results. Fig. 12
shows the results of one example and the corresponding error images. Judging from the figure, our result looks artifact-free and very realistic, while the result obtained from the CE has visible artifacts in the reconstructed region. However, the PSNR value of CE is 1.73dB higher than ours. The error image shows that our result has large errors in hair area, because we generate a hairstyle which is different from the real image. This indicates that quantitative result do not represent well the real performance of different methods when the ground-truth is not unique. Similar observations can be found in recent super-resolution works[14, 19], where better visual results corresponds to lower PSNR values.
|Real||CE Error 2||Ours Error 2|
For random holes, both methods achieve much higher PSNR, even with missing pixels. In this case, our method outperforms the CE. This is because uncorrupted pixels are spread across the entire image, and the flexibility of the reconstruction is strongly restricted; therefore PSNR is more meaningful in this setting which is more similar to the one considered in classical inpainting works.
While the results are promising, the limitation of our method is also obvious. Indeed, its prediction performance strongly relies on the generative model and the training procedure. Some failure examples are shown in Fig. 13, where our method cannot find the correct in the latent image manifold. The current GAN model in this paper works well for relatively simple structures like faces, but is too small to represent complex scenes in the world. Conveniently, stronger generative models, improve our method in a straight-forward way.
In this paper, we proposed a novel method for semantic inpainting. Compared to existing methods based on local image priors or patches, the proposed method learns the representation of training data, and can therefore predict meaningful content for corrupted images. Compared to CE, our method often obtains images with sharper edges which look much more realistic. Experimental results demonstrated its superior performance on challenging image inpainting examples.
Acknowledgments: This work is supported in part by IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM Cognitive Horizons Network. This work is supported by NVIDIA Corporation with the donation of a GPU.
Texture synthesis using convolutional neural networks.In NIPS, 2015.
Journal of Machine Learning Research, 2008.