Generative Image Inpainting with Contextual Attention https://arxiv.org/abs/1801.07892
Recent deep learning based approaches have shown promising results on image inpainting for the challenging task of filling in large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted structures or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness of convolutional neural networks in explicitly borrowing or copying information from distant spatial locations. On the other hand, traditional texture and patch synthesis approaches are particularly suitable when it needs to borrow textures from the surrounding regions. Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions. The model is a feed-forward, fully convolutional neural network which can process images with multiple holes at arbitrary locations and with variable sizes during the test time. Experiments on multiple datasets including faces, textures and natural images demonstrate that the proposed approach generates higher-quality inpainting results than existing ones. Code and trained models will be released.READ FULL TEXT VIEW PDF
The latest deep learning-based approaches have shown promising results f...
Prior knowledge of face shape and location plays an important role in fa...
Image inpainting is one of the most challenging tasks in computer vision...
We present a nonlocal variational image completion technique which admit...
Over the last few years, deep learning based approaches have achieved
Recent deep learning based image inpainting methods which utilize contex...
Recent advances in deep learning have shown exciting promise in filling ...
Generative Image Inpainting with Contextual Attention https://arxiv.org/abs/1801.07892
Filling missing pixels of an image, often referred as image inpainting or completion, is an important task in computer vision. It has many applications in photo editing, image-based rendering and computational photography[3, 25, 30, 31, 36, 41]. The core challenge of image inpainting lies in synthesizing visually realistic and semantically plausible pixels for the missing regions that are coherent with existing ones.
Early works [3, 14] attempted to solve the problem using ideas similar to texture synthesis [10, 11], i.e. by matching and copying background patches into holes starting from low-resolution to high-resolution or propagating from hole boundaries. These approaches work well especially in background inpainting tasks, and are widely deployed in practical applications . However, as they assume missing patches can be found somewhere in background regions, they cannot hallucinate novel image contents for challenging cases where inpainting regions involve complex, non-repetitive structures (e.g. faces, objects). Moreover, these methods are not able to capture high-level semantics.
Rapid progress in deep convolutional neural networks (CNN) and generative adversarial networks (GAN)  inspired recent works [17, 27, 32, 41] to formulate inpainting as a conditional image generation problem where high-level recognition and low-level pixel synthesis are formulated into a convolutional encoder-decoder network, jointly trained with adversarial networks to encourage the coherency between generated and existing pixels. These works are shown to generate plausible new contents in highly structured images, such as faces, objects and scenes.
Unfortunately, these CNN-based methods often create boundary artifacts, distorted structures and blurry textures inconsistent with surrounding areas. We found that this is likely due to ineffectiveness of convolutional neural networks in modeling long-term correlations between distant contextual information and the hole regions. For example, to allow a pixel being influenced by the content of 64 pixels away, it requires at least 6 layers of convolutions with dilation factor 2 or equivalent [17, 42]. Nevertheless, a dilated convolution samples features from a regular and symmetric grid and thus may not be able to weigh the features of interest over the others. Note that a recent work  attempts to address the appearance discrepancy by optimizing texture similarities between generated patches and the matched patches in known regions. Although improving the visual quality, this method is being dragged by hundreds of gradient descent iterations and costs minutes to process an image with resolution on GPUs.
We present a unified feed-forward generative network with a novel contextual attention layer for image inpainting. Our proposed network consists of two stages. The first stage is a simple dilated convolutional network trained with reconstruction loss to rough out the missing contents. The contextual attention is integrated in the second stage. The core idea of contextual attention is to use the features of known patches as convolutional filters to process the generated patches. It is designed and implemented with convolution for matching generated patches with known contextual patches, channel-wise softmax to weigh relevant patches and deconvolution to reconstruct the generated patches with contextual patches. The contextual attention module also has spatial propagation layer to encourage spatial coherency of attention. In order to allow the network to hallucinate novel contents, we have another convolutional pathway in parallel with the contextual attention pathway. The two pathways are aggregated and fed into single decoder to obtain the final output. The whole network is trained end to end with reconstruction losses and two Wasserstein GAN losses [1, 13], where one critic looks at the global image while the other looks at the local patch of the missing region.
Experiments on multiple datasets including faces, textures and natural images demonstrate that the proposed approach generates higher-quality inpainting results than existing ones. Example results are shown in Figure 1.
Our contributions are summarized as follows:
We propose a novel contextual attention layer to explicitly attend on related feature patches at distant spatial locations.
We introduce several techniques including inpainting network enhancements, global and local WGANs  and spatially discounted reconstruction loss to improve the training stability and speed based on the current the state-of-the-art generative image inpainting network . As a result, we are able to train the network in a week instead of two months.
Existing works for image inpainting can be mainly divided into two groups. The first group represents traditional diffusion-based or patch-based methods with low-level features. The second group attempts to solve the inpainting problem by a learning-based approach, e.g. training deep convolutional neural networks to predict pixels for the missing regions.
Traditional diffusion or patch-based approaches such as [2, 4, 10, 11] typically use variational algorithms or patch similarity to propagate information from the background regions to the holes. These methods work well for stationary textures but are limited for non-stationary data such as natural images. Simakov et al.  propose a bidirectional patch similarity-based scheme to better model non-stationary visual data for re-targeting and inpainting applications. However, dense computation of patch similarity  is a very expensive operation, which prohibits practical applications of such method. In order to address the challenge, a fast nearest neighbor field algorithm called PatchMatch  has been proposed which has shown significant practical values for image editing applications including inpainting.
Recently, deep learning and GAN-based approaches have emerged as a promising paradigm for image inpainting. Initial efforts [23, 39] train convolutional neural networks for denoising and inpainting of small regions. Context Encoders  firstly train deep neural networks for inpainting large holes. It is trained to complete center region of in a image, with both pixel-wise reconstruction loss and generative adversarial loss as the objective function. More recently, Iizuka et al.  improve it by introducing both global and local discriminators as adversarial losses. The global discriminator assesses if completed image is coherent as a whole, while the local discriminator focus on a small area centered at the generated region to enforce the local consistency. In addition, Iizuka et al. 
use dilated convolutions in inpainting network to replace channel-wise fully connected layer adopted in Context Encoders, both techinics are proposed for increasing receptive fields of output neurons. Meanwhile, there have been several studies focusing on generative face inpainting. Yeh et al. search for the closest encoding in latent space of the corrupted image and decode to get completed image. Li et al.  introduce additional face parsing loss for face completion. However, these methods typically require post processing steps such as image blending operation to enforce color coherency near the hole boundaries.
Several works [37, 40] follow ideas from image stylization [5, 26] to formulate the inpainting as an optimization problem. For example, Yang et al.  propose a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network. This approach shows promising visual results but is very slow due to the optimization process.
There have been many studies on learning spatial attention in deep convolutional neural networks. Here, we select to review a few representative ones related to the proposed contextual attention model. Jaderberg et al.
firstly propose a parametric spatial attention module called spatial transformer network (STN) for object classification tasks. The model has a localization module to predict parameters of global affine transformation to warp features. However, this model assumes a global transformation so is not suitable for modeling patch-wise attention. Zhou et al.
introduce an appearance flow to predict offset vectors specifying which pixels in the input view should be moved to reconstruct the target view for novel view synthesis. This method is shown to be effective for matching related views of the same objects but is not effective in predicting a flow field from the background region to the hole, according to our experiments. Recently, Dai et al. and Jeon et al.  propose to learn spatially attentive or active convolutional kernels. These methods can potentially better leverage information to deform the convolutional kernel shape during training but may still be limited when we need to borrow exact features from the background.
We first construct our baseline generative image inpainting network by reproducing and making several improvements to the recent state-of-the-art inpainting model  which has shown promising visual results for inpainting images of faces, building facades and natural images.
Coarse-to-fine network architecture The network architecture of our improved model is shown in Figure 2. We follow the same input and output configurations as in  for training and inference, i.e. the generator network takes an image with white pixels filled in the holes and a binary mask indicating the hole regions as input pairs, and outputs the final completed image. We pair the input with a corresponding binary mask to handle holes with variable sizes, shapes and locations. The input to the network is a image with a rectangle missing region sampled randomly during training, and the trained model can take an image of different sizes with multiple holes in it.
In image inpainting tasks, the size of the receptive fields should be sufficiently large, and Iizuka et al.  adopt dilated convolution for that purpose. To further enlarge the receptive fields and stabilize training, we introduce a two-stage coarse-to-fine network architecture where the first network makes an initial coarse prediction, and the second network takes the coarse prediction as inputs and predict refined results. The coarse network is trained with the reconstruction loss explicitly, while the refinement network is trained with the reconstruction as well as GAN losses. Intuitively, the refinement network sees a more complete scene than the original image with missing regions, so its encoder can learn better feature representation than the coarse network. This two-stage network architecture is similar in spirits to residual learning  or deep supervision .
Also, our inpainting network is designed in a thin and deep scheme for efficiency purpose and has fewer parameters than the one in 18] (which we found deteriorates color coherence). Also, we use ELUs 17], and clip the output filter values instead of using or functions. In addition, we found separating global and local feature representations for GAN training works better than feature concatenation in . More details can be found in the supplementary materials.
Global and local Wasserstein GANs Different from previous generative inpainting networks [17, 27, 32] which rely on DCGAN  for adversarial supervision, we propose to use a modified version of WGAN-GP [1, 13]. We attach the WGAN-GP loss to both global and local outputs of the second-stage refinement network to enforce global and local consistency, inspired by . WGAN-GP loss is well-known to outperform existing GAN losses for image generation tasks, and it works well when combined with reconstruction loss as they both use the distance metric.
Specifically, WGAN uses the Earth-Mover distance (a.k.a. Wasserstein-1) distance for comparing the generated and real data distributions. Its objective function is constructed by applying the Kantorovich-Rubinstein duality:
where is the set of 1-Lipschitz functions and is the model distribution implicitly defined by . is the input to the generator.
Gulrajani et al.  proposed an improved version of WGAN with a gradient penalty term
where is sampled from the straight line between points sampled from distribution and . The reason is that the gradient of at all points on the straight line should point directly towards current sample , meaning .
For image inpainting, we only try to predict hole regions, thus the gradient penalty should be applied only to pixels inside the holes. This can be implemented with multiplication of gradients and input mask as follows:
where the mask value is for missing pixels and for elsewhere. is set to 10 in all experiments.
We use a weighted sum of pixel-wise loss (instead of mean-square-error as in ) and WGAN adversarial losses. Note that in primal space, Wasserstein-1 distance in WGAN is based on ground distance:
denotes the set of all joint distributionswhose marginals are respectively and . Intuitively, the pixel-wise reconstruction loss directly regresses holes to the current ground truth image, while WGANs implicitly learn to match potentially correct images and train the generator with adversarial gradients. As both losses measure pixel-wise distances, the combined loss is easier to train and makes the optimization process stabler.
Spatially discounted reconstruction loss Inpainting problems involve hallucination of pixels, so it could have many plausible solutions for any given context. In challenging cases, a plausible completed image can have patches or pixels that are very different from those in the original image. As we use the original image as the only ground truth to compute a reconstruction loss, strong enforcement of reconstruction loss in those pixels may mislead the training process of convolutional network.
Intuitively, missing pixels near the hole boundaries have much less ambiguity than those pixels closer to the center of the hole. This is similar to the issue observed in reinforcement learning. When long-term rewards have large variations during sampling, people use temporal discounted rewards over sampled trajectories. Inspired by this, we introduce spatially discounted reconstruction loss using a weight mask . The weight of each pixel in the mask is computed as , where is the distance of the pixel to the nearest known pixel. is set to 0.99 in all experiments.
Similar weighting ideas are also explored in [32, 41]. Importance weighted context loss, proposed in , is spatially weighted by the ratio of uncorrupted pixels within a fixed window (e.g. ). Pathak et al.  predict a slightly larger patch with higher loss weighting () in the border area. For inpainting large hole, the proposed discounted loss is more effective for improving the visual quality. We use discounted reconstruction loss in our implementation.
With all the above improvements, our baseline generative inpainting model converges much faster than  and result in more accurate inpainting results. For Places2 , we reduce the training time from 11,520 GPU-hours (K80) reported by  to 120 GPU-hours (GTX 1080) which is almost speedup. Moreover, the post-processing step (image blending)  is no longer necessary.
Convolutional neural networks process image features with local convolutional kernel layer by layer thus are not effective for borrowing features from distant spatial locations. To overcome the limitation, we consider attention mechanism and introduce a novel contextual attention layer in the deep generative network. In this section, we first discuss details of the contextual attention layer, and then address how we integrate it into our unified inpainting network.
The contextual attention layer learns where to borrow or copy feature information from known background patches to generate missing patches. It is differentiable, thus can be trained in deep models, and fully-convolutional, which allows testing on arbitrary resolutions.
Match and attend We consider the problem where we want to match features of missing pixels (foreground) to surroundings (background). As shown in Figure 3, we first extract patches () in background and reshape them as convolutional filters. To match foreground patches with backgrounds ones
, we measure with normalized inner product (cosine similarity)
where represents similarity of patch centered in background and foreground . Then we weigh the similarity with scaled softmax along -dimension to get attention score for each pixel , where is a constant value. This is efficiently implemented as convolution and channel-wise softmax. Finally, we reuse extracted patches as deconvolutional filters to reconstruct foregrounds. Values of overlapped pixels are averaged.
Attention propagation We further encourage coherency of attention by propagation (fusion). The idea of coherency is that a shift in foreground patch is likely corresponding to an equal shift in background patch for attention. For example, usually have close value with . To model and encourage coherency of attention maps, we do a left-right propagation followed by a top-down propagation with kernel size of . Take left-right propagation as an example, we get new attention score with:
The propagation is efficiently implemented as convolution with identity matrix as kernels. Attention propagation significantly improves inpainting results in testing and enriches gradients in training.
Memory efficiency Assuming that a region is missing in a
feature map, then the number of convolutional filters extracted from backgrounds is 12,288. This may cause memory overhead for GPUs. To overcome this issue, we introduce two options: 1) extracting background patches with strides to reduce the number of filters and 2) downscaling resolution of foreground inputs before convolution and upscaling attention map after propagation.
To integrate attention module, we introduce two parallel encoders as shown in Figure 4 based on Figure 2. The bottom encoder specifically focuses on hallucinating contents with layer-by-layer (dilated) convolution, while the top one tries to attend on background features of interest. Output features from two encoders are aggregated and fed into a single decoder to obtain the final output. To interpret contextual attention, we visualize it in a way shown in Figure 4. We use color to indicate the relative location of the most interested background patch for each foreground pixel. For examples, white (center of color coding map) means the pixel attends on itself, pink on bottom-left, green on top-right. The offset value is scaled differently for different images to best visualize the most interesting range.
For training, given a raw image , we sample a binary image mask at a random location. Input image is corrupted from the raw image as . Inpainting network takes concatenation of and as input, and output predicted image with the same size as input. Pasting the masked region of to input image, we get the inpainting output . Image values of input and output are linearly scaled to in all experiments. Training procedure is shown in Algorithm 1.
Qualitative comparisons First, we show in Figure 5 that our baseline model generates comparable inpainting results with the previous state-of-the-art  by comparing our output result and result copied from their main paper. Note that no post-processing step is performed for our baseline model, while image blending is applied in result of .
Next we use the most challenging Places2 dataset to evaluate our full model with contextual attention by comparing to our baseline two-stage model which is extended from the previous state-of-the-art . For training, we use images of resolution with largest hole size described in Section 4.2. Both methods are based on fully-convolutional neural networks thus can fill in multiple holes on images of different resolutions. Visual comparisons on a variety of complex scenes from the validation set are shown in Figure 6. Those test images are all with size for consistency of testing. All the results reported are direct outputs from the trained models without using any post-processing. For each example, we also visualize latent attention map for our model in the last column (color coding is explained in Section 4.2).
As shown in the figure, our full model with contextual attention can leverage the surrounding textures and structures and consequently generates more realistic results with much less artifacts than the baseline model. Visualizations of attention maps reveal that our method is aware of contextual image structures and can adaptively borrow information from surrounding areas to help the synthesis and generation.
In Figure 7, we also show some example results and attention maps of our full model trained on CelebA, DTD and ImageNet. Due to space limitation, we include more results for these datasets in the supplementary material.
Like other image generation tasks, image inpainting lacks good quantitative evaluation metrics. Inception score introduced for evaluating GAN models is not a good metric for evaluating image inpainting methods as inpainting mostly focuses on background filling (e.g. object removal case), not on its ability to generate a variety classes of objects.
Evaluation metrics in terms of reconstruction errors are also not perfect as there are many possible solutions different from the original image content. Nevertheless, we report our evaluation in terms of mean error, mean error, peak signal-to-noise ratio (PSNR) and total variation (TV) loss on validation set on Places2 just for reference in Table 1. As shown in the table, learning-based methods perform better in terms of , errors and PSNR, while methods directly copying raw image patches have lower total variation loss.
Our full model has a total of 2.9M parameters, which is roughly half of model proposed in 
. Models are implemented on TensorFlow v1.3, CUDNN v6.0, CUDA v8.0, and run on hardware with CPU Intel(R) Xeon(R) CPU E5-2697 v3 (2.60GHz) and GPU GTX 1080 Ti. Our full model runs at0.2 seconds per frame on GPU and 1.5 seconds per frame on CPU for images of resolution on average.
Contextual attention vs. spatial transformer network and appearance flow We investigate the effectiveness of contextual attention comparing to other spatial attention modules including appearance flow  and spatial transformer network  for image inpainting. For appearance flow , we train on the same framework except that the contextual attention layer is replaced with a convolution layer to directly predict 2-D pixel offsets as attention. As shown in Figure 8, for a very different test image pair, appearance flow returns very similar attention maps, meaning that the network may stuck in a bad local minima. To improve results of appearance flow, we also investigated ideas of multiple attention aggregation and patch-based attention. None of these ideas work well enough to improve the inpainting results. Also, we show the results with the spatial transformer network  as attention in our framework in Figure 8. As shown in the figure, STN-based attention does not work well for inpainting as its global affine transformation is too coarse.
Choice of the GAN loss for image inpainting Our inpainting framework benefits greatly from the WGAN-GP loss as validated by its learning curves and faster/stabler convergence behaviors. The same model trained with DCGAN sometimes collapses to limited modes for the inpainting task, as shown in Figure 9. We also experimented with LSGAN , and the results were worse.
Essential reconstruction loss We also performed testing if we could drop out the reconstruction loss and purely rely on the adversarial loss (i.e. improved WGANs) to generate good results. To draw a conclusion, we train our inpainting model without reconstruction loss in the refinement network. Our conclusion is that the pixel-wise reconstruction loss, although tends to make the result blurry, is an essential ingredient for image inpainting. The reconstruction loss is helpful in capturing content structures and serves as a powerful regularization term for training GANs.
Perceptual loss, style loss and total variation loss We have not found perceptual loss (reconstruction loss on VGG features), style loss (squared Frobenius norm of Gram matrix computed on the VGG features)  and total variation (TV) loss bring noticeable improvements for image inpainting in our framework, thus are not used.
We proposed a coarse-to-fine generative image inpainting framework and introduced our baseline model as well as full model with a novel contextual attention module. We showed that the contextual attention module significantly improves image inpainting results by learning feature representations for explicitly matching and attending to relevant background patches. As a future work, we plan to extend the method to very high-resolution inpainting applications using ideas similar to progressive growing of GANs 
. The proposed inpainting framework and contextual attention module can also be applied on conditional image generation, image editing and computational photography tasks including image-based rendering, image super-resolution, guided editing and many others.
Filling-in by joint interpolation of vector fields and gray levels.IEEE transactions on image processing, 10(8):1200–1211, 2001.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
International Conference on Machine Learning, pages 448–456, 2015.
Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
CelebA-HQ  We show results from our full model trained on CelebA-HQ dataset in Figure 10. Note that the original image resolution of CelebA-HQ dataset is . We resize image to for both training and evaluation.
CelebA  We show more results from our full model trained on CelebA dataset in Figure 11. Note that the original image resolution of CelebA dataset is . We resize image to and do a random crop of size to make face landmarks roughly unaligned for both training and evaluation.
In addition to attention map visualization, we visualize which parts in the input image are being attended for pixels in holes. To do so, we highlight the regions that have the maximum attention score and overlay them to input image. As shown in Figure 16, the visualization results given holes in different locations demonstrate the effectiveness of our proposed contextual attention to borrow information at distant spatial locations.
In addition to Section 3, we report more details of our network architectures. For simplicity, we denote them with K (kernel size), D (dilation), S (stride size) and C (channel number).
Inpainting network Inpainting network has two encoder-decoder architecture stacked together, with each encoder-decoder of network architecture:
K5S1C32 - K3S2C64 - K3S1C64 - K3S2C128 - K3S1C128 - K3S1C128 - K3D2S1C128 - K3D4S1C128 - K3D8S1C128 - K3D16S1C128 - K3S1C128 - K3S1C128 - resize () - K3S1C64 - K3S1C64 - resize () - K3S1C32 - K3S1C16 - K3S1C3 - clip.
Local WGAN-GP critic We use Leaky ReLU with as activation function for WGAN-GP critics.
K5S2C64 - K5S2C128 - K5S2C256 - K5S2C512 - fully-connected to 1.
Global WGAN-GP critic K5S2C64 - K5S2C128 - K5S2C256 - K5S2C256 - fully-connected to 1.
Contextual attention branch K5S1C32 - K3S2C64 - K3S1C64 - K3S2C128 - K3S1C128 - K3S1C128 - contextual attention layer - K3S1C128 - K3S1C128 - concat.