Image inpainting aims at filling corrupted or replacing unwanted regions of images with plausible and fine-detailed contents, which is widely applied in fields of restoring damaged photographs, retouching pictures, et al.
Existing inpainting approaches can be roughly divided into two groups: conventional and deep learning based approaches. Conventional inpainting approaches usually make use of low-level features (e.g. color and texture descriptors) hand-crafted from the incomplete input image and resort to priors (e.g. smoothness and image statistics) or auxiliary data (e.g. external image databases). They either propagate low-level features from surroundings to the missing regions following a diffusive process[2, 11, 16] or fill holes by searching and fusing similar patches from the same image or external image databases [6, 1, 15, 7]. Without a high-level understanding of the image contents and structures, conventional approaches usually struggle to generate semantically meaningful content, especially when a large portion of an image is missing or corrupted.
Deep learning-based approaches can understand the image content by automatically capturing the intrinsic hierarchical representations and generate high-level semantic features to synthesize the missing contents, which generally outperform the conventional methods in the inpainting task. Context Encoder proposed by CECE is the first attempt to exploit a deep convolution encoder-decoder trained with an adversarial strategy for image inpainting. The method produces semantic reasonable contents, but the results often lack fine-detailed textures and contain visible artifacts. To achieve more pleasing results, GLI,NPS,CA,Patch-Swap,Shift-NetGLI,NPS,CA,Patch-Swap,Shift-Net and NIPSNIPS respectively extend Context Encoder in different ways, such as in the aspects of architectures and learning strategies.
Recently, EGEG propose to utilize explicit image structure knowledge for inpainting. They develop a two-stage model which comprises of an edge generator followed by an image generator. The edge generator is trained to hallucinate the possible edge sketches of the missing regions. Then the image generator makes the generated sketches as a structure prior or precondition to produce final results. ContourContour propose a similar model but take a contour generator instead of an edge generator which is more applicable in the cases where the corrupted image contains salient objects. By introducing the structure information, both methods generate more visually plausible inpainting results.
The success of the above two-stage models suggests that structure knowledge such as edges and contours plays an important role to generate reasonable and detailed contents for image inpainting. It also indicates that, without advisable guidance of structure knowledge in the learning process, previous deep learning-based approaches may struggle to understand the plausible semantic structures of the corrupted images. However, the two-stage strategy may suffer several limitations: 1) it takes much more parameters since using two generators; 2) it is easy subjected to the adverse effects from unreasonable structure preconditions during the inference time due to using a series-coupled architecture; 3) without an explicit structure guidance as a loss function during the learning process, it may not sufficiently incorporate the structure information since they may be weakened or forgotten due to the sparsity of the structures and the depth of the network.
Based on these insights, we propose to use a multi-task framework to better incorporate structure knowledge for image inpainting. Instead of explicit modeling the structure preconditions, we utilize a shared generator to simultaneously generate the completed image and corresponding structures, thus supervising the generator to incorporate relevant structure knowledge for inpainting. This is reasonable because both tasks require a high-level understanding and share the same semantics of the image content. Besides, EGEG and ContourContour have demonstrated that structure priors are benefiting to image completion; the other way round, it is more likely to figure out the complete structures from a relatively intact image compared with a corrupted one.
In addition, to further incorporate the structure information, we introduce a structure embedding scheme which explicitly feeding the learned structure features into the inpainting process serving as preconditions for image completion. Moreover, an attention mechanism is developed to exploit the recurrent structures or patterns in the image to refine the generated structures and contents. Specifically, we also propose a novel pyramid structure loss to supervise the learning of the structure knowledge. We summarize the main contributions as follows:
We propose a multi-task learning framework to incorporate the image structure knowledge to assist image inpainting.
We introduce a structure embedding scheme which can explicitly provide structure preconditions for image completion, and an attention mechanism to exploit the similar patterns in the image to refine the generated structures and contents.
We propose a novel pyramid structure loss specifically for structure learning and embedding. Extensive experiments have been conducted to evaluate the performance of our approach.
2 Related Work
Numerous image inpainting approaches have been proposed; here, we focus to review the representative deep learning-based methods.
Context Encoder proposed by CECE is one of the first deep learning-based methods for image inpainting, which takes an encoder-decoder architecture and trains with an adversarial learning strategy. It leverages convolutional encoder-decoder and Generative Adversarial Network , thus able to develop semantic features and synthesis visually pleasing contents even the missing regions are quite large. But the inpainting results often lack fine-detailed textures due to the information bottleneck layer of the encoder-decoder which may discard some features for image details. Besides, the approach tends to create artifacts around the border of the missing region due to the local consistency is not taken into consideration.
GLIGLI address the information bottleneck defect by replacing the bottleneck layer with a series of dilated convolution layers and reducing the downsampling times. For local continuity, a local discriminator is designed to enforce the locally filled content is both visually plausible and consistent with the surroundings. Although the method can plausibly fill missing regions, it still takes Poisson blending  to tackle the color inconsistency between the completed region and its surroundings. NPSNPS, in a different way, enhance Context Encoder by proposing a multi-scale neural patch synthesis approach. The approach first takes the output of the network as initialization and then leverages style transfer techniques  to propagate the high-frequency textures from the surroundings to the missing region by iteratively solving a multi-scale optimization problem. The approach works well for high-resolution semantic inpainting.
CACA propose a two-stage coarse-to-fine architecture to generate and refine the inpainting results, where the coarse network makes an initial estimation, and the refinement network takes the initialization to produce finer results. Besides, at the refinement stage, a novel module termed as Contextual Attention is designed to explicitly borrowing information from the surroundings of the missing regions. Patch-SwapPatch-Swap develop a similar coarse-to-fine method and introduce a Patch-Swap module which can heuristically propagate the textures from surroundings to the holes. The coarse-to-fine architecture does help to generate finer results; however, it builds upon the assumption that the coarse estimate at the first stage is reasonably accurate. Similar to the ideas of Context Attention and Patch-Swap, Shift-NetShift-Net develop a shift-connect module by which the features of the known background at the encoding phase are directly shifted to fill the missing areas at the decoding phase. Unlike using an explicit module to propagate information from the surroundings to missing regions, NIPSNIPS introduce an implicit diversified Markov random fields (ID-MRF) loss which implicit constraints the network to propagates relevant information to the target inpainting areas. And to leverage features of both image-level and feature-level, PEN-NetPEN-Net propose a pyramid-context encoder network and an attention transfer mechanism which are able to progressively fill the missing regions from high-level to low-level feature map and ensure the semantic consistencies at the same time.
To generalize well in the inpainting tasks of irregular missing regions, PartialPartial propose partial convolutions. Unlike vanilla convolution, partial convolution only utilizes valid information to inference the missing contents through an automatic mask updating mechanism which is effective in cases of arbitrary missing regions. CA2CA2 further generalize the partial convolution and propose a gated convolution with a learnable mask updating mechanism which achieve competitive or better inpainting qualities. Besides, the users are able to interact with the inpainting network with hand-drawn sketches to produce user-guided inpainting results.
Recently, several approaches explicitly introduce image structure prior (e.g. edges and contours) for inpainting which produce more impressive results. EGEG propose a model termed as EdgeConnect which consists of an edge generator followed by an image generator. The edge generator is utilized to estimate the possible edges of the missing region, which then as precondition information feed into the successive image completion process. ContourContour develop a similar model which takes a contour generator instead of the edge generator. Since the approach predicts contours for salient objects, it is more applicable in the cases where the corrupted image contains salient objects.
Our multi-task framework is shown in Figure 2. It estimates a shared generator for simultaneously generating the complete image and corresponding structures at different scales, where the structure generation works as an auxiliary task providing possible structure cues for the image completion task.
Here, we mainly use the edge structures to represent the image structure which describe the profiles of the contents of the image. Instead of directly figuring out the possible edges, we first predict the whole gradient map which inherently contains the edge information and then introduce an implicit regularization scheme in the proposed pyramid structure loss to learn the edge structures. Generating the gradient map is preferable in our multi-task setting. One the one hand, since the edge structure of an image is usually sparse and only conveys binary sketch information of the image, generating such edge structure shares little features with the task of image generation during the last several phases of the generation process, thus task-specific network layers for edge generation have to be designed. One the other hand, the gradient map itself not only conveys the possible edge information but also represents the texture information or high-frequency details which is important for detailed texture synthesis grad1,grad2grad1,grad2.
Formally, let’s be the ground truth image, and denote its gradient and edge map respectively. Here, we use Sobel filters shown in Figure 3 to extract the gradient map, and Canny detector to acquire the edge map.
The generator takes the masked image as the input, and corresponding gradient map and edge map , in addition with the image mask (with value 0 for known region 1 otherwise) as preconditions. Here, denotes the Hadamard product. The generator jointly generates the image content and estimates its gradient map at different scales:
where represents our generator, the generated image, denotes the predicted gradient map at scale . The final completed image and gradient map are and , where is the incomplete gradient map at scale . The number of scales is upon the specific architecture of the generator.
We take the architecture proposed by EGEG as the backbone of our generator, which has achieved impressive results for image inpainting. As Figure 2 shows, for image generation, the generator consists of a spatial context encoder which down-samples twice followed by eight residual blocks and a decoder which up-samples twice to generate images of the original size. For structure generation, the encoder is shared and the decoder is adapted to a multi-scale style to embed and output the structures of different scales. In addition, two modules are developed to make use of the structure information:
Structure Embedding Layer
We use the structure embedding layers to embed the structure features into the decoding phase at different scales serving as priors for image generation. It first separates from the image generation branch to learn the specific structural features and predict the possible structures, then merges the learned features back through a concatenation operation. This parallel/sibling-style scheme not only provides the structure priors for image generation but also avoids the adverse effects from improper preconditions since the decoder can learn to whether to exploit the structure priors or not. Specifically, we implement the layer with a standard residual block .
Our attention operation is inspired by the non-local mean mechanism which has been used for deionizing 
and super-resolution. It calculates the response at a position of the output feature map as a weighted sum of the features in the whole input feature map. And the weight or attention score is measured by the feature similarity. Through attention, similar features from surroundings can be transferred to the missing regions to refine the generated contents and structures (e.g. smoothing the artifacts and enhancing the details).
Given an input feature map, we first extract the feature patches and calculate the cosine similarityof each pair of the patches:
where and are the -th and -th patch of the input feature map respectively. Then softmax operations are applied to compute the attention scores:
Supposing a total of patches are extracted, the response of a position in the output feature map is calculated as the weighted sum of the patch features:
In particular, as shown in Figure 4, we formulate all the operations into convolution forms, and make it a residual block which thus can be seamlessly embedded into our architecture:
where is the residual output, is a learnable scale parameter.
3.2 Loss Functions
Our generator is expected to achieve two goals — figuring out the structure cues and completing the corrupted image. We introduce a pyramid structure loss to capture the structure knowledge and a hybrid image loss to supervise image inpainting.
Pyramid Structure Loss
We propose a pyramid structure loss to guide the structure generation and embedding, thus incorporating the structure information into the generation process. Specifically, it consists of two terms at a specific scale . One is the distance between the predicted gradient map and corresponding ground truth, the other is a regularization term for learning the edge structure:
where denotes the regularization term, corresponding coefficient and the number of total scales. To implement the regularization on the edge structure, we first use a Gaussian filter to convolve the binary ground truth edge map to create a weighted edge mask as:
Then, we computes the edge regularization loss as:
where the weighted edge mask is used to extract the edge information from the gradient map. Using such an edge mask not only considers the positions of the binary edges but also exert constraints on their nearby locations, thus to highlight and intensify the edge structure. In our implementation, a Gaussian filter with sizeis used.
Hybrid Image Loss
We take a similar hybrid loss as in  for image completion, which consists of a pixel-wise reconstruction loss, a perception loss, a style loss and an adversarial loss which are detailed as follows.
The reconstruction loss is measured by the distance between the generated image and corresponding ground truth at pixel level:
The perceptual loss computes the distance between and its ground truth in the feature spaces after feeding to the pre-trained VGG-19 network 
on ImageNet dataset.
where is the feature map of the ’th selected layer from VGG-19. Here, layers , , , and are used.
Style loss also compares the distance between images in feature spaces, but first computing corresponding Gram matrix  of each selected feature map:
where is a Gram matrix constructed from feature maps of size .
In our framework, an adversarial training strategy is also used which almost has been a standard practice in image generation tasks. We take PatchGAN  as our discriminator and denote its adversarial loss as:
and the adversarial loss for our generator as:
Then, the hybrid image loss is defined as:
where , and
are hyperparameters which balance the contributions of different loss terms.
Finally, the generator is optimized by minimizing the pyramid structure loss and the hybrid image loss:
where is a predefined weight to balance the two learning tasks. For our experiments, we choose hyperparameters of the hybrid image loss as in , and , .
In this section, we present our experimental comparisons with several state-of-the-art image inpainting approaches and ablation studies of the effectiveness of our multi-task framework. More results can reference our supplementary material.
4.1 Experimental Settings
Datasets and Baslines
GL: proposed by GLIGLI, which uses two discriminators to ensure global and local consistency of the generated image.
CA: proposed by CACA, which leverages a coarse-to-fine architecture with a contextual attention layer to produce and refine the inpainting results.
PEN-Net: proposed by PEN-NetPEN-Net, which adopts a pyramid context encoder to fill missing regions with features of both image-level and feature-level.
EG: proposed by by EGEG, which leverages the edge structure preconditions for inpainting with a series-coupled architecture.
We utilize the available pre-trained models of the baseline approaches and reimplement PEN-Net  as there is no publicly available code yet.
For experiments, we resize images to and use both regular and irregular image masks for training and testing. For fair comparisons, we use regular masks (with a size of ) following the common experimental settings of baselines and irregular masks as in baseline . We generate gradient maps with Sobel filters and edge maps with Canny detectors as in 
. To compute the pyramid structure loss, we scale these maps into corresponding resolutions with nearest-neighbor interpolation. We implement our model in TensorFlow using a single NVIDIA GeForce GTX 1080 Ti and the code will be publicly available.
4.2 Qualitative Evaluation
As shown in Figure 5, our approach is able to generate visually realistic images with sharp edges and fine-detailed textures in both regular and irregular mask settings. Besides, testing on regularly masked images as shown in Figure 6 and Figure 7, ours compared with baselines shows obvious visual enhancement on pleasing image structures, such as sharp facial contours, crisp eyes and ears, and reasonable object boundaries. And comparing with the approaches CA, GL and PEN-Net where few image structure information is explicitly considered, ours and EG which incorporate the edge structure knowledge are more likely to generate plausible image contents. Moreover, as shown in Figure 1, Figure 6 and Figure 7, comparing against EG which using a serial-coupled architecture to exploiting structure knowledge, our multi-task architecture exhibits superior performance with more visually plausible structures and detailed contents.
4.3 Quantitative Evaluation
loss, peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), universal quality index (UQI) , visual information fidelity (VIF)  and Frechet Inception Distance (FID) 
as our evaluation metrics. Specifically, we utilizeloss and PSNR to measure the similarity between two images at the pixel level, SSIM and UQI to assess the distortions of the generated image content relative to the ground truth, and VIF and FID to evaluate the overall visual quality, among which VIF correlates well with human perceptions and FID has been a commonly used metric for image generation. In addition, the metrics will be calculated over ten thousand random images in the test sets.
As shown in Table 1, our approach achieves superior performance against all the baselines on datasets CelebA and Places2. The results can be explained by the baseline approaches either ignore the structure knowledge of the image or not well make use of it. Besides, under the scenario with irregular masks, although models such as CA, GL, and PEN-Net can deal with irregular holes (like filling irregular holes with multiple regular patches), they usually show inferior performance since not particularly trained on irregular masks.
4.4 Ablation Study
We analyze how the proposed components of our framework contribute to the final performance of image inpainting. We take the image generator in  as the baseline, then gradually adding our multi-task learning strategy (MT), structure embedding (SE) and attention mechanism(AT) until establishing the whole model we proposed. Correspondingly, we evaluate the model with the gradually added components quantitatively and qualitatively over one thousand random images in the test sets with regular masks.
As Table 2 shows, the performances of our model on the metrics are gradually improved or retained compared with the baseline as progressively integrating each component. Specifically, metric VIF and FID are enhanced by a large margin, which indicates the visual quality of the complete images are improved substantially. As qualitative comparisons are shown in Figure 8, when taking a shared generator to simultaneously complete the image and corresponding structures instead of the only image completion task as in baseline, ours generates more pleasing image structures (e.g. sharp facial and month contours), which suggests the proposed multi-task strategy shows great potentials for incorporating the structure knowledge into the inpainting process. Besides, with the explicit embedding of the structure features, the inpainting results are further enhanced (e.g. more sharp contours and textures). Moreover, with the attention mechanism embedded, the results are finally polished by the similar structures and patterns in the images.
|MT, SE, AT||3.78||27.01||0.911||0.979||0.848||11.98|
We have primarily presented a framework for incorporating image structure knowledge for image inpainting. We propose to utilize the multi-task learning strategy, explicit structure embedding besides with an attention mechanism to make use of the image structure knowledge for inpainting. The experiments results demonstrate that the proposed approach shows superior performance compared with several state-of-the-art inpainting methods which either ignore or not well exploit the structure knowledge. Besides, we have verified each proposed component for incorporating structure knowledge by ablation studies. In future work, we plan to investigate adapting the proposed multi-task framework to other specific inpainting architectures to leverage the structure knowledge.
This work is supported by grants from: National Natural Science Foundation of China (No.71932008, 91546201, and 71331005).
-  (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In ACM transactions on graphics, Vol. 28, pp. 24–32. Cited by: §1.
-  (2000) Image inpainting. In Proceedings of the 27th annual conference on computer graphics and interactive techniques, pp. 417–424. Cited by: §1.
-  (2005) A non-local algorithm for image denoising. In , Vol. 2, pp. 60–65. Cited by: §3.1.
Image style transfer using convolutional neural networks. In The IEEE conference on computer vision and pattern recognition, pp. 272–280. Cited by: §2, §3.2.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
-  (2008) Scene completion using millions of photographs. Communications of the ACM 51 (10), pp. 87–94. Cited by: §1.
-  (2012) Statistics of patch offsets for image completion. In European conference on computer vision, pp. 16–29. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §4.3.
-  (2009) Super-resolution from a single image. In Proceedings of the IEEE international conference on computer vision, pp. 349–356. Cited by: §3.1.
-  (2003) Learning how to inpaint from global image statistics. In Proceedings of the ninth IEEE international conference on computer Vision, pp. 305–312. Cited by: §1.
-  (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: §4.1.
-  (2019) EdgeConnect: generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212. Cited by: Figure 1, §3.2, §3.2, §4.1, §4.4.
-  (2003) Poisson image editing. ACM transactions on graphics 22 (3), pp. 313–318. Cited by: §2.
-  (2009) Shift-map image editing. In 2009 IEEE 12th international conference on computer vision, pp. 151–158. Cited by: §1.
-  (2005) Fields of experts: a framework for learning image priors. In 2005 IEEE computer society conference on computer vision and pattern recognition, Vol. 2, pp. 860–867. Cited by: §1.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §3.2.
-  (2006) Image information and visual quality. IEEE Transactions on image processing 15 (2), pp. 430–444. Cited by: §4.3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
-  (2013) Spatial pattern templates for recognition of objects with regular structure. In German conference on pattern recognition, pp. 364–374. Cited by: §4.1.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §4.3.
-  (2002) A universal image quality index. IEEE signal processing letters 9 (3), pp. 81–84. Cited by: §4.3.
-  (2019) Learning pyramid-context encoder network for high-quality image inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1486–1494. Cited by: §4.1.
Places: a 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1452–1464. Cited by: §4.1.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §3.2.