Log In Sign Up

Guidance and Evaluation: Semantic-Aware Image Inpainting for Mixed Scenes

Completing a corrupted image with correct structures and reasonable textures for a mixed scene remains an elusive challenge. Since the missing hole in a mixed scene of a corrupted image often contains various semantic information, conventional two-stage approaches utilizing structural information often lead to the problem of unreliable structural prediction and ambiguous image texture generation. In this paper, we propose a Semantic Guidance and Evaluation Network (SGE-Net) to iteratively update the structural priors and the inpainted image in an interplay framework of semantics extraction and image inpainting. It utilizes semantic segmentation map as guidance in each scale of inpainting, under which location-dependent inferences are re-evaluated, and, accordingly, poorly-inferred regions are refined in subsequent scales. Extensive experiments on real-world images of mixed scenes demonstrated the superiority of our proposed method over state-of-the-art approaches, in terms of clear boundaries and photo-realistic textures.


page 1

page 3

page 5

page 6

page 10

page 13

page 14

page 15


Image Inpainting Guided by Coherence Priors of Semantics and Textures

Existing inpainting methods have achieved promising performance in recov...

Semantic-Guided Inpainting Network for Complex Urban Scenes Manipulation

Manipulating images of complex scenes to reconstruct, insert and/or remo...

Context-Aware Image Inpainting with Learned Semantic Priors

Recent advances in image inpainting have shown impressive results for ge...

Interactive Image Inpainting Using Semantic Guidance

Image inpainting approaches have achieved significant progress with the ...

ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors

The image inpainting task fills missing areas of a corrupted image. Desp...

Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid

Restoring reasonable and realistic content for arbitrary missing regions...

Structure-aware Image Inpainting with Two Parallel Streams

Recent works in image inpainting have shown that structural information ...

1 Introduction

Image inpainting refers to the task of filling the missing area in a scene with synthesized content, which has drawn great attention in the field of computer vision and graphics

[2, 1, 5, 10, 27]. Recent learning-based methods have achieved great success in filling large missing regions with plausible contents of simple scenes, such as human faces (CelebA-HQ faces dataset) [39, 36, 38], buildings (Paris StreetView dataset) [22, 33, 40], and landscape images (Places2 dataset) [31, 25, 37]. However, these existing methods still encounter difficulties while completing images of a mixed scene, that composes of multiple objects with different semantics.

Existing learning-based image inpainting methods typically fill missing regions by inferring the context of corrupted images [22, 36, 12, 37]. However, In a mixed scene, the prior distributions of various semantics are different and various semantic regions also contribute differently to pixels in the missing region, thus uniformly mapping different semantics onto a single manifold in the context-based methods often leads to unrealistic semantic content as illustrated in Fig. 1(b).

Figure 1: Comparison of the inpainting results for a mixed scene: (b) GatedConv [37] without structural information, (c) EdgeConnect [20] with predicted edges, (d) SPG [26] with less reliable predicted semantic segmentation, (e) segmentation-guide inpainting with an uncorrupted segmentation map, and (f) the proposed SGE-Net with iteratively optimized semantic segmentation. [Best viewed in color]

To address this issue, low to mid-level structural information [15, 32, 26] was introduced to assist image inpainting. These methods extract and reconstruct the edges or contours in the first stage and complete an image with the predicted structural information in the second stage. The spatial separation by the structures helps to alleviate the blurry boundary problem. These methods, however, ignore the modeling of semantic content, which may result in ambiguous textures at the semantic boundaries. Moreover, the performance of the two-stage inpainting process highly relies on the reconstructed structures from the first stage, but the unreliability of the edge or contour connections largely increases in a mixed scene (Fig. 1(c)). As revealed in [31] that human beings perceive and reconstruct the structures under the semantic understanding of a corrupted image, it is natural to involve semantic information in the process of image inpainting.

In this paper, we show how semantic segmentation can effectively assist image inpainting of a mixed scene based on two main discoveries: semantic guidance and segmentation confidence evaluation. Specifically, a semantic segmentation map carries pixel-wise semantic information, providing the layout of a scene as well as the category, location and shape of each object. It can assist the learning of different texture distributions of various semantic regions. Moreover, the intermediate confidence score derived from the segmentation process can offer a self-evaluation for an inpainted region, under the assumption that ambiguous semantic contents usually cannot lead to solid semantic segmentation results.

To the best of our knowledge, a similar work making use of semantic segmentation information for image inpainting is SPG-Net proposed in [26], which is also a two-stage process. It extracts and reconstructs a segmentation map, and then utilizes the map to guide image inpainting. Thanks to the helpful semantic information carried in the segmentation map, SPG-Net can effectively improve inpainting performance compared to those methods without a semantic segmentation map. Nevertheless, it is hard to predict reliable semantics about a region when its context information is largely missing, especially in mixed scene. As a result, its performance can be significantly degraded by such unreliable semantic region boundaries and labels predicted by the semantic segmentation. Such performance degradation is evidenced in Fig. 1(d), from which we can observe blurry and incorrect inpainted textures generated by SPG-Net. By contrast, segmentation-guided inpainting can achieve high-quality image completion provided that a reliable segmentation map (i.e., the segmentation map of uncorrupted image) is given as illustrated in Fig. 1(e). Therefore, to make the best use of semantic information carried in the segmentation map for image inpainting, how to predict a reliable semantic segmentation map, even if part of an image is corrupted, is the key.

To address the above problems, we advocate that the interplay between the two tasks, semantic segmentation and image inpainting, can effectively improve the reliability of the semantic segmentation map from a corrupted image, which will in turn improve the performance of inpainting as illustrated in Fig. 1(f). To this end, we propose a novel Semantic Guidance and Evaluation Network (SGE-Net) that makes use of the interplay between semantic segmentation and image inpainting in a coarse-to-fine manner. Experiments conducted on the datasets containing mixtures of multiple semantic regions demonstrated the effectiveness of our method in completing a corrupted mixed scene with significantly improved semantic contents.

Our contributions are summarized as follows:

1) We show that the interplay between semantic segmentation and image inpainting in a coarse-to-fine manner can effectively improve the performance of image inpainting by simultaneously generating an accurate semantic guidance from merely an input corrupted image.

2) We are the first to propose a self-evaluation mechanism for image inpainting through segmentation confidence scoring to effectively localize the predicted pixels with ambiguous semantic meanings, which enables the inpainting process to update both contexts and textures progressively.

3) Our model outperforms the state-of-the-art methods, especially on mixed scenes with multiple semantics, in the sense of generating semantically realistic contexts and visually pleasing textures.

2 Related Work

2.1 Deep Learning-Based Inpainting

Deep learning-based image inpainting approaches [22, 35, 14] are generally based on generative adversarial networks (GANs) [9, 24] to generate the pixels of a missing region. For instance, Pathak et al. introduced Context Encoders [22]

, which was among the first approaches in this kind. The model was trained to predict the context of a missing region but usually generates blurry results. Based on the Context Encoders model, several methods were proposed to better recover texture details through the use of well-designed loss functions

[12, 7, 14], neural patch synthesis [34], residual learning [6], feature patch matching [33, 25, 38, 36], content and style disentanglement [31, 8], and others [30, 28, 19]. Semantic attention was further proposed to refine the textures in [17]. However, most of the above methods were designed for dealing with rectangular holes, but cannot effectively handle large irregular holes. To fill irregular holes, Liu et al. [16] proposed a partial convolutional layer, which calculates a new feature map and updates the mask at each layer. Later, Yu et al. [37] proposed a gated convolutional layer based on the models in [36] for irregular image inpainting. While these methods work reasonably well for one category of objects or background, they can easily fail if the missing region contains multiple categories of scenes.

2.2 Structural Information-Guided Inpainting

Recently, structural information was introduced in learning-based framework to assist the image inpainting process. These methods are mostly based on two-stage networks, where missing structures are reconstructed in the first stage and then used to guide the texture generation in the second stage. Edge maps were first introduced by Liao et al. [15] as a structural guide to the inpainting network. This idea is further improved by Nazeri et al. [20] and Li et al. [13] in terms of better edge prediction. Similar to edge information, object contours were used by Xiong et al. [32] to separately reconstruct the foreground and background areas. Ren et al. [23] proposed using smoothed images to carry additional image information other than edges as prior information. Considering semantic information for the modeling of texture distributions, SPG-Net proposed in [26] predicts the semantic segmentation map of a missing region as a structural guide. The above-mentioned methods show that the structure priors effectively help improve the quality of the final completed image. However, how to reconstruct correct structures remains challenging, especially when the missing region becomes complex.

3 Approach

As illustrated in Figs. 1(d)-(f), the success of semantic segmentation-guided inpainting depends on a reliable segmentation map, which is hard to obtain from an image with a corrupted mixed scene. To address this issue, we propose a novel method to progressively predict a reliable segmentation map from a corrupted image through the interplay between semantic segmentation and image inpainting in a coarse-to-fine manner. To verify how semantic information boosts image inpainting, two networks are proposed. As a baseline, the first one uses only semantic guidance on image inpainting. Moreover, the semantic evaluation is added in the second network as an advanced strategy.

We first introduce some notations used throughout this paper. Given a corrupted image with a binary mask , and the corresponding ground-truth image , the inpainting task is to generate an inpainted image from and . Given a basic encoder-decoder architecture of layers, we denote the feature maps from deep to shallow in the encoder as , , …, , …, , and in the decoder as , …, , …, .

Figure 2: Proposed Baseline: Semantic Guidance Network (SG-Net). It iteratively updates the contextual features in a coarse-to-fine manner. SGIM updates the predicted context features based on the segmentation map at the next scale.

3.1 Semantic Guidance Network (SG-Net)

The SG-Net architecture is shown in Fig. 2(a). The encoder is used to extract the contextual features of a corrupted image. The decoder then updates the contextual features to predict the semantic segmentation maps and inpainted images simultaneously in a multi-scale manner. Based on this structure, semantic guidance takes effect in two aspects. First, the semantic supervisions are added to guide the learning of contextual features at different scales of the decoder. Second, the predicted segmentation maps are involved in the inference modules to guide the update of the contextual features at the next scale. Being different from the two-stage process[15, 20, 26], the supervision of semantic segmentation on the contextual features enables them to carry the semantic information, that helps the decoder learn better texture models for different semantics.

The corrupted image is initially completed in the feature level through a Context Inference Module (CIM) based on Context Encoders [22]. After that, the image inpainting and semantic segmentation interplay with each other and are progressively updated across scales. Two branches are extended from the contextual features at each scale of the decoder to generate multi-scale completed images , …, , …, and their semantic segmentation maps , …, , …, .


where and denote the inpainting branch and segmentation branch, respectively.

Semantic-Guided Inference Module (SGIM) SGIM is designed to make an inference and update the contextual features at the next scale . As shown in Fig. 2(b), SGIM takes three types of inputs: two of them are the current contextual features and the skip features of the next scale from the encoder. The third input is the segmentation map , which is used to formalize the textures under the assumption that those regions of the same semantic class should have similar textures. The inference process can be formulated as follows:


where is the process of updating the contextual features in SGIM.

Figure 3: Proposed Semantic Guidance and Evaluation Network (SGE-Net). It iteratively evaluates and updates the contextual features through the SCEM and SGIM+ modules in a coarse-to-fine manner, where SCEM identifies the pixels where the context needs to be corrected, while SGIM+ updates the predicted context features representing the incorrect pixels located by SCEM.

To update the contextual features based on segmentation map , we follow the image generation approach in [21], which adopts spatial adaptive normalization to propagate semantic information into the predicted images for achieving effective semantic guidance. The contextual features are updated as follows:


where is a pair of affine transformation parameters modeled from segmentation map , and

are the mean and standard deviation of each channel in the concatenated feature vector

generated from and . denotes element-wise multiplication.

3.2 Semantic Guidance and Evaluation Network (SGE-Net)

To deeply exploit how segmentation confidence evaluation can help correct the wrongly predicted pixels, we add the Segmentation Confidence Evaluation Module (SCEM) on each decoder layer of SG-Net. The evaluation is performed under the assumption that predicted ambiguous semantic content would result in low confidence scores during the semantic segmentation process. Therefore, we introduce the segmentation confidence scoring after each decoding layer to self-evaluate the predicted region. The reliability mask is then feed to the next scale, which can be used to identify those to-be-updated contextual features that contribute to the unreliable area. This module enables the proposed method to correct the mistakes in those regions completed at the previous coarser scale. Fig. 3(a) illustrates the detailed architecture of SGE-Net.

Segmentation Confidence Evaluation Module (SCEM) The output of the semantic segmentation branch is a

-channel probability map. The confidence score at every channel of a pixel in the map signifies how the pixel looks like a specific class. Based on the scores, we assume that an inpainted pixel is unreliable if it has low scores for all semantic classes.

The framework of SCEM is depicted in Fig. 3(b). Taking the segmentation probability map at a certain scale , we generate a reliability mask to locate those pixels which might have unreal semantic meaning. We first generate a max-possibility map by assigning each pixel with the highest confidence score of channels in . Then, the mask value of pixel in the reliability mask is decided by judging whether the max-confidential score at each pixel location exceeds a threshold .


where is decided by the proportion of the maximum one from the sorted probability values.

Enhanced SGIM (SGIM+) In order to correct the pixels marked as unreliable from the SCEM, SGIM+ takes the reliability mask as the fourth input to update the current context features (as shown in Fig. 3(c)). The formulation of the inference process can be updated as follows:


To enable the dynamic corrections of semantics, we propose two sub-networks: base-net and bias-net . The result from the base-net is the same as that from the previous version of SGIM. The new reliability mask is fed into the bias-net to learn residuals to rectify the basic contextual features from the base-net. The new contextual features at the next scale can be formulated as


where denotes the concatenation operation and represents the convolutions to translate the reliability mask into a feature map.

3.3 Training Loss Function

The loss functions comprise loss terms for both image inpainting and semantic segmentation. For image inpainting, we adopt the multi-scale reconstruction loss to refine a completed image and the adversarial loss to generate visually realistic textures. For semantic segmentation, we adopt the multi-scale cross-entropy loss to restrain the distance between the predicted and target class distributions of pixels at all scales.

Multi-scale Reconstruction Loss We use the loss to encourage per-pixel reconstruction accuracy, and the perceptual loss to encourage higher-level feature similarity.


where is the activation map of the -th layer, is the operation to upsample to the same size as , is a trade-off coefficient. We use layered features , , and in VGG-16 to calculate these loss functions.

Adversarial Loss We use a multi-scale PatchGAN [31]

to classify the global and local patches of an image at multiple scales. The multi-scale patch adversarial loss is defined as:


where is the discriminator, and are the patches in the -th scaled versions of and .

Multi-scale Cross-Entropy Loss This loss is used to penalize the deviation of at each position at every scale.


Final Training Loss The overall training loss of our network is defined as the weighted sum of the multi-scale reconstruction loss, adversarial loss, and multi-scale cross-entropy loss.


where and are the weights for the adversarial loss and the multi-scale cross-entropy loss, respectively.

4 Experiments

4.1 Setting

We evaluate our method on Outdoor Scenes [29] and Cityscapes [4] both with segmentation annotations. Outdoor Scenes contains 9,900 training images and 300 test images belonging to 8 categories. Cityscapes contains 5,000 street view images belonging to 20 categories. In order to enlarge the number of training images of this dataset, we use 2,975 images from the training set and 1,525 images from the test set for training, and test on the 500 images from the validation set. We resize each training image to ensure its minimal height/width to be 256 for Outdoor Scenes and 512 for Cityscapes, and then randomly crop sub-images of size as inputs to our model.

We compare our method with the following three representative baselines:

GatedConv [37]: gated convolution for free-form image inpainting, without any auxiliary structural information.

EdgeConnect [20]: two-stage inpainting framework with edges as low-level structural information.

SPG-Net [26]: two-stage inpainting framework with a semantic segmentation map as high-level structural information.

In our experiments, we use GatedConv and EdgeConnect fine-tuned on each dataset. We also re-implement and train the model of SPG-Net by ourselves since there is no released code or model. We conduct experiments on both settings of centering and irregular holes. The centering holes are ( for Outdoor Scenes and for Cityscapes), and the irregular masks are obtained from [16].

Input GatedConv EdgeConnect SPG-Net SG-Net SGE-Net GT
Figure 4: Subjective quality comparison of inpainting results on image samples from Outdoor Scenes and Cityscapes. GT stands for Ground-Truth.
Outdoor Scenes Cityscapes
centering holes irregular holes centering holes irregular holes
GatedConv [37] 19.06 0.73 42.34 19.27 0.81 40.31 21.13 0.74 20.03 17.42 0.72 40.57
EdgeConnect [20] 19.32 0.76 41.25 19.63 0.83 44.31 21.71 0.76 19.87 17.83 0.73 38.07
SPG-Net [26] 18.04 0.70 45.31 17.85 0.74 50.03 20.14 0.71 23.21 16.01 0.64 44.13
SG-Net (ours) 19.58 0.77 41.49 19.87 0.81 41.74 23.04 0.83 18.98 17.94 0.64 41.24
SGE-Net (ours) 20.53 0.81 40.67 20.02 0.83 42.47 23.41 0.85 18.67 18.03 0.75 39.93
Table 1: Objective quality comparison of five methods in terms of PSNR, SSIM, and FID on Outdoor Scenes and Cityscapes (: Higher is better; : Lower is better). The two best scores are colored in red and blue, respectively.

4.2 Image Inpainting Results

In this section, we present the results of our model trained on human-annotated segmentation labels. We also verify our model trained on the segmentation labels predicted by a state-of-the-art segmentation model. The results can be found in Section 4.4, and more details are given in the supplementary material.

Qualitative Comparisons The subjective visual comparisons of the proposed SG-Net and SGE-Net with the three baselines (GatedConv, EdgeConnect, SPG-Net) on Outdoor Scenes and Cityscapes are presented in Fig. 4. The corrupted area is simulated by sampling a central hole ( for Outdoor Scenes and for Cityscapes), or randomly placing multiple missing rectangles. As shown in the figure, the baselines usually generate unrealistic shape and textures. The proposed SG-Net generates more realistic textures than the baselines, but still has some flaws at the boundaries since its final result highly depends on the initial inpainting result. The proposed SGE-Net generates better boundaries between semantic regions and more consistent textures than SG-Net and all the baselines, thanks to its evaluation mechanism that can correct the wrongly predicted labels.

Quantitative Comparisons Table 1

shows the numerical results based on three quality metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) and Fréchet Inception Distance (FID)

[11]. In general, the proposed SGE-Net achieves significantly better objective scores than the baselines, especially in PSNR and SSIM .

User Study We conduct the user study on 80 images randomly selected from both datasets. In total, 24 subjects are involved to rank the subjective visual qualities of images completed by four inpainting methods (GatedConv, EdgeConnect, SPG-Net, and our SGE-Net). As shown in Table 2, the study shows that of subjects (1295 out of 1920 comparisons), and preferred our results over GatedConv, EdgeConnect, and SPG-Net, respectively. Hence, our method outperforms the other methods.

Since our method mainly focuses on completing mixed scenes with multiple semantics, we also verify its performance on images with different scene complexities. We conduct this analysis by dividing all 80 images into three levels of semantic complexities: 1) low-complexity scenes containing 27 images with 1–2 semantics; 2) moderate-complexity scenes containing 32 images with 3–4 semantics; 3) high-complexity scenes containing 21 images with more than 4 semantics. As shown in Table 2, compared to the baselines, while our method achieves generally better results than the baselines for the simple- to moderate-complexity scenes (about from to ), the preference rate increases significantly for the complex scenes (from to ). This verifies that our method is particularly powerful for completing mixed-scene images with multiple semantics, thanks to its mechanism for understanding and updating the semantics.

GatedConv [37] EdgeConnect [20] SPG-Net [26] SGE-Net (ours)
GatedConv [37] (46.7)/41.5/47.7/52.0 (58.1)/57.3/59.8/56.7 (32.6)/37.8/34.8/22.4
EdgeConnect [20] (53.3)/58.5/52.3/48.0 (70.1)/68.7/69.3/73.0 (29.3)/35.0/31.8/18.1
SPG-Net [26] (41.9)/42.7/40.2/43.3 (29.9)/31.3/30.7/27.0 (26.8)/32.7/28.8/16.3
SGE-Net (ours) (67.4)/62.2/65.2/77.6 (70.7)/65.0/68.2/81.9 (73.2)/67.3/71.2/83.7
Table 2: Preference percentage matrix (%) of different scene complexities on Outdoor Scenes and Cityscapes datasets. Overall, low complexity, moderate complexity, and high complexity are colored in black, green, blue and red, respectively.

4.3 Ablation Study

Effectiveness of SGIM and SCEM In the proposed networks, the two core components of our method, semantic-guided inference and segmentation confidence evaluation, are implemented by SGIM and SCEM, respectively. In order to investigate their effectiveness, we conduct an ablation study on three variants: a) Base-Net (without SGIM and SCEM); b) SG-Net (with SGIM but without SCEM); and c) SGE-Net (with both SGIM and SCEM).

The visual and numeric comparisons on Outdoor Scenes are shown in Fig. 5 and Table 3. In general, the inpainting performance increases with the added modules. Specifically, the multi-scale semantic-guided interplay framework does a good job for generating detailed contents, and the semantic segmentation map helps learn a more accurate layout of a scene. With SGIM, the spatial adaptive normalization helps generate more realistic textures based on the semantic priors. Moreover, SCEM makes further improvements on completing structures and textures (fourth column in Fig. 5) by coarse-to-fine optimizing the semantic contents across scales.

Input Base-Net SG-Net SGE-Net GT Figure 5: Subjective visual quality comparisons on the effects of SGIM and SCEM.
Table 3: Objective quality comparison on the performances of SGIM and SCEM in terms of three metrics.
Base-Net 19.14 0.71 45.31
SG-Net 19.58 0.77 41.49
SGE-Net 20.53 0.81 40.67
scale 4 scale 3 scale 2 scale 1 final
Figure 6: Illustration of multi-scale progressive refinement with SGE-Net. From left to right of the first 5 columns: the inpainted images (top row) and the segmentation maps (bottom row) from scale 4 to scale 1 and the final result. The last 3 columns show the the reliability maps (top row) and the confidence score maps (bottom row) of the inpainted area across scales (e.g., shows the confidence score increases from scale 4 to 3).

To further verify the effectiveness of SCEM, we visualize a corrupted image and its segmentation maps derived from all decoding scales. As shown in the first five columns of Fig. 6, the multi-scale progressive-updating mechanism gradually refines the detailed textures as illustrated in the images and the segmentation maps at different scales. The last three columns of the top row show that the region of the unreliable mask gradually shrinks as well. Correspondingly, the bottom row shows the increase of the confidence scores of segmentation maps from left to right (e.g., showing the increase of the confidence score from scale 4 to scale 3). The proportion of the white region, which roughly indicates unreliable labels, also decreases significantly from left to right. The result evidently demonstrates the benefits of SCEM in strengthening the semantic correctness of contextual features.

scale 4 scale 3 scale 2 scale 1 final scale 4 scale 3 scale 2 scale 1 final
(a) (b)
Figure 7: Correspondence between the confidence score value and the reliability of inpainted image content. (a) Outdoor Scenes and (b) Cityscapes. Row 1: Inpainted image. Row 2: Predicted segmetation map. Row 3: the confidence score map (darker color means higher confidence score, and vice versa). Row 4: unreliable pixel map (white pixels indicate unreliable pixels). Since the map at scale 4 is the same as the input mask, we put the input image for better comparison.

Justification of Segmentation Confidence Scoring During the progressive refinement of image inpainting and semantic segmentation, the semantic evaluation mechanism of SCEM is based on the assumption that the pixel-wise confidence scores from the segmentation possibility map can well reflect the correctness of the inpainted pixel values. Here we attempt to justify this assumption. Some examples from both datasets are shown in Fig. 7. It can be seen that (except for the confidence scores at the region boundaries):

  • The low confidence scores (the white area in row 3) usually appear in the mask area, indicating that the scores reasonably well reflects the reliability of inpainted image content;

  • the confidence score becomes higher when the scale goes finer, and correspondingly the area of unreliable pixels reduces, meaning that our method can progressively refine the context feature towards correct inpainting.

Figure 8: Correlation between the inpainting quality and confidence score.

We then further verify the effectiveness of the pixel-wise confidence scores by validating the correlation between the confidence scores and the loss of the completed images with respect to the ground-truth that can be used to measure of the fidelity of inpainted pixels. We randomly select 9,000 images out of all the training and testing images from the two datasets with centering and irregular-hole settings, and calculate the average loss and the confidence scores of all pixels in the missing region. As demonstrated in Fig. 8, the number of good-fidelity images with loss increases with the segmentation confidence score (lower means higher quality of the predicted image), implying the segmentation confidence score well serves the purpose of a metric of evaluating the accuracy of inpainted image content.

Input EC SPG-Net SGE-Net GT
Figure 9: Visual comparison on semantic segmentation between SGE-Net and the segmentation-after-inpainting solutions. ‘EC’ and ‘SPG-Net’ stand for EdgeConnect+DPN/Deeplab and SPG-Net+DPN/Deeplab, respectively.
Table 4: Statistical comparison on semantic segmentation between SGE-Net and the segmentation-after-inpainting solutions on corrupted images. Outdoor Scenes Cityscapes Methods mIoU% Methods mIoU% EdgeConnect DPN 0.73 EdgeConnect Deeplab 0.61 SPG-NetDPN 0.63 SPG-NetDeeplab 0.57 SGE-Net (ours) 0.79 SGE-Net (ours) 0.71

4.4 Impact of Semantic Segmentation

The success of semantics-guided inpainting largely relies on the quality of semantic segmentation map. Here we investigate the impact of segmentation accuracy on image inpainting in two aspects: 1) the comparison between SGE-Net (i.e., multi-scale interplay between segmentation and inpainting) and the traditional approach (i.e., non-iterative segmentation after an initial inpainting); and 2) the comparison between SGE-Net with segmentation maps generated by state-of-the-art segmentation tools and SGE-Net with human-labeled maps. We utilize the DPN model [18] pre-trained on [29] and the Deeplab v3+ model [3] as the segmentation tools for Outdoor Scenes and Cityscapes, respectively.

SGE-Net versus Segmentation-after-Inpainting SGE-Net makes use of multi-scale iterative interplay between inpainting and semantic segmentation to improve the quality of semantic segmentation map for a corrupted image. We also conduct experiments to validate whether the iterative interplay between inpainting and semantic segmentation outperforms the traditional non-iterative segmentation-after-inpainting strategy in semantic segmentation. We compare the semantic segmentation maps generated by SGE-Net itself with initial segmentation maps extracted from images completed by baseline methods.

As compared in Fig. 9 and Table 9, the results show that SGE-Net evidently beats the segmentation-after-inpainting methods since SGE-Net leads to more accurate semantic assignments and object boundaries, thanks to its multi-scale joint-optimization of semantics and image contents.

Outdoor Scenes Cityscapes
Methods PSNR Methods PSNR
Auto-segs 20.19 Auto-segs 22.94
Label-segs 20.53 Label-segs 23.41
Table 5: Objective quality comparison on model trained by automatic segmentation (Auto-segs) and Human-labeled semantics (Label-segs).

Automatic Segmentation vs. Human-Labeled Semantics For a sanity check, we further study the impact on the inpainting performance of SGE-Net by replacing the ground-truth segmentation maps for training SGE-Net with the maps generated by CNN-based state-of-the-art segmentation models (i.e., DPN and Deeplab v3+). It aims to test the sensitivity of our method to the datasets with imperfect semantic annotations.

As shown in Table 5, the performance degradation of our SGE-Net trained on imperfect semantic annotations is not significant, meaning that our model can still do a reasonably good job even trained on model-generated semantic annotation. More subjective quality comparisons are provided in supplementary material. Note that the segmentation maps, either human-annotated or model-generated, are only used in the training stage of our model. While completing an image, SGE-Net itself can automatically generate the inpainted image and segmentation map simultaneously, without the need of the semantic annotations.

Input GC EC SGE-Net GT
Figure 10: Subjective quality comparison on image samples from Places2. ‘GC’, ‘EC’ and ‘GT’ stand for GatedConv, EdgeConnect and Ground-Truth, respectively.
Input GC EC SPG-Net SGE-Net
Figure 11: Examples of failure cases from Outdoor Scenes and CityScapes dataset. ‘GC’ and ‘EC’ stand for GatedConv and EdgeConnect, respectively.

4.5 Additional Results on Places2

For fair comparison, we also test our method on Places2 [41] to verify that SGE-Net can be applied to images without segmentation annotations. Places2 was used for evaluation by both GatedConv and EdgeConnect. It contains images with similar semantic scenes to Outdoor Scenes. Therefore, we use our model trained on Outdoor Scenes to complete the images with similar scenes in Places2. The subjective results in Fig. 11 show that SGE-Net is still able to generate proper semantic structures, owing to the introduction of the semantic segmentation, which provides the prior knowledge about the scenes.

4.6 Failure Cases

Fig. 11 shows some typical failure cases of our mode. In general, the most common failure case is on the non-rigid body, e.g., a person or an animals, where the semantic shapes are hard to learn by the model. In the completed image in the first row, the model reconstruct the wall over the head of the lion, leading to a poor completion. In the second row, one hand of the second person is missing, but the main part of ground do get completed.

5 Conclusion

In this paper, a novel SGE-Net with semantic segmentation guided scheme was proposed to complete corrupted images of mixed semantic regions. To address the problem of unreliable semantic segmentation due to missing regions, we proposed a progressive multi-scale refinement mechanism to conduct interplay between semantic segmentation and image inpainting. Experimental results demonstrate that the mechanism can effectively refines poorly-inferred regions through segmentation confidence evaluation to generate promising semantic structures and texture details in a coarse-to-fine manner.


  • [1] Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: Patchmatch: A randomized correspondence algorithm for structural image editing. In: ACM Transactions on Graphics (ToG). vol. 28, p. 24. ACM (2009)
  • [2] Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques. pp. 417–424 (2000)
  • [3] Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV (2018)
  • [4]

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)

  • [5] Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing 13(9), 1200–1212 (2004)
  • [6] Demir, U., Unal, G.: Deep stacked networks with residual polishing for image inpainting. arXiv preprint arXiv:1801.00289 (2017)
  • [7] Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. In: NeuIPS (2016)
  • [8] Gilbert, A., Collomosse, J., Jin, H., Price, B.: Disentangling structure and aesthetics for style-aware image completion. In: CVPR (2018)
  • [9] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu B., Warde-Farley, D., Ozair S., Courville A., Bengio, Y.: Generative adversarial nets. In: NIPS(2014)
  • [10] Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Transactions on Graphics (TOG) 26(3), 4–es (2007)
  • [11] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeuIPS (2017)
  • [12] Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion. ACM Transactions on Graphics (ToG) 36(4),  107 (2017)
  • [13] Li, J., He, F., Zhang, L., Du, B., Tao, D.: Progressive reconstruction of visual structure for image inpainting. In: ICCV (2019)
  • [14] Li, Y., Liu, S., Yang, J., Yang, M.H.: Generative face completion. In: CVPR (2017)
  • [15] Liao, L., Hu, R., Xiao, J., Wang, Z.: Edge-aware context encoder for image inpainting. In: ICASSP (2018)
  • [16] Liu, G., Reda, F.A., Shih, K.J., Wang, T.C., Tao, A., Catanzaro, B.: Image inpainting for irregular holes using partial convolutions. In: ECCV (2018)
  • [17] Liu, H., Jiang, B., Xiao, Y., Yang, C.: Coherent semantic attention for image inpainting. arXiv preprint arXiv:1905.12384 (2019)
  • [18] Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Deep learning markov random field for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(8), 1814–1828 (2017)
  • [19] Ma, Y., Liu, X., Bai, S., Wang, L., He, D., Liu, A.: Coarse-to-fine image inpainting via region-wise convolutions and non-local correlation. In: IJCAI (2019)
  • [20] Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212 (2019)
  • [21] Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)
  • [22] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: CVPR (2016)
  • [23] Ren, Y., Yu, X., Zhang, R., Li, T.H., Liu, S., Li, G.: Structureflow: Image inpainting via structure-aware appearance flow. In: ICCV (2019)
  • [24] Radford A., Metz L., Chintala S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  • [25] Song, Y., Yang, C., Lin, Z., Liu, X., Huang, Q., Li, H., Jay Kuo, C.C.: Contextual-based image inpainting: Infer, match, and translate. In: ECCV (2018)
  • [26] Song, Y., Yang, C., Shen, Y., Wang, P., Huang, Q., Kuo, C.C.J.: Spg-net: Segmentation prediction and guidance network for image inpainting. arXiv preprint arXiv:1805.03356 (2018)
  • [27] Sun, J., Yuan, L., Jia, J., Shum, H.Y.: Image completion with structure propagation. In: ACM SIGGRAPH 2005 Papers, pp. 861–868 (2005)
  • [28] Wang, N., Li, J., Zhang, L., Du, B.: Musical: multi-scale image contextual attention learning for inpainting. In: IJCAI (2019)
  • [29]

    Wang, X., Yu, K., Dong, C., Change Loy, C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: CVPR (2018)

  • [30]

    Wang, Y., Tao, X., Qi, X., Shen, X., Jia, J.: Image inpainting via generative multi-column convolutional neural networks. In: NeurIPS (2018)

  • [31] Xiao, J., Liao, L., Liu, Q., Hu, R.: Cisi-net: Explicit latent content inference and imitated style rendering for image inpainting. In: AAAI (2019)
  • [32] Xiong, W., Yu, J., Lin, Z., Yang, J., Lu, X., Barnes, C., Luo, J.: Foreground-aware image inpainting. In: CVPR (2019)
  • [33]

    Yan, Z., Li, X., Li, M., Zuo, W., Shan, S.: Shift-net: Image inpainting via deep feature rearrangement. In: ECCV (2018)

  • [34] Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., Li, H.: High-resolution image inpainting using multi-scale neural patch synthesis. In: CVPR (2017)
  • [35] Yeh, R.A., Chen, C., Yian Lim, T., Schwing, A.G., Hasegawa-Johnson, M., Do, M.N.: Semantic image inpainting with deep generative models. In: CVPR (2017)
  • [36] Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: CVPR (2018)
  • [37] Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: ICCV (2019)
  • [38] Zeng, Y., Fu, J., Chao, H., Guo, B.: Learning pyramid-context encoder network for high-quality image inpainting. In: CVPR (2019)
  • [39] Zhang, S., He, R., Sun, Z., Tan, T.: Demeshnet: Blind face inpainting for deep meshface verification. IEEE Transactions on Information Forensics and Security 13(3), 637–647 (2018)
  • [40] Zheng, C., Cham, T.J., Cai, J.: Pluralistic image completion. In: CVPR (2019)
  • [41] Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40(6), 1452–1464 (2017)