Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations

07/14/2020 ∙ by Hongyu Liu, et al. ∙ 0

Deep encoder-decoder based CNNs have advanced image inpainting methods for hole filling. While existing methods recover structures and textures step-by-step in the hole regions, they typically use two encoder-decoders for separate recovery. The CNN features of each encoder are learned to capture either missing structures or textures without considering them as a whole. The insufficient utilization of these encoder features limit the performance of recovering both structures and textures. In this paper, we propose a mutual encoder-decoder CNN for joint recovery of both. We use CNN features from the deep and shallow layers of the encoder to represent structures and textures of an input image, respectively. The deep layer features are sent to a structure branch and the shallow layer features are sent to a texture branch. In each branch, we fill holes in multiple scales of the CNN features. The filled CNN features from both branches are concatenated and then equalized. During feature equalization, we reweigh channel attentions first and propose a bilateral propagation activation function to enable spatial equalization. To this end, the filled CNN features of structure and texture mutually benefit each other to represent image content at all feature levels. We use the equalized feature to supplement decoder features for output image generation through skip connections. Experiments on the benchmark datasets show the proposed method is effective to recover structures and textures and performs favorably against state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 9

page 10

page 11

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is a need to recover missing contents in corrupted images for visual aesthetics improvement. Deep neural networks have advanced image inpainting by introducing semantic guidance to fill hole regions. Different from the traditional methods 

[7, 2, 8, 3] that propagate uncorrupted image contents to the hole regions via patch-based image matching, deep inpainting methods [25, 13] utilize CNN features in different levels (i.e., from low-level features to high-level semantics) to produce more meaningful and globally consistent results.

(a) Input (b) GC [40] (c) CSA [21] (d) Ours (e) GT
Figure 1: Visual comparison on the Paris StreetView dataset [6]. GT is the ground truth image. The proposed inpainting method is effective to reduce blur and artifacts within and around the hole regions, which are brought by inconsistent structure and texture features.

The encoder-decoder architecture is prevalent in existing deep inpainting methods [13, 19, 38, 25]. However, a direct utilization of the end-to-end training and prediction processes generate limited results. This is due to the challenging factor that the hole region is completely empty. Without sufficient image guidance, an encoder-decoder is not able to reconstruct the whole missing content. An alternative is to use two encoder-decoders to separately learn missing structures and textures in a step-by-step manner. These two-stage methods [28, 24, 26, 41, 37, 40, 21] typically generate an intermediate image with recovered structures in the first stage (i.e., encoder-decoder), and send this image to the second stage for texture generation. Although structures and textures are produced on the output image, their appearances are not consistent. Fig. 1 shows an example, the inconsistent structures and textures within hole regions produce blur and artifacts as shown in (b) and (c). Meanwhile, the recovered contents are not coherent to the uncorrupted contents around the hole boundaries (e.g., the leaves). This limitation is because of the independent learning of CNN features representing structures and textures. In practice, the structures and textures correlate with each other to formulate the image contents. Without considering their coherence, existing methods are not able to produce visually pleasing results.

In this work, we propose a mutual encoder-decoder to jointly learn CNN features representing structures and textures. The features from the deep layers of the encoder contain structure semantics while the features from the shallow layers contain texture details. The hole regions of these two features are filled via two separate branches. In the CNN feature space, we use a multi-scale filling block within each branch for hole filling. Each block consists of 3 partial convolution streams with progressively increased kernel sizes. After hole filling in these two features, we propose a feature equalization method to ensure the structure and texture features consistent with each other. Meanwhile, the equalized features are coherent with the features of uncorrupted image content around the hole boundaries. The proposed feature equalization consists of channel reweighing and bilateral propagation. We concatenate two features first and perform channel reweighing via attention exploration [12]. The attentions across two features are set to be consistent after channel equalization. Then, we propose a bilateral propagation activation function to equalize the feature consistency in the whole feature maps. This activation function uses elements on the global feature maps to propagate channel consistency (i.e., feature coherence across the hole boundaries), while using elements within local neighboring regions to maintain channel similarities (i.e., feature consistency within the hole). To this end, we fuse the texture and structure features together to reduce inconsistency in the CNN feature maps. The equalized features then supplement the decoder features in all the feature levels via encoder-decoder skip connections. The feature consistency is then reflected in the reconstructed output image, where the blur and artifacts are effectively removed around the hole regions as shown in Fig. 1(d). Experiments on the benchmark datasets show that the proposed method performs favorably against state-of-the-art approaches.

We summarize the contributions of this work as follows:

  • [noitemsep,nolistsep]

  • We propose a mutual encoder-decoder network for image inpainting. The CNN features from the shallow layer are learned to represent textures and the features from deep layers represent structures.

  • We propose a feature equalization method to make structure and texture features consistent with each other. We first reweigh channels after feature concatenation and propose a bilateral propagation activation function to make the whole feature consistent.

  • Extensive experiments on the benchmark datasets have shown the effectiveness of the proposed inpainting method in removing blur and artifacts caused by inconsistent structure and texture features. The proposed method performs favorably against state-of-the-art inpainting approaches.

2 Related Works

Empirical Image Inpainting

. The empirical image inpainting methods [3, 18, 1] based on diffusion techniques propagate the neighborhood appearances to the missing regions. However, they only consider surrounding pixels of missing regions, which can only deal with small holes in background inpainting tasks and may fail to generate meaningful structures. In contrast, methods [4, 5, 27, 2, 35] based on patch match fill missing regions by transferring similar and relevant patches from the remaining image region to the hole region. Although empirical methods perform well to handle small holes on the background inpainting task, they are not able to generate semantically meaningful content. When the hole region is large, these methods suffer from a lack of semantic guidance.

Deep Image Inpainting

. Image inpainting based on deep learning typically involves the generative adversarial network 

[9] to supplement visual perceptual guidance for hole filling. Pathak et al. [25] first bring adversarial training [9] to inpainting and demonstrate semantic hole-filling. Iizuka et al. [13] propose local and global discriminators, assisted by dilated convolution [39] to improve the inpainting quality. Nazeri et al. [24] propose EdgeConnect that predicts salient edges for inpainting guidance. Song et al. [28] utilize a segmentation prediction network to generate segmentation guidance for detail refinement around the hole region. Xiong et al. [33] present foreground-aware inpainting, which involves three stages, i.e., contour detection, contour completion and image completion, for the disentanglement of structure inference and content hallucination. Ren et al. [26] introduce structure-aware network, which splits the inpainting task into two parts: structure reconstruction and texture generation. It uses appearance flow to sample features from contextual regions. Yan et al. [36] speculate the relationship between the contextual regions in the encoder layer and the associated hole region in the decoder layer for better predictions. Yu et al. [41] and Song et al. [37] search for a collection of background patches with the highest similarity to the generated contents in the first stage prediction. Liu et al. [20] address this inpainting task via exploiting the partial convolutional layer and mask-update operation. Following the [20], Yu et al. [40] present gate convolution that learns a dynamic mask-update mechanism and combines with SN-PatchGAN discriminator to achieve better predictions. Liu et al. [21] propose coherent semantic attention, which considers the feature coherency of hole regions to guarantee the pixel continuity in image level. Wang et al. [31]

propose a generative multi-column convolutional neural network (GMCNN) that uses varied receptive fields in branches. Different from existing deep inpainting methods, our method produces CNN features to consistently represent structures and textures to reduce blurry and artifacts around the hole region.

3 Proposed Algorithm

Figure 2: Overview of the proposed pipeline. We use a mutual encoder-decoder to jointly recover structures and textures during hole filling. The deep layer features of the encoder are reorganized as structure features, while the shallow layer features are reorganized as texture features. We fill holes in multi-scales within the CNN feature space and equalize output features in both channel and spatial domains. The equalized features contain consistent structure and texture features at different CNN feature level, and supplement the decoder via skip connections for output image generation.

Fig. 2 shows the pipeline of the proposed method. We use one mutual encoder-decoder to jointly learn structure and texture features and equalize them for consistent representation. The details are presented in the following:

3.1 Mutual Encoder-Decoder

We use an encoder-decoder for end-to-end image generation to fill holes. The structure of this encoder-decoder is a simplified generative network [14] where there are 6 convolutional layers in the encoder and 5 convolutional layers in the decoder. Meanwhile, 4 residual blocks [10] with dilated convolutions are set between the encoders and decoders. The dilated convolutions [13, 24] increase the size of the receptive field to perceive encoder features.

In the encoder, we reorganize the CNN features from deep layers as structure features where the semantics reside. Meanwhile, we reorganize the CNN features from shallow layers as texture features to represent image details. We denote the structure features as and the texture features as as shown in Fig. 2. The reorganization process is to resize and transform the CNN feature maps from different convolutional layers to the same size, and concatenate them accordingly.

After CNN feature reorganization, we design two branches (i.e., the structure branch and the texture branch) to separately perform hole filling on and . The architectures of these two branches are the same. In each branch, there are 3 parallel streams to fill holes in multiple scales. Each stream consists of 5 partial convolutions [20] with the same kernel size while the kernel size differs among different streams. By using different kernel sizes, we perform multi-scale filling in each branch for the input CNN features. The filled features from 3 streams (i.e., 3 scales) are concatenated and mapped to the same size of the input feature map via a convolution. We denote the output of the structure branch as , and the output of the texture branch as . To ensure the hole filling to focus on the textures and structures, we incorporate supervisions on and . We use a convolution to separately map and to a color image and a color image , respectively. The pixel-wise L1 loss can be written as follows:

(1)

where is the ground truth image and is the structure image of . We use an edge-preserving smoothing method RTV [34] to generate following [26].

The hole regions in and are filled via structure and texture branches, individually. The feature representations in and are not consistent to reflect the recovered structures and textures. This inconsistency leads to blur and artifacts within and around the hole regions as shown in Fig. 1. To mitigate these effects, we concatenate and first, and make a simple fusion to generate via a convolutional layer. The texture and structure representations in are corrected via feature equalization at different CNN feature levels (i.e., across shallow to deep CNN layers).

3.2 Feature Equalizations

We equalize the fused CNN features in both channel and spatial domains. The channel equalization follows the squeeze and excitation operation [12] to ensure that the attentions within each channel of are the same. As the reweighed channels are influenced by both structure and texture representations in , the consistent attentions indicate that these representations are set to be consistent as well. We propagate channel equalization to the spatial domain via the proposed bilateral propagation activation function (BPA).

3.2.1 Formulation.

BPA is inspired by the edge-preserving image smoothing [29] to generate response values based on spatial and range distances. It can be written as follows:

(2)
(3)
(4)

where is the feature channel at position of input feature , is a neighboring feature channel around at position , and are the feature channels after spatial and range similarity measurements. We set the normalization factor as , where is the number of positions in . We use to denote the concatenation and channel reduction of and via a convolutional layer.

The bilateral propagation utilizes the distances of feature channels from both spatial and range domains. We explore within a neighboring region , which is set as the same spatial size of the input feature for global propagation. The spatial contributions from neighboring feature channels are adjusted via a Gaussian function . When computing , we measure the similarities between feature channels and via within a neighboring region around . The size of is . To this end, the bilateral propagation considers both global continuity via and local consistency via .

During the range similarity computation step, we define the pairwise function as a dot product operation, which can be written as follows:

(5)

The proposed bilateral propagation shares similarity to the non-local block [30] that for each , becomes the softmax computation along dimension . The difference resides on the region design of propagation. The non-local block uses feature channels from all the positions to generate and the similarity is only measured between and . In contrast, BPA considers both feature channel similarity and spatial distance between and during bilateral weight computation. In addition, we use a global region to compute spatial distance while using a local region to compute range distance. The advantage of global and local region selections is that we ensure both long-term continuity in the whole spatial region and local consistency around the current feature channel. The boundaries of hole regions are unified with the neighboring image content and the content within the hole regions are set to be consistent.

Figure 3: The pipeline of the bilateral propagation activation function. We denote the broadcast dot product operation as , element-wise addition in selected channel as , and the concatenation as . For two matrices with different dimensions, broadcast operations first broadcast features in each dimension to match the dimensions of the two matrices.

3.2.2 Implementations.

Fig. 3 shows how bilateral propagation operates in the network. The range step corresponds to the computation of in eq. 3 and the spatial step corresponds to in eq. 2. During range computation, the operations until the element-wise multiplication P represent eq. 5

at all spatial locations. We use the unfold function in PyTorch to reshape the feature to vectors (i.e.,

) for obtaining all the neighboring for each , so that we can make efficient element-wise matrix multiplications. Similarly, the operations until P represent the term in eq. 3. During spatial computation, the operations until P represent the term . As a result, the bilateral propagation operation can be efficiently executed via the element-wise matrix multiplications and additions shown in Fig. 3.

3.3 Loss functions

We introduce several loss functions to measure structure and texture differences including pixel reconstruction loss, perceptual loss, style loss and relativistic average LS adversarial loss 

[16] during training. We also employ a discriminator with local and global operations to ensure local-global contents consistency, the spectral normalization [23] is applied in both local and global discriminator to stable training.

Pixel Reconstruction Loss.

We measure the pixel wise difference from two aspects. The first one is the loss terms illustrated in eq. 1 where we add supervisions on the texture and structure branches. The second one measures the similarity between the network output and the ground truth, which can be written as follows:

(6)

where is the finally predicted image by the network.

Perceptual Loss.

To capture the high-level semantics and simulate human perception of images quality, we utilize the perceptual loss [15]

defined on the ImageNet-pretrained VGG-16 feature backbone.

(7)

where is the activation map of the -th layer of VGG-16 backbone. In our work, corresponds to the activation maps from layers ReLu1_1, ReLu2_1, ReLu3_1, ReLu4_1, and ReLu5_1.

Style Loss.

The transposed convolutional layers from the decoder will bring artifacts that resemble checkerboard. To mitigate this effect, we introduce the style loss. Given feature maps of size , we compute the style loss as follows:

(8)

Where is a Gram matrix constructed from the selected activation maps. These activation maps are the same as those used in the perceptual loss.

Relativistic Average LS Adversarial Loss.

We follow [41] to utilize global and local discriminators for perception enhancement. The relativistic average LS adversarial loss is adopted for our discriminators. For the generator, the adversarial loss is defined as:

(9)

where and

indicates the local or global discriminator without the last sigmoid function. To this end, real and fake data pairs

are sampled from the ground-truth and output images.

Total Losses.

The whole objective function of the proposed network can be written as:

(10)

where , , , , and are the tradeoff parameters. In our implementation, we empirically set , , , , , .

(a) Input (b) (c) (d)
(e) Output (f) (g) (h)
Figure 4: Visualization of the feature map response. The input and output images are shown in (a) and (e), respectively. We use a convolutional layer to map high dimension feature maps to the color images as shown in (b)-(d) and (f)-(h).

3.4 Visualizations

We use a structure branch and a texture branch to separately fill holes in CNN feature space. Then, we perform feature equalization to enable consistent feature representations in different feature levels for output image reconstruction. In this section, we visualize the feature maps during different steps to show whether they correspond to our objectives. We use a convolutional layer to map CNN feature maps to color images for a clear display.

Fig. 4 shows the visualization results. The input image is shown in (a) with a mask in the center. The visualized and are shown in (b) and (f), respectively. We observe that textures are preserved in (b) while the structures are in (f). By multi-scale hole filling, the hole regions in and are effectively reduced as shown in (c) and (g). After equalization, the hole regions in (h) are effectively filled and the equalized features contribute to the decoders to generate the output image as shown in (e).

(a) Input (b) CE [25] (c) CA [41] (d) SH [36] (e) Ours (f) GT
Figure 5: Visual evaluations for filling center holes. Our method performs favorably against existing approaches to retain both structures and textures.

4 Experiments

We evaluate our method on three datasets: Paris StreetView [6], Place2 [43] and CelebA [22]. We follow the training, testing, and validation splits of these three datasets. Data augmentation such as flipping is also adopted during training. Our model is optimized by the Adam optimizer [17] with a learning rate of

on a single NVIDIA 2080TI GPU. The training of CelebA model, Paris StreetView model and Place2 model are stopped after 6 epochs, 30 epochs and 60 epochs, respectively. All the masks and images for training and testing are with the size of 256

256.

We compare our method with six state-of-the-art method: CE [25], CA [41], SH [36], CSA [21], SF [26] and GC [40]. For a fair evaluation on model generalization abilities, we conduct experiments on filling center holes and irregular holes on the input images. The center hole is brought by a mask that covers the image center with a size of . We obtain irregular masks from PConv [20]. These masks are in different categories according to the ratios of the hole regions versus the entire image size (i.e., below 10%, from 10% to 20%, etc). For holes in the image center, we compare with CA [41], SH [36] and CE [25] on the CelebA [22] validation set. We choose these three methods because they are more effective to fill holes in the image center than fill irregular holes. When handling irregular holes on the input images, we compare with CSA [21], SF [26] and GC [40] using Paris StreetView [6] and Place2 [43] validation datasets.

(a) Input (b) GC [40] (c) SF [26] (d) CSA [21] (e) Ours (f) GT
Figure 6: Visual evaluations for filling irregular holes. Our method performs favorably against existing approaches to retain both structures and textures.

4.1 Visual Evaluations

The visual comparison on the results for filling center holes are in Fig. 5 and the results for filling irregular holes are in Fig. 6. We also display ground truth images in (f) to show the actual image content. In Fig. 5, the input images are shown in (a). The results produced by CE and CA contains distorted structures and blurry textures as shown (b) and (c). Although more visually pleasing content are generated in (d), the semantics still unreasonable. By utilizing consistent structure and texture features, our method is effective to generate results with realistic textures.

Fig. 6 shows the comparison for filling irregular holes, which are more challenging than filing centering holes. The results from GC contain noisy patterns shown in (b). The details are missing and the structures are distorted in (c) and (d). These methods are not effective to recover image contents without bringing obvious artifacts (i.e., the second row around the door regions). In contrast, our method learns to represent structures and textures in a consistent formation. The results shown in (e) indicate the effectiveness of our method to produce visually pleasing contents. The evaluations on filling both center holes and irregular holes indicate our method performs favorably against existing hole filling approaches.

4.2 Numerical Evaluations

CE CA SH Ours
FID 52.17 37.61 29.72 25.51
PSNR 8.53 23.65 26.10 26.32
SSIM 0.137 0.870 0.902 0.910
Table 1: Numerical evaluations on the CelebA dataset where the inputs are with center hole regions. The indicates lower is better while indicates higher is better.
Mask GC SF CSA Ours
10-20% 19.04 8.78 7.85 6.91
FID 20-30% 28.45 16.38 13.95 8.06
30-40% 40.71 27.54 25.74 19.36
40-50% 60.72 40.93 38.74 28.79
10-20% 27.10 29.50 31.31 31.13
PSNR 20-30% 25.18 27.22 28.66 28.87
30-40% 22.51 24.37 25.01 25.34
40-50% 20.35 21.90 22.54 22.81
10-20% 0.929 0.926 0.954 0.957
SSIM 20-30% 0.878 0.885 0.918 0.923
30-40% 0.823 0.802 0.843 0.854
40-50% 0.670 0.678 0.702 0.719
Table 2: Numerical comparisons on the Place2 dataset. The indicates lower is better while indicates higher is better.

We conduct numerical evaluations on the Place2 dataset with different mask ratios. Besides, we evaluate numerically on CelebA dataset with center holes in the input images. There are 100 validation images from the “valley” scene category chosen for evaluations. In CelebA, we randomly choose 500 images for evaluation. For the evaluation metrics, we follow 

[26] to use SSIM [32] and PSNR. Moreover, we introduce FID (Fechet Inception Distance) metric [11] as it indicates the perceptual quality of the results. The evaluation results are shown in Table 1 and Table 2. Our method outperforms existing methods to fill centering holes. Meanwhile, favorable performance is achieved in our method to fill irregular holes under various hole versus image ratios.

CE CA SH GC SF CSA Ours
Paris StreetView N/A N/A N/A 5.3% 21.0% 29.8% 43.7%
Place2 N/A N/A N/A 3.0% 25.0% 29.6% 42.4%
CelebA 1.2% 2.0% 40.4% N/A N/A N/A 56.4%
Table 3: Human Subject Evaluation results. Each subject selects the most realistic result without knowing hole regions in advance.

Human Subject Evaluation.

We follow [42] to involve over 35 volunteers for evaluating the results on CelebA, Place2 and Paris StreetView datasets. The volunteers are all image experts with image processing background. There are 20 questions for each subject. In each question, the subject needs to select the most realistic result from 4 results generated by different methods without knowing the hole region in advance. We tally the votes and show the statistics in Table 3. Our method performs favorably against existing methods.

Ours without
textures
Ours without
structures
Ours
FID 30.37 27.46 25.10
PSNR 22.80 22.96 23.38
SSIM 0.818 0.823 0.833
Table 4: Ablation study on the Paris StreetView dataset. Structure and textures improve our performance.
Ours without
equalization
Non-local
aggregation
Ours
FID 29.11 24.07 21.26
PSNR 23.14 23.64 24.57
SSIM 0.837 0.848 0.852
Table 5: Ablation study on the Place2 dataset. Non-local aggregation improves our baseline while feature equalization makes further improvement.

5 Ablation Study

(a) Input (b) Ours w/o (c) Ours w/o (d) Ours (e) Ground
image textures structures truth
Figure 7: Abalation studies on structure and texture branches. A joint utilization of these two branches improve the content quality.
(a) Input (b) Ours w/o (c) Non-Local (d) Ours (e) Ground
image equalization aggregation truth
Figure 8: Ablation studies on feature equalizations. More realistic and visually pleasing contents are generated via feature equalizations.

Structure and Texture branches.

To evaluate the effects of structure and texture branches, we use each of these branches separately for network training. For fair comparisons, we expand the channel number of the texture and structure branch outputs via additional convolutions. So the single branch output contains the same size as that of . As shown in Fig. 8, the output of our method without texture branch contains rich structure information (i.e., the window in the red and green boxes) while the textures are missing. In comparison, the output of our method without structure branch does not contain meaningful structure (i.e., the window in the red and green boxes). By utilizing both branches, our method achieve favorable results on both structures and textures. Table 4 shows the similar numerical performance on Paris StreetView dataset where these two branches improve our method significantly.

Feature Equalizations.

We show the contributions of feature equalizations by removing them from the pipeline and showing the performance degradation. Moreover, we show the bilateral propagation activation function (BPA) is more effective to fill hole regions than the Non-local attentions [30]. As shown in Fig. 8, without using equalization our method generates visually unpleasant contents and visible artifacts. In comparison, the contents generated by [30] are more natural. However, the recovered contents are still blurry and inconsistent because the Non-local block ignores the local coherency and global distance of features. This limitation is effectively solved via our method with feature equalizations. Similar performance has been shown numerically on Table 5 where our method achieves favorable results.

6 Concluding Remarks

We propose a mutual encoder-decoder with feature equalizations to correlate filled structures with textures during image inpainting. The shallow and deep layer features are reorganized as texture and structure features, respectively. In the CNN feature space, we introduce a texture branch and a structure branch to fill holes in multi-scales and fuse the outputs together via feature equalizations. During equalization, we first ensure consistent attentions among each channel and propagate to the whole spatial feature map region via the proposed bilateral propagation activation function. Experiments on the benchmark datasets have shown the effectiveness of our method when compared to state-of-the-art approaches on filling both regular and irregular hole regions.

Acknowledgements.

This work is partially supported by the National Natural Science Foundation of China under Grant No. 61702176.

References

  • [1] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera (2001)

    Filling-in by joint interpolation of vector fields and gray levels

    .
    IEEE Transactions on Image Processing. Cited by: §2.
  • [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman (2009) Patchmatch: a randomized correspondence algorithm forstructural image editing. In ACM Transactions on Graphics (SIGGRAPH), Cited by: §1, §2.
  • [3] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester (2000) Image inpainting. In ACM Transactions on Graphics (SIGGRAPH), Cited by: §1, §2.
  • [4] A. Criminisi, P. Pérez, and K. Toyama (2004) Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on Image Processing. Cited by: §2.
  • [5] S. Darabi, E. Shechtman, C. Barnes, D. B. Goldman, and P. Sen (2012) Image melding: combining inconsistent images using patch-based synthesis. ACM Transactions on Graphics. Cited by: §2.
  • [6] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros (2015) What makes paris look like paris?. Communications of the ACM. Cited by: Figure 1, §4, §4.
  • [7] A. Efros and W. Freeman (2001) Image quilting for texture synthesis and transfer. In ACM Transactions on Graphics (SIGGRAPH), Cited by: §1.
  • [8] A. Efros and W. Freeman (2001) Texture synthesis by nonparametric sampling. In

    IEEE International Conference on Computer Vision

    ,
    Cited by: §1.
  • [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Neural Information Processing Systems, Cited by: §2.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §3.1.
  • [11] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Neural Information Processing Systems, Cited by: §4.2.
  • [12] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.2.
  • [13] S. Iizuka, E. Simo-Serra, and H. Ishikawa (2017) Globally and locally consistent image completion. In ACM Transactions on Graphics (SIGGRAPH), Cited by: §1, §1, §2, §3.1.
  • [14] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
  • [15] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    .
    In European Conference on Computer Vision, Cited by: §3.3.
  • [16] A. Jolicoeur-Martineau (2018) The relativistic discriminator: a key element missing from standard gan. In International Conference on Learning Representations, Cited by: §3.3.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In arXiv preprint arXiv:1412.6980, Cited by: §4.
  • [18] A. Levin, A. Zomet, and Y. Weiss (2003) Learning how to inpaint from global image statistics. In IEEE International Conference on Computer Vision, Cited by: §2.
  • [19] Y. Li, S. Liu, J. Yang, and M. Yang (2017) Generative face completion. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [20] G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In European Conference on Computer Vision, Cited by: §2, §3.1, §4.
  • [21] H. Liu, B. Jiang, Y. Xiao, and C. Yang (2019) Coherent semantic attention for image inpainting. In IEEE International Conference on Computer Vision, Cited by: Figure 1, §1, §2, Figure 6, §4.
  • [22] Liu,Ziwei, LuoPi,ng, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision, Cited by: §4, §4.
  • [23] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In arXiv preprint arXiv:1802.05957, Cited by: §3.3.
  • [24] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi (2019) Edgeconnect: generative image inpainting with adversarial edge learning. In ICCV Workshops, Cited by: §1, §2, §3.1.
  • [25] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros (2016) Context encoders: feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2, Figure 5, §4.
  • [26] Y. Ren, X. Yu, R. Zhang, T. H. Li, S. Liu, and G. Li (2019) StructureFlow: image inpainting via structure-aware appearance flow. In IEEE International Conference on Computer Vision, Cited by: §1, §2, §3.1, Figure 6, §4.2, §4.
  • [27] Y. Song, L. Bao, S. He, Q. Yang, and M. Yang (2017) Stylizing face images via multiple exemplars. CVIU. Cited by: §2.
  • [28] Y. Song, C. Yang, Y. Shen, P. Wang, Q. Huang, and J. Kuo (2018) Spg-net: segmentation prediction and guidance network for image inpainting. In arXiv preprint arXiv:1805.03356, Cited by: §1, §2.
  • [29] C. Tomasi and R. Manduchi (1998) Bilateral filtering for gray and color images. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.2.1.
  • [30] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.2.1, §5.
  • [31] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia (2018) Image inpainting via generative multi-column convolutional neural networks. In Neural Information Processing Systems, Cited by: §2.
  • [32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing. Cited by: §4.2.
  • [33] W. Xiong, J. Yu, Z. Lin, J. Yang, X. Lu, C. Barnes, and J. Luo (2019) Foreground-aware image inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [34] L. Xu, Q. Yan, Y. Xia, and J. Jia (2012) Structure extraction from texture via relative total variation. SIGGRAPH. Cited by: §3.1.
  • [35] Z. Xu and J. Sun (2010) Image inpainting by patch propagation using patch sparsity. IEEE Transactions on Image Processing. Cited by: §2.
  • [36] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan (2018)

    Shift-net: image inpainting via deep feature rearrangement

    .
    In European Conference on Computer Vision, Cited by: §2, Figure 5, §4.
  • [37] S. ,. Yang,Chao, Z. Lin, X. Liu, Q. Huang, H. Li, and C. Jay (2018) Contextual-based image inpainting: infer, match, and translate. In European Conference on Computer Vision, Cited by: §1, §2.
  • [38] Yeh,Raymond, Chen,Chen, Lim,TeckYian, M. H. Johnson, and D. N (2016) Semantic image inpainting with perceptual and contextual losses. In arXiv preprint arXiv:1607.07539, Cited by: §1.
  • [39] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. In arXiv preprint arXiv:1511.07122, Cited by: §2.
  • [40] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2019) Free-form image inpainting with gated convolution. In IEEE International Conference on Computer Vision, Cited by: Figure 1, §1, §2, Figure 6, §4.
  • [41] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and H. S (2018) Generative image inpainting with contextual attention. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, Figure 5, §3.3, §4.
  • [42] Y. Zeng, J. Fu, H. Chao, and B. Guo (2019) Learning pyramid-context encoder network for high-quality image inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
  • [43] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    .
    PAMI. Cited by: §4, §4.