Region Normalization for Image Inpainting

11/23/2019 ∙ by Tao Yu, et al. ∙ USTC 17

Feature Normalization (FN) is an important technique to help neural network training, which typically normalizes features across spatial dimensions. Most previous image inpainting methods apply FN in their networks without considering the impact of the corrupted regions of the input image on normalization, e.g. mean and variance shifts. In this work, we show that the mean and variance shifts caused by full-spatial FN limit the image inpainting network training and we propose a spatial region-wise normalization named Region Normalization (RN) to overcome the limitation. RN divides spatial pixels into different regions according to the input mask, and computes the mean and variance in each region for normalization. We develop two kinds of RN for our image inpainting network: (1) Basic RN (RN-B), which normalizes pixels from the corrupted and uncorrupted regions separately based on the original inpainting mask to solve the mean and variance shift problem; (2) Learnable RN (RN-L), which automatically detects potentially corrupted and uncorrupted regions for separate normalization, and performs global affine transformation to enhance their fusion. We apply RN-B in the early layers and RN-L in the latter layers of the network respectively. Experiments show that our method outperforms current state-of-the-art methods quantitatively and qualitatively. We further generalize RN to other inpainting networks and achieve consistent performance improvements.



There are no comments yet.


page 5

page 6

page 7

page 8

Code Repositories


Region Normalization for Image Inpainting, accepted by AAAI-2020

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image inpainting aims to reconstruct the corrupted (or missing) regions of the input image. It has many applications in image editing such as object removal, face editing and image disocclusion. A key issue in image inpainting is to generate visually plausible content in the corrupted regions.

Existing image inpainting methods can be divided into two groups: traditional and learning-based methods. The traditional methods fill the corrupted regions by diffusion-based methods [4, 2, 10, 5] that propagate neighboring information into them, or patch-based methods [8, 3, 27, 6] that copy similar patches into them. The learning-based methods commonly train neural networks to synthesize content in the corrupted regions, which yield promising results and have significantly surpassed the traditional methods in recent years.

Figure 1: Illustration of our Region Normalization (RN) with region number . Pixels in the same color (green or pink) are normalized by the same mean and variance. The corrupted and uncorrupted regions of the input image are normalized by different means and variances.

Recent image inpainting works, such as [29, 17, 30, 19], focus on the learning-based methods. Most of them design an advanced network to improve the performance, but ignore the inherent nature of image inpainting problem: unlike the input image of general vision task, the image inpainting input image has corrupted regions that are typically independent of the uncorrupted regions. Inputing a corrupted image as a general spatially consistent image into a neural network has potential problems, such as convolution of invalid (corrupted) pixels and mean and variance shifts of normalization. Partial convolution [17] is proposed to solve the invalid convolution problem by operating on only valid pixels, and achieves a performance boost. However, none of existing methods solve the mean and variance shift problem of normalization in inpainting networks. In particular, most existing methods apply feature normalization (FN) in their networks to help training, and existing FN methods typically normalize features across spatial dimensions, ignoring the corrupted regions and resulting in mean and variance shifts of normalization.

In this work, we show in theory and experiment that the mean and variance shifts caused by existing full-spatial normalization limit the image inpainting network training. To overcome the limitation, we propose Region Normalization (RN), a spatially region-wise normalization method that divides spatial pixels into different regions according to the input mask and computes the mean and variance in each region for normalization. RN can effectively solve the mean and variance shift problem and improve the inpainting network training.

We further design two kinds of RN for our image inpainting network: Basic RN (RN-B) and Learnable RN (RN-L). In the early layers of the network, the input image has large corrupted regions, which results in severe mean and variance shifts. Thus we apply RN-B to solve the problem by normalizing corrupted and uncorrupted regions separately. The input mask of RN-B is obtained from the original inpainting mask. After passing through several convolutional layers, the corrupted regions are fused gradually, making it difficult to obtain a region mask from the original mask. Therefore, we apply RN-L in the latter layers of the network, which learns to detect potentially corrupted regions by utilizing the spatial relationship of the input feature and generates a region mask for RN. Additionally, RN-L can also enhance the fusion of corrupted and uncorrupted regions by global affine transformation. RN-L not only solves the mean and variance shift problem, but also boosts the reconstruction of corrupted regions.

We conduct experiments on Places2 [32] and CelebA [18] datasets. The experimental results show that, with the help of RN, a simple backbone can surpass current state-of-the-art image inpainting methods. In addition, we generalize our RN to other inpainting networks and yield consistent performance improvements.

Our contributions in this work include:

  • Both theoretically and experimentally, we show that existing full-spatial normalization methods are sub-optimal for image inpainting.

  • To the best our knowledge, we are the first to propose spatially region-wise normalization Region Normalization (RN).

  • We propose two kinds of RN for image inpainting and the use of them for achieving state-of-the-art on image inpainting.

2 Related Work

2.1 Image Inpainting

Previous works in image inpainting can be divided into two categories: traditional and learning-based methods.

Traditional methods use diffusion-based [4, 2, 10, 5] or patch-based [8, 3, 27, 6] methods to fill the holes. The former propagate neighboring information into holes. The latter typically copy similar patches into the holes. The performance of these traditional methods is limited since they cannot use semantic information.

Learning-based methods can learn to extract semantic information by massive data training, and thus significantly improve the inpainting results. These methods map a corrupted image directly to the completed image. ContextEncoder [21]

, one of pioneer learning-based methods, trains a convolutional neural network to complete image. With the introduction of generative adversarial networks (GANs)

[11], GAN-based methods [28, 13, 29, 26, 19] are widely used in image inpainting. ContextualAttention [29] is a popular model with coarse-to-fine architecture. Considering that there are valid/uncorrupted and invalid/corrupted regions in a corrupted image, partial convolution [17] operates on only valid pixels and achieves promising results. Gated convolution [30] generalizes PConv by a soft distinction of valid and invalid regions. EdgeConnect [19] first predicts the edges of the corrupted regions, then generates the completed image with the help of the predicted edges.

However, most existing inpainting methods ignore the impact of corrupted regions of the input image on normalization which is a crucial technique for network training.

2.2 Normalization

Feature normalization layer has been widely applied in deep neural networks to help network training.

Batch Normalization (BN) [14], normalizing activations across batch and spatial dimensions, has been widely used in discriminative networks for speeding up convergence and improve model robustness, and found also effective in generative networks. Instance Normalization (IN) [22], distinguished from BN by normalizing activations across only spatial dimensions, achieves a significant improvement in many generative tasks such as style transformation. Layer Normalization (LN) [1] normalizes activations across channel and spatial dimensions (

normalizes all features of an instance), which helps recurrent neural network training. Group Normalization (GN)

[25] normalizes features of grouped channels of an instance and improves the performance of some vision tasks such as object detection.

Different from a single set of affine parameters in the above normalization methods, conditional normalization methods typically use external data to reason multiple sets of affine parameters. Conditional instance normalization (CIN) [9], adaptive instance normalization (AdaIN) [12], conditional batch normalization (CBN) [7] and spatially adaptive denormalization (SPADE) [20] have been proposed in some image synthesis tasks.

None of existing normalization methods considers spatial distribution’s impact on normalization.

3 Approach

In this secetion, we show that existing full-spatial normalization methods are sub-optimal for image inpianting problem as motivation for Region Normalization (RN). We then introduce two kinds of RN for image inpainting, Basic RN (RN-B) and Learnable RN (RN-L). We finally introduce our image inpainting network using RN.

3.1 Motivation for Region Normalization

Problem in Normalization.

Figure 2: (a) is the original feature map. with mask performs full-spatial normalization in all the regions. performs separate normalization in the masked and unmasked regions. (b) The distribution of

’s unmasked area has a shift to the nonlinear region, which easily causes the vanishing gradient problem. But

does not have this problem.

, and are three feature maps of the same size, each with pixels, as shown in Figure 2. is the original uncorrupted feature map. and are the different normalization results of feature map with masked and unmasked areas. and are the pixel numbers of the masked and unmasked areas, respectively. Then . Specifically, is normalized in all the areas. is normalized separately in the masked and unmasked areas. Assuming the masked region pixels have the max value

, the mean and standard deviation of three feature maps are listed as

, , , , , , and . The subscripts and represent the entire areas of and , and and represent the masked and unmasked areas of , respectively. The relationships are listed below:


After normalizing the masked and unmasked areas together, unmasked area’s mean has a shift toward and its variance increases compared with and . According to [14], the normalization shifts and scales the distribution of features into a small region where the mean is zero and the variance is one. We take batch normalization (BN) as an example here. For each point


Compared with the ’s unmasked area, distribution of ’s unmasked area narrows down and shifts from toward . Then, for both fully-connected and convolutional layer, the affine transformation is followed by an element-wise nonlinearity [14]:



is the nonlinear activation function such as ReLU or sigmoid. The BN transform is added immediately before the function, by normalizing

. The and are learned parameters of the model.

As shown in Figure 2, in the ReLU and sigmoid activations, the distribution region of is narrowed down and shifted by the masked area, which adds the internal covariate shift and easily get stuck in the saturated regimes of nonlinearities (causing the vanishing gradient problem), wasting lots of time for , and W to fix the problem. However, , normalized the masked and unmasked regions separately, reduces the internal covariate shift, which preserves the network capacity and improves training efficiency.

Motivated by this, we design a spatial region-wise normalization named Region Normalization (RN).

Formulation of Region Normalization.

Let be the input feature. , , and are batch size, number of channels, height and width, respectively. Let be a pixel of and be a channel of where (, , , ) is an index along (, , , ) axis. Given a region label map (mask) , is divided into regions as follows:


The mean and standard deviation of each region of a channel computed by:


Here is a region index, is the number of pixels in region and is a small constant. The normalization of each region performs the following computation:


RN merges all normalized regions and obtains the region normalized feature as follows:


After normalization, each region is transformed separately with a set of learnable affine parameters .

Analysis of Region Normalization.

RN is an alternative to Instance Normalization (IN). RN degenerates into IN when region number equals to one. RN normalizes spatial regions on each channel separately as the spatial regions are not entirely dependent. We set for image inpainting in this work, as there are two obviously independent spatial regions in the input image: corrupted and uncorrupted regions. RN with is illustrated in Figure 1.

3.2 Basic Region Normalization

Basic RN (RN-B) normalizes and transforms corrupted and uncorrupted regions separately. This can solve the mean and variance shift problem of normalization and also avoid information mixing in affine transformation. RN-B is designed for using in early layers of the inpainting network, as the input feature has large corrupted regions, which causes severe mean and variance shifts.

Given an input feature and a binary region mask indicating corrupted region, RN-B layer first separates each channel of input feature into two regions ( uncorrupted region) and ( corrupted region) according to region mask . Let represent a pixel of where is an index of axis. The separation rule is as follow:


RN-B then normalizes each region following Formula (9), (10) and (11) with region number . Then we merge the two normalized regions and to obtain normalized channel . RN-B is a basic implement of RN and the region mask is obtained from the original inpainting mask.

For each channel, there are two sets of learnable parameters and for affine transformation of each region. For ease of denotation, we denote as , as . RN-B layer is showed in Figure 3(a).

3.3 Learnable Region Normalization

Figure 3: Two kinds of RN: RN-B (a) and RN-L (b)

After passing through several convolutional layers, the corrupted regions are fused gradually and obtaining an accurate region mask from the original mask is hard. RN-L addresses the issue by automatically detecting corrupted regions and obtaining a region mask. To further improve the reconstruction, RN-L enhances the fusion of corrupted and uncorrupted regions by global affine transformation. RN-L boosts the corrupted region reconstruction in a soft way, which solves the mean and variance shift problem and also enhances the fusion. Therefore, RN-L is suitable for latter layers of the network. Note that, RN-L does not need a region mask and the affine parameters of RN-L are pixel-wise. RN-L is illustrated in Figure 3(b).

RN-L generates a spatial response map by taking advantage of the spatial relationship of the features themselves. Specifically, RN-L first performs max-pooling and average-pooling along the channel axis. The two pooling operations are able to obtain an efficient feature descriptor

[31, 24]. RN-L then concatenates the two pooling results. RN-L is convolved on the two maps with sigmoid activation to get a spatial response map. The spatial response map is computed as:


Here and are the max-pooling and average-pooling results of the input feature . is the convolution operation and

is the sigmoid function.

is the spatial response map. To get a region mask for RN, we set a threshold to the spatial response map:


We set threshold

in this work. Note that the thresholding operation is only performed in the inference stage and the gradients do not pass through it during backpropagation.

Based on the mask , RN normalizes the input feature and then performs a pixel-wise affine transformation. The affine parameters and are obtained by convolution on the spatial response map :


Note that the values of and are expanded along the channel dimension in the affine transformation.

Figure 4: Illustration of our inpainting model.

The spatial response map has global spatial information. Convolution on it can learn a global representation, which boosts the fusion of corrupted and uncorrupted regions.

3.4 Network Architecture

EdgeConnect(EC) [19] consists of an edge generator and an image generator. The image generator is a simple yet effective network originally proposed by Johnson et al. [16]. We use only the image generator as our backbone generator. We replace the original instance normalization (IN) of backbone generator to our two kinds of RN, RN-B and RN-L. Our generator architecture is shown in Figure 4. Based the instruction of Section 3.2 and 3.3, we apply RN-B in the early layers (encoder) of our generator and RN-L in the intermediate and later layers (the residual blocks and decoder). Note that the input mask of RN-B is sampled from the original inpainting mask while RN-L does not need an external input as it generates region masks internally. We apply the same discriminators (PatchGAN [15, 33]

) and loss functions (reconstruction loss, adversarial loss, perceptual loss and style loss) of the original backbone model to our model

111The codes are available at

4 Experiments

We first compare our method with current state-of-the-art methods. We then conduct ablation study to explore the properties of RN and visualize our methods. Finally, we generalize RN to some other state-of-the-art methods.

4.1 Experiment Setup

We evaluate our methods on Places2 [32] and CelebA [18] datasets. We use two kinds of image masks: regular masks which are fixed square masks (occupying a quarter of the image) and irregular masks from [17]. The irregular mask dataset contains 12000 irregular masks and the masked area in each mask occupies 0-60% of the total image size. Besides, the irregular dataset is grouped into six intervals according to the mask area, 0-10%, 10-20%, 20-30%, 30-40%, 40-50% and 50-60%. Each interval has 2000 masks.

4.2 Comparison

We compare our method to four current state-of-the-art methods and the baseline.

- CA: Contextual Attention [29].

- PC: Partial Convolution [17].

- GC: Gated Convolution [30].

- EC: EdgeConnect [19].

- Baseline: the backbone network we used. The baseline model use instance normalization instead of RN.

Quantitative Comparisons

We test all models on total validation data (36500 images) of Places2. We compare our model with CA, PC, GC, EC and the baseline. Three commonly used metrics are used: PSNR, SSIM [23] with window size 11, and loss. We give the results of quantitative comparisons in Table 1. The second column is the area of irregular masks at testing time. Note that the in Table 1 represents using all irregular masks (0-60%) when testing. Our model surpasses all the comparing models on all three metrics. Compared to the baseline, our model improve PSNR by 0.73 dB and SSIM by 0.017, and reduce loss (%) by 0.25 in the case.

Mask CA PC* GC EC baseline Ours
PSNR 10-20% 24.45 28.02 26.65 27.46 27.28 28.16
20-30% 21.14 24.90 24.79 24.53 24.35 25.06
30-40% 19.16 22.45 23.09 22.52 22.33 22.94
40-50% 17.81 20.86 21.72 20.90 20.96 21.21
All 21.60 24.82 24.53 24.39 24.37 25.10
SSIM 10-20% 0.891 0.869 0.882 0.920 0.914 0.926
20-30% 0.811 0.777 0.836 0.859 0.851 0.868
30-40% 0.729 0.685 0.782 0.794 0.784 0.804
40-50% 0.651 0.589 0.721 0.723 0.711 0.734
All 0.767 0.724 0.807 0.814 0.806 0.823
(%) 10-20% 1.81 1.14 3.01 1.58 1.24 1.10
20-30% 3.24 1.98 3.54 2.71 2.17 1.96
30-40% 4.81 3.02 4.25 3.93 3.19 2.90
40-50% 6.30 4.11 4.99 5.32 4.36 4.00
All 4.21 2.80 3.79 2.83 2.95 2.70
Table 1: Quantitative results on Places2 with models: CA [29], PC [17], GC [30], EC [19], the baseline, and ours(RN). All masks masks with 0-60% area. higher is better. lower is better. the statistics are obtained from their paper.

Qualitative Comparisons

Figure 5: Qualitative results with CA [29], PC [17], GC [30], EC [19], the baseline, and our RN. The first two rows are the testing results on Places2 and the last two are on CelebA.

Figure 5 compares images generated by CA, PC, GC, EC, the baseline and ours. The first two rows of input images are taken from Places2 validation dataset and the last two rows are taken from CelebA validation dataset. In addition, the first three rows show the results in irregular mask case and the last row shows regular mask (fixed square mask in center) case. Our method achieves better subjective results, which benefits from RN-B’s eliminating the impact of the mean and variance shifts on training, and RN-L’s further boosting the reconstruction of corrupted regions.

4.3 Ablation Study

Arch. Encoder Res-blocks Decoder PSNR SSIM (%)
baseline IN IN IN 24.37 0.806 2.95
1 RN-B IN IN 24.88 0.814 2.77
2 RN-B RN-B IN 24.41 0.810 2.90
3 RN-B RN-B RN-B 24.59 0.812 2.85
4 RN-B RN-L IN 25.02 0.823 2.71
5 RN-B RN-L RN-L 25.10 0.823 2.70
6 RN-L RN-L RN-L 24.53 0.812 2.86
Table 2: The influence of plugging location of RN-B and RN-L. The baseline uses inistance normalization (IN) in all three stages. The results are based on Places2.
PSNR 24.47 24.37 24.24 25.10
SSIM 0.811 0.806 0.806 0.823
(%) 2.91 2.95 2.98 2.70
Table 3: The final convergence results of different normalization methods on Places2. None means no normalization.

RN and Architecture

Figure 6: The PSNR results of different normalization methods in the first 10000 iterations on Places2. None means no normalization.

We first explore the source of gain for our methods and the best strategy to apply two kinds of RN: RN-B and RN-L. We conduct ablation experiments on the backbone generator, which has three stages: an encoder, followed by eight residual blocks and a decoder. We plug RN-B and RN-L in different stages and obtain six architectures (Arch.1-6) as shown in Table 2. The results in Table 2 show the effectiveness of our use of RN: apply RN-B in the early layers (encoder) to solve the mean and variance shifts caused by large-area uncorrupted regions; apply RN-L in the later layers to solve the the mean and variance shifts and boost the fusion of two kinds of regions. Arch.1 only applies RN-B in the encoder and achieves a significant performance boost, which directly shows the RN-B’s effectiveness. Arch.2 and 3 reduce the performance as RN-B can hardly obtain an accurate region mask in the latter layers of the network after passing through several convolutional layers. Arch.4 is beyond Arch.1 by adding RN-L in the middle residual blocks. Arch.5 (Our method) further improves the performance of Arch.4 by applying RN-L in both the residual blocks and the decoder. Note that Arch.6 uses RN-L to the encoder and its performance is reduced compared to Arch.5 ,since RN-L, a module of soft fusion, unavoidably mixing up information from corrupted and uncorrupted regions and washing away information from the uncorrupted regions. The above results verify the effectiveness of our use of RN-B and RN-L that we explain in Section 3.2 and 3.3.

Comparisons with Other Normalization Methods

To verify our RN is more effective in training of the inpainting model, we compare our RN with a none-normalization method and two full-spatial normalization methods, batch normalization (BN) and instance normalization (IN), based on the same backbone. We show the PSNR curves in the first 10000 iterations in Figure 6 and the final convergence results (about 225,000 iterations) in Table 3. The experiments are on Places2. Note that no normalization (None) is better than full-spatial normalization (IN and BN), and RN is better than no normalization by eliminating the mean and variance shifts and taking advantage of normalization technique at the same time.

Threshold of Learnable RN

Threshold is set in Learnable RN to generate a region mask from the spatial response map. The threshold affects the accuracy of the region mask and further affects the power of RN. We conduct a set of experiments to explore the best threshold. The PSNR results on Places2 and CelebA show that RN-L achieves the best results when threshold equals to 0.8, as shown in Table 4. We show the generated mask of the first RN-L layer in the sixth residual block () as an example in Figure 7. The generated mask of is likely to be the most accurate mask in this layer.

Figure 7: The generated mask with different threshold of the first RN-L layer in the sixth residual block.
0.5 0.6 0.7 0.8 0.9
Places2 23.85 24.90 24.96 25.10 24.93
CelebA 27.36 27.92 28.45 28.51 23.73
Table 4: The PSNR results with different threshold on Places2 and CelebA datasets.

RN and Masked Area

We explore the mask area’s influence to RN. Based the theoretical analysis in Section 3.1, the mean and variance shifts become more severe as mask area increases. Our experiments on CelebA show that the advantage of our RN becomes more significant as the mask area increases, as shown in Table 5. We use loss to evaluate the results.

Mask 0-10% 10-20% 20-30% 30-40% 40-50% 50-60%
baseline 0.26 0.69 1.28 2.02 2.92 4.83
RN 0.23 0.62 1.18 1.85 2.68 4.52
Change -0.03 -0.07 -0.10 -0.17 -0.24 -0.31
Table 5: The testing (%) loss with different mask area on CelebA. RN’s advantage becomes more significant as the mask area increases.
PSNR 21.60 24.12 24.82 25.32 24.53 24.55
SSIM 0.767 0.842 0.724 0.829 0.807 0.807
4.21 3.17 2.80 2.61 3.79 3.75
Table 6: The results of applying RN to different backbone networks: CA [29], PC [17] and GC [30]. The results is based on Places2.
Figure 8: Visualization of our method. The top two rows are illustrated the changes of the spatial response and generated mask in different locations of the network: the first RN-L in the sixth residual block, the second RN-L in the seventh residual block and the second RN-L in the eighth residual block. In the last two rows, from left to right: input, encoder result, spatial response map, generated mask, gamma map and beta map of the first RN-L in the seventh residual block.

4.4 Visualization

We visualize some features of the inpainting network to verify our method. We show the changes of the spatial response and generated mask of RN-L as the network deepens in the top two rows of Figure 8. The mask changes in different layers as the fusion effect of passing through convolutional layers. RN-L can detect potentially corrupted regions consistently. From the last two rows of Figure 8 we can see: (1) the uncorrupted regions in the encoded feature are well preserved by using RN-B; (2) RN-L can distinguish between potentially different regions and generate a region mask; (3) gamma and beta maps in RN-L perform a pixel-level transform on potentially corrupted and uncorrupted regions distinctively to help the fusion of them.

4.5 Generalization Experiments

RN-B and RN-L are plug-and-play modules in image inpainting networks. We generalize our RN (RN-B and RN-L) to some other backbone networks: CA, PC and GC. We apply RN-B to their early layers (encoder) and RN-L to the later layers. CA and GC are two-stage (coarse-to-fine) inpainting networks and the coarse result is the input of the refinement network. The corrupted and uncorrupted regions of the coarse result is typically not particularly obvious, thus we only apply RN to the coarse inpainting networks of CA and GC. The results on Places2 are shown in Table 6. The RN-applied CA and PC achieve a significant performance boost by 2.52 and 0.5 dB PSNR respectively. The gain on GC is not very impressive. A possible reason is that gated convolution of GC greatly smoothes features which make RN-L hard to track potentially corrupted regions. Besides, GC’s results are typically blurry as shown in Figure 5.

5 Conclusion

In this work, we investigate the impact of normalization on inpainting network and show that Region Normalization (RN) is more effective for image inpainting network, compared with existing full-spatial normalization. The proposed two kinds of RN are plug-and-play modules, which can be applied to other image inpainting networks conveniently. Additionally, our inpainting model works well in real use cases such as object removal, face editing and image restoration, as shown in Figure 9.

In the future, we will explore RN for other supervised vision tasks such as classification, detection and so on.

Figure 9: Our results in real use cases.


  • [1] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §2.2.
  • [2] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera (2001)

    Filling-in by joint interpolation of vector fields and gray levels

    IEEE TIP 10 (8), pp. 1200–1211. Cited by: §1, §2.1.
  • [3] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In ACM TOG, Vol. 28, pp. 24. Cited by: §1, §2.1.
  • [4] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester (2000) Image inpainting. In SIGGRAPH, pp. 417–424. Cited by: §1, §2.1.
  • [5] M. Bertalmio, L. Vese, G. Sapiro, and S. Osher (2003) Simultaneous structure and texture image inpainting. IEEE TIP 12 (8), pp. 882–889. Cited by: §1, §2.1.
  • [6] S. Darabi, E. Shechtman, C. Barnes, D. B. Goldman, and P. Sen (2012) Image melding: combining inconsistent images using patch-based synthesis.. ACM TOG 31 (4), pp. 82–1. Cited by: §1, §2.1.
  • [7] H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. C. Courville (2017) Modulating early visual processing by language. In NIPS, pp. 6594–6604. Cited by: §2.2.
  • [8] I. Drori, D. Cohen-Or, and H. Yeshurun (2003) Fragment-based image completion. In ACM TOG, Vol. 22, pp. 303–312. Cited by: §1, §2.1.
  • [9] V. Dumoulin, J. Shlens, and M. Kudlur (2016) A learned representation for artistic style. arXiv preprint arXiv:1610.07629. Cited by: §2.2.
  • [10] S. Esedoglu and J. Shen (2002) Digital inpainting based on the mumford–shah–euler image model. European Journal of Applied Mathematics 13 (4), pp. 353–370. Cited by: §1, §2.1.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §2.1.
  • [12] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, pp. 1501–1510. Cited by: §2.2.
  • [13] S. Iizuka, E. Simo-Serra, and H. Ishikawa (2017) Globally and locally consistent image completion. ACM TOG 36 (4), pp. 107. Cited by: §2.1.
  • [14] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.2, §3.1.
  • [15] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, pp. 1125–1134. Cited by: §3.4.
  • [16] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In ECCV, pp. 694–711. Cited by: §3.4.
  • [17] G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In ECCV, pp. 85–100. Cited by: §1, §2.1, Figure 5, §4.1, §4.2, Table 1, Table 6.
  • [18] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In ICCV, pp. 3730–3738. Cited by: §1, §4.1.
  • [19] K. Nazeri, E. Ng, T. Joseph, F. Qureshi, and M. Ebrahimi (2019) Edgeconnect: generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212. Cited by: §1, §2.1, §3.4, Figure 5, §4.2, Table 1.
  • [20] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR, pp. 2337–2346. Cited by: §2.2.
  • [21] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, pp. 2536–2544. Cited by: §2.1.
  • [22] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.2.
  • [23] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE TIP 13 (4), pp. 600–612. Cited by: §4.2.
  • [24] S. Woo, J. Park, J. Lee, and I. So Kweon (2018-09) CBAM: convolutional block attention module. In ECCV, Cited by: §3.3.
  • [25] Y. Wu and K. He (2018) Group normalization. In ECCV, pp. 3–19. Cited by: §2.2.
  • [26] W. Xiong, J. Yu, Z. Lin, J. Yang, X. Lu, C. Barnes, and J. Luo (2019) Foreground-aware image inpainting. In CVPR, pp. 5840–5848. Cited by: §2.1.
  • [27] Z. Xu and J. Sun (2010) Image inpainting by patch propagation using patch sparsity. IEEE TIP 19 (5), pp. 1153–1165. Cited by: §1, §2.1.
  • [28] R. A. Yeh, C. Chen, T. Yian Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do (2017) Semantic image inpainting with deep generative models. In CVPR, pp. 5485–5493. Cited by: §2.1.
  • [29] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. In CVPR, pp. 5505–5514. Cited by: §1, §2.1, Figure 5, §4.2, Table 1, Table 6.
  • [30] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2019) Free-form image inpainting with gated convolution. In ICCV, pp. 4471–4480. Cited by: §1, §2.1, Figure 5, §4.2, Table 1, Table 6.
  • [31] S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §3.3.
  • [32] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    IEEE TPAMI 40 (6), pp. 1452–1464. Cited by: §1, §4.1.
  • [33] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pp. 2223–2232. Cited by: §3.4.