Region Normalization for Image Inpainting, accepted by AAAI-2020
Feature Normalization (FN) is an important technique to help neural network training, which typically normalizes features across spatial dimensions. Most previous image inpainting methods apply FN in their networks without considering the impact of the corrupted regions of the input image on normalization, e.g. mean and variance shifts. In this work, we show that the mean and variance shifts caused by full-spatial FN limit the image inpainting network training and we propose a spatial region-wise normalization named Region Normalization (RN) to overcome the limitation. RN divides spatial pixels into different regions according to the input mask, and computes the mean and variance in each region for normalization. We develop two kinds of RN for our image inpainting network: (1) Basic RN (RN-B), which normalizes pixels from the corrupted and uncorrupted regions separately based on the original inpainting mask to solve the mean and variance shift problem; (2) Learnable RN (RN-L), which automatically detects potentially corrupted and uncorrupted regions for separate normalization, and performs global affine transformation to enhance their fusion. We apply RN-B in the early layers and RN-L in the latter layers of the network respectively. Experiments show that our method outperforms current state-of-the-art methods quantitatively and qualitatively. We further generalize RN to other inpainting networks and achieve consistent performance improvements.READ FULL TEXT VIEW PDF
Image inpainting is the art of predicting damaged regions of an image. T...
Blind inpainting is a task to automatically complete visual contents wit...
Most convolutional network (CNN)-based inpainting methods adopt standard...
In view of the problem of image inpainting error continuation and the
Most previous image matting methods require a roughly-specificed trimap ...
Image inpainting aims at restoring missing region of corrupted images, w...
We present a novel deep learning based image inpainting system to comple...
Region Normalization for Image Inpainting, accepted by AAAI-2020
Image inpainting aims to reconstruct the corrupted (or missing) regions of the input image. It has many applications in image editing such as object removal, face editing and image disocclusion. A key issue in image inpainting is to generate visually plausible content in the corrupted regions.
Existing image inpainting methods can be divided into two groups: traditional and learning-based methods. The traditional methods fill the corrupted regions by diffusion-based methods [4, 2, 10, 5] that propagate neighboring information into them, or patch-based methods [8, 3, 27, 6] that copy similar patches into them. The learning-based methods commonly train neural networks to synthesize content in the corrupted regions, which yield promising results and have significantly surpassed the traditional methods in recent years.
Recent image inpainting works, such as [29, 17, 30, 19], focus on the learning-based methods. Most of them design an advanced network to improve the performance, but ignore the inherent nature of image inpainting problem: unlike the input image of general vision task, the image inpainting input image has corrupted regions that are typically independent of the uncorrupted regions. Inputing a corrupted image as a general spatially consistent image into a neural network has potential problems, such as convolution of invalid (corrupted) pixels and mean and variance shifts of normalization. Partial convolution  is proposed to solve the invalid convolution problem by operating on only valid pixels, and achieves a performance boost. However, none of existing methods solve the mean and variance shift problem of normalization in inpainting networks. In particular, most existing methods apply feature normalization (FN) in their networks to help training, and existing FN methods typically normalize features across spatial dimensions, ignoring the corrupted regions and resulting in mean and variance shifts of normalization.
In this work, we show in theory and experiment that the mean and variance shifts caused by existing full-spatial normalization limit the image inpainting network training. To overcome the limitation, we propose Region Normalization (RN), a spatially region-wise normalization method that divides spatial pixels into different regions according to the input mask and computes the mean and variance in each region for normalization. RN can effectively solve the mean and variance shift problem and improve the inpainting network training.
We further design two kinds of RN for our image inpainting network: Basic RN (RN-B) and Learnable RN (RN-L). In the early layers of the network, the input image has large corrupted regions, which results in severe mean and variance shifts. Thus we apply RN-B to solve the problem by normalizing corrupted and uncorrupted regions separately. The input mask of RN-B is obtained from the original inpainting mask. After passing through several convolutional layers, the corrupted regions are fused gradually, making it difficult to obtain a region mask from the original mask. Therefore, we apply RN-L in the latter layers of the network, which learns to detect potentially corrupted regions by utilizing the spatial relationship of the input feature and generates a region mask for RN. Additionally, RN-L can also enhance the fusion of corrupted and uncorrupted regions by global affine transformation. RN-L not only solves the mean and variance shift problem, but also boosts the reconstruction of corrupted regions.
We conduct experiments on Places2  and CelebA  datasets. The experimental results show that, with the help of RN, a simple backbone can surpass current state-of-the-art image inpainting methods. In addition, we generalize our RN to other inpainting networks and yield consistent performance improvements.
Our contributions in this work include:
Both theoretically and experimentally, we show that existing full-spatial normalization methods are sub-optimal for image inpainting.
To the best our knowledge, we are the first to propose spatially region-wise normalization Region Normalization (RN).
We propose two kinds of RN for image inpainting and the use of them for achieving state-of-the-art on image inpainting.
Previous works in image inpainting can be divided into two categories: traditional and learning-based methods.
Traditional methods use diffusion-based [4, 2, 10, 5] or patch-based [8, 3, 27, 6] methods to fill the holes. The former propagate neighboring information into holes. The latter typically copy similar patches into the holes. The performance of these traditional methods is limited since they cannot use semantic information.
Learning-based methods can learn to extract semantic information by massive data training, and thus significantly improve the inpainting results. These methods map a corrupted image directly to the completed image. ContextEncoder 
, one of pioneer learning-based methods, trains a convolutional neural network to complete image. With the introduction of generative adversarial networks (GANs), GAN-based methods [28, 13, 29, 26, 19] are widely used in image inpainting. ContextualAttention  is a popular model with coarse-to-fine architecture. Considering that there are valid/uncorrupted and invalid/corrupted regions in a corrupted image, partial convolution  operates on only valid pixels and achieves promising results. Gated convolution  generalizes PConv by a soft distinction of valid and invalid regions. EdgeConnect  first predicts the edges of the corrupted regions, then generates the completed image with the help of the predicted edges.
However, most existing inpainting methods ignore the impact of corrupted regions of the input image on normalization which is a crucial technique for network training.
Feature normalization layer has been widely applied in deep neural networks to help network training.
Batch Normalization (BN) , normalizing activations across batch and spatial dimensions, has been widely used in discriminative networks for speeding up convergence and improve model robustness, and found also effective in generative networks. Instance Normalization (IN) , distinguished from BN by normalizing activations across only spatial dimensions, achieves a significant improvement in many generative tasks such as style transformation. Layer Normalization (LN)  normalizes activations across channel and spatial dimensions (
normalizes all features of an instance), which helps recurrent neural network training. Group Normalization (GN) normalizes features of grouped channels of an instance and improves the performance of some vision tasks such as object detection.
Different from a single set of affine parameters in the above normalization methods, conditional normalization methods typically use external data to reason multiple sets of affine parameters. Conditional instance normalization (CIN) , adaptive instance normalization (AdaIN) , conditional batch normalization (CBN)  and spatially adaptive denormalization (SPADE)  have been proposed in some image synthesis tasks.
None of existing normalization methods considers spatial distribution’s impact on normalization.
In this secetion, we show that existing full-spatial normalization methods are sub-optimal for image inpianting problem as motivation for Region Normalization (RN). We then introduce two kinds of RN for image inpainting, Basic RN (RN-B) and Learnable RN (RN-L). We finally introduce our image inpainting network using RN.
, and are three feature maps of the same size, each with pixels, as shown in Figure 2. is the original uncorrupted feature map. and are the different normalization results of feature map with masked and unmasked areas. and are the pixel numbers of the masked and unmasked areas, respectively. Then . Specifically, is normalized in all the areas. is normalized separately in the masked and unmasked areas. Assuming the masked region pixels have the max value
, the mean and standard deviation of three feature maps are listed as, , , , , , and . The subscripts and represent the entire areas of and , and and represent the masked and unmasked areas of , respectively. The relationships are listed below:
After normalizing the masked and unmasked areas together, unmasked area’s mean has a shift toward and its variance increases compared with and . According to , the normalization shifts and scales the distribution of features into a small region where the mean is zero and the variance is one. We take batch normalization (BN) as an example here. For each point
Compared with the ’s unmasked area, distribution of ’s unmasked area narrows down and shifts from toward . Then, for both fully-connected and convolutional layer, the affine transformation is followed by an element-wise nonlinearity :
Here. The and are learned parameters of the model.
As shown in Figure 2, in the ReLU and sigmoid activations, the distribution region of is narrowed down and shifted by the masked area, which adds the internal covariate shift and easily get stuck in the saturated regimes of nonlinearities (causing the vanishing gradient problem), wasting lots of time for , and W to fix the problem. However, , normalized the masked and unmasked regions separately, reduces the internal covariate shift, which preserves the network capacity and improves training efficiency.
Motivated by this, we design a spatial region-wise normalization named Region Normalization (RN).
Let be the input feature. , , and are batch size, number of channels, height and width, respectively. Let be a pixel of and be a channel of where (, , , ) is an index along (, , , ) axis. Given a region label map (mask) , is divided into regions as follows:
The mean and standard deviation of each region of a channel computed by:
Here is a region index, is the number of pixels in region and is a small constant. The normalization of each region performs the following computation:
RN merges all normalized regions and obtains the region normalized feature as follows:
After normalization, each region is transformed separately with a set of learnable affine parameters .
RN is an alternative to Instance Normalization (IN). RN degenerates into IN when region number equals to one. RN normalizes spatial regions on each channel separately as the spatial regions are not entirely dependent. We set for image inpainting in this work, as there are two obviously independent spatial regions in the input image: corrupted and uncorrupted regions. RN with is illustrated in Figure 1.
Basic RN (RN-B) normalizes and transforms corrupted and uncorrupted regions separately. This can solve the mean and variance shift problem of normalization and also avoid information mixing in affine transformation. RN-B is designed for using in early layers of the inpainting network, as the input feature has large corrupted regions, which causes severe mean and variance shifts.
Given an input feature and a binary region mask indicating corrupted region, RN-B layer first separates each channel of input feature into two regions ( uncorrupted region) and ( corrupted region) according to region mask . Let represent a pixel of where is an index of axis. The separation rule is as follow:
RN-B then normalizes each region following Formula (9), (10) and (11) with region number . Then we merge the two normalized regions and to obtain normalized channel . RN-B is a basic implement of RN and the region mask is obtained from the original inpainting mask.
For each channel, there are two sets of learnable parameters and for affine transformation of each region. For ease of denotation, we denote as , as . RN-B layer is showed in Figure 3(a).
After passing through several convolutional layers, the corrupted regions are fused gradually and obtaining an accurate region mask from the original mask is hard. RN-L addresses the issue by automatically detecting corrupted regions and obtaining a region mask. To further improve the reconstruction, RN-L enhances the fusion of corrupted and uncorrupted regions by global affine transformation. RN-L boosts the corrupted region reconstruction in a soft way, which solves the mean and variance shift problem and also enhances the fusion. Therefore, RN-L is suitable for latter layers of the network. Note that, RN-L does not need a region mask and the affine parameters of RN-L are pixel-wise. RN-L is illustrated in Figure 3(b).
RN-L generates a spatial response map by taking advantage of the spatial relationship of the features themselves. Specifically, RN-L first performs max-pooling and average-pooling along the channel axis. The two pooling operations are able to obtain an efficient feature descriptor[31, 24]. RN-L then concatenates the two pooling results. RN-L is convolved on the two maps with sigmoid activation to get a spatial response map. The spatial response map is computed as:
Here and are the max-pooling and average-pooling results of the input feature . is the convolution operation and
is the sigmoid function.is the spatial response map. To get a region mask for RN, we set a threshold to the spatial response map:
We set threshold
in this work. Note that the thresholding operation is only performed in the inference stage and the gradients do not pass through it during backpropagation.
Based on the mask , RN normalizes the input feature and then performs a pixel-wise affine transformation. The affine parameters and are obtained by convolution on the spatial response map :
Note that the values of and are expanded along the channel dimension in the affine transformation.
The spatial response map has global spatial information. Convolution on it can learn a global representation, which boosts the fusion of corrupted and uncorrupted regions.
EdgeConnect(EC)  consists of an edge generator and an image generator. The image generator is a simple yet effective network originally proposed by Johnson et al. . We use only the image generator as our backbone generator. We replace the original instance normalization (IN) of backbone generator to our two kinds of RN, RN-B and RN-L. Our generator architecture is shown in Figure 4. Based the instruction of Section 3.2 and 3.3, we apply RN-B in the early layers (encoder) of our generator and RN-L in the intermediate and later layers (the residual blocks and decoder). Note that the input mask of RN-B is sampled from the original inpainting mask while RN-L does not need an external input as it generates region masks internally. We apply the same discriminators (PatchGAN [15, 33]
) and loss functions (reconstruction loss, adversarial loss, perceptual loss and style loss) of the original backbone model to our model111The codes are available at https://github.com/geekyutao/RN.
We first compare our method with current state-of-the-art methods. We then conduct ablation study to explore the properties of RN and visualize our methods. Finally, we generalize RN to some other state-of-the-art methods.
We evaluate our methods on Places2  and CelebA  datasets. We use two kinds of image masks: regular masks which are fixed square masks (occupying a quarter of the image) and irregular masks from . The irregular mask dataset contains 12000 irregular masks and the masked area in each mask occupies 0-60% of the total image size. Besides, the irregular dataset is grouped into six intervals according to the mask area, 0-10%, 10-20%, 20-30%, 30-40%, 40-50% and 50-60%. Each interval has 2000 masks.
We compare our method to four current state-of-the-art methods and the baseline.
- CA: Contextual Attention .
- PC: Partial Convolution .
- GC: Gated Convolution .
- EC: EdgeConnect .
- Baseline: the backbone network we used. The baseline model use instance normalization instead of RN.
We test all models on total validation data (36500 images) of Places2. We compare our model with CA, PC, GC, EC and the baseline. Three commonly used metrics are used: PSNR, SSIM  with window size 11, and loss. We give the results of quantitative comparisons in Table 1. The second column is the area of irregular masks at testing time. Note that the in Table 1 represents using all irregular masks (0-60%) when testing. Our model surpasses all the comparing models on all three metrics. Compared to the baseline, our model improve PSNR by 0.73 dB and SSIM by 0.017, and reduce loss (%) by 0.25 in the case.
Figure 5 compares images generated by CA, PC, GC, EC, the baseline and ours. The first two rows of input images are taken from Places2 validation dataset and the last two rows are taken from CelebA validation dataset. In addition, the first three rows show the results in irregular mask case and the last row shows regular mask (fixed square mask in center) case. Our method achieves better subjective results, which benefits from RN-B’s eliminating the impact of the mean and variance shifts on training, and RN-L’s further boosting the reconstruction of corrupted regions.
We first explore the source of gain for our methods and the best strategy to apply two kinds of RN: RN-B and RN-L. We conduct ablation experiments on the backbone generator, which has three stages: an encoder, followed by eight residual blocks and a decoder. We plug RN-B and RN-L in different stages and obtain six architectures (Arch.1-6) as shown in Table 2. The results in Table 2 show the effectiveness of our use of RN: apply RN-B in the early layers (encoder) to solve the mean and variance shifts caused by large-area uncorrupted regions; apply RN-L in the later layers to solve the the mean and variance shifts and boost the fusion of two kinds of regions. Arch.1 only applies RN-B in the encoder and achieves a significant performance boost, which directly shows the RN-B’s effectiveness. Arch.2 and 3 reduce the performance as RN-B can hardly obtain an accurate region mask in the latter layers of the network after passing through several convolutional layers. Arch.4 is beyond Arch.1 by adding RN-L in the middle residual blocks. Arch.5 (Our method) further improves the performance of Arch.4 by applying RN-L in both the residual blocks and the decoder. Note that Arch.6 uses RN-L to the encoder and its performance is reduced compared to Arch.5 ,since RN-L, a module of soft fusion, unavoidably mixing up information from corrupted and uncorrupted regions and washing away information from the uncorrupted regions. The above results verify the effectiveness of our use of RN-B and RN-L that we explain in Section 3.2 and 3.3.
To verify our RN is more effective in training of the inpainting model, we compare our RN with a none-normalization method and two full-spatial normalization methods, batch normalization (BN) and instance normalization (IN), based on the same backbone. We show the PSNR curves in the first 10000 iterations in Figure 6 and the final convergence results (about 225,000 iterations) in Table 3. The experiments are on Places2. Note that no normalization (None) is better than full-spatial normalization (IN and BN), and RN is better than no normalization by eliminating the mean and variance shifts and taking advantage of normalization technique at the same time.
Threshold is set in Learnable RN to generate a region mask from the spatial response map. The threshold affects the accuracy of the region mask and further affects the power of RN. We conduct a set of experiments to explore the best threshold. The PSNR results on Places2 and CelebA show that RN-L achieves the best results when threshold equals to 0.8, as shown in Table 4. We show the generated mask of the first RN-L layer in the sixth residual block () as an example in Figure 7. The generated mask of is likely to be the most accurate mask in this layer.
We explore the mask area’s influence to RN. Based the theoretical analysis in Section 3.1, the mean and variance shifts become more severe as mask area increases. Our experiments on CelebA show that the advantage of our RN becomes more significant as the mask area increases, as shown in Table 5. We use loss to evaluate the results.
We visualize some features of the inpainting network to verify our method. We show the changes of the spatial response and generated mask of RN-L as the network deepens in the top two rows of Figure 8. The mask changes in different layers as the fusion effect of passing through convolutional layers. RN-L can detect potentially corrupted regions consistently. From the last two rows of Figure 8 we can see: (1) the uncorrupted regions in the encoded feature are well preserved by using RN-B; (2) RN-L can distinguish between potentially different regions and generate a region mask; (3) gamma and beta maps in RN-L perform a pixel-level transform on potentially corrupted and uncorrupted regions distinctively to help the fusion of them.
RN-B and RN-L are plug-and-play modules in image inpainting networks. We generalize our RN (RN-B and RN-L) to some other backbone networks: CA, PC and GC. We apply RN-B to their early layers (encoder) and RN-L to the later layers. CA and GC are two-stage (coarse-to-fine) inpainting networks and the coarse result is the input of the refinement network. The corrupted and uncorrupted regions of the coarse result is typically not particularly obvious, thus we only apply RN to the coarse inpainting networks of CA and GC. The results on Places2 are shown in Table 6. The RN-applied CA and PC achieve a significant performance boost by 2.52 and 0.5 dB PSNR respectively. The gain on GC is not very impressive. A possible reason is that gated convolution of GC greatly smoothes features which make RN-L hard to track potentially corrupted regions. Besides, GC’s results are typically blurry as shown in Figure 5.
In this work, we investigate the impact of normalization on inpainting network and show that Region Normalization (RN) is more effective for image inpainting network, compared with existing full-spatial normalization. The proposed two kinds of RN are plug-and-play modules, which can be applied to other image inpainting networks conveniently. Additionally, our inpainting model works well in real use cases such as object removal, face editing and image restoration, as shown in Figure 9.
In the future, we will explore RN for other supervised vision tasks such as classification, detection and so on.
Perceptual losses for real-time style transfer and super-resolution. In ECCV, pp. 694–711. Cited by: §3.4.
Places: a 10 million image database for scene recognition. IEEE TPAMI 40 (6), pp. 1452–1464. Cited by: §1, §4.1.