There is a need to recover missing contents in corrupted images for visual aesthetics improvement. Deep neural networks have advanced image inpainting by introducing semantic guidance to fill hole regions. Different from the traditional methods[7, 2, 8, 3] that propagate uncorrupted image contents to the hole regions via patch-based image matching, deep inpainting methods [25, 13] utilize CNN features in different levels (i.e., from low-level features to high-level semantics) to produce more meaningful and globally consistent results.
|(a) Input||(b) GC ||(c) CSA ||(d) Ours||(e) GT|
The encoder-decoder architecture is prevalent in existing deep inpainting methods [13, 19, 38, 25]. However, a direct utilization of the end-to-end training and prediction processes generate limited results. This is due to the challenging factor that the hole region is completely empty. Without sufficient image guidance, an encoder-decoder is not able to reconstruct the whole missing content. An alternative is to use two encoder-decoders to separately learn missing structures and textures in a step-by-step manner. These two-stage methods [28, 24, 26, 41, 37, 40, 21] typically generate an intermediate image with recovered structures in the first stage (i.e., encoder-decoder), and send this image to the second stage for texture generation. Although structures and textures are produced on the output image, their appearances are not consistent. Fig. 1 shows an example, the inconsistent structures and textures within hole regions produce blur and artifacts as shown in (b) and (c). Meanwhile, the recovered contents are not coherent to the uncorrupted contents around the hole boundaries (e.g., the leaves). This limitation is because of the independent learning of CNN features representing structures and textures. In practice, the structures and textures correlate with each other to formulate the image contents. Without considering their coherence, existing methods are not able to produce visually pleasing results.
In this work, we propose a mutual encoder-decoder to jointly learn CNN features representing structures and textures. The features from the deep layers of the encoder contain structure semantics while the features from the shallow layers contain texture details. The hole regions of these two features are filled via two separate branches. In the CNN feature space, we use a multi-scale filling block within each branch for hole filling. Each block consists of 3 partial convolution streams with progressively increased kernel sizes. After hole filling in these two features, we propose a feature equalization method to ensure the structure and texture features consistent with each other. Meanwhile, the equalized features are coherent with the features of uncorrupted image content around the hole boundaries. The proposed feature equalization consists of channel reweighing and bilateral propagation. We concatenate two features first and perform channel reweighing via attention exploration . The attentions across two features are set to be consistent after channel equalization. Then, we propose a bilateral propagation activation function to equalize the feature consistency in the whole feature maps. This activation function uses elements on the global feature maps to propagate channel consistency (i.e., feature coherence across the hole boundaries), while using elements within local neighboring regions to maintain channel similarities (i.e., feature consistency within the hole). To this end, we fuse the texture and structure features together to reduce inconsistency in the CNN feature maps. The equalized features then supplement the decoder features in all the feature levels via encoder-decoder skip connections. The feature consistency is then reflected in the reconstructed output image, where the blur and artifacts are effectively removed around the hole regions as shown in Fig. 1(d). Experiments on the benchmark datasets show that the proposed method performs favorably against state-of-the-art approaches.
We summarize the contributions of this work as follows:
We propose a mutual encoder-decoder network for image inpainting. The CNN features from the shallow layer are learned to represent textures and the features from deep layers represent structures.
We propose a feature equalization method to make structure and texture features consistent with each other. We first reweigh channels after feature concatenation and propose a bilateral propagation activation function to make the whole feature consistent.
Extensive experiments on the benchmark datasets have shown the effectiveness of the proposed inpainting method in removing blur and artifacts caused by inconsistent structure and texture features. The proposed method performs favorably against state-of-the-art inpainting approaches.
2 Related Works
Empirical Image Inpainting
. The empirical image inpainting methods [3, 18, 1] based on diffusion techniques propagate the neighborhood appearances to the missing regions. However, they only consider surrounding pixels of missing regions, which can only deal with small holes in background inpainting tasks and may fail to generate meaningful structures. In contrast, methods [4, 5, 27, 2, 35] based on patch match fill missing regions by transferring similar and relevant patches from the remaining image region to the hole region. Although empirical methods perform well to handle small holes on the background inpainting task, they are not able to generate semantically meaningful content. When the hole region is large, these methods suffer from a lack of semantic guidance.
Deep Image Inpainting
. Image inpainting based on deep learning typically involves the generative adversarial network to supplement visual perceptual guidance for hole filling. Pathak et al.  first bring adversarial training  to inpainting and demonstrate semantic hole-filling. Iizuka et al.  propose local and global discriminators, assisted by dilated convolution  to improve the inpainting quality. Nazeri et al.  propose EdgeConnect that predicts salient edges for inpainting guidance. Song et al.  utilize a segmentation prediction network to generate segmentation guidance for detail refinement around the hole region. Xiong et al.  present foreground-aware inpainting, which involves three stages, i.e., contour detection, contour completion and image completion, for the disentanglement of structure inference and content hallucination. Ren et al.  introduce structure-aware network, which splits the inpainting task into two parts: structure reconstruction and texture generation. It uses appearance flow to sample features from contextual regions. Yan et al.  speculate the relationship between the contextual regions in the encoder layer and the associated hole region in the decoder layer for better predictions. Yu et al.  and Song et al.  search for a collection of background patches with the highest similarity to the generated contents in the first stage prediction. Liu et al.  address this inpainting task via exploiting the partial convolutional layer and mask-update operation. Following the , Yu et al.  present gate convolution that learns a dynamic mask-update mechanism and combines with SN-PatchGAN discriminator to achieve better predictions. Liu et al.  propose coherent semantic attention, which considers the feature coherency of hole regions to guarantee the pixel continuity in image level. Wang et al. 
propose a generative multi-column convolutional neural network (GMCNN) that uses varied receptive fields in branches. Different from existing deep inpainting methods, our method produces CNN features to consistently represent structures and textures to reduce blurry and artifacts around the hole region.
3 Proposed Algorithm
Fig. 2 shows the pipeline of the proposed method. We use one mutual encoder-decoder to jointly learn structure and texture features and equalize them for consistent representation. The details are presented in the following:
3.1 Mutual Encoder-Decoder
We use an encoder-decoder for end-to-end image generation to fill holes. The structure of this encoder-decoder is a simplified generative network  where there are 6 convolutional layers in the encoder and 5 convolutional layers in the decoder. Meanwhile, 4 residual blocks  with dilated convolutions are set between the encoders and decoders. The dilated convolutions [13, 24] increase the size of the receptive field to perceive encoder features.
In the encoder, we reorganize the CNN features from deep layers as structure features where the semantics reside. Meanwhile, we reorganize the CNN features from shallow layers as texture features to represent image details. We denote the structure features as and the texture features as as shown in Fig. 2. The reorganization process is to resize and transform the CNN feature maps from different convolutional layers to the same size, and concatenate them accordingly.
After CNN feature reorganization, we design two branches (i.e., the structure branch and the texture branch) to separately perform hole filling on and . The architectures of these two branches are the same. In each branch, there are 3 parallel streams to fill holes in multiple scales. Each stream consists of 5 partial convolutions  with the same kernel size while the kernel size differs among different streams. By using different kernel sizes, we perform multi-scale filling in each branch for the input CNN features. The filled features from 3 streams (i.e., 3 scales) are concatenated and mapped to the same size of the input feature map via a convolution. We denote the output of the structure branch as , and the output of the texture branch as . To ensure the hole filling to focus on the textures and structures, we incorporate supervisions on and . We use a convolution to separately map and to a color image and a color image , respectively. The pixel-wise L1 loss can be written as follows:
The hole regions in and are filled via structure and texture branches, individually. The feature representations in and are not consistent to reflect the recovered structures and textures. This inconsistency leads to blur and artifacts within and around the hole regions as shown in Fig. 1. To mitigate these effects, we concatenate and first, and make a simple fusion to generate via a convolutional layer. The texture and structure representations in are corrected via feature equalization at different CNN feature levels (i.e., across shallow to deep CNN layers).
3.2 Feature Equalizations
We equalize the fused CNN features in both channel and spatial domains. The channel equalization follows the squeeze and excitation operation  to ensure that the attentions within each channel of are the same. As the reweighed channels are influenced by both structure and texture representations in , the consistent attentions indicate that these representations are set to be consistent as well. We propagate channel equalization to the spatial domain via the proposed bilateral propagation activation function (BPA).
BPA is inspired by the edge-preserving image smoothing  to generate response values based on spatial and range distances. It can be written as follows:
where is the feature channel at position of input feature , is a neighboring feature channel around at position , and are the feature channels after spatial and range similarity measurements. We set the normalization factor as , where is the number of positions in . We use to denote the concatenation and channel reduction of and via a convolutional layer.
The bilateral propagation utilizes the distances of feature channels from both spatial and range domains. We explore within a neighboring region , which is set as the same spatial size of the input feature for global propagation. The spatial contributions from neighboring feature channels are adjusted via a Gaussian function . When computing , we measure the similarities between feature channels and via within a neighboring region around . The size of is . To this end, the bilateral propagation considers both global continuity via and local consistency via .
During the range similarity computation step, we define the pairwise function as a dot product operation, which can be written as follows:
The proposed bilateral propagation shares similarity to the non-local block  that for each , becomes the softmax computation along dimension . The difference resides on the region design of propagation. The non-local block uses feature channels from all the positions to generate and the similarity is only measured between and . In contrast, BPA considers both feature channel similarity and spatial distance between and during bilateral weight computation. In addition, we use a global region to compute spatial distance while using a local region to compute range distance. The advantage of global and local region selections is that we ensure both long-term continuity in the whole spatial region and local consistency around the current feature channel. The boundaries of hole regions are unified with the neighboring image content and the content within the hole regions are set to be consistent.
Fig. 3 shows how bilateral propagation operates in the network. The range step corresponds to the computation of in eq. 3 and the spatial step corresponds to in eq. 2. During range computation, the operations until the element-wise multiplication P represent eq. 5) for obtaining all the neighboring for each , so that we can make efficient element-wise matrix multiplications. Similarly, the operations until P represent the term in eq. 3. During spatial computation, the operations until P represent the term . As a result, the bilateral propagation operation can be efficiently executed via the element-wise matrix multiplications and additions shown in Fig. 3.
3.3 Loss functions
We introduce several loss functions to measure structure and texture differences including pixel reconstruction loss, perceptual loss, style loss and relativistic average LS adversarial loss during training. We also employ a discriminator with local and global operations to ensure local-global contents consistency, the spectral normalization  is applied in both local and global discriminator to stable training.
Pixel Reconstruction Loss.
We measure the pixel wise difference from two aspects. The first one is the loss terms illustrated in eq. 1 where we add supervisions on the texture and structure branches. The second one measures the similarity between the network output and the ground truth, which can be written as follows:
where is the finally predicted image by the network.
To capture the high-level semantics and simulate human perception of images quality, we utilize the perceptual loss 
defined on the ImageNet-pretrained VGG-16 feature backbone.
where is the activation map of the -th layer of VGG-16 backbone. In our work, corresponds to the activation maps from layers ReLu1_1, ReLu2_1, ReLu3_1, ReLu4_1, and ReLu5_1.
The transposed convolutional layers from the decoder will bring artifacts that resemble checkerboard. To mitigate this effect, we introduce the style loss. Given feature maps of size , we compute the style loss as follows:
Where is a Gram matrix constructed from the selected activation maps. These activation maps are the same as those used in the perceptual loss.
Relativistic Average LS Adversarial Loss.
We follow  to utilize global and local discriminators for perception enhancement. The relativistic average LS adversarial loss is adopted for our discriminators. For the generator, the adversarial loss is defined as:
indicates the local or global discriminator without the last sigmoid function. To this end, real and fake data pairsare sampled from the ground-truth and output images.
The whole objective function of the proposed network can be written as:
where , , , , and are the tradeoff parameters. In our implementation, we empirically set , , , , , .
We use a structure branch and a texture branch to separately fill holes in CNN feature space. Then, we perform feature equalization to enable consistent feature representations in different feature levels for output image reconstruction. In this section, we visualize the feature maps during different steps to show whether they correspond to our objectives. We use a convolutional layer to map CNN feature maps to color images for a clear display.
Fig. 4 shows the visualization results. The input image is shown in (a) with a mask in the center. The visualized and are shown in (b) and (f), respectively. We observe that textures are preserved in (b) while the structures are in (f). By multi-scale hole filling, the hole regions in and are effectively reduced as shown in (c) and (g). After equalization, the hole regions in (h) are effectively filled and the equalized features contribute to the decoders to generate the output image as shown in (e).
We evaluate our method on three datasets: Paris StreetView , Place2  and CelebA . We follow the training, testing, and validation splits of these three datasets. Data augmentation such as flipping is also adopted during training. Our model is optimized by the Adam optimizer  with a learning rate of
on a single NVIDIA 2080TI GPU. The training of CelebA model, Paris StreetView model and Place2 model are stopped after 6 epochs, 30 epochs and 60 epochs, respectively. All the masks and images for training and testing are with the size of 256256.
We compare our method with six state-of-the-art method: CE , CA , SH , CSA , SF  and GC . For a fair evaluation on model generalization abilities, we conduct experiments on filling center holes and irregular holes on the input images. The center hole is brought by a mask that covers the image center with a size of . We obtain irregular masks from PConv . These masks are in different categories according to the ratios of the hole regions versus the entire image size (i.e., below 10%, from 10% to 20%, etc). For holes in the image center, we compare with CA , SH  and CE  on the CelebA  validation set. We choose these three methods because they are more effective to fill holes in the image center than fill irregular holes. When handling irregular holes on the input images, we compare with CSA , SF  and GC  using Paris StreetView  and Place2  validation datasets.
|(a) Input||(b) GC ||(c) SF ||(d) CSA ||(e) Ours||(f) GT|
4.1 Visual Evaluations
The visual comparison on the results for filling center holes are in Fig. 5 and the results for filling irregular holes are in Fig. 6. We also display ground truth images in (f) to show the actual image content. In Fig. 5, the input images are shown in (a). The results produced by CE and CA contains distorted structures and blurry textures as shown (b) and (c). Although more visually pleasing content are generated in (d), the semantics still unreasonable. By utilizing consistent structure and texture features, our method is effective to generate results with realistic textures.
Fig. 6 shows the comparison for filling irregular holes, which are more challenging than filing centering holes. The results from GC contain noisy patterns shown in (b). The details are missing and the structures are distorted in (c) and (d). These methods are not effective to recover image contents without bringing obvious artifacts (i.e., the second row around the door regions). In contrast, our method learns to represent structures and textures in a consistent formation. The results shown in (e) indicate the effectiveness of our method to produce visually pleasing contents. The evaluations on filling both center holes and irregular holes indicate our method performs favorably against existing hole filling approaches.
4.2 Numerical Evaluations
We conduct numerical evaluations on the Place2 dataset with different mask ratios. Besides, we evaluate numerically on CelebA dataset with center holes in the input images. There are 100 validation images from the “valley” scene category chosen for evaluations. In CelebA, we randomly choose 500 images for evaluation. For the evaluation metrics, we follow to use SSIM  and PSNR. Moreover, we introduce FID (Fechet Inception Distance) metric  as it indicates the perceptual quality of the results. The evaluation results are shown in Table 1 and Table 2. Our method outperforms existing methods to fill centering holes. Meanwhile, favorable performance is achieved in our method to fill irregular holes under various hole versus image ratios.
Human Subject Evaluation.
We follow  to involve over 35 volunteers for evaluating the results on CelebA, Place2 and Paris StreetView datasets. The volunteers are all image experts with image processing background. There are 20 questions for each subject. In each question, the subject needs to select the most realistic result from 4 results generated by different methods without knowing the hole region in advance. We tally the votes and show the statistics in Table 3. Our method performs favorably against existing methods.
5 Ablation Study
|(a) Input||(b) Ours w/o||(c) Ours w/o||(d) Ours||(e) Ground|
|(a) Input||(b) Ours w/o||(c) Non-Local||(d) Ours||(e) Ground|
Structure and Texture branches.
To evaluate the effects of structure and texture branches, we use each of these branches separately for network training. For fair comparisons, we expand the channel number of the texture and structure branch outputs via additional convolutions. So the single branch output contains the same size as that of . As shown in Fig. 8, the output of our method without texture branch contains rich structure information (i.e., the window in the red and green boxes) while the textures are missing. In comparison, the output of our method without structure branch does not contain meaningful structure (i.e., the window in the red and green boxes). By utilizing both branches, our method achieve favorable results on both structures and textures. Table 4 shows the similar numerical performance on Paris StreetView dataset where these two branches improve our method significantly.
We show the contributions of feature equalizations by removing them from the pipeline and showing the performance degradation. Moreover, we show the bilateral propagation activation function (BPA) is more effective to fill hole regions than the Non-local attentions . As shown in Fig. 8, without using equalization our method generates visually unpleasant contents and visible artifacts. In comparison, the contents generated by  are more natural. However, the recovered contents are still blurry and inconsistent because the Non-local block ignores the local coherency and global distance of features. This limitation is effectively solved via our method with feature equalizations. Similar performance has been shown numerically on Table 5 where our method achieves favorable results.
6 Concluding Remarks
We propose a mutual encoder-decoder with feature equalizations to correlate filled structures with textures during image inpainting. The shallow and deep layer features are reorganized as texture and structure features, respectively. In the CNN feature space, we introduce a texture branch and a structure branch to fill holes in multi-scales and fuse the outputs together via feature equalizations. During equalization, we first ensure consistent attentions among each channel and propagate to the whole spatial feature map region via the proposed bilateral propagation activation function. Experiments on the benchmark datasets have shown the effectiveness of our method when compared to state-of-the-art approaches on filling both regular and irregular hole regions.
This work is partially supported by the National Natural Science Foundation of China under Grant No. 61702176.
Filling-in by joint interpolation of vector fields and gray levels. IEEE Transactions on Image Processing. Cited by: §2.
-  (2009) Patchmatch: a randomized correspondence algorithm forstructural image editing. In ACM Transactions on Graphics (SIGGRAPH), Cited by: §1, §2.
-  (2000) Image inpainting. In ACM Transactions on Graphics (SIGGRAPH), Cited by: §1, §2.
-  (2004) Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on Image Processing. Cited by: §2.
-  (2012) Image melding: combining inconsistent images using patch-based synthesis. ACM Transactions on Graphics. Cited by: §2.
-  (2015) What makes paris look like paris?. Communications of the ACM. Cited by: Figure 1, §4, §4.
-  (2001) Image quilting for texture synthesis and transfer. In ACM Transactions on Graphics (SIGGRAPH), Cited by: §1.
Texture synthesis by nonparametric sampling.
IEEE International Conference on Computer Vision, Cited by: §1.
-  (2014) Generative adversarial nets. In Neural Information Processing Systems, Cited by: §2.
Deep residual learning for image recognition.
IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Neural Information Processing Systems, Cited by: §4.2.
-  (2018) Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.2.
-  (2017) Globally and locally consistent image completion. In ACM Transactions on Graphics (SIGGRAPH), Cited by: §1, §1, §2, §3.1.
-  (2017) Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, Cited by: §3.3.
-  (2018) The relativistic discriminator: a key element missing from standard gan. In International Conference on Learning Representations, Cited by: §3.3.
-  (2014) Adam: a method for stochastic optimization. In arXiv preprint arXiv:1412.6980, Cited by: §4.
-  (2003) Learning how to inpaint from global image statistics. In IEEE International Conference on Computer Vision, Cited by: §2.
-  (2017) Generative face completion. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
-  (2018) Image inpainting for irregular holes using partial convolutions. In European Conference on Computer Vision, Cited by: §2, §3.1, §4.
-  (2019) Coherent semantic attention for image inpainting. In IEEE International Conference on Computer Vision, Cited by: Figure 1, §1, §2, Figure 6, §4.
-  (2015) Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision, Cited by: §4, §4.
-  (2018) Spectral normalization for generative adversarial networks. In arXiv preprint arXiv:1802.05957, Cited by: §3.3.
-  (2019) Edgeconnect: generative image inpainting with adversarial edge learning. In ICCV Workshops, Cited by: §1, §2, §3.1.
-  (2016) Context encoders: feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2, Figure 5, §4.
-  (2019) StructureFlow: image inpainting via structure-aware appearance flow. In IEEE International Conference on Computer Vision, Cited by: §1, §2, §3.1, Figure 6, §4.2, §4.
-  (2017) Stylizing face images via multiple exemplars. CVIU. Cited by: §2.
-  (2018) Spg-net: segmentation prediction and guidance network for image inpainting. In arXiv preprint arXiv:1805.03356, Cited by: §1, §2.
-  (1998) Bilateral filtering for gray and color images. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.2.1.
-  (2018) Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.2.1, §5.
-  (2018) Image inpainting via generative multi-column convolutional neural networks. In Neural Information Processing Systems, Cited by: §2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing. Cited by: §4.2.
-  (2019) Foreground-aware image inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (2012) Structure extraction from texture via relative total variation. SIGGRAPH. Cited by: §3.1.
-  (2010) Image inpainting by patch propagation using patch sparsity. IEEE Transactions on Image Processing. Cited by: §2.
Shift-net: image inpainting via deep feature rearrangement. In European Conference on Computer Vision, Cited by: §2, Figure 5, §4.
-  (2018) Contextual-based image inpainting: infer, match, and translate. In European Conference on Computer Vision, Cited by: §1, §2.
-  (2016) Semantic image inpainting with perceptual and contextual losses. In arXiv preprint arXiv:1607.07539, Cited by: §1.
-  (2015) Multi-scale context aggregation by dilated convolutions. In arXiv preprint arXiv:1511.07122, Cited by: §2.
-  (2019) Free-form image inpainting with gated convolution. In IEEE International Conference on Computer Vision, Cited by: Figure 1, §1, §2, Figure 6, §4.
-  (2018) Generative image inpainting with contextual attention. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, Figure 5, §3.3, §4.
-  (2019) Learning pyramid-context encoder network for high-quality image inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
Places: a 10 million image database for scene recognition. PAMI. Cited by: §4, §4.