Image Inpainting with Learnable Bidirectional Attention Maps

09/03/2019
by   Chaohao Xie, et al.
Nankai University
0

Most convolutional network (CNN)-based inpainting methods adopt standard convolution to indistinguishably treat valid pixels and holes, making them limited in handling irregular holes and more likely to generate inpainting results with color discrepancy and blurriness. Partial convolution has been suggested to address this issue, but it adopts handcrafted feature re-normalization, and only considers forward mask-updating. In this paper, we present a learnable attention map module for learning feature renormalization and mask-updating in an end-to-end manner, which is effective in adapting to irregular holes and propagation of convolution layers. Furthermore, learnable reverse attention maps are introduced to allow the decoder of U-Net to concentrate on filling in irregular holes instead of reconstructing both holes and known regions, resulting in our learnable bidirectional attention maps. Qualitative and quantitative experiments show that our method performs favorably against state-of-the-arts in generating sharper, more coherent and visually plausible inpainting results. The source code and pre-trained models will be available at: https://github.com/Vious/LBAM_inpainting/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 12

page 15

page 16

page 17

page 18

04/25/2021

Image Inpainting with Edge-guided Learnable Bidirectional Attention Maps

For image inpainting, the convolutional neural networks (CNN) in previou...
11/02/2020

Image Inpainting with Learnable Feature Imputation

A regular convolution layer applying a filter in the same way over known...
04/18/2022

Cylin-Painting: Seamless 360° Panoramic Image Outpainting and Beyond with Cylinder-Style Convolutions

Image outpainting gains increasing attention since it can generate the c...
08/09/2020

Recurrent Feature Reasoning for Image Inpainting

Existing inpainting methods have achieved promising performance for reco...
08/03/2020

Shape Adaptor: A Learnable Resizing Module

We present a novel resizing module for neural networks: shape adaptor, a...
10/14/2020

Learning Propagation Rules for Attribution Map Generation

Prior gradient-based attribution-map methods rely on handcrafted propaga...
04/20/2018

Image Inpainting for Irregular Holes Using Partial Convolutions

Existing deep learning based image inpainting methods use a standard con...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image inpainting [BertalmioInpainting], aiming at filling in holes of an image, is a representative low level vision task with many real-world applications such as distracting object removal, occluded region completion, etc. However, there may exist multiple potential solutions for the given holes in an image, , the holes can be filled with any plausible hypotheses coherent with the surrounding known regions. And the holes can be of complex and irregular patterns, further increasing the difficulty of image inpainting. Traditional exemplar-based methods [Barnes:2009:PAR, LeMeur2011examplar, XuPatchSparsity], , PatchMatch [Barnes:2009:PAR], gradually fill in holes by searching and copying similar patches from known regions. Albeit exemplar-based methods are effective in hallucinating detailed textures, they are still limited in capturing high-level semantics, and may fail to generate complex and non-repetitive structures (see Fig. 1(c)).

(a) Original (b) Input (c) PM [Barnes:2009:PAR] (d) GL [IizukaGL] (e) CA [yu2018generative] (f) PConv [partialconv2017] (g) Ours
Figure 1: Qualitative comparison of inpainting results by PatchMatch (PM) [Barnes:2009:PAR], Global&Local (GL) [IizukaGL], Context Attention (CA) [yu2018generative], and Partial Convolution (PConv) [partialconv2017], and Ours.

Recently, considerable progress has been made in applying deep convolutional networks (CNNs) to image inpainting [IizukaGL, pathakCVPR16context]. Benefited from the powerful representation ability and large scale training, CNN-based methods are effective in hallucinating semantically plausible result. And adversarial loss [Goodfellow_GAN] has also been deployed to improve the perceptual quality and naturalness of the result. Nonetheless, most existing CNN-based methods usually adopt standard convolution which indistinguishably treats valid pixels and holes. Thus, they are limited in handling irregular holes and more likely to generate inpainting results with color discrepancy and blurriness. As a remedy, several post-processing techniques [IizukaGL, Yang_2017_CVPR] have been introduced but are still inadequate in resolving the artifacts (see Fig. 1(d)).

CNN-based methods have also been combined with exemplar-based one to explicitly incorporate the mask of holes for better structure recovery and detail enhancement [song_contextual_2018, Yan_2018_Shift, yu2018generative]. In these methods, the mask is utilized to guide the propagation of the encoder features from known regions to the holes. However, the copying and enhancing operation heavily increases the computational cost and is only deployed at one encoding and decoding layers. As a result, they are better at filling in rectangular holes, and perform poorly on handling irregular holes (see Fig. 1(e)).

For better handling irregular holes and suppressing color discrepancy and blurriness, partial convolution (PConv) [partialconv2017] has been suggested. In each PConv layer, mask convolution is used to make the output conditioned only on the unmasked input, and feature re-normalization is introduced for scaling the convolution output. A mask-updating rule is further presented to update a mask for the next layer, making PConv very effective in handling irregular holes. Nonetheless, PConv adopts hard 0-1 mask and handcrafted feature re-normalization by absolutely trusting all filling-in intermediate features. Moreover, PConv considers only forward mask-updating and simply employs all-one mask for decoder features.

In this paper, we take a step forward and present the modules of learnable bidirectional attention maps for the re-normalization of features on both encoder and decoder of the U-Net [UNetRFB15a] architecture. To begin with, we revisit PConv without bias, and show that the mask convolution can be safely avoided and the feature re-normalization can be interpreted as a re-normalization guided by hard 0-1 mask. To overcome the limitations of hard 0-1 mask and handcrafted mask-updating, we present a learnable attention map module for learning feature re-normalization and mask-updating. Benefited from the end-to-end training, the learnable attention map is effective in adapting to irregular holes and propagation of convolution layers.

Furthermore, PConv simply uses all-one mask on the decoder features, making the decoder should hallucinate both holes and known regions. Note that the encoder features of known region will be concatenated, it is natural that the decoder is only required to focus on the inpainting of holes. Therefore, we further introduce learnable reverse attention maps to allow the decoder of U-Net concentrate only on filling in holes, resulting in our learnable bidirectional attention maps. In contrast to PConv, the deployment of learnable bidirectional attention maps empirically is beneficial to network training, making it feasible to include adversarial loss for improving visual quality of the result.

Qualitative and quantitative experiments are conducted on the Paris SteetView [doersch2015makes] and Places [zhou2017places] datasets to evaluate our proposed method. The results show that our proposed method performs favorably against state-of-the-arts in generating sharper, more coherent and visually plausible inpainting results. From Fig. 1(f)(g), our method is more effective in hallucinating clean semantic structure and realistic textures in comparison to PConv. To sum up, the main contribution of this work is three-fold,

  • A learnable attention map module is presented for image inpainting. In contrast to PConv, the learnable attention maps are more effective in adapting to arbitrary irregular holes and propagation of convolution layers.

  • Forward and reverse attention maps are incorporated to constitute our learnable bidirectional attention maps, further benefiting the visual quality of the result.

  • Experiments on two datasets and real-world object removal show that our method performs favorably against state-of-the-arts in hallucinating shaper, more coherent and visually plausible results.

2 Related Work

In this section, we present a brief survey on the relevant work, especially the propagation process adopted in exemplar-based methods as well as the network architectures of CNN-based inpainting methods.

    (a) PConv    (b) Learnable forward attention map  (c) Learnable reverse attention map
Figure 2: Interplay models between mask and intermediate feature for PConv and our learnable bidirectional attention maps. Here, the white holes in denotes missing region with value 0, and the black area denotes the known region with value 1.

2.1 Exemplar-based Inpainting

Most exemplar-based inpainting methods search and paste from the known regions to gradually fill in the holes from the exterior to the interior [Barnes:2009:PAR, Criminisi2004region, LeMeur2011examplar, XuPatchSparsity], and their results highly depend on the propagation process. In general, better inpainting result can be attained by first filling in structures and then other missing regions. To guide the patch processing order, patch priority [KomodakisPriority, WilczkowiakBMVC05] measure has been introduced as the product of confidence term and data term. While the confidence term is generally defined as the ratio of known pixels in the input patch, several forms of data terms have been proposed. In particular, Criminisi  [Criminisi2004region] suggested a gradient-based data term for filling in linear structure with higher priority. Xu and Sun [XuPatchSparsity] assumed that structural patches are sparsely distributed in an image, and presented a sparsity-based data term. Le Meur  [LeMeur2011examplar]

adopted the eigenvalue discrepancy of structure tensor 

[DiZenzo1986gradient] as an indicator of structural patch.

2.2 Deep CNN-based Inpainting

Early CNN-based methods [KohlerSSHJR2014, RenShepardConv, XieDenoiseCNN] are suggested for handling images with small and thin holes. In the past few years, deep CNNs have received upsurging interest and exhibited promising performance for filling in large holes. Phatak  [pathakCVPR16context] adopted an encoder-decoder network (, context-encoder), and incorporated reconstruction and adversarial losses for better recovering semantic structures. Iizuka  [IizukaGL] combined both global and local discriminators for reproducing both semantically plausible structures and locally realistic details. Wang  [WangMulticolumn] suggested a generative multi-column CNN incorporating with confidence-driven reconstruction loss and implicit diversified MRF (ID-MRF) term.

Multi-stage methods have also been investigated to ease the difficulty of training deep inpainting networks. Zhang  [ZhangPGN] presented a progressive generative networks (PGN) for filling in holes with multiple phases, while LSTM is deployed to exploit the dependencies across phases. Nazeri  [nazeri2019edgeconnect] proposed a two-stage model EdgeConnect first predicting salient edges and then generating inpainting result guided by edges. Instead, Xiong  [Xiong_2019_CVPR] presented foreground-aware inpainting, which involves three stages, , contour detection, contour completion and image completion, for the disentanglement of structure inference and content hallucination.

In order to combine exemplar-based and CNN-based methods, Yang  [Yang_2017_CVPR] suggested multi-scale neural patch synthesis (MNPS) to refine the result of context-encoder via joint optimization with the holistic content and local texture constraints. Other two-stage feed-forward models, , contextual attention [song_contextual_2018] and patch-swap [yu2018generative], are further developed to overcome the high computational cost of MNPS while explicitly exploiting image features of known regions. Concurrently, Yan  [Yan_2018_Shift] modified the U-Net to form an one-stage network, , Shift-Net, to utilize the shift of encoder feature from known regions for better reproducing plausible semantics and detailed contents. Most recently, Zheng  [Zheng2019Pluralistic] introduced an enhanced short+long term attention layer, and presented a probabilistic framework with two parallel paths for pluralistic inpainting.

Most existing CNN-based inpainting methods are usually not well suited for handling irregular holes. To address this issue, Liu  [partialconv2017] proposed a partial convolution (PConv) layer involving three steps, , mask convolution, feature re-normalization, and mask-updating. Yu  [yu2018free] provided gated convolution which learns channel-wise soft mask by considering both corrupted images, masks and user sketches. However, PConv adopts handcrafted feature re-normalization and only considers forward mask-updating, making it still limited in handling color discrepancy and blurriness (see Fig. 1(d)).

3 Proposed Method

In this section, we first revisit PConv, and then present our learnable bidirectional attention maps. Subsequently, the network architecture and learning objective of our method are also provided.

Figure 3: The network architecture of our model. The circle with triangle inside denotes operation form of Eqn.( 12), and

represent activation functions of Eqn.( 

9) and mask updating function of Eqn.( 8).

3.1 Revisiting Partial Convolution

A PConv [partialconv2017] layer generally involves three steps, , (i) mask convolution, (ii) feature re-normalization, and (iii) mask-updating. Denote by the input feature map and the corresponding hard 0-1 mask. We further let be the convolution filter and be its bias. To begin with, we introduce the convolved mask , where denotes the convolution operator, denotes a convolution filter with each element . The process of PConv can be formulated as,

(1)
(2)
(3)

where denotes the attention map, and denotes the updated mask. We further define the activation functions for attention map and updated mask as,

(4)
(5)

From Eqns. (1)(5) and Fig. 2(a), PConv can also be explained as a special interplay model between mask and convolution feature map. However, PConv adopts the handcrafted convolution filter as well as handcrafted activation functions and , thereby giving some leeway for further improvements. Moreover, the non-differential property of also increases the difficulty of end-to-end learning. To our best knowledge, it remains a difficult issue to incorporate adversarial loss to train a U-Net with PConv. Furthermore, PConv only considers the mask and its updating for encoder features. As for decoder features, it simply adopts all-one mask, making PConv limited in filling holes.

3.2 Learnable Attention Maps

The convolution layer without bias has been widely adopted in U-Net for image-to-image translation 

[isola2017cvpr] and image inpainting [Yan_2018_Shift]. When the bias is removed, it can be readily seen from Eqn. (2) that the convolution features in updated holes are zeros. Thus, the mask convolution in Eqn. (1) is equivalently rewritten as standard convolution,

(6)

Then, the feature re-normalization in Eqn. (2) can be interpreted as the element-wise product of convolution feature and attention map,

(7)

Even though, the handcrafted convolution filter is fixed and not adapted to the mask. The activation function for updated mask absolutely trusts the inpainting result in the region , but it is more sensible to assign higher confidence to the region with higher .

To overcome the above limitations, we suggest learnable attention map which generalizes PConv without bias from three aspects. First, to make the mask adaptive to irregular holes and propagation along with layers, we substitute with layer-wise and learnable convolution filters . Second, instead of hard 0-1 mask-updating, we modify the activation function for updated mask as,

(8)

where

is a hyperparameter and we set

. One can see that degenerates into when . Third, we introduce an asymmetric Gaussian-shaped form as the activation function for attention map,

(9)

where , , , and are the learnable parameters, we initialize them as and learn them in an end-to-end manner.

To sum up, the learnable attention map adopt Eqn. (6) in Step (i), and the next two steps are formulated as,

(10)
(11)

Fig. 2(b) illustrates the interplay model of learnable attention map. In contrast to PConv, our learnable attention map is more flexible and can be end-to-end trained, making it effective in adapting to irregular holes and propagation of convolution layers.

Original Input PM [Barnes:2009:PAR] GL [IizukaGL] CA [yu2018generative] PConv [partialconv2017] Ours
Figure 4: Qualitative comparison on Paris StreetView dataset. Comparison with PatchMatch (PM) [Barnes:2009:PAR], Global&Local(GL) [IizukaGL], Context Attention(CA) [yu2018generative], PConv [partialconv2017] and Ours.

3.3 Learnable Bidirectional Attention Maps

When incorporating PConv with U-Net for inpainting, the method [partialconv2017] only updates the masks along with the convolution layers for encoder features. However, all-one mask is generally adopted for decoder features. As a result, the -th layer of decoder feature in both known regions and holes should be hallucinated using both -th layer of encoder feature and -th layer of decoder feature. Actually, the -th layer of encoder feature will be concatenated with the -th layer of decoder feature, and we can only focus on the generation of the -th layer of decoder feature in the holes.

We further introduce learnable reverse attention maps to the decoder features. Denote by the convolved mask for encoder feature . Let be the convolved mask for decoder feature . The first two steps of learnable reverse attention map can be formulated as,

(12)

where and are the convolution filters. And we define as the reverse attention map. Then, the mask is updated and deployed to the former decoder layer,

(13)

Fig. 2(c) illustrates the interplay model of reverse attention map. In contrast to forward attention maps, both encoder feature (mask) and decoder feature (mask) are considered. Moreover, the updated mask in reverse attention map is applied to the former decoder layer, while that in forward attention map is applied to the next encoder layer.

By incorporating forward and reverse attention maps with U-Net, Fig. 3 shows the full learnable bidirectional attention maps. Given an input image with irregular holes, we use to denote the binary mask, where ones indicate the valid pixels and zeros indicate the pixels in holes. From Fig. 3, the forward attention maps take as the input mask for the re-normalization of the first layer of encoder feature, and gradually update and apply the mask to next encoder layer. In contrast, the reverse attention maps take as the input for the re-normalization of the last (, -th) layer of decoder feature, and gradually update and apply the mask to former decoder layer. Benefited from the end-to-end learning, our learnable bidirectional attention maps (LBAM) are more effective in handling irregular holes. The introduction of reverse attention maps allows the decoder concentrate only on filling in irregular holes, which is also helpful to inpainting performance. Our LBAM is also beneficial to network training, making it feasible to exploit adversarial loss for improving visual quality.

Original Input PM [Barnes:2009:PAR] GL [IizukaGL] CA [yu2018generative] PConv [partialconv2017] Ours
Figure 5: Qualitative comparison on Places dataset. Comparison with PatchMatch (PM) [Barnes:2009:PAR], Global&Local(GL) [IizukaGL], Context Attention(CA) [yu2018generative], PConv [partialconv2017] and Ours.

3.4 Model Architecture

We modify the U-Net architecture [isola2017cvpr] of 14 layers by removing the bottleneck layer and incorporating with bidirectional attention maps (see Fig. 3). In particular, forward attention layers are applied to the first six layers of encoder, while reverse attention layers are adopted to the last six layers of decoder. For all the U-Net layers and the forward and reverse attention layers, we use convolution filters with the kernel size of

, stride

and padding

, and no bias parameters are used. In the U-Net backbone, batch normalization and leaky ReLU nonlinearity are used to the features after re-normalization, and tanh nonlinearity is deployed right after convolution for the last layer. Fig. 

3 also provides the size of feature map for each layer, and more details of the network architecture are given in the suppl.

3.5 Loss Functions

For better recovery of texture details and semantics, we incorporate pixel reconstruction loss, perceptual loss [Johnson2016Perceptual], style loss [Gatys2016ImageST] and adversarial loss [Goodfellow_GAN] to train our LBAM.

Pixel Reconstruction Loss. Denote by the input image with holes, the binary mask region, and the ground-truth image. The output of our LBAM can be defined as , where denotes the model parameters to be learned. We adopt the -norm error of the output image as the pixel reconstruction loss,

(14)

Perceptual Loss. The -norm loss is limited in capturing high-level semantics and is not consistent with the human perception of image quality. To alleviate this issue, we introduce the perceptual loss defined on the VGG-16 network [SimonyanZ14a]

pre-trained on ImageNet 

[ILSVRC15],

(15)

where is the feature maps of the -th pooling layer. In our implementation, we use -1, -2, and -3 layers of the pre-trained VGG-16.

Style Loss.  For better recovery of detailed textures, we further adopt the style loss defined on the feature maps from the pooling layers of VGG-16. Analogous to [partialconv2017], we construct a Gram matrix from each layer of feature map. Suppose that the size of feature map is . The style loss can then be defined as,

(16)
Original Input CA [yu2018generative] PConv [partialconv2017] Ours
Figure 6: Results on real-world images. From left to right are: original image, input with objects masked (white area), Context Attention (CA) [yu2018generative], PConv [partialconv2017], and Ours.

Adversarial Loss. Adversarial loss [Goodfellow_GAN] has been widely adopted in image generation [salimans2016improved, pixelCNN, han2017stackgan] and low level vision [Ledig2016a] for improving the visual quality of generated images. In order to improve the training stability of GAN, Arjovsky  [MartinWGAN] exploit the Wasserstein distance for measuring the distribution discrepancy between generated and real images, and Gulrajani  [ishaan2017improved] further introduce gradient penalty for enforcing the Lipschitz constraint in discriminator. Following [ishaan2017improved], we formulate the adversarial loss as,

(17)

where represents the discriminator. is sampled from and

by linear interpolation with a randomly selected factor,

is set to 10 in our experiments. We empirically find that it is difficult to train the PConv model when including adversarial loss. Fortunately, the incorporation of learnable attention maps is helpful to ease the training, making it feasible to learn LBAM with adversarial loss. Please refer to the suppl. for the network architecture of the 7-layer discriminator used in our implementation.

Model Objective

  Taking the above loss functions into account, the model objective of our LBAM can be formed as,

(18)

where , , , and are the tradeoff parameters. In our implementation, we empirically set , , and .

4 Experiments

Experiments are conducted for evaluating our LBAM on two datasets, , Paris StreetView [doersch2015makes] and Places (Places365-standard) [zhou2017places], which have been extensively adopted in image inpainting literature [pathakCVPR16context, Yan_2018_Shift, Yang_2017_CVPR, yu2018generative]. For Paris StreetView, we use its original splits, images for training, and images for testing. In our experiments, images are randomly selected and removed from the training set to form our validation set. As for Places, we randomly select categories from the categories, and use all the images per category from the original training set to form our training set of images. Moreover, we divide the original validation set from each category of images into two equal non-overlapped sets of images respectively for validation and testing. Our LBAM takes ms for processing a image, faster than Context Attention [yu2018generative] () and faster than Global&Local(GL) [IizukaGL] ().

In our experiments, all the images are resized where the minimal height or width is , and then randomly cropped to the size of . Data augmentation such as flipping is adopted during training. We generate masks with random shape, and masks from [partialconv2017] for training and testing. Our model is optimized using the ADAM algorithm [AdamOptim] with initial learning rate of and . The training procedure ends after epochs, and the mini-batch size is . All the experiments are conducted on a PC equipped with parallel NVIDIA GTX 1080Ti GPUs.

(a) (b) (c) (d) (e) (f) (g)
Figure 7: Visualization of features from the first encoder layer and -th decoder layer. (a) Input, (b)(c) Ours(unlearned), (d)(e) Ours(forward), (f)(g) Ours(full).
(a) (b) (c) (d) (e) (f) (g)
Figure 8: Visualization of updated masks after activation function for forward and reverse attention maps. (a) Input, (b)(c)(d) forward masks from the first three (1,2,3) layers, (e)(f)(g) reverse masks from the last three (11, 12, 13) layers.

4.1 Comparison with State-of-the-arts

Our LBAM is compared with four state-of-the-art methods, , Global&Local [IizukaGL], PatchMatch [Barnes:2009:PAR], Context Attention [yu2018generative], and PConv [partialconv2017].

Evaluation on Paris StreetView and Places. Fig. 4 and Fig. 5 show the results by our LBAM and the competing methods. Global&Local [IizukaGL] is limited in handling irregular holes, producing many matchless and meaningless textures. PatchMatch [Barnes:2009:PAR] performs poorly for recovering complex structures, and the results are not consistent with surrounding context. For some complex and irregular holes, context attention [yu2018generative] still generates blurry results and may produce unwanted artifacts. PConv [partialconv2017] is effective in handling irregular holes, but over-smoothing results are still inevitable in some regions. In contrast, our LBAM performs well generating visually more plausible results with fine-detailed, and realistic textures.

Quantitative Evaluation. We also compare our LBAM quantitatively with the competing methods on Places [zhou2017places] with mask ratio , , and . From Table 1, our LBAM performs favorably in terms PSNR, SSIM, and mean loss, especially when the mask ratio is higher than .

Mask GL [IizukaGL] PM [Barnes:2009:PAR] CA [yu2018generative] PConv* [partialconv2017] Ours
PSNR (0.1-0.2] 23.36 26.67 26.27 28.32 28.51
(0.2, 0.3] 20.53 24.21 23.56 25.25 25.59
(0.3, 0.4] 19.37 21.95 21.20 22.89 23.31
(0.4, 0.5] 17.86 20.02 19.95 21.38 21.66
SSIM (0.1-0.2] 0.828 0.876 0.881 0.870 0.872
(0.2, 0.3] 0.744 0.763 0.769 0.779 0.785
(0.3, 0.4] 0.643 0.657 0.667 0.689 0.708
(0.4, 0.5] 0.545 0.572 0.563 0.595 0.602
Mean (%) (0.1-0.2] 2.45 1.43 2.05 1.09 1.12
(0.2, 0.3] 4.01 2.38 3.74 1.88 1.93
(0.3, 0.4] 5.86 3.59 5.65 2.84 2.55
(0.4, 0.5] 7.92 5.22 7.43 3.85 3.67
Table 1: Quantitative comparison on Places. Results of PConv* are taken from [partialconv2017].
(a) Input (b) Ours(unlearned) (c) Ours(forward) (d) Ours(full)
Figure 9: Visual quality comparison of the effect on the learnable bidirectional attention maps.

Object Removal from Real-world Images. Using the model trained on Places, we further evaluate LBAM on the real world object removal task. Fig. 6 shows the results by our LBAM, context attention [yu2018generative] and PConv [partialconv2017]. We mask the object area either with contour shape or with rectangular bounding box. In contrast to the competing methods, our LBAM can produce realistic and coherent contents by both global semantics and local textures.

User Study. Besides, user study is conducted on Paris StreetView and Places for subjective visual quality evaluation. We randomly select images from the test set covering with different irregular holes, and the inpainting results are generated by PatchMatch [Barnes:2009:PAR], Global&Local [IizukaGL], Context Attention [yu2018generative], PConv [partialconv2017] and ours. We invited volunteers to vote for the most visually plausible inpainting result, which is assessed by the criteria including coherency with the surrounding context, semantic structure and fine details. For each test image, the inpainting results are randomly arranged and presented to user along with the input image. Our LBAM has chance to win out as the most favorable result, largely surpassing PConv [partialconv2017] (), PatchMatch [Barnes:2009:PAR] (), Context Attention [yu2018generative] () and Global&Local [IizukaGL] ().

4.2 Ablation Studies

Ablation studies are conducted to compare the performance of several LBAM variants on Paris StreetView, , (i) Ours(full): the full LBAM model, (ii) Ours(unlearned): the LBAM model where all the elements in mask convolution filters are set as because the filter size is , and we adopt the activation functions defined in Eqn. (4) and Eqn. (5), (iii) Ours(forward): the LBAM model without reverse attention map, (iv) Ours(w/o ): the LBAM model without (w/o) adversarial loss, (v) Ours(Sigmoid/LReLU/ReLU/): the LBAM model using Sigmoid/LeakyReLU/ReLU as activation functions or filter for mask updating.

Fig. 7 shows the visualization of features from the first encoder layer and -th decoder layer by Ours(unlearned), Ours(forward), and Ours(full). For Ours(unlearned), blurriness and artifacts can be observed from Fig. 9(b). Ours(forward) is beneficial to reduce the artifacts and noise, but the decoder hallucinates both holes and known regions and produces some blurry effects (see Fig. 9(c)). In contrast, Ours(full) is effective in generating semantic structure and detailed textures (see Fig. 9(d)), and the decoder focus mainly on hallucinating holes (see Fig. 7(g)). Table 2 gives the quantitative results of the LBAM variants on Paris StreetView, and the performance gain of Ours(full) can be explained by (1) learnable attention maps, (2) reverse attention maps, and (3) proper activation functions.

Method (0.1, 0.2] (0.2, 0.3] (0.3, 0.4] (0.4, 0.5]
Ours(unlearned) 26.95/0.853 24.39/0.763 22.54/0.677 21.20/0.583
Ours(forward) 27.80/0.869 25.13/0.775 23.04/0.688 21.76/0.598
Ours(Sigmoid) 26.93/0.857 24.15/0.768 22.24/0.683 20.32/0.582
Ours(LReLU) 26.61/0.852 23.59/0.762 20.63/0.667 18.38/0.562
Ours(ReLU) 27.62/0.864 25.16/0.776 22.96/0.685 21.48/0.596
Ours(3x3) blue28.74/0.886 26.10/0.793 24.03/0.703 22.43/0.617
Ours(w/o ) red29.19/red0.903 red26.55/red0.817 red24.46/red0.729 red22.70/red0.626
Ours(full) 28.73/blue0.889 blue26.16/blue0.795 blue24.26/blue0.716 blue22.62/blue0.621
Table 2: Ablation studies (PSNR/SSIM) on Paris StreetView.

Mask Updating. Fig. 8 shows the visualization of updated masks from different layers. From the first to third layers, the masks of encoder are gradually updated to reduce the size of holes. Analogously, from the 13-th to 11-th layers, the masks of decoder are gradually updated to reduce the size of known region.

Effect of Adversarial Loss. Table 2 also gives the quantitative result w/o . Albeit Ours(w/o ) improves PSNR and SSIM, the use of generally benefits the visual quality of the inpainting results. The qualitative results are given in the suppl.

5 Conclusion

This paper proposed a learnable bidirectional attention maps (LBAM) for image inpainting. With the introduction of learnable attention maps, our LBAM is effective in adapting to irregular holes and propagation of convolution layers. Furthermore, reverse attention maps are presented to allow the decoder of U-Net concentrate only on filling in holes. Experiments shows that our LBAM performs favorably against state-of-the-arts in generating sharper, more coherent and fine-detailed results.

Acknowledgement

This work was supported in part by the NSFC grant under No. 61671182 and 61872116, and National Key Research and Development Project 2018YFC0832105.

References

Supplementary Material

Visual comparison of several LBAM variants on Paris StreetView dataset

We implement our bidirectional attention maps by employing an asymmetric Gaussian shaped form (Eqn. 9) for activation the attention map and the modified activation function (Eqn. 8) for updating the mask. In this material, we give visual comparison of several variants of our LBAM model, , (i) Ours(full): the full LBAM model, (ii) Ours(unlearned): the LBAM model where all the elements in mask convolution filters are set as because the filter size is , and we adopt the activation functions defined in Eqn. 4 and Eqn. 5, (iii) Ours(forward): the LBAM model without reverse attention map, (iv) Ours(w/o ): the LBAM model without (w/o) adversarial loss, (v) Ours(Sigmoid/LReLU/ReLU/): the LBAM model using Sigmoid/LeakyReLU/ReLU as activation functions or filter for mask updating.

Fig. 10 shows qualitatively comparison over variants (i) to (iv). Ours (forward) model benefits from learnable attention map and helps reduce reduce the artifacts and noise of unlearned one, see Fig. 10(a) and (b). But its decoder hallucinates both holes and known regions and produces some blurry effects compared to our full model with learnable reverse attention map Fig. 10(d).

The qualitative comparison in ablation studies with the effect of GAN loss is shown in Fig. 10(c) and (d). The inpainted results of our LBAM model without adversarial loss (Fig. 10(c)), are much better than the unlearned model Fig. 10(a), and somehow clearer in producing details than ours without reverse attention map which applied GAN loss. Our LBAM full model (Fig. 10(d)) benefits from GAN loss, is superior in giving fine-detailed structures and capturing global semantics.

The visual comparison of different activation functions or filter for mask updating are shown in Fig. 11.

Failure cases. Fig. 12 shows some failure cases of our LBAM model. Our model struggles to recover the high-frequency details while the damaged areas are too large or the background objects are too complex. In some cases, the mask covers a large portion of a specific object, like a car, it is still difficult for our LBAM model to recover the original shape.

Input (a) Ours(unlearned) (b) Ours(forward) (c) Ours(w/o ) (d) Ours(full)
Figure 10: Visual comparison of variants (i) to (iii) of our LBAM model. From left to right are: Input, (a) Ours with unlearned model, (b) Ours without reverse attention map, (c) Our without (w/o) adversarial loss, (d) our full LBAM model. All images are scaled to .
Input (a) (b) (c) (d) (e) Ours
Figure 11: Visual comparison of different activation functions or filters on the bidirectional attention maps. From left to right are: Input, (a) Sigmoid as activation function, (b) Leaky ReLU with slope of as activation function, (c) ReLU, (e) filter for mask updating, and (e) our full LBAM model. All images are scaled to .

Model Architectures

Architecture of Our Learnable Bidirectional Attention Map

The learnable bidirectional attention model takes the damaged image, the mask

and the reverse mask as input. We adopt the basic U-Net structure with layers, and both encoder and decoder consists of layers. The features are normalized by the learnable bidirectional attention maps through element-wise product. We use convolution filters of size , , for all layers including the bidirectional attention maps.

The forward attention map takes the mask as input, it contains layers, and the reverse attention map takes the reverse mask as input, which consists of layers. We adopt an asymmetric Gaussian-shaped form as activation function ( of Eqn. 9) for activating the attention map and a modified ReLU based activating function ( of Eqn. 8) for updating mask maps. In consideration of the skip connection of the U-Net structure, the symmetric forward and reverse attention maps are concatenated for normalizing the connected features of the corresponding layer in the decoder, under Eqn. 12. Besides, batch normalization and Leaky ReLU non-linearity are used to the features after attention re-normalization. The last layer of our LBAM model are directly de-convoluted with filters of size , , , followed by a tanh non-linear activation. More details about our model is given in Table 4. Note that each activation function and mask updating term are unique for each layer, and they do not share parameters among layers.

Architecture of the Discriminator

The discriminator is trained to produce adversarial loss for minimizing the distance between the generated images and the real data distributions. In our work, we use a two-column discriminator with one column takes the remained area of inpainted result or a ground-truth image, and another column takes the missing holes of inpainted result or a ground-truth image as input. The two-column discriminator consists of layers, the two parallel features are emerged after layer at the resolution of . We specifically use convolution layer with filters size of , and , except the last layer with . We use sigmoid non-linear activation function at last layer, while the leaky ReLU with slope of 0.2 for other layers. Table 3 provides a more details of the discriminator.

Input Ours Ground Truth Input Ours Ground Truth
Figure 12: Failure cases of our LBAM model. Each group is ordered as input image, our result and ground truth. All images are scaled to .

More Comparisons on Paris StreetView and Places

More comparisons with PatchMatch (PM) [Barnes:2009:PAR], Global&Local (GL) [IizukaGL], Context Attention (CA) [yu2018generative], and Partial Convolution (Pconv) [partialconv2017] are also conducted. Fig. 13,  14 and  15 show the qualitative comparison on Paris StreetView dataset and Places dataset. For Paris StreetView [doersch2015makes] dataset, we use its original splits, images for training, and images for testing.

For Places [zhou2017places] dataset, categories from the total categories are choosed for training our LBAM model, they are: apartment_building_outdoor, beach, house, ocean, sky, throne_room, tower, tundra, valley and wheat_field. We gather all images of each category to form our training set of images. The validation set from each category of images into two equal non-overlapped sets of images respectively for validation and testing. It can be seen that our model performs better in producing both global consistency and fine-detailed structures.

Object removal on real world images.

Finally, we apply our model trained on Places dataset for object removal on real world images. As shown in Fig. 16, although these images contain different objects, background, context and shapes, even some of them have large portion masked regions, our model can handle them well, demonstrating its practicability and generalization ability of our LBAM model.

Input: Image () Input: Image ()
[Layer 1-1] Conv.(), stride = 2; LReLU; [Layer 1-2] Conv.(), stride = 2; LReLU;
[Layer 2-1] Conv.(), stride = 2; BN; LReLU; [Layer 2-2] Conv.(), stride = 2; BN; LReLU;
[Layer 3-1] Conv.(), stride = 2; BN; LReLU; [Layer 3-2] Conv.(), stride = 2; BN; LReLU;
[Layer 4-1] Conv.(), stride = 2; BN; LReLU; [Layer 4-2] Conv.(), stride = 2; BN; LReLU;
[Layer 5-1] Conv.(), stride = 2; BN; LReLU; [Layer 5-2] Conv.(), stride = 2; BN; LReLU;
[Layer 6-1] Conv.(), stride = 2; BN; LReLU; [Layer 6-2] Conv.(), stride = 2; BN; LReLU;
Concatenate(Layer 6-1, Layer 6-2);
[Layer 7] Conv.(), stride = 0; Sigmoid;
Output: Real or Fake ()
Table 3: The architecture of the discriminator. BN represents BatchNorm, LReLU denotes leaky ReLU with slope of 0.2, and represents mask with zeros denote the missing pixels and ones denote the remained pixels.
Our Modified U-Net Learnable Bidirectional Attention Maps
Input: Image () Input: ()
[Layer 1-1] Conv.(), stride = 2; [Layer 1-2] Conv.(), stride = 2;
Ewp(Layer 1-1, ); LReLU;
[Layer 2-1] Conv.(), stride = 2; [Layer 2-2] ; Conv.(), stride = 2;
Ewp(Layer 2-1, ); BN; LReLU;
[Layer 3-1] Conv.(), stride = 2; [Layer 3-2] ; Conv.(), stride = 2;
Ewp(Layer 3-1, ); BN; LReLU;
[Layer 4-1] Conv.(), stride = 2; [Layer 4-2] ; Conv.(), stride = 2;
Ewp(Layer 4-1, ); BN; LReLU;
[Layer 5-1] Conv.(), stride = 2; [Layer 5-2] ; Conv.(), stride = 2;
Ewp(Layer 5-1, ); BN; LReLU;
[Layer 6-1] Conv.(), stride = 2; [Layer 6-2] ; Conv.(), stride = 2;
Ewp(Layer 6-1, ); BN; LReLU;
[Layer 7-1] Conv.(), stride = 2; [Layer 7-2] ; Conv.(), stride = 2;
Ewp(Layer 7-1, ); BN; LReLU;
[Layer 8-1] DeConv.(), stride = 2; [Layer 6-3] ; Conv.(), stride = 2;
Ewp(Cat(Layer 8-1, Layer 6-1), Cat(, ));BN; LReLU;
[Layer 9-1] DeConv.(), stride = 2; [Layer 5-3] ; Conv.(), stride = 2;
Ewp(Cat(Layer 9-1, Layer 5-1), Cat(, ));BN; LReLU;
[Layer 10-1] DeConv.(), stride = 2; [Layer 4-3] ; Conv.(), stride = 2;
Ewp(Cat(Layer 10-1, Layer 4-1), Cat(, ));BN; LReLU;
[Layer 11-1] DeConv.(), stride = 2; [Layer 3-3] ; Conv.(), stride = 2;
Ewp(Cat(Layer 11-1, Layer 3-1), Cat(, ));BN; LReLU;
[Layer 12-1] DeConv.(), stride = 2; [Layer 2-3] ; Conv.(), stride = 2;
Ewp(Cat(Layer 12-1, Layer 2-1), Cat(, ));BN; LReLU;
[Layer 13-1] DeConv.(), stride = 2; [Layer 1-3] Conv.(), stride = 2;
Ewp(Cat(Layer 13-1, Layer 1-1), Cat(, ));BN; LReLU;
[Layer 14-1] DeConv.(), stride = 2; tanh; Input: ()
Output: Final result () Reverse Attention Maps
Table 4: The architecture of our LBAM model. Ewp() means element-wise product, Cat() represents feature concatenation operation, denotes asymmetric Gaussian-shaped form activation function of Eqn. (9), and denotes mask updating function of Eqn. (8), BN represents BatchNorm, LReLU denotes leaky ReLU with slope of 0.2, and represents mask with zeros indicating the missing pixels and ones indicating the remained pixels. Note that and are unique among layers and do not share its parameters.
Input PM [Barnes:2009:PAR] GL [IizukaGL] CA [yu2018generative] PConv [partialconv2017] Ours
Figure 13: Qualitative comparison on Paris StreetView dataset. Comparison with PatchMatch (PM) [Barnes:2009:PAR], Global&Local (GL) [IizukaGL], Context Attention (CA) [yu2018generative], and Partial Convolution (PConv) [partialconv2017]. All images are scaled to .
Input PM [Barnes:2009:PAR] GL [IizukaGL] CA [yu2018generative] PConv [partialconv2017] Ours
Figure 14: Qualitative comparison on Paris StreetView dataset. Comparison with PatchMatch (PM) [Barnes:2009:PAR], Global&Local (GL)GL [IizukaGL], Context Attention (CA) [yu2018generative], and Partial Convolution (PConv) [partialconv2017]. First three rows are from Paris StreetView dataset and the last four rows are from Places dataset. All images are scaled to .
Input PM [Barnes:2009:PAR] GL [IizukaGL] CA [yu2018generative] PConv [partialconv2017] Ours
Figure 15: Qualitative comparison on Places dataset. Comparison with PatchMatch (PM) [Barnes:2009:PAR], Global&Local (GL) [IizukaGL], Context Attention (CA) [yu2018generative], and Partial Convolution (PConv) [partialconv2017]. All images are scaled to .
Original Image Input Ours Original Image Input Ours
Figure 16: Results of our LBAM on object removal task of real world images. All images are scaled to .