Learning Oracle Attention for High-fidelity Face Completion

03/31/2020 ∙ by Tong Zhou, et al. ∙ The University of Sydney South China University of Technology International Student Union 0

High-fidelity face completion is a challenging task due to the rich and subtle facial textures involved. What makes it more complicated is the correlations between different facial components, for example, the symmetry in texture and structure between both eyes. While recent works adopted the attention mechanism to learn the contextual relations among elements of the face, they have largely overlooked the disastrous impacts of inaccurate attention scores; in addition, they fail to pay sufficient attention to key facial components, the completion results of which largely determine the authenticity of a face image. Accordingly, in this paper, we design a comprehensive framework for face completion based on the U-Net structure. Specifically, we propose a dual spatial attention module to efficiently learn the correlations between facial textures at multiple scales; moreover, we provide an oracle supervision signal to the attention module to ensure that the obtained attention scores are reasonable. Furthermore, we take the location of the facial components as prior knowledge and impose a multi-discriminator on these regions, with which the fidelity of facial components is significantly promoted. Extensive experiments on two high-resolution face datasets including CelebA-HQ and Flickr-Faces-HQ demonstrate that the proposed approach outperforms state-of-the-art methods by large margins.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

page 14

page 15

page 16

page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Face completion results for images () with center mask (). In each row from left to right: (a) the result taken from the paper of PEN-Net [34]. By zooming in, we can observe the color discrepancy between two eyes, along with distortions in the nose and mouth regions. (b) The results of our method. It can be seen that our method indeed produces face images with high fidelity.

Image inpainting refers to filling in the missing pixels in an image with the expectation that the recovered image will be visually realistic. This process not only requires that the filled textures themselves be meaningful, but also seeks semantic consistency between the filled area and the context. Image inpainting is widely applied to photo restoration, image editing, and object removal, among many other tasks.

Face completion, as a branch of image inpainting, focuses on filling the missing regions of a human face, and turns out to be a challenging task. The reasons behind lie in two aspects. First, the human face contains rich and subtle textures that also differ dramatically across persons, meaning that it is difficult to perfectly restore these diverse facial textures. Second, there are close correlations between facial components, making the fidelity of the image more vulnerable to the semantic consistency between the facial components. Take one very recent work [34] for example, its generated images are satisfactory in facial structure, but still suffer from small artifacts in facial components and semantic inconsistency. As illustrated in Figure 1, the two eyes in the same face are different in color; moreover, small distortions in the nose and mouth areas can also be observed. These flaws have a substantial impact on the authenticity of the overall visual effect.

Recently, convolutional neural network (CNN) based methods have become the mainstream approach to image inpainting 

[30, 29, 12, 26, 36]. In order to generate visually realistic images, existing approaches typically adopt the non-local schemes that make use of contextual relations to fill in the missing pixels [32, 21, 39, 24]. However, due to the lack of direct supervision on the obtained attention scores, the learned relations are insufficiently reliable, meaning that these methods may generate distorted textures.

Moreover, different types of structural information have been extracted by networks in order to act as prior knowledge for assisting image inpainting; for example, segmentation maps [25], object contours [28], edge [17], facial landmarks [23], and face parsing [13]. While these methods focus on the correctness of the structural information, they also ignore the quality of the textures on key areas in the image (e.g. facial components in face images).

In this paper, we propose a comprehensive framework to handle the above issues. Inspired by recent progress in attention models 

[35, 3], we propose a Dual Spatial Attention (DSA) model that comprises foreground self-attention and foreground-background cross-attention modules. To capture contextual information across different scales, we apply it to multiple layers in the network. Compared with attention models introduced in recent works [32, 15], DSA has two key advantages: first, it is more efficient and can capture more comprehensive contextual information; second, we impose an oracle supervision signal to ensure that the attention scores produced by DSA are reasonable. With the help of DSA, our approach obtains semantically consistent face completion results, as illustrated in Figure 1.

Furthermore, we also extract facial landmarks from the ground truth image to act as prior knowledge. Rather than imposing constraints to ensure that the facial landmarks between the recovered and ground-truth images are consistent, we use facial landmarks to locate four key facial components: i.e., both eyes, the nose, and the mouth. Subsequently, we train four discriminators for each recovered facial component. By benefiting from adversarial learning on the specified locations, our generator pays more attention to the textures of each key facial component. As a result, our proposed approach can generate more visually realistic textures, as shown in Figure 1. Since all discriminators are removed during testing, our approach does not yield any drop in efficiency.

We conduct a number of experiments on high-resolution face datasets, including CelebA-HQ [8] and Flickr-Faces-HQ [9]. Quantitative and qualitative comparisons demonstrate that our proposed approach outperforms state-of-the-art methods by large margins.

2 Related Work

Figure 2: The overall architecture of our model. We use the U-Net structure [14] as backbone. The proposed attention module is embedded to layer 12, 13 and 14, where the resolution is , and respectively. We feed the ground truth images to the network, such that we can impose an oracle supervision signal on attention scores produced by DSA. We deploy discriminators on both the masked area and each of the four facial components, respectively. Best viewed in color.

2.1 Image Inpainting

Previous image inpainting methods can be divided into two categories: hand-crafted methods and learning-based methods.

Methods in the first category attempt to copy similar patches from the unmasked area to fill in the missing area [37, 20, 11, 5]. Criminisi et al[2] established the priority of each missing patch with reference to the surrounding structure. When the missing patch is surrounded by more valid pixels or closer to boundary region, this patch is assigned a higher priority. These methods can restore a continuous structure as they give priority to well-defined structural areas; however, iterating over every patch to search for the most similar one results in elevated time and memory costs. Accordingly, Barnes et al[1] proposed a faster method referred to as PatchMatch. This approach employs a randomized algorithm to quickly find the approximate nearest neighbor matches. However, the hand-crafted methods are unable to handle complex-structured or largely occluded images, as only low-level features are considered.

The second category of methods [31, 19, 27] typically involve training a deep CNN in an encoder-decoder structure to predict each pixel of the missing area. Pathak et al[18] proposed Context Encoders that applied adversarial learning to the entire image. In order to generate more realistic details, moreover, Iizuka et al[6] appended an extra local discriminator to improve the generated effect of the masked area. However, this approach relies on post-processing to alleviate artifacts. To address this problem, Liu et al[14] used only valid pixels per convolution to alleviate artifacts by updating a binary mask to indicate the generated pixels.

Face Completion Due to the complexity and diversity of facial structures, face completion is one of the more challenging image inpainting tasks. Generally speaking, researchers in this area therefore use a wealth of face prior knowledge to aid in restoration. For example, Li et al[13] used face parsing to propose a semantic parsing loss. Song et al[23] trained an extra network to restore the facial landmarks and face parsing, then input them along with the masked image to train face completion network. However, the results of these methods are greatly affected by the performance of the prior knowledge extracting network, which may consume a large amount of computational resources; moreover, these methods cannot directly guide the network to focus on the texture of key facial components.

2.2 Attention Mechanism

In order to maintain contextual consistency, Yu et al[32] proposed a coarse-to-fine network containing a contextual attention module that learns the correlations between the missing and unmasked patches. Subsequently, some methods directly use the coarse-to-fine network with contextual attention; for example, GConv [33] turns the binary mask proposed by PConv [14] into learnable soft value as a gating mechanism, while Xiong et al[28] used object contours as prior knowledge to assist in restoration. On the other hand, other methods opt to use contextual attention in different ways. For example, Sagong et al[21] designed a parallel decoding network to replace the coarse-to-fine structure, thereby reducing the number of parameters, while Zeng et al[34] proposed an attention transfer network that uses the attention learned from high-level features to reconstruct low-level features. Moreover, the CSA layer [15] learns correlations between the patches inside the masked area.

In summary, previous attention-based methods learn long-range correlations in order to search the similar feature-level patches as references for filling. However, the learned attention is not reliable enough, as the parameters of the attention module lack direct supervision.

3 Proposed Algorithm

The overall architecture of the proposed approach is illustrated in Figure 2. We adopt the same U-Net structure as used in [14]

to construct the basic generator. In what follows, we provide the details of our proposed methods. Specifically, we introduce the proposed DSA module with an oracle supervision signal, and then describe the deployment of the multi-discriminator. Finally, we describe the loss functions used to guide the training process.

3.1 Dual Spatial Attention with Supervision

We treat the masked area as the foreground and the unmasked area as background. When learning relations between the different parts of the face, we consider two key scenarios. First, when restoring the foreground features, we obtain reference information from the background. For example, when the left eye is masked and the right one is not, we obtain features from the right eye to help restore the left eye. Second, when the masked area is large, we consider the relations between different parts in the foreground. For instance, when both eyes are masked, we ensure that the restored eyes have similar features.

Inspired by the principle of self-attention [35, 3], we propose the DSA module, which comprises foreground self-attention and foreground background cross-attention modules, so as to tackle above two scenarios.

Figure 3: Overview of the Dual Spatial Attention (DSA) module, which includes two parallel branches. They focus on learning foreground-background cross-attention and foreground self-attention, respectively. Best viewed in color.

As shown in Figure 3, we first use the mask to segment the input feature into the foreground feature and background feature . Then we reshape into and into , where and denote the number of foreground pixels and background pixels in respectively. denotes the number of channels. For the foreground background cross-attention module, we put and into 1-dimensional convolutional layers respectively and generate two new feature maps and . Subsequently, we conduct a matrix multiplication between and the transpose of

. Then we apply a softmax layer to obtain the attention map with size of

. We write,

(1)

where denotes the degree of correlation between the

feature vector of

(learned from ) and the feature vector of (learned from ).

Meanwhile, we also feed feature into the other 1-dimensional convolutional layer, so as to generate a new feature . Next, we perform a matrix multiplication between the transpose of and the transpose of the attention matrix, then reshape it back to the original size of

. In this way, the original foreground features are eventually rebuilt with consideration given to the correlation with the background features. Finally, we extend the rebuilt feature to the original feature map size by means of zero-padding, and then merge it into the original feature map

via element-wise sum operation, which can be formulated as follows:

(2)

where denotes zero-padding operation, is a training parameter that is initialized to zero, and denotes matrix multiplication.

For the foreground self-attention module, we follow the same steps as for the foreground background cross-attention module, except that we use only. Specifically, we use three convolutional layers to generate feature maps , , and . The foreground attention matrix can be formulated as

(3)

where denotes the degree of correlation between the feature vector of and feature vector of , where and are both learned from . The output can thus be formulated as follows,

(4)

where is a training parameter initialized to zero as . Finally, we fuse and by means of element-wise addition, and then adopt a convolutional layer to adjust the feature and obtain the final refined feature map.

Supervision Signal The attention module helps the network select reference features to improve the filling quality. Nevertheless, if the learned attention is insufficiently accurate, the network may refer to unsuitable features, resulting in poorer filling quality. As a result, ensuring the accuracy of the learned attention scores is key to improving the filling quality. However, the parameters of the attention module are optimized together with the total face completion network using supervision signals for face completion only, meaning that the parameters of the attention module lack a direct restriction.

In order to impose a direct supervision signal towards attention, we extract the attention learned from the ground truth images as the objective. More specifically, in addition to taking the masked image as input, we also feed the ground truth image into the network when training. Through the use of the same network layers, including the DSA, we can obtain the oracle attention maps learned from the ground truth denoted as and

. Moreover, all attention scores are received through the softmax layer, which means that the attention map comprises a number of probability distributions. Therefore, we use KL-divergence distance to set up the objective function for attention. The KL-divergence loss is formulated as follows:

(5)

where we add the average KL-divergence distance between and , as well as between and .

In addition, attentions learned from different scale layers complement each other. On a high-resolution feature map, the attention reflects the relations between small-scale features, such as hair. Conversely, the attention on a low-resolution feature map reflects the relations between large-scale features, such as large-range aspects of facial structure. After accounting for the trade-off between computational efficiency and attention learning, we embed the DSA module into three layers of the decoder (i.e. layers 12, 13, and 14) to assist the filling process. It is important to note that the large-scale attention represents more large-range structural information and can affect later features; thus, we only impose the KL-divergence loss on layer 12 in order to strengthen the authenticity of the face structure, which can also benefit the subsequent attention learning.

Discussion Compared with the contextual attention layer [32], our DSA uses matrix multiplication rather than dividing the background patches as kernels for convolution, which significantly promotes efficiency. In addition, the CSA layer [15] also learns the relations between foreground patches. However, it computes the similarity between two foreground patches in a iterative way; this slows down the filling speed, especially when the masked area is large. Moreover, neither contextual attention or CSA ensures the accuracy of learned attention scores, which leads to unsatisfying results.

3.2 Multi-discriminator Design

Adversarial learning is helpful in generating photo-realistic images by training the generator and discriminator until a Nash equilibrium is achieved. In addition to the global discriminator [18], which is used on the whole image, the local discriminator is designed to focus on the generated details of the masked area [6, 13]. For face completion tasks, the quality of the facial components largely determines the authenticity of the entire face image. However, relying solely on the global and local discriminators is insufficient if our goal is to guide the network to focus on small regions. To this end, we propose the multi-discriminator for enhancing face details especially on facial components, as shown in Figure 4.

We first use the facial landmarks extracted by the method [10] to mark the locations of the left eye, right eye, nose, and mouth. We then generate four masks of fixed size, and use these masks to crop four facial components. During training, we input each facial component of the generated image and ground truth into the corresponding discriminator to judge whether it is real or fake. Moreover, inspired by the collocation of the global and local discriminators, we further divide the masked area into four equal parts without overlap. These four parts share a discriminator, named the local subdivision discriminator, which focuses on more detailed characteristics and can also be regarded as a weighted optimization of the local discriminator loss function.

By using the multi-discriminator, the generator can learn more specific features for each facial component and further enhance the details within the masked area. Unlike previous ways that utilize the prior knowledge, we only use facial landmarks to mark the positions of facial components, after which we guide the network to improve the details in specific areas. Furthermore, since the discriminator only works during the training process, adding multiple discriminators does not affect the efficiency of the implementation.

Figure 4: Overview of the multi-discriminator. denotes the generator. The four facial components are located by facial landmarks of the ground truth image. Best viewed in color.

3.3 Loss Functions

In order to effectively guide the training process, it is critical to design loss functions to be capable of measuring the distance between the generated image and the corresponding ground truth. We thus adopt multiple loss functions from different aspects, described as follows.

Given an original ground truth image and a randomly generated binary mask (zero for holes), we produce the training image by means of element-wise multiplication and denote the generated area of output as . Moreover, in order to crop the facial components, we also produce a left-eye mask, right-eye mask, nose mask, and mouth mask, which are denoted by respectively.

Firstly, we use L1-distance between the output and the ground truth as the reconstruction loss to constrain the value of pixels. Furthermore, we propose to increase the penalty on the facial components and masked area. can be represented as

(6)

where denotes element-wise multiplication, and indicates a matrix with all values set to 1. Here, the facial components and missing area have more weight than other parts, while the missing portion of facial components is assigned the largest weight.

Method L1 PSNR SSIM LPIPS [38]
PM [1] 5.82% 17.60 0.7786 0.2221
CA [32] 1.82% 24.58 0.8980 0.0977
PIC [39] 1.81% 25.31 0.9023 0.0897
GConv [33] 1.89% 26.29 0.8996 0.0809
Ours 1.46% 26.36 0.9107 0.0706
Table 1: Quantitative results on the same test set using masks with random position. Higher SSIM and PSNR values are better; lower L1 error and LPIPS are better.

Secondly, we introduce the perceptual loss using a VGG-16 [22], which is pre-trained by [38] to impose a constraint:

(7)

where is the output of the pre-trained VGG-16. The perceptual loss computes the L1-distance in feature space between both and and the ground truth.

Thirdly, we adopt PatchGAN [7] as our discriminator structure, which maps the input image to a matrix where each element represents the authenticity of a portion of the input image. In this way, the network pays more attention to the local image details. In addition, we adopt an improved version of WGAN with a gradient penalty term [4]. The final adversarial loss function for each discriminator is as follows:

(8)

where indicates one of the seven discriminators as shown in Figure 4. Here, denotes the crop operation used to obtain the corresponding areas from the image,

is interpolated from pairs of points sampled from the real data distribution

and the generated distribution ; means the operation to compute gradient, and is set to 10. Thus, the adversarial loss for the generator is as follows,

(9)

The KL-divergence loss is depicted as Equation 5, while the details are described in Section 3. We define the overall loss function as follows,

(10)

where we empirically set the four trade-off parameters , , , and to 10, 2, 1 and 1 respectively.

4 Experiment

Figure 5: Comparisons in visual effect by different methods with random rectangle masks. Four state-of-the-art methods are compared: PM [1], CA [32], PIC [39] and GConv [33]. Best viewed with zoom-in and pay attention to the details on facial components. More qualitative results are presented in the supplementary material.

4.1 Experimental settings

Dataset We conduct a number of experiments on two high-quality human face datasets including CelebA-HQ [8] and Flickr-Faces-HQ [9]; these contain 30,000 and 70,000 high-quality face images respectively with size of . We randomly selected 2,000 images from CelebA-HQ and 10,000 from Flickr-Faces-HQ to comprise the test set.

Implementation Details We resize all input images to . The random position rectangle masks account for of the original image; the largest size is and the smallest size is

. During training, we use RMSprop as the optimizer with a learning rate of 0.0001. On a single NVIDIA TITAN Xp (12GB), we train our model on CelebA-HQ for four days and Flickr-Faces-HQ for eight days with a batch size of 16.

4.2 Comparisons with State-of-the-Art Methods

Method L1 PSNR SSIM LPIPS [38]
PEN-Net [34] 2.87% 24.53 0.8369 0.1701
Ours 2.28% 26.11 0.8718 0.1355
Table 2: Quantitative comparison on the same test set with a center mask. Higher SSIM and PSNR values are better; lower L1 error and LPIPS are better.

We conduct qualitative and quantitative comparisons with multiple methods including PatchMatch (PM) [1], Contexual Attention (CA) [32], PIC [39], GConv [33] and PEN-Net [34]. We use the officially released models of CA, PIC, and GConv trained on CelebA-HQ to facilitate fair comparisons. We compare with PEN-Net on the same test set on CelebA-HQ as [34]. As PEN-Net only handles a center mask on CelebA-HQ whose size is , we compare with it separately using a center mask. For PIC, we follow the official instructions to select the best of their multiple results. In addition, we train models of CA, PIC, and GConv for Flickr-Faces-HQ respectively. The results on Flickr-Faces-HQ are presented in the supplementary material. In order to focus on assessing the generative capability of the different models, we copy the valid pixels to the output image for all models in comparison.

Quantitative Comparisons As outlined in Table 1 and Table 2

, we conduct quantitative comparisons on CelebA-HQ using different masks. We select the commonly-used L1 loss, peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) as evaluation metrics at pixel space. As mentioned in 

[32, 16], however, these classical metrics are not optimal for image inpainting tasks; thus, we further use Learned Perceptual Image Patch Similarity (LPIPS) [38] as perceptual metric. According to these metrics, our method not only outperforms the previous state-of-the-art image inpainting methods, but also achieves a notable improvement. This is because our method focuses on learning the exact relations of the facial structure and improving the completion effect of facial components; by contrast, other methods tend to ignore the complexity and diversity of the facial structure.

Method L1 PSNR SSIM LPIPS Time Cost
CA [32] 1.64% 25.56 0.9057 0.0899 12.9ms
DSA 1.46% 26.36 0.9107 0.0706 6.4ms
Table 3: Quantitative comparison results on CelebA-HQ. The first row is the result of our model that replaces DSA with the CA module [32]. Higher SSIM and PSNR values are better; lower L1 error and LPIPS are better. We compare the time cost per image by CA and DSA on layer 14 of the U-Net.
Figure 6: Qualitative comparison results on CelebA-HQ. In each row from left to right: (a) input image, (b) the result of using CA module [32] to replace DSA layer, (c) the result of using DSA module and (d) the ground truth image. Best viewed with zoom-in.

Qualitative Comparisons As shown in Figure 1 and Figure 5, we respectively compare the visual effect using a center mask and a random mask on the CelebA-HQ dataset. Firstly, the hand-crafted method PM [1] fails to recover the basic structure of the face because it is difficult to find similar patches on background. Secondly, CA [32], PIC [39], GConv [33] and PEN-Net [34] employ attention modules to learn contextual information for inpainting. But they still generate semantically inconsistent structures or textures as their learned attention scores may not be reasonable. In addition, these methods may generate artifacts or blurry effect on facial components such as the noses in the first row of Figure 5. This is because they do not pay enough attention to facial components. Finally, our approach produces sharper and more abundant details, especially for facial components, which can be explained by the effect of multi-discriminator. More qualitative results are presented in the supplementary material.

Figure 7: Qualitative comparison results on CelebA-HQ. In each row from left to right: (a) input image, (b) the result of using global and local discriminator, (c) the result of using the multi-discriminator, and (d) the ground truth image. Best viewed with zoom-in.

4.3 Ablation Study

Effect of DSA Module We replace the DSA module in our model with the contextual attention (CA) module [32] for comparison. As shown in Table 3 and Figure 6, the model with DSA is more effective and produces images with fine-grained textures. This is because DSA explores pixel-to-pixel correlations rather than patch-to-patch correlations in CA. Therefore, DSA can employ small-scale features as reference for inpainting. Furthermore, as DSA learns foreground self-attention, it maintains the contextual consistency in foreground areas. For example, the eyes generated by DSA are symmetric in both structure and texture, as shown in Figure 6. In comparison, the eyes generated by CA may be inconsistent with each other. Besides, DSA is also more efficient than CA as the latter one adopts more convolution operations [32]. The CSA module [15] also learns correlations between foreground patches, but its computational cost is significantly higher than DSA. This is because it adopts an iterative processing strategy, which means foreground pixels are refined one by one. Under the same GPU and experimentation setting, DSA and CSA (our implementation) take 6.4 ms and 31.3 ms per image on layer 14, respectively.

Effect of Multi-discriminator As shown in Figure 7, comparisons in visual effect show that using global and local discriminator only [13, 6] is insufficient to generate high-fidelity facial textures. With the help of local subdivision discriminator and four discriminators for facial components, the generated results are sharper and contain richer textures especially for the facial components.

Figure 8: Qualitative comparison results on CelebA-HQ. In each row from left to right: (a) input, (b) the result of using DSA module but without supervision signal, (c) the result of using DSA module with supervision signal and (d) the ground truth image. The yellow circles indicate artifacts. Best viewed with zoom-in.
Method PSNR SSIM LPIPS
DSA without supervision 26.23 0.9062 0.0726
DSA with supervision 26.36 0.9107 0.0706
Table 4: Quantitative results over CelebA-HQ to prove the effect of supervision on DSA. Higher SSIM and PSNR values are better; lower LPIPS are better.

Effect of Supervision on Attention As shown in Figure 8 (b), DSA without the extra supervision signal may produce artifacts for some challenging cases. For example, redundant hair is generated on the cheek of the girl in the first row. In the second row, the generated eye is close to blue due to the influence of the blue background. In addition, there is clutter on the boundary between hair and background. These artifacts are caused by inaccurate attention scores, which lead the network to refer to features in unsuitable areas for inpainting. As illustrated in Figure 8 (c), after imposing the oracle supervision signal as a guidance, the accuracy of attention scores is improved and accordingly these artifacts or color discrepancy can be solved. As noted in Table 4, the quantitative results also prove that the network benefits from the supervision signal.

5 Discussion and Extension

In this paper, we propose a comprehensive model for face completion consisting of a DSA module with supervision signal, and a multi-discriminator, and further conduct multiple experiments to prove that it outperforms previous state-of-the-art methods. The supervised DSA module helps the network to identify the correlation between different facial parts, while the multi-discriminator forces the network to learn the specific features of facial components. In the supplementary material, we further show the results of the proposed method conditioning on irregular masks and higher resolution face image ().

6 Acknowledgement

Changxing Ding is supported by NSF of China under Grant 61702193 and U1801262, the Science and Technology Program of Guangzhou under Grant 201804010272, the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant 2017ZT07X183, and the Fundamental Research Funds for the Central Universities of China under Grant 2019JQ01.

References

  • [1] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In ToG, Vol. 28, pp. 24. Cited by: §2.1, Table 1, Figure 5, §4.2, §4.2.
  • [2] A. Criminisi, P. Pérez, and K. Toyama (2004) Region filling and object removal by exemplar-based image inpainting. TIP 13 (9), pp. 1200–1212. Cited by: §2.1.
  • [3] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In CVPR, pp. 3146–3154. Cited by: §1, §3.1.
  • [4] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In NeurIPS, pp. 5767–5777. Cited by: §3.3.
  • [5] K. He and J. Sun (2012) Statistics of patch offsets for image completion. In ECCV, pp. 16–29. Cited by: §2.1.
  • [6] S. Iizuka, E. Simo-Serra, and H. Ishikawa (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (ToG) 36 (4), pp. 107. Cited by: §2.1, §3.2, §4.3.
  • [7] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, pp. 1125–1134. Cited by: §3.3.
  • [8] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: Appendix B, Figure 10, Figure 11, Appendix, §1, §4.1.
  • [9] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In CVPR, pp. 4401–4410. Cited by: Appendix A, Figure 9, Appendix, §1, §4.1.
  • [10] V. Kazemi and J. Sullivan (2014) One millisecond face alignment with an ensemble of regression trees. In CVPR, pp. 1867–1874. Cited by: §3.2.
  • [11] J. Lee, D. Lee, and R. Park (2012) Robust exemplar-based inpainting algorithm using region segmentation. TCE 58 (2), pp. 553–561. Cited by: §2.1.
  • [12] A. Li, J. Qi, R. Zhang, and R. Kotagiri (2019) Boosted gan with semantically interpretable information for image inpainting. In IJCNN, pp. 1–8. Cited by: §1.
  • [13] Y. Li, S. Liu, J. Yang, and M. Yang (2017) Generative face completion. In CVPR, pp. 3911–3919. Cited by: §1, §2.1, §3.2, §4.3.
  • [14] G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In ECCV, pp. 85–100. Cited by: Table 6, Figure 2, §2.1, §2.2, §3.
  • [15] H. Liu, B. Jiang, Y. Xiao, and C. Yang (2019) Coherent semantic attention for image inpainting. In ICCV, Cited by: §1, §2.2, §3.1, §4.3.
  • [16] I. Molodetskikh, M. Erofeev, and D. Vatolin (2019) Perceptually motivated method for image inpainting comparison. arXiv preprint arXiv:1907.06296. Cited by: §4.2.
  • [17] K. Nazeri, E. Ng, T. Joseph, F. Qureshi, and M. Ebrahimi (2019) Edgeconnect: generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212. Cited by: §1.
  • [18] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, pp. 2536–2544. Cited by: §2.1, §3.2.
  • [19] Y. Ren, X. Yu, R. Zhang, T. H. Li, S. Liu, and G. Li (2019) StructureFlow: image inpainting via structure-aware appearance flow. In ICCV, pp. 181–190. Cited by: §2.1.
  • [20] T. Ružić and A. Pižurica (2014) Context-aware patch-based image inpainting using markov random field modeling. TIP 24 (1), pp. 444–456. Cited by: §2.1.
  • [21] M. Sagong, Y. Shin, S. Kim, S. Park, and S. Ko (2019) Pepsi: fast image inpainting with parallel decoding network. In CVPR, pp. 11360–11368. Cited by: §1, §2.2.
  • [22] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. Computer Science. Cited by: §3.3.
  • [23] L. Song, J. Cao, L. Song, Y. Hu, and R. He (2019) Geometry-aware face completion and editing. In AAAI, pp. 2506–2513. Cited by: §1, §2.1.
  • [24] Y. Song, C. Yang, Z. Lin, X. Liu, Q. Huang, H. Li, and C.-C. Jay Kuo (2018-09) Contextual-based image inpainting: infer, match, and translate. In ECCV, Cited by: §1.
  • [25] Y. Song, C. Yang, Y. Shen, P. Wang, Q. Huang, and C. J. Kuo (2018) SPG-net: segmentation prediction and guidance network for image inpainting. In BMVC, pp. 97. Cited by: §1.
  • [26] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia (2018) Image inpainting via generative multi-column convolutional neural networks. In NeurIPS, pp. 331–340. Cited by: §1.
  • [27] C. Xie, S. Liu, C. Li, M. Cheng, W. Zuo, X. Liu, S. Wen, and E. Ding (2019) Image inpainting with learnable bidirectional attention maps. In ICCV, pp. 8858–8867. Cited by: §2.1.
  • [28] W. Xiong, J. Yu, Z. Lin, J. Yang, X. Lu, C. Barnes, and J. Luo (2019) Foreground-aware image inpainting. In CVPR, pp. 5840–5848. Cited by: §1, §2.2.
  • [29] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan (2018)

    Shift-net: image inpainting via deep feature rearrangement

    .
    In ECCV, pp. 1–17. Cited by: §1.
  • [30] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li (2017) High-resolution image inpainting using multi-scale neural patch synthesis. In CVPR, pp. 6721–6729. Cited by: §1.
  • [31] R. A. Yeh, C. Chen, T. Yian Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do (2017) Semantic image inpainting with deep generative models. In CVPR, pp. 5485–5493. Cited by: §2.1.
  • [32] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. In CVPR, pp. 5505–5514. Cited by: Table 5, Appendix A, Figure 10, Figure 11, Figure 9, §1, §1, §2.2, §3.1, Table 1, Figure 5, Figure 6, §4.2, §4.2, §4.2, §4.3, Table 3.
  • [33] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2019) Free-form image inpainting with gated convolution. In CVPR, Cited by: Table 5, Appendix A, Figure 10, Figure 11, Figure 9, §2.2, Table 1, Figure 5, §4.2, §4.2.
  • [34] Y. Zeng, J. Fu, H. Chao, and B. Guo (2019) Learning pyramid-context encoder network for high-quality image inpainting. In CVPR, pp. 1486–1494. Cited by: Figure 11, Figure 1, §1, §2.2, §4.2, §4.2, Table 2.
  • [35] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In

    International Conference on Machine Learning

    ,
    pp. 7354–7363. Cited by: §1, §3.1.
  • [36] H. Zhang, Z. Hu, C. Luo, W. Zuo, and M. Wang (2018) Semantic image inpainting with progressive generative networks. In ACM-MM, pp. 1939–1947. Cited by: §1.
  • [37] H. Zhang, Y. Jin, and Y. Wu (2010) Image completion by a fast and adaptive exemplar-based image inpainting. In ICCASM, Vol. 3, pp. V3–115. Cited by: §2.1.
  • [38] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp. 586–595. Cited by: Table 5, §3.3, Table 1, §4.2, Table 2.
  • [39] C. Zheng, T. Cham, and J. Cai (2019) Pluralistic image completion. In CVPR, pp. 1438–1447. Cited by: Table 5, Appendix A, Figure 10, Figure 11, Figure 9, §1, Table 1, Figure 5, §4.2, §4.2.

Appendix

This supplementary material includes five sections. Section A shows comparison results between our method and state-of-the-art methods on the Flickr-Faces-HQ [9] database. Section B provides more qualitative comparions on the CelebA-HQ [8] database. SectionC shows two challenging cases including faces in profile or complex illuminations. Section D shows face completion results of our method using irregular masks. Section E provides the face completion results of our approach on high resolution images (). Finally, we introduce the details of the network architecture in Section F.

Appendix A Comparisons on Flickr-Faces-HQ

Method L1 PSNR SSIM LPIPS [38]
CA [32] 1.96% 24.30 0.8896 0.0869
PIC [39] 1.88% 24.53 0.9007 0.0982
GConv [33] 1.85% 24.93 0.8879 0.0879
Ours 1.50% 26.06 0.9045 0.0693
Table 5: Quantitative comparisons on the same test set using rectangular masks of random position. Higher SSIM and PSNR values are better; lower L1 error and LPIPS values are better.

We conduct both quantitative and qualitative comparisons between our approach and state-of-the-art methods on the Flickr-Faces-HQ [9] database. Rectangular masks of random position are adopted. All images are resized to . As the original papers of CA [32], PIC [39], and GConv [33] do not provide the performance of their models on Flickr-Faces-HQ, we use their released codes to train the three models on Flickr-Faces-HQ respectively. Quantitative comparison results are summarized in Table 5, it is shown that our approach outperforms the other methods by large margins. Qualitative comparisons are provided in Figure 9, it is clear that our method is also the best in visual effect.

Appendix B More Comparison Results on CelebA-HQ

We show more qualitative comparisons on the CelebA-HQ [8] database in Figure 10 and Figure 11. In the two figures, rectangular masks of random position and a center mask are utilized, respectively. Intuitively, the center mask is more challenging as most facial components are occluded and it is difficult to find references from the background. It is shown that our method performs the best in visual effect.

Appendix C Special Cases

As shown in Figure 12, we show the results of processing faces in profile or complex illuminations, which are indeed more challenging for the inpainting task due to the data imbalance problem.

Appendix D Results with Irregular Masks

All the above experiments adopt rectangular masks. In this experiment, we show face completion results of our approach using irregular masks. As the masks are irregular now and discriminators usually require rectangular patches as input, we consistently apply the local discriminator and local subdivision discriminator to the central region () of each training image. The other settings of our approach remain unchanged. Face completion results are illustrated in Figure 13. It is shown that our approach produces face images of high-fidelity even with irregular masks.

Appendix E Results on High Resolution Images

In this experiment, we show the effectiveness of our approach on high resolution face images (). The experimental settings are consistent with those on low resolution images (). The results are illustrated in Figure 14, Figure 15, Figure 16. It is shown that the recovered images by our approach contain rich facial textures. The facial textures are also highly consistent with the ground-truth images. Therefore, the ability of our approach to generate high-fidelity facial images is justified.

Appendix F Network Architecture

We show the architecture details of the generator and discriminators in our model in Table 6 and Table 7 respectively. For high resolution images (), the architecture of our model is adjusted slightly, as shown in Table 8 and Table 9.

Figure 9: Comparisons on Flickr-Faces-HQ [9] by different methods with random rectangular masks. Three state-of-the-art methods are compared: CA [32], PIC [39] and GConv [33]. Best viewed with zoom-in and pay attention to the details on facial components.
Figure 10: Comparisons on CelebA-HQ [8] by different methods with random rectangular mask. Three state-of-the-art methods are compared: CA [32], PIC [39] and GConv [33]. Best viewed with zoom-in and pay attention to the details on facial components.
Figure 11: Comparisons on CelebA-HQ [8] by different methods with a center mask (). Four state-of-the-art methods are compared: CA [32], PIC [39], GConv [33] and PEN-Net [34]. Best viewed with zoom-in and pay attention to the details on facial components.
Figure 12: The results of processing faces in profile or complex illuminations. All these images are included in the CelebA-HQ test set. Best viewed with zoom-in.
Figure 13: Results on CelebA-HQ with irregular masks.
Figure 14: Results on high-resolution images of CelebA-HQ ().
Figure 15: Results on high-resolution images of CelebA-HQ ().
Figure 16: Results on high-resolution images of CelebA-HQ ().
Layer 1

Conv(7, 7, 64), stride=2; ReLU

Layer 2 Conv(5, 5, 128), stride=2; BN; ReLU
Layer 3 Conv(3, 3, 256), stride=2; BN; ReLU
Layer 4 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 5 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 6 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 7 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 8 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 9
Upsample(factor = 2); Concat(w/ Layer 7);
Conv(3, 3, 512), stride=1; BN;
LReLU(slope = 0.2);
Layer 10
Upsample(factor = 2); Concat(w/ Layer 6);
Conv(3, 3, 512), stride=1; BN;
LReLU(slope = 0.2);
Layer 11
Upsample(factor = 2); Concat(w/ Layer 5);
Conv(3, 3, 512), stride=1; BN;
LReLU(slope = 0.2);
Layer 12
Upsample(factor = 2); Concat(w/ Layer 4);
Conv(3, 3, 512), stride=1; BN;
LReLU(slope = 0.2);
Dual Spatial Attention Module(DSA);
Layer 13
Upsample(factor = 2); Concat(w/ Layer 3);
Conv(3, 3, 256), stride=1; BN;
LReLU(slope = 0.2);
Dual Spatial Attention Module(DSA);
Layer 14
Upsample(factor = 2); Concat(w/ Layer 2);
Conv(3, 3, 128), stride=1; BN;
LReLU(slope = 0.2);
Dual Spatial Attention Module(DSA);
Layer 15
Upsample(factor = 2); Concat(w/ Layer 1);
Conv(3, 3, 64), stride=1; BN;
LReLU(slope = 0.2);
Layer 16
Upsample(factor = 2); Concat(w/ Input);
Conv(3, 3, 3), stride=1; Sigmoid
Table 6:

The architecture of the generator. BN denotes batch normalization and LReLU denotes leaky ReLU. We adopt a very similar U-Net structure as used in 

[14] for the generator. The difference lies in two aspects: (1) we adopt conventional convolution rather than partial convolution; (2) we equip U-net with the Dual Spatial Attention (DSA) module.
Layer 1 Conv(4, 4, ), stride=2; LReLU(slope = 0.2);
Layer 2 Conv(4, 4, 2 ), stride=2; LReLU(slope = 0.2);
Layer 3 Conv(4, 4, 4 ), stride=2; LReLU(slope = 0.2);
Layer 4 Conv(4, 4, 8 ), stride=1; LReLU(slope = 0.2);
Layer 5 Conv(4, 4, 1), stride=1
Table 7: The architecture of discriminators. denotes the number of channels of the convolutional layers. For the local subdivision discriminator and the four organ discriminators imposed on facial components, equals to 32. For the global and local discriminators, equals to 64 and 48, respectively.
Layer 1 Conv(7, 7, 64), stride=2; ReLU
Layer 2 Conv(5, 5, 128), stride=2; BN; ReLU
Layer 3 Conv(3, 3, 256), stride=2; BN; ReLU
Layer 4 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 5 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 6 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 7 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 8 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 9 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 10 Conv(3, 3, 512), stride=2; BN; ReLU
Layer 11
Upsample(factor = 2); Concat(w/ Layer 9);
Conv(3, 3, 512), stride=1; BN; LReLU(slope = 0.2);
Layer 12
Upsample(factor = 2); Concat(w/ Layer 8);
Conv(3, 3, 512), stride=1; BN; LReLU(slope = 0.2);
Layer 13
Upsample(factor = 2); Concat(w/ Layer 7);
Conv(3, 3, 512), stride=1; BN; LReLU(slope = 0.2);
Layer 14
Upsample(factor = 2); Concat(w/ Layer 6);
Conv(3, 3, 512), stride=1; BN; LReLU(slope = 0.2);
Dual Spatial Attention Module(DSA);
Layer 15
Upsample(factor = 2); Concat(w/ Layer 5);
Conv(3, 3, 512), stride=1; BN; LReLU(slope = 0.2);
Dual Spatial Attention Module(DSA);
Layer 16
Upsample(factor = 2); Concat(w/ Layer 4);
Conv(3, 3, 512), stride=1; BN; LReLU(slope = 0.2);
Dual Spatial Attention Module(DSA);
Layer 17
Upsample(factor = 2); Concat(w/ Layer 3);
Conv(3, 3, 256), stride=1; BN; LReLU(slope = 0.2);
Layer 18
Upsample(factor = 2); Concat(w/ Layer 2);
Conv(3, 3, 128), stride=1; BN; LReLU(slope = 0.2);
Layer 19
Upsample(factor = 2); Concat(w/ Layer 1);
Conv(3, 3, 64), stride=1; BN; LReLU(slope = 0.2);
Layer 20
Upsample(factor = 2); Concat(w/ Input);
Conv(3, 3, 3), stride=1; Sigmoid
Table 8: The architecture of the generator for input images of . To accommodate the high resolution, we add two convolutional layers for both the encoder and decoder of the generator in Table 6.
Layer 1 Conv(4, 4, ), stride=2; LReLU(slope = 0.2);
Layer 2 Conv(4, 4, 2 ), stride=2; LReLU(slope = 0.2);
Layer 3 Conv(4, 4, 4 ), stride=2; LReLU(slope = 0.2);
Layer 4 Conv(4, 4, 8 ), stride=2; LReLU(slope = 0.2);
Layer 5 Conv(4, 4, 8 ), stride=2; LReLU(slope = 0.2);
Layer 6 Conv(4, 4, 8 ), stride=1; LReLU(slope = 0.2);
Layer 7 Conv(4, 4, 1), stride=1
Table 9: The architecture of discriminators for input images of . To accommodate the high resolution, we add two convolutional layers for discriminators in Table 7.