Identity Preserving Face Completion for Large Ocular Region Occlusion

07/23/2018 ∙ by Yajie Zhao, et al. ∙ 0

We present a novel deep learning approach to synthesize complete face images in the presence of large ocular region occlusions. This is motivated by recent surge of VR/AR displays that hinder face-to-face communications. Different from the state-of-the-art face inpainting methods that have no control over the synthesized content and can only handle frontal face pose, our approach can faithfully recover the missing content under various head poses while preserving the identity. At the core of our method is a novel generative network with dedicated constraints to regularize the synthesis process. To preserve the identity, our network takes an arbitrary occlusion-free image of the target identity to infer the missing content, and its high-level CNN features as an identity prior to regularize the searching space of generator. Since the input reference image may have a different pose, a pose map and a novel pose discriminator are further adopted to supervise the learning of implicit pose transformations. Our method is capable of generating coherent facial inpainting with consistent identity over videos with large variations of head motions. Experiments on both synthesized and real data demonstrate that our method greatly outperforms the state-of-the-art methods in terms of both synthesis quality and robustness.



There are no comments yet.


page 2

page 4

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Wearable VR/AR devices provide users the ability to travel freely through physical environments mixed with immersive virtual content, enabling new applications in entertainment, education and telepresence. However, the large occlusion introduced by head-mounted display (HMD) is a huge hindrance for face-to-face communications. Such limitation could prevent the adaptation of VR/AR technologies in areas, such as teleconferencing, in which eye contact and facial expressions are crucial elements in effective team communication and negotiation tactics.

Figure 1: Given a significantly occluded facial image (fig:teaser_input1) (fig:teaser_input2), we synthesize the un-occluded face image using our framework with identity preserved (fig:teaser_output1)(fig:teaser_output2). (fig:teaser_target1)(fig:teaser_target2)show the ground truth image and (fig:teaser_refs1)(fig:teaser_refs2) show the reference images we use of the same person to provide identity information.

To enable better face-to-face like communications when wearing a HMD, researchers have developed techniques to capture a wearer’s expressions to drive a digital avatar [Li et al.(2015)Li, Trutoiu, Olszewski, Wei, Trutna, Hsieh, Nicholls, and Ma, Olszewski et al.(2016)Olszewski, Lim, Saito, and Li, Thies et al.(2016)Thies, Zollhöfer, Stamminger, Theobalt, and Nießner]. Although some impressive results have been demonstrated, the visual representation is only used for a "talking head" in the VR setting and is limited in quality and details. Instead of driving virtual avatars, another research direction is to inpaint the occluded regions with plausible faces. Nevertheless, inferring the occluded content introduced by VR goggles is particularly challenging as over half of the face is obstructed in the most cases.  [Burgos-Artizzu et al.(2015)Burgos-Artizzu, Fleureau, Dumas, Tapie, Clerc, and Mollet, Zhao et al.(2016)Zhao, Xu, Huang, and Yang] tried to synthesize the missing texture using personalized database, but requires dedicated capturing setup that makes their methods hard to generalize.

Although some of the state-of-the-art algorithms [Li et al.(2017a)Li, Liu, Yang, and Yang] and [Iizuka et al.(2017)Iizuka, Simo-Serra, and Ishikawa] are able to produce plausible face image with large occlusions. Their inputs are required to be frontal, aligned and the identity is usually not preserved during the completion. Such limitations make them infeasible in applications like headset removal, as identity is required to be preserved while the face pose is likely to be changing.

We thus present a novel deep learning approach that can not only fill in the large occluded regions with plausible contents but also provide control over the restored face identity and face poses as shown in Figure 1. The user could specify the desired face identity by providing an arbitrary occlusion-free face image of the target subject as reference. In addition, by inputting a pose map, our approach could generate facial structures consistent to the intended face orientation. These advances are enabled by a generative network that is optimized with dedicated constraints to regularize the synthesis process. To inpaint the occluded region with facial content that is visually similar to the input reference image, we introduce a novel reference network that imposes an identity prior onto the searching space of generator. The identity prior is extracted from the referenced identity and penalizes stylistic deviation between the generated result and the input reference image. At most cases, the reference image is prone to have different pose, illumination and background with the input. To obtain a spatially-coherent result, we regularize the generator using two discriminators: a global discriminator that enforces context consistency between filled pixels with surrounding background, and a pose discriminator that regularizes the high-level postural errors. The pose map serves as both the input of generator and the condition of pose discriminator. By observing the ground-truth face pose, the pose discriminator penalizes unreal pose transformations produced from the generator.

Compared with the previous state-of-the-art methods, our approach is more advantageous in the following aspects. 1) Our method provides significantly better results in the presence of large occluded regions, e.gobstruction from large HMDs. 2) We propose the first face inpainting framework that could explicit control the recovered face identity, which makes identity preserving possible in headset removal. 3) Our approach also offers the editing of face poses in the restored content. To the best of our knowledge, this is the first work that could achieve realistic pose-varying face completion in videos.

2 Related Work

Synthesizing the missing portion of a face image could be formulated as an inpainting problem, which is first introduced in [Bertalmio et al.(2000)Bertalmio, Sapiro, Caselles, and Ballester]. To obtain a faithful reconstruction, content prior is usually required, which comes from either other part of the same image or an external dataset. The former method generates reasonable inpaintings under specific assumptions, such as repetitiveness of texture [Efros and Leung(1999)], spatial smoothness in the missing region [Shen and Chan(2002)] or planar objects [Huang et al.(2014)Huang, Kang, Ahuja, and Kopf]. However, these methods are prone to fail when completing images with structured content. The data-driven methods leverage learnt features from database to infer the missing content  [Hays and Efros(2007), Pathak et al.(2016)Pathak, Krahenbuhl, Donahue, Darrell, and Efros, Whyte et al.(2009)Whyte, Sivic, and Zisserman, Mairal et al.(2008)Mairal, Elad, and Sapiro, Xie et al.(2012)Xie, Xu, and Chen, Yeh et al.(2016)Yeh, Chen, Lim, Hasegawa-Johnson, and Do, Fawzi et al.(2016)Fawzi, Samulowitz, Turaga, and Frossard, Gupta et al.()Gupta, Kazi, and Kong]. In particular, the authors in [Hays and Efros(2007), Whyte et al.(2009)Whyte, Sivic, and Zisserman, Mairal et al.(2008)Mairal, Elad, and Sapiro] generate complete image automatically by using a feature dictionary.

Deep neural network based methods

[Xie et al.(2012)Xie, Xu, and Chen, Gupta et al.()Gupta, Kazi, and Kong, Fawzi et al.(2016)Fawzi, Samulowitz, Turaga, and Frossard] hallucinate the missing portion of the images by learning through the background texture. However, the early attempts tend to generate blurry results and have no control over the semantic meaning of generated result. More recently, several GAN frameworks have been proposed to address this issue  [Pathak et al.(2016)Pathak, Krahenbuhl, Donahue, Darrell, and Efros, Yeh et al.(2016)Yeh, Chen, Lim, Hasegawa-Johnson, and Do, Isola et al.(2016)Isola, Zhu, Zhou, and Efros, Yang et al.(2016b)Yang, Lu, Lin, Shechtman, Wang, and Li, Li et al.(2017b)Li, Liu, Yang, and Yang, Yeh et al.(2017)Yeh, Chen, Lim, Schwing, Hasegawa-Johnson, and Do, Iizuka et al.(2017)Iizuka, Simo-Serra, and Ishikawa]. GANs have been shown to perform well in generating realistic appearing images.  [Li et al.(2017b)Li, Liu, Yang, and Yang, Iizuka et al.(2017)Iizuka, Simo-Serra, and Ishikawa] solve the general face completion problem by training a model with global and local discriminators. These discriminators ensure that the generated face appear realistic. In the face inpainting work of Yeh et al [Yeh et al.(2017)Yeh, Chen, Lim, Schwing, Hasegawa-Johnson, and Do], they search the closet encoding in the latent image manifold to get an inference of how the missing content should be structured, which predicts information in large missing regions and achieve appealing results. None of the existing face inpainting approaches is capable of preserving the identity, which makes it infeasible to be applied in the headset removal applications.

The identity-preserving problem has been explored in related tasks [Li et al.(2016)Li, Zuo, and Zhang, Huang et al.(2017)Huang, Zhang, Li, He, et al., Yin et al.(2017)Yin, Yu, Sohn, Liu, and Chandraker, Tran et al.(2017)Tran, Yin, and Liu]

, e.g. attributes transfer, frontalization and face recognition. In particular, pose code has also been introduced in  

[Tran et al.(2017)Tran, Yin, and Liu, Yin et al.(2017)Yin, Yu, Sohn, Liu, and Chandraker] to resolve the identity ambiguity. However, trivially applying the above-mentioned approaches would fail in our case as none of these works has considered large occlusion, e.gentirely blocked upper face, in their formulation. We show that by jointly learning features from an arbitrary image of the target identity and a control pose map can significantly improve the inpainting performance while achieving additional control of face identity and head pose, which, for the first time, enables inpainting over a dynamic sequence with large head pose variations.

Figure 2: Our network architecture

3 Identity Preserving Face Completion

To faithfully inpaint the missing region with realistic content while resembling the input identity under a pose constraint, we propose an architecture that consists of a generator and two discriminators as illustrated in Figure 2.

3.1 Generator

To inpaint large occluded regions with controllable face identity and head pose, the generator of our network takes three inputs: the occluded face image , an occlusion-free image of a reference identity and a pose map that controls the head pose of generated result. The pose map

is a constant-value color image with its three channels encoded by normalized pitch, yaw and roll angles that define the intended face orientation. Starting from a random variable

, we progressively optimize the generator so that it could learn the mapping from a normal distribution to an image manifold

that is close to both groundtruth and the reference image under the pose constraint . We formulate the process of finding a recovered encoding as a conditioned optimization problem. In particular, is optimized via solving the following equation:


where indicates the reconstruction loss and denotes an identity loss that penalizes the deviation from the referenced identity. As loss empirically leads to blurry output and loss performs better on preserving of high-frequency details, we use the loss for measuring the reconstruction error between the generated result and the groundtruth image:


Though conditioned on the reference image, only reconstruction loss is not sufficient for ensuring visual similarity with the referenced identity. To achieve identity preservation, we propose to add an identity loss by introducing a

reference network that extracts high-level features from the generated result and reference image. We utilize the pre-trained VGG Face network [Parkhi et al.(2015)Parkhi, Vedaldi, and Zisserman] as our feature extractor. In particular, we use the feature for both input images. We define the identity loss as the

distance between the extracted feature vectors:



represents the non-linear feature extracting function learnt by


Note that we do not require the referenced identity image to be pose-aligned with the groundtruth. But our model can still accurately capture the semantic features from the reference image. As demonstrated in Figure 4, the identity-dependent features have been successfully transferred to the generated image. We thus interpret the reference network as a regularizer that imposes an identity/style prior on the manifold of generated images. The proposed network can not only improve the synthesis quality but also stabilize the output to enhance the temporal coherence when dealing with dynamic sequences, e.g. videos.

3.2 Discriminator

Though the generator can synthesize the missing content with low reconstruction and identity errors, there is no guarantee that the generated image is realistic and consistent with surrounding background. Discriminator serves as a binary classifier that distinguishes real and fake images so that it helps improve the synthesis quality. To encourage photorealism and effective control of face pose, we introduce two discriminators to supervise the generator.

We first introduce a global discriminator

to justify the fidelity and coherence of the entire image. The rationale for introducing a global discriminator is that the inpainted content should not only be realistic but also spatially coherent with surrounding context. In addition, the global discriminator should impose constraints on forming semantic valid facial structures. In particular, we formulate the global discriminator loss function as below:


where and represent the distributions of real data and noise variables respectively. The global discriminator is sufficient for synthesizing occluded faces with fixed face pose. However, in our application scenario, where the HMD wearer is likely to rotate his/her head while talking, our network should be robust to variable face poses. That means the generated content should have facial structures oriented consistently with the input head pose. We therefore propose an additional pose discriminator to distinguish the faithfulness of synthesized result given the pose constraint. In particular, the pose loss is defined as follows:


We condition the loss of pose discriminator on the pose map so that the input pose map would have more accurate control over the inpainted result. However, unlike the global discriminator that back-propagates the gradient over the entire image, the pose discriminator only supervises the loss gradients for the missing region.

Therefore, the overall loss function of our network is defined by:


where , , , and are the weights for the reconstruction loss, identity loss, global discriminator and pose discriminator loss, respectively.

3.3 Architecture

In our experiment, we adopt U-Net architecture with skipped connections as our generator. Specifically, we concatenate the -th layer onto the -th layer, where is the total number of layers, to avoid information loss caused by the bottleneck layer. On the discriminating side, we use PatchGAN for both global and pose discriminator.

(a) Input
(b) Pathak [Pathak et al.(2016)Pathak, Krahenbuhl, Donahue, Darrell, and Efros]
(c) Yeh [Yeh et al.(2017)Yeh, Chen, Lim, Schwing, Hasegawa-Johnson, and Do]
(d) Li [Li et al.(2017b)Li, Liu, Yang, and Yang]
(e) Iizuka [Iizuka et al.(2017)Iizuka, Simo-Serra, and Ishikawa]
(f) Ours
(g) GT
Figure 3: Comparison with state-of-the-art inpainting frameworks.

4 Implementation Details

We use images from MS-celeb-1M [Guo et al.(2016)Guo, Zhang, Hu, He, and Gao] to construct our training data. Our model is trained using 476 identities and 8000 pair of images (the occluded face image and its reference identity image). To prepare the data, MTCNN [Zhang et al.(2016)Zhang, Zhang, Li, and Qiao] is applied to detect the landmarks and bounding boxes. We then scale all the images of the dataset to and align them by registering the nose tip. The loss functions are optimized using the Adam optimizer, with a learning rate of 0.0002 and

. We train the network for 100 epochs. Our framework is implemented using Torch 

[Collobert et al.(2011)Collobert, Kavukcuoglu, and Farabet]

. In all experiments, we set our loss hyperparameters as

, ,, and . The momentum is set to 0.9 in our training process.


We implement our model with Torch [Collobert et al.(2011)Collobert, Kavukcuoglu, and Farabet] on a platform of Intel E3 CPU, 3.30GHz and Nvidia GTX-1080 GPU. We can reconstruct face images with size at a frame rate of 20 Fps.

5 Experimental Results

5.1 Face Identity Control

Figure 4: Cross identity experiments. From left to right: inputs, generated image, ground truth, reference images of another identity.

In this section, we evaluate the effectiveness of the proposed face identity control. Figure 4 demonstrates the cross identity experiments, where the image of another identity is fed into the network as reference image. As seen in Figure 4, the referenced identity differs significantly from the original identity in terms of appearance and even genders. However, our model can still generate high-fidelity result with spatial coherence while capturing the high-level identity-dependent features, e.gthick eyebrows, eye color, of the referenced identity.

Quantitative Analysis

To further quantify the performance of identity preservation, for each synthesized result, we apply the OpenFace [Amos et al.(2016)Amos, Ludwiczuk, and Satyanarayanan] to verify the identity similarity of our result compared to the ground-truth and reference image used in our network. In particular, the OpenFace will generate a binary output (0 or 1) for indicating if its two input images capture the same identity. We compare the performance of Li et al [Li et al.(2017b)Li, Liu, Yang, and Yang], Iizuka et al [Iizuka et al.(2017)Iizuka, Simo-Serra, and Ishikawa] and our algorithm using 588 images pairs(the input and reference image are the same person). As demonstrated in Table1, our method significantly outperforms Li et al [Li et al.(2017b)Li, Liu, Yang, and Yang] and achieves a pass rate when comparing with the groundtruth, indicating the efficacy of our method in preserving the target identity.

ours (%) Li (%) Iizuka (%)
Compare with groundtruth 93.2 57.3 45.9
Compare with reference 87 42.8 34.7
Table 1: Comparison of Face Verification
Figure 5: Impacts of pose variation on the face reconstruction. The ground truth pose of the input image (first column) is , and first/second rows show the effects of different pitch/yaw values on the reconstruction results. As we can see, the close poses (in red box) tend to generate visually better results.
Figure 6: Video example. From top to bottom: input video frames, generated outputs from our network, and the ground truth. The reference image is provided left-most.

5.2 Head Pose Editing

In this section, we validate the effectiveness of the proposed pose editing component. Figure 5 shows how the pose variation influences the reconstruction of the face images. By providing different pose inputs, our reconstructed facial structures will be aligned accordingly. The first row of Figure 5

shows the impact of pitch variance on reconstructed results. By gradually increasing the pitch value, the synthesized eye will move up accordingly. The similar control effect is manifested in tuning the yaw values as shown in the second row of Figure 

5. As the face regions outside the mask remain fixed, it is more natural that only the correct pose input will lead to the most coherent and clear results. As seen from our result, only the closer poses (highlighted in red) generates visually better results, suggesting the accuracy of our pose control.

The introduction of pose editing component ensures our model to perform face inpainting in dynamic sequence with time-space coherence. In Figure 6, we show the reconstructed frames with different head poses from a video sequence by using only one frontal reference. Regardless of the large variations of poses, our network can stably reconstruct appealing results. To better evaluate our method, we provide video results at

5.3 Ablative Analysis

To access the efficacy of each introduced loss, we experiment on three combinations of losses: +GAN, +GAN+ID and +GAN+ID+Pose. Figure 7 shows the comparison of the above networks. In general, +GAN tends to generate blurry results and fails to capture spatial coherency.+GAN+ID demonstrates better performance on preserving the identity, although the result is still blurry and mis-aligned in pose. +GAN+ID+Pose, which is our proposed method, is capable to generate images with sharper details while faithfully capturing the referenced identity. The results indicate that the pose control component acts as an implicit alignment prior to register different features to reduce the blurness for each semantic part. Table 2 shows the quantitative evaluation on the test set. Our proposed network outperforms the other methods in both PSNR and SSIM.

Yeh [Yeh et al.(2017)Yeh, Chen, Lim, Schwing, Hasegawa-Johnson, and Do] Li [Li et al.(2017b)Li, Liu, Yang, and Yang] Iizuka [Iizuka et al.(2017)Iizuka, Simo-Serra, and Ishikawa] + GAN + ID Ours
PSNR 18.87 21.69 23.14 23.81 24.9
SSIM 0.79 0.78 0.82 0.83 0.87
Table 2: Quantitative Evaluation on different Networks
Figure 7: The ablative analysis. From left to right: input, +GAN, +GAN+ID, +GAN+ID+pose, ground truth, and reference image.

6 Comparisons

Comparisons with other methods

We compare our result with other state-of-the-art inpainting frameworks. As shown in Figure 3, Pathak et al[Pathak et al.(2016)Pathak, Krahenbuhl, Donahue, Darrell, and Efros] smoothly fill the missing part without any semantic meaning. Though Yeh et al.  [Yeh et al.(2017)Yeh, Chen, Lim, Schwing, Hasegawa-Johnson, and Do] can produce content with semantic meanings, their results tend to be blurry and fail to be spatially coherent with surrounding context. Li et al [Li et al.(2017b)Li, Liu, Yang, and Yang] and Iizuka et al [Iizuka et al.(2017)Iizuka, Simo-Serra, and Ishikawa] demonstrate sharper results but the results are still blurry and the generated facial features tends to be distorted and appear unnaturally with unmasked regions. Comparing to other methods, our approach is capable of generating high-fidelity facial details with coherent blending with backgrounds. In addition, we successfully preserve the identity, providing result close to the ground truth.

Test on images with real occlusion.

We test our network on video sequences in which the subjects are wearing HMDs. As we assume known head poses for our network. We need first to extract the head poses from occluded images. However, many HMDs, like HTC vive, provide real-time tracking of the head pose, which could be converted to our pose input via a simple calibration step. We also train a pose prediction network using the synthetic data with known pose information. The network consists of 5 convolutional layers to extract the high-level features from the input image, and two fully connected layers to regress the feature into pose. In particular, our convolution part is same as the content prediction network of [Yang et al.(2016a)Yang, Lu, Lin, Shechtman, Wang, and Li], which is trained to inpaint the missing content. In Figure 8, we show a video result with nearly frontal view, where we assume that pose is fixed to . As seen from the results, our network produces stable results when the pose changes slightly. In Figure 9, we test our network on a video in which the subject is wearing a HTC vive VR headset and talking with large variations of head poses. Despite the large head movement, our network can still generate very promising results. Although our results of real data contain artifacts, the improvement is significant compared to Li et al [Li et al.(2017b)Li, Liu, Yang, and Yang] and Iizuka et al [Iizuka et al.(2017)Iizuka, Simo-Serra, and Ishikawa].

Figure 8: Reconstruction of real video data with frontal head pose. The first and second rolls are the selected frames reconstructed by our network and the ground truth. The leftmost image is used as reference for all the frames.
Figure 9: Reconstruction of real video data with large head pose variation when wearing VR/AR headsets. From the left to the right columns are: One single reference image, Inputs, Li et al [Li et al.(2017b)Li, Liu, Yang, and Yang], Iizuka et al [Iizuka et al.(2017)Iizuka, Simo-Serra, and Ishikawa] and Ours. The single reference image is used for all frames.

7 Conclusion, Limitation and Future work

We present a novel learning-based approach for face inpainting with favorable property of preserving the identity of a given reference image. Furthermore, our approach offers flexible pose control on the reconstruction results, making it possible to faithfully restore facial details in occluded video sequences with large face pose variations. These two properties provide insight into solving the headset removal problem, which attracts increasing attention due to the surge of VR/AR techniques. Our network, in current form, cannot handle well extreme viewing angles and expressions (failure cases can be found in supplementary materials). However, we believe that by including such cases in training dataset, the robustness of our network can be further improved. In the video inpainting results, jittering can be observed in the transition between different frames as temporal coherency is not explicitly constrained in our formulation. Its worth investigating in the future work to add such additional constraint in our current framework. The results of real data generated by our network still have some artifacts around the mask boundary. This is due to the fact that the lower face usually has shadows cast by the HMDs. One possible solution is to synthesize shadows for the training data. It would also be an interesting future work to incorporate a pose estimation network to enable an end-to-end face inpainting network for videos.

8 Acknowledgements

We would like to show our gratitude to Nathan Jacobs, Xinyu Huang for sharing their pearls of wisdom with us during the course of this research, and we thank Qingguo Xu for assistance with experiments and comments that greatly improved the manuscript.