LaFIn: Generative Landmark Guided Face Inpainting
It is challenging to inpaint face images in the wild, due to the large variation of appearance, such as different poses, expressions and occlusions. A good inpainting algorithm should guarantee the realism of output, including the topological structure among eyes, nose and mouth, as well as the attribute consistency on pose, gender, ethnicity, expression, etc. This paper studies an effective deep learning based strategy to deal with these issues, which comprises of a facial landmark predicting subnet and an image inpainting subnet. Concretely, given partial observation, the landmark predictor aims to provide the structural information (e.g. topological relationship and expression) of incomplete faces, while the inpaintor is to generate plausible appearance (e.g. gender and ethnicity) conditioned on the predicted landmarks. Experiments on the CelebA-HQ and CelebA datasets are conducted to reveal the efficacy of our design and, to demonstrate its superiority over state-of-the-art alternatives both qualitatively and quantitatively. In addition, we assume that high-quality completed faces together with their landmarks can be utilized as augmented data to further improve the performance of (any) landmark predictor, which is corroborated by experimental results on the 300W and WFLW datasets.READ FULL TEXT VIEW PDF
LaFIn: Generative Landmark Guided Face Inpainting
Image inpainting (a.k.a. image completion) refers to the process of reconstructing lost or deteriorated regions of images, which can be applied to, as a fundamental component, various tasks such as image restoration and editing [1, 30]. Undoubtedly, one expects the completed result to be realistic, so that the reconstructed regions can be hardly perceived. Compared with natural scenes like oceans and lawns, manipulating faces, the focus of this work, is more challenging. Because the faces have much stronger topological structure and attribute consistency to preserve. Figure 1 shows three such examples. Very often, given the observed clues, human beings can easily infer what the lost parts possibly, although inexactly, look like. As a consequence, a slight violation on the topological structure and/or the attribute consistency in the reconstructed face highly likely leads to a significant perceptual flaw. The following gives the definition of the problem:
Face Inpainting. Given a face image with corrupted regions masked by . Let designate the complement of , and the Hadamard product. The goal is to fill the target part with semantically meaningful and visually continuous information to the observed part. In other words, the completed result should preserve the topological structure among face components such as eyes, nose and mouth, and the attribute consistency on like pose, gender, ethnicity and expression.
Various image inpainting methods have been developed over the last decades. In what follows, we briefly review classic and contemporary works closely related to ours.
Traditional Inpainting Methods. In this category, diffusion-based and patch-based approaches are two representative branches. Diffusion-based approaches [2, 5, 32] iteratively propagate low-level features around the occluded areas. However, these methods are limited to reconstructing structureless and small-size regions. While patch-based methods [1, 4, 8] attempt to copy similar blocks from either the same image or a set of images to the target regions. On the one hand, their computational cost of calculating the similarity between blocks is expensive, even though some works like  have been proposed towards accelerating the procedure. On the other hand, as a common limitation, they all hypothesize that the missing part can be found elsewhere, which does not always hold in practice.
Deep Learning-based Methods. Recently, deep learning based methods have become the mainstream for image inpainting. The context encoder , as a pioneer deep-learning method for image completion, introduces an encoder-decoder network trained with an adversarial loss . After that, plenty of follow-ups have been proposed to improve the performance from various aspects. For instance, the scheme by  employs both the global and local discriminators to accomplish the task. Another attempt proposed in  designs a coarse-to-fine network structure and applies a self-attention layer to connect related features at distant spatial locations. Besides, Yu et al. and Liu et al. [33, 16] upgrade the convolutional layers for making networks adaptive to the masked input. However, most of the above-mentioned methods can barely keep the structure of the original image, and the inpainted result frequently tends to be blurry, especially on large occluded areas. For the sake of maintaining the structure of corrupted images, a number of methods, such as [20, 31], try to first predict the edge information for corrupted images, then apply it as a condition to guide the inpainting. These methods work well on small corrupted regions though, when the corruption becomes larger, the performance significantly degrades as it is not easy to predict reasonable edges inside the masked regions, leading to unsatisfactory results.
Deep Face Inpainting Methods. Specific to face completion, the authors of  construct a loss taking care of the gap in semantic segmentation (face parsing) between generated face images and ground truth, expecting to better preserve the structure. But this work often suffers from the color inconsistency, and lacks of ability in handling faces with large poses. Besides, [11, 33] directly ask users to manually label edges for generating corresponding results. Although providing a flexible way to editing faces, sometimes it is difficult/inconvenient for users to input precise edge information. To relive the requirement from users, Nazeri et al. applied a network to predict the edges , which however suffers from inaccurate/unreasonable prediction on large holes. Moreover, we argue that, for face completion, both face parsing and edge information are relatively redundant, which may even degenerate the performance when feeding slightly inaccurate information into the inpainting module. Facial landmarks are better to act as the indicator, which are neat, sufficient, and robust to reflect the structure of face, please see Fig. 2 for an example. Many works have successfully applied landmarks to the task of face generation, such as ,  and . It is worth noting that, different from the generation task  and , in our problem, the landmarks need to be obtained from the corrupted images.
As stated previously, completing face images in the wild is challenging. A qualified face inpainting algorithm should carefully take into account the following two concerns to guarantee the realism of output:
Faces are of strong structure. The topological relationship among facial features including eyebrows, eyes, nose and mouth is always well-organized. The completed faces must satisfy this topology structure primarily;
The attributes of face, such as pose, gender, ethnicity, and expression, should be consistent across the inpainted regions and the observed part.
Otherwise, a slight violation on these two factors will result in a significant perceptual flaw.
Why adopt landmarks? This work employs facial landmarks as structural guidance, because of their compactness, sufficiency, and robustness. One may ask whether the edge or parsing information provide more powerful guidance than the landmarks? If the information is precise, the answer is yes. But, taking the strategy using edges  as an example, it is not easy to generate reasonable edges in challenging situations like large-area corrupted faces with large-poses. Under the circumstances, the redundant and inaccurate information would instead hurt the performance. Alternatively, a set of landmarks (pre-defined fiducial points) always exists, no matter what situation the face is in. Further, the landmarks can be viewed as the discrete/ordered points sampled on the key edges/regions of face, which are sufficient to conversely reform the key edges/facial regions (face parsing) with redundant information removed. Compared with the edge  and parsing  information, for one thing, the landmarks are much neater and more robust, please see Fig. 2 for illustration. For another thing, once the landmarks for a face are obtained, the topology structure, pose and expression can be subsequently determined. Moreover, the landmarks are more convenient to control from the editing perspective. These properties support that using landmarks is a better choice for face completion.
How to guarantee the attribute consistency? Except for the pose and expression attributes determined by the landmarks, there are several other attributes, such as gender, ethnicity, and makeup style, need to be concerned. Notice that the consistency is to bridge the observed and the inpainted regions. This is to say, for these finer-grained attributes, the inpainting algorithm should take the observed (real) information as reference for the reconstruction. Harnessing distant spatial context (large receptive field) and connecting temporal feature maps (long-short term) can effectively fulfill the requirement.
This paper presents a deep network, namely Generative Landmark Guided Face Inpaintor (LaFIn for short), which comprises of a facial landmark predicting subnet and an image inpainting subnet, for solving the face inpainting problem. The main contributions can be summarized in the following aspects.
As analyzed, facial landmarks are neat, sufficient, and robust to act as the indicator for face inpainting. We construct a module for predicting landmarks on incomplete faces, which reflect the topological structure, pose and expression of the target face.
To complete faces, we design an inpainting subnet that employs the predicted landmarks as guidance. For the attribute consistency, the subnet harnesses distant spatial context and connects temporal feature maps.
Extensive experiments are conducted to reveal the efficacy of our design and, show its advances over state-of-the-art alternatives both qualitatively and quantitatively.
In addition, we can further use the completion results to help boosting the performance of data-driven landmark detectors. Since, in real situations, the training data are often insufficient, and manually labeling landmarks is time-consuming, a simple yet reliable data augmentation manner is definitely desired. Our another contribution is as follows:
The completion can generate various plausible new faces conditioned on the landmarks. Thus, the generated face and the corresponding (ground-truth) landmarks can be employed as the augmented data to relieve the pressure from manual annotation. The effectiveness of this manner is confirmed by experimental results on the WFLW and 300W datasets.
A desired face inpaintor should generate natural-looking results with logical structures and attributes. We build a deep network, denoted as LaFIn, to achieve the goal. As schematically illustrated in Fig. 3, the network is composed of a subnet for predicting landmarks, and another one for generating new pixels conditioned on the predicted landmarks. In the next subsections, we shall detail the network.
The landmark prediction module aims to retrieve a set of ( in this work) landmarks from a corrupted face image , i.e. , with the trainable parameters. Technically, can be accomplished by any landmark detector like [29, 14, 28]. Please notice that, for the target inpainting task, what we expect from the landmarks is more about the underlying topology structure and some attributes (pose and expression) than the precise location of each individual landmark. The following may explain the reason: considering the landmarks on face contour for an example, with the corresponding region fixed, shifting them along the contour will not affect the final result much. Consequently, we build a simple yet sufficiently effective . Our is built upon the MobileNet-V2 model proposed in 
, which focuses on feature extraction. The final landmark prediction is achieved by fully connecting the fused feature maps at different rear stages, as illustrated in Fig.3. The training loss for is simply as follows:
where denotes the ground-truth landmarks. In addition, stands for the norm.
The inpainting network desires to complete faces by taking corrupted images and their (predicted or ground-truth) landmarks ( or ) as input, i.e. , with the network parameters. This subnet comprises of a generator and a discriminator.
Generator. Overall, the generator is based on the U-Net structure. More specifically, the network consists of three gradually down-sampled encoding blocks, followed by residual blocks with dilated convolutions and a long-short term attention block. Then, the decoder processes the feature maps gradually up-sampled to the same size as input. The long-short attention layer  is harnessed to connect temporal feature maps, and the stacked dilated blocks are to enlarge the receptive filed so that features in a wider range can be taken into account. Besides, shortcuts are added between corresponding encoder and decoder layers. Moreover, the convolution operation is executed before each decoding layer as the channel attention to adjust weights of features from the shortcut and last layer. In such a way, the network can better make use of distant features both spatially and temporally. The structure of the generator can be found in Fig. 3, and more in Appendix.
Discriminator. Based on the concept of two-player game, the generator tries to produce completed faces conditioned on the landmarks to fool the discriminator, while the discriminator aims to determine whether the generated result satisfies the data distribution. The convergence is reached when the generated results are not distinguishable from the real ones. In this work, our discriminator is built upon the Patch-GAN architecture . To stabilize the training process, we introduce the spectral normalization (SN)  into the blocks of the discriminator. Besides, an attention layer is inserted to adaptively treat the features. It is worth to notice that the works like  employ two discriminators, i.e. a global discriminator focuses on the entire image to assess if it is coherent as a whole, and a local one looks only at the completed region to ensure the local consistency. Differently, our discriminator adopts only one judger to accomplish the job, which takes an image and its landmarks as input, i.e. with the parameters. The reasons are: 1) the generated results are conditioned on the landmarks, already ensuring the global structure; and 2) the attention layer concentrates more on the attribute consistency. The configuration of our discriminator can be found in Fig. 3, and more details in Appendix.
Loss. We use a combination of a per-pixel loss, a perceptual loss , a style loss, a total variation loss and an adversarial loss, for training the inpaintor.
(I) The per-pixel loss is defined as follows:
where stands for the norm. Notice that we use the mask size as the denominator to adjust the penalty. It means that if a face is interfered by a small occlusion, the inpainted result should be very close to the ground-truth, while if the corruption is large, the restriction can be relaxed as long as the structure and consistency are rational.
(II) The perceptual loss measures the difference of feature maps extracted from a pre-trained network, which is calculated in the following manner:
where denotes the feature maps with size of the -th layer from the pre-trained network. , , , and
of the VGG-19 pre-trained on the ImageNet are utilized to calculated the perceptual loss, as well as the style loss described below.
where stands for the Gram Matrix corresponding to .
(IV) The total variation loss is utilized to suppress the checkerboard artifact, which is defined as:
where is the pixel number of , and is the first order derivative, containing (horizontal) and (vertical).
(V) The adversarial loss adopts the LSGAN proposed in , due to its stability during the training process and the advance in visual quality, which is as follows:
The total loss with respect to the generator yields:
We use , , and in our experiments. The whole training procedure alternatively minimizes for the generator and for the discriminator until converged.
The generator is desired to complete image via . For face images, their strong regularity, like the landmarks considered by our design , could benefit model reduction and training procedure, as the space is considerably restricted by the regularity. Intuitively, the training for and can be finished jointly. Technically, it is feasible. But, in practice, it is not a good choice. The reasons are as follows: 1) the loss for , say , computes over a small number of (only in this work) locations, which is incompatible with . In other words, the parameter tuning is extremely hard; and 2) even with the well-tuned parameters, the performance of both and may be too inaccurate especially at the beginning of training, which consequently leads to low-quality landmark prediction and inpainting results. These two coupled factors very likely drag the training into dilemmas, like bad points of convergence and/or high prices of training. Thus, we decouple the joint model into the landmark prediction and inpainting modules, and train them separately. It is worth to note that we actually have trained the model in a joint way with different carefully-tuned settings, the best shot is still inferior to our separate training. In experiments shown in this work, the landmark prediction model and the inpainting model are trained using images and optimized by the Adam optimizer  with and , and the learning rate . The learning rate of the discriminator is We use batch size for the landmark prediction module and batch size for the inpainting model.
In this part, we evaluate the face inpainting performance of our LaFIn on the CelebA-HQ face dataset [17, 12]. The masks used for training come from the random mask dataset  and additional block masks randomly generated. The competitors involved in the comparison include Context Encoder (CE) , Generative Face Completion (GFC) , Contextual Attention (CA) , Geometry Aware Face Completion (GAFC) , Pluralistic Image Completion (PIC) , and EdgeConnect (EC) . For quantitatively measuring the performance difference among the competitors, we employ PSNR, SSIM  and FID , as metrics. For PSNR and SSIM, higher values indicate better performance, while for FID, the lower the better. As the ground-truth landmarks are unavailable for the CelebA-HQ dataset, we apply the results by FAN  to perform as the reference information for training our landmark predictor.
Result comparison. Table 1 reports the performance of CA, EC, PIC and our LaFIn with different types and sizes of mask. Notice that for CA and PIC, the pre-trained models on the CelebA-HQ are given***https://github.com/JiahuiYu/generative_inpainting†††https://github.com/lyndonzheng/Pluralistic-Inpainting. While the authors of EC do not offer the pre-trained model on the CelebA-HQ dataset, we try our best to retrain it using the training code‡‡‡https://github.com/knazeri/edge-connect. As can be seen from the numbers in Table 1, EC is superior over PIC and CA in most cases, as it employs the edge information to help inpainting. Overall, our LaFIn outperforms the others by large margins in terms of all PSNR, SSIM and FID, except for the case of center falling behind PIC in terms of FID vs. , the explanation is in Appendix. This comparison verifies that the landmarks are stronger and more robust guidance than the edges for the task of face inpainting. Further quantitative comparisons with CE, EC and GFC under center masks on CelebA are shown in Table 2. Figure 4 depicts two visual comparisons among CA, EC, PIC, and our LaFIn, from which, we can see that LaFIn can generate more natural-looking and visually striking results even on the cases with large poses and extreme occlusions. Figure 5 and 6 further provide visual comparisons of CE, GFC, GAFC, EC and LaFIn on four samples from the CelebA dataset. Notice that GFC utilizes the face parsing information and GAFC uses both the landmark and parsing to guide the inpainting. As observed from the results, those by GFC suffer from the face component shifting problem. GAFC§§§Since neither the code nor implementation details of GAFC is available, when this paper is prepared, we only compare the cases cropped from the GAFC paper. seems to somewhat mitigate the problem due to the introduction of landmarks, but still inferior to our LaFIn. This comparison tells that the redundancy of the face parsing prior may alternatively hurt the performance. It is worth to emphasize that GAFC considers the symmetry property of faces and low rankness of mask in the loss, which are not so reasonable because large poses of faces and random corruptions can easily violate these properties. Also, the editing on parsed regions (+landmarks) is much more difficult than on sparse landmarks. Figure 7 gives several more results by LaFIn. Due to page limit, please find more comparisons in Appendix.
Ablation study. Table 3 reports the difference of LaFIn with the long short term attention (LSTA) disabled (denoted as w/o LSTA), LaFIn with the landmark guidance canceled (w/o LMK) , and the complete LaFIn. From the numbers reported in Table 3, both the LSTA and LMK help the task of face inpainting. Specifically, the LSTA influences more than the landmark indication on the cases with relatively small corruptions. This phenomenon is reasonable because the completed part should pay more attention on the attribute consistency to make the results visually coincident to the observed (large) area. While for the cases with relatively large masks, the attribute consistency is barely violated in the generated result as there is few information given to match. Alternatively the landmark information is more important to ensure the structure well-preserved. The above corroborates the principle of our design, say the LSTA is for the attribute consistency and the landmarks for the main structure. To view the effect of landmark, Figure 8 shows the inpainting results based on different landmark templates. By varying the templates (mouth), the completed faces accordingly change with much better visual quality than the one without adopting any landmark information. This experiment also informs us that editing faces is viable by manipulating the landmark template. Affirmatively, operating sparse landmarks is more convenient than modifying parsed regions (together with landmarks ).
Most of data-driven approaches, if not all, require well-labeled data, which is time consuming and labor intensive. Like the original motivation of GANs, it attempts to produce more samples for training networks. Specifically for facial landmark detectors/predictors, one wants to generate diverse plausible faces given the ground-truth landmarks. Intrinsically, this is how our work stands. For an image , we are able to obtain the augmented data through , where is the landmark of , and stands for any mask. By doing so, for the image , the augmented faces vary with different masks. The discriminator will make sure that the inpainted results match . An example is shown in Fig. 9, from which we can see that the features of are significantly different from those of with the same landmarks. Consequently, the pair of (, ) can be used for training.
To validate the effectiveness of such a data-augmentation manner, we feed the augmented data into both our and LAB  on the WFLW dataset . We notice that LAB¶¶¶ We use the PyTorch version of LAB downloaded from
https://github.com/FunkyKoki/Look_At_Boundary_PyTorch is carefully built for the task of facial landmark detection, while the landmark module in LaFIn is much simpler and smaller because as previously explained, in our task, the landmarks can be not that accurate as long as they can provide the main structure of faces. Therefore, our performance in NME (normalized mean error by inter-ocular factor) is inferior to LAB. Nevertheless, as can be viewed from Table 5, the augmentation improves both the performance of LAB and our LaFIn. In addition, we also test LaFIn on the 300W dataset, the numerical results consistently reveal the effectiveness of the augmentation. Notice that no obvious difference in inpainted results is observed using the landmark predictors without and with augmentation, which again verifies that our inpainting module is robust against variation in landmarks, and can produce striking results as long as the structure is reasonably offered.
We use the PyTorch version of LAB downloaded from
In this study, we have developed a generative network, namely LaFIn, for completing face images. The proposed LaFIn first predicts the landmarks then accomplishes the inpainting conditioned on the predicted landmarks. Our principle is that the landmarks are neat, sufficient, and robust to perform as guidance for providing the structural information to the face inpainting module. For ensuring the attribute consistency, we designed to harness distant spatial context and connect temporal feature maps. Extensive experiments have been conducted to verify our claims, reveal the efficacy of our design and, demonstrate its advances over state-of-the-art alternatives both qualitatively and quantitatively. Furthermore, we proposed to use our LaFIn to augment face-landmark data for relieving manual annotation in the task of landmark detection. The effectiveness of this manner has been experimentally confirmed.
Our landmark predictor is based on the MobileNet-V2. A series of bottlenecks are employed to extract the features and speed up the network. Feature maps at different stages of fusion layers are fully connected to achieve the final landmark prediction. The detailed architecture is shown in Table 6.
The discriminator is built upon the Patch-GAN architecture. Spectral normalization is applied on the convolution layers to stabilize the training process. The attention block is placed in the discriminator to adaptively treat the features. The detailed architecture is given in Table 7.
represent kernel size, output channels, stride and padding of convolution or deconvolution layers, respectively. SN refers to spectral normalization and LReLU means leaky relu with the slope set to.
The generator is based on the U-Net structure. Three encoding blocks are applied for down-sampling, followed by 7 residual blocks with dilated convolutions to enlarge receptive fields. The long-short term attention block connects the features from the last residual block and the last down-sampling block so that the features in a wider range can be better used. The shortcuts are added between corresponding encoder and decoders. The 1x1 convolutions are employed as channel attention to adjust the weights of features from shortcut and last layer. The detailed architecture is shown in Table 8.
|Dilated Residual Block||-||256||-||-||-|
|Dilated Residual Block||-||256||-||-||-|
|Dilated Residual Block||-||256||-||-||-|
|Dilated Residual Block||-||256||-||-||-|
|Dilated Residual Block||-||256||-||-||-|
|Dilated Residual Block||-||256||-||-||-|
|Dilated Residual Block||-||256||-||-||R7|
|E3,R7||Short+Long Term Attention||-||256||-||-|
In the last but one section of the main paper, we validated the effectiveness of the proposed data-augmentation manner. The implementation details are as follows. In our experiment, for each pair of training sample
in a single epoch, a pair of augmented datawill be generated by the inpaintor and be applied as the additional training data. Moreover, in different epochs, the masked region will change so that various augmented images can be produced. The training settings of LaFIn is same as above mentioned except the batch size shrinks to 4. The settings of LAB follow its original implementation.
In Table 1 of the main paper, our LaFIn falls behind PIC in terms of FID vs. in the case of center mask. First we give the definition of FID as follows:
Assuming that the extracted features and
follow multidimensional Gaussian distributionsand respectively, the FID calculates the Frechet distance between the two distributions. And stands for the trace of matrix. From Eq.(8
) we can see that the FID takes both the mean and the variance of features into consideration. As Figures11 and 12 show, in the situation of center mask, the available information in images for inpainting is limited and our LaFIn tends to generate common but reasonable results, which decreases the performance in terms of FID, especially in the variance term. While PIC is designed to generate pluralistic features, but some of the results are not visually satisfactory. More results comparing with CA , EC, PIC on CelebA-HQ and CE, GFC , GAFC  on CelebA are shown in Figure 10 to Figure 14.
Image denoising and inpainting with deep neural networks. In NeurlPS, pp. 341–349. Cited by: §1.