Image inpainting refers to the task of filling in missing or masked regions with synthesized contents. Among the various ways of vision algorithm of today, deep learning based methods have attracted a lot of attention in image inpainting. The earliest deep learning image inpainting method was called context encoders (CE) by Deepak Pathak et al . They compulsively obtain latent characteristic information of the missing area by context information. However, the context encoders only pay attention to the missing area information rather than the whole image such that the generated image would have obvious patching marks at the boundary (see Fig. 1(c)). To solve the problem, Yu et. al. proposed generative image inpainting with contextual attention (CA) , it first generated a low-resolution image in the missing area, then updated the refinement image by searching for patches similar to an unknown area from a known area with contextual attention. Zheng et. al. proposed a pluralistic image completion network (PICNet) with a reconstructive path and the generative path to creating multiple plausible results . However, all these methods produce discordant facial parts, which are not structurally-reasonable. For example, the asymmetry eyebrow (see Fig. 1(d)), one eye is large and the other one is small (see Fig. 1(e)) or two eyes of one people have different colors (see Fig. 1(e)).
One possible reason is that these general image inpainting methods mainly focus on the generated image resolution of the missing portion and consider its conformance with the external context but without consideration of the special particularities of the human face (e.g., symmetrical relation, harmonious relation) in their approach. Only a few works [11, 13] dedicated to the task of face inpainting. These face inpainting algorithms incorporate simple face features into a generator for human face completion. However, the benefit of face region domain information has not been fully explored, which also leads to unnatural images. Face inpainting remains a challenging problem as it requires to generate semantically new pixels for the missing key components with consistency on structures and appearance.
In this paper, A Domain Embedded Multi-model Generative Adversarial Network (DEGNet) is proposed for face inpainting. We firstly embedding the face region domain information (i.e., face mask, face part and landmark image) by variational auto-encoder into a latent variable space as the guidance, where only face features lie in the latent variable space. Before generating a face image with plausible contents, we combine the face region domain embedded latent variable into the generator for face inpainting. Finally, our adversarial discriminators judge whether the generated face image is close to the real distribution or not. Experiments on two benchmark face datasets [15, 10] demonstrate that our proposed approach generates higher quality inpainting results than the state-of-the-art methods. The main contributions of this paper are summarized as follows:
Our proposed model embedding the face region information into latent variables as the guidance information for face inpainting. The proposed method enables complex real scenes learning and produces sharper and more natural faces, and thus leads to improved face structure prediction, especially for larger missing regions.
We design a new learning scheme with the patch and global adversarial loss, in which the global discriminator could control the overall spatial consistency and the patch discriminator could provide more elaborate face feature distribution which can generate impressively photorealistic high-quality face images.
To the best of our knowledge, our work is the first on the evaluation of the side face inpainting problem, and more importantly our impainting results of side face show excellent visual quality and facial structures comparing to the state-of-the-art methods.
2 Related Work
General Image Inpainting. Traditional diffusion-based or patch-based methods [3, 4, 31, 9] with low-level features generally assume image holes share similar content to visible regions; thus they would directly match, copy and realign the background patches to complete the holes. These methods perform well for background completion, e.g. for object removal, but cannot hallucinate unique content not present in the input images. Barnes et al.  proposed a fast nearest-neighbor field algorithm called PatchMatch (PM) for image editing applications including inpainting. It greatly reduces the search scopes of patch similarity by utilizing the continuity of images. Based on the nearest neighbor (NN) inpainting method, Whyte et al.  updated the predicted region by finding the nearest image from the original image in the training data set. While the above methods become less effective when the missing region becomes large or irregular.
Many recent image inpainting methods are proposed based on deep learning model [6, 33, 14, 32, 20, 7, 29]. Li et al.  propose a deep generative network for face completion, it consists of an encoding-decoding generator and two adversarial discriminators to synthesize the missing contents from random noise. The proposed model[30, 23, 32, 26, 36] can synthesize plausible contents for the missing facial key parts from random noise. Alternatively, given a trained generative model, Raymond et al.  search for the closest encoding of the corrupted image in the latent image manifold using their context and prior losses to infer the missing content by the generative model. To recover large missing areas of an image, Patricia et al.  tackle the problem not only by using the available visual data but also by taking advantage of generative adversarial networks incorporating image semantics. However, these methods can generate visually plausible image structures and textures, but usually, create distorted structures or blurry textures inconsistent with surrounding areas.
To reduce the blurriness issue commonly existing in the CNN-based inpainting, two-stage methods have been proposed to conduct texture refinement on the initially completed images [38, 35, 23]. Generally, they firstly filled the missing regions by a content generation network and then updated the neural patch in the predicted region with fine textures in the known region. Recently Yu et al.  propose a new deep generative model-based approach for inpainting. It not only synthesize novel image structures but also explicitly utilize surrounding image features as references to make better predictions. While it is likely to fail when the source image does not contain a sufficient amount of data to fill in the unknown regions. When the training image is a Non-HQ image, it performs not well. Furthermore, such processing might introduce undesired content change in the predicted region, especially when the desired content does not exist in the known region. To avoid generating such in-correct content, Xiao et al.  propose a content inference and style imitation network for image inpainting. It explicitly separates the image data into content code and style code to generate the complete image. It performs well on structural and natural images in terms of content accuracy as well as texture details but does not demonstrate its performance on face image inpainting. Zheng et al.  present a pluralistic image completion approach for generating multiple and diverse plausible solutions for image completion. However, it cannot keep stable performance and need a sufficiently varied high-quality dataset.
Face Inpainting. Li et.al. proposed a deep generative face inpainting model that consists of an encoding-decoding generator, two adversarial discriminators, and a parsing network to synthesize the missing contents from random noise. To ensure pixel faithfulness and local-global contents consistency, they use an additional semantic parsing network to regularize the generative networks (GFCNet) . In 2018, under a novel generative framework called collaborative GAN (collagen) , Liao et al proposed a collaborative adversarial learning approach to facilitate the direct learning of multiple tasks including face completion, landmark detection and semantic segmentation for better semantic understanding to ultimately yield better face inpainting.
3 Domain Embedded Multi-model GAN
An overview of our proposed framework is shown in Fig. 2. Our goal is to use the domain information of the face part to generate high quality face impainting images. We first introduce our Face Domain Embedded Generator and Multi-model Discriminator in Section 3.1 and Section 3.2
, respectively. Finally, the loss functions of these components are described in Section3.3. Note that the details of network structure (number of layers, size of feature maps etc.) can be found in the supplementary document (see Table A1).
3.1 Face Domain Embedded Generator
Domain Embedding Network. Given a face image randomly drawn from the training set and its cropped image , our goal is to learn the most important face domain information to guide the inpainting procedure. We use face region images which include face mask image and face part image and face landmarks image to represent the face domain information, see Fig.3 (c,d,e).
Then, we use a VAE network [5, 19, 16] with an encoder-decoder architecture to embed the face domain information into latent variables. More specifically, in the encoding phase, the corrupted face image
is first passed to two encoders, face region encoder and face land mark encoder, yielding standard normal distributions for the face region image and face landmarks image, respectively. Subsequently, latent variables are sampled from each of these two normal distributions:
where and are the sampled latent variables for the face region and face landmarks, respectively. and with
denote the means and variances of the generated standard normal distributions.
In the decoding phase, and are concatenated to . The face mask decoder and face part decoder takes as input and generate the reconstructed face mask , face part image , and the face landmark decoder take as input and generate the reconstructed face landmark image (See Fig.2). During the training process, the above crucial face information is embedded into the latent variable . The structure of these two encoders are symmetrical to that of the three decoders and all of them have different weights.
Domain Embedded Generator. To reconstruct the complete and harmonious face image, we need to integrate the embedded latent variable into the face Generator. We use Unet  as our generator, in the encoding phase, the cropped image is sent into Unet and get a latent feature with size (16, 16, 512) from the middle layer of the Unet. To concatenate and , we resize into (16, 16, 2) and concatenate and on their last channels, denote as . In the decoding phase, we generate realistic face images by deconvolution blocks with as input, See Fig. 2.
3.2 Multi-model Discriminator
In our network, the generator would produce blurry face images without discriminator. So we use a global-discriminator to get a clear face image. We use patch-discriminator from PatchGAN  to enhance image quality. In particular, when generating face image , we put with into two discriminator to distinguish true or fake. In DEGNet, we adopt a global discriminator to guarantee the spatial consistency of the global structure of with in the beginning process. When with has been consistent with the overall spatial structure, patch discriminator then split into patches to refine the spatial structure consistent with on every patch.In addition to producing realistic face image, Not only can we enhance the robustness and generalization performance of the generator, but also the Domain Embedding Net achieves more abundant and accuracy before face information in the cropped region by two discriminators influence.
DEGNet Algorithm. All experiments in this paper set ,1:: : learning rate, : batch size, : Vae’s parameters, : Unet’s parameters, : global-discriminator parameters, : patch-discriminator parameters, : sampled latent vector from Vae’s encoder, : domain infromation about face, : patch-discriminator, : global-discriminator2:while and have not converged do3:: sample a batch from the cropped data4:: generate facial information from the Vae’s decoder5:: reconstructed images by Unet6:: discriminate the distribution of real or fake7:: discriminate the distribution of real or fake8:Update by , , , , and9:Update by , ,10:Update and11:end while
3.3 Loss Function
Domain Embedding Network Loss. For corrupted face image , the VAE network are trained to reconstruct the face mask , face region , and the landmark image . In this work, we define three reconstruction losses (see Eq.(2)) for these three outputs, respectively:
where denotes the cross-entropy loss, and , , and are the corresponding ground truth images. The encoder can extract face domain information more accurately under the constraint of (2
). To impose a domain distribution (in our case, the standard normal distribution) on the latent space, we employ a latent classifier
rather than the Kullback-Leibler divergence used in standard VAEs. This technique has been demonstrated to help the VAE to capture better data manifold, thereby learning better latent representations. This technique has also been widely used in various VAE-based generative networks such as -GAN . The latent classification loss is defined as follows:
is a random variable randomly sampled from the standard normal distribution. Equallyand are defined for and . where is the conection of the with in the last channels.
Domain Embedded Generator Loss. To construct the background information more quickly and make the missing region and its inversion having a better fusion effect indirectly, we impose the following reconstruction loss for the foreground region that including face and hair regions.
where normal can penalty the difference between and . Remembering capabilities can be further evaluated by the reconstruction accuracy of given sample under its latent representation.
Multi-model Discriminator Loss. apart from a normal discriminator (global discriminator) , a patch discriminator is introduced to discriminate the reconstructed face image against the uncorrupted face image .We use global-discriminator and patch-discriminator to be adversarial with face domain embedding generator.We adopt global adversarial loss and patch adversarial loss to learn more better latent representation and encourage the generated images that look realistic and naturalTherefore, the loss for the face generator is as follows:
The former distinguishes whether a local patch of the image is from a real sample or a synthesized image. It has the ability to capture the local statistics and drive the generator to generate locally coherent face images. The latter distinguishes whether an image is a real sample or a synthesized image. This can significantly improve adversarial training robustness and alleviate the transferability among the members of the ensembles in both untargeted and targeted modes.
Total Loss. The overall loss function of our model is defined by a weighted sum of the above loss functions:
4.1 Experiments Settings.
Datasets. Our experiments are conducted on two human face data sets. 1) CelebA , a Large-scale CelebFaces Attributes Dataset. 2) CelebA-HQ , a high-quality version of CelebA datset. We follow the official split for training, validating and testing (details in Table 1).
Three types of criteria are used to measure the performance of different methods: 1) Peak Signal to Noise Ratio (PSNR), which directly measures visibility of errors and gives you an average value; 2) Structural SIMilarity (SSIM), which measures the structural similarity of an image against a reference image; 3) Normalization Cross Correlation (NCC), which has been commonly used to evaluate the degree of similarity between two compared images.
Pre-processing. Our training dataset includes six types of images (see Figure 3): 1) the original full-face image with (Figure 3(a)), 2) cropped face image or landmarks (Figure 3(b)), 3) face mask (Figure 3(c)), 4) face part (Figure 3(d)), 5) landmark image (Figure 3(e)), we use face_alignment detection interface  to extract 68 facial landmarks properly from an original full-face image, and 6) foreground mask (Figure 3(f)).
The cropped face image (Figure 3(b)) and face mask (Figure 3(c)) are obtained by stretching the convex hull computed from the 41 landmarks in eyes nose and mouth. To obtain face part, we dilate the face mask by 3% of the image width to ensure that the mask boarders are slightly inside the face contours and include the eyebrows inside the mask. Then the only face part image (Figure 3(d)) is obtained by applying the face mask to the input image. Finally, the foreground mask (Figure 3(f)) is detected using Baidu segementation API .
All of these face images are coarsely cropped and resized from to . Finally, the cropped image will be resized to in our experiment.
and leave other parameters as Keras default. Our model uses batch size of 60 training with 80 epochs and sets, , and .
4.2 Comparison with Existing Work
We compare our algorithm against existing works in two groups: general image inpainting methods and face inpainting methods separately:
1) General image inpainting methods: The texture refinement methods PM , Context_encoder (CE)  and Generative_inpainting (CA)  that replacing the initially completed images and traditional methods which only using low-level features to complete image inpainting and Pluralistic Network (PICNet) , noted that since PICNet can generate multiple outputs, we chose the top one best results based on its discriminator scores to compare. Both PICNet and CA methods require high-resolution training images in original papers, we report their results on the high-resolution CelebA-HQ data set. And there is no public code available for the PENNet , we compare the best loss performance reported in their paper.
2) Face inpainting methods: GFCNet , and CollaGAN . As there are no public code available for both methods, we report the best performance in their paper. Aiming at fair comparison, our experiments follow their experiments setting and use the same dataset with the same training and testing split.
4.2.1 Quantitative Comparisons.
Compassion with General image inpainting methods. As presented in Fig. 4, the first two lines of inpainting results are from the Celeba, and others are from the Celeba-HQ. When the missing area is quite different from the surrounding environment ,we find that PM method does not inpaint the whole face, and the CE method has better performance in some frontal faces but when the missing area contains some background information besides face , it is a high possibility for CE to produce blurry even distorted face. PIC can inpaint clear faces but the faces are not harmonious. this is because PIC is to produce clearer image by enhance the constraint capability of discriminator, which destroy the structural consistency of the image and result in distortion of the image .In DEGNet, we produce clear and harmonious face inpainting by keep balance between rec loss and adv loss.
As presented in Table 2 and Table. 3, our method in general achieve better performance than all the other methods, in terms of SSIM, PSNR, and NCC. It is easy to see that our method outperforms state-of-the-arts in both PSNR and SSIM and achieves the best generalization performance for large and different crops. CA only achieve better results for high resolution training dataset, but get poor result for low resolution training dataset (see figure 1(d)).
Compassion with Face inpainting methods. In addition comparing to the general image inpainting methods, we also compare to face inpainting methods. As shown in figure 6, we completed all kinds of part cropped face inpainting.We produced clear and natural result. Because of without code and visual feature in thier paper,we only show our feature.
As shown in Table 4, our method in general achieve the better performance over all other methods in terms of SSIM and PSNR. It is easy to find that the values of PSNR and SSIM in our methods are signficantly higher than those of CFGNet with CollaGAN except O4.
4.3 Evaluation of Side Face Impainting
Besides frontal face inpainting, we further evaluate face inpainting performance on the side face. Different from the frontal face, the cropped side face contains more missing information. In terms of side face, the facial features information is difficult to learn than the frontal face. So it is hard to complete face inpainting inside face. Generally, most of the existing methods failed on side face inpainting. To illustrate the problem, Table 5 and Fig. 5 shows the quantitative and qualitative comparisons of different methods on side face inpainting. Table 5 shows that our method outperforms state-of-the-art in both PSNR and SSIM on the side face inpainting. From Figure 5, we find that our DEGNet method has symmetry faces, such as eyes with the same sizes and colors, while other methods include blurry textures and asymmetry faces.
4.4 Ablation Study.
We further perform experiments to study the effect of the components of our model. We analyze how the different combinations of our components, (S1) reconstruction, (S2) Reconstruction + global discriminator, (S3) Reconstruction + global discriminator+ patch discriminator, affect our inpainting performance, the results are shown in Table 6 and Fig. 7). Based on our backbone, we further impose the latent classifier on the random vector, which results in better PSNR and SSIM. As an intermediate test, the global discriminator somewhat decreases the SSMI score. Moreover, based on the former, more other two global and patch discriminators only acting on the missing region would result in more better results.
And three constraint factors actually can have a positive effect on their performance development. According to our previous analysis, we know that reconstruction constraints and discriminator can improve the performance of our backbone directly. We also explored different training modes on how to affect its performance. Thus combing with domain embedded generator and discriminator alternative optimization, our DEGNet is proposed to overcome this problem and discuss how to generate high-quality face inpainting images based on this.
We proposed a Domain Embedded Multi-model Generative Adversarial Network for face image inpainting. Our proposed model improves the face inpainting performance by using the face region information as the guidance information within a Multi-model GAN framework. Experimental results demonstrate our method gets better performance than the state-of-the-art face inpainting methods. Furthermore, our method could be easily applied to other image editing tasks.
-  (2019) Note: https://ai.baidu.com/tech/body/seg Cited by: §4.1.
-  (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In ACM Transactions on Graphics (ToG), Vol. 28, pp. 24. Cited by: §2, §4.2, Table 2, Table 3, Table 5.
-  (2000) Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 417–424. Cited by: §2.
-  (2004) Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on Image Processing 13 (9), pp. 1200–1212. Cited by: §2.
Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. Cited by: §3.1.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §2.
-  (2019) Deep fusion network for image completion. arXiv preprint arXiv:1904.08060. Cited by: §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1125–1134. Cited by: §3.2.
-  (2015) Annihilating filter-based low-rank hankel matrix approach for image inpainting. IEEE Transactions on Image Processing 24 (11), pp. 3498–3511. Cited by: §2.
-  (2018) Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations (ICLR), External Links: Cited by: §1, §4.1, Table 1.
-  (2017) Generative face completion. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §4.2, Table 4.
-  (2017) Generative face completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3911–3919. Cited by: §2.
-  (2018) Face completion with semantic knowledge and collaborative adversarial learning. In Asian Conference on Computer Vision, pp. 382–397. Cited by: §1, §2, §4.2, Table 4.
-  (2018) Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 85–100. Cited by: §2.
-  (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §1, §4.1, Table 1.
-  (2018) AlphaGAN: generative adversarial networks for natural image matting. arXiv preprint arXiv:1807.10088. Cited by: §3.1.
-  (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §3.3.
-  (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2536–2544. Cited by: Figure 1, §1, §4.2, Table 2, Table 3, Table 4, Table 5.
-  (2016) Variational autoencoder for deep learning of images, labels and captions. In Advances in neural information processing systems, pp. 2352–2360. Cited by: §3.1.
-  (2019) StructureFlow: image inpainting via structure-aware appearance flow. In Proceedings of the IEEE International Conference on Computer Vision, pp. 181–190. Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1.
-  (2017) Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987. Cited by: §3.3.
-  (2018) Contextual-based image inpainting: infer, match, and translate. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2, §2.
-  (2018) Semantic image inpainting through improved wasserstein generative adversarial networks. arXiv preprint arXiv:1812.01071. Cited by: §2.
-  (2018) Facial feature point detection: a comprehensive survey. Neurocomputing 275, pp. 50–65. Cited by: §4.1.
Image inpainting via generative multi-column convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 331–340. Cited by: §2.
-  (2009) Get out of my picture! internet-based inpainting.. In British Machine Vision Conference (BMVC), Vol. 2, pp. 5. Cited by: §2.
CISI-net: explicit latent content inference and imitated style rendering for image inpainting.
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 33, pp. 354–362. Cited by: §2.
-  (2019) Image inpainting with learnable bidirectional attention maps. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8858–8867. Cited by: §2.
-  (2019) Foreground-aware image inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5840–5848. Cited by: §2.
-  (2010) Image inpainting by patch propagation using patch sparsity. IEEE transactions on image processing 19 (5), pp. 1153–1165. Cited by: §2.
Shift-net: image inpainting via deep feature rearrangement. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 1–17. Cited by: §2.
-  (2017) High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6721–6729. Cited by: §2.
-  (2017) Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5485–5493. Cited by: §2.
-  (2018) Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5505–5514. Cited by: Figure 1, §1, §2, §4.2, Table 2, Table 3, Table 5.
-  (2019) Free-form image inpainting with gated convolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4471–4480. Cited by: §2.
-  (2019-06) Learning pyramid-context encoder network for high-quality image inpainting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2, Table 3.
-  (2018) Semantic image inpainting with progressive generative networks. In ACM Multimedia Conference on Multimedia Conference (ACM MM), pp. 1939–1947. Cited by: §2.
-  (2019) Pluralistic image completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1438–1447. Cited by: Figure 1, §1, §2, §4.2, Table 2, Table 3, Table 5.