Joint Face Completion and Super-resolution using Multi-scale Feature Relation Learning

02/29/2020 ∙ by Zhilei Liu, et al. ∙ Tianjin University 10

Previous research on face restoration often focused on repairing a specific type of low-quality facial images such as low-resolution (LR) or occluded facial images. However, in the real world, both the above-mentioned forms of image degradation often coexist. Therefore, it is important to design a model that can repair LR occluded images simultaneously. This paper proposes a multi-scale feature graph generative adversarial network (MFG-GAN) to implement the face restoration of images in which both degradation modes coexist, and also to repair images with a single type of degradation. Based on the GAN, the MFG-GAN integrates the graph convolution and feature pyramid network to restore occluded low-resolution face images to non-occluded high-resolution face images. The MFG-GAN uses a set of customized losses to ensure that high-quality images are generated. In addition, we designed the network in an end-to-end format. Experimental results on the public-domain CelebA and Helen databases show that the proposed approach outperforms state-of-the-art methods in performing face super-resolution (up to 4x or 8x) and face completion simultaneously. Cross-database testing also revealed that the proposed approach has good generalizability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative face restoration aims to recover valuable missing information of face images caused by factors such as low resolution (LR), occlusion, and large pose. It is the subject of extensive research in the field of face recognition, especially with the emergence of convolution neural networks (CNNs)

[28, 13], and generative adversarial networks (GANs) [7]. Many face image restoration sub-tasks have yielded great breakthroughs, including face completion [25, 34], face super-resolution or hallucination [5, 16], and face frontal view synthesis [10].

Fig. 1: (a) Traditional two-stage multi-task face restoration model; (b) The proposed end-to-end face restoration model.

Although many methods have been proposed for image completion and image super-resolution reconstruction, most are designed for single tasks such as face completion [18, 29] and face super-resolution [4, 5, 16, 30, 38]. Thus, these methods are more effective at single tasks; however, they underperform when applied to multiple tasks. Both occluded and LR pictures can be perceived as original pictures with degradation matrix added to them, which can be expressed as follows:

(1)

where is a low-quality image, is an original image, and is a degeneration matrix. For occluded images, is 0 or 1; for LR images, is any real number between 0 and 1, and stands for element multiplication. Repairing both forms of degradation requires eliminating the influence of the degradation matrix on the original image . For both tasks, the most direct way to achieve this is to connect two pre-trained models in series. The output of the first model is inputted into the second model to repair the LR occluded pictures. However, this method tends to further increase the impact of noise in low-quality pictures, making the repair effect poor. [3] proposed combining two pre-trained models and modifying the training strategy to achieve image completion and image super-resolution tasks simultaneously. Although this method has proven effective at processing multiple tasks, it cannot complete a single task because it relies on two pre-trained models.

In this paper, we propose an end-to-end model that can not only achieve both image completion and image super-resolution simultaneously, but is also effective at a single task. The comparison of the two different models is shown in Fig. 1. Compared with the general model, our model can effect both multi-task repair and single-task repair. In addition, unlike existing face image restoration models, to learn the features of occluded face patches from the non-occluded patches, we propose an improved graph convolutional network (IGCN), whose structure is shown in Fig. 3. The IGCN can also improve the resolution of non-occluded face patches from other visible face patches by exploring the correlations of different facial components in different facial expressions. Then, using the proposed IGCN, a region relation modeling block (RRMB) is built for capturing facial features with different scales for face restoration, as shown in Fig. 4. Given finer facial division, the proposed framework can represent the relations among different face patches very accurately. Because of the very accurate adjacency matrix in the proposed IGCN, our model can effectively restore the feature map in deep networks using the features of the visible patches. Furthermore, to consider the relationship between different feature sizes in the face region, we incorporate the feature pyramid network (FPN) [19]

into the model and use the pre-trained VGG-19 network as the feature extraction network in the FPN. Finally, we design a discriminator to enable the generator to generate realistic faces. The model includes three losses: (i) pixel loss for effective reconstruction of non-occluded high-resolution face images; (ii) adversarial loss for differentiating between real and generated face images; (iii) perceptual loss for improving the quality of the generated faces to some extent.

The main contributions of this work include the following: (i) An end-to-end model that can complete not only image completion and image super-resolution tasks but also single tasks; (ii) Unlike the existing model, based on the relationship between various regions of the image, we used the graph convolution method instead of the conventional convolution method. Apart from this, considering that different features have different sizes, we incorporated the FPN into the model; (iii) Compared with state-of-the-art face image completion and face image super-resolution methods, the results are promising.

This work is an extension of our previous work on IGCN [21]. The essential improvements over our previous work include the following: (i) Improvement on the model and incorporation of the FPN; (ii) Verification of the effectiveness of the model on single tasks; (iii) Extension of the datasets to the general face datasets.

2 Related Work

We will briefly describe the existing research on image restoration, as well as provide an introduction to FPN and graph convolutional networks (GCNs).

2.1 Image Restoration

Image restoration consists of image super-resolution or hallucination [5, 16], image completion [18, 34], face frontal view synthesis [10], image denoizing [24], image deraining [32], image dehazing [6], image deblurring [17], and shadow removal [31]. In this study, we mainly addressed image completion and image super-resolution tasks.

In image completion, the entire area of the image is utilized to fill in the missing area. Early image completion methods typically utilized information from the surrounding pixels of the occluded area to recover the missing parts. Ballester et al. [1]

propose joint interpolation of the gray and gradient directions of the image to fill in the missing areas; however, this method is ineffective when the missing areas are large or have pixels with values varying significantly from those of the complete area. Hays J et al.

[8] attempted to use the data-driven method to solve the shortcomings of the previous method. When no similar patch can be found in the visible parts of the image, this study suggests that material can be obtained from the large number of pictures on the Internet. To solve the problem of low overall efficiency, this article proposed the extraction of a complete block directly from other images to fill the hole. Efros et al. [3] propose a patch-based method to search for relevant patches from the non-occluded areas of the image, and use them to gradually fill in the missing areas from the outside to the inside. Although this method improves on the previous study, the area search process is time-consuming. To solve this problem, Barnes et al. [2]

propose a fast patch search algorithm; however, their method cannot complete image completion in real time. With the development of deep learning, the application of deep models to face image restoration commenced. Liu et al.

[20] use CNNs to gradually recover lost pixels. Pathak D et al. [25] propose the use of CNNs for learning high-level features in images, following which these features are used to guide the process of generating the missing parts of images. Recently, image restoration has attracted significant attention, because the GAN [7]

can generate an image from a random vector, and use a discriminator to distinguish the real image from generated image

[5, 29]. The conditional GAN was proposed to limit the distribution of generated images [23]. Many methods have been developed to tackle the task of image completion; however, they are generally not effective at multi-task processing, as has been demonstrated above.

In image super-resolution, a set of LR images (or motion sequences) is used to generate a single high-resolution image. The application field of image super-resolution reconstruction is wide, and it has important application prospects in areas such as military, medicine, public safety, and computer vision. Before deep learning, learning-based super-resolution algorithms were a hot topic. Learning-based super-resolution algorithms use a large number of high-resolution (HR) image construction learning libraries to generate learning models. To repair LR images, the algorithms exploited prior knowledge obtained from the learning model to obtain the high-frequency details of the image, and achieve better image recovery effect.

Deep learning was applied for the first time to image super-resolution by Dong et al. [27]. They proposed a super-resolution CNN (SRCNN) method for image super-resolution. The main steps of the algorithm are as follows: 1) HR images were obtained from LR images. 2) images were obtained through neural network feature extraction, non-linear feature mapping, and reconstruction. Compared with the traditional method, this method greatly improved the reconstruction of HR images. Kim et al. [11] propose a deeply-recursive convolutional network (DRCN) structure with a network that was deeper than that of the SRCNN. Although the CNN achieved some breakthroughs in the completion of HR images, it also had certain limitations. When the resolution of the picture is too low, the CNNs often cannot repair it effectively. Huang et al. [9] proposed a wavelet-based CNN method, Wavelet-SRNet, to process extremely LR pictures. The WT can describe the context and texture information of the image from different levels such that the repaired picture is closer to the real picture. Other methods that repair extremely LR images to realize HR reconstruction by modifying the network structure have also been proposed. In 2017, Christian Ledig et al. propose the super-resolution GAN (SRGAN) [16]

. The authors pioneered the use of GAN to solve the problem of super-resolution. They mention that the mean square error is used as a loss function when the network is being trained; however, the recovered images usually lose high-frequency information, as a result of which the visual experience is reduced. The SRGAN used perceptual loss and adversarial loss to improve the realism of the recovered images. Although the recovered image has a lower peak signal-to-noise ratio (PSNR) value, it has realistic visual effects. Similarly, the SRGAN is only applicable to a single task of image super-resolution.

2.2 Feature Pyramid Networks

In 2017, Lin et al. [19] proposed the FPN, a multi-scale objection detection method. Most of the original object detection algorithms utilize only top-level features for prediction. Although the semantic information of the lower-level feature is relatively less, the target position is accurate. The high-level feature semantic information is rich; however, the target position is imprecise. In addition, although some algorithms use multi-scale feature fusion, they generally use the fused features for prediction. The uniqueness of the FPN is that the prediction is performed independently at different feature layers. The effectiveness of the FPN in the fields of object recognition and behavior recognition has popularized it greatly in the field of image restoration. In image super-resolution, [14] proposed the deep Laplacian pyramid super-resolution network for fast and accurate image super-resolution. This method greatly reduced the number of parameters, and saved running time. This method of introducing multi-layer features into image restoration also inspired people to introduce the FPN into image restoration. In image completion, [35] proposed that the image be restored at the feature level and image level simultaneously. The restored low-level features enable the restoration of high-level features, and proved to be effective. In this study, to better retain the information of the face features, we also used the FPN as the feature extraction component of the network.

2.3 Graph Convolutional Networks

Considering that there is a relationship between the domain to be repaired and the visible domain, we also applied the GCN [37] to the overall experimental model. GCN-based methods can either be spectral-based or spatial-based. The spectral-based approach introduces filters from the perspective of graph signal processing to define graph convolution, where the graph convolution operation is interpreted as denoizing the graph signal. The spatial-based method represents the graph convolution as the aggregation of feature information from the neighborhood. When the algorithm of the graph convolution network is run at the node level, the graph pooling module and the graph convolution layer can be interleaved to coarsen the graph into an advanced substructure. Inspired by the first-order graph Laplacian methods, [12]

propose the GCN. A link matrix is defined according to the overall structure of the graph, and related nodes in the graph are connected through the defined link matrix to generate a new feature graph. Graph convolution takes into account the relationship between features, which also provides a theoretical basis for graph convolution rather than conventional convolution. Therefore, rather than utilize the linear transformation in the conventional GCN, we improved the conventional GCN by combining the tensor-inputs and the standard convolutional layer to retain the facial structure information.

3 Proposed Method

In this section, the proposed method is described in details. First, the overall framework of our MFG-GAN, consisting of a generator and discriminator, is introduced in Section 3.1. Then, in Section 3.2, we introduce the three different losses used in our model.

Fig. 2: The overall framework of our proposed MFG-GAN model. The input is a low-quality image, which can be an occluded image, an LR image, or an LR occluded image. The generator includes three graph convolution pyramid blocks (GCPB), each of which is followed by an improved graph convolution layer; the last layer is the deconvolution layer; the discriminator is a six-layer improved graph convolution base layer and a fully connected layer that is used to distinguish the generated image from the real image. Besides, both perceptual loss from VGG-19 and pixel-wise loss are also considered.

3.1 Network Architecture

The structure of our proposed method is shown in Fig. 2; it consists of a generator to restore the entire face image and a discriminator to determine whether the generated face image is real or fake. To fully utilize the non-occluded face patches, we simultaneously addressed the face completion and super-resolution problems. For the face completion, we modeled the relations between the non-occluded face patches and occluded patches. For the face super-resolution, the correlation among different non-occluded face patches was modeled. The aim was to ensure the global harmony of the generated face images. In addition, we used graph convolution instead of general convolution. Unlike general graph convolution, we propose an improved GCN (IGCN). In the conventional graph convolutional network, the features of the nodes, which are referred to as non-Euclidean data, are vectors [7]. Every patch of the face image is related to the other patches, and is Euclidean data. To directly utilize the face patches as nodes by modeling the relations among the different facial patches, we proposed an improved GCN, the entire structure of which is shown in Fig. 3.

Fig. 3: Structure of the IGCN

with stride 1.

represents the link between -th patch and -th patch; represents adjacency matrix.

First, we split the feature, , of the face image into face patches with specific order ID. We used the convolutional layer to capture the representation of each face patch. At this juncture, it should be noted that conventional GCNs deploy vectors as features, and use the linear transformation layer to capture representations. Unlike conventional GCNs, our proposed IGCN deployed the 4-D tensor of face patches as features, and used the convolution layer to capture representations. The weights of all the convolutional layers for each patch were distributed under one layer of the IGCN. According to the symmetrical adjacency matrix, we obtained every face patch feature after the summation operation. Finally, we converted the face patch features into a feature map according to the origin position ID. Note that the deconvolutional layer can also be used in the IGCN. The adjacency matrix was pre-defined using the facial structure. The IGCN can be defined as follows:

(2)

where is the stacked patches features, is the normalized adjacency matrix, and b represents the tensor product. The adjacency matrix is defined by the correlations among the various facial regions. Supposing that two patches are correlated, then, the link between the two patches ought to be 1, and the opposite, 0. The GCN makes it possible to use the non-occluded regions to complete the occluded regions via pre-defined relations; for instance, the non-occluded left eye can be used to restore the occluded right eyes. Likewise, the non-occluded region can be used to enhance the quality of other non-occluded regions. Besides, to reflect the relationship between different size features, we designed a GCPB to replace the general convolution layer; each GCPB is composed of a FPN and three RRMBs, as shown in Fig. (a)a.

(a) Graph Convolution Pyramid Block (GCPB)
(b) Region Relation Modeling Block (RRMB)
Fig. 4: Framework of GCPB and RRMB. (a) GCPB: Image is inputted into the VGG-19 network for feature extraction; the extracted features are inputted into the RRMB for graph convolution. Then, the convolved features are concatenated, and inputted into the corresponding deconvolutional layer. (b) RRMB consists of the IGCN , , . , is obtained by splitting only one patch for the inputs, is obtained by splitting four patches for the inputs, and represents splitting 64 patches for the inputs.

We used the pre-trained VGG-19 network as the feature extraction component of the FPN, sent the extracted features of the first three blocks to the RRMB for convolution, following which we concatenated the obtained features as a new feature. The RRMB as shown in Fig. (b)b is proposed for the task of image feature representation learning. To capture the features of the different scales, we used three scales, which were obtained by split the patch, patches, and patches. When splitting the patch in the IGCN , which is the same with the standard convolutional layer. The scale was designed to capture global image-level features. The second scale was obtained by splitting patches in the IGCN ; it was designed to ensure the stability of the features during flipped situation. This scale setting was exploited to capture the object- level features. The third scale was obtained by splitting the patches in the IGCN; the was designed such that it could construct the relationship between the relational spatial patches, such as eyes and mouth. This scale setting was exploited to capture the patch-level features. All these scales features were summed pixel-wisely to obtain the final output features.

3.2 Loss Functions

Three loss functions are mainly used in our model: pixel loss, perceptual loss, and adversarial loss.

  • pixel loss: The pixel loss used in this study is defined as follows:

    (3)

    The pixel loss constrains the generator to generate a face image. Where is the face image generated by the generator, is the ground-truth face image.

  • perceptual loss: We used the pre-trained 19-layer VGG to compute the perceptual loss to obtain more facial details [16]. The perceptual loss is defined as follows:

    (4)

    where is the ground-truth’s feature map obtained by the -th convolution layer before the

    -th max-pooling layer in the VGG-19,

    is the generated face’s feature map, and is the ground-truth face feature map.

  • adversarial loss: The adversarial loss is used to constrain the generated face image to be closer to the real image; it is defined as follows:

    (5)

    where is the discriminator for discriminating the ground-truth face image from the generated one.

  • overall loss: The overall loss of the proposed face image restoration framework is as follows:

    (6)

    where , , are the trade-off parameters.

4 Experiment

4.1 Datasets and Settings

Datasets: We performed experimental evaluations using two public-domain datasets: CelebA [22] and Helen [15]. CelebA is a large-scale face attribute dataset with 10,177 subjects and 202,599 face images. We adhered to the standard protocol, and divided the dataset into a training set (162,770 images), a validation set (19,867 images), and a test set (19,962 images). Helen is composed of 2,330 face images. Based on the standard protocol of Helen, we used 2,000 images for training and 330 images for testing. In our experiments, CelebA was used to train the network, and obtain test results. Helen was used to perform cross-validation to further verify the validity of the model.

Implementation details: For CelebA, we roughly aligned to 144 144, and then randomly cropped the images to 128 128 as inputs based on [26]. For Helen, we aligned the pictures based on the 5-point coordinates detected by the multi-task cascaded convolutional networks (MTCNN) [36], and resized them to 128 128 3. For the multi-task experiments, the input ill face images were produced by resizing the high-resolution face images to 32 32 and 16 16 through the bicubic interpolation method, and we incorporated a random binary mask whose size is one-fourth of the input size. For the face completion experiment, a mask concealing a quarter of the full image was added to the input image; for the face super-resolution, we resized the high-resolution face image to 16 16 through the bicubic interpolation method to conduct the experiment with the eight-times downsampled image. We used an Adam algorithm with an initial learning rate of to optimize the face image restoration network. The settings of the trade-off parameters were =1, =0.01, and =0.0005. The batch size was 24, and the kernel size was 3.

Evaluation metrics: We evaluated the model mainly from the qualitative and quantitative aspects.

  • Qualitative evaluation metrics: We evaluated the images based on multi-task face super-resolution and face completion, and conducted an ablation study to observe the quality of the repair.

  • Quantitative evaluation metrics: We quantitatively evaluated the repaired images using two main metrics, the (PSNR) and the structural similarity index (SSIM) [33].

4.2 Quality results

Qualitative results can provide an intuitive observation of the recovered face images through different methods. For joint face completion and face super-resolution, we performed two experiments (SRFC x4 and SRFC x8) with the CelebA dataset as the input. We used a four-times downsampled image(from 128 128 to 32 32) with 1/4 area of occlusion. The input was an eight-times downsampled image(from 128 128 to 16 16), with 1/4 area of occlusion. The qualitative results are shown in Fig. (a)a and Fig. (b)b.

(a) Input: 8-times downsampled image with area of occlusion
(b) Input: 4-times downsampled image with area of occlusion
Fig. 5: Qualitative results on multi-task experiment under different input conditions. The first row is the real picture, the second row is the input, and the last is the output.

To verify the effectiveness of the model on single tasks, we compared it with two baseline methods, Bicubic and SRCNN, on face image super-resolution [27]. The comparison results are shown in Fig. 6. In addition, we also verified the robustness of the model to variations in the size of the occlusion toward face completion; we set the size of the mask to 1/4, 1/8, and 1/16 of the original image, and observed the results obtained for the different mask sizes, as shown in Fig. 7.

Fig. 6: Visualization of face super-resolution results on CelebA. Our super-resolution results vastly outperformed other methods in terms of visual quality.
Fig. 7: Visualization of face completion results on CelebA. From left to right are the ground truth, and the input and output results obtained for the images with the 1/4 mask, 1/8 mask, and the 1/16 mask, respectively.

It can be observed that the smaller the occluded area, the better the restored image’s quality. This also shows that the model is robust for face completion.

4.3 Quantity results

In addition to visual quality, we also quantified the effectiveness of the proposed approach at face completion and super-resolution based on two measurements. One is the PSNR, which is widely used during image compression to measure the fidelity of the reconstructed signal w.r.t. ground-truth. The other is the SSIM [33], a perceptual measure that considers image degradation based on several perceptual information, such as luminance and contrast, in addition to the perceived change in structural information.

SRFC SRFC x8 FCSR-GAN x4 SRFC x4
PSNR 21.19 22.23 23.49
SSIM 0.634 0.657 0.714
TABLE I: Quantitative results for joint face completion and super-resolution. “SRFC x4” represents the four-times downsampled input image with a 1/4 area of occlusion; ”SRFC x8” represents the eight-times downsampled input image with a 1/4 area of occlusion. The red type indicates the best performance.

The quantitative results for the joint face completion and super-resolution is shown in Table I, where “SRFC x4” represents the four-times downsampled input image with area of occlusion; it can be observed that the proposed model outperforms the FCSR-GAN based on both the PSNR and SSIM indicators. The “SRFC x8” represents the eight-times downsampled image with area of occlusion. We conducted a cross-dataset validation to evaluate the generalizability of our MFG-GAN, i.e., the model was trained on CelebA, but tested on Helen. Then, the model pre-trained using CelebA was fine-tuned and tested with Helen. All through, the inputs were face images with area of occlusion. The result is shown in Table II.

SRFC train fin-tune
PSNR 20.772 22.442
SSIM 0.639 0.692
TABLE II: The quantitative results for joint face completion and super-resolution on Helen. “train” indicates that the model was trained on CelebA, and testing was conducted on Helen. “fine-tune” indicates that the model was trained on CelebA, and fine-tuned on Helen.

Our MFG-GAN trained on CelebA achieved 20.772 dB PSNR and 0.639 SSIM on Helen; and fine-tuning it on Helen achieved 22.442 dB PSNR and 0.692 SSIM. Compared with the intra-database testing results on CelebA (23.49 dB PSNR and 0.714 SSIM), these results are quite encouraging, considering the difference of the data distributions between CelebA and Helen.

For face completion, we compared the proposed model with the two baseline models, CE [25] and GFC [18]. We ensured that the inputs in all the instances were face images with 1/4 area of occlusion, as shown in Table III. For the face super-resolution, we compared the proposed model with two baseline models, Bicubic and SRCNN [27], and ensured that the inputs in all the instances were face images, as shown in Table IV.

FC CE [25] GFC [18] our
PSNR 24.499 24.281 25.413
SSIM 0.732 0.837 0.861
TABLE III: Quantitative results of face completion on CelebA testing sets. Red type indicates the best performance.
SR Bicubic SRCNN [27] our
PSNR 21.049 21.938 23.307
SSIM 0.601 0.632 0.701
TABLE IV: Quantitative results of face super-resolution on CelebA testing sets. Red type indicates the best performance.

Furthermore, we verified the robustness of the model to variations in the occlusion size, as shown in Table V, where ”mask:1/4” corresponds to input face images with area of occlusion; ”mask:1/8” corresponds to input face images with area of occlusion, and ”mask:1/16” corresponds to face images with area of occlusion.

FC mask:1/4 mask:1/8 mask:1/16
PSNR 25.413 28.891 33.417
SSIM 0.861 0.922 0.962
TABLE V: Quantitative results of face completion on CelebA testing sets with different occlusions’ sizes.

It can be observed that as the size of the occlusion became smaller, the repair effect improved, which shows that the model demonstrates some robustness in face completion.

4.4 Ablation Study

The proposed MFG-GAN consists of the IGCN and FPN. We designed three models to verify the effectiveness of both part: M1, M2, M3. The FPN is absent in M1; general convolution is used in place of graph convolution in M2. In M3, the IGCN and FPN are integrated.

Fig. 8: Visualization of the ablation study. The first two columns are the real image and input respectively. The third column shows the output results of the model without the FPN. The fourth column shows the output results when graph convolution is replaced with conventional convolution; the last column is the output of the experimental model. It can be observed that the model combining both the IGCN and FPN achieves the best qualitative results.
SRFC FPN IGCN PSNR SSIM
M1 21.544 0.624
M2 22.569 0.687
M3 23.499 0.714
TABLE VI: Quantitative results of ablation study, where “M1” is the model without the FPN, “M2” uses general convolution instead of graph convolution, and “M3” is the combination of the IGCN and FPN. Red type indicates the best performance.

All the experiments use the same input: four-times downsampled image (i.e., from to ) with 1/4 area of occlusion. The results are shown in Table VI. The qualitative results obtained using these three models with CelebA are shown in Fig. 8. In addition, to demonstrate the impact of the IGCN and FPN on the restored results more intuitively, we show the results from the details.

Fig. 9: Comparison results of face image restoration method with and without IGCN. By making it possible to learn the correlation among the various regions of the face, the IGCN made the restoration of facial features more accurate.
Fig. 10: Comparison of image restoration with and without FPN. Incorporating the FPN improved the repair results of some small facial features such as eyes and nose.

Fig. 9 and Fig. 10 show the impact of the IGCN and the FPN on the restoration results, respectively. Comparing the results of the two images, it is apparent that because the IGCN takes into account the relationship between features, the detailed information learned in some samples was more accurate; furthermore, the extraction of the multi-scale features by the FPN also remarkably improves the learning of some small-sized features such as eyes and nose.

We also list some unrealistic repair results. As shown in Fig. 11, it can be found that these pictures are mostly in profile. This may be attributable to the impossibility of fully learning the correlation among the facial features due to the special nature of the data; thus, the repair is ineffective.

Fig. 11: Unrealistic restoration results for special faces. Due to the lack of facial information, the relationship between the repaired area and the complete area cannot be learned, resulting in the inability to restore the original effect.

5 Conclusion

In this paper, based on the integration of an IGCN and FPN, we proposed a joint face completion and face super-resolution method (MFG-GAN) that can leverage multi-task learning to recover non-occluded high-resolution facial patches from LR face images with occlusions. The experimental results on the public datasets, CelebA and Helen, reveal that the proposed model outperforms the baseline method when simultaneously tackling the tasks of face completion and face image super-resolution. Furthermore, the proposed method proved to be effective following both cross-dataset and internal dataset testing. In addition, we verified the model’s effectiveness on the tasks of face completion and face super-resolution singly, and achieved outstanding results. The proposed framework also has prospects for other face restoration tasks and other multi-task problems such as face recognition and facial attribute analysis.

The proposed method can mainly be used to repair occluded and LR face images. The evaluation indicators were the most commonly used indicators related to image quality evaluation (PSNR and SSIM). In future work, we shall further evaluate the repair results based on the identity information of the face, and attempt to retain it as much as possible. Furthermore, we will also exploit some prior knowledge of faces to optimize the repair results.

Acknowledgments

This work is supported by National Natural Science Foundation of China ( No.41806116 and No.61503277 ). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.

References

  • [1] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera (2001) Filling-in by joint interpolation of vector fields and gray levels. IEEE transactions on image processing 10 (8), pp. 1200–1211. Cited by: §2.1.
  • [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In ACM Transactions on Graphics (ToG), Vol. 28, pp. 24. Cited by: §2.1.
  • [3] J. Cai, H. Han, S. Shan, and X. Chen (2019) FCSR-gan: joint face completion and super-resolution via multi-task learning. IEEE Transactions on Biometrics, Behavior, and Identity Science. Cited by: §1, §2.1.
  • [4] Q. Cao, L. Lin, Y. Shi, X. Liang, and G. Li (2017)

    Attention-aware face hallucination via deep reinforcement learning

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 690–698. Cited by: §1.
  • [5] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang (2018) Fsrnet: end-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2492–2501. Cited by: §1, §1, §2.1, §2.1.
  • [6] D. Engin, A. Genç, and H. Kemal Ekenel (2018) Cycle-dehaze: enhanced cyclegan for single image dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 825–833. Cited by: §2.1.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.1, §3.1.
  • [8] J. Hays and A. A. Efros (2007) Scene completion using millions of photographs. ACM Transactions on Graphics (TOG) 26 (3), pp. 4. Cited by: §2.1.
  • [9] H. Huang, R. He, Z. Sun, and T. Tan (2017) Wavelet-srnet: a wavelet-based cnn for multi-scale face super resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1689–1697. Cited by: §2.1.
  • [10] R. Huang, S. Zhang, T. Li, and R. He (2017) Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2439–2448. Cited by: §1, §2.1.
  • [11] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1637–1645. Cited by: §2.1.
  • [12] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.3.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [14] W. Lai, J. Huang, N. Ahuja, and M. Yang (2018) Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.2.
  • [15] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang (2012) Interactive facial feature localization. In European conference on computer vision, pp. 679–692. Cited by: §4.1.
  • [16] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §1, §1, §2.1, §2.1, 1st item.
  • [17] L. Li, J. Pan, W. Lai, C. Gao, N. Sang, and M. Yang (2018) Learning a discriminative prior for blind image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6616–6625. Cited by: §2.1.
  • [18] Y. Li, S. Liu, J. Yang, and M. Yang (2017) Generative face completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3911–3919. Cited by: §1, §2.1, §4.3, TABLE III.
  • [19] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1, §2.2.
  • [20] G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 85–100. Cited by: §2.1.
  • [21] Z. Liu, L. Li, Y. Wu, and C. Zhang (2020) Facial expression restoration based on improved graph convolutional networks. In International Conference on Multimedia Modeling, pp. 527–539. Cited by: §1.
  • [22] Z. Liu, P. Luo, X. Wang, and X. Tang (2018) Large-scale celebfaces attributes (celeba) dataset. Retrieved August 15, pp. 2018. Cited by: §4.1.
  • [23] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.1.
  • [24] N. Muhammad, N. Bibi, A. Jahangir, and Z. Mahmood (2018)

    Image denoising with norm weighted fusion estimators

    .
    Pattern Analysis and Applications 21 (4), pp. 1013–1022. Cited by: §2.1.
  • [25] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §1, §2.1, §4.3, TABLE III.
  • [26] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §4.1.
  • [27] C. Ren, X. He, Q. Teng, Y. Wu, and T. Q. Nguyen (2016) Single image super-resolution using local geometric duality and non-local similarity. IEEE Transactions on Image Processing 25 (5), pp. 2168–2183. Cited by: §2.1, §4.2, §4.3, TABLE IV.
  • [28] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [29] L. Song, J. Cao, L. Song, Y. Hu, and R. He (2018) Geometry-aware face completion and editing. arXiv preprint arXiv:1809.02967. Cited by: §1, §2.1.
  • [30] Y. Song, J. Zhang, S. He, L. Bao, and Q. Yang (2017) Learning to hallucinate face images via component generation and enhancement. arXiv preprint arXiv:1708.00223. Cited by: §1.
  • [31] J. Wang, X. Li, and J. Yang (2018) Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1788–1797. Cited by: §2.1.
  • [32] Y. Wang, X. Zhao, T. Jiang, L. Deng, Y. Chang, and T. Huang (2018) Rain streak removal for single image via kernel guided cnn. arXiv preprint arXiv:1808.08545. Cited by: §2.1.
  • [33] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: 1st item, §4.3.
  • [34] R. A. Yeh, C. Chen, T. Yian Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do (2017) Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5485–5493. Cited by: §1, §2.1.
  • [35] Y. Zeng, J. Fu, H. Chao, and B. Guo (2019) Learning pyramid-context encoder network for high-quality image inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1486–1494. Cited by: §2.2.
  • [36] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §4.1.
  • [37] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §2.3.
  • [38] S. Zhu, S. Liu, C. C. Loy, and X. Tang (2016) Deep cascaded bi-network for face hallucination. In European conference on computer vision, pp. 614–630. Cited by: §1.