Cross-Resolution Face Recognition via Prior-Aided Face Hallucination and Residual Knowledge Distillation

05/26/2019 ∙ by Hanyang Kong, et al. ∙ National University of Singapore Xi'an Jiaotong University Microsoft 0

Recent deep learning based face recognition methods have achieved great performance, but it still remains challenging to recognize very low-resolution query face like 28x28 pixels when CCTV camera is far from the captured subject. Such face with very low-resolution is totally out of detail information of the face identity compared to normal resolution in a gallery and hard to find corresponding faces therein. To this end, we propose a Resolution Invariant Model (RIM) for addressing such cross-resolution face recognition problems, with three distinct novelties. First, RIM is a novel and unified deep architecture, containing a Face Hallucination sub-Net (FHN) and a Heterogeneous Recognition sub-Net (HRN), which are jointly learned end to end. Second, FHN is a well-designed tri-path Generative Adversarial Network (GAN) which simultaneously perceives facial structure and geometry prior information, i.e. landmark heatmaps and parsing maps, incorporated with an unsupervised cross-domain adversarial training strategy to super-resolve very low-resolution query image to its 8x larger ones without requiring them to be well aligned. Third, HRN is a generic Convolutional Neural Network (CNN) for heterogeneous face recognition with our proposed residual knowledge distillation strategy for learning discriminative yet generalized feature representation. Quantitative and qualitative experiments on several benchmarks demonstrate the superiority of the proposed model over the state-of-the-arts. Codes and models will be released upon acceptance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, face recognition based on various deep learning architectures have acquired tremendous results under some challenging scenarios such as variations of illumination [18], pose [32] and age [31]. However, different resolution, especially with very large resolution gap between query and gallery images, is also a problem which needs to be solved in real-world application. To be specific, query images are always low-resolution because of the limitation of camera performance or far shooting distance between camera and subject of interest, while pre-enrolled face images in database are all high-resolution. So how to match very low-resolution (LR) queries with high-resolution (HR) gallery images would be a problem worth considering.

In this work, we focus on the problem of cross-resolution face recognition. Most of the existing solutions can be separated into two categories. One is to reconstruct HR query images from LR ones before recognition [13, 26, 28], which is called hallucination method. Although face hallucination can generate missing facial details, it is not directly optimized for recognition but reconstruction, thus the hallucinated faces may not be optimal for recognition. The other category is to transform LR query images and corresponding HR gallery images into a common domain invariant subspace [9, 27, 21], which can take full advantage of identity information to learn a discriminative representation. Nevertheless, naively learning from LR query images especially with very low resolution can be problematic due to the absence of facial details, which would cause the learned face recognition model fail to extract discriminative features and have an unideal generalization ability.

Figure 1:

Cross-resolution face recognition in the wild. Our proposed RIM can learn resolution-invariant face representation and recover super-resolution faces efficiently with the aid of easy-to-collect prior estimation (

i.e. landmark heatmaps and parsing maps), and unsupervised domain adversarial training strategy. Best viewed in color.

Considering the advantages and limitations of the above methods, we propose a unified deep architecture, named Resolution Invariant Model (RIM), to super-resolve a very low-resolution probe to its 8 larger one and learn domain invariant feature representation between images with different resolution simultaneously. RIM takes a LR probe image and corresponding HR gallery one as a paired input. It outputs a reconstructive super-resolution (SR) probe with the help of prior estimation, i.e. landmark heatmap/parsing map, and meanwhile preserves discriminative representation across different identities, which offers a strong robustness to resolution variations, as illustrated in Fig. 1.

In particular, RIM consists of a Face Hallucination sub-Net (FHN) and a Heterogeneous Recognition sub-Net (HRN). FHN employs a tri-path prior-aided generator with the aid of facial geometry estimation to better reconstruct the lost high-frequency information. These three pathways focus on the inference of global structure, landmark heatmap and parsing map, respectively. After that, the concatenated feature maps of global structure and prior estimation are fed into the mix-adversarial discriminator to finally reconstruct LR probes to SR one, while simultaneously maximizing the Multi-Kernel Maximum Discrepancy (MK-MMD) in an adversarial manner to learn feature representation invariant to the covariate shift between domains. HRN is utilized for face verification via residual knowledge distillation between HR gallery image and corresponding SR probes recovered from LR one by FHN. In contrast to the vanilla knowledge distillation methods [1, 2], we further introduced a teacher assistant network to compensate residual error between the transferred knowledge (feature maps) of student and teacher network. With this strategy, the final output feature map is more similar to that of teacher network, hence guaranteeing excellent recognition accuracy for cross-resolution face recognition.

Our contributions are summarized as follows:

  • We propose a unified deep architecture to achieve super-resolution face reconstruction and cross-resolution face recognition jointly.

  • We design a novel face hallucination network that can super-resolve LR images to SR ones with the aid of prior knowledge estimation and cross-domain adversarial learning strategy.

  • We develop an effective and novel training strategy, i.e. residual knowledge distillation, for the recognition network, which can efficiently transfer knowledge between images with different resolution and generate powerful face representation.

Based on the above technical contributions, we have presented a high performance cross-resolution face recognition system which obtains competing performances over many state-of-the-art methods.

2 Related Work

Generic Face Recognition

Face recognition via deep learning has achieved a series of breakthrough in these years [23, 22, 5, 25]. For instance, DeepID [23] and DeepID2 [22] can be effectively learned by challenging multi-class identification and verification jointly, which achieve excellent recognition performance. Wen et al. [25] propose a center loss to further enhance the capacity of discriminative feature learning. Deng et al. [5] introduce an additive angular margin loss to obtain highly discriminative features for face recognition. Generally speaking, deep learning models have achieved outstanding results on face recognition. However, these methods hardly perform satisfactory on cross-resolution face recognition because of the absence of facial details in very low-resolution face images.

Face Hallucination

Face hallucination aims to reconstruct a HR image from a LR input, which is a domain-specific problem. Recently, various face hallucination methods based on deep convolutional neural networks have obtained the state-of-the-art performance. For example, Jiwon Kim et al. [14] utilize a very deep convolutional network by cascading many small filters to extract contextual information over HR images. Lai et al. [15] propose a Laplacian pyramid super resolution network to reconstruct HR images based on cascade of convolutional neural network and residual error between upsampled feature maps and the ground truth HR images at the respective level.

Chen et al [4] recover LR images with the help of geometry prior estimation, i.e. facial landmark heatmaps and parsing segmentation information. Although the above methods can recover LR images to make up for missing facial details, their optimization objectives are not tailored for recognizing faces but reconstructing LR images, which will consume computing resources and affect the recognition efficiency.

Knowledge Distillation

Knowledge distillation [11] is one of the most efficient methods for model compression and knowledge transfer, which aims at training a smaller network to mimic a more complex teacher network. Most knowledge methods typically apply one teacher network to supervise one student network. For instance, Ashok et al [1]

use reinforcement learning to prune student network under the guidance of teacher network. Belagiannis

et al [2] apply adversarial strategy to knowledge distillation. They further utilize a discriminator to measure whether student model and teacher model are close enough. It is worth considering, however, that there is still a certain discrepancy between the learning capacity of teacher network and student network. Inspired by residual representation [10], we adopt an additional teacher assistant network to learn the representation gap between teacher network and student network at different-level features.

3 Resolution-Invariant Face Recognition Model

Figure 2: Resolution Invariant Model (RIM) for face recognition in the wild. The RIM contains a Face Hallucination sub-Net (FHN) and a Heterogeneous Recognition sub-Net (HRN) that jointly learn end-to-end. Best viewed in color.

3.1 Face Hallucination Sub-Net

As illustrated in Fig. 2, Face Hallucination sub-Net (FHN) consists of a prior-aided tri-path generator and a mix-adversarial discriminator. The prior-aided tri-path generator is utilized to extract feature maps of facial structure and prior knowledge, i.e. landmark heatmaps and parsing maps, and integrate them together as the concatenated feature maps. We further apply a mix-adversarial discriminator to reconstruct SR images from the concatenated feature maps and reduce domain gap between images with different resolution in an adversarial manner. With the iterative adversarial training strategy, FHN can learn feature representations invariant to domain shift between images with different resolution. We now present FHN in details.

Different from face recognition over a single image, the task of set-based face recognition aims to accept or reject the claimed identity of a subject represented by a face media set containing both images and videos. Performance is assessed using two measures: percentage of false accepts and that of false rejects. A good model should optimize both metrics simultaneously. MPNet is designed to nonlinearly map the raw sets of faces to multiple prototypes in a low dimensional space such that the distance between these prototypes is small if the sets belong to the same subject, and large otherwise. The similarity metric learning is achieved by training MPNet with two identical CNN branches that share weights. MPNet handles inputs in a pair-wise, set-to-set way so that it explicitly organizes the face media in a way favorable to set-based face recognition.

3.1.1 Prior-Aided Tri-Path Generator

Due to the deficiency of facial details and high-frequency information, face recognition using LR face image as the input query will badly affect recognition accuracy. The generic scheme to solve this problem is to super-resolve LR probes to SR counterparts before recognition. Different from general approaches that typically adopt one single CNN model for reconstruction, which cannot capture facial spacial information adequately, we further apply geometry prior knowledge, i.e. facial landmark heatmaps and parsing maps, to aid the process of LR face reconstruction, as inspired by [4].

To be specific, the prior-aided tri-path generator consists of a coarse SR network

and a tri-path generator: global feature extraction path

, landmark path and parsing map path . Given a pair of face images with different resolution, we first apply to roughly super-resolve the LR probe to a coarse HR one which is benefit for subsequent feature extraction and prior estimation. Then the coarse HR probe and corresponding real HR image are fed to the tri-path generator simultaneously, to extract global image feature maps and estimate prior information, i.e. landmark heatmaps and parsing maps, respectively. Finally, these output feature maps are concatenated as the output feature maps of .

Formally, we denote the real HR face, corresponding LR and super-resolved one as , and , respectively. After coarsely super-resolving to a coarse HR image , is fed to to extract the concatenated feature maps. There are two key requirements to guarantee the performance of : 1) need to learn global facial feature maps and estimate geometry priors, so as to recover the SR face image better which visually resemble a real one and maintain the original identity information and textures. 2) The data distribution between source domain (with HR images) and target domain (with LR images) should be consistent, so as to make reconstructed image more similar to the real one.

To this end, we propose to learn the parameter of by minimizing the following combined loss:

(1)

where is the unsupervised domain adversarial loss for domain adaption between images with different resolution, is the pixel-wise Euclidean loss for constraining the visual performance and content consistency, is landmark heatmap loss for enforcing structural consistency, is the cross-entropy loss for facial parsing segmentation, and are weighting parameters among different losses.

In order to enhance the visual quality of super-resolved images, we apply to constrain the consistency between the hallucinated faces and corresponding HR ground truths by penalizing the pixel-wise Euclidean distance between them:

(2)

where and are super-resolved face and the corresponding HR ground truth at pixel .

The landmark heatmap loss is introduced for landmark path to enforce the facial structural consistency of hallucinated face. Note that rather than training the landmark path to regress and

landmark coordinates, the landmark is represented by a set of heatmaps. Each landmark point is represented by an output channel which contains a 2D Gaussian distribution centered at corresponding landmark point. The landmark path is trained to regress the 2D Gaussian map (heatmap). According to above discussion, we apply the landmark heatmap loss to enforce SR probes and corresponding HR images yielding the same heatmaps while capturing spatial context and structural part relationships, which is defined as:

(3)

where represents the landmark heatmap corresponding to the landmark channel at pixel , represents the ground truth landmark heatmap, and denotes the number of landmark points.

Parsing segmentation loss

is a pixel-wise cross-entropy loss which examines each pixel individually and compares the class prediction with the one-hot encoded vector. The parsing segmentation loss can be calculated as:

(4)

where and are the predicted value and one hot label at pixel , respectively.

3.1.2 Mix-Adversarial Discriminator

To increase the realism of hallucination images for better face recognition, it is necessary to narrow the domain gap between images with different resolution. Hence a mix-adversarial discriminator is introduced to integrate the concatenated feature maps to SR images, while reducing the domain discrepancy between and which is roughly super-resolved by .

Specifically, the mix-adversarial discriminator is composed of two pathways: a feature integrator path (with learnable parameters ) and a domain discriminator path (with learnable parameters ). is applied to reconstruct SR probe from concatenated features embeded with geometry prior estimation, which is extracted from . Domain discriminator and tri-path generator cooperate with each other via a domain adversarial strategy, so as to reduce domain discrepancy between images with different resolutions. To this end, we propose to learn the parameter by minimizing the following loss:

(5)

where is a pixel-wise distance to enforce the multi-scale content consistency between the hallucinated faces and the ground truths, which is the same as Eq. (2).

The learnable parameters can be learned by minimizing the following loss:

(6)

where is Multi-Kernel Maximum Mean Discrepancy (MK-MMD) [17], which is a non-parametric criterion to compute the mean square distance between different data by mapping them to the Reproducing Kernel Hilbert Space (RKHS). Formally, can be represented as:

(7)

where and are sampled from and . is utilized to map data samples to RKHS. Note that in Hilbert space, the norm operation is the same thing as the inner product operation, where , so Eq. (7) can be rewritten by kernel tricks as

(8)

where is a characteristic kernel which are combined by several convex kernels . The kernel associated with the feature maps can be defined as

(9)

where kernel is Gaussian kernel, defined as , where is the bandwidth parameter which is used to determine statistical efficiency of MK-MMD.

In conclusion, the domain invariant representation learning between tri-path prior-aided generator and domain discriminator can be achieved by solving the following minimax problem:

(10)

3.2 Heterogeneous Recognition Sub-Net

Considering that applying the state-of-the-art face recognition model directly would only perform well on HR query and gallery images. Re-training the model with cross-resolution images would also degrade performance because of the distribution gap between images with different resolutions. So it’s crucial to incorporate the information extracted from HR images to the corresponding lower-resolution one. Hence, we utilize knowledge distillation on Heterogeneous Recognition sub-Net (HRN) to train a generic feature extractor.

Vanilla knowledge distillation methods [1, 2] apply a student network to learn from a large and complex network which can extract discriminative features. Considering the gap in learning capacity between teacher model and student model, there is no guarantee that student network can learn enough discriminative knowledge to reproduce relatively high performance of teacher network. So we employ an additional network, which is named teacher assistant network, to make further distillation by learning the residual error between teacher network and student network. That is to say, the assistant network is employed to help the student network fine-tune its output and transfer knowledge from teacher network more sufficiently, which is consistent to coarse-to-fine principle.

To sum up, our goal is to train a student network to reproduce the predictive capability of teacher network. As shown in Fig. 2, we first respectively divide teacher network and student network into blocks and regard output of each block as a feature map. Accordingly, to transfer knowledge111Feature maps are treated as knowledge in this work. from teacher network to student network, the knowledge distillation process can be represented by the following loss

(11)

where and are respectively the output feature maps of the blocks of teacher network and student network. As shown in Eq. (11), we can figure out that tries to reproduce feature maps which is same as to obtain powerful performance. Note that there is still a gap between feature maps extracted from and , it is non-trival for to capture underlying feature maps with the help of alone. To this end, we utilize an additional network, named teacher assistant network , to ease the differences between representation capacities of and . To be specific, is also divided into blocks and optimized to learn residual errors between and

at each corresponding blocks. Accordingly, loss function of

can be formulated as

(12)

where , and are output feature maps of teacher network, student network and assistant network at block, respectively. After assistant and student network reaching their optima, the feature map summed up with the residual error, i.e. , will be finally applied for inference.

3.3 Training and Inference

The goal of RIM is to use pairs of LR face probes and their corresponding HR one to train FHN and HRN that mutually boost and jointly accomplish cross-resolution face recognition. Each separate loss plays a role as a unique supervisor within the nested structure to converge the whole framework. The training process of RIM is an end-to-end procedure which can be optimized with various loss functions based on adversarial training strategy and back-propagation algorithm. During testing, we just feed a pair of images with different resolution as the input of RIM to get their corresponding embedding features and . Then we calculate the cosine distance between the two embedding features and compare with the threshold to determine whether the two faces belong to the same identity.

4 Experiments

Training Dataset

We use Helen [16] dataset as training set. Helen dataset has 2,330 face images, each of which has a ground truth label of 194 landmarks and 11 parsing maps. We first perform face alignment and crop each image to by using MTCNN [29] according to the ground truth of 5 facial landmarks and then flip each image horizontally as data augmentation. After pre-processing, we use 2,800 images for model training RIM and another 1,100 for testing the performance of face hallucination and cross-resolution face recognition.

Testing Dataset

We evaluate performance of RIM on face hallucination and cross-resolution face recognition over Helen [16], LFW [12], CALFW [33] and CPLFW [20] datasets, respectively. For face hallucination, we use the Helen test set and part of LFW dataset. LFW dataset contains 13,233 face images with 5,749 identities. We randomly choose 1,000 aligned faces to evaluate face hallucination performance of RIM. For cross-resolution face recognition, the input of RIM is a cross-resolution image pair and the resolution of the image pair is and , respectively. We select 500 image pairs of the same identity and 500 image pairs of different identities from the Helen test set as the face recognition test set. We random choose 1500 pairs of faces of the same identity and 1500 pairs of faces of different identities as another test set of face hallucination and LR face recognition. Furthermore, we also apply CALFW [33] and CPLFW [20] datasets to evaluate the performance of cross-resolution face recognition.

Implementation details

We first enlarge LR images from to

by bicubic interpolation as the input of our model. RMSprop solver is applied with an initial learning rate

for all sub-net. We apply a pre-trained ResNet50-IR [5] face recognition model [30]

as the teacher model and choose ResNet34 to serve as student network and assistant network in HRN. According to ResNet sturcture, 4 blocks is divided for each network. To compute residual error between teacher network and student network, outputs of corresponding blocks from different network must have the same size, especially channel dimension. We implement all the experiments by extending the publicly available PyTorch framework

[7] on a single NVIDIA TITAN Xp GPU with 12 G memory.

4.1 Ablation Study

To clarify the role of each component in our model structure, we combine different sub-nets together and train them respectively for face recognition on LFW dataset. Various combinations and corresponding results are reported in Tab. 1.

We can draw the following conclusions. First, experiment (a), (b) and (c) utilize HR image as input directly to train the teacher network, student network and distill knowledge from the teacher network to student network. The results show that we can’t extract discriminative feature representation from LR images directly. In experiment (d), (e) and (f), FHN is applied to reconstruct LR images to SR images before face recognition. Comparative results of (d) and (g) show that using SR image reconstructed by face hallucination and HR image as the input of teacher network can achieve the same recognition accuracy. (e) and (f) indicate that after knowledge distillation, the recognition accuracy of the student network decreases to a certain extent because the complexity of the student network used for identification is much less than that of the teacher network. After introducing the teacher assistant network in HRN, the recognition accuracy is improved by compared with (e).


Table 1: Analysis of the role of each component. We run the experiment by combining different sub-nets as indicated by the checkmarks.
(a) (b) (c) (d) (e) (f) (g)
Gallery-Query HR-LR HR-LR HR-LR HR-LR HR-LR HR-LR HR-HR
FHN


HRN
T: ResNet50-IR
S: ResNet34
A: ResNet34
Acc 0.843 0.687 0.802 0.997 0.966 0.988 0.998

4.2 Comparisons with the State-of-the-Arts

4.2.1 Evaluation on Cross-Resolution Face Recognition

We compare our method with the state-of-the-art face recognition methods [5, 3, 25] to evaluate the performance of cross-resolution face recognition. ArcFace [5] proposes an additive angular margin loss to obtain highly discriminative features using a clear geometric interpretation. VGGFace2 [3] utilizes ResNet50 [10] to access performance on face recognition. CenterFace [25]

proposes center loss to recognize identities which learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. We respectively utilize HR image, SR image and LR images as the input of these methods and compare the recognition results with our method for cross-resolution face recognition. All results are reported in Tab.

2. Note that [5, 3, 25] are all pre-trained on HR images.

Gallery-Query Method Helen[16] LFW[12] CALFW[33] CPLFW[20]
(a) HR-HR ArcFace[5] 0.998 0.998 0.952 0.921
(b) HR-HR VGGFace2[3] - 0.994 - 0.840
(c) HR-HR Centerface[25] - 0.988 - 0.775
(d) HR-SR ArcFace[5] 0.998 0.998 0.950 0.920
(e) HR-SR VGGFace2[3] - 0.990 - 0.827
(f) HR-SR Centerface[25] - 0.975 - 0.763
(g) HR-LR ArcFace[5] 0.843 0.807 0.781 0.797
(h) HR-LR RIM(Ours) 0.987 0.988 0.943 0.908
Table 2: Cross-resolution face identification evaluation on four different datasets.

From Tab. 2, we can draw the following conclusions: (1) For almost all methods above, compared with recognition results which input pairs are both HR images, using the hallucination SR/HR images as input hardly reduce the recognition accuracy of the models. The possible explanation is that plenty facial details reconstructed from LR face images make it easier for model to extract discriminative features, which is benefical for recognition performance. (2) As shown in (g), directly using HR-LR image pairs as input of ArcFace [5] for face recognition will dramatically decrease the recognition accuracy even though the model has powerful capacity to extract discriminative features. This indicates the importance of face hallucination for LR images in cross-resolution face recognition tasks. (3) The performance of our proposed network structure in cross-resolution face recognition is very close, and even can exceed than that of various HR-HR face recognition methods. This benefits from two aspects: For one thing, with the help of prior information and domain adaption between LR and HR images, our network can obtain high-fidelity reconstructive SR images which offer sufficient facial details. For another, instead of training a recognition network directly, we adopt residual knowledge distillation to transfer knowledge from teacher network to student network. To transfer knowledge more efficiently, an assistant network is incorporated to make up the gap between teacher network and student network.

Method Training Data Testing (Gallery-Query)
HR-HR HR-LR
(a) Teacher Network HR 0.998 0.843
(b) Student Network LR - 0.807
(c) Staged-CNN[19] HR & LR 0.908 0.902
(d) Guided-CNN[8] HR & LR 0.974 0.938
(e) RIM(Ours) HR & LR 0.998 0.988
Table 3: Cross-resolution face verification on LFW [12].

Furthermore, we compare our proposed method with two baselines and other cross-resolution face recognition methods on LFW dataset. [19] applies the CNN model with staged-training to address this problem. They utilize a simple stage-wise training procedure that first trains the model on HR images and artificially lowers the resolution of training images to reduce the domain gap between images with different resolution. [8] utilizes parallel sub-CNN model as guide and learners for cross-resolution recognition. From Tab. 3, we can observe that only train baseline on HR or LR images would not perform well on cross-resolution face recognition. [19] and [8] improve the recognition accuracy by transfer knowledge from HR images to LR ones. It is worth noting that, compared with other baselines or methods under the setting of HR-HR face verification,, RIM achieves comparable performance under a much more challenging setting of HR-LR, which outperforms the by 0.05.

4.2.2 Evaluation on Face Hallucination

Figure 3: Qualitative comparision on Helen [16] and LFW [12] datasets. The first row is sampled from Helen and the bottom is sampled from LFW.
Method Bicubic VDSR[14] SRCNN[6] LapSRN[15] ESRGAN[24] RIM(Ours)
PSNR 24.965 25.613 25.712 26.380 25.607 28.572
SSIM 0.629 0.730 0.713 0.786 0.783 0.882
Table 4: Quantitative evaluation on LFW [12] dataset with PSNR/SSIM criterion.
Figure 4: Cumulative Score Distribution (CSD) scores for PSNR (left) and SSIM (right) on LFW [12] datasets. Curves further to the right are of better quantitative performance. Best viewed in color.

For qualitative comparison, we compare our face hallucination performance with four state-of-the-art methods: VDSR [14], SRCNN [6], LapSRN [15] and ESRGAN [24] in Fig. 3. For fair comparison, we train all models by their released code with the same train set. From Fig. 3, we can draw the following conclusion: (1) With a high magnification scale 8 , bicubic interpolation cannot provide sufficient facial details. (2) VDSR and SRCNN supplement some facial details by utilizing cascaded CNN, but fail to provide substantial texture information. (3) LapSRN and ESRGAN recover facial details and texture information better than VDSR and SRCNN, but there are still some losses of facial details and some amplified noise present in LR images. (4) Benefitting from prior information and reduction in distribution discrepancy between images with different resolutions, our method significantly outperforms other methods for face hallucination.

Furthermore, we quantitatively compute average PSNR and SSIM over LFW dataset, as reported in Tab. 4. Our method outperforms the by 2.192dB and 0.096 in terms of PSNR and SSIM, respectively. Furthermore, to get better insight into the performance, we present Cumulative Score Distribution (CSD) curves for PSNR and SSIM, as illustrated in Fig. 4. We can observe that there is an obvious gap between the quantitative results of face hallucination with different methods. VDSR and SRCNN have similar quantitative results in terms of PSNR and SSIM criterion, which are consistent with their visualization. LapSRN and ESRGAN are very close in terms of SSIM-based CSD curves, but the difference become larger with PSNR criterion. By comprehensive comparison, our proposed PIM is superior to other methods for super-resolving LR faces. Both qualitative and quantitative comparisons clearly demonstrate the superiority of our proposed RIM for face hallucination.

5 Conclusion

We propose a novel Resolution-Invariant Model (RIM) to address the challenging cross-resolution face recognition problem. RIM unifies a Face Hallucination sub-Net (FHN) and a Heterogeneous Recognition sub-Net (HRN) for resolution-invariant recognition in an end-to-end deep architecture. The FHN introduce a well-designed face hallucination model with the aid of geometry prior knowledge, i.e. facial landmark heatmaps and parsing maps, to super-resolve low-resolution query to its larger one with recovered high-fidelity facial details. The HRN introduces a generic convolutional neural network with a new residual knowledge distillation strategy. Comprehensive experiments demonstrate the superiority of RIM over the state-of-the-arts.

Acknowledgement

The work of Jian Zhao was partially supported by China Scholarship Council (CSC) grant 201503170248.

The work of Junliang Xing was partially supported by the National Science Foundation of China 61672519.

The work of Jiashi Feng was partially supported by NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-D17-112.

References