Hierarchical Deep CNN Feature Set-Based Representation Learning for Robust Cross-Resolution Face Recognition

by   Guangwei Gao, et al.
Nanjing University

Cross-resolution face recognition (CRFR), which is important in intelligent surveillance and biometric forensics, refers to the problem of matching a low-resolution (LR) probe face image against high-resolution (HR) gallery face images. Existing shallow learning-based and deep learning-based methods focus on mapping the HR-LR face pairs into a joint feature space where the resolution discrepancy is mitigated. However, little works consider how to extract and utilize the intermediate discriminative features from the noisy LR query faces to further mitigate the resolution discrepancy due to the resolution limitations. In this study, we desire to fully exploit the multi-level deep convolutional neural network (CNN) feature set for robust CRFR. In particular, our contributions are threefold. (i) To learn more robust and discriminative features, we desire to adaptively fuse the contextual features from different layers. (ii) To fully exploit these contextual features, we design a feature set-based representation learning (FSRL) scheme to collaboratively represent the hierarchical features for more accurate recognition. Moreover, FSRL utilizes the primitive form of feature maps to keep the latent structural information, especially in noisy cases. (iii) To further promote the recognition performance, we desire to fuse the hierarchical recognition outputs from different stages. Meanwhile, the discriminability from different scales can also be fully integrated. By exploiting these advantages, the efficiency of the proposed method can be delivered. Experimental results on several face datasets have verified the superiority of the presented algorithm to the other competitive CRFR approaches.



page 1

page 2

page 6

page 7

page 11


Cross-Resolution Learning for Face Recognition

Convolutional Neural Networks have reached extremely high performances o...

Synthesis-based Robust Low Resolution Face Recognition

Recognition of low resolution face images is a challenging problem in ma...

Cross-Resolution Face Recognition via Prior-Aided Face Hallucination and Residual Knowledge Distillation

Recent deep learning based face recognition methods have achieved great ...

Deep Rival Penalized Competitive Learning for Low-resolution Face Recognition

Current face recognition tasks are usually carried out on high-quality f...

Graph Jigsaw Learning for Cartoon Face Recognition

Cartoon face recognition is challenging as they typically have smooth co...

Deep Convolutional Neural Network Features and the Original Image

Face recognition algorithms based on deep convolutional neural networks ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

During the past few decades, the noise robust face recognition (FR) problem has been a vibrant topic due to the increasing demands in law enforcement and biometric applications [24, 35, 7, 22, 4]. Promising performance has been achieved under controlled conditions where the acquired face region contains sufficient discriminative information [48, 30, 5, 21, 51, 45, 58]. Nevertheless, in real surveillance scenes, the desired unambiguous high-resolution (HR) face images may not be always available because of the large distances between cameras and subjects. This results in captured faces that are usually of low-resolution (LR) with too much noise in poses and illumination conditions. Fig. 1

(a) demonstrates some real examples of low-resolution faces. The primary challenge is how to match an observed noisy LR probe against those HR candidates from a face image gallery. In this case, the conventional feature extraction and metric learning methods cannot be directly used due to the existence of semantic resolution discrepancy in LR and HR image space.

(a) Some high-resolution and low-resolution face pairs
(b) Proposed hierarchical feature set-based representation learning (HFSRL)
Fig. 1: Significant novelties lie in (i) intermediate FSRL is exploited to mitigate the resolution discrepancy, and (ii) hierarchical predictions from different stages are fused to boost the recognition performance.

Recently, we have witnessed some advanced methods investigating the use of deep neural networks for the cross-resolution face recognition (CRFR) problem [33, 1, 26, 8, 3, 15, 43]. Most of these deep architectures explore pre-trained models or train deep architectures in a feed-forward way to extract features (see traditional deep learning method in Fig. 1

(b)). Usually convolutional layers are applied successively with various kernel sizes to capture the local salient features, and pooling layers are adopted to reduce the size of the extracted feature maps with the larger sizes of receptive fields. The final output of the fully connected layers is a high dimensional vector, which is used to represent the features of LR and HR face samples for the recognition task.

Due to the characteristics of LR images, the performance of the CRFR problem is affected by two factors – how to learn more efficient feature representations and how to exploit them for the face recognition task. Carefully designed networks can extract representative and discriminative features for the recognition task. However, in previous methods, the discriminability of the learned representation is not fully studied across multiple latent feature extraction stages, which can provide complementary information for the final recognition. Therefore, in this paper, we present to fully explore multi-level deep convolutional neural network (CNN) features through a set representation for the CRFR (Fig. 1). First, we learn multi-scale features in different stages and utilize a simple yet efficient approach to adaptively fuse them. Then, for the resultant hierarchical features, we develop a novel feature set-based representation learning (termed as FSRL) to fully explore these features for more accurate recognition. In addition, based on the observations that features from different stages contain distinct information, we propose to fuse these hierarchical recognition outputs on various scales to further improve their performance. Experiments demonstrate the effectiveness of the presented algorithm in various application scenarios.

We organize the rest of this paper as follows. In Section II, we introduce two categories of the relevant works, and the proposed method is presented in Section III. The experimental results and analysis are given in Section IV. Finally, we conclude this paper in Section V.

Ii Related Work

We briefly introduce the previous relevant works on CRFR in this section. To recognize an LR probe face with limited details, researchers have concentrated on two main approaches, super-resolution methods that recognize faces in the synthesized HR domain space and resolution-robust mapping methods where face samples with different resolutions are matched in a unified feature space.

Ii-a Super-Resolution Reconstruction Algorithms

Super-resolution (SR) algorithms have been investigated during last decades [53, 6]. They first super-resolved the desired HR face samples from the acquired LR one, and then perform similarity metric learning in the same resolution space by means of classical HR image recognition technologies. The authors of [18, 29] presented to obtain the super-resolved face images and remove the noise simultaneously. With the help of carefully designed representation learning strategy, an efficient face image super-resolution method was presented in [20]. To fully utilize the model based prior, a deep CNN denoiser together with multi-layer neighbor embedding method was proposed in [19]. A component generation and enhancement method was proposed in [44]. They firstly obtained the basic facial structure by several parallel CNNs, and then predicted the fine grained facial structures by a component enhancement algorithm. To recover identity information when generating HR images, the authors of [56]

designed a super-identity CNN model. A siamese generative adversarial network (GAN) was proposed in 

[14] for identity-preserving face image SR. Similarly, the authors of [11] recently designed a cascaded super-resolution framework together with identity priors to achieve superior performance. In [41], several adaptive kernel mappings were trained to predict the useful high-frequency feature from the given LR input.

Fig. 2: Flowchart of our proposed feature extraction network (FEN), which can be divided into four stages each representing a feature set. The outputs respectively calculated from four MSFBs are fused by a bottleneck layer. Accordingly, the output from this bottleneck layer is formulated to represent a more discriminative visual feature of LR and HR face images.

Ii-B Discriminative Feature Learning Methods

Resolution-robust algorithms just adopt a couple mappings to meanwhile embed the LR input and related HR pairs into a unified feature space for similarity metric learning. The main challenge of these coupled mapping methods is to design a reasonable discriminant criterion based on some manifold assumptions. A couple of discriminant subspace works have been proposed on the basis of the linear discriminant analysis [39, 17, 12, 34]. Multidimensional scaling (MDS) [2, 32] method firstly applies facial landmark localization to the LR inputs and then embeds the LR and HR pairs into a unified metric space where their distances approximate the ones in the HR space. To ensure discriminability, two discriminative multidimensional scaling (MDS) methods were presented in [49] to take full advantage of both intra-class and inter-class distance to project the coupled LR and HR faces into a unified space where their large distance gap is mitigated. In [28, 55], multi-resolution face samples were involved simultaneously to extract resolution invariant features for better recognition. Recently, many deep CNN based models have been developed. For example, the robust partially coupled networks were established in  [46] to simultaneously achieve feature enhancement and recognition. Motivated by the pioneer work in [13], the authors of [31] applied deep coupled residual network to embed the LR and HR face pairs into a unified space. To investigate the scale-adaptive LR recognition problem, a cascaded SR GAN framework was proposed in [47]. Aghdam et al. [1] reported a deep CNN model for LR face recognition, where various training resolutions are used for feature extraction. In [27], the authors introduced a GAN pre-training architecture to further enhance the accuracy of several deep learning-based approaches, and a semi-supervised local GAN [37] was also presented to impose the label consistency prior that showed better performance by exploring unlabeled data. The authors of [9] presented a two-stream CNN method based on selective knowledge distillation to identify LR faces with low computational cost. An adversarial training of deep networks has also been proposed to extract the most discriminative features from the generated hard triplets [57]. The contextual information can also be incorporated into the discriminative features through hierarchically gated deep networks [38]. Feature matching between similar images by considering the discriminative spatial contexts has also been studied in literature [36]. Shu et al. [42] proposed fine-grained dictionaries to achieve better recognition accuracy, which is also related to the proposed CRFR approach.

Distinguishing from the existing competitive CRFR approaches, in our method, different intermediate features are learned in different stages and fused by a bottleneck layer to achieve a more discriminative feature with more local salient context information. Moreover, a feature set-based representation learning scheme is designed to collaboratively represent these extracted hierarchical features for better recognition. Meanwhile, the discriminability in different scales are federated to further boost the recognition accuracy.

Iii Proposed Approach

The challenging issue in CRFR is how to extract discriminative and resolution-invariant features from the pair of LR and HR face images. To this end, in this work, multi-level deep CNN feature sets are output from different stages to investigate discriminative capability of intermediate features. Additionally, an interesting feature set-based representation learning approach is developed to mitigate the resolution discrepancy. The hierarchical recognition results calculated from the CNN feature set of different stages are fused to boost the recognition performance.

Iii-a Feature Extraction Network

Network Architecture. Fig. 2 details the flowchart of the proposed feature extraction network (FEN), which is a Resnet-like CNN [13]. The network employs the CNN to extract discriminative and meaningful features shared by different resolutions. The LR faces are generated as follows: we first downsample the original HR faces by a scale factor s

, and then upsample the LR faces to the original size by interpolation.

The convolution layer has a kernel size of

with stride and padding all setting to 1, while the max pooling is performed with a kernel size of

and a stride of 2. We add ReLU nonlinear activation after each convolution layer. The number of channels for the feature map in each convolution layer is

, and a fully connection layer has outputs as the last layer.

Following [25], we use multi-scale feature extraction block (MSFB) to extract the face image features at various scales, as shown in Fig. 3. MSFB uses two different branches with different kernel sizes. We formulate the operation in the MSFB as follows:


where denotes the ReLU operation, and the symbol stand for the concatenation. It should be noted that the input and the output of the first and second convolution layers in the MSFB possess the same number of feature maps. We apply an convolution layer to reduce the number of feature maps to 32 in the MSFB.

In the experiment, we find that the output of each MSFB may contain distinct features. Therefore, we want to explore these contextual features from various stages. A simple yet effective feature fusion strategy is used – all the output features from the foregoing MSFB are sent to the end of the network. To adaptively fuse these contextual features, a bottleneck layer composed of a convolution layer with a kernel size of is utilized.

The fusion strategy is defined as:


where denotes the output of the th MSFB, and the numbers (8,4, and 2) in the parentheses denote the stride of the max pooling operation.

Training Loss. Let and denote the extracted feature vectors by the proposed FEN from the th HR face and its LR counterpart, respectively. During the training of FEN, we first devote to maximizing inter-class distance to learn discriminative identity features in the respective HR and LR feature spaces. To this end, the following softmax loss is used:


where denotes the number of the training sample pairs, denotes the number of the object classes in the training set, represents the label of the th sample image, and and are the th column of the weight matrices and in the final fully connection layer, while and are the biases for the respective HR and LR feature spaces.

Meanwhile, we aim to reduce the intra-class difference between an individual face sample and its center of the same identity in the feature space. The center loss [48] is written as


where and are the centers of the HR and the LR features corresponding to the th class, respectively.

As shown in Figure 2, the critical challenge of the CRFR comes from the limited distinct features in the observed LR face images. Fortunately, the HR training samples can be utilized to guide the extraction of discriminative features from the LR faces. For the CRFR task, the features of LR face images should be as closed as possible to their HR counterparts. For the sake of simplicity, we have the following Euclidean loss


By considering the previous three effective losses, the loss of the proposed method can be written as


where and are two balancing hyper-parameters to control the contributions of the center loss and the Euclidean loss. In this fashion, the proposed method takes into account both the discriminative and representative ability of the learned features, making the CRFR more expressive in the learned feature space.

Fig. 3: Multi-scale feature extraction block (MSFB).

Iii-B Feature Set-Based Representation Learning

In previous methods, the tail features (e.g., and

in the aforementioned section) extracted by the trained network are usually used to train the classifiers directly for the recognition task. However, the extracted features from the MSFBs are not fully explored to their full potentials. We will elaborate in this section on how we can utilize these multi-level features to mitigate the resolution discrepancy for better recognition performance.

Vector Set-Based Collaborative Representation. In this part, we use a vector set to represent a face image. The features extracted by FEN from a LR query face image in a specific stage is denoted as (where each column of is a reshaped feature map, denotes the number of feature maps in a query stage, and is the size of the reshaped feature map).

Denote by the features extracted from the th () HR gallery face image in the same stage. Let be the concatenation of the features from all the HR gallery faces, and denotes the total number of the resultant feature maps.

For the query feature set , its -norm regularized hull can be defined as


where is the coefficient vector. Then, we can define the representation of the hull over the gallery feature set as follows:


where is the representation vector, the constraint is used to prevent the trivial solution , and and are hyper-parameters to balance between the regularization terms on and , respectively.

Either -norm or -norm could be explored to constrain the vector norm for and . For the sake of efficiency and effectiveness, we use -norm here. In this case, Eq. (8) will have a closed-form solution. The Lagrangian function (8) can be denoted as


where is an all-one row vector, , and


By taking the derivative of the Langarian function wrt the multiplier and the decision variable , and equating the results to zero, we obtain


Then, we can obtain the closed solution to Eq. (9):


where .

Matrix Set-Based Collaborative Representation. Contrary to the previous section where each feature map is treated as a vector, here we adopt the original matrix form of the feature map to represent a face image. Existing works [50] have revealed that nuclear norm constraint could be more suitable to keep the 2D structure of a feature map. The features extracted from a LR query face image and all the HR gallery faces in a certain stage are denoted by and , respectively.

Then, we can define the representation of the hall over the corresponding gallery feature set by


where denotes the nuclear norm of a matrix, , and .

For convenience, Eq. (13) can be rewritten as


The alternating minimization method (ADMM) is then adopted to solve this optimization problem with the following augmented Lagrangian function:


where is the inner product, and and are the auxiliary Lagrange multipliers, with a positive penalty constant .

Then the optimal and can be solved alternatively. Specifically, by fixing others, the solution to is


where , , , , and . Thus, Eq. (16) has a closed form solution as

Fig. 4: The proposed HFSRL scheme for CRFR process. The FEN is used to extract discriminative feature sets. First, multi-scale features are extracted in each stage. Then, based on these hierarchical features, FSRL scheme is designed to fully exploit these deep CNN features for more accurate recognition. Last, these hierarchical recognition outputs are fused to further promote the recognition performance.

Once is obtained, is updated via optimizing the following minimization problem:


where , , and . The closed form solution of Eq. (18) is given as


By fixing other parameters, can be solved by


where . The solution of problem (20) could be solved by


in which , , and denotes the rank of matrix .

Once , and are obtained, the auxiliary Lagrange multipliers and can be updated to


The procedure for solving Eq. (14) is summarized in Algorithm 1.

Input: The extracted feature set from a LR query face, concatenated feature set from all the HR gallery faces.
Output: The optimal representation vectors and .
1 Parameter: The model parameters and , and the termination condition parameter .
2 Initialize: .
3 while  do
4       Update via Eq. (17);
5       Update via Eq. (19);
6       Update via Eq. (21);
7       Update Lagrange multipliers and via Eq. (22);
8       .
10 end while
Algorithm 1 Solving Eq. (14) via ADMM.

Iii-C Hierarchical Prediction Fusion

It is well known that the features obtained from different layers contain distinct information. The features learned from the shallow layers contain the low level information such as edges and corners, while the features with rich semantics can be extracted from the deeper layers. Fully exploring the discriminative abilities of such hierarchical features is essential to the recognition tasks [54].

Suppose that we have obtained the representation vectors and via solving the aforementioned feature set-based representation learning problem. We can rewrite as , where each denotes the sub-vector of the coefficients corresponding to the th class. Then the regularized representation residual of hall over each class can be denoted by


Then the class label of the query feature set is Identity.

Now the problem boils down to how to fuse the hierarchical outputs from different stages (scales) to achieve a better performance. With the help of a given dataset and s scales (in our model, the output of the th stage is treated as the th scale due to the use of a pooling operation), a decision matrix can be defined as follows:


where is the real label for sample while represents the predicted label of on the th scale.

In order to obtain the best recognition result from different stages of scales, we define the following objective function:


where is the scale weight, is the regularization parameter, and has a length of s. Eq. (25) can be rewritten as


where , . The solution of problem (26) can be easily obtained by the widely used solver [23]. Once the optimal scale weights are obtained, the fused prediction can be formulated as Identity . The overall evaluation process is given in Fig. 4.

Iv Experimental Results and Analysis

In this part, we implement tests to validate the efficiency of our model. Following previous work, we use the CASIA-Webface [52] to train our FEN. The detected faces are normalized and resized to have a size of

. In the next, we firstly depict the datasets and the experimental settings, and then perform comparisons between our proposed approach and several competitive CRFR approaches. We implement our model with PyTorch on the popular NVIDIA Titan Xp GPU.

Fig. 5: Example face samples from the (a) UCCS dataset, (b) NJU-ID dataset, and (c) SCface dataset. Each column lists three images with the same identity from two respective resolutions, where image samples in the first row have HR while in the second (third) row have LR without (with) block occlusion.

Iv-a Datasets and Settings

Experiments are performed on three well-known face datasets: UCCS (UnConstrained College Students) [40], NJU-ID (Nanjing University ID Card Face) [16] and SCface (Surveillance Cameras Face) [10]. Some HR-LR images pairs from these datasets are listed in Fig. 5. We detail the three datasets in the next text.

UCCS dataset. The UCCS dataset collects face images of college students. The distance between the HR surveillance camera and the objects is about 100 to 150 meters. The images captured in large standoff distance and unconstrained surveillance settings make the recognition problem more difficult. Face images from 1,732 labeled persons are used, where blur, occlusion and bad illumination are existed. Following the experimental protocol in [46], we choose the top 180 subjects on the basis of the number of images. In this experiment, we separate the images of each subject according to a ratio of 1:4 to form the probe and gallery sets. The gallery face samples are reshaped to have a size of as the HR sets, while the probe face samples are first down-sampled to pixels and then resized to pixels to form the LR sets. The same size face samples in CASIA-WebFace dataset are applied for training the FEN.

NJU-ID dataset. The NJU-ID dataset includes face samples from 256 persons. A non-contact IC chip is embedded in the card. The ID card used here refers to the second generation of resident ID cards in China. Due to the storage limitations of the ID card, the stored images natively have low resolution. For each person, there are one HR camera image captured from a digital camera and one LR card image. The ID card image has a size of , while the camera image has a size of . All the card and camera images are resized to have a size of . To make the problem more challenging, we further down-sample the ID card images to to form the LR query images.

SCface dataset. The SCface dataset uses five video surveillance cameras with various qualities to collect uncontrolled indoor face images from 130 subjects. This dataset can be regarded as a real-world LR dataset. For each person, there is one frontal mugshot face sample captured by a digital camera and 15 images (five images at each distance) taken by five real surveillance cameras with different qualities within three distances (1.0m, 2.6m and 4.2m, respectively). In this experiment, 50 out of 130 persons are randomly picked to fine-tune the FEN while the rest for test. The CASIA-WebFace images with size of are take as the HR images while those of , and are taken as LR images to train the FEN at three distances.

Iv-B Ablation Study

Fig. 6: Ablation study on effects of the feature fusion (top) and the hierarchical prediction fusion (bottom).

Fig. 6 presents the ablation study on the feature fusion and hierarchical prediction fusion. In this part, for the sake of convenience, we use HFSRL to represent hierarchical vector set-based collaborative learning. Compared to HFSRL, HFSRLNF removes the feature connections from other stages. FSRL (=1,2,3,4) indicates using the feature sets from the th stage for representation learning. From Fig. 6, we can see that, FSRL obtains better recognition accuracy than FSRLNF, which reveals the feature fusion strategy is useful for recognition. The reason may be that the features from other stages can carry some discriminative information from early layers to latter layers.

From Fig. 6, we can also find that the performance from different stages varies a lot. Generally, the features extracted from the lower layer have the worst performance since the semantic information revealed by the lower layer is limited. The features extracted from the higher layer achieve better performance than that in lower layer. The reason may be that the features in higher layer contain more semantic information, that is essential for recognition tasks. Moreover, our fusion method obtains the best performance, which reveals that fusing the results from latent layers can bring complementary discriminative ability for the final recognition.

Iv-C Competitive Results

Methods Accuracy (%) Year
SICNN [56] 66.5 2018
SiGAN [14] 67.2 2019
PCN [46] 55.4 2016
DCR [31] 70.3 2018
DAlign [34] 71.9 2019
SKD [9] 75.2 2019
Centerloss [27] 76.4 2019
HFSRL-v 79.5 -
HFSRL-m 80.8 -
TABLE I: Face recognition indexes (%) of respective methods on the UCCS dataset. The boldface indicates our method.
Methods Accuracy (%) Year
SICNN [56] 62.4 2018
SiGAN [14] 62.8 2019
PCN [46] 58.5 2016
DCR [31] 63.7 2018
DAlign [34] 64.5 2019
SKD [9] 67.8 2019
Centerloss [27] 68.4 2019
HFSRL-v 71.4 -
HFSRL-m 72.6 -
TABLE II: Face recognition indexes (%) of respective methods on the NJU-ID dataset. The boldface indicates our method.
Methods Dist 1 Dist 2 Dist 3 Year
SICNN [56] 28.3 38.2 44.5 2018
SiGAN [14] 28.8 38.7 44.8 2019
PCN [46] 26.8 38.2 43.5 2016
DCR [31] 30.3 40.5 45.3 2018
DAlign [34] 32.4 42.7 48.7 2019
SKD [9] 38.5 48.0 54.7 2019
Centerloss [27] 40.5 51.8 57.5 2019
HFSRL-v 44.2 54.3 59.5 -
HFSRL-m 45.3 55.3 60.6 -
TABLE III: Face recognition indexes (%) of respective methods on the SCface dataset. The boldface indicates our method.
Fig. 7: Face recognition accuracy (%) of respective method on the UCCS dataset with random occlusion.
Fig. 8: Face recognition accuracy (%) of respective method on the NJU-ID dataset with random occlusion.
Fig. 9: Face recognition accuracy (%) of respective method on the SCface dataset with random occlusion.

We also compare our presented algorithm with two categories of advanced approaches to handle the resolution mismatching issue: one is super-resolution methods, such as SICNN [56] and SiGAN [14], together with one deep-based recognition method, i.e., DFL [48]. The other is resolution-robust methods, such as PCN [46], DCR [31], DAlign [34], SKD [9] and Centerloss [27]. For those super-resolution approaches, we adopt the CASIA-Webface dataset for training. While for resolution-robust approaches, we employ the same probe and gallery sets. We use HFSRL-v and HFSRL-m to denote the hierarchical feature set-based representation learning with vector and matrix form, respectively.

Tables. I-III show the recognition results. We see that directly feeding the super-resolved faces into the classical recognition method appears to have a small contribution to final recognition since that the synthesized faces may be not optimized for recognition tasks. By comparison, the resolution-robust approaches (i.e., PCN, DCR, DAlign, SKD, and Centerloss) take the discriminability of features into account, achieving better recognition performance. The quantitative comparisons on three datasets also validate that our HFSRL approach gets the best performance among all competitive ones. By fully exploiting the multi-level deep CNN features, our proposed HFSRL can dramatically boost the recognition accuracy.

On account of the complicated and unknown imaging scenes, the effect of noise cannot be neglected in real-world applications. In this part, the observed LR query face samples are corrupted by a square “baboon” image with a random location under an occlusion standard of 20%. Some examples are displayed in Fig. 5. The recognition results of competitive approaches are given in Fig. 7-9. We can survey that the performance of all methods are reduced drastically. Our method (both HFSRL-v and HFSRL-m) can also perform better than other competitors. Particularly, by considering the latent structural information of the feature set, our proposed HFSRL-m can better reveal noise and performs better than HFSRL-v.

Iv-D Speed Comparisons

In this part, we check the computational speed of competitive methods. We conduct tests with a configuration of Intel CPU @ 3.4 GHz. For the simplicity of demonstration, we only provide the comparisons on the NJU-ID dataset. The average inference time of respective methods are tabulated in Table IV. The two super-resolution methods, SICNN and SiGAN, cost little more time due to the extra operation of resolution enhancement. By directly performing recognition, the resolution-robust approaches, PCN, DCR, DAlign, SKD, and Centerloss, need relatively lower computational cost. Different from previous methods, which directly use the tail extracted feature vector for recognition, our proposed methods fully take the multi-level hierarchical features into account, thus cost much more computational time. Especially, HFSRL-v has closed solution and only involves a matrix inversion operation. Thus, it has comparative time consumption with other methods. HFSRL-m obtains the best performance at the cost of higher time consumption due to the iterative procedure in representation learning. In our future work, we will try our best to investigate fast and efficient ADMM to accelerate the procedure of representation learning.

Methods Time (seconds) Year
SICNN [56] 0.92 2018
SiGAN [14] 1.15 2019
PCN [46] 0.33 2016
DCR [31] 0.46 2018
DAlign [34] 0.62 2019
SKD [9] 0.25 2019
Centerloss [27] 0.53 2019
HFSRL-v 1.62 -
HFSRL-m 4.50 -
TABLE IV: Speed comparisons (seconds) of respective methods on the NJU-ID dataset.

V Conclusions

In this work, we present to exploit multi-level deep CNN feature set to further mitigate the resolution discrepancy for better CRFR. An end-to-end feature extraction network is suggested to learn a more discriminative feature representation, which can contain more details of visual and contextual information. A feature set-based representation learning scheme is proposed to jointly represent hierarchical features. By fusing recognition results respectively generated by hierarchical features, CRFR accuracy can be improved. In addition, experimental results over three different popular face datasets with various recognition scenes have verified that the presented approach can outperform some competitive CRFR approaches.

In the future work, we will incorporate face priors such as face landmark and face parsing into the attention network to enhance the discriminability of the features. Also, we will try to adopt the graph neural networks to handle the multi-level hierarchical features for better recognition. Moreover, we will investigate the adversarial metric learning methods to robustly match the cross-resolution face image pairs.


  • [1] O. Abdollahi Aghdam, B. Bozorgtabar, H. Kemal Ekenel, and J. Thiran (2019) Exploring factors for improving low resolution face recognition. In Proc. IEEE Conf. CVPR Workshops, pp. 1–8. Cited by: §I, §II-B.
  • [2] S. Biswas, G. Aggarwal, P. J. Flynn, and K. W. Bowyer (2013) Pose-robust recognition of low-resolution face images. IEEE Trans. Pattern Anal. Mach. Intell. 35 (12), pp. 3037–3049. Cited by: §II-B.
  • [3] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang (2018) FSRNet: end-to-end learning face super-resolution with facial priors. In Proc. IEEE Conf. CVPR, pp. 2492–2501. Cited by: §I.
  • [4] W. Deng, J. Hu, and J. Guo (2019) Compressive binary patterns: designing a robust binary face descriptor with random-field eigenfilters. IEEE Trans. Pattern Anal. Mach. Intell. 41 (3), pp. 758–767. Cited by: §I.
  • [5] G. Gao, J. Yang, X. Jing, F. Shen, W. Yang, and D. Yue (2017) Learning robust and discriminative low-rank representations for face recognition with occlusion. Pattern Recogn. 66, pp. 129–143. Cited by: §I.
  • [6] G. Gao, Y. Yu, J. Xie, J. Yang, M. Yang, and J. Zhang (2020) Constructing multilayer locality-constrained matrix regression framework for noise robust face super-resolution. Pattern Recogn. 110, pp. 107539. Cited by: §II-A.
  • [7] G. Gao, Y. Yu, M. Yang, H. Chang, P. Huang, and D. Yue (2020) Cross-resolution face recognition with pose variations via multilayer locality-constrained structural orthogonal procrustes regression. Inf. Sci. 506, pp. 19–36. Cited by: §I.
  • [8] S. Ge, S. Zhao, X. Gao, and J. Li (2019) Fewer-shots and lower-resolutions: towards ultrafast face recognition in the wild. In Proc. ACM Conf. Multimedia, pp. 229–237. Cited by: §I.
  • [9] S. Ge, S. Zhao, C. Li, and J. Li (2019) Low-resolution face recognition in the wild via selective knowledge distillation. IEEE Trans. Image Process. 28 (4), pp. 2051–2062. Cited by: §II-B, §IV-C, TABLE I, TABLE II, TABLE III, TABLE IV.
  • [10] M. Grgic, K. Delac, and S. Grgic (2011) SCface–surveillance cameras face database. Multimed. Tools. Appl. 51 (3), pp. 863–879. Cited by: §IV-A.
  • [11] K. Grm, W. J. Scheirer, and V. Štruc (2020) Face hallucination using cascaded super-resolution and identity priors. IEEE Trans. Image Process. 29 (1), pp. 2150–2165. Cited by: §II-A.
  • [12] M. Haghighat and M. Abdel-Mottaleb (2017) Low resolution face recognition in surveillance systems using discriminant correlation analysis. In Proc. FG, pp. 912–917. Cited by: §II-B.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. CVPR, pp. 770–778. Cited by: §II-B, §III-A.
  • [14] C. Hsu, C. Lin, W. Su, and G. Cheung (2019) SiGAN: siamese generative adversarial network for identity-preserving face hallucination. IEEE Trans. Image Process. 28 (12), pp. 6225–6236. Cited by: §II-A, §IV-C, TABLE I, TABLE II, TABLE III, TABLE IV.
  • [15] X. Hu, P. Ma, Z. Mai, S. Peng, Z. Yang, and L. Wang (2019) Face hallucination from low quality images using definition-scalable inference. Pattern Recogn. 94, pp. 110–121. Cited by: §I.
  • [16] J. Huo, Y. Gao, Y. Shi, W. Yang, and H. Yin (2016) Ensemble of sparse cross-modal metrics for heterogeneous face recognition. In Proc. ACM Conf. Multimedia, pp. 1405–1414. Cited by: §IV-A.
  • [17] M. Jian and K. Lam (2015)

    Simultaneous hallucination and recognition of low-resolution faces based on singular value decomposition

    IEEE Trans. Circuits Syst. Video Technol. 25 (11), pp. 1761–1772. Cited by: §II-B.
  • [18] J. Jiang, R. Hu, Z. Wang, and Z. Han (2014) Noise robust face hallucination via locality-constrained representation. IEEE Trans. Multimedia 16 (5), pp. 1268–1281. Cited by: §II-A.
  • [19] J. Jiang, Y. Yu, J. Hu, S. Tang, and J. Ma (2018) Deep cnn denoiser and multi-layer neighbor component embedding for face hallucination. In Proc. IJCAI, pp. 771–778. Cited by: §II-A.
  • [20] J. Jiang, Y. Yu, S. Tang, J. Ma, A. Aizawa, and K. Aizawa (2019) Context-patch based face hallucination via thresholding locality-constrained representation and reproducing learning. IEEE Trans. Cybern. 50 (1), pp. 324–337. Cited by: §II-A.
  • [21] C. Jing, Z. Dong, M. Pei, and Y. Jia (2019) Heterogeneous hashing network for face retrieval across image and video domains. IEEE Trans. Multimedia 21 (3), pp. 782–794. Cited by: §I.
  • [22] F. Keinert, D. Lazzaro, and S. Morigi (2019) A robust group-sparse representation variational method with applications to face recognition. IEEE Trans. Image Process. 28 (6), pp. 2785–2798. Cited by: §I.
  • [23] K. Koh, S. Kim, and S. Boyd (2007)

    An interior-point method for large-scale l1-regularized logistic regression

    J. Mach. Learn. Res. 8, pp. 1519–1555. Cited by: §III-C.
  • [24] J. Li, J. Zhao, F. Zhao, H. Liu, J. Li, S. Shen, J. Feng, and T. Sim (2016) Robust face recognition with deep multi-view representation learning. In Proc. ACM Conf. Multimedia, pp. 1068–1072. Cited by: §I.
  • [25] J. Li, F. Fang, K. Mei, and G. Zhang (2018) Multi-scale residual network for image super-resolution. In Prof. ECCV, pp. 517–532. Cited by: §III-A.
  • [26] M. Li, Z. Zhang, G. Xie, and J. Yu (2020) A deep learning approach for face hallucination guided by facial boundary responses. ACM Trans. Multim. Comput. 16 (1), pp. 1–23. Cited by: §I.
  • [27] P. Li, L. Prieto, D. Mery, and P. J. Flynn (2019) On low-resolution face recognition in the wild: comparisons and new techniques. IEEE Trans. Inf. Forensics Secur. 14 (8), pp. 2000–2012. Cited by: §II-B, §IV-C, TABLE I, TABLE II, TABLE III, TABLE IV.
  • [28] X. Li, W. Zheng, X. Wang, T. Xiang, and S. Gong (2015) Multi-scale learning for low-resolution person re-identification. In Proc. IEEE Conf. ICCV, pp. 3765–3773. Cited by: §II-B.
  • [29] L. Liu, S. Li, and C. P. Chen (2019) Iterative relaxed collaborative representation with adaptive weights learning for noise robust face hallucination. IEEE Trans. Circuits Syst. Video Technol. 29 (5), pp. 1284–1295. Cited by: §II-A.
  • [30] L. Liu, C. Xiong, H. Zhang, Z. Niu, M. Wang, and S. Yan (2016) Deep aging face verification with large gaps. IEEE Trans. Multimedia 18 (1), pp. 64–75. Cited by: §I.
  • [31] Z. Lu, X. Jiang, and A. Kot (2018) Deep coupled resnet for low-resolution face recognition. IEEE Signal Proc. Lett. 25 (4), pp. 526–530. Cited by: §II-B, §IV-C, TABLE I, TABLE II, TABLE III, TABLE IV.
  • [32] S. P. Mudunuri and S. Biswas (2016) Low resolution face recognition across variations in pose and illumination. IEEE Trans. Pattern Anal. Mach. Intell. 38 (5), pp. 1034–1040. Cited by: §II-B.
  • [33] S. P. Mudunuri, S. Sanyal, and S. Biswas (2018) GenLR-net: deep framework for very low resolution face and object recognition with generalization to unseen categories. In Proc. IEEE Conf. CVPR Workshops, pp. 489–498. Cited by: §I.
  • [34] S. P. Mudunuri, S. Venkataramanan, and S. Biswas (2019) Dictionary alignment with re-ranking for low-resolution nir-vis face recognition. IEEE Trans. Inf. Forensics Secur. 14 (4), pp. 886–896. Cited by: §II-B, §IV-C, TABLE I, TABLE II, TABLE III, TABLE IV.
  • [35] C. Peng, N. Wang, J. Li, and X. Gao (2019) Re-ranking high-dimensional deep local representation for nir-vis face recognition. IEEE Trans. Image Process. 28 (9), pp. 4553–4565. Cited by: §I.
  • [36] G. Qi, X. Hua, Y. Rui, J. Tang, and H. Zhang (2010) Image classification with kernelized spatial-context. IEEE Trans. Multimedia 12 (4), pp. 278–287. Cited by: §II-B.
  • [37] G. Qi, L. Zhang, H. Hu, M. Edraki, J. Wang, and X. Hua (2018) Global versus localized generative adversarial nets. In Proc. IEEE Conf. CVPR, pp. 1517–1525. Cited by: §II-B.
  • [38] G. Qi (2016) Hierarchically gated deep networks for semantic segmentation. In Proc. IEEE Conf. CVPR, pp. 2267–2275. Cited by: §II-B.
  • [39] C. Ren, D. Dai, and H. Yan (2012) Coupled kernel embedding for low-resolution face image recognition. IEEE Trans. Image Process. 21 (8), pp. 3770–3783. Cited by: §II-B.
  • [40] A. Sapkota and T. E. Boult (2013) Large scale unconstrained open set face database. In Proc. IEEE Conf. BTAS, pp. 1–8. Cited by: §IV-A.
  • [41] J. Shi and G. Zhao (2019) Face hallucination via coarse-to-fine recursive kernel regression structure. IEEE Trans. Multimedia 21 (9), pp. 2223–2236. Cited by: §II-A.
  • [42] X. Shu, J. Tang, G. Qi, Z. Li, Y. Jiang, and S. Yan (2016) Image classification with tailored fine-grained dictionaries. IEEE Trans. Circuits Syst. Video Technol. 28 (2), pp. 454–467. Cited by: §II-B.
  • [43] M. Singh, S. Nagpal, R. Singh, and M. Vatsa (2019) Dual directed capsule network for very low resolution image recognition. In Proc. IEEE Conf. ICCV, pp. 340–349. Cited by: §I.
  • [44] Y. Song, J. Zhang, S. He, L. Bao, and Q. Yang (2017) Learning to hallucinate face images via component generation and enhancement. In Proc. IJCAI, pp. 4537–4543. Cited by: §II-A.
  • [45] Z. Wang, C. Zhao, Y. Qin, Q. Zhou, G. Qi, J. Wan, and Z. Lei (2018) Exploiting temporal and depth information for multi-frame face anti-spoofing. arXiv preprint arXiv:1811.05118. Cited by: §I.
  • [46] Z. Wang, S. Chang, Y. Yang, D. Liu, and T. S. Huang (2016) Studying very low resolution recognition using deep networks. In Proc. IEEE Conf. CVPR, pp. 4792–4800. Cited by: §II-B, §IV-A, §IV-C, TABLE I, TABLE II, TABLE III, TABLE IV.
  • [47] Z. Wang, M. Ye, F. Yang, X. Bai, and S. Satoh (2018) Cascaded sr-gan for scale-adaptive low resolution person re-identification.. In Proc. IJCAI, pp. 3891–3897. Cited by: §II-B.
  • [48] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In Proc. ECCV, pp. 499–515. Cited by: §I, §III-A, §IV-C.
  • [49] F. Yang, W. Yang, R. Gao, and Q. Liao (2018) Discriminative multidimensional scaling for low-resolution face recognition. IEEE Signal Proc. Lett. 25 (3), pp. 388–392. Cited by: §II-B.
  • [50] J. Yang, L. Luo, J. Qian, Y. Tai, F. Zhang, and Y. Xu (2017) Nuclear norm based matrix regression with applications to face recognition with occlusion and illumination changes. IEEE Trans. Pattern Anal. Mach. Intell. 39 (1), pp. 156–171. Cited by: §III-B.
  • [51] M. Yang, W. Wen, X. Wang, L. Shen, and G. Gao (2020) Adaptive convolution local and global learning for class-level joint representation of facial recognition with a single sample per data subject. IEEE Trans. Inf. Forensics Secur. 15, pp. 2469–2484. Cited by: §I.
  • [52] D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014) Learning face representation from scratch. arXiv:1411.7923. Cited by: §IV.
  • [53] H. Yu, D. Liu, H. Shi, H. Yu, Z. Wang, X. Wang, B. Cross, M. Bramler, and T. S. Huang (2017) Computed tomography super-resolution using convolutional neural networks. In Proc. ICIP, pp. 3944–3948. Cited by: §II-A.
  • [54] H. Yu, X. Chen, H. Shi, T. Chen, T. S. Huang, and S. Sun (2020)

    Motion pyramid networks for accurate and efficient cardiac motion estimation

    In Proc. MICCAI, pp. 436–446. Cited by: §III-C.
  • [55] D. Zeng, H. Chen, and Q. Zhao (2016) Towards resolution invariant face recognition in uncontrolled scenarios. In Proc. IJCB, pp. 1–8. Cited by: §II-B.
  • [56] K. Zhang, Z. Zhang, C. Cheng, W. H. Hsu, Y. Qiao, W. Liu, and T. Zhang (2018) Super-identity convolutional neural network for face hallucination. In Proc. ECCV, pp. 183–198. Cited by: §II-A, §IV-C, TABLE I, TABLE II, TABLE III, TABLE IV.
  • [57] Y. Zhao, Z. Jin, G. Qi, H. Lu, and X. Hua (2018) An adversarial approach to hard triplet generation. In Prof. ECCV, pp. 501–517. Cited by: §II-B.
  • [58] X. Zhu, H. Liu, Z. Lei, H. Shi, F. Yang, D. Yi, G. Qi, and S. Z. Li (2019) Large-scale bisample learning on id versus spot face recognition. Int. J. Comput. Vis. 127 (6-7), pp. 684–700. Cited by: §I.