Face recognition is an important and challenging problem in computer vision. Despite great progress achieved in recent years, there are still many challenging face recognition scenarios, for example, face recognition problem on images captured in heterogeneous environments. Conventional face recognition methods generally perform poor because of large texture (or style) differences between heterogeneous face images. Matching heterogeneous face images,i.e., heterogeneous face recognition (HFR), is now attracting growing attentions on account of its large theoretical challenges and great potential applications. For example, in the law enforcement agency, when no face image of the suspect is available or there are only poor quality images in video surveillance, face sketches created by forensic artists111http://www.askaforensicartist.com/composite-sketch-leads-to-arrest-in-virginia-highland-robbery/.222http://www.askaforensicartist.com/phoenix-police-sketch-leads-to-arrest-of-kidnapper/., or composite-generation software  are commonly used to perform matching with mug shot photos. In complex illumination environment, near infrared images (NIR)  or thermal infrared images (TIR)  are preferred for authentication by matching with controlled indoor visible light (VIS) face images that have been enrolled before. These scenarios introduce a great challenge to face recognition systems and in this paper we present a sparse graphical representation based discriminant analysis (SGR-DA) approach for aforementioned scenarios.
Up to now, many HFR approaches have been proposed, which can be broadly classified into three categories: image synthesis-based methods, common subspace projection-based methods, and modality invariant feature descriptor-based methods. Image synthesis-based methods[4, 5, 6, 7, 8, 9, 10, 11, 12, 13] transform face images from one modality into another such that they become homogeneous and the conventional face recognition methods can then be applied directly. However, the image synthesis process is a complex problem itself, even more difficult than the recognition task. Furthermore, these synthesis-based methods are designed for fixed modalities respectively and cannot be well generalized to different HFR scenarios. The common subspace projection-based methods [2, 14, 15, 16, 17, 18, 19, 20] usually learn modality-specific mappings to project heterogeneous face images into a common latent space, where they can be matched directly. However, since the projection procedure reduces the discriminability, it degrades the performance of HFR methods. The modality invariant feature descriptor based methods [1, 3, 21, 22, 23, 24, 25, 26, 27, 28, 29] first represent face images by extracting modality invariant features which are then measured for matching. Yet most of these methods extract feature descriptors ignoring the facial spatial information and thus these methods have limited discriminability.
Recently, a graphical representation-based approach  has been proposed to deal with HFR problem. The spatial information of facial structures is taken into consideration by jointly modeling heterogeneous face patches through Markov networks. A new similarity metric is developed for matching and the method is effective and efficient on multiple HFR scenarios. Different from the feature representations directly learnt from raw pixels [31, 32], the graphical representations are generated through state-of-the-art face synthesis model (Markov networks). The basic assumption is that the heterogeneous faces of the same person tend to have similar weight matrixes during the synthesis process. Based on this assumption, a representation dataset containing some heterogeneous face pairs is constructed to encode the faces. The graphical representations are generated with spatial information taken into consideration. Furthermore, the graphical representation-based approach can be easily and effectively generalized to multiple HFR scenarios. However, the Markov networks employed in  suffer from the same shortcoming with those used in synthesis scenarios [5, 6]: fixed nearest neighbors of the probe image patch are selected when constructing Markov networks. The performance of these methods is heavily affected by the number of nearest neighbors , which is manually determined. In addition, in the method , the whole face image is matched through a single classifier without considering the complex facial structure. Although there are several methods [21, 22, 33, 34] that divide face images into local regions and perform discriminant analysis on each region respectively, it remains an unresolved problem to improve the discriminability to the complex facial structure.
In this paper, we propose a novel sparse graphical representation based discriminant analysis (SGR-DA) method for heterogeneous face recognition. Firstly, a new Markov networks model is deployed to generate an adaptive sparse graphical representation. Unlike selecting nearest neighbors as employed in [5, 6, 30], the proposed method skips the nearest neighbor searching process and all related image patches are considered when the Markov networks model is constructed. The non-negative sparse regularization in the Markov networks model results in adaptive sparse vectors. Secondly, a spatial partition-based discriminant analysis framework is proposed to handle the complex facial structure and improve the discriminability. Three spatial partition strategies are developed and discriminant analysis is performed separately on each spatial partition region. The proposed discriminant analysis framework is simple yet effective and it results in high recognition accuracy. Experimental results on six commonly used heterogeneous face datasets demonstrate the effectiveness of the proposed method.
The main contributions of this paper are summarized as follows.
We propose an adaptive sparse graphical representation scheme to represent heterogeneous face images. By skipping the nearest neighbor selection process, adaptive sparse vectors can be generated from the Markov networks model;
We develop a spatial partition-based discriminant analysis framework for heterogeneous face matching. With the proposed spatial partition strategies, the discriminability of heterogeneous face images is improved.
Ii Related Work
In this section, we give a brief review on representative HFR methods of aforementioned three categories: image synthesis-based methods, common subspace projection based methods, and modality invariant feature descriptor-based methods.
Image synthesis-based methods transform heterogeneous face images into the same modality.  first proposed an eigen-transformation algorithm for face sketch-photo synthesis. Considering the drawback of performing synthesis on the whole face ,  and  employed local linear embedding (LLE) to perform face sketch-photo synthesis and NIR-VIS synthesis respectively. In order to take the relationship between a face image patch and its neighboring patches into consideration, the Markov random field (MRF) model was introduced by  (for face sketch-photo synthesis) and  (for TIR-VIS synthesis). In the aforementioned MRF model-based methods only the “best” candidate patch for representing the probe patch was selected, and it would cause facial deformations. Thus,  proposed a Markov weight field (MWF) model, by selecting a number of candidates to construct the Markov networks model, which is capable of synthesizing new patches without existing in the training set.  further incorporated the test image into the learning process through a transductive face sketch-photo synthesis (TFSPS) framework. A multiple representations based approach was proposed by Peng et al. . Recently 
proposed a real-time face sketch synthesis method by considering the synthesis procedure as a denoising issue. Inspired by the wide applications of convolutional neural network (CNN) in computer vision, developed a CNN-based sketch-photo synthesis method by taking the whole face photo as inputs and generating the corresponding whole face sketch.
Common subspace projection-based methods attempt to project heterogeneous face images into a latent subspace where the heterogeneity is minimized. It began with 
through a common discriminant feature extraction (CDFE) approach. proposed to use the correlational regression method (canonical correlation analysis) to map NIR and VIS images into a common feature space.  proposed a coupled spectral regression (CSR) based method for NIR-VIS matching, which was later improved by learning mappings from both modalities . The partial least squares (PLS) algorithm was exploited by  to learn the linear mapping transformations between face images in different modalities.  took both the positive and negative constraints during metric learning process into consideration, and proposed a cross modal metric learning (CMML) method for heterogeneous face matching. A multi-view discriminant analysis (MvDA) method was proposed by 
, which exploited both inter-view and intra-view correlations of heterogeneous face images. Inspired by the unsupervised deep learning algorithms,
utilized Restricted Boltzmann Machines to learn a shared representation for HFR.
The modality invariant feature descriptor-based methods encode face images with local feature descriptors, which can then be utilized for recognition.  first proposed to use a difference of Gaussian (DoG) filter and multiblock local binary patterns (MB-LBP) for matching NIR and VIS. Later  explored the scale invariant feature transform feature (SIFT)  and multiscale local binary pattern feature (MLBP)  and proposed a local feature-based discriminant analysis (LFDA) framework for forensic sketch recognition with a populated gallery.  designed a learning-based feature by coupled information-theoretic encoding (CITE) for matching viewed sketches with photos. Two other binary pattern features, i.e., local radon binary pattern (LRBP)  and local difference of Gaussian binary pattern (LDoGBP) , were also designed for viewed sketch recognition. In order to mimic the gap between viewed sketch recognition and forensic sketch recognition,  proposed a semi-forensic sketch dataset and deployed the multi-scale circular Weber’s local descriptor (MCWLD) for matching.  utilized nonlinear kernel similarities to represent face image and evaluated their prototype random subspace (P-RS) approach on four HFR scenarios. Recently, a number of composite sketch recognition methods [1, 27, 28, 29] were proposed. Mittal et al. 
presented a transfer learning-based deep learning representation for composite sketch recognition. Considering the insufficient usage of facial spatial information, a graphical representation based HFR approach was proposed recently. The graphical representation was extracted by Markov networks and a coupled representation similarity metric was designed to cater for the obtained representations. However, the graphical representation in  suffered the same shortcomings with existing methods [5, 6] that the performance was seriously affected by the nearest neighbor searching process with manually defined . In this paper, we skip this process and propose an adaptive sparse graphical representation. Additionally, we propose a new spatial partition-based discriminant analysis framework to improve the discriminability and test our method on six different heterogeneous face datasets.
Iii Sparse Graphical Representation based Discriminant Analysis for Heterogeneous Face Recognition
In this section, we present the proposed sparse graphical representation based discriminant analysis method for heterogeneous face recognition. We first give the formulation and analysis of the adaptive sparse graphical representation. Then, we introduce the spatial partition-based discriminant analysis framework. Finally, the whole SGR-DA approach is developed. Without loss of generality and for ease of presentation, we take viewed sketch recognition as an example here to introduce the proposed method. It can be seen from the experimental section that the proposed approach can be generalized to other HFR scenarios.
Iii-a Adaptive Sparse Graphical Representation
A representation dataset, consisting of face sketch-photo pairs, is constructed firstly. We divide each face image into overlapping patches and represent each patch by a feature descriptor (SIFT, for example). Given a probe sketch and a gallery photo , we divide them into patches and represent each patch by the feature descriptor in exactly the same way we have done before for the representation dataset. Let ( 12) denote a probe sketch patch, and be the feature descriptor corresponding to . The nearest sketch patch on each face sketch in the representation dataset within the search region around the location of is selected based on the Euclidean distance of the feature descriptors. Therefore, we can find related sketch patches for a probe sketch . Likewise, related photo patches for a gallery photo can be found.
In existing methods [5, 6, 30], nearest related neighbors are selected from these related patches and the probe sketch patch can be regarded as a linear combination of the nearest related neighbors weighted by a column vector . A Markov networks model can then be built by jointly modeling all probe sketch patches and their nearest neighbors:
where denotes the th probe sketch patch and the th probe sketch patch are adjacent. denotes the linear combination of feature descriptors on the nearest related neighbors, i.e., . and are the local evidence function and the neighboring compatibility function respectively. Maximizing the problem in (1) can be formulated as the following problem (2). The detailed proof can be found in the Appendix.
The shortcoming of the above procedure is that the parameter (i.e., the number of nearest related neighbors) is always defined manually. For example, was set to 10 in [5, 6] and 15-40 in . However, the performance of these methods is heavily affected by . Therefore, in this paper we propose to skip the nearest neighbor searching process and all the related image patches are considered. Now, the problem (2) becomes the following optimization issue:
The constraint in function (3), i.e., , is identical to the following constraint when :
which is a non-negative sparse regularization. The non-negative constraint here prevents subtraction from occurring in the linear combination of the related image patches, which is contrary to the intuitive notion of combining parts to form a whole . It has been shown that the non-negativity property is advantageous . Different from existing sparse graphical representation applied in several computer vision applications , the proposed adaptive sparse graphical representation is generated based on state-of-the-art face synthesis model (Markov networks) which can take spatial information into consideration.
The non-negative sparse regularization in our Markov networks model produces an adaptive sparse representation of the data. In our experiments, statistics show that above 90% elements of the adaptive sparse graphical representation are near zero (10). Examples of several representation vectors are shown in Fig. 1. It should be noticed that the size of the representation dataset is far larger than . Instead of manually selecting related neighbors for each probe sketch patch at the beginning, the proposed method can adaptively utilize different numbers of related neighbors for different probe sketch patches. This adaptive sparse property makes the face images of different identities to have maximum discriminability. We will validate the effectiveness of it in the experiment section. The obtained adaptive sparse vectors are regarded as the adaptive sparse graphical representation of the probe sketch , i.e., . The adaptive sparse graphical representation of the gallery photo can be obtained in a similar way: .
Iii-B Spatial Partition-based Discriminant Analysis
After obtaining the adaptive sparse graphical representations of both probe sketches and gallery photos, we refine these representations through discriminant analysis for face matching. The representations of all face image patches can be simply concatenated together and then apply classical subspace analysis, such as principal component analysis (PCA) and linear discriminant analysis (LDA)  to extract discriminative information for matching. However, the facial structure is complex and this direct concatenation approach neglects the spatial facial structure. Furthermore, it is likely to be overfitting due to the small sample size .
In order to handle the complex facial structure and improve the discriminability, many discriminant analysis strategies have been proposed. The LFDA approach  divided face image patches into “slices”, where “slices” correspond to the concatenation of features from each column of image patches. However, the drawback of LFDA is that combining only one column of image patches together may not be the most discriminative strategy. On the other hand, it is also practical to concatenate features from several rows of images patches, which was not exploited in . There are other methods that divide face image into local regions manually. For example, the CITE approach  divided a face image into 75 local regions with equal size. The method  manually divided a face image into five local regions corresponding to five facial components (eye, eyebrow, checks, nose, and mouth). These approaches suffer the same problem that the local regions are designed manually, without the consideration of the characteristics of face image data. The semantic pixel sets based method  exploited the semantic pixel relation by intensity distribution and clustered face regions by the pixel intensity values. However, as illustrated below, this clustering-based strategy is complementary to the column-based and row-based strategy. The fusion of combining different spatial partition strategies to further improve the discriminability has not been investigated yet.
Three spatial partition strategies are developed in this paper to address the drawbacks of aforementioned approaches and improve the discriminability:
(1) Considering the shortcoming of combining only one column of image patches together, we propose to combine columns of image patches as a spatial partition region. Discriminant analysis can then be performed separately on each spatial partition region. The extracted features are then concatenated together for matching. The column-based spatial partition strategy is shown in the first row of Fig. 2, when 1,2,3,4,5. It should be noticed that we demonstrate the results of taking columns of patches as the same spatial partition region and represent these patches by the same color in the first row of Fig. 2.
(2) In order to exploit the row-based spatial partition strategy, we combine rows of image patches as a spatial partition region. Discriminant analysis can be performed on each region similarly. The illustration when 1,2,3,4,5 is shown in the second row of Fig. 2.
(3) Instead of manually dividing face image into local regions, we further exploit learning-based spatial partition strategy. The image patches can be clustered together through machine learning techniques. We use-means clustering  here. The features of image patches at the same location of each face image are concatenated as a vector. The purpose is to cluster the locations of face images in the dataset. In our experiments, the clusters are determined by the long feature vectors created through concatenating the feature descriptors from coupled heterogeneous face images across the training set. Therefore, different heterogeneous face datasets (e.g., face sketch-photo dataset and NIR-VIS dataset) may result in different clustering results. Illustrations of learning-based spatial partition strategy on the CUHK face sketch FERET database is shown in the last row of Fig. 2 when the cluster number 3,5,7,9,11. Discriminant analysis can be performed on each clustered region.
The effects of different , , and will be discussed in the experiment section, and the best , , and
are used. In the discriminant analysis process performed on each spatial partition region, PCA is firstly applied with 99 percent of the variance preserved. Subsequently, LDA is performed to further reduce the dimensionality and improve the discriminability. Finally, all the projected vectors of the same face image are concatenated and the cosine similarity measure is used to calculate the similarity score between a probe sketch and a gallery photo.
We further investigate that the proposed column-based, row-based, and learning-based spatial partition strategies are complementary. The fusion of these three spatial partition strategies can naturally enhance the recognition performance. Details are given in the experiment section. In our work, we simply sum the similarity scores of different spatial partition strategies after a min-max score normalization.
Iii-C SGR-DA Method for HFR
In order to better illustrate the proposed approach, the whole approach is outlined in Fig. 3. Firstly, the face images are divided into patches, and common feature descriptors (SIFT, for example) are used to represent each image patch. Secondly, for a probe sketch (or a gallery photo), a Markov networks model is constructed on the features of probe sketch patches (or gallery photo patches) and sketch patches (or photo patches) in representation dataset. The adaptive sparse graphical representations of the input image can then be generated by solving (3). Thirdly, the column-based, row-based, and learning-based spatial partition strategies are applied to refine the adaptive sparse graphical representations and improve its discriminability. Finally, the cosine similarity measure is used to calculate the similarity score of the three refined vectors, which are then fused. A nearest neighbor matcher is used for recognition in the end.
In this section, we evaluate our SGR-DA through extensive experiments on six commonly used heterogeneous face datasets: the CUHK Face Sketch FERET Database (CUFSF) , PRIP Viewed Software-Generated Composite Database (PRIP-VSGC) , IIIT-D Semi-Forensic Sketch Database , Forensic Sketch Database , CASIA NIR-VIS 2.0 Face Database , and Natural Visible and Infrared facial Expression Database (USTC-NVIE) . We first introduce the experimental settings of our method and evaluate the effectiveness of our contributions. Then we illustrate that our approach achieves superior performance in comparison with state-of-the-art methods on these six datasets.
Three baseline results are provided in this section: Fisherface algorithm , the open source face recognition algorithm OpenBR , and the state-of-the-art HFR algorithm P-RS . For the Fisherface algorithm, we combine the heterogeneous face images together to train the projection matrix and the Euclidean distance is used for matching. For the OpenBR algorithm, we use the public source which is freely available online333http://openbiometrics.org/.. For the P-RS algorithm, we implemented the prototype random subspace framework and the direct random subspace (D-RS) framework. The results of fusing P-RS and D-RS are reported. Note that the results in this paper are reported as 10-fold cross validation by randomly splitting the training and testing sets.
Three features are utilized in this section: SIFT feature, speeded up robust features (SURF)  feature, and histograms of oriented gradients (HOG)  feature. The SURF feature is extracted by exploiting the implementation embedded in the MATLAB R 2012b software. The center of the image patch is set as the interest point and the standard SURF-64 version is utilized. The SIFT feature and HOG feature are extracted through an open source library444http://www.vlfeat.org/.. For the SIFT feature the center of the image patch is set as the interest point. A 128-dimensional vector and a 124-dimensional vector are generated for SIFT and HOG respectively.
Iv-a Experimental Settings
All the heterogeneous face images are aligned based on five facial points (centers of two eyes, nose tip, left and right mouth corner), which are automatically detected by . Because the facial point detection method  failed on the TIR images in the USTC-NVIE database, we manually located the five points on the TIR images. After the facial points are located, each face image is cropped to 100125 based on the five points. The image patch size is 1010, and 50% overlapping ratio is kept. The size of the search region is 16. We further conduct adjustment experiments on the CUFSF database to determine other parameters as well as evaluate the effectiveness of our contributions. 250 sketch-photo pairs in CUFSF consist of the representation dataset and other 250 pairs are used for training. The rest 694 pairs are used for testing (There are 1194 face sketch-photo pairs in this database in total). In our experiments, it is evaluated that little influence will be introduced when different sources of the representation dataset are chosen. The only principle is that the images in the representation dataset should not appear in the training set or the testing set again.
The most time-consuming part of the proposed approach lies in the extraction process of the adaptive sparse graphical representation. Although in the proposed method the nearest neighbor searching process is skipped, we still need to find the best image patch on each face image in the representation dataset within the search region. Therefore, the complexity of this process is . Here is the number of candidates in the search region around one patch. is the number of patches per image. is the number of face image pairs in the representation dataset and is the dimension of the local descriptor. During the discriminant analysis process, standard PCA and LDA are applied before matching. In our experiments, it takes approximately five minutes to encode one face image through the proposed adaptive sparse graphical representation. This may hinder the usage of the proposed representation in a part of real applications. However, in real law enforcement scenarios it is quite acceptable since several days even months are usually taken to search a suspect by human beings. In such applications, the recognition performance is more important than the speed of the encoding process. All the experiments in this paper are conducted on an Intel Core i7-4790 3.6GHz PC under MATLAB R 2012b environment.
We first evaluate the effectiveness of the proposed adaptive sparse graphical representation. The SURF feature is utilized as the feature descriptor to represent image patches. Because there are 250 pairs in the representation dataset, the size of related neighbors in our method is 250. To compare with existing methods [5, 6, 30] which manually selected nearest neighbors, we set the number of nearest neighbors to be 15, 20, 25, 30, 35, 40 and the accuracies of different fixed number of related neighbors are shown in Fig. 4, denoted as “Fixed neighbors approach”. We further implemented a direct feature based approach, by replacing the adaptive sparse graphical representation with the original SURF feature, denoted as “Direct feature approach”. As shown in Fig. 4, the proposed adaptive sparse graphical representation is superior to existing nearest related neighbor selection strategy.
We then demonstrate the effects of different , , and in the proposed spatial partition-based discriminant analysis framework. The left top subfigure of Fig. 5 shows the rank-1 accuracy when different is set. The right top subfigure shows the results when different is set. The left bottom subfigure shows the results when different is set. It can be seen that 4, 5, and 9 achieve the best accuracies respectively on CUFSF database under our experimental settings (image size and patch size). In order to further illustrate the effectiveness of the proposed spatial partition-based discriminant analysis, we compare it with three conventional strategies without the proposed strategy (concatenating all patches, dividing a face image into 77 local regions, and manually defined regions). As shown in right bottom subfigure of Fig. 5, the proposed spatial partition-based discriminant analysis strategy exploits the characteristics of the data and achieves better performance than conventional strategies. We assume that the best parameters here can be generalized to other datasets, and , , and are fixed to 4, 5, and 9 respectively in the following experiments.
The proposed column-based, row-based, and learning-based spatial partition strategies are complementary. Therefore, the fusion of these three spatial partition strategies can naturally enhance the recognition performance, as shown in the left subfigure of Fig. 6. Because the proposed method represents heterogeneous face images in each modality separately, common features used in conventional face recognition tasks can be used to represent image patches in our method. We utilize SURF feature, SIFT feature, and HOG feature in this paper. We further investigate that fusion of these three features results in improved recognition accuracy, as shown in the right subfigure of Fig. 6.
Iv-B CUFSF Viewed Sketch Database
|Fisherface ||28.82%||OpenBR ||10.80%|
|TFSPS ||72.62%||MrFSPS ||75.36%|
|PLS ||51.00%||MvDA ||55.50%|
|LRBP ||91.12%||P-RS ||83.95%|
|Fisherface ||21.87%||OpenBR ||16.65%|
|SSD-based ||45.30%||Deep Network ||52.00%|
The CUFSF database is used to evaluate the proposed method on matching viewed sketches with photos. There are totally 1194 persons in this database. Each person has one photo and corresponding one sketch drawn by the artist. There are illumination variations in photos and shape exaggerations in sketches of the database. The viewed sketches are drawn by professional forensic artist while viewing the photos. Some examples used in this paper are shown in Fig. 7. In our experiment, 250 sketch-photo pairs are randomly selected to construct the representation dataset. Other 250 pairs are randomly selected for training and the rest 694 pairs are used for testing.
We compare the proposed SGR-DA method with three aforementioned baseline approaches, i.e., Fisherface, OpenBR, and P-RS, as well as several state-of-the-art methods. The rank-1 recognition accuracies of different methods are reported in Table I. The two baseline face recognition methods (Fisherface and OpenBR) performed poorly on the HFR scenario. The image synthesis based method TFSPS first transformed face sketches and photos into the same modality and utilized random sampling LDA method  for recognition. The two common subspace projection-based methods (PLS and MvDA) only achieved rank-1 accuracies below 60%. The two modality invariant feature descriptor-based methods (LRBP and P-RS) achieved good performance with 91.12% and 83.94% respectively. The graphical representation based method (G-HFR) achieved a rank-1 accuracy of 96.04% on CUFSF. The proposed method represents heterogeneous faces with adaptive sparse graphical representations, in which the adaptive sparse property makes face images of different persons discriminative. The proposed spatial partition-based discriminant analysis framework further improves the discriminability and finally achieves a rank-1 accuracy of 96.97%, which is superior to state-of-the-art methods.
Iv-C PRIP-VSGC Composite Sketch Database
|Deep Network ||15.60%||48.30%|
The PRIP-VSGC database contains 123 photos from the AR database  and corresponding sketches created using composite generation software (FACES555http://www.iqbiometrix.com. and Identi-Kit666http://www.identikit.net.). The composite sketches are more easily available than hand drawn sketches because it is more affordable to create sketches by composite generation software than training a professional forensic artist. In our experiment, 123 sketch-photo pairs from the CUHK Student database  are used to form the representation dataset, and the 123 composite sketches generated by Identi-Kit are used for testing777Currently only the 123 composites generated using Identi-Kit are released on http://biometrics.cse.msu.edu/pubs/databases.html.. Some examples of the composite sketch-photo pairs used are shown in Fig. 8.
We first follow the baseline experiment protocol in [28, 29], denoted as protocol I. 48 composite sketch-photo pairs are randomly selected for training. The rest 75 pairs form the testing set. The rank-10 accuracies of different methods under protocol I are reported in Table II. The deep network-based transfer learning approach  and the state-of-the-art P-RS method achieved good performance of 52% and 53.73% respectively. It can be seen that the proposed method outperforms existing methods under protocol I and reached rank-10 accuracy of 70%.
In order to evaluate the performance of the proposed method on large database, we then follow the extended experiment protocol in [28, 29], denoted as protocol II. The size of the training set remains to be 48, and the gallery size is extended to be 2400 while the probe size is 75. It should be noticed that the methods [3, 28] used images obtained from law enforcement agencies to extend the gallery, and  selected images from multiple face databases which are not clearly introduced. Considering the images used to extend the gallery in existing methods are not available to the community, we randomly selected face images from a publicly available dataset, i.e. the labelled faces in the wild-a (LFW-a) , to extend the gallery. It can help increase the diversity of the gallery set and mimic the real-world face recognition scenarios. We acknowledge that there may be bias of similarity scores between the LFW-a images and the gallery images . This bias is due to the fact that the face images in LFW-a are collected from Internet while the gallery photos are captured under controlled conditions. The usage of 10000 photos from LFW-a to extend the gallery set aims to make the face recognition problem more challenging. The rank-20 and rank-40 accuracies of different methods under protocol II are reported in Table III. The proposed method achieves a rank-20 accuracy of 54.93% and a rank-40 accuracy of 67.60%, which outperform state-of-the-art methods of at least 12%.
Finally, we further extend the gallery size to 10000 to mimic the real-world face retrieval scenarios in law enforcement agencies, denoted as protocol III. Face images from LFW-a are used to extend the gallery size. We randomly select 100 sketch-photo pairs from the CUFSF database for training and the whole 123 composite sketch-photo pairs are used for testing. The cumulative match scores comparison with baseline methods under protocol III is demonstrated in Fig. 9. The Fisherface method achieves a rank-50 accuracy of 11.38%. The open source face recognition algorithm OpenBR is developed for general face recognition and it performed poorly on the composite sketches (a rank-50 accuracy of 2.55%). The kernel prototype similarities based P-RS approach achieves good performance on this dataset, with a rank-50 accuracy of 42.72%. Benefiting from the maximum discriminability driven by both the adaptive sparse graphical representation and the spatial partition-based discriminant analysis, the proposed approach achieves a rank-50 accuracy of 55.28%, which is superior to other methods.
Iv-D IIIT-D Semi-Forensic Sketch Database
The IIIT-D Semi-Forensic Sketch database  is composed of 140 semi-forensic sketches and corresponding photos. The semi-forensic sketches are drawn based on the memory of the forensic artist after viewing the photos for a few minutes. Therefore, the semi-forensic sketches are less similar to photos than viewed sketches, which makes them practical to narrow the gap between viewed sketches and real-world forensic sketches. It is observed that classifiers trained on semi-forensic sketches can better fit forensic sketch recognition scenario . In our experiment, 123 sketch-photo pairs in the CUHK AR database  are used to construct the representation dataset. The protocol in  is followed. The semi-forensic sketches are used for training classifiers and our collected forensic sketch database  containing 168 forensic sketch-photo pairs are used for testing. 10000 photos from LFW-a are used to extend the gallery. Some examples of the semi-forensic sketch-photo pairs are shown in Fig. 10.
The cumulative match score results on the semi-forensic dataset is shown in Fig. 11. The two baseline approaches, i.e., Fisherface and OpenBR, achieve similar performances on the semi-forensic sketches. The MCWLD method  utilized 6324 photos to extend the gallery and achieved a rank-50 accuracy of 28.52%. The P-RS approach achieved a rank-50 accuracy of 33.93% on this dataset. The proposed method utilized 10000 photos to extend the gallery and achieves a superior performance against the MCWLD method, with a larger size of gallery. A rank-50 accuracy of 41.33% is achieved on this challenging dataset, which verifies the effectiveness of the proposed method.
Iv-E Forensic Sketch Database
The forensic sketch database  contains 168 real-world forensic sketches with corresponding mug shot photos. Examples of forensic sketch and mug shot photos are shown in Fig. 12. Forensic sketches are drawn by the professional forensic artist based on the description of the eyewitnesses or victims. There are variant face perceptions between people and the eyewitnesses or victims cannot recall and describe all the details of the faces. These effects lead to large differences between the faces in forensic sketches and mug shot photos, which makes the forensic sketch recognition scenario more challenging than viewed sketch or semi-forensic sketch recognition. In our experiment, the CUHK AR database  including 123 sketch-photo pairs is taken as the representation dataset. The same train-test partition protocol in  is followed in this section. We randomly select 112 persons to form the training set and the rest 56 persons form the testing set. The gallery is extended by 10000 face images from LFW-a database.
Fig. 13 shows the cumulative match scores of the proposed method and baseline methods on the forensic sketch dataset. The forensic sketches and mug shot photos in this dataset vary greatly from each other. Because they are collected from different sources, some of them are of poor quality. The two baseline methods (Fisherface and OpenBR) achieve limited performance, with accuracies increasing very slowly when the rank numbers become large. The P-RS method achieves a rank-50 accuracy of 36.79% (higher than the reported 20.80% in ). The proposed approach exploits the discriminative information in both forensic sketches and mug shot photos, and achieves superior performance with a rank-50 accuracy of 54.64%.
Iv-F CASIA NIR-VIS 2.0 Face Database
There are 725 subjects in the CASIA NIR-VIS 2.0 face database . The subjects in this database are of different ages (from children to old people) and collected under different lighting conditions. Instead of using multiple images per subject, we randomly select one NIR-VIS pair per subject for training and testing. It is helpful to evaluate the performance of HFR methods and mimic real-world face recognition scenario with smaller training set and extended gallery . Therefore, 10000 face images from the LFW-a dataset are used to extend the gallery in this section. Example NIR-VIS pairs are shown in Fig. 14. We randomly choose 100 NIR-VIS pairs to form the representation dataset. In the rest 625 subjects, 417 subjects are used for training and the remaining 208 subjects form the testing set.
We evaluate the proposed approach on the NIR-VIS dataset and the results are shown in Fig. 15. The near infrared images are captured by NIR cameras to overcome illumination variation problem. Therefore there are great appearance differences between NIR and VIS images. The Fisherface and OpenBR methods cannot cope with the NIR scenario and only achieve rank-50 accuracies of 29.90% and 23.09% respectively. The P-RS  method is developed to deal with multiple HFR scenarios including NIR-VIS matching and leads to good matching accuracy with a rank-50 accuracy of 52.64%. Our proposed method achieves a rank-50 accuracy of 87.84% on this challenging database benefited from the most discriminative information by the adaptive sparse graphical representation and spatial partition based discriminant analysis framework.
Iv-G USTC-NVIE TIR-VIS Database
The USTC-NVIE database  is composed of 215 subjects, with VIS and TIR images captured by the visible camera and infrared camera respectively. Due to the imaging principle of the thermal infrared camera, wearing glasses or not greatly affects the results of TIR imaging. When performing face recognition, it is important information whether wearing glasses or not. There are also lighting and expression variations in this database which increase the difficulty of matching. Fig. 16 shows some examples of TIR and VIS images. In our experiment, 129 subjects are used888Only 129 subjects are available in the USTC-NVIE database because of the loss of some TIR and VIS videos as reported in . with one TIR-VIS pair per subject. We further extend the gallery size by adding 10000 face images in LFW-a on this scenario. 60 TIR-VIS pairs are randomly selected as the representation dataset. We randomly select 30 TIR-VIS pairs to train the classifiers and the rest 39 pairs are used for testing.
Fig. 17 demonstrates the cumulative match score results on the TIR-VIS database. The thermal infrared cameras capture the thermal emission from the face, which makes the TIR images lack details of the faces. The open source face recognition algorithm OpenBR is invalid on this database. The Fisherface method achieves a rank-50 accuracy of 40%. The P-RS method  can deal with TIR-VIS matching scenario to some extent, with a rank-50 accuracy of 46.15%. The proposed method achieves excellent performance on this scenario, with a rank-50 accuracy of 93.08%. This drastic improvement benefits from the maximum discriminability of our method driven by both the adaptive sparse graphical representation and the spatial partition based discriminant analysis.
In this paper, a novel sparse graphical representation based discriminant analysis, denoted as SGR-DA, is proposed for multiple HFR scenarios. The most discriminative information is extracted through two aspects: the adaptive sparse graphical representation and spatial partition-based discriminant analysis. Firstly, the adaptive sparse property of our method maximizes the discriminability of different subjects. Secondly, the row-based, column-based, and learning-based spatial partition strategies are presented to refine the adaptive sparse vectors and further improve the discriminability. Extensive experiments were conducted on six commonly used heterogeneous face datasets. We achieved superior rank-1 accuracy (5% higher) in comparison with state-of-the-art methods on the CUFSF viewed sketch dataset. We also outperformed existing methods on the composite sketch, semi-forensic sketch, and forensic sketch datasets under different protocols. It is further shown that the proposed approach has excellent generalization ability on both NIR-VIS dataset and TIR-VIS dataset. Our future work will focus on (1) further improve the accuracy of each HFR scenarios separately, and (2) investigate the relations between different HFR scenarios to help improve the performance of these scenarios together. We will also consider incorporating additional HFR scenarios, such as matching 2D photos with 3D range images and matching faces of different resolutions.
where denotes the vector of intensity values extracted from overlapping area between th patch and th patch in the th related neighbor. is set to 0.25. The quadratic parameters and are given below:
where and are two matrices, with the th column being and , respectively.
-  S. Klum, H. Han, B. Klare, and A. K. Jain, “The FaceSketchID system: Matching facial composites to mugshots,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2248–2263, 2014.
-  D. Lin and X. Tang, “Inter-modality face recognition,” in Proceedings of European Conference on Computer Vision, 2006, pp. 13–26.
-  B. Klare and A. Jain, “Heterogeneous face recognition using kernel prototype similarities,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 6, pp. 1410–1422, 2013.
-  N. Wang, D. Tao, X. Gao, X. Li, and J. Li, “A comprehensive survey to face hallucination,” International Journal of Computer Vision, vol. 31, no. 1, pp. 9–30, 2014.
H. Zhou, Z. Kuang, and K. Wong, “Markov weight fields for face sketch
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1091–1097.
-  N. Wang, D. Tao, X. Gao, X. Li, and J. Li, “Transductive face sketch-photo synthesis,” IEEE Transactions on Neural Networks and Learning System, vol. 24, no. 9, pp. 1364–1376, 2013.
-  X. Tang and X. Wang, “Face sketch synthesis and recognition,” in Proceedings of IEEE International Conference on Computer Vision, 2003, pp. 687–694.
-  Q. Liu, X. Tang, H. Jin, H. Lu, and S. Ma, “A nonlinear approach for face sketch synthesis and recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 1005–1010.
-  J. Chen, D. Yi, J. Yang, G. Zhao, S. Li, and M. Pietikainen, “Learning mappings for face synthesis from near infrared to visual light images,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 156–163.
-  X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 11, pp. 1955–1967, 2009.
-  J. Li, P. Hao, C. Zhang, and M. Dou, “Hallucinating faces from thermal infrared images,” in Proceeding of IEEE International Conference on Image Processing, 2008, pp. 465–468.
-  Y. Song, L. Bao, Q. Yang, and M. Yang, “Real-time exemplar-based face sketch synthesis,” in Proceedings of European Conference on Computer Vision, 2014, pp. 800–813.
-  L. Zhang, L. Lin, X. Wu, S. Ding, and L. Zhang, “End-to-end photo-sketch generation via fully convolutional representation learning,” [Online]. Available: http://arxiv.org/abs/1501.07180, 2015.
-  D. Yi, R. Liu, R. Chu, Z. Lei, and S. Li, “Face matching between near infrared and visible light images,” in Proceedings of International Conference on Biometrics, 2007, pp. 523–530.
-  Z. Lei and S. Li, “Coupled spectral regression for matching heterogeneous faces,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1123–1128.
-  Z. Lei, C. Zhou, D. Yi, A. Jain, and S. Li, “An improved coupled spectral regression for heterogeneous face recognition,” in Proceedings of International Conference on Biometrics, 2012, pp. 7–12.
-  A. Sharma and D. Jacobs, “Bypass synthesis: PLS for face recognition with pose, low-resolution and sketch,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 593–600.
-  A. Mignon and F. Jurie, “CMML: a new metric learning approach for cross modal matching,” in Proceedings of Asian Conference on Computer Vision, 2012, pp. 1–14.
-  M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multi-view discriminant analysis,” in Proceedings of European Conference on Computer Vision, 2012, pp. 808–821.
-  D. Yi, Z. Lei, and S. Li, “Shared representation learning for heterogeneous face recognition,” in Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, 2015.
-  B. Klare, Z. Li, and A. Jain, “Matching forensic sketches to mug shot photos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 639–646, 2011.
-  W. Zhang, X. Wang, and X. Tang, “Coupled information-theoretic encoding for face photo-sketch recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 513–520.
-  S. Liao, D. Yi, Z. Lei, R. Qin, and S. Li, “Heterogeneous face recognition from local structures of normalized appearance,” in Proceedings of IAPR International Conference on Biometrics, 2009.
-  H. Galoogahi and T. Sim, “Face sketch recognition by local radon binary pattern,” in Proceedings of IEEE International Conference on Image Processing, 2012, pp. 1837–1840.
-  A. Alex, V. Asari, and A. Mathew, “Local difference of gaussian binary pattern: robust features for face sketch recognition,” in Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, 2013, pp. 1211–1216.
-  H. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa, “Memetically optimized MCWLD for matching sketches with digital face images,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 5, pp. 1522–1535, 2012.
-  H. Han, B. Klare, K. Bonnen, and A. Jain, “Matching composite sketches to face photos: a component-based approach,” IEEE Transactions on Information Forensics and Security, vol. 8, no. 1, pp. 191–204, 2013.
-  P. Mittal, A. Jain, G. Goswami, R. Singh, and M. Vatsa, “Recognizing composite sketches with digital face images via SSD dictionary,” in Proceedings of IEEE International Conference on Biometrics, 2014, pp. 1–6.
-  P. Mittal, M. Vatsa, and R. Singh, “Composite sketch recognition via deep network-A transfer learning approach,” in Proceedings of IAPR International Conference on Biometrics, 2015.
-  C. Peng, X. Gao, N. Wang, and J. Li, “Graphical representation for heterogeneous face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
-  Z. Cao, Q. Yin, X. Tang, and J. Sun, “Face recognition with learning-based descriptor,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2707–2714.
-  Z. Lei, M. Pietikäinen, and S. Li, “Learning discriminant face descriptor,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 2, pp. 289–302, 2014.
-  S. Liu, D. Yi, Z. Lei, and S. Li, “Heterogeneous face image matching using multi-scale features,” in Proceedings of IAPR International Conference on Biometrics, 2012, pp. 79–84.
-  Z. Chai, H. Mendez-Vazquez, R. He, Z. Sun, and T. Tan, “Semantic pixel sets based local binary patterns for face recognition,” in Proceedings of Asian Conference on Computer Vision, 2012, pp. 639–651.
-  C. Peng, X. Gao, N. Wang, D. Tao, X. Li, and J. Li, “Multiple representations-based face sketch-photo synthesis,” IEEE Transactions on Neural Networks and Learning Systems, 2016.
-  D. Lowe, “Distinctive image features from scale-invariant key-points,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
-  T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002.
-  D. Lee and H. Seung, “Learning the parts of objects with nonnegative matrix factorization,” Nature, vol. 401, pp. 788–791, 1999.
-  B. Cheng, J. Yang, S. Yan, Y. Fu, and T. Huang, “Learning with -graph for image analysis,” IEEE Transactions on Image Processing, vol. 19, no. 4, pp. 858–866, 2010.
-  I. Jolliffe, Principal component analysis. New York: Springer, 2002.
-  P. Belhumeur, J. Hespanda, and D. Kiregeman, “Eigenfaces vs. fisherfaces: recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
J. MacQueen, “Some methods for classification and analysis of multivariate
Proceedings of 5t Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.
-  H. S. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa, “Memetic approach for matching sketches with digital face images,” IIITD-TR-2011-006, Tech. Rep., 2011.
-  S. Li, D. Yi, Z. Lei, and S. Liao, “The CASIA NIR-VIS 2.0 face database,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 348–353.
-  S. Wang, Z. Liu, S. Lv, Y. Lv, G. Wu, P. Peng, F. Chen, and X. Wang, “A natural visible and infrared facial expression database for expression recognition and emotion inference,” IEEE Transactions on Multimedia, vol. 12, no. 7, pp. 682–691, 2010.
-  J. Klontz, B. Klare, S. Klum, A. Jain, and M. Burge, “Open source biometric recognition,” IEEE Biometrics: Theory, Applications, and Systems (under review), 2013.
-  H. Bay, A. Ess, T. Tuytelaars, and L. Gool, “SURF: speeded up robust features,” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893.
-  Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3476–3483.
-  P. Mittal, A. Jain, R. Singh, and M. Vatsa, “Boosting local descriptors for matching composite and digital face images,” in Proceedings of IEEE Conference on Image Processing, 2013, pp. 2797–2801.
-  X. Wang and X. Tang, “Random sampling for subspace face recognition,” International Journal of Computer Vision, vol. 70, no. 1, pp. 91–104, 2006.
-  A. Martinez and R. Benavente, “The AR face database,” CVC Technical Report #24, Tech. Rep., 1998.
-  L. Wolf, T. Hassner, and Y. Taigman, “Effective unconstrained face recognition by combining multiple descriptors and learned background statistics,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 10, pp. 1978–1990, 2011.
-  L. Best-Bowden, H. Han, C. Otto, B. Klare, and A. Jain, “Unconstrained face recognition: Identifying a person of interest from a media collection,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 2, pp. 2144–2157, 2014.