1 Introduction
Visual similarity matching is arguably considered as one of the most fundamental problems in computer vision and pattern recognition, and this problem becomes more challenging when dealing with crossdomain data. For example, in stillvideo face retrieval, a newly rising task in visual surveillance, faces from still images captured under a constrained environment are utilized as the queries to find the matches of the same identity in unconstrained videos. Ageinvariant and sketchphoto face verification tasks are also examples of crossdomain image matching. Some examples in these applications are shown in Figure 1.
Conventional approaches (e.g., canonical correlation analysis [1] and partial least square regression [2]) for crossdomain matching usually follow a procedure of two steps:

Samples from different modalities are first projected into a common space by learning a transformation. One may simplify the computation by assuming that these cross domain samples share the same projection.

A certain distance is then utilized for measuring the similarity/disimilarity in the projection space. Usually Euclidean distance or inner product are used.
Suppose that and are two samples of different modalities, and and are two projection matrices applied on and , respectively. and are usually formulated as linear similarity transformations mainly for the convenience of optimization. A similarity transformation has a good property of preserving the shape of an object that goes through this transformation, but it is limited in capturing complex deformations that usually exist in various real problems, e.g., translation, shearing, and their compositions. On the other hand, Mahalanobis distance, Cosine similarity, and their combination have been widely studied in the research of similarity metric learning, but it remains less investigated on how to unify feature learning and similarity learning, in particular, how to combine Mahalanobis distance with Cosine similarity and integrate the distance metric with deep neural networks for endtoend learning.
To address the above issues, in this work we present a more general similarity measure and unify it with deep convolutional representation learning. One of the key innovations is that we generalize the existing similarity models from two aspects. First, we extend the similarity transformations and
to the affine transformations by adding a translation vector into them, i.e., replacing
and with and , respectively. Affine transformation is a generalization of similarity transformation without the requirement of preserving the original point in a linear space, and it is able to capture more complex deformations. Second, unlike the traditional approaches choosing either Mahalanobis distance or Cosine similarity, we combine these two measures under the affine transformation. This combination is realized in a datadriven fashion, as discussed in the Appendix, resulting in a novel generalized similarity measure, defined as:(1) 
where submatrices and are positive semidefinite, representing the selfcorrelations of the samples in their own domains, and is a correlation matrix crossing the two domains.
Figure 2 intuitively explains the idea^{1}^{1}1Figure 2 does not imply that our model geometrically aligns two samples to be matched. Using this example we emphasize the superiority of the affine transformation over the traditional linear similarity transformation on capturing pattern variations in the feature space.
. In this example, it is observed that Euclidean distance under the linear transformation, as (a) illustrates, can be regarded as a special case of our model with
, , , , , and . Our similarity model can be viewed as a generalization of several recent metric learning models [3][4]. Experimental results validate that the introduction of and more flexible setting on do improve the matching performance significantly.Another innovation of this work is that we unify feature representation learning and similarity measure learning. In literature, most of the existing models are performed in the original data space or in a predefined feature space, that is, the feature extraction and the similarity measure are studied separately. These methods may have several drawbacks in practice. For example, the similarity models heavily rely on feature engineering and thus lack of generality when handling problems under different scenarios. Moreover, the interaction between the feature representations and similarity measures is ignored or simplified, thus limiting their performances. Meanwhile, deep learning, especially the Convolutional Neural Network (CNN), has demonstrated its effectiveness on learning discriminative features from raw data and benefited to build endtoend learning frameworks. Motivated by these works, we build a deep architecture to integrate our similarity measure with the CNNbased feature representation learning. Our architecture takes raw images of different modalities as the inputs and automatically produce their representations by sequentially stacking shared subnetwork upon domainspecific subnetworks. Upon these layers, we further incorporate the components of our similarity measure by stimulating them with several appended structured neural network layers. The feature learning and the similarity model learning are thus integrated for endtoend optimization.
In sum, this paper makes three main contributions to crossdomain similarity measure learning.

First, it presents a generic similarity measure by generalizing the traditional linear projection and distance metrics into a unified formulation. Our model can be viewed as a generalization of several existing similarity learning models.

Second, it integrates feature learning and similarity measure learning by building an endtoend deep architecture of neural networks. Our deep architecture effectively improves the adaptability of learning with data of different modalities.

Third, we extensively evaluate our framework on four challenging tasks of crossdomain visual matching: person reidentification across views^{2}^{2}2Person reidentification is arguably a crossdomain matching problem. We introduce it in our experiments since this problem has been receiving increasing attentions recently., and face verification under different modalities (i.e., faces from still images and videos, older and younger faces, and sketch and photo portraits). The experimental results show that our similarity model outperforms other stateofthearts in three of the four tasks and achieves the second best performance in the other one.
The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 introduces our generalized similarity model and discusses its connections to existing works. Section 4 presents the proposed deep neural network architecture and the learning algorithm in Section 4.2. The experimental results, comparisons and ablation studies are presented in Section 5. Section 6 concludes the paper.
2 Related Work
In literature, to cope with the crossdomain matching of visual data, one can learn a common space for different domains. CCA [1] learns the common space via maximizing crossview correlation, while PLS [2] is learned via maximizing crossview covariance. Coupled informationtheoretic encoding is proposed to maximize the mutual information [5]. Another conventional strategy is to synthesize samples from the input domain into the other domain. Rather than learning the mapping between two domains in the data space, dictionary learning [6][7] can be used to alleviate crossdomain heterogeneity, and semicoupled dictionary learning (SCDL [7]) is proposed to model the relationship on the sparse coding vectors from the two domains. Duan et al. proposed another framework called domain adaptation machine (DAM) [8]
for multiple source domain adaption but they need a set of pretrained base classifiers.
Various discriminative common space approaches have been developed by utilizing the label information. Supervised information can be employed by the Rayleigh quotient [1], treating the label as the common space [9], or employing the maxmargin rule [10]. Using the SCDL framework, structured group sparsity was adopted to utilize the label information [6]. Generalization of discriminative common space to multiview was also studied [11]. Kan et al. proposed a multiview discriminant analysis (MvDA [12]) method to obtain a common space for multiple views by optimizing both interview and intraview Rayleigh quotient. In [13], a method to learn shape models using local curve segments with multiple types of distance metrics was proposed.
Moreover, for most existing multiview analysis methods, the target is defined based on the standard inner product or distance between the samples in the feature space. In the field of metric learning, several generalized similarity / distance measures have been studied to improve recognition performance. In [4][14], the generalized distance / similarity measures are formulated as the difference between the distance component and the similarity component to take into account both cross inner product term and two norm terms. Li et al. [3] adopted the secondorder decision function as distance measure without considering the positive semidefinite (PSD) constraint. Chang and Yeung [15] suggested an approach to learn locally smooth metrics using local affine transformations while preserving the topological structure of the original data. These distance / similarity measures, however, were developed for matching samples from the same domain, and they cannot be directly applied to cross domain data matching.
To extend traditional singledomain metric learning, Mignon and Jurie [16] suggested a crossmodal metric learning (CMML) model, which learns domainspecific transformations based on a generalized logistic loss. Zhai et al. [17] incorporated the joint graph regularization with the heterogeneous metric learning model to improve the crossmedia retrieval accuracy. In [16, 17], Euclidean distance is adopted to measure the dissimilarity in the latent space. Instead of explicitly learning domainspecific transformations, Kang et al. [18] learned a low rank matrix to parameterize the crossmodal similarity measure by the accelerated proximal gradient (APG) algorithm. However, these methods are mainly based on the common similarity or distance measures and none of them addresses the feature learning problem under the crossdomain scenarios.
Instead of using handcrafted features, learning feature representations and contextual relations with deep neural networks, especially the convolutional neural network (CNN) [19], has shown great potential in various pattern recognition tasks such as object recognition [20] and semantic segmentation [21]
. Significant performance gains have also been achieved in face recognition
[22] and person reidentification [23][24][25][26], mainly attributed to the progress in deep learning. Recently, several deep CNNbased models have been explored for similarity matching and learning. For example, Andrew et al. [27] proposed a multilayer CCA model consisting of several stacked nonlinear transformations. Li et al. [28] learned filter pairs via deep networks to handle misalignment, photometric and geometric transforms, and achieved promising results for the person reidentification task. Wang et al. [29] learned finegrained image similarity with deep ranking model. Yi et al. [30] presented a deep metric learning approach by generalizing the Siamese CNN. Ahmed et al. [25] proposed a deep convolutional architecture to measure the similarity between a pair of pedestrian images. Besides the shared convolutional layers, their network also includes a neighborhood difference layer and a patch summary layer to compute crossinput neighborhood differences. Chen et al. [26] proposed a deep ranking framework to learn the joint representation of an image pair and return the similarity score directly, in which the similarity model is replaced by full connection layers.Our deep model is partially motivated by the above works, and we target on a more powerful solution of crossdomain visual matching by incorporating a generalized similarity function into deep neural networks. Moreover, our network architecture is different from existing works, leading to new stateoftheart results on several challenging person verification and recognition tasks.
3 Generalized Similarity Model
In this section, we first introduce the formulation of our deep generalized similarity model and then discuss the connections between our model and existing similarity learning methods.
3.1 Model Formulation
According to the discussion in Section 1, our generalized similarity measure extends the traditional linear projection and integrates Mahalanobis distance and Cosine similarity into a generic form, as shown in Eqn. (1). As we derive in the Appendix, and in our similarity measure are positive semidefinite but does not obey this constraint. Hence, we can further factorize , and , as:
(2) 
Moreover, our model extracts feature representation (i.e., and ) from the raw input data by utilizing the CNNs. Incorporating the feature representation and the above matrix factorization into Eqn. (1), we can thus have the following similarity model:
Specifically, , , can be regarded as the similarity components for , while , , accordingly for
. These similarity components are modeled as the weights that connect neurons of the last two layers. For example, a portion of output activations represents
by taking as the input and multiplying the corresponding weights . In the following, we discuss the formulation of our similarity learning.The objective of our similarity learning is to seek a function that satisfies a set of similarity/disimilarity constraints. Instead of learning similarity function on handcrafted feature space, we take the raw data as input, and introduce a deep similarity learning framework to integrate nonlinear feature learning and generalized similarity learning. Recall that our deep generalized similarity model is in Eqn. (1). are the feature representations for samples of different modalities, and we use to indicate their parameters. We denote as the similarity components for sample matching. Note that is asymmetric, i.e., . This is reasonable for crossdomain matching, because the similarity components are domainspecific.
Assume that is a training set of crossdomain sample pairs, where denotes the th pair, and denotes the corresponding label of indicating whether and are from the same class:
(4) 
where denotes the class label of the sample . An ideal deep similarity model is expected to satisfy the following constraints:
(5) 
for any .
Note that the feasible solution that satisfies the above constraints may not exist. To avoid this scenario, we relax the hard constraints in Eqn. (5) by introducing a hingelike loss:
(6) 
To improve the stability of the solution, some regularizers are further introduced, resulting in our deep similarity learning model:
(7) 
where denotes the regularizer on the parameters of the feature representation and generalized similarity models.
3.2 Connection with Existing Models
Our generalized similarity learning model is a generalization of many existing metric learning models, while they can be treated as special cases of our model by imposing some extra constraints on .
Conventional similarity model usually is defined as , and this form is equivalent to our model, when , , , and . Similarly, the Mahalanobis distance is also regarded as a special case of our model, when , , , and .
In the following, we connect our similarity model to two stateoftheart similarity learning methods, i.e., LADF [3] and Joint Bayesian [4].
In [3], Li et al. proposed to learn a decision function that jointly models a distance metric and a locally adaptive thresholding rule, and the socalled LADF (i.e., LocallyAdaptive Decision Function) is formulated as a secondorder largemargin regularization problem. Specifically, LADF is defined as:
(8) 
One can observe that when we set and in our model.
It should be noted that LADF treats and using the same metrics, i.e., for both and , and for and . Such a model is reasonable for matching samples with the same modality, but may be unsuitable for crossdomain matching where and are with different modalities. Compared with LADF, our model uses and to calculate and , and uses and to calculate and , making our model more effective for crossdomain matching.
In [4]
, Chen et al. extended the classical Bayesian face model by learning a joint distributions (i.e., intraperson and extraperson variations) of sample pairs. Their decision function is posed as the following form:
(9) 
Note that the similarity metric model proposed in [14] also adopted such a form. Interestingly, this decision function is also a special variant of our model by setting , , , , and .
In summary, our similarity model can be regarded as the generalization of many existing crossdomain matching and metric learning models, and it is more flexible and suitable for crossdomain visual data matching.
4 Joint Similarity and Feature Learning
In this section, we introduce our deep architecture that integrates the generalized similarity measure with convolutional feature representation learning.
4.1 Deep Architecture
As discussed above, our model defined in Eqn. (7) jointly handles similarity function learning and feature learning. This integration is achieved by building a deep architecture of convolutional neural networks, which is illustrated in Figure 3. It is worth mentioning that our architecture is able to handle the input samples of different modalities with unequal numbers, e.g., samples of and samples of are fed into the network in a way of batch processing.
From left to right in Figure 3, two domainspecific subnetworks and are applied to the samples of two different modalities, respectively. Then, the outputs of and are concatenated into a shared subnetwork . We make a superposition of and to feed . At the output of , the feature representations of the two samples are extracted separately as and , which is indicated by the slice operator in Figure 3. Finally, these learned feature representations are utilized in the structured fullyconnected layers that incorporate the similarity components defined in Eqn. (3.1). In the following, we introduce the detailed setting of the three subnetworks.
Domainspecific subnetwork. We separate two branches of neural networks to handle the samples from different domains. Each network branch includes one convolutional layer with filters of size
and the stride step of
pixels. The rectified nonlinear activation is utilized. Then, we follow by a one maxpooling operation with size of and its stride step is set as pixels.Shared subnetwork. For this component, we stack one convolutional layer and two fullyconnected layers. The convolutional layer contains filters of size and the filter stride step is set as pixel. The kernel size of the maxpooling operation is and its stride step is pixels. The output vectors of the two fullyconnected layers are of dimensions. We further normalize the output of the second fullyconnected layer before it is fed to the next subnetwork.
Similarity subnetwork. A slice operator is first applied in this subnetwork, which partitions the vectors into two groups corresponding to the two domains. For the example in Figure 3, vectors are grouped into two sets, i.e., and , with size of and , respectively. and are both of dimensions. Then, and are fed to two branches of neural network, and each branch includes a fullyconnected layer. We divide the activations of these two layers into six parts according to the six similarity components. As is shown in Figure 3, in the top branch the neural layer connects to and outputs , , and , respectively. In the bottom branch, the layer outputs , , and , respectively, by connecting to . In this way, the similarity measure is tightly integrated with the feature representations, and they can be jointly optimized during the model training. Note that is a parameter of the generalized similarity measure in Eqn. (1). Experiments show that the value of only affects the learning convergence rather than the matching performance. Thus we empirically set in our experiments.
In the deep architecture, we can observe that the similarity components of and those of do not interact to each other by the factorization until the final aggregation calculation, that is, computing the components of is independent of . This leads to a good property of efficient matching. In particular, for each sample stored in a database, we can precomputed its feature representation and the corresponding similarity components, and the similarity matching in the testing stage will be very fast.
4.2 Model Training
In this section, we discuss the learning method for our similarity model training. To avoid loading all images into memory, we use the minibatch learning approach, that is, in each training iteration, a subset of the image pairs are fed into the neural network for model optimization.
For notation simplicity in discussing the learning algorithm, we start by introducing the following definitions:
(10) 
where and denote the output layer’s activations of the samples and . Prior to incorporating Eqn. (10) into the similarity model in Eqn. (3.1), we introduce three transformation matrices (using Matlab representation):
(11) 
where equals to the dimension of the output of shared neural network (i.e., the dimension of and ), an
indicates the identity matrix. Then, our similarity model can be rewritten as:
(12) 
(13) 
where the summation term denotes the hingelike loss for the cross domain sample pair , is the total number of pairs, represents the feature representation of different domains and represents the similarity model. and are both embedded as weights connecting neurons of layers in our deep neural network model, as Figure 3 illustrates.
The objective function in Eqn. (13) is defined in samplepairbased form. To optimize it using SGD, one should apply a certain scheme to generate minibatches of the sample pairs, which usually costs much computation and memory. Note that the sample pairs in training set are constructed from the original set of samples from different modalities , where and . The superscript denotes the sample index in the original training set, e.g., and , while the subscript denotes the index of sample pairs, e.g., . and denote the total number of samples from different domains. Without loss of generality, we define and . For each pair in , we have and with and . And we also have and .
Therefore, we rewrite Eqn. (13) in a samplebased form:
(14) 
Given , the loss function in Eqn. (7) can also be rewritten in the samplebased form:
(15) 
The objective in Eqn. (15) can be optimized by the minibatch back propagation algorithm. Specifically, we update the parameters by gradient descent:
(16) 
where denotes the learning rate. The key problem of solving the above equation is calculating . As is discussed in [31], there are two ways to this end, i.e., pairbased gradient descent and samplebased gradient descent. Here we adopt the latter to reduce the requirements on computation and memory cost.
Suppose a minibatch of training samples from the original set , where and
. Following the chain rule, calculating the gradient for all pairs of samples is equivalent to summing up the gradient for each sample,
(17) 
where can be either or .
Using as an example, we first introduce an indicator function before calculating the partial derivative of output layer activation for each sample . Specifically, we define when is a sample pair and . Otherwise we let . , indicating where and are from the same class. With , the gradient of can be written as
(18) 
The calculation of can be conducted in a similar way. The algorithm of calculating the partial derivative of output layer activation for each sample is shown in Algorithm 1.
Note that all the three subnetworks in our deep architecture are differentiable. We can easily use the backpropagation procedure [19] to compute the partial derivatives with respect to the hidden layers and model parameters . We summarize the overall procedure of deep generalized similarity measure learning in Algorithm 2.
If all the possible pairs are used in training, the samplebased form allows us to generate sample pairs from a minibatch of . On the other hand, the samplepairbased form may require samples or less to generate sample pairs. In gradient computation, from Eqn. (18), for each sample we only require calculating once and times in the samplebased form. While in the samplepairbased form, and should be computed and times, respectively. In sum, the samplebased form generally results in less computation and memory cost.
Batch Process Implementation. Suppose that the training image set is divided into categories, each of which contains images from the first domain and images from the second domain. Thus we can obtain a maximum number of pairwise samples, which is quadratically more than the number of source images . In real application, since the number of stored images may reach millions, it is impossible to load all the data for network training. To overcome this problem, we implement our learning algorithm in a batchprocess manner. Specifically, in each iteration, only a small subset of cross domain image pairs are generated and fed to the network for training. According to our massive experiments, randomly generating image pairs is infeasible, which may cause the image distribution over the special batch becoming scattered, making valid training samples for a certain category very few and degenerating the model. Besides, images in any pair are almost impossible to come from the same class, making the positive samples very few. In order to overcome this problem, an effective cross domain image pair generation scheme is adopted to train our generalized similarity model. For each round, we first randomly choose instance categories. For each category, a number of images first domain and a number of from second domain are randomly selected. For each selected images in first domain, we randomly take samples from the second domain and the proportions of positive and negative samples are equal. In this way, images distributed over the generated samples are relatively centralized and the model will effectively converge.
5 Experiments
In this section, we apply our similarity model in four representative tasks of matching crossdomain visual data and adopt several benchmark datasets for evaluation: i) person reidentification under different views on CUHK03 [28] and CUHK01 [32] datasets; ii) ageinvariant face recognition on MORPH [33], CACD [34] and CACDVS [35] datasets; iii) sketchtophoto face matching on CUFS dataset [36]; iv) face verification over stillvideo domains on COX face dataset [37]. On all these tasks, stateoftheart methods are employed to compare with our model.
Experimental setting. Minibatch learning is adopted in our experiments to save memory cost. In each task, we randomly select a batch of sample from the original training set to generate a number of pairs (e.g.,
). The initial parameters of the convolutional and the full connection layers are set by two zeromean Gaussian Distributions, whose standard deviations are
and respectively. Other specific settings to different tasks are included in the following subsections.In addition, ablation studies are presented to reveal the benefit of each main component of our method, e.g., the generalized similarity measure and the joint optimization of CNN feature representation and metric model. We also implement several variants of our method by simplifying the similarity measures for comparison.
5.1 Person Reidentification
Person reidentification, aiming at matching pedestrian images across multiple nonoverlapped cameras, has attracted increasing attentions in surveillance. Despite that considerable efforts have been made, it is still an open problem due to the dramatic variations caused by viewpoint and pose changes. To evaluate this task, CUHK03 [28] dataset and CUHK01 [32] dataset are adopted in our experiments.
CUHK03 dataset [28] is one of the largest databases for person reidentification. It contains 14,096 images of 1,467 pedestrians collected from 5 different pairs of camera views. Each person is observed by two disjoint camera views and has an average of 4.8 images in each view. We follow the standard setting of using CUHK03 to randomly partition this dataset for 10 times, and a training set (including 1,367 persons) and a testing set (including 100 persons) are obtained without overlap.
CUHK01 dataset [32] contains 971 individuals, each having two samples from disjoint cameras. Following the setting in [28][25], we partition this dataset into a training set and a testing set: 100 individuals for testing and the others for training.
For evaluation on these two benchmarks, the testing set is further randomly divided into a gallery set of 100 images (i.e., one image per person) and a probe set (including images of individuals from different camera views in contrast to the gallery set) without overlap for 10 times. We use Cumulative Matching Characteristic (CMC) [38]
as the evaluation metric in this task.
In our model training, all of the images are resized to , and cropped to the size of at the center with a small random perturbation. During every round of learning, 4800 pairs of samples are constructed by selecting 60 persons (or classes) and constructing 80 pairs for each person (class). For CUHK01, due to each individual only have two samples, the 80 pairs per individual will contain some duplicated pairs.
Results on CUHK03. We compare our approach with several stateoftheart methods, which can be grouped into three categories. First, we adopt five distance metric learning methods based on fixed feature representation, i.e. the Information Theoretic Metric Learning (ITML) [5], the Local Distance Metric Learning (LDM) [39], the Large Margin Nearest Neighbors (LMNN) [40], the learningtorank method (RANK) [41], and the Kernelbased Metric Learning method (KML) [23]. Following their implementation, the handcrafted features of dense color histograms and dense SIFT uniformly sampled from patches are adopted. Second, three methods specially designed for person reidentification are employed in the experiments: SDALF [42], KISSME [43], and eSDC [44]. Moreover, several recently proposed deep learning methods, including DRSCH [45], DFPNN [28] and IDLA [25], are also compared with our approach. DRSCH [45] is a supervised hashing framework for integrating CNN feature and hash code learning, while DFPNN and IDLA have been introduced in Section 2.
The results are reported in Fig. 4 (a). It is encouraging to see that our approach significantly outperforms the competing methods (e.g., improving the stateoftheart rank1 accuracy from 54.74% (IDLA [25]) to 58.39%). Among the competing methods, ITML [5], LDM [39], LMNN [40], RANK [41], KML [23], SDALF [42], KISSME [43], and eSDC [44] are all based on handcrafted features. And the superiority of our approach against them should be attributed to the deployment of both deep CNN features and generalized similarity model. DRSCH [45], DFPNN [28] and IDLA [25] adopted CNN for feature representation, but their matching metrics are defined based on traditional linear transformations.
Results on CUHK01. Fig. 4 (b) shows the results of our method and the other competing approaches on CUHK01. In addition to those used on CUHK03, one more method, i.e. LMLF [24], is used in the comparison experiment. LMLF [24] learns midlevel filters from automatically discovered patch clusters. According to the quantitative results, our method achieves a new stateoftheart with a rank1 accuracy of 66.50%.
5.2 Ageinvariant Face Recognition
Age invariant face recognition is to decide whether two images with different ages belong to the same identity. The key challenge is to handle the large intrasubject variations caused by aging process while distinguishing different identities. Other factors, such as illumination, pose, and expression, make age invariant face recognition more difficult. We conduct the experiments using three datasets, i.e., MORPH [33], CACD [34], and CACDVS [35].
MORPH [33] contains more than 55,000 face images of 13,000 individuals, whose ages range from 16 to 77. The average number of images per individual is 4. The training set consists of 20,000 face images from 10,000 subjects, with each subject having two images with the largest age gap. The test set is composed of a gallery set and a probe set from the remaining 3,000 subjects. The gallery set is composed of the youngest face images of each subject. The probe set is composed of the oldest face images of each subject. This experimental setting is the same with those adopted in [46] and [34].
CACD [34] is a large scale dataset released in 2014, which contains more than 160,000 images of 2,000 celebrities. We adopt a subset of 580 individuals from the whole database in our experiment, in which we manually remove the noisy images. Among these 580 individuals, the labels of images from 200 individuals have been originally provided, and we annotate the rest of the data. CACD includes large variations not only in pose, illumination, expression but also in ages. Based on CACD, a verification subset called CACDVS [35] is further developed, which contains 2,000 positive pairs and 2,000 negative pairs. The setting and testing protocol of CACDVS are similar to the wellknown LFW benchmark [47], except that CACDVS contains much more samples for each person.
All of the images are resized to . For data augmentation, images are cropped to the size of at the center with a small random perturbation when feeding to the neural network. Samplebased minibatch setting is adopted, and 4,800 pairs are constructed for each iteration.
Results on MORPH.
We compare our method with several stateoftheart methods, including topological dynamic Bayesian network (TDBN)
[48], crossage reference coding (CARC) [34], probabilistic hidden factor analysis (HFA) [46], multifeature discriminant analysis (MFDA) [49] and 3D aging model [50]. The results are reported in Table I(a). Thanks to the use of CNN representation and generalized similarity measure, our method achieves the recognition rate of 94.35%, and significantly outperforms the competing methods.(a) Recognition rates on the MORPH dataset.  

Method  Recognition rate 
TDBN [48]  60% 
3D Aging Model [50]  79.8% 
MFDA [49]  83.9% 
HFA [46]  91.1% 
CARC [34]  92.8% 
Ours  94.4% 
(b) Verification accuracy on the CACDVS dataset. 


Method  verification accuracy 
HDLBP [51]  81.6% 
HFA [46]  84.4% 
CARC [34]  87.6% 
Deepface [52]  85.4% 
Ours  89.8% 
Results on CACD. On this dataset, the protocol is to retrieve face images of the same individual from gallery sets by using a probe set, where the age gap between probe face images and gallery face images is large. Following the experimental setting in [34], we set up 4 gallery sets according to the years when the photos were taken: , , , and . And we use the set of as the probe set to search for matches in the rest of three sets. We introduce several stateoftheart methods for comparison, including CARC [34], HFA [46] and one deep learning based method, Deepface [52]. The results of CARC [34] and HFA [46] are borrowed from their papers. The results of Deepface [52] and our approach (i.e., Ours1) are implemented based on the 200 originally annotated individuals, where 160 samples are used for model training. From the quantitative results reported in Figure 5, our model achieves superior performances over the competing methods. Furthermore, we also report the result of our method (i.e., Ours2) by using images of 500 individuals as training samples. One can see that, the performance of our model can be further improved by increasing training data.
Results on CACDVS. Following the setting in [35], we further evaluate our approach by conducting the general face verification experiment. Specifically, for all of the competing methods, we train the models on CACD and test on CACDVS, and the optimal threshold value for matching is obtained by exhaustive search. The results produced by our methods and the others (i.e., CARC [34], HFA [46], HDLBP [51] and Deepface [52]) are reported in Table I (b). It is worth mentioning that our method improves the stateoftheart recognition rate from 87.6% (by CARC [34] [52]) to 89.8%. Thanks to the introduction of generalized similarity measure our approach achieves higher verification accuracy than Deepface. Note that an explicit face alignment was adopted in [52] before the CNN feature extraction, which is not in our framework.
5.3 Sketchphoto Face Verification
Sketchphoto face verification is an interesting yet challenging task, which aims to verify whether a face photo and a drawing face sketch belong to the same individual. This task has an important application of assisting law enforcement, i.e., using face sketch to find candidate face photos. It is however difficult to match photos and sketches in two different modalities. For example, handdrawing may bring unpredictable face distortion and variation compared to the real photo, and face sketches often lack of details that can be important cues for preserving identity.
We evaluate our model on this task using the CUFS dataset [36]. There are 188 face photos in this dataset, in which 88 are selected for training and 100 for testing. Each face has a corresponding sketch that is drawn by the artist. All of these face photos are taken at frontal view with a normal lighting condition and neutral expression.
Results of the ablation studies demonstrating the effectiveness of each main component of our framework. The CMC curve and recognition rate are used for evaluation. The results of different similarity models are shown using the handcrafted features (in (a) and (b)) and using the deep features (in (c)  (f) ), respectively. (g) and (h) show the performances with / without the deep feature learning while keeping the same similarity model.
All of the photos/sketches are resized to , and cropped to the size of at the center with a small random perturbation. 1200 pairs of photos and sketches (i.e., including 30 individuals with each having 40 pairs) are constructed for each iteration during the model training. In the testing stage, we use face photos to form the gallery set and treat sketches as the probes.
We employ several existing approaches for comparison: the eigenface transformation based method (ET) [53], the multiscale Markov random field based method (MRF) [36], and MRF+ [54] (i.e., the lighting and pose robust version of [36]). It is worth mentioning that all of these competing methods need to first synthesize face sketches by photosketch transformation, and then measure the similarity between the synthesized sketches and the candidate sketches, while our approach works in an endtoend way. The quantitative results are reported in Table II. Our method achieves 100% recognition rate on this dataset.
5.4 Stillvideo Face Recognition
Method  V2S  S2V 

PSD [55]  9.90%  11.64% 
PMD [56]  6.40%  6.10% 
PAHD [57]  4.70%  6.34% 
PCHD [58]  7.93%  8.89% 
PSDML [59]  12.14%  7.04% 
PSCLEA [37]  30.33%  28.39% 
Ours  28.45%  29.02% 
Matching person faces across still images and videos is a newly rising task in intelligent visual surveillance. In these applications, the still images (e.g., ID photos) are usually captured under a controlled environment while the faces in surveillance videos are acquired under complex scenarios (e.g., various lighting conditions, occlusions and low resolutions).
For this task, a largescale stillvideo face recognition dataset, namely COX face dataset, has been released recently^{3}^{3}3The COX face DB is collected by Institute of Computing Technology Chinese Academy of Sciences, OMRON Social Solutions Co. Ltd, and Xinjiang University., which is an extension of the COXS2V dataset [60]. This COX face dataset includes 1,000 subjects and each has one high quality still image and 3 video cliques respectively captured from 3 cameras. Since these cameras are deployed under similar environments ( e.g., similar results are generated for the three cameras in [37], we use the data captured by the first camera in our experiments.
Following the setting of COX face dataset, we divide the data into a training set (300 subjects) and a testing set (700 subjects), and conduct the experiments with 10 random splits. There are two subtasks in the testing: i) matching video frames to still images (V2S) and ii) matching still images to video frames (S2V). For V2S task we use the video frames as probes and form the gallery set by the still images, and inversely for S2V task. The split of gallery/probe sets is also consistent with the protocol required by the creator. All of the image are resized to , and cropped to the size of with a small random perturbation. 1200 pairs of still images and video frames (i.e., including 20 individuals with each having 60 pairs) are constructed for each iteration during the model training.
Unlike the traditional imagebased verification problems, both V2S and S2V are defined as the pointtoset matching problem, i.e., one still image to several video frames (i.e., 10 sampled frames). In the evaluation, we calculate the distance between the still image and each video frame by our model and output the average value over all of the distances. For comparison, we employ several existing pointtoset distance metrics: dualspace linear discriminant analysis (PSD) [55], manifoldmanifold distance (PMD) [56]
, hyperplanebased distance (PAHD)
[57], kernelized convex geometric distance (PCHD) [58], and covariance kernel based distance (PSDML) [59]. We also compare with the pointtoset correlation learning (PSCLEA) method [37], which specially developed for the COX face dataset. The recognition rates of all competing methods are reported in Table III, and our method achieves excellent performances, i.e., the best in S2V and the second best in V2S. The experiments show that our approach can generally improve performances in the applications to imagetoimage, imagetovideo, and videotoimage matching problems.5.5 Ablation Studies
In order to provide more insights on the performance of our approach, we conduct a number of ablation studies by isolating each main component (e.g., the generalized similarity measure and feature learning). Besides, we also study the effect of using samplepairbased and samplebased batch settings in term of convergence efficiency.
Generalized Similarity Model. We design two experiments by using handcrafted features and deep features, respectively, to justify the effectiveness of our generalized similarity measure.
(i) We test our similarity measure using the fixed handcrafted features for person reidentification. The experimental results on CUHK01 and CUHK03 clearly demonstrate the effectiveness of our model against the other similarity models without counting on deep feature learning. Following [44], we extract the feature representation by using patchbased color histograms and dense SIFT descriptors. This feature representation is fed into a full connection layer for dimensionality reduction to obtain a 400dimensional vector. We then invoke the similarity subnetwork (described in Section 4) to output the measure. On both CUHK01 and CUHK03, we adopt several representative similarity metrics for comparison, i.e., ITML [5], LDM [39], LMNN [40], and RANK [41], using the same feature representation.
The quantitative CMC curves and the recognition rates of all these competing models are shown in Fig. 6 (a) and (b) for CUHK03 and CUHK01, respectively, where “Generalized” represents our similarity measure. It is observed that our model outperforms the others by large margins, e.g., achieving the rank1 accuracy of 31.85% against 13.51% by LDM on CUHK03. Most of these competing methods learn Mahalanobis distance metrics. In contrast, our metric model combines Mahalanobis distance with Cosine similarity in a generic form, leading to a more general and effective solution in matching crossdomain data.
(ii) On the other hand, we incorporate several representative similarity measures into our deep architecture and jointly optimize these measures with the CNN feature learning. Specifically, we simplify our network architecture by removing the top layer (i.e., the similarity model), and measure the similarity in either the Euclidean embedding space (as Baseline1) or in the innerproduct space (as Baseline2). These two variants can be viewed as two degenerations of our similarity measure (i.e., affine Euclidean distance and affine Cosine similarity). To support our discussions in Section 3.2, we adopt the two distance metric models LADF [3] and BFR (i.e., Joint Bayesian) [4] into our deep neural networks. Specifically, we replace our similarity model by the LADF model defined in Eqn. (8) and the BFR model defined in Eqn. (9), respectively. Moreover, we implement one more variant (denoted as “Linear” in this experiment), which applies similarity transformation parameters with separate linear transformations for each data modality. That is, we remove affine transformation while keeping separate linear transformation by setting , and in Eqn. 1. Note that the way of incorporating these metric models into the deep architecture is analogously to our metric model. The experiment is conducted on four benchmarks: CUHK03, MORPH, COXV2S and COXS2V, and the results are shown in Figure 6 (c), (d), (e), (f), respectively. Our method outperforms the competing methods by large margins on MORPH and COX face dataset. On CUHK03 (i.e., Fig. 6 (c)), our method achieves the best rank1 identification rate (i.e., ) among all the methods. In particular, the performance drops by when removing the affine transformation on CUHK03.
It is interesting to discover that most of these competing methods can be treated as special cases of our model. And our generalized similarity model can fully take advantage of convolutional feature learning by developing the specific deep architecture, and can consistently achieve superior performance over other variational models.
Deep Feature Learning. To show the benefit of deep feature learning, we adopt the handcrafted features (i.e., color histograms and SIFT descriptors) on CUHK01 and CHUK03 benchmark. Specifically, we extract this feature representation based on the patches of pedestrian images and then build the similarity measure for person reidentification. The results on CUHK03 and CHUK01 are reported in Fig. 6 (g) and (h), respectively. We denote the result by using the handcrafted features as “hand.fea + gen.sim” and the result by endtoend deep feature learning as “deep.fea + gen.sim”. It is obvious that without deep feature representation the performance drops significantly, e.g., from 58.4% to 31.85% on CUHK03 and from 66.5% to 39.5% on CUHK01. These above results clearly demonstrate the effectiveness of utilizing deep CNNs for discriminative feature representation learning.
Samplepairbased vs. samplebased batch setting. In addition, we conduct an experiment to compare the samplepairbased and samplebased in term of convergence efficiency, using the CUHK03 dataset. Specifically, for the samplebased batch setting, we select 600 images from 60 people and construct 60,000 pairs in each training iteration. For the samplepairbased batch setting, 300 pairs are randomly constructed. Note that each person on CUHK03 has 10 images. Thus, 600 images are included in each iteration and the training time per iteration is almost the same for the both settings. Our experiment shows that in the samplebased batch setting, the model achieves rank1 accuracy of after about 175,000 iterations, while in the other setting the rank1 accuracy is after 300,000 iterations. These results validate the effectiveness of the samplebased form in saving the training cost.
6 Conclusion
In this work, we have presented a novel generalized similarity model for crossdomain matching of visual data, which generalizes the traditional twostep methods (i.e., projection and distancebased measure). Furthermore, we integrated our model with the feature representation learning by building a deep convolutional architecture. Experiments were performed on several very challenging benchmark dataset of crossdomain matching. The results show that our method outperforms other stateoftheart approaches.
There are several directions along which we intend to extend this work. The first is to extend our approach for larger scale heterogeneous data (e.g., web and user behavior data), thereby exploring new applications (e.g., rich information retrieval). Second, we plan to generalize the pairwise similarity metric into tripletbased learning for more effective model training.
Derivation of Equation (1)
As discussed in Section 1, we extend the two linear projections and into affine transformations and apply them on samples of different domains, and , respectively. That is, we replace and with and , respectively. Then, the affine Mahalanobis distance is defined as:
where the matrix can be further unfolded as:
(20) 
Furthermore, the affine Cosine similarity is defined as the inner product in the space of affine transformations:
The corresponding matrix is,
(22) 
We propose to fuse and by a weighted aggregation as follows:
Note that is an affine distance (i.e., nonsimilarity) measure while is an affine similarity measure. Analogous to [14], we adopt () to combine and . The parameters , , and are automatically learned through our learning algorithm. Then, the matrix can be obtained by fusing and :
(24) 
where
(25) 
In the above equations, we use matrix (vector) variables, i.e., , , , , and , to represent the parameters of the generalized similarity model in a generic form. On one hand, given , , and , these matrix variables can be directly determined using Eqn. (25). On the other hand, if we impose the positive semidefinite constraint on and , it can be proved that once , , , , and are determined there exist at least one solution of , , and , respectively, that is, is guaranteed to be decomposed into the weighted Mahalanobis distance and Cosine similarity. Therefore, the generalized similarity measure can be learned by optimizing , , , , and under the positive semidefinite constraint on and . In addition, is not required to satisfy the positive semidefinite condition and it may not be a square matrix when the dimensions of and are unequal.
Acknowledgment
This work was supported in part by Guangdong Natural Science Foundation under Grant S2013050014548 and 2014A030313201, in part by Program of Guangzhou Zhujiang Star of Science and Technology under Grant 2013J2200067, and in part by the Fundamental Research Funds for the Central Universities. This work was also supported by Special Program for Applied Research on Super Computation of the NSFCGuangdong Joint Fund (the second phase).
References
 [1] D. Hardoon, S. Szedmak, and J. ShaweTaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Comput., vol. 16, no. 12, pp. 2639–2664, 2004.
 [2] A. Sharma and D. W. Jacobs, “Bypassing synthesis: Pls for face recognition with pose, lowresolution and sketch,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2011, pp. 593–600.
 [3] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith, “Learning locallyadaptive decision functions for person verification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2013, pp. 3610–3617.
 [4] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revisited: A joint formulation,” in Proc. Eur. Conf. Comput. Vis. Springer, 2012, pp. 566–579.
 [5] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Informationtheoretic metric learning,” in Proc. Int’l Conf. Mach. Learn. ACM, 2007, pp. 209–216.

[6]
Y. T. Zhuang, Y. F. Wang, F. Wu, Y. Zhang, and W. M. Lu, “Supervised coupled
dictionary learning with group structures for multimodal retrieval,” in
TwentySeventh AAAI Conference on Artificial Intelligence
, 2013. 
[7]
S. Wang, D. Zhang, Y. Liang, and Q. Pan, “Semicoupled dictionary learning with applications to image superresolution and photosketch synthesis,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2012, pp. 2216–2223.  [8] L. Duan, D. Xu, and I. W. Tsang, “Domain adaptation from multiple sources: A domaindependent regularization approach,” IEEE Trans. Neural Networks Learn. Syst., vol. 23, no. 3, pp. 504–518, 2012.

[9]
D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled lda: A
supervised topic model for credit attribution in multilabeled corpora,” in
Proc. Conf. Empirical Methods in Natural Language Processing
. Association for Computational Linguistics, 2009, pp. 248–256.  [10] J. Zhu, A. Ahmed, and E. P. Xing, “Medlda: maximum margin supervised topic models for regression and classification,” in Proc. Int’l Conf. Mach. Learn. ACM, 2009, pp. 1257–1264.
 [11] A. Sharma, A. Kumar, H. Daume III, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2012, pp. 2160–2167.
 [12] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multiview discriminant analysis,” in Proc. Eur. Conf. Comput. Vis. Springer, 2012, pp. 808–821.
 [13] P. Luo, L. Lin, and X. Liu, “Learning compositional shape models of multiple distance metrics by information projection.” IEEE Trans. Neural Networks Learn. Syst., 2015.
 [14] Q. Cao, Y. Ying, and P. Li, “Similarity metric learning for face recognition,” in Proc. Int’l Conf. Comput. Vis. IEEE, 2013, pp. 2408–2415.
 [15] H. Chang and D.Y. Yeung, “Locally smooth metric learning with application to image retrieval,” in Proc. Intel. Conf. Comput. Vis, 2007.
 [16] A. Mignon and F. Jurie, “Cmml: a new metric learning approach for cross modal matching,” in Proc. Asian Conf. Comput. Vis, 2012.
 [17] X. Zhai, Y. Peng, and J. Xiao, “Heterogeneous metric learning with joint graph regularization for crossmedia retrieval,” in TwentySeventh AAAI Conference on Artificial Intelligence, June 2013.
 [18] C. Kang, S. Liao, Y. He, J. Wang, S. Xiang, and C. Pan, “Crossmodal similarity learning : A low rank bilinear formulation,” Arxiv, vol. abs/1411.4738, 2014. [Online]. Available: http://arxiv.org/abs/1411.4738

[19]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”
Neural Comput., vol. 1, no. 4, pp. 541–551, 1989. 
[20]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.  [21] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” arXiv preprint arXiv:1411.4038, 2014.
 [22] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identificationverification,” in Advances in Neural Information Processing Systems, 2014, pp. 1988–1996.
 [23] F. Xiong, M. Gou, O. Camps, and M. Sznaier, “Person reidentification using kernelbased metric learning methods,” in Proc. Eur. Conf. Comput. Vis. Springer, 2014, pp. 1–16.
 [24] R. Zhao, W. Ouyang, and X. Wang, “Learning midlevel filters for person reidentification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2014, pp. 144–151.
 [25] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learning architecture for person reidentification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2015.
 [26] S. Chen, C. Guo, and J. Lai, “Deep ranking for person reidentification via joint representation learning,” Arxiv, vol. abs/1505.06821, 2015. [Online]. Available: http://arxiv.org/abs/1505.06821
 [27] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in Proc. IEEE the 30th Int’l Conf. Mach. Learn., 2013, pp. 1247–1255.
 [28] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairing neural network for person reidentification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2014, pp. 152–159.
 [29] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning finegrained image similarity with deep ranking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2014, pp. 1386–1393.
 [30] D. Yi, Z. Lei, and S. Z. Li, “Deep metric learning for practical person reidentification,” arXiv preprint arXiv:1407.4979, 2014.
 [31] S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learning with relative distance comparison for person reidentification,” Pattern Recognition, 2015.
 [32] W. Li, R. Zhao, and X. Wang, “Human reidentification with transferred metric learning.” in Proc. Asian Conf. Comput. Vis, 2012, pp. 31–44.
 [33] K. Ricanek and T. Tesafaye, “Morph: A longitudinal image database of normal adult ageprogression,” in Proc. IEEE Int’l Conf. Automatic Face and Gesture Recognition. IEEE, 2006, pp. 341–345.
 [34] B.C. Chen, C.S. Chen, and W. H. Hsu, “Crossage reference coding for ageinvariant face recognition and retrieval,” in Proc. Eur. Conf. Comput. Vis. Springer, 2014, pp. 768–783.
 [35] B.C. Chen, C.S. Chen, and W. Hsu, “Face recognition and retrieval using crossage reference coding with crossage celebrity dataset,” IEEE Trans. Multimedia, vol. 17, no. 6, pp. 804–815, 2015.
 [36] X. Wang and X. Tang, “Face photosketch synthesis and recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 1955–1967, 2009.
 [37] Z. Huang, S. Shan, R. Wang, H. Zhang, S. Lao, A. Kuerban, and X. Chen, “A benchmark and comparative study of videobased face recognition on cox face database,” IEEE Trans. Image Processing, 2015.
 [38] D. Gray, S. Brennan, and H. Tao, “Evaluating appearance models for recognition, reacquisition, and tracking,” in Proc. IEEE Int’l Conf. Workshop on Performance Evaluation for Tracking and Surveillance, vol. 3, no. 5. Citeseer, 2007.
 [39] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? metric learning approaches for face identification,” in Proc. Int’l Conf. Comput. Vis, 2009, pp. 498–505.
 [40] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Advances in Neural Information Processing Systems, 2005, pp. 1473–1480.
 [41] B. McFee and G. R. Lanckriet, “Metric learning to rank,” in Proc. Int’l Conf. Mach. Learn., 2010, pp. 775–782.
 [42] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Person reidentification by symmetrydriven accumulation of local features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2010, pp. 2360–2367.
 [43] M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2012, pp. 2288–2295.
 [44] R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience learning for person reidentification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2013, pp. 3586–3593.
 [45] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bitscalable deep hashing with regularized similarity learning for image retrieval,” IEEE Trans. Image Processing, vol. 24, no. 12, pp. 4766–4779, 2015.
 [46] D. Gong, Z. Li, D. Lin, J. Liu, and X. Tang, “Hidden factor analysis for age invariant face recognition,” in Proc. Int’l Conf. Comput. Vis, 2013, pp. 2872–2879.
 [47] G. B. Huang, M. Ramesh, T. Berg, and E. LearnedMiller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Technical Report 0749, University of Massachusetts, Amherst, Tech. Rep., 2007.
 [48] D. Bouchaffra, “Mapping dynamic bayesian networks toshapes: Application to human faces identification across ages,” IEEE Trans. Neural Networks Learn. Syst., vol. 23, no. 8, pp. 1229–1241, 2012.
 [49] Z. Li, U. Park, and A. K. Jain, “A discriminative model for age invariant face recognition,” IEEE Trans. Inf. Forensics Secur., vol. 6, no. 3, pp. 1028–1037, 2011.
 [50] U. Park, Y. Tong, and A. K. Jain, “Ageinvariant face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 5, pp. 947–954, 2010.
 [51] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: Highdimensional feature and its efficient compression for face verification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2013, pp. 3025–3032.
 [52] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to humanlevel performance in face verification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2014, pp. 1701–1708.
 [53] X. Tang and X. Wang, “Face sketch recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 1, pp. 50–57, 2004.
 [54] W. Zhang, X. Wang, and X. Tang, “Lighting and pose robust face sketch synthesis,” in Proc. Eur. Conf. Comput. Vis. Springer, 2010, pp. 420–433.
 [55] X. Wang and X. Tang, “Dualspace linear discriminant analysis for face recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, vol. 2, 2004, pp. II–564.
 [56] R. Wang, S. Shan, X. Chen, Q. Dai, and W. Gao, “Manifold–manifold distance and its application to face recognition with image sets,” IEEE Trans. Image Process., vol. 21, no. 10, pp. 4466–4479, 2012.
 [57] P. Vincent and Y. Bengio, “Klocal hyperplane and convex distance nearest neighbor algorithms,” in Advances in Neural Information Processing Systems, 2001, pp. 985–992.
 [58] H. Cevikalp and B. Triggs, “Face recognition based on image sets,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2010, pp. 2567–2573.
 [59] P. Zhu, L. Zhang, W. Zuo, and D. Zhang, “From point to set: Extend the learning of distance metrics,” in Proc. Int’l Conf. Comput. Vis. IEEE, 2013, pp. 2664–2671.
 [60] Z. Huang, S. Shan, H. Zhang, S. Lao, A. Kuerban, and X. Chen, “Benchmarking stilltovideo face recognition via partial and local linear discriminant analysis on coxs2v dataset,” in Proc. Asian Conf. Comput. Vis. Springer, 2013, pp. 589–600.
Comments
There are no comments yet.