Cross-Domain Visual Matching via Generalized Similarity Measure and Feature Learning

05/13/2016 ∙ by Liang Lin, et al. ∙ IEEE Xidian University SUN YAT-SEN UNIVERSITY 0

Cross-domain visual data matching is one of the fundamental problems in many real-world vision tasks, e.g., matching persons across ID photos and surveillance videos. Conventional approaches to this problem usually involves two steps: i) projecting samples from different domains into a common space, and ii) computing (dis-)similarity in this space based on a certain distance. In this paper, we present a novel pairwise similarity measure that advances existing models by i) expanding traditional linear projections into affine transformations and ii) fusing affine Mahalanobis distance and Cosine similarity by a data-driven combination. Moreover, we unify our similarity measure with feature representation learning via deep convolutional neural networks. Specifically, we incorporate the similarity measure matrix into the deep architecture, enabling an end-to-end way of model optimization. We extensively evaluate our generalized similarity model in several challenging cross-domain matching tasks: person re-identification under different views and face verification over different modalities (i.e., faces from still images and videos, older and younger faces, and sketch and photo portraits). The experimental results demonstrate superior performance of our model over other state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual similarity matching is arguably considered as one of the most fundamental problems in computer vision and pattern recognition, and this problem becomes more challenging when dealing with cross-domain data. For example, in still-video face retrieval, a newly rising task in visual surveillance, faces from still images captured under a constrained environment are utilized as the queries to find the matches of the same identity in unconstrained videos. Age-invariant and sketch-photo face verification tasks are also examples of cross-domain image matching. Some examples in these applications are shown in Figure 1.

(a)
(b)
(c)
(d)
Fig. 1: Typical examples of matching cross-domain visual data. (a) Faces from still images and vidoes. (b) Front- and side-view persons. (c) Older and younger faces. (d) Photo and sketch faces.

Conventional approaches (e.g., canonical correlation analysis [1] and partial least square regression [2]) for cross-domain matching usually follow a procedure of two steps:

  1. Samples from different modalities are first projected into a common space by learning a transformation. One may simplify the computation by assuming that these cross domain samples share the same projection.

  2. A certain distance is then utilized for measuring the similarity/disimilarity in the projection space. Usually Euclidean distance or inner product are used.

Suppose that and are two samples of different modalities, and and are two projection matrices applied on and , respectively. and are usually formulated as linear similarity transformations mainly for the convenience of optimization. A similarity transformation has a good property of preserving the shape of an object that goes through this transformation, but it is limited in capturing complex deformations that usually exist in various real problems, e.g., translation, shearing, and their compositions. On the other hand, Mahalanobis distance, Cosine similarity, and their combination have been widely studied in the research of similarity metric learning, but it remains less investigated on how to unify feature learning and similarity learning, in particular, how to combine Mahalanobis distance with Cosine similarity and integrate the distance metric with deep neural networks for end-to-end learning.

To address the above issues, in this work we present a more general similarity measure and unify it with deep convolutional representation learning. One of the key innovations is that we generalize the existing similarity models from two aspects. First, we extend the similarity transformations and

to the affine transformations by adding a translation vector into them, i.e., replacing

and with and , respectively. Affine transformation is a generalization of similarity transformation without the requirement of preserving the original point in a linear space, and it is able to capture more complex deformations. Second, unlike the traditional approaches choosing either Mahalanobis distance or Cosine similarity, we combine these two measures under the affine transformation. This combination is realized in a data-driven fashion, as discussed in the Appendix, resulting in a novel generalized similarity measure, defined as:

(1)

where sub-matrices and are positive semi-definite, representing the self-correlations of the samples in their own domains, and is a correlation matrix crossing the two domains.

Figure 2 intuitively explains the idea111Figure 2 does not imply that our model geometrically aligns two samples to be matched. Using this example we emphasize the superiority of the affine transformation over the traditional linear similarity transformation on capturing pattern variations in the feature space.

. In this example, it is observed that Euclidean distance under the linear transformation, as (a) illustrates, can be regarded as a special case of our model with

, , , , , and . Our similarity model can be viewed as a generalization of several recent metric learning models [3][4]. Experimental results validate that the introduction of and more flexible setting on do improve the matching performance significantly.

Fig. 2: Illustration of the generalized similarity model. Conventional approaches project data by simply using the linear similarity transformations (i.e., and ), as illustrated in (a), where Euclidean distance is applied as the distance metric. As illustrated in (b), we improve existing models by i) expanding the traditional linear similarity transformation into an affine transformation and ii) fusing Mahalanobis distance and Cosine similarity. One can see that the case in (a) is a simplified version of our model. Please refer to Appendix section for the deduction details.

Another innovation of this work is that we unify feature representation learning and similarity measure learning. In literature, most of the existing models are performed in the original data space or in a pre-defined feature space, that is, the feature extraction and the similarity measure are studied separately. These methods may have several drawbacks in practice. For example, the similarity models heavily rely on feature engineering and thus lack of generality when handling problems under different scenarios. Moreover, the interaction between the feature representations and similarity measures is ignored or simplified, thus limiting their performances. Meanwhile, deep learning, especially the Convolutional Neural Network (CNN), has demonstrated its effectiveness on learning discriminative features from raw data and benefited to build end-to-end learning frameworks. Motivated by these works, we build a deep architecture to integrate our similarity measure with the CNN-based feature representation learning. Our architecture takes raw images of different modalities as the inputs and automatically produce their representations by sequentially stacking shared sub-network upon domain-specific subnetworks. Upon these layers, we further incorporate the components of our similarity measure by stimulating them with several appended structured neural network layers. The feature learning and the similarity model learning are thus integrated for end-to-end optimization.

In sum, this paper makes three main contributions to cross-domain similarity measure learning.

  • First, it presents a generic similarity measure by generalizing the traditional linear projection and distance metrics into a unified formulation. Our model can be viewed as a generalization of several existing similarity learning models.

  • Second, it integrates feature learning and similarity measure learning by building an end-to-end deep architecture of neural networks. Our deep architecture effectively improves the adaptability of learning with data of different modalities.

  • Third, we extensively evaluate our framework on four challenging tasks of cross-domain visual matching: person re-identification across views222Person re-identification is arguably a cross-domain matching problem. We introduce it in our experiments since this problem has been receiving increasing attentions recently., and face verification under different modalities (i.e., faces from still images and videos, older and younger faces, and sketch and photo portraits). The experimental results show that our similarity model outperforms other state-of-the-arts in three of the four tasks and achieves the second best performance in the other one.

The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 introduces our generalized similarity model and discusses its connections to existing works. Section 4 presents the proposed deep neural network architecture and the learning algorithm in Section 4.2. The experimental results, comparisons and ablation studies are presented in Section 5. Section 6 concludes the paper.

2 Related Work

In literature, to cope with the cross-domain matching of visual data, one can learn a common space for different domains. CCA [1] learns the common space via maximizing cross-view correlation, while PLS [2] is learned via maximizing cross-view covariance. Coupled information-theoretic encoding is proposed to maximize the mutual information [5]. Another conventional strategy is to synthesize samples from the input domain into the other domain. Rather than learning the mapping between two domains in the data space, dictionary learning [6][7] can be used to alleviate cross-domain heterogeneity, and semi-coupled dictionary learning (SCDL [7]) is proposed to model the relationship on the sparse coding vectors from the two domains. Duan et al. proposed another framework called domain adaptation machine (DAM) [8]

for multiple source domain adaption but they need a set of pre-trained base classifiers.

Various discriminative common space approaches have been developed by utilizing the label information. Supervised information can be employed by the Rayleigh quotient [1], treating the label as the common space [9], or employing the max-margin rule [10]. Using the SCDL framework, structured group sparsity was adopted to utilize the label information [6]. Generalization of discriminative common space to multiview was also studied [11]. Kan et al. proposed a multiview discriminant analysis (MvDA [12]) method to obtain a common space for multiple views by optimizing both inter-view and intra-view Rayleigh quotient. In [13], a method to learn shape models using local curve segments with multiple types of distance metrics was proposed.

Moreover, for most existing multiview analysis methods, the target is defined based on the standard inner product or distance between the samples in the feature space. In the field of metric learning, several generalized similarity / distance measures have been studied to improve recognition performance. In [4][14], the generalized distance / similarity measures are formulated as the difference between the distance component and the similarity component to take into account both cross inner product term and two norm terms. Li et al. [3] adopted the second-order decision function as distance measure without considering the positive semi-definite (PSD) constraint. Chang and Yeung [15] suggested an approach to learn locally smooth metrics using local affine transformations while preserving the topological structure of the original data. These distance / similarity measures, however, were developed for matching samples from the same domain, and they cannot be directly applied to cross domain data matching.

To extend traditional single-domain metric learning, Mignon and Jurie [16] suggested a cross-modal metric learning (CMML) model, which learns domain-specific transformations based on a generalized logistic loss. Zhai et al. [17] incorporated the joint graph regularization with the heterogeneous metric learning model to improve the cross-media retrieval accuracy. In [16, 17], Euclidean distance is adopted to measure the dissimilarity in the latent space. Instead of explicitly learning domain-specific transformations, Kang et al. [18] learned a low rank matrix to parameterize the cross-modal similarity measure by the accelerated proximal gradient (APG) algorithm. However, these methods are mainly based on the common similarity or distance measures and none of them addresses the feature learning problem under the cross-domain scenarios.

Instead of using hand-crafted features, learning feature representations and contextual relations with deep neural networks, especially the convolutional neural network (CNN) [19], has shown great potential in various pattern recognition tasks such as object recognition [20] and semantic segmentation [21]

. Significant performance gains have also been achieved in face recognition

[22] and person re-identification [23][24][25][26], mainly attributed to the progress in deep learning. Recently, several deep CNN-based models have been explored for similarity matching and learning. For example, Andrew et al. [27] proposed a multi-layer CCA model consisting of several stacked nonlinear transformations. Li et al. [28] learned filter pairs via deep networks to handle misalignment, photometric and geometric transforms, and achieved promising results for the person re-identification task. Wang et al. [29] learned fine-grained image similarity with deep ranking model. Yi et al. [30] presented a deep metric learning approach by generalizing the Siamese CNN. Ahmed et al. [25] proposed a deep convolutional architecture to measure the similarity between a pair of pedestrian images. Besides the shared convolutional layers, their network also includes a neighborhood difference layer and a patch summary layer to compute cross-input neighborhood differences. Chen et al. [26] proposed a deep ranking framework to learn the joint representation of an image pair and return the similarity score directly, in which the similarity model is replaced by full connection layers.

Our deep model is partially motivated by the above works, and we target on a more powerful solution of cross-domain visual matching by incorporating a generalized similarity function into deep neural networks. Moreover, our network architecture is different from existing works, leading to new state-of-the-art results on several challenging person verification and recognition tasks.

3 Generalized Similarity Model

In this section, we first introduce the formulation of our deep generalized similarity model and then discuss the connections between our model and existing similarity learning methods.

3.1 Model Formulation

According to the discussion in Section 1, our generalized similarity measure extends the traditional linear projection and integrates Mahalanobis distance and Cosine similarity into a generic form, as shown in Eqn. (1). As we derive in the Appendix, and in our similarity measure are positive semi-definite but does not obey this constraint. Hence, we can further factorize , and , as:

(2)

Moreover, our model extracts feature representation (i.e., and ) from the raw input data by utilizing the CNNs. Incorporating the feature representation and the above matrix factorization into Eqn. (1), we can thus have the following similarity model:

Specifically, , , can be regarded as the similarity components for , while , , accordingly for

. These similarity components are modeled as the weights that connect neurons of the last two layers. For example, a portion of output activations represents

by taking as the input and multiplying the corresponding weights . In the following, we discuss the formulation of our similarity learning.

The objective of our similarity learning is to seek a function that satisfies a set of similarity/disimilarity constraints. Instead of learning similarity function on hand-crafted feature space, we take the raw data as input, and introduce a deep similarity learning framework to integrate nonlinear feature learning and generalized similarity learning. Recall that our deep generalized similarity model is in Eqn. (1). are the feature representations for samples of different modalities, and we use to indicate their parameters. We denote as the similarity components for sample matching. Note that is asymmetric, i.e., . This is reasonable for cross-domain matching, because the similarity components are domain-specific.

Assume that is a training set of cross-domain sample pairs, where denotes the th pair, and denotes the corresponding label of indicating whether and are from the same class:

(4)

where denotes the class label of the sample . An ideal deep similarity model is expected to satisfy the following constraints:

(5)

for any .

Note that the feasible solution that satisfies the above constraints may not exist. To avoid this scenario, we relax the hard constraints in Eqn. (5) by introducing a hinge-like loss:

(6)

To improve the stability of the solution, some regularizers are further introduced, resulting in our deep similarity learning model:

(7)

where denotes the regularizer on the parameters of the feature representation and generalized similarity models.

3.2 Connection with Existing Models

Our generalized similarity learning model is a generalization of many existing metric learning models, while they can be treated as special cases of our model by imposing some extra constraints on .

Fig. 3:

Deep architecture of our similarity model. This architecture is comprised of three parts: domain-specific sub-network, shared sub-network and similarity sub-network. The first two parts extract feature representations from samples of different domains, which are built upon a number of convolutional layers, max-pooling operations and fully-connected layers. The similarity sub-network includes two structured fully-connected layers that incorporate the similarity components in Eqn. (

3.1).

Conventional similarity model usually is defined as , and this form is equivalent to our model, when , , , and . Similarly, the Mahalanobis distance is also regarded as a special case of our model, when , , , and .

In the following, we connect our similarity model to two state-of-the-art similarity learning methods, i.e., LADF [3] and Joint Bayesian [4].

In [3], Li et al. proposed to learn a decision function that jointly models a distance metric and a locally adaptive thresholding rule, and the so-called LADF (i.e., Locally-Adaptive Decision Function) is formulated as a second-order large-margin regularization problem. Specifically, LADF is defined as:

(8)

One can observe that when we set and in our model.

It should be noted that LADF treats and using the same metrics, i.e., for both and , and for and . Such a model is reasonable for matching samples with the same modality, but may be unsuitable for cross-domain matching where and are with different modalities. Compared with LADF, our model uses and to calculate and , and uses and to calculate and , making our model more effective for cross-domain matching.

In [4]

, Chen et al. extended the classical Bayesian face model by learning a joint distributions (i.e., intra-person and extra-person variations) of sample pairs. Their decision function is posed as the following form:

(9)

Note that the similarity metric model proposed in [14] also adopted such a form. Interestingly, this decision function is also a special variant of our model by setting , , , , and .

In summary, our similarity model can be regarded as the generalization of many existing cross-domain matching and metric learning models, and it is more flexible and suitable for cross-domain visual data matching.

4 Joint Similarity and Feature Learning

In this section, we introduce our deep architecture that integrates the generalized similarity measure with convolutional feature representation learning.

4.1 Deep Architecture

As discussed above, our model defined in Eqn. (7) jointly handles similarity function learning and feature learning. This integration is achieved by building a deep architecture of convolutional neural networks, which is illustrated in Figure 3. It is worth mentioning that our architecture is able to handle the input samples of different modalities with unequal numbers, e.g., samples of and samples of are fed into the network in a way of batch processing.

From left to right in Figure 3, two domain-specific sub-networks and are applied to the samples of two different modalities, respectively. Then, the outputs of and are concatenated into a shared sub-network . We make a superposition of and to feed . At the output of , the feature representations of the two samples are extracted separately as and , which is indicated by the slice operator in Figure 3. Finally, these learned feature representations are utilized in the structured fully-connected layers that incorporate the similarity components defined in Eqn. (3.1). In the following, we introduce the detailed setting of the three sub-networks.

Domain-specific sub-network. We separate two branches of neural networks to handle the samples from different domains. Each network branch includes one convolutional layer with filters of size

and the stride step of

pixels. The rectified nonlinear activation is utilized. Then, we follow by a one max-pooling operation with size of and its stride step is set as pixels.

Shared sub-network. For this component, we stack one convolutional layer and two fully-connected layers. The convolutional layer contains filters of size and the filter stride step is set as pixel. The kernel size of the max-pooling operation is and its stride step is pixels. The output vectors of the two fully-connected layers are of dimensions. We further normalize the output of the second fully-connected layer before it is fed to the next sub-network.

Similarity sub-network. A slice operator is first applied in this sub-network, which partitions the vectors into two groups corresponding to the two domains. For the example in Figure 3, vectors are grouped into two sets, i.e., and , with size of and , respectively. and are both of dimensions. Then, and are fed to two branches of neural network, and each branch includes a fully-connected layer. We divide the activations of these two layers into six parts according to the six similarity components. As is shown in Figure 3, in the top branch the neural layer connects to and outputs , , and , respectively. In the bottom branch, the layer outputs , , and , respectively, by connecting to . In this way, the similarity measure is tightly integrated with the feature representations, and they can be jointly optimized during the model training. Note that is a parameter of the generalized similarity measure in Eqn. (1). Experiments show that the value of only affects the learning convergence rather than the matching performance. Thus we empirically set in our experiments.

In the deep architecture, we can observe that the similarity components of and those of do not interact to each other by the factorization until the final aggregation calculation, that is, computing the components of is independent of . This leads to a good property of efficient matching. In particular, for each sample stored in a database, we can pre-computed its feature representation and the corresponding similarity components, and the similarity matching in the testing stage will be very fast.

4.2 Model Training

In this section, we discuss the learning method for our similarity model training. To avoid loading all images into memory, we use the mini-batch learning approach, that is, in each training iteration, a subset of the image pairs are fed into the neural network for model optimization.

For notation simplicity in discussing the learning algorithm, we start by introducing the following definitions:

(10)

where and denote the output layer’s activations of the samples and . Prior to incorporating Eqn. (10) into the similarity model in Eqn. (3.1), we introduce three transformation matrices (using Matlab representation):

(11)

where equals to the dimension of the output of shared neural network (i.e., the dimension of and ), an

indicates the identity matrix. Then, our similarity model can be re-written as:

(12)

Incorporating Eqn. (12

) into the loss function Eqn. (

6), we have the following objective:

(13)

where the summation term denotes the hinge-like loss for the cross domain sample pair , is the total number of pairs, represents the feature representation of different domains and represents the similarity model. and are both embedded as weights connecting neurons of layers in our deep neural network model, as Figure 3 illustrates.

The objective function in Eqn. (13) is defined in sample-pair-based form. To optimize it using SGD, one should apply a certain scheme to generate mini-batches of the sample pairs, which usually costs much computation and memory. Note that the sample pairs in training set are constructed from the original set of samples from different modalities , where and . The superscript denotes the sample index in the original training set, e.g., and , while the subscript denotes the index of sample pairs, e.g., . and denote the total number of samples from different domains. Without loss of generality, we define and . For each pair in , we have and with and . And we also have and .

Therefore, we rewrite Eqn. (13) in a sample-based form:

(14)

Given , the loss function in Eqn. (7) can also be rewritten in the sample-based form:

(15)

The objective in Eqn. (15) can be optimized by the mini-batch back propagation algorithm. Specifically, we update the parameters by gradient descent:

(16)

where denotes the learning rate. The key problem of solving the above equation is calculating . As is discussed in [31], there are two ways to this end, i.e., pair-based gradient descent and sample-based gradient descent. Here we adopt the latter to reduce the requirements on computation and memory cost.

Suppose a mini-batch of training samples from the original set , where and

. Following the chain rule, calculating the gradient for all pairs of samples is equivalent to summing up the gradient for each sample,

(17)

where can be either or .

Using as an example, we first introduce an indicator function before calculating the partial derivative of output layer activation for each sample . Specifically, we define when is a sample pair and . Otherwise we let . , indicating where and are from the same class. With , the gradient of can be written as

(18)

The calculation of can be conducted in a similar way. The algorithm of calculating the partial derivative of output layer activation for each sample is shown in Algorithm 1.

0:    The output layer’s activation for all samples
0:    The partial derivatives of output layer’s activation for all the samples
1:  for each sample  do
2:     Initialize the partner set containing the sample with ;
3:     for each pair  do
4:         if pair contains the sample  then
5:            if pair satisfies  then
6:                , the corresponding partner of in ;
7:            end if
8:         end if
9:     end for
10:     Compute the derivatives for the sample with all the partners in , and sum these derivatives to be the desired partial derivative for sample ’s output layer’s activation using Eqn. (18);
11:  end for
Algorithm 1 Calculate the derivative of the output layer’s activation for each sample

Note that all the three sub-networks in our deep architecture are differentiable. We can easily use the back-propagation procedure [19] to compute the partial derivatives with respect to the hidden layers and model parameters . We summarize the overall procedure of deep generalized similarity measure learning in Algorithm 2.

If all the possible pairs are used in training, the sample-based form allows us to generate sample pairs from a mini-batch of . On the other hand, the sample-pair-based form may require samples or less to generate sample pairs. In gradient computation, from Eqn. (18), for each sample we only require calculating once and times in the sample-based form. While in the sample-pair-based form, and should be computed and times, respectively. In sum, the sample-based form generally results in less computation and memory cost.

0:    Training set, initialized parameters and , learning rate ,
0:    Network parameters and
1:  while  do
2:     Sample training pairs ;
3:     Feed the sampled images into the network;
4:     Perform a feed-forward pass for all the samples and compute the net activations for each sample ;
5:     Compute the partial derivative of the output layer’s activation for each sample by Algorithm 1.
6:     Compute the partial derivatives of the hidden layers’ activations for each sample following the chain rule;
7:     Compute the desired gradients using the back-propagation procedure;
8:     Update the parameters using Eqn. (16);
9:  end while
Algorithm 2 Generalized Similarity Learning

Batch Process Implementation. Suppose that the training image set is divided into categories, each of which contains images from the first domain and images from the second domain. Thus we can obtain a maximum number of pairwise samples, which is quadratically more than the number of source images . In real application, since the number of stored images may reach millions, it is impossible to load all the data for network training. To overcome this problem, we implement our learning algorithm in a batch-process manner. Specifically, in each iteration, only a small subset of cross domain image pairs are generated and fed to the network for training. According to our massive experiments, randomly generating image pairs is infeasible, which may cause the image distribution over the special batch becoming scattered, making valid training samples for a certain category very few and degenerating the model. Besides, images in any pair are almost impossible to come from the same class, making the positive samples very few. In order to overcome this problem, an effective cross domain image pair generation scheme is adopted to train our generalized similarity model. For each round, we first randomly choose instance categories. For each category, a number of images first domain and a number of from second domain are randomly selected. For each selected images in first domain, we randomly take samples from the second domain and the proportions of positive and negative samples are equal. In this way, images distributed over the generated samples are relatively centralized and the model will effectively converge.

5 Experiments

In this section, we apply our similarity model in four representative tasks of matching cross-domain visual data and adopt several benchmark datasets for evaluation: i) person re-identification under different views on CUHK03 [28] and CUHK01 [32] datasets; ii) age-invariant face recognition on MORPH [33], CACD [34] and CACD-VS [35] datasets; iii) sketch-to-photo face matching on CUFS dataset [36]; iv) face verification over still-video domains on COX face dataset [37]. On all these tasks, state-of-the-art methods are employed to compare with our model.

Experimental setting. Mini-batch learning is adopted in our experiments to save memory cost. In each task, we randomly select a batch of sample from the original training set to generate a number of pairs (e.g.,

). The initial parameters of the convolutional and the full connection layers are set by two zero-mean Gaussian Distributions, whose standard deviations are

and respectively. Other specific settings to different tasks are included in the following sub-sections.

In addition, ablation studies are presented to reveal the benefit of each main component of our method, e.g., the generalized similarity measure and the joint optimization of CNN feature representation and metric model. We also implement several variants of our method by simplifying the similarity measures for comparison.

5.1 Person Re-identification

Person re-identification, aiming at matching pedestrian images across multiple non-overlapped cameras, has attracted increasing attentions in surveillance. Despite that considerable efforts have been made, it is still an open problem due to the dramatic variations caused by viewpoint and pose changes. To evaluate this task, CUHK03 [28] dataset and CUHK01 [32] dataset are adopted in our experiments.

CUHK03 dataset [28] is one of the largest databases for person re-identification. It contains 14,096 images of 1,467 pedestrians collected from 5 different pairs of camera views. Each person is observed by two disjoint camera views and has an average of 4.8 images in each view. We follow the standard setting of using CUHK03 to randomly partition this dataset for 10 times, and a training set (including 1,367 persons) and a testing set (including 100 persons) are obtained without overlap.

CUHK01 dataset [32] contains 971 individuals, each having two samples from disjoint cameras. Following the setting in [28][25], we partition this dataset into a training set and a testing set: 100 individuals for testing and the others for training.

For evaluation on these two benchmarks, the testing set is further randomly divided into a gallery set of 100 images (i.e., one image per person) and a probe set (including images of individuals from different camera views in contrast to the gallery set) without overlap for 10 times. We use Cumulative Matching Characteristic (CMC) [38]

as the evaluation metric in this task.

In our model training, all of the images are resized to , and cropped to the size of at the center with a small random perturbation. During every round of learning, 4800 pairs of samples are constructed by selecting 60 persons (or classes) and constructing 80 pairs for each person (class). For CUHK01, due to each individual only have two samples, the 80 pairs per individual will contain some duplicated pairs.

(a) CUHK03
(b) CUHK01
Fig. 4: CMC curves on (a) CUHK03 [28] dataset and (b) CUHK01 [32] for evaluating person re-identification. Our method has superior performances over existing state-of-the-arts overall.

Results on CUHK03. We compare our approach with several state-of-the-art methods, which can be grouped into three categories. First, we adopt five distance metric learning methods based on fixed feature representation, i.e. the Information Theoretic Metric Learning (ITML) [5], the Local Distance Metric Learning (LDM) [39], the Large Margin Nearest Neighbors (LMNN) [40], the learning-to-rank method (RANK) [41], and the Kernel-based Metric Learning method (KML) [23]. Following their implementation, the handcrafted features of dense color histograms and dense SIFT uniformly sampled from patches are adopted. Second, three methods specially designed for person re-identification are employed in the experiments: SDALF [42], KISSME [43], and eSDC [44]. Moreover, several recently proposed deep learning methods, including DRSCH [45], DFPNN [28] and IDLA [25], are also compared with our approach. DRSCH [45] is a supervised hashing framework for integrating CNN feature and hash code learning, while DFPNN and IDLA have been introduced in Section 2.

The results are reported in Fig. 4 (a). It is encouraging to see that our approach significantly outperforms the competing methods (e.g., improving the state-of-the-art rank-1 accuracy from 54.74% (IDLA [25]) to 58.39%). Among the competing methods, ITML [5], LDM [39], LMNN [40], RANK [41], KML [23], SDALF [42], KISSME [43], and eSDC [44] are all based on hand-crafted features. And the superiority of our approach against them should be attributed to the deployment of both deep CNN features and generalized similarity model. DRSCH [45], DFPNN [28] and IDLA [25] adopted CNN for feature representation, but their matching metrics are defined based on traditional linear transformations.

Results on CUHK01. Fig. 4 (b) shows the results of our method and the other competing approaches on CUHK01. In addition to those used on CUHK03, one more method, i.e. LMLF [24], is used in the comparison experiment. LMLF [24] learns mid-level filters from automatically discovered patch clusters. According to the quantitative results, our method achieves a new state-of-the-art with a rank-1 accuracy of 66.50%.

5.2 Age-invariant Face Recognition

Age invariant face recognition is to decide whether two images with different ages belong to the same identity. The key challenge is to handle the large intra-subject variations caused by aging process while distinguishing different identities. Other factors, such as illumination, pose, and expression, make age invariant face recognition more difficult. We conduct the experiments using three datasets, i.e., MORPH [33], CACD [34], and CACD-VS [35].

MORPH [33] contains more than 55,000 face images of 13,000 individuals, whose ages range from 16 to 77. The average number of images per individual is 4. The training set consists of 20,000 face images from 10,000 subjects, with each subject having two images with the largest age gap. The test set is composed of a gallery set and a probe set from the remaining 3,000 subjects. The gallery set is composed of the youngest face images of each subject. The probe set is composed of the oldest face images of each subject. This experimental setting is the same with those adopted in [46] and [34].

CACD [34] is a large scale dataset released in 2014, which contains more than 160,000 images of 2,000 celebrities. We adopt a subset of 580 individuals from the whole database in our experiment, in which we manually remove the noisy images. Among these 580 individuals, the labels of images from 200 individuals have been originally provided, and we annotate the rest of the data. CACD includes large variations not only in pose, illumination, expression but also in ages. Based on CACD, a verification subset called CACD-VS [35] is further developed, which contains 2,000 positive pairs and 2,000 negative pairs. The setting and testing protocol of CACD-VS are similar to the well-known LFW benchmark [47], except that CACD-VS contains much more samples for each person.

All of the images are resized to . For data augmentation, images are cropped to the size of at the center with a small random perturbation when feeding to the neural network. Sample-based mini-batch setting is adopted, and 4,800 pairs are constructed for each iteration.

Results on MORPH.

We compare our method with several state-of-the-art methods, including topological dynamic Bayesian network (TDBN)

[48], cross-age reference coding (CARC) [34], probabilistic hidden factor analysis (HFA) [46], multi-feature discriminant analysis (MFDA) [49] and 3D aging model [50]. The results are reported in Table I(a). Thanks to the use of CNN representation and generalized similarity measure, our method achieves the recognition rate of 94.35%, and significantly outperforms the competing methods.

Fig. 5: The retrieval performances on CACD dataset for age-invariant face recognition. Ours-1 and Ours-2 are our method, while the latter uses more training samples.
(a) Recognition rates on the MORPH dataset.
Method Recognition rate
TDBN [48] 60%
3D Aging Model [50] 79.8%
MFDA [49] 83.9%
HFA [46] 91.1%
CARC [34] 92.8%
Ours 94.4%

(b) Verification accuracy on the CACD-VS dataset.
Method verification accuracy
HD-LBP [51] 81.6%
HFA [46] 84.4%
CARC [34] 87.6%
Deepface [52] 85.4%
Ours 89.8%
TABLE I: Experimental results for age-invariant face recognition.

Results on CACD. On this dataset, the protocol is to retrieve face images of the same individual from gallery sets by using a probe set, where the age gap between probe face images and gallery face images is large. Following the experimental setting in [34], we set up 4 gallery sets according to the years when the photos were taken: , , , and . And we use the set of as the probe set to search for matches in the rest of three sets. We introduce several state-of-the-art methods for comparison, including CARC [34], HFA [46] and one deep learning based method, Deepface [52]. The results of CARC [34] and HFA [46] are borrowed from their papers. The results of Deepface [52] and our approach (i.e., Ours-1) are implemented based on the 200 originally annotated individuals, where 160 samples are used for model training. From the quantitative results reported in Figure 5, our model achieves superior performances over the competing methods. Furthermore, we also report the result of our method (i.e., Ours-2) by using images of 500 individuals as training samples. One can see that, the performance of our model can be further improved by increasing training data.

Results on CACD-VS. Following the setting in [35], we further evaluate our approach by conducting the general face verification experiment. Specifically, for all of the competing methods, we train the models on CACD and test on CACD-VS, and the optimal threshold value for matching is obtained by exhaustive search. The results produced by our methods and the others (i.e., CARC [34], HFA [46], HD-LBP [51] and Deepface [52]) are reported in Table I (b). It is worth mentioning that our method improves the state-of-the-art recognition rate from 87.6% (by CARC [34] [52]) to 89.8%. Thanks to the introduction of generalized similarity measure our approach achieves higher verification accuracy than Deepface. Note that an explicit face alignment was adopted in [52] before the CNN feature extraction, which is not in our framework.

5.3 Sketch-photo Face Verification

Sketch-photo face verification is an interesting yet challenging task, which aims to verify whether a face photo and a drawing face sketch belong to the same individual. This task has an important application of assisting law enforcement, i.e., using face sketch to find candidate face photos. It is however difficult to match photos and sketches in two different modalities. For example, hand-drawing may bring unpredictable face distortion and variation compared to the real photo, and face sketches often lack of details that can be important cues for preserving identity.

We evaluate our model on this task using the CUFS dataset [36]. There are 188 face photos in this dataset, in which 88 are selected for training and 100 for testing. Each face has a corresponding sketch that is drawn by the artist. All of these face photos are taken at frontal view with a normal lighting condition and neutral expression.

(a) sim.+hand.fea,
CUHK03
(b) sim.+hand.fea,
CUHK01
(c) sim.+deep.fea
CUHK03
(d) sim.+deep.fea,
MORPH
(e) sim.+deep.fea,
COX-V2S
(f) sim.+deep.fea,
COX-S2V
(g) deep/hand fea,
CUHK03
(h) deep/hand fea,
CUHK01
Fig. 6:

Results of the ablation studies demonstrating the effectiveness of each main component of our framework. The CMC curve and recognition rate are used for evaluation. The results of different similarity models are shown using the handcrafted features (in (a) and (b)) and using the deep features (in (c) - (f) ), respectively. (g) and (h) show the performances with / without the deep feature learning while keeping the same similarity model.

All of the photos/sketches are resized to , and cropped to the size of at the center with a small random perturbation. 1200 pairs of photos and sketches (i.e., including 30 individuals with each having 40 pairs) are constructed for each iteration during the model training. In the testing stage, we use face photos to form the gallery set and treat sketches as the probes.

We employ several existing approaches for comparison: the eigenface transformation based method (ET) [53], the multi-scale Markov random field based method (MRF) [36], and MRF+ [54] (i.e., the lighting and pose robust version of [36]). It is worth mentioning that all of these competing methods need to first synthesize face sketches by photo-sketch transformation, and then measure the similarity between the synthesized sketches and the candidate sketches, while our approach works in an end-to-end way. The quantitative results are reported in Table II. Our method achieves 100% recognition rate on this dataset.

Method Recognition rate
ET [53] 71.0%
MRF [36] 96.0%
MRF+ [54] 99.0%
Ours 100.0%
TABLE II: Recognition rates on the CUFS dataset for sketch-photo face verification.

5.4 Still-video Face Recognition

Method V2S S2V
PSD [55] 9.90% 11.64%
PMD [56] 6.40% 6.10%
PAHD [57] 4.70% 6.34%
PCHD [58] 7.93% 8.89%
PSDML [59] 12.14% 7.04%
PSCL-EA [37] 30.33% 28.39%
Ours 28.45% 29.02%
TABLE III: Recognition rates on the COX face dataset.

Matching person faces across still images and videos is a newly rising task in intelligent visual surveillance. In these applications, the still images (e.g., ID photos) are usually captured under a controlled environment while the faces in surveillance videos are acquired under complex scenarios (e.g., various lighting conditions, occlusions and low resolutions).

For this task, a large-scale still-video face recognition dataset, namely COX face dataset, has been released recently333The COX face DB is collected by Institute of Computing Technology Chinese Academy of Sciences, OMRON Social Solutions Co. Ltd, and Xinjiang University., which is an extension of the COX-S2V dataset [60]. This COX face dataset includes 1,000 subjects and each has one high quality still image and 3 video cliques respectively captured from 3 cameras. Since these cameras are deployed under similar environments ( e.g., similar results are generated for the three cameras in [37], we use the data captured by the first camera in our experiments.

Following the setting of COX face dataset, we divide the data into a training set (300 subjects) and a testing set (700 subjects), and conduct the experiments with 10 random splits. There are two sub-tasks in the testing: i) matching video frames to still images (V2S) and ii) matching still images to video frames (S2V). For V2S task we use the video frames as probes and form the gallery set by the still images, and inversely for S2V task. The split of gallery/probe sets is also consistent with the protocol required by the creator. All of the image are resized to , and cropped to the size of with a small random perturbation. 1200 pairs of still images and video frames (i.e., including 20 individuals with each having 60 pairs) are constructed for each iteration during the model training.

Unlike the traditional image-based verification problems, both V2S and S2V are defined as the point-to-set matching problem, i.e., one still image to several video frames (i.e., 10 sampled frames). In the evaluation, we calculate the distance between the still image and each video frame by our model and output the average value over all of the distances. For comparison, we employ several existing point-to-set distance metrics: dual-space linear discriminant analysis (PSD) [55], manifold-manifold distance (PMD) [56]

, hyperplane-based distance (PAHD)

[57], kernelized convex geometric distance (PCHD) [58], and covariance kernel based distance (PSDML) [59]. We also compare with the point-to-set correlation learning (PSCL-EA) method [37], which specially developed for the COX face dataset. The recognition rates of all competing methods are reported in Table III, and our method achieves excellent performances, i.e., the best in S2V and the second best in V2S. The experiments show that our approach can generally improve performances in the applications to image-to-image, image-to-video, and video-to-image matching problems.

5.5 Ablation Studies

In order to provide more insights on the performance of our approach, we conduct a number of ablation studies by isolating each main component (e.g., the generalized similarity measure and feature learning). Besides, we also study the effect of using sample-pair-based and sample-based batch settings in term of convergence efficiency.

Generalized Similarity Model. We design two experiments by using handcrafted features and deep features, respectively, to justify the effectiveness of our generalized similarity measure.

(i) We test our similarity measure using the fixed handcrafted features for person re-identification. The experimental results on CUHK01 and CUHK03 clearly demonstrate the effectiveness of our model against the other similarity models without counting on deep feature learning. Following [44], we extract the feature representation by using patch-based color histograms and dense SIFT descriptors. This feature representation is fed into a full connection layer for dimensionality reduction to obtain a 400-dimensional vector. We then invoke the similarity sub-network (described in Section 4) to output the measure. On both CUHK01 and CUHK03, we adopt several representative similarity metrics for comparison, i.e., ITML [5], LDM [39], LMNN [40], and RANK [41], using the same feature representation.

The quantitative CMC curves and the recognition rates of all these competing models are shown in Fig. 6 (a) and (b) for CUHK03 and CUHK01, respectively, where “Generalized” represents our similarity measure. It is observed that our model outperforms the others by large margins, e.g., achieving the rank-1 accuracy of 31.85% against 13.51% by LDM on CUHK03. Most of these competing methods learn Mahalanobis distance metrics. In contrast, our metric model combines Mahalanobis distance with Cosine similarity in a generic form, leading to a more general and effective solution in matching cross-domain data.

(ii) On the other hand, we incorporate several representative similarity measures into our deep architecture and jointly optimize these measures with the CNN feature learning. Specifically, we simplify our network architecture by removing the top layer (i.e., the similarity model), and measure the similarity in either the Euclidean embedding space (as Baseline-1) or in the inner-product space (as Baseline-2). These two variants can be viewed as two degenerations of our similarity measure (i.e., affine Euclidean distance and affine Cosine similarity). To support our discussions in Section 3.2, we adopt the two distance metric models LADF [3] and BFR (i.e., Joint Bayesian) [4] into our deep neural networks. Specifically, we replace our similarity model by the LADF model defined in Eqn. (8) and the BFR model defined in Eqn. (9), respectively. Moreover, we implement one more variant (denoted as “Linear” in this experiment), which applies similarity transformation parameters with separate linear transformations for each data modality. That is, we remove affine transformation while keeping separate linear transformation by setting , and in Eqn. 1. Note that the way of incorporating these metric models into the deep architecture is analogously to our metric model. The experiment is conducted on four benchmarks: CUHK03, MORPH, COX-V2S and COX-S2V, and the results are shown in Figure 6 (c), (d), (e), (f), respectively. Our method outperforms the competing methods by large margins on MORPH and COX face dataset. On CUHK03 (i.e., Fig. 6 (c)), our method achieves the best rank-1 identification rate (i.e., ) among all the methods. In particular, the performance drops by when removing the affine transformation on CUHK03.

It is interesting to discover that most of these competing methods can be treated as special cases of our model. And our generalized similarity model can fully take advantage of convolutional feature learning by developing the specific deep architecture, and can consistently achieve superior performance over other variational models.

Deep Feature Learning. To show the benefit of deep feature learning, we adopt the handcrafted features (i.e., color histograms and SIFT descriptors) on CUHK01 and CHUK03 benchmark. Specifically, we extract this feature representation based on the patches of pedestrian images and then build the similarity measure for person re-identification. The results on CUHK03 and CHUK01 are reported in Fig. 6 (g) and (h), respectively. We denote the result by using the handcrafted features as “hand.fea + gen.sim” and the result by end-to-end deep feature learning as “deep.fea + gen.sim”. It is obvious that without deep feature representation the performance drops significantly, e.g., from 58.4% to 31.85% on CUHK03 and from 66.5% to 39.5% on CUHK01. These above results clearly demonstrate the effectiveness of utilizing deep CNNs for discriminative feature representation learning.

Sample-pair-based vs. sample-based batch setting. In addition, we conduct an experiment to compare the sample-pair-based and sample-based in term of convergence efficiency, using the CUHK03 dataset. Specifically, for the sample-based batch setting, we select 600 images from 60 people and construct 60,000 pairs in each training iteration. For the sample-pair-based batch setting, 300 pairs are randomly constructed. Note that each person on CUHK03 has 10 images. Thus, 600 images are included in each iteration and the training time per iteration is almost the same for the both settings. Our experiment shows that in the sample-based batch setting, the model achieves rank-1 accuracy of after about 175,000 iterations, while in the other setting the rank-1 accuracy is after 300,000 iterations. These results validate the effectiveness of the sample-based form in saving the training cost.

6 Conclusion

In this work, we have presented a novel generalized similarity model for cross-domain matching of visual data, which generalizes the traditional two-step methods (i.e., projection and distance-based measure). Furthermore, we integrated our model with the feature representation learning by building a deep convolutional architecture. Experiments were performed on several very challenging benchmark dataset of cross-domain matching. The results show that our method outperforms other state-of-the-art approaches.

There are several directions along which we intend to extend this work. The first is to extend our approach for larger scale heterogeneous data (e.g., web and user behavior data), thereby exploring new applications (e.g., rich information retrieval). Second, we plan to generalize the pairwise similarity metric into triplet-based learning for more effective model training.

Derivation of Equation (1)

As discussed in Section 1, we extend the two linear projections and into affine transformations and apply them on samples of different domains, and , respectively. That is, we replace and with and , respectively. Then, the affine Mahalanobis distance is defined as:

where the matrix can be further unfolded as:

(20)

Furthermore, the affine Cosine similarity is defined as the inner product in the space of affine transformations:

The corresponding matrix is,

(22)

We propose to fuse and by a weighted aggregation as follows:

Note that is an affine distance (i.e., nonsimilarity) measure while is an affine similarity measure. Analogous to [14], we adopt () to combine and . The parameters , , and are automatically learned through our learning algorithm. Then, the matrix can be obtained by fusing and :

(24)

where

(25)

In the above equations, we use matrix (vector) variables, i.e., , , , , and , to represent the parameters of the generalized similarity model in a generic form. On one hand, given , , and , these matrix variables can be directly determined using Eqn. (25). On the other hand, if we impose the positive semi-definite constraint on and , it can be proved that once , , , , and are determined there exist at least one solution of , , and , respectively, that is, is guaranteed to be decomposed into the weighted Mahalanobis distance and Cosine similarity. Therefore, the generalized similarity measure can be learned by optimizing , , , , and under the positive semi-definite constraint on and . In addition, is not required to satisfy the positive semidefinite condition and it may not be a square matrix when the dimensions of and are unequal.

Acknowledgment

This work was supported in part by Guangdong Natural Science Foundation under Grant S2013050014548 and 2014A030313201, in part by Program of Guangzhou Zhujiang Star of Science and Technology under Grant 2013J2200067, and in part by the Fundamental Research Funds for the Central Universities. This work was also supported by Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase).

References

  • [1] D. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Comput., vol. 16, no. 12, pp. 2639–2664, 2004.
  • [2] A. Sharma and D. W. Jacobs, “Bypassing synthesis: Pls for face recognition with pose, low-resolution and sketch,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2011, pp. 593–600.
  • [3] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith, “Learning locally-adaptive decision functions for person verification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2013, pp. 3610–3617.
  • [4] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revisited: A joint formulation,” in Proc. Eur. Conf. Comput. Vis.   Springer, 2012, pp. 566–579.
  • [5] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in Proc. Int’l Conf. Mach. Learn.   ACM, 2007, pp. 209–216.
  • [6] Y. T. Zhuang, Y. F. Wang, F. Wu, Y. Zhang, and W. M. Lu, “Supervised coupled dictionary learning with group structures for multi-modal retrieval,” in

    Twenty-Seventh AAAI Conference on Artificial Intelligence

    , 2013.
  • [7]

    S. Wang, D. Zhang, Y. Liang, and Q. Pan, “Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis,” in

    Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2012, pp. 2216–2223.
  • [8] L. Duan, D. Xu, and I. W. Tsang, “Domain adaptation from multiple sources: A domain-dependent regularization approach,” IEEE Trans. Neural Networks Learn. Syst., vol. 23, no. 3, pp. 504–518, 2012.
  • [9] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora,” in

    Proc. Conf. Empirical Methods in Natural Language Processing

    .   Association for Computational Linguistics, 2009, pp. 248–256.
  • [10] J. Zhu, A. Ahmed, and E. P. Xing, “Medlda: maximum margin supervised topic models for regression and classification,” in Proc. Int’l Conf. Mach. Learn.   ACM, 2009, pp. 1257–1264.
  • [11] A. Sharma, A. Kumar, H. Daume III, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2012, pp. 2160–2167.
  • [12] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multi-view discriminant analysis,” in Proc. Eur. Conf. Comput. Vis.   Springer, 2012, pp. 808–821.
  • [13] P. Luo, L. Lin, and X. Liu, “Learning compositional shape models of multiple distance metrics by information projection.” IEEE Trans. Neural Networks Learn. Syst., 2015.
  • [14] Q. Cao, Y. Ying, and P. Li, “Similarity metric learning for face recognition,” in Proc. Int’l Conf. Comput. Vis.   IEEE, 2013, pp. 2408–2415.
  • [15] H. Chang and D.-Y. Yeung, “Locally smooth metric learning with application to image retrieval,” in Proc. Intel. Conf. Comput. Vis, 2007.
  • [16] A. Mignon and F. Jurie, “Cmml: a new metric learning approach for cross modal matching,” in Proc. Asian Conf. Comput. Vis, 2012.
  • [17] X. Zhai, Y. Peng, and J. Xiao, “Heterogeneous metric learning with joint graph regularization for crossmedia retrieval,” in Twenty-Seventh AAAI Conference on Artificial Intelligence, June 2013.
  • [18] C. Kang, S. Liao, Y. He, J. Wang, S. Xiang, and C. Pan, “Cross-modal similarity learning : A low rank bilinear formulation,” Arxiv, vol. abs/1411.4738, 2014. [Online]. Available: http://arxiv.org/abs/1411.4738
  • [19]

    Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”

    Neural Comput., vol. 1, no. 4, pp. 541–551, 1989.
  • [20]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
  • [21] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” arXiv preprint arXiv:1411.4038, 2014.
  • [22] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in Advances in Neural Information Processing Systems, 2014, pp. 1988–1996.
  • [23] F. Xiong, M. Gou, O. Camps, and M. Sznaier, “Person re-identification using kernel-based metric learning methods,” in Proc. Eur. Conf. Comput. Vis.   Springer, 2014, pp. 1–16.
  • [24] R. Zhao, W. Ouyang, and X. Wang, “Learning mid-level filters for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.   IEEE, 2014, pp. 144–151.
  • [25] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learning architecture for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.   IEEE, 2015.
  • [26] S. Chen, C. Guo, and J. Lai, “Deep ranking for person re-identification via joint representation learning,” Arxiv, vol. abs/1505.06821, 2015. [Online]. Available: http://arxiv.org/abs/1505.06821
  • [27] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in Proc. IEEE the 30th Int’l Conf. Mach. Learn., 2013, pp. 1247–1255.
  • [28] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2014, pp. 152–159.
  • [29] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2014, pp. 1386–1393.
  • [30] D. Yi, Z. Lei, and S. Z. Li, “Deep metric learning for practical person re-identification,” arXiv preprint arXiv:1407.4979, 2014.
  • [31] S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learning with relative distance comparison for person re-identification,” Pattern Recognition, 2015.
  • [32] W. Li, R. Zhao, and X. Wang, “Human reidentification with transferred metric learning.” in Proc. Asian Conf. Comput. Vis, 2012, pp. 31–44.
  • [33] K. Ricanek and T. Tesafaye, “Morph: A longitudinal image database of normal adult age-progression,” in Proc. IEEE Int’l Conf. Automatic Face and Gesture Recognition.   IEEE, 2006, pp. 341–345.
  • [34] B.-C. Chen, C.-S. Chen, and W. H. Hsu, “Cross-age reference coding for age-invariant face recognition and retrieval,” in Proc. Eur. Conf. Comput. Vis.   Springer, 2014, pp. 768–783.
  • [35] B.-C. Chen, C.-S. Chen, and W. Hsu, “Face recognition and retrieval using cross-age reference coding with cross-age celebrity dataset,” IEEE Trans. Multimedia, vol. 17, no. 6, pp. 804–815, 2015.
  • [36] X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 1955–1967, 2009.
  • [37] Z. Huang, S. Shan, R. Wang, H. Zhang, S. Lao, A. Kuerban, and X. Chen, “A benchmark and comparative study of video-based face recognition on cox face database,” IEEE Trans. Image Processing, 2015.
  • [38] D. Gray, S. Brennan, and H. Tao, “Evaluating appearance models for recognition, reacquisition, and tracking,” in Proc. IEEE Int’l Conf. Workshop on Performance Evaluation for Tracking and Surveillance, vol. 3, no. 5.   Citeseer, 2007.
  • [39] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? metric learning approaches for face identification,” in Proc. Int’l Conf. Comput. Vis, 2009, pp. 498–505.
  • [40] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Advances in Neural Information Processing Systems, 2005, pp. 1473–1480.
  • [41] B. McFee and G. R. Lanckriet, “Metric learning to rank,” in Proc. Int’l Conf. Mach. Learn., 2010, pp. 775–782.
  • [42] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Person re-identification by symmetry-driven accumulation of local features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2010, pp. 2360–2367.
  • [43] M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2012, pp. 2288–2295.
  • [44] R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience learning for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2013, pp. 3586–3593.
  • [45] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bit-scalable deep hashing with regularized similarity learning for image retrieval,” IEEE Trans. Image Processing, vol. 24, no. 12, pp. 4766–4779, 2015.
  • [46] D. Gong, Z. Li, D. Lin, J. Liu, and X. Tang, “Hidden factor analysis for age invariant face recognition,” in Proc. Int’l Conf. Comput. Vis, 2013, pp. 2872–2879.
  • [47] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Technical Report 07-49, University of Massachusetts, Amherst, Tech. Rep., 2007.
  • [48] D. Bouchaffra, “Mapping dynamic bayesian networks to-shapes: Application to human faces identification across ages,” IEEE Trans. Neural Networks Learn. Syst., vol. 23, no. 8, pp. 1229–1241, 2012.
  • [49] Z. Li, U. Park, and A. K. Jain, “A discriminative model for age invariant face recognition,” IEEE Trans. Inf. Forensics Secur., vol. 6, no. 3, pp. 1028–1037, 2011.
  • [50] U. Park, Y. Tong, and A. K. Jain, “Age-invariant face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 5, pp. 947–954, 2010.
  • [51] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2013, pp. 3025–3032.
  • [52] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.   IEEE, 2014, pp. 1701–1708.
  • [53] X. Tang and X. Wang, “Face sketch recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 1, pp. 50–57, 2004.
  • [54] W. Zhang, X. Wang, and X. Tang, “Lighting and pose robust face sketch synthesis,” in Proc. Eur. Conf. Comput. Vis.   Springer, 2010, pp. 420–433.
  • [55] X. Wang and X. Tang, “Dual-space linear discriminant analysis for face recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, vol. 2, 2004, pp. II–564.
  • [56] R. Wang, S. Shan, X. Chen, Q. Dai, and W. Gao, “Manifold–manifold distance and its application to face recognition with image sets,” IEEE Trans. Image Process., vol. 21, no. 10, pp. 4466–4479, 2012.
  • [57] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Advances in Neural Information Processing Systems, 2001, pp. 985–992.
  • [58] H. Cevikalp and B. Triggs, “Face recognition based on image sets,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2010, pp. 2567–2573.
  • [59] P. Zhu, L. Zhang, W. Zuo, and D. Zhang, “From point to set: Extend the learning of distance metrics,” in Proc. Int’l Conf. Comput. Vis.   IEEE, 2013, pp. 2664–2671.
  • [60] Z. Huang, S. Shan, H. Zhang, S. Lao, A. Kuerban, and X. Chen, “Benchmarking still-to-video face recognition via partial and local linear discriminant analysis on cox-s2v dataset,” in Proc. Asian Conf. Comput. Vis.   Springer, 2013, pp. 589–600.