1 Introduction
Due to robustness against varying imaging conditions, Riemannian manifolds have proven powerful representations for video sequences in many branches of computer vision. Two of the most popular Riemannian structures are the manifold of linear subspaces (i.e., Grassmann manifold) and the manifold of Symmetric Positive Definite (SPD) matrices. From a different perspective, these Riemannian representations can be related to modeling a video with a multivariate Gaussian distribution, characterized by its mean and covariance matrix. In the case of the Grassmann manifold, the distances between subspaces can be reduced to distances between multivariate Gaussian distributions by treating linear subspaces as the flattened limit of a zeromean, homogeneous factor analyzer distribution
[1]. In the case of the SPD manifold, a sequence of video frames is represented as the covariance matrix of the image features of frames [2, 3], which therefore essentially encodes a zeromean Gaussian distribution of the image features. In [4, 5], each video is modeled as a Gaussian distribution with nonzero mean and covariance matrix, which can be combined to construct an SPD matrix, and thus also resides on a specific SPD manifold [6, 7].The success of Riemannian representations in visual recognition is mainly due to the learning of more discriminant metrics, which encode Riemannian geometry of the underlying manifolds. For example, by exploiting the geometry of the Grassmann manifolds, [8] proposed Grassmann kernel functions to extend the existing Kernel Linear Discriminant Analysis [9] to learn a metric between Grassmannian representations. In [10], a new method is presented to learn a Riemannian metric on a Grassmann manifold by performing a Riemannian geometryaware dimensionality reduction from the original Grassmann manifold to a lowerdimensional, more discriminative Grassmann manifold where more favorable classification can be achieved. To learn a discriminant metric for an SPD representation, [2] derived a kernel function that explicitly maps the SPD representations from the SPD manifold to a Euclidean space where a traditional metric learning method such as Partial Least Squares [11] can be applied. In [12], an approach is proposed to search for a projection that yields a lowdimensional SPD manifold with maximum discriminative power, encoded via an affinityweighted similarity measure based on Riemannian metrics on the SPD manifold.
In this paper, we focus on studying the application of Riemannian metric learning to the problem of videobased face recognition, which identifies a subject with his/her face video sequences. Generally speaking, there exist three typical tasks of videobased face recognition, i.e., VideotoStill (V2S), StilltoVideo (S2V) and VideotoVideo (V2V) face identification/verification. Specifically, the task of V2S face recognition matches a query video sequence against still face images, which are typically taken in a controlled setting. This task commonly occurs in watch list screening systems. In contrast, in the task of S2V face recognition, a still face image is queried against a database of video sequences, which can be applied to locate a person of interest by searching his/her identity in the stored surveillance videos. The third task, i.e., V2V face recognition, looks for the same face in input video sequences among a set of target video sequences. For example, one could track a person by matching his/her video sequences recorded somewhere against the surveillance videos taken elsewhere.
To handle the tasks of videobased face recognition, stateoftheart deep feature learning methods
[13, 14, 15, 16, 17] typically adopt mean pooling strategy to fuse deep features from single frames within one face video. However, in addition to the firstorder mean pooling, as studied in many works such as [18, 19, 20, 4, 5], the pattern variation (i.e., secondorder pooling) of videos provides yet an important complementary cue for videobased face recognition. With this motivation in mind, we propose a new metric learning scheme across a Euclidean space and a Riemannian manifold to match/fuse appearance mean and patter variation (i.e., first and secondorder poolings) for still images and video sequences. In particular, the learning scheme employs either raw or even deeply learned feature vectors (i.e., Euclidean data) from still facial images, while representing the faces within one video sequence with both their appearance mean (i.e., Euclidean data) and pattern variation models that are typically treated as Riemannian data. Benefited from the new architecture, the three typical videobased face recognition tasks can be uniformly tackled. Compared with the previous version
[21] that can only handle V2S/S2V face recognition with pattern variation modeling on videos, this paper mainly makes two technical improvements:
To improve the V2S/S2V face recognition task, the new framework represents each video simultaneously with appearance mean and pattern variation models, and derives a more generalized cross Euclidean and Riemannian metric learning scheme to match still images and video sequences.

The original framework is also extended to handle the task of V2V face recognition. To this end, the objective function of the framework is adapted to fuse Euclidean data (i.e., appearance mean) and Riemannian data (i.e., pattern variation) of videos in a unified framework.
The key challenge of learning EuclideantoRiemannian metric learning is the essentially heterogeneous properties of the processed underlying data spaces, i.e., Euclidean spaces and Riemannian manifolds, which respect totally different geometrical structures and thus are equipped with different metrics, i.e. Euclidean distance and Riemannian metric respectively. As a result, applying most of traditional metric learning methods in Fig.8(a), (b), (c) and (d) will totally break down in the context of learning a metric across a Euclidean space and a Riemannian manifold. For example, EuclideantoEuclidean metric learning Fig.8(b) merely learns a discriminative distance metric between two Euclidean spaces with different data domain settings, while RiemannianRiemannian metric learning Fig.8(c) only explores a discriminative function across two homogeneous Riemannian manifolds. Hence, in the metric learning theory, this work mainly brings the following three innovations:

As depicted in Fig.8(e), a novel heterogeneous metric learning framework is developed to match/fuse Euclidean and Riemannian representations by designing a new objective function well performing across EuclideanRiemannian spaces. To the best of our knowledge, it is one of the first attempts to learn the metric across Euclidean and Riemannian spaces.

The proposed metric learning scheme can accommodate a group of typical nonEuclidean (Riemannian) representations widely used in vision problems, e.g., linear subspaces, affine subspaces and SPD matrices. Thus, it is a general metric learning framework to study the problem of fusing/matching hybrid Euclidean and Riemannian data.
2 Related Work
In this section we review relevant Euclidean metric learning and Riemannian metric learning methods. In addition, we also introduce existing applications of Riemannian metric learning to the problem of videobased face recognition.
2.1 Euclidean Metric Learning
In conventional techniques to learn a metric in a Euclidean space, the learned distance metric is usually defined as a Mahalanobis distance, which is the squared Euclidean distance after applying the learned linear transformation(s) to the original Euclidean space(s). According to the number of the source Euclidean spaces, traditional metric learning methods can be categorized into Euclidean metric learning and EuclideantoEuclidean metric learning.
As shown in Fig.8(a), the Euclidean metric learning methods [22, 23, 24, 25, 26] intend to learn a metric or a transformation of object features from a source Euclidean space to a target Euclidean space . For example, [23] introduced an informationtheoretic formulation to learn a metric in one source Euclidean space. In [25], a metric learning method was proposed to learn a transformation from one Euclidean space to a new one for the Knearest neighbor algorithm by pulling neighboring objects of the same class closer together and pushing others further apart. In [26], an approach was presented to learn a distance metric between (realistic and virtual) data in a single Euclidean space for the definition of a more appropriate pointtoset distance in the application of pointtoset classification.
In contrast, as depicted in Fig.8(b), the EuclideantoEuclidean metric learning methods [27, 28, 29, 30, 31, 32, 33] are designed to learn a crossview metric or multiple transformations mapping object features from multiple source Euclidean spaces, say , , to a target common subspace . For instance, [30] proposed a metric learning method to seek multiple projections under a neighborhood preserving constraint for multiview data in multiple Euclidean spaces. In [31], a multiple kernel/metric learning technique was applied to integrate different object features from multiple Euclidean spaces into a unified Euclidean space. In [33], a metric learning method was developed to learn two projections from two different Euclidean spaces to a common subspace by integrating the structure of crossview data into a joint graph regularization.
2.2 Riemannian Metric Learning
Riemannian metric learning pursues discriminant functions on Riemannian manifold(s) in order to classify the Riemannian representations more effectively. In general, existing Riemannian metric learning Fig.
8 (c) and RiemanniantoRiemannian metric learning Fig.8 (d) adopt one of the following three typical schemes to achieve a more desirable Riemannian metric on/across Riemannian manifold(s).The first Riemannian metric learning scheme [34, 35, 36, 37, 38, 39] typically first flattens the underlying Riemannian manifold via tangent space approximation, and then learns a discriminant metric in the resulting tangent (Euclidean) space by employing traditional metric learning methods. However, the map between the manifold and the tangent space is locally diffeomorphic, which inevitably distorts the original Riemannian geometry. To address this problem, LogitBoost on SPD manifolds [34] was introduced, by pooling the resulting classifiers in multiple approximated tangent spaces on the calculated Karcher mean on Riemannian manifolds. Similarly, a weighted Riemannian locality preserving projection is exploited by [38] during boosting for classification on Riemannian manifolds.
Another family of Riemannian metric learning methods [8, 1, 40, 2, 3, 41, 42, 43, 4] derives Riemannian metric based kernel functions to embed the Riemannian manifolds into a highdimensional Reproducing Kernel Hilbert space (RKHS). As an RKHS respects Euclidean geometry, this learning scheme enables the traditional kernelbased metric learning methods to work in the resulting RKHS. For example, in [8, 1, 40, 3], the projection metric based kernel and its extensions were introduced to map the underlying Grassmann manifold to an RKHS, where kernel learning algorithms developed in vector spaces can be extended to their counterparts. To learn discriminant data on the SPD manifolds, [2, 3, 41, 42, 43, 4] exploited some wellstudied Riemannian metrics such as the LogEuclidean metric [44], to derive positive definite kernels on manifolds that permit to embed a given manifold with a corresponding metric into a highdimensional RKHS.
The last kind of Riemannian metric learning [45, 12, 10, 46] learns the metric by mapping the original Riemannian manifold to another one equipped with the same Riemannian geometry. For instance, in [12]
, a metric learning algorithm was introduced to map a highdimensional SPD manifold into a lowerdimensional, more discriminant one. This work proposed a graph embedding formalism with an affinity matrix that encodes intraclass and interclass distances based on affineinvariant Riemannian metrics
[47, 48] on the SPD manifold. Analogously, on the Grassmann manifold, [10] proposed a new Riemannian metric learning to learn a Mahalanobislike matrix that can be decomposed into a manifoldtomanifold transformation for geometryaware dimensionality reduction.In contrast to Riemannian metric learning performed on a single Riemannian manifold, RiemanniantoRiemannian metric learning [49, 20] typically learns multiple Riemannian metrics across different types of Riemannian manifolds by employing the second Riemannian metric learning scheme mentioned above. For example, in [49], multiple Riemannian manifolds were first mapped into multiple RKHSs, and a feature combining and selection method based on a traditional Multiple Kernel Learning technique was then introduced to optimally combine the multiple transformed data lying in the resulting RKHSs. Similarly, [20] adopted multiple traditional metric learning methods to fuse the classification scores on multiple Riemannian representations by employing Riemannian metric based kernels on their underlying Riemannian manifolds.
2.3 Riemannian Metric Learning Applied to Videobased Face Recognition
Stateoftheart methods [8, 1, 50, 36, 2, 19, 12, 4, 5, 46] typically model each video sequence of faces with a variation model (e.g., linear subspace, affine subspace and SPD matrices) and learn a discriminant Riemannian metric on the underlying Riemannian manifold for robust videobased face recognition. For example, [8] represented each set of video frames by a linear subspace of their image features. By exploiting the geometry of the underlying Grassmann manifold of linear subspaces, they extended the Kernel Linear Discriminant Analysis method to learn discriminative linear subspaces. As studied in [1], image sets are more robustly modeled by affine subspaces, each of which is obtained by adding an offset (i.e, the data mean) to one linear subspace. Analogously to [8], an affine Grassmann manifold and its Riemannian geometry were exploited by [1] for affine subspace discriminant learning. In [2], each video is modeled as a covariance matrix, which is then treated as an SPD matrix residing on the SPD manifold. To learn discriminative SPD matrices, they applied traditional discriminant analysis methods such as Partial Least Squares on the manifold of SPD matrices. Accordingly, the success of these methods mainly derives from the effective video modeling with Riemannian representations and discriminant metric learning on such Riemannian data.
3 Cross EuclideantoRiemannian Metric Learning
In this section, we first formulate the new problem of Cross EuclideantoRiemannian Metric Learning (CERML), and then present its objective function. In the end, we develop an optimization algorithm to solve the objective function.
3.1 Problem Formulation
Let be a set of Euclidean data with class labels and be a set of Riemannian representations with class labels , where come as a certain type of Riemannian representations such as linear subspaces, affine subspaces, or SPD matrices.
Given one pair of a Euclidean point and a Riemannian point , we use to represent their distance. To achieve an appropriate distance metric between them for better discrimination, we propose to learn two transformation functions and , which respectively map the Euclidean points and Riemannian points to a common Euclidean subspace. In the common subspace, the learned distance metric between the involved pair of heterogeneous data can be reduced to the classical Euclidean distance as:
(1) 
However, as the source Euclidean space and Riemannian manifold differ too much in terms of geometrical data structure, it is difficult to employ linear transformations to map them into the common target Euclidean subspace . This motivates us to first transform the Riemannian manifold to a flat Euclidean space so that the heterogeneity between this flattened space and the source Euclidean space reduces. To this end, there exist two strategies: one is tangent space embedding, and the other is Reproducing Kernel Hilbert Space (RKHS) embedding. The first strategy pursues an appropriate tangent space to approximate the local geometry of the Riemannian manifold. In contrast, the second strategy exploits typical Riemannian metrics based kernel functions to encode Riemannian geometry of the underlying manifold. As evidenced by the theory of kernel methods in Euclidean spaces, compared with the tangent space embedding scheme, the RKHS embedding yields much richer highdimensional representations of the original data, making visual classification tasks easier.
With this idea in mind, and as shown in Fig.2, the proposed framework of Cross EuclideantoRiemannian Metric Learning (CERML) first derives the kernel functions based on typical Euclidean and Riemannian metrics to define the inner product of the implicit nonlinear transformations and , which respectively map the Euclidean space and the Riemannian manifold into two RKHSs , . After the kernel space embedding, two mappings are learned from the two RKHSs to the target common subspace. Thus, the final goal of this new framework is to employ the two mappings and to transform the original Euclidean data and Riemannian data into the common subspace , where the distance metric between each pair of Euclidean data point and Riemannian data point is reduced to the classical Euclidean distance defined in Eq. 1. In particular, the two linear projections can be represented as , , where are two linear projection matrices. Inspired by the classical kernel techniques, we employ the corresponding kernel functions to derive the inner products of these two nonlinear transformations as , where are the kernel matrices involved. By parameterizing the inner products in the two RKHSs, the formulations of the two final mapping functions and can be achieved by , , where are respectively the th columns of the kernel matrices . Accordingly, the distance metric Eq.1 between a pair of a Euclidean point and a Riemannian representation can be further formulated as:
(2) 
Additionally, according to the above mapping mode, the distance metric between each pair of transformed homogeneous data points in the common Euclidean subspace can also be achieved as:
(3) 
(4) 
where the specific forms of and will be presented in the following.
Now, we need to define the kernel functions
for Euclidean data and Riemannian representations. For the Euclidean data, without loss of generality, we exploit the Radial Basis Function (RBF) kernel, which is one of the most popular positive definite kernels. Formally, given a pair of data point
in Euclidean space, the kernel function is defined as:(5) 
which actually employs the Euclidean distance between two Euclidean points and .
As for the Riemannian representations, since they are commonly defined on a specific type of Riemannian manifold that respects a nonEuclidean geometry [8, 1, 2], the above kernel function formulation will fail. So, it has to be generalized to Riemannian manifolds. For this purpose, given two elements on a Riemannian manifold, we formally define a generalized kernel function for them as:
(6) 
The kernel function performed on Riemannian representations actually takes the form of a Gaussian function (note that we also study the linear kernel case in the supplementary material). The most important component in such a kernel function is , which defines the distance between one pair of Riemannian points on the underlying Riemannian manifold. Next, this distance is discussed for three typical Riemannian representations, i.e., Grassmannian data, affine Grassmannian data and SPD data.
1) For Grassmannian representations
As studied in [51, 8, 1, 40, 3], each Riemannian representation on a Grassmann manifold refers to a dimensional linear subspace of . The linear subspace can be represented by its orthonormal basis matrix that is formed by the
leading eigenvectors corresponding to the
largest eigenvalues of the covariance matrix of one Euclidean data set. On a Grassmann manifold, one of the most popular Riemannian metrics is the projection metric
[51]. Formally, for one pair of data on the Grassmannian, their distance is measured by the projection metric:(7) 
where denotes the matrix Frobenius norm.
2) For affine Grassmannian representations
In contrast to the Grassmannian representation, each affine Grassmannian point is an element on an affine Grassmann manifold, which is the space of dimensional affine subspaces named affine Grassmann manifold . Therefore, each Riemannian representation on is an affine subspace spanned by an orthonormal matrix adding the offset (i.e., the mean) from the origin. On the affine Grassmann manifold, [1] defined a similarity function as between pairs of data points. Alternatively, we extend the similarity function to a distance metric between two Riemannian data on the affine manifold as:
(8)  
where
is the identity matrix.
3) For SPD representations
Each SPD representation is an element of the manifold of Symmetric Positive Definite (SPD) matrices of size . As studied in [47, 44, 2, 43], the set of SPD matrices yields a Riemannian manifold when endowing a specific Riemannian metric. One of the most commonly used SPD Riemannian metrics is the LogEuclidean metric [44] due to its effectiveness in encoding the true Riemannian geometry by reducing the manifold to a flat tangent space at the identity matrix. Formally, on the Riemannian SPD manifold, the LogEuclidean distance metric between two elements is given by classical Euclidean computations in the domain of SPD matrix logarithms as:
(9) 
where with being the eigendecomposition of the SPD matrix .
Similar to our prior work [21], we denote the proposed CERML working in the three studied settings by CERMLEG, CERMLEA and CERMLES, respectively. By studying the Riemannian metrics defined in Eq.7, Eq.8 and Eq.9, the kernel function corresponding to the specific type of Riemannian manifold can be derived by employing Eq.6. However, according to Mercer’s theorem, only positive definite kernels yield valid RKHS. To achieve this, by employing the approach developed in [43], we can easily prove the positive definiteness of these Gaussian kernels defined on the resulting Riemannian manifolds. As for the details to prove their positive definiteness, readers are referred to [43].
3.2 Objective Function
From Eq.2, Eq.3, Eq.52, we find that the CERML contains two parameter transformation matrices . In order to learn a discriminant metric between heterogeneous data, we formulate the objective function of this new framework to optimize the two matrices in the following:
(10)  
where is the distance constraint defined on the collections of similarity and dissimilarity constraints. and are, respectively, a geometry constraint and a transformation constraint, both of which are regularizations defined on the target transformation matrices . are balancing parameters.
Distance constraint : This constraint is defined so that the distances between the Euclidean data and the Riemannian data – with the similarity (/dissimilarity) constraints – are minimized (/maximized). In this paper, we adopt a classical expression of the sum of squared distances to define this constraint as:
(11)  
where indicates if the heterogeneous data and are relevant or irrelevant, as inferred from the class label. To balance the effect of similarity and dissimilarity constraints, we normalize the elements of by averaging them over the total number of similar/dissimilar pairs respectively.
Geometry constraint : This constraint aims to preserve Euclidean geometry and Riemannian geometry for the Euclidean and Riemannian points, respectively. Thus, it can be defined as: , which are Euclidean and Riemannian geometry preserving items formulated as:
(12)  
(13)  
where , indicates Euclidean data or Riemannian data . () means data is in the () neighborhood of data or data is in the () neighborhood of data .
Transformation constraint : Since Euclidean distance will be used in the target common subspace where all dimensions are treated uniformly, it is reasonable to require the feature vectors satisfy an isotropic distribution. Thus, this constraint can be expressed in terms of unit covariance:
(14) 
where is the Frobenius norm.
3.3 Optimization Algorithm
To optimize the objective function Eq.10, we develop an iterative optimization algorithm, which first applies the Fisher criterion of Fisher Discriminant Analysis (FDA) [52] to initialize the two transformation matrices , and then employs a strategy of alternately updating their values.
Before introducing the optimization algorithm, we first rewrite Eq.53, Eq.12 and Eq.13 in matrix formulation as:
(15)  
(16)  
(17)  
where , , and are diagonal matrices with , , , .
Initialization. We define the withinclass template and betweenclass template for in Eq.53 as:
(18) 
By substituting Eq.18 into Eq.15, the withinclass template and betweenclass template for in Eq.53 can be computed as:
(19)  
(20)  
Likewise, we achieve the withinclass and betweenclass templates for and in Eq.12 and Eq.13 respectively denoted by , , , . For the sake of clarity, more details are given in the supplementary material.
Then we can initialize and by maximizing the sum of betweenclass templates while minimizing the sum of withinclass templates with the Fisher criterion of the traditional Fisher Discriminant Analysis (FDA) [52]:
(21)  
where , . By transforming Eq.21 into matrix formulation, the function for initialization can be further simplified as:
(22)  
where , . Obviously, Eq.22 is a standard generalized eigenvalue problem that can be solved using any eigensolver.
Alternately updating. We substitute Eq.15, Eq.16, Eq.17 into the objective function in Eq.10 to derive its matrix form. By differentiating w.r.t. and setting it to zero, we have the following equation:
(23)  
Then by fixing , the solution of can be achieved as:
(24) 
Likewise, the solution of when is fixed, can be obtained as
(25) 
We alternate the above updates of and for several iterations to search an optimal solution. While it is hard to provide a theoretical proof of uniqueness or convergence of the proposed iterative optimization, we empirically found our objective function Eq.10 can converge to a desirable solution after only a few tens of iterations. The convergence characteristics are studied in more detail in the experiments.
4 Application to Videobased Face Recognition
In this section we present the application of the proposed Cross EuclideantoRiemannian Metric Learning (CERML) to the three typical tasks of videobased face recognition, i.e., V2S, S2V and V2V settings.
4.1 V2S/S2V Face Recognition
As done in several stateoftheart techniques [8, 36, 2, 19, 12, 4, 5], we represent a set of facial frames within one video with their appearance mean and variation model aforementioned (e.g., linear subspace) simultaneously. Therefore, the task of V2S/S2V face recognition can be formulated as the problem of matching Euclidean representations (i.e., feature vectors) of face images with the Euclidean data (i.e., feature mean) and Riemannian representation (i.e., feature variation) of faces from videos. Formally, the Euclidean data of a face image is written as with labels . The Euclidean data of videos are represented by , with labels , while their Riemannian data are sharing the labels with their Euclidean data. In the following, we describe the components of the proposed CERML framework for this task.
Distance metric. The distance metric Eq.2 in Sec.3.1 is instantiated for V2S/S2V face recognition as:
(26)  
Objective function. The objective function Eq.10 in Sec.3.3 takes the form:
(27)  
where the distance constraint , the geometry constraint , the transformation constraint .
4.2 V2V Face Recognition
Similar to the case of V2S/S2V face recognition, each facial video sequence is commonly represented by the appearance mean of its frames and their pattern variation. Therefore, the task of V2S/S2V face recognition can be expressed as the problem of fusing the Euclidean data (i.e., feature mean) and the Riemannian representation (i.e., feature variation such as linear subspace) of video sequences of faces. Formally, the Euclidean data of videos are represented by , with labels , while the Riemannian representations of such videos are sharing the labels with the Euclidean data. To adapt the proposed CERML framework to this task, we now define its components.
Distance metric. The distance metrics Eq.3 and Eq.52 in Sec.3.1 are implemented for V2V face recognition as:
(31)  
Objective function. The objective function Eq.10 in Sec.3.3 is instantiated as:
(32)  
where the distance constraint , the geometry constraint , the transformation constraint .
Initialization. The initialized objective function Eq.22 in the optimization algorithm in Sec.3.3 is instantiated by defining the matrix as:
(33) 
(34) 