In the domain of pattern recognition and computer vision, more and more information is presented to us in the form of enormous amount of captured videos, such as surveillance videos, handheld camera videos, and internet videos, etc. One of the representative application branches of them is face recognition problem, and traditional recognition methods where the decision is based on single-shot images are achieved impressive success under restrict conditions[1, 2, 3, 4, 5, 6]. By considering each video sequence as an image set, image-set based object classification problems have recently been attracting increasing attention [7, 8, 9, 10, 11, 12, 13] and exhibit extensive potential applications, including video-based face recognition , object categorization , and action recognition , etc. This is mainly because image set can provide more useful information of data variability for more robust video scene parsing under more realistic conditions.
Different from the single-shot image based classification problem, for image set classification, the training and testing samples are image sets, and each set generally contains a number of image instances that belong to the same category. Actually, the subject’s appearance information of intra-set and/or inter-set are very likely to exhibit large variations owing to the wide range of rigid and non-rigid deformations, illumination changes, as well as different shooting conditions. Therefore, the distribution formed by the set data is often nonlinear and thus pose a key issue of how to faithfully characterize the real structure of them for classification.
As a matter of fact, nonlinear data are often encountered in Euclidean geometry based classification tasks, which include covariance descriptors [14, 17, 18], orthogonal linear subspaces , and Gaussian distributions 
. However, the spaces where such nonlinear data reside on are not a vector space structure but instead a Riemannian geometry. Specifically, they are SPD manifold, Grassmann manifold, and Gaussian embedded Riemannian manifold, respectively. Hence, applying the conventional Euclidean learning techniques to the manifold-valued data straightforward is unreasonable and often leads to poor performance . As a countermeasure,[20, 21, 22, 23] advocated some metrics that were designed for Riemannian manifold, incluing Affine-Invariant Riemannian metric (AIRM) , Log-Euclidean metric (LEM) , Stein divergence , and Projection metric (PM) . By utilizing these well-studied Riemannian metrics, some Euclidean learning algorithms can be generalized to Riemannian manifold by the following strategies.
The first is to learn a Euclidean feature representation of the original Riemannian manifold-valued data by mapping the Riemannian manifold into a flat space which is an approximate Euclidean space [15, 18]. An alternative strategy is to embed it into a high dimensional Hilbert Space via positive definite Riemannian kernels [12, 14, 24, 25]. Compared with the former, this approach makes some Euclidean methods valid in a generalized Euclidean space, while getting a richer feature representation simultaneously. In some aspects, this approach shows better classification performance than the first one [24, 25]. However, the Riemannian computing methods mentioned above are actually conveying the idea of approximate computation and ignore the geometry of Riemannian manifold up to a point. To handle this problem, some Riemannian manifold dimensionality reduction methods [16, 23] have been suggested to directly perform a mapping from the original high-dimensional Riemannian manifold to a lower-dimensional, more discriminative one. The advantage of this type of method is that the intrinsic manifold geometry of the data has been fully considered, but it also has an inherent problem that the linear mapping function is learned on the non-linear manifold, which inevitably leads to sub-optimal results.
In parallel with the above developments, deep neural network has become a vital tool in artificial intelligence and pattern recognition. Its advantages stem both from the ability to extract powerful feature representation and from the effective non-linear training procedure based backpropagation. Inspired by these merits, some authors try to develop the idea of conventional deep learning to Riemannian manifold, and a slice of corresponding architectures[26, 27, 28]
have been put forward to conduct dimensionality reduction and deep feature learning directly on Riemannian manifold. On the tail of them, the Euclidean learning methods can be applied for further computing. Undoubtedly, the introduced Riemannian manifold deep learning strategy has made significant improvements in the classification performance, which mainly owes to two reasons: 1) non-linear learning mechanism; 2) Riemannian matrix backpropagation computing.
The above mentioned approaches for image set classification are based on the Riemannian manifold, and the distribution based methods, e.g. Single Gaussian Model (SGM) 
, and Gaussian mixture models (GMM)[7, 10]
, also seems to be a favorable choice to capture the variations in the given image set. Theoretically, after the given image sets are modeled by the distribution based statistics, the similarity between any two image sets can be replaced by using Kullback-Leibler Divergence (KLD) for measurement. However, this distance metric is lack of growing discriminability for some complicated video based classification tasks . On the basis of information geometry,  and  point out that the space formed by -dimensional Gaussian distributions can be embedded into another Riemannian manifold, specifically a SPD manifold that spanned by a set of -dimensional symmetric positive definite matrices. Therefore, the above problem can be well addressed by applying the Riemannian metric based learning algorithms [33, 10].
In fact, when there exist large and complex data variations within a collected video sequence, no matter which Riemannian manifold-valued descriptor (covariance matrix, linear subspace, or Gaussian distribution), we applied for set data characterization, the discriminative information which can be provided is finite [23, 26, 33, 16]. The fundamental reason is each descriptor can only model the set data from one side of the coin. To tackle the above problem, in this paper, we propose a novel multi-kernel metric learning framework which not only can describe the original image set from a multi-geometric perspective but also can combine them for improved classification. Specifically, given an image set, we encode it by utilizing the covariance matrix, linear subspace, and Gaussian distribution simultaneously for obtaining complementary features. Since the -dimensional nonsingular covariance matrix lying in a SPD manifold , the -dimensional linear subspace residing on Grassmann manifold , and the -dimensional Gaussian distributions can be embedded into another Riemannian manifold , it is not trivial to fuse different topologies. We first adopt the well-studied Riemannian kernels to map each corresponding Riemannian manifold into a high dimensional Reproducing Kernel Hilbert Space (RKHS). Then, a multi-kernel metric learning algorithm is developed to fuse the learned hybrid kernels into a lower dimensional, more discriminative unified subspace for classification. Extensive classification results achieved on four widely used image set datasets validate the efficacy of the proposed method. In this paper, our main contributions can be summarized as follows:
To extract complementary feature representations of image set for improved classification, we first encode each given image set (one video sequence) by three commonly used Riemannian manifold-valued descriptors, i.e., covariance matrix, linear subspace, and Gaussian distribution simultaneously.
Due to the heterogeneity of the spaces spanned by these three descriptors, three well-equipped Riemannian kernel functions are then exploited to map their corresponding Riemannian manifold-valued features into high dimensional Reproducing Kernel Hilbert Spaces (RKHS) for the sake of facilitating the subsequent fusion operation.
Finally, we develop a multi-kernel metric learning framework to merge the generated hybrid kernel features into a lower dimensional unified subspace by jointly learning an adaptive discriminative distance metric and an adaptive weight for each local region of the produced kernel spaces. Consequently, the inter-class dispersion and intra-class compactness of the generated subspace features is enhanced.
Ii Related work
In image set classification, the covariance matrix, linear subspace and Gaussian distribution are three commonly used Riemannian manifold-valued descriptors for image set description. For covariance matrix, its advantages are the simplicity and flexibility to capture the variations within the set [14, 16, 34], and for linear subspace, its preponderance stem both from the lower computational cost and from the ability to accommodate the effects of various intra-set variations [12, 23]
. In comparison, the strength of Gaussian distribution is that it can describe the set data variations by estimating their first-order statistics and second order statistics simultaneously[33, 10]. The increasing attention and promotion of these three descriptors based image set classification problems manifests in three main factors, which are presented as follows.
Kernel Based Image Set Classification: For this approach [12, 10, 24, 14, 35], the Riemannian manifold is embedded into a high dimensional Hilbert space via well-studied Riemannian kernel functions, which makes the Euclidean methods are easily applied for further computation. Therein, Wang et al.  employ Log-Euclidean metric (LEM)  based Riemannian kernel to embed the data from SPD manifold to a generalized Euclidean space. The Kernel Discriminant Analysis (KDA)  is then applied to learn a discriminative subspace for classification. Similarly, Huang et al.  map the basic elements of Grassmann manifold into a flat space by using PM based Riemannian kernel function, and try to learn a discriminant function with KDA. In order to develop the kernel based methods on Gaussian distribution, Wang et al.  investigate a series of probabilistic kernels to encode the Riemannian geometry of Gaussian distributions, and the generated kernel space is further reduced to a discriminative lower-dimensional subspace via the devised weighted KDA algorithm. However, it is obviously perceive that the learning process of such approach ignores the intrinsic Riemannian geometry of the data.
Manifold Dimensionality Reduction Based Image Set Classification: To circumvent the above problem, some algorithms that jointly perform linear mapping and metric learning directly on the original Riemannian manifold have been suggested recently [16, 23, 15], and therefore a discriminative lower-dimensional one can be yielded. Harandi et al. 
produce a lower-dimensional SPD manifold with an orthogonal mapping obtained by devising a discriminative metric learning framework with respect to the original high-dimensional data. To simplify the computational complexity, Huanget al.  put forward a novel Log-Euclidean metric learning algorithm to form a desirable SPD manifold by directly embedding the tangent space of original SPD manifold into a lower-dimensional one. Similarly, Huang et al.  try to learn a lower-dimensional and more discriminative Grassmannian-valued feature representations for the original high dimensional Grassmann manifold under a devised projection metric learning framework. Thanks to the advantage of fully considering the manifold geometry, the above algorithms show good classification performance. Yet, they also have an inherent design flaw, that is the mapping which is defined and learned on the non-linear Riemannian geometry is linear, which seems to be unreasonable.
Riemannian Deep Learning based Image Set Classification: As is well known, how to effectively measure the similarity between image sets is an open and challenging question, and the above mentioned Riemannian manifold learning algorithms provide some constructive ideas to address this problem. Inspired by the proven effectiveness of deep neural networks, Sun et al.  aggregate the local match kernels with a deep neural architecture to generate a global deep match kernel for similarity measurement. To take discriminative feature representation into account, Lu et al.  investigate to extract nonlinear discriminative class specific information for image set classification by integrating manifold metric learning into CNN. However, there still remains a research gap to extract more desirable feature representations for complicated classification tasks. More recently, some researchers extend the ideology of deep learning to Riemannian manifold, and raise some manifold deep learning networks to close the gap. Wherein, Huang and Val Gool  first design a slice of spectral layers to deeply extract appropriate feature representations on the SPD manifold, and then propose a Riemannian matrix backpropagation algorithm for model optimization. Meanwhile, a Grassmannian deep learning architecture  is devised to learn deeply selected Grassmannian-valued features. Since a specific Riemannian manifold corresponds to a specific deep learning architecture, such approach has weak of versatility and scalability.
Multi-statistical features based Image Set Classification: To properly represent the image sets, the above introduced algorithms make some prior assumptions such as Gaussian distribution, linear subspace or covariance matrix. However, there often exhibit large intra-class ambiguity in wild videos, which may make these assumptions lose some hinge information for classification. To handle this problem, Lu et al. 
first extract multiple order statistics of each given image set with mean, covariance matrix and tensor for set modeling, and then design a localized multi-kernel metric learning framework to perform discriminative feature fusion. However, the authors adopt one kernel function that derived from the Euclidean metric to the heterogeneous features, which may lose some capability to preserve the original geometry structure of the set data. In contrast to, Huang et al.  encode each given image set simultaneously with mean, covariance matrix and Gaussian distribution. Since different statistics span different topologies, the authors first exploit three different Riemannian kernel functions to embed these heterogeneous features into RKHS. Next, a hybrid Euclidean-and-Riemannian metric learning framework is proposed to fuse them for face recognition by learning multiple distance metrics. Whereas, it can be explicitly found that all the local regions in the learned kernel spaces share the same weights, but the importance of them is actually different.
Iii Background Theory
This section presents a brief introduction of the geometry of SPD manifold and Grassmann manifold, as well as of Gaussian distribution, which provides the fundamental theory for the proposed algorithm.
Iii-a The Geometry of SPD Manifold
For all non-zero , a real SPD matrix has an intrinsic property, which is . The space spanned by a set of SPD matrices is the interior of a convex cone in the dimensional Euclidean space, signified as . As studied in [21, 40], when endowing it with an appropriate Riemannian metric, a specific Riemannian manifold can be formed, i.e., SPD manifold. Due to the topological space of SPD manifold is locally comply with Euclidean properties, the derivatives of the curves at point on the SPD manifold can be possibly defined under the logarithm map, which can be expressed as . Therein, denotes the tangent space of the SPD manifold at , and a group of its adhere inner products is regarded as the Riemannian metric. Specifically, for any two tangent elements , their scalar product in is formulated as:
where is the directional derivative of the matrix logarithm at along . For the logarithmic map that related to the Riemannian metric, it can be defined in terms of matrix logarithms:
where , and represents the matrix exponential map with a definition as:
According to Eq.1, Eq.2 and Eq.3, the Riemannian metric on the SPD manifold can be formed by:
This metric is widely used to measure the geodesic distance between any two points on the SPD manifold named Log-Euclidean Metric(LEM) . As a result, when endowed with this Riemannian metric, the space of SPD matrices is transformed into a tangent space  and a valid Riemannian kernel  on can therefore be derived by computing the inner product as:
Iii-B The Geometry of Grassmann Manifold
Given an orthogonal matrixof size , its equivalence class can be expressed as follows,
whose leading columns form the same subspace as those of . Here, is an orthogonal group composed of orthogonal matrices. Actually, the equivalence class represents a point lying in the Grassmann manifold . In other words, a Grassmann manifold is spanned by a set of -dimensional linear subspaces of . For each linear subspace, which is constituted by an orthonormal basis matrix of size (, and
is an identity matrix of size) is known as an element of .
As well-studied in [23, 42], each point in Grassmann manifold corresponding to a unique projection matrix of size with rank-. As a result, a natural choice of inner product can be yielded under the projection operator , which is , and then a geodesic distance measurement named Projection Metric is induced .
Since the projection mapping is continuous and differentiable, a flat space associated with the Grassmann manifold can be generated by endowing with this Riemannian metric. By computing the inner product in this flat space, we can obtain a Grassmann kernel .
Iii-C The Geometry of Gaussian Distribution
Given an image set, its mean and variations can be simultaneously captured to determine a particular Gaussian distribution, which is typically a commonly used probability distribution developed in probability theory. Therefore, the distribution of each image set can be modeled as a Single Gaussian Distribution (SGM) by estimating meanand covariance matrix ,
is a given image set, and the Expectation-Maximization (EM) algorithm is often exploited for estimation.
As studied in , the underlying characteristics of a family of Gaussian distributions is actually a space of constant negative curvature. Hence, it is not appropriate to endow the SGM with some Euclidean computations. Learning from the information geometry [31, 32], we can see the space of Gaussian distribution can be embedded into a Riemannian manifold . To be specific, in the information geometry, if a given random vector conforms to , then its affine transformation conforms to , and vice versa. For the covariance matrix , it can be decomposed into , where . Therefore, such a Gaussian distribution can be denoted by the affine transformation . Based on the information geometry theory , in the Gaussian embedded space , a -dimensional SPD matrix can be uniquely represents a -dimensional Gaussian model as.
For a more detailed introduction to the Gaussian embedding process, please refer to .
Since we embed such a -dimensional Single Gaussian Model into another SPD manifold , the well-studied LEM can be applied to replace the KL divergence to measure the distance between two probability distributions. Moreover, as studied in , we can formulate a corresponding kernel of Gaussian distributions as:
where represent the -dimensional SPD matrices that corresponding to two Gaussian models.
Iv Proposed method
Fig.1 uses a schematic diagram to intuitively present the proposed method. For each given image set, as discussed before, three different Riemannian manifold-valued descriptors are adopted to simultaneously model it for the purpose of extracting complementary feature information. Due to the spaces spanned by them are three types of Riemannian manifolds: , and , then how to fuse these heterogeneous features is becoming an essential problem for classification. To this end, a multi-kernel metric learning framework is designed in the hybrid kernel spaces produced by three well-equipped kernel functions. Eventually, a more discriminative common subspace can be yielded.
Iv-a Set Modeling with Multiple Riemannian manifold-valued descriptors
Let be the -th given image set with entities, where represents the -th image sample. Given an image set, we encode it with the following three descriptors, and the extracted complementary feature information can be regarded as the new image set features for the subsequent computations.
Set Modeling with Covariance Matrix: The second-order statistics is a widely used model for set representation. Its advantages are the simplicity and flexibility to model the variations within the set with no assumption about the set data distribution. For , we can compute its covariance matrix as:
where is the mean of . As studied in [21, 40], the space spanned by positive definite covariance matrices is SPD manifold . Hence, we apply the following tactic to maintain the positive definiteness of .
where is an identity matrix of size , and we set in all the experiments.
Set Modeling with Linear Subspace: For linear subspace, it can be regarded as the subspace-based statistics, which has the advantages of lower computational complexity and accommodating to the effects of various within-set variations. For , its -dimensional linear subspace used for set representation is formed by an orthonormal basis matrix , which can be easily obtained by:
where and respectively represent the matrices of
largest eigenvalues and their corresponding
largest eigenvectors. As studied in, the linear subspace resides on Grassmann manifold.
Set Modeling with Gaussian Distribution: Gaussian distribution is actually a probability distribution developed in probability theory. It is often utilized to model the set data by simultaneously capturing its first-order statistics and second-order statistics under EM algorithm. Therefore, when specifying , a Single Gaussian Model (SGM) can be formulated as.
In the information geometry [31, 32], the is typically lie in another Riemannian manifold , which is spanned by a family of SPD matrices of size . Therefore, we adopt the same strategy as used in Eq.13 to keep the non-singularity of .
Iv-B Multi-Kernel Metric Learning for Heterogeneous Features Fusion
As stated earlier, different descriptors reside on different Riemannian manifolds, straightforwardly combining them for classification is inappropriate. In this part, we will present this multi-kernel metric learning framework designed for fusing the extracted heterogeneous but complementary feature representations in detail.
Let represents the gallery consisting of image sets, where is the -th image set, , and demonstrates the number of images in this set. For each , we use covariance matrix, linear subspace and Gaussian distribution to model it, respectively. We use to represent the -th generated feature set of the gallery, and is the -th Riemannian manifold-valued feature representation extracted from . Here, and we set as three different models are utilized to describe the set data. In order to aggregate such heterogeneous feature representations, three well-studied Riemannian kernel functions are then applied for high dimensional feature embeddings in the light of the proven success of kernel learning [43, 44, 45]. This process is implemented by mapping the original Riemannian manifold-valued features into a Hilbert space, and then computing the dot product in it. We use to represent the generated new feature representation of , and the non-linear mapping function is formulated as: , where denotes the produced -th Riemannian space and is the transformed Hilbert space. Though the mapping function is usually implicit, we first directly use it to formulize our method for simplicity. For classification, the first and foremost task is to measure the similarity between a given a pair of training image sets and by defining a distance metric in as:
where is a gating model defined to assign different positive weights to different , which will be introduced later, and is the Mahalanobis matrix needs to be learned. Due to its symmetric positive semi-definite property, we can look for a non-square matrix to re-represent it as , and the Eq.16 can therefore be rewritten as.
Our next target is to learn the transformation matrix , so that the hybrid kernel features can be mapped into a desirable unified space for more discriminative classification. To achieve this purpose, we attempt to simultaneously maximize the distance of all the between-class sample pairs and minimize the distance of all the within-class sample pairs in the gallery with the following objective function:
where denote the between-class dispersion and within-class compactness, respectively, and can be formulated as:
where represent the number of sample pairs from the intra-class and inter-class in the training set, denote the category labels of and , and are the intra-class scatter matrix and inter-class scatter matrix, which can be formulated as:
Clearly, it is arduous for us to perform subsequent computations, which dues to the form of is unknown and thus is impossible to compute and in the mapped space. However, when we express the basis as a linear combination of all the training samples in the feature space , i.e.,
where are the representation coefficients. As a result, the above problem can be addressed by using the kernel trick method  as:
where is a column vector and is its -th entry, and is the -th column of the -th kernel matrix . Here, is of size , generated from the -th Riemannian manifold feature using the corresponding -th Riemannian kernel function.
Hence, the intra-class and inter-class scatter matrices can be respectively re-expressed as.
Thus, the objective function can be rewritten as.
Next, we have another problem need to be discussed, which is the gating model. In fact, the specific form of the gating model is not fixed. In this study, it is defined as follow :
where and are the parameters of this gating model. It is interesting to find this gating model is increased incrementally with the importance of , and the softmax can guarantee its nonnegativity. However, this gating model is currently difficult to play a part for the reason of the implicit form of . To tackle this, we employ the similar way as introduced in Eq.23.
Then, this gating model can be reformulated as,
where and are the two parameters need to be learned in this gating model.
A difficulty in solving the trace ratio problem in Eq.27 arises from the fact that a closed-form solution for the transformation matrix is unknown. This is mainly because the existed outcome interdependence between and . Hence, we use an iterative mode to handle this problem. To be specific, we first fix the values of and with a randomly generated vector of size and a randomly generated constant respectively, to get the new , and then update and with the updated , iteratively.
1. Computation of . Conventionally, such nonconvex trace ratio problem in Eq.27 is often transformed into a simpler ratio trace problem , which is shown in Eq.31 to get the closed-form solution.
Obviously, it can be easily solved with the Eigen-Decomposition method. However, this approximation may sacrifice the potential discriminatory ability of the produced lower-dimensional feature . Instead, we follow an efficient way  to directly solve the trace ratio problem defined in Eq.27.
Denote , and make an assumption that , then the trace ratio problem is equivalently changed to
Without losing generality, we briefly summarize this procedure as the following two steps:
(1) Remove the Null Space of . Because of the positive semi-definite property of and , the intersection of their null space is equal to the null space of . As a matter of fact, there does not exist any discriminative feature information lie in the null space of , so it could be removed from the solution space without losing accuracy. For
, its singular value decomposition is expressed as:
where , and represents the number of positive singular values of . Then, the solution space is restrict to a new space formed by the column vectors of , which is , and is of size . Consequently, the trace ratio problem in Eq.32 can be converted into:
where and . Now, is positive definite.
(2) Iterative Optimization. Based on the trace ratio problem defined in Eq.34, we solve a trace difference problem in each step:
where , and is the -th trace ratio value computed from the transformation matrix of the last step and can be formulated as:
Here, is endowed with a randomly initialized columnly orthogonal matrix, and with the obtained , the trace difference problem of -th step is constructed as:
By this time, the Eigen-Decomposition method is utilized to obtain the -th projection matrix . For the sake of orthogonal transformation invariance, is reshaped by adopting singular value decomposition to as , where . Then, iterating the above operations to get the desirable projection matrix . This algorithm is proved to be converged to the global Optimum. For a more detailed treatment, please refer to .
2. Computation of and . Having computed , we first conduct partial derivatives of with respect to as:
where , and the specific forms of and are respectively presented as follows:
where if and 0 otherwise. Similarly, we can easily get the partial derivative of with respect to by referring to the above process, and we omit it here for simplicity.
After obtaining and , the gradient ascent method is applied to train the gating model:
where is the learning rate and configured as in the experiments.
Having updated and , we need to utilize them to update the values of the , and , respectively. So that the transformation matrix can be updated by re-solving the trace ratio problem defined in Eq.32. Afterwards, repeating this staggered process for a certain number of iterations until the conditions are satisfied. We summarize the proposed image set classification algorithm in Algorithm 1.
Algorithm 1. Metric Learning for Heterogeneous Features Fusion
Input: Training image sets , label matrix , different kernel matrices (), the number of iterations , target feature dimension and convergence error .
Output: Target transformation matrix and two parameters .
Step 1 (Initialization): Initialize with an arbitrary column vector, and with a small random number.
Step 2 (Optimization): For , repeat
Use Eq.25 and Eq.26 to compute , respectively.
Solve the trace ratio problem defined in Eq.32 and get the -th transformation matrix .
Update and by using gradient ascent method:
if , and or , turn to Step 3.
Step 3 (Output): Transformation matrix and
In the testing phase, we first apply the three different Riemannian manifold-valued descriptors to encode a given testing image set , and we here use to represent the extracted -th Riemannian feature, where . Then, we measure the similarity between and all the training sets in the form of three different computed kernel vectors with each denoted by . Afterwards, the distance between and each training set is formulated as follows.
Lastly, we use the nearest neighbor classifier to do classification.
Iv-E Computational Complexity
According to the Algorithm 1, we can intuitively realize that the time consumption in the training stage is mainly manifested in three aspects: 1) building different kernel matrices, which is . Here, is the number of training sets; 2) computing the intra-class scatter matrix and inter-class scatter matrix . Here, we use to represent the number of image sets that are used to train in the -th category. As a result, we respectively need to pay and for computing them; 3) updating and , which is ( is the number of the iterations). As a result, the computational complexity of the training phase is . In the testing phase, the main time cost is to construct the different similarity matrices and compute the distance between each testing image set and each training image set, respectively. Clearly, the computational complexity of them is and , where represents the number of testing samples. Considering that and , the primary time cost of the proposed algorithm is .
Iv-F Relation with the Previous Works
Relation with : In fact, the proposed algorithm is an extension of our previous work . The essential differences between the proposed work and the conference paper lie in the following five aspects: 1) in addition to set modeling with covariance matrix and linear subspace in , this paper also exploits gaussian distribution to encode the original set data for the sake of mining more useful information of intra-class variations; 2) Due to the space formed by a set of gaussian distributions is another Riemannian manifold , a well-equipped Riemannian kernel function is further applied to it for the purpose of preserving the structural information of the Riemannian manifold-valued data in the Hilbert space embedding; 3) Due to the discriminability of each local region in the produced kernel spaces is different, this paper integrates the devised multi-kernel learning algorithm into our originally proposed metric learning framework  for the sake of learning an adaptive weight for each, while  assigns the same weight to them; 4) To optimize the transformation matrix, this paper follows an efficient way  to directly solve the trace ratio problem, while  transforms this problem into a simpler ratio trace problem for approximation computing; 5) Besides the video-based face recognition and set-based object categorization tasks in the conference paper, we further assess the proposed work on video-based emotion recognition and dynamic scene classification tasks by making extensive experiments on two challenging video-based datasets: AFEW  and MDSD .
Relation with : The proposed method and  not only focus on building reliable set models, but also focus on learning discriminative subspace feature representations. However, their essential differences are as follows: 1) besides covariance matrix and gaussian distribution which are used for set description, the proposed algorithm also makes use of linear subspace to characterize the original set data, while  extracts their first-order information. The linear subspace has been proven to be able to accommodate the effects of various intra-set variations; 2) The weight corresponding to each local region of the learned kernel spaces is obtained by adaptive learning under the designed multi-kernel metric learning framework in this paper, while which are the same in ; 3) For optimization, this paper first formulates the feature fusion problem into the trace ratio form, and then exploits ITR  and gradient descent method to iteratively solve it, while  utilizes the LogDet divergence  based constraint to formulate the feature fusion problem, and adopts the cyclic Bregman projection method  to solve it; 4) This paper evaluates the proposed algorithm on four different video-based classification tasks, and the extensive classification results justify its effectiveness, while  concentrates on video-based face recognition task.
In this section, we evaluate the proposed algorithm 111The source code will be released on: https://github.com/GitWR on four image set classification tasks: video-based face recognition, set-based object categorization, video-based emotion recognition and dynamic scene classification, respectively.
V-a Comparative Methods and Settings
In this paper, we compare the proposed algorithm with some representative image set classification methods which can be divided into five categories as follows:
Kernel based methods: Grassmann Discriminant Analysis (GDA) , Grassmannian Graph-Embedding Discriminant Analysis (GEDA) , Covariance Discriminative Learning (CDL) , Riemannian Sparse Representation (RSR)  and Discriminant Analysis on Riemannian manifold of Gaussian distributions (DARG) .
Here, we should point out the classification results of GDA, CDL, PML and LEML are carefully implemented by ourselves. As to other comparative methods, we adopt their source codes provided by the original authors to make experiments except for LMKML and MMDML. Since the source code of LMKML and MMDML have not been released, we use the classification rates that have already been achieved in [38, 34]. For fair comparison, the parameters of the comparative methods that we set in this paper are empirically tuned according to the original works. For CDL, KDA is used for discriminative subspace learning and the perturbation is set to . In PML, the number of iterations and the value of the trade-off coefficient are set according to the original authors , and the target dimensionality of the generated new Grassmann manifold is determined by cross-validation. For GDA, we make use of the Projection Metric  and its corresponding projection kernel. Moreover, the number of basis vectors for the subspace in GDA and GEDA are determined by cross-validation. In LEML, and are the only two parameters need to be optimized, and we search their values in the range of and . In SPDNet, the size of the transformation matrices are configured to , and
, respectively. Other parameters such as the number of epochs and the size of input data are set toand , while the learning rate and batch size are chosen by cross-validation. For RSR, we tune the value of in the scope of . The two parameters and in SPDML-AIM and SPDML-Stein are searched by referring to , while the target dimensionality of the resulted new SPD manifold is set by cross-validation. In LMKML, the learning rate is set as and the bandwidth of Gaussian kernel is tuned by cross-validation. For HERML, we respectively tune the two parameters and in the scope of and . For DCC and MMD, we follow the default settings in [54, 55].
V-B Video-based Face Recognition
In this paper, the much challenging and widely used YouTube Celebrities (YTC) [14, 10, 23, 33] dataset is applied to the task of video-based face recognition. This dataset contains 1,910 video clips of 47 subjects that were collected from the website of YouTube. Each clip is comprised of hundreds of frames, most of which exhibit large variations in expression, illumination and pose. The number of image sets available for each subject is not fixed. Some sample face frames of this dataset are shown in Fig.2. Following the previous works [14, 10, 23, 33, 38, 15], in our experiments we first reshape each face image into a grayscale one and in order to eliminate the effects of lighting, histogram equalization is adopted for pre-processing. Then, we randomly select nine image sets in each subject with three for training and six for testing. Finally, we run ten times different combinations of gallery/probe and report the average recognition rates of different methods in Table 1.
According to the results listed in Table 1 we have some interesting observations. Firstly, the recognition rate of GDA is inferior to that of GEDA, which demonstrates the consideration of local structure of the data is helpful for us to extract more discriminative information when performing discriminant analysis on Grassmann manifold. Furthermore, it is intuitive to see the classification performance of both GDA and GEDA is lower than PML. The most fundamental reason is the process of directly performing dimensionality reduction on the Grassmann manifold can more faithfully characterize the geometry of the original set data than the Euclidean treatment. This reason can also be used to explain the difference in recognition rates between CDL and LEML.
Secondly, compared with the state-of-the-art methods, the proposed algorithm shows better classification performance than them on this data. Among these results, we first want to make a comparison between SPMDL-AIM/SPDML-Stein and LEML. It can be found that LEML outperforms the former, which justifies the potential superiority of LEM based Riemannian manifold dimensionality reduction framework over AIM and Stein divergence based ones. An essential reason can be adduced: the inherent matrix-form of LEML can perserve more structural information of the space spanned by SPD matrices than the vector-form of SPDML-AIM/SPDML-Stein. Then, we want to discuss the performance of SPDNet. Obviously, it shows a relatively poor result than other SPD matrix learning methods, which may due to the limited number of training samples.
Lastly, the comparison of classification performance between the proposed algorithm and LMKML and HERML is what we especially care about on this dataset. It is easy to observe that LMKML and HERML outperform most of the comparative methods in terms of recognition rate, which proves combining multiple statistical features of the original set data can yield more discriminative information than single model based methods. However, LMKML is absolutely surpassed by the proposed algorithm. As discussed before, the reason is the proposed algorithm attempts to learn data-specific kernel features instead of the unified one learned in LMKML, which can better preserve the original set data structure. For HERML, no distinction is made between different local regions in the produced kernel spaces in terms of discriminability, and therefore leads to weaker performance when compared with ours.
V-C Set-based Object Categorization
For set-based object categorization task, we conduct experiments on the ETH-80 dataset [14, 15]. This dataset consists of 8 categories: cows, cups, horses, dogs, tomatoes, cars, pears, and apples with 10 image sets per class. There are 41 images that collected from different perspectives in each image set and the size of each image is . As shown in Fig.3, there are some sample images on ETH-80 dataset. In order to keep consistent with the original works [14, 15, 39, 38, 34], we first extract the gray-scale features for each original instance and adjust its size to . Then, we randomly choose five objects in each category for training sets and the remaining five for query sets. Moreover, we randomly split this dataset into ten different pairs of training set and testing set, and the following table shows the average classification accuracies of different methods.
Among these classification results reported in Table 2, we summarize our observations in four aspects. The first is the classification performance of GEDA surpasses GDA, which further demonstrates the importance of exploiting the local structure of the set data. Then, it is interesting to observe the classification results generated by SPDML-AIM/SPDML-Stein are both inferior to that of LEML, which further justifies the matrix-form based SPD matrix feature learning is more effective than the vector-form based. Afterwards, we can intuitively find MMDML achieves better classification accuracy than SPDNet and other set-based methods. This mainly owes to the designed class-specific deep network in MMDML can extract more discriminative feature information for classification. However, SPDNet produces a relatively mediocre classification rate on this dataset, which further indicates the number of training sets plays a vital role in SPD manifold deep learning. Lastly, we also care about the classification performance of HERML and LMKML on this dataset. As can be found in Table 2, HERML achieves an impressive classification result and the classification performance of LMKML is also comparable. This again proves the complementary feature information provided by multiple statistics is useful to boost the image set classification performance. For the proposed algorithm, it yields a state-of-the-art classification result, which again demonstrates its effectiveness.
V-D Video-based Emotion Recognition
For further evaluation, we apply the proposed algorithm to a much more difficult facial expression dataset for the emotion recognition task. This dataset is called Acted Facial Expression in Wild (AFEW) , which depicts natural facial expressions in unconstrained environments and contains 1,345 video sequences of facial expressions collected from the movies with close to real world scenarios. Some examples on AFEW dataset are presented in Fig.4. To comply with the standard protocols of the Emotion Recognition in the Wild Challenge (EmotiW2014) , we divide this dataset into three parts: training set, validation set and test set. Then, we follow [50, 26] to split these training video sequences into 1,746 small clips for data augmentation. For the task of classifying each video sequence into one of the seven expression classes, we first resize each facial frame to an gray-scale image, then follow [50, 26] to report the recognition results of different competitors on the validation set, dues to the ground truth of test set is not publicly available.
According to the recognition results tabulated in Table 3, we can clearly find the performance of HERML exceeds most of the comparative methods on the task of emotion recognition. The reasons are arise from two aspects: 1) as discussed before, the multiple statistics can provide complementary feature information; 2) by jointly learning Euclidean-and-Riemannian metrics, more useful geometry information can be explored on this complicated video based dataset. Another interesting observation is the performance of LEML and SPDML-AIM/SPDML-Stein is more mediocre than CDL. This may be possible for that the linear transformation functions of LEML and SPDML-AIM/SPDML-Stein are learned on the non-linear manifold, which lose some ability to parse the structural information of complicated scenarios. Apparently, on this large-scale facial expression dataset, SPDNet shows its superiority on emotion recognition over other representative methods. Meanwhile, our method outperforms all the competitors, which demonstrates the integration of multiple Riemannian manifold-valued descriptors is qualified to improve the final classification accuracy.
V-E Dynamic Scene Classification
Dynamic scene classification in an unconstrained setting is a fundamental and challenging task in computer vision. Recently, image set classification has provided a new direction to address this task. In this paper, we report the classification performance of our method on the MDSD [37, 56] dataset. This dataset is comprised of 13 different categories of dynamic scenes with each has 10 video sequences. As presented in Fig.5, there are some sample images on MDSD dataset. Due to the large intra-class variation in illumination, resolution, physical morphology and background, this classification seems very arduous. In our experiments, we follow the same protocols as introduced in the above experimental settings to preprocess each video frame, and use the seventy-thirty-ratio (STR) protocol, which typically builds gallery and probes by randomly selecting 7 videos for training set and the rest for query set in each category to test our method. Besides, we also conduct ten times different combinations of gallery/probe. The final average classification results are given in Table 4.
For evaluation, we compare the proposed algorithm with eleven state-of-the-art image set classification methods, as listed in Table 4. As can be seen in this table, our method produces a relatively good classification performance compared to others. However, the lower classification accuracies obtained by these methods intuitively illustrate this dynamic scene classification task is challenging. Then, we are interesting to see the performance of DARG surpasses other kernel based methods. The main reason is that the dissimilarity measurement between Gaussians in DARG is replaced by respectively measuring the dissimilarity between means and covariance matrices with Mahalanobis distance and LEM, which can extract more structural information for classification. For HERML, its classification performance is good on this dataset, and the same observations also can be found on the other three used datasets. This further justifies the effectiveness of jointly learning multiple statistics of image sets. Lastly, the achieved state-of-the-art classification results of the proposed method on all the used datasets verify its feasibility and utility.
V-F Ablation Study for Different Riemannian Manifold-valued Descriptors
In previous experiments, the proposed algorithm has shown its superiority in image set classification over some representative set-based methods. Here, we further conduct experiments to observe the classification performance of each Riemannian manifold-valued descriptor incorporating with the proposed metric learning framework. Table 5 lists the classification results of them on the four used datasets, and some interesting observations can be summarized into two aspects. Firstly, for each used dataset, the classification results they have obtained are different. To be specific, on YTC and AFEW datasets the Grassmann manifold-valued descriptor achieves better classification performance than the other two, which may indicate the linear subspace is more effective in characterizing the structural information of the face image. On the contrary, the Gaussian distribution yields the best recognition rates on ETH-80 and MDSD datasets. There are two reasons may explain this: 1) most of the image sets in the two datasets conform to Gaussian distribution; 2) Gaussian model contains the first-order statistics and the second order statistics of the set data. Secondly, it is clear to see the performance of the proposed algorithm which simultaneously couples these three Riemannian manifold-valued descriptors with our multi-kernel metric learning framework outperforms the way of separately, which further justifies the complementarity of these three descriptors in set data modeling.
V-G Ablation Study for Convergence Behavior
As discussed in Section 4.2.2, we expect to study the transformation matrix but have to infer and simultaneously. Hence, we use an iterative manner to solve this problem. To optimize , we follow  to directly solve the trace ratio problem defined in Eq.27, and for and we utilize gradient ascent method. Although, it is hard for us to provide a systematic theoretical proof of convergence behavior of this optimization problem, we find after several iterations the objective function Eq.32 can reach to a stable value, which is confirmed experimentally. Fig.6 and Fig.7 were respectively obtained using the AFEW dataset and YTC dataset, and it is intuitively observe that with the increase of the number of iterations, the value of the objective function tends to steadily fluctuate within a very small range. Furthermore, we also increase the number of iterations to 40 to see the current values of the objective function on the two datasets, which are 0.8398 and 0.9207 respectively. This demonstrates our algorithm can achieve a stable classification performance with more iterations.
V-H Parameter Discussion
According to the description in Section 4, we can see the resulting high dimensional heterogeneous and complementary Riemannian manifold-valued features are fused into a -dimensional Euclidean space under the proposed multi-kernel metric learning framework. Since more useful and more compact feature representations often reside in a lower dimensional feature space, it is indispensable to find a desirable value of . Therefore, we make experiments on YTC and MDSD datasets to compare the impact of different on the final classification results of our method. The experimental results are presented in Fig.8 and Fig.9, respectively. From Fig.8, we can intuitively see it achieves a top classification result when the value of is 25. Moreover, we let reach to its maximum value on MDSD dataset, which is 91, but the produced 5.13% classification accuracy is very lower than other cases. From Fig.9, we can easily find 70 is the best value of on YTC dataset. Furthermore, we also increase to 141, its maximum value on this dataset, and the generated 72.22% recognition rate is somewhat lower.
With the above observations in mind, we can see there are two reasons that can explain the varying tendency of the curves depicted in Fig.8 and Fig.9. The first is when the values of are lower, the learned insufficient discriminative feature information is unable to make effective distinctions between some overlapping samples, which may bring about lower classification accuracies. The second is when the values of are higher, some redundant information cannot be effectively filted out from the extracted efficient features, which may also lead to undesirable recognition rates. Besides, on AFEW and ETH-80 datasets, the best values of are configured as 70 and 8, respectively.
In this paper, we propose a novel image set classification algorithm which fuses multiple Riemannian manifold-valued features of image sets with a designed multi-kernel metric learning framework. This proposed algorithm has been assessed on four image set classification tasks: video-based face recognition, set-based object categorization, video-based emotion recognition and dynamic scene classification respectively. Extensive experimental results computed on four video-based datasets demonstrate its superiority over some representative image set classification methods. Besides, the comparison between each single Riemannian manifold-valued descriptor and their combination justifies their complementarity in encoding the set data, and their fusion is beneficial to improve the classification performance on video-based set data.
Since the temporal order is an important factor in describing frames in video, we plan to integrate it into our proposed framework and hope this can help to improve its discriminatory ability on some complicated classification tasks. For future work, another possible direction is to investigate other metric learning methods to fuse the heterogeneous and complementary features. Finally, we would like to transfer some popular Euclidean deep learning architectures into Riemannian manifold for better recognition on large-scale video-based datasets.
-  J. R. Barr, K. W. Bowyer, P. J. Flynn, S. Biswas, Face recognition from video: A review, International Journal of Pattern Recognition and Artificial Intelligence 26 (05) (2012) 1266002126600253.
-  Z. Shaohua Kevin, C. Rama, M. Baback, Visual tracking and recognition using appearance-adaptive models in particle filters, IEEE TIP 13 (11) (2004) 14911506
-  N. Ye, Y. Ning, T. Sim, Towards general motion-based face recognition, in: CVPR (2010) 25982650.
-  M. Kim, S. Kumar, V. Pavlovic, H. Rowley, Face tracking and recognition with visual constraints in real-world videos, in: CVPR (2008) 18.
-  Chen Z, Wu X J, Kittler J, A sparse regularized nuclear norm based matrix regression for face recognition with contiguous occlusion, Pattern Recognition Letters, 2019.
-  Yu D, Wu X J, 2DPCANet: a deep leaning network for face recognition, Multimedia Tools and Applications, 77(10) (2018) 1291912934.
-  O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, T. Darrell, Face recognition with image sets using manifold density divergence, in: CVPR (2005) 581588.
-  H. Cevikalp, B. Triggs, Face recognition based on image sets, in: CVPR (2010) 25672573.
-  Y. Hu, A. S. Mian, R. Owens, Sparse approximated nearest points for image set classification, in: CVPR (2011) 121128.
-  W. Wang, R. Wang, Z. Huang, S. Shan, X. Chen, Discriminant analysis on riemannian manifold of gaussian distributions for face recognition with image sets, in: CVPR (2015) 20482057.
-  O. Yamaguchi, K. Fukui, K.-i. Maeda, Face recognition using temporal image sequence, in: FG (1998) 318323.
-  J. Hamm, D. D. Lee, Grassmann discriminant analysis: a unifying view on subspace-based learning, in: ICML (2008) 376383.
-  Z. Wu, Y. Huang, L. Wang, Learning representative deep features for image set analysis, IEEE TMM 17 (11) (2015) 19601968.
-  R. Wang, H. Guo, L. S. Davis, Q. Dai, Covariance discriminative learning: A natural and efficient approach to image set classification, in: CVPR (2012) 24962503.
-  Z. Huang, R. Wang, S. Shan, X. Li, X. Chen, Log-euclidean metric learning on symmetric positive definite manifold with application to image set classification, in: ICML (2015) 720729.
-  M. Harandi, M. Salzmann, R. Hartley, Dimensionality reduction on spd manifolds: The emergence of geometry-aware methods, IEEE TPAMI (2018) 4862.
-  O. Tuzel, F. Porikli, P. Meer, Region covariance: A fast descriptor for detection and classification (2006) 589600.
-  O. Tuzel, F. Porikli, P. Meer, Pedestrian detection via classification on riemannian manifolds, IEEE TPAMI (2008) 17131727.
-  P. Turaga, A. Veeraraghavan, A. Srivastava, R. Chellappa, Statistical computations on grassmann and stiefel manifolds for image and video-based recognition, IEEE TPAMI (2011) 22732286.
-  X. Pennec, P. Fillard, N. Ayache, A riemannian framework for tensor computing, IJCV (2006) 4166.
V. Arsigny, P. Fillard, X. Pennec, N. Ayache, Geometric means in a novel vector space structure on symmetric positive-definite matrices, SIAM journal on matrix analysis and applications (2007) 328347.
-  S. Sra, Positive definite matrices and the s-divergence, in: Proceedings of the American Mathematical Society (2016) 27872797.
-  Z. Huang, R. Wang, S. Shan, X. Chen, Projection metric learning on grassmann manifold with application to video based face recognition, in: CVPR(2015) 140149.
-  M. T. Harandi, C. Sanderson, R. Hartley, B. C. Lovell, Sparse coding and dictionary learning for symmetric positive definite matrices: A kernel approach, in: ECCV (2012) 216229.
-  S. Jayasumana, R. Hartley, M. Salzmann, H. Li, M. Harandi, Kernel methods on the riemannian manifold of symmetric positive definite matrices, in: CVPR (2013) 7380.
-  Z. Huang, L. J. Van Gool, A riemannian network for spd matrix learning,in: AAAI (2017) 3.
-  Z. Huang, J. Wu, L. Van Gool, Building deep networks on grassmann manifolds, in: AAAI (2018).
-  Z. Huang, C. Wan, T. Probst, L. Van Gool, Deep learning on lie groups for skeleton-based action recognition, in: CVPR (2017) 12431252.
-  G. Shakhnarovich, J. W. Fisher, T. Darrell, Face recognition from long-term observations, in: ECCV (2002) 851865.
-  T. M. Cover, J. A. Thomas, Elements of information theory, John Wiley Sons, 2012.
-  S.-i. Amari, H. Nagaoka, Meth