In the past decade, sparsity has become a popular term in neuroscience, information theory, signal processing and related areas Olshausen_1996_Nature ; CANDES_2006 ; Donoho_2006 ; Wright:PAMI:2009 ; ELAD_SR_BOOK_2010 . Through sparse representation and compressive sensing it is possible to represent natural signals like images using only a few non-zero coefficients of a suitable basis. In computer vision, sparse and overcomplete image representations were first introduced for modeling the spatial receptive fields of simple cells in the human visual system by Olshausen_1996_Nature . The linear decomposition of a signal using a few atoms of a dictionary has been shown to deliver notable results for various visual inference tasks, such as face recognition Wright:PAMI:2009 ; Wright_2010_IEEE , image classification Yang_CVPR_2009 ; Mairal_2008_CVPR , subspace clustering Elhamifar_2013_PAMI , image restoration Mairal_TIP_2008 , and motion segmentation Rao_CVPR_2008 to name a few. While significant steps have been taken to develop the theory of the sparse coding and dictionary learning in Euclidean spaces, similar problems on non-Euclidean geometry have received comparatively little attention Yuan_ACCV_2009 ; Harandi_TNNLS15 ; Harandi_2013_ICCV ; Guo_TIP_2013 ; Vemuri_ICML_2013 ; Cetingul_TMI_2014 .
This paper introduces techniques to sparsely represent -dimensional linear subspaces in
using a combination of linear subspaces. Linear subspaces can be considered as the core of many inference algorithms in computer vision and machine learning. For example, the set of all reflectance functions produced by Lambertian objects lies in a linear subspaceBasri_2003_PAMI ; Ramamoorthi_2002_PAMI . Several state-of-the-art methods for matching videos or image sets model given data by subspaces HAMM2008_ICML ; Harandi_CVPR_2011 ; Turaga_PAMI_2011 ; Vemula:CVPR:2013 ; Sanderson_AVSS_2012 ; Shaokang:CVPR:2013 . Auto regressive and moving average models, which are typically employed to model dynamics in spatio-temporal processing, can also be expressed by linear subspaces Turaga_PAMI_2011 . More applications of linear subspaces in computer vision include, chromatic noise filtering Subbarao:IJCV:2009 , subspace clustering Elhamifar_2013_PAMI , motion segmentation Rao_CVPR_2008 , domain adaptation GFK:CVPR:2012 ; Gopalan:PAMI:2013 , and object tracking Shirazi_2015 .
Despite their wide applications and appealing properties, subspaces lie on a special type of Riemannian manifold, namely the Grassmann manifold, which makes their analysis very challenging. This paper tackles and provides efficient solutions to the following two fundamental problems on Grassmann manifolds (see Fig. 1 for a conceptual illustration):
Coding. Given a subspace and a set with elements (also known as atoms), where and are linear subspaces, how can be approximated by a combination of atoms in ?
Dictionary learning. Given a set of subspaces , how can a smaller set of subspaces be learned to represent accurately?
Our main motivation here is to develop new methods for analyzing video data and image sets. This is inspired by the success of sparse signal modeling and related topics that suggest natural signals like images (and hence video and image sets as our concern here) can be efficiently approximated by superposition of atoms of a dictionary. We generalize the traditional notion of coding, which operates on vectors, to coding on subspaces. Coding with the dictionary of subspaces can then be seamlessly used for categorizing video data.
Considering the problem of coding and dictionary learning on Grassmann manifolds, previous studies (e.g., Vemuri_ICML_2013 ; Cetingul_ISBI_2011 ; Cetingul_TMI_2014 ) opt for an intrinsic and general framework for sparse coding on Riemannian manifolds. This intrinsic formulation exploits the tangent bundle of the manifold for sparse coding. Due to the computational complexity of the logarithm map on Grassmann manifolds, performing intrinsic sparse coding might be computationally demanding for the problems that we are interested in (e.g., video analysis). Moreover, learning a dictionary based on the intrinsic formulation as proposed by Vemuri_ICML_2013 requires computing the gradient of a cost function that includes terms based on logarithm map. As will be shown later, the involvement of logarithm map (which does not have an analytic formulation on Grassmann manifolds) deprives us from having a closed-form solution for learning a Grassmann dictionary intrinsically.
Contributions. In light of the above discussion, in this paper we introduce an extrinsic methods for coding and dictionary learning on Grassmann manifolds. To this end, we propose to embed Grassmann manifolds into the space of symmetric matrices by a diffeomorphism that preserves several properties of the Grassmannian structure. We show how coding can be accomplished in the induced space and devise an algorithm for updating a Grassmann dictionary atom by atom. Furthermore, in order to accommodate non-linearity in data, we propose kernelized versions of our coding and dictionary learning algorithms. Our contributions are therefore three-fold:
We propose to perform coding and dictionary learning for data points on Grassmann manifolds by embedding the manifolds into the space of symmetric matrices.
We derive kernelized versions of the proposed coding and dictionary learning algorithms (i.e., embedded into Hilbert spaces), which can address non-linearity in data.
We apply the proposed Grassmann dictionary learning methods to several computer vision tasks where the data are videos or image sets. Our proposed algorithms outperform state-of-the-art methods on a wide range of classification tasks, including gender recognition from gait, scene analysis, face recognition from image sets, action recognition and dynamic texture classification.
2 Background Theory
This section overviews Grassmann geometry and provides the groundwork for techniques described in following sections. Since the term “manifold” itself is often used in computer vision in a somewhat loose sense, we emphasize that the word is used in this paper in its strict mathematical sense.
Throughout the paper, bold capital letters denote matrices (e.g. ) and bold lower-case letters denote column vectors (e.g. ). The notations and are used to demonstrate elements in position and in a vector and matrix, respectively. and are vectors of ones and zeros. is the identity matrix. and denote the and norms, respectively, with indicating transposition. designates the Frobenius norm, with computing the matrix trace.
2.1 Grassmann Manifolds and their Riemannian Structure
For , the space of matrices with orthonormal columns is not a Euclidean space but a Riemannian manifold, the Stiefel manifold . That is,
By grouping together all points on that span the same subspace we obtain the Grassmann manifold . More formally, the Stiefel manifold admits a right action by the orthogonal group (consisting of orthogonal matrices); for and , the matrix is also an element of . Furthermore the columns of and span the same subspace of , and are to be thought of representatives of the same element of the Grassmann manifold, . Thus, the orbits of this group action form the elements of the Grassman manifold. The resulting set of orbits is a manifold according to the quotient manifold theorem (see Theorem 21.10 in Lee_Book_2012 ). The details of this construction are not critical to an understanding of the rest of this paper.
An element of can be specified by a basis, i.e., a set of vectors such that is the set of all their linear combinations. When the vectors are ordered as the columns of a matrix , then is said to span and we write . In what follows, we refer to a subspace and hence a point on by its basis matrix . The choice of the basis is not unique but it has no effect in what we develop later.
A Riemannian metric on a manifold is defined formally as a smooth inner product on the tangent bundle. (See Absil_2004 for the form of Riemannian metric on ). However, we shall be concerned only with geodesic distances on the Grassmann manifold, which allows us to avoid many technical points and give a straight-forward definition.
On a Riemannian manifold, points are connected via smooth curves. The geodesic distance between two points is defined as the length of shortest curve in the manifold (called a geodesic) connecting them. The Stiefel manifold is embedded in the set of matrices, which may be seen as a Euclidean space with distances defined by the Frobenius norm. Consequently the length of a smooth curve (or path) in is defined as its length as a curve in . Now, given two points and in , the distance is defined as the length of the shortest path in between any two points and in that are members of the equivalence clases and .
The geodesic distance has an interpretation as the magnitude of the smallest rotation that takes one subspace to the other. If is the sequence of principal angles between two subspaces and , then .
Definition 1 (Principal Angles)
Let and be two matrices of size with orthonormal columns. The principal angles between two subspaces and , are defined recursively by
In other words, the first principal angle is the smallest angle between all pairs of unit vectors in the first and the second subspaces. The rest of the principal angles are defined similarly.
Two operators, namely the logarithm map and its inverse, the exponential map are defined over Riemannian manifolds to switch between the manifold and the tangent space at . A key point here is the fact that both the logarithm map and its inverse do not have closed-form solutions for Grassmann manifolds. Efficient numerical approaches for computing both maps were proposed by ANUJ_2003 ; Begelfor_CVPR_2006 . In this paper, however, the exponential and logarithm maps will only be used when describing previous work of other authors.
3 Problem Statement
In vector spaces, by coding we mean the general notion of representing a vector (the query) as some combination of other vectors belonging to a dictionary. Typically, is expressed as a linear combination , or else as an affine combination in which the coefficients satisfy the additional constraint . (This constraint may also be written as .)
In sparse coding one seeks to express the query in terms of a small number of dictionary elements. Given a query and a dictionary of size , i.e., with atoms , the problem of coding can be formulated as solving the minimization problem:
The domain of may be the whole of , so that the sum runs over all linear combinations of dictionary elements (or atoms), or alternatively, the extra constraint may be specified, to restrict to affine combinations.
The idea here is to (approximately) reconstruct the query by a combination of dictionary atoms while forcing the coefficients of combination, i.e., , to have some structure. The quantity can be thought of as a coding cost combining the squared residual coding error, reflected in the energy term in (3), along with a penalty term , which encourages some structure such as sparsity. The function could be the norm, as in the Lasso problem Tibshirani_1996 , or some form of locality as proposed by Yu:NIPS:2009 and Wang_CVPR_2010_LLC .
The problem of dictionary learning is to determine given a finite set of observations , by minimizing the total coding cost for all observations, namely
A “good” dictionary has a small residual coding error for all observations while producing codes with the desired structure. For example, in the case of sparse coding, the norm is usually taken as to obtain the most common form of dictionary learning in the literature. More specifically, the sparse dictionary learning problem may be written in full as that of jointly minimizing the total coding cost over all choices of coefficients and dictionary:
A common approach to solving this is to alternate between the two sets of variables, and , as proposed for example by Aharon_2006_ksvd (see ELAD_SR_BOOK_2010 for a detailed treatment). Minimizing (5) over sparse codes while dictionary is fixed is a convex problem. Similarly, minimizing the overall problem over with fixed is convex as well.
In generalizing the coding problem to a more general space , (e.g., Riemannian manifolds), one may write (3) as
Here and are points in the space , while is some distance metric and is an encoding function, assigning an element of to every choice of coefficients and dictionary. Note that (3) is a special case of this, in which represents linear or affine combination, and is the Euclidean distance metric. To define the coding, one need only specify the metric to be used and the encoding function . Although this formulation may apply to a wide range of spaces, here we shall be concerned chiefly with coding on Grassmann manifolds.
4 Related Work
A seemingly straightforward method for coding and dictionary learning is through embedding manifolds into Euclidean spaces via a fixed tangent space. The embedding function in this case would be , where is some default base point. The natural choice for the base point on is
By mapping points in the manifold to the tangent space, the problem at hand is transformed to its Euclidean counterpart. For example in the case of sparse coding, instead of (6), the encoding cost may be defined as follows:
where the notation reminds us that the norm is in the tangent space at . We shall refer to this straightforward approach as Log-Euclidean sparse coding (the corresponding steps for Grassmann manifolds in Algorithm 1), following the terminology used in Arsigny:2006 . This idea has been deployed for action recognition on the manifold of Symmetric Positive Definite matrices by Yuan_ACCV_2009 and Guo_TIP_2013 . Since on a tangent space only distances to the base point are equal to true geodesic distances, the Log-Euclidean solution does not take into account the true structure of the underlying Riemannian manifold. Moreover, the solution is dependent upon the particular point used as a base point.
A more elegant and intrinsic approach is to work in the tangent bundle of the manifold, varying the particular tangent space according to the point being approximated. Such an idea has roots in the work of Goh_CVPR_2008 which extends various methods of dimensionality reduction to Riemannian manifolds. As for sparse coding, Cetingul_ISBI_2011 , Cetingul_TMI_2014 and Vemuri_ICML_2013 show that by working in the tangent space at , i.e., , the encoding cost in (6) can be written as
The extra affine constraint, i.e., is necessary to avoid a trivial solution and has been used successfully in other applications such as dimensionality reduction Roweis_2000_LLE , subspace clustering Elhamifar_2013_PAMI and coding Yu_2010_ICML ; Wang_CVPR_2010_LLC to name a few.
Similar to the Euclidean case, the problem in (9) is solved by iterative optimization over and . Computing the sparse codes is done by solving (8). To update , Vemuri_ICML_2013 proposed a gradient descent approach along geodesics. That is, the update of at time while and are kept fixed has the form
In Eq. (10) is a step size and the tangent vector represents the direction of maximum ascent. That is 111On an abstract Riemannian manifold , the gradient of a smooth real function at a point , denoted by , is the element of satisfying for all . Here, denotes the directional derivative of at in the direction of . The interested reader is referred to Absil:2008 for more details on how the gradient of a function on Grassmann manifolds can be computed., where
Here is where the difficulty arises. Since the logarithm map does not have a closed-form expression on Grassmann manifolds, an analytic expression for in Eq. (10) cannot be sought for the case of interest in this work, i.e., Grassmann manifolds222This is acknowledged by Vemuri_ICML_2013 .. Having this in mind, we propose extrinsic approaches to coding and dictionary learning specialized for Grassmann manifolds. Our proposal is different from the intrinsic method in following points:
As compared to the intrinsic approach, our extrinsic coding methods are noticeably faster. This is especially attractive for vision applications where the dimensionality of Grassmann manifolds is high.
Similar to the intrinsic method, our proposed dictionary learning approach is an alternating method. However and in contrast to the intrinsic method, the updating rule for dictionary atoms admits an analytic form.
Our proposed extrinsic methods can be kernelized. Such kernelization for the intrinsic method is not possible due to the fact that the logarithm map does not have a closed-form and analytic expression on Grassmann manifolds. Kernelized coding enables us to model non-linearity in data better (think of samples that do not lie on a subspace in low-dimensional space but could form one in a higher-possibly infinite- dimensional space). As shown in our experiments, kernelized coding can result in higher recognition accuracies as compared to linear coding.
5 Coding on Grassmann Manifolds
In this work, we propose to embed Grassmann manifolds into the space of symmetric matrices via the projection embedding Chikuse:2003 . The projection embedding has been previously used in subspace tracking Srivastava:2004 , clustering Cetingul:CVPR:2009 , discriminant analysis HAMM2008_ICML ; Harandi_CVPR_2011 and classification purposes Vemula:CVPR:2013 . Let be the set of idempotent and symmetric matrices of rank . The projection embedding is given by where . The mapping is a diffeomorphism Chikuse:2003 , and may be thought of as simply an alternative form of the Grassmann manifold. It is a smooth, compact submanifold of of dimension Helmke:2007 .
From its embedding in , the manifold inherits a Riemannian metric (and hence a notion of path length), from the Frobenius norm in . It is an important fact that the mapping is also an isometry with respect to the Riemannian metric on and the standard Riemannian metric, defined in Section 2 for Chikuse:2003 . Hence, preserves length of curves Harandi_2013_ICCV . The shortest path length between two points in defines a distance metric called the geodesic metric.
Working with instead of has the advantage that each element of is a single matrix, whereas elements of are equivalence classes of matrices. In other words, if and are two bases for , then .
In future, we shall denote by , the hat representing the action of the projection embedding. Furthermore, represents the Frobenius inner product: thus . Note that in computing it is not necessary to compute and explicitly (they may be large matrices). Instead, note that . This is advantageous, since may be a substantially smaller matrix.
Apart from the geodesic distance metric, an important metric used in this paper is the chordal metric, defined by
This metric will be used in the context of (6) to recast the coding and consequently dictionary-learning problem in terms of chordal distance. Before presenting our proposed methods, we establish an interesting link between coding and the notion of weighted mean in a metric space.
Weighted Karcher mean.
The underlying concept of coding using a dictionary is to represent in some way a point in a space of interest as a combination of other elements in that space. In the usual method of coding in given by (3), each is represented by a linear combination of dictionary elements , where the first term represents the coding error. For coding in a manifold, the problem to address is that linear combinations do not make sense. We wish to find some way in which an element may be represented in terms of other dictionary elements as suggested in (6). For a proposed method to generalize the case, one may prefer a method that is a direct generalization of the Euclidean case in some way.
In , a different way to consider the expression in (3) is as a weighted mean of the points This observation relies on the following fact, which is verified using a Lagrange multiplier method.
Given coefficients with , and dictionary elements in , the point that minimizes is given by .
In other words, the affine combination of dictionary elements is equal to their weighted mean. Although linear combinations are not defined for points on manifolds or metric spaces, a weighted mean is.
Given points on a Riemannian manifold , and weights , the point that minimizes , is called the weighted Karcher mean of the points with weights . Here, is the geodesic distance on .
Generally, finding the Karcher mean Karcher_1977 on a manifold involves an iterative procedure, which may converge to a local minimum, even on a simple manifold, such as Manton_2004 ; Hartley_IJCV_13 . However, one may replace the geodesic metric with a different metric in order to simplify the calculation. To this end, we propose the chordal metric on a Grassman manifold, defined for matrices and in by
The corresponding mean, as in definition 2 (but using the chordal metric) is called the weighted chordal mean of the points. In contrast to the Karcher mean, the weighted chordal mean on a Grassman manifold has a simple closed-form.
The weighted chordal mean of a set of points with weights is equal to where represents the closest point on .
has a closed form solution in terms of the Singular Value Decomposition. For proofs of these results, see the proofs of Theorems10.1 and 10.2 in the appendix.
The chordal metric on a Grassman manifold is not a geodesic metric (that is it is not equal to the length of a shortest geodesic under the Riemannian metric). However, it is closely related. In fact, one may easily show that for and
Furthermore, the path-metric (Hartley_IJCV_13 ) induced by is equal to the geodesic distance.
5.1 Sparse Coding
Given a dictionary with atoms and a query sample the problem of sparse coding can be recast extrinsically as (see Fig. 2 for a conceptual illustration):
The formulation here varies slightly from the general form given in (6), in that the point does not lie exactly on the manifold , since it is not idempotent nor its rank is necessarily . We call this solution an extrinsic solution; the point coded by the dictionary is allowed to step out of the manifold. There is no reason to see this as a major flaw, as will be discussed later in Section 5.4.
Expanding the Frobenius norm term in (14) results in a convex function in :
The sparse codes can be obtained without explicit embedding of the manifold to using . This can be seen by defining as an dimensional vector storing the similarity between signal and dictionary atoms in the induced space and as an symmetric matrix encoding the similarities between dictionary atoms (which can be computed offline). Then, the sparse coding in (14) can be written as:
The symmetric matrix is positive semidefinite since for all :
Therefore, the problem is convex and can be efficiently solved using common packages like CVX CVX1 ; CVX2 or SPAMS Mairal_JMLR_2010 . The problem in (15) can be transposed into a vectorized sparse coding problem. More specifically, let be the SVD of . Then (15) is equivalent to
A special case is sparse coding on the Grassmann manifold , which can be seen as a problem on dimensional unit sphere, albeit with a subtle difference. More specifically, unlike conventional sparse coding in vector spaces, , which results in having antipodals points being equivalent. For this special case, the solution proposed in (14) can be understood as sparse coding in the higher dimensional quadratic space, i.e., . We note that in the quadratic space, and .
5.2 Locality-Constrained Coding
Several studies favor locality in coding process as locality could lead to sparsity but not necessarily vice versa Roweis_2000_LLE ; Yu:NIPS:2009 ; Wang_CVPR_2010_LLC . In what follows, we describe a coding scheme based on neighborhood information. We show that the codes with local constraints can be obtained in closed-form which in turn avoids convex optimization problems as required for sparse coding. However, there is no free lunch here since the new algorithm requires a new parameter, namely the number of nearest neighbors, to generate the codes.
In vector spaces, a fast version of Locality-constrained Linear Coding (LLC) as proposed by Wang_CVPR_2010_LLC is described by:
In (17), is the query, is a local basis obtained by simply stacking the nearest neighbors of from a global dictionary and is the dimensional LLC vector. Recasting the LLC problem depicted in (17) to Grassmann manifolds using the mapping , we obtain:
Observing the constraint , we may write
where the elements of matrix are
A similar formulation albeit intrinsic, for the purpose of nonlinear embedding of Riemannian manifolds is developed by Goh_CVPR_2008 . Aside from the different purpose (coding versus embedding), gLC can exploit an additional codebook learning step (as explained in § 6) while dictionary learning based on the intrinsic formulation has no analytic solution.
5.3 Classification Based on Coding
If the atoms in the dictionary are not labeled (e.g., ifShawe-Taylor:2004:KMP for classification. Inspired by the Sparse Representation Classifier (SRC) Wright:PAMI:2009 , when the atoms in sparse dictionary are labeled, the generated codes of the query sample can be directly used for classification. In doing so, let
be the class-specific sparse codes, where is the class label of atom and is the discrete Dirac function. An efficient way of utilizing class-specific sparse codes is through computing residual errors. In this case, the residual error of query sample for class is defined as:
Alternatively, the similarity between query sample to class can be defined as . The function could be a linear function like or even a non-linear one like . Preliminary experiments suggest that Eq. (21) leads to higher classification accuracies when compared to the aforementioned alternatives.
5.4 Extrinsic Nature of the Solution
The solutions proposed in this section (e.g., (14)) apply to points in and solve coding extrinsically, meaning is not necessarily a point on . If, however, it is required that the linear combination of elements actually be used to represent a point on the manifold, then this can be found by projecting to the closest point on using Theorem 10.2 (see appendix). As shown there, the resulting point is the weighted chordal mean of the dictionary atoms with coefficients .
Furthermore, the proposed methods follow the general principle of coding in that the over-completeness of will approximate , and can be expected to be closely adjacent to a Grassmann point. An intrinsic version of (14) can be written as:
where is the weighted chordal mean of the dictionary atoms, as shown by Theorem 10.2. This formulation is precisely of the form (6), where the coding is the weighted chordal mean. The involvement of SVD makes solving (22) challenging. While seeking efficient ways of solving (22) is interesting, it is beyond the scope of this work. The coding error given by (22) and (14) will normally be very close, making (14) an efficient compromise solution.
6 Dictionary Learning
Given a finite set of observations , the problem of dictionary learning on Grassmann manifolds is defined as minimizing the following cost function:
with being a dictionary of size . Here,
is a loss function and should be small ifis “good” at representing . In the following text, we elaborate on how a Grassmann dictionary can be learned.
Aiming for sparsity, the -norm regularization is usually employed to obtain the most common form of as depicted in Eq. (14). With this choice, the problem of dictionary learning on Grassmann manifolds can be written as: