1 Introduction
In the past years, the subspace clustering or segmentation has attracted great interest in computer vision, pattern recognition and signal processing
[1, 2, 3]. The basic idea of subspace clustering is based on the fact that most data often have intrinsic subspace structures and can be regarded as the samples of a mixture of multiple subspaces. Thus the main goal of subspace clustering is to group data into different clusters, data points in each of which justly come from one subspace. To investigate and represent the underlying subspace structure, many subspace methods have been proposed, such as the conventional iterative methods [4, 5], the statistical methods [6, 7], the factorizationbased algebraic approaches [8, 9, 10], and the spectral clusteringbased methods
[11, 12, 3, 13, 14, 15, 16]. And they have been successfully applied in many scenarios, such as image representation [10], motion segement[8], face classification [13] and saliency detection [16], etc.Among all subspace clustering methods aforementioned, the spectral clustering methods based on affinity matrix are considered having good prospects
[3], in which an affinity matrix is firstly learned from the given data and then the final clustering results are obtained by spectral clustering algorithms such as Kmeans or Normalized Cuts (NCut)
[17]. The main component of the spectral clustering methods is to construct a proper affinity matrix for different data. In the typical method, Sparse Subspace Clustering (SSC) [3], one assumes that the data of subspaces are independent and are sparsely represented under the socalled Subspace Detection Property [18], in which the withinclass affinities are sparse and the betweenclass affinities are all zeros. It has been proved that under certain conditions the multiple subspace structures can be exactly recovered via minimization [19]. In most of current sparse subspace methods, one mainly focuses on independent sparse representation for data objects.However, the relation among data objects or the underlying structure of subspaces that generate the subsets of data to be grouped is usually not well considered, while these intrinsic properties are very important for clustering applications. So some researchers explore these intrinsic properties and relations among data objects and then revise the sparse representation model to represent these properties by introducing extra constraints, such as Label Consistent [20], Sequential property [21], Low rank constraint [14] and its Laplace regularization [22], etc. In these constraints, the holistic constraints such as the low rank or nuclear norm are proposed in favour of structural sparsity. The Low Rank Representation (LRR) model [23]
is one of representatives. The LRR model tries to reveal the latent sparse property embedded in a data set in high dimensional space. It has been proved that, when the highdimensional data set is actually from a union of several low dimension subspaces, the LRR model can reveal this structure through subspace clustering
[23].Although most current subspace clustering methods show good performance in various applications, the similarity among data objects is measured in the original data domain. For example, the current LRR method is based on the principle of data self representation and the representation error is measured in terms of Euclidean alike distance. However, this hypothesis may not be always true for many highdimensional data in practice where data may not reside in a linear space. In fact, it has been proved that many highdimensional data are embedded in low dimensional manifolds. For example, the human face images are considered as samples from a nonlinear submanifold [24]. It is desired to reveal the nonlinear manifold structure underlying these highdimensional data.
There are two types of manifold related learning tasks. In the socalled manifold learning, one has to respect the local geometry existed in the data but unknown to learners. The classic representative algorithms for manifold learning include LLE (Locally Linear Embedding) [25], ISOMAP [26], LLP (Locally Linear Projection) [27], LE (Laplacian Embedding) [28] and LTSA (Local Tangent Space Alignment) [29]. In the case of the other type of learning tasks, we clearly know manifolds where the data come from. For example, in image analysis, people usually use covariance matrices of features as a region descriptor [30]. In this case, one must respect the fact that the descriptor is a point on the manifold of symmetrical positive definite matrices. In dealing with data from a known manifold, one powerful way is to use a nonlinear mapping to ”flat” the data, like kernel methods. In computer vision, it is common to collect data on the socalled Grassmann manifold [31]. In these cases, the properties of the manifold is known, thus how to incorporate the manifold properties for some practical tasks is a challenging work. This type of tasks incorporating manifold properties in learning is called learning on manifolds.
In this paper, we explore the LRR model to be used for clustering a set of data objects on Grassmann manifold. The intrinsic characteristics and geometry properties of Grassmann manifold will be exploited in algorithm design of LRR learning. Grassmann manifold has a nice property that it can be embedded into the linear space of symmetric matrices. By this way, all the abstract points (subspaces) on Grassmann manifold can be embedded into a Euclidean space where the classic LRR model can be applied. Then an LRR model can be constructed in the embedding space, where the error measure is simply taken as the Euclidean metric. This idea can also be seen in the recent work [32] for computer vision tasks.
The contributions of this work are listed as follows:

Reviewing and extending the LRR model on Grassmann Manifold introduced in our conference paper [33];

Giving the solutions and practical algorithms to the problems of the extended Grassmann LRR model under different noise models, particularly defined by Frobenius norm and norm;

Presenting a new kernelized LRR model on Grassmann manifold.
The rest of the paper is organized as follows. In Section 2, we review some related works. In Section 3, the proposed LRR on Grassmann Manifold (GLRR) is described and the solutions to the GLRR models with different noises assumptions are given in detail. In Section 4, we introduce a general framework for the LRR model on Grassmann manifold from the kernelization point of view. In Section 5, the performance of the proposed methods is evaluated on clustering problems with several public databases. Finally, conclusions and suggestions for future work are provided in Section 6.
2 Related Works
In this section, we briefly review the existing sparse subspace clustering methods including the classic Sparse Subspace Clustering (SSC) and the Low Rank Representation (LRR) and summarize the properties of Grassmann manifold that are related to the work presented in this paper.
2.1 Sparse Subspace Clustering (SSC)
Given a set of data drawn from a union of unknown subspaces, the task of subspace clustering is to find the number of subspaces and their dimensions and the bases, and then segment the data set according to the subspaces. In recent years, sparse representation has been applied to subspace clustering, and the proposed Sparse Subspace Clustering (SSC) aims to find the sparsest representation for the data set using approximation [2]. The general SSC can be formulated as the follows:
(1) 
where is a set of signals in dimension and is the correspondent sparse representation of under the dictionary , and represents the observation noise or the error between the signals and its reconstructed values, which is measured by norm , particularly in terms of Euclidean norm, i.e., (or ) denoting the Frobenius norm to deal with the Gaussian noise, or (Laplacian noise) to deal with the random gross corruptions or to deal with the samplespecific corruptions. Finally is a penalty parameter to balance the sparse term and the reconstruction error.
In the above sparse model, it is critical to use an appropriate dictionary to represent signals. Generally, a dictionary can be learned from some training data by using one of many dictionary learning methods, such as the KSVD method [34]. However, a dictionary learning procedure is usually timeconsuming and so should be done in an offline manner. So many researchers adopt a simple and direct way to use the original signals themselves as the dictionary, which is known as the selfexpressiveness property [3] to find subspaces, i.e. each data point in a union of subspaces can be efficiently reconstructed by a linear combination of other points in dataset. More specifically, every point in the dataset can be represented as a sparse linear combinations of other points from the same subspace. Mathematically we write this sparse formulation as
(2) 
From these sparse representations an affinity matrix Z is compiled. This affinity matrix is interpreted as a graph upon which a clustering algorithm such as Normalized Cuts (NCut)[17] is applied for final segmentation. This is the typical approach of modern subspace clustering techniques.
2.2 LowRank Representation (LRR)
The LRR can be regarded as one special type of sparse representation, in which rather than compute the sparsest representation of each data point individually, the global structure of the data is incorporeally computed by the lowest rank representation of a set of data points. The low rank measurement has long been utilized in matrix completion from corrupted or missing data [35, 36]. Specifically for clustering applications, it has been proved that, when a highdimensional data set is actually composed of data from a union of several low dimension subspaces, LRR model can reveal this structure through subspace clustering [23]
. It is also proved that LRR has good clustering performance in dealing with the challenges in subspace clustering, such as the unclean data corrupted by noise or outliers, no prior knowledge of the subspace parameters, and lacking of theoretical guarantees for the optimality of the method
[14, 37, 16]. The general LRR model can be formulated as the following optimization problem:(3) 
where is the low rank representationa of the data set by itself. Here the low rank constraint is achieved by approximating rank with the nuclear norm
, which is defined as the sum of singular values of a matrix and is the low envelop of the rank function of matrices
[38].Although the current LRR method has good performance in subspace clustering, it relies on Euclidean distance for measuring the similarity of the raw data. However, this measurement is not suitable to highdimensional data with embedding low manifold structure. To characterize the local geometry of data on an unknown manifold, the LapLRR method [22] uses the graph Laplacian matrix derived from the data objects as a regularized term for the LRR model to represent the nonlinear structure of high dimensional data, while the reconstruction error of the revised model is still computed in Euclidean space.
2.3 Grassmann Manifold
This paper is concerned with the points particularly on a known manifold. Generally manifolds can be considered as low dimensional smooth ”surfaces” embedded in a higher dimensional Euclidean space. At each point of the manifold, manifold is locally similar to Euclidean space. In recent years, Grassmann manifold has attracted great interest in the computer vision research community. Although Grassmann manifold itself is an abstract manifold, it can be well represented as a matrix quotient manifold and its Riemannian geometry has been investigated for algorithmic computation [39].
Grassmann manifold has a nice property that it can be embedded into the space of symmetric matrices via the projection embedding, referring to Section 3.1 below. This property was used in subspace analysis, learning and representation[40, 41, 42]. The sparse coding and dictionary learning within the space of symmetric positive definite matrices have been investigated by using kerneling method [43]. For clustering applications, the mean shift method was discussed on Stiefel and Grassmann manifolds in [44]. Recently, a new version of Kmeans method was proposed to cluster Grassmann points, which is constructed by a statistical modeling method[45]. These works try to expand the clustering methods within Euclidean space to more practical situations on nonlinear spaces. Along with this direction, we further explore the subspace clustering problems on Grassmann manifold and try to establish a novel and feasible LRR model on Grassmann manifold.
3 LRR on Grassmann Manifolds
3.1 LRR on Grassmann Manifolds
In most of cases, the reconstruction error of LRR model in (3) is computed in the original data domain. For example, the common form of the reconstruction error is Frobenius norm in original data space, i.e. the error term can be chosen as . In practice, many high dimension data have their intrinsic manifold structures. For example, it has been proved that human faces in images have an underlying manifold structure [46]. In an ideal scenario, the error should be measured according to the manifold geometry. So we consider signal representation for the data with manifold structure and employ an error measurement in LRR model based on the distance defined on manifold spaces.
However the linear relation defined by is no longer valid on a manifold. One way to get around this difficulty is to use the log map on a manifold to lift points (data) on a manifold onto the tangent space at a data point. This idea has been applied for clustering and dimensionality reduction on manifold in [47].
However when the underlying manifold is Grassmannian, we can use the distance over its embedded space to replace the manifold distance and the linear relation can be implemented in its embedding Euclidean space naturally, as detailed below.
Grassmann manifold [48] is the space of all dimensional linear subspaces of for . A point on Grassmann manifold is a dimensional subspace of which can be represented by any of orthonormal basis . The chosen orthonormal basis is called a representative of a subspace . Grassmann manifold has onetoone correspondence to a quotient manifold of , see [48]. On the other hand, we can embed Grassmann manifold into the space of symmetric matrices by the following mapping, see [43],
(4) 
The embedding is diffeomorphism [49] (a onetoone, continuous, differentiable mapping with a continuous, differentiable inverse). Then it is reasonable to replace the distance on Grassmann manifold by the following distance defined on the symmetric matrix space under this mapping,
(5) 
3.1.1 LRR on Grassmann Manifold with Gaussian Noise (GLRRF) [33]
Given a set of data points on Grassmann manifold, i.e., a set of subspaces of dimension accordingly, we have their mapped symmetric matrices . Similar to the LRR model in (3), we represent these symmetric matrices by itself and use the error measurement defined in (5) to construct the LRR model on Grassmann manifold as follows:
(6) 
where is a 3order tensor by stacking all mapped symmetric matrices , along the 3rd mode, is the error tensor and means the mode3 multiplication of a tensor and a matrix, see [50]. The representation of and the 3order product operation are illustrated in Fig. 1.
The use of the Frobenius norm in (6) makes an assumption that the model fits to Gaussian noise. We call this model the Frobenius norm constrained GLRR (GLRRF). In this case, we have
(7) 
where is the th slice of , which represents the distance between the symmetric matrix and its reconstruction .
3.1.2 LRR on Grassmann Manifold with Noise (GLRR21)
When there exist outliers in the data set, the Gaussian noise model is no longer a favoured choice. Instead we propose using the socalled noise model. For example, in LRR clustering applications [23], [14], is used to cope with columnwise gross errors in signals. In a similar fashion, we formulate the following norm constrained GLRR model (GLRR21),
(8) 
where the norm of a tensor is defined as the sum of the Frobenius norm of the 3mode slices as the following form:
(9) 
3.2 Algorithms for LRR on Grassmann Manifold
The GLRR models in (6) and (8) present two typical optimization problems. In this subsection, we propose appropriate algorithms to solve them.
The GLLRF model was proposed in our earlier ACCV paper [33] where an algorithm based on ADMM was proposed. In this paper, we provide an even fast closed form solution for (6) and further investigate the structure of tensor used in these models for a practical solution for (8).
Intuitively, the tensor calculation can be converted to matrix operation by tensorial matricization, see [50]. For example, we can matricize the tensor in mode3 and obtain a matrix of data points (in rows). So it seems that the problem has been solved using the method of the standard LRR model. However, as the dimension is often too large in practical problems, the existing LRR algorithm could break down. To avoid this scenario, we carefully analyze the representation of the construction tensor error terms and convert the optimization problems to its equivalent and readily solvable optimization model. In the following two subsections, we will give the detail of these solutions.
3.2.1 Algorithm for the Frobenius Norm Constrained GLRR Model
We follow the notation used in [33]. By using variable elimination, we can convert problem (6) into the following problem
(10) 
We note that has a small dimension which is easy to handle. To simplify expression of the objective function (6), we denote
(11) 
Clearly . Define an symmetric matrix
(12) 
Then we have the following Lemma.
Lemma 1.
Given a set of matrices , if with element , then the matrix is semipositive definite.
Proof.
Define a matrix . Then it is easy to show that
So is a semipositive definite matrix. ∎
Based on the conclusion from Lemma 1, we have the eigenvector decomposition for
defined bywhere and
with nonnegative eigenvalues
. Define the square root of bythen it is not hard to prove that problem (10) is equivalent to the following problem
(13) 
Finally we have
Theorem 2.
Given that as defined above, the solution to (13) is given by
where is a diagonal matrix with its th element defined by
Proof.
Please refer to the proof of Lemma 1 in [15]. ∎
3.2.2 Algorithm for the Norm Constrained GLRR Model
Now we turn to the GLRR12 problem (8). Because the existence of norm in error measure, the objective function is not differentiable but convex. We propose using the alternating direction method (ADM) method to solve this problem.
Firstly, we construct the following augmented Lagrangian function:
(14) 
where is the standard inner product of two tensors in the same order, is the Lagrange multiplier, and is the penalty parameter.
Then ADM is used to decompose the minimization of w.r.t and simultaneously into two subproblems w.r.t and , respectively. More specifically, the iteration of ADM goes as follows:
(15)  
(16)  
(17) 
where we have used an adaptive parameter . The adaptive rule will be specified later in Algorithm 1.
The above ADM is appealing only if we can find closed form solutions for the subproblems (15) and (16).
First we consider problem (15). Denote and for any 3order tensor we use to denote the th slice along the 3mode as a shorten notation. Then we observe that (15) is separable in terms of matrix variable as follows:
(18)  
From Lemma 3.2 in [23], we know that the problem in (18) has a closed form solution, given by
(19) 
where .
Denoting by
problem (16) becomes
(20) 
We adopt the linearization method to solve the above problem. For this purpose, we need to compute w.r.t. . To do so, we firstly utilize the matrices in each slice to compute the tensor operation in the definition of . For the th slice of the first term in , we have
Define a new matrix by
then the first term in has the following representation:
(21) 
For the th slice of the second term of , we have
Denoting a matrix by
and noting (12), we will have
(22)  
Combining (21) and (22), we have
Thus we have
Finally we can use the following linearized proximity approximation to replace (20) as follows
(23) 
with a constant where is the matrix norm of the third mode matricization of the tensor . The new problem (23) has a closed form solution given by, see [51],
(24) 
where is the SVD of and is the Singular Value Thresholding (SVT) operator defined by
Finally the procedure of solving the norm constrained GLRR problem (8) is summarized in Algorithm 1. For the purpose of the selfcompletion of the paper, we borrow the convergence analysis for Algorithm 1 from [52] without proof.
Theorem 3.
If is nondecreasing and upper bounded, , then the sequence generated by Algorithm 1 converges to a KKT point of problem (8).
4 Kernelized LRR on Grassmann Manifold
4.1 Kernels on Grassmann Manifold
In this section, we consider the kernelization of the GLRRF model. In fact, the LRR model on Grassman manifold (6) can be regarded a kernelized LRR with a kernel feature mapping defined by (4). It is not surprised that is semidefinite positive as it serves as a kernel matrix. It is natural to further generalize the GLRRF based on kernel functions on Grassmann manifold.
There are a number of kernel functions proposed in recent years in computer vision and machine learning communities, see
[53, 42, 32, 54]. For simplicity, we focus on the following kernels:1. The Projection Kernel: This kernel is defined in [42]. For any two Grassmann points and , the kernel value is
The feature mapping of the kernel is actually the mapping defined in (4).
2. Canonical Correlation Kernel: Referring to [42], this kernel is based on the cosine values of the socalled principal angle between two subspaces defined as follows
We can use the largest canonical correlation value (the cosine of the first principal angle) as the kernel value as done in [55], i.e.,
The cosine of principal angles of two subspaces can be calculated by using SVD as discussed in [56], see Theorem 2.1 there.
Consider two subspaces and as two Grassmann points where and are given bases. If we take the following SVD
then the values on the diagonal matrix are the cosine values of all the principal angles. The kernel uses partial information regarding the two subspaces. To increase its performance in our LRR, in this paper, we use the sum of all the diagonal values of as the kernel value between and . We still call this revised version the canonical correlation kernel.
4.2 Kernelized LRR on Grassmann Manifold
Let be any kernel function on Grassmann manifold. According to the kernel theory [57], there exists a feature mapping such that
where is the relevant feature space under the given kernel .
Give a set of points on Grassmann manifold , we define the following LRR model
(25) 
We call the above model the Kernelized LRR on Grassman manifold, denoted by KGLRR, and KGLRRcc, KGLRRproj for and respectively. However, for KGLRRproj, the above model (25) becomes the LRR model on Grassmann manifold (10).
Denote by the kernel matrix over all the data points ’s. By using the similar derivation in [33], we can prove that the model (25) is equivalent to
which is equivalent to
(26) 
where is the square root matrix of the kernel matrix . So the Kernelized model KGLRRproj is similar to GLRRF model in Section 3.
It has been proved that using multiple kernel functions may obtain improving performance in many application scenarios [58], due to the virtues of different kernel functions for the complex data. So in practice, we can employ different kernel functions to implement the model in (25), even we can adopt a combined kernel function. For example, in our experiments, we use a combination of the above two kernel functions and as follows.
where is the hand assigned combination coefficient. We denote the Kernelized LRR model of by KGLRRcc+proj.
4.3 Algorithm for KGLRR
It is straightforward to use Theorem 2 to solve (26). For the sake of convenience, we present the algorithm below.
Let us take the eigenvector decomposition of the kernel matrix
where is the diagonal matrix of all the eigenvalues. Then the solution to (26) is given by
where is the diagonal matrix with elements defined by
This algorithm is valid for any kernel functions on Grassmann manifold.
5 Experiments
To investigate the performance of our proposed methods, GLRR21, GLRRF/KGLRRproj, KGLRRcc, KGLRRcc+proj, we conduct clustering experiments on several widely used public databases, the MNIST handwritten digits database [59], the DynTex++ database [60], the Highway Traffic Dataset [61] and the YouTube Celebrity (YTC) dataset [62, 63]. The clustering results are compared with three stateoftheart clustering algorithms, SSC, LRR and the Statistical Computations on Grassmann and Stiefel Manifold (SCGSM) in [45]. All the algorithms are coded in Matlab 2014a and implemented on an Intel Core i74770K 3.5GHz CPU machine with 16G RAM. In the following, we first describe each dataset and experiment setting, then report and analyze our experiment results.
5.1 Datasets and Experiment Setting
5.1.1 Datasets
Four widely used public datasets are used to test the chosen algorithms. They are
1) MNIST handwritten digit database [59]
The database consists of approximately 70,000 digit images written by 250 volunteers. For recognition applications, 60,000 images are generally used as training sets and the other 10,000 images are used as testing sets. All the digit images have been sizenormalized and centered in a fixed size of . Some samples of this database are shown in Fig. 2. As the samples in this database are sufficient and the images are almost noisefree, we choose this database to test the performance of our clustering methods in an ideal condition and in noisy condition at different levels in order to get some insight of the new methods.
2) DynTex++ database [60]
The database is derived from a total of 345 video sequences in different scenarios, which contains river water, fish swimming, smoke, cloud and so on. Some frames of the videos are shown in Fig. 3. The videos are labeled as 36 classes and each class has 100 subsequences (totally 3600 subsequences) with a fixed size of (50 gray frames). This is a challenging database for clustering because most textures from different classes are fairly similar and the number of classes is quite large. We select this database to test the clustering performance of the proposed methods for the case of large number of classes.
3) YouTube Celebrity dataset (YTC) [62, 63]
The dataset is downloaded from Youtube. It contains videos of celebrities joining activities under reallife scenarios in various environments, such as news interviews, concerts, films and so on. The dataset is comprised of 1,910 video clips of 47 subjects and each clip has more than 100 frames. We test the proposed methods on a face dataset detected from the vidoe clips. It is a quite challenging dataset since the faces are all of low resolution with variations of expression, pose and background. Some samples of YTC dataset are shown in Fig. 4.
4) Highway traffic dataset [61]
The dataset contains 253 video sequences of highway with three traffic levels, light, medium and heavy, in various weather scenes such as sunny, cloudy and rainy. Each video sequence has 42 to 52 frames. Fig. 5 shows some frames of traffic scene of three levels. The video sequences are converted to grey images and each image is normalized to size
with mean zero and unit variance. This database has much challenge as the scenes and its weather context are changing timely. So it is a good dataset for evaluating the clustering methods in real world scene.
5.1.2 Experiment Setting
GLRR model is designed to cluster Grassmann points, which are subspaces instead of raw object/signal points. Thus before implementing the main components of GLRR and the spectral clustering algorithm (here we use Ncut algorithm), we must represent the raw signals in a subspace form, i.e., the points on Grassmann manifold. As a subspace can be generally represented by an orthonormal basis, we utilize the samples drawn from the same subspace to construct its basis representation. Similar to the previous work [42] [64]
, we simply adopt Singular Value Decomposition (SVD) to construct a subspace basis. Concretely, given a set of images, e.g., the same digits written by the same person, denoted by
and each is a greyscale image with dimension , we can construct a matrix of size by vectorizing each image . Then is decomposed by SVD as . We can pick the first singularvectors of to represent the image set as a point on Grassmann manifold .The setting of the model parameters affects the performance of our proposed methods. is the most important penalty parameter for balancing the error term and the lowrank term in our proposed methods. Empirically, the value of in different applications has big gaps, and the best value for has to be chosen from a large range of values to get a better performance in a particular application. From our experiments, we have observed that, for a fixed database, when the cluster number is increasing, the best is decreasing, and that will be smaller when the noise of data is lower while larger if the noise level higher. This observation can be used as a guidance for future applications of the methods. On the other hand, the error tolerances are also important in controlling the terminal condition, which bound the allowed reconstructed errors. We experimentally seek a proper value of to make algorithms terminate at an appropriate stage with better errors.
For the conventional SSC and LRR methods, Grassmann points cannot be used as inputs. In fact our experiments confirm this naive strategy results in poorer performance for both SSC and LRR. To construct a fair comparison between SSC or LRR and our Grassmann based algorithms, we adopt the following strategy to construct training data for SSC and LRR. For each image set, we “vectorize” them into a long vector with all the raw data in the image set, in a carefully chosen order, e.g., in the frame order etc. In most of the experiments, we cannot simply take these vectors as inputs to SSC and LRR algorithms because of high dimensionality for a larger image sets. In this case, we apply PCA to reduce the raw vectors to a low dimension which equals to either the dimension of subspaces of Grassmann manifold or the number of PCA components retaining 90% of its variance energy. Then PCA projected vectors will be taken as the inputs to SSC and LRR algorithms.
5.2 MNIST Handwritten Digit Clustering
In this experiment, we simply test our algorithms on the test dataset of MNIST. We divide 10,000 images into subgroups so that each subgroup consists of 20 images of a particular digit to simulate the images from the same person. Thus our task is to cluster image subgroups into 10 categories. As described in the last section, we use leading singular vectors to represent each subgroup as a Grassmann point . Thus the size of the representative matrix of a Grassmann point is .
For SSC and LRR, the size of the input vector becomes , which is too large to handle on a desktop machine. We use PCA to reduce each vector to by keeping 90% variance energy. And this dimension will increase when the noise level increases.
After getting the lowrank representation of Grassmann points mentioned above, we pipeline the coefficient matrix to NCut for clustering. The experiment results are reported in Table I. It is shown that the accuracy of our proposed algorithms, GLRR21, GLRRF/ KGLRRproj, KGLRRcc, KGLRRcc+proj, are all , outperforming other methods more than 10 percents. The manifold mapping extracts more useful information about the differences among sample data. Thus the combination of Grassmann geometry and LRR model brings better accuracy for NCut clustering.
SSC [3]  LRR [14]  SCGSM [45]  GLRR21  GLRRF [33]  KGLRRcc  KGLRRcc+proj  
/KGLRRproj  
Accuracy  0.7576  0.8667  0.8646  1  1  1  1 
To test the robustness of the proposed algorithms, we add Gaussian noise onto all the digit images and then cluster them by different algorithms mentioned above. Fig. 6 shows some digit images with noise . Generally, the noises will effect the performance of the clustering algorithms, especially when the noise is heavy. Table II
shows the clustering performance of different methods with the noise standard deviation
ranging from 0.05 to 0.35. It indicates that our algorithm keeps 100 accuracy for the standard deviation up to 0.3, while the accuracy of other methods is generally lower than our method and behaves unstable when the noise standard deviation varies. This indicates that our proposed algorithms are robust for certain level of noises.Noise  SSC [3]  LRR [14]  SCGSM [45]  GLRR21  GLRRF [33]  KGLRRcc  KGLRRcc+proj 
/KGLRRproj  
0.05  0.7838  0.8667  0.8646  1  1  1  1 
0.1  0.7596  0.8889  0.7091  1  1  1  1 
0.15  0.7475  0.9939  0.8667  1  1  1  1 
0.2  0.6202  0.9960  0.7374  1  1  1  1 
0.25  0.3374  0.9960  0.7293  1  1  1  1 
0.3  0.2020  0.8909  0.6828  1  1  1  1 
0.35  0.1556  0.8263  0.2646  0.8889  0.996  0.8566  0.9838 
We further study the impact of on the performance of the clustering methods by varying value. From these experiments, it is observed that depends on noise levels. Generally, a relatively larger will give better clustering results when the noise level is higher. This explains that the noise level will impact the rank of the lowrank representation . A larger noise level will increase the rank of the represented coefficient matrix. So should be increased if we have a prior knowledge of higher level of noises.
5.3 Dynamic Texture Clustering
For the texture video sequences, the dynamic texture descriptor, Local Binary Patterns from Three Orthogonal Plans (LBPTOP) [65], is considered more suitable to capture its spacial and temporary features. So we use LBPTOP to construct the dynamic texture points on Grassmann manifold instead of the former SVD method. Generally, the LBPTOP method extracts the local cooccurrence features of a dynamic texture from three orthogonal planes of the sequential space. For 3600 subsequences in the DynTex++ database, the LBPTOP features are extracted to obtain 3600 matrices each in size of . We directly use these feature matrices as the points on Grassmann manifold. As the class number of all the 3600 subsequences is large, we pick the first classes from 36 classes and 50 subsequences for each class to cluster. The experiments are repeated several times for each . For SSC and LRR, the size of the input vector is , which is even too large for PCA algorithm. So we employ 2D PCA [66] to reduce the dimension to the subspace dimension of the Grassmann manifold.
The clustering results for DynTex++ database are shown in Table III. For more than 4 classes, the accuracy of the proposed methods are superior to the other methods around 10 percents. The accuracy of KGLRRcc+proj is higher than GLRR21 and GLRRF except for the case of 9 classes, which means the kernel version is more stable. We also observe that the accuracy decreases as the number of classes increases. This may be caused by the clustering challenge when more similar texture images are added into the data set.
Class  SSC [3]  LRR [14]  SCGSM [45]  GLRR21  GLRRF [33]  KGLRRcc  KGLRRcc+proj 

/KGLRRproj  
3  0.6700  0.9967  1  1  1  1  1 
4  0.7075  0.8625  0.9050  0.9975  0.9975  0.9975  0.9975 
5  0.5060  0.7280  0.8340  0.9840  0.9740  0.9980  0.9880 
6  0.4167  0.5933  0.7367  0.9250  0.8683  0.9150  0.9300 
7  0.3371  0.5643  0.5914  0.9071  0.8857  0.8757  0.9100 
8  0.4187  0.4788  0.6313  0.8725  0.8513  0.8675  0.8738 
9  0.3556  0.4378  0.5044  0.7689  0.8056  0.7867  0.7800 
10  0.2550  0.4440  0.4790  0.6940  0.7620  0.7150  0.8110 
5.4 YouTube Celebrity Clustering
In order to create a face dataset from the YTC videos, a face detection algorithm is exploited to extract face regions and resize each face to a
image. We treat the faces extracted from each video as an image set, which is represented as a point on Grassmann manifold by the SVD method as used for the handwritten digit case. Each face image set contains varying number of face images, however we fix the dimension of subspaces to . Since there is a big gap between 13 and 349 frames in the YTC videos and PCA algorithm requires each sample has the same dimension, it is unfair to select only few frames equally from each video as the input data for SSC and LRR algorithms. Hence we give up comparing our methods with SSC and LRR.We simply choose persons, respectively, as the target classes from totally 47 persons and test the proposed algorithms over all the face image sets of the chosen persons. Table IV shows the clustering results on YTC face dataset with different number of selected persons. The accuracy of our methods, especially the kernel methods, are significantly higher than other methods. Like the Dyntex texture experiment, with the number of persons (classes) increasing, the accuracy for most algorithms decreases and KGLRRcc+proj behaves more stably. Because GLRR21 consumes so much CPU memory resource that we could not test a wide range of to get a better experiment result, actually we have to relax the terminal condition and empirically select some . The accuracy of GLRR21 reported is not the best result. All the other methods are tested on a wide range of from 0.1 to 50.
Class  SCGSM [45]  GLRR21  GLRRF [33]  KGLRRcc  KGLRRcc+proj 

/KGLRRproj  
4  0.5282  0.6972  0.8944  0.8944  0.9085 
5  0.7188  0.9167  0.9167  0.9167  0.9167 
6  0.5925  0.8566  0.8604  0.8604  0.8604 
7  0.5955  0.7612  0.8034  0.7697  0.8174 
8  0.6624  0.8135  0.8264  0.8006  0.8264 
9  0.6974  0.6785  0.7470  0.7447  0.7825 
10  0.5264  0.6892  0.7400  0.7294  0.7569 
5.5 UCSD Traffic Clustering
The traffic video clips in the database are labeled into three classes based on the level of traffic jam. There are 44 clips of heavy level, 45 clips of medium level and 164 clips of light level. We regard each video as an image set to construct a point on Grassmann manifold, also by using the SVD method. The subspace dimension is selected as 20, the cluster number and the total number of samples . For SSC and LRR, we vectorize the former 42 frames of each clip (there are 42 to 52 frames in a clip) and then use PCA to reduce the dimension (24*24*42) to 147 by keeping 90% variance energy. Note that the level of traffic jam doesn’t have a sharp borderline. For some confused clips, it is difficult to say whether they belong to heavy, medium or light level. So it is a great challenging task for clustering methods.
Table V presents the clustering performance of all the algorithms on the Traffic dataset with two different frame sizes. The accuracy of our methods except for KGLRRproj are at least 10 percent higher than the other methods. When the frame size is
, the KGLRRcc+proj gets the highest accuracy 0.8972 which almost reaches the accuracy of some supervised learning based classification algorithms
[67]. However, constrained by the CPU resource, we cannot report the results from GLRR21, SSC and LRR.6 Conclusion and Future Work
In this paper, we propose a novel LRR model on Grassmann manifold by utilizing the embedding mapping from the manifold onto the space of symmetric matrices to construct a metric in Euclidean space. To treat different noises, the proposed GLRR is further extended to two models, GLRRF and GLRR21, to deal with Gaussian noise and nonGaussian noise with outliers, respectively. We derive an equivalent optimization problem which has a closedform solution for GLRRF. In addition, we show that the LRR model on Grassmann manifold can be generalized under the kernel framework and two special kernel functions on Grassmann manifold are incorporated into the kernelized GLRR model. The proposed models and algorithms are evaluated on several public databases against several existing clustering algorithms. The experimental results show that the proposed methods outperform the stateoftheart methods and behave robustly to noises and outliers. This work provides a novel idea to construct LRR model for data on manifolds and it has demonstrated that incorporating geometrical property of manifolds via embedding mapping actually facilitate learning on manifold. In the future work, we will focus on the exploring the intrinsic property of Grassmann manifold to construct LRR on it.
Acknowledgements
The research project is supported by the Australian Research Council (ARC) through the grant DP130100364 and also partially supported by National Natural Science Foundation of China under Grant No. 61390510, 61133003, 61370119, 61171169, 61300065 and Beijing Natural Science Foundation No. 4132013.
References

[1]
R. Xu and D. WunschII, “Survey of clustering algorithms,”
IEEE Transactions on Neural Networks
, vol. 16, no. 2, pp. 645–678, 2005.  [2] R. Vidal, “Subspace clustering,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 52–68, 2011.
 [3] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, Theory, and Applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 2765–2781, 2013.
 [4] P. Tseng, “Nearest flat to points,” Journal of Optimization Theory and Applications, vol. 105, no. 1, pp. 249–252, 2000.
 [5] J. Ho, M. H. Yang, J. Lim, K. Lee, and D. Kriegman, “Clustering appearances of objects under varying illumination conditions,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, 2003, pp. 11–18.
 [6] M. Tipping and C. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural Computation, vol. 11, no. 2, pp. 443–482, 1999.
 [7] A. Gruber and Y. Weiss, “Multibody factorization with uncertainty and missing data using the EM algorithm,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. I, 2004, pp. 707–714.
 [8] K. Kanatani, “Motion segmentation by subspace separation and model selection,” in IEEE International Conference on Computer Vision, vol. 2, 2001, pp. 586–591.

[9]
Y. Ma, A. Yang, H. Derksen, and R. Fossum, “Estimation of subspace arrangements with applications in modeling and segmenting mixed data,”
SIAM Review, vol. 50, no. 3, pp. 413–458, 2008.  [10] W. Hong, J. Wright, K. Huang, and Y. Ma, “Multiscale hybrid linear models for lossy image representation,” IEEE Transactions on Image Processing, vol. 15, no. 12, pp. 3655–3671, 2006.
 [11] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007.
 [12] G. Chen and G. Lerman, “Spectral curvature clustering,” International Journal of Computer Vision, vol. 81, no. 3, pp. 317–330, 2009.

[13]
G. Liu and S. Yan, “Latent lowrank representation for subspace segmentation and feature extraction,” in
IEEE International Conference on Computer Vision, 2011, pp. 1615–1622.  [14] G. Liu, Z. Lin, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by lowrank representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 171–184, 2013.
 [15] P. Favaro, R. Vidal, and A. Ravichandran, “A closed form solution to robust subspace estimation and clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1801–1807.
 [16] C. Lang, G. Liu, J. Yu, and S. Yan, “Saliency detection by multitask sparsity pursuit,” IEEE Transactions on Image Processing, vol. 21, no. 1, pp. 1327–1338, 2012.
 [17] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 888–905, 2000.
 [18] D. Donoho, “For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution,” Comm. Pure and Applied Math., vol. 59, pp. 797–829, 2004.
 [19] G. Lerman and T. Zhang, “Robust recovery of multiple subspaces by geometric minimization,” The Annuals of Statistics, vol. 39, no. 5, pp. 2686–2715, 2011.
 [20] Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent KSVD: Learning a discriminative dictionary for recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2651–2664, 2013.
 [21] S. Tierney, J. Gao, and Y. Guo, “Subspace clustering for sequential data,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1019–1026.
 [22] J. Liu, Y. Chen, J. Zhang, and Z. Xu, “Enhancing lowrank subspace clustering by manifold regularization,” IEEE Transactions on Image Processing, vol. 23, no. 9, pp. 4022–4030, 2014.
 [23] G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by lowrank representation,” in International Conference on Machine Learning, 2010, pp. 663–670.
 [24] R. Wang, S. Shan, X. Chen, and W. Gao, “Manifoldmanifold distance with application to face recognition based on image set,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
 [25] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 1, pp. 2323–2326, 2000.
 [26] J. Tenenbaum, V. Silva, and J. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Optimization Methods and Software, vol. 290, no. 1, pp. 2319–2323, 2000.
 [27] X. He and P. Niyogi, “Locality preserving projections,” in Advances in Neural Information Processing Systems, vol. 16, 2003.
 [28] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in Neural Information Processing Systems, vol. 14, 2001.
 [29] Z. Zhang and H. Zha, “Principal manifolds and nonlinear dimension reduction via local tangent space alignment,” SIAM Journal of Scientific Computing, vol. 26, no. 1, pp. 313–338, 2005.
 [30] O. Tuzel, F. Porikli, and P. Meer, “Region covariance: A fast descriptor for detection and classification,” European Conference on Computer Vision, vol. 3952, pp. 589–600, 2006.
 [31] P. K. Turaga, A. Veeraraghavan, and R. Chellappa, “Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
 [32] M. T. Harandi, M. Salzmann, S. Jayasumana, R. Hartley, and H. Li, “Expanding the family of Grassmannian kernels: An embedding perspective,” in European Conference on Computer Vision, vol. 8695, 2014, pp. 408–423.
 [33] B. Wang, Y. Hu, J. Gao, Y. Sun, and B. Yin, “Low rank representation on Grassmann manifolds,” in Asian Conference on Computer Vision, 2014.
 [34] M. Aharon, M. Elad, and A. Bruckstein, “KSVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 1, pp. 4311–4322, 2006.

[35]
J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, “Robust principal component analysis: Exact recovery of corrupted lowrank matrices via convex optimization,” in
Advances in Neural Information Processing Systems, vol. 22, 2009.  [36] E. J. Candés, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM, vol. 58, no. 3, pp. 1–37, 2011.
 [37] B. Cheng, G. Liu, J. Wang, Z. Huang, and S. Yan, “Multitask lowrank affinity pursuit for image segmentation,” in International Conference on Computer Vision, 2011, pp. 2439–2446.
 [38] M. Fazel, “Matrix rank minimization with applications,” PhD thesis, Stanford University, 2002.
 [39] P. A. Absil, R. Mahony, and R. Sepulchre, “Riemannian geometry of Grassmann manifolds with a view on algorithmic computation,” Acta Applicadae Mathematicae, vol. 80, no. 2, pp. 199–220, 2004.

[40]
A. Srivastava and E. Klassen, “Bayesian and geometric subspace tracking,”
Advances in Applied Probability
, vol. 36, no. 1, pp. 43–56, 2004.  [41] J. Hamm and D. Lee, “Grassmann discriminant analysis: a unifying view on subspacebased learning,” in International Conference on Machine Learning, 2008, pp. 376–383.
 [42] M. T. Harandi, C. Sanderson, S. A. Shirazi, and B. C. Lovell, “Graph embedding discriminant analysis on Grassmannian manifolds for improved image set matching,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2705–2712.
 [43] M. T. Harandi, C. Sanderson, C. Shen, and B. Lovell, “Dictionary learning and sparse coding on Grassmann manifolds: An extrinsic solution,” in International Conference on Computer Vision, 2013, pp. 3120–3127.
 [44] H. Cetingul and R. Vidal, “Intrinsic mean shift for clustering on Stiefel and Grassmann manifolds,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1896–1902.
 [45] P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa, “Statistical computations on Grassmann and Stiefel manifolds for image and videobased recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2273–2286, 2011.
 [46] C. Xu, T. Wang, J. Gao, S. Cao, W. Tao, and F. Liu, “An orderedpatchbased image classification approach on the image Grassmannian manifold,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 4, pp. 728–737, 2014.
 [47] A. Goh and R. Vidal, “Clustering and dimensionality reduction on Riemannian manifolds,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–7.
 [48] P. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.
 [49] J. T. Helmke and K. Hüper, “Newton’s method on Grassmann manifolds.” Preprint: [arXiv:0709.2205], Tech. Rep., 2007.
 [50] G. Kolda and B. Bader, “Tensor decomposition and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009.
 [51] J. F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. on Optimization, vol. 20, no. 4, pp. 1956–1982, 2008.
 [52] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low rank representation,” in Advances in Neural Information Processing Systems, vol. 23, 2011.
 [53] L. Wolf and A. Shashua, “Learning over sets using kernel principal angles,” Journal of Machine Learning Research, vol. 4, pp. 913–931, 2003.
 [54] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi, “Optimizing over radial kernels on compact manifolds,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3802 – 3809.
 [55] O. Yamaguchi, K. Fukui, and K. Maeda, “Face recognition using temporal image sequence,” in Automatic Face and Gesture Recognition, 1998, pp. 318–323.
 [56] A. Björck and G. H. Golub, “Numerical methods for computing angles between linear subspaces,” Mathematics of Computation, vol. 27, pp. 579–594, 1973.
 [57] B. Schölkopf and A. Smola, Learning with Kernels. Cambridge, Massachusetts: The MIT Press, 2002.
 [58] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in International Conference on Machine Learning, 2004, pp. 6–.
 [59] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 1, pp. 2278–2324, 1998.
 [60] B. Ghanem and N. Ahuja, “Maximum margin distance learning for dynamic texture recognition,” in European Conference on Computer Vision, 2010, pp. 223–236.
 [61] A. B. Chan and N. Vasconcelos, “Modeling, clustering, and segmenting video with mixtures of dynamic textures,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 5, pp. 909–926, 2008.
 [62] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley, “Face tracking and recognition with visual constraints in realworld videos,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
 [63] R. Wang, H. Guo, L. Davis, and Q. Dai, “Covariance discriminative learning: A natural and efficient approach to image set classification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2496–2503.
 [64] M. Harandi, C. Sanderson, R. Hartley, and B. Lovell, “Sparse coding and dictionary learning for symmetric positive definite matrices: A kernel approach,” in European Conference on Computer Vision, 2012, pp. 216–229.
 [65] G. Zhao and M. Pietikäinen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 915–928, 2007.
 [66] S. Yu, J. Bi, and J. Ye, “Probabilistic interpretations and extensions for a family of 2D PCAstyle algorithms,” in Workshop on Data Mining Using Matrices Tensors, 2008, pp. 1–7.
 [67] A. Sankaranarayanan, P. Turaga, R. Baraniuk, and R. Chellappa, “Compressive acquisition of dynamic scenes,” in European Conference on Computer Vision, vol. 6311, 2010, pp. 129–142.
Comments
There are no comments yet.