Subspace Clustering Library
This paper studies the subspace segmentation problem. Given a set of data points drawn from a union of subspaces, the goal is to partition them into their underlying subspaces they were drawn from. The spectral clustering method is used as the framework. It requires to find an affinity matrix which is close to block diagonal, with nonzero entries corresponding to the data point pairs from the same subspace. In this work, we argue that both sparsity and the grouping effect are important for subspace segmentation. A sparse affinity matrix tends to be block diagonal, with less connections between data points from different subspaces. The grouping effect ensures that the highly corrected data which are usually from the same subspace can be grouped together. Sparse Subspace Clustering (SSC), by using ℓ^1-minimization, encourages sparsity for data selection, but it lacks of the grouping effect. On the contrary, Low-Rank Representation (LRR), by rank minimization, and Least Squares Regression (LSR), by ℓ^2-regularization, exhibit strong grouping effect, but they are short in subset selection. Thus the obtained affinity matrix is usually very sparse by SSC, yet very dense by LRR and LSR. In this work, we propose the Correlation Adaptive Subspace Segmentation (CASS) method by using trace Lasso. CASS is a data correlation dependent method which simultaneously performs automatic data selection and groups correlated data together. It can be regarded as a method which adaptively balances SSC and LSR. Both theoretical and experimental results show the effectiveness of CASS.READ FULL TEXT VIEW PDF
This paper studies the subspace segmentation problem which aims to segme...
This paper studies the subspace clustering problem. Given some data poin...
Subspace clustering refers to the problem of segmenting data drawn from ...
Sparse subspace clustering (SSC) is a state-of-the-art method for segmen...
Recently there is a line of research work proposing to employ Spectral
In this paper, we develop a method for unsupervised clustering of two-wa...
Multivariate functional data from a complex system are naturally
Subspace Clustering Library
This paper focuses on subspace segmentation, the goal of which is to segment a given data set into clusters, ideally with each cluster corresponding to a subspace. Subspace segmentation is an important problem in both computer vision and machine learning literature. It has numerous applications, such as motion segmentation, face clustering , and image segmentation , owing to the fact that the real-world data often approximately lie in a mixture of subspaces. The problem is formally defined as follows :
(Subspace Segmentation) Given a set of sufficiently sampled data vectors
(Subspace Segmentation) Given a set of sufficiently sampled data vectors, where is the feature dimension, and is the number of data vectors. Assume that the data are drawn from a union of subspaces of unknown dimensions , respectively. The task is to segment the data according to the underlying subspaces they are drawn from.
Some notations are used in this work. We use capital and lowercase symbols to represent matrices and vectors, respectively. In particular, denotes the vector of all 1’s, is a vector whose -th entry is 1 and 0 for others, and
is used to denote the identity matrix. Diag() converts the vector into a diagonal matrix in which the -th diagonal entry is . diag() is a vector whose -th entry is of a square matrix . tr() is the trace of a square matrix . denotes the -th column of a matrix . sign() is the sign function defined as sign if and 0 for otherwise. denotes that converges to .
Some vector and matrix norms will be used. , , and denote the -norm (number of nonzero entries), -norm (sum of the absolute vale of each entry), -norm and -norm of a vector . , , , , and denote the -norm (), Frobenius norm, -norm (), -norm (
), and nuclear norm (the sum of all the singular values) of a matrix, respectively.
There has been a large body of research on subspace segmentation [23, 3, 13, 24, 17, 5, 8]. Most recently, the Sparse Subspace Clustering (SSC) [3, 4], Low-Rank Representation (LRR) [13, 12, 2], and Least Squares Regression (LSR)  techniques have been proposed for subspace segmentation and attracted much attention. These methods learn an affinity matrix whose entries measure the similarities among the data points and then perform spectral clustering on the affinity matrix to segment data. Ideally, the affinity matrix should be block diagonal (or block sparse in vector form), with nonzero entries corresponding to data point pairs from the same subspace. A typical choice for the measure of similarity between and is , where . However, such method is unable to utilize the underlying linear subspace structure of data. The constructed affinity matrix is usually not block diagonal even under certain strong assumptions, e.g. independent subspaces 111A collection of linear subspaces are independent if and only if for all (or ).. For a new point in the subspaces, SSC pursues a sparse representation:
where is a parameter. SSC solves problem (1) or (2) for each data point in the dataset with all the other data points as the dictionary. Then it uses the derived representation coefficients to measure the similarities between data points and constructs the affinity matrix. It is shown that, if the subspaces are independent, the sparse representation is block sparse. However, if the data from the same subspace are highly correlated or clustered, the -minimization will generally select a single representative at random, and ignore other correlated data. This leads to a sparse solution but misses data correlation information. Thus SSC may result in a sparse affinity matrix but lead to unsatisfactory performance.
Low-Rank Representation (LRR) is a method which aims to group the correlated data together. It solves the following convex optimization problem:
The above problem can be extended for the noisy case:
is a parameter. Although LRR guarantees to produce a block diagonal solution when the data are noise free and drawn from independent subspaces, the real data are usually contaminated with noises or outliers. So the solution to problem (4) is usually very dense and far from block diagonal. The reason is that the nuclear norm minimization lacks the ability of subset selection. Thus, LRR generally groups correlated data together, but sparsity cannot be achieved.
In the context of statistics, Ridge regression (-regularization)  may have the similar behavior as LRR. Below is the most recent work by using Least Squares Regression (LSR)  for subspace segmentation:
Both LRR and LSR encourage grouping effect but lack of sparsity. In fact, for subspace segmentation, both sparsity and grouping effect are very important. Ideally, the affinity matrix should be sparse, with no connection between clusters. On the other hand, the affinity matrix should not be too sparse, i.e., the nonzero connections within cluster should be sufficient enough for grouping correlated data in the same subspaces. Thus, it is expected that the model can automatically group the correlated data within cluster (like LRR and LSR) and eliminate the connections between clusters (like SSC). Trace Lasso , defined as
, is such a newly established regularizer which interpolates between the-norm and -norm of . It is adaptive and depends on the correlation among the samples in , which can be encoded by . In particular, when the data are highly correlated ( is close to ), it will be close to the -norm, while when the data are almost uncorrelated ( is close to ), it will behave like the -norm. We take the adaptive advantage of trace Lasso to regularize the representation coefficient matrix, and define an affinity matrix by applying spectral clustering to the normalized Laplacian. Such a model is called Correlation Adaptive Subspace Segmentation (CASS) in this work. CASS can be regarded as a method which adaptively interpolates SSC and LSR. An intuitive comparison of the coefficient matrices derived by these four methods can be found in Figure 1. For CASS, we can see that most large representation coefficients cluster on the data points from the same subspace as . In comparison, the connections within cluster are very sparse by SSC, and the connections between clusters are very dense by LRR and LSR.
We summarize the contributions of this paper as follows:
We propose a new subspace segmentation method, called the Correlation Adaptive Subspace Segmentation (CASS), by using trace Lasso . CASS is the first method that takes the data correlation into account for subspace segmentation. So it is self-adaptive for different types of data.
In theory, we show that if the data are from independent subspaces, and the objective function satisfies the proposed Enforced Block Sparse (EBS) conditions, then the obtained solution is block sparse. Trace Lasso is a special case which satisfies the EBS conditions.
We theoretically prove that trace Lasso has the grouping effect, i.e., the coefficients of a group of correlated data are approximately equal.
Trace Lasso  is a recently proposed norm which balances the -norm and -norm. It is formally defined as
A main difference between trace Lasso and the existing norms is that trace Lasso involves the data matrix , which makes it adaptive to the correlation of data. Actually, it only depends on the matrix of data, which encodes the correlation information among data. In particular, if the norm of each column of is normalized to one, we have the following decomposition of :
If the data are uncorrelated (the data points are orthogonal,
), the above equation gives the singular value decomposition of. In this case, trace Lasso is equal to the -norm:
If the data are highly correlated (the data points are all the same, , ), trace Lasso is equal to the -norm:
For other cases, trace Lasso interpolates between the -norm and -norm :
We use trace Lasso for subset selection from all the data adaptively, which leads to the Correlation Adaptive Subspace Segmentation (CASS) method. We first consider the subspace segmentation problem with clean data by CASS and then extend it to the noisy case.
Let be a set of data drawn from subspaces , where denotes a collection of data points from the -th subspace , , and is a hypothesized permutation matrix which rearranges the data to the true segmentation of data. For a given data point , it can be represented as a linear combination of all the data points . Different from the previous methods in SSC, LRR and LSR, CASS uses the trace Lasso as the objective function and solves the following problem:
The methods, SSC, LRR and LSR, show that if the data are sufficiently sampled from independent subspaces, a block diagonal solution can be achieved. The work  further shows that it is easy to get a block diagonal solution if the objective function satisfies the Enforced Block Diagonal (EBD) conditions. But the EBD conditions cannot be applied to trace Lasso directly, since trace Lasso is a function involving both the data and . Here we extend the EBD conditions  to the Enforced Block Sparse (EBS) conditions and show that the obtained solution is block sparse when the objective function satisfies the EBS conditions. Trace Lasso is a special case which satisfies the EBS conditions and thus leads to a block sparse solution.
Enforced Block Sparse (EBS) Conditions. Assume is a function with regard to a matrix and a vector , . Let . The EBS conditions are:
, for any permutation matrix ;
, and the equality holds if and only if .
For some cases, the EBS conditions can be regarded as extensions of the EBD conditions 222For example, , where . It is easy to see that satisfies the EBS conditions and satisfies the EBD conditions.. The EBS conditions will enforce the solution to the following problem
to be block sparse when the subspace are independent.
Let be a data matrix whose column vectors are sufficiently 333That the data sampling is sufficient makes sure that problem (7) has a feasible solution. drawn from a union of independent subspaces , , . For each , and . Let be a new point in . Then the solution to problem (7) is block sparse, i.e., and for all .
Proof. For , let be the optimal solution to problem (7), where corresponds to for each . We decompose into two parts , where and . We have
Since and , . Thus . Considering that the subspaces are independent, , we have and , . So is feasible to problem (7). On the other hand, by the definition of and the EBS conditions (2), we have
Noticing that is optimal to problem (7), . Thus the equality holds. By the EBS conditions (2), we get . Therefore, , and for all .
The EBS conditions greatly extend the family of the objective function which involves the block sparse property. It is easy to check that trace Lasso satisfies the EBS conditions. Let , for any permutation matrix ,
Trace Lasso also satisfies the EBS conditions (2) by the following lemma:
[18, Lemma 11] Let be partitioned in the form . Then and the equality holds if and only if .
In a similar way, CASS owns the block sparse property:
Let be a data matrix whose column vectors are sufficiently drawn from a union of independent subspaces , , . For each , and . Let be a new point in . It holds that the solution to problem (6) is block sparse, i.e., and for all . Furthermore, is also optimal to the following problem:
The block sparse property of CASS is the same as those of SSC, LRR and LSR when the data are from independent subspaces. This is also the motivation for using trace Lasso for subspace segmentation. For the noisy case, different from the previous methods, CASS may also lead to a solution which is close to block sparse, and it also has the grouping effect (see Section 2.3).
The noise free and independent subspaces assumption may be violated in real applications. Problem (6) can be extended to handle noises of different types. For small magnitude and dense noises (e.g. Gaussian), a reasonable strategy is to use the -norm to model the noises:
Here is a parameter balancing the effects of the two terms. For data with a small fraction of gross corruptions, the -norm is a better choice:
Namely, the choice of the norm depends on the noises. It is important for subspace segmentation but not the main focus of this paper.
In the case of data contaminated with noises, it is difficult to obtain a block sparse solution. Though the representation coefficient derived by SSC tends to be sparse, it is unable to group correlated data together. On the other hand, LRR and LSR lead to dense representations which lack the ability of subset selection. CASS by using trace Lasso takes the correlation of data into account which places a tradeoff between sparsity and grouping effect. Thus it can be regarded as a method which balances SSC and LSR.
For SSC, LRR, LSR and CASS, each data point is expressed as a linear combination of all the data with a coefficient vector. These coefficient vectors can be arranged as a matrix measuring the similarities between data points. Figure 2 illustrates the coefficient matrices derived by these four methods on the Extended Yale B database (see Section 3.1 for detailed experimental setting). We can see that the coefficient matrix derived by SSC is so sparse that it is even difficult to identify how many groups there are. This phenomenon confirms that SSC loses the data correlation information. Thus SSC does not perform well for data with strong correlation. On the contrary, the coefficient matrices derived by LRR and LSR are very dense. They group many data points together, but do not do subset selection. There are many nonzero connections between clusters, and some are very large. Thus LRR and LSR may contain much erroneous information. Our proposed method CASS by using trace Lasso, achieves a more accurate coefficient matrix, which is close to be block diagonal, and it also groups data within cluster. Such intuition shows that CASS is more accurate to reveal the true data structure for subspace segmentation.
It has been shown in  that the effectiveness of LSR by -regularization comes from the grouping effect, i.e., the coefficients of a group of correlated data are approximately equal. In this work, we show that trace Lasso also has the grouping effect for correlated data.
Given a data vector , data points and parameter . Let be the optimal solution to problem (9). If , then .
The proof of the Theorem 3 can be found in the supplementary materials.
If each column of is normalized, implies that the sample correlation . Namely and are highly correlated. Then these two data points will be grouped together by CASS due to the grouping effect. Illustrations of the grouping effect are shown in Figures 1 and 2. One can see that the connections within cluster by CASS are dense, similar to LRR and LSR. The grouping effect of CASS may be weaker than LRR and LSR, since it also encourages sparsity between clusters, but it is sufficient enough for grouping correlated data together.
Performing CASS needs to solve the convex optimization problem (9), which can be optimized by off-the-shelf solvers. The work in  introduces an iteratively reweighted least squares method for solving problem (9), but the solution is not necessarily globally optimal due to a trick by adding a term to avoid the non-invertible issue. Motivated by the optimization method used in low-rank minimization [1, 15], we adopt the Alternating Direction Method (ADM) to solve problem (9). We first convert it to the following equivalent problem:
This problem can be solved by the ADM method, which operates on the following augmented Lagrangian function:
where is the Lagrange multiplier and is the penalty parameter for violation of the linear constraint. We can see that is separable, thus it can be decomposed into two subproblems and minimized with regard to and , respectively. The whole procedure for solving problem (9) is outlined in the Algorithm 1. It iteratively solves two subproblems which have closed form solutions. By the theory of ADM and the convexity of problem (9), Algorithm 1 converges globally.
For solving the subspace segmentation problem by trace Lasso, we first solve problem (9) for each data point with which excludes itself, and obtain the corresponding coefficients. Then these coefficients can be arranged as a matrix . The affinity matrix is defined as . Finally, we use the Normalized Cuts (NCuts)  to segment the data into groups. The whole procedure of CASS algorithm is outlined in the Algorithm 2.
and MNIST database555http://yann.lecun.com/exdb/mnist/
of handwritten digits. CASS is compared with SSC, LRR and LSR which are the representative and state-of-the-art methods for subspace segmentation. The derived affinity matrices from all algorithms are also evaluated for the semi-supervised learning task on the Extended Yale B database. For fair comparison with previous works, we follow the experimental settings as in. The parameters for each method are tuned to achieve the best performance. The segmentation accuracy/error is used to evaluate the subspace segmentation performance. The accuracy is calculated by the best matching rate of the predicted label and the ground truth of data .
Hopkins 155 motion database contains 156 sequences, each of which has 39
550 data points drawn from two or three motions (a motion corresponds to a subspace). Each sequence is a sole data set and so there are 156 subspace segmentation problems in total. We first use PCA to project the data into a 12-dimensional subspace. All the algorithms are performed on each sequence, and the maximum, mean and standard deviation of the error rates are reported.
|Comparison under the same setting|
|Comparison to state-of-the-art methods|
Extended Yale B is challenging for subspace segmentation due to large noises. It consists of 2,414 frontal face images of 38 subjects under various lighting, poses and illumination conditions. Each subject has 64 faces. We construct three subspace segmentation tasks based on the first 5, 8 and 10 subjects face images of this database. The data are first projected into a , , and -dimensional subspace by PCA, respectively. Then the algorithms are employed on these three tasks and the accuracies are reported.
To further evaluate the effectiveness of CASS for other learning problems, we also use the derived affinity matrix for semi-supervised learning. The Markov random walks algorithm  is employed in this experiment. It performs a -step Markov random walk on the graph or affinity matrix. The influence of one example to another example is proportional to the affinity between them. We test on the 10 subjects face classification problem. For each subject, 4, 8, 16 and 32 face images are randomly selected to form the training data set, and the remaining for testing. Our goal is to predict the labels of the test data by Markov random walks  on the affinity matrices learnt by NN, SSC, LRR, LSR and CASS. We experimentally select neighbors. The experiment is repeated for 20 times, and the accuracy and standard deviation are reported for evaluation.
MNIST database of handwritten digits is also widely used in subspace learning and clustering . It has 10 subjects, corresponding to 10 handwritten digits, 09. We select a subset with a similar size as in the above face clustersing problem for this experiment, which consists of the first 50 samples of each subject. The accuracies of SSC, LRR, LSR and CASS are reported.
Table 1 tabulates the motion segmentation errors of four methods on the Hopkins 155 database. It shows that CASS gets a misclassification error of 2.42 for all 156 sequences, while the best previously reported result is 2.50 by LSR. The improvement of CASS on this database is limited due to many reasons. First, previous methods have performed very well on the data with only slight corruptions, and thus the room for improvement is limited. Second, the reported error is the mean of 156 segmentation errors, most of which are zeros. So even if there are some high improvements on some challenging sequences, the improvement of the mean error is also limited. Third, the correlation of data is strong as the dimension of each affine subspace is no more than three  , thus CASS tends to be close to LSR in this case. Due to the dimensionality reduction by PCA and sufficient data sampling in each motion, CASS may behave like LSR with a strong grouping effect. Furthermore, in order to compare with the state-of-the-art methods, we follow the post-processing in , which may not be optimal for CASS, and the error of CASS is reduced to 1.47. But the best performance by Latent LRR  is 0.85. It is much better than other methods. That is because Latent LRR further employs unobserved hidden data as the dictionary and has complex pre-processing and post-processing with several parameters. The idea of incorporating unobserved hidden data may also be considered in CASS. This will be our future work.
Table 2 shows the clustering result on the Extended Yale B database. We can see that CASS outperforms SSC, LRR and LSR on all these three clustering tasks. In particular, CASS gets accuracies of 94.03, 91.41, and 81.88 for face clustering with 5, 8, and 10 subjects, respectively, which outperforms the state-of-the-art method LSR. For the 5 subjects face clustering problem, all these four methods perform well, and no big improvement is made by CASS. But for the 8 subjects and 10 subjects face clustering problems, CASS achieves significant improvements. For these two clustering tasks, both LRR and LSR perform much better than SSC, which can be attributed to the strong grouping effect of the two methods. However, both the two methods lack the ability of subset selection, and therefore may group some data points between clusters together. CASS not only preserves the grouping effect within cluster but also enhances the sparsity between clusters. The intuitive comparison of these four methods can be found in Figure 2. It confirms that CASS usually leads to an approximately block diagonal affinity matrix which results in a more accurate segmentation result. This phenomenon is also consistent with the analysis in Theorems 2 and 3.
For semi-supervised learning, the comparison of the classification accuracies is shown in Figure 3 with different numbers of training data. CASS achieves the best performance and the accuracies on these settings are all above . Notice that they are much higher than the clustering accuracies in Table 2. This is mainly due to the mechanism of semi-supervised learning which makes use of both labeled and unlabeled data for training. The accurate graph construction is the key step for semi-supervised learning. This example shows that the affinity matrix by trace Lasso is also effective for semi-supervised learning.
Table 3 shows the clustering accuracies by SSC, LRR, LSR, and CASS on the MNIST database. The comparison of the derived affinity matrices by these four methods is illustrated in Figure 4. We can see that CASS obtains an affinity matrix which is close to block diagonal by preserving the grouping effect. None of these four methods performs perfectly on this database. Nonetheless, our proposed CASS method achieves the best accuracy . The main reason may lie in the fact that the handwritten digit data do not fit the subspace structure well. This is also the main challenge for real-world applications by subspace segmentation.
In this work, we propose the Correlation Adaptive Subspace Segmentation (CASS) method by using the trace Lasso. Compared with the existing SSC, LRR, and LSR, CASS simultaneously encourages grouping effect and sparsity. The adaptive advantage of CASS comes from the mechanism of trace Lasso which balances between -norm and -norm. In theory, we show that CASS is able to reveal the true segmentation result when the subspaces are independent. The grouping effect of trace Lasso is firstly established in this work. At last, the experimental results on the Hopkins 155, Extended Yale B, and MNIST databases show the effectiveness of CASS. Similar improvement can also be observed in semi-supervised learning setting on the Extended Yaled B database. However, there still remain many problems for future exploration. First, the data itself, which may be noisy, are used as the dictionary for linear construction. It may be better to learn a compact and discriminative dictionary for trace Lasso. Second, trace Lasso may have many other applications, i.e. classification, dimensionality reduction, and semi-supervised learning. Third, more scalable optimization algorithms should be developed for large scale subspace segmentation.
This research is supported by the Singapore National Research Foundation under its International Research Centre @Singapore Funding Initiative and administered by the IDM Programme Office. Z. Lin is supported by National Natural Science Foundation of China (Grant nos. 61272341, 61231002, and 61121002).
From few to many: Illumination cone models for face recognition under variable lighting and pose.TPAMI, 23(6):643–660, 2001.
Ridge regression: biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970.
Latent low-rank representation for subspace segmentation and feature extraction.In ICCV, pages 1615–1622, 2011.
Generalized principal component analysis (GPCA).TPAMI, 27(12):1945–1959, 2005.