Matrix rank minimizing fazel2002matrix is ubiquitous in machine learning, computer vision, control, signal processing and system identification. For instance, low-rank representation based subspace clustering liu2010robust ; liu2013robust ; vidal2014low and matrix completion candes2009exact ; hu2013fast methods have achieved great success recently. Subspace clustering vidal2010tutorial is one of the fundamental topics with numerous applications, e.g., image representation eldar2009robust ; yang2008unsupervised , face clustering elhamifar2013sparse ; liu2013robust , and motion segmentation rao2010motion ; lauer2009spectral
. It is assumed that high-dimensional data is more likely a union of low-dimensional subspaces rather than one individual subspace. For example, different subspaces are needed to describe trajectories of different moving objects in a video sequence. Subspace clustering is an intrinsically difficult problem, since we need to simultaneously cluster all data points into multiple groups and find a low-dimensional subspace fitting each group of points.
Subspace clustering has been an active research topic over the past decades. Four main categories of methods are proposed elhamifar2013sparse
: iterative, algebraic, statistical, and spectral clustering-based methods. The first three kinds of approaches are sensitive to initialization, noise and outliers; in addition, they are difficult to optimizeelhamifar2013sparse
. Spectral clustering-based methods have achieved promising performance, where the key is to learn a good affinity matrix of data points. For instance, the algorithms of local subspace affinity (LSA)yan2006general , locally linear manifold clustering (LLMC) goh2007segmenting , and spectral local best-fit flats (SLBF) zhang2012hybrid , use local information around each point to construct the affinity matrix, while spectral curvature clustering (SCC) chen2009spectral
method preserves the global structures of the whole data set in deriving the affinity matrix. Subsequently, K-meansjing2007entropy or Normalized Cuts (NCut) shi2000normalized ; von2007tutorial are applied to the affinity matrix to obtain clustering results.
Recently, some spectral clustering based methods, such as sparse representation (SSC) elhamifar2013sparse , low-rank representation (LRR) liu2013robust , have been proposed to obtain state-of-the-art results in subspace clustering. SSC represents each data point as a sparse linear combination of the other points and solves an -norm regularized minimization problem for sparsity. SSC shows promising results if the subspaces are either independent or disjoint elhamifar2010clustering .
The basic idea of LRR is to learn a low-rank representation of data by capturing the global Euclidean structure of the whole data. In this scheme, each data point is represented as a linear combination of the examples in the data matrix itself, and a convex nuclear norm minimization is used as a surrogate of the rank function to obtain the desired low-rank representation. Though its optimization is well-studied and has a global optimum, its performance may be far from optimal in real applications because the nuclear norm might not be a good approximation to the rank function. Compared to the rank function to which all nonzero singular values have equal contributions, the nuclear norm treats those values differently by simply adding them together. As a result, the nuclear norm may be dominated by a few very large singular values and significantly deviated from the true value of the rank. Several papers have considered this problem of using the nuclear norm and designed methods to alleviate it by either thresholding or removing some of the singular values; for instance, singular value thresholding cai2010singular and truncated nuclear norm hu2013fast both considerably enhance the performance of matrix completion.
In this paper, we propose to use a log-determinant (LogDet) function for rank approximation and study its minimization in subspace clustering. Different from the nuclear norm-based approaches which minimize the summation of all singular values, our approach aims to minimize the rank by making the contribution to be much closer to one from a big singular value, while zero from a small singular value. In this way, we can get closer and more robust approximation to the rank function than the nuclear norm. Since the LogDet function is non-convex, we apply the method of augmented Lagrange multipliers (ALM) to solve the associated optimization for potentially large-scale applications, in which the subproblem for minimizing the LogDet function in each iteration has a closed-form solution. To demonstrate the effectiveness of our LogDet minimization
method, we apply it to subspace clustering. By employing a rather simple formulation based on the LogDet function, we obtain a low-rank representation for subspace clustering. Subsequently, we exploit the angular information of principal directions of such a representation to further enhance the separation ability of the affinity matrix. In summary, our main contributions of this work include:
More accurate and robust rank approximation is used to obtain the low-rank representation, which is able to capture the global structure of the dataset.
An iterative optimization algorithm is designed for minimizing this rank approximation-based objective function. Theoretical analysis shows that our algorithm converges to a stationary point. Specifically, the proposed optimization method is applied to subspace clustering.
Angular information of principal directions of the low-rank representation is employed to further exploit the intrinsic local geometrical structure relevant to the membership of data points.
Extensive experiments demonstrate the effectiveness of the proposed LogDet minimization method for rank approximation. Especially, when used for subspace clustering, our simple formulation shows favorable performance compared to other state-of-the-art methods, although we do not explicitly account for outliers in our model. This demonstrates the robustness of our approach.
The remainder of the paper is organized as follows: Section 2 provides a brief review of LRR and SSC. In Section 3, we present the proposed approximation and design an efficient optimization scheme. We give convergence analysis in Section 4. Experimental results are shown in Section 5. Finally, conclusions are drawn in Section 6.
2 Review of LRR and SCC
In this section, we give a brief review of SSC and LRR.
Let be a set of -dimensional data points drawn from an unknown union of linear subspaces . The task of subspace clustering is to segment data points into subspaces.
LRR tries to seek the lowest rank representation among many possible linear combinations of the bases in a given dictionary, which typically is the data matrix itself. The problem can be formulated as:
where is the coefficient matrix with each being the representation of . The above problem is NP-hard due to the combinatorial nature of the rank function.
The tightest convex relaxation of the rank function recht2010guaranteed is the nuclear norm. For a matrix , its nuclear norm is defined as , where means the -th singular value of . Using this relaxation, LRR solves the following problem:
After obtaining , the affinity matrix is defined as
Then the spectral clustering algorithm, Normalized Cuts shi2000normalized is used to produce the final segmentation.
SSC aims to find a sparse representation of by solving the following convex optimization problem:
where , is a sparse matrix containing the gross error, and , is a matrix of fitting residuals. After obtaining , subsequent procedures are similar to LRR.
3 LogDet Rank Approximation and Its Minimization Algorithm
A function is absolutely symmetric if is invariant under arbitrary permutations and sign changes of the elements of . Based on this function , we have the following theorem lewis1995convex .
Function is unitarily invariant if , where whose singular value decomposition is
, are singular values of , and . And the gradient of at is
whose singular value decomposition is
In this work, we utilize unitarily invariant function LogDet to achieve a closer, though not convex, rank relaxation than the nuclear norm. We apply the method of ALM for LogDet rank approximation associated minimization. To explain our method, we specifically consider using LogDet as a rank surrograte in subspace clustering. We first obtain a low-rank representation of high-dimensional data based on the LogDet optimization. Then we construct an affinity graph matrix for spectral clustering by using the angular information of principal directions of the low-rank representation.
3.1 LogDet rank minimization
We use as a surrogate of the rank function of . It is obvious that . Because it can be easily verified that for any , we always have ; especially, if there are large nonzero singular values, the LogDet function will be much smaller than the nuclear norm since for a large . It is noted that for small nonzero singular values, their contribution to the LogDet function will be significantly reduced compared to the nuclear norm. Because small nonzero singular values are often regarded as being from noise in the data, the LogDet function reduces noise effect more compared to the nuclear norm.
It is worthwhile to note that a similar function
was proposed in fazel2003log to approximate rank and iterative linearization was used to find a local minimum. However, is a very small constant (e.g., ), which leads to biased approximation for small singular values.
This LogDet function is differentiable with respect to the singular values by Theorem 1, and even though it is non-convex, its minimization is rather simple by using our optimization method. To explain its minimization, we consider its specific application to subspace clustering. By employing the above LogDet function, we simply formulate the subspace clustering into the following unconstrained nonconvex minimization problem:
is the identity matrix. The first term of (6) is to minimize the rank of , while the second is a relaxation of , which is referred to as the self-expressiveness of with representing the similarity between data points. Because the LogDet function is not convex in , we resort to ALM technique to solve (6), by re-writing (6) as follows:
We turn to minimizing the following augmented Lagrangian function:
where is a penalty parameter and is the Lagrangian dual variable. With a sufficiently large , the objective function converges to objective function in (6). This can be solved by updating , , and alternatively while fixing the other variables. Specifically, assume at the th iteration we have obtained , and , then for the th iteration, the optimization problem (8) can be updated via the following four steps.
Input: data matrix , parameters .
Initialize: , .
Until stopping criterion is satisfied.
Step 1: Computing . Fix and and then calculate :
which has a closed-form solution,
Step 2: Computing . Fix and , and minimize as follows:
This can be converted to a scalar minimization problem due to the following theorem. As we notice, this can also be rewritten as s special case of the problem in a recent work lu2015generalized .
For unitarily invariant function , assuming SVD of is , , the optimal solution to the following problem
is , with obtained by solving scalar minimization problems
Let be SVD of , then . Denoting which has exactly the same signular values as , i.e., , we have
In the above, (15) holds because the Frobenius norm is unitary invariant; (16) holds because is unitary invariant; (17) is true by von Neumann’s inequality; and (20) holds as . The inequality between (15) and (19) can also be obtained by the Hoffman-Wielandt inequality. Therefore, (20) is a lower bound of (14), where is obtained by minimizing (20). Note that the equality in (18) is attained if . Because , the SVD of is , which is the minimizer of problem (12). Hence the proof is completed.
where SVD of is . The above equation is cubic and gives three roots. In addition, we need to enforce the nonnegativity of . It is easily seen that there exists at least one nonnegative root. And there is a unique minimizer if . Finally, we obtain the update of variable with .
Step 3: Computing . Fix and , and then we calculate as follows:
Step 4: Updating as . The complete procedure is summarized in Algorithm 1.
Problem (6) is nonconvex. It is difficult to give a rigorous mathematical argument for convergence to a (local) optimum. We will provide a theoretical proof that our algorithm converges to an accumulation point and this accumulation point is a stationary point. Our empirical experiments confirm the convergence of the proposed method on the benchmark datasets. The experimental results are promising, despite that the solution obtained by the proposed optimization method may be a local optimum.
3.2 Affinity graph matrix construction
Now we will construct an affinity matrix for subspace clustering. Optimal may not accurately describe the relationship between samples if the data is severely corrupted. Therefore, in general, it is not a good idea to construct by directly using . In the spirit of liu2013robust ; lauer2009spectral , we construct an affinity matrix in the following way.
Assuming the skinny SVD of is , we define and
. Based on the weighted eigen-vector matrixor , we construct an affinity matrix as follows:
where () and () represent the -th and -th columns (rows) of (), respectively, and parameter tunes the sharpness of the affinity between two points, with helping separate the clusters. When increases, while the between-cluster separability can be increased, the intra-cluster cohesiveness would nevertheless be degraded. Thus, a suitable needs to balance within-cluster cohesiveness and between-cluster separability. In this paper, we set to be 2. Then we have the same post-processing as LRR111For LRR, we use equation (12) in liu2013robust rather than (3) to construct . We also confirmed with an author of liu2013robust , the power 2 of equation (12) is a typo, it should be 4.. As or spans the principal directions of , we employ the angle information, or powered correlation coefficients of the examples, because their lengths may be affected significantly by the noise or outliers in the data.
Now using the resultant affinity matrix, we can apply spectral clustering algorithm to do segmentation. In this paper, we simply perform NCuts shi2000normalized on . The proposed subspace clustering procedure is summarized in Algorithm 2.
4 Convergence Analysis
In this section, we give the convergence analysis for Algorithm 1. We will show that our optimization algorithm attains at least one stationary point of problem (7). We first rewrite the objective function of (7) as
The sequence is bounded.
To minimize at step , the optimal needs to satisfy the first-order optimality condition
Note that the updating rule for is
thus . We know from (5) that
and , so is bounded. Then it is seen that , i.e., is bounded.
and are bounded if and .
Since the second term in above inequality is finite,
is bounded. We can rewrite
Because and are bounded and each term on the right hand side of the equation (34) is nonnegative, each term will be bounded. being bounded implies that all singular values of are bounded and is bounded. Since , clearly we have bounded . Therefore and are bounded.
has at least one accumulation point , and is a stationary point of optimization problem (7) with the assumption that .
is a bounded sequence, hence by the Bolzano-Weierstrass theorem, there must be at least one accumulation point, which is denoted by . Without loss of generality, we assume that itself converges to . Next, we prove that this accumulation point is a stationary point of problem (26). As , we have . Because and is bounded, we get , i.e., . By first-order optimality condition and the definition of , we have . Let , we get . At the th step, satisfies , i.e., . With the assumption that nie2014new , we get .
Now we can see that satisfies the KKT conditions of and thus is a stationary point of (7).
|Method||Face clustering||Motion segmentation|
|Scenario 1||Scenario 2|
5 Experiments and Analysis
In this section, we conduct experiments on the subspace clustering task with both synthetic and real data.
5.1 Experiments with Synthetic Data
We construct 5 independent subspaces whose bases are generated by a random rotation matrix through , , where
is a random orthogonal matrixliu2010robust . We sample 20 data vectors from each subspace by , , where is a iid matrix. Some data vectors are randomly chosen to corrupt; for example, for a data vector
, it is corrupted by adding Gaussian noise with zero mean and variance. We then use SCLD to segment the data into 5 clusters. Subspace clustering error rate defined as is used to assess the performance. We report the clustering error rate (averaged from 30 trials) with different corruption levels in Figure 1. Without any corruption, SCLD can cluster all data points correctly.
5.2 Experiments with Real Data
In this section, we evaluate the effectiveness and robustness of SCLD on benchmark datasets, Extended Yale B (EYaleB) georghiades2001few ; lee2005acquiring and Hopkins 155 tron2007benchmark . We compare the proposed method SCLD with several state-of-the-art subspace clustering algorithms: LRR liu2013robust , SSC elhamifar2013sparse , LRSC favaro2011closed ; vidal2014low , and local subspace affinity (LSA) yan2006general . For these methods, we use the parameters given by the respective authors. For our method, we also tune to obtain the best performance. Generally, should be relatively large if the data are slightly corrupted. and have little influence on the clustering results, so we just set to ensure the unique of minimizer and use empirically. Other parameters are shown in Table 1. The experiments are conducted on Window 7 with 16 GM memory and Intel Core i5-2300 CPU.
5.2.1 Face Clustering
Face clustering is to cluster a set of face images from multiple individuals in a hope to reveal the identity of these individuals. EYaleB Database includes 2414 frontal images of 38 individuals. For each individual, the images are taken under 64 lighting conditions and can be described by a low-dimensional subspace basri2003lambertian . The images are resized to 4842 pixels and each vectorized image is regarded as a data point. Fig. 2 shows some example images from the database.
22.214.171.124 First Experiment Scenario
|error rate (%)||20.94||35||59.52||35.78||3.59|
As done in liu2010robust , we test the algorithms on the first 10 classes of EYaleB, which consists of 640 frontal face images. More than half of the images are corrupted by shadow and noise. We use this heavily corrupted data to test the effectiveness of our method. As shown in Table 2, SCLD significantly enhances the performance. Specifically, it improves the clustering accuracy by at least when compared to the other algorithms. Since the only difference between our approach and LRR is rank approximation, this improvement is due to LogDet.
126.96.36.199 Second Experiment Scenario
For a fair comparison, we have followed the experimental setup of elhamifar2013sparse . We divide the 38 subjects into four groups: subjects 1 to 10, 11 to 20, 21 to 30, and 31 to 38. We consider all choices of subjects for the first three groups. For the last group, we consider all choices of . We implement our subspace clustering algorithm on each set of subjects. For all experiments, the stopping criterion for is triggered by a relative difference of between two successive iterations, or by a maximum of 100 iterations.
The results are presented in Table 3. For other methods, we cited the results from Table 5 of paper elhamifar2013sparse . SCLD consistently has low clustering error rates and is more stable than the other methods whose error rates increase drastically as the number of subjects increases to 8 and 10. As shown in Figure 2, there are many sparse within-sample outliers in the face images, e.g, shadows. Although LRR uses a regularization term to count for corruptions, the regularization term does not appear to be well suited to EYaleB. LSA has inferior performance possibly because it does not explicitly exploit the low-rank structure of the data.
188.8.131.52 Third Experiment Scenario
In this section, we compare SCLD with other algorithms with RPCA candes2011robust as a preprocessing step. In practice, we do not know the clustering of the data beforehand and hence we apply RPCA to the collection of all data points for each trial prior to clustering. As shown in Table 4, SCLD is still superior to other methods though they apply RPCA to deal with sparse outlying entries. Compared to Table 3, only the clustering error rates of LRSC reduced in some cases. We can conclude that applying RPCA to all data points simultaneously is not effective to improve clustering performance. This is due to the fact that RPCA seeks a common low-rank subspace, which will decrease the principal angles between subspaces and decrease the distance between data points in different subjects elhamifar2013sparse .
5.2.2 Motion Segmentation
Motion segmentation is to segment the trajectories associated with different moving objects into different groups according to their motions in a video sequence. Because different motions can be treated as different subspaces, we use the Hopkins 155 Dataset to validate SCLD. This dataset is slightly corrupted as shown in Figure 3. It consists of 155 sequences of two or three motions and 1 sequence of 5 motions; the latter is regarded as outlier. Each sequence is regarded as a separate clustering problem.
The experimental results are reported in Table 5. We also used the results in Table 1 of elhamifar2013sparse . It can be seen that SCLD produces superior results compared to the other methods. For all 155 sequences, the error rate is as low as 1.79. If we use all 156 sequences, the overall error rate of our proposed algorithm will be 1.87. We report the average computation time for every sequence at the bottom of Table 5. The computational cost of LRSC is much lower than the other methods, while LRR, SSC and SCLD are comparable.
To testify the influence of parameter in our algorithm, we show the clustering error rates of SCLD for different over all 155 sequences in Figure 4. As we can see, when was between 1 and 200, the clustering error varied between 1.79 and 4.67. This implies that SCLD performs well under a wide range of values of .
To test the dependence of SCLD on initialization, we apply another two different initializations. First, we use the solutions from LRR as initial guess for SCLD. Second, we just generate some random numbers. We find that we can still get the same results. Actually, it is recommended to use convex relaxation solutions as initialization for nonconvex formulations fan2014strong ; zhang2010analysis .
In this paper we propose to use a log-determinant function (LogDet) as a rank approximation to recover the low-rank representation of high-dimensional data. When applied to subspace clustering, the proposed algorithm, called SCLD, exploits both global and local structures of the data through the LogDet rank approximation and angle-based affinity matrix. Consequently, it captures more intrinsic information of the data that benefits subspace clustering. Our extensive experimental results show that it outperforms other low-rank representation algorithms based on the nuclear norm. Therefore LogDet appears to be an effective rank approximation function well suited to subspace clustering applications. Although our model is simple and with no explicit modeling of outliers, it is resilient to various corruptions. Our future research will consider modeling corruptions explicitly.
Acknowledgements.This work is supported in part by US National Science Foundation grants IIS 1218712.
- (1) M. Fazel, Matrix rank minimization with applications. Ph.D. thesis, PhD thesis, Stanford University (2002)
- (2) G. Liu, Z. Lin, Y. Yu, in Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010), pp. 663–670
- (3) G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Pattern Analysis and Machine Intelligence, IEEE Transactions on 35(1), 171 (2013)
R. Vidal, P. Favaro, Pattern Recognition Letters43, 47 (2014)
- (5) E.J. Candès, B. Recht, Foundations of Computational mathematics 9(6), 717 (2009)
- (6) Y. Hu, D. Zhang, J. Ye, X. Li, X. He, Pattern Analysis and Machine Intelligence, IEEE Transactions on 35(9), 2117 (2013)
- (7) R. Vidal, IEEE Signal Processing Magazine 28(2), 52 (2010)
- (8) Y.C. Eldar, M. Mishali, Information Theory, IEEE Transactions on 55(11), 5302 (2009)
- (9) A.Y. Yang, J. Wright, Y. Ma, S.S. Sastry, Computer Vision and Image Understanding 110(2), 212 (2008)
- (10) E. Elhamifar, R. Vidal, Pattern Analysis and Machine Intelligence, IEEE Transactions on 35(11), 2765 (2013)
- (11) S. Rao, R. Tron, R. Vidal, Y. Ma, Pattern Analysis and Machine Intelligence, IEEE Transactions on 32(10), 1832 (2010)
- (12) F. Lauer, C. Schnorr, in Computer Vision, 2009 IEEE 12th International Conference on (IEEE, 2009), pp. 678–685
- (13) J. Yan, M. Pollefeys, in Computer Vision–ECCV 2006 (Springer, 2006), pp. 94–106
- (14) A. Goh, R. Vidal, in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on (IEEE, 2007), pp. 1–6
- (15) T. Zhang, A. Szlam, Y. Wang, G. Lerman, International Journal of Computer Vision 100(3), 217 (2012)
- (16) G. Chen, G. Lerman, International Journal of Computer Vision 81(3), 317 (2009)
- (17) L. Jing, M.K. Ng, J.Z. Huang, Knowledge and Data Engineering, IEEE Transactions on 19(8), 1026 (2007)
- (18) J. Shi, J. Malik, Pattern Analysis and Machine Intelligence, IEEE Transactions on 22(8), 888 (2000)
- (19) U. Von Luxburg, Statistics and computing 17(4), 395 (2007)
- (20) E. Elhamifar, R. Vidal, in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on (IEEE, 2010), pp. 1926–1929
- (21) J.F. Cai, E.J. Candès, Z. Shen, SIAM Journal on Optimization 20(4), 1956 (2010)
- (22) B. Recht, M. Fazel, P.A. Parrilo, SIAM review 52(3), 471 (2010)
- (23) A.S. Lewis, Journal of Convex Analysis 2(1), 173 (1995)
- (24) M. Fazel, H. Hindi, S.P. Boyd, in American Control Conference, 2003. Proceedings of the 2003, vol. 3 (IEEE, 2003), vol. 3, pp. 2156–2162
- (25) C. Lu, C. Zhu, C. Xu, S. Yan, Z. Lin, in AAAI (2015)
- (26) F. Nie, Y. Huang, X. Wang, H. Huang, in Proceedings of International Conference on Machine Learning (2014)
- (27) A.S. Georghiades, P.N. Belhumeur, D. Kriegman, Pattern Analysis and Machine Intelligence, IEEE Transactions on 23(6), 643 (2001)
- (28) K.C. Lee, J. Ho, D. Kriegman, Pattern Analysis and Machine Intelligence, IEEE Transactions on 27(5), 684 (2005)
- (29) R. Tron, R. Vidal, in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on (IEEE, 2007), pp. 1–8
- (30) P. Favaro, R. Vidal, A. Ravichandran, in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on (IEEE, 2011), pp. 1801–1807
- (31) R. Basri, D.W. Jacobs, Pattern Analysis and Machine Intelligence, IEEE Transactions on 25(2), 218 (2003)
- (32) E.J. Candès, X. Li, Y. Ma, J. Wright, Journal of the ACM (JACM) 58(3), 11 (2011)
- (33) J. Fan, L. Xue, H. Zou, Annals of statistics 42(3), 819 (2014)
- (34) T. Zhang, The Journal of Machine Learning Research 11, 1081 (2010)