1 Introduction
Subspace clustering denotes the problem of clustering data points drawn from a union of lowdimensional linear (or affine) subspaces into their respective subspaces. This problem has many applications in computer vision, such as motion segmentation and image clustering. To give a concrete example, under an affine camera model, the trajectories of points on a rigidly moving object lie in a linear subspace of dimension up to four; thus motion segmentation can be cast as a subspace clustering problem
[7].Existing subspace clustering methods can be roughly divided into three categories [40]
: algebraic algorithms, statistical methods, and spectral clusteringbased methods. We refer the reader to
[40] for a comprehensive review of the literature of subspace clustering. Recently, there has been a surge of spectral clusteringbased methods [7, 26, 27, 42, 17, 18, 46, 45], which consist of first constructing an affinity matrix and then applying spectral clustering
[29]. All these methods, however, can only handle linear (or affine) subspaces. In practice, the data points may not fit exactly to a linear subspace model. For example, in motion segmentation (see Figure 1), the camera often has some degree of perspective distortion so that the affine camera assumption does not hold; in this case, the trajectories of one motion rather lie in a nonlinear subspace (or submanifold) ^{1}^{1}1In this paper, we confine our discussion to data structures that nonlinearly deviate from linear subspaces, but are not arbitrarily far from linear subspaces. Therefore, arbitrary manifold clustering [38], e.g., for Olympic rings and spirals, is out of the scope of this paper..A few other methods [31, 32, 30, 43, 44] extended linear subspace clustering to nonlinear counterparts by exploiting the kernel trick. In particular, [31, 32] kernelized sparse subspace clustering (SSC) [7, 8] by replacing the inner product of the data matrix with the polynomial kernel or Gaussian RBF kernel matrices; similarly, [30, 43] kernelized the method of low rank representation (LRR) [26, 25]. [44] assumed that data points were drawn from symmetric positive definite (SPD) matrices and applied the LogEuclidean kernel on SPD matrices to kernelize SSC. However, with the predefined kernels used in all these methods, the data after (implicit) mapping to the feature space have no guarantee to be lowrank, and thus are very unlikely to form multiple lowdimensional subspace structures in the feature space.
In this paper, by contrast, we propose a joint formulation to adaptively solve (in an unsupervised manner) a lowrank kernel mapping such that the data in the resulting feature space is both lowrank and selfexpressive. Intuitively, enforcing the kernel feature mapping to be lowrank will encourage the data to form linear subspace structures in feature space. Our idea of lowrank kernel mapping is general and could, in principle, be implemented within most selfexpressivenessbased subspace clustering frameworks [7, 26, 27, 42]. Here, in particular, we make use of the SSC one of [7]. This allows us to make use of the Alternating Direction Method of Multipliers (ADMM) [24, 2] to derive an efficient solution to the resulting optimization problem.
We extensively evaluate our method on multiple motion segmentation and image clustering datasets, and show that it significantly outperforms the linear subspace clustering methods of [8, 41, 25, 46] and the method of [32] based on predefined kernels. Specifically, we achieve stateoftheart results on the Hopkins155 motion segmentation dataset [39], the Extended Yale B face image clustering dataset [20], the ORL face image clustering dataset [34], and the COIL100 image clustering dataset [28].
2 Subspace SelfExpressiveness
Modern subspace clustering methods rely on building an affinity matrix such that data points from the same subspace have high affinity values and those from different subspaces have low affinity values (ideally zero). Recent selfexpressivenessbased methods [8, 25, 45] resort to the socalled subspace selfexpressiveness property, , one point from one subspace can be represented as a linear combination of other points in the same subspace, and leverage the selfexpression coefficient matrix as the affinity matrix for spectral clustering.
Specifically, given a data matrix (with each column a data point), subspace selfexpressiveness means that one can express , where is the selfexpression coefficient matrix. As shown in [17], under the assumption of subspaces being independent, the optimal solution for obtained by minimizing certain norms of has a blockdiagonal structure (up to permutations), , only if points and are from the same subspace. In other words, we can address the subspace clustering problem by solving the following optimization problem
(1) 
where denotes an arbitrary matrix norm, and the constraint ^{2}^{2}2 denotes the diagonal matrix whose diagonal elements are the same as those on the diagonal of . prevents the trivial identity solution for sparse norms of [8]. In the literature, various norms on have been used to regularize subspace clustering, such as the norm in [7, 8], the nuclear norm in [26, 25, 9, 41], the norm in [27, 17], a structured norm in [21], and a mixture of and norms in [42, 45].
Compared to another line of research based on local higherorder models [5, 14, 33], where affinities are constructed from the residuals of local subspace model fitting, the selfexpressivenessbased methods build holistic connections (or affinities) for all points, in a single, principled optimization problem. Moreover, this formulation is convex (after certain relaxations), which guarantees globallyoptimal solutions. Unfortunately, subspace selfexpressiveness only holds for linear (or affine) subspaces. In the following section, we show how to jointly solve a lowrank kernel for nonlinear subspace clustering within the framework of selfexpressivenessbased subspace clustering, and derive efficient solutions for the resulting formulations.
3 LowRank Kernel Subspace Clustering
Kernel methods map data points to a highdimensional (or even infinite dimensional) feature space where linear pattern analysis can be done, corresponding to nonlinear analysis in the input data space [35]. Instead of explicitly computing the coordinates in the highdimensional feature space, common practice in kernel methods consists of using the “kernel trick”, where feature mapping is implicit and inner products between pair of data points in the feature space are computed as kernel values. While the “kernel trick” is relatively computationally cheap, for commonly used kernels such as the Gaussian RBF, we don’t know explicitly how the data points are mapped to feature space. Specifically, in the context of subspace clustering, it is very likely that, after an implicit feature mapping, we don’t have the desired lowdimensional linear subspace structure in the feature space.
3.1 Problem Formulation
In this work, we aim to solve a lowrank kernel mapping that projects the data into a highdimensional Hilbert space where it has the structure of linear subspaces (see Figure 2). While our approach is general, we implement it within the sparse subspace clustering (SSC) framework [7, 8], formulated as
(2) 
which seeks a sparse representation of every data point using the other points as the dictionary.
When the structure of linear subspaces is present in the Hilbert feature space, the feature mapping should have low rank. Since we would like the data in Hilbert space to still lie on multiple linear subspaces, we also expect it to be selfexpressive. Combining these properties leads to the optimization problem as follows
(3) 
where is an unknown kernel mapping function and is a tradeoff parameter. Here, we minimize the nuclear norm , which is a convex surrogate of , to encourage to have low rank. In practice, the data points often contain noise. Therefore, we can relax the equality constraint in (3) and make it a regularization term in the objective, , . In kernel methods, we normally don’t know the explicit form of , so we need to apply the “kernel trick”. To this end, we can then expand this regularization term and have
(4) 
which does not explicitly depend on anymore, but on the kernel Gram matrix .
However, there are still two hurdles in optimizing (3): (i) the first term () in the objective depends on the kernel mapping which does not have explicit form in most kernel methods; (ii) the kernel mapping function in this formulation is underconstrained in the sense that one can always map the data to all zero to achieve the minimum value of the objective.
For the first hurdle, one may think of minimizing instead of as the rank of is equal to the rank of . However, minimizing will not lead to a lowrank , because we have . So minimizing is equivalent to minimizing , which doesn’t encourage the data in feature space to have low rank [10]. Fortunately, it has been shown in [10] that this hurdle can be circumvented by using a reparametrization, which leads to a closed form solution for robust rank minimization in the feature space. Since the kernel matrix is symmetric positive semidefinite, we can factorize it as , where is a square matrix. It is easy to show that
(5) 
Thus, we can replace with in the objective of (3).
For the second hurdle, to regularize the kernel mapping, we further enforce our adaptive kernel matrix to be close to a predefined kernel matrix using userspecified kernels. With an additional affine constraint for affine subspaces, our optimization problem translates to
(6) 
where and
is an allone column vector. The idea of our formulation is that we want to solve an adaptive kernel matrix
such that:
the mapped data in the feature space has low rank;

the unknown kernel matrix (to be solved) is not arbitrary but close to a predefined kernel matrix (, polynomial kernels);

the data points in the feature space still form a multiple linear subspace structure, and are thus selfexpressive.
To handle the diagonal constraint on , we introduce an auxiliary variable such that as in [8]. Substituting Eq. (4) into (6), we have the following equivalent formulation
(7) 
Below, we show how to solve this problem efficiently.
3.2 Solutions via ADMM
The above optimization problem is nonconvex (or biconvex) due to the bilinear terms in the objective. Here we propose to solve it via the Alternating Direction Method of Multipliers (ADMM) [24, 2]. Recently, the ADMM has gained popularity to solve nonconvex problems [22], especially bilinear ones [6, 15, 36]. A convergence analysis of the ADMM for certain nonconvex problems is provided in [13]. We will also give our empirical convergence analysis in the next section.
To derive the ADMM solution to (7), we first need to compute its augmented Lagrangian. This is given by
(8) 
where and are the Lagrange multipliers corresponding to the equality constraints in (7), and is the penalty parameter for the augmentation term in the Lagrangian.
The ADMM works in an iterative manner by updating one variable while fixing the other ones [24] , ,
(9a)  
(9b)  
(9c) 
and then updating the Lagrange multipliers .
(1) Updating
The update of can be achieved by solving the following subproblem
(10) 
This subproblem has a closedfrom solution given by
(11) 
where , and is an elementwise softthresholding operator defined as .
(2) Updating
To update , we must solve the subproblem
(12) 
This can be achieved by taking the derivative w.r.t. , and setting it to zero. This again yields a closedform solution given by
(13) 
where
is an identity matrix.
(3) Updating
can be updated by solving the following subproblem
(14) 
with . ^{3}^{3}3In our implementation, we make a symmetric matrix by computing . Fortunately, this subproblem also has a closedform solution given by
(15) 
where and
are both related to the singular value decomposition (SVD) of
. Let denote the SVD of . is a diagonal matrix, , , with (, ), where is the singular value of (see [10] for a complete proof of this result). In other words, can be obtained by first solving a set of depressed cubic equations whose firstorder coefficients come from the singular values of and then select the nonnegative solution that minimizes ; is the matrix containing the singular vectors of . Note that the solution to (14) is nonunique as one can multiply an arbitrary orthogonal matrix to the left of (
15) without changing the objective value in (14). However, this nonuniqueness does not affect the final clustering as the solution for C remains the same.3.3 Handling Gross Corruptions
Our formulation in (6) can be sensitive to gross data corruptions (, Laplacian noise) due to the norm regularization. When data points are grossly contaminated, we need to model the gross errors in the kernel matrix. To this end, we assume that the gross corruptions in the data are sparse so that we can model them with an regularizer ^{4}^{4}4In principle, one can also model the gross corruptions with structured sparse norms such as
norm if some data points are completely outliers. Here we use the
norm regularizer because we assume that gross corruptions only happen to some sparse entries of the data vectors.. This lets us derive the following formulation(16) 
where , and where we decompose the predefined kernel matrix into the sum of a lowrank kernel matrix and a sparse outlier term .
Similarly to (7), we can again solve (16) with the ADMM. To this end, we derive its augmented Lagrangian as
(17) 
We can then derive the subproblems to update each of the variables by minimizing .
(1) Updating , and
The subproblems for updating and are exactly the same as in (10) and (3.2). Correspondingly, the solutions are also the same as in (11) and (13). The subproblem to update is similar to (14), except that is now defined as
(18) 
(2) Updating
The subproblem to update can be written as
(19) 
which has a closedform solution given by
(20) 
3.4 The Complete Algorithm
Given the data matrix , we solve either (7) or (16) with Algorithm 1 or 2, respectively, depending on whether the data points are grossly contaminated or not. After we get the coefficient matrix , we then construct the affinity matrix with an extra normalization step on as in SSC [8]. Finally, we apply the spectral clustering algorithm [29, 37] to get the clustering results. Our complete algorithm for lowrank kernel subspace clustering (LRKSC) is outlined in Algorithm 3.
4 Experiments
We compare our method with the following baselines: low rank representation (LRR) [25], sparse subspace clustering (SSC) [8], kernel sparse subspace clustering (KSSC) [32], low rank subspace clustering (LRSC) [41], and sparse subspace clustering by orthogonal matching pursuit (SSCOMP) [46]. Specifically, KSSC is the kernelized version of SSC, and SSCOMP is a scalable version of SSC. ^{5}^{5}5While there are other kernel subspace clustering methods [43, 44] in the literature, their source codes are not yet publicly available. So we are unable to compare with their results. However, our comparison with KSSC already shows the benefits of solving for a lowrank kernel over a predefined kernel. The metric for quantitative evaluation is the ratio of wrongly clustered points, ,
For LRR, SSC, LRSC and SSCOMP, we used the source codes released by the authors. For KSSC, we used the same ADMM framework as ours (and set for best performance), and update the variables according to the descriptions in the original paper [32].
4.1 Kernel Selection
For all kernel methods, kernel selection still remains an open but important problem since there is a vast range of possible kernels in practice. The selection of kernels depends on our tasks at hand, the prior knowledge about the data and the types of patterns we expect to discover [35]. In the supervised setting, one can possibly use a validation set to choose the kernels that give the best (classification) results. However, in the unsupervised setting, it is hard to define a measure of ”goodness” to guide the kernel selection. Nonetheless, in the case of subspace clustering, we can make use of our prior knowledge (or assumption) that the data points are not too far away from linear subspaces, and rely on simple kernels to define . In this paper, we advocate the use of polynomial kernels for kernel subspace clustering, and argue that more complex kernels such as the Gaussian RBF kernels would destroy the subspace structure in the feature space to the extent that even with our proposed adaptive kernel approximation the subspace structure cannot be restored. Specifically, in all our experiments, we use the polynomial kernels . The degree parameter is set to either 2 or 3, and the bias parameter is selected from .
4.2 Convergence Analysis
In this part of the experiments, we examine how Algorithm 1 and Algorithm 2 converge. Specifically, we compute the objective values of (7) and (16), and their primal residuals at each iteration. The primal residuals are respectively computed as for Algorithm 1, and for Algorithm 2. We show typical convergence curves in Figure 3 with data sampled from Hopkins155 and Extended Yale B. We can see that both algorithms converge fast (within 15 iterations) with the primal residuals quickly reduced close to zero. This is mainly due to our use of a relatively large in the ADMM, which gives rise to a large penalty parameter in a few iterations and thus greatly accelerates convergence. As we will see in the following sections, the solutions obtained by the ADMM are good in the sense that the corresponding results outperform the stateoftheart on multiple datasets.
4.3 Motion Segmentation on Hopkins155
Hopkins155 [39] is a standard motion segmentation dataset consisting of 155 sequences with two or three motions. The sequences can be divided into three categories, , indoor checkerboard sequences (104 sequences), outdoor traffic sequences (38 sequences), and articulated/nonrigid sequences (13 sequences). This dataset provides groundtruth motion labels, and outlierfree feature trajectories (x, ycoordinates) across the frames with moderate noise. The number of feature trajectories per sequence ranges from 39 to 556, and the number of frames from 15 to 100. Since, under the affine camera model, the trajectories of one motion lie on an affine subspace of dimension up to three, subspace clustering methods can be applied for motion segmentation.
Method  LRR  SSC  KSSC  LRSC  SSCOMP  Ours 

2 motions  
Mean  2.13  1.53  1.85  2.57  10.34  1.22 
Median  0.00  0.00  0.00  0.00  3.48  0.00 
3 motions  
Mean  4.13  4.41  3.57  6.64  18.58  3.10 
Median  1.43  0.56  0.30  1.76  13.01  0.56 
All  
Mean  2.56  2.18  2.24  3.47  12.20  1.64 
Median  0.00  0.00  0.00  0.09  5.11  0.00 
The parameters for our method (with Formulation (7) solved using Algorithm 1) are set to , , , and we use the polynomial kernel to define in our formulation. We use the same polynomial kernel for KSSC. Since the input subspaces are affine but the affine constraint in (7) is in feature space, we append an allone row to the data matrix, which acts as an affine constraint for the input space. This trick is also applied for KSSC. For the other baselines, we either use the parameters suggested in the original papers (if the parameters were given therein), or tune them to the best. We show the results on the Hopkins155 motion segmentation dataset in Table 1. Since most sequences in this dataset fit well to an affine camera model, most of the baselines perform well. Note that our method still achieves the lowest clustering errors, whereas KSSC performs slightly worse than SSC. The performance gain of our method over SSC mainly comes from the ability to handle the nonlinear structure that occurs when the affine camera model assumption is not strictly fulfilled. For example, in Figure 1, we show the results on the 2RT3RCR sequence, which contains noticeable perspective distortion, and our method performs significantly better than SSC.
We further test our method for twoframe perspective motion segmentation [23, 16] on Hopkins155 to rule out the effects of inaccurate camera model assumptions. For two perspective images, the subspace comes from rewriting the epipolar constraint [12] , where is the fundamental matrix, and , are the homogeneous coordinates of two correspondence points. The epipolar constraint can be rewritten as [16]
(21) 
where is the vectorization of the fundamental matrix . So lies on the epipolar subspace (, orthogonal complement of ) of dimension up to eight [16]. Since different motions correspond to different fundamental matrices, we have multiple epipolar subspaces for two perspective images with multiple motions.
We take the first and last frames from each sequence of Hopkins155 to construct the twoframe Hopkins155 dataset. Note that the dimension of the ambient space of epipolar subspaces is only nine, so the epipolar subspaces are very likely to be dependent. To increase the ambient dimension, we replicate the data 30 times^{6}^{6}6In Matlab, this is done by repmat(,30,1).. We set the parameters of our method (with Formulation (7) solved using Algorithm 1) on twoframe Hopkins155 as , , , with the polynomial kernel . The results are shown in Table 2, where our method achieves the lowest overall clustering errors. Note that our method with only two frames even outperforms LRSC with the whole sequences.
Method  LRR  SSC  KSSC  LRSC  SSCOMP  Ours 

2 motions  
Mean  6.18  3.52  2.75  3.74  14.41  2.53 
Median  0.10  0.10  0.00  0.85  9.60  0.36 
3 motions  
Mean  14.92  10.93  8.60  22.28  37.32  5.93 
Median  10.16  8.65  5.40  20.72  39.17  0.20 
All  
Mean  8.15  5.20  4.07  7.93  19.58  3.30 
Median  0.54  0.52  0.40  1.81  15.98  0.36 
4.4 Face Image Clustering on Extended Yale B
The Extended Yale B Dataset [20] consists of aligned face images of 38 subjects. Each subject has 64 frontal face images, which are acquired under a fixed pose with varying lighting conditions. It has been shown in [1] that, under Lambertian reflection, the images of a subject under the same pose with different illuminations lie close to a 9dimensional linear subspace. Following [8], we downsample the face images to size , and vectorize them to form 2016dimensional vectors. Each 2016dimensional image vector lies close to a lowdimensional subspace. The dataset contains sparse gross corruptions due to the existence of specular reflections, which are nonLambertian. Nonlinearity arises because the face poses and expressions were not exactly the same when images were taken for the same subject.
We test our method and the baselines on this dataset with different numbers of subjects ( 10, 15, 20, 25, 30, 35, or 38). We number the subjects from 1 to 38. We first take the first subjects, and then take the next subjects until we consider all of them. For example, for 10 subjects, we take all the images from subjects 110, 211, , or 2938 to form the data matrix for each trial.
We use Formulation (16) (solved using Algorithm 2) for our method, and set the parameters to , , , with the polynomial kernel . For our method, we normalize the data to lie within . We use the same polynomial kernel for KSSC. The results on Extended Yale B are shown in Figure 4. We can see from the table that KSSC improves the clustering accuracies over SSC for 20, 25, 30, and 35 subjects. Our lowrank kernel subspace clustering method achieves the lowest clustering errors on this dataset for all numbers of subjects. For 20, 25, 30, 35 and 38 subjects, our method almost halves the clustering errors of SSC, and also significantly outperforms all baselines including KSSC.
4.5 Face Image Clustering on the ORL Dataset
The ORL dataset [34] is composed of face images of 40 distinct subjects. Each subject has ten different images taken under varying lighting conditions, with different facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). Following [4], we crop the images to size , and then vectorize them to 1024dimensional vectors. We use a similar experimental setting as for Extended Yale B, and test the algorithms with different numbers of subjects ( 10, 15, 20, 25, 30, 35, 40). Compared to Extended Yale B, ORL has fewer images for each subject (10 vs. 64), higher subspace nonlinearity due to variations of facial expressions and details, and is thus more challenging.
We use Formulation (16) (solved using Algorithm 2) for our method, and set the parameters to , , , with the polynomial kernel . For our method, we normalize the data to lie within . We use the same polynomial kernel for KSSC. We show the results on the ORL dataset for all competing methods in Figure 4. We can see that KSSC cannot outperform SSC for 20, 25, 30, 35 and 40 subjects on this dataset. We conjecture that this is mainly because for each subject we only have ten images, which are too few to span the whole space. Our superior performance verifies that, by adaptively solving for a lowrank feature mapping, our method can better handle this “veryfewsample” case.
4.6 Object Image Clustering on COIL100
The COIL100 dataset [28] contains images of 100 objects. Each object has 72 images viewed from varying angles. Following [3], we downsample them to grayscale images. Each image is vectorized into a 1024dimensional vector, which corresponds to a point lying in a lowdimensional subspace. As in the previous experiment on Extended Yale B, we also test our method and the baselines on this dataset for different numbers of objects with 3, 4, 5, 6, 7, 8, or 9. For (, three objects), we take all the images from objects 13, 24, , 98100 to form the data matrix for each trial. The data matrices for the other s are formed in a similar manner.
Again, we use Formulation (16) (solved using Algorithm 2) for our method, and set the parameters to , , , with the polynomial kernel . We also use the same kernel for KSSC. We show the results on COIL100 in Figure 4. We can see that KSSC consistently outperforms SSC in this setting, which indicates that there is a considerable amount of nonlinearity in this dataset. Our method, by solving a lowrank kernel mapping, achieves the lowest clustering errors among all the baselines.
Discussion:
For all our experiments, we do not mean to claim that our results are the best among all the literature, but to showcase that our adaptive lowrank kernel subspace clustering improves over its linear counterpart and the kernel method that uses fixed kernels. For example, we note that, on Hopkins155, better results were reported by [19, 18]. However, to the best of our knowledge, our method is the first kernel subspace clustering method that achieves better results than its linear counterpart on Hopkins155 where most of the data conforms to the linear subspace structure very well.
5 Conclusion
In this paper, we have proposed a novel formulation for kernel subspace clustering that can jointly optimize an adaptive lowrank kernel and pairwise affinities between data points (through subspace selfexpressiveness). Our key insight is that instead of using fixed kernels, we need to derive a lowrank feature mapping such that we have the desired linear subspace structure in the feature space. We have derived efficient ADMM solutions to the resulting formulations, with closedform solutions for each subproblem. We have shown by extensive experiments that the proposed method significantly outperforms kernel subspace clustering with predefined kernels and the stateoftheart linear subspace clustering methods.
The main limitation of the current method is that we still need to manually select a kernel function to construct . In the future, we plan to explore the possibility of employing the multiple kernel learning method [11] to automatically determine .
References
 [1] R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces. TPAMI, 25(2):218–233, 2003.

[2]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.
Distributed optimization and statistical learning via the alternating
direction method of multipliers.
Foundations and Trends® in Machine Learning
, 3(1):1–122, 2011.  [3] D. Cai, X. He, J. Han, and T. Huang. Graph regularized nonnegative matrix factorization for data representation. TPAMI, 33(8):1548–1560, 2011.

[4]
D. Cai, X. He, Y. Hu, J. Han, and T. Huang.
Learning a spatially smooth subspace for face recognition.
In CVPR, pages 1–7. IEEE, 2007.  [5] G. Chen and G. Lerman. Spectral curvature clustering (SCC). IJCV, 81(3):317–330, 2009.
 [6] A. Del Bue, J. Xavier, L. Agapito, and M. Paladini. Bilinear factorization via augmented lagrange multipliers. In ECCV, pages 283–296. Springer, 2010.
 [7] E. Elhamifar and R. Vidal. Sparse subspace clustering. In CVPR, pages 2790–2797. IEEE, 2009.
 [8] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. TPAMI, 35(11):2765–2781, 2013.

[9]
P. Favaro, R. Vidal, and A. Ravichandran.
A closed form solution to robust subspace estimation and clustering.
In CVPR, pages 1801–1807. IEEE, 2011.  [10] R. Garg, A. Eriksson, and I. Reid. Nonlinear dimensionality regularizer for solving inverse problems. arXiv:1603.05015, 2016.
 [11] M. Gönen and E. Alpaydın. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12(Jul):2211–2268, 2011.
 [12] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
 [13] M. Hong, Z.Q. Luo, and M. Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016.
 [14] S. Jain and V. Madhav Govindu. Efficient higherorder clustering on the grassmann manifold. In ICCV, pages 3511–3518, 2013.
 [15] P. Ji, H. Li, M. Salzmann, and Y. Dai. Robust motion segmentation with unknown correspondences. In ECCV, pages 204–219. Springer, 2014.
 [16] P. Ji, H. Li, M. Salzmann, and Y. Zhong. Robust multibody feature tracker: a segmentationfree approach. In CVPR, pages 3843–3851, 2016.
 [17] P. Ji, M. Salzmann, and H. Li. Efficient dense subspace clustering. In WACV, pages 461–468. IEEE, 2014.
 [18] P. Ji, M. Salzmann, and H. Li. Shape interaction matrix revisited and robustified: Efficient subspace clustering with corrupted and incomplete data. In ICCV, pages 4687–4695, 2015.
 [19] H. Jung, J. Ju, and J. Kim. Rigid motion segmentation using randomized voting. In CVPR, pages 1210–1217, 2014.
 [20] K.C. Lee, J. Ho, and D. J. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. TPAMI, 27(5):684–698, 2005.
 [21] C.G. Li and R. Vidal. Structured sparse subspace clustering: A unified optimization framework. In CVPR, pages 277–286, 2015.
 [22] G. Li and T. K. Pong. Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4):2434–2460, 2015.
 [23] Z. Li, J. Guo, L.F. Cheong, and S. Zhiying Zhou. Perspective motion segmentation via collaborative clustering. In ICCV, pages 1369–1376, 2013.
 [24] Z. Lin, M. Chen, and Y. Ma. The augmented lagrange multiplier method for exact recovery of corrupted lowrank matrices. arXiv:1009.5055, 2010.
 [25] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by lowrank representation. TPAMI, 35(1):171–184, 2013.
 [26] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by lowrank representation. In ICML, pages 663–670, 2010.
 [27] C.Y. Lu, H. Min, Z.Q. Zhao, L. Zhu, D.S. Huang, and S. Yan. Robust and efficient subspace segmentation via least squares regression. In ECCV, pages 347–360. Springer, 2012.
 [28] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (COIL100). Technical Report CUCS00696, 1996.

[29]
A. Y. Ng, M. I. Jordan, Y. Weiss, et al.
On spectral clustering: Analysis and an algorithm.
In NIPS, pages 849–856, 2001.  [30] H. Nguyen, W. Yang, F. Shen, and C. Sun. Kernel lowrank representation for face recognition. Neurocomputing, 155:32–42, 2015.
 [31] V. M. Patel, H. Van Nguyen, and R. Vidal. Latent space sparse subspace clustering. In ICCV, pages 225–232, 2013.
 [32] V. M. Patel and R. Vidal. Kernel sparse subspace clustering. In ICIP, pages 2849–2853. IEEE, 2014.
 [33] P. Purkait, T.J. Chin, A. Sadri, and D. Suter. Clustering with hypergraphs: the case for large hyperedges. TPAMI, 2016.
 [34] F. S. Samaria and A. C. Harter. Parameterisation of a stochastic model for human face identification. In Applications of Computer Vision, 1994., Proceedings of the Second IEEE Workshop on, pages 138–142. IEEE, 1994.
 [35] J. ShaweTaylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.
 [36] Y. Shen, Z. Wen, and Y. Zhang. Augmented lagrangian alternating direction method for matrix separation based on lowrank factorization. Optimization Methods and Software, 29(2):239–263, 2014.
 [37] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8):888–905, 2000.
 [38] R. Souvenir and R. Pless. Manifold clustering. In ICCV, pages 648–653. IEEE, 2005.
 [39] R. Tron and R. Vidal. A benchmark for the comparison of 3d motion segmentation algorithms. In CVPR, pages 1–8. IEEE, 2007.
 [40] R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2011.
 [41] R. Vidal and P. Favaro. Low rank subspace clustering (LRSC). Pattern Recognition Letters, 43:47–61, 2014.
 [42] Y.X. Wang, H. Xu, and C. Leng. Provable subspace clustering: When lrr meets ssc. In NIPS, pages 64–72, 2013.

[43]
S. Xiao, M. Tan, D. Xu, and Z. Y. Dong.
Robust kernel lowrank representation.
IEEE transactions on neural networks and learning systems
, 27(11):2268–2281, 2016.  [44] M. Yin, Y. Guo, J. Gao, Z. He, and S. Xie. Kernel sparse subspace clustering on symmetric positive definite manifolds. In CVPR, pages 5157–5164, 2016.
 [45] C. You, C.G. Li, D. P. Robinson, and R. Vidal. Oracle based active set algorithm for scalable elastic net subspace clustering. In CVPR, pages 3928–3937, 2016.
 [46] C. You, D. Robinson, and R. Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In CVPR, pages 3918–3927, 2016.