1 Introduction
Subspace clustering is a classic problem where one is given points in a highdimensional ambient space and would like to approximate them by a union of lowerdimensional linear subspaces. In particular, each subspace contains a subset of the points. This problem is hard because one needs to jointly find the subspaces, and the points corresponding to each; the data we are given are unlabeled. The unions of subspaces model naturally arises in settings where data from multiple latent phenomena are mixed together and need to be separated. Applications of subspace clustering include motion segmentation [23], face clustering [8], gene expression analysis [10], and system identification [22]. In these applications, data points with the same label (e.g., face images of a person under varying illumination conditions, feature points of a moving rigid object in a video sequence) lie on a lowdimensional subspace, and the mixed dataset can be modeled by unions of subspaces. For detailed description of the applications, we refer the readers to the reviews [10, 20] and references therein.
There is now a sizable literature on empirical methods for this particular problem and some statistical analysis as well. Many recently proposed methods, which perform remarkably well and have theoretical guarantees on their performances, can be characterized as involving two steps: (a) finding a “neighborhood” for each data point, and (b) finding the subspaces and/or clustering the points given these neighborhoods. Here, neighbors of a point are other points that the algorithm estimates to lie on the same subspace as the point (and not necessarily just closest in Euclidean distance).
Our contributions: In this paper we devise new algorithms for each of the two steps above; (a) we develop a new method, Nearest Subspace Neighbor (NSN), to determine a neighborhood set for each point, and (b)
a new method, Greedy Subspace Recovery (GSR), to recover subspaces from given neighborhoods. Each of these two methods can be used in conjunction with other methods for the corresponding other step; however, in this paper we focus on two algorithms that use NSN followed by GSR and Spectral clustering, respectively. Our main result is establishing
statistical guarantees for exact clustering with general subspace conditions, in the standard models considered in recent analytical literature on subspace clustering. Our condition for exact recovery is weaker than the conditions of other existing algorithms that only guarantee correct neighborhoods^{1}^{1}1By correct neighborhood, we mean that for each point every neighbor point lies on the same subspace., which do not always lead to correct clustering. We provide numerical results which demonstrate our theory. We also show that for the realworld applications our algorithm performs competitively against those of stateoftheart algorithms, but the computational cost is much lower than them. Moreover, our algorithms are much simpler to implement.Subspace  Conditions for:  
Algorithm  What is guaranteed  condition  Fully random model  Semirandom model 
SSC [4, 16]  Correct neighborhoods  None  
LRR [14]  Exact clustering  No intersection     
SSCOMP [3]  Correct neighborhoods  No intersection     
TSC [6, 7]  Exact clustering  None  
LRSSC [24]  Correct neighborhoods  None    
NSN+GSR  Exact clustering  None  
NSN+Spectral  Exact clustering  None   
1.1 Related work
The problem was first formulated in the data mining community [10]
. Most of the related work in this field assumes that an underlying subspace is parallel to some canonical axes. Subspace clustering for unions of arbitrary subspaces is considered mostly in the machine learning and the computer vision communities
[20]. Most of the results from those communities are based on empirical justification. They provided algorithms derived from theoretical intuition and showed that they perform empirically well with practical dataset. To name a few, GPCA [21], Spectral curvature clustering (SCC) [2], and many iterative methods [1, 19, 26] show their good empirical performance for subspace clustering. However, they lack theoretical analysis that guarantees exact clustering.As described above, several algorithms with a common structure are recently proposed with both theoretical guarantees and remarkable empirical performance. Elhamifar and Vidal [4] proposed an algorithm called Sparse Subspace Clustering (SSC), which uses minimization for neighborhood construction. They proved that if the subspaces have no intersection^{2}^{2}2By no intersection between subspaces, we mean that they share only the null point., SSC always finds a correct neighborhood matrix. Later, Soltanolkotabi and Candes [16] provided a statistical guarantee of the algorithm for subspaces with intersection. Dyer et al. [3] proposed another algorithm called SSCOMP, which uses Orthogonal Matching Pursuit (OMP) instead of minimization in SSC. Another algorithm called LowRank Representation (LRR) which uses nuclear norm minimization is proposed by Liu et al. [14]. Wang et al. [24] proposed an hybrid algorithm, LowRank and Sparse Subspace Clustering (LRSSC), which involves both norm and nuclear norm. Heckel and Bölcskei [6] presented Thresholding based Subspace Clustering (TSC), which constructs neighborhoods based on the inner products between data points. All of these algorithms use spectral clustering for the clustering step.
The analysis in those papers focuses on neither exact recovery of the subspaces nor exact clustering in general subspace conditions. SSC, SSCOMP, and LRSSC only guarantee correct neighborhoods which do not always lead to exact clustering. LRR guarantees exact clustering only when the subspaces have no intersections. In this paper, we provide novel algorithms that guarantee exact clustering in general subspace conditions. When we were preparing this manuscript, it is proved that TSC guarantees exact clustering under certain conditions [7], but the conditions are stricter than ours. (See Table 1)
1.2 Notation
There is a set of data points in , denoted by . The data points are lying on or near a union of subspaces . Each subspace is of dimension which is smaller than . For each point , denotes the index of the nearest subspace. Let denote the number of points whose nearest subspace is , i.e.,
. Throughout this paper, sets and subspaces are denoted by calligraphic letters. Matrices and key parameters are denoted by letters in upper case, and vectors and scalars are denoted by letters in lower case. We frequently denote the set of
indices by . As usual, denotes a subspace spanned by a set of vectors. For example, . is defined as the projection of onto subspace . That is, . denotes the indicator function which is one if the statement is true and zero otherwise. Finally, denotes the direct sum.2 Algorithms
We propose two algorithms for subspace clustering as follows.

NSN+GSR : Run Nearest Subspace Neighbor (NSN) to construct a neighborhood matrix , and then run Greedy Subspace Recovery (GSR) for .

NSN+Spectral : Run Nearest Subspace Neighbor (NSN) to construct a neighborhood matrix , and then run spectral clustering for .
2.1 Nearest Subspace Neighbor (NSN)
NSN approaches the problem of finding neighbor points most likely to be on the same subspace in a greedy fashion. At first, given a point without any other knowledge, the one single point that is most likely to be a neighbor of is the nearest point of the line . In the following steps, if we have found a few correct neighbor points (lying on the same true subspace) and have no other knowledge about the true subspace and the rest of the points, then the most potentially correct point is the one closest to the subspace spanned by the correct neighbors we have. This motivates us to propose NSN described in the following.
NSN collects neighbors sequentially for each point. At each step , a dimensional subspace spanned by the point and its neighbors is constructed, and the point closest to the subspace is newly collected. After , the subspace constructed at the th step is used for collecting neighbors. At last, if there are more points lying on , they are also counted as neighbors. The subspace can be stored in the form of a matrix whose columns form an orthonormal basis of . Then can be computed easily because it is equal to . While a naive implementation requires computational cost, this can be reduced to , and the faster implementation is described in Section A.1. We note that this computational cost is much lower than that of the convex optimization based methods (e.g., SSC [4] and LRR [14]) which solve a convex program with variables and constraints.
NSN for subspace clustering shares the same philosophy with Orthogonal Matching Pursuit (OMP) for sparse recovery in the sense that it incrementally picks the point (dictionary element) that is the most likely to be correct, assuming that the algorithms have found the correct ones. In subspace clustering, that point is the one closest to the subspace spanned by the currently selected points, while in sparse recovery it is the one closest to the residual of linear regression by the selected points. In the sparse recovery literature, the performance of OMP is shown to be comparable to that of Basis Pursuit (
minimization) both theoretically and empirically [18, 11]. One of the contributions of this work is to show that this highlevel intuition is indeed born out, provable, as we show that NSN also performs well in collecting neighbors lying on the same subspace.2.2 Greedy Subspace Recovery (GSR)
Suppose that NSN has found correct neighbors for a data point. How can we check if they are indeed correct, that is, lying on the same true subspace? One natural way is to count the number of points close to the subspace spanned by the neighbors. If they span one of the true subspaces, then many other points will be lying on the span. If they do not span any true subspaces, few points will be close to it. This fact motivates us to use a greedy algorithm to recover the subspaces. Using the neighborhood constructed by NSN (or some other algorithm), we recover the subspaces. If there is a neighborhood set containing only the points on the same subspace for each subspace, the algorithm successfully recovers the unions of the true subspaces exactly.
Recall that the matrix contains the labelings of the points, so that if point is assigned to subspace . Top denotes the dimensional principal subspace of the set of vectors . This can be obtained by taking the first left singular vectors of the matrix whose columns are the vector in the set. If there are only vectors in the set, GramSchmidt orthogonalization will give us the subspace. As in NSN, it is efficient to store a subspace in the form of its orthogonal basis because we can easily compute the norm of a projection onto the subspace.
Testing a candidate subspace by counting the number of near points has already been considered in the subspace clustering literature. In [25], the authors proposed to run RANdom SAmple Consensus (RANSAC) iteratively. RANSAC randomly selects a few points and checks if there are many other points near the subspace spanned by the collected points. Instead of randomly choosing sample points, GSR receives some candidate subspaces (in the form of sets of points) from NSN (or possibly some other algorithm) and selects subspaces in a greedy way as specified in the algorithm above.
3 Theoretical results
We analyze our algorithms in two standard noiseless models. The main theorems present sufficient conditions under which the algorithms cluster the points exactly with high probability. For simplicity of analysis, we assume that every subspace is of the same dimension, and the number of data points on each subspace is the same, i.e.,
. We assume that is known to the algorithm. Nonetheless, our analysis can extend to the general case.3.1 Statistical models
We consider two models which have been used in the subspace clustering literature:

Fully random model: The subspaces are drawn iid uniformly at random, and the points are also iid randomly generated.

Semirandom model: The subspaces are arbitrarily determined, but the points are iid randomly generated.
Let be a matrix whose columns form an orthonormal basis of . An important measure that we use in the analysis is the affinity between two subspaces, defined as
where is the th principal angle between and . Two subspaces and are identical if and only if . If , every vector on is orthogonal to any vectors on . We also define the maximum affinity as
There are points, and there are points exactly lying on each subspace. We assume that each data point is drawn iid uniformly at random from where is the unit sphere in . Equivalently,
As the points are generated randomly on their corresponding subspaces, there are no points lying on an intersection of two subspaces, almost surely. This implies that with probability one the points are clustered correctly provided that the true subspaces are recovered exactly.
3.2 Main theorems
The first theorem gives a statistical guarantee for the fully random model.
Theorem 3.1.
Suppose dimensional subspaces and points on each subspace are generated in the fully random model with polynomial in . There are constants such that if
(1) 
then with probability at least , NSN+GSR^{3}^{3}3NSN with followed by GSR with arbitrarily small . clusters the points exactly. Also, there are other constants such that if (1) with and replaced by and holds then NSN+Spectral^{4}^{4}4NSN with . clusters the points exactly with probability at least . is the exponential constant.
Our sufficient conditions for exact clustering explain when subspace clustering becomes easy or difficult, and they are consistent with our intuition. For NSN to find correct neighbors, the points on the same subspace should be many enough so that they look like lying on a subspace. This condition is spelled out in the first inequality of (1). We note that the condition holds even when is a constant, i.e., is linear in . The second inequality implies that the dimension of the subspaces should not be too high for subspaces to be distinguishable. If is high, the random subspaces are more likely to be close to each other, and hence they become more difficult to be distinguished. However, as increases, the points become dense on the subspaces, and hence it becomes easier to identify different subspaces.
Let us compare our result with the conditions required for success in the fully random model in the existing literature. In [16], it is required for SSC to have correct neighborhoods that should be superlinear in when fixed. In [6, 24], the conditions on becomes worse as we have more points. On the other hand, our algorithms are guaranteed exact clustering of the points, and the sufficient condition is orderwise at least as good as the conditions for correct neighborhoods by the existing algorithms (See Table 1). Moreover, exact clustering is guaranteed even when is linear in , and fixed.
For the semirandom model, we have the following general theorem.
Theorem 3.2.
Suppose dimensional subspaces are arbitrarily chosen, and points on each subspace are generated in the semirandom model with polynomial in . There are constants such that if
(2) 
then with probability at least , NSN+GSR^{5}^{5}5NSN with and followed by GSR with arbitrarily small . clusters the points exactly.
In the semirandom model, the sufficient condition does not depend on the ambient dimension . When the affinities between subspaces are fixed, and the points are exactly lying on the subspaces, the difficulty of the problem does not depend on the ambient dimension. It rather depends on , which measures how close the subspaces are. As they become closer to each other, it becomes more difficult to distinguish the subspaces. The second inequality of (2) explains this intuition. The inequality also shows that if we have more data points, the problem becomes easier to identify different subspaces.
Compared with other algorithms, NSN+GSR is guaranteed exact clustering, and more importantly, the condition on improves as grows. This remark is consistent with the practical performance of the algorithm which improves as the number of data points increases, while the existing guarantees of other algorithms are not. In [16], correct neighborhoods in SSC are guaranteed if . In [6], exact clustering of TSC is guaranteed if . However, these algorithms perform empirically better as the number of data points increases.
4 Experimental results
In this section, we empirically compare our algorithms with the existing algorithms in terms of clustering performance and computational time (on a single desktop). For NSN, we used the fast implementation described in Section A.1. The compared algorithms are means, flats^{6}^{6}6flats is similar to means. At each iteration, it computes top principal subspaces of the points with the same label, and then labels every point based on its distances to those subspaces., SSC, LRR, SCC, TSC^{7}^{7}7The MATLAB codes for SSC, LRR, SCC, and TSC are obtained from http://www.cis.jhu.edu/~ehsan/code.htm, https://sites.google.com/site/guangcanliu, and http://www.math.duke.edu/~glchen/scc.html, http://www.nari.ee.ethz.ch/commth/research/downloads/sc.html, respectively., and SSCOMP^{8}^{8}8For each data point, OMP constructs a neighborhood for each point by regressing the point on the other points up to accuracy.. The numbers of replicates in means, flats, and the means used in the spectral clustering are all fixed to . The algorithms are compared in terms of Clustering error (CE) and Neighborhood selection error (NSE), defined as
where is the permutation space of . CE is the proportion of incorrectly labeled data points. Since clustering is invariant up to permutation of label indices, the error is equal to the minimum disagreement over the permutation of label indices. NSE measures the proportion of the points which do not have all correct neighbors.^{9}^{9}9For the neighborhood matrices from SSC, LRR, and SSCOMP, the points with the maximum weights are regarded as neighbors for each point. For TSC, the nearest neighbors are collected for each point.
4.1 Synthetic data
We compare the performances on synthetic data generated from the fully random model. In , five dimensional subspaces are generated uniformly at random. Then for each subspace unitnorm points are generated iid uniformly at random on the subspace. To see the agreement with the theoretical result, we ran the algorithms under fixed and varied and . We set so that each pair of subspaces has intersection. Figures 1 and 2 show CE and NSE, respectively. Each error value is averaged over trials. Figure 1 indicates that our algorithm clusters the data points better than the other algorithms. As predicted in the theorems, the clustering performance improves as the number of points increases. However, it also improves as the dimension of subspaces grows in contrast to the theoretical analysis. We believe that this is because our analysis on GSR is not tight. In Figure 2, we can see that more data points obtain correct neighbors as increases or decreases, which conforms the theoretical analysis.
We also compare the computational time of the neighborhood selection algorithms for different numbers of subspaces and data points. As shown in Figure 3, the greedy algorithms (OMP, Thresholding, and NSN) are significantly more scalable than the convex optimization based algorithms (minimization and nuclear norm minimization).
4.2 Realworld data : motion segmentation and face clustering
We compare our algorithm with the existing ones in the applications of motion segmentation and face clustering. For the motion segmentation, we used Hopkins155 dataset [17], which contains video sequences of or motions. For the face clustering, we used Extended Yale B dataset with cropped images from [5, 13]. The dataset contains images for each of individuals in frontal view and different illumination conditions. To compare with the existing algorithms, we used the set of resized raw images provided by the authors of [4]. The parameters of the existing algorithms were set as provided in their source codes.^{10}^{10}10As SSCOMP and TSC do not have proposed number of parameters for motion segmentation, we found the numbers minimizing the mean CE. The numbers are given in the table. Tables 2 and 3 show CE and average computational time.^{11}^{11}11The LRR code provided by the author did not perform properly with the face clustering dataset that we used. We did not run NSN+GSR since the data points are not well distributed in its corresponding subspaces. We can see that NSN+Spectral performs competitively with the methods with the lowest errors, but much faster. Compared to the other greedy neighborhood construction based algorithms, SSCOMP and TSC, our algorithm performs significantly better.

Algorithms  means  flats  SSC  LRR  SCC  SSCOMP(8)  TSC(10)  NSN+Spectral(5) 

Mean CE ()  19.80  13.62  1.52  2.13  2.06  16.92  18.44  3.62  
2  Median CE ()  17.92  10.65  0.00  0.00  0.00  12.77  16.92  0.00 
Avg. Time (sec)    0.80  3.03  3.42  1.28  0.50  0.50  0.25  
Mean CE ()  26.10  14.07  4.40  4.03  6.37  27.96  28.58  8.28  
3  Median CE ()  20.48  14.18  0.56  1.43  0.21  30.98  29.67  2.76 
Avg. Time (sec)    1.89  5.39  4.05  2.16  0.82  1.15  0.51  


Algorithms  means  flats  SSC  SSCOMP  TSC  NSN+Spectral 

Mean CE ()  45.98  37.62  1.77  4.45  11.84  1.71  
2  Median CE ()  47.66  39.06  0.00  1.17  1.56  0.78 
Avg. Time (sec)    15.78  37.72  0.45  0.33  0.78  
Mean CE ()  62.55  45.81  5.77  6.35  20.02  3.63  
3  Median CE ()  63.54  47.92  1.56  2.86  15.62  3.12 
Avg. Time (sec)    27.91  49.45  0.76  0.60  3.37  
Mean CE ()  73.77  55.51  4.79  8.93  11.90  5.81  
5  Median CE ()  74.06  56.25  2.97  5.00  33.91  4.69 
Avg. Time (sec)    52.90  74.91  1.41  1.17  5.62  
Mean CE ()  79.92  60.12  7.75  12.90  38.55  8.46  
8  Median CE ()  80.18  60.64  5.86  12.30  40.14  7.62 
Avg. Time (sec)    101.3  119.5  2.84  2.24  11.51  
Mean CE ()  82.68  62.72  9.43  15.32  39.48  9.82  
10  Median CE ()  82.97  62.89  8.75  17.11  39.45  9.06 
Avg. Time (sec)    134.0  157.5  5.26  3.17  14.73  

References
 Bradley and Mangasarian [2000] P. S. Bradley and O. L. Mangasarian. Kplane clustering. Journal of Global Optimization, 16(1):23–32, 2000.
 Chen and Lerman [2009] G. Chen and G. Lerman. Spectral curvature clustering. International Journal of Computer Vision, 81(3):317–330, 2009.

Dyer et al. [2013]
E. L. Dyer, A. C. Sankaranarayanan, and R. G. Baraniuk.
Greedy feature selection for subspace clustering.
The Journal of Machine Learning Research (JMLR), 14(1):2487–2517, 2013.  Elhamifar and Vidal [2013] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(11):2765–2781, 2013.

Georghiades et al. [2001]
A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman.
From few to many: Illumination cone models for face recognition under variable lighting and pose.
IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.  Heckel and Bölcskei [2013] R. Heckel and H. Bölcskei. Subspace clustering via thresholding and spectral clustering. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013.
 Heckel and Bölcskei [2014] R. Heckel and H. Bölcskei. Robust subspace clustering via thresholding. arXiv preprint arXiv:1307.4891v2, 2014.

Ho et al. [2003]
J. Ho, M.H. Yang, J. Lim, K.C. Lee, and D. Kriegman.
Clustering appearances of objects under varying illumination
conditions.
In
IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, 2003. 
Inglot [2010]
T. Inglot.
Inequalities for quantiles of the chisquare distribution.
Probability and Mathematical Statistics, 30(2):339–351, 2010. 
Kriegel et al. [2009]
H.P. Kriegel, P. Kröger, and A. Zimek.
Clustering highdimensional data: A survey on subspace clustering, patternbased clustering, and correlation clustering.
ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1):1, 2009.  Kunis and Rauhut [2008] S. Kunis and H. Rauhut. Random sampling of sparse trigonometric polynomials, ii. orthogonal matching pursuit versus basis pursuit. Foundations of Computational Mathematics, 8(6):737–763, 2008.
 Ledoux [2005] M. Ledoux. The concentration of measure phenomenon, volume 89. AMS Bookstore, 2005.
 Lee et al. [2005] K. C. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans. Pattern Anal. Mach. Intelligence, 27(5):684–698, 2005.
 Liu et al. [2013] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by lowrank representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):171–184, 2013.
 Milman and Schechtman [1986] V. D. Milman and G. Schechtman. Asymptotic Theory of Finite Dimensional Normed Spaces: Isoperimetric Inequalities in Riemannian Manifolds. Lecture Notes in Mathematics. Springer, 1986.

Soltanolkotabi and Candes [2012]
M. Soltanolkotabi and E. J. Candes.
A geometric analysis of subspace clustering with outliers.
The Annals of Statistics, 40(4):2195–2238, 2012.  Tron and Vidal [2007] R. Tron and R. Vidal. A benchmark for the comparison of 3d motion segmentation algorithms. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2007.
 Tropp and Gilbert [2007] J. A. Tropp and A. C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. Information Theory, IEEE Transactions on, 53(12):4655–4666, 2007.
 Tseng [2000] P. Tseng. Nearest qflat to m points. Journal of Optimization Theory and Applications, 105(1):249–252, 2000.
 Vidal [2011] R. Vidal. Subspace clustering. Signal Processing Magazine, IEEE, 28(2):52–68, 2011.

Vidal et al. [2003a]
R. Vidal, Y. Ma, and S. Sastry.
Generalized principal component analysis.
In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2003a.  Vidal et al. [2003b] R. Vidal, S. Soatto, Y. Ma, and S. Sastry. An algebraic geometric approach to the identification of a class of linear hybrid systems. In Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, volume 1, pages 167–172. IEEE, 2003b.
 Vidal et al. [2008] R. Vidal, R. Tron, and R. Hartley. Multiframe motion segmentation with missing data using power factorization and GPCA. International Journal of Computer Vision, 79(1):85–105, 2008.
 Wang et al. [2013] Y.X. Wang, H. Xu, and C. Leng. Provable subspace clustering: When LRR meets SSC. In Advances in Neural Information Processing Systems (NIPS), December 2013.
 Yang et al. [2006] A. Y. Yang, S. R. Rao, and Y. Ma. Robust statistical estimation and segmentation of multiple subspaces. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2006.
 Zhang et al. [2012] T. Zhang, A. Szlam, Y. Wang, and G. Lerman. Hybrid linear modeling via local bestfit flats. International journal of computer vision, 100(3):217–240, 2012.
Appendix A Discussion on implementation issues
a.1 A faster implementation for NSN
At each step of NSN, the algorithm computes the projections of all points onto a subspace and find one with the largest norm. A naive implementation of the algorithm requires time complexity.
In fact, we can reduce the complexity to . Instead of finding the maximum norm of the projections, we can find the maximum squared norm of the projections. Let be the subspace at step . For any data point , we have
where is the new orthogonal axis added from to make . That is, and . As is already computed in the ’th step, we do not need to compute it again at step . Based on this fact, we have a faster implementation as described in the following. Note that at the th step is equal to in the original NSN algorithm.
a.2 Estimation of the number of clusters
When is unknown, it can be estimated at the clustering step. For Spectral clustering, a wellknown approach to estimate
is to find a “knee” point in the singular values of the neighborhood matrix. It is the point where the difference between two consecutive singular values are the largest. For GSR, we do not need to estimate the number of clusters a priori. Once the algorithms finishes, the number of the resulting groups will be our estimate of
.a.3 Parameter setting
The choices of and depend on the dimension of the subspaces . If data points are lying exactly on the model subspaces, is enough for GSR to recover a subspace. In practical situations where the points are near the subspaces, it is better to set to be larger than . can also be larger than because the additional dimensions, which may be induced from the noise, do no intersect with the other subspaces in practice. For Extended Yale B dataset and Hopkins155 dataset, we found that NSN+Spectral performs well if is set to be around .
Appendix B Proofs
b.1 Proof outline
We describe the first few highlevel steps in the proofs of our main theorems. Exact clustering of our algorithms depends on whether NSN can find all correct neighbors for the data points so that the following algorithm (GSR or Spectral clustering) can cluster the points exactly. For NSN+GSR, exact clustering is guaranteed when there is a point on each subspace that have all correct neighbors which are at least . For NSN+Spectral, exact clustering is guaranteed when each data point has only the other points on the same subspace as neighbors. In the following, we explain why these are true.
Step 11: Exact clustering condition for GSR
The two statistical models have a property that for any dimensional subspace in other than the true subspaces the probability of any points lying on the subspace is zero. Hence, we claim the following.
Fact B.1 (Best dimensional fit).
With probability one, the true subspaces are the subspaces containing the most points among the set of all possible dimensional subspaces.
Then it suffices for each subspace to have one point whose neighbors are all correct points on the same subspace. This is because the subspace spanned by those points is almost surely identical to the true subspace they are lying on, and that subspace will be picked by GSR.
Fact B.2.
If NSN with finds all correct neighbors for at least one point on each subspace, GSR recovers all the true subspaces and clusters the data points exactly with probability one.
In the following steps, we consider one data point for each subspace. We show that NSN with finds all correct neighbors for the point with probability at least . Then the union bound and Fact B.2 establish exact clustering with probability at least .
Step 12: Exact clustering condition for spectral clustering
It is difficult to analyze spectral clustering for the resulting neighborhood matrix of NSN. A trivial case for a neighborhood matrix to result in exact clustering is when the points on the same subspace form a single fully connected component. If NSN with finds all correct neighbors for every data point, the subspace at the last step () is almost surely identical to the true subspace that the points lie on. Hence, the resulting neighborhood matrix form fully connected components each of which contains all of the points on the same subspace.
In the rest of the proof, we show that if (1) holds, NSN finds all correct neighbors for a fixed point with probability . Let us assume that this is true. If (1) with and replaced by and holds, we have
Then it follows from the union bound that NSN finds all correct neighbors for all of the points on each subspace with probability at least , and hence we obtain for every . Exact clustering is guaranteed.
Step 2: Success condition for NSN
Now the only proposition that we need to prove is that for each subspace NSN finds all correct neighbors for a data point (which is a uniformly random unit vector on the subspace) with probability at least . As our analysis is independent of the subspaces, we only consider . Without loss of generality, we assume that lies on () and focus on this data point.
When NSN finds neighbors of , the algorithm constructs subspaces incrementally. At each step , if the largest projection onto of the uncollected points on the same true subspace is greater than the largest projection among the points on different subspaces, then NSN collects a correct neighbor. In a mathematical expression, we want to satisfy
(3) 
for each step of .
The rest of the proof is to show (1) and (2) lead to (3) with probability in their corresponding models. It is difficult to prove (3) itself because the subspaces, the data points, and the index set are all dependent of each other. Instead, we introduce an Oracle algorithm whose success is equivalent to the success of NSN, but the analysis is easier. Then the Oracle algorithm is analyzed using stochastic ordering, bounds on order statistics of random projections, and the measure concentration inequalities for random subspaces. The rest of the proof is provided in Sections B.3 and B.4.
b.2 Preliminary lemmas
Before we step into the technical parts of the proof, we introduce the main ingredients which will be used. The following lemma is about upper and lower bounds on the order statistics for the projections of iid uniformly random unit vectors.
Lemma B.3.
Let be drawn iid uniformly at random from the dimensional unit ball . Let denote the ’th largest value of where .

Suppose that the rows of are orthonormal to each other. For any , there exists a constant such that for where
(4) we have
(5) with probability at least .

For any matrix ,
(6) with probability at least .
b.3 Proof of Theorem 3.2
Following Section B.1, we show in this section that if (2) holds then NSN finds all correct neighbors for (which is assumed to be on ) with probability at least .
Step 3: NSN Oracle algorithm
Consider the Oracle algorithm in the following. Unlike NSN, this algorithm knows the true label of each data point. It picks the point closest to the current subspace among the points with the same label. Since we assume , the Oracle algorithm for picks a point in at every step.
Note that the Oracle algorithm returns failure if and only if the original algorithm picks an incorrect neighbor for . The reason is as follows. Suppose that NSN for picks the first incorrect point at step . By the step , correct points have been chosen because they are the nearest points for the subspaces in the corresponding steps. The Oracle algorithm will also pick those points because they are the nearest points among the correct points. Hence . At step , NSN picks an incorrect point as it is the closest to . The Oracle algorithm will declare failure because that incorrect point is closer than the closest point among the correct points. In the same manner, we see that NSN fails if the Oracle NSN fails. Therefore, we can instead analyze the success of the Oracle algorithm. The success condition is written as
(7) 
Note that ’s are independent of the points . We will use this fact in the following steps.
Step 4: Lower bounds on the projection of correct points (the LHS of (7))
Let be such that the columns of form an orthogonal basis of . Such a exists because is a dimensional subspace of . Then we have
In this step, we obtain lower bounds on for and for .
It is difficult to analyze because and
are dependent. We instead analyze another random variable that is stochastically dominated by
. Then we use a highprobability lower bound on that variable which also lower bounds with high probability.Define as the ’th largest norm of the projections of iid uniformly random unit vector on onto a dimensional subspace. Since the distribution of the random unit vector is isotropic, the distribution of is identical for any dimensional subspaces independent of the random unit vectors. We have the following key lemma.
Lemma B.5.
stochastically dominates , i.e.,
for any . Moreover, for any .
The proof of the lemma is provided in Appendix B.5. Now we can use the lower bound on given in Lemma B.3a to bound . Let us pick and for which the lemma holds. The first inequality of (2) with leads to , and also
(8) 
Hence, it follow from Lemma B.3a that for each , we have
(9) 
with probability at least .
For , we want to bound . We again use Lemma B.5 to obtain the bound. Since the condition for the lemma holds as shown in (8), we have
(10) 
with probability at least , for every .
Step 5: Upper bounds on the projection of incorrect points (the RHS of (7))
Since we have , the RHS of (7) can be written as
(11) 
In this step, we want to bound (11) for every by using the concentration inequalities for and . Since and are independent, the inequality for holds for any fixed .
It follows from Lemma B.3b and the union bound that with probability at least ,
for all . The last inequality follows from the fact . Since for every such that , we have
(12) 
Now let us consider . In our statistical model, the new axis added to at the th step ( in Algorithm 3) is chosen uniformly at random from the subspace in orthogonal to . Therefore,
is a random matrix drawn uniformly from the
Stiefel manifold, and the probability measure is the normalized Haar (rotationinvariant) measure. From Lemma B.4b and the union bound, we obtain that with probability at least ,
Comments
There are no comments yet.