Subspace clustering aims to approximate the data points collected from a high-dimensional space as the union of several low-dimensional linear subspaces. This problem has gained increasing attention in machine learning community due to its wide applications in (e.g.,) motion segmentation[23, 3, 31], face clustering [32, 24, 29, 30], handwritten digits clustering [28, 6, 11], and manifold learning [2, 26]. At a high level, subspace clustering conducts dimension reduction and clustering simultaneously. Unlike traditional dimension reduction methods [1, 8], it is more flexible by allowing more than one linear structure embedded in the ambient space. In addition, unlike many other clustering algorithms [16, 5], subspace clustering makes use of the linear structure of the data clusters which might lead to superior clustering accuracy.
I-a Previous Work
Most subspace clustering algorithms are composed of two main stages. In the first step, an affinity matrix is constructed through some measure of similarity calculated for each pair of data points. In the second stage, the spectral clustering algorithm is applied to the computed affinity matrix to identify the clusters. The spectral clustering algorithm is guaranteed to deliver exact clustering results  if the graph induced by affinity matrix satisfies the following two properties: the affinity matrix is error less and nodes within each cluster are connected. The subspace clustering algorithms mainly focus on computing an error-less adjacency matrix, i.e., the computed affinity value between all pairs of data points which lie in different clusters are zero.
A widely used approach for computing the affinity matrix leverages the linear structure of clusters and uses the self-representation technique to find the neighbouring points of each data points and compute their corresponding affinity values [3, 13, 12]. For instance, the Sparse Subspace Clustering (SSC) method  computes the affinity values between data column and the rest of the data points as
where is the column of and is the element of . Each data point in a linear cluster can be constructed using a few other data points which lie in the same cluster. The optimization problem in (1) utilizes
-norm to ensure that the non-zero values of the optimal representation vector correspond to the data points which lie in the same cluster thatlies in. [22, 27] established the performance guarantees for SSC. In  and  the -norm of SSC was replaced by the elastic net objective function and orthogonal matching pursuit, respectively. In  and  two methods were proposed to further speed up the SSC algorithm.
A different line of approaches computes the similarity between all pairs of data points for building the adjacency matrix. For instance,  used the absolute value of inner product to measure the similarity between data points. However, when the clusters are close to each other (i.e., the principal angles 
between the subspaces are small), the inner product might be no longer reliable for estimating the affinity. To tackle this challenging scenario,[20, 19, 21] proposed the Innovation Pursuit (iPursuit) algorithm in which the directions of innovation were utilized to measure the resemblance between the data points. They showed that iPursuit could notably outperform the self-representation based methods especially when the clusters are close to each other. In this paper, we focus on analyzing the theoretical properties for the iPursuit algorithm. Based on our analysis, a simple method is developed to further improve iPursuit when the subspaces are heavily intersected.
The iPursuit algorithm has demonstrated superior performance via numerical experiments. In this paper, we present an analysis of iPursuit under various probabilistic models. Specifically, we focus on the understanding of its superior performance in the challenging cases where subspaces have large dimension of intersection, whereas other subspace clustering algorithms typically fail to distinguish the clusters. Importantly, our work reveals that the “innovative component” of the subspaces is the main factor which affects the performance of iPursuit. That is, in sharp contrast to many other methods which require the subspaces to be sufficiently incoherent with each other, iPursuit requires the incoherence between the innovative component of the subspaces.
Moreover, we show that the performance of iPursuit can be guaranteed even if the dimension of intersection is in linear order with the rank of the data. To our best knowledge, this is the first time that a subspace clustering algorithm provably tolerates this level of intersection between subspaces. Inspired by the presented analysis, we introduce a simple yet effective technique to boost the performance of iPursuit. The proposed technique can reduce the affinities between subspaces and can enhance the innovation component of the data points which helps the algorithm in distinguishing the clusters.
I-C Notation and Definitions
In this paper, is the given data matrix and is its -th column, is the (i,j)-th entry. The column space of is denoted as . The columns of lie in a union of linear subspaces. The value of might be estimated in various ways, e.g., from the Laplacian matrix . In this paper, it is assumed that is known. Let denote the span of the clusters. The sub-matrix of whose columns lie in is denoted as . Subspace contains data points and the dimension of is . The columns of orthonormal matrix form a basis for . To explicitly model the intersection between subspaces, each basis is written as
Here models the non-trivial intersection among all the subspaces while is the innovation component of . Define as the dimension of the column space of (a large value of means the subspaces are heavily intersected and they are very close to each other). Note that the formulation of in (2) does not lose generality and it simplifies the representation of the challenging clustering scenarios.
The -th point in cluster is denoted as , , and . The direct sum of subspaces and is denoted by . We define and as an orthonormal basis matrix for . Matrix denotes matrix after removing the columns of . We also define as the cosine of the smallest principal angle  between and . We assume that the columns of are normalized to unit -norm.
Ii Analysis of the iPursuit Method
Here we present a theoretical analysis of iPursuit method.
Ii-a The iPursuit Algorithm
The iPursuit algorithm [20, 19, 21] computes an optimal direction for each data point and the optimal direction is used to measure the similarity between that data point and the rest of the data. The optimal direction, termed direction of innovation corresponding to , is computed as the optimal point of
The optimization problem (3) finds an optimal direction corresponding to which has a non-zero inner product with and it has the minimum projection on the rest of the data. Since the -norm is used in the cost function, (3) prefers a direction which is orthogonal to a maximum number of data points. Suppose lies in and assume that and define as the span of (subspace is the innovative component of which does not lie in the direct sum of other subspaces). If some sufficient conditions are satisfied, the optimal solution of (3) lies in which means that it is orthogonal to all the data points which lie in the other clusters. Accordingly, vector can be used as a proper vector of affinity values between and the rest of data points. Algorithm 1 presents the iPursuit algorithm which utilizes spectral clustering to process the computed adjacency matrix. The optimization problem in Algorithm 1 is equivalent to (3) but it computes all the directions corresponding for all the data points simultaneously.
Ii-B Data Model and Innovation Assumption
Define as the optimal point of (3) (the optimal direction corresponding to ). In the presented results, we establish the sufficient conditions to guarantee that all the computed directions yield proper affinity vectors. A proper affinity vector is defined as follows.
Suppose lies in . The optimal direction yields a proper affinity vector if all the data points corresponding to the nonzero entries of lie in .
If all the directions yield proper affinity vectors, then iPursuit delivers an error-less affinity matrix. Next, we provide the sufficient conditions for a deterministic data model, which reveal the requirements of the algorithm about the distribution of the subspaces and the distribution of the data points inside the subspaces. Then, we present performance guarantees with random models for the distribution of data points/subspaces. The presumed models can be summarized as follows.
Deterministic Model: both and are deterministic.
Semi-Random Model: the subspaces are deterministic but are sampled uniformly at random from where was defined as .
Fully Random Model: the subspaces are sampled uniformly at random from Grassmannian manifold with dimension and are sampled uniformly at random from .
Note that in the fully random model, we assume have the same dimension and same number of points . In the first three theorems, it is assumed that each cluster carries an innovation with respect to the other clusters. Specifically, it is assumed that each subspace satisfies the Innovation Assumption described as follows.
The Innovation Assumption is satisfied if for any :
Note that Innovation Assumption does not mean independence between the subspaces and it allows the subspaces to have large dimension of intersection. In the rest of the paper, by the innovation subspace corresponding to , we mean the projection of onto the complement of . Define as an orthonormal basis of . Accordingly, Innovation Assumption is equivalent to . The presented results show that iPursuit yields an error-less affinity matrix even if (the dimension of intersection between the span of clusters) is large.
Two main factors play an important role in the performance of the subspace clustering algorithms: the affinity between each pair of subspaces and the distribution of the data points inside the subspaces. The following definition specifies these parameters which are used in the presented results.
Define and as
Define as the minimum positive real numbers such that
Quantity is known as the permeance statistics  which characterises how well the data points are distributed in each subspace. A large value of means that data points are well scattered in their corresponding subspaces. In other words, is large if the data points are concentrated along certain directions in the subspace and vice versa. Recall is the cosine value of the smallest principal angle between and . The parameters and determine the coherence between the innovative components of the subspaces. The smaller and are, the more distinguishable the innovative components are.
Another measure of affinity between subspaces is
where are the principal angles between and . A key difference between and is, takes into account the intersection between the subspaces, hence will be nearly equal to one when is large, while only consider the similarity between innovation parts. The quantity characterises the innovative component of the data points, i.e., a small value of means that data points are close to their corresponding innovation subspaces. In order to understand the importance of , consider a scenario where some of the data points completely lie in . In this scenario, no clustering algorithm can assign a clustering label to those data points because they could belong to all the clusters (since they completely lie in the intersection subspace).
Ii-C Theoretical Guarantees With the Deterministic Model
Theorem 1 provides sufficient conditions to guarantee that iPursuit yields an error-less affinity matrix.
Assume Innovation Assumption is satisfied and
then all the optimal directions yield proper affinity vectors and iPursuit constructs an error-less affinity matrix.
The sufficient conditions are satisfied if are sufficiently small and is close enough to . Geometrically, it means that the algorithm requires the innovative component of each pair of subspaces (measured by ) are sufficiently away from each other and the data points are well scattered inside the subspaces (measured by . Importantly, these conditions allow the subspaces to have a large dimension of intersection because in contrast to the other methods, iPursuit requires incoherence between the innovative components not the subspaces themselves. Consider an extreme case where . The first inequality requires hence . Thus, if the data points are distributed well in the subspace, can be almost equal to which means that the dimension of intersection between the subspaces might be close to or at least linear with . To the best of our knowledge, no existing subspace clustering algorithms can provably allow this degree of intersections.
Ii-D Theoretical Guarantees with the Semi-Random Model
In the semi-random model, it was assumed that the data points in each subspace are distributed uniformly at random.
Suppose the data follows the Semi-Random Model and the Innovation Assumption is satisfied. Define
Define as in Definition 3, and
To understand the significance of the result, consider an extreme scenario in which (i.e., the innovative components are fully incoherent with each other). If there are a sufficient number of data points in each cluster (i.e., is large), we have , hence we need . To guarantee that , (the dimension of intersection) can not be larger than . This means that iPursuit is able to allow to be in linear order with while other algorithms such as SSC and TSC need to be in the order of [27, 22, 4]. This assumption will be violated if is in linear order with . Therefore, our results require a much more weaker condition on the affinities between these subspaces. The new information we learn from this theorem is that if the dimension of the subspaces increases, more data points is required to guarantee the performance of iPursuit. This requirement is intuitively correct because as increases, more number of data points is required to populate the clusters.
Ii-E Theoretical guarantees with the fully-random model
We now consider the fully random model. In order to model the intersection between subspaces explicitly, assume and are sampled uniformly at random from Grassmannian manifolds with dimension and , respectively. Note that this setting is different from the traditional fully random model, in which are sampled independently from a Grassmannian manifold with dimension . In this section, it is assumed that is linear with . In order to derive the sufficient conditions, first we need to establish upper-bounds for and in (4). Under the fully random model,
is the largest root of a multivariate beta distribution with the following limiting behavior as
is a special random variable which has the mean, with typical fluctuations of and atypical large deviations of  . This means random variable is in the order of almost surely. It addition, scales with and the first term in RHS of (7) has an order of , hence for large the RHS of (7) is dominated by its first term almost surely. Similarly .
The sufficient conditions in Theorem 3 roughly state that we need to be sufficiently small and this confirms our intuition since it implies that it is more likely to sample subspaces with small affinities.
Iii Enhancing the Performance of iPursuit
In this section, we address a performance bottleneck of iPursuit. While iPursuit performs well in handling the subspaces with high dimension of intersection, the algorithm could fail to obtain a proper affinity vector for when the projection of in the span of (the intersection part) is strong. The main reason is that if is close to the span of , then (the optimal point of (3)) should have a large -norm to lie in the innovation subspace and to satisfy the constraint because is strongly aligned with the span of . Therefore, the optimal point of (3) may not lie in the innovation subspace or be close to it since the cost function of (3) increases with .
We present a technique which enables iPursuit to handle data points with weak projection on their corresponding innovation subspaces. Define as and where is an orthonormal basis for the column space of . In other words, is the projection of the data points in and is the projection of the data points in the complement of , i.e., the columns of are the projection of data points in the innovation subspaces. When the dimension of intersection is large, is much larger than , i.e., most of the power of the data lies in . Accordingly, we approximate with the span of the dominant left singular vectors. The next theorem shows that the first singular vector is close to when is large.
Assume the semi random data model. Define as the first left singular vector of , and / as / . For positive constant we have
with probability at least , where
where and .
Motivated by the observation that the intersection subspace is close to the span of dominant singular vectors, we improve the performance of iPursuit by filtering the projection of the data points in the span of the dominant singular vectors. Define
as the matrix of left singular vectors corresponding to the non-zero singular values andis formed by removing the first columns of . The parameter is our estimate of . Accordingly, we update each column of as to remove the projection of each data point in the span of the dominant singular vectors and enhance the innovative component of the data points.
Iv Numerical Experiments
We report two sets of experiments. First, we analyze the approximation studied in (7); second, we show the superior performance of iPursuit equipped with the technique presented in Section III, using both synthetic and real datasets, measured by clustering accuracy. We refer to iPursuit method equipped with the technique in Section III as Enhanced iPursuit.
In Figure 1, we simulated pairs of random subspaces under different settings and calculated the ratio between true cosine values and the corresponding estimations (i.e., first term in RHS of (7)). As we increase , the approximation tends to be very stable and slightly below which confirms our analysis in Section II-E. In this simulation we fix the ratio between and . With fixed , the first term at the RHS of (7) is more likely to be the dominant term as we increase .
Iv-a Results on Synthetic Data
We first compare the performances of iPursuit with Enhanced iPursuit using synthetic datasets. Specifically, two groups of experiments are considered: in the first group, we fix , , , and increase from to ; in the second group, we fix , , , and increase from to . In both groups we choose .
Figure 2 illustrates that Enhanced iPursuit is outperforming iPursuit. In the first plot, the clustering accuracy decreases as the dimension of intersections increases. In the second group, we observe that clustering accuracy of iPursuit increases with which confirms our analysis in Section II.
Iv-B Results on Handwritten Digits
We now demonstrate the results of Enhanced iPursuit on two popular handwritten digits datasets: MNIST and ZIPCODE. For MNIST, we use the CNN extracted features which are 3472-dimensional vectors. Using PCA, we reduced the dimension to . For ZIPCODE, we conducted a PCA step to reduce the dimension from to . In Figure 3, we observe a clear gap between the first and second singular values in both MNIST and ZIPCODE; hence we set . The results are demonstrated in Table I (for both experiments we fix .). One can observe that the proposed technique enhanced the performance of iPursuit from to nearly on MNIST and to on ZIPCODE. In both datasets, Enhanced iPursuit yielded the best clustering accuracy.
While many subspace clustering algorithms cannot deal with heavily intersected subspaces, iPursuit [20, 19, 21] was shown to notably outperform the other methods in this challenging setting. In this paper, we provided an analysis of iPursuit to understand its capability in handling heavily intersected subspaces. It was shown that in sharp contrast to the other methods which require the subspaces to be sufficiently incoherence with each other, iPursuit requires incoherence between the innovative components of the subspaces. In addition, we proved that iPursuit can yield provable performance even if the dimension of intersection is in linear order with rank of the data. We also proposed a promising projection-based method to further improve iPursuit.
-  Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):1373–1396, 2003.
-  Ling Ding, Ping Tang, and Hongyi Li. Subspace feature analysis of local manifold learning for hyperspectral remote sensing images classification. Applied Mathematics & Information Sciences, 8(4):1987, 2014.
-  Ehsan Elhamifar and René Vidal. Sparse subspace clustering. In , pages 2790–2797, Miami, FL, 2009.
-  Reinhard Heckel and Helmut Bölcskei. Robust subspace clustering via thresholding. IEEE Trans. Inf. Theory, 61(11):6320–6342, 2015.
Anil K. Jain.
Data clustering: 50 years beyond k-means.Pattern Recognit. Lett., 31(8):651–666, 2010.
-  Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian D. Reid. Deep subspace clustering networks. In Advances in Neural Information Processing (NIPS), pages 24–33, Long Beach, CA, 2017.
-  Iain M. Johnstone. Multivariate analysis and jacobi ensembles: Largest eigenvalue, tracy–widom limits and rates of convergence. Annals of Statistics, 36(6):2638, 2008.
-  Ian T Jolliffe and Jorge Cadima. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016.
-  Andrew V Knyazev and Peizhen Zhu. Principal angles between subspaces and their tangents. arXiv preprint arXiv:1209.0523, 2012.
-  Gilad Lerman, Michael B. McCoy, Joel A. Tropp, and Teng Zhang. Robust computation of linear models by convex relaxation. Found. Comput. Math., 15(2):363–410, 2015.
-  Weiwei Li, Jan Hannig, and Sayan Mukherjee. Subspace clustering through sub-clusters. Journal of Machine Learning Research, 22(53):1–37, 2021.
-  Guangcan Liu and Ping Li. Recovery of coherent data via low-rank dictionary pursuit. In Advances in Neural Information Processing (NIPS), pages 1206–1214, Montreal, Canada, 2014.
-  Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):171–184, 2013.
-  Celine Nadal and Satya N Majumdar. Journal of Statistical Mechanics: Theory and Experiment, 2011(04):P04001, 2011.
Andrew Y Ng, Michael I Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.In Advances in neural information processing systems (NIPS), pages 849–856, Vancouver, Canada, 2002.
-  Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336–3341, 2009.
-  Xi Peng, Lei Zhang, and Zhang Yi. Scalable sparse subspace clustering. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 430–437, Portland, OR, 2013.
-  Mostafa Rahmani and George K Atia. Coherence pursuit: Fast, simple, and robust principal component analysis. IEEE Transactions on Signal Processing, 65(23):6260–6275, 2017.
-  Mostafa Rahmani and George K Atia. Innovation pursuit: A new approach to subspace clustering. IEEE Transactions on Signal Processing, 65(23):6276–6291, 2017.
-  Mostafa Rahmani and George K Atia. Subspace clustering via optimal direction search. IEEE Signal Processing Letters, 24(12):1793–1797, 2017.
-  Mostafa Rahmani and Ping Li. Outlier detection and robust PCA using a convex measure of innovation. In Advances in Neural Information Processing Systems (NeurIPS), pages 14200–14210, Vancouver, Canada, 2019.
Mahdi Soltanolkotabi and Emmanuel J. Candes.
A geometric analysis of subspace clustering with outliers.Annals of Statistics, 40(4):2195–2238, 2012.
-  Roberto Tron and René Vidal. A benchmark for the comparison of 3-d motion segmentation algorithms. In Proceedings of the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, 2007.
-  René Vidal and Paolo Favaro. Low rank subspace clustering (LRSC). Pattern Recognit. Lett., 43:47–61, 2014.
-  Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
-  Lei Wang, Danping Li, Tiancheng He, and Zhong Xue. Manifold regularized multi-view subspace clustering for image representation. In Proceedings of the 23rd International Conference on Pattern Recognition (ICPR), pages 283–288, Cancún, Mexico, 2016.
-  Yu-Xiang Wang and Huan Xu. Noisy sparse subspace clustering. The Journal of Machine Learning Research, 17(1):320–360, 2016.
-  Chong You, Claire Donnat, Daniel P. Robinson, and René Vidal. A divide-and-conquer framework for large-scale subspace clustering. In Proceedings of the 50th Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pages 1014–1018, Pacific Grove, CA, 2016.
-  Chong You, Chun-Guang Li, Daniel P. Robinson, and René Vidal. Oracle based active set algorithm for scalable elastic net subspace clustering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3928–3937, Las Vegas, NV, 2016.
-  Chong You, Daniel P. Robinson, and René Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3918–3927, Las Vegas, NV, 2016.
-  Xiao-Tong Yuan and Ping Li. Sparse additive subspace clustering. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Part III, pages 644–659, Zurich, Switzerland, 2014.
-  Chengju Zhou, Changqing Zhang, Xuewei Li, Gaotao Shi, and Xiaochun Cao. Video face clustering via constrained sparse representation. In Proceeings of the IEEE International Conference on Multimedia and Expo (ICME), pages 1–6, Chengdu, China, 2014.
Proofs of Theorems and Auxiliary Lemmas
Let be orthonormal matrices in and assume
here is a constant. Then for any vector with unit -norm, the following two inequalities hold:
Proof of Lemma 1: We will prove the first part of Lemma 1 by using mathematical induction on , the second part can be proved similarly. For , the result is straightforward. Now consider for , we write , here both and are vectors in . Note that has unit norm, we then have
After some manipulations the above equation leads to
It suffices to have
as the singular value decomposition (svd) of. Note that orthogonal transformation will not change the -norm, for ease of notation we simply write and . Let the diagonal of be . We can expand (11) and combine it with (10), it suffices to have
Finally, we note that , this means . Also from we know , therefore (12) is true. Therefore the result is true for .
Now assume the result is true for , we will prove the result is also true for by contradiction. Specifically, we can do Gram-Schmidt orthogonalization on columns of (it is fine to have matrices, we just add empty columns at the end) and get . Write , here is the projection of onto . Then we know
By mathematical induction, the result follows naturally if we can show for some . Now we assume for any . If the result is not true for , by induction we have
Similarly we can get , for any . This means
which is a contradiction! Hence the result is also true for , and the lemma is proved.
(See Lemma 1 in ) Let sampled uniformly from , and be constants such that . For constant , we write and . Assume that , then
with probability at least , where
With Lemma 1, Lemma 2, and Lemma 3 being true, we are now ready to prove Theorem 1.
Proof of Theorem : Assume WLOG that , we prove the iPursuit optimization problem (3) has the same optimal solution with
Let , be the optimal solution of the iPursuit optimization problem and (13) associated with , respectively. Note that if iPursuit and (13) share the same optimal solution, all the points from other than would be orthogonal to . This means only points from can have positive affinities with .
Following the proof in , it suffices to show
for any = 0. This can be reduced (see ) to be
where , , and . The LHS of (14) can be further lower bounded by
Here we write into two parts and such that is orthogonal to both and columns of , and .
Write and , where is the orthogonal basis of . From the definition of we know
Note that inequality (14) is true follows trivially with . Now we are going to bound the terms in (14) with . Specifically, we will prove the sum of the first three terms at the LHS of (15) is non-negative.
Write we have
where is normalized . We can write such that is the projection of onto , then we have and