# A geometric analysis of subspace clustering with outliers

This paper considers the problem of clustering a collection of unlabeled data points assumed to lie near a union of lower-dimensional planes. As is common in computer vision or unsupervised learning applications, we do not know in advance how many subspaces there are nor do we have any information about their dimensions. We develop a novel geometric analysis of an algorithm named sparse subspace clustering (SSC) [In IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009 (2009) 2790-2797. IEEE], which significantly broadens the range of problems where it is provably effective. For instance, we show that SSC can recover multiple subspaces, each of dimension comparable to the ambient dimension. We also prove that SSC can correctly cluster data points even when the subspaces of interest intersect. Further, we develop an extension of SSC that succeeds when the data set is corrupted with possibly overwhelmingly many outliers. Underlying our analysis are clear geometric insights, which may bear on other sparse recovery problems. A numerical study complements our theoretical analysis and demonstrates the effectiveness of these methods.

There are no comments yet.

## Authors

• 36 publications
• 29 publications
• ### Noisy Sparse Subspace Clustering

This paper considers the problem of subspace clustering under noise. Spe...
09/05/2013 ∙ by Yu-Xiang Wang, et al. ∙ 0

• ### Robust subspace clustering

Subspace clustering refers to the task of finding a multi-subspace repre...
01/11/2013 ∙ by Mahdi Soltanolkotabi, et al. ∙ 0

• ### Subspace Clustering via Thresholding and Spectral Clustering

We consider the problem of clustering a set of high-dimensional data poi...
03/15/2013 ∙ by Reinhard Heckel, et al. ∙ 0

• ### Self-Representation Based Unsupervised Exemplar Selection in a Union of Subspaces

Finding a small set of representatives from an unlabeled dataset is a co...
06/07/2020 ∙ by Chong You, et al. ∙ 0

• ### Robust Subspace Clustering via Thresholding

The problem of clustering noisy and incompletely observed high-dimension...
07/18/2013 ∙ by Reinhard Heckel, et al. ∙ 0

• ### Probabilistic Sparse Subspace Clustering Using Delayed Association

Discovering and clustering subspaces in high-dimensional data is a funda...
08/28/2018 ∙ by Maryam Jaberi, et al. ∙ 0

• ### Theoretical Analysis of Sparse Subspace Clustering with Missing Entries

Sparse Subspace Clustering (SSC) is a popular unsupervised machine learn...
01/01/2018 ∙ by Manolis C. Tsakiris, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Motivation

One of the most fundamental steps in data analysis and dimensionality reduction consists of approximating a given data set by a single

low-dimensional subspace, which is classically achieved via Principal Component Analysis (PCA).

In many problems, however, a collection of points may not lie near a low-dimensional plane but near a union of multiple subspaces as shown in Figure 1. It is then of interest to find or fit all these subspaces. Furthermore, because our data points are unlabeled in the sense that we do not know in advance to which subspace they belong to, we need to simultaneously cluster these data into multiple subspaces and find a low-dimensional subspace approximating all the points in a cluster. This problem is known as subspace clustering and has numerous applications; we list just a few:

• Unsupervised learning. In unsupervised learning the goal is to build representations of machine inputs, which can be used for decision making, predicting future inputs, efficiently communicating the inputs to another machine and so on.

In some unsupervised learning applications, the standard assumption is that the data is well approximated by a union of lower-dimensional manifolds. Furthermore, these manifolds are sometimes well approximated by subspaces whose dimension is only slightly higher than that of the manifold under study. Such an example is handwritten digits. When looking at handwritten characters for recognition, the human eye is able to allow for simple transformations such as rotations, small scalings, location shifts and character thickness. Therefore, any reasonable model should be insensitive to such changes as well. Simard et al. Simard93 characterize this invariance with a -dimensional manifold; that is, different transformations of a single digit are well approximated by a -dimensional manifold. As illustrated by Hastie et al. Hastie97 , these -dimensional manifolds are in turn well approximated by -dimensional subspaces. Thus, in certain cases, unsupervised learning can be formulated as a subspace clustering problem.

• Computer vision. There has been an explosion of visual data in the past few years. Cameras are now everywhere: street corners, traffic lights, airports and so on. Furthermore, millions of videos and images are uploaded monthly on the web. This visual data deluge has motivated the development of low-dimensional representations based on appearance, geometry and dynamics of a scene. In many such applications, the low-dimensional representations are characterized by multiple low-dimensional subspaces. One such example is motion segmentation Vidal08 . Here, we have a video sequence which consists of multiple moving objects, and the goal is to segment the trajectories of the objects. Each trajectory approximately lies in a low-dimensional subspace. To understand scene dynamics, one needs to cluster the trajectories of points on moving objects based on the subspaces (objects) they belong to, hence the need for subspace clustering.

Other applications of subspace clustering in computer vision include image segmentation Yang08 , face clustering Ho03 , image representation and compression Hong06 , and systems theory Vidal03 . Over the years, various methods for subspace clustering have been proposed by researchers working in this area. For a comprehensive review and comparison of these algorithms, we refer the reader to the tutorial Vidal11 and references therein Boult91 , Costeira98 , Gear98 , Vidal05 , Bradley00 , Tseng00 , Agarwal04 , Lu06 , Zhang09 , Tipping99 , Sugaya04 , Ma07 , Rao08 , Yang06 , Yan06 , Zhang10 , Goh07 , Elhamifar09 , Elhamifar10 , Liu10 , Chen09 .

• Disease detection

. In order to detect a class of diseases of a specific kind (e.g., metabolic), doctors screen specific factors (e.g., metabolites). For this purpose, various tests (e.g., blood tests) are performed on the newborns and the level of those factors are measured. One can further construct a newborn-factor level matrix, where each row contains the factor levels of a different newborn. That is to say, each newborn is associated with a vector containing the values of the factors. Doctors wish to cluster groups of newborns based on the disease they suffer from. Usually, each disease causes a correlation between a specific set of factors. Such an assumption implies that points corresponding to newborns suffering from a given disease lie on a lower-dimensional subspace

Peterkriegel08 . Therefore, the clustering of newborns based on their specific disease together with the identification of the relevant factors associated with each disease can be modeled as a subspace clustering problem.

PCA is perhaps the single most important tool for dimensionality reduction. However, in many problems, the data set under study is not well approximated by a linear subspace of lower dimension. Instead, as we hope we have made clear, the data often lie near a union of low-dimensional subspaces, reflecting the multiple categories or classes a set of observations may belong to. Given its relevance in data analysis, we find it surprising that subspace clustering has been well studied in the computer science literature but has comparably received little attention from the statistical community. This paper begins with a very recent approach to subspace clustering and proposes a framework in which one can develop some useful statistical theory. As we shall see, insights from sparse regression analysis in high dimensions—a subject that has been well developed in the statistics literature in recent years—inform the subspace clustering problem.

### 1.2 Problem formulation

In this paper we assume we are given data points that are distributed on a union of unknown linear subspaces ; that is, there are subspaces of of unknown dimensions . More precisely, we have a point set consisting of points in , which may be partitioned as

 X=X0∪X1∪⋯∪XL (1)

for each , is a collection of unit-normed vectors chosen from . The careful reader will notice that we have an extra subset in (1) accounting for possible outliers. Unless specified otherwise, we assume that this special subset consists of points chosen independently and uniformly at random on the unit sphere. The task is now simply stated. Without any prior knowledge about the number of subspaces, their orientation or their dimension, [(1)]

identify all the outliers, and

segment or assign each data point to a cluster as to recover all the hidden subspaces.

It is worth emphasizing that our model assumes normalized data vectors; this is not a restrictive assumption since one can always normalize inputs before applying any subspace clustering algorithm. Although we consider linear subspaces, one can extend the methods of this paper to affine subspace clustering which will be explained in Section 1.3.1.

We now turn to methods for achieving these goals. Our focus is on noiseless data and we leave noisy subspace clustering to future work.

### 1.3 Methods and contributions

To introduce our methods, we first consider the case in which there are no outliers before treating the more general case. From now on, it will be convenient to arrange the observed data points as columns of a matrix , where is the total number of points.

#### 1.3.1 Methods

Subspace clustering has received quite a bit of attention in recent years and, in particular, Elhamifar and Vidal introduced a clever algorithm based on insights from the compressive sensing literature. The key idea of the Sparse Subspace Clustering (SSC) algorithm Elhamifar09 is to find the sparsest expansion of each column of as a linear combination of all the other columns. This makes a lot of sense because under some generic conditions, one expects that the sparsest representation of would only select vectors from the subspace in which  happens to lie in. This motivates Elhamifar and Vidal to consider the sequence of optimization problems

 minz∈RN∥z∥ℓ1subject to Xz=xi and zi=0. (2)

The hope is that whenever , and belong to the same subspace. This property is captured by the definition below. [( subspace detection property)] The subspaces and points obey the subspace detection property if and only if it holds that for all , the optimal solution to (2) has nonzero entries only when the corresponding columns of are in the same subspace as .

In certain cases the subspace detection property may not hold, that is, the support of the optimal solution to (2) may include points from other subspaces. However, it might still be possible to detect and construct reliable clusters. A strategy is to arrange the optimal solutions to (2) as columns of a matrix , build an affinity graph with vertices and weights , construct the normalized Laplacian of

, and use a gap in the distribution of eigenvalues of this matrix to estimate the number of subspaces. Using the estimated number of subspaces, spectral clustering techniques (e.g.,

Shi00 , Ng02 ) can be applied to the affinity graph to cluster the data points. The main steps of this procedure are summarized in Algorithm 1. This algorithm clusters linear subspaces but can also cluster affine subspaces by adding the constraint to (2).

#### 1.3.2 Our contributions

In Section 3 we will review existing conditions involving a restriction on the minimum angle between subspaces under which Algorithm 1 is expected to work. The main purpose of this paper is to show that Algorithm 1 works in much broader situations.

• Subspaces with nontrivial intersections. Perhaps unexpectedly, we shall see that our results assert that SSC can correctly cluster data points even when our subspaces intersect so that the minimum principal angle vanishes. This is a phenomenon which is far from being explained by current theory.

• Subspaces of nearly linear dimension. We prove that in generic settings, SSC can effectively cluster the data even when the dimensions of the subspaces grow almost linearly with the ambient dimension. We are not aware of other literature explaining why this should be so. To be sure, in most favorable cases, earlier results only seem to allow the dimensions of the subspaces to grow at most like the square root of the ambient dimension.

• Outlier detection. We present modifications to SSC that succeed when the data set is corrupted with many outliers—even when their number far exceeds the total number of clean observations. To the best of our knowledge, this is the first algorithm provably capable of handling these many corruptions.

• Geometric insights.

Such improvements are possible because of a novel approach to analyzing the sparse subspace clustering problem. This analysis combines tools from convex optimization, probability theory and geometric functional analysis. Underlying our methods are clear geometric insights explaining quite precisely when SSC is successful and when it is not. This viewpoint might prove fruitful to address other sparse recovery problems.

Section 3 proposes a careful comparison with the existing literature. Before doing so, we first need to introduce our results, which is the object of Sections 1.4 and 2.

### 1.4 Models and typical results

#### 1.4.1 Models

In order to better understand the regime in which SSC succeeds as well as its limitations, we will consider three different models. Our aim is to give informative bounds for these models highlighting the dependence upon key parameters of the problem such as (1) the number of subspaces, (2) the dimensions of these subspaces, (3) the relative orientations of these subspaces, (4) the number of data points per subspace and so on.

• Deterministic model. In this model the orientation of the subspaces as well as the distribution of the points on each subspace are nonrandom. This is the setting considered by Elhamifar et al. and is the subject of Theorem 2.1, which guarantees that the subspace detection property holds as long as for any two subspaces, pairs of (primal and dual) directions taken on each subspace have a sufficiently small inner product.

• Semi-random model. Here, the subspaces are fixed but the points are distributed at random on each of the subspaces. This is the subject of Theorem 2.2, which uses a notion of affinity to measure closeness between any two subspaces. This affinity is maximal and equal to the square root of the dimension of the subspaces when they overlap perfectly. Here, our results state that if the affinity is smaller, by a logarithmic factor, than its maximum possible value, then SSC recovers the subspaces exactly.

• Fully random model. Here, both the orientation of the subspaces and the distribution of the points are random. This is the subject of Theorem 1.1; in a nutshell, SSC succeds as long as the dimensions of the subspaces are within at most a logarithmic factor from the ambient dimension.

#### 1.4.2 Segmentation without outliers

Consider the fully random model first. We establish that the subspace detection property holds as long as the dimensions of the subspaces are roughly linear in the ambient dimension. Put differently, SSC can provably achieve perfect subspace recovery in settings not previously understood.

Our results make use of a constant only depending upon the density of inliers (the number of points on each subspace is ) and which obeys the following two properties: [(ii)]

For all , .

There is a numerical value , such that for all , one can take .

###### Theorem 1.1

Assume there are subspaces, each of dimension , chosen independently and uniformly at random. Furthermore, suppose there are points chosen independently and uniformly at random on each subspace.333From here on, when we say that points are chosen from a subspace, we implicitly assume they are unit normed. For ease of presentation we state our results for , that is, the number of points on each subspace is not exponentially large in terms of the dimension of that subspace. The results hold for all by replacing with . Then the subspace detection property holds with large probability as long as

 d

[ is the total number of data points]. The probability is at least , which is calculated for values of close to the upper bound. For lower values of , the probability of success is of course much higher, as explained below.

Theorem 1.1 is in fact a special instance of a more general theorem that we shall discuss later and which holds under less restrictive assumptions on the orientations of the subspaces as well as the number and positions of the data points on each subspace. This theorem conforms to our intuition since clustering becomes more difficult as the dimensions of the subspaces increase. Intuitively, another difficult regime concerns a situation in which we have very many subspaces of small dimensions. This difficulty is reflected in the dependence of the denominator in (3) on , the number of subspaces (through ). A more comprehensive explanation of this effect is provided in Section 2.1.2.

As it becomes clear in the proof (see Section 7), a slightly more general version of Theorem 1.1 holds, namely, with , the subspace detection property holds as long as

 d<2β[c2(ρ)logρ12logN]n (4)

with probability at least . Therefore, if is a small fraction of the right-hand side in (3), the subspace detection property holds with much higher probability, as expected.

An interesting regime is when the number of subspaces is fixed and the density of points per subspace is , for a small . Then as with the ratio fixed, it follows from and (4) using that the subspace detection property holds as long as

 d<η48(1+η)n.

This justifies our earlier claims since we can have subspace dimensions growing linearly in the ambient dimension. It should be noted that this asymptotic statement is only a factor away from what is observed in simulations, which demonstrates a relatively small gap between our theoretical predictions and simulations.444To be concrete, when the ambient dimension is and the number of subspaces is , the subspace detection property holds for in the range from to .

#### 1.4.3 Segmentation with outliers

We now turn our attention to the case where there our extraneous points in the data in the sense that there are outliers assumed to be distributed uniformly at random on the unit sphere. Here, we wish to correctly identify the outlier points and apply any of the subspace clustering algorithms to the remaining samples. We propose a very simple detection procedure for this task. As in SSC, decompose each as a linear combination of all the other points by solving an -minimization problem. Then one expects the expansion of an outlier to be less sparse. This suggests the following detection rule: declare to be an outlier if and only if the optimal value of (2) is above a fixed threshold. This makes sense because if is an outlier, one expects the optimal value to be on the order of (provided is at most polynomial in ), whereas this value will be at most on the order of if belongs to a subspace of dimension . In short, we expect a gap—a fact we will make rigorous in the next section. The main steps of the procedure are shown in Algorithm 2.

Our second result asserts that as long as the number of outliers is not overwhelming, Algorithm 2 detects all of them.

###### Theorem 1.2

Assume there are points to be clustered together with outliers sampled uniformly at random on the -dimensional unit sphere (). Algorithm 2 detects all of the outliers with high probability666With probability at least . If , this is at least . as long as

 N0<1nec√n−Nd,

where is a numerical constant. Furthermore, suppose the subspaces are -dimensional and of arbitrary orientation, and that each contains points sampled independently and uniformly at random. Then with high probability,777With probability at least . If , this is at least . Algorithm 2 does not detect any subspace point as outlier provided that

 N0

in which .

This result shows that our outlier detection scheme can reliably detect all outliers even when their number grows exponentially in the root of the ambient dimension. We emphasize that this holds without making any assumption whatsoever about the orientation of the subspaces or the distribution of the points on each subspace. Furthermore, if the points on each subspace are uniformly distributed, our scheme will not wrongfully detect a subspace point as an outlier. In the next section we show that similar results hold under less restrictive assumptions.

## 2 Main results

### 2.1 Segmentation without outliers

In this section we shall give sufficient conditions in the fully deterministic and semi-random model under which the SSC algorithm succeeds (we studied the fully random model in Theorem 1.1).

Before we explain our results, we introduce some basic notation. We will arrange the points on subspace as columns of a matrix . For , , we use to denote all points on subspace excluding the th point, . We use to denote an arbitrary orthonormal basis for . This induces a factorization , where is a matrix of coordinates with unit-norm columns. For any matrix , the shorthand notation denotes the symmetrized convex hull of its columns, . Also stands for . Finally, is the operator norm of and the maximum absolute value of its entries.

#### 2.1.1 Deterministic model

We first introduce some basic concepts needed to state our deterministic result.

[(Dual point)] Consider a vector and a matrix , and let be the set of optimal solutions to

 max\boldsλ∈Rd⟨y,\boldsλ⟩subject to ∥∥AT\boldsλ∥∥ℓ∞≤1.

The dual point is defined as a point in with minimum Euclidean norm.888If this point is not unique, take to be any optimal point with minimum Euclidean norm. A geometric representation is shown in Figure 2.

[(Dual directions)] Define the dual directions [arranged as columns of a matrix ]

corresponding to the dual points as

 v(ℓ)i=U(ℓ)\boldsλ(ℓ)i∥\boldsλ(ℓ)i∥ℓ2.

The dual direction , corresponding to the point , from subspace is shown in Figure 3.

[(Inradius)] The inradius of a convex body , denoted by , is defined as the radius of the largest Euclidean ball inscribed in .

[(Subspace incoherence)] The subspace incoherence of a point set vis a vis the other points is defined by

 μ(Xℓ)=maxx∈X∖Xℓ∥∥V(ℓ)Tx∥∥ℓ∞,

where is as in Definition 2.

###### Theorem 2.1

If

 μ(Xℓ)

for each , then the subspace detection property holds. If (5) holds for a given , then a local subspace detection property holds in the sense that for all , the solution to (2) has nonzero entries only when the corresponding columns of are in the same subspace as .

The incoherence parameter of a set of points on one subspace with respect to other points is a measure of affinity between subspaces. To see why, notice that if the incoherence is high, it implies that there is a point on one subspace and a direction on another (a dual direction) such that the angle between them is small. That is, there are two “close” subspaces, hence, clustering becomes hard. The inradius measures the spread of points. A very small minimum inradius implies that the distribution of points is skewed toward certain directions, thus, subspace clustering using an penalty is difficult. To see why this is so, assume the subspace is of dimension and all of the points on the subspace are skewed toward one line, except for one special point which is in the direction orthogonal to that line. This is shown in Figure 4 with the special point in red and the others in blue. To synthesize this special point as a linear combination of the other points from its subspace, we would need huge coefficient values and this is why it may very well be more economical—in an sense—to select points from other subspaces. This is a situation where minimization would still be successful but its convex surrogate is not (researchers familiar with sparse regression would recognize a setting in which variables are correlated and which is challenging for the LASSO). Theorem 2.1 essentially states that as long as different subspaces are not similarly oriented and the points on a single subspace are well spread, SSC can cluster the data correctly. A geometric perspective of (5) is provided in Section 4.

To get concrete results, one needs to estimate both the incoherence and inradius in terms of the parameters of interest, which include the number of subspaces, the dimensions of the subspaces, the number of points on each subspace and so on. To do this, we use the probabilistic models we introduced earlier. This is our next topic.

#### 2.1.2 Semi-random model

The following definitions capture notions of similarity/affinity between two subspaces.

The principal angles between two subspaces and of dimensions and are recursively defined by

 cos(θ(i)kℓ)=maxy∈Skmaxz∈SℓyTz∥y∥ℓ2∥z∥ℓ2:=yTizi∥yi∥ℓ2∥zi∥ℓ2,

with the orthogonality constraints , , . Alternatively, if the columns of and

are orthobases, then the cosine of the principal angles are the singular values of

. We write the smallest principal angle as so that is the largest singular value of .

The affinity between two subspaces is defined by

 aff(Sk,Sℓ)=√cos2θ(1)kℓ+⋯+cos2θ(dk⋁dℓ)kℓ.

In case the distribution of the points are uniform on their corresponding subspaces, the Geometric Condition (5) may be reduced to a simple statement about the affinity. This is the subject of the next theorem.

###### Theorem 2.2

Suppose points are chosen on each subspace at random, . Then as long as

 maxk:k≠ℓ4√2(log[Nℓ(Nk+1)]+logL+t)aff(Sk,Sℓ)√dk (6)

the subspace detection property holds with probability at least

 1−L∑ℓ=1Nℓe−√dℓ√Nℓ−1−1L2∑k≠ℓ4e−2t(Nk+1)Nℓ.

Hence, ignoring log factors, subspace clustering is possible if the affinity between the subspaces is less than about the square root of the dimension of these subspaces.

To derive useful results, assume for simplicity that we have subspaces of the same dimension and points per subspace so that . Then perfect clustering occurs with probability at least if

 aff(Sk,Sℓ)√d

Our notion of affinity matches our basic intuition. To be sure, if the subspaces are too close to each other (in terms of our defined notion of affinity), subspace clustering is hard. Having said this, our result has an element of surprise. Indeed, the affinity can at most be ( in general) and, therefore, our result essentially states that if the affinity is less than , then SSC works. Now this allows for subspaces to intersect and, yet, SSC still provably clusters all the data points correctly!

To discuss other aspects of this result, assume as before that all subspaces have the same dimension . When is small and the total number of subspaces is , the problem is inherently hard because it involves clustering all the points into many small subgroups. This is reflected by the low probability of success in Theorem 2.2. Of course, if one increases the number of points chosen from each subspace, the problem should intuitively become easier. The probability associated with (7) allows for such a trend. In other words, when is small, one can increase the probability of success by increasing . Introducing a parameter , the condition can be modified to

 aff(Sk,Sℓ)√d

which holds with probability at least . The more general condition (2.2) and the corresponding probability can also be modified in a similar manner.

### 2.2 Segmentation with outliers

To see how Algorithm 2 works in the presence of outliers, we begin by introducing a proper threshold function and define

 λ(γ)=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩√2π1√γ,\quad1≤γ≤e,√2πe1√logγ,\quadγ≥e, (9)

shown in Figure 5. The theorem below justifies the claims made in the introduction.

###### Theorem 2.3

Suppose the outlier points are chosen uniformly at random and set , then using the threshold value , all outliers are identified correctly with probability at least for some positive numerical constant . Furthermore, we have the following guarantees in the deterministic and semi-random models: [(a)]

If in the deterministic model,

 maxℓ,i1r(P(X(ℓ)−i))<(1−t)λ(γ)√e√n, (10)

then no “real” data point is wrongfully detected as an outlier.

If in the semi-random model,

 maxℓ√2dℓc(ρℓ)√logρℓ<(1−t)λ(γ)√e√n, (11)

then with probability at least , no “real” data point is wrongfully detected as an outlier.

The threshold in the right-hand side of (10) and (11) is essentially multiplied by a factor which depends only on the ratio of the number of points and the dimension of the ambient space.

As in the situation with no outliers, when is small we need to increase to get a result holding with high probability. Again this is expected because when is small, we need to be able to separate the outliers from many small clusters which is inherently a hard problem for small values of .

The careful reader will notice a factor discrepancy between the threshold presented in Algorithm 2 and what is proven in (10) and (11). We believe that this is a result of our analysis999More specifically, from switching from the mean width to a volumetric argument by means of Urysohn’s inequality. and we conjecture that (10) and (11) hold without the factor in the denominator. Our simulations in Section 5 support this conjecture.

## 3 Discussion and comparison with other work

It is time to compare our results with a couple of previous important theoretical advances. To introduce these earlier works, we first need some definitions. The subspaces are said to be independent if and only if , where is the direct sum. For instance, three lines in cannot be independent. The subspaces are said to be disjoint if and only if for all pairs , .

The geodesic distance between two subspaces and of dimension , denoted by , is defined by

 dist(Sk,Sℓ)= ⎷dk⋁dℓ∑i=1(θ(i)kℓ)2.

### 3.1 Segmentation without outliers

In Elhamifar09 , Elhamifar and Vidal show that the subspace detection property holds as long as the subspaces are independent. In Elhamifar10 , the same authors show that under less restrictive conditions the subspace detection property still holds. Formally, they show that if

 1√dℓmaxY∈Wdℓ(X(ℓ))σmin(Y)>maxk:k≠ℓcos(θ(1)kℓ)for all ℓ=1,…,L, (12)

then the subspace detection property holds. In the above formulation, denotes the smallest singular value of and denotes the set of all full rank sub-matrices of of size . The interesting part of the above condition is the appearance of the principal angle on the right-hand side. However, the left-hand side is not particularly insightful (i.e., it does not tell us anything about the important parameters involved in the subspace clustering problem, such as dimensions, number of subspaces and so on) and it is in fact NP-hard to even calculate it.

• Deterministic model. This paper also introduces a sufficient condition (5) under which the subspace detection property holds in the fully deterministic setting; compare Theorem 2.1. This sufficient condition is much less restrictive as any configuration obeying (12) also obeys (5). More precisely, and .101010The latter follows from which is a simple consequence of Lemma 7.7. As for (12), checking that (5) holds is also NP-hard in general. However, to prove that the subspace detection property holds, it is sufficient to check a slightly less restrictive condition than (5); this is tractable, see Lemma 7.1.

• Semi-random model. Assume that all subspaces are of the same dimension and that there are points on each subspace. Since the columns of have unit norm, it is easy to see that the left-hand side of (12) is strictly less than . Thus, (12) at best restricts the range for perfect subspace recovery to [by looking at (12), it is not entirely clear that this would even be achievable]. In comparison, Theorem 2.2 (excluding some logarithmic factors for ease of presentation) requires

 aff(Sk,Sℓ) = < c√log(ρ)√d.

The left-hand side can be much smaller than and is, therefore, less restrictive.

To be more specific, assume that in the model described above we have two subspaces with an intersection of dimension . Because the two subspaces intersect, the condition given by Elhamifar and Vidal becomes , which cannot hold. In comparison, our condition (3.1) simplifies to

 cos2(θ(s+1)kℓ)+⋯+cos2(θ(d)kℓ)

which holds as long as is not too large and/or a fraction of the angles are not too small. From an application standpoint, this is important because it explains why SSC can often succeed even when the subspaces are not disjoint.

• Fully random model. As before, assume for simplicity that all subspaces are of the same dimension and that there are points on each subspace. We have seen that (12) imposes . It can be shown that in the fully random setting,111111

One can see this by noticing that the square of this parameter is the largest root of a multivariate beta distribution. The asymptotic value of this root can be calculated, for example, see

Johnstone08 . . Therefore, (12) would put a restriction of the form

 d

In comparison, Theorem 1.1 requires

 d

which allows for the dimension of the subspaces to be almost linear in the ambient dimension.

Such improvements come from a geometric insight: it becomes apparent that the SSC algorithm succeeds if the actual subspace points (primal directions) have small inner products with the dual directions on another subspace. This is in contrast with Elhamifar and Vidal’s condition which requires that the inner products between any direction on one subspace and any direction on another be small. Further geometric explanations are given in Section 4.2.

### 3.2 Segmentation with outliers

To the best of our knowledge, there is only one other theoretical result regarding outlier detection. In Lerman10 , Lerman and Zhang study the effectiveness of recovering subspaces in the presence of outliers by some sort of minimization for different values of . They address simultaneous recovery of all subspaces by minimizing the functional

 eℓp(X,S1,…,SL)=∑x∈Xmin1≤ℓ≤L(dist(x,Sℓ))p. (14)

Here, are the optimization variables and is our data set. This is not a convex optimization for any , since the feasible set is the Grassmannian.

In the semi-random model, the result of Lerman and Zhang states that under the assumptions stated in Theorem 1.2, with and a constant,121212The result of Lerman10 is a bit more general in that the points on each subspace can be sampled from a single distribution obeying certain regularity conditions, other than the uniform measure. In this case, depends on this distribution as well. the subspaces minimize (with large probability) the energy (14) among all -dimensional subspaces in if

 N0<τ0ρdmin(1,mink≠ℓdist(Sk,Sℓ)p/2p). (15)

It is easy to see that the right-hand side of (15) is upperbounded by , that is, the typical number of points on each subspace. Notice that our analogous result in Theorem 1.1 allows for a much larger number of outliers. In fact, the number of outliers can sometimes even be much larger than the total number of data points on all subspaces combined. Our proposed algorithm also has the added benefit that it is convex and, therefore, practical. Having said this, it is worth mentioning that the results in Lerman10 hold for a more general outlier model. Also, an interesting byproduct of the result from Lerman and Zhang is that the energy minimization can perform perfect subspace recovery when no outliers are present. In fact, they even extend this to the case when the subspace points are noisy.

Finally, while this manuscript was in preparation, Liu Guangcan brought to our attention a new paper Liu11 , which also addresses outlier detection. However, the suggested scheme limits the number of outliers to . That is, when the total dimension of the subspaces () exceeds the ambient dimension , outlier detection is not possible based on the suggested scheme. In contrast, our results guarantee perfect outlier detection even when the number of outliers far exceeds the number of data points.

## 4 Geometric perspective on the separation condition

The goal of this section is twofold. One aim is to provide a geometric understanding of the subspace detection property and of the sufficient condition presented in Section 2.1. Another is to introduce concepts such as -norms and polar sets, which will play a crucial role in our analysis.

### 4.1 Linear programming theory

We are interested in finding the support of the

optimal solution to

 minx∈RN∥x∥ℓ1subject to Ax=y, (16)

where both and the columns of have unit norm. The dual takes the form

 maxz∈Rn⟨y,z⟩subject to ∥∥ATz∥∥ℓ∞≤1. (17)

Since strong duality always holds in linear programming, the optimal values of (16) and (17) are equal. We now introduce some notation to express the dual program differently. The norm of a vector with respect to a symmetric convex body is defined as

 ∥y∥K=inf{t>0\dvtxy/t∈K}. (18)

This norm is shown in Figure 6(a). The polar set of is defined as

 Ko={y∈Rn\dvtx⟨x,y⟩≤1 for all x∈K}. (19)

Set so that our dual problem (17) is of the form

 maxz∈Rn⟨y,z⟩% subject to z∈Ko. (20)

It then follows from the definitions above that the optimal value of (16) is given by , where ; that is to say, the minimum value of the norm is the norm of with respect to the symmetrized convex hull of the columns of . In other words, this perspective asserts that support detection in an minimization problem is equivalent to finding the face of the polytope that passes through the ray ; the extreme points of this face reveal those indices with a nonzero entry. We will refer to the face passing through the ray as the face closest to . Figure 6(b) illustrates some of these concepts.

### 4.2 A geometric view of the subspace detection property

We have seen that the subspace detection property holds if for each point , the closest face to resides in the same subspace. To establish a geometric characterization, consider an arbitrary point, for instance, as in Figure 7. Now construct the symmetrized convex hull of all the other points in indicated by in the figure. Consider the face of that is closest to ; this face is shown in Figure 7 by the line segment in red. Also, consider the plane passing through this segment and orthogonal to along with its reflection about the origin; this is shown in Figure 7 by the light grey planes. Set to be the region of space restricted between these two planes. Intuitively, if no two points on the other subspaces lie outside of , then the face chosen by the algorithm is as in the figure and lies in .

To illustrate this point further, suppose there are two points not in lying outside of the region as in Figure 8. In this case, the closest face does not lie in as can be seen in the figure. Therefore, one could intuitively argue that a sufficient condition for the closest face to lie in is that the projections onto of the points from all the other subspaces do not lie outside of regions for all points in subspace . This condition is closely related to the sufficient condition stated in Theorem 2.1. More precisely, the dual directions approximate the normal directions to the restricting planes , and the distance of these planes from the origin.

Finally, to understand the sufficient condition of Theorem 2.1, we will use Figure 9. We focus on a single subspace, say, . As previously stated, a sufficient condition is to have all points not in to have small coherence with the dual directions of the points in . The dual directions are depicted in Figure 9 (blue dots). One such dual direction line is shown as the dashed blue line in the figure. The points that have low coherence with the dual directions are the points whose projection onto subspace lie inside the red polytope. As can be seen, this polytope approximates the intersection of regions () and subspace . This helps in understanding the difference between the condition imposed by Elhamifar and Vidal and our condition; in this setting, their condition essentially states that the projection of the points on all other subspaces onto subspace must lie inside the blue circle. By looking at Figure 9, one might draw the conclusion that these conditions are very similar, that is, the red polytope and the blue ball restrict almost the same region. This is not the case, because as the dimension of the subspace increases most of the volume of the red polytope will be concentrated around its vertices and the ball will only occupy a very small fraction of the total volume of the polytope.

## 5 Numerical results

This section proposes numerical experiments on synthesized data to further our understanding of the behavior/limitations of SSC, of our analysis and of our proposed outlier detection scheme. In this numerical study we restrict ourselves to understanding the effect of noise on the spectral gap and the estimation of the number of subspaces. For a more comprehensive analytical and numerical study of SSC in the presence of noise, we refer the reader to Elhamifar11 . For comparison of SSC with more recent methods on motion segmentation data, we refer the reader to Liu10 , Favaro11 . These papers indicate that SSC has the best performance on the Hopkins 155 data Tron07 when corrupted trajectories are present, and has a performance competitive with the state of the art when there is no corrupted trajectory. In the spirit of reproducible research, the Matlab code generating all the plots is available at http://www.stanford.edu/~mahdisol/Software.

### 5.1 Segmentation without outliers

As mentioned in the Introduction, the subspace detection property can hold even when the dimensions of the subspaces are large in comparison with the ambient dimension . SSC can also work beyond the region where the subspace detection property holds because of further spectral clustering. Section 5.1.1 introduces several metrics to assess performance and Section 5.1.2 demonstrates that the subspace detection property can hold even when the subspaces intersect. In Section 5.1.3 we study the performance of SSC under changes in the affinity between subspaces and the number of points per subspace. In Section 5.1.4 we illustrate the effect of the dimension of the subspaces on the subspace detection property and the spectral gap. In Section 5.1.5 we study the effect of noise on the spectral gap. In the final subsection we study the capability of SSC in estimating the correct number of subspaces and compare it with a classical algorithm.

#### 5.1.1 Error metrics

The four different metrics we use are as follows (see Elhamifar10 for simulations using similar metrics):

• Feature detection error. For each point , partition the optimal solution of SSC as

 zi=\boldsΓ⎡⎢ ⎢ ⎢ ⎢⎣zi1zi2⋮ziL⎤⎥ ⎥ ⎥ ⎥⎦.

In this representation, is our unknown permutation matrix and denote the coefficients corresponding to each of the subspaces. Using as the total number of points, the feature detection error is

 1NN∑i=1(1−∥ziki∥ℓ1∥zi∥ℓ1), (21)

in which is the subpace belongs to. The quantity between brackets in (21) measures how far we are from choosing all our neighbors in the same subspace; when the subspace detection property holds, this term is equal to whereas it takes on the value when all the points are chosen from the other subspaces.

• Clustering error.

Here, we assume knowledge of the number of subspaces and apply spectral clustering to the affinity matrix built by the SSC algorithm. After the spectral clustering step, the clustering error is simply defined as

 # of misclassified pointstotal # of points. (22)
• Error in estimating the number of subspaces. This is a 0-1 error which takes on the value if the true number of subspaces is correctly estimated, and otherwise.

• Smallest nonzero eigenvalue. We use the th smallest eigenvalue of the normalized Laplacian131313After building the symmetrized affinity graph , we form the normalized Laplacian , where is a diagonal matrix and is equal to the sum of the elements in column . This form of the Laplacian works better for spectral clustering as observed in many applications Ng02 . as a numerical check on whether the subspace detection property holds (when the subspace detection property holds this value vanishes).

#### 5.1.2 Subspace detection property holds even when the subspaces intersect

We wish to demonstrate that the subspace detection property holds even when the subspaces intersect. To this end, we generate two subspaces of dimension in with an intersection of dimension . We sample one subspace () of dimension uniformly at random among all -dimensional subspaces and a subspace of dimension [denoted by ] inside that subspace, again, uniformly at random. Sample another subspace of dimension uniformly at random and set .

Our experiment selects points uniformly at random from each subspace. We generate instances from this model and report the average of the first three error criteria over these instances; see Figure 10. Here, the subspace detection property holds up to . Also, after the spectral clustering step, SSC has a vanishing clustering error even when the dimension of the intersection is as large as .

#### 5.1.3 Effect of the affinity between subspaces

In Section 2.1.2 we showed that in the semi-random model, the success of SSC depends upon the affinity between the subspaces and upon the density of points per subspace (recovery becomes harder as the affinity increases and as the density of points per subspace decreases). We study here this trade-off in greater detail through experiments on synthetic data.

We generate subspaces , and , each of dimension in . The choice makes the problem challenging since every data point on one subspace can also be expressed as a linear combination of points on other subspaces. The bases we choose for and are

 U(1)=[Id0d×d],U(2)=[0d×dId], (23)

whereas for ,

 U(3)=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣cos(θ1)000…00cos(θ2)00…000cos(θ3)0…0⋮⋮⋮⋮⋱⋮0000…cos(θd)sin(θ1)000…00sin(θ2)00…000sin(θ3)0…0⋮⋮⋮⋮⋱⋮0000…sin(θd)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦. (24)

Above, the principal angles are set in such a way that decreases linearly from to , where and are fixed parameters; that is to say, , .

In our experiments we sample points uniformly at random from each subspace. We fix and vary and . Since , as increases from to , the normalized maximum affinity decreases from to (recall that a normalized affinity equal to 1 indicates a perfect overlap, that is, two subspaces are the same). For each value of and , we evaluate the SSC performance according to the three error criteria above. The results, shown in Figure 11, indicate that SSC is successful even for large values of the maximum affinity as long as the density is sufficiently large. Also, the figures display a clear correlation between the three different error criteria, indicating that each could be used as a proxy for the other two. An interesting point is and ; here, the algorithm can identify the number of subspaces correctly and perform perfect subspace clustering (clustering error is ). This indicates that the SSC algorithm in its full generality can achieve perfect subspace clustering even when the subspaces are very close.

#### 5.1.4 Effect of dimension on subspace detection property and spectral gap

In order to illustrate the effect an increase in the dimension of subspaces has on the spectral gap, we generate subspaces chosen uniformly at random from all -dimensional subspaces in . We consider different values for , namely, 5, 10, 15, 20, 25. In all these cases, the total dimension of the subspaces is more than the ambient dimension . We generate unit-normed points on each subspace uniformly at random. The corresponding singular values of the normalized Laplacian are displayed in Figure 12. As evident from this figure, the subspace detection property holds, when the dimension of the subspaces is less than (this corresponds to the last eigenvalues being exactly equal to ). Beyond , the gap is still evident, however, the gap decreases as

increases. In all these cases, the gap was detectable using the sharpest descent heuristic presented in Algorithm

1 and, thus, the correct estimates for the number of subspaces were always found.

#### 5.1.5 Effect of noise on spectral gap

In order to illustrate the effect of noise on the spectral gap, we sample subspaces chosen uniformly at random from all -dimensional subspaces in . The total dimension of the subspaces () is once again more than the ambient dimension . We then sample points on each subspace— per subspace as before—and perturb each unit-norm data point by a noisy vector chosen independently and uniformly at random on the sphere of radius (noise level) and then normalize to have unit norm. The noisy samples are , where . We consider different values for the noise level, namely, 0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4. The corresponding singular values of the normalized Laplacian are shown in Figure 13. As evident from this figure, we are in a regime where the subspace detection property does not hold even for noiseless data (this corresponds to the last eigenvalues not being exactly equal to ). For positive, the gap is still evident but decreases as a function of . In all these cases, the gap was detectable using the sharpest descent heuristic presented in Algorithm 1 and, thus, the number of subspaces was always correctly inferred.

#### 5.1.6 Comparison with other methods

We now hope to demonstrate that one of the main advantages of SSC is its ability to identify, in much broader circumstances, the correct number of subspaces using the eigen-gap heuristic. Before we discuss the pertaining numerical results, we quickly review a classical method in subspace clustering Costeira98 . Start with the rank- SVD of the data matrix and use as the affinity matrix. (Interestingly, the nuclear-norm heuristic also results in the same affinity matrix Liu10 , Favaro11 ). It was shown in Costeira98 that when the subspaces are independent, the affinity matrix will be block diagonal and one can thus perform perfect subspace clustering. When the subspaces are not independent, the affinity matrix may occasionally be approximately block diagonal as observed empirically in some particular computer vision applications. In the presence of noise, or when the independence assumption is violated, various methods have been proposed to “clean up” the affinity matrix and put it into block diagonal form Costeira98 , Kanatani01 , Ichimura99 , Wu01 , Kanatani02 , Kanatani98 . As noted by Vidal in Vidal11 , most of these algorithms need some knowledge of the true data rank and/or dimension of the subspaces. Furthermore, none of these algorithms have been proven to work when the independence criterion is violated—in contrast with the analysis presented in this paper.

We believe that a major advantage of SSC vis a vis more recent approaches Liu10 , Favaro11 is that the eigen-gap heuristic is applicable under broader circumstances. To demonstrate this, we sample subspaces chosen uniformly at random from all -dimensional subspaces in . The total dimension is once more larger than the ambient dimension . The eigenvalues of the normalized Laplacian of the affinity matrix for both SSC and the classical method () are shown in Figure 14(a). Observe that the gap exists in both plots. However, SSC demonstrates a wider gap and, therefore, the estimation of the number of subspaces is more robust to noise. To illustrate this point further, consider Figure 14(b) in which points are sampled according to the same scheme but with , and with noise possibly added just as in Section 5.1.5. Both in the noisy and noiseless cases, the classical method does not produce a detectable gap, while the gap is detectable using the simple methodology presented in Algorithm 1.

### 5.2 Segmentation with outliers

We now turn to outlier detection. For this purpose, we consider three different setups in which

• , ,

• , ,

• , .

In each case, we sample subspaces chosen uniformly at random so that the total dimension . For each subspace, we generate points uniformly at random so that the total number of data points is . We add outliers chosen uniformly at random on the sphere. Hence, the number of outliers is equal to the number of data points. The optimal values of the optimization problems (2) are plotted in Figure 15. The first values correspond to the data points and the next values to the outliers. As can be seen in all the plots, a gap appears in the values of the norm of the optimal solutions. That is, the optimal value for data points is much smaller than the corresponding optimal value for outlier points. We have argued that the critical parameter for outlier detection is the ratio . The smaller, the better. As can be seen in Figure 15(a), the ratio is already small enough for the conjectured threshold of Algorithm 2 to work and detect all outlier points correctly. However, it wrongfully considers a few data points as outliers. In Figure 15(b), and the conjectured threshold already works perfectly, but the proven threshold is still not able to do outlier detection well. In Figure 15(c), , both the conjectured and proven thresholds can perform perfect outlier detection. (In practice, it is of course not necessary to use the threshold as a criterion for outlier detection; one can instead use a gap in the optimal values.) It is also worth mentioning that if is larger, the optimal value is more concentrated for the data points and, therefore, both the proven and conjectured threshold would work for smaller ratios of (this is different from the small values of above).

## 6 Background on Geometric Functional Analysis

Our proofs rely heavily on techniques from Geometric Functional Analysis and we now introduce some basic concepts and results from this field. Most of our exposition is adapted from Vershynin11 .

The maximal and average values of on the sphere are defined by

 b(K)=supx∈Sn−1∥x∥KandM(K)=∫Sn−1∥x∥Kdσ(x).

Above, is the uniform probability measure on the sphere.

The mean width of a symmetric convex body in is the expected value of the dual norm over the unit sphere,

 M∗(K)=M(Ko)=∫Sn−1∥y∥Kodσ(y)=∫Sn−1maxz∈K⟨y,z⟩dσ(y).

With this in place, we now record some useful results.

###### Lemma 6.1

We always have .

Observe that since is the dual norm of , and, thus,

where the inequality follows from Cauchy-Schwarz. The following theorem deals with concentration properties of norms. According to Klartag07 , these appear in the first pages of Milman86 .

###### Theorem 6.2 ((Concentration of measure))

For each , we have

 σ{x∈Sn−1\dvtx∣∣∥x∥K−M(K)∣∣>tM(K)}