Subspace clustering is a problem motivated by many real applications. It is now widely known that many high dimensional data including motion trajectories(Costeira and Kanade, 1998), face images (Basri and Jacobs, 2003), network hop counts (Eriksson et al., 2012), movie ratings (Zhang et al., 2012) and social graphs (Jalali et al., 2011) can be modelled as samples drawn from the union of multiple low-dimensional linear subspaces (illustrated in Figure 1). Subspace clustering, arguably the most crucial step to understand such data, refers to the task of clustering the data into their original subspaces and uncovers the underlying structure of the data. The partitions correspond to different rigid objects for motion trajectories, different people for face data, subnets for network data, like-minded users in movie database and latent communities for social graph.
Subspace clustering has drawn significant attention in the last decade and a great number of algorithms have been proposed, including Expectation-Maximization-like local optimization algorithms, e.g., K-plane(Bradley and Mangasarian, 2000) and Q-flat (Tseng, 2000)
, algebraic methods, e.g., Generalized Principal Component Analysis (GPCA)(Vidal et al., 2005), matrix factorization methods (Costeira and Kanade, 1998, 2000)
, spectral clustering-based methods(Lauer and Schnorr, 2009; Chen and Lerman, 2009), bottom-up local sampling and affinity-based methods (e.g., Yan and Pollefeys, 2006; Rao et al., 2008), and the convex optimization-based methods: namely, Low Rank Representation (LRR) (Liu et al., 2010, 2013) and Sparse Subspace Clustering (SSC) (Elhamifar and Vidal, 2009, 2013). For a comprehensive survey and comparisons, we refer the readers to the tutorial (Vidal, 2011). Among these algorithms, SSC is known to enjoy superb empirical performance, even for noisy data. For example, it is the state-of-the-art algorithm for motion segmentation on Hopkins155 benchmark (Tron and Vidal, 2007; Elhamifar and Vidal, 2009), and has been shown to be more robust than LRR as the number of clusters increase (Elhamifar and Vidal, 2013).
The key idea of SSC is to represent each data point by a sparse linear combination of the remaining data points using minimization. Without introducing the notations (which is deferred in Section 3), the noiseless and noisy version of SSC solve respectively
for each data column , and the hope is that will be supported only on indices of the data points from the same subspace as . While this formulation is for linear subspaces, affine subspaces can also be dealt with by augmenting data points with an offset variable .
Effort has been made to explain the practical success of SSC by analyzing the noiseless version. Elhamifar and Vidal (2010) show that under certain conditions, disjoint subspaces (i.e., they are not overlapping) can be exactly recovered. A recent geometric analysis of SSC (Soltanolkotabi and Candes, 2012) broadens the scope of the results significantly to the case when subspaces can be overlapping. However, while these analyses advanced our understanding of SSC, one common drawback is that data points are assumed to be lying exactly on the subspaces. This assumption can hardly be satisfied in practice. For example, motion trajectories data are only approximately of rank-4 due to perspective distortion of camera, tracking errors and pixel quantization (Costeira and Kanade, 1998); similary, face images are not precisely of rank-9 since human faces are at best approximated by a convex body (Basri and Jacobs, 2003).
In this paper, we address this problem and provide a theoretical analysis of SSC with noisy or corrupted data. Our main result shows that a modified version of SSC (see Eq. (3.2)) succeeds when the magnitude of noise does not exceed a threshold determined by a geometric gap between the inradius and the subspace incoherence (see below for precise definitions). This complements the result of Soltanolkotabi and Candes (2012) that shows the same geometric gap determines whether SSC succeeds for the noiseless case. Indeed, when the noise vanishes, our results reduce to the noiseless case results of Soltanolkotabi and Candes.
While our analysis is based upon the geometric analysis of Soltanolkotabi and Candes (2012), the analysis is more involved: In SSC, sample points are used as the dictionary for sparse recovery, and therefore noisy SSC requires analyzing a noisy dictionary. We also remark that our results on noisy SSC are exact, i.e., as long as the noise magnitude is smaller than a threshold, the recovered subspace clusters are correct. This is in sharp contrast to the majority of previous work on structure recovery for noisy data where stability/perturbation bounds are given – i.e., the obtained solution is approximately correct, and the approximation gap goes to zero only when the noise diminishes.
Lastly, we remark that an independently developed work (Soltanolkotabi et al., 2014) analyzed the same algorithm under a statistical model that generates the data. In contrast, our main results focus on the cases when the data are deterministic. Moreover, when we specialize our general result to the same statistical model, we show that we can handle a significantly larger amount of noise under certain regimes.
The paper is organized as follows. In Section 2, we review previous and ongoing works related to this paper. In Section 3, we formally define the notations, explain our method and the models of our analysis. Then we present our main theoretical results in Section 4 with examples and remarks to explain the practical implications of each theorem. In Section 5 and 6, proofs of the deterministic and randomized results are provided. We then evaluate our method experimentally in Section 7 with both synthetic data and real-life data, which confirms the prediction of the theoretical results. Lastly, Section 8 summarizes the paper and discuss some open problems for future research in the task of subspace clustering.
2 Related works
In this section, we review previous and ongoing theoretical studies on the problem of subspace clustering.
2.1 Nominal performance guarantee for noiseless data
Most previous analyses concern about the nominal performance of a particular subspace clustering algorithm with noiseless data. The focus is to relax the assumptions on the underlying subspaces and data generation.
A number of methods have been shown working under the independent subspace assumption including the early factorization-based methods (Costeira and Kanade, 1998; Kanatani, 2001), LRR (Liu et al., 2010) and the initial guarantee of SSC (Elhamifar and Vidal, 2009). Recall that the data points are drawn from a union of subspaces, the independent subspace assumption requires each subspace to be linearly independent to the span of all other subspaces. Equivalently, this assumption requires the sum of each subspace’s dimension to be equal to the dimension of the span of all subspaces. For example, in a two dimensional plane, one can only have 2 independent lines. If there are three lines intersecting at the origin, even if each pair of the lines are independent, they are not considered independent as a whole.
Disjoint subspace assumption only requires pairwise linear independence, and hence is more meaningful in practice. To the best of our knowledge, only GPCA (Vidal et al., 2005) and SSC (Elhamifar and Vidal, 2010, 2013) have been shown to provably handle the data under disjoint subspace assumption. GPCA however is not a polynomial time algorithm. Its computational complexity increases exponentially with respect to the number and dimension of the subspaces.
Soltanolkotabi and Candes (2012) developed a geometric analysis that further extends the performance guarantee of SSC, and in particular it covers some cases when the underlying subspaces are slightly overlapping, meaning that two subspaces can even share a basis. The analysis reveals that the success of SSC relies on the difference of two geometric quantities (inradius and incoherence ) to be greater than , which leads to by far the most general and strongest theoretical guarantee for noiseless SSC. A summary of these assumptions on the subspaces and their formal definition are given in Table 1.
|Disjoint Subspaces||for all .|
|Overlapping Subspaces||No points lies in for any .|
2.2 Robust performance guarantee
Previous studies of the subspace clustering under noise have been mostly empirical. For instance, factorization, spectral clustering and local affinity based approaches, which we mentioned above, are able to produce a (sometimes good) solution even for noisy real data. Convex optimization based approaches like LRR and SSC can be naturally reformulated as a robust method by relaxing the hard equality constraints to a penalty term in the objective function. In fact, the superior results of SSC and LRR on motion segmentation and face clustering data are produced using the robust extension in Elhamifar and Vidal (2009) and Liu et al. (2010) instead of the well-studied noiseless version.
As of writing, there have been very few subspace clustering methods that is guaranteed to work when data are noisy. Besides the conference version of the current paper (Wang and Xu, 2013), an independent work (Soltanolkotabi et al., 2014) also analyzed SSC under noise. Subsequently, there has been noisy guarantees for other algorithms, e.g., thresholding based approach (Heckel and Bölcskei, 2013) and orthogonal matching pursuit (Dyer et al., 2013).
The main difference between our work and (Soltanolkotabi et al., 2014) is that our guarantee works for a more general set of problems when the data and noise may not be random, whereas the key arguments in the proof in Soltanolkotabi et al. (2014)
relies on the assumption that data points are uniformly distributed on the unit sphere within each subspace, which corresponds to the “semi-random model” in our paper. As illustrated inElhamifar and Vidal (2013, Figure 9 and 10)
, the semi-random model is not a good fit for both the motion segmentation and the face clustering datasets, as in these datasets there is a fast decay in the singular values of each subspace. The uniform distribution assumption becomes even harder to justify as the dimensionof each subspace gets larger — a regime where the analysis in (Soltanolkotabi et al., 2014) focuses on.
Moreover, with a minor modification in our analysis that sharpens the bound of the tuning parameter that ensures the solution is non-trivial, we are able to get a result that is stronger than Soltanolkotabi et al. (2014) in cases when the dimension of each subspace 111Admittedly, (Soltanolkotabi et al., 2014) obtained better noise-tolerance than the comparable result in our conference version (Wang and Xu, 2013). . This result extends the provably guarantee of SSC to a setting where the signal to noise ratio (SnR) is allowed to go to as the ambient dimension gets large. In summary, we compare our results in terms of the level of noise that can be provably tolerated in Table 2. These comparisons are in the same setting modulo some slight differences in the noise model and successful criteria. It is worth noting that when , Soltanolkotabi et al. (2014)’s bound is sharper. We will provide more details in the Appendix.
|This paper||(Wang and Xu, 2013)||Soltanolkotabi et al. (2014)|
|Deterministic + random noise||N.A.|
|Semi-random data + random noise|
|Fully-random data + random noise|
Lastly, we note that the notion of robustness in this paper is confined to the noise/arbitrary corruptions added to the legitimate data. It is not the robustness against outliers in the data, unless otherwise specified. Handling outliers is a completely different problem. Solutions have been proposed for LRR inLiu et al. (2012) by decomposing a norm column-wise sparse components and for SSC in Soltanolkotabi and Candes (2012) by objective value thresholding. However these results require non-outlier data points to be free of noise, therefore are not comparable to the study in this paper.
3 Problem setup
We denote the uncorrupted data matrix by , where each column of
(normalized to unit vector222We assume the normalization condition for ease of presentation. Our results can be extened to the case when each column of the noisy data points is normalized, as well as the case where no normalizing is performed at all, under simple modifications to the conditions. ) belongs to a union of subspaces
Each subspace is of dimension and contains data samples with . We observe the noisy data matrix , where is some arbitrary noise matrix. Let denote the selection of columns in that belongs to , and denote the corresponding columns in and by and respectively. Without loss of generality, let be ordered. In addition, we use subscript “” to represent a matrix that excludes column , e.g., Calligraphic letters such as represent the set containing all columns of the corresponding matrix (e.g., and ).
For any matrix , represents the symmetrized convex hull of its columns, i.e., . Also let and for short. and denote respectively the projection matrix and projection operator (acting on a set) to subspace . Throughout the paper, represents -norm for vectors and operator norm for matrices; other norms will be explicitly specified (e.g., ).
Original SSC solves the linear program
for each data point . Solutions are arranged into matrix , then spectral clustering techniques such as Ng et al. (2002)
are applied on the affinity matrix( represents entrywise absolute value). Note that when , this method breaks down: indeed (3.1) may even be infeasible.
We will focus on Formulation (3.2) in this paper. Notice that (3.2) coincides with standard LASSO. Yet, since our task is subspace clustering, the analysis of LASSO (mainly for the task of support recovery) does not extend to SSC. In particular, existing literature for LASSO to succeed requires the dictionary to satisfy the Restricted Isometry Property (RIP for short; Candès, 2008) or the Null-space property (Donoho et al., 2006), but neither of them is satisfied in the subspace clustering setup.333As a simple illustrative example, suppose there exists two identical columns in , which violates RIP for 2-sparse signal and has maximum incoherence .
In the subspace clustering task, there is no single “ground-truth” to compare the solution against. Instead, the algorithm succeeds if each sample is expressed as a linear combination of samples belonging to the same subspace, as the following definition states.
[LASSO Subspace Detection Property]
We say the subspaces and noisy sample points from these subspaces obey LASSO subspace detection property with parameter , if and only if it holds that for all , the optimal solution to (3.2) with parameter satisfies:
(1) is not a zero vector, i.e., the solution is non-trivial, (2) Nonzero entries of correspond to only columns of sampled from the same subspace as . This property ensures that the output matrix and (naturally) the affinity matrix are exactly block diagonal with each subspace cluster represented by a disjoint block. The property is illustrated in Figure 2. For convenience, we will refer to the second requirement alone as “Self-Expressiveness Property” (SEP), as defined in Elhamifar and Vidal (2013).
Note that the LASSO Subspace Detection Property is a strong condition. In practice, spectral clustering does not require the exact block diagonal structure for perfect segmentation (check Figure 9 in our simulation section for details). A caveat is that it is also not sufficient for perfect segmentation, since it does not guarantee each diagonal block forms a connected component. This is a known problem for SSC (Nasihatkon and Hartley, 2011), although we observe that in practice graph connectivity is usually not a big issue. Proving the high-confidence connectivity (even under probabilistic models) remains an open problem, except for the almost trivial cases when the subspaces are independent (Liu et al., 2013; Wang et al., 2013).
Models of analysis:
Our objective here is to provide sufficient conditions upon which the LASSO subspace detection properties hold in the following four models. Precise definition of the noise models will be given in Section 4.
|fully deterministic model;|
|deterministic data with random noise;|
|semi-random data with random noise;|
|fully random model.|
4 Main results
4.1 Deterministic model
We start by defining two concepts adapted from the original proposal of Soltanolkotabi and Candes (2012). [Projected Dual Direction] Let be the optimal solution to the dual optimization program444This definition is related to (5.3), the dual problem of (3.2), which we will define in the proof.
and is a low-dimensional subspace. The projected dual direction is defined as
[Projected Subspace Incoherence Property] Compactly denote projected dual direction and . We say that vector set is -incoherent to other points if
Here, measures the incoherence between corrupted subspace samples and clean data points in other subspaces (illustrated in Figure 4). As by the normalization assumption, the range of is . In case of random subspaces in high dimension, is close to zero. Moreover, as we will see later, for deterministic subspaces and random data points, is proportional to their expected angular distance (measured by cosine of canonical angles).
Definition 4.1 and 4.1 differ from the dual direction and subspace incoherence property of Soltanolkotabi and Candes (2012) in that we require a projection to a particular subspace to cater to the analysis of the noise case. Also, since they reduce to the original definitions when data are noiseless and , these definitions can be considered as a generalization of their original version.
[inradius] The inradius of a convex body , denoted by , is defined as the radius of the largest Euclidean ball inscribed in . The inradius of a
describes the dispersion of the data points. Well-dispersed data lead to larger inradius and skewed/concentrated distribution of data have small inradius. An illustration is given in Figure4.
[Deterministic noise model] Consider arbitrary additive noise to , each column is bounded by the two quantities below:
As we assume the uncorrupted data point has unit norm, essentially describes the amount of allowable relative error.
Under the deterministic noise model, compactly denote
If for each , furthermore
then LASSO subspace detection property holds for all weighting parameter in the range
which is guaranteed to be non-empty. We now offer some discussions of the theorem and the proof will be given in Section 5.
Condition can be interpreted as the breaking point under increasing magnitude of attack. This suggests that SSC by (3.2) is provably robust to arbitrary noise having signal-to-noise ratio (SNR) greater than . (Notice that , and hence .)
Tuning parameter .
The range of the parameter in the theorem depends on unknown parameters , and , and therefore cannot be used in practice to choose the parameter in practice. It does however justify that when is small, the range of that Lasso-SSC works is large, therefore not hard to tune. In practice, we do not need to know in prior. One approach is to trace the Lasso path (Tibshirani et al., 2013) until we have about non-zero entries in the coefficient vector. If we would like to use a single for all columns, a good point to start is to take to be in the order of , this ensures the solution to be at least non-trivial.
Agnostic subspace clustering.
The robustness to deterministic error is important, since in practice the union-of-subspace structures are usually only good approximations. If each subspace has decaying singular values (e.g., motion segmentation, face clustering (Elhamifar and Vidal, 2013) and hybrid system identification(Vidal et al., 2003)), the deterministic guarantee allows for the flexibility in choosing the cut-off points, e.g., take 90% of the energy as signal and treat the remaining spectrum as noise. If one keeps a smaller number of singular values ( a smaller subspace dimension), the inradius will likely to be larger 555A formal relationship between inradius and smallest singular value is described in (Wang et al., 2013)., although the noise level also increases. It is possible that the conditions in Theorem 4.1 are satisfied for some decomposition (e.g., those with a large spectral gap) but not others. The nice thing is that this is not a tuning parameter, but rather a theoretical property that remains agnostic to the users. In fact, the algorithm will be provably effective as long as the conditions are satisfied for any signal noise decomposition (not restricted to rank-projection). None of these is possible if distributional assumptions are made to either the data or the noise.
4.2 Randomized models
We further analyze three randomized models with increasing level of randomness.
- Determinitic+Random Noise.
Subspaces and samples in subspace are arbitrary; the noise obeys the Random Noise model (Definition 4.2).
- Semi-random+Random Noise.
Subspace is deterministic, but samples in each subspace are drawn iid uniformly from the intersection of the unit sphere and the subspace; the noise obeys the Random Noise model.
- Fully random.
Both subspace and samples are drawn uniformly at random from their respective domains; the noise is iid Gaussian.
In each of these models, we improve the performance guarantee over our conference version (Wang and Xu, 2013). In the most well-studied semi-random model, we are able to handle cases where the noise level is much larger than the signal, and hence improves upon the best known result for SSC Soltanolkotabi et al. (2014). A detailed comparison of the noise tolerance of these methods is given in Table 2.
[Random noise model] Our random noise model is defined to be any additive that is (1) columnwise iid; (2) spherical symmetric; and (3) for all
with probability at least. A good example of our random noise model is iid Gaussian noise. Let each entry . It is known that (see Lemma 6) for some constant
[Deterministic+Random Noise] Under random noise model, compactly denote , and as in Theorem 4.1, furthermore let
If for all ,
then with probability at least , LASSO subspace detection property holds for all weighting parameter in the range
which is guaranteed to be non-empty.
Low SnR paradigm.
Compared to Theorem 4.1, Theorem 4.2 considers a more benign noise which leads to a stronger result. In particular, without assuming any statistical model on how data are generated, we show that Lasso-SSC is able to tolerate noise of level or (whichever is smaller). This extends SSC’s guarantee with deterministic data to cases where the noise can be significantly larger than the signal. In fact, the SnR can go to as the ambient dimension gets large.
On the other hand, Theorem 4.2 shows that Lasso-SSC is able to tolerate a constant level of noise when the geometric gap is as small as . This is arguably near-optimal (when is small) as the projection of a constant-level random noise into a -dimensional subspace has an expected magnitude of the same order, which could easily close up the small geometric gap for some non-trivial probability if the noise is much larger.
Margin of error.
Since the bound depends critically on – the difference of inradius and incoherence – which is the geometric gap that appears in the noiseless guarantee of Soltanolkotabi and Candes (2012). We will henceforth call this gap the margin of error.
We now analyze this margin of error under different generative models. We start from the semi-random model, where the distance between two subspaces is measured as follows. The affinity between two subspaces is defined by:
where is the canonical angle between the two subspaces. Let and be a set of orthonormal bases of each subspace, then . When data points are randomly sampled from each subspace, the geometric entity can be expressed using this (more intuitive) subspace affinity, which leads to the following theorem.
[Semi-random model+random noise] Under the semi-random model with random noise, there exists a non-empty range of such that LASSO subspace detection property holds with probability as long as the noise level obeys
where , , , , and are absolute constants. The proof is essentially substituting the incoherence and inradius parameters in Theorem 4.2 with meaningful bounds, so Thereom 4.2 can be regarded as a corollary of Theorem 4.2.
Comparison to Soltanolkotabi et al. (2014).
In the high dimensional setting when , our result is able to handle the low SnR regime when , while Soltanolkotabi et al. (2014) needs to be bounded by a small constant.
In the case when is a constant fraction of , however, our bound is worse by a factor of . Soltanolkotabi et al. (2014) is still able to handle a small constant noise while we needs . The suboptimal bound might be due to the fact that we are simply developing the theorem for the semirandom model as a corollary of Theorem 4.2 and haven not fully exploit the structure of the semi-random model in the proof.
We now turn to the fully random case. [Fully random model] Suppose there are subspaces each with dimension , chosen independently and uniformly at random. For each subspace, points are chosen independently and uniformly from the unit sphere inside each subspace. Each measurement is corrupted by iid Gaussian noise . Furthermore, if
then with probability at least , the LASSO subspace detection property holds for any in the range
which is guaranteed to be non-empty. Here, are absolute constants. The results under this simple model are very interpretable. It provides intuitive guideline in how robustness of Lasso-SSC change with respect to the various parameters of the data. One one hand, it is sensitive to the dimension of each subspace , since the . This dependence on subspace dimension is not a critical limitation as most interesting applications indeed have very low subspace-dimension, as summarized in Table 3. On the other hand, the dependence on the number of subspaces (in both and since ) is only logarithmic. This suggests that SSC is robust even when there are many clusters, and .
|3D motion segmentation (Costeira and Kanade, 1998)|
|Face clustering (with shadow) (Basri and Jacobs, 2003)|
|Diffuse photometric face (Zhou et al., 2007)|
|Network topology discovery (Eriksson et al., 2012)|
|Hand writing digits (Hastie and Simard, 1998)|
|Social graph clustering (Jalali et al., 2011)|
4.3 Geometric interpretations
The left pane depicts the separation condition in Theorem 2.5 of Soltanolkotabi and Candes (2012). The blue polygon represents the the intersection of halfspaces defined with dual directions that are also the tangent to the red inscribing sphere. More precisely, this is . From our illustration of in Figure 4, we can easily tell that if and only if the projection of external data points fall inside this solid blue polygon. We call this solid blue polygon the successful region.
The right pane illustrates our guarantee of Theorem 4.1 under bounded deterministic noise. The successful condition requires that the whole red ball (analogous to uncertainty set in Robust Optimization (Ben-Tal and Nemirovski, 1998; Bertsimas and Sim, 2004)) around each external data point to fall inside the dashed red polygon, which is smaller than the blue polygon by a factor related to the noise level and the inradius.
The successful region is affected by the noise because the design matrix is also arbitrarily perturbed and the dual solution is no longer within each subspace . Specifically, as will become clear in the proof, the key of showing SEP boils down to proving for all pairs of where
and is any point from another subspace. In the noiseless case we can always take and . For noisy data and Lasso-SSC, we can no longer do that. In fact, for any fixed , the dual solution will be uniquely determined by a projection of on to the feasible region (see the first pane of Figure 4). The absolute value of the inner product will depend on the magnitude of the dual solution, especially its component perpendicular to the current subspace. Indeed by carefully choosing the error, we can make very correlated with some external data point .
To illustrate this further, we plot the shape of this feasible region in 3D (see Figure 7(b)). From the feasible region alone, it seems that the magnitude of dual variable can potentially be quite large. Luckily, the quadratic penalty in the objective function allows us to exploit the optimality of the solution and bound the “out-of-subspace” component of , which results in a much smaller region where the solution can potentially be (given in Figure 7(c)). The region for the “in-subspace” component is also smaller as is shown in Figure 7. A more detailed argument of this is given in Section 5.3 of the proof.
Admittedly, the geometric interpretation under noise is slightly messier than the noiseless case, but it is clear that the largest deterministic noise Lasso-SSC can tolerate must be smaller than geometric gap . Theorem 4.1 show that a sufficient condition is . It remains unclear whether this gap can be closed without additional assumptions.
Finally, we note that for the random noise model in Theorem 4.2, the geometric interpretation is similar, except that the impact of the noise is weakened. Thanks to the randomness and the corresponding concentration of measure, we may bound the reduction of the successful region with a much smaller value comparing to the adversarial noise case.
5 Proof of the Deterministic Result
In this section, we provide the proof for Theorem 4.1.
Instead of analyzing (3.2) directly, we consider an equivalent constrained version by introducing slack variable :
The constraint can be rewritten as
The dual program of (5.1) is:
Recall that we want to establish the conditions on noise magnitude , structure of the data ( and in the deterministic model and affinity in the semi-random model), and ranges of valid such that by Definition 3, the solution is non-trivial and has support indices inside the column set (i.e., satisfies SEP).
The proof is hence organized into three main steps:
Proving non-trivialness by showing that the optimal value is smaller than the value of the trivial solution (i.e., and ). This step is given in Section 5.4.
Showing the existence of a proper . As it will be made clear later, conditions for (1) include and (2) requires for some expression and . Then it is natural to request , so that a valid exists. It turns out that this condition boils down to for some expression . This argument is carried over in Section 5.5.
5.1 Optimality Condition
Consider a general convex optimization problem:
We state Lemma 5.1, which extends Lemma 7.1 in Soltanolkotabi and Candes (2012). Consider a vector and a matrix . If there exists a triplet obeying and has support , furthermore the dual certificate vector satisfies
then any optimal solution to (5.4) obeys . For optimal solution , we have:
To see , note that the right hand side equals to , which takes a maximal value of when . The last equation holds because both and are feasible solution, such that . Also, note that .
With the inequality constraints of given in the lemma statement, we have
Substitute into (5.5), we get:
where is strictly greater than .
Using the fact that is an optimal solution, . Therefore, and is also an optimal solution. This concludes the proof.
5.2 Construction of Dual Certificate
To construct the dual certificate, we consider the following fictitious optimization problem (and its dual) that explicitly requires that all feasible solutions satisfy SEP666To be precise, it is the corresponding that satisfies SEP. (note that one can not solve such problem in practice without knowing the subspace clusters, and hence the name “fictitious”).
This optimization problem is feasible because so any obeying and corresponding is a pair of feasible solution. Then by strong duality, the dual program is also feasible, which implies that for every optimal solution of (5.6) with supported on , there exist satisfying:
This construction of satisfies all conditions in Lemma 5.1 with respect to
i.e., we must check for all data point ,
Thus, if we show that the solution of (5.7) also satisfies (5.9), we can conclude that is a dual certificate required in Lemma 5.1, which implies that the candidate solution (5.8) associated with optimal of (5.6) is indeed the optimal solution of (5.1) and therefore SEP holds.
5.3 Dual separation condition
Our strategy to show (5.9) is to provide an upper bound of then impose the inequality on the upper bound.
First, we find it appropriate to project to the subspace and its orthogonal complement subspace then analyze separately. For convenience, denote , . Then
To see the last inequality, check that by Definition 4.1, .
Since we are considering general (possibly adversarial) noise, we will use the relaxation for all cosine terms (a better bound under random noise will be given later). Thus, what left is to bound and (note ).
We first bound by exploiting the feasible region of in (5.7):
which is equivalent to
Decompose the condition into
and relax the expression into
The relaxed condition contains the feasible region of in (5.7). It turns out that the geometric properties of this relaxed feasible region provides an upper bound of . [polar set] The polar set of set is defined as
By the polytope geometry, we have
Now we introduce the concept of circumradius. [circumradius] The circumradius of a convex body , denoted by , is defined as the radius of the smallest Euclidean ball containing . The magnitude is bounded by . Moreover, by the the following lemma we may find the circumradius by analyzing the polar set of instead. By the property of polar operator, polar of a polar set gives the tightest convex envelope of the original set, i.e., . Since is convex in the first place, the polar set of is . [Page 448 in Brandenberg et al. (2004)] For a symmetric convex body , i.e. , inradius of and circumradius of polar set of satisfy:
Given , denote , furthermore where is a linear subspace, then we have: