A Theoretical Analysis of Noisy Sparse Subspace Clustering on Dimensionality-Reduced Data

10/24/2016 ∙ by Yining Wang, et al. ∙ Carnegie Mellon University 0

Subspace clustering is the problem of partitioning unlabeled data points into a number of clusters so that data points within one cluster lie approximately on a low-dimensional linear subspace. In many practical scenarios, the dimensionality of data points to be clustered are compressed due to constraints of measurement, computation or privacy. In this paper, we study the theoretical properties of a popular subspace clustering algorithm named sparse subspace clustering (SSC) and establish formal success conditions of SSC on dimensionality-reduced data. Our analysis applies to the most general fully deterministic model where both underlying subspaces and data points within each subspace are deterministically positioned, and also a wide range of dimensionality reduction techniques (e.g., Gaussian random projection, uniform subsampling, sketching) that fall into a subspace embedding framework (Meng & Mahoney, 2013; Avron et al., 2014). Finally, we apply our analysis to a differentially private SSC algorithm and established both privacy and utility guarantees of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Subspace clustering is an unsupervised learning paradigm aiming at grouping unlabeled data points into disjoint

clusters so that data points within each cluster lie near a low-dimensional linear subspace

. It has found many successful applications in computer vision and machine learning, as many high dimensional data can be approximated by a union of low-dimensional subspaces. Examples include motion trajectories

(Costeira & Kanade, 1998), face images (Basri & Jacobs, 2003), network hop counts (Eriksson et al., 2012), movie ratings Zhang et al. (2012) and social graphs (Jalali et al., 2011).

The success in applications is made possible by two decades of algorithmic research on this problem. Popular approaches include Expectation-Maximization-style methods such as K-plane

(Bradley & Mangasarian, 2000) and Q-flat (Tseng, 2000)

, algebraic methods such as generalized principal component analysis

(Vidal et al., 2005), matrix factorization methods (Costeira & Kanade, 1998), bottom-up local affinity-based methods such as those proposed by Yan & Pollefeys (2006); Ma et al. (2007), and convex optimization based approaches including Low Rank Representation (LRR, Liu et al., 2013) and Sparse Subspace Clustering (SSC, Elhamifar & Vidal, 2013).

In this paper we consider the SSC algorithm, which has drawn much interest from the literature. It is known that SSC enjoys superb performance in practice (Elhamifar & Vidal, 2013) and has theoretical guarantees under fairly general conditions (Soltanolkotabi et al., 2012; Wang & Xu, 2013; Soltanolkotabi et al., 2014). Let denote the data matrix, where is the ambient dimension and is the number of data points. For noiseless data (i.e., data points lie exactly on low-rank subspaces), the exact SSC algorithm solves

(1.1)

for each data point to obtain self regression solutions . For noisy data, the following Lasso version of SSC is often used in practice:

(1.2)

The intuition of SSC is to learn a sparse “self-representation” matrix over all data points by imposing an penalty on the representation coefficients. This can also be thought of as a convex optimization based graph embedding that captures a specific type of relationship among data points. Each non-zero entry of is an edge connecting two data points. It has been shown in Soltanolkotabi et al. (2012); Wang & Xu (2013); Soltanolkotabi et al. (2014) that under mild conditions the learned representation/similarity matrix contains no false connections

, in that every such edge connects only data points that belong to the same cluster. Finally, spectral clustering

(Ng et al., 2002) is performed on the learned similarity matrix to cluster the data points into disjoint clusters.

Although success conditions for both exact SSC and Lasso SSC have been extensively analyzed in previous literature, in practice it is inefficient or even infeasible to operate on data with high dimension. Some types of dimensionality reduction is usually required (Vidal, 2010). In this paper, we propose a theoretical framework that analyzes SSC under many popular dimensionality reduction settings, including

  • Compressed measurement: With compressed measurement dimensionality-reduced data are obtained by multiplying the original data typically with a random Gaussian matrix. We show that SSC provably succeeds when the projected dimension is on the order of a low-degree polynomial of the maximum intrinsic rank of each subspace.

  • Efficient computation: By using fast Johnson-Lindenstrauss transform (FJLT) (Ailon & Chazelle, 2009) or sketching (Charikar et al., 2004; Clarkson & Woodruff, 2013) one can reduce the data dimension for computational efficiency while still preserving important structures in the underlying data. We prove similar results for both FJLT and sketching.

  • Handling missing data: In many applications, the data matrix may be incomplete due to measurement and sensing limits. It is shown in this paper that, when data meet some incoherent criteria, uniform feature sampling suffices for SSC.

  • Data privacy: Privacy is an important concern in modern machine learning applications. It was shown that Gaussian random projection with added Gaussian noise preserves both information-theoretic (Zhou et al., 2009) and differential privacy (Kenthapadi et al., 2013). We provide a utility analysis which shows that SSC can achieve exact subspace detection despite stringent privacy constraints.

The main contribution of this paper is a unified framework for analyzing sparse subspace clustering on dimensionality-reduced data. In particular, we prove that a subspace embedding property (Clarkson & Woodruff, 2013; Avron et al., 2014) is sufficient to guarantee successful execution of the SSC algorithm after compression. Furthermore, the lowest dimension we can compress the data into only scales as a low-degree polynomial of the intrinsic dimension and does not depend (up to poly-logarithmic factors) on either the ambient dimension or the total number of data points . This is a much desired property, because in practical subspace clustering applications both and are huge, while the intrinsic dimensionality typically stays at nearly a constant (Vidal, 2010; Wang & Xu, 2013). We also show by simple derivations and citing existing results that all of the above-mentioned data compression schemes (Gaussian projection, uniform subsampling, FJLT, sketching) are subspace embeddings, and hence fall into the perturbation analysis framework we formulated. Finally, as an application of our analysis, we propose a differentially private SSC algorithm by random projection followed by Gaussian perturbation and prove both privacy and utility guarantees of the proposed algorithm.

1.1 Problem setup and notations

Notations

For a vector

, denotes the vector -norm of . For a matrix , denotes the operator -norm of . In particular, is the spectral norm of . The subscript is sometimes dropped for ; that is, and .

The uncorrupted data matrix is denoted as , where is the ambient dimension and is the total number of data points. is normalized so that each column has unit two norm, which does not alter the union of subspaces on which lies. Each column in belongs to a union of subspaces . For each subspace we write for all columns belonging to , where is the number of data points in and . We assume the rank of the th subspace is and define . In addition, we use to represent an orthonormal basis of . The observed matrix is denoted by . Under the noiseless setting we have ; for the noisy setting we have where is a noise matrix which can be either deterministic or stochastic.

We use “” subscript to denote all except the th column in a data matrix. For example, and . For any matrix , let denote the symmetric convex hull spanned by all columns in . For any subspace and vector , denote as the projection of onto .

Models

We consider three models for the uncorrupted data of increasing strictness of assumptions. Such hierarchy of models were first introduced in Soltanolkotabi et al. (2012) and have served as reference models in existing analysis of SSC methods (Soltanolkotabi et al., 2014; Heckel & Bolcskei, 2013; Wang & Xu, 2013; Park et al., 2014):

  • Fully deterministic model: in the fully deterministic model both the underlying low-rank subspaces and the data points in each subspace are deterministically placed. This is the most general model for subspace clustering (except the model-free agnostic settings considered in the projective clustering literature (Feldman et al., 2013)) as no stochastic or i.i.d. type assumptions are imposed on either the subspaces or the data points.

  • Semi-random model: In the semi-random model the underlying subspaces are again deterministically placed; however, (uncorrupted) data points within each subspace are assumed to be sampled i.i.d. uniformly at random from the unit sphere of the corresponding low-dimensional subspace. One advantage of semi-random modeling is its interpretability: success conditions of SSC could be fully characterized using affinities between subspaces and number of data points per subspace (Soltanolkotabi et al., 2012, 2014; Wang & Xu, 2013).

  • Fully-random model: In the fully random model both the underlying subspaces and data points within each subspace are sampled uniformly at random.

Apart from data models, we also consider two models when data are corrupted by noise . In the deterministic noise model, each noise vector is adversarially placed, except that the largest maximum magnitude is upper bounded by a noise parameter . In the stochastic noise model, is assumed to be i.i.d. sampled from a zero-mean multivariate distribution for some noise parameter

. Note that here we divide the variance

by to keep the magnitude of noise constant and not increasing with number of dimensions. In our analysis, the stochastic noise model allows for larger magnitude of noise compared to the deterministic (adversarial) noise setting as it places stronger assumptions on the properties of the noise.

Methods

The first step is to perform dimensionality reduction on the observation matrix . More specifically, for a target projection dimension , the projected observation matrix is obtained by first computing for some random projection matrix and then normalizing it so that each column in has unit two norm. Afterwards, Lasso self-regression as formulated in Eq. (1.2) is performed for each column in to obtain the similarity matrix . Spectral clustering is then be applied to to obtain an explicit clustering of . In this paper we use the normalized-cut algorithm Shi & Malik (2000) for spectral clustering.

Evaluation measures

To evaluate the quality of obtained similarity matrix , we consider the Lasso subspace detection property defined in Wang & Xu (2013). More specifically, satisfies Subspace Detection Property (SDP) if for each the following holds: 1) is a non-trivial solution. That is, is not the zero vector; 2) if then data points and belong to the same subspace cluster. The second condition alone is referred to as “Self-Expressiveness Property” (SEP) in Elhamifar & Vidal (2013). Note that we do not require for every pair of belonging to the same cluster. We also remark that in general SEP is not necessary for spectral clustering to succeed, cf. Wang & Xu (2013) 222It is almost sufficient for perfect clustering both in practice (Elhamifar & Vidal, 2013) and in theory (Wang et al., 2016)..

1.2 Related Work

Heckel et al. (2014) analyzed SSC and threshold-based subspace clustering (TSC, Heckel & Bolcskei, 2013) on dimension-reduced noiseless data. Heckel et al. (2015), which was arxived concurrently with the earlier version in ICML of our paper, further generalized Heckel et al. (2014), with analysis of TSC and SSC-OMP on noisy data. One important limitation of both Heckel et al. (2014, 2015) is that the data points in each subspace is assumed to be drawn uniformly at random, corresponding to the semi-random model specified in Sec. 1.1. Though amenable for theoretical analysis, the semi-random model deviates significantly from subspace clustering practice because data points, even on the same subspace, are not distributed uniformly at random. This is particularly true when data lies on affine subspaces and the “homogeneous embedding” trick 333Appending to every data point. is used, which is almost always the case in practice (Elhamifar & Vidal, 2013). The added dimension in homogeneous embedding is constant valued and breaks any semi-random assumptions. In this paper, we complement the results of Heckel et al. (2015) by analyzing noisy SSC under the fully deterministic model, where no stochastic assumptions imposed upon the data points. Our analysis naturally leads to interpretable noise conditions under the semi-random model.

Our proof technique also significantly differs from the one in Heckel et al. (2015) which focused primarily on perturbation of subspace affinities (Soltanolkotabi et al., 2012)

. When data points are not uniformly distributed, subspace affinity as defined by

Soltanolkotabi et al. (2012); Heckel et al. (2015) no longer serves as a good characterization of the difficulty of the subspace clustering problem. Instead, we propose novel perturbation analysis of the dual solution of the noisy SSC formulation, which is applicable under the fully deterministic setting. Finally, we remark that an earlier conference version of this paper (Wang et al., 2015a) which summarized most parts of the deterministic analysis in this paper (including analysis for noisy data) was published before Heckel et al. (2015). Another difference, as noted in Heckel et al. (2015), is that in our analysis the projected noise is added after normalization of the projected signal. We consider this to be a minor difference because the length of the projected signal is close to the length of the original signal, thanks to Proposition 2.3

. In particular, for semi-random or fully-random models the length of each data point is very close to one with high probability, and hence the noisy model we analyzed behaves similarly to the one considered in

Heckel et al. (2015).

Arpit et al. proposed a novel dimensionality reduction algorithm to preserve independent subspace structures Arpit et al. (2014). They showed that by using , where is the number of subspaces, one can preserve the independence structure among subspaces. However, their analysis only applies to noiseless and independent subspaces, while our analysis applies even when the least principal angle between two subspaces diminishes and can tolerate a fair amount of noise. Furthermore, in our analysis the target dimension required depends on the maximum intrinsic subspace dimension instead of . Usually is quite small in practice (Elhamifar & Vidal, 2013; Basri & Jacobs, 2003).

Another relevant line of research is high-rank matrix completion. In Eriksson et al. (2012) the authors proposed a neighborhood selection based algorithm to solve multiple matrix completion problems. However, Eriksson et al. (2012) requires an exponential number of data points to effectively recover the underlying subspaces. In contrast, in our analysis only needs to scale polynomially with . In addition, strong distributional assumptions are imposed in Eriksson et al. (2012) to ensure that data points within the same subspace lie close to each other, while our analysis is applicable to the fully general deterministic setting where no such distributional properties are required.

2 Dimension reduction methods

In this section we review several popular dimensionality reduction methods and show that they are subspace embeddings. To keep the presentation simple, proofs of results in this section are presented in Appendix A. A linear projection is said to be a subspace embedding if for some -dimensional subspace the following holds:

(2.1)

The following proposition is a simple property of subspace embeddings.

Proposition 2.1.

Fix . Suppose is a subspace embedding with respect to a union of subspaces with parameters , and . Then with probability for all we have

(2.2)

furthermore, for all the following holds:

(2.3)

Random Gaussian projection

In a random Gaussian projection matrix each entry

is generated from i.i.d. Gaussian distributions

, where is the target dimension after projection. Using standard Gaussian tail bounds and Johnson-Lindenstrauss argument we get the following proposition.

Proposition 2.2.

Gaussian random matrices is a subspace embedding with respect to if

(2.4)

Uniform row sampling

For uniform row sampling each row in the observed data matrix is sampled independently at random so that the resulting matrix has non-zero rows. Formally speaking, each row of the projection matrix is sampled i.i.d. from the distribution , where , and is a -dimensional indicator vector with only the th entry not zero.

For uniform row sampling to work, both the observation matrix and the column space of the uncorrupted data matrix should satisfy certain incoherence conditions. In this paper, we apply the following two types of incoherence/spikiness definitions, which are widely used in the low rank matrix completion literature Recht (2011); Balzano et al. (2010); Krishnamurthy & Singh (2014).

Definition 2.1 (Column space incoherence).

Suppose 444We require both and to be incoherent because the noise vector may not belong to the incoherent subspace . is the column space of some matrix and . Let be an orthonormal basis of . The incoherence of is defined as

(2.5)

where indicates the th row of .

Definition 2.2 (Column spikiness).

For a vector , the spikiness of is defined as

(2.6)

where denotes the vector infinite norm.

Under these two conditions, uniform row sampling operator is a subspace embedding.

Proposition 2.3.

Suppose and for some constant . The uniform sampling operator is a subspace embedding with respect to if

(2.7)

FJLT and sketching

The Fast Johnson-Lindenstrauss Transform (FJLT, Ailon & Chazelle, 2009) computes a compressed version of a data matrix using operations instead of per column with high probability. The projection matrix can be written as , where is a sparse JL matrix, is a deterministic Walsh-Hadamard matrix and is a random diagonal matrix. Details of FJLT can be found in Ailon & Chazelle (2009).

Sketching (Charikar et al., 2004; Clarkson & Woodruff, 2013) is another powerful tool for dimensionality reduction on sparse inputs. The sketching operator is constructed as , where is a random permutation matrix and is a random sign diagonal matrix. The projected vector can be computed in time, where is the number of nonzero entries in .

The following two propositions show that both FJLT and sketching are subspace embeddings, meaning that with high probability the inner product and norm of any two vectors on a low-dimensional subspace are preserved uniformly. In fact, they are oblivious in the sense that they work for any low-dimensional subspace .

Proposition 2.4 (Clarkson & Woodruff, 2013).

The FJLT operator is an oblivious subspace embedding if , with considered as a constant.

Proposition 2.5 (Avron et al., 2014).

The sketching operator is an oblivious subspace embedding if .

2.1 Simulations

Figure 1: Relative SEP violation (left) and clustering error for Lasso SSC on the Hopkins-155 dataset. The rightmost two columns in the left figure indicate trivial solutions. White indicates good similarity graph or clustering and black indicates poor similarity graph or clustering.
Figure 2: Comparison of clustering error (left) and relative SEP violation (right) for Lasso SSC, TSC and LRR on the Hopkins-155 dataset.

To gain some intuition into the performance of the SSC algorithm on dimensionality-reduced data, we report empirical results on Hopkins-155, a motion segmentation data set that is specifically designed to serve as a benchmark for subspace clustering algorithms (Tron & Vidal, 2007). The ambient dimension in the data set ranges from 112 to 240, and we compress the data points into dimension using random Gaussian projection, with taking the values from 5 to 50. We compare Lasso SSC with TSC (Heckel & Bolcskei, 2013) and LRR (Liu et al., 2013). The Lasso SSC algorithm is implemented using augmented Lagrangian method (ALM, Bertsekas (2014)). The LRR implementation is obtained from Liu (2013). All algorithms are implemented in Matlab.

For evaluation, we report both clustering error and the relative violation of the Self-Expressiveness Property (SEP). Clustering error is defined as the percentage of mis-clustered data points up to permutation. The relative violation of SEP characterizes how much the obtained similarity matrix violates the self-expressiveness property. It was introduced in Wang & Xu (2013) and defined as

(2.8)

where means and belong to the same cluster and vice versa.

Figure 1 shows that the relative SEP violation of Lasso SSC goes up when the projection dimension decreases, or the regularization parameter is too large or too small. The clustering accuracy acts accordingly. In addition, in Figure 2 we report the clustering error and relative SEP violation for Lasso SSC, TSC and LRR on Hopkins-155. Both clustering error and relative SEP violation are averaged across all 155 sequences. Figure 2 also indicates that Lasso SSC outperforms TSC and LRR under various regularization and projection dimension settings, which is consistent with previous experimental results (Elhamifar & Vidal, 2013).

3 Main results

We present general geometric success conditions for Lasso SSC on dimensionality-reduced data, under the fully deterministic setting where both the underlying low-dimensional subspaces and the data points on each subspace are placed deterministically. We first describe the result for the noiseless case and then the results are extended to handle a small amount of adversarial perturbation or a much larger amount of stochastic noise. In addition, implications of our success conditions under the much stronger semi-random and fully random models are discussed. The basic idea common in all of the upcoming results is to show that the subspace incoherence and inradius 555Both subspace incoherence and inradius are key quantities appearing in analysis of sparse subspace clustering algorithms and will be defined later. (therefore the geometric gap) are approximately preserved under dimension reduction.

3.1 Deterministic model: the noiseless case

We consider first the noiseless case where . We begin our analysis with two key concepts introduced in the seminal work of Soltanolkotabi et al. (2012): subspace incoherence and inradius. Subspace incoherence characterizes how well the subspaces associated with different clusters are separated. It is based on the dual direction of the optimization problem in Eq. (1.1) and (1.2), which is defined as follows:

Definition 3.1 (Dual direction, Soltanolkotabi et al., 2012; Wang & Xu, 2013).

Fix a column of belonging to subspace . Its dual direction is defined as the solution to the following dual optimization problem: 666For exact SSC simply set .

(3.1)

Note that Eq. (3.1) has a unique solution when .

The subspace incoherence for , , is defined in Eq. (3.2). Note that it is not related to the column subspace incoherence defined in Eq. (2.5). The smaller is the further is separated from the other subspaces.

Definition 3.2 (Subspace incoherence, Soltanolkotabi et al. (2012); Wang & Xu (2013)).

Subspace incoherence for subspace is defined as

(3.2)

where and . is the dual direction of defined in Eq. (3.1) and is the low-dimensional subspace on which lies.

The concept of inradius characterizes how well data points are distributed within a single subspace. More specifically, we have the following definition:

Definition 3.3 (Inradius, Soltanolkotabi et al., 2012).

For subspace , its inradius is defined as

(3.3)

where denotes the radius of the largest ball inscribed in a convex body.

The larger is, the more uniformly data points are distributed in the th subspace. We also remark that both and are between 0 and 1 because of normalization.

With the characterization of subspace incoherence and inradius, we are now ready to present our main result, which states sufficient success condition for Lasso SSC on dimensionality-reduced noiseless data under the fully deterministic setting.

Theorem 3.1 (Compressed SSC on noiseless data).

Suppose is a noiseless input matrix with subspace incoherence and inradii . Assume for all . Let be the normalized data matrix after compression. Assume , where . If satisfies Eq. (2.2,2.3) with parameter then Lasso SSC satisfies subspace detection property if is upper bounded by

(3.4)

where is the minimum gap between subspace incoherence and inradius for each subspace.

We make several remarks on Theorem 3.1. First, an upper bound on implies a lower bound about projection dimension , and exact values vary for different data compression schemes. For example, if Gaussian random projection is used then is satisfied with . In addition, even for noiseless data the regularization coefficient cannot be too small if projection error is present (recall that corresponds to the exact SSC formulation). This is because when goes to zero the strong convexity of the dual optimization problem decreases. As a result, small perturbation on could result in drastic changes of the dual direction and Lemma 5.1 fails subsequently. On the other hand, as increases the similarity graph connectivity decreases because the optimal solution to Eq. (1.2) becomes sparser. To guarantee the obtained solution is nontrivial (i.e., at least one nonzero entry in ), must not exceed .

3.2 Deterministic model: the noisy case

When data are corrupted by either adversarial or stochastic noise, we can still hope to get success conditions for Lasso SSC provided that the magnitude of noise is upper bounded. The success conditions can again be stated using the concepts of subspace incoherence and inradius. Although the definition of subspace incoherence in Eq. (3.2) remains unchanged (i.e., defined in terms of the noisy data ), the definition of inradius needs to be slightly adjusted under the noisy setting as follows:

Definition 3.4 (Inradius for noisy SSC, Wang & Xu (2013)).

Let where is the uncorrupted data, is the noise matrix and is the observation matrix. For subspace , its inradius is defined as

(3.5)

where denotes the radius of the largest ball inscribed in a convex body.

As a remark, under the noiseless setting we have and Definition 3.4 reduces to the definition of inradius for noiseless data.

Finally, we have Theorem 3.2 and Theorem 3.3.

Theorem 3.2 (Compressed-SSC under deterministic noise).

Suppose is a noisy input matrix with subspace incoherence and inradii . Assume and for all . Suppose where is the normalized uncorrupted data matrix after compression and is the projected noise matrix. If satisfies Eq. (2.2,2.3) with parameter and satisfies

(3.6)

then Lasso SSC satisfies the subspace detection property if the approximation error and noise magnitude satisfy

(3.7)

Here is the minimum gap between subspace incoherence and inradius.

Theorem 3.3 (Compressed-SSC under stochastic noise).

Suppose is a noisy input matrix with subspace incoherence and inradii . Assume for some noise magnitude parameter . Suppose where is the normalized uncorrupted data matrix after compression and is the projected noise matrix. If the spectral norm of is upper bounded by for some constant (i.e., ) and in addition satisfies Eq. (2.2,2.3) with parameter and satisfies

(3.8)

for some universal constants , then Lasso SSC satisfies the subspace detection property if the approximation error and noise magnitude satisfy

(3.9)

where is the minimum gap between subspace incoherence and inradius.

Before we proceed some clarification on the ambient data dimension is needed. Eq. (3.9) seems to suggest that the noise variance increases with . While this is true if noise magnitude is measured in terms of the norm , we remark that coordinate-wise noise variance does not increase with , because each one of the coordinates of

is a Gaussian random variable with variance

.

These results put forward an interesting view of the subspace clustering problem in terms of resource allocation. The critical geometric gap (called “Margin of Error” in Wang & Xu (2013)) can be viewed as the amount of resource that we have for a problem while preserving the subspace detection property. It can be used to tolerate noise, compress the data matrix, or alleviate the graph connectivity problem of SSC Wang et al. (2013). With larger geometric gap , the approximation error from dimensionality reduction can be tolerated at a larger level, which implies that the original data can be compressed more aggressively, at a smaller dimension after compression, without losing the SDP property of sparse subspace clustering. The results also demonstrate trade-offs between noise tolerance and dimensionality reduction. For example, at a higher level of noise variance or the regularization parameter must be set at a lower level, according to conditions in Eqs. (3.6,3.8); subsequently, the approximation error in dimensionality reduction needs to be smaller to ensure success of SSC on the projected data, which places a higher lower bound on the dimension one can compress the original data into.

3.3 Semi-random and Fully-random models

In this section we consider random data models, where either the data points or the underlying low-dimensional subspaces are assumed to be drawn i.i.d. from a uniform distribution. Under the semi-random model the underlying subspaces , are still assumed to be fixed but unknown. However, we place stochastic conditions on the data points by assuming that each data point is drawn uniformly at random from the unit sphere of the corresponding low-dimensional subspace. This assumption makes the success conditions of Lasso SSC more transparent, as the success conditions now only depend on the number of data points per subspace and the affinity between different subspaces, which is formally defined as follows:

Definition 3.5 (Subspace affinity (normalized), (Soltanolkotabi et al., 2012; Wang & Xu, 2013)).

For two subspaces of intrinsic dimension and , the affinity between and is defined as

(3.10)

where are canonical angles between and . Note that is always between 0 and 1, with smaller value indicating one subspace being further apart from the other.

We are now able to state our main theorem on success conditions of dimensionality-reduced noisy SSC under the semi-random model.

Theorem 3.4 (Compressed SSC under Semi-random Model).

Suppose is a noisy input matrix with sampled uniformly at random from the unit sphere in and for some noise magnitude parameter . Assume in addition that for some and the subspace affinity satisfies

(3.11)

Suppose where is the normalized uncorrupted data matrix after compression and is the projected noise matrix. If the spectral norm of is upper bounded by for some constant (i.e., ) and in addition satisfies Eq. (2.2,2.3) with parameter and , then Lasso SSC satisfies the subspace detection property with probability if the approximation error and noise magnitude satisfy

(3.12)

Furthermore, as a corollary, if is the random Gaussian projection matrix (i.e., ) with the dimension after projection, then Lasso SSC satisfies the subspace detection property with probability if the noise magnitude satisfies the constraint as in Eq. (3.12) and the projected dimension is at least

(3.13)

As a remark, Theorem 3.4 shows that using the Gaussian random projection operator, the dimension after projection depends polynomially on the intrinsic dimension . Such dependency is unavoidable, as one cannot hope to compress the data to the point that data dimension after compression is smaller than the intrinsic dimension. On the other hand, our analysis shows that depends only poly-logarithmically on both the number of subspaces and the number of data points . Such dependency is significantly better than treating the entire union-of-subspace model as an agglomerate low-rank model, which would then require dimension after compression. However, we conjecture the term in Eq. (3.13) is loose (which comes up in our argument of strong convexity of the dual problem of SSC) and could be improved to an even lower order of polynomial function. Finally, we note that needs to increase with , the total number of data points, which might seem counter-intuitive. This is because success of the SSC algorithm is defined in terms of self-expressiveness property over all data points, which becomes more difficult to satisfy with more data points.

We next turn to the fully-random model, where not only the data points but also the underlying subspaces are assumed to be drawn i.i.d. uniformly at random. In this even simpler model, we have the following theorem that directly follows from Theorem 3.4:

Theorem 3.5 (Compressed SSC under fully random model).

With the same notation and conditions in Theorem 3.4, except that the condition on subspace affinity, Eq. (3.11), is replaced by the following condition that involves only the number of data points , the ambient dimension and the data ratio :

(3.14)

Then under the random Gaussian projection operator, Lasso SSC succeeds with probability if and the number of dimension after compression satisfies

(3.15)

4 Application to privacy preserving subspace clustering

Dimensionality reduction is useful in many practical applications involving subspace clustering, as explained in the introduction. In this section, we discuss one particular motivation of compressing data before data analysis in order to protect data privacy. The privacy issue of subspace clustering has received research attention recently (Wang et al., 2015b), as it is applied to sensitive data sets such as medical/genetic or movie recommendation data (Zhang et al., 2012; McWilliams & Montana, 2014). Nevertheless, there has been no prior work on formally establishing privacy claims for sparse subspace clustering algorithms.

In this section we investigate differentially private sparse subspace clustering under a random projection dimension reduction model. This form of privacy protection is called “matrix masking” and has a long history in statistical privacy and disclosure control (see, e.g., Duncan et al., 1991; Willenborg & De Waal, 1996; Hundepool et al., 2012). It has been formally shown more recently that random projections (at least with Gaussian random matrices) protect information privacy (Zhou et al., 2009). Stronger privacy protection can be enforced by injecting additional noise to the dimension reduced data (Kenthapadi et al., 2013). Algorithmically, this basically involves adding iid Gaussian noise to the data after we apply a Johnson-Lindenstrauss transform of choice to and normalize every column. This procedure guarantees differential privacy (Dwork et al., 2006; Dwork, 2006) at the attribute level, which prevents any single entry of the data matrix from being identified “for sure” given the privatized data and arbitrary side information. The amount of noise to add is calibrated according to how “unsure” we need to be and how “spiky” (Definition 2.2) each data point can be. We show in Sec. 4.1 and 4.2 that the proposed variant of SSC algorithm achieves perfect clustering with high probability while subject to formal privacy constraints. We also prove in Sec. 4.3 that a stronger user-level privacy constraint cannot be attained when perfect clustering of each data point is required. Wang et al. (2015b) discussed alternative solutions to this dilemma by weakening the utility claims.

4.1 Privacy claims

In classic statistical privacy literature, transforming data set by taking

for some random matrix

and is called matrix masking. Zhou et al. (2009) show that random compression allows the mutual information of the output and raw data to converge to with rate even when , and their result directly applies to our problem. The guarantee suggests that the amount of information in the compressed output about the raw data goes to 0 as the ambient dimension gets large.

On the other hand, if is an iid Gaussian noise matrix, we can protect the -differential privacy of every data entry. More specifically, we view the matrix as a data collection of users, each corresponding to a column in associated with a -dimensional attribute vector. Each entry in then corresponds to a particular attribute of a specific user. The formal definition of attribute differential privacy notion is given below:

Definition 4.1 (Attribute Differential Privacy).

Suppose is the set for all possible outcomes. We say a randomized algorithm is -differentially private at attribute level if

for any measurable outcome , any and that differs in only one entry.

This is a well-studied setting in (Kenthapadi et al., 2013). It is weaker than protecting the privacy of individual users (i.e., entire columns in ), but much stronger than the average protection via mutual information. In fact, it forbids any feature of an individual user from being identified “for sure” by an adversary with arbitrary side information.

Theorem 4.1.

Assume the data (and all other users that we need to protect) satisfy column spikiness conditions with parameter as in Definition 2.2. Let be a Johnson-Lindenstrauss transform with parameter . Releasing compressed data with preserves attribute-level -differential privacy.

Note that in Theorem 4.1 the exact value of is not necessary: an upper bound of would be sufficient, which results in more conservative differentially private procedures.

The proof involves working out the -sensitivity of the operator in terms of column incoherence and apply “Gaussian Mechanism” (Dwork, 2006; Dwork & Roth, 2013). We defer the proof to Sec. 5.6. Note that differential privacy is close to “post-processing”, meaning that any procedure on the released data does not change the privacy level. Therefore, applying SSC on the released data injected with noise remains a -differentially private procedure.

4.2 Utility claims

We show that if column spikiness is a constant, Lasso-SSC is able to provably detect the correct subspace structures, despite privacy constraints.

Corollary 4.1.

Let the raw data be compressed and privatized using the above described mechanism to get . Assume the same set of notations and assumptions in Theorem 3.1. Suppose is a JL transform with parameter . Let . If the privacy parameter satisfies

(4.1)

then the solution to Lasso-SSC using obeys the subspace detection property with probability .

Corollary 4.1 should be interpreted with care: though we place a lower bound condition on the privacy parameter , which might sound strange because is usually specified by users, such lower bound condition is only required to establish utility guarantees. The algorithm itself is always (-differential private regardless of values, as stated in Theorem 4.1. On the other hand, if the original data is sufficiently incoherent (i.e., ), the right-hand side of Eq. (4.1) scales as , which quickly approaches zero as the ambient dimension increases. As a result, the lower bound in Eq. (4.1) is a very mild condition on incoherent data matrix .

The proof idea of Corollary 4.1 is simple. We are now injecting artificial Gaussian noise to a compressed subspace clustering problem with fixed input, and Theorem 3.3 directly addresses that. All we have to do is to replace the geometric quantities in and by their respective bound after compression in Corollary 5.1 and Lemma 5.2. The complete proof is deferred to Sec. 5.7.

Before proceeding to the impossibility results we make some remarks on the condition of the privacy parameter . It can be seen that, smaller privacy parameter (i.e., higher degree of privacy protection) is possible on data sets with larger “geometric gap” that makes subspace clustering relatively easy to succeed on such data sets. In addition, incoherence helps with privacy preservation, as can be set at a smaller level on more incoherent data.

4.3 Impossibility results

As we described in the main results, attribute-level differential privacy is a much weaker notion of privacy. A stronger privacy notion is user-level differential privacy, where two neighboring databases and differ by a column rather than an entry, and hence the privacy of the entire attribute vector of each user in is protected. However, we show in this section that user-level differential privacy cannot be achieved if utility is measured in terms of (perfect) subspace detection property.

We first give a formal definition of user-level differential privacy:

Definition 4.2 (User-Level Differential Privacy).

We say a randomized algorithm is -differential private at user level if

for any measurable outcome , any that differs in only one column.

Compared with the attribute differential privacy defined in 4.1, the only difference is how and may differ. Note that we can arbitrarily replace any single point in with any , to form .

The following proposition shows that user-level differential privacy cannot be preserved if perfect subspace detection or clustering is desired. Its proof is placed in Sec. 5.8.

Proposition 4.1.

User-level differential privacy is NOT possible for any using any privacy mechanism if perfect subspace detection property or perfect clustering results are desired. In addition, If an algorithm achieves perfect clustering or subspace detection with probability for some , user-level differential privacy is NOT possible for any .

Intuitively, the reason why attribute-level privacy is not subject to the impossibility result in Proposition 4.1 is because the privacy promise is much weaker: even if perfect subspace detection or clustering results are presented, it is still possible to hide the information of a specific attribute of a user provided that the attribute values are distributed in a incoherent and near-uniform way. On the other hand, change in a user’s complete attribute file may often alter the cluster that user belongs to and eventually break perfect subspace clustering.

User-level privacy for sparse subspace clustering and for privacy in general remains an important open problem. Some progress has been made in Wang et al. (2015b) to address user-level private subspace clustering by weakening the utility guarantee from correct clustering to approximately identifying underlying subspaces. Nevertheless, the analysis in Wang et al. (2015b) mostly focus on simpler algorithms like thresholding-based subspace clustering (Heckel & Bolcskei, 2013) and cannot be easily generalized to state-of-the-art subspace clustering methods such as SSC (Elhamifar & Vidal, 2013) or LRR (Liu et al., 2013).

5 Proofs

Success condition for exact SSC was proved in Soltanolkotabi et al. (2012) and was generalized to the noisy case in Wang & Xu (2013). Below we cite Theorem 6 and Theorem 8 in Wang & Xu (2013) for a success condition of Lasso SSC. In general, Lasso SSC succeeds when there is a sufficiently large gap between subspace incoherence and inradius. Results are restated below, with minor simplification in our notation.

Theorem 5.1 (Wang & Xu, 2013, Theorem 6 and 8).

Suppose where is the uncorrupted data matrix and is a deterministic noise matrix that satisfies . Define . If

(5.1)

then subspace detection property holds for the Lasso SSC algorithm in Eq. (1.2) if the regularization coefficient is in the range

(5.2)

In addition, if are independent Gaussian noise with variance satisfying

(5.3)

for sufficiently small constant , then with probability at least the subspace detection property holds if is in the range

(5.4)

Here and are absolute constants.

5.1 Proof of Theorem 3.1

We first bound the perturbation of dual directions when the data are noiseless.

Lemma 5.1 (Perturbation of dual directions, the noiseless case).

Assume . Fix a column in with dual direction and defined in Eq. (3.1) and (3.2). Let denote the projected data matrix and denote the normalized version of . Suppose and are computed using the normalized projected data matrix . If satisfies Eq. (2.2, 2.3) with parameter and then the following holds for all :

(5.5)
Proof.

Fix and one column in . Let and denote the low-rank subspaces to which belongs before and after compression. That is, .

First note that . because and putting we obtain a solution with value . On the other hand, and putting we obtain a solution with value 0. Also, under the noiseless setting , if .

Define

where . Let and denote the values of the optimization problems. The first step is to prove that is feasible and nearly optimal to the projected optimization problem; that is, is close to .

We first show that is a feasible solution with high probability. By Proposition 2.1, the following bound on holds:

(5.6)

Furthermore, with probability

(5.7)

Consequently, by the definition of one has

(5.8)

Next, we compute a lower bound on , which should serve as a lower bound for because is the optimal solution to the dual optimization problem on the projected data. We first remark that due to the optimality condition of the dual problem at and hence

where the last inequality is due to the assumption that and . Consequently, we have the following chain of inequalities: