# Breaking the Small Cluster Barrier of Graph Clustering

This paper investigates graph clustering in the planted cluster model in the presence of small clusters. Traditional results dictate that for an algorithm to provably correctly recover the clusters, all clusters must be sufficiently large (in particular, Ω̃(√(n)) where n is the number of nodes of the graph). We show that this is not really a restriction: by a more refined analysis of the trace-norm based recovery approach proposed in Jalali et al. (2011) and Chen et al. (2012), we prove that small clusters, under certain mild assumptions, do not hinder recovery of large ones. Based on this result, we further devise an iterative algorithm to recover almost all clusters via a "peeling strategy", i.e., recover large clusters first, leading to a reduced problem, and repeat this procedure. These results are extended to the partial observation setting, in which only a (chosen) part of the graph is observed.The peeling strategy gives rise to an active learning algorithm, in which edges adjacent to smaller clusters are queried more often as large clusters are learned (and removed). From a high level, this paper sheds novel insights on high-dimensional statistics and learning structured data, by presenting a structured matrix learning problem for which a one shot convex relaxation approach necessarily fails, but a carefully constructed sequence of convex relaxationsdoes the job.

## Authors

• 6 publications
• 34 publications
• 1 publication
• ### Recovery of a mixture of Gaussians by sum-of-norms clustering

Sum-of-norms clustering is a method for assigning n points in R^d to K c...
02/19/2019 ∙ by Tao Jiang, et al. ∙ 0

• ### A mixed-integer linear programming approach for soft graph clustering

This paper proposes a Mixed-Integer Linear Programming approach for the ...
06/11/2019 ∙ by Vicky Mak-Hau, et al. ∙ 0

• ### Clustering of Sparse and Approximately Sparse Graphs by Semidefinite Programming

As a model problem for clustering, we consider the densest k-disjoint-cl...
03/16/2016 ∙ by Aleksis Pirinen, et al. ∙ 0

• ### Exact recovery and sharp thresholds of Stochastic Ising Block Model

The stochastic block model (SBM) is a random graph model in which the ed...
04/13/2020 ∙ by Min Ye, et al. ∙ 0

• ### Clustering Partially Observed Graphs via Convex Optimization

This paper considers the problem of clustering a partially observed unwe...
04/25/2011 ∙ by Yudong Chen, et al. ∙ 0

• ### Clustering of Data with Missing Entries

The analysis of large datasets is often complicated by the presence of m...
01/03/2018 ∙ by Sunrita Poddar, et al. ∙ 0

• ### Finding Dense Clusters via "Low Rank + Sparse" Decomposition

Finding "densely connected clusters" in a graph is in general an importa...
04/27/2011 ∙ by Samet Oymak, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper considers a classic problem in machine learning and theoretical computer science, namely graph clustering, i.e., given an undirected unweighted graph, partition the nodes into disjoint clusters, so that the density of edges within one cluster is higher than those across clusters. Graph clustering arises naturally in many application across science and engineering. Some prominent examples include community detection in social network

Mishra et al. (2007), submarket identification in E-commerce and sponsored search Yahoo!-Inc (2009), and co-authorship analysis in analyzing document database Ester et al. (1995), among others. From a purely binary classification theoretical point of view, the edges of the graph are (noisy) labels of similarity or affinity between pairs of objects, and the concept class consists of clusterings of the objects (encoded graphically by identifying clusters with cliques).

Many theoretical results in graph clustering (e.g., Boppana, 1987; Chen et al., 2012; McSherry, 2001) consider the planted partition model, in which the edges are generated randomly; see Section 1.1 for more details. While numerous different methods have been proposed, their performance guarantees all share the following manner – under certain condition of the density of edges (within clusters and across clusters), the proposed method succeeds to recover the correct clusters exactly if all clusters are larger than a threshold size, typically .

In this paper, we aim to break this small cluster barrier of graph clustering. Correctly identifying extremely small clusters is inherently hard as they are easily confused with “fake” clusters generated by noisy edges111Indeed, even in a more lenient setup where one clique (i.e., a perfect cluster) of size is embedded in an Erdos-Renyi graph of nodes and probability of forming an edge, to recover this clique, the best known polynomial method requires and it has been a long standing open problem to relax this requirement., and is not the focus of this paper. Instead, in this paper we investigate a question that has not been addressed before: Can we still recover large clusters in the presence of small clusters? Intuitively, this should be doable. To illustrate, consider an extreme example where the given graph consists two disjoint subgraphs and , where is a graph that can be correctly clustered using some existing method, and is a small-size clique. certainly violates the minimum cluster size requirement of previous results, but why should spoil our ability to cluster ?

Our main result confirms this intuition. We show that the cluster size barrier arising in previous work (e.g., Chaudhuri et al., 2012; Bollobás and Scott, 2004; Chen et al., 2012; McSherry, 2001) is not really a restriction, but rather an artifact of the attempt to solve the problem in a single shot using convex relaxation techniques. Using a more careful analysis, we prove that the mixed trace-norm and based convex formulation, initially proposed in Jalali et al. (2011), can recover clusters of size even in the presence of smaller clusters. That is, small clusters do not interfere with recovery of the big clusters.

The main implication of this result is that one can apply an iterative “peeling” strategy, recovering smaller and smaller clusters. The intuition is simple – suppose the number of clusters is limited, then either all clusters are large, or the sizes of the clusters vary significantly. The first case is obviously easy. The second one is equally easy: use the aforementioned convex formulation, the larger clusters can be correctly identified. If we remove all nodes from these larger clusters, the remaining subgraph contains significantly fewer nodes than the original graph, which leads to a much lower threshold on the size of the cluster for correct recovery, making it possible for correctly clustering some smaller clusters. By repeating this procedure, indeed, we can recover the cluster structure for almost all nodes with no lower bound on the minimal cluster size. We summarize our main contributions and techniques:

(1) We provide a refined analysis of the mixed trace norm and convex relaxation approach for exact recovery of clusters proposed in Jalali et al. (2011) and Chen et al. (2012), focusing on the case where small clusters exist. We show that in the classical planted partition settings Boppana (1987), if each cluster is either large (more precisely, of size at least ) or small (of size at most ), then with high probability, this convex relaxation approach correctly identifies all big clusters while “ignoring” the small ones. Notice that the multiplicative gap between the two thresholds is logarithmic w.r.t. . In addition, it is possible to arbitrarily increase , thus turning a “knob” in quest of an interval that is disjoint from the set of cluster sizes. The analysis is done by identifying a certain feasible solution to the convex program and proving its almost sure optimality using a careful construction of a dual certificate. This feasible solution easily identifies the big clusters. This method has been performed before only in the case where all clusters are of size .

(2) We provide a converse of the result just described. More precisely, we show that if for some value of the knob an optimal solution appears to look as if the interval were indeed free of cluster sizes, then the solution is useful (in the sense that it correctly identifies big clusters) even if this weren’t the case.

(3) The last two points imply that if some interval of the form is free of cluster sizes, then an exhaustive search of this interval will constructively find big clusters (though not necessarily for that particular interval). This gives rise to an iterative algorithm, using a “peeling strategy”, to recover smaller and smaller clusters that are otherwise impossible to recover. Using the “knob”, we prove that as long as the number of clusters is bounded by , regardless of the cluster sizes, we can correctly recover the cluster structure for an overwhelming fraction of nodes. To the best of our knowledge, this is the first result of provably correct graph clustering without any assumptions on the cluster sizes.

(4) We extend the result to the partial observation case, where only a faction of similarity labels (i.e., edge/no edge) is known. As expected, smaller observation rates allow identification of larger clusters. Hence, the observation rate serves as the “knob”. This gives rise to an active learning algorithm for graph clustering based on adaptively increasing the rate of sampling in order to hit a “forbidden interval” free of cluster sizes, and concentrating on smaller inputs as we identify big clusters and peel them off.

Beside these technical contributions, this paper provides novel insights into low-rank matrix recovery and more generally high-dimensional statistics, where data are typically assumed to obey certain low-dimensional structure. Numerous methods have been developed to exploit this a priori

information so that a consistent estimator is possible even when the dimensionality of data is larger than the number of samples. Our result shows that one may combine these methods with a “peeling strategy” to further push the envelope of learning structured data – By iteratively recovering the easier structure and then reducing the problem size, it is possible to learn structures that are otherwise difficult using previous approaches.

### 1.1 Previous work

The literature of graph clustering is too vast for a detailed survey here; we concentrate on the most related work, and in specific those provide theoretical guarantees on cluster recovery.

Planted partition model: The setup we study is the classical planted partition model Boppana (1987), also known as the stochastic block model Rohe et al. (2011). Here, nodes are partitioned into subsets, referred as the “true clusters”, and a graph is randomly generated as follows: for each pair of nodes, depending on whether they belong to a same subset, an edge connecting them is generated with a probability or respectively. The goal is to correctly recover the clusters given the random graph. The planted partition model has been studied as early as 1980’s Boppana (1987). Earlier work focused on the 2-partition or more generally -partition case, i.e., the minimal cluster size is  Boppana (1987); Condon and Karp (2001); Carson and Impagliazzo (2001); Bollobás and Scott (2004)

. Recently, several works have proposed methods to handle sublinear cluster sizes. These works can be roughly classified into three approaches: randomized algorithms

(e.g., Shamir and Tsur, 2007)(e.g., McSherry, 2001; Giesen and Mitsche, 2005; Chaudhuri et al., 2012; Rohe et al., 2011)), and low-rank matrix decomposition Jalali et al. (2011); Chen et al. (2012); Ames and Vavasis (2011); Oymak and Hassibi (2011). While these work differs in the methodology, they all impose constraints on the size of the minimum true cluster – the best result up-to-date requires it to be .

Correlation Clustering This problem, originally defined by Bansal, Blum and Chawla Bansal et al. (2004)

, also considers graph clustering but in an adversarial noise setting. The goal there is to find the clustering minimizing the total disagreement (intercluster edges plus intracluster nonedges), without there being necessarily a notion of true clustering (and hence no “exact recovery”). This problem is usually studied in the combinatorial optimization framework and is known to be NP-Hard to approximate to within some constant factor. Prominent work includes

Demaine et al. (2006); Ailon et al. (2008); Charikar et al. (2005). A PTAS is known in case the number of clusters is fixed Giotis and Guruswami (2006).

Low rank matrix decomposition via trace norm: Motivated from robust PCA, it has recently been shown Chandrasekaran et al. (2011); Candès et al. (2011), that it is possible to recover a low-rank matrix from sparse errors of arbitrary magnitude, where the key ingredient is using trace norm (aka nuclear norm) as a convex surrogate of the rank. A similar result is also obtained when the low rank matrix is corrupted by other types of noise Xu et al. (2012).

Of particular relevance to this paper is Jalali et al. (2011) and Chen et al. (2012), where the authors apply this approach to graph clustering, and specifically to the planted partition model. Indeed, Chen et al. (2012) achieve state-of-art performance guarantees for the planted partition problem. However, they don’t overcome the minimal cluster size lower bound.

Active learning/Active clustering Another line of work that motivates this paper is study of active learning algorithms (a settings in which labeled instances are chosen by the learner, rather than by nature), and in particular active learning for clustering. The most related work is Ailon et al. (2012), who investigated active learning for correlation clustering. The authors obtain a -approximate solution with respect to the optimal, while (actively) querying no more than edges. The result imposed no restriction on cluster sizes and hence inspired this work, but differs in at least two major ways. First, Ailon et al. (2012) did not consider exact recovery as we do. Second, their guarantees fall in the ERM (Empirical Risk Minimization) framework, with no running time guarantees. Our work recovers true cluster exactly using a convex relaxation algorithm, and is hence computationally efficient. The problem of active learning has also been investigated in other clustering setups including clustering based on distance matrix Voevodski et al. (2012); Shamir and Tishby (2011)

Eriksson et al. (2011); Krishnamurthy et al. (2012). These setups differ from ours and cannot be easily compared.

## 2 Notation and Setup

Throughout, denotes a ground set of elements, which we identify with the set . We assume a true ground truth clustering of given by a pairwise disjoint covering , where is the number of clusters. We say if for some , otherwise . We let for all . For any , is the unique index satisfying .

For a matrix and a subset of size , the matrix is the principal minor of corresponding to the set of indexes . For a matrix , denotes the support of , namely, the set of index pairs such that .

The ground truth clustering matrix, denoted , is defined so that is , otherwise . This is a block diagonal matrix, each block consisting of ’s only. Its rank is . The input is a symmetric matrix , a noisy version of . It is generated using the well known planted clustering model, as follows. There are two fixed edge probabilities, . We think of as the adjacency matrix of an undirected random graph, where edge is in the graph for with probability if , otherwise with probability , independent of other choices. The error matrix is denoted by . We let denote the noise locations.

Note that our results apply to the more practical case in which the edge probability of is for each and for , as long as .

## 3 Results

We remind the reader that the trace norm of a matrix is the sum of its singular values, and we define the

norm of a matrix to be . Consider the following convex program, combining the trace norm of a matrix variable with the norm of another matrix variable using two parameters that will be determined later:

 (CP1) min ∥K∥∗+c1∥∥PΓ(A)B∥∥1+c2∥∥PΓ(A)cB∥∥1 s.t. K+B=A 0≤Kij≤1,∀(i,j).
###### Theorem 1.

There exist constants such that the following holds with probability at least . For any parameter and , define

 ℓ♯=b3κ√p(1−q)np−qlog2n  ℓ♭=b4κ√p(1−q)np−q . (1)

If for all , either or and if is an optimal solution to (CP1), with

 c1=b1κ√nlogn√1−tt   c2=b1κ√nlogn√t1−t , (2)

then , where for a matrix , is the matrix defined by

 (P♯M)(i,j)={M(i,j)max{n⟨i⟩,n⟨j⟩}≥ℓ♯0otherwise .

(Note that by the theorem’s premise, is the matrix obtained from after zeroing out blocks corresponding to clusters of size at most .) The proof is based on Chen et al. (2012) and is deferred to the supplemental material due to lack of space. The main novelty in this work compared to previous work is the treatment of small clusters of size at most , whereas in previous work only large clusters were treated, and the existence of small clusters did not allow recovery of the big clusters.

###### Definition 2.

An matrix is a partial clustering matrix if there exists a collection of pairwise disjoint sets (the induced clusters) such that if and only if for some , otherwise . If is a partial clustering matrix then is defined as .

The definition is depicted in Figure 1. Theorem 1 tells us that by choosing (and hence , ) properly such that no cluster size falls in the range , the unique optimal solution to convex program (CP1) is such that is a partial clustering induced by big ground truth clusters.

In order for this fact to be useful algorithmically, we also need a type of converse: there exists an event with high probability (in the random process generating the input), such that for all values of , if an optimal solution to the corresponding (CP1) looks like the solution defined in Theorem 1, then the blocks of correspond to actual clusters.

###### Theorem 3.

There exists constants such that with probability at least , the following holds. For all and , if is an optimal solution to (CP1) with as defined in Theorem 1, and additionally is a partial clustering induced by , and also

 σmin(K)≥max{C1klogn(p−q)2,C2κ√p(1−q)nlognp−q} , (3)

then are actual ground truth clusters, namely, there exists an injection such that for all .

(Note: Our proof of Theorem 3 uses Hoeffding tail bounds for simplicity, which are tight for bounded away from and . Bernstein tail bounds can be used to strengthen the result for other classes of . We elaborate on this in Section 3.1.)

The combination of Theorems 1 and 3 implies that, as long as there exists a relatively small interval which is disjoint from the set of cluster sizes, and such that at least one cluster size is larger than this interval (and large enough), we can recover at least one (large) cluster using (CP1). This is made clear in the following.

###### Corollary 4.

Assume we have a guarantee that there exists a number , such that no cluster size falls in the interval and at least one cluster size is of size at least . Then with probability at least , we can recover at least one cluster of size at least efficiently by solving (CP1) with .

Of course we do not know what (and hence ) is. We could exhaustively search for a and hope to recover at least one large cluster. A more interesting question is, when is such a guaranteed to exist? Let . The number is the (multiplicative) gap size, equaling the ratio between and (for any ). If the number of clusters is a priori bounded by some , we both ensure that there is at least one cluster of size , and by the pigeonhole principle, that one of the intervals in the sequence . is disjoint of cluster sizes. If, in addition, the smallest interval in the sequence is not too small and is not too small so that Corollary 4 holds, then we are guaranteed to recover at least one cluster using Algorithm 1. We find this condition difficult to work with. An elegant, useful version of the idea is obtained if we assume are some fixed constants.222In fact, we need only fix , but we wish to keep this exposition simple. As the following lemma shows, it turns our that in this regime, can be assumed to be almost logarithmic in to ensure recovery of at least one cluster. 333In comparison, Ailon et al. (2012) require to be constant for their guarantees, as do the Correlation Clustering PTAS Giotis and Guruswami (2006). In what follows, notation such as denotes universal positive functions that depend on only.

###### Lemma 5.

There exists such that the following holds. Assume that , and that we are guaranteed that , where . Then with probability at least Algorithm 1 will recover at least one cluster in at most iterations.

The proof is deferred to the supplemental material section. Lemma 5 ensures that by trying at most a logarithmic number of values of , we can recover at least one large cluster, assuming the number of clusters is roughly logarithmic in . The next proposition tells us that as long as this step recovers the clusters covering at most all but a vanishing fraction of elements, the step can be repeated.

###### Proposition 6.

A pair of numbers is called good if and . If is good, then is good for all satisfying and .

The proof is trivial. The proposition implies an inductive process in which at least one big (with respect to the current unrecovered size) cluster can be efficiently removed as long as the previous step recovered at most a -fraction of its input. Combining, we proved the following:

###### Theorem 7.

Assume satisfy the requirements of Lemma 5. Then with probability at least Algorithm 2 recovers clusters covering all but at most a fraction of the input in the full observation case, without any restriction of the minimal cluster size. Moreover, if we assume that is bounded by a constant , then the algorithm will recover clusters covering all but a constant number of input elements.

### 3.1 Partial Observations

We now consider the case where the input matrix is not given to us in entirety, but rather that we have oracle access to for of our choice. Unobserved values are formally marked with .

Consider a more particular setting in which the edge probabilities defining are (for ) and (for ), and we observe with probability , for each , independently. More precisely: For we have with probability , with probability and with remaining probability. For we have with probability , with probability and with remaining probability. Clearly, by pretending that the values in are , we emulate the full observation case with , .

Of particular interest is the case in which are held fixed and tends to zero as grows. In this regime, by varying and fixing , Theorem 1 implies the following:

###### Corollary 8.

There exist constants such that for any sampling rate parameter the following holds with probability at least . define

 ℓ♯=b3(p′,q′)√n√ρlog2n       ℓ♭=b4(p′,q′)√n√ρ .

If for all , either or and if is an optimal solution to (CP1), with

 c1 = b1(p′,q′)√nlogn√1−b5(p′,q′)ρb5(p′,q′)ρ c2 = b1(p′,q′)√nlogn√b5(p′,q′)1−b5(p′,q′)ρ ,

then , where is as defined in Theorem 1.

(Note: We’ve abused notation by reusing previously defined global constants (e.g. ) with global functions of (e.g. ).) Notice now that the observation probability can be used as a knob for controlling the cluster sizes we are trying to recover, instead of . We would also like to obtain a version of Theorem 3. In particular, we would like to understand its asymptotics as tends to .

###### Theorem 9.

There exist constants such that for all observation rate parameters , the following holds with probability at least . If is an optimal solution to (CP1) with as defined in Theorem 8, and additionally is a partial clustering induced by , and also

 σmin(K)≥max{C1(p′,q′)klognρ,C2(p′,q′)√nlogn√ρ} , (4)

then are actual ground truth clusters, namely, there exists an injection such that for all .

The proof can be found in the supplemental material. Using the same reasoning as before, we derive the following:

###### Theorem 10.

Let (with defined in Corollary 8). There exists a constant such that the following holds. Assume the number of clusters is bounded by some known number . Let . Then there exists in the set for which, if is obtained with observation rate (zeroing ’s), then with probability at least , any optimal solution to (CP1) with from Corollary 8 satisfies (4).

(Note that the upper bound on ensures that is a probability.) The theorem is proven using a simple pigeonhole principle, noting that one of the intervals must be disjoint from the set of cluster sizes, and there is at least one cluster of size at least . The theorem, together with Corollary 8 and Theorem 9 ensures the following. On one end of the spectrum, if is a constant (and is large enough), then with high probability we can recover at least one large cluster (of size at least ) after querying no more than

 O⎛⎝nk20(b3(p′,q′)b4(p′,q′)log2n)2k0log4n⎞⎠ (5)

values of . On the other end of the spectrum, if and is large enough (exponential in ), then we can recover at least one large cluster after querying no more than values of . (We omit the details of the last fact from this version.) This is summarized in the following:

###### Theorem 11.

Assume an upper bound on the number of clusters . As long as is larger than some function of , Algorithm 4 will recover, with probability at least , at least one cluster of size at least , regardless of the size of other (small) clusters. Moreover, if is a constant, then clusters covering all but a constant number of elements will be recovered with probability at least , and the total number of observation queries is (5), hence almost linear.

Note that unlike previous results for this problem, the recovery guarantee does not impose any lower bound on the size of the smallest cluster. Also note that the underlying algorithm is an active learning one, because more observations fall in smaller clusters which survive deeper in the recursion of Algorithm 4.

## 4 Experiments

We experimented with simplified versions of our algorithms. Here we did not make an effort to compute the various constants defining the algorithms in this work, creating a difficulty in exact implementation. Instead, for Algorithm 1, we increase by a multiplicative factor of in each iteration until a partial clustering matrix is found. Similarly, in Algorithm 3, is increased by an additive factor of . Still, it is obvious that our experiments support our theoretical findings. A more practical “user’s guide” for this method with actual constants is subject to future work.

In all experiment reports below, we use a variant of the Augmented Lagrangian Multiplier (ALM) method Lin et al. (2009) to solve the semi-definite program (CP1). Whenever we say that “clusters were recovered”, we mean that a corresponding instantiation of (CP1) resulted in an optimal solution for which was a partial clustering matrix induced by .

#### Experiment 1 (Full Observation)

Consider nodes partitioned into clusters , of sizes , respectively. The graph is generated according to the planted partition model with , , and we assume the full observation setting. We apply a simplified version of Algorithm 2, which terminates in steps. The recovered clusters at each step are detailed in Table 1.

#### Experiment 2 (Partial Observation - Fixed Sample Rate)

We have with clusters of sizes . The observed graph is generated with , , and observation rate . We repeatedly solved (CP1) with as in Corollary 8. At each iteration, at least one large cluster (compared to the input size at that iteration) was recovered exactly and removed. This terminated in exactly iterations. Results are shown in Table 1.

#### Experiment 3 (Partial Observation - Incremental Sampling Rate)

We tried a simplified version of Algorithm 4. We have with clusters of sizes . The observed graph is generated with , , and an observation rate which we now specify. We start with and increase it by incrementally until we recover (and then remove) at least one cluster, then repeat. The algorithm terminates in steps. Results are shown in Table 1.

#### Experiment 3A

We repeat the experiment with a larger instance: with clusters of sizes , and . Results are shown in Table 1. Note that we recover the smallest clusters, whose size is below .

#### Experiment 4 (Mid-Size Clusters)

Our current theoretical results do not say anything about the mid-size clusters – those with sizes between and . It is interesting to study the behavior of (CP1) in the presence of mid-size clusters. We generated an instance with , , , and . We then solved (CP1) with a fixed . The low-rank part of the solution is shown in Fig. 2. The large cluster is completely recovered in , while the small clusters and are entirely ignored. The mid-size cluster , however, exhibits a pattern we find difficult to characterize. This shows that the polylog gap in our theorems is a real phenomenon and not an artifact of our proof technique. Nevertheless, the large cluster appears clean, and might allow recovery using a simple combinatorial procedure. If this is true in general, it might not be necessary to search for a gap free of cluster sizes. Perhaps for any , (CP1) identifies all large clusters above after a possible simple mid-size cleanup procedure. Understanding this phenomenon and its algorithmic implications is of much interest.

## 5 Discussion

An immediate future research is to better understand the “mid-size crisis”. Our current results say nothing about clusters that are neither big nor small, falling in the interval . Our numerical experiments confirm that the mid-size phenomenon is real: they are neither completely recovered nor entirely ignored by the optimal . The part of restricted to these clusters does not seem to have an obvious pattern.Proving whether we can still efficiently recover large clusters in the presence of mid-size clusters is an interesting open problem.

Our study was mainly theoretical, focusing on the planted partition model. As such, our experiments focused on confirming the theoretical findings with data generated exactly according to the distribution we could provide provable guarantees for. It would be interesting to apply the presented methodology to real applications, particularly big data sets merged from web application and social networks.

Another interesting direction is extending the “peeling strategy” to other high-dimensional learning problems. This requires understanding when such a strategy may work. One intuitive explanation of the small cluster barrier encountered in previous work is ambiguity – when viewing the input at the “big cluster resolution”, a small cluster is both a low-rank matrix and a sparse matrix. Only when “zooming in” (after recovering big clusters), small clusters patterns emerge. There are other formulations with similar property. For example, in Xu et al. (2012)

, the authors propose to decompose a matrix into the sum of a low rank one and a column sparse one to solve an outlier-resistant PCA task. Notice that a column sparse matrix is also a low rank matrix. We hope the “peeling strategy” may also help with that problem.

## References

• Ailon et al. [2008] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. J. ACM, 55(5):23:1–23:27, 2008.
• Ailon et al. [2012] N. Ailon, R. Begleiter, and E. Ezra. Active learning using smooth relative regret approximations with applications. In COLT, 2012.
• Ames and Vavasis [2011] B. Ames and S. Vavasis. Nuclear norm minimization for the planted clique and biclique problems. Mathematical Programming, 129(1):69–89, 2011.
• Bansal et al. [2004] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56:89–113, 2004.
• Bollobás and Scott [2004] B. Bollobás and AD Scott. Max cut for random graphs with a planted partition. Combinatorics, Prob. and Comp., 13(4-5):451–474, 2004.
• Boppana [1987] R.B. Boppana. Eigenvalues and graph bisection: An average-case analysis. In FOCS, pages 280–285, 1987.
• Candès et al. [2011] E. Candès, X. Li, Y. Ma, and J. Wright.

Robust principal component analysis?

J. ACM, 58:1–37, 2011.
• Carson and Impagliazzo [2001] T. Carson and R. Impagliazzo. Hill-climbing finds random planted bisections. In SODA, 2001.
• Chandrasekaran et al. [2011] V. Chandrasekaran, S. Sanghavi, S. Parrilo, and A. Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM J. on Optimization, 21(2):572–596, 2011.
• Charikar et al. [2005] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. J. Comput. Syst. Sci., 71(3):360–383, 2005.
• Chaudhuri et al. [2012] K. Chaudhuri, F. Chung, and A. Tsiatas. Spectral clustering of graphs with general degrees in the extended planted partition model. COLT, 2012.
• Chen et al. [2012] Y. Chen, S. Sanghavi, and H. Xu. Clustering sparse graphs. In NIPS. Available on arXiv:1210.3335, 2012.
• Condon and Karp [2001] A. Condon and R.M. Karp. Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 2001.
• Demaine et al. [2006] E. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation clustering in general weighted graphs. Theoretical Comp. Sci., 2006.
• Eriksson et al. [2011] B. Eriksson, G. Dasarathy, A. Singh, and R. Nowak. Active clustering: Robust and efficient hierarchical clustering using adaptively selected similarities. arXiv:1102.3887, 2011.
• Ester et al. [1995] M. Ester, H. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. In KDD, 1995.
• Giesen and Mitsche [2005] J. Giesen and D. Mitsche. Reconstructing many partitions using spectral techniques. In Fundamentals of Computation Theory, pages 433–444, 2005.
• Giotis and Guruswami [2006] Ioannis Giotis and Venkatesan Guruswami. Correlation clustering with a fixed number of clusters. Theory of Computing, 2(1):249–266, 2006.
• Jalali et al. [2011] A. Jalali, Y. Chen, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex optimization. In ICML. Available on arXiv:1104.4803, 2011.
• Krishnamurthy et al. [2012] A. Krishnamurthy, S. Balakrishnan, M. Xu, and A. Singh. Efficient active algorithms for hierarchical clustering. arXiv:1206.4672, 2012.
• Lin et al. [2009] Z. Lin, M. Chen, L. Wu, and Y. Ma. The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices. UIUC Technical Report UILU-ENG-09-2215, 2009.
• McSherry [2001] F. McSherry. Spectral partitioning of random graphs. In FOCS, pages 529–537, 2001.
• Mishra et al. [2007] N. Mishra, I. Stanton R. Schreiber, and R. E. Tarjan. Clustering social networks. Algorithms and Models for Web-Graph, Springer, 2007.
• Oymak and Hassibi [2011] S. Oymak and B. Hassibi. Finding dense clusters via low rank + sparse decomposition. arXiv:1104.5186v1, 2011.
• Rohe et al. [2011] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic block model. Ann. of Stat., 39:1878–1915, 2011.
• Shamir and Tishby [2011] O. Shamir and N. Tishby. Spectral Clustering on a Budget. In AISTATS, 2011.
• Shamir and Tsur [2007] R. Shamir and D. Tsur. Improved algorithms for the random cluster graph model. Random Struct. & Alg., 31(4):418–449, 2007.
• Voevodski et al. [2012] K. Voevodski, M. Balcan, H. Röglin, S. Teng, and Y. Xia. Active clustering of biological sequences. JMLR, 13:203–225, 2012.
• Xu et al. [2012] H. Xu, C. Caramanis, and S. Sanghavi. Robust PCA via outlier pursuit. IEEE Transactions on Information Theory, 58(5):3047–3064, 2012.
• Yahoo!-Inc [2009] Yahoo!-Inc. Graph partitioning.

## Appendix A Notation and Conventions

We use the following notation and conventions throughout the supplement. For a real matrix , we use the unadorned norm to denote its spectral norm. The notation refers to the Frobenius norm, is and is .

We will also study operators on the space of matrices. To distinguish them from the matrices studied in this work, we will simply call these objects “operators”, and will denote them using a calligraphic font, e.g. . The norm of an operator is defined as

 ∥P∥=supM:∥M∥F=1∥PM∥F ,

where the supremum is over matrices .

For a fixed, real matrix , we define the matrix linear subspace as follows:

 T(M):={YM+MX:X,Y∈Rn×n} .

In words, this subspace is the set of matrices spanned by matrices each row of which is in the row space of , and matrices each column of which is in the column space of .

For any given subspace of matrices , we let denote the orthogonal projection onto with respect to the the inner product . This means that for any matrix ,

 PSM=argminX∈S∥M−X∥F .

For a matrix , we let denote the set of matrices supported on a subset of the support of . Note that for any matrix ,

 (PΓ(X)M)(i,j)={M(i,j)X(i,j)≠00otherwise.

It is a well known fact that is given as follows:

 PT(X)M=PC(X)M+MPR(X)−PC(X)MPR(X) ,

where

is projection (of a vector) onto the column space of

, and is projection onto the row space of .

For a subspace we let denote the orthogonal subspace with respect to :

 S⊥={X∈Rn×n:⟨X,Y⟩=0 ∀Y∈S} .

Slightly abusing notation, we will use the set complement operator to formally define to be (by this we are stressing that the space is given as where is any matrix such that and have complementary supports). Note that

For a matrix , is defined as the matrix satisfying:

 (sgnM)(i,j)=⎧⎨⎩1M(i,j)>0−1M(i,j)<00otherwise .

## Appendix B Proof of Theorem 1

The proof is based on Chen et al. [2012]. We prove it for . The adjustment for

is done using a padding argument, presented at the end of the proof.

1. We let denote the set of of elements such that . (We remind the reader that .)

2. We remind the reader that the projection is defined as follows:

 (P♯M)(i,j)={M(i,j)max{n⟨i⟩,n⟨j⟩}≥ℓ♯0otherwise .
3. The projection is defined as follows:

 (P♭M)(i,j)={M(i,j)max{n⟨i⟩,n⟨j⟩}≤ℓ♭0otherwise .

In words, projects onto the set of matrices supported on . Note that by the theorem assumption, (equivalently, projects onto the set of matrices supported on ).

4. Define the set

 D={Δ∈Rn×n|Δij≤0,∀i∼j,(i,j)∉V♭×V♭;0≤Δij,∀i≁j,(i,j)∉V♭×V♭},

which contains all feasible deviation from .

5. For simplicity we write and .

We will make use of the following:

1. .

2. .

3. and commute with each other.

### b.1 Approximate Dual Certificate Condition

###### Proposition 12.

, ) is the unique optimal solution to (CP) if there exists a matrix and a positive number satisfying:

1. :

2. :

###### Proof.

Consider any feasible solution to (CP1) ; we know due to the inequality constraints in (CP1). We will show that this solution will have strictly higher objective value than if .

For this , let be a matrix in satisfying and ; such a matrix always exists because . Suppose . Clearly, and, due to desideratum 1, we have . Therefore, is a subgradient of at