# Clustering Partially Observed Graphs via Convex Optimization

This paper considers the problem of clustering a partially observed unweighted graph---i.e., one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of "disagreements"---i.e., the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) low-rank matrix and an (unknown) sparse matrix from their partially observed sum. We evaluate the performance of our algorithm on the classical Planted Partition/Stochastic Block Model. Our main theorem provides sufficient conditions for the success of our algorithm as a function of the minimum cluster size, edge density and observation probability; in particular, the results characterize the tradeoff between the observation probability and the edge density gap. When there are a constant number of clusters of equal size, our results are optimal up to logarithmic factors.

## Authors

• 45 publications
• 9 publications
• 46 publications
• 49 publications
10/11/2012

### Improved Graph Clustering

Graph clustering involves the task of dividing nodes into clusters, so t...
04/27/2011

### Finding Dense Clusters via "Low Rank + Sparse" Decomposition

Finding "densely connected clusters" in a graph is in general an importa...
05/25/2018

### Randomized Robust Matrix Completion for the Community Detection Problem

This paper focuses on the unsupervised clustering of large partially obs...
08/22/2017

### Recovering Nonuniform Planted Partitions via Iterated Projection

In the planted partition problem, the n vertices of a random graph are p...
05/05/2016

### Clustering on the Edge: Learning Structure in Graphs

With the recent popularity of graphical clustering methods, there has be...
02/19/2013

### Breaking the Small Cluster Barrier of Graph Clustering

This paper investigates graph clustering in the planted cluster model in...
11/25/2020

### Mixed Membership Graph Clustering via Systematic Edge Query

This work considers clustering nodes of a largely incomplete graph. Unde...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper is about the following task: given partial observation of an undirected unweighted graph, partition the nodes into disjoint clusters so that there are dense connections within clusters, and sparse connections across clusters. By partial observation, we mean that for some node pairs we know if there is an edge or not, and for other node pairs we do not know – these pairs are unobserved. This problem arises in several fields across science and engineering. For example, in sponsored search, each cluster is a submarket that represents a specific group of advertisers that do most of their spending on a group of query phrases – see e.g. (Inc, 2009) for such a project at Yahoo. In VLSI and design automation, it is useful in minimizing signaling between components, layout etc. – see e.g. (Kernighan & Lin, 1970) and references thereof. In social networks, clusters represent groups of mutual friends; finding clusters enables better recommendations, link prediction, etc (Mishra et al., 2007). In the analysis of document databases, clustering the citation graph is often an essential and informative first step (Ester et al., 1995). In this paper, we will focus not on specific application domains, but rather on the basic graph clustering problem itself.

As with any clustering problem, this needs a precise mathematical definition. We are not aware of any existing work with provable performance guarantees for partially observed graphs. Even most existing approaches to clustering fully observed graphs, which we review in section 1.1 below, either require an additional input (e.g. the number of clusters required for spectral or -means clustering approaches), or do not guarantee the performance of the clustering. Indeed, the specialization of our results to the fully observed case extends the known guarantees there.

Our Formulation: We focus on a natural formulation, one that does not require any other extraneous input besides the graph itself. It is based on minimizing disagreements, which we now define. Consider any candidate clustering; this will have (a) observed node pairs that are in different clusters, but have an edge between them, and (b) observed node pairs that are in the same cluster, but do not have an edge between them. The total number of node pairs of types (a) and (b) is the number of disagreements between the clustering and the given graph. We focus on the problem of finding the optimal clustering – one that minimizes the number of disagreements. Note that we do not pre-specify the number of clusters. For the special case of fully observed graphs, this formulation is exactly the same as the problem of “Correlation Clustering”, first proposed by (Bansal et al., 2002). They showed that exact minimization of the above objective is NP-complete in the worst case – we survey and compare this and other related work in section 1.1. As we will see, our approach and results are very different.

Our Approach: We aim to achieve the combinatorial disagreement minimization objective using matrix splitting via convex optimization. In particular, as we show in section 2 below, one can represent the adjacency matrix of the given graph as the sum of an unknown low-rank matrix (corresponding to“ideal” clusters) and a sparse matrix (corresponding to disagreements from this ”ideal” in the given graph). Our algorithm either returns a clustering, which is guaranteed to be disagreement minimizing, or returns a “failure” – it never returns a sub-optimal clustering. Our analysis provides both deterministic and probabilistic guarantees for when our algorithm succeeds. Our analysis uses the special structure of our problem to provide much stronger guarantees than are current results on general matrix splitting (Chandrasekaran et al., 2009; Candes et al., 2009).

### 1.1 Related Work

Our problem can be interpreted in the general clustering context as one in which the presence of an edge between two points indicates a ”similarity”, and the lack of an edge means no similarity. The general field of clustering is of course vast, and a detailed survey of all methods therein is beyond our scope here. We focus instead on the two sets of papers most relevant to the problem here, namely the work on Correlation Clustering, and the other approaches to the specific problem of graph clustering.

Correlation Clustering: First formulated in (Bansal et al., 2002), correlation clustering looks at the following problem: given a complete graph where every edge is labelled “+” or “-”, cluster the nodes to minimize the total of the number of “-” edges within clusters and “+” edges across clusters. As mentioned, for a completely observed graph, our problem is mathematically precisely the same as correlation clustering; in particular a “+” in correlation clustering corresponds to an edge in graph clustering, and a “-” to the lack of an edge. Disagreements are defined in the same way. Thus, this paper can equivalently be considered an algorithm, and guarantees, for correlation clustering under partial observations. (Bansal et al., 2002) show that exact minimization is NP-complete, and also provide (a) constant-factor approximation algorithm for the problem of minimizing the number of disagreements, and (b) a PTAS for maximizing agreements. Their algorithms are combinatorial in nature. Subsequently, there has been much work on devising alternative approximation algorithms for both the weighted and unweighted cases, and for both agreement and disagreement objectives (Emmanuel & Immorlica, 2003; Demaine et al., 2005; Swamy, 2004; Charikar et al., 2003; Emmanuel & Fiat, 2003; Becker, 2005). Approximations based on LP relaxation (Becker, 2005) and SDP relaxation (Swamy, 2004), followed by rounding, have also been developed. We emphasize that while we do convex relaxation as well, we do not do rounding; rather, our convex program itself yields an optimal clustering. We emphasize that ours is the first attempt at correlation clustering with partial observations.

Graph Clustering: The problem of graph clustering is well studied and very rich literature on the subject exists (see e.g. (Everitt, 1980; Jain & Dubes, 1988) and references thereof). One set of approaches seek to optimize criteria such as -median, minimum sum or minimum diameter (Bern & Eppstein, 1996); typically these result in NP-hard problems with few global guarantees. Another option is a top-down hierarchical approach, i.e., recursively bisecting the graph into smaller and smaller clusters. Various algorithms in this category differ in the criterion used to determine where to split in each iteration. Notable examples of such criteria include small cut (Condon & Karp, 2001), maximal flow (Flake et al., 2004), low conductance (Shi & Malik, 2000)

, eigenvector of the Laplacian (aka spectral clustering

(Ng et al., 2002), and many others. Due to the iterative nature of these algorithms, global theoretical guarantees are hard to obtain.

As we mentioned before, we are not aware of any work on graph clustering with partial observations and provable guarantees.

## 2 Main Contributions

Our algorithm is based on convex optimization, and either (a) outputs a clustering that is guaranteed to the one that minimizes the number of observed disagreements, or (b) declares “failure” – in which case one could potentially try some other approximate methods. In particular, it never produces a suboptimal clustering. We now briefly present the main idea, then describe the algorithm, and finally present our main results – analytical characterizations of when the algorithm succeeds.

Setup: We are given a partially observed graph, whose adjacency matrix is – which has is there is an edge between nodes and , if there is no edge, and if we do not know. Let be the set of observed entries, i.e. the set of elements of that are known to be 0 or 1. We want to find the optimal clustering, i.e. the one that has the minimum number of disagreements in .

Idea: Consider first the fully observed case, i.e. every or 1. Suppose also that the graph is already ideally clustered – i.e. there is a partition of the nodes such that there are no edges between partitions, and each partition is a clique. In this case, the matrix is now a low-rank matrix, with the rank being equal to the number of clusters. This can be seen by noticing that if we re-ordered the rows and columns so that partitions appear together, the result would be a block-diagonal matrix, with each block being an all-ones sub-matrix – and thus rank one. Of course, this re-ordering does not change the rank of the matrix, and hence is (exactly) low-rank.
Consider now any given graph, still fully observed. In light of the above, we are looking for a decomposition of its into a low-rank part (of block-diagonal all-ones, one block for each cluster) and a remaining (the disagreements) – such that the number of entries in is as small as possible; i.e. is sparse. Finally, the problem we look at is recovery of the best when we do not observe all entries. The idea is depicted in Figure 1.

Convex Optimization Formulation: We propose to do the matrix splitting using convex optimization, an approach recently taken in (Chandrasekaran et al., 2009; Candes et al., 2009) (however, we establish much stronger results for our special problem). Our approach consists of dropping any additional structural requirements, and just looking for a decomposition of the given as the sum of a sparse matrix and a low-rank matrix . In particular, we use the following convex program

 minB,K η ||B||1+(1−η) ||K||∗ s.t. PΩobs(B+K)=PΩobs(I+A)

Here, for any matrix , the term keeps all elements of in unchanged, and sets all other elements to 0; the constraints thus state that the sparse and low-rank matrix should in sum be consistent with the observed entries. is the norm of the entries of the matrix, which is well-known to be a convex surrogate for the number of non-zero entries . The second term is

is ”nuclear norm”: the sum of singular values of

. This has been shown recently to be the convex surrogate111In particular, it is the

norm of the singular value vector, while rank is the

norm of the same.
for the rank function (Recht et al., 2009). Thus our objective function is a convex surrogate for the (natural) combinatorial objective . (2) is, in fact, a semi-definite program SDP (Chandrasekaran et al., 2009).

Definitition: Validity: The convex program (2) is said to produce a valid output if the low-rank matrix part of the optimum corresponds to a graph of disjoint cliques; i.e. its rows and columns can be re-ordered to yield a block-diagonal matrix with all-one matrices for each block.

Validity of a given

can easily be checked, either via elementary re-ordering operations, or via a singular value decomposition

222An SVD of a valid will yield singular vectors with disjoint supports, with each vector having all non-zero entries equal to each other. The supports correspond to the clusters.. Our first simple, but crucial, insight is that whenever the convex program (2) yields a valid solution, it is the disagreement minimizer. This is true in-spite of the fact that we have clearly dropped several constraints of the original problem (e.g. we do not enforce the entries of to be between 0 and 1, etc.).

###### Theorem 1

For any , if the optimum of (2) is valid, then it is the clustering that minimizes the number of observed disagreements.

Algorithm: Our algorithm takes the adjacency matrix of the network and outputs either the optimal clustering or declares failure. Using the result of Theorem 1, if the clustering is valid, then we are guaranteed that the result is a disagreement minimizer clustering.

We recommend using the fast implementation algorithms developed in (Lin et al., 2009), which is specially tailored for matrix splitting. Setting the parameter can be done either via a simple line search from 0 to 1, or maybe a binary search, or any other option. Whenever it results in a valid , we have found the optimal clustering.

Analysis: The main analytical contribution of this paper is conditions under which the above algorithm will find the clustering that minimizes the number of disagreements among the observed entries. We provide both deterministic/worst-case guarantees, and average case guarantees for a natural randomness assumption. Let be the low-rank matrix corresponding to the optimal clustering (as described above). Let be the matrix of observed disagreements for this clustering. Note that the support of is contained in . Let be the size of the smallest cluster in .

Deterministic guarantees: We first provide deterministic conditions under which (2) will find . For any node , let be the cluster in that node belongs to. For any cluster , including possibly itself, define

 di,c=|{j∈c|aij=? or aij=1}|

For the case , define

 di,c=|{j∈c|aij=? or aij=0}|

In words, for both cases, is the total number of disagreements and unobserved entries between and . We now define a quantity as follows

 Dmax = maxi,cdi,cmin{|c|,C(i)}

Essentially, is the largest fraction of “bad entries” (i.e. disagreements or unobserved) between a node and a cluster. Thus for the same , a node is allowed to have more bad entries to a larger cluster, but constrained to have a smaller to a smaller cluster. It is intuitively clear that a large will cause problems, as a node will have so many disagreements (with respect to the corresponding cluster size) that it will be impossible to resolve. We now state our main theorem for the deterministic case.

###### Theorem 2

If   , then the optimal clustering is the unique solution of (2) for any

 η∈⎛⎜ ⎜⎝11+12Kmin,1−Kmin(1+34nDmax)Kmin−1⎞⎟ ⎟⎠.

Remarks on Theorem 2: Essentially, Theorem 2 allows for the number of disagreements and unobserved edges at a node to be as large as a third of the number of “good” edges (i.e. edges to its own cluster in the optimal clustering). This means that there is a lot of evidence “against” the optimal clustering, and missing evidence, making it that much harder to find. Theorem 2 allows a node to have many disagreements and unobserved edges overall; it just requires these to be distributed proportional to the cluster sizes.
In many applications, the size of the typical cluster may be much smaller than the size of the graph. Theorem 2 implies that the smallest cluster for any non-trivial problem (i.e. one where every cluster has at least one node with at least one disagreement or unobserved edge). Our method can thus handle as many as clusters; this can be compared to existing approaches to graph clustering, which often partition nodes into two or a constant number of clusters. The guarantees of this theorem are stronger than what would result from a direct application of the results in (Chandrasekaran et al., 2009).

Probabilistic Guarantees: We now provide much stronger guarantees for the case where both the locations of the observations, and the locations of the observed disagreements, are drawn uniformly at random. Specifically, consider a graph that is generated as follows: start with an initial “ideally clustered” graph with no disagreements – i.e. each cluster is completely connected (i.e. a full clique), and different clusters are completely disconnected (i.e. have no edges between them). Then for some and for each of the possible node pairs, flip the entry in this location with probability from 0 to 1 or 1 to 0, as the case may be – thus causing them to be disagreements. There are thus, on average, disagreements in the resulting graph. The actual number is close to this with high probability, by standard concentration arguments. Further, this graph is observed at locations chosen uniformly at random. Specifically, for each node pair there is a probability that , and this choice is made independently of any other node pair, or of the graph. Note that now it may be possible that the optimal clustering is not the original ideal clustering we started with; the following theorem says that we will still find the optimal clustering with high probability.

###### Theorem 3

For any constant , there exist constants , , such that, with probability at least , the optimal clustering is the unique solution of (2) with provided that

 τ≤CdandKmin≥Ck√n(logn)4/p0.

Remarks on Theorem 3: This shows that our algorithm will succeed in the overwhelming majority of instances where as large as a constant fraction of all observations are disagreements. In particular the number of disagreements can be an order of magnitude larger than the number of “good” edges (i.e. those that agree with the clustering). This remains true even if we observe a vanishingly small fraction of the total number of node pairs – above is allowed to be a function of . Smaller however requires to be correspondingly larger. The reason underlying these stronger results is that bounded matrices with random supports are very spectrally diffuse, and thus find it hard to “hide” a clique, which is highly structured. The proof of this theorem goes beyond the probabilistic results of (Candes et al., 2009); in particular, we allow being a vanishing function of .

## 3 Proof of Theorem 1

In this section, we prove Theorem 1; in particular, that if (2) produces a valid low-rank matrix, i.e. one that corresponds to a clustering of the nodes, then this is the disagreement minimizing clustering.

Consider the following non-convex optimization problem

 minB,K η ||B||1+(1−η) ||K||∗ s.t. PΩobs(B+K)=PΩobs(I+A) K is valid

and let be any feasible solution. Since represents a valid clustering, it is positive semidefinite and has all ones along its diagonal. Therefore, any valid obeys . On the other hand, because both and are adjacency matrices, the entries of must be equal to , or (i.e. it is a disagreement matrix). Hence when is valid. We thus conclude that the above optimization problem is equivalent to minimizing s.t. the constraints in (3) hold. This is exactly the minimization of the number of disagreements on the observed edges. Now notice that (2) is a relaxed version (3). Therefore, if the optimum of (2) is valid and feasible to (3), then it is also optimal to (3).

## 4 Proof Outline for Theorem 2 and 3

We now overview the main steps in the proof of Theorem 2 and 3; the following sections provide details. Recall that we would like to show that and corresponding to the optimal clustering is the unique optimum of our convex program (2). This involves the following steps:

Step 1: Write down sub-gradient based first-order sufficient conditions that need to be satisfied for to be the unique optimum of (2). In our case, this involves showing the existence of a matrix – the dual certificate – that satisfies certain properties. This step is technically involved – requiring us to delve into the intricacies of sub-gradients since our convex function is not smooth – but otherwise standard. Luckily for us, this has been done by (Chandrasekaran et al., 2009; Candes et al., 2009).

Step 2: Using the assumptions made on the optimal clustering and its disagreements , construct a candidate dual certificate that meets the requirements – and thus certifies as being the unique optimum. This is where the “art” of the proof lies: different assumptions on the (e.g. we look at deterministic and random assumptions) and different ways to construct this will result in different performance guarantees.

The crucial second step is where we go beyond the existing literature on matrix splitting (Chandrasekaran et al., 2009; Candes et al., 2009). In particular, our sparse and low-rank matrices have a lot of additional structure, and we use some of this in new ways to generate dual certificates. This leads to much more powerful performance guarantees than those that could be obtained via a direct application of existing sparse and low-rank matrix splitting results.

### 4.1 Preliminaries

We now provide several definitions that will be useful in stating the first-order sufficient conditions and constructing the dual certificate.

Definitions related to : By symmetry, the SVD of is of the form . We define a sub-space of the span of all matrices that share either the same column space or the same row space as :

 T = {UXT+YUT:X,Y∈Rn×p}.

For any matrix , we can define its orthogonal projection to the space as follows:

 PT(M)=UUTM+MUUT−UUTMUUT.

We also define the projections onto , the complement orthogonal space of , as follows:

 PT⊥(M) = M−PT(M).

Definitions related to : For any matrix define it support set as . Let be the space of matrices with support sets that are a subset of the support set of , i.e.,

 Ω={B∈Rn×n:supp(B)⊆supp(B∗)}.

Let be the projection of the matrix onto the space , i.e., is obtained from by setting all entries not in the set to zero. Let be the orthogonal space to – it is the space of all matrices whose entries in the set are zero. The projection is defined accordingly. Finally, let be the matrix whose entries are +1 for every positive entry in , -1 for every negative entry, and 0 for all the zero entries.

Definitions related to partial observations: Let be the space of matrices with support sets that are a subset of the set of observed entries, and is the set of matrices with support within the set of observed entries but outside the set of disgreements. Accordingly, define , , and similar to that of and .

Norms: Several matrix norms are used in the following. represents the spectral norm of the matrix . is the nuclear norm and equal to the sum of the singular values of . We define to be the sum of the absolute values of all entries of the , and to be the element-wise maximum magnitude of . We also use to denote the Frobenius norm.

## 5 Worst Case Analysis

In this section, we prove our worst case guarantee stated in Theorem 2. We first state the determinisic first-order conditions required for and to be the unique optimum of our convex program (2). The proof of the following lemma can be found in (Chandrasekaran et al., 2009).

###### Lemma 1 (Deterministic Sufficient Optimality)

and are unique solutions to (2) provided that and there exists a matrix such that

The first condition, , is satisfied under the assumption of the theorem; the proof follows from showing ; see supplementary materials for details. Next we need to construct a suitable dual certificate that satisfies condition ()-(). We use the alternating projection method (see (Candes & Recht, 2009)) to construct . The novelty of our analysis is that by taking advantage of the rich structures of the matrices and , such as symmetricity, block-diagonal, etc, we improve the existing guarantees (Chandrasekaran et al., 2009; Candes et al., 2009) to a much larger class of matrices.

Dual Certificate Construction: For and , consider the infinite sums

 SM VN =N−PΓ⊥(N)+PTPΓ⊥(N)−PΓ⊥PTPΓ⊥(N)+⋅⋅⋅

Provided that these two sums converge, let

 Q=(1−η)VUUT+ηSsgn(B∗).

It is easy to check that the equality conditions in Lemma 1 are satisfied. It remains to show that (i) the sums converge and (ii) the inequality conditions in Lemma 1 are satisfied. The proof again requires suitable bounds on , as well as on , which crucially depend on the assumptions imposed on and ; see supplementary materials. Combining the above discussion establishes the theorem.

## 6 Average Case Analysis

In this section, we prove our average case guarantee in Theorem 3. We first state the probabilistic first-order conditions required for and to be the unique optimum of our convex program (2) with high probability. Here and in the sequal, by with high probability we mean with probability at least for some constant . The proof of the following lemma can be found in (Candes et al., 2009).

###### Lemma 2 (Probabilistic Sufficient Optimality)

Under the assumptions of Theorem 3, and are unique solutions to (2) with high proability provided that there exists a matrix such that

 (S1) ∥∥PT(WB)∥∥F≤12n2. (L1) ∥∥PT⊥(WK)∥∥<14. (L2) ∥∥PT(WK)−UUT∥∥F ≤12n2 (L3) PΓ⊥(WK)=0. (L4) ∥∥PΓ(WK)∥∥∞ <14η1−η.

In the sequel we construct explicitly.

Dual Certificate construction: We used the so-called Golfing Scheme ((Candes et al., 2009; Gross, 2009)) to construct . Our application of Golfing Scheme is different from (Candes et al., 2009), and the proof utilizes additional structure in our problem, which leads to stronger guarantees. In particular, we go beyond existing results by allowing the fraction of observed entries to be vanishing.

Now for the details. With slight abuse of notation, we use , , and to denote both the spaces of matrices, as well as the sets of indices these matrices are supported on. By definition, (as a set of indices) contains each entry index with probability . Observe that may be considered to be generated by , where each contains each entry with probability independent of all others, where and are suitably chosen. For , define the operator by

 RΓk(M) = n∑i=1mi,ieieTi+q−1∑1≤i

where if and 0 otherwise, and is the -th standard basis – i.e., the column vector with in its -th entry and elsewhere. and are defined as

 WB=WBk0+η1−ηsgn(B∗),WK=WKk0,

where is defined recursively by setting and for all ,

 WBk =WBk−1−RΓkPT(η1−ηPT(sgn(B∗))+WBk−1) WKk =WKk−1+RΓkPT(UUT−WKk−1).

It is straightforward to verify that the equality constraints in Lemma 2 are satisfied. Moreover, satisfies the inequality constraints. The proof is nearly identical to that of in section 7.3 in (Candes et al., 2009). It remains to prove that also satisfies the corresponding inequalities in Lemma 2. As in the worst case analysis, the proof involves upper-bounding the norms of matrices after certain (random) linear tranformations, such as , , , and . These bounds are proven again using the assumptions imposed on , , and .

## 7 Experimental Results

We explore the performance of our algorithm as a various graph parameters of interest via simulation. We see that the performance matches well with the theory.

We first verify our deterministic guarantees for fully observed graphs and consider two cases: (1) all clusters have the same size equal to , and the number of disagreements involving each node is fixed at across all nodes; (2) is again fixed, but clusters may have different sizes no smaller than . For each pair , a graph is picked randomly from all graphs with the desired property, and we use our algorithm to find and . The optimization problem (2) is solved using the fast algorithm in (Lin et al., 2009) with set via line search with step size . We check if the solution is a valid clustering and is equal to the underlying ideal cluster. The experiment is repeated for times and we plot the probability of success in Fig. 2 and 3. One can see that the margin of the number of disagreements is higher in the second case, as these graphs have typically larger clusters than in the first case.

We next consider partially observed graphs. A test case is constructed by generating a -node graph with equal cluster size , and then placing a disagreement on each (potiential) edge with probability , independent of all others. Each edge is observed with probability . In the first set of experiments, we fix and vary . The probabiliity of success is ploted in Fig. 4. The second set of experiments have fixed and different , with results ploted in Fig. 5. One can see that our algorithm succeeds with as small as and the average number of disagreements per node being on the same order of the cluster size. We expect that the fraction of observed entries can be even smaller for larger networks, where the concentration effect is more significant.

## References

• Bansal et al. (2002) Bansal, N., Blum, A., and Chawla, S. Correlation clustering. In Proceedings of the 43rd Symposium on Foundations of Computer Science, 2002.
• Becker (2005) Becker, H. A survey of correlation clustering. Available online at http://www1.cs.columbia.edu/ hila/clustering.pdf, 2005.
• Bern & Eppstein (1996) Bern, M. and Eppstein, D. Approximation Algorithms for Geometric Problems. In Approximation Algorithms for NP-Hard Problems, edited by D. S. Hochbaum, Boston: PWS Publishing Company, 1996.
• Candes & Recht (2009) Candes, E. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 2009.
• Candes et al. (2009) Candes, E., Li, X., Ma, Y., and Wright, J.

Robust principal component analysis?

Technical report, Stanford University, CA, 2009.
• Chandrasekaran et al. (2009) Chandrasekaran, V., Sanghavi, S., Parrilo, S., and Willsky, A. Rank-sparsity incoherence for matrix decomposition. Available on arXiv:0906.2220v1, 2009.
• Charikar et al. (2003) Charikar, M., Guruswami, V., and Wirth, A. Culstering with qualitative information. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science, 2003.
• Condon & Karp (2001) Condon, A. and Karp, R.M. Algorithms for graph partitioning on the planted partition model. Random Structures & Algorithms, 2001.
• Demaine et al. (2005) Demaine, E. D., Immorlica, N., Emmanuel, D., and Fiat, A. Correlation clustering in general weighted graphs. SIAM special issue on approximation and online algorithms, 2005.
• Emmanuel & Fiat (2003) Emmanuel, D. and Fiat, A. Correlation clustering minimizing disagreements on arbitrary weighted graphs. In Proceedings of the 11th Annual European Symposium on Algorithms, 2003.
• Emmanuel & Immorlica (2003) Emmanuel, D. and Immorlica, N. Correlation clustering with partial information. In

Proceedings of the 6th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems

, 2003.
• Ester et al. (1995) Ester, M., Kriegel, H., and Xu, X. A database interface for clustering in large spatial databases. 1995.
• Everitt (1980) Everitt, B. New York: Halsted Press, 1980.
• Flake et al. (2004) Flake, G.W., Tarjan, R.E., and Tsioutsiouliklis, K. Graph clustering and minimum cut trees. Internet Mathematics, 2004.
• Gross (2009) Gross, D. Recovering low-rank matrices from few coefficients in any basis. Available on arXiv:0910.1879v4, 2009.
• Inc (2009) Inc, Yahoo. Graph partitioning. Available at http://research.yahoo.com/project/2368, 2009.
• Jain & Dubes (1988) Jain, A. K. and Dubes, R. C. Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988.
• Kernighan & Lin (1970) Kernighan, B. W. and Lin, S.

An efficient heuristic pro- cedure for partitioning graphs.

1970.
• Lin et al. (2009) Lin, Z., Chen, M., Wu, L., and Ma, Y. The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices. UIUC Technical Report UILU-ENG-09-2215, 2009.
• Mishra et al. (2007) Mishra, N., R. Schreiber, I. Stanton, and Tarjan, R. E. Clustering social networks. Algorithms and Models for Web-Graph, Springer, 2007.
• Ng et al. (2002) Ng, A.Y., Jordan, M.I., and Weiss, Y. On spectral clustering: Analysis and an algorithm. In NIPS, 2002.
• Recht (2009) Recht, B. A Simpler Approach to Matrix Completion. Arxiv preprint arXiv:0910.0651, 2009.
• Recht et al. (2009) Recht, B., Fazel, M., and Parillo, P. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. Available on arXiv:0706.4138v1, 2009.
• Shi & Malik (2000) Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000.
• Swamy (2004) Swamy, C. Correlation clustering: maximizing agreements via semidefinite programming. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms, 2004.
• Tropp (2010) Tropp, J.A. User-friendly tail bounds for sums of random matrices. Arxiv preprint arXiv:1004.4389, 2010.
• Vershynin (2007) Vershynin, R. Math 280 lecture notes. Available at http://www-stat.stanford.edu/~dneedell/280, 2007.

## Appendix A Overview

In this supplementary material, we provide details for the proofs of our worst case and average case guarantees, which are repeated below for completeness. Recall that we propose to solve the following program

 minB,K η ||B||1+(1−η) ||K||∗ s.t. PΩobs(B+K)=PΩobs(I+A)

and guarantees are given in the next two theorem.

###### Theorem 4

If   , then the optimal clustering is the unique solution of (2) for any

 η∈⎛⎜ ⎜⎝11+12Kmin,1−Kmin(1+34nDmax)Kmin−1⎞⎟ ⎟⎠.

.

###### Theorem 5

For any constant , there exist constants , , such that, with probability at least , the optimal clustering is the unique solution of (2) with provided that

 τ≤CdandKmin≥Ck√n(logn)4p0.

Definitions related to : For the purpose of analysis only, without loss of generality, by appropriately permuting rows and columns, can be assumed to be of the block-diagonal form

 K∗=⎛⎜ ⎜ ⎜ ⎜⎝K∗1K∗2⋅K∗p⎞⎟ ⎟ ⎟ ⎟⎠,

where each is a matrix with all one entries, and , where is the number of clusters. All other entries of the matrix , i.e., outside these all-one blocks, are zero. We will assume this in all that follows, since all our arguments remain the same if rows and columns are permuted in the same way.

It is easy to show that the SVD of is of the form , where, and , where, for all , each column has the following form

 ui=1√Ki⎛⎜ ⎜ ⎜⎝0∑i−1j=1Kj1Ki0n−∑ij=1Kj⎞⎟ ⎟ ⎟⎠n×1.

That is, is non-zero only in those rows that correspond to nodes in cluster ; these non-zero entries are all equal to .

We now define a sub-space of the set of all matrices that share either the same column space or the same row space as :

 T = {UXT+YUT:X,Y∈Rn×p}.

Now, for an arbitrary matrix , we can define its orthogonal projection to the space as follows:

 PT(M)=UUTM+MUUT−UUTMUUT.

Note that is also a matrix. We will also be interested in projections onto , the complement orthogonal space of – i.e., the set of all matrices that have zero inner-product with all matrices in . The projection of any matrix onto is as follows:

 PT⊥(M) = M−PT(M).

Definitions related to : The symmetric matrix represents the disagreements between the given graph and . We reorder the rows and columns of this as well, in a way that is consistent with the re-ordering of as described above. Thus the first rows (and columns) of correspond to nodes in cluster 1, the next to nodes in cluster 2, and so on. Thus we have that

 B∗=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝B∗1,1B∗1,2⋅B∗1,pB∗T1,2B∗2,2⋅B∗2,p⋅⋅⋅⋅B∗T1,pB∗T2,p⋅B∗p,p⎞⎟ ⎟ ⎟ ⎟ ⎟⎠.

Now, for any two clusters , the entries of are either , corresponding to the missing edges inside the cluster , or 0; the entries in are either , corresponding to the edges between clusters and , or 0.

For any matrix define to be the support set of the matrix . Let be the space of matrices with support sets that are a subset of the support set of , i.e.,

 Ω={B∈Rn×n:supp(B)⊆supp(B∗)}.

Let be the projection of the matrix onto the space , i.e.,

In words, is obtained from by setting all entries not in the set to zero.

Let be the orthogonal space to – it is the space of all matrices whose entries in the set are zero. The projection onto is as follows

 PΩ⊥(N)=N−PΩ(N).

Now, is obtained from by setting all entries in the set to zero.

Extra definitions for partially observed case: Let be the space of matrices with support sets that are a subset of the set of observed entries and

 Γ=Ω⊥∩Ωobs,

is the set of matrices with support within the set of observed entries but outside the set of disgreements. Accordingly, define , , and similar to the definition of and .

In our probabilistic analysis, we assume that each of possible disagreement edges is present with probability . Accordingly, for each present edge, we set ’s to be if the edge is between two clusters and otherwise. Similarly, we assume that each edge is observed with probability . Due to the symmetric structure of , observing an edge is equivalent to observing two (equal) entries of the matrix.

Norms: We now define the several matrix norms we need to use in the following. We use to represent the spectral norm of the matrix . is the nuclear norm of the matrix and is equal to the sum of the singular values of the matrix . With slightly abuse of notation, we extend vector and norms to matrices – we define to be the sum of the absolute values of all entries of the matrix and to be the element-wise maximum magnitude of the matrix . We also use to denote the Frobenius norm.

## Appendix C Proof of Theorem 2

We prove Theorem 2 in this section. Recall that we need to rpove , and show the existence of a dual certificate obeying the following sufficient optimality conditions:

• .

• .

• .

• .

We propose to construct as follows. For and , consider the infinite sums

 SM VN =N−PΓ⊥(N)+PTPΓ⊥(N)−PΓ⊥PTPΓ⊥(N)+⋅⋅⋅

Provided that these two sums converge, we let

 Q=(1−η)VUUT+ηSsgn(B∗).

The convergence of the infinite sum is guaranteed by Lemma 5 in the next subsection. It is also easy to check that satisfies the equality conditions. The following lemma proves .

###### Lemma 3

Under the assumptions of Theorem 2, .

Proof: Using (Chandrasekaran et al., 2009) (see Proposition 1 therein and replace with ), it suffices to show that , which is implied by our assumption.

It remains to show that satisfies the inequality conditions. The next lemma provides this result. The proof utilizes the auxiliary lemmas given in the next sub-section. Define .

###### Lemma 4

Under the assumption of Theorem 2, satisfies inequality conditions.

Proof: W.L.O.G. we only consider the non-degenerate case where there is at least disgreement and clusters in the graph, i.e., and . Under the assumption of Theorem 2, we have and the range for is non-empty. By Lemma 6 and sum of the geometric series, we have

 ∥PΩ⊥(Q)∥∞ ≤11−α∥∥(1−η)PΩ(UUT)−ηsgn(B∗)∥∥∞ ≤11−α1−ηKmin+α1−αη <η.

Here, we used the special structure of . In the second inequality, we used triangle inequality to get the result. The strict inequality holds if . Moreover, by the result of Lemma 7 for the spectral norm of elements of , we have

 ∥PT⊥(Q)∥ ≤nDmax∥∥ ∥∥PΓ⊥(∞∑i=0(PTPΓ⊥)n((1−η)PΓ⊥(UUT)+ηsgn(B∗)))∥∥ ∥∥∞

Consequently, we have

 ∥PT⊥(Q)∥ ≤nDmax1−α((1−η)1Kmin+η) <1−η.

The strict inequality holds if . Combining the conditions on , we need

 11+(1−2α)Kmin<η<1−Kmin(1+1−αnDmax)Kmin−1,

which are implied by range of in Theorem 2.

### c.1 Auxiliary Lemmas

In this sub-section we provide several lemmas required in the preceding proofs.

###### Lemma 5

If then for any and the series and converge.

Proof: The proof follows from the fact that if , then the projection of an element of one of these spaces into the other is a contraction. More formally, Let and be orthonormal basis for the spaces and , respectively. Let and be unit-length combinations of ’s and ’s, respectively. Since , there exists such that for all and . Thus, we have and converges geometrically fast. With similar argument, one can show that also converges.

###### Lemma 6

For any , we have .

Proof: For any matrix , by linearity of the projection, we have . Denote by For any entry belonging to the submatrix (WLOG ), we have

 ∣∣(PT(sgn(M)))k,l∣∣ ≤(1Ki−1KiKj)Ki∑t=11Γ⊥t,l+(1Kj−1KiKj)Kj∑q=11Γ⊥k,q +1KiKjKi∑k≠t=1Kj∑l≠q=11Γ⊥t,q+1KiKj ≤(1Ki−1KiKj)DmaxKj+(1Kj−1KiKj)DmaxKj +1KiKjDmaxKj(Kj−1)+1KiKj ≤3(1−1K1)Dmax+1K2p=α.

This concludes the proof of the lemma.

###### Lemma 7

For any , we have .

Proof: Note that , where, with . Moreover, by definition of , we have