# Finding Dense Clusters via "Low Rank + Sparse" Decomposition

Finding "densely connected clusters" in a graph is in general an important and well studied problem in the literature Schaeffer. It has various applications in pattern recognition, social networking and data mining Duda,Mishra. Recently, Ames and Vavasis have suggested a novel method for finding cliques in a graph by using convex optimization over the adjacency matrix of the graph Ames, Ames2. Also, there has been recent advances in decomposing a given matrix into its "low rank" and "sparse" components Candes, Chandra. In this paper, inspired by these results, we view "densely connected clusters" as imperfect cliques, where imperfections correspond missing edges, which are relatively sparse. We analyze the problem in a probabilistic setting and aim to detect disjointly planted clusters. Our main result basically suggests that, one can find dense clusters in a graph, as long as the clusters are sufficiently large. We conclude by discussing possible extensions and future research directions.

## Authors

• 34 publications
• 39 publications
04/25/2011

### Clustering Partially Observed Graphs via Convex Optimization

This paper considers the problem of clustering a partially observed unwe...
06/09/2021

### Local Algorithms for Finding Densely Connected Clusters

Local graph clustering is an important algorithmic technique for analysi...
04/30/2012

### Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies

Given the superposition of a low-rank matrix plus the product of a known...
06/10/2020

### Low Rank Directed Acyclic Graphs and Causal Structure Learning

Despite several important advances in recent years, learning causal stru...
05/25/2018

### Randomized Robust Matrix Completion for the Community Detection Problem

This paper focuses on the unsupervised clustering of large partially obs...
05/07/2012

### Graph Prediction in a Low-Rank and Autoregressive Setting

We study the problem of prediction for evolving graph data. We formulate...
10/16/2020

### Learnable Graph-regularization for Matrix Decomposition

Low-rank approximation models of data matrices have become important mac...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recently, convex optimization methods have become increasingly popular for data analysis. For example, in compressed sensing [1], we observe the measurements and aim to recover an unknown sparse solution of a system of linear equations via minimization. In many other cases, we have the perfect knowledge of a signal which possibly looks complicated, however it has a simpler underlying structure and we aim to reveal this structure by decomposing it into meaningful pieces. For example, decomposing a signal into a sparse superposition of sines and spikes is one of the well-known problems of this type [13]. Decomposing a matrix into low rank and sparse components is another key problem of this nature and it has recently been studied in various settings [4, 5, 6, 7]. In this problem, we observe the matrix , where is low rank and is sparse, and we aim to find and . The suggested convex optimization program is as follows:

 min∥L∥⋆+λ∥S∥1 (1) subject to L+S=L0+S0

Here

is the nuclear norm i.e. sum of the singular values of a matrix and

is the norm, i.e., the sum of the absolute values of the entries. Problem (1) can be considered as the natural convex relaxation of “low rank sparse” decomposition as norm and nuclear norm are the tightest convex relaxations of the sparsity and rank functions respectively. Consequently, this program promotes sparsity for and low rankness for . For the correct choice of , if and satisfies certain incoherence requirements, it is known that we’ll have where is output of problem (1).

This result is actually very useful as low rankness and sparsity are the underlying structures in many problems. In [8], Gaussian graphical models with latent variables were investigated and the problem of finding conditional dependencies of the observed variables was connected to problem (1). On the other hand, in the problem of finding cliques in a given unweighted graph, the key observation is the fact that, in the adjacency matrix, a clique corresponds to a submatrix of all ’s which is clearly rank . Based on these observations, in this paper, we aim to extend the results of Ames and Vavasis [2, 3] for detection of the planted cliques to detection of “densely connected clusters”. This problem comes up naturally, as most of the times, it might be unreasonable to expect full-cliques in a graph. For example, there might be missing edges naturally, or data might be corrupted or we might be observing only partial information. However, even if we miss some of the edges, it is very likely that most of the edges will be preserved and the cluster, we want to identify, will still be denser than the rest. We’ll view these dense clusters as imperfect cliques with some missing edges and, in our approach, full cliques will correspond to low rank piece, , whereas the missing edges inside and (extra) edges outside of the clusters will correspond to sparse piece, .

We analyze the problem under a general probabilistic setting, which we call “probabilistic cluster model”. In our model, an edge inside

’th cluster exists with probability

and an edge which is not inside any of the clusters exists with probability independent of other edges in the graph, where are constant. Here, by “inside a cluster”, we mean an edge lying between two nodes which belong to the same cluster. Notice that, this model can be viewed as a slight modification of well-known Erdös-Rényi random graph model where we introduce a nonuniform distribution which makes the clusters identifiable. We additionally assume the clusters are disjoint.

We’ll analyze two convex programs for detection of the clusters using the knowledge of the graph. We name the first program “blind approach” and it is just a slight modification to problem (1), given in (10), and we show that if

 minipi=pmin>12>q (2)

as long as the clusters are sufficiently large, with high probability, problem (10) can detect the clusters. Our second program is called “intelligent approach” which is given in problem (15). In this case, we require an extra information but we can guarantee the detection for any . Problem (15) can be considered as a mixture of (1) and (12) of [2] because it focuses on the subgraph induced by the edges inside the clusters similar to (12) of [2] but additionally accounts for the missing edges. This approach also trivially extends to the case where we observe the partial graph, in which, each edge is observed with same probability independent of others. In this case, clusters can still be recovered but we need clusters to be slightly larger compared to the case we observe the full graph.

## 2 Basic Definitions and Notations

Let denote the set for all integers . We differentiate a subset of nodes in a graph by calling that subset a cluster. For the rest of the paper, we assume the graph is unweighted with nodes, and there are disjoint planted clusters with sizes nodes. By unweighted we mean edges do not carry weights. Assume nodes are labeled from to and let be the set of the nodes inside the cluster hence , and for any . We also let denote rest of the nodes i.e. and .

We call a subset of , a region. denotes the complement, which is given by .

Let be the region corresponding to the union of regions induced by the clusters, i.e., . Note that is simply a subset of . We also let for . basically divides into disjoint regions similar to a grid. Also is simply the region induced by ’th cluster for any .

Let and

. We say a random variable

is if

 P(X=a)=r (3) P(X=b)=1−r (4)

For a given matrix , denotes the entry lying on ’th row and ’th column. is a matrix where entries are all ’s. Assume is a subset of . Then, can be viewed as a set of coordinates and if , we denote the matrix which is induced by entries of on by :

 (Xβ)i,j={Xi,j  if (i,j)∈β0  else (5)

In particular, is a matrix, whose entries on are and rest of the entries are .

Now, we introduce some definitions to explain the model we’ll work on.

###### Definition 1 (Random Support).

A random set is called “random support” with parameter if each coordinate is an element of with probability , independent of other coordinates.

A random set is called “corrected random support” with parameter if it is statistically identical to where is a random support with parameter . Basically, we include the diagonal coordinates.

Let be the adjacency matrix of . For simplicity, we let for all . Also for

 Ai,j={1  if an edge exists between nodes i,j0  else (6)

Note that, is symmetric, i.e., for all , as a result, it is uniquely determined by the entries on the lower triangular part.

###### Definition 2 (Probabilistic Cluster Model).

Recall that with and for all . Also for all and . Let be constants between and . Then, a random graph , generated according to probabilistic cluster model, has the following adjacency matrix. Entries of on the lower triangular part are independent random variables and for any :

 Ai,j={Bern(1,0,pl) random% variable if (i,j)∈Rl,l for some l≤tBern(1,0,q) random variable else (7)

Verbally, an edge inside ’th cluster exists with probability and an edge which is not inside any of the clusters exists with probability , independent of other edges, where . In order to distinguish clusters we’ll assume they are denser i.e. an edge inside the region is more likely to exist compared to an edge which is not. Consequently, we have:

 mini≤tpi=pmin>q (8)

for the rest of the paper. One can similarly treat the case where by considering the complement graph whose adjacency for all . In this case, will still satisfy probabilistic model with inside and outside cluster edge probability respectively where . Notice that, in the special case of cliques, we have for all .

In this model, can be characterized also by using random supports.

 A=t∑i=11n×nRi,i∩βi+1n×nRc∩Γ (9)

where are independent corrected random supports with parameters respectively.

Let be the set of nonzero coordinates of , i.e., . Basically, is the region induced by the edges inside the graph with the addition of diagonal coordinates. For example, the set corresponds to the missing edges inside the clusters. Clearly, is random, as is drawn from probabilistic cluster model.

We’ll call a matrix (or vector) positive (negative) if all its entries are positive (negative). Finally, we let

denote sum of the entries of i.e. for . If matrix is nonnegative then .

## 3 Proposed Convex Programs

Our aim is finding the clusters in a graph drawn from the probabilistic cluster model described in Definition 2. This can be achieved by finding . This is not hard to see, because, in the matrix , nonzero entries of each column will exactly correspond to one of the clusters, as clusters are disjoint. Then, we can simply scan through all columns to find the clusters.

### 3.1 Blind Approach

As our first approach, in order to find , we suggest the following, slightly modified version of problem (1)

 minL,S∥L∥⋆+λ∥S∥1 (10) subject to 1≥Li,j≥0  for all i,j (11) L+S=A (12)

Advantage of this approach is the fact that we don’t need any additional information about clusters such as number (or sizes) of the clusters. The desired solution is where corresponds to the full cliques, when missing edges inside are completed, and corresponds to the missing edges and the extra edges between the clusters. In particular we want:

 L0=1n×nR (13) S0=1n×nA∩Rc−1n×nAc∩R (14)

It is easy to see that the pair is feasible, later we’ll argue that under correct assumptions is indeed unique optimal solution.

### 3.2 Intelligent Approach

The second convex problem to be analyzed is a mixture of problems (1) and (12) of [2]. We’ll require an extra information which is the size of the region induced by clusters, i.e., . Suggested program focuses on subgraph induced by the edges inside the clusters and is given below:

 minL,S∥L∥⋆+λ∥S∥1 (15) subject to 1≥Li,j≥Si,j≥0  for all i,j (16) trace((1n×n−A)T(L−S))=0 (17) sum(L)≥|R|=t∑i=1k2i (18)

Actually, knowledge of ,will help us guess the solution of problem (15) (under the right assumptions). should correspond to the full cliques similar to (15), however should only correspond to the missing edges inside the clusters. Formally, we want:

 L0=1n×nR (19) S0=1n×nAc∩R (20)

In the next section, we’ll state the main results of the paper regarding the problems 10 and 15. Proofs of the theorems in section 4 will be given in sections 6, 7 and 8. Finally, section 5 will conclude the paper.

## 4 Main Results

In this section, we’ll explain the conditions for which the candidates given in (13) and (19) are the unique optimal solutions of problems (10) and (15) respectively. This will also naturally answer the question of finding the densely connected clusters . Let be the size of the minimum cluster

 kmin=min1≤i≤tki (21)

and was given previously in (8). Our analysis yields the following following fundamental constraints.

• for some constant (In particular will work).

• for all .

Actually, both of these constraints are natural. In [4], is used as the weight for problem (1). It is not surprising that we are using a similar weight as our random graph model has strong similarities with the uniformly random support of the sparse component in [4]. Secondly, we observe that implies which suggests that for recoverability, we need size of the ’th cluster to be at least and as gets smaller, this size should grow. This condition is consistent with the previous results of [2, 3] which says for recoverability of disjoint cliques, one needs a minimum clique size of .

The main results of this paper are summarized in the following theorems.

###### Theorem 1 (Main Result for Intelligent Approach).

Set . Let be a random graph generated according to the probabilistic cluster model (2) with cluster sizes and parameters . Assume and for all . Then, (independent of rest of the parameters) there exists constants such that as a result of convex program (15) we have

 L0=1n×nR=t∑i=11n×nRi,i (22) S0=1n×nAc∩R (23)

with probability at least (w.p.a.l.)

 1−cn2exp(−C(pmin−q)2kmin) (24)

In Theorem 1, one can simplify the condition on by simply requiring however statement of the theorem will be weaker unless all ’s are equal. Following corollary gives an idea about the case where we observe the partial graph.

###### Corollary 1 (Result for Partially Observed Graphs).

Let be a random graph as described in Theorem 1 and we observe each edge of with probability independent of the other edges. Let be the adjacency matrix of the observed subgraph. Then, statement of Theorem 1 holds with variables instead of respectively where and for all . Hence for recovery (w.h.p.), we require for all .

###### Theorem 2 (Main Result for Blind Approach).

Let and be a random graph generated according to the probabilistic cluster model with cluster sizes . Set and assume for all . Then, there exists constants such that as the output of problem 10 we have

 L0=1n×nR (25) S0=1n×nA∩Rc−1n×nAc∩R (26)

with probability at least

 1−cn2exp(−C(min{2pmin−1,1−2q})2kmin) (27)

We should emphasize that slightly stronger results can be given for both theorems. For example, we can reduce the lower bound required for by a factor of four in both theorems at the expense of the error exponent . In fact, one can get even better lower bounds for by choosing as a function of however we preferred to make independent of .

The following theorem provides a converse result for blind method.

###### Theorem 3.

Let be a random graph generated according to the probabilistic cluster model with and assume , for some constant . Then, if

 12≥pmin     or     q>12 and R≠[n]×[n] (28)

as , given in (13) is not a minimizer of problem (10) with probability approaching .

Remark: Note that if there is nothing to solve as all nodes are in the same cluster.

## 5 Future Extensions and Conclusion

### 5.1 Simulation Results

We considered two relatively small cases. For the first case, we have , , , and is variable. We plotted the empirical probability of success for both methods as a function of in Figure 1.

Secondly, in order to illustrate the difference between intelligent and blind approaches, we set , , , and varied . Due to Theorem 3, for blind approach to work, we always need . On the other hand, intelligent approach will work for any as long as is sufficiently large. Hence, when we increase we expect to see a better recovery region for intelligent approach compared to blind. We should remark that, in a probabilistic setting, case is trivial as we can find the cluster with high probability by looking at the nodes with high degree. Empirical curve is given in Figure 2.

Remark: In order to keep the model size small, we used in both of the simulations.

### 5.2 Future Extensions

#### 5.2.1 Alternative approaches

Our simulation results indicate that a slight modification to problem (1) of [3] might be an alternative to the methods analyzed in this paper. Let be the vector of all ’s. Then assuming we know the number of clusters , proposed convex program is as follows

 maxL sum(LA) (29) subject to (30) L⪰0  (positive semi-definite) (31) trace(L)=t (32) Li,j≥0  for all 1≤i,j≤n (33) (Xe)i≤ei=1  for all 1≤i≤n (34)

The desired solution of this problem is

 L0=t∑i=11ki1n×nRi,i (35)

It is easy to see that is feasible. Program (29) might be a more useful approach compared to (15) as it requires number of clusters as a prior information instead of . However, we only considered “low rank sparse” decompositions in this paper.

#### 5.2.2 Removing the disjointness assumption

As a natural extension, we consider removing the assumption of disjoint clusters. When clusters are allowed to intersect, intuitively is no longer low rank. Although, we don’t provide a proof, we believe rank of is equal to the number of distinct nonempty sets of type where . This suggests rank of can be as high as which grows exponentially with number of clusters. This intuition is verified by simulation results. Consequently, convex programs (15) and (10) might not be good candidates when clusters are allowed to intersect as we aim to find as a solution in these approaches. As a result, an alternative approach which will naturally result in a low rank solution is of significant interest. Another related problem is, when clusters can intersect, how to obtain from the knowledge of assuming we are able to find as a result of the optimization. Certainly, we may not always be able to uniquely decompose into , but in general decomposition which yields smallest number of clusters might be of interest.

#### 5.2.3 Extremely sparse graph

In many cases decays as the model size grows. For example, in order to have a connected graph with high probability, Erdös-Rényi model with edge probability requires only i.e. average node degree of . Sparse graphs are very common and useful in social networks [16] and web graphs [17] hence it would be of interest to extend results of this paper to the setting where are not constant. We believe this can be done by using concentration results specific to the spectral norm of sparse matrices.

In this paper, we analyzed two novel approaches for detection of disjoint clusters in a general probabilistic model. Our results are consistent with the existing works in literature and significantly extend results of [2], [3]. Simulation results suggest that even for a relatively small model, our methods yield the desired result with high probability.

## 6 Proof of Theorem 1

Analysis of problems (15) and (10) are similar to a great extent. Therefore, many of the results for this section will also be used for section 7. In the following discussion, and is always assumed to be positive.

### 6.1 Perturbation Analysis for (L0,s0)

Let denote the optimal solution of problem (15). We’ll follow a conventional proof strategy to show that under some conditions, for any feasible nonzero perturbation over given in (19), the objective function strictly increases i.e.

 ∥L0+EL∥⋆+λ∥S0+ES∥1>∥L0∥⋆+λ∥S0∥1 (36)

Consequently, due to convexity we’ll conclude .

#### 6.1.1 Observations

###### Lemma 1.

For optimal solution of problem 15, we have .

###### Proof.

From (17) we have

 sum((L∗−S∗)Ac)=0 (37)

This follows from the fact that . Combining this with (16), we can conclude that

 L∗Ac=S∗Ac (38)

since for all .

Secondly, one can observe that if is feasible for problem (15), then is also feasible and gives a lower (or equal) cost. This is because:

• The only constraint on entries of over is and will trivially satisfy this.

• with equality if and only if . Therefore, the objective will not increase by substituting by .

Hence for optimality, we require:

 S∗A=0 (39)

Using (38) and (39), WLOG, takes the following simple form:

 S∗=L∗Ac (40)

A natural interpretation of 40 is that, the only role of is filling the missing edges inside the clusters. Actually, we can write a simpler and equivalent optimization, where we get rid of the variable ; but still get the same result as problem (15), as follows

 minL∥L∥⋆+λ∥LAc∥1 (41) subject to 1≥Li,j≥0  for all i,j sum(L)≥|R|

Finally notice that satisfies (40) as expected.

#### 6.1.2 Optimality Conditions for (L0,s0)

Let denote the usual inner product i.e. . Also let such that

 sign(X)i,j=⎧⎪⎨⎪⎩1  if Xi,j>00  if Xi,j=0−1  if Xi,j<0 (42)

We would like to show any feasible nonzero perturbation over will strictly increase the objective. Due to Lemma 1, we can assume

 ES=ELAc (43)

as satisfies (40). In the following discussion, we analyze the increase in the objective due to the perturbation.

Increase due to : Similar to [4], by using the subgradient of the norm we can write:

 ∥S0+ES∥1≥∥S0∥1+⟨sign(S0)+Q,ES⟩ (44)

for all , as is nonzero over . Here is the infinity norm, i.e., .

Note that . Then, by choosing and using (43) in (44), we find:

 ∥S0+ES∥1≥∥S0∥+sum(ELAc) (45)

Increase due to : Let be the characteristic vector of with unit norm i.e. for , ’th entry of is

 ul,i={1√kl  if i∈Cl0  else (46)

Let and . Also denotes the spectral norm, i.e., the maximum singular value. Then, following lemma characterizes the increase in the objective due to .

###### Lemma 2.

For any and with , we have

 ∥L0+EL∥⋆≥∥L0∥⋆+t∑l=11klsum(ELRl,l)+⟨EL,W⟩ (47)
###### Proof.

Singular value decomposition of can be written as

 t∑l=1kluluTl=U⎡⎢ ⎢ ⎢ ⎢ ⎢⎣k1k2⋱kt⎤⎥ ⎥ ⎥ ⎥ ⎥⎦UT (48)

as a result columns of , , are the left and right singular vectors of . Then, we have

 ∥L0+EL∥⋆≥∥L0∥⋆+⟨EL,W+UUT⟩ (49)

for any with , which follows from the subgradient of the nuclear norm, similar to [4]. Finally, observe that

 UUT=t∑l=11kl1n×nRl,l⟹⟨EL,UUT⟩=t∑l=11klsum(ELRl,l) (50)

to conclude.

Overall increase: By combining (45) and Lemma 2, we have the following lower bound for the increase of the objective:

 (∥L0+EL∥⋆−∥L0∥⋆)+λ(∥S0+ES∥1−∥S0∥1)≥t∑l=11klsum(ELRl,l)+λsum(ELAc)+⟨EL,W⟩ (51)

for any , . Then, as long as the right hand side of (51) can be made strictly positive for all feasible nonzero (by properly choosing ), is the unique optimal solution of problem (15). Let us call

 f(EL,W)=t∑l=11klsum(ELRl,l)+λsum(ELAc)+⟨EL,W⟩ (52)

#### 6.1.3 Main Cases

The following lemma will help us separate the problem into two main cases.

###### Lemma 3.

Given , assume there exists with such that . Then at least one of the followings holds:

• There exists with and

• For all , .

###### Proof.

Let . Assume for some . Since is linear in , WLOG, let , . Then choose . Clearly, , and

 f(EL,W∗)=f(EL,W0)+⟨EL,cW′⟩>f(EL,W0)≥0 (53)

Notice that, for all , is equivalent to which is the orthogonal complement of in . has the following simple characterization:

 M⊥U={X∈Rn×n:X=UMT+NUT  for some M,N∈Rn×t} (54)

In the following discussion, based on Lemma 3, as a first step, in section 6.2, we’ll show that, under certain conditions, for all with high probability (w.h.p.)

 g(EL)=t∑i=11klsum(ELRl,l)+λsum(ELAc)>0 (55)

Secondly, in section 6.3, we’ll argue that, under certain conditions, there exists a with such that w.h.p. for all feasible . This is called the dual certificate. Finally, combining these two arguments, we’ll conclude that is the unique optimal w.h.p.

### 6.2 Solving for EL∈M⊥U case

In order to simplify the following discussion, we let

 g1(X)=t∑i=11klsum(XRl,l) (56) g2(X)=sum(XAc)

so that in (55). Also let where . Thus, is basically obtained by, normalizing columns of to make its nonzero entries . Assume . Then, we can write

 EL=VMT+NVT (57)

Let denote ’th columns of respectively. Notice that hence from (18)

 sum(EL)≥0 (58)

Similarly, from and (16) it follows that

 ELRc is (entrywise) nonnegative (59) ELR is nonpositive

Now, we list some simple observations regarding structure of . We can write

 EL=t∑i=1(vimTi+nivTi)=t+1∑i=1t+1∑j=1ELRi,j (60)

Notice that is contributed by only two components which are: and .

Let be an (arbitrary) indexing of elements of i.e. . For a vector let denote the vector induced by entries of in . Basically, for any , . Also, let which is induced by entries on . In other words,

 Ei,jc,d=ELai,c,aj,d   for any (i,j)∈Ci×Cj and for any 1≤c≤ki, 1≤d≤kj (61)

Basically, is same as when we get rid of trivial zero rows and zero columns. Then

 Ei,j=1kimjiT+nij1kjT (62)

Clearly, given , is uniquely determined. Now, assume we fix for all and we would like to find the worst subject to these constraints. Variables in such an optimization are . Basically we are interested in

 ming(EL) (63) subject to (64) sum(Ei,j)=ci,j for all i,j (65) Ei,j {nonnegative if i≠jnonpositive if i=j (66)

where are constants. Constraint (66) follows from (59). Essentially, based on (58), we would like to show that with high probability for any nonzero with we have . Remark: For the special case of , notice that .

In (63), is fixed and equal to . Consequently, based on (56), we just need to do the optimization with the objective .

Let be a set of coordinates defined as follows. For any

 (c,d)∈βi,j iff (ai,c,aj,d)∈A (67)

For , and are independent variables. Consequently, due to (62), we can partition problem (63) into the following smaller disjoint problems.

 minmji,nij