Randomized Robust Matrix Completion for the Community Detection Problem

05/25/2018 ∙ by Mostafa Rahmani, et al. ∙ 0

This paper focuses on the unsupervised clustering of large partially observed graphs. We propose a provable randomized framework in which a clustering algorithm is applied to a graphs adjacency matrix generated from a stochastic block model. A sub-matrix is constructed using random sampling, and the low rank component is found using a convex-optimization based matrix completion algorithm. The clusters are then identified based on this low rank component using a correlation based retrieval step. Additionally, a new random node sampling algorithm is presented which significantly improves upon the performance of the clustering algorithm with unbalanced data. Given a partially observed graph with adjacency matrix A ∈ R^N × N , the proposed approach can reduce the computational complexity from O(N^2) to O(N).



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Community detection concerns identifying clusters (communities) that are more densely connected in graph-structured data, a central task of many important problems in data mining and modern network science (see [1, 2, 3] and references therein). Real-world applications of the community detection problem abound, including ones in social networking [4, 5], biology (e.g., to study brain connectivity [6]

), and machine learning (e.g., recommendation systems


Various techniques have been proposed to this end including, but not to limited to, approaches leveraging random walks [8]

, spectral clustering

[9], semidefinite programming [10], matrix decomposition [11, 12, 13], and word embedding [14, 15].

As in much of the prior work, we consider graph-structured data randomly generated from the canonical Stochastic Block Model (SBM), a popular probabilistic generative model for random graphs with latent structure (planted clusters) [16, 17]

. The nodes are connected with intra-cluster and inter-cluster probabilities

and , respectively. Due to partial observability, some of the existing edges are unobserved. The fully-observed case, known as correlation clustering, was introduced in [18].

To detect densely connected clusters, the authors in [11] devised a decomposition-based approach [19, 20, 21, 22] to correlation clustering, in which the adjacency matrix of the graph is decomposed as a sum of a low rank and a sparse component. The low rank component captures the non-overlapping full cliques, and the sparse component captures missing edges within clusters and extra edges across clusters. The validity of the low rank plus sparse structure emerges from the fact that , i.e., the higher density of intra-cluster over inter-cluster connections. The use of matrix decomposition was later extended to the partially observed case [12, 13].

Our work is motivated by two main limitations of decomposition-based approaches (and most other community detection algorithms). First, existing algorithms work with the adjacency matrix of the full graph, which limits their scalability to large graph sizes. Second, they often fall short of discovering small clusters in unbalanced graphs (i.e., graphs in which the largest and smallest clusters differ greatly in size).

We build on the previous works of [12, 13]

by presenting and analyzing a randomized scalable framework, which significantly reduces the complexity of the clustering algorithm. Randomized sketching techniques have been instrumental in devising scalable solutions to many high-dimensional unsupervised learning problems, such as Principal Component Analysis (PCA)

[20], matrix decomposition [23, 24, 25]

, outlier detection

[26, 27], low rank matrix approximation [28, 29, 30], data summarization [31, 28, 32], and clustering [33, 34, 35, 36].

There are major distinctions between the approach proposed herein and existing randomized data clustering algorithms. For example, the approach in [37] uses random node sampling, but considers altogether different scenarios in which the data feature matrices are known. In sharp contrast, our setup only assumes availability of similarity and dissimilarity information given the partially observed topology, otherwise no data features are available. The work in [35] targets the graph clustering problem, but uses a very different edge sampling mechanism as opposed to node sampling. We focus on node sampling to obtain balanced sub-graphs. As pointed out earlier, the primary focus of this paper is to devise a scalable solution to the graph clustering problem leveraging the decomposition-based approach presented in [12] along with graph sketching, and to provide robustness to data unbalancedness.

The adjacency matrix is assumed to follow the Planted Partition/Stochastic Block Model [16, 17], with a modification to account for partial observations. This data model, originally found in [12], is defined as follows.

Data Model 1.

The graph consists of nodes partitioned into clusters. Any two nodes within a cluster are connected with probability , and two nodes belonging to different clusters are connected with probability . Any given edge is observed with probability .

Given an adjacency matrix , the decomposition into the low rank and sparse matrices takes the form . The matrix captures the connectivity of the nodes within a cluster and has no inter-cluster edges, while indicates the missing intra-cluster and extra inter-cluster edges. The diagonal elements of are assumed to be all ones. We note that for an ideal adjacency matrix with , and we have and .

I-a Summary of contributions

We develop and analyze a scalable graph clustering algorithm in which matrix recovery and completion are applied to the adjacency matrix of a small random sub-graph obtained from the original graph through random sampling. We establish performance guarantees for the presented randomized approach to yield exact clustering. Moreover, we present a new node sampling method, dubbed Sparsity-based Sampling, which significantly enhances the performance of the clustering algorithm with unbalanced graphs. Setting the sampling probabilities inversely proportional to the node degrees (equivalently, proportional to the sparsity levels of the columns of the adjacency matrix), Sparsity-based Sampling achieves the twin objective of capturing smaller clusters and yielding balanced sketches which enhance the success of the subsequent completion and clustering.

The algorithms in [12, 13] require clusters of minimum size to guarantee a high probability of success, while at the same time requiring the high computational cost of working with the full matrix. In contrast, our proposed random sampling method maintains similar performance guarantees while only requiring samples to be made, where is the size of the smallest cluster (assuming the number of clusters , and the SBM parameters are fixed). Furthermore, the use of Sparsity-based Sampling improves the probability of sampling from the smallest clusters, thus allowing the restriction on minimum cluster size to be relaxed.

Ii Proposed Approach

In this section, we propose a randomized framework which provides a scalable solution for the community detection problem with big data. We focus on the convex-optimization-based method analyzed in [12, 13] which translates the problem of clustering a partially observed graph to one of robust low rank matrix completion. As discussed in Section I

, the two main shortcomings of existing methods we seek to address are the computational complexity and the inability of these algorithms to detect the small clusters in unbalanced graphs. First, the randomized approach proposed is presented and analyzed with uniform random sampling where it is shown to significantly reduce the computational complexity. Subsequently, Sparsity-based Sampling is presented to handle unbalanced high-dimensional data. The proofs of all stated results are deferred to an extended version of this work.

Ii-a Randomized graph clustering using random node sampling

The proposed method is detailed in Algorithm 1. It consists of three main steps: node sampling, sub-graph clustering, and full data clustering.

Ii-A1 Uniform random node sampling

The basic idea underlying the proposed randomized approach is to apply the clustering algorithm to a random sketch rather than the full data. The clustering method is applied to a random sub-graph obtained by sampling nodes. The sub-graph is equivalently represented by a sub-matrix (referred to as the sketch matrix) of the full adjacency matrix . As with the full adjacency matrix, the sketch can be decomposed into low rank and sparse matrices. The random selection is performed using uniform random sampling. It is important to ensure that at least one node is sampled from each cluster, i.e., the columns of span the column space of . The following lemma shows that the sufficient number of random samples is linear with , where is the number of nodes in the smallest cluster.

Lemma 1.

Suppose the adjacency matrix follows Data Model 1. Define as the low rank component of the sketched adjacency matrix, constructed using randomly sampled (uniformly without replacement) nodes. If then the rank of is equal to the rank of with probability at least .

Ii-A2 Robust sub-graph clustering

In the second step, we apply the robust matrix completion algorithm (3) to cluster the sampled nodes by finding the low rank component of . The following lemma, which is based in part on [12, Theorem 4], provides sufficient conditions for the convex matrix completion algorithm to yield the correct .

Lemma 2.

Let , , and , where is a constant real number. If


then the optimal point of (3) yields the exact low rank component of with probability at least , where is a constant real number.

The term can be thought as an indicator of the ‘balancedness’ of the clusters, where a larger value indicates less balance. In fact, for uniform random sampling is the inverse of the probability of sampling from the smallest cluster.

Ii-A3 Full data clustering

In the last step, we compare the connectivity of each node of the full graph to the clusters obtained in the second step from the sketch. For an exact description of the last step, we refer the reader to the third step of Algorithm 1. The operator

returns the observed values of its vector or matrix argument.

Input: Given adjacency matrix
1. Random Node Sampling:

1.1 Form the set consisting of indices of randomly sampled nodes without replacement. Sampling is accomplished using either Random Sampling, or Sparsity-based Sampling.

1.2 Construct as the sub-matrix of corresponding to the sampled nodes.

2. Sub-graph Clustering:

2.1 Define and as the optimal point of


as described in [12]. In the experiments, the problem is solved using .

2.2 Cluster the sub-graph corresponding to using (we use Spectral Clustering in our experiments).

2.3 If is the number of detected clusters, define as the set of columns of , which span the column space of (each vector essentially represents the connectivities within a certain cluster).

3. Full Data Clustering:
Define as the vector of elements of ( column of ) indexed by set . Let be the number of elements in the cluster of the sketch (as identified in Step 2 of the algorithm).
For from 1 to

 Assign the node to the cluster.
End For

Algorithm 1 Structure of Proposed Approach

We can readily state the following theorem which establishes a sufficient condition for Algorithm 1 to yield exact clustering with high probability.

Theorem 3.

Suppose the adjacency matrix follows Data Model 1, and that satisfies (1) and (2). If additionally


then Algorithm 1 exactly clusters the graph with probability at least , where , , and are as defined in Lemma 2.

One can observe that the sufficient number of sampled nodes depends on the size of the graph through the factor . However, if the graph is balanced, then . Thus, for balanced graphs, the sufficient number of randomly sampled nodes is almost independent of the size of the graph.

Ii-B Sparsity-based Sampling

While the clustering algorithm that uses the full-scale data has quadratic computational complexity in the graph size, the complexity of Algorithm 1 is only linear in when the data is balanced. This affords substantial speedup for high-dimensional graphs. Nevertheless, according to Lemma 1, if the data is unbalanced and uniform random sampling is utilized, the sketch will be similarly unbalanced and a large number of samples will be needed to ensure the small clusters are sufficiently represented in the sketch. At the same time, the existing clustering algorithms can hardly capture small clusters when the data is unbalanced, a fact that is also confirmed by results from the theory developed in [12].

In this section, a new sampling method which can yield a more balanced sketch of unbalanced data is presented. The basic idea is to sample the sparser columns of the adjacency matrix, which represent nodes with fewer connections, with a higher probability, hence the appellation ‘Sparsity-based Sampling’. Specifically, the probability of the node being selected is set proportional to , a factor which is used as a measure of sparsity, where denotes the -norm. Here, the entries corresponding to unobserved edges are set to zero. Note that since the diagonal entries of the adjacency matrix are set to one. The sampling probabilities are properly normalized so that they sum up to one.

For ease of exposition, suppose the graph is composed of two clusters with population sizes and (). In the case where the adjacency matrix is clean and complete, i.e., , the following lemma shows that the probabilities of sampling from the different clusters are equal if we use the Sparsity-based Sampling scheme, independent of their relative population sizes.

Lemma 4.

Suppose , , and all edges are observed. Then, using Sparsity-based Sampling, the sampling probabilities from both clusters are equal.

According to Lemma 4, the new sampling algorithm can efficiently yield perfectly balanced sketches. Now, we will show that the Sparsity-based Sampling scheme can improve the balancedness of the sketched data even if the graph-structured data is corrupted.

Recall that we are sampling without replacement. To accomplish this, suppose that the probability of sampling each column is fixed throughout the sampling process. Columns may be sampled more than once, but duplicates will be discarded and not counted towards the budget of samples. Now, let be the probability of sampling from the smallest cluster. For a randomly generated graph, the following lemma places a lower bound on which holds with high probability.

Lemma 5.

Suppose the adjacency matrix follows Data Model 1, has two clusters with sizes , and all the conditions in the statement of Lemma 2 hold. Define


where . If sampling is performed using Sparsity-based Sampling, then with probability at least .

If and are small, then will be close to , indicating that both clusters have a roughly equal chance of being sampled. Contrast this with uniform random sampling, where the probability of sampling from the smallest cluster is exactly . This means that under proper conditions, Sparsity-based Sampling will tend to produce more balanced sketches than uniform random sampling. In Section III, we show through numerical experiments that Sparsity-based Sampling indeed significantly enhances the performance of the randomized approach, and can even outperform the full-scale decomposition of the full adjacency matrix .

Iii Numerical Experiments

A set of numerical simulations are presented to study the performance of the proposed randomized framework. First, it is shown that the proposed approach brings about substantial speedups. Then, an experiment with unbalanced data is presented to showcase the effectiveness of the Sparsity-based Sampling scheme. We show that in some cases the proposed method with Sparsity-based Sampling can even outperform full-scale decomposition. An experiment is considered successful if the algorithm reconstructs the exact low rank matrix

. We estimate the number of clusters using Spectral Clustering


applied to the obtained low rank component. Each point in the presented phase transition results is obtained by averaging over 20 independent runs.

Iii-a Running time

We compare the running time of the proposed randomized method with the full-scale decomposition algorithm. The adjacency matrix follows Data Model 1 with , , , , and . In this experiment, . Table I compares the running time in seconds for different values of . In all cases, both the full-scale and sketching-based decompositions cluster the data accurately, but the randomized method is substantially faster. The main reason is that the matrix completion algorithm is applied to the much smaller sketch matrix. Since the complexity of the last step of Algorithm 1 is linear with , it does not have a significant impact on the run time.

Randomized approach Full-scale decomposition
500 1.2 s 3.3 s
1000 1.2 s 15.0 s
5000 1.2 s 174.0 s
10000 1.3 s 647.0 s
TABLE I: Run time in seconds

Iii-B Clustering unbalanced graphs

In this experiment, we study the performance of the proposed randomized approach with Sparsity-based Sampling. The adjacency matrix adheres to Data Model 1 with , , , and . The graph consists of two small clusters and one dominant cluster, with population sizes and . Fig. 1 compares the phase transition plots of the proposed randomized methods in the sample complexity and minimum cluster size plane. Results with uniform random sampling are shown on the left and with Sparsity-based Sampling on the right. The phase transitions are shown for and . One can observe that the algorithm which uses Sparsity-based Sampling can yield exact clustering even when the data is highly unbalanced in which case the uniform random sampling based algorithm fails. For instance, when , the Sparsity-based Sampling algorithm can yield accurate clustering with only randomly sampled nodes (only 4% of the total number of nodes). The main reason is that the probability of sampling from the small clusters using Sparsity-based Sampling is larger than the corresponding probability with uniform random sampling, which results in a more balanced sketch.

Fig. 1: Phase transition plots for a) uniform random sampling and b) Sparsity-based Sampling. White regions indicate success and black regions failure.

Interestingly, in addition to being significantly faster than the full-scale decomposition algorithm, the proposed randomized approach (with Sparsity-based Sampling) can even outperform full-scale decomposition in terms of success rate when the inter-cluster probability is sufficiently small. We remark that this does not violate the data processing inequality which indicates that post-processing cannot increase information [39]. Rather, the full-scale decomposition algorithm is not robust to data unbalancedness in the sense that it often fails to yield accurate clustering with unbalanced data. For example, consider , , , , with a graph composed of three clusters with population sizes and . Fig. 2a shows the probability of success of the proposed randomized approach and the full-scale decomposition algorithm versus . The proposed method uses 500 sampled nodes (only 10% of the total number nodes). As shown, the full-scale algorithm fails to yield accurate clustering for . Meanwhile, the randomized approach with Sparsity-based Sampling yields exact clustering even when . Fig. 2b shows the phase transition of the randomized approach with Sparsity-based Sampling under the same setup.

Fig. 2: a) Success probability for full-scale decomposition and the randomized approach with Sparsity-based Sampling versus . b) Phase transition with Sparsity-based Sampling.

Iv Conclusion

We presented and analyzed a randomized framework for the community detection problem with partially observed graphs. It was shown that the proposed method can substantially reduce the computational complexity from to with balanced data. We also developed a new node sampling scheme, Sparsity-based Sampling, which efficiently obtains a descriptive sketch of unbalanced graphs. It was shown that the new sampling method can improve the performance of the clustering algorithm by enhancing the balancedness of the data.

Acknowledgment This work was supported in part by NSF CAREER Award CCF-1552497.


  • [1] S. Fortunato and D. Hric, “Community detection in networks: A user guide,” Phys. Rep., vol. 659, pp. 1 – 44, 2016.
  • [2] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, Sep. 1999.
  • [3] M. A. Porter, J.-P. Onnela, and P. J. Mucha, “Communities in networks,” Not. AMS, vol. 56, no. 9, Oct. 2009.
  • [4] P. Bedi and C. Sharma, “Community detection in social networks,” Data Mining Knowl. Discov., vol. 6, no. 3, pp. 115–135.
  • [5] N. Mishra, R. Schreiber, I. Stanton, and R. E. Tarjan, “Clustering social networks,” in Proc. 5th Int’l Conf. Algor. Models Web-graph.   Berlin, Heidelberg: Springer-Verlag, 2007, pp. 56–67.
  • [6] C. Nicolini, C. Bordier, and A. Bifone, “Community detection in weighted brain connectivity networks beyond the resolution limit,” ArXiv e-prints, Sep. 2016.
  • [7] J. Stan, F. Muhlenbach, and C. Largeron, “Recommender systems using social network analysis: Challenges and future trends,” in Encyclopedia of Social Network Analysis and Mining.   Springer, 2014, pp. 1–22.
  • [8] P. Pons and M. Latapy, “Computing communities in large networks using random walks,” in Proc. 20th Int. Symp. Comput. Inform. Sci.   Springer, 2005, pp. 284–293.
  • [9] U. Von Luxburg, “A tutorial on spectral clustering,” Stat. Comput., vol. 17, no. 4, pp. 395–416, 2007.
  • [10] B. Hajek, Y. Wu, and J. Xu, “Semidefinite programs for exact recovery of a hidden community,” in J. Mach. Learn. Res., vol. 49, 23–26 Jun 2016, pp. 1051–1095.
  • [11] S. Oymak and B. Hassibi, “Finding dense clusters via “low rank + sparse” decomposition,” ArXiv e-prints, Apr. 2011.
  • [12] Y. Chen, A. Jalali, S. Sanghavi, and H. Xu, “Clustering partially observed graphs via convex optimization,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 2213–2238, Jan. 2014.
  • [13] R. Korlakai Vinayak, S. Oymak, and B. Hassibi, “Graph clustering with missing data: Convex algorithms and analysis,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2996–3004.
  • [14] V. W. Zheng, S. Cavallari, H. Cai, K. Chen-Chuan Chang, and E. Cambria, “From node embedding to community embedding,” ArXiv e-prints, Oct. 2016.
  • [15] W. Ding, C. Lin, and P. Ishwar, “Node embedding via word embedding for network community discovery,” IEEE Trans. Signal Inf. Process. Netw., vol. 3, no. 3, pp. 539–552, Sept 2017.
  • [16] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Social Networks, vol. 5, no. 2, pp. 109 – 137, 1983.
  • [17] A. Condon and R. M. Karp, “Algorithms for graph partitioning on the planted partition model,” Random Struct. Algor., vol. 18, no. 2, pp. 116–140, 2001.
  • [18] N. Bansal, A. Blum, and S. Chawla, “Correlation clustering,” Mach. Learn., vol. 56, no. 1, pp. 89–113, Jul 2004.
  • [19] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence for matrix decomposition,” SIAM J. Optim., vol. 21, no. 2, pp. 572–596, 2011.
  • [20] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 58, no. 3, pp. 11:1–11:37, Jun. 2011.
  • [21] D. Hsu, S. M. Kakade, and T. Zhang, “Robust matrix decomposition with sparse corruptions,” IEEE Trans. Inf. Theory, vol. 57, no. 11, pp. 7221–7234, Nov 2011.
  • [22] Y. Chen, A. Jalali, S. Sanghavi, and C. Caramanis, “Low-rank matrix recovery from errors and erasures,” IEEE Trans. Inf. Theory, vol. 59, no. 7, pp. 4324–4337, July 2013.
  • [23] M. Rahmani and G. K. Atia, “High dimensional low rank plus sparse matrix decomposition,” IEEE Trans. Signal Process., vol. 65, no. 8, pp. 2004–2019, April 2017.
  • [24] M. Rahmani and G. Atia, “A subspace learning approach for high dimensional matrix decomposition with efficient column/row sampling,” in Proc. 33rd Int. Conf. Mach. Learn., 2016, pp. 1206–1214.
  • [25] L. W. Mackey, M. I. Jordan, and A. Talwalkar, “Divide-and-conquer matrix factorization,” in Proc. Adv. Neural Inf. Process. Syst, 2011, pp. 1134–1142.
  • [26] M. Rahmani and G. Atia, “Randomized robust subspace recovery and outlier detection for high dimensional data matrices,” IEEE Trans. Signal Process., vol. 65, no. 6, March 2017.
  • [27] X. Li and J. Haupt, “Identifying outliers in large matrices via randomized adaptive compressive sampling,” IEEE Trans. Signal Process., vol. 63, no. 7, pp. 1792–1807, 2015.
  • [28] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011.
  • [29] M. Gu and S. C. Eisenstat, “Efficient algorithms for computing a strong rank-revealing QR factorization,” SIAM J. Sci. Comput., vol. 17, no. 4, pp. 848–869, 1996.
  • [30] N. H. Nguyen, T. T. Do, and T. D. Tran, “A fast and efficient algorithm for low-rank approximation of a matrix,” in Proc. 41st ACM Symp. Theory Comput., 2009, pp. 215–224.
  • [31] M. Rahmani and G. K. Atia, “Spatial random sampling: A structure-preserving data sketching tool,” IEEE Signal Process. Lett., vol. 24, no. 9, pp. 1398–1402, 2017.
  • [32] N. Ailon and B. Chazelle, “Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform,” in Proc. 38th ACM Symp. Theory Comput., 2006, pp. 557–563.
  • [33]

    S. Ben-David, “A framework for statistical clustering with constant time approximation algorithms for k-median and k-means clustering,”

    Mach. Learn., vol. 66, no. 2-3, pp. 243–257, 2007.
  • [34] A. Czumaj and C. Sohler, “Sublinear-time approximation for clustering via random sampling,” in Proc. 31st ICALP.   Springer, 2004, pp. 396–407.
  • [35] S.-Y. Yun and A. Proutiere, “Community detection via random and adaptive sampling,” in Proc. Conf. Learn. Theory, 2014, pp. 138–175.
  • [36] I. Kirianovskii, O. Granichin, and A. Proskurnikov, “A new randomized algorithm for community detection in large networks,” IFAC-PapersOnLine, vol. 49, no. 13, pp. 31–35, 2016.
  • [37] K. Voevodski, M.-F. Balcan, H. Roglin, S.-H. Teng, and Y. Xia, “Efficient clustering with limited distance information,” arXiv preprint arXiv:1009.5168, 2010.
  • [38] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2765–2781, Nov 2013.
  • [39] T. M. Cover and J. A. Thomas, Elements of Information Theory.   Wiley-Interscience, 2006.