1 Introduction
Graph clustering is one of the most fundamental tasks in artificial intelligence and machine learning
[Giatsidis et al.2014, Tian et al.2014, Anagnostopoulos et al.2016]. Given a graph consisting of a node set and an edge set, graph clustering asks to partition graph nodes into clusters such that nodes within the same cluster are “denselyconnected” by graph edges, while nodes in different clusters are “looselyconnected”. Graph clustering on modern largescale graphs imposes high computational and storage requirements, which are too expensive, if not impossible, to obtain from a single machine. In contrast, distributed computing clusters and server storages are a popular and cheap way to meet the requirements. Distributed graph clustering has received considerable research interests [Hui et al.2007, Yang and Xu2015, Chen et al.2016, Sun and Zanetti2017]. However, the dynamic nature of modern graphs makes the clustering problem even more challenging. We discuss several motivational examples and their characteristics as follows.Citation Networks. Graph clustering on citation networks aims to generate groups of papers/manuscripts/patents with many similar citations. This implies that the authors within each cluster share similar research interests. The clustering results can be useful for recommending research collaboration, e.g. in ResearchGate. Largescale citation networks, e.g. the US patent citation network (19631999)^{1}^{1}1https://snap.stanford.edu/data/citPatents.html, contain millions of patents and tens of millions of citations, and they are dynamic with frequent insertions. New papers are published everyday with new citations to be added to the network graph. Citation networks usually have negligible deletions because very few works get revoked.
Large Images.
Image segmentation is a fundamental task in computer vision
[Arbelaez et al.2011]. Graphbased image segmentation has been studied extensively [Shi and Malik2000, Maier, Luxburg, and Hein2009, Kim et al.2011]. In these methods, each pixel is mapped into a node in a highdimensional space (considering coordinates and intensity) that then connects to its nearest nodes. In many applications such as in astronomy and microscopy, highresolution images are captured with an extremely large size, up to gigapixels. Segmentation of these images usually requires pipelining, such as with deblurring as a preprocessing, so new pixels could be added for image segmentation over time. Similar to citation networks, no pixels and their edges would be deleted once they are inserted into the images.Web Graphs. In a web graph with web pages as nodes and hyperlinks between pages as edges, web pages within the same community are usually denselyconnected. Clustering results on a web graph can be helpful for eliminating duplicates and recommending related pages. There have been over 46 billion web pages on the WWW until July, 2018 [Worldwidewebsize2018], and its size grows fast as new web pages have been constantly crawled over time. The deletions of web pages are much less frequent and more difficult to discover than insertions. In some cases, deleted web pages are still kept in Web graphs for analytic purposes.
All these examples require effective ways to clustering over largescale dynamic graphs, when node/edge insertions/deletions are observed distributively and over time. For notation convenience, we assume that we know an estimated total number of nodes in the graphs, and then node insertions and deletions are treated as insertions/deletions of its edges. Since deletions seldom happen, we first only consider node/edge insertions, and then discuss how to include a small number of deletions in detail. Formally, there are
distributed remote sites and a coordinator. At each time point , each of these sites observes a graph update stream , defining the local graph observed up to the time point , and these sites corporate with the coordinator to generate graph clustering over the global graph . For simplicity, edge weights cannot be updated but an edge can be observed at different sites. We illustrate the problem by an example in Fig. 1.For distributed systems, communication costs are one of the major performance measures we aim to optimize. In this paper, we consider two wellestablished communication models in multiparty communication literature [Phillips, Verbin, and Zhang2016], namely the message passing and the blackboard models. In the former model, there is a communication channel between each of the remote sites and a distinguished coordinator. Each site can send a message to another site by first sending to the coordinator, who then forwards the message to the destination. In the latter model, there is a broadcast channel to which a message sent is visible to all sites. Note that both models abstract away issues of message delay, synchronization and loss and assume that each message is delivered immediately. These assumptions can be removed by using standard techniques of timestamping, acknowledgements and resending, respectively. We measure communication costs in terms of the total number of bits communicated.
Unfortunately, existing graph clustering algorithms cannot work reasonably well for the problem we considered. In order to show the challenge, we discuss two natural methods central (CNTRL) and static (ST). For every time point in , CNTRL centralizes all graph updates that are distributively arriving and then applies any centralized graph clustering algorithm. However, the total communication cost for CNTRL is very high, especially when the number of edges is very large. On the other hand, for every time point in , ST applies any distributed static graph clustering algorithm on the current graph and thus adapt it to distributed dynamic setting. According to [Chen et al.2016], the lower bounds on communication cost for distributed graph clustering in the message passing and the blackboard models are and , respectively, where is the number of nodes in the graph and is the number of sites. Summing over time points, the total communication cost for ST are and resp., which could be very high especially when is very large. Therefore, designing new algorithms for distributed dynamic graph clustering is significant and challenging because of the scarce of any valid algorithms.
Contribution. The contribution of our work are summarized as follows.

For the message passing model, we analyze the problem of ST and propose an algorithm framework namely Distributed Dynamic Clustering Algorithm with Monotonicity Property (DCAMP), which can significantly reduce the total communication cost to , for an node graph distributively observed at sites in a time interval . Any spectral sparsification algorithms (we will formally introduce in Sec. 2) satisfying the monotonicity property can be used in DCAMP to achieve the communicaiton cost.

We propose an algorithm namely Distributed Dynamic Clustering Algorithm for the BLackboard model (DCABL) with communication cost by adapting the spectral sparsification algorithm [Cohen, Musco, and Pachocki2016]. DCABL is also a new static distributed graph clustering algorithm with nearlyoptimal communication cost, the same as the iterative sampling approach [Li, Miller, and Peng2013] based state of the art [Chen et al.2016]. However, it is much simpler and also works for the more complicated distributed dynamic setting.

More importantly, we show that the communication costs of DCAMP and DCABL match their lower bounds and up to polylogarithmic factors, respectively. And then we prove that at every time point, DCAMP and DCABL can generate clustering results of quality nearly as good as CNTRL.

Finally, we have conducted extensive experiments on both synthetic and realworld networks to compare DCAMP and DCABL with CNTRL and ST, which shows that our algorithms can achieve communication cost significantly smaller than these baselines, while generating nearly the same clustering results.
Related Work. Geometric clustering has been studied by [Cormode, Muthukrishnan, and Wei2007] in the distributed dynamic setting. They presented an algorithm for kcenter clustering with theoretical bounds on the clustering quality and the communication cost. However, it is not for the graph clustering. There have been extensive research on graph clustering in the distributed setting [Hui et al.2007, Yang and Xu2015, Chen et al.2016, Sun and Zanetti2017] where the graph is static (does not change over time) but distributed. [Yang and Xu2015] proposed a divide and conquer method for distributed graph clustering. [Chen et al.2016] used spectral sparsifiers in graph clustering for two distributed communication models to reduce communication cost. [Sun and Zanetti2017] presented a node degree based sampling scheme for distributed graph clustering, and their method does not need to compute approximate effective resistance. However, as discussed earlier, all these methods suffer from very high communication costs, depending on the time duration, and thus cannot be used in the studied dynamic distributed clustering. Independently, [Jian, Lian, and Chen2018] studied distributed community detection on dynamic social networks. However, their algorithm is not optimized for communication cost, focusing on finding overlapping clusters and only accepts unweighted graphs. In contrast, our algorithms are optimized for communication cost. They can generate nonoverlapping clusters and process both weighted and unweighted graphs.
2 The Proposed Algorithms
We first introduce spectral sparsification that we will use in subsequent algorithm design. Recall that the message passing communication model represents distributed systems with pointtopoint communication, while the blackboard model represents distributed systems with a broadcast channel, which can be used to broadcast a message to all sites. We then propose two algorithms for different practical scenarios in Sec. 2.1 and 2.2, respectively.
Graph Sparsification. In this paper, we consider weighted undirected graphs and will use and to denote the numbers of nodes and edges in respectively. Graph sparsification is the procedure of constructing sparse subgraphs of the original graphs such that certain important property of the original graphs are well approximated. For instance, a subgraph is called a spanner of if for every , the shortest distance between and is at most times of their distance in [Peleg and Schaffer1989]. Let be the adjacency matrix of . That is, if and zero otherwise. Let be the degree matrix of defined as , and zero otherwise. Then the unnormalized Laplacian matrix and normalized Laplacian matrix of are defined as and , resp.. [Spielman and Teng2011] introduced spectral sparsification: a spectral sparsifier for is a subgraph of , such that for every , the inequality holds. There is a rich literature on improving the tradeoff between the size of spectral sparsifiers and the construction time, e.g. [Spielman and Srivastava2011, Zhu, Liao, and Orecchia2015, Lee and Sun2017]. Recently, [Lee and Sun2017] proposed the stateoftheart algorithm to construct a spectral sparsifier of optimal size (up to a constant factor) in nearly linear time .
2.1 The Message Passing Model
Because spectral sparsifiers have much fewer edges than the original graphs but can preserve cutbased clustering and spectrum information of the original graphs [Spielman and Srivastava2011], we propose an algorithm framework as follows. At each time point , each site first constructs a spectral sparsifier for the local graph , and then transmits the much smaller , instead of itself, to the coordinator. Upon receiving the spectral sparsifier from every site at the time , the coordinator first takes their union
and then applies a standard centralized graph clustering algorithm, e.g., the spectral clustering algorithm
[Ng, Jordan, and Weiss2001], on to get the clustering . This process is repeated at the next time point to get the clustering until .However, simply reconstructing spectral sparsifiers from scratch at every time point does not provide any bound on the size of the updates to the previous spectral sparsifiers for obtaining at every time point , and thus needs to communicate the entire spectral sparsifiers of size at every time point . Summing over all sites and all time points, the total communication cost is .
It is natural to consider algorithms for dynamically maintaining spectral sparsifiers in dynamic computational models [Abraham et al.2016, Kelner and Levin2013, Kapralov et al.2014]. Unfortunately, applying them also does not provide such a bound, incurring the same communication cost! To see this, the key of (algorithms in) dynamic computational models is a data structure for dynamically maintaining the result of a computation while the underlying input data is updated periodically. For instance, dynamic algorithms [Abraham et al.2016], after each update to the input data, are allowed to process the update to compute the new result within a fast time; online algorithms [Kelner and Levin2013] allow to process the input data that are revealed step by step; and streaming algorithms [Kapralov et al.2014] impose a space constraint while processing the input data that are revealed step by step. The main principle of all these computational models is on efficiently processing the dynamically changing input data, instead of bounding the size of the updates to the previous output result over time.
We define a new type of spectral sparsification algorithms, which can provide such a bound, and is defined as follows.
Definition 1.
For an node graph =, let = be the graph consisting of the first edges. A spectral sparsification algorithm is called a Spectral Sparsification Algorithm with Monotonicity Property (SAMP), if the spectral sparsifers , constructed for , respectively, satisfy that (1) ; and (2) has size .
We show that, by using any SAMP in the algorithm framework mentioned above, we can reduce the total communication cost from to , removing a factor of . We refer to the resultant algorithm framework as Distributed Dynamic Clustering Algorithm with Monotonicity Property (DCAMP). The intuition for the significant reduction in the total communication cost is that, the monotonicity property guarantees that, for every time point , the constructed spectral sparsifiers is a superset of at the previous time point . Then, we only need to transmit edges in and at the same time not in to the coordinator for maintaining . Every communicated bit transmitted at the time point is used at all subsequent time points , and thus no communication is “wasted”. Furthermore, we show that by only switching an arbitrary spectral sparsification algorithm to SAMP, the total communication cost achieved has been optimal, up to a polylogarithmic factor. That is, we cannot design another algorithm with communication cost smaller than DCAMP by a polylogarithmic factor.
We summarize the results in Theorem 3. For every node set in , let its volume and conductance be and , respectively. Intuitively, a small value of conductance implies that nodes in are likely to form a cluster. A collection of subsets of nodes is called a (kway) partition of if (1) for ; and (2) . The kway expansion constant is defined as
. The eigenvalues of
are denoted as . The highorder Cheeger inequality shows that [Lee, Gharan, and Trevisan2014]. A lower bound on implies that, has exactly welldefined clusters [Peng, Sun, and Zanetti2015]. It is because a large gap between and guarantees the existence of a kway partition with bounded , and that any way partition contains a subset with significantly higher conductance compared with . For any two sets and , the symmetric difference of and is defined as . To prove Theorem 3, we will use the following lemma and theorems.Lemma 1.
[Chen et al.2016] Let be a spectral sparsifier of for some . For all node sets , the inequality holds.
Theorem 1.
[Chen et al.2016] Let be an node graph and the edges of are distributed amongst sites. Any algorithm that correctly outputs a constant fraction of each cluster in requires bits of communications.
Theorem 2.
[Peng, Sun, and Zanetti2015] Given a graph with and an optimal partition achieving for some positive integer , the spectral clustering algorithm can output partition such that, for every , the inequality holds.
Theorem 3 (The Message Passing model).
For every time point , suppose that satisfies that and there is an optimal partition which achieves for some positive integer , DCAMP can output partition at the coordinator such that for every , holds. Summing over all time points, the total communication cost is . It is optimal up to a polylogarithmic factor.
Proof.
We start by proving that for every time point , the structure constructed at the coordinator is a spectral sparsifier of the graph received up to the time point . By the monotonicity property of a SAMP, for every , is a spectral sparsifier of the graph . The decomposability of spectral sparsifiers states that the union of spectral sparisifiers of some graphs is a spectral sparsifier for the union of the graphs [Sun and Zanetti2017]. Then by this property, the union of obtained at the coordinator is a spectral sparsifier of the graph .
Now we prove that for every time point , if satisfies that , also satisfies that . By the definition of , it suffices to prove that and . The former follows from that for every , the inequality
holds, according to Lemma 1. According to the definition of
spectral sparsifier and simple math, it holds for every vector
thatBy the definition of normalized graph Laplacian , and the fact that for every vector ,
we have that for every ,
which implies that . Then we can apply the spectral clustering algorithm on to get the desirable properties, according to Theorem 2.
For the upper bound on the communication cost, by the monotonicity property of a SAMP, each site only needs to transmit number of edges over all time points. Summing over all sites, the total communication cost is .
For the lower bound, we show the following statement. For every time point , suppose satisfies that and there is an optimal partition which achieves for positive integer , in the message passing model there is an algorithm which can output at the coordinator, such that for every , holds. Then the algorithm requires total communication cost over time points.
Consider any time point . We assume by contradiction that there exists an algorithm which can output in at the coordinator, such that for every , holds, using bits of communications. Then the algorithm can be used to solve a corresponding graph clustering problem in the distributed but static setting using bits of communications. This contradicts Theorem 1, and then completes the proof. ∎
Combining Theorems 2 and 3, DCAMP could generate clustering of quality asymptotically the same as CNTRL. We stress that the monotonicity property in general can be helpful for improving the communication efficiency over distributed dynamic graphs. In Sec. 3, we will discuss a new application which also benefits from the property.
As mentioned earlier, any SAMP algorithm can be plugged in DCAMP, e.g., the online sampling technique [Cohen, Musco, and Pachocki2016]. But the resultant algorithm becomes a randomized algorithm which succeeds w.h.p. because the constructed subgraphs are spectral sparsifiers w.h.p. Another SAMP algorithm is the onlineBSS algorithm [Baston, Spielman, and Srivastava2012, Cohen, Musco, and Pachocki2016], which has a slightly smaller communication cost (by a logarithmic factor) but requires larger memory and is more complicated.
2.2 The Blackboard Model
How to efficiently exploit the broadcast channel in the blackboard model to reduce the communication complexity in distributed graph clustering is nontrivial. For example, [Chen et al.2016] proposed to construct spectral sparisifers as a chain in the blackboard based on the iterative sampling technique [Li, Miller, and Peng2013]. Each spectral sparsifier in the chain is a spectral sparsifer of its following sparsifier. However, the technique fails to extend to the dynamic setting, as each graph update could incur a large number of updates in the maintained spectral sparsifiers, especially for those in the latter part of the chain.
We propose a simple algorithm called Distributed Dynamic Clustering Algorithm for the BLackboard model (DCABL), based on adapting Cohen et al.’s algorithm [Cohen, Musco, and Pachocki2016]. The basic idea is that every site corporates with each other to construct a spectral sparsifier for at each time point in the blackboard.
The edgenode incidence matrix of is defined as if is ’s head, if is ’s tail, and zero otherwise. At the beginning, the parameters and of the algorithm are set by a distinguished site and then sent to every site, and the blackboard has an empty spectral sparsifier , or equivalently an empty incidence matrix of dimension . Consider the time point . Suppose that at the previous time point , the incidence matrix for was in the blackboard. For each newly observed edge at the time point , the site observing computes the online ridge leverage score by accessing the incidence matrix currently in the blackboard, where is an dimensional vector with all zeroes except that the entries corresponding to ’s head and tail are 1 and 1, resp..
Let the sampling probability . With probability , is sampled, or discarded otherwise. If is sampled, the site transmits the rescaled vector corresponding to to the blackboard to append it at the end of . After all the newly observed edges at the time point at all the sites are processed, for will be in the blackboard. Then the coordinator applies any standard graph clustering algorithm, e.g. [Ng, Jordan, and Weiss2001], on to get the clustering . The process is repeated for every subsequent time point until . The algorithm is summarized in Alg. 1.
Our results for the blackboard model are summarized in Theorem 4. To prove Theorem 4, first it follows from [Cohen, Musco, and Pachocki2016] that the constructed subgraph in the blackboard for every time point is a spectral sparsifier for the graph w.h.p.. Then the rest of the proof is the same as the proof of Theorem 3. In the algorithm, processing an edge requires only , which is in the blackboard and visible to every site. Therefore, each site can process its edges locally and only transmit the sampled edges to the blackboard. The total communication cost is , because the size of the constructed spectral sparsifier is and each site has to transmit at least one bit of information. It is easy to see this communication cost is optimal up to polylogarithmic factors, because even only for one time point, the clustering result itself has bits of information and each site has to transmit at least one bit of information.
Theorem 4 (The Blackboard model).
For every time point , suppose that satisfies that and there is an optimal partition which achieves for some positive integer , w.h.p. DCABL can output partition at the coordinator such that for every , holds. Summing over time points, the total communication cost is . It is optimal up to a polylogarithmic factor.
DCABL can also work in the distributed static setting by considering that there is only one time point, at which all graph information comes together. As mentioned earlier, it is a brand new algorithm with nearlyoptimal communication complexity, the same as the stateoftheart algorithm [Chen et al.2016]. But our algorithm is much simpler without having to maintain a chain of spectral sparsifiers. Another advantage is the simplicity that one algorithm works for both distributed settings. The computational complexity for computing the online ridge leverage score for each edge in Alg. 1 is . To save computational cost, we can batch process in every site new edges observed at each time point in a batch of . By using the JohnsonLinderstrauss random projection trick [Spielman and Srivastava2011], we can approximate online ridge leverage scores for a batch of edges in time, and then sample all edges together according to the computed scores.
3 Discussions
Another Application of the Monotonicity Property. Consider the same computational and communication models. When the queries posed at the coordinator are changed to approximate shortest path distance queries between two given nodes, we use graph spanners [Peleg and Schaffer1989, Althofer et al.1993] to sparsify the original graphs while well approximating allpair shortest path distances in the original graphs.
We now describe the algorithm. In the message passing model, at each time point each site first constructs a graph spanner of the local graph using a DCAMP for constructing graph spanners [Elkin2011], and then transmits to the coordinator. Upon receiving from every site, the coordinator first takes their union and then applies a pointtopoint shortest path algorithm (e.g., Dijkstra’s algorithm [Dijkstra1959]) on to get the shortest distance between the two nodes at the time point . This process is repeated for every . The theoretical guarantees of the algorithm are summarized in Theorem 5, and its proof is in Sec. 3 of Appendix.
Theorem 5.
Given two nodes and an integer , for every time point , the proposed algorithm can answer approximate shortest distance between and in no larger than times of their actual shortest distance at the coordinator in the message passing model. Summing over time points, the total communication cost is .
Dynamic Graph Streams. When the graph update stream observed at each site is a fully dynamic stream containing a small number of node/edge deletions, we present a simple trick which enables that our algorithms still have good performance. We observe that the spectral sparsifiers can probably keep unchanged, when there is only a small number of deletions. This is reasonable because spectral sparsifiers are sparse subgraphs which could contain much smaller edges than the original graphs. When the number of deletions is small, the deletions may not affect the spectral sparsifiers at all. Even when the deletions lead to small changes in the spectral sparsifiers, there is a high probability that the clustering is not changed significantly. Therefore, in order to save communication and computation, we can ignore and do not process or transmit these deletions while still approximately preserving the clustering. We experimentally confirm the effects of this thick in the experiment section.
4 Experiments
In this section, we present the experimental results that we conducted on both synthetic and reallife datasets, where we compared the proposed algorithms DCAMP and DCABL with baseline algorithms CNTRL and ST. For ST, we used the distributed static graph clustering algorithms [Chen et al.2016] in the message passing and the blackboard models, and refer the resultant algorithms as STMP and STBL, respectively. For measuring the quality of the clustering results, we used the normalized cut value (NCut) of the clustering [Sun and Zanetti2017]. A smaller value of NCut implies a better clustering while a larger value of NCut implies a worse clustering. For simplicity, we used the total number of edges communicated as the communication cost, which approximates the total number of bits by a logarithmic factor. We implemented all five algorithms in Matlab programs, and conducted the experiments on a machine equipped with Intel i7 7700 2.8GHz CPU, 8G RAM and 1T disk storage.
The details of the datasets we used in the experiments are described as follows. The Gaussians
dataset consists of 800 nodes and 47,897 edges. Each point from each of four clusters is sampled from an isotropic Gaussians of variance 0.01. We consider each point to be a node in constructing the similarity graph. For every two nodes
and such that one is among the 100nearest points of the other, we add an edge of weight with . The number of clusters is 4. For the Sculpture dataset, we used a version of a photo of The Greek Slave^{2}^{2}2http://artgallery.yale.edu/collections/objects/14794, and it contains 1980 nodes and 61,452 edges. We consider each pixel to be a node by mapping each pixel to a point in , i.e. , where the last three coordinates are the RGB values. For every two nodes and such that () is among the 80nearest points of (), we add an edge of weight with . The number of clusters is 3.In the problem studied, the site and the time point each edge comes is arbitrary. Therefore, we make that the edges of nodes with smaller coordinates have smaller arrival times than the edges of nodes with larger coordinates. Intuitively, this results in that the edges of nodes on the left side come before the edges of nodes on the right side. This helps us to easily monitor the changing of the clustering results. Independently, the site every edge comes is randomly picked from the interval .
Time  Gaussians  Gaussians  Sculpture  Sculpture  

DCAMP  DCABL  DCAMP  DCABL  
50  15  4485  3132  15292  7130 
30  4607  3133  15235  6054  
45  4660  3126  15560  6076  
60  4669  3095  15764  6705  
90  15  7342  4988  27036  12153 
30  7533  4982  27020  10287  
45  7586  4979  27700  10336  
60  7630  4960  28001  11421  
100  15  7748  5238  28408  12846 
30  7988  5230  28338  10874  
45  7998  5235  29038  10897  
60  8062  5218  29343  12062 
Time  Gaussians  Gaussians  Sculpture  Sculpture  

DCAMP  DCABL  DCAMP  DCABL  
50%  10  4562  3127  15078  5998 
30  4645  3126  15278  6063  
100  4607  3133  15235  6054  
300  4620  3113  15269  6064  
90%  10  7467  4979  26699  10202 
30  7581  4983  27012  10278  
100  7533  4982  27020  10287  
300  7618  4958  27042  10299  
100%  10  7917  5225  28046  10779 
30  8045  5234  28278  10847  
100  7988  5230  28338  10874  
300  8031  5211  28345  10869 
Experimental Results. As the baseline setting, we selected the total number of time points and the total number of sites . The communication cost and NCut of different algorithms on both datasets are shown in Fig. 1. On both datasets, the communication cost of DCAMP and DCABL are much smaller than CNTRL, STMP and STBL. Specifically, on Gaussians dataset, the communication cost of DCAMP can be only 4% of that of STMP and on average 16% of that of CNTRL. The communication cost of DCABL is on average 11% of CNTRL and can be only 12% of that of STBL. STMP has communication cost even much larger than CNTRL. DCABL has a smaller communication cost than DCAMP. On Sculpture dataset, the communication cost of DCAMP can be only 11% of that of STMP and is on average 49% of that of CNTRL. The communication cost of DCABL can be only 15% of that of STBL and is on average 21% of that of CNTRL. Similar to STMP, STBL also has communication cost larger than CNTRL. DCABL has a much smaller communication cost than DCAMP and the difference here is larger than in Gaussians dataset.
For both datasets, all algorithms have comparable NCut at every time point, except that on Gaussians dataset, at the time point 9, DCABL has a slightly larger NCut. This could be due to that DCABL is a randomized algorithm with high success probability. In Fig. 1(el), the clustering results of CNTRL and DCAMP on both datasets at time points 9 and 10 are visually very similar. (The same cluster colors in different figures do not have relation.) But for Sculpture dataset at the time point 9, the clustering result of DCAMP visually looks even more reasonable.
We then varied the value of from 15 to 60 with a step of 15 or the value of from 10 to 300 with a factor of 3 while keeping the other parameters unchanged as in the baseline setting. Due to limit of space, we only show the resultant communication cost of DCAMP and DCABL on both datasets in Tables 1 and 2. But the complete results are referred to Appendix. When we varied the value of , the communication cost of DCAMP increases roughly linearly with the increase of the value of from 15 to 60, while that of DCABL do not obviously increase with the value of . These observations are consistent with their theoretical communication cost and , respectively. When we varied the value of , both the communication cost of DCAMP and DCABL roughly keep the same, also supporting our theory above.
Finally, we tested the performance of DCAMP and DCABL for dynamic graph streams. We randomly chose 5% of edges to delete at a random time point after their arrival. This increases the communicate cost of CNTRL by 5% as CNTRL sends every deletion to the coordinator/blackboard. However, the communication cost of DCAMP and DCABL are not changed. More importantly, even ignoring the deletions, the resultant clusterings of DCAMP and DCABL at every time point have NCut comparable to that of CNTRL. Due to limit of space, we refer to Fig. 1 in Appendix.
5 Conclusion and Future Work
In this paper, we study the problem of how to efficiently perform graph clustering over modern graph data that are often dynamic and collected at distributed sites. We design communicationoptimal algorithms DCAMP and DCABL for two different communication models and prove their optimality rigorously. Finally, we conducted extensive simulations to confirm that DCAMP and DCABL significantly outperform baseline algorithms in practice. As the future work, we will study whether and how we can achieve similar results for geometric clustering, and how to achieve better computational bounds for the studied problems. We will also study other related problems in the distributed dynamic setting such as lowrank approximation [Bringmann, Kolev, and Woodruff2017], sourcewise and standard roundtrip spanner constructions [Zhu and Lam2017, Zhu and Lam2018] and cut sparsifier constructions [Abraham et al.2016].
Acknowledgments
This work was partially supported by NSF grants DBI 1356655, CCF1514357, IIS1718738, as well as NIH grants R01DA037349 and K02DA043063 to Jinbo Bi.
References
 [Abraham et al.2016] Abraham, I.; Durfee, D.; Koutis, I.; Krinninger, S.; and Peng, R. 2016. On fully dynamic graph sparsifiers. In Proceedings of FOCS Conference, 335–344.
 [Althofer et al.1993] Althofer, I.; Das, G.; Dobkin, D.; Joseph, D.; and Soares, J. 1993. On sparse spanners of weighted graphs. Discrete Computational Geometry 9:81–100.
 [Anagnostopoulos et al.2016] Anagnostopoulos, A.; Lacki, J.; Lattanzi, S.; Leonardi, S.; and Mahdian, M. 2016. Community detection on evolving graphs. In Proceedings of NIPS Conference, 3530–3538.
 [Arbelaez et al.2011] Arbelaez, P.; Maire, M.; Fowlkes, C.; and Malik, J. 2011. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(5):898–916.
 [Baston, Spielman, and Srivastava2012] Baston, J.; Spielman, D.; and Srivastava, N. 2012. Twiceramanujan sparsifiers. SIAM Journal on Computing 41(6):1704–1721.
 [Bringmann, Kolev, and Woodruff2017] Bringmann, K.; Kolev, P.; and Woodruff, D. 2017. Approximation algorithms for low rank approximation. In Proceedings of NIPS Conference, 6648–6659.
 [Chen et al.2016] Chen, J.; Sun, H.; Woodruff, D.; and Zhang, Q. 2016. Communicationoptimal distributed clustering. In Proceedings of NIPS Conference, 3720–3728.
 [Cohen, Musco, and Pachocki2016] Cohen, M.; Musco, C.; and Pachocki, J. 2016. Online row sampling. In Proceedings of APPROXRANDOM Conference, 7:1–7:18.
 [Cormode, Muthukrishnan, and Wei2007] Cormode, G.; Muthukrishnan, S.; and Wei, Z. 2007. Conquering the divide: continuous clustering of distributed data streams. In Proceedings of ICDE Conference, 1036–1045.
 [Dijkstra1959] Dijkstra, E. 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1(1):269–271.
 [Elkin2011] Elkin, M. 2011. Streaming and fully dynamic centralized algorithms for constructing and maintaining sparse spanners. ACM Transactions on Algorithms 7(2):20.
 [Giatsidis et al.2014] Giatsidis, C.; Malliaros, F.; Thilikos, D.; and Vazirgiannis, M. 2014. CORECLUSTER: A degeneracy based graph clustering framework. In Proceedings of AAAI Conference, 44–50.
 [Hui et al.2007] Hui, P.; Yoneki, E.; Chan, S.; and Crowcroft, J. 2007. Distributed communicty detection in delay tolerant networks. In Proceedings of 2nd ACM/IEEE International Workshop on Mobility in Evolving Internet Architecture.
 [Jian, Lian, and Chen2018] Jian, X.; Lian, X.; and Chen, L. 2018. On efficientl detecting overlapping communities over distributed dynamic graphs. In Proceedings of ICDE Conference.
 [Kapralov et al.2014] Kapralov, M.; Lee, Y.; Musco, C.; Musco, C.; and Aaron, S. 2014. Single pass spectral sparsification in dynamic streams. In Proceedings of FOCS Conference, 561–570.
 [Kelner and Levin2013] Kelner, J., and Levin, A. 2013. Spectral sparsification in the semistreaming setting. Theory of Computing Systems 53(2):243–262.
 [Kim et al.2011] Kim, S.; Nowozin, S.; Kohli, P.; and Yoo, C. 2011. Higherorder correlation clustering for image segmentation. In Proceedings of NIPS Conference, 1530–1538.
 [Lee and Sun2017] Lee, Y., and Sun, H. 2017. An SDPbased algorithm for linearsized spectral sparsification. In Proceedings of STOC Conference, 678–687.
 [Lee, Gharan, and Trevisan2014] Lee, J.; Gharan, S.; and Trevisan, L. 2014. Multiway spectral partitioning and higherorder Cheeger inequalities. Journal of the ACM 61(6):37.
 [Li, Miller, and Peng2013] Li, M.; Miller, G.; and Peng, R. 2013. Iterative row sampling. In Proceedings of FOCS Conference, 127–136.
 [Maier, Luxburg, and Hein2009] Maier, M.; Luxburg, U.; and Hein, M. 2009. Influence of graph construction on graphbased clustering measures. In Proceedings of NIPS Conference, 1025–1032.

[Ng, Jordan, and Weiss2001]
Ng, A.; Jordan, M.; and Weiss, Y.
2001.
On spectral clustering: analysis and an algorithm.
In Proceedings of NIPS Conference, 849–856.  [Peleg and Schaffer1989] Peleg, D., and Schaffer, A. 1989. Graph spanners. Journal of Graph Theory 13(1):99–116.
 [Peng, Sun, and Zanetti2015] Peng, R.; Sun, H.; and Zanetti, L. 2015. Partitioning wellclustered graphs: spectral clustering works! In Proceedings of COLT Conference, 1423–1455.
 [Phillips, Verbin, and Zhang2016] Phillips, J.; Verbin, E.; and Zhang, Q. 2016. Lower bounds for numberinhand multiparty communication complexity, made easy. SIAM Journal on Computing 45(1):174–196.
 [Shi and Malik2000] Shi, J., and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8):888–905.
 [Spielman and Srivastava2011] Spielman, D., and Srivastava, N. 2011. Graph sparsification by effective resistances. SIAM Journal on Computing 40(6):1913–1926.
 [Spielman and Teng2011] Spielman, D., and Teng, S.H. 2011. Spectral sparsification of graphs. SIAM Journal on Computing 40(4):981–1025.
 [Sun and Zanetti2017] Sun, H., and Zanetti, L. 2017. Distributed graph clustering and sparsification. https://arxiv.org/abs/1711.01262.
 [Tian et al.2014] Tian, F.; Gao, B.; Cui, Q.; Chen, E.; and Liu, T.Y. 2014. Learning deep representations for graph clustering. In Proceedings of AAAI Conference, 1293–1299.
 [Worldwidewebsize2018] Worldwidewebsize. 2018. http://www.worldwidewebsize.com/. [Online; accessed 05072018].
 [Yang and Xu2015] Yang, W., and Xu, H. 2015. A divide and conquer framework for distributed graph clustering. In Proceedings of ICML Conference, 504–513.
 [Zhu and Lam2017] Zhu, C., and Lam, K.Y. 2017. Sourcewise roundtrip spanners. Information Processing Letters 124(C):42–45.
 [Zhu and Lam2018] Zhu, C., and Lam, K.Y. 2018. Deterministic improved roundtrip spanners. Information Processing Letters 129:57–60.
 [Zhu, Liao, and Orecchia2015] Zhu, Z.; Liao, Z.; and Orecchia, L. 2015. Spectral sparsification and regret minimization beyong matrix multiplicative updates. In Proceedings of STOC Conference, 237–245.
Appendix
1 The Complete Results when varying the value of and
The results of communication cost and normalized cut (NCut) at every time point when varying the value of on the Gaussians dataset and the Sculpture dataset are presented in Tables 1 and 2, respectively. Basically, the communication cost increases linearly with respect to for DCAMP. The increase for DCABL are not obvious. The results of communication cost and normalized cut (NCut) at every time point that is a multiple of 10% of the total number of time points when varying the value of on the Gaussians dataset and the Sculpture dataset are presented in Tables 3 and 4, respectively. The communication costs roughly keep unchanged for DCAMP and DCABL. In all the tables, the NCut for different algorithms are comparably, except some rare cases when any algorithm do not succeed.
Time  CNTRL  CNTRL  DCAMP  DCAMP  DCABL  DCABL  

NCut  Comm.  NCut  Comm.  NCut  Comm.  
10  15  2.941  6556  2.862  1025  2.953  784 
30  2.921  1067  2.839  784  
45  2.869  1083  2.913  772  
60  2.96  1050  2.843  791  
20  15  2.918  11265  2.954  1784  2.923  1275 
30  2.954  1886  2.88  1284  
45  2.918  1916  2.881  1281  
60  2.982  1872  2.853  1263  
30  15  2.729  15872  2.855  2533  2.931  1800 
30  2.842  2651  2.932  1804  
45  2.93  2707  2.787  1831  
60  2.896  2612  2.905  1802  
40  15  2.744  21802  2.91  3510  2.707  2510 
30  2.854  3643  2.939  2500  
45  2.961  3698  2.886  2505  
60  2.856  3632  2.822  2481  
50  15  2.721  27748  2.763  4485  2.87  3132 
30  2.897  4607  2.909  3133  
45  2.748  4660  2.758  3126  
60  2.753  4669  2.814  3095  
60  15  2.712  32649  2.841  5297  2.785  3623 
30  2.829  5407  2.704  3623  
45  2.829  5473  2.866  3602  
60  2.651  5497  2.914  3597  
70  15  2.846  35976  2.853  5863  2.908  3959 
30  2.868  6003  2.743  3972  
45  2.707  6020  2.681  3956  
60  2.855  6086  2.823  3911  
80  15  2.794  39445  2.847  6430  2.854  4372 
30  2.766  6592  2.932  4377  
45  2.804  6599  2.844  4346  
60  2.875  6645  2.881  4312  
90  15  0.198  45250  0.216  7342  0.206  4988 
30  0.199  7533  0.22  4982  
45  1.102  7586  0.224  4979  
60  0.205  7630  0.207  4960  
100  15  0.198  47897  0.208  7748  0.206  5238 
30  0.2  7988  0.22  5230  
45  0.206  7998  0.215  5235  
60  0.205  8062  0.204  5218 
Time  CNTRL  CNTRL  DCAMP  DCAMP  DCABL  DCABL  

NCut  Comm.  NCut  Comm.  NCut  Comm.  
10  15  1.874  5798  1.991  3210  1.973  1574 
30  1.121  3145  1.984  1348  
45  1.12  3254  1.883  1344  
60  1.117  3263  1.963  1491  
20  15  1.924  11792  1.971  6264  1.974  2928 
30  0.466  6210  1.947  2507  
45  0.475  6394  1.935  2511  
60  1.102  6447  1.817  2785  
30  15  1.698  18810  1.933  9503  1.846  4478 
30  1.087  9421  1.843  3834  
45  1.034  9659  1.988  3812  
60  1.091  9787  1.955  4241  
40  15  1.89  25388  1.845  12562  1.97  5856 
30  0.434  12501  1.852  5019  
45  0.235  12798  1.806  4982  
60  0.23  12976  2.002  5546  
50  15  1.927  31256  1.765  15292  1.804  7130 
30  0.305  15235  1.788  6054  
45  0.653  15560  1.678  6076  
60  0.755  15764  1.965  6705  
60  15  1.742  37954  1.745  18434  1.929  8500 
30  1.079  18387  1.924  7233  
45  1.983  18798  1.889  7234  
60  1.043  19033  1.941  7997  
70  15  1.949  44566  1.823  21436  1.952  9877 
30  1.948  21421  1.726  8378  
45  1.835  21888  1.911  8394  
60  1.39  22156  1.939  9264  
80  15  1.892  51437  0.086  24676  1.914  11329 
30  1.856  24654  1.845  9598  
45  1.56  25225  1.867  9633  
60  1.848  25512  1.726  10647  
90  15  1.945  56331  1.749  27036  1.779  12153 
30  1.878  27020  1.825  10287  
45  1.695  27700  1.906  10336  
60  1.868  28001  1.774  11421  
100  15  0.009  61452  0.01  28408  0.011  12846 
30  0.009  28338  0.011  10874  
45  0.009  29038  0.009  10897  
60  0.013  29343  0.013  12062 
Time  CNTRL  CNTRL  DCAMP  DCAMP  DCABL  DCABL  

NCut  Comm.  NCut  Comm.  NCut  Comm.  
10%  10  2.941  6556  2.869  1091  2.96  763 
30  2.927  1148  2.882  790  
100  2.921  1067  2.839  784  
300  2.975  1152  2.806  725  
20%  10  2.918  11265  2.965  1857  2.866  1253 
30  2.822  1935  2.908  1261  
100  2.954  1886  2.88  1284  
300  2.979  1929  2.863  1189  
30%  10  2.729  15872  2.788  2569  2.981  1820 
30  2.868  2692  2.895  1816  
100  2.842  2651  2.932  1804  
300  2.926  2700  2.868  1709  
40%  10  2.744  21802  2.91  3568  2.692  2521 
30  2.832  3680  2.891  2516  
100  2.854  3643  2.939  2500  
300  2.834  3648  2.844  2438  
50%  10  2.721  27748  2.945  4562  2.771  3127 
30  2.797  4645  2.889  3126  
100  2.897  4607  2.909  3133  
300  2.883  4620  2.846  3113  
60%  10  2.712  32649  2.904  5397  2.749  3616 
30  2.861  5479  2.807  3616  
100  2.829  5407  2.704  3623  
300  2.706  5465  2.689  3599  
70%  10  2.846  35976  2.821  5971  2.944  3948 
30  2.855  6070  2.814  3953  
100  2.868  6003  2.743  3972  
300  2.825  6044  2.905  3913  
80%  10  2.794  39445  2.749  6538  2.851  4348 
30  2.876  6650  2.816  4346  
100  2.766  6592  2.932  4377  
300  2.829  6616  2.809  4316  
90%  10  0.198  45250  0.209  7467  1.042  4979 
30  1.033  7581  0.221  4983  
100  0.199  7533  0.22  4982  
300  1.039  7618  1.017  4958  
100%  10  0.198  47897  0.205  7917  0.206  5225 
30  0.204  8045  0.215  5234  
100  0.2  7988  0.22  5230  
300  0.202  8031  0.211  5211 
Time  CNTRL  CNTRL  DCAMP  DCAMP  DCABL  DCABL  

NCut  Comm.  NCut  Comm.  NCut  Comm.  
10%  10  1.874  5798  1.106  3207  1.988  1361 
30  1.121  3204  1.976  1341  
100  1.121  3145  1.984  1348  
300  1.115  3276  1.974  1360  
20%  10  1.924  11792  1.092  6188  1.967  2486 
30  0.463  6277  1.994  2540  
100  0.466  6210  1.947  2507  
300  1.088  6296  1.945  2497  
30%  10  1.698  18810  0.283  9399  1.813  3804 
30  0.977  9489  1.75  3865  
100  1.087  9421  1.843  3834  
300  1.104  9505  1.902  3861  
40%  10  1.89  25388  0.232  12407  1.937  4952 
30  0.571  12593  1.884  5045  
100  0.434  12501  1.852  5019  
300  0.239  12573  1.982  4997  
50%  10  1.927  31256  0.448  15078  1.848  5998 
30  0.2  15278  1.967  6063  
100  0.305  15235  1.788  6054  
300  0.562  15269  1.69  6064  
60%  10  1.742  37954  1.079  18195  1.924  7139 
30  0.93  18464  1.784  7201  
100  1.079  18387  1.924  7233  
300  0.952  18392  1.8  7221  
70%  10  1.949  44566  1.942  21171  1.927  8289 
30  1.957  21400  1.938  8330  
100  1.948  21421  1.726  8378  
300  1.934  21387  1.865  8377  
80%  10  1.892  51437  1.904  24363  2.001  9496 
30  1.73  24582  1.901  9566  
100  1.856  24654  1.845  9598  
300  1.839  24647  1.868  9609  
90%  10  1.945  56331  1.902  26699  1.845  10202 
30  1.941  27012  1.879  10278  
100  1.878  27020  1.825  10287  
300  1.911  27042  1.755  10299  
100%  10  0.009  61452  0.011  28046  0.011  10779 
30  0.011  28278  0.012  10847  
100  0.009  28338  0.011  10874  
300  0.01  28345  0.013  10869 
2 The Complete Results for Dynamic Graph Streams
The complete results of communication cost and NCut under dynamic graph stream on the Gaussians and Sculpture datasets are plotted in Fig. 1. It can be seen that even though DCAMP and DCABL do not process the deletions, their NCut remains comparable to that of CNTRL. The communication cost can be saved by this trick, keeping much smaller than the communication cost of CNTRL.
3 Proof of Theorem 5
Theorem 5.
Given two nodes and an integer , for every time point , the proposed algorithm can answer approximate shortest distance between and in no larger than times of their actual shortest distance at the coordinator in the message passing model. Summing over time points, the total communication cost is .
Proof.
We first prove that at every time , the constructed subgraph is a spanner of the graph received up to the time point . For each edge , there is a path between and in the spanner of distance no larger than , because is a spanner of . Then in the union graph , the path is still presented. Therefore, for every edge in , there is a path between and in with distance no larger than . This implies that is a spanner of .
By the monotonicity property, each site only needs to transmit summing over all time points. Summing over sites, the total communication cost is . ∎
Comments
There are no comments yet.