Flowless: Extracting Densest Subgraphs Without Flow Computations

10/15/2019 ∙ by Digvijay Boob, et al. ∙ 0

We propose a simple and computationally efficient method for dense subgraph discovery, which is a classic problem both in theory and in practice. It is well known that dense subgraphs can have strong correlation with structures of interest in real-world networks across various domains ranging from social networks to biological systems [Gionis and Tsourakakis `15]. For the densest subgraph problem, Charikar's greedy algorithm [Asashiro `00, Charikar `00] is very simple, and can typically find result of quality much better than the provable factor 2-approximation, which makes it very popular in practice. However, it is also known to give suboptimal output in many real-world examples. On the other hand, finding the exact optimal solution requires the computation of maximum flow. Despite the existence of highly optimized maximum flow solvers, such computation still incurs prohibitive computational costs for the massive graphs arising in modern data science applications. We devise an iterative algorithm which naturally generalizes the greedy algorithm of Charikar. Our algorithm draws insights from the iterative approaches from convex optimization, and also exploits the dual interpretation of the densest subgraph problem. We have empirical evidence that our algorithm is much more robust against the structural heterogeneities in real-world datasets, and converges to the optimal subgraph density even when the simple greedy algorithm fails. On the other hand, in instances where Charikar's algorithm performs well, our algorithm is able to quickly verify its optimality. Furthermore, we demonstrate that our method is significantly faster than the maximum flow based exact optimal algorithm. We conduct experiments on datasets from broad domains, and our algorithm achieves ∼145× speedup on average to find subgraphs whose density is at least 90% of the optimal value.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Finding dense components in graphs is a major topic in graph mining with diverse applications including DNA motif detection, unsupervised detection of interesting stories from micro-blogging streams in real time, indexing graphs for efficient distance query computation, and anomaly detection in financial networks, and social networks

[16]. The densest subgraph problem (DSP) is one of the major formulations for dense subgraph discovery, where, given an undirected weighted graph we want to find a set of nodes that maximizes the degree density , where is the sum of the weights of the edges in the graph induced by . When the weights are non-negative, the problem is solvable in polynomial time using maximum flows [17]. Since maximum flow computations are expensive despite the theoretical progress achieved over the recent years, Charikar’s greedy peeling algorithm is frequently used in practice [8]. This algorithm iteratively peels the lowest degree node from the graph, thus producing a sequence of subsets of nodes, of which it outputs the densest one. This simple, linear time and linear space algorithm provides a -approximation for the DSP. However, when the edge weights are allowed to be negative, the DSP becomes NP-hard [36].

Our work was originally motivated by a natural question: How can we quickly assess whether the output of Charikar’s algorithm on a given graph instance is closer to optimality or to the worst case -approximation guarantee? However, we ended up answering the following intriguing question that we state as the next problem:

Problem 1.1.

Can we design an algorithm that performs (i) as well as Charikar’s greedy algorithm in terms of efficiency, and (ii) as well as the maximum flow-based exact algorithm in terms of output quality?

Contributions. The contributions of this paper are summarized as follows:

We design a novel algorithm Greedy++ for the densest subgraph problem, a major dense subgraph discovery primitive that “lies at the heart of large-scale data mining” [5]. Greedy++ combines the best of two different worlds, the accuracy of the exact maximum flow based algorithm [17, 14], and the efficiency of Charikar’s greedy peeling algorithm [8].

It is worth outlining that Charikar’s greedy algorithm typically performs better on real-world graphs than the worse case -approximation; on a variety of datasets we have tried, the worst case approximation was 0.8. Nonetheless, the only way to verify how close the output is to optimality relies on computing the exact solution using maximum flow. Our proposed method Greedy++ can be used to assess the accuracy of Charikar’s algorithm in practice. Specifically, we find empirically that for all graph instances where Greedy++ after a couple of iterations does not significantly improve the output density, the output of Charikar’s algorithm is near-optimal.

We implement our proposed algorithm in C++ and apply it on a variety of real-world datasets. We verify the practical value of Greedy++. Our empirical results indicate that Greedy++ is a valuable addition to the toolbox of dense subgraph discovery; on real-world graphs, Greedy++ is both fast in practice, and converges to a solution with an arbitrarily small approximation factor.

Notation. Let be a undirected graph, where . For a given subset of nodes , denotes the number of edges induced by . When the graph is weighted, i.e., there exists a weight function , and denotes the sum of the weights of the edges induced by . We use to define the set of neighbors of , and . We use to denote ’s degree in , i.e., the number of neighbors of within the set of nodes . We use to denote the maximum degree in . Finally, the degree density of a vertex set is defined as , or when the graph is weighted.

2 Related Work

Dense subgraph discovery. Detecting dense components is a major problem in graph mining. It is not surprising that many different notions of a dense subgraph are used in practice. The prototypical dense subgraph is a clique. However, the maximum clique problem is not only NP-hard, but also strongly inapproximable, see [19]. The notion of optimal quasi-cliques has been developed to detect subgraphs that are not necessarily fully interconnected but very dense [35]. However, finding optimal quasi-cliques is also NP-hard [21, 34]. Another popular and scalable approach to finding dense components is based on -cores [12]. Recently, -cores have also been used to detect anomalies in large-scale networks [15, 29].

The interested reader may refer to the recent survey by Gionis and Tsourakakis on the more broad topic of dense subgraph discovery [16]. In the following, we only provide a brief overview of work related to the densest subgraph problem.

Densest subgraph problem (DSP). The goal of the densest subgraph problem (DSP) is to find the set of nodes which maximizes the degree density . The densest subgraph can be identified in polynomial time by solving a maximum flow problem [14, 22, 17]. Charikar [8] proved that the greedy algorithm proposed by Asashiro et al.111Despite the fact that the greedy algorithm was originally proposed in [3], it is widely known as Charikar’s greedy algorithm. [3] produces a -approximation of the densest subgraph in linear time. To obtain fast algorithms with better approximation factors, McGregor et. al. [25], and Mitzenmacher et. al. [26] uniformly sparsified the input graph, and computed the densest subgraph in the resulting sparse graph. The first near-linear time algorithm for the DSP, given by Bahmani et. al. [4], relies on approximately solving the LP dual of the DSP. It is worth mentioning that Kannan and Vinay [20] gave a spectral approximation algorithm for a related notion of density.

Charikar’s greedy peeling algorithm. Since our algorithm Greedy++ is an improvement over Charikar’s greedy algorithm, we discuss the latter algorithm in greater detail. The algorithm removes in each iteration, the node with the smallest degree. This process creates a nested sequence of sets of nodes . The algorithm outputs the graph that maximizes the degree density among . The pseudocode is shown in Algorithm 1.

Input: Undirected graph

Output: A dense subgraph of : .

1:
2:;
3:while  do
4:     Find the vertex with minimum ;
5:     Remove and all its adjacent edges from ;
6:     if  then
7:         
8:     end if
9:end while
10:Return .
Algorithm 1 Greedy

Fast numerical approximation algorithms for DSP. Bahmani et. al. [4] approached the DSP via its dual problem, which in turn they reduced to

instances of solving a positive linear program. To solve these LPs, they employed the multiplicative weights update framework

[2, 27] to achieve an -approximation in iterations, where each iteration requires work.

Notable extensions of the DSP. The DSP has been studied in weighted graphs, as well as directed graphs. When the edge weights are non-negative, both the maximum flow algorithm and Charikar’s greedy algorithm maintain their theoretical guarantees. In the presence of negative weights, the DSP in general becomes NP-hard [36]. For directed graphs Charikar [8] provided a linear programming approach which requires the computation of linear programs and a -approximation algorithm which runs in time. Khuller and Saha have provided more efficient implementations of the exact and approximation algorithms for the undirected and directed versions of the DSP [22]. Furthermore, Tsourakakis et al. recently extended the DSP to the -clique, and the -biclique densest subgraph problems [33, 26]. These extensions can be used for finding large near-cliques in general graphs and bipartite graphs respectively. The DSP has also been studied in the dynamic setting [7, 10, 28], the streaming setting [5, 7, 25, 11], and in the MapReduce computational model [5]. Bahmani, Goel, and Munagala use the multiplicative weights update framework [2, 27] to design an improved MapReduce algorithm [4]. We discuss this method in greater detail in Section 3. Tatti and Gionis [32] introduced a novel graph decomposition known as locally-dense, that imposes certain insightful constraints on the k-core decomposition. Further, efficient algorithms to find locally-dense subgraphs were developed by Danisch et al. [9].

We notice that in the DSP there are no restrictions on the size of the output. When restrictions on the size of are imposed the problem becomes NP-hard. The densest--subgraph problem asks for find the subgraph with maximum degree density among all possible sets such that . The state-of-the art algorithm is due to Bhaskara et al. [6], and provides a approximation in time. A long standing question is closing the gap between this upper bound and the lower bound. Other versions where have also been considered in the literature see [1].

3 Proposed Method

3.1 The Greedy++ algorithm

As we discussed earlier, Charikar’s peeling algorithm greedily removes the node of smallest degree from the graph, and returns the densest subgraph among the sequence of

subgraphs created by this procedure. While ties may exist, and are broken arbitrarily, for the moment it is useful to think as if Charikar’s greedy algorithm produces a single permutation of the nodes, that naturally defines a nested sequence of subgraphs.

Algorithm description. Our proposed algorithm Greedy++ iteratively runs Charikar’s peeling algorithm, while keeping some information about the past runs. This information is crucial, as it results in different permutations, that naturally yield higher quality outputs. The pseudocode for Greedy++ is shown in Algorithm 2. It takes as input the graph , and a parameter of the number of passes to be performed, and runs an iterative, weighted peeling procedure. In each round the load of each node is a function of its induced degree and the load from the previous rounds. It is worth outlining that the algorithm is easy to implement, as it is essentially instances of Charikar’s algorithm. What is less obvious perhaps, is why this algorithm makes sense, and works well. We answer this question in detail in Section 3.2.

Input: Undirected graph , iteration count

Output: An approximately densest subgraph of : .

1:
2:

Initialize the vertex load vector

;
3:for  do
4:     ;
5:     while  do
6:         Find the vertex with minimum ;
7:         ;
8:         Remove and all its adjacent edges from ;
9:         if  then
10:              
11:         end if
12:     end while
13:end for
14:Return .
Algorithm 2 Greedy++

Example. We provide a graph instance that clearly illustrates why Greedy++ is a significant improvement over the classical greedy algorithm. We discuss the first two rounds of Greedy++. Consider the following graph where and . Namely is a disjoint union of a complete bipartite graph , and of -cliques . Consider the case where . is pictured in Figure 1(a). The density of is

Notice that this is precisely the density of any -clique. However, the density of is , which is in fact the optimal solution. Charikar’s algorithm outputs itself, since it starts eliminating nodes of degree from , and by doing this, it never sees a subgraph with higher density. This example illustrates that the approximation is tight. Consider now a run of Greedy++.

In its first iteration, it simply emulates Charikar’s algorithm. The vertices of which were eliminated first - each have load . At this stage, our input is the disjoint union of cliques and a bipartite graph. Of the remaining vertices in , one vertex is charged with load , two vertices each with loads , and one vertex with load . On the other hand, vertices in are charged with loads . Figure 1(b) shows the cumulative degrees of vertices in after one iteration of Greedy++.

Without any loss of generality let us assume the vertex from that got charged originally had degree . This vertex in the second iteration will get deleted first, and the vertex whose sum of load and degree is will get deleted second. But after these two, all the cliques get peeled away by the algorithm. This leaves us with a bipartite graph as the output after the second iteration, whose density is almost optimal.

[scale=0.37] at (2,9) ; (1) at (0,-3); (2) at (0,-1.5); (3) at (0,0); at (0,1.5) ; (4) at (0,3); (5) at (4,-3); (6) at (4,-1.5); (7) at (4,0); at (4,1.5) ; (8) at (4,3); (9) at (4,4.5); at (4,6) ; (10) at (4,7.5); [mynode, label=left:] at (1) ; [mynode, label=left:] at (2) ; [mynode, label=left:] at (3) ; [mynode, label=left:] at (4) ; [mynode, label=right:] at (5) ; [mynode, label=right:] at (6) ; [mynode, label=right:] at (7) ; [mynode, label=right:] at (8) ; [mynode, label=right:] at (9) ; [mynode, label=right:] at (10) ; in 1,2,…,4 in 5,6,…,10 [line width=0.2pt] () – (); at (14,9) ; (11) at (14,0); (12) at (16,1); (13) at (16,3); (14) at (14,4); (15) at (12,3); at (12,2.25) ; (16) at (12,1); [mynode, label=below:
] at (11) ; [mynode, label=right:] at (12) ; at (17,0) ; [mynode, label=right:] at (13) ; at (17,4) ; [mynode, label=above:
] at (14) ; [mynode, label=left:] at (15) ; at (11,4) ; [mynode, label=left:] at (16) ; at (11,0) ; in 12,…,16 [line width=0.2pt] (11) – (); in 13,…,16 [line width=0.2pt] (12) – (); in 14,15,16 [line width=0.2pt] (13) – (); in 15,16 [line width=0.2pt] (14) – ();
(a) Initial degrees of
[scale=0.37] at (2,9) ; (1) at (0,-3); (2) at (0,-1.5); (3) at (0,0); at (0,1.5) ; (4) at (0,3); (5) at (4,-3); (6) at (4,-1.5); (7) at (4,0); at (4,1.5) ; (8) at (4,3); (9) at (4,4.5); at (4,6) ; (10) at (4,7.5); [mynode, label=left:] at (1) ; [mynode, label=left:] at (2) ; [mynode, label=left:] at (3) ; [mynode, label=left:] at (4) ; [mynode, label=right:] at (5) ; [mynode, label=right:] at (6) ; [mynode, label=right:] at (7) ; [mynode, label=right:] at (8) ; [mynode, label=right:] at (9) ; [mynode, label=right:] at (10) ; in 1,2,…,4 in 5,6,…,10 [line width=0.2pt] () – (); at (14,9) ; (11) at (14,0); (12) at (16,1); (13) at (16,3); (14) at (14,4); (15) at (12,3); at (12,2.25) ; (16) at (12,1); [mynode, label=below:
] at (11) ; [mynode, label=right:] at (12) ; at (17,0) ; [mynode, label=right:] at (13) ; at (17,4) ; [mynode, label=above:
] at (14) ; [mynode, label=left:] at (15) ; at (11,4) ; [mynode, label=left:] at (16) ; at (11,0) ; in 12,…,16 [line width=0.2pt] (11) – (); in 13,…,16 [line width=0.2pt] (12) – (); in 14,15,16 [line width=0.2pt] (13) – (); in 15,16 [line width=0.2pt] (14) – ();
(b) Cumulative degrees (degree + load) of after one iteration
Figure 1: Illustration of two iterations of Greedy++ on . The output after one iteration is itself (density ), whereas the output after the second iteration is (density ).

Theoretical guarantees. Before we prove our theoretical properties for our proposed algorithm Greedy++, it is worth outlining that experiments indicate that the performance of Greedy++ is significantly better than the worst-case analysis we perform. Furthermore, we conjecture that our guarantees are not tight from a theoretical perspective; an interesting open question is to extend our analysis in Section 3.2 for Greedy++ to prove that it provides asymptotically an optimal solution for the DSP. We conjecture that our algorithm is a -approximation algorithm for the DSP. Our fist lemma states that Greedy++ is a -approximation algorithm for the DSP.

Lemma 3.1.

Let the output of Greedy++. Then, , where denotes the optimum value of the problem.

Proof.

Notice that the first iteration is identical to Charikar’s -approximation algorithm, and is at least as dense as the output of the first iteration. ∎

The next lemma provides bounds the quality of the dual solution, i.e., at each iteration the average load (average over the algorithm’s iterations) assigned to any vertex is at most .

Lemma 3.2.

The following invariant holds for Greedy++: for any vertex and iteration ,

Proof.

First, let . The proof for this base case goes through identically as in [8].

Now, assume that the statement is true for some iteration index . Consider the point at which vertex is chosen in iteration . Denote the graph at that instant to be . For any vertex at that point, the cumulative degree is . Since has the minimum cumulative degree at that point,

Running time. Finally, we bound the runtime of the algorithm as follows. The next lemma states that our algorithm can be implemented to run in .

Lemma 3.3.

Each iteration of the above algorithm runs in time .

Proof.

The deletion operation, along with assigning edges to a vertex and updating degrees takes time since every edge is assigned once. Finding the minimum degree vertex can be implemented in two ways:

  1. Since degrees in our algorithm can go from to , we can create lists for each separate integer degree value. Now we need to scan each list from to . However, after deleting a vertex of degree , we only need to scan from onwards. So the total time taken is .

  2. We can maintain a priority queue, which needs a total of update operations, each taking time. ∎

Note that in the case of weighted graphs, we cannot maintain lists for each possible degree, and hence, it is necessary to use a priority queue.

3.2 Why does Greedy++ work well?

Explaining the intuition behind Greedy++ requires an understanding of the load balancing interpretation of Charikar’s LP for the DSP [8], and the multiplicative weights update (MWU) framework by Plotkin, Shmoys and Tardos [27] used for packing/covering LPs. In the context of the DSP, the MWU framework was first used by Bahmani, Goel, and Munagala [4]. We include a self-contained exposition of the required concepts from [4, 8] in this section, that has a natural flow and concludes with our algorithmic contributions. Intuitively, the additional passes that Greedy++ performs, improve the load balancing.

Charikar’s LP and the load balancing interpretation. The following is a well-known LP formulation of the densest subgraph problem, introduced in [8], which we denote by . The optimal objective value is known to be .

We then construct the dual LP for the above problem. Let be the dual variable associated with the first constraints of the form , and let be associated with the last constraint. We get the following LP, which we denote by , and whose optimum is also .

This LP can be visualized as follows. Each edge has a load of , which it wants to send to its end points: and such that the total load of any vertex , , is at most . The objective is to find the minimum for which such a load assignment is feasible.

For a fixed , the above dual problem can be framed as a flow problem on a bipartite graph as follows: Let the left side represent and the right side represent . Add a super-source and edges from to all vertices in with capacity . Add edges from to if is incident on in . All vertices in have demands of unit. Although Goldberg’s initial reduction [17] involved a different flow network, this graph can also be used to use maximum flow and use that to find the exact optimum to our problem. From strong duality, we know that the optimal objective values of both linear programs are equal, i.e., exactly . Let be the objective of any feasible solution to . Similarly, let be the objective of any feasible solution to . Then, by optimality of and weak duality, we obtain the optimality result .

Bahmani et al. [4] use the following covering LP formulation: decide the feasibility of constraints for each edge subject to the polyhedral constraints:

The width of this linear program is the maximum value of provided that satisfy the constraints of the program. Bahmani et al. in order to provably bound the width of the above LP, they introduce another set of simple constraints as follows:

where is a small constant. So, for a particular value of , they verify the approximate feasibility of the covering problem using the MWU framework. However, this necessitates running a binary search over all possible values of and finding the lowest value of for which the LP is feasible. Since the precision for can be as low as , this binary search is inefficient in practice. Furthermore, due to the added constraint to bound the width, extracting the primal solution (i.e. an approximately densest subgraph) from the dual is no longer straightforward, and the additional rounding step to overcome this incurs additional loss in the approximation factor.

In order to overcome these practical issues, we propose an alternate MWU formulation which sacrifices the width bounds but escapes the binary search phase over . Eliminating the artificial width bound makes it straightforward to extract a primal solution. Moreover, our experiments on real world graphs suggest that width is not a bottleneck for the running time of the MWU algorithm. Even more importantly, our alternate formulation naturally yields Greedy++ as we explain in the following.

Our MWU formulation. We can denote the LP succinctly as follows:

minimize
subject to

where is the vector representation of the all variables, is the matrix denoting the left hand side of all constraints of the form . denotes the vector of ’s and is a polyhedral constraint set defined as follows:

Note that for any , we have that the minimum satisfying is equal to . This follows due to the non-negativity of for any . Now a simple observation shows that for any non-negative vector , we can write

where . Hence, we can now write as:

(1)

Here the last equality follows due to strong duality of the convex optimization.

The “inner” minimization part of (1) can be performed easily. In particular, we need an oracle which, given a vector , solves

Lemma 3.4.

Given a vector , can be computed in time.

Proof.

For each edge , simply check which of and is smaller. WLOG, assume it is . Then, set and . ∎

We denote the optimal for a given as . Now, using the above oracle, we can apply the MWU algorithm to the “outer” problem of (1), i.e.,

. Additionally, to apply the MWU framework, we need to estimate the width of this linear program. The width for (

1) can be bounded by largest degree, of the graph . Indeed, we see in Lemma 3.4 that is a vector. In that case, .

We conclude our analysis of this alternative dual formulation of the DSP with the following theorem.

Theorem 3.5.

Our alternative dual formulation admits a MWU algorithm that outputs an such that .

For the sake of completeness, we detail the MWU algorithm and the proof of Theorem 3.5 in Appendix A.

Let us now view Charikar’s peeling algorithm in the context of this dual problem. In a sense, the greedy peeling algorithm resembles one “inner” iteration of the MWU algorithm, where whenever a vertex is removed, its edges assign their load to it. Keeping this in mind, we designed Greedy++ to add “outer” iterations to the peeling algorithm, thus improving the approximation factor arbitrarily with increase in iteration count. By weighting vertices using their load from previous iterations, Greedy++ implicitly performs a form of load balancing on the graph, thus arriving at a better dual solution.

4 Experiments

4.1 Experimental setup

Name
web-trackers [23] 40 421 974 140 613 762
orkut [23] 3 072 441 117 184 899
livejournal-affiliations [23] 10 690 276 112 307 385
wiki-topcats 1 791 489 25 447 873
cit-Patents 3 774 768 16 518 948
actor-collaborations [23] 382 219 15 038 083
ego-gplus 107 614 12 238 285
dblp-author 5 425 963 8 649 016
web-BerkStan 685 230 6 649 470
flickr [37] 80 513 5 899 882
wiki-Talk 2 394 385 4 659 565
web-Google 875 713 4 322 051
com-youtube 1 134 890 2 987 624
roadNet-CA 1 965 206 2 766 607
web-Stanford 281 903 1 992 636
roadNet-TX 1 379 917 1 921 660
roadNet-PA 1 088 092 1 54 898
Ego-twitter 81 306 1 342 296
com-dblp 317 080 1 049 866
com-Amazon 334 863 925 872
soc-slashdot0902 82 168 504 230
soc-slashdot0811 77 360 469 180
soc-Epinions 75 879 405 740
blogcatalog [37] 10,312 333 983
email-Enron 36 692 183 831
ego-facebook 4 039 88 234
ppi [31] 3 890 37 845
twitter-retweet [30] 316 662 1 122 070
twitter-favorite [30] 226 516 1 210 041
twitter-mention [30] 571 157 1 895 094
twitter-reply [30] 196 697 296 194
soc-sign-slashdot081106 77 350 468 554
soc-sign-slashdot090216 81 867 497 672
soc-sign-slashdot090221 82 140 500 481
soc-sign-epinions 131 828 711 210
Table 1: Datasets used in our experiments.

The experiments were performed on a single machine, with an Intel(R) Core(TM) i7-2600 CPU at 3.40GHz (4 cores), 8MB cache size, and 8GB of main memory. We find densest subgraphs on the samples using binary search and maximum flow computations. The flow computations were done using C++ implementations of the push-relabel algorithm [18], HiPR222HiPR is available at http://www.avglab.com/andrew/soft/hipr.tar. We have implemented our algorithm Greedy++ and Charikar’s greedy algorithm C++. Our implementations are efficient and our code is available publicly333Our code for Greedy++ and the exact algorithm is available at the anonymous link https://www.dropbox.com/s/jzouo9fjoytyqg3/code-greedy%2B%2B.zip?dl=0.

(a) (b)
Figure 2: Number of iterations for Greedy++. Histograms of number of iterations to reach (a) 99% of the optimum degree density, (b) the optimum degree density.
(a) (b)
Figure 3: Scalability. (a) Running time in seconds of each iteration of Greedy++ versus the number of edges. (b) Speedup achieved by Greedy++ vs. number of edges in the graph. Specifically, the -axis is the ratio of the run time of the exact max flow algorithm divided by the run time of Greedy++ that finds 90% of the optimal solution.

We use a variety of datasets obtained from the Stanford’s SNAP database [24], ASU’s Social Computing Data Repository [37], BioGRID [31] and from the Koblenz Network Collection [23], that are shown in table Table 1. A majority of the datasets are from SNAP, and hence we mark only the rest with their sources. Multiple edges, self-loops are removed, and directionality is ignored for directed graphs. The first cluster of datasets are unweighted graphs. The largest unweighted graph is the web-trackers graph with roughly 141M edges, while the smallest unweighted graph has roughly 25K edges. For weighted graphs, we use a set of Twitter graphs that were crawled during the first week of February 2018 [30]. Finally, we use a set of signed networks (slashdot, epinions). We remind the reader that while the DSP is NP-hard on signed graphs, Charikar’s algorithm does provide certain theoretical guarantees, see Theorem 2 in [36].

4.2 Experimental results

Before we delve in detail into our experimental findings, we summarize our key findings here:

  • Our algorithm Greedy++ when given enough number of iterations always finds the optimal value, and the densest subgraph. This agrees with our conjecture that running iterations of Greedy++ gives a approximation to the DSP.

  • Experimentally, Charikar’s greedy algorithm always achieves at least 80% accuracy, and occasionally finds the optimum.

  • For graphs on which the performance of Charikar’s greedy algorithm is optimal, the first couple of iterations of Greedy++ suffice to deduce convergence safely, and thus act in practice as a certificate of optimality. This is the first method to the best of our knowledge that can be used to infer quickly the actual approximation of Charikar’s algorithm on a given graph instance.

  • When Charikar’s algorithm does not yield an optimal solution, then Greedy++ within few iterations is able to increase the accuracy to 99% of the optimum density, and by adding a few more iterations is able to find the optimal density and extract and optimal output.

  • When we are able to run the exact algorithm (for graphs with more than 8M edges, the maximum flow code crashes) on our machine, the average speedup that our algorithm provides to reach the optimum is 144.6

    on average, with a standard deviation equal to 57.4. The smallest speedup observed was 67.9

    , and the largest speedup 290. Additionally, we remark that the exact algorithm is only able to find solutions up to an accuracy of on most graphs.

  • The speedup typically increases as the size of the graph increases. In fact, the maximum flow exact algorithm cannot complete on the largest graphs we use.

  • The maximum number of iterations needed to reach 90% of the optimum is at most 3, i.e., by running two more passes compared to Charikar’s algorithm, we are able to boost the accuracy by 10%.

  • The same remarks hold for both weighted and unweighted graphs.

(a)          (b)
(c)          (d)
(e)          (f)
(g)          (h)
Figure 4: Convergence to optimum as a function of the number of iterations of Greedy++. (a) roadNet-CA, (b) roadNet-PA, (c) roadNet-TX, (d) com-Amazon, (e) dblp-author, (f) ego-twitter, (g) twitter-favorite, (h) twitter-reply. Here, the accuracy is given by , where is the output of Greedy++ after iterations.

Number of iterations. We first measure how many iterations we need to reach 99% of the optimum, or even the optimum. Figures 2(a), (b) answer these questions respectively. We observe the impressive performance of Charikar’s greedy algorithm; for the majority of the graph instances we observe that it finds a near-optimal densest subgraph. Nonetheless, even for those graph instances –as we have emphasized earlier– our algorithm Greedy++ acts as a certificate of optimality. Namely, we observe that the objective remains the same after a couple of iterations if and only if the algorithm has reached the optimum. For the rest of the graphs where Charikar’s greedy algorithm outputs an approximation greater than 80% but less than 99%, we observe the following: for five datasets it takes at most 3 iterations, for one graph it takes nine iterations, and then there exist three graphs for which Greedy++ requires 10, 22, and 29 iterations respectively. If we insist on finding the optimum densest subgraph, we observe that the maximum number of iterations can go up to 100. On average, Greedy++ requires 12.69 iterations to reach the optimum densest subgraph.

Scalability. Our experiments verify the intuitive facts that (i) each iteration of the greedy algorithm runs fast, and (ii) the exact algorithm that uses maximum flows is comparatively slow. We constrain ourselves on the set of data for which we were able to run the exact algorithm. Figure 3(a) shows the time that each iteration of the Greedy++ takes on average (runtimes are well concentrated around the average) over the iterations performed to reach the optimal densest subgraph. Figure 3(b) shows the speedup achieved by our algorithm when we condition on obtaining at least 90% (notice that frequently the actual accuracy is greater than 95%) of the optimal solution versus the exact max-flow based algorithm. Specifically, we plot the ratio of the running times of the exact algorithm by the time of Greedy++ versus the number of edges. Notice that for small graphs, the speedups are very large, then they drop, and they exhibit an increasing trend as the graph size grows. For the largest graphs in our collection, the exact algorithm is infeasible to run on our machine.

Figure 5: Log-log plot of optimal degree density versus the number of edges in the graph.

Convergence. Figure 4 illustrates the convergence of Greedy++ for various datasets. Specifically, each figure plots the accuracy of Greedy++ after iterations versus . The accuracy is measured as the ratio of the degree density achieved by Greedy++ by the optimal degree density. Figures 4(a),(b),(c),(d),(e),(f),(g),(h) correspond to the convergence behavior of roadNet-CA, roadNet-PA, roadNet-TX, com-Amazon, dblp-author, ego-twitter, twitter-favorite, twitter-reply respectively. These plots illustrate various interesting properties of Greedy++ in practice. Observe Figure 4(e). Notice how Greedy++ keeps outputting the same subgraph for few consecutive iterations, but then suddenly around the 10th iteration it “jumps” and finds an even denser subgraph. Recall that on average over our collection of datasets for which we can run the exact algorithm (i.e., datasets with less than 8M edges), Greedy++ requires roughly 12 iterations to reach the optimum densest subgraph. For this reason we suggest running Greedy++ for that many iterations in practice. Furthermore, we typically observe an improvement over the first pass, with the exception of the weighted graph twitter-reply, where the “jump” happens at the end of the third iteration.

Anomaly detection. It is worth outlining that Greedy++ provides a way to compute the densest subgraph in graphs where the maximum flow approach does not scale. For example, for graphs with more than 8 million edges, the exact method does not run on our machine. By running Greedy++ for enough iterations we can compute a near-optimal or the optimal solution. This allows us to compute a proxy of for the largest graphs, like orkut and trackers. We examined to what extent there exists a pattern between the size of the graph and the optimal density. In contrast to the power law relationship between the -cores and the graph size claimed in [29], we do not observe a similar power law when we plot (the exact optimal value or the proxy value found by Greedy++ after 100 iterations for the largest graphs) versus the number of edges in the graph. This is shown in Figure 5

. Part of the reason why we do not observe such a law are anomalies in graphs. For instance, we observe that small graphs may contain extremely dense subgraphs, thus resulting in significant outliers.

5 Conclusion

In this paper we provide a powerful algorithm for the densest subgraph problem, a popular and important objective for discovering dense components in graphs. The main practical value of our Greedy++ algorithm is two-fold. First, by running few more iterations of Charikar’s greedy algorithm we obtain (near-)optimal results that can be obtained using only maximum flows. Second, Greedy++ can be used to answer for first time the question “Is the approximation of Charikar’s algorithm on this graph instance closer to or to ?” without computing the optimal density using maximum flows. Empirically, we have verified that Greedy++ combines the best of “two worlds” on real data, i.e., the efficiency of the greedy peeling algorithm, and the accuracy of the exact maximum flow algorithm. We believe that Greedy++ is a valuable addition to the algorithmic toolbox for dense subgraph discovery that combines the best of two worlds, i.e., the accuracy of maximum flows, and the time and space efficiency of Charikar’s greedy algorithm.

We conclude our work with the following intriguing open question stated as a conjecture:

Conjecture 5.1.

Greedy++ is a approximation algorithm for the DSP, where is the number of iterations it performs.

References

  • [1] R. Andersen and K. Chellapilla. Finding dense subgraphs with size bounds. In International Workshop on Algorithms and Models for the Web-Graph, pages 25–37. Springer, 2009.
  • [2] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
  • [3] Y. Asahiro, K. Iwama, H. Tamaki, and T. Tokuyama. Greedily finding a dense subgraph. Journal of Algorithms, 34(2):203–221, 2000.
  • [4] B. Bahmani, A. Goel, and K. Munagala. Efficient primal-dual graph algorithms for mapreduce. In International Workshop on Algorithms and Models for the Web-Graph, pages 59–78. Springer, 2014.
  • [5] B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce. Proceedings of the VLDB Endowment, 5(5):454–465, 2012.
  • [6] A. Bhaskara, M. Charikar, E. Chlamtac, U. Feige, and A. Vijayaraghavan. Detecting high log-densities: an o (n ) approximation for densest k-subgraph. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 201–210. ACM, 2010.
  • [7] S. Bhattacharya, M. Henzinger, D. Nanongkai, and C. Tsourakakis. Space-and time-efficient algorithm for maintaining dense subgraphs on one-pass dynamic streams. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 173–182. ACM, 2015.
  • [8] M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In

    International Workshop on Approximation Algorithms for Combinatorial Optimization

    , pages 84–95. Springer, 2000.
  • [9] M. Danisch, T. H. Chan, and M. Sozio. Large scale density-friendly graph decomposition via convex programming. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017, pages 233–242, 2017.
  • [10] A. Epasto, S. Lattanzi, and M. Sozio. Efficient densest subgraph computation in evolving graphs. In Proceedings of the 24th International Conference on World Wide Web, pages 300–310. International World Wide Web Conferences Steering Committee, 2015.
  • [11] H. Esfandiari, M. Hajiaghayi, and D. P. Woodruff. Applications of uniform sampling: Densest subgraph and beyond. arXiv preprint arXiv:1506.04505, 2015.
  • [12] H. Esfandiari, S. Lattanzi, and V. Mirrokni. Parallel and streaming algorithms for k-core decomposition. arXiv preprint arXiv:1808.02546, 2018.
  • [13] Y. Freund and R. E. Schapire. Game theory, on-line prediction and boosting. In

    Proceedings of the Ninth Annual Conference on Computational Learning Theory

    , COLT ’96, pages 325–332, New York, NY, USA, 1996. ACM.
  • [14] G. Gallo, M. D. Grigoriadis, and R. E. Tarjan. A fast parametric maximum flow algorithm and applications. SIAM Journal on Computing, 18(1):30–55, 1989.
  • [15] C. Giatsidis, F. Malliaros, D. Thilikos, and M. Vazirgiannis. Corecluster: A degeneracy based graph clustering framework. In

    Twenty-Eighth AAAI Conference on Artificial Intelligence

    , 2014.
  • [16] A. Gionis and C. E. Tsourakakis. Dense subgraph discovery: Kdd 2015 tutorial. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2313–2314. ACM, 2015.
  • [17] A. V. Goldberg. Finding a maximum density subgraph. University of California Berkeley, CA, 1984.
  • [18] A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-flow problem. Journal of the ACM (JACM), 35(4):921–940, 1988.
  • [19] J. Hstad. Clique is hard to approximate within . Acta Mathematica, 182(1), 1999.
  • [20] R. Kannan and V. Vinay. Analyzing the structure of large graphs. Rheinische Friedrich-Wilhelms-Universität Bonn Bonn, 1999.
  • [21] Y. Kawase and A. Miyauchi. The densest subgraph problem with a convex/concave size function. Algorithmica, 80(12):3461–3480, 2018.
  • [22] S. Khuller and B. Saha. On finding dense subgraphs. In International Colloquium on Automata, Languages, and Programming, pages 597–608. Springer, 2009.
  • [23] J. Kunegis. Konect: The koblenz network collection. In Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13 Companion, pages 1343–1350, New York, NY, USA, 2013. ACM.
  • [24] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection, June 2014.
  • [25] A. McGregor, D. Tench, S. Vorotnikova, and H. T. Vu. Densest subgraph in dynamic graph streams. In International Symposium on Mathematical Foundations of Computer Science, pages 472–482. Springer, 2015.
  • [26] M. Mitzenmacher, J. Pachocki, R. Peng, C. Tsourakakis, and S. C. Xu. Scalable large near-clique detection in large-scale networks via sampling. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 815–824. ACM, 2015.
  • [27] S. A. Plotkin, D. B. Shmoys, and É. Tardos. Fast approximation algorithms for fractional packing and covering problems. Mathematics of Operations Research, 20(2):257–301, 1995.
  • [28] S. Sawlani and J. Wang. Near-optimal fully dynamic densest subgraph. arXiv preprint arXiv:1907.03037, 2019.
  • [29] K. Shin, T. Eliassi-Rad, and C. Faloutsos. Corescope: graph mining using k-core analysis: patterns, anomalies and algorithms. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 469–478. IEEE, 2016.
  • [30] K. Sotiropoulos, J. W. Byers, P. Pratikakis, and C. E. Tsourakakis. Twittermancer: Predicting interactions on twitter accurately. arXiv preprint arXiv:1904.11119, 2019.
  • [31] C. Stark, B.-J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers. Biogrid: A general repository for interaction datasets. Nucleic acids research, 34:D535–9, 01 2006.
  • [32] N. Tatti and A. Gionis. Density-friendly graph decomposition. In Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 18-22, 2015, pages 1089–1099, 2015.
  • [33] C. Tsourakakis. The k-clique densest subgraph problem. In Proceedings of the 24th international conference on world wide web, pages 1122–1132. International World Wide Web Conferences Steering Committee, 2015.
  • [34] C. Tsourakakis. Streaming graph partitioning in the planted partition model. In Proceedings of the 2015 ACM on Conference on Online Social Networks, pages 27–35. ACM, 2015.
  • [35] C. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. Tsiarli. Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 104–112. ACM, 2013.
  • [36] C. E. Tsourakakis, T. Chen, N. Kakimura, and J. Pachocki. Novel dense subgraph discovery primitives: Risk aversion and exclusion queries. arXiv preprint arXiv:1904.08178, 2019.
  • [37] R. Zafarani and H. Liu. Social computing data repository at ASU, 2009.

Appendix A Multiplicative Weights Update Algorithm

In this section, we give an algorithm to solve the zero-sum game , which corresponds to solving the dual of the densest subgraph problem, as described in Section 3.2. Given that we have an oracle access to , we can use the multiplicative weights update framework to get an -approximation of the game [13].

The pseudocode for the MWU algorithm is shown in Algorithm 3.

Input: Matrix , approximation factor .

Output: An approximate solution to the zero-sum game.

1:Initialize the weight vector as for all
2:Initialize
3:for  do
4:      for all .
5:     Find using Oracle().
6:     Set
7:     Let be the -th element in .
8:     Update the weights as
9:end for
10:Return as the solution.
Algorithm 3 Multiplicative Weight Update Algorithm

To prove the convergence of Algorithm 3, we use the following theorem from [2]. We modify it slightly to accommodate for the fact that the width of the DSP, , can be at most . In other words, the oracle can assign at most edges to any particular vertex.

Lemma A.1 (Theorem 3.1 from [2]).

Given an error parameter , there is an algorithm which solves the zero-sum game up to an additive factor of using calls to Oracle, with an additional processing time of per call, where is the width of the problem.

Using the fact that our Oracle runs in time (from Lemma 3.4), and using , we get the following corollary.

Corollary A.2.

The Multiplicative Weight Update algorithm (Algorithm 3) outputs a approximate solution to the densest subgraph problem in time .