Retrieving Top Weighted Triangles in Graphs

10/01/2019 ∙ by Raunak Kumar, et al. ∙ cornell university Stanford University 0

Pattern counting in graphs is a fundamental primitive for many network analysis tasks, and a number of methods have been developed for scaling subgraph counting to large graphs. Many real-world networks carry a natural notion of strength of connection between nodes, which are often modeled by a weighted graph, but existing scalable graph algorithms for pattern mining are designed for unweighted graphs. Here, we develop a suite of deterministic and random sampling algorithms that enable the fast discovery of the 3-cliques (triangles) with the largest weight in a graph, where weight is measured by a generalized mean of a triangle's edges. For example, one of our proposed algorithms can find the top-1000 weighted triangles of a weighted graph with billions of edges in thirty seconds on a commodity server, which is orders of magnitude faster than existing "fast" enumeration schemes. Our methods thus open the door towards scalable pattern mining in weighted graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Small subgraph patterns, also called graphlets or network motifs, have proven fundamental for the understanding of the structure of complex networks (Milo et al., 2004; Benson et al., 2016; Milo et al., 2002). One of the simplest non-trivial subgraph patterns is the triangle (3-clique), and the basic problem of triangle counting and enumeration has been studied extensively from theoretical and practical perspectives (Avron, 2010; Eden et al., 2017; Seshadhri et al., 2013; Berry et al., 2015; Stefani et al., 2017a). These developments are often driven by the desire to scale graph counting to large networks, where performing computations naively is intractable. The focus on triangles is in part spurred by the widespread use of the pattern in graph mining applications, including community detection (Berry et al., 2011; Gleich and Seshadhri, 2012; Rohe and Qin, 2013), network comparison (Contractor et al., 2006; Mahadevan et al., 2007; Pržulj, 2007), representation learning (Henderson et al., 2012; Rossi and Ahmed, 2015), and generative modeling (Robins et al., 2007; Robles et al., 2016). In addition, triangle-based network statistics such as the clustering coefficient are used extensively in the social sciences (Durak et al., 2012; Lawrence, 2006; Burt, 2007; Welles et al., 2010).

Nearly all of the algorithmic literature on scalable counting or enumeration of triangles focuses on unweighted graphs. However, many real-world network datasets have a natural notion of weight attached to the edges of the graph (Barrat et al., 2004). For example, edge weights can capture tie strength in social networks (Wasserman and Faust, 1994), traffic flows in transportation networks (Jia et al., 2019), or co-occurrence counts in projections of bipartite networks (Xu et al., 2014). Such edge weights offer additional insight into the structure of these networks. Moreover, edge weights can enrich the types of small subgraph patterns that are used in analysis. For instance, the network clustering coefficient has been generalized to account for edge weights (Opsahl and Panzarasa, 2009; Onnela et al., 2005b)

; in these cases, a triangle is given a weight derived from the weights of its constituent edges. Roughly speaking, the weight of a triangle is typically some combination of the arithmetic mean, geometric mean, minimum and maximum of the edge weights of the triangle. All this being said, we still lack the algorithmic tools for fast analysis of modern large-scale weighted networks, especially in the area of weighted triangle listing and counting.

In applications of weighted triangles in this big data regime, it can often suffice to retrieve only the triangles of largest weight for some suitable . For example, in large online social networks, the weight of an edge could reflect how likely it is for users to communicate with each other, and top weighted triangles and cliques in this network could be used for group chat recommendations. In such a scenario, we would typically only be interested in a small number of triangles whose nodes are very likely to communicate with each other as opposed to finding all triangles in the graph.

Another application for finding top-weighted triangles appears in prediction tasks involving higher-order network interactions. The goal of the “higher-order link prediction” problem is to predict which new groups of nodes will simultaneously interact (such as which group of authors will co-author a scientific paper in the future) (Benson et al., 2018). In this setting, existing algorithms first create a weighted graph where an edge weight is the number of prior interactions that involves the two end points and then predict that the top-weighted triangles in this weighted graph will appear as higher-order interactions in the future (weight here is measured by a generalized mean of the triangle’s edge weights). Again, it is not necessary to find all triangles since only the top predictions will be acted upon. Existing triangle enumeration algorithms do not scale to massive graphs for these problems, and we need efficient algorithms for retrieving triangles in large weighted graphs.

In this work, we address the problem of enumerating the top-weighted triangles in a weighted graph. To be precise, let be a simple, undirected graph with positive edge weights . We define the weight of a triangle in be equal to the generalized -mean; specifically, if a triangle has edge weights , , and , then the triangle weight is

(1)

Given and an integer parameter , we develop algorithms to extract the top- triangles in . We use top- to refer to the triangles having the largest weights, or in other words, the -heaviest triangles. Note that some special cases of the -mean include arithmetic mean (), geometric mean (

), harmonic mean (

), minimum ( and maximum (). This family of means is more general and includes those previously examined by Opsahl and Panzarasa (Opsahl and Panzarasa, 2009) and Benson et al. (Benson et al., 2018).

At a high level, we develop two families of algorithms for extracting top-weighted triangles. The first family of algorithms (Section 3) is deterministic and optimized for extracting top- weighted triangles for small (typically up to a few tens of thousands). These algorithms take advantage of the inherent heavy-tailed edge weight distribution common in real-world networks. In the most general case, we show that under a modified configuration model, these algorithms are even “distribution-oblivious,” in the sense that they can automatically compute optimal hyper-parameters to the algorithm for a wide range of input graph distributions. Additionally, the algorithmic analysis is done in a continuous sense (rather than discrete), which may be of independent interest. The second family of algorithms (Section 4) is randomized and aims to extract a large number of heavy triangles (not necessarily the top-). We show that this family of sampling algorithms is closely connected to the prior sampling algorithms for counting triangles on unweighted graphs (Seshadhri et al., 2014). Furthermore, we show that these sampling algorithms are easily parallelizable.

We find that a carefully tuned parallel implementation of our deterministic algorithm performs well across a broad range of large weighted graphs, even outperforming the fast random sampling algorithms that are not guaranteed to enumerate all of the top-weighted triangles. A parallel implementation of our algorithm running on a commodity server with 64 cores can find the top 1000 weighted triangles in under 10 seconds on several graphs with hundreds of millions of weighted edges and in 30 seconds on a graph with nearly two billion weighted edges. We compare this with the off-the-shelf alternative approach, which would be an intelligent triangle enumeration algorithm that maintains a heap of the top-weighted triangles. Our proposed algorithms are orders of magnitude faster than this standard approach.

2. Additional related work

Due its to the wide applicability, there is a plethora of work on unweighted triangle-related algorithms. In the context of enumeration algorithms, recent attention has focused on enumeration in the distributed and parallel setting (Chu and Cheng, 2011; Suri and Vassilvitskii, 2011; Arifuzzaman et al., 2013; Rahman and Hasan, 2013). These algorithms typically employ an optimized brute force method on each machine (Latapy, 2008; Berry et al., 2010), with the main algorithmic difficulty in deciding how to partition the data amongst the machines. Although each machine employs a brute force algorithm, early research shows that these algorithms run in time almost linear in the number of edges so long as the degeneracy of the graph is small (Chiba and Nishizeki, 1985), which has led to efficient enumeration algorithms (Berry et al., 2015; Suri and Vassilvitskii, 2011). For comparison with our methods, we modify such a fast enumeration method (specifically NodeIterator++ (Suri and Vassilvitskii, 2011)) to retain the top- weighted triangles. Although enumeration algorithms are agnostic to edge weights, we note that the sheer number of triangles in massive graphs renders such an approach prohibitively expensive.

When enumeration becomes intractable, triangle-related algorithms focus instead on merely triangle counts or graph statistics such as clustering coefficients. Again, these statistics are in the unweighted regime, as only the number of triangles are considered. There is a progression of sampling methods depending on what kind of structures one is sampling from the graph. At a basic level, edge-based sampling methods sample an edge and counts all incident triangles on that edge. So-called wedge-based methods sample length-2 paths (Seshadhri et al., 2013), and this concept has been generalized for counting 4-cliques (Jha et al., 2015). Finally, tiered-sampling combines sampling of arbitrary subgraphs to count the occurrence of larger graphs (with a focus on 4-cliques and 5-cliques) (Stefani et al., 2017b).

Beyond enumeration and sampling, there are numerous other methods for triangle-based algorithms, such as graph sparsification methods (Tsourakakis et al., 2009; Pagh and Tsourakakis, 2012a; Etemadi et al., 2016), spectral and matrix based methods (Tsourakakis, 2008; Alon et al., 1997), and a multitude of methods for computing clustering and closure coefficients (Rahman and Hasan, 2014; Seshadhri et al., 2014; Schank and Wagner, 2005; Yin et al., 2019). For a deeper background on triangle counting, we refer the reader to the overview of Hasan and Dave (Hasan and Dave, 2018).

All of the above methods are for triangles. These ideas have been extended in several ways. There are sampling methods for estimating counts of more general motifs 

(Ahmed et al., 2014; Jain and Seshadhri, 2017; Bressan et al., 2017) and motifs with temporal structure (Liu et al., 2019) as well as parallel clique enumeration methods (Danisch et al., 2018). Still, these methods do not work for weighted graphs, where subgraph patterns appear in generalizations of the clustering coefficient (Onnela et al., 2005a) as well as in graph decompositions (Soufiani and Airoldi, 2012).

3. Deterministic algorithms

We begin by developing two types of deterministic algorithms for finding the top- weighted triangles in a graph. As mentioned above, the triangle weight is given by the generalized -mean of its edge weights as defined in Eq. 1. A simple and robust baseline algorithm is to employ a fast triangle enumeration algorithm for unweighted graphs, compute the weight on each triangle, and pick out the top- weighted triangles according to the weighting function (or, to save memory, we can maintain a heap of the top-weighted triangles). In our tests, we use an optimized sequential version of NodeIterator++ (Berry et al., 2015; Suri and Vassilvitskii, 2011; Chiba and Nishizeki, 1985), which is the basis for many parallel enumeration algorithms. We call this a “brute force” approach. We note that faster, parallel versions of the brute force approach exist. However, as the results in Section 5 show, even a brute force enumerator with perfect parallelism would require over 2000 cores (or machines) to beat our sequential deterministic algorithm in certain cases.

The brute force approach is agnostic to the distribution of edge weights—it is the same regardless of the weights. Intuitively however, we expect that triangles of large weight are formed by edges of large weight. We exploit this intuition below to develop faster algorithms. At a high level, our main deterministic algorithm will try to dynamically partition the edges into “heavy” and “light” classes based on edge weight. Following this partition, we find triangles adjacent to the heavy edges until the top- heaviest are identified.

Input: Weighted graph , scaling , number of triangles
all triangles formed by edges in return triangles in with largest -mean weight.
Algorithm 1 Static heavy-light algorithm for finding top-weighted triangles.

A simple heavy-light algorithm.  As a precursor to our dynamic algorithm, consider a simple threshold based algorithm as follows: (1) given a threshold , partition the edges into a “heavy” set and a “light” set . For a large threshold , we expect most of the edges to be in the “light” class. Thus, the subgraph induced by is small and we may run any enumeration algorithm on it to get a collection of heavy triangles (Algorithm 1). This is not by itself guaranteed to find the heaviest triangles—edges in might only appear in triangles with edges in . However, we find that in practice the heaviest triangles always have all of its edges in . We note that with no additional asymptotic cost, we can use existing triangle enumeration algorithms to check for triangles with only one or two heavy edges (and the rest light). Unfortunately, the constant factor slow downs substantially increase the running time on real-world graphs.

In practice, we find that this simple algorithm vastly outperforms brute force, and can always find the top-weight triangle given a proper threshold; thus, this method will serve as a robust baseline. Nonetheless, there are a couple of issues with this static heavy-light algorithm. The first is that since the algorithm relies on a single threshold to partition the edges into the light and heavy sets, many more triangles may be enumerated than is necessary. The second is that it is difficult to know what an appropriate value for may be given no prior knowledge of the input graph. In the next section, we develop a dynamic variant of Algorithm 1 to deal with these issues.

3.1. Dynamic heavy-light algorithm

Input: Weighted graph , scaling , number of triangles , parameter .
Sort in decreasing order of weight Set threshold , triangle set Set partitions , Set edge pointers // We take the convention that .
1 while there are triangles above weight in  do
2       if  then
3             Move from to . triangles formed by and 2 edges from triangles formed by , 1 edge from and 1 edge from
4       else
5             Move from to . triangles formed by and 2 edges from
6       end if
7      Update threshold
8 end while
return triangles in with largest -mean weight.
Algorithm 2 Dynamic heavy-light algorithm for finding the top- weighted triangles.

We now develop a dynamic algorithm that uses the concepts of Algorithm 1 but is significantly more efficient.

Suppose we have preprocessed the edges so that they are sorted by decreasing weight ( where denotes the weight of the -th edge). Our dynamic heavy-light algorithm maintains a dynamic partition of the edges into three sets based on weight:

  • [topsep=0pt]

  • are the “super-heavy” edges of the largest weights;

  • are the “heavy” edges consisting of the next edges of largest weights; and

  • are the remaining “light” edges that are neither “heavy” nor “super-heavy.”

As the algorithm evolves, we adjust the sets , , and by changing the values of and . Any triangle can be broken down into a combination of “super-heavy,” “heavy,” and “light” edges. As a first order approximation, we intuitively expect the heaviest class of triangles to have three super-heavy edges, the next heaviest to have two super-heavy edges and one heavy edge, and so on down to the case of three light edges. Furthermore, by considering the edges in a specific order, we can also obtain useful bounds on the weight of the heaviest triangles we have not yet enumerated. Suppose we have enumerated all triangles containing three super-heavy edges. Then the heaviest triangle not yet enumerated must have at least 1 edge from either or . This upper bounds the -th power of the weight of that triangle to be . Our method will try to enumerate triangles so that this bound decreases as quickly as possible.

Algorithm 2 outlines our procedure. Each step of the algorithm consists of two steps: (i) update the partition by moving an edge to a heavier class; and (ii) enumerate triangles whose edges come from certain classes. At the end of each step, we maintain the invariant that we have enumerated all triangles with at least one super-heavy edge or at least two heavy-edges. This invariant is what allows us to obtain a bound on the heaviest triangle we have not yet enumerated. The constant of Algorithm 2, determines how edges get promoted to heavier classes; this parameter is used in our analysis to optimize the expected decrease in the threshold . We will specify this constant later in our analysis. When the algorithm begins, the partitions are initialized with , , and . The algorithm runs until it enumerates triangles above a dynamically decreasing threshold (Algorithm 2 of Algorithm 2).

Let be the weight of the -th heaviest triangle. As soon as , we will have enumerated all of the top- triangles. This algorithm solves both issues of our static heavy-light algorithm. If the threshold hits exactly, we only enumerate around triangles. As is computed on the fly, there is no need to choose the threshold at which we partition the edges. As a further benefit, we give a parameter-free version of our algorithm where is implicitly computed. The algorithmic analysis is done in a continuous sense (rather than discrete), which may be of independent interest.

3.2. Algorithm correctness

We first bound the heaviest possible triangle not yet enumerated. Observe that when an edge moves from to (Algorithm 2) all triangles including that edge and at least one edge from are enumerated; when an edge moves from to (Algorithm 2) all triangles including that edge are enumerated. Thus, the if-else clause maintains the invariant that all triangles with at least one edge from or at least two edges from are enumerated. Now consider any triangle with weight at least . By case analysis, there must either exist two edges with weight at least , or one edge with weight at least . This means that either two edges are from , or one edge is from . In either case, our invariant ensures that must have been enumerated. In fact, similar reasoning shows a tight threshold is , as is potentially an unenumerated triangle. However, this sum is at most due to the monotonicity of the edge weights.

Since is monotonically decreasing, this implies that at the end of each iteration of the while loop in Algorithm 2, contains the top- weighted triangles in the graph (there may be triangles not enumerated with equal weight to one of the triangles in ). Therefore, Algorithm 2 correctly returns the top- triangles, provided the graph has at least triangles.

3.3. Algorithm analysis

Although Algorithm 2 is a discrete algorithm we present a simple analysis using continuous differentials. Let and denote the weight of edges and at time respectively (one can think of as a continuous counter for the while loop iterations). At time , the threshold is . As a proxy to optimizing the triangles enumeration rate, we would like to maximize the rate at which the threshold decreases instead. To do this, we maximize the derivative by adjusting and based on the input parameter .

Let us consider what and represent. The derivatives approximate the maximum change we can make in or “per unit of computation time”. In each iteration of the while loop, we can choose to spend time decreasing or . Thus, a rough approximation to the derivatives is the ratio of the change in weight (by incrementing either the or pointer) to the computational cost of changing the corresponding pointer.

Let and let and

be the cumulative density function and probability density function of the edge weight distribution respectively. If we move the edge pointer

, then the average change in is the ratio of to the number of edges that have weight . The number of edges that have weight is proportional to , so the average change in is approximately

(2)

Similarly, the change in is equal to .

Analysis for power-law distributed weights.  At this point, we are free to continue our analysis with any model for for the distribution of weights on . One important case which is analyzable is a power law distribution on the edges, and we find that this type of distribution is a reasonable model for several of our datasets (see Fig. 1 for examples). Thus in this section, we carry out the analysis assuming that the edge weights follow a power law distribution with parameter .

Figure 1. Edge weight distribution in two datasets (see Section 5.1 for a description of the datasets). These plots suggest that a power law distribution on edge weights is a reasonable assumption. With this, we have a simple condition (Algorithm 2) to choose which pointer to move in the dynamic heavy-light algorithm (Algorithm 2).

Formally, let

be a random variable. We say

follows a power law distribution with parameter and some constant if for large . Thus, the probability that a random edge weight is greater than or equal to is and this implies that the probability that a random edge weight is equal to is for large . Using this assumption we can write the change in and as and respectively.

Now we analyze the computational cost of changing and . To do this we impose a simple configuration model on the way that is generated (Newman et al., 2001). We assume that each vertex draws its degree from a univariate degree distribution with the sum of degrees being even. We assume that the graph is generated from the following random process: (1) each vertex starts out stubs. While there are stubs available, two random stubs are drawn from the set of all stubs, and the vertices corresponding to those stubs are connected. Furthermore, upon connection a random edge weight drawn from the edge weight distribution is assigned to the edge. At the end, all self-loops in the graph are discarded. Note that these assumptions are quite strong, however we find that even this simple analysis yields good estimates for optimal values of in practice (see Section 5). Continuing with the analysis, we now analyze the expected cost in incrementing either the or pointer. Let and , and let denote the average degree in a graph . With appropriate data structures for checking the existence of edges in , the cost of incrementing is bounded by the degree sum of the endpoints of in , which is on average . Assuming that has approximately as many edges as (valid in the case of small ), . Thus the computational cost of moving is approximately .

Similarly, the computational cost of moving is bounded by the degree sum of the endpoints of in , which is on average . Since the number of edges in is exactly , the assumptions on the weight distribution gives us that . Thus the computational cost of moving is approximately .

Combining this with Eq. 2, we obtain the following expressions for the derivatives

(3)

Note that since and are decreasing, both derivatives are monotonically decreasing as the algorithm progresses. This monotonicity property means that greedily choosing the pointer to increment is optimal. The threshold decrease rate is . Since at each iteration of the algorithm we can only choose to change one of or , we should greedily change the pointer that gives the most “bang per buck”, i.e., choose and such that

(4)

In other words, we should maintain the edge pointers and such that the weights are separated geometrically by .

Distribution oblivious dynamic heavy-light algorithm.  The analysis in the previous section yields a fast algorithm given a known prior on the power law parameter of the weight distribution. In many applications, this can be easily and robustly estimated. In this section we present a method for which the parameter can be implicitly estimated on the fly, albeit with less robustness in practice than simply setting .

Although we assumed a power-law distribution on the edge weights, our analysis is actually much more general than that. Note that as long as the derivatives and are monotonic, our greedy method of incrementing the pointers will be optimal. For the derivatives to be monotonically decreasing, the only requirement is that the

of the weight distribution is monotonically increasing as the weight decreases. This includes a wide family of distributions such as power laws and uniform distributions.

Furthermore, the analysis we used to derive can also be used to compute implicitly. By maintaining an estimate of the derivatives and as the algorithm runs, we can compute all the derivatives used in the analysis on the fly and greedily change the pointer with higher value of (Equation 4).

Following the analysis in the previous section, the change in weight for is estimated by the ratio of and the number of edges that have weight , and similarly for . The computational cost of changing can be estimated by the sum of the degrees of the endpoints of in , and similarly for with . Consequently, we can obtain a “distribution-oblivious” algorithm that works on a family of monotone distributions.

In our experiments we find that this automatic way of implicitly computing is quite successful, although in practice noise in the derivative estimates may cause this algorithm to be slower than using a set value of .

4. Random sampling algorithms

In this section, we develop random sampling algorithms designed to sample a large collection of triangles with large weight. More formally, given a generalized -mean as a weight function, these algorithms all sample triangles exactly proportional to their weight. The main difference between the algorithms is how efficiently they can generate samples.

We specifically generalize three types of sampling schemes that have been used to estimate triangle counts in unweighted graphs. The first scheme is based on edge sampling (Kolountzakis et al., 2012; Tsourakakis et al., 2011; Pagh and Tsourakakis, 2012b), where we first sample one edge and then enumerate triangles adjacent to the sampled edge. The second method uses ideas from wedge sampling (Seshadhri et al., 2014), where we sample two adjacent edges and check whether these two edges induce a triangle. The final approach generalizes the idea of path sampling (Jha et al., 2015); we sample a three-edge path and check if it induces a triangle. Although these approaches were are all designed for triangle counting in unweighted graphs, they generalize quite seamlessly to a simple scheme for sampling highly weighted triangles. The main benefits of these algorithms are that they are simple to implement and also easy parallelize, since samples can be trivially generated in parallel.

Throughout this section, we assume that our weighting function for a triangle is a generalized -mean as given in Eq. 1. Since the weighted ordering of triangles is independent of the scaling by and the exponent , we can instead consider the more simple weighting function

(5)

For a given vertex , we will use to denote the set of neighbors of : , and to be the (unweighted) degree of node (i.e., ).

4.1. Weighted edge sampling

Input: Weighted graph , scaling , number of iterations , number of triangles
1 Initialize triangle set for iteration  do
2       Sample edge for each neighbor  do
3            
4       end for
5      
6 end for
return triangles in with largest -mean weight.
Algorithm 3 Weighted edge random sampling algorithm

We first discuss the edge sampling algorithm (Algorithm 3). The algorithm is based on a simple two-step procedure. First, we sample a single edge according to the following distribution

Second, after we sample an edge , we enumerate all triangles incident to . These two steps are repeated several times.

The above procedure has a few issues. First, it takes time to find triangles adjacent to an edge , which can be expensive in graphs where high-degree nodes are connected; we get around this in our next sampling scheme. Second, there is no guarantee that the above procedure will generate at least unique triangles, and moreover, even if the algorithm samples a sufficient number of triangles, it is not necessarily the case that these are the top-weighted triangles. The latter issue is an inherent limitation of random sampling schemes in general. All that being said, the procedure has the nice property of being biased in terms of sampling triangles with high weight, formalized as follows.

Proposition 4.1 ().

The probability that a triangle is enumerated in a given iteration of Algorithm 3 is , where .

Proof.

The probability that any edge is sampled initially is . Triangle is enumerated if any one of if edge , , or is this sampled edge. ∎

While the ES method is simple to describe, making the algorithm fast in practice requires some careful implementations. First, a natural way of sampling an edge is to simply pick one at random with probability proportional to its weight, but this is slow because there are a large number of edges. However, there is typically a much smaller number of unique edge weights. Thus, we first first sample an edge weight and then sample an edge with this weight. Pre-processing calculation of sampling probabilities for this approach involves iterating over the edges and computing two quantities—a cumulative edge weight (in order to sample an edge weight) and a map of edge weight to edges (in order to sample an edge given an edge weight). This pre-processing step can take much longer than the sampling loop if implemented naively. However, in a sorted list of edges, all edges that share the same edge weight lie in a contiguous chunk and this significantly speeds up the process of computing the above quantities.

4.2. Weighted wedge sampling

Input: Weighted graph , scaling , number of iterations , number of triangles
1 Initialize triangle set for iteration  do
2       Sample node with probability as in Eq. 6 Sample with probability as in Eq. 7 Sample with probability as in Eq. 8 if nodes form a triangle then
3            
4      
5 end for
return triangles in with largest -mean weight.
Algorithm 4 Weighted wedge random sampling algorithm

One of the issues with the simple edge sampling scheme described above is that we have to look over the neighbors of the end points of the sampled edge in order to find triangles. This can be expensive if the degrees of these nodes are large. An alternative approach is to sample adjacent edges—also called wedges—with large weight and then check if each wedge induces a triangle. This scheme is called wedge sampling and has been used as a mechanism for estimating the total number of triangles in an unweighted graph (Seshadhri et al., 2014; Türkoglu and Turk, 2017).

The overall sampling scheme is outlined in Algorithm 4. Each iteration of Algorithm 4 proceeds in three steps. First, we sample a single node with a bias towards nodes that participate in heavily weighted edges. Specifically, let denote the sum of the edge weights incident to . We sample a node according to the following distribution:

(6)

where is a normalizing constant. Next, we sample a neighbor of node , again with a bias towards nodes that participate in heavily weighted edges. The specific distribution is

(7)

where is a normalizing constant. We have now produced a single edge and want to produce an adjacent edge. We do this by sampling another neighbor of , this time with probability

(8)

where is again a normalizing constant. If the sampled wedge induces a triangle, then we add it to our collection.

Similar to the unweighted scheme of Seshadhri et al. (Seshadhri et al., 2014), we show the following:

Proposition 4.2 ().

A given iteration of Algorithm 4 samples triangle with probability , where .

Proof.

The normalizing constants in Eqs. 8, 7 and 6 are , , and Thus, the probability of sampling wedge centered on is equal to

Thus, the probability of sampling any of the three wedges consisting of nodes and is equal to

Plugging in the expression for shows that this probability is equal to . Since triangle is sampled if any of its three wedges is sampled, this yields the desired result. ∎

4.3. Weighted path sampling

Input: Weighted graph , scaling , number of iterations , number of triangles .
1 Initialize triangle set for  do
2       Sample edge with probability as in Eq. 9 Sample with probability as in Eq. 10 Sample with probability as in Eq. 11 if  then
3            
4      
5 end for
return triangles in with largest -mean weight.
Algorithm 5 Weighted path sampling algorithm

We can also design a sampling scheme based on path sampling (Jha et al., 2015). Previously, edge sampling found triangles after sampling a single edge and wedge sampling found triangles after sampling two pairs; here, path sampling starts by sampling three edges that will be biased towards finding a weighted triangle. This sampling scheme performs poorly in practice, but we include it here for the sake of theoretical interest.

For path sampling, we first sample an edge and then sample two more edges and connected to (producing a length-three path). A triangle is obtained when this length three path closes into a triangle (). Let and . The sampling probabilities are as follows:

(9)
(10)
(11)

where are normalizing constants.

Under these sampling probabilities, the path sampling method (Algorithm 5) samples triangles proportional to their weight, just as in Algorithms 4 and 3.

Proposition 4.3 ().

A given iteration of Algorithm 5 samples triangle with probability , where
.

Proof.

Let and be the normalization constants of and (equations (9), (10) and (11)) respectively. We have

The probability of sampling triangle is equal to

Thus, the probability of sampling triangle is proportional to its weight. ∎

4.4. Number of samples

Propositions 4.3, 4.2 and 4.1 says that Algorithms 5, 4 and 3 tend to sample edges with large weight. However, there is no guarantee that the top-weighted triangles are actually enumerated. Some standard probabilistic analysis can at least give us some sense on how many iterations we need. Specifically, if is the probability of sampling a triangle , then for any , samples guarantees that is enumerated with probability at least . To see this, let denote the probability that triangle is sampled at least once in samples. Using the fact that , we get that .

We can immediately see that the normalizing constants for wedge and path sampling drive up the number of samples required to get top weighted triangles, as compared to edge sampling. However, obtaining samples with Algorithm 3 can be much more costly if we have to find common neighbors of nodes with large degree. It is not immediately clear which algorithm should be better, but our experimental results in the next section show that edge sampling is superior in practice.

4.5. Extensions to cliques

All of our sampling algorithms can also be used to sample cliques of arbitrary size . The high level objective of each sampling algorithm is to sample some sort of subgraph with probability proportional to its weight. The three sampling algorithms we present are ways to sample edges, wedges, and length-3 paths respectively. To convert one of our triangle samplers to a clique sampler, we can simply sample a subgraph with one of our three algorithms, and then enumerate all cliques incident on that subgraph. For edge sampling, we can sample an edge and enumerate all -cliques incident on that edge. For wedge and path sampling, we can sample wedges and paths until we sample a triangle. Once a triangle is sampled, we can then enumerate all -cliques incident on that triangle. To test out this concept, we include additional experiments where we extend edge sampling into a clique sampler.

This natural extension to cliques is an advantage over our dynamic heavy-light algorithm from Section 3, which is not easily extended to larger cliques. However, our main focus is on triangles, for which we provide several numerical experiments in the next section. We provide some experiments for cliques in References.

5. Numerical experiments

We now report the results of several experiments measuring the performance of our deterministic and random sampling algorithms compared to some competitive baselines such as the static heavy-light thresholding method in Algorithm 1. Implementations of our algorithms and code to reproduce results are available.111https://github.com/raunakkmr/Retrieving-top-weighted-triangles-in-graphs

We find that edge sampling works much better than wedge sampling for our random sampling algorithms but that our deterministic heavy light algorithm is even faster across a wide range of datasets, outperforming the baselines by orders of magnitude in terms of running time.

dataset # nodes # edges edge weight
4-5 mean max
tags-stack-overflow 50K 4.2M 13 469
threads-stack-overflow 2.3M 21M 1.1 546
Wikipedia-clickstream 4.4M 23M 347 817M
Ethereum 38M 103M 2.8 1.9M
AMiner 93M 324M 1.3 13K
reddit-reply 8.4M 435M 1.5 165K
MAG 173M 545M 1.7 38K
Spotify 3.6M 1.9B 8.6 2.8M
Table 1. Summary statistics of datasets.

5.1. Data

We used a number of datasets in order to test the performance our algorithms. The datasets can be found at http://www.cs.cornell.edu/~arb/data/index.html. Table 1 lists summary statistics of the datasets and we describe them briefly below.

tags-stack-overflow (Benson et al., 2018).  On Stack Overflow, users ask, answer, and discuss computer programming questions, and users annotate questions with 1–5 tags. The nodes in this graph correspond to tags, are the weight between two nodes is the number of questions jointly annotated by the two tags.

threads-stack-overflow (Benson et al., 2018).  This graph is constructed from a dataset of user co-participation on Stack Overflow question threads that last at most 24 hours. Nodes correspond to users and the weight of an edge is the number of times that two users appeared in one of these short question threads.

Wikipedia-clickstream (Wulczyn and Taraborelli, 2017).  This graph is derived from Wikipedia clickstream data from request logs in January, 2017 that capture how users transition between articles (only transitions appearing at least 11 times were recorded). The nodes of the graph correspond to Wikipedia articles and the weight between two nodes is the number of times users transitioned between the two pages.

Ethereum.  Ethereum is a blockchain-based computing platform for decentralized applications. Transactions are used to update state in the Ethereum network, and each transaction has a sender and a receiver address. We create a graph where the nodes are addresses and the weight between two nodes is the number of transactions between the two addresses, using all transactions on the platform up to August 17, 2018, as provided by blockchair.com.

AMiner and MAG (Tang et al., 2008; Sinha et al., 2015).  These graphs are constructed by two large bibliographic databases—AMiner and the Microsoft Academic Graph. We construct weighted co-authorship graphs, where nodes represent authors and the weight between two nodes is the number of papers they have co-authored. Papers with more than 25 authors were omitted from the graph construction.

reddit-reply (Hessel et al., 2016; Liu et al., 2019).  Users on the social media web site reddit.com interact by commenting on each other’s posts. We derive a graph from a collection of user comments. Nodes are users and the weight of an edge is the number of interactions between the two users.

Spotify.

  As part of a machine learning challenge, the music streaming platform Spotify released a large number of user “listening sessions,” each consisting of a set of songs. We constructed a weighted graph where the nodes represent songs and the weight of an edge is the number of times that the songs co-appeared in a session.

5.2. Algorithm benchmarking

We evaluate the performance of our proposed algorithms: (i) random edge sampling (ES) as in Algorithm 3; (ii) random wedge sampling (WS) as in Algorithm 4; (iii) random path sampling (PS) as in Algorithm 5; (iv) the static heavy-light (SHL) scheme as in Algorithm 1 (see below for how we set the thresholds); (v) the dynamic heavy-light scheme (DHL) as in Algorithm 2 (vi) auto heavy-light (auto-HL), which is the distribution-obvlivious adaptation of DHL that automatically adjusts edge promotion to optimize the decrease in the threshold. As a baseline, we use an optimized sequential version of NodeIterator++ (Berry et al., 2015; Suri and Vassilvitskii, 2011; Chiba and Nishizeki, 1985) and refer to this as the “brute force approach” (BF). Essentially, this algorithm iterates over vertices with decreasing degree, and for each vertex it only enumerates triangles that are formed by neighboring vertices with lower degree than itself. All algorithms were implemented in C++, and all experiments were executed on a 64 core 2.20 GHz Intel Xeon CPU with 200 GB of RAM. We used parallel sorting for all algorithms and parallel sampling for the random sampling algorithms; other parts of the algorithms were executed sequentially. We evaluated the algorithms for two values of : 1,000 and 100,000. We also use the arithmetic mean ( in Eq. 1).

Recall that DHL (Algorithm 2) uses a power law distribution model of the edge weights and sets a parameter based on the power law exponent. We fix for our experiments, which works well across a range of datasets.

The random sampling algorithms are not guaranteed to enumerate all of the top- weighted triangles. Instead, we measure the performance of these algorithms in terms of running time and accuracy (the fraction of top- triangles actually enumerated). We ran ES long enough for it to achieve at least 94% ( 1,000) or 50% ( 100,000) accuracy on all datasets. We also ran WS long enough for it to achieve at least 50% accuracy (for both values of ). However, in practice, its performance is poor, and we terminate the algorithm if it takes longer than BF to achieve this accuracy level. PS takes longer than BF to achieve this accuracy level on all datasets, so we do not include it in Table 2.

Similarly, the static heavy-light algorithm (Algorithm 1) is not guaranteed to achieve 100% accuracy since it relies on a fixed threshold to partition the edges as heavy and light and only enumerates triangles formed by heavy edges. As the threshold decreases a larger number of edges are labelled heavy. This increases the accuracy but also slows down the algorithm. Figure 2 illustrates this trade-off on the Ethereum dataset; a similar trend is observed on the other datasets. In our experiments, we labelled the top 10% of edges as heavy and report the achieved accuracy. As we discuss below, we find that SHL is slower and attains sub-100% accuracy in practice; improving the accuracy would only make this baseline slower.

Figure 2. Accuracy and running time as a function of edges labeled “heavy” by the thresholding for the static heavy-light algorithm (Algorithm 1) on the Ethereum dataset for 1000. As the threshold decreases, a larger percentage of edges are labelled heavy. This increases the accuracy but also increases the running time. For reasonable accuracy levels, we find that the running time is slower than our optimized dynamic heavy-light algorithm (Algorithm 2), which achieves 100% accuracy (see Table 2).
dataset BF ES WS DHL Auto-HL SHL Accuracy (SHL)
1000 tags-stack-overflow 12.71 0.57 11.28 .08 0.09 0.54 0.99
threads-stack-overflow 34.92 1.31 34.92 0.53 0.38 1.55 0.55
Wikipedia-clickstream 16.32 14.31 16.32 5.44 7.26 2.02 0.87
Ethereum 52.91 9.03 52.91 8.12 6.94 11.90 0.28
Aminer 243.75 3.72 243.75 13.35 12.36 43.47 0.32
reddit-reply 4047.62 5.19 341.17 5.02 4.74 102.65 0.51
MAG 512.24 4.92 48.58 29.19 20.89 72.49 0.91
Spotify 86400 60.33 5300 31.82 30.79 5388.45 1.00
100000 tags-stack-overflow 13.06 0.58 13.06 0.23 0.23 0.62 0.28
threads-stack-overflow 33.99 1.19 33.99 1.82 1.73 1.63 0.32
Wikipedia-clickstream 17.34 13.64 17.34 5.49 7.24 2.15 0.13
Ethereum 57.35 10.03 57.35 18.11 19.87 11.70 0.11
Aminer 245.28 3.45 245.28 15.38 13.91 43.28 0.24
reddit-reply 3857.57 6.52 3857.57 6.87 7.49 98.34 0.34
MAG 524.80 4.25 524.80 29.52 21.37 75.97 0.10
Spotify 86400 47.27 5300 30.57 29.89 5384.17 0.92
Table 2. Running times of all of our algorithms in seconds averaged over 10 runs. BF is brute force enumeration of triangles, which is the out-of-the-box baseline; ES is the parallel edge sampling algorithm (Algorithm 3); WS is the parallel wedge sampling algorithm (Algorithm 4); SHL is the static heavy-light threshold deterministic algorithm (Algorithm 1); DHL is the dynamic heavy-light deterministic algorithm (Algorithm 2); and Auto-HL is the distribution oblivious modification of the dynamic heavy-light discussed in Section 3.3. ES was run to achieve 94% ( 1,000) or 50% ( 100,000) accuracy, while WS was run to achieve just 50% accuracy and was stopped early if taking longer BF (or longer than SHL on the Spotify dataset). SHL is also an approximation, and we report its accuracy in the final column. Overall, our deterministic algorithms (DHL or Auto-HL) are fast and achieve 100% accuracy; in some cases, our ES technique is slightly faster, but it is approximate.

Table 2 shows the running times of all of our algorithms. BF did not terminate on Spotify after 24 hours, so running times for this baseline are not available on that dataset. We highlight a few important findings. First, our deterministic methods DHL and Auto-HL excel at retrieving the top triangles. They achieve perfect accuracy with orders of magnitude faster than BF. For instance, these algorithms get a 1000x speedup on the reddit-reply dataset () and more than a 2000x speedup on the Spotify dataset (). These algorithms also outperform SHL by a significant margin in terms of time and accuracy. For example, despite being 30x slower on reddit-reply SHL only achieves 50% accuracy (). Again, our deterministic algorithms algorithms always achieve 100% accuracy, and do so in a fraction of the time taken by the baseline methods BF and SHL.

Among the sampling algorithms, ES performs much better than WS. WS struggles to achieve high accuracy and is not competitive with the BF baseline or SHL. On the other hand, ES is quite competitive with even DHL and Auto-HL. ES retrieves the top 1000 triangles on the AMiner and MAG datasets with 99% accuracy at speedups of 2x or 4x over DHL and Auto-HL. A similar speedup is observed for 100,000, but ES is only achieving 50% accuracy in these cases. Even though ES works well in these cases, our deterministic algorithms are still competitive; we conclude that intelligent deterministic approaches work extremely well for finding top weighted triangles in large weighted graphs.

All of our algorithms except BF and WS use a pre-procesing step of sorting edges by weight. Surprisingly, we find that this pre-processing step is the bottleneck in our computations. Sorting in parallel is crucial to achieving high performance. In turn, this negates the possible benefit of parallel sampling for the randomized algorithms over our deterministic methods, whose main routines are inherently sequential.

6. Discussion

Subgraph patterns, and in particular, triangles, have been used extensively in graph mining applications. However, most of the existing literature focuses on counting or enumeration tasks in unweighted graphs. In this paper, we developed deterministic and random sampling algorithms for finding the heaviest triangles in large weighted graphs. With some tuning, our main deterministic algorithm can find these triangles in a few seconds on graphs with hundreds of millions of edges or in 30 seconds on a graph with billions of edges. This is orders of magnitude faster than what one could achieve with existing fast enumeration schemes and is usually much faster than even our randomized sampling algorithms.

We anticipate that our work will enable scientists to better explore large-scale weighted graphs and also expect that our work will spur new algorithmic developments on subgraph listing and counting in weighted graphs. For example, an interesting avenue for future research would be the development of random sampling algorithms that sample triangles with probability proportional to some arbitrary function of their weight, chosen to converge to the top weighted triangles faster. This could make random sampling approaches competitive with our fast deterministic methods. The edge sampling method can also be generalized to -clique sampling by sampling an edge and then enumerating adjacent -cliques. How to extend the deterministic algorithms to top -clique enumeration is less clear, so sampling may be more appropriate for larger clique patterns.

Acknowledgements

This research was supported in part by NSF Award DMS-1830274, ARO Award W911NF19-1-0057, ARO MURI, NSF grant CCF-1617577, Simons Investigator Award, Google Faculty Research Award, Amazon Research Award, and Google Cloud resources.

References

  • (1)
  • Ahmed et al. (2014) Nesreen K. Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014. Graph sample and hold: a framework for big-graph analytics. In KDD.
  • Alon et al. (1997) Noga Alon, Raphael Yuster, and Uri Zwick. 1997. Finding and Counting Given Length Cycles. Algorithmica 17, 3 (1997), 209–223.
  • Arifuzzaman et al. (2013) Shaikh Arifuzzaman, Maleq Khan, and Madhav V. Marathe. 2013. PATRIC: a parallel algorithm for counting triangles in massive networks. In ICDM. 529–538.
  • Avron (2010) Haim Avron. 2010.

    Counting triangles in large graphs using randomized matrix trace estimation. In

    Workshop on Large-scale Data Mining: Theory and Applications.
  • Barrat et al. (2004) Alain Barrat, Marc Barthelemy, Romualdo Pastor-Satorras, and Alessandro Vespignani. 2004. The architecture of complex weighted networks. Proceedings of the national academy of sciences 101, 11 (2004), 3747–3752.
  • Benson et al. (2018) Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. 2018. Simplicial closure and higher-order link prediction. Proceedings of the National Academy of Sciences 115, 48 (2018), E11221–E11230.
  • Benson et al. (2016) Austin R. Benson, David F. Gleich, and Jure Leskovec. 2016. Higher-order organization of complex networks. Science 353, 6295 (2016), 163–166.
  • Berry et al. (2015) Jonathan W. Berry, Luke A. Fostvedt, Daniel J. Nordman, Cynthia A. Phillips, C. Seshadhri, and Alyson G. Wilson. 2015. Why Do Simple Algorithms for Triangle Enumeration Work in the Real World? Internet Mathematics 11, 6 (2015), 555–571.
  • Berry et al. (2011) Jonathan W Berry, Bruce Hendrickson, Randall A LaViolette, and Cynthia A Phillips. 2011. Tolerating the community detection resolution limit with edge weighting. Physical Review E 83, 5 (2011), 056119.
  • Berry et al. (2010) Jonathan W Berry, Daniel J Nordman, Cynthia A Phillips, and Alyson G Wilson. 2010. Listing triangles in expected linear time on a class of power law graphs. Technical Report. Technical report, Sandia National Laboratories.
  • Bressan et al. (2017) Marco Bressan, Flavio Chierichetti, Ravi Kumar, Stefano Leucci, and Alessandro Panconesi. 2017. Counting Graphlets: Space vs Time. In WSDM. ACM.
  • Burt (2007) Ronald S Burt. 2007. Secondhand brokerage: Evidence on the importance of local structure for managers, bankers, and analysts. Academy of Management Journal 50, 1 (2007), 119–148.
  • Chiba and Nishizeki (1985) Norishige Chiba and Takao Nishizeki. 1985. Arboricity and Subgraph Listing Algorithms. SIAM J. Comput. 14, 1 (1985), 210–223.
  • Chu and Cheng (2011) Shumo Chu and James Cheng. 2011. Triangle listing in massive networks and its applications. In KDD. 672–680.
  • Contractor et al. (2006) Noshir S Contractor, Stanley Wasserman, and Katherine Faust. 2006. Testing multitheoretical, multilevel hypotheses about organizational networks: An analytic framework and empirical example. Academy of Management Review 31, 3 (2006), 681–703.
  • Danisch et al. (2018) Maximilien Danisch, Oana Denisa Balalau, and Mauro Sozio. 2018. Listing k-cliques in Sparse Real-World Graphs. In WWW. 589–598.
  • Durak et al. (2012) Nurcan Durak, Ali Pinar, Tamara G Kolda, and C Seshadhri. 2012. Degree relations of triangles in real-world networks and graph models. In ACM CIKM. 1712–1716.
  • Eden et al. (2017) Talya Eden, Amit Levi, Dana Ron, and C. Seshadhri. 2017. Approximately Counting Triangles in Sublinear Time. SIAM J. Comput. 46, 5 (jan 2017), 1603–1646.
  • Etemadi et al. (2016) Roohollah Etemadi, Jianguo Lu, and Yung H. Tsin. 2016. Efficient Estimation of Triangles in Very Large Graphs. In CIKM. 1251–1260.
  • Gleich and Seshadhri (2012) David F Gleich and C Seshadhri. 2012. Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In KDD. ACM, 597–605.
  • Hasan and Dave (2018) Mohammad Al Hasan and Vachik S. Dave. 2018. Triangle counting in large networks: a review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8, 2 (2018).
  • Henderson et al. (2012) Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, Sugato Basu, Leman Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. 2012. RolX: structural role extraction & mining in large graphs. In KDD.
  • Hessel et al. (2016) Jack Hessel, Chenhao Tan, and Lillian Lee. 2016. Science, AskScience, and BadScience: On the Coexistence of Highly Related Communities. In ICWSM. 171–180.
  • Jain and Seshadhri (2017) Shweta Jain and C. Seshadhri. 2017. A Fast and Provable Method for Estimating Clique Counts Using Turán’s Theorem. In WWW. 441–449.
  • Jha et al. (2015) Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts. In WWW. 495–505.
  • Jia et al. (2019) Junteng Jia, Michael T. Schaub, Santiago Segarra, and Austin R. Benson. 2019.

    Graph-based Semi-Supervised & Active Learning for Edge Flows. In

    KDD. ACM.
  • Kolountzakis et al. (2012) Mihail N Kolountzakis, Gary L Miller, Richard Peng, and Charalampos E Tsourakakis. 2012. Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Mathematics 8, 1-2 (2012), 161–185.
  • Latapy (2008) Matthieu Latapy. 2008. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. 407, 1-3 (2008), 458–473.
  • Lawrence (2006) Barbara S Lawrence. 2006. Organizational reference groups: A missing perspective on social context. Organization Science 17, 1 (2006), 80–100.
  • Liu et al. (2019) Paul Liu, Austin R. Benson, and Moses Charikar. 2019. Sampling Methods for Counting Temporal Motifs. In WSDM. ACM, 294–302.
  • Mahadevan et al. (2007) Priya Mahadevan, Calvin Hubble, Dmitri Krioukov, Bradley Huffaker, and Amin Vahdat. 2007. Orbis: rescaling degree correlations to generate annotated internet topologies. In ACM SIGCOMM Computer Communication Review, Vol. 37. ACM, 325–336.
  • Milo et al. (2004) Ron Milo, Shalev Itzkovitz, Nadav Kashtan, Reuven Levitt, Shai Shen-Orr, Inbal Ayzenshtat, Michal Sheffer, and Uri Alon. 2004. Superfamilies of Evolved and Designed Networks. Science (2004).
  • Milo et al. (2002) Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. 2002. Network motifs: simple building blocks of complex networks. Science 298, 5594 (2002), 824–827.
  • Newman et al. (2001) Mark EJ Newman, Steven H Strogatz, and Duncan J Watts. 2001. Random graphs with arbitrary degree distributions and their applications. Physical review E 64, 2 (2001), 026118.
  • Onnela et al. (2005a) Jukka-Pekka Onnela, Jari Saramäki, János Kertész, and Kimmo Kaski. 2005a. Characterizing Motifs in Weighted Complex Networks. In AIP. AIP.
  • Onnela et al. (2005b) Jukka-Pekka Onnela, Jari Saramäki, János Kertész, and Kimmo Kaski. 2005b. Intensity and coherence of motifs in weighted complex networks. Phys. Rev. E 71 (Jun 2005), 065103. Issue 6.
  • Opsahl and Panzarasa (2009) Tore Opsahl and Pietro Panzarasa. 2009. Clustering in weighted networks. Social networks 31, 2 (2009), 155–163.
  • Pagh and Tsourakakis (2012a) Rasmus Pagh and Charalampos E. Tsourakakis. 2012a. Colorful triangle counting and a MapReduce implementation. Inf. Process. Lett. 112, 7 (2012), 277–281.
  • Pagh and Tsourakakis (2012b) Rasmus Pagh and Charalampos E Tsourakakis. 2012b. Colorful triangle counting and a mapreduce implementation. Inform. Process. Lett. 112, 7 (2012), 277–281.
  • Pržulj (2007) Nataša Pržulj. 2007. Biological network comparison using graphlet degree distribution. Bioinformatics 23, 2 (2007), e177–e183.
  • Rahman and Hasan (2013) Mahmudur Rahman and Mohammad Al Hasan. 2013. Approximate triangle counting algorithms on multi-cores. In IEEE Big Data. 127–133.
  • Rahman and Hasan (2014) Mahmudur Rahman and Mohammad Al Hasan. 2014. Sampling Triples from Restricted Networks using MCMC Strategy. In CIKM. 1519–1528.
  • Robins et al. (2007) Garry Robins, Pip Pattison, Yuval Kalish, and Dean Lusher. 2007. An introduction to exponential random graph models for social networks. Social Networks 29, 2 (May 2007), 173–191.
  • Robles et al. (2016) Pablo Robles, Sebastian Moreno, and Jennifer Neville. 2016. Sampling of attributed networks from hierarchical generative models. In KDD. ACM, 1155–1164.
  • Rohe and Qin (2013) Karl Rohe and Tai Qin. 2013. The blessing of transitivity in sparse and stochastic networks. arXiv (2013).
  • Rossi and Ahmed (2015) Ryan A. Rossi and Nesreen K. Ahmed. 2015. Role Discovery in Networks. IEEE Transactions on Knowledge and Data Engineering 27, 4 (apr 2015), 1112–1131.
  • Schank and Wagner (2005) Thomas Schank and Dorothea Wagner. 2005. Approximating Clustering Coefficient and Transitivity. J. Graph Algorithms Appl. 9, 2 (2005), 265–275.
  • Seshadhri et al. (2013) C. Seshadhri, Ali Pinar, and Tamara G. Kolda. 2013. Triadic Measures on Graphs: The Power of Wedge Sampling. In Proceedings of SDM.
  • Seshadhri et al. (2014) C. Seshadhri, Ali Pinar, and Tamara G. Kolda. 2014. Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Statistical Analysis and Data Mining 7, 4 (2014), 294–307.
  • Sinha et al. (2015) Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In WWW '15 Companion. ACM.
  • Soufiani and Airoldi (2012) Hossein Azari Soufiani and Edo Airoldi. 2012. Graphlet decomposition of a weighted network. In AISTATS. 54–63.
  • Stefani et al. (2017a) Lorenzo De Stefani, Alessandro Epasto, Matteo Riondato, and Eli Upfal. 2017a. TRIÈST: Counting Local and Global Triangles in Fully Dynamic Streams with Fixed Memory Size. ACM TKDD 11, 4 (jun 2017), 1–50.
  • Stefani et al. (2017b) Lorenzo De Stefani, Erisa Terolli, and Eli Upfal. 2017b. Tiered sampling: An efficient method for approximate counting sparse motifs in massive graph streams. In IEEE BigData 2017. 776–786.
  • Suri and Vassilvitskii (2011) Siddharth Suri and Sergei Vassilvitskii. 2011. Counting triangles and the curse of the last reducer. In WWW. 607–614.
  • Tang et al. (2008) Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner. In KDD. ACM.
  • Tsourakakis (2008) Charalampos E. Tsourakakis. 2008. Fast Counting of Triangles in Large Real Networks without Counting: Algorithms and Laws. In ICDM. 608–617.
  • Tsourakakis et al. (2009) Charalampos E Tsourakakis, U Kang, Gary L Miller, and Christos Faloutsos. 2009. Doulion: counting triangles in massive graphs with a coin. In KDD. ACM, 837–846.
  • Tsourakakis et al. (2011) Charalampos E Tsourakakis, Mihail N Kolountzakis, and Gary L Miller. 2011. Triangle Sparsifiers. J. Graph Algorithms Appl. 15, 6 (2011), 703–726.
  • Türkoglu and Turk (2017) Duru Türkoglu and Ata Turk. 2017. Edge-Based Wedge Sampling to Estimate Triangle Counts in Very Large Graphs. In 2017 IEEE ICDM 2017. IEEE, 455–464.
  • Wasserman and Faust (1994) Stanley Wasserman and Katherine Faust. 1994. Social network analysis: Methods and applications. Vol. 8. Cambridge university press.
  • Welles et al. (2010) Brooke Foucault Welles, Anne Van Devender, and Noshir Contractor. 2010. Is a” friend” a friend? Investigating the structure of friendship networks in virtual worlds. In CHI 2010. 4027–4032.
  • Wulczyn and Taraborelli (2017) Ellery Wulczyn and Dario Taraborelli. 2017. Wikipedia Clickstream.
  • Xu et al. (2014) Kuai Xu, Feng Wang, and Lin Gu. 2014. Behavior Analysis of Internet Traffic via Bipartite Graphs and One-Mode Projections. IEEE/ACM Transactions on Networking 22, 3 (June 2014), 931–942.
  • Yin et al. (2019) Hao Yin, Austin R. Benson, and Jure Leskovec. 2019. The Local Closure Coefficient: A New Perspective On Network Clustering. In WSDM. 303–311.

Appendix A Additional Numerical Experiments

While our methods focused on finding top weighted triangles, some of the sampling methods naturally extend to the larger cliques. In this case, the weight of a clique is some generalized -mean of the weights of the edges in the clique. We found that the extension of edge sampling—as described in Section 4.5—performed best in practice, and we compare the performance of edge sampling algorithm to an intelligent brute force approach for finding top weighted 4-cliques and 5-cliques that enumerates all cliques using the algorithm of Chiba and Nishizeki (Chiba and Nishizeki, 1985)

. Our main finding is that edge sampling can approximately retrieve the top weighted cliques much faster; however the performance does show higher variance than for the case of triangles.

Since brute force enumeration of even 4-cliques is computationally expensive, we use smaller datasets for these experiments than the ones in Table 1. We construct weighted graphs from 5 temporal hypergraph datasets (Benson et al., 2018), where the weight of an edge is the number of hyperedges that contain nodes and . Table 3 shows summary statistics for the data.

dataset # nodes # edges edge weight
4-5 mean max
email-Enron 143 1800 16 819
email-Eu 979 29K 25 48K
contact-high-school 327 5.8K 32 29K
contact-primary-school 242 8.3K 15 780
tags-math-sx 1.6K 91K 17 16K
Table 3. Summary statistics of datasets used in clique sampling experiments.

We evaluate the performance of our proposed algorithm, random edge sampling (ES) as in Algorithm 3 with the modifications mentioned in Section 4.5. As a baseline, we use an optimized sequential clique enumerator and refer to this as the “brute force approach” (BF). The rest of the experimental setup is the same as in Section 5. We evaluated the algorithm for two values of : and , and two clique sizes: 4 and 5. We also use the arithmetic mean ( in Eq. 5).

The random edge sampling algorithm is not guaranteed to enumerate all of the top- weighted cliques. Instead, we measure the performance of this algorithm in terms of running time and accuracy (the fraction of top- cliques actually enumerated). We ran ES long enough for it to achieve at least 50% (clique size = 4) or at least 40% (clique size = 5) accuracy on all datasets.

Table 4 shows the running times of BF and ES. We observe substantial speedup of ES on all datasets—for example, we see speedups of 14x on the email-Eu dataset for both 4-cliques and 5-cliques. However, ES is only required to achieve 40 to 50% accuracy in these cases. We find that, unlike the performance of ES for finding top weighted triangles, the performance of ES for finding top weighted cliques has higher variance across runs. In some cases, ES can actually become more expensive than the brute force approach (5-cliques on the tags-math-sx dataset); a better understanding of these changes in performance is still an open question.

clique size
3-6 4 5
dataset BF ES BF ES
1000 email-Enron 1.13 0.05 1.1 0.3
email-Eu 9.17 0.4 83 5.8
contact-high-school 1.32 0.08 1.24 0.5
contact-primary-school 1.63 0.3 9 2.5
tags-math-sx 398 4.5 9340 9340
100000 email-Enron 1.13 0.05 1.1 0.3
email-Eu 9.17 0.45 83 5.8
contact-high-school 1.32 0.08 1.24 0.5
contact-primary-school 1.63 0.2 9 2.5
tags-math-sx 398 4.5 9340 9340
Table 4. Running times in seconds for brute force enumeration (BF) and parallel edge sampling (ES) (averaged over 10 runs) for 4 and 5 cliques. ES was run long enough to achieve 50% (4-cliques) and 40% (5-cliques) accuracy for and .