DegreeSketch: Distributed Cardinality Sketches on Massive Graphs with Applications

04/08/2020 ∙ by Benjamin W. Priest, et al. ∙ Lawrence Livermore National Laboratory 0

We present DegreeSketch, a semi-streaming distributed sketch data structure and demonstrate its utility for estimating local neighborhood sizes and local triangle count heavy hitters on massive graphs. DegreeSketch consists of vertex-centric cardinality sketches distributed across a set of processors that are accumulated in a single pass, and then behaves as a persistent query engine capable of approximately answering graph queries pertaining to the sizes of adjacency set unions and intersections. The tth local neighborhood of a vertex is the number of vertices reachable in G from v by traversing at most t edges, whereas the local triangle count is the number of 3-cycles in which it is included. Both metrics are useful in graph analysis applications, but exact computations scale poorly as graph sizes grow. We present efficient algorithms for estimating both local neighborhood sizes and local triangle count heavy hitters using DegreeSketch. In our experiments we implement DegreeSketch using the celebrated hyperloglog cardinality sketch and utilize the distributed communication tool YGM to achieve state-of-the-art performance in distributed memory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

As graph datasets scales continue to grow in applications, basic queries are becoming increasingly difficult to answer. How connected are the proteins in an interaction network in aggregate? Which hyperlinks shortcut large numbers of possible intermediate webpages? How many friends of friends of friends does a particular profile in a social network have? Many such queries amount to reasoning about the unions and intersections of the neighbor sets of the vertices in a graph. However, answering such queries exactly is typically superlinear in compute time and communication in distributed implementations, untenable for massive graphs.

Furthermore, the simple storage of large graphs can become burdensome, particularly as scale-free graphs include vertices whose degree is linear in the size of the graph. Not to mention, communicating neighborhood set information about such vertices is impractical. It is therefore tempting to consider schemata for sublinearly summarizing the information contained in vertex adjacency sets, and estimating unions and intersections. It is known that any data structure that provides relative error guarantees for the cardinality of a multiset with unique elements requires space (alon1999space). Consequently, investigators have developed many so-called cardinality sketches that provide such relative error guarantees in

space while admitting a small probability of failure, such as PCSA

(flajolet1985probabilistic), MinCount (bar2002counting), LogLog (durand2003loglog), Multiresolution Bitmap (estan2003bitmap), HyperLogLog (flajolet2007hyperloglog), and the space-optimal solution of (kane2010optimal)

. While all these cardinality sketches have a natural union operation that allows one to combine the sketches of two multisets into a sketch of their union, most have no closed intersection operation. Many, however, admit a heuristic intersection estimator that cannot obtain bounded error due to known lower bounds.

We present the DegreeSketch data structure, which maintains a cardinality sketch for each vertex distributed over a set of processors. These sketches accumulate in a single pass over a data stream describing the graph, and use a total amount of space polyloglinear in the number vertices - i.e. DegreeSketch is a semi-streaming data structure. We demonstrate its utility with distributed algorithms that estimate local -neighborhood sizes, as well as edge- and vertex-local triangle count heavy hitters.

The local -neighborhood of a vertex is the number of vertices that can be reached in hops. As increases, the -neighborhoods describe how the “ball” around the vertex grows. Knowledge of these ball sizes can be useful for applications such as edge prediction in social networks (gupta2013wtf) and probabilistic distance calculations (boldi2011hyperanf; myers2014information). For example, knowing the 3-neighborhood size of a user profile in a social network predicts cost of performing a computation over the set of its friends of friends of friends. The ANF (palmer2002anf) and HyperANF (boldi2011hyperanf) algorithms estimate the neighborhood function, the “average ball” around vertices in a graph, by individually estimating and summing the local -estimates for all vertices. using Flajolet-Martin and HyperLogLog cardinality sketches, respectively. We present an algorithm producing an estimate of a similar form, but its distributed implementation allows it to scale to much larger graphs. Moreover, DegreeSketch is a leave-behind reusable data structure.

Counting the number of triangles in simple graphs is a canonical problem in network science. A “triangle” is a trio of co-adjacent vertices, and is the smallest nontrivial community structure. Both the global count of triangles and the vertex-local counts, i.e. the number of triangles incident upon each vertex, are key to network analysis and graph theory topics such as cohesiveness (lim2015mascot), global and local clustering coefficients (tsourakakis2008fast), and trusses (cohen2008trusses). Local triangle counts are useful in protein interaction analysis (milo2002network), spam detection (becchetti2010efficient), and community discovery (wang2010triangulation; berry2011tolerating).

Although many exact algorithms have been proposed for the triangle counting problem (tsourakakis2008fast; becchetti2010efficient; chu2011triangle; suri2011counting; wolf2017fast), their time complexity is superlinear in the number of edges (). In order to avoid this dreaded superlinear scaling for applications involving large graphs, many researchers have turned to streaming approximations. These serial streaming algorithms maintain a limited number of sampled edges from an edge stream. Streaming global triangle estimation algorithms have arisen that sample edges with equal probability (tsourakakis2009doulion), sample edges with probability relative to counted adjacent sampled edges and incident triangles (ahmed2017sampling), and sample edges along with paths of length two (jha2013space). The first proposed semi-streaming local triangle estimation algorithm relies upon min-wise independent permutations accumulated over a logarthmic number of passes (becchetti2008efficient). More recently, true single-pass sampling algorithms have arisen such as Mascot (lim2015mascot) and Triést (stefani2017triest).

While many distributed global and vertex-local triangle counting algorithms have been proposed, the overwhelming majority store the graph in distributed memory and return exact solutions (suri2011counting; arifuzzaman2013patric; pearce2017triangle). Recently, the study of distributed streaming vertex-local triangle counting was intiated in earnest with the presentation of Try-Fly (shin2018tri), a parallelized generalization of Triést, and its follow-up DiSLR that introduced limited edge redundancy (shin2018dislr). Our approach is fundamentally different to these methods, depending upon sketching rather than sampling as its core primitive. DegreeSketch also permits the estimation of edge-local triangle counts, or the number of triangles in which individual edges participate. While the sampling approaches produce estimates and a stochastically sampled sparse graph, we produce a leave-behind queryable data structure.

We begin with a discussion of some of the preliminaries and notation in Section 2. Section 3 introduces DegreeSketch, as well as describing algorithms that utilize DegreeSketch to perform local neighborhood size estimation as well as approximately recovering edge- and vertex-local triangle count heavy hitters. Section 4 describes the details of a particular implementation of DegreeSketch using HyperLogLog cardinality sketches. We conclude with experiments in Section 5.

2 Preliminaries and Notation

Throughout this document we will consider an undirected graph , which we assume to be large. We adopt the usual convention that and . Let be the length of the shortest path in between , and let be the degree of . We will consider a universe of processors , and further assume a partitioning of vertices to processors . We will occasionally abuse notation and use to describe the set of vertices that map to . We make no assumptions about the particulars of , noting that vertex partitions are a subject of intense academic scrutiny. In effect, our algorithms are designed to work alongside any reasonable . In our algorithms we will occasionally use the keyword Reduce to refer to a global sum, except were the operand is a max heap in which case it is the creation of a global max heap.

We assume that is given by a data stream , a sequential list of the edges of . is further partitioned by some unknown means into substreams, one to be read by each processor. We assume that each processor has send and receive buffers and , respectively. Algorithms are given broken up into Send, Receive, and local Computation Contexts. We make no assumptions as to how processors handle switching between contexts. In our implementations we use the software package YGM (priest2019you) to manages the send and receive buffers, as well as switching contexts, in a manner that is opaque to the client algorithm.

Consider a vertex . For , let be the local -neighborhood of defined as

(1)

Moreover, let the global -neighborhood be defined by

(2)

We define the edge-local triangle count of an edge as

(3)

This allows us to define the more con We define the vertex-local triangle count of as

(4)

We can equivalently define vertex-local triangle counts in terms of edge-local triangle counts as

(5)

We will also refer to the global number of triangles in a graph

(6)

We will drop the subscripts where they are clear.

As we have stated, DegreeSketch consists of a set of cardinality sketches. Many such sketches would suffice to implement DegreeSketch, so for the purposes of discussion we will abstract many of the particulars until Section 4. We will assume a notional sketch that requires space, where is an accuracy parameter. We assume that this sketch supports an operation to add elements and admits a operator, providing an approximation that is within a multiplicative factor of of the number of unique inserted items with high probability. We also assume that the sketch affords a closed operator to combine sketches, and a operator to estimate intersection cardinalities. For reasons that will be described in Section 4.1, we do not assume that the intersection operator has the same error properties as .

3 DegreeSketch

DegreeSketch maintains a distributed data structure that can be queried for an estimate of a vertex’s degree. For each we maintain a cardinality sketch , which affords the approximation of . We will assume throughout that is the processor that will hold .

Algorithm 1 describes the distributed accumulation of a DegreeSketch instance on a universe of processors . In a distributed pass over the partitioned stream, processors use the partition function to send edges to the cognizant processors for each endpoint. These processors each maintain cardinality sketches for their assigned vertices. When receives an edge where , it performs . Once all processors are done reading and communicating, is accumulated.

Input: - edge stream divided into substreams

- universe of processors

- dictionary mapping to send queues

- dictionary mapping to receive queues

- partition mapping

Returns: - accumulated DegreeSketch

1:Send Context for :
2:while  is not empty do
3:     
4:     
5:Receive Context for :
6:while  is not empty do
7:     
8:     if  then      
9:     
10:Computation Context for :
11: empty DegreeSketch dictionary
12:while  has unread element  do
13:     
14:     
15:return
Algorithm 1 Accumulation

DegreeSketch can be implemented with any cardinality sketch that admits some form of close union operator and intersection estimation. In fact, the algorithms in Sections 3.2 and  3.3 do not even require a closed union operator. In our experiments, we focus on the well-known HyperLogLog or HLL cardinality sketches.

We describe HLL and discuss these features in greater detail in Section 4. First, however, we describe algorithms utilizing DegreeSketch for neighborhood size estimation in Section 3.1, recovering edge-local triangle count heavy hitters in Section 3.2, and finally recovering vertex-local triangle count heavy hitters in Section 3.3.

3.1 Neighborhood Size Estimation

Let be an instance of DegreeSketch as described, so that for , is a cardinality sketch of the adjacency set of . By the properties of cardinality sketches, if we know ’s neighbors we can compute an estimate of by computing

(7)

Input: - edge stream divided into substreams

- universe of processors

- accumulated DegreeSketch

- dictionary mapping to send queues

- dictionary mapping to receive queues

- partition mapping

- maximum neighborhood degree

Returns: , for all ,

1:Send Context for :
2:while  is not empty do
3:     if next message is an Edge then
4:         
5:         
6:     else if next message is a Sketch then
7:         
8:               
9:Receive Context for :
10:while  is not empty do
11:     if next message is an Edge then
12:         
13:         
14:     else if next message is a Sketch then
15:         
16:               
17:Computation Context for :
18:
19:while true do
20:      for
21:     
22:      Reduce
23:     
24:     if  then break      
25:     Reset
26:     
27:     while  has unread element  do
28:         
29:               
Algorithm 2 Neighborhood Approximation

Higher-order union operations of the form Equation (7) are the core of the ANF (palmer2002anf) and HyperANF (boldi2011hyperanf) algorithms. Algorithm 2 is a distributed generalization of HyperANF using DegreeSketch. After accumulating , an instance of DegreeSketch, the algorithm takes a number of additional passes over . For starting at 2, we accumulate

(8)

by way of a message-passing scheme similar to Algorithm 1. When receives an edge where , it forwards to . When receives , it merges it into its next layer local sketch for , , computing Equation (8) once all messages are processed. By construction, we have that

(9)

Ergo, the set of elements inserted into consists of all such that , which is to say that directly approximates (Equation (1)). These data structures can be maintained for later use by simply storing all between passes.

The summations over all sketches in line 21 of Algorithm 2 estimate in the form Equation 2. Note that these summations are performed as distributed Reduce operations, and occur between passes over . The following theorem states the approximation quality of Algorithm 2 when implemented with HLLs and is inspired by Theorem 1 of (boldi2011hyperanf).

Theorem 1.

Let and

be the multiplicative bias and standard deviation for

HLLs given in Theorem 1 of (flajolet2007hyperloglog). The output and for at the -th iteration satisfies

i.e. they are nearly unbiased. Furthermore, both also have standard deviation bounded by . That is,

Proof.

For each , , where is a union of HLLs, into which every such that is inserted, as we noted from Equation (9). Thus by Theorem 1 of (flajolet2007hyperloglog),

Thus, by the linearity of expectation and the subadditivity of variance,

Theorem 1 tells us that the estimates of and retain the approximation guarantees of their underlying sketch. Similar theorems for different cardinality sketches with closed operators are similarly simple to prove. Hence, we are able to guarantee that all approximations produced by Algorithm 2 retain the guarantees of their underlying cardinality sketches.

3.2 Edge-Local Triangle Count Heavy Hitters

In addition to estimating local neighborhood sizes, DegreeSketch affords an analysis of local triangle counts using intersection estimation. Furthermore, while sampling-based streaming algorithms are limited to vertex-local triangle counts, DegreeSketch affords the analysis of edge-local triangle counts, which can be thought of as a generalization of vertex-local triangle counts. Given the edge-local triangle counts for each edge incident upon a vertex, we can easily compute its vertex-local triangle count as in Equation (5). The reverse is not true.

Edge-local triangle counts have understandably not received much attention in the streaming literature, considering that even enumerating them requires space. Given an accumulated DegreeSketch we can estimate using the intersection estimation operator,

(10)

This procedure is similar in spirit to the well-known if suboptimal intersection method for local triangle counting. We can also estimate the total number of triangles following Equation (6) by

(11)

Input: - edge stream divided into substreams

- integral heavy hitter count

- universe of processors

- accumulated DegreeSketch

- dictionary mapping to send queues

- dictionary mapping to receive queues

- partition mapping

1:Computation Context for :
2: empty max -heap
3:
4:while  has unread element  do
5:     
6:Reduce
7:
8:Reduce
Algorithm 3 Local Triangle Count Heavy Hitters Chassis

Unfortunately, while most cardinality sketches have a native and closed operation, they all lack a satisfactory intersection operation, a consequence of the fact that detection of a trivial intersection is impossible in sublinear memory. Hence, we must instead make use of unsatisfactory intersection operations in practice, which has been a focus of recent research (ting2016towards; cohen2017minimal; ertl2017new). We will discuss these in more detail in Section 4.1, and analyze their shortcomings in Appendix B. For our purposes, we will suppose that is reliable only where intersections are large. Consequently, we will attempt only to recovery the heavy hitters, i.e. the edges participating in the greatest number of triangles.

Algorithm 3 provides a chassis for Algorithms 4 and  5, which differ only in their communication behavior. In Algorithm 3, all processors read over their edge streams and forward edges to one of their endpoints, similar to the behavior in the Accumulation Context of Algorithm 2. They also initialize a counter and a max heap with a maximum size of , . These values are modified in the Send Context and Receive Context.

Returns: , , the top edge estimates

1:Send Context for :
2:while  is not empty do
3:     if next message is an Edge then
4:         
5:         
6:     else if next message is a Sketch then
7:         
8:               
9:Receive Context for :
10:while  is not empty do
11:     if next message is an Edge then
12:         
13:         
14:     else if next message is a Sketch then
15:         
16:         
17:          
18:         Try to insert into      
19:Computation Context for :
20:Run Algorithm 3 using these communication contexts
Algorithm 4 Edge-Local Triangle Count Heavy Hitters

Algorithm 4 issues a chain of messages for each read edge, not unlike the procedure in Algorithm 2. reads , and issues a message of type Edge containing to . Upon receipt, issues a message of type Sketch containing to . When receives this message, it computes via Equation (10) and updates and . Once computation is complete and all receive queues are flushed, the algorithm computes a global Reduce sum to find and similarly finds the global top estimates via a reduce on . The algorithm returns (each triangle is counted 3 times) and .

Algorithm 4 addresses edge–local triangle count heavy hitter recovery using memory sublinear in the size of . It requires time and communication, given our assumptions, and a total of space, where DegreeSketch is implemented using HyperLogLog sketches with accuracy parameter . Unfortunately, we are unable to provide an analytic bound on the error of this algorithm, due to the nature of sublinear intersection estimation. We provide an experimental exploration of this problem in Appendix B, and evaluate the performance of Algorithm 4 in Section 5.

3.3 Vertex-Local Triangle Count Heavy Hitters

Given access to a DegreeSketch instance and the neighbors of , we can compute an estimate of using

(12)

following Equations (5) and  (10). We must still limit our scope to the recovery of vertex-local triangle count heavy hitters due to problem of estimating small intersections.

Algorithm 5 performs vertex-local triangle count estimation in a manner similar to Algorithm 4 with some additional steps. We maintain for each . It performs similar work for up to the point processor estimates . Instead of inserting this estimate into a local max heap, we add it to , and forward to so that it can add it to . This message has the Est type, to distinguish it from Edge and Sketch messages. After all processors are done communicating and updating their local counts, they assemble and reduce a max -heap of vertex-local triangle count heavy hitters. Note that we could also return for all vertices at no additional cost, but this is not generally a good idea as explained in Appendix B.

Algorithm 5 addresses vertex–local triangle count heavy hitter recovery using the same asymptotic computation, memory and communication costs as Algorithm 4. Unfortunately, we are similarly unable to provide an a priori analytic bound on the error of this algorithm.

Output: , , the top vertex estimates

1:Send Context for :
2:while  is not empty do
3:     if next message is an Edge then
4:         
5:         
6:     else if next message is a Sketch then
7:         
8:         
9:     else if next message is an Est then
10:         
11:               
12:Receive Context for :
13:while  is not empty do
14:     if next message is an Edge then
15:         
16:         
17:     else if next message is a Sketch then
18:         
19:         
20:         
21:         
22:         
23:     else if next message is an Est then
24:         
25:               
26:Computation Context for :
27:
28:Run Algorithm 3 using these communication contexts
29:Accumulate max heap from
Algorithm 5 Vertex-Local Triangle Count Heavy Hitters

4 The HyperLogLog Sketch

The HyperLogLog sketch is arguably the most popular cardinality sketch in applications, and has attained widespread adoption (flajolet2007hyperloglog). The sketch relies on the key insight that the binary representation of a random machine word starts with with probability . Thus, if the maximum number of leading zeros in a set of random words is , then is a good estimate of the cardinality of the set (flajolet1985probabilistic). However, this estimator clearly has high variance. The variance is traditionally minimized using stochastic averaging to simulate parallel random trials (flajolet1985probabilistic).

Assume we have a stream of random machine words of a fixed size . For a -bit word , let be the first bits of , and let be the number of leading zeros plus one of its remaining bits. We pseudorandomly partition elements of into substreams of the form . For each of these approximately equally-sized streams, we maintain an independent estimator of the above form. Each register , , accumulates the value

(13)

Of course, in practice we simulate randomness using hash functions. We utilize the non-cryptographic xxhash (xxhash) in our implementation. We will assume throughout that algorithms have access to such a hash function .

After accumulation, stores the maximum number of leading zeroes in the substream , plus one. The authors of HyperLogLog show in (flajolet2007hyperloglog)

that the normalized bias corrected harmonic mean of these registers,

(14)

where the bias correction term is given by

(15)

is a good estimator of , the number of unique elements in . The error of estimate ,

, has standard error

, i.e. satisfies

(16)

with high probability. However, Equation 14 is known to experience practical bias on small and very large . Expanding the hash function to handle 64-bit words instead of the original 32-bit words practically eliminates bias on large in most practical problems. Although modifications to handle small bias abound, we choose the approach of LogLogBeta (qin2016loglog) for its simplicity and replace with

(17)

Here is the number of zero registers in and is an experimentally determined bias minimizer. We follow the lead of the authors of LogLogBeta and determined as a 7th-degree polynomial of , whose weights are set experimentally by solving a least-squares problem like in Section II.C. of (qin2016loglog).

The majority of vertices in many application graphs have a small number of neighbors, which suggests that to maximize memory and communication efficiency we would like a cheaper way to represent sketches where most of is empty. We adopt the sparse representation for HyperLogLog sketches suggested by Heule et al. (heule2013hyperloglog). Mathematically, the sparsification procedure is tantamount to maintaining the set . requires less memory than when the cardinality of the underlying multiset is small. Moreover, it is straightforward to saturate a sparse sketch into a dense one once it is no longer cost effective to maintain it by instantiating while assuming all registers not set in are zero. We will assume that is implemented as a map, where an element if and is zero otherwise.

A particular HyperLogLog sketch, , consists a hash function , a prefix size (typically between 4 and 16), a maximum register value , and an array of registers, initially an empty sparse register list . We summarize references to such a sketch as . Algorithm 6 describes the functions supported by our version of the HyperLogLog sketch.

State Variables for :

    mode , initially sparse

    

     dense register list of size , initially

    sparse register set, initially

     count of nonzero elements of or , initially

1:function Insert()
2:     
3:     
4:     
5:     
6:function Insert()
7:     if  then
8:         
9:     else if  then
10:          (see Figures 6 & 7 of (heule2013hyperloglog))
11:         if  then
12:                             
13:function Saturate()
14:     
15:     for  do
16:               
17:     
18:function Merge()
19:     
20:     for  do
21:         
22:         if  then
23:                             
24:     return
25:function Estimate()
26:     if  then
27:         return
28:     else
29:         return      
Algorithm 6 Operations

Note that HLL

s support a natural merge operation: taking the element-wise maximum of each index of a group of register vectors. This requires that the two sketches were generated using the same hash function. We assume throughout that all of the sketches in an instance of

DegreeSketch are where and is fixed. Estimate accuracy scales inverse squared with per Equation (16). Thus, increasing improves estimation performance at the cost of computation and memory overhead.

4.1 Intersection Estimation

A naïve approach to estimating an intersection of two sets and using cardinality sketches might involve computing the intersection via the inclusion-exclusion principle:

(18)

However, if we attempt to estimate each quantity on the right side of Equation (18) the error noise in each estimate could result in a negative answer! Furthermore, if the true intersection is small relative to the set sizes, or if one set is much larger than the other, the variance will be quite high.

Ertl describes a better intersection estimator using a maximum likelihood principle (ertl2017new). The estimator yields estimates of , , and . The algorithm depends on the optimization of a Poisson model, where it is assumed that

is drawn from a Poisson distribution with parameter

, and similarly and use Poisson parameters and . These parameters can be related to the observed HyperLogLog register lists corresonding to and , and , via a loglikelihood function given by Equation (70) of (ertl2017new), which we do not reproduce due to space constraints. is a function of the statistics:

(19)

which capture the differences in register list distribution. The inclusion-exclusion estimator loses information present in the more detailed count statistics statistics in Equation (19). Algorithm 9 of (ertl2017new) describes the estimation of , , and by accumulating the sufficient statistic (19) and using it to find the maximum of Equation (70) in the source via maximum likelihood estimation. The author shows extensive simulation evidence indicating that this method significantly improves upon the estimation error of a naïve estimator. We provide further analyses of intersection estimation edge cases in Appendix B, and reaffirm some of Ertl’s findings.

5 Experiments

We now evaluate the algorithms and claims made throughout this document.

Implementation: We implemented all of our algorithms in C++ and MVAPICH2 2.3. Inter- and intra-node communication is managed using the pseudo-asynchronous MPI-enabled communication software package YGM (priest2019you). We used xxhash as our hash function implementation (xxhash).

Figure 1: Mean relative error estimating for all and up to 5 for 10 moderate graphs. Here a prefix size of 8 was used, guaranteeing standard error around 0.06.

Hardware: All of the experiments were performed on a cluster of compute nodes with twenty-four 2.4 GHz Intel Xeon E5-2695 v2 processor cores and 128GB memory per node. We varied the number of nodes per experiment depending on scalability requirements and the size of the graph. We consider graph partitioning to be a separate problem, and accordingly use simple round-robin assignment for our experiments.

Graphs: We ran our implementations on a collection of real and synthetic graphs. Many of these graphs are provided by Stanford’s widely used SNAP dataset (snapnets). These graphs are collected from natural sources, such as social media, transportation networks, email records, peer-to-peer communications, and so on. We casted each graph as unweighted, ignoring directionality, self-loops, and repeated edges. We also used 5 graphs derived from nonstochastic Kronecker products of smaller graphs. These graphs are described in detail in Appendix C.

Experiments: Except where noted otherwise, we ran each experiment 100 times using the same settings while varying the random seed. We set the prefix size in experiments depending on the accuracy needs, where larger implies greater performance at a higher cost. We report mean relative error (MRE), where the relative error of and estimate of a true quantity is .

Analysis: We performed experiments on real graph datasets for the purpose of establishing the following of our algorithms:

  1. Estimation Quality Do the algorithms yield good local -neighborhood estimates? How accurate are the global and edge- and vertex-local estimates?

  2. Heavy Hitter recovery Do the heavy hitters returned by Algorithms 4 and 5 correspond to the ground truth heavy hitters?

  3. Speed & Scalability How fast is accumulation? Estimation? How does wall time relate to ?

Figure 2: Precision versus recall for the top 10, top 100, and top 1000 ground edge-local triangle count truth heavy hitters of all SNAP graphs and all 5 synthetic kronecker graphs.

We examined the performance of the local -neighborhood estimation algorithm (Algorithm 2) with prefix size of 8 on 10 moderately sized SNAP graphs up to . Figure 1 displays the MRE of the returned local estimates. We find that the MRE matches our expectations informed by Theorem 1. In early passes, most of the neighborhoods are relatively small and so in practice the cardinality sketches give very precise estimates. As grows, the neighborhood balls grow to saturate the graph and so accordingly the estimation error grows until leveling off around the theoretical error guarantee.

Figure 3: The triangle counts and triangle densities of the edge-local triangle count heavy hitters up to for four graphs.

We experimented with Algorithm 4 with a prefix size of 12, attempting to collect the top heavy hitters for all of the SNAP and Kronecker graphs. For each , we ran the algorithm with ranging from up to , producing . We performed a similar analysis and found similar results for Algorithm 5, and omit them for space.

Treating

as a one-class classifier of elements in

, an edge is a true positive if , a false negative if , a false positive if , or a true negative if .

We can report the quality of experiments in terms of its recall versus its precision for each setting of and in Figure 3

. The precision versus recall tradeoff is a common metric in information retrieval, where the goal is to tune model parameters so as to force both the precision and recall as close to one as possible. Although these measures are known to exhibit bias, they are accepted as being reasonable for heavily uneven classification problems such as ours, where the class of interest is a small proportion of the samples

(boughorbel2017optimal)

. Although most graphs show very good precision versus recall curves, there are a few notable outliers.

Figure 3 contrasts the edge-local triangle count distribution between a graph with good performance in Figure 2 and three with relatively poor performance. We find that triangle density, or the ratio of edge-local triangles versus the union of endpoint adjacency sets is a powerful determiner of performance. Triangle density corresponds with the Jaccard similarity of the edge’s endpoint neighbor sets, and for neighbor sets and is computed as . In other words, again, relatively small intersections produce high error and uncertain heavy hitter recovery.

The cit-Patent graph exhibits good performance in Figure 2, and demonstrates a reasonable triangle count distribution as well as high triangle density throughout in Figure 3. The other three graphs demonstrate poor performance in Figure 2. The kronecker em em graph exhibits an unusual number of ties in its triangle count distribution due to its construction, in addition to low triangle density among its heavy hitters. The P2P-Gnutella24 graph has very low triangle density, and a 3 or fewer triangles for the vast majority of its edges. The ca-HepTh graph exhibits an unusual triangle distribution, where a huge portion of its edges tie at 30 triangles. Consequently, even a perfect heavy hitter extraction procedure will fail on this graph. Notably, the two edges with the largest triangle counts are reliably returned.

graph Type
patents 3,774,768 16,518,947 Citation
ye ye 5,574,320 88,338,632 Kronecker
or or 131,859,288 1,095,962,562 Kronecker
Twitter 41,652,224 1,201,045,942 S. N. (kunegis2013konect)
WDC 3,563,602,788 128,736,914,864 Web
Table 1: Scaling Graphs
Figure 4: The time in seconds to perform local -neighborhood size estimation up to for the or or graph for nodes .
Figure 5: The time in seconds to accumulate and perform local triangle count estimation using compute nodes for all graphs listed in Table 1.
Figure 6: The time in seconds to accumulate and perform local triangle count estimation using up to for the citation-Patents graph.

We also examined the performance scaling of the algorithms as a function of data and computing resource sizes. For each experiment we used a prefix size of 8 and discounted the I/O time spent on reading data streams from files. Algorithms 4 and  5 exhibit near-identical time performance, and so we report only the later.

We ran Algorithm 2 for on the or or Kronecker graph, varying the number of nodes from . We note nice weak scaling behavior: as the computational resources double, the time roughly halves. In particular, pass 2 tends to take more time because of the sparsity settings in our implementations; merges are less efficient for sparse sketches. Once the sketches saturate, note that the time decreases significantly in later passes. One could implement DegreeSketch using only dense sketches to avoid this hump. If one only intends to perform local =neighborhood estimation, omitting sparse sketches is a good idea as all sketches will eventually saturate as grows.

Similarly, Figure 6 measures the the time in seconds spent for Algorithms 1 and  5 on the cit-Patents graph, where varies from 1 up to 72. This weak scaling experiment demonstrates significant performance improvements on a fixed amount of work as resources increase.

It is difficult to demonstrate strong scaling using graph data, as graphs (especially realistic ones) do not scale smoothly like, say, linear algebra. We instead present a ”strong scaling” experiment on the 5 large graphs in Table 1. We found that subsequent passes of Algorithm 2 exhibited similar behavior to Algorithm 5, and so we only report results for the latter.

Figure 5 measures the the time in seconds spent accumulating DegreeSketch and performing the vertex-local estimation on each graph, plotted against the number of edges in each graph. We used compute nodes in each case. As promised, the wall time is linear in the number edges for both accumulation and estimation, showing good scaling with graph size on fixed resources. We found in experiments that the subsequent passes of Algorithm 2 experienced similar linear scaling.

It is worth noting that a competing state-of-the-art exact triangle counting algorithm required compute nodes to even load the largest WebDataCommons graph into distributed memory (pearce2017triangle).

6 Conclusions

We have herein demonstrated the efficacy of DegreeSketch to scalably and approximately answering queries on massive graphs. Although we have focused on estimating neighborhood sizes and counting triangles, DegreeSketch’s utility extends to more general queries that can be phrased as unions and possibly an intersection of adjacency sets. In particular, although we have focused on simple undirected graphs in this work, colored graphs are an interesting area of future work. A simple generalization to the work here presented allows us to estimate interesting queries of the form ”how many of ’s -neighbors are both red and green?” or ”how many of ’s -neighbors are not blue?” DegreeSketch’s demonstrated scalability coupled with its demonstrated performance should prove useful in applications where such queries are prevalent, such as motif counting.

7 Acknowledgments

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-757958). Experiments were performed at the Livermore Computing Facility.

References

Appendix A Vertex-Local Variance Bound

The following theorem bounds the estimator error variance of Algorithm 5

in terms of the variances of the atomic edge-local estimates using the subadditivity of the standard deviation - i.e. if random variables

and have finite variance, :

Theorem 2.

Let be the estimated output of Algorithm 5 for , and let be the estimated edge triangle count for . Assume further that for each , we know a standard deviation bound so that

(20)

Furhermore, let . Then, has at most twice this maximum standard deviation. That is,

Proof.
subadditivity
Equation (20)
Equation (5)

Theorem 2 shows that if we can bound the error variance of the edge-local triangle count estimates produced using DegreeSketch, we can also bound the error variance of the vertex-local triangle count estimates produced by Algorithm 5. Unfortunately, we are unable to provide these bounds a priori, as they depend upon the sizes of all of the the sets and their intersections, which are an unknown function of the graph. However, if we are somehow promised that the triangle density of every edge is above a given threshold, we are able to produce analytic guarantees of the estimation error.

Appendix B Dominations and Small Intersections

We have noted that there are limitations to the sketch intersection estimation in Section 4.1. There appear to be two main sources of large estimation error in practice. Throughout we will borrow the parlance of Section 4.1.

The first source of error is the phenomenon where for all where , resulting in for all and for all . This generally only occurs when or . We say that such an strictly dominates . In this case, Equation (70) of (ertl2017new) can be rewritten as the sum of functions depending upon and . This means that the optimization relative to does not depend upon or . The optimization relative to is similarly independent of , and thus is tantamount to using the maximum likelihood estimator for independent of . Consequently, could be anything between 0 and without affecting the optimum joint likelihood, resulting in arbitrary estimates for the intersection.

Figure 7: Mean relative intersection error as a function of , where .

We also consider the phenomenon where for all , resulting in for all . We say that such an dominates . We are unable to make the same analytic statements about Equation (70), as the terms dependent upon are not eliminated. Consequently, the optimum estimate for depends upon and . If dominates , the count statistics given by Equation 19 are unable to distinguish whether is subset of . Many and large nonzero values for for large will bias the optimization towards larger intersections, whereas the converse is true if is nonzero for only a few small values of . If , then the latter might occur whether is large or small. Furthermore, note that if , then will (possibly strictly) dominate .

If dominates , then . Ergo, the inclusion-exclusion estimator returns the estimated value of . This estimate is dubious, given that we have no evidence that the sets and hold any elements in common. This is especially true if . Hence, both the naïve and maximum likelihood estimators may suffer from bias when a domination event occurs. Figure 7 plots the mean relative error as one of the sets decreases, with a fixed relative intersection size. As gets smaller, the likelihood of a domination increases. At dominations occur in of cases, at dominations occur in of cases, at dominations occur in of cases, and at dominations occur in of cases. In particular in the two cases where and and a domination does not occur, the maximum likelihood estimator returns exactly 1. So for a fixed intersection size relative to , both the inclusion-exclusion and maximum likelihood estimators return more reasonable estimates when dominations do not occur. However, there is no known reliable method to avoid them in practice.

Consequently, it might be safest to disregard dominations in practice, as failing to do so is theoretically unsound and likely to produce high and arbitrary error. However, this poses a problem for many graph applications, as one will frequently have to compare the sketches of high degree vertices with those of comparatively low degree to find their joint triangle count.

Figure 8: HLL inclusion-exclusion and maximum likelihood intersection estimator performance where and varies from up to .

We have also noted the problem of small intersections. As discussed above, the maximum likelihood intersection estimate is proportional to the number (and size) of the nonzero for , where larger biases the estimate toward larger intersections. If the ground truth intersection is small relative to and , however, Equation (70) will exhibit high variance.

Figure 8 compares the performance of the inclusion-exclusion estimator to the maximum likelihood estimator for a prefix size of 12. Here the set sizes are kept constant at Note that the mean relative error grows quite large as the relative interection size decreases, although the maximum likelihood estimator consistently outperforms the inclusion-exclusion estimator by roughly an order of magnitude.

Appendix C Kronecker Graph Construction

Nonstochastic Kronecker graphs (weichsel1962kronecker) have adjacency matrices that are Kronecker products , where the factors are also adjacency matrices. This type of synthetic graph is attractive for testing graph analytics at massive scale (leskovec2010kronecker; kepner2018design), as ground truth solution is often cheaply computable. For such graphs, global triangle count and triangle counts at edges are computed via Kronecker formulas (sanders2018large): for a graph with edges, the worst-case cost of computing global triangle counts is sublinear, , whereas the cost of computing the full set of edge-local counts is .

Here, we build from identical factors, , that come from a small set of graphs with up to from the University of Florida sparse matrix collection (polbooks, celegans, geom, yeast (davis2011university)). All graphs were forced to be undirected, unweighted, and without self loops. We compute the number of triangles at each edge for and use the Kronecker formula in (sanders2018large) to get the respective quantities for . Summing over the edges and dividing by 3 gives the global triangle count for .