Improved 3-pass Algorithm for Counting 4-cycles in Arbitrary Order Streaming

07/27/2020 ∙ by Sofya Vorotnikova, et al. ∙ 0

The problem of counting small subgraphs, and specifically cycles, in the streaming model received a lot of attention over the past few years. In this paper, we consider arbitrary order insertion-only streams, improving over the state-of-the-art result on counting 4-cycles. Our algorithm computes a (1+ϵ)-approximation by taking three passes over the stream and using space O(m log n/ϵ^2 T^1/3), where m is the number of edges in the graph and T is the number of 4-cycles.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Subgraph counting is a fundamental graph problem and an important primitive in massive graph analysis. It has many applications in data mining and analyzing the structure of large networks. This problem has also received a lot of attention in the streaming community, with the main focus on counting triangles [1, 2, 4, 6, 8, 14, 7, 5, 15, 3, 9, 13]. Several papers considered counting larger cycles and cliques [3, 9, 12], and a few studied arbitrary subgraphs of constant size [3, 10, 11]. There is also work on counting 4-cycles in the case when the underlying graph is bipartite [16]. Since a 4-cycle is also a 2-by-2 biclique, it is the most basic motif in bipartite graphs and plays essentially the same role as a triangle does in general graphs.

In this paper, we concentrate on counting 4-cycles in the arbitrary order insertion-only streaming model, improving over the state-of-the-art algorithm presented by McGregor and Vorotnikova [13].

1.1 Our Result and Previous Work

Throughout this paper, we use to denote the number of vertices in the graph, to denote the number of edges, and for the number of 4-cycles. Note that our algorithm is parameterized in terms of , which is a convention adopted in the literature. In practice, the quantities in the algorithm would be initialized based on a promised lower bound on .

Our result is as follows.

Theorem 1.

There exists an space algorithm that takes three passes over an arbitrary order stream and returns a

multiplicative approximation to the number of 4-cycles in the graph with probability at least

.

By running copies of the algorithm in parallel and taking the median of their outputs, we can increase the success probability to , where .

Our algorithm can be directly compared to the space222We use notation to hide and factors. algorithm by McGregor and Vorotnikova [13]. It takes the same number of passes over the stream and has the same approximation guarantees. We believe that the space of our algorithm is tight, however the best known lower bound is currently  [13].

In [3] Bera and Chakrabarti present a different 4-cycles counting algorithm which takes four passes and uses space . Note, that the space used by our algorithm is as good or better when . McGregor and Vorotnikova [13] also present a 2-pass space algorithm which distinguishes between graphs with 0 and 4-cycles.

2 Algorithm and Analysis

2.1 Notation

A wedge is a path of length 2. For wedge we call vertices and the endpoints of the wedge and vertex the center.

We use to denote the set of neighbors of vertex . Consider sets of vertices and . Edges between these two sets form a complete bipartite graph, which we call a diamond with endpoints and . We say that wedge is a part of diamond if they have the same endpoints. Note that a diamond with endpoints and consists of wedges and involves 4-cycles.

Throughout the paper we use , , and to denote the number of 4-cycles involving edge , wedge , or involved in diamond respectively. For any quantity , we use

to denote its estimate.

In Section 2.3, we define heavy/light edges, wedges, and diamonds, where “heavy” roughly corresponds to “involved in many 4-cycles” and “light” to “involved in few 4-cycles”. Note that these are defined by the algorithm and depend on the collected samples of vertices and edges. We define to be the number of 4-cycles with at least one heavy wedge and as the number of 4-cycles with no heavy wedges and at most one heavy edge.

2.2 Main Idea

The most basic algorithm approximating the number of 4-cycles in a graph is as follows:

Pass 1:

Sample edges with probability , call set .

Pass 2:

For each edge in the stream, let be the number of 3-paths with all edges in that completes to a 4-cycle.

Return:

.

In expectation, the value returned by this algorithm is

. However, due to the fact that some edges or wedges in the graph can be involved in a large number of 4-cycles, the variance of this estimator is large. If an edge or wedge participates in many 4-cycles, call it “bad”. In this paper, we show that it is possible to identify such bad edges and wedges and take care of them separately, leading to an accurate approximation.

We observe that if wedge is bad, then it is a part of a large diamond with endpoints and . If we sample vertices in and collect all incident edges, we will detect the diamond and accurately estimate its size. Using this method, we approximate the total number of cycles with bad wedges.

We then separately approximate the number of cycles with no bad wedges and at most one bad edge. This procedure follows the same template as the arbitrary order 4-cycle counting algorithm in [13]

. Sampling edges uniformly at a certain rate allows us to obtain some 3-paths which are involved in 4-cycles with no bad wedges. Additionally, sampling vertices uniformly and storing all incident edges allows us to build an oracle roughly classifying edges as good or bad. We use this oracle to compute the number of bad edges in each of the cycles we discover. Note that the oracle takes an extra pass over the stream, and thus in total our algorithm uses three passes.

2.3 Algorithm

The algorithm in this section computes estimates to and separately and then returns their sum. We later show that is an accurate approximation of .

Within the algorithm, we define heavy/light diamonds and wedges. Roughly speaking, a heavy diamond consist of wedges and a light diamond consist of wedges. A wedge is then defined as heavy or light if it is a part of a heavy or light diamond respectively.

In the third pass, we refer to the oracle which classifies edges as heavy or light. It is described separately after the main algorithm.

Pass 1:
  • Let .

  • Sample edges with probability , call set .

  • Sample vertices with probability , call set . Collect all incident edges, call set .

  • Sample vertices with probability , call set . Collect all incident edges, call set .

After Pass 1:
  • For a pair of vertices , let be the number of wedges with center in and endpoints and .

  • Define diamond with endpoints and to be heavy if and light otherwise. Let .

  • Define wedge with endpoints and to be heavy if it is part of a heavy diamond and light otherwise. Let .

  • Find all pairs of vertices which are endpoints of heavy diamonds/wedges.

  • Let , where is a heavy diamond.

Pass 2:

For every edge in the stream:

  • Check if completes any 3 edges from to a 4-cycle (call it ). Check whether has a heavy wedge; if not, store .

Pass 3:
  • For all edges involved in cycles stored in pass 2, use to classify them as heavy or light.

  • Let be the number of pairs s.t. has no heavy edges.

  • Let be the number of pairs s.t. is heavy and the other 3 edges in are light.

  • Let

Return:

Oracle.

Below, we describe the oracle which classifies edges as heavy or light. Roughly speaking, heavy edges are involved in 4-cycles and light edges in .

Suppose, that we need to classify edge as heavy or light. We then look at edges sharing a vertex with . In the post-processing of the first pass, we determined all pairs of vertices which are endpoints of heavy diamonds/wedges. Thus, for wedge we can refer to that list to check whether it is heavy or not. If it is heavy, we also get an estimate of the number of 4-cycles it is involved in and thus contributes to . Separately, we approximate the total number of 4-cycles on which involve two light wedges and .

oracle(, , ):
  • Let and .

  • For wedges of the form , where : if is heavy, “exclude”333When we talk about “excluding” edges from , we need to “exclude” different sets of edges for different instances of the oracle. In practice, for each instance mark those edges and ignore them. However, they might be used by other instances. from .

  • For each edge in the stream, s.t. shares a vertex with :

    • Look up whether is heavy.

    • If heavy, .

    • If light and , let be the number of vertices , such that is a 4-cycle. .

  • Let .

  • Return:

2.4 Correctness

2.4.1 Oracle

In Lemma 2, we show that light edges are involved in at most 4-cycles and heavy edges are involved in at least cycles. Note that the oracle relies on the procedure estimating the number of 4-cycles on a heavy wedge, so in the proof we refer to Lemma 3 below.

Lemma 2.

With high probability

a.

implies

b.

implies

Proof.

Let be the number of 4-cycles on , where is a part of a heavy wedge. Let . Let and be our estimates of those two quantities.

Note that in the process of approximating , we are double-counting 4-cycles with two heavy wedges involving . However, we can show that this double-count is negligible. Let be the number of heavy diamonds which involve . Since each 4-cycle can belong to at most 2 diamonds, we are double-counting at most cycles. From Lemma 3 part (b), it follows that the number of 4-cycles in a heavy diamond is at least . Therefore, and . If is sufficiently large, then .

From Lemma 3 part (c) it follows that

Taking double-counting into account,

(1)

Recall that and let be the number of cycles with no heavy wedges if , and otherwise. Let and note that .

If , then , and from the Chernoff bound it follows that

(2)

where the first inequality follows from the fact that for all . Similarly, if , then

(3)

We first prove the contrapositive of (a). Assume . Then from Eq. 1 (taking ) and Eq. 3,

Similarly, we prove the contrapositive of (b) from Eq. 1 and 2. If , then

2.4.2 Estimating

In Lemma 3, we prove that we can distinguish between large and small diamonds and estimate the number of 4-cycles in a heavy diamond or on a heavy wedge.

Lemma 3.

Let be the number of wedges in diamond , and let be the number of those wedges with center in . Recall that is the number of 4-cycles in diamond , and is the number of 4-cycles involving wedge . Then with high probability,

a.

If diamond is heavy (), then

b.

If diamond is light (), then

c.

If wedge is heavy, then

d.

If diamond is heavy, then

Proof.

Observe that . By an application of the Chernoff bound, if , then

proving (a). Statement (b) is proved similarly.

Note that the number of 4-cycles in a diamond grows as the square of the number of wedges. Therefore, to get a -approximation to , we need to estimate to a higher accuracy. If , from Chernoff it follows that

Recall that if a diamond consists of wedges, then the number of 4-cycles on each of those wedges is . Therefore, statement (c) follows since . Statement (d) follows since and . ∎

Lemma 4.

With high probability, .

Proof.

First, note that our algorithm double-counts 4-cycles which are involved in two heavy diamonds. As was mentioned before, the number of 4-cycles in a heavy diamond is at least , and thus the number of heavy diamonds is at most . Since two diamonds can have at most one cycle in common, we are double-counting at most cycles. The rest of the proof follows from Lemma 3 part (d). ∎

2.4.3 Estimating

Lemma 5.

With constant probability, .

Proof.

Let be the number of 4-cycles in with heavy edges. Let and . Note that and .

We now show that with constant probability, and .

By an application of the Chebyshev bound,

as long as . We now give a bound on the variance of . Let be the set of 3-paths which are involved in 4-cycles in . Let be 1 if all 3 edges of path were sampled and 0 otherwise. Then

(4)
(5)

Equation 4 follows from the fact that any path intersects at most other paths in at one edge and at most paths at two edges (from Lemmas 2 and 3). Equation 5 follows from our definition of .

Proving follows along the same lines. ∎

2.4.4 Estimating

We refer to one of the lemmas in [13], which bounds the number of 4-cycles with at most one edge which is involved in a lot of cycles.

Lemma 6 (McGregor and Vorotnikova [13]).

We call an edge “bad” if it is contained in at least 4-cycles, and “good” otherwise. There are at least cycles containing no more than one bad edge.

Applying this lemma with , we get that the number of cycles with at most one bad edge is at least . We can now prove the main lemma.

Lemma 7.

With constant probability, .

Proof.

Let be the number of cycles with at least one heavy wedge and at most one heavy edge. Note that . Since good edges (with ) are classified as light w.h.p.,

where the first inequality follows from Lemma 6. The rest of the proof follows from Lemmas 4 and 5. ∎

2.5 Space analysis

Sets , , and all have the same expected size . The expected number of cycles stored in pass 2 is . Finally, the extra space used by each instance of is in expectation , since it keeps track of a constant number of counters and “excluded” edges, corresponding to heavy wedges involving among the input of the instance. Therefore, the total space used by the algorithm is .

References

  • [1] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Graph sketches: sparsification, spanners, and subgraphs. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pages 5–14, 2012.
  • [2] Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 623–632, 2002.
  • [3] Suman K. Bera and Amit Chakrabarti. Towards Tighter Space Bounds for Counting Triangles and Other Substructures in Graph Streams. In Heribert Vollmer and Brigitte Vallée, editors, 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017), volume 66 of Leibniz International Proceedings in Informatics (LIPIcs), pages 11:1–11:14, Dagstuhl, Germany, 2017. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [4] Vladimir Braverman, Rafail Ostrovsky, and Dan Vilenchik. How hard is counting triangles in the streaming model? In Automata, Languages, and Programming - 40th International Colloquium, ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part I, pages 244–254, 2013.
  • [5] Laurent Bulteau, Vincent Froese, Konstantin Kutzkov, and Rasmus Pagh. Triangle counting in dynamic graph streams. Algorithmica, 76(1):259–278, Sep 2016.
  • [6] Luciana S. Buriol, Gereon Frahling, Stefano Leonardi, Alberto Marchetti-Spaccamela, and Christian Sohler. Counting triangles in data streams. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pages 253–262, 2006.
  • [7] Graham Cormode and Hossein Jowhari. A second look at counting triangles in graph streams. Theor. Comput. Sci., 552:44–51, 2014.
  • [8] Hossein Jowhari and Mohammad Ghodsi. New streaming algorithms for counting triangles in graphs. In Proceedings of the 11th International Computing and Combinatorics Conference (COCOON), pages 710–716, 2005.
  • [9] John Kallaugher, Andrew McGregor, Eric Price, and Sofya Vorotnikova. The complexity of counting cycles in the adjacency list streaming model. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019., pages 119–133, 2019.
  • [10] John Kallaugher and Eric Price. A hybrid sampling scheme for triangle counting. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’17, pages 1778–1797, Philadelphia, PA, USA, 2017. Society for Industrial and Applied Mathematics.
  • [11] Daniel M. Kane, Kurt Mehlhorn, Thomas Sauerwald, and He Sun. Counting arbitrary subgraphs in data streams. In Proceedings of the 39th International Colloquium Conference on Automata, Languages, and Programming - Volume Part II, ICALP’12, pages 598–609, Berlin, Heidelberg, 2012. Springer-Verlag.
  • [12] Madhusudan Manjunath, Kurt Mehlhorn, Konstantinos Panagiotou, and He Sun. Approximate counting of cycles in streams. In Algorithms - ESA 2011 - 19th Annual European Symposium, Saarbrücken, Germany, September 5-9, 2011. Proceedings, pages 677–688, 2011.
  • [13] Andrew McGregor and Sofya Vorotnikova. Triangle and four cycle counting in the data stream model. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS’20, page 445–456, New York, NY, USA, 2020. Association for Computing Machinery.
  • [14] Andrew McGregor, Sofya Vorotnikova, and Hoa T. Vu. Better algorithms for counting triangles in data streams. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 401–411, 2016.
  • [15] A. Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. Counting and sampling triangles from a graph stream. PVLDB, 6(14):1870–1881, 2013.
  • [16] Seyed-Vahid Sanei-Mehri, Yu Zhang, Ahmet Erdem Sariyüce, and Srikanta Tirthapura. Fleet: Butterfly estimation from a bipartite graph stream. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, page 1201–1210, New York, NY, USA, 2019. Association for Computing Machinery.