1 Introduction
Subgraph counting is a fundamental graph problem and an important primitive in massive graph analysis. It has many applications in data mining and analyzing the structure of large networks. This problem has also received a lot of attention in the streaming community, with the main focus on counting triangles [1, 2, 4, 6, 8, 14, 7, 5, 15, 3, 9, 13]. Several papers considered counting larger cycles and cliques [3, 9, 12], and a few studied arbitrary subgraphs of constant size [3, 10, 11]. There is also work on counting 4cycles in the case when the underlying graph is bipartite [16]. Since a 4cycle is also a 2by2 biclique, it is the most basic motif in bipartite graphs and plays essentially the same role as a triangle does in general graphs.
In this paper, we concentrate on counting 4cycles in the arbitrary order insertiononly streaming model, improving over the stateoftheart algorithm presented by McGregor and Vorotnikova [13].
1.1 Our Result and Previous Work
Throughout this paper, we use to denote the number of vertices in the graph, to denote the number of edges, and for the number of 4cycles. Note that our algorithm is parameterized in terms of , which is a convention adopted in the literature. In practice, the quantities in the algorithm would be initialized based on a promised lower bound on .
Our result is as follows.
Theorem 1.
There exists an space algorithm that takes three passes over an arbitrary order stream and returns a
multiplicative approximation to the number of 4cycles in the graph with probability at least
.By running copies of the algorithm in parallel and taking the median of their outputs, we can increase the success probability to , where .
Our algorithm can be directly compared to the space^{2}^{2}2We use notation to hide and factors. algorithm by McGregor and Vorotnikova [13]. It takes the same number of passes over the stream and has the same approximation guarantees. We believe that the space of our algorithm is tight, however the best known lower bound is currently [13].
In [3] Bera and Chakrabarti present a different 4cycles counting algorithm which takes four passes and uses space . Note, that the space used by our algorithm is as good or better when . McGregor and Vorotnikova [13] also present a 2pass space algorithm which distinguishes between graphs with 0 and 4cycles.
2 Algorithm and Analysis
2.1 Notation
A wedge is a path of length 2. For wedge we call vertices and the endpoints of the wedge and vertex the center.
We use to denote the set of neighbors of vertex . Consider sets of vertices and . Edges between these two sets form a complete bipartite graph, which we call a diamond with endpoints and . We say that wedge is a part of diamond if they have the same endpoints. Note that a diamond with endpoints and consists of wedges and involves 4cycles.
Throughout the paper we use , , and to denote the number of 4cycles involving edge , wedge , or involved in diamond respectively. For any quantity , we use
to denote its estimate.
In Section 2.3, we define heavy/light edges, wedges, and diamonds, where “heavy” roughly corresponds to “involved in many 4cycles” and “light” to “involved in few 4cycles”. Note that these are defined by the algorithm and depend on the collected samples of vertices and edges. We define to be the number of 4cycles with at least one heavy wedge and as the number of 4cycles with no heavy wedges and at most one heavy edge.
2.2 Main Idea
The most basic algorithm approximating the number of 4cycles in a graph is as follows:
 Pass 1:

Sample edges with probability , call set .
 Pass 2:

For each edge in the stream, let be the number of 3paths with all edges in that completes to a 4cycle.
 Return:

.
In expectation, the value returned by this algorithm is
. However, due to the fact that some edges or wedges in the graph can be involved in a large number of 4cycles, the variance of this estimator is large. If an edge or wedge participates in many 4cycles, call it “bad”. In this paper, we show that it is possible to identify such bad edges and wedges and take care of them separately, leading to an accurate approximation.
We observe that if wedge is bad, then it is a part of a large diamond with endpoints and . If we sample vertices in and collect all incident edges, we will detect the diamond and accurately estimate its size. Using this method, we approximate the total number of cycles with bad wedges.
We then separately approximate the number of cycles with no bad wedges and at most one bad edge. This procedure follows the same template as the arbitrary order 4cycle counting algorithm in [13]
. Sampling edges uniformly at a certain rate allows us to obtain some 3paths which are involved in 4cycles with no bad wedges. Additionally, sampling vertices uniformly and storing all incident edges allows us to build an oracle roughly classifying edges as good or bad. We use this oracle to compute the number of bad edges in each of the cycles we discover. Note that the oracle takes an extra pass over the stream, and thus in total our algorithm uses three passes.
2.3 Algorithm
The algorithm in this section computes estimates to and separately and then returns their sum. We later show that is an accurate approximation of .
Within the algorithm, we define heavy/light diamonds and wedges. Roughly speaking, a heavy diamond consist of wedges and a light diamond consist of wedges. A wedge is then defined as heavy or light if it is a part of a heavy or light diamond respectively.
In the third pass, we refer to the oracle which classifies edges as heavy or light. It is described separately after the main algorithm.
 Pass 1:


Let .

Sample edges with probability , call set .

Sample vertices with probability , call set . Collect all incident edges, call set .

Sample vertices with probability , call set . Collect all incident edges, call set .

 After Pass 1:


For a pair of vertices , let be the number of wedges with center in and endpoints and .

Define diamond with endpoints and to be heavy if and light otherwise. Let .

Define wedge with endpoints and to be heavy if it is part of a heavy diamond and light otherwise. Let .

Find all pairs of vertices which are endpoints of heavy diamonds/wedges.

Let , where is a heavy diamond.

 Pass 2:

For every edge in the stream:

Check if completes any 3 edges from to a 4cycle (call it ). Check whether has a heavy wedge; if not, store .

 Pass 3:


For all edges involved in cycles stored in pass 2, use to classify them as heavy or light.

Let be the number of pairs s.t. has no heavy edges.

Let be the number of pairs s.t. is heavy and the other 3 edges in are light.

Let

 Return:

Oracle.
Below, we describe the oracle which classifies edges as heavy or light. Roughly speaking, heavy edges are involved in 4cycles and light edges in .
Suppose, that we need to classify edge as heavy or light. We then look at edges sharing a vertex with . In the postprocessing of the first pass, we determined all pairs of vertices which are endpoints of heavy diamonds/wedges. Thus, for wedge we can refer to that list to check whether it is heavy or not. If it is heavy, we also get an estimate of the number of 4cycles it is involved in and thus contributes to . Separately, we approximate the total number of 4cycles on which involve two light wedges and .
 oracle(, , ):


Let and .

For wedges of the form , where : if is heavy, “exclude”^{3}^{3}3When we talk about “excluding” edges from , we need to “exclude” different sets of edges for different instances of the oracle. In practice, for each instance mark those edges and ignore them. However, they might be used by other instances. from .

For each edge in the stream, s.t. shares a vertex with :

Look up whether is heavy.

If heavy, .

If light and , let be the number of vertices , such that is a 4cycle. .


Let .

Return:

2.4 Correctness
2.4.1 Oracle
In Lemma 2, we show that light edges are involved in at most 4cycles and heavy edges are involved in at least cycles. Note that the oracle relies on the procedure estimating the number of 4cycles on a heavy wedge, so in the proof we refer to Lemma 3 below.
Lemma 2.
With high probability
 a.

implies
 b.

implies
Proof.
Let be the number of 4cycles on , where is a part of a heavy wedge. Let . Let and be our estimates of those two quantities.
Note that in the process of approximating , we are doublecounting 4cycles with two heavy wedges involving . However, we can show that this doublecount is negligible. Let be the number of heavy diamonds which involve . Since each 4cycle can belong to at most 2 diamonds, we are doublecounting at most cycles. From Lemma 3 part (b), it follows that the number of 4cycles in a heavy diamond is at least . Therefore, and . If is sufficiently large, then .
Recall that and let be the number of cycles with no heavy wedges if , and otherwise. Let and note that .
If , then , and from the Chernoff bound it follows that
(2) 
where the first inequality follows from the fact that for all . Similarly, if , then
(3) 
We first prove the contrapositive of (a). Assume . Then from Eq. 1 (taking ) and Eq. 3,
Similarly, we prove the contrapositive of (b) from Eq. 1 and 2. If , then
∎
2.4.2 Estimating
In Lemma 3, we prove that we can distinguish between large and small diamonds and estimate the number of 4cycles in a heavy diamond or on a heavy wedge.
Lemma 3.
Let be the number of wedges in diamond , and let be the number of those wedges with center in . Recall that is the number of 4cycles in diamond , and is the number of 4cycles involving wedge . Then with high probability,
 a.

If diamond is heavy (), then
 b.

If diamond is light (), then
 c.

If wedge is heavy, then
 d.

If diamond is heavy, then
Proof.
Observe that . By an application of the Chernoff bound, if , then
proving (a). Statement (b) is proved similarly.
Note that the number of 4cycles in a diamond grows as the square of the number of wedges. Therefore, to get a approximation to , we need to estimate to a higher accuracy. If , from Chernoff it follows that
Recall that if a diamond consists of wedges, then the number of 4cycles on each of those wedges is . Therefore, statement (c) follows since . Statement (d) follows since and . ∎
Lemma 4.
With high probability, .
Proof.
First, note that our algorithm doublecounts 4cycles which are involved in two heavy diamonds. As was mentioned before, the number of 4cycles in a heavy diamond is at least , and thus the number of heavy diamonds is at most . Since two diamonds can have at most one cycle in common, we are doublecounting at most cycles. The rest of the proof follows from Lemma 3 part (d). ∎
2.4.3 Estimating
Lemma 5.
With constant probability, .
Proof.
Let be the number of 4cycles in with heavy edges. Let and . Note that and .
We now show that with constant probability, and .
By an application of the Chebyshev bound,
as long as . We now give a bound on the variance of . Let be the set of 3paths which are involved in 4cycles in . Let be 1 if all 3 edges of path were sampled and 0 otherwise. Then
(4)  
(5) 
Equation 4 follows from the fact that any path intersects at most other paths in at one edge and at most paths at two edges (from Lemmas 2 and 3). Equation 5 follows from our definition of .
Proving follows along the same lines. ∎
2.4.4 Estimating
We refer to one of the lemmas in [13], which bounds the number of 4cycles with at most one edge which is involved in a lot of cycles.
Lemma 6 (McGregor and Vorotnikova [13]).
We call an edge “bad” if it is contained in at least 4cycles, and “good” otherwise. There are at least cycles containing no more than one bad edge.
Applying this lemma with , we get that the number of cycles with at most one bad edge is at least . We can now prove the main lemma.
Lemma 7.
With constant probability, .
2.5 Space analysis
Sets , , and all have the same expected size . The expected number of cycles stored in pass 2 is . Finally, the extra space used by each instance of is in expectation , since it keeps track of a constant number of counters and “excluded” edges, corresponding to heavy wedges involving among the input of the instance. Therefore, the total space used by the algorithm is .
References
 [1] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Graph sketches: sparsification, spanners, and subgraphs. In Proceedings of the 31st ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems (PODS), pages 5–14, 2012.
 [2] Ziv BarYossef, Ravi Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the 13th Annual ACMSIAM Symposium on Discrete Algorithms (SODA), pages 623–632, 2002.
 [3] Suman K. Bera and Amit Chakrabarti. Towards Tighter Space Bounds for Counting Triangles and Other Substructures in Graph Streams. In Heribert Vollmer and Brigitte Vallée, editors, 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017), volume 66 of Leibniz International Proceedings in Informatics (LIPIcs), pages 11:1–11:14, Dagstuhl, Germany, 2017. Schloss Dagstuhl–LeibnizZentrum fuer Informatik.
 [4] Vladimir Braverman, Rafail Ostrovsky, and Dan Vilenchik. How hard is counting triangles in the streaming model? In Automata, Languages, and Programming  40th International Colloquium, ICALP 2013, Riga, Latvia, July 812, 2013, Proceedings, Part I, pages 244–254, 2013.
 [5] Laurent Bulteau, Vincent Froese, Konstantin Kutzkov, and Rasmus Pagh. Triangle counting in dynamic graph streams. Algorithmica, 76(1):259–278, Sep 2016.
 [6] Luciana S. Buriol, Gereon Frahling, Stefano Leonardi, Alberto MarchettiSpaccamela, and Christian Sohler. Counting triangles in data streams. In Proceedings of the 29th ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems (PODS), pages 253–262, 2006.
 [7] Graham Cormode and Hossein Jowhari. A second look at counting triangles in graph streams. Theor. Comput. Sci., 552:44–51, 2014.
 [8] Hossein Jowhari and Mohammad Ghodsi. New streaming algorithms for counting triangles in graphs. In Proceedings of the 11th International Computing and Combinatorics Conference (COCOON), pages 710–716, 2005.
 [9] John Kallaugher, Andrew McGregor, Eric Price, and Sofya Vorotnikova. The complexity of counting cycles in the adjacency list streaming model. In Proceedings of the 38th ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30  July 5, 2019., pages 119–133, 2019.
 [10] John Kallaugher and Eric Price. A hybrid sampling scheme for triangle counting. In Proceedings of the TwentyEighth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’17, pages 1778–1797, Philadelphia, PA, USA, 2017. Society for Industrial and Applied Mathematics.
 [11] Daniel M. Kane, Kurt Mehlhorn, Thomas Sauerwald, and He Sun. Counting arbitrary subgraphs in data streams. In Proceedings of the 39th International Colloquium Conference on Automata, Languages, and Programming  Volume Part II, ICALP’12, pages 598–609, Berlin, Heidelberg, 2012. SpringerVerlag.
 [12] Madhusudan Manjunath, Kurt Mehlhorn, Konstantinos Panagiotou, and He Sun. Approximate counting of cycles in streams. In Algorithms  ESA 2011  19th Annual European Symposium, Saarbrücken, Germany, September 59, 2011. Proceedings, pages 677–688, 2011.
 [13] Andrew McGregor and Sofya Vorotnikova. Triangle and four cycle counting in the data stream model. In Proceedings of the 39th ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems, PODS’20, page 445–456, New York, NY, USA, 2020. Association for Computing Machinery.
 [14] Andrew McGregor, Sofya Vorotnikova, and Hoa T. Vu. Better algorithms for counting triangles in data streams. In Proceedings of the 35th ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26  July 01, 2016, pages 401–411, 2016.
 [15] A. Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and KunLung Wu. Counting and sampling triangles from a graph stream. PVLDB, 6(14):1870–1881, 2013.
 [16] SeyedVahid SaneiMehri, Yu Zhang, Ahmet Erdem Sariyüce, and Srikanta Tirthapura. Fleet: Butterfly estimation from a bipartite graph stream. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, page 1201–1210, New York, NY, USA, 2019. Association for Computing Machinery.
Comments
There are no comments yet.