We study the problem of estimating the number of edges of a simple undirected graph in the context of sublinear-time graph algorithms. The goal is to design a highly-efficient randomized algorithm that, given a certain type of oracle access to an underlying graph , outputs a number that approximates the number of edges of . The first result in this direction was by Feige [Fei06], who studied this problem when the oracle is a degree oracle: the degree oracle answers queries of the form “what is the degree of a given vertex ?” The algorithm of Feige makes queries to the degree oracle, where denotes the number of edges of the input graph , and outputs a -approximation to for any constant . Moreover, Feige showed that the upper bound of is tight for a -approximation, and indeed degree queries are necessary for a -approximation. Soon thereafter, Goldreich and Ron [GR08] considered an oracle that, in addition to degree queries, can answer neighbor queries (i.e., given a vertex and an index , the oracle returns the th neighbor of according to some fixed ordering). Their algorithm uses 111We use and to surpress factors. queries and outputs a -approximation to for any constant ; they further showed that the upper bound is tight up to a factor.
Since then, sublinear-time algorithms have been developed for a variety of graph problems, including estimating the number of stars [GRS11, ABG16], triangles [ELRS17], -cliques [ERS18], and arbitrary small subgraphs [AKK19], finding forbidden graph minors [KSS18, KSS19], sampling edges almost uniformly [ER18], approximating the minimum weight spanning tree [CRT05, CS09, CEF05], maximum matching [NO08, YYI09], and minimum vertex cover [PR07, MR09, NO08, YYI09, HKNO09, ORRR12]. As noted in a recent work of Beame, Har-Peled, Ramamoorthy, Rashtchian, and Sinha [BHR18], all these algorithms interact with oracles that provide only local information about the underlying graph (such as degree, neighbor, and edge existence queries where an algorithm can ask “is vertex connected to vertex ?”)222One exception is that [AKK19] also uses uniform edge sampling in addition to the above specified queries.. They suggested that non-local oracle models may be natural in certain scenarios of graph parameter estimation and their non-locality may enable more efficient graph algorithms.
Along this line of investigation, [BHR18] introduced both the independent set oracle and the bipartite independent set oracle and studied the problem of estimating the number of edges under these two query models. The independent set oracle for a graph can be queried with a set of vertices and outputs whether or not is an independent set in , i.e. whether or not there exist vertices with . The bipartite independent set oracle, on the other hand, can be queried with a pair of disjoint sets and outputs whether or not is a bipartite independent set in , i.e. whether or not there exist and with .333We remark that the bipartite independent set oracle is at least as powerful, up to poly-logarithmic factors, as the independent set oracle. Consider a graph , a set of vertices, and the question of whether or not is an independent set. Letting be a uniformly random partition of , we may query the bipartite independent set oracle with and . If is an independent set, then will be a bipartite independent set; if is not an independent set, then will not be a bipartite independent set with probability at least
will not be a bipartite independent set with probability at least. Thus, bipartite independent set queries can simulate an independent set query with probability at least .
The problem of edge estimation using (bipartite) independent set queries shares resemblance to the classical problem of group testing, which dates back to 1943 [Dor43] and has found many recent applications in computer science [Swa85, CS90, DH00, ND00, MP04, CM05, INR10]. In group testing one needs to recover an unknown subset of a known universe by making subset queries: an algorithm can pick a subset of and ask whether contains any element from . The graph setting of the current paper is a natural generalization of group testing by considering the unknown object as a binary relation over a known universe . The goal of estimating the number of edges, on the other hand, is a relaxation of group testing because it suffices to obtain an approximation of the size of the unknown binary relation, instead of recovering the relation itself exactly. The same relaxation on the original group testing setting (i.e., using subset queries to estimate the size of an unknown subset ) was studied by Ron and Tsur [RT16]. Besides group testing, edge estimation using independent set queries is motivated by connections to problems that arise in computational geometry and counting complexity, which we refer the interested reader to [BHR18].
Perhaps surprisingly, [BHR18] gave an algorithm that returns a -approximation to the number of edges by making only queries to the bipartite independent set oracle. So in this setting, the non-locality indeed brings down the query complexity significantly for the edge estimation problem (compared to [Fei06] and [GR08], both of which use local queries only). For the independent set oracle, [BHR18] obtained an algorithm for a -approximation of the number of edges with query complexity . It was left as an open problem in [BHR18] to improve current understanding of edge estimation under independent set queries.
1.1 Our results
[Upper bound] There is a randomized algorithm that takes as input (1) an accuracy parameter , (2) a positive integer as the number of vertices and (3) access to the independent set oracle of an undirected graph with .444The assumption of is merely for convenience; it avoids the issue that the query complexity upper bound claimed would be when . We note that whether a graph is empty or not can be determined by a single independent set query. With probability at least , the algorithm makes no more than many independent set queries and outputs a number that satisfies .
The improvement over the upper bound of [BHR18] is due to a new algorithm for edge estimation that uses independent set queries (Theorem 3). Note that the query complexity achieved by the algorithm underlying Theorem 1.1 is essentially the same as [GR08]; however, the two algorithms access the graph with very different ways (independent set oracle versus degree and neighbor oracles). The proof of Theorem 1.1 requires new ideas and algorithmic techniques that are developed for independent set queries. See further discussion in Section 1.2.
[Lower bound] Let and be two positive integers with . Any randomized algorithm with access to the independent set oracle of an undirected graph must make at least queries in order to determine whether or with probability at least .
Theorems 1.1 and 1.1 essentially settle the query complexity of edge estimation with independent set queries at . Theorem 1.1 brings down the overall complexity of the problem from [BHR18] to ; the worst case is when the number of edges is linear in . Theorem 1.1, on the other hand, shows that no algorithm with independent set queries can achieve sub-polynomial query complexity. This gives an exponential separation between the power of the bipartite independent set oracle and the independent set oracle for the task of edge estimation.
1.2 Overview of techniques
We first give a high-level overview of the lower bound because some key ideas from the lower bound will be helpful in understanding the main algorithm later. For convenience we will slightly abuse the notation and to hide factors of in the discussion below. Outside of Section 1.2 they follow the convention described in footnote 1.
1.2.1 Lower bound
We describe our construction for the case when , where we seek a lower bound of . The complement case follows from a reduction to this case.
The plan is to follow Yao’s principle. We construct two distributions and over graphs with vertices so that has no more than edges with probability at least and has at least edges with probability at least . We then show that no deterministic algorithm with access to an independent set oracle can distinguish these two distributions.
A graph is generated by first sampling a uniformly random partition of vertices into and then forming the bipartite graph by including each pair with and as an edge independently with probability , where . In expectation has about edges and thus, has no more than edges with probability . On the other hand, a graph is generated by sampling a uniformly random partition of , as well as a subset by including each vertex of independently with probability . Similar to , a pair where and is included as an edge independently with probability . The main difference compared to is that every pair , where and , is included as an edge (so form a complete bipartite graph). Given that and with high probability, the number of edges in the graph is with probability at least .
We make the following two observations. The first is that a graph can be generated by first drawing a graph with partition , then sampling by including each vertex in independently with probability , and finally adding all pairs between and as edges in . This suggests that, in order for an algorithm to distinguish from , a (seemingly quite weak) necessary condition is for one of its queries to overlap with when it runs on .
For the second observation, we consider a query set of size larger than . In both and , we have with high probability and when this happens, is not an independent set with high probability, given that there are at least
pairs between and and each is included in the graph with probability . Since is not an independent set in both and with high probability, such a query conveys very little information in distinguishing the two distributions. Thus, a reasonable algorithm should only make queries of size smaller than . This intuition, that algorithms should not make queries of size larger than , will be helpful in our discussion of the algorithm later, and we will frequently refer to the quantity as the critical threshold. However, if all the queries an algorithm makes are smaller than , then queries are necessary for at least one of them to overlap with ; otherwise, given that , the probability that one of the queries overlaps with is negligible.
To formalize the above intuition and simplify the presentation of our lower bound proof, we introduce the notion of an augmented (independent set) oracle in Section 5.2. We first show that any algorithm with access to the standard independent set oracle can be simulated using an augmented oracle with the same query complexity. Then, we prove an lower bound for algorithms that distinguish and with access to an augmented oracle.
1.2.2 Upper Bound
Our goal is to obtain a -approximation algorithm for edge estimation with independent set queries, where denotes the number of edges of the input graph (Theorem 3). Theorem 1.1 follows by combining it with the algorithm of [BHR18] by running both algorithms in parallel and outputting the result of whichever finishes first.
In the sketch of the algorithm below, we assume that a rough estimate of the number of edges is given, satisfying . The goal is to refine it to obtain a -approximation of .
An Initial Plan:
At a high level, we partition the vertex set into many buckets according to their degrees: a vertex belongs to the th bucket if is between and . We refer to as the degree of bucket for convenience. Our initial plan is to develop efficient algorithms for the following two tasks:
Task 1: Develop a subroutine that, given a vertex and an index , checks if belongs to .555The goal of the subroutine as described above may not sound reasonable. If lies very close to the boundary of two buckets and , determining which of the two buckets lies in may be expensive with independent set queries. This is indeed one source of errors we need to handle. We focus on high-level ideas behind the algorithm and skip details such as errors most of time, and discuss briefly how we analyze the algorithm in the presence of errors at the end of the sketch.
Task 2: Use the first subroutine to estimate the size of each bucket .
We point out that this initial plan looks very similar to the framework of the algorithm of [GR08], where ideally one would like to estimate the size of each by drawing enough random samples and running the subroutine in Task 1 on each sample to obtain an estimate of . The similarity, however, stops here as we start discussing more details about how to implement the plan with an independent set oracle.
We consider Task 1 first (which is trivial with a degree oracle). Note that when , checking whether a vertex has or requires independent set queries. As a result, it requires to tell if when the degree of is at least . The bad news is that the same task becomes significantly more challenging as goes down from . This challenge leads to a major revision of our initial plan.
To gain some intuition we consider the task of distinguishing and when .666For convenience we consider the case of in the sketch but the same idea works when . Suppose we sample a set from by including each vertex with probability and then make two independent set queries on and . Let denote the event that is an independent set but is not (so contains at least one neighbor of ). Then we claim that there is a significant gap in the probability of when versus . This gap in the probability of is large enough so that one can repeat the experiment times (each time making two independent set queries) to distinguish the two cases with high probability.
Now we turn to the case when . In this case, the algorithm is limited to query sets of size much smaller than . Therefore, we limit to include each vertex with probability instead of . Two issues arise. The first (minor) issue is that, given that the size of is roughly , even to hit a neighbor of (with degree roughly ) one needs to draw at least many times. This suggests that queries are needed for Task 1 when the degree of the bucket we are interested is less than .
There is, however, a more serious issue that is subtle but leads to a major revision of the initial plan. Consider the scenario where has neighbors and every neighbor has degree . If we sample by including each vertex with probability , it is very unlikely that contains a neighbor of but is at the same time independent (since when conditioning on containing a neighbor of , most likely also contains a neighbor of given the large degree of ). Because of the second issue, we change the goal of the subroutine in Task 1 from finding the right bucket of according to the degree of to finding the right bucket according to the number of neighbors of with degree at most , when . For vertices with degree at least , we still would like to partition them into buckets according to their degrees.
A Revised Plan:
By the above, we arrived at the following revised plan:
Task 0: Develop a subroutine that, given a vertex , decides777Again we need to handle errors when is close to . if (which we refer
to as high-degree vertices and denote the set by ) or (which we refer to as low-degree vertices and denote the set by ). High-degree vertices are further partitioned into buckets according to their degrees. Low-degree vertices, on the other hand, are partitioned into buckets according to their degrees to low-degree vertices, denoted by for a vertex .
Task 1: Develop a subroutine that, given a vertex (or ) and an index , decides
if belongs to the bucket (or ).
Task 2: Use the two subroutines to obtain -estimations of the size of each and .
Looking ahead, with -approximations and for and , one can compute
as roughly a -approximation of the number of edges . The reason that we only get -approximation follows by the fact that in the sum, edges between vertices in and edges between vertices in are counted twice but edges between and are only counted once. We will discuss more about how to further revise the plan to obtain a -approximation; for now let us consider Task 2.
Note that Task 2 for buckets is easy. Consider a low-degree bucket with . Unless , has negligible impact on the final estimate. When , it takes samples to get a sufficient number of vertices in . We can then get a good estimation of by running subroutines for Task 0 and 1 on these vertices. We pay queries for each vertex so the overall query complexity is
as desired. In contrast, uniformly sampling vertices and checking individually if each of them lies in is too inefficient for high-degree buckets, given that when .
Estimating the size of each high-degree bucket is where we fully take advantage of the non-locality of independent set queries. To explain the intuition, let us consider the task of distinguishing versus for some parameter where denotes the degree of the bucket . To this end, it suffices to have a procedure that can take a random set of size and answers the question “does there exist that belongs to ?” with queries. With such a procedure it suffices to draw and run the procedure on for
many times in order to obtain a good estimation of .
As discussed earlier, the revised plan ultimately leads to a -approximation algorithm with independent set queries. We achieve -approximation by revising the plan further. First we divide high-degree vertices into buckets where is related to the degree of (as usual), but the second index is related to the fraction of neighbors of in ; see Definition 3.2 for details. Task 1 is updated to develop a subroutine that can decide whether belongs to or not. Task 2 is updated to estimate the size of each (with similar ideas in the approximation of sketched above) and . Together they lead to a -approximation of the number of edges between low-degree and high-degree vertices, and ultimately a -approximation of .
Now extra care must be taken to handle errors when executing the above plan. As alerted in two footnotes, one cannot hope for a subroutine that returns the true bucket of a vertex . To simplify the presentation of the algorithm and its analysis, we introduce the notion of -degree oracles (see Definition 3.2). An -degree oracle can answer questions listed in Tasks 0 and 1 consistently and accurately up to certain errors (as captured by the notion of an -degree partition in Definition 3.2 underlying each -degree oracle). We first present an algorithm in Section 3.3 that has query access to a -degree oracle. We finish the proof of Theorem 3 by giving an efficient implementation of a -degree oracle using an independent set oracle in Section 4.
Given a positive integer , we write to denote . Similarly, for two non-negative integers , we write to denote . All graphs considered in this paper are undirected and simple (meaning that there are no parallel edges or loops), and have as its vertex set.
[Independent set oracle] Given an undirected graph , its independent set oracle is a map which satisfies that for any set of vertices , if and only if is an independent set of (i.e., for all ).
We use to denote the degree of a vertex . Given and , we let
Note that can lie in , but since we only consider simple graphs, . For the sake of brevity, we write . We usually skip the subscript in and when the underlying graph is clear from the context.
The following simple lemma will be used multiple times. Let be an undirected graph, be a set of vertices, and be an upper bound on the number of edges in the subgraph induced by . Let be a random subset given by independently including each vertex of with probability . Then,
Proof: The expected number of edges where both vertices lie in is at most . By Markov’s inequality the probability that contains at least one edge is at most .
2.1 Binary search using the independent set oracle
We present a subroutine based on binary search for finding an edge using independent set queries:
There is a randomized algorithm, that takes as input (1) a positive integer , (2) access to the independent set oracle of an undirected graph , (3) a set of vertices such that is not an independent set of , and (4) an error parameter . Binary-Search makes queries to and outputs with with probability at least . Proof: We consider an execution of in Figure 1. Note that we maintain the invariant that is never an independent set. This is because is not an independent set in step 1, and whenever is updated in step 2(b), it is never assigned an independent set. It suffices to show that after iterations, with high probability.
An iteration of step 2 makes progress if the size of the set decreases by at least constant factor. If, in any iteration of step 2(b), the partition of into and has at least one edge fully contained in or , then that iteration will make progress. Since there is always at least one edge in , this occurs independently in each iteration with probability at least . Since it only takes rounds for the size of to drop to , it follows from Chernoff bound that the subroutine fails with probability at most .
We will always invoke Binary-Search with the parameter .888For example, setting will suffice for our purposes. The subroutine will always make queries, and will fail with probability at most .
3 Upper bound
In this section we prove the following upper bound:
There is a randomized algorithm that takes as input (1) an accuracy parameter , (2) a positive integer , and (3) access to the independent set oracle of a graph with . With probability at least , Estimate-Edges makes queries and outputs a number satisfying
We recall the following lemma from [BHR18]. [Lemma 5.6 from [BHR18]] There is a randomized algorithm that takes as input (1) an accuracy parameter , (2) a positive integer , and (3) access to the independent set oracle of a graph with . With probability at least , the algorithm makes queries and outputs a number satisfying .
The upper bound claimed in Theorem 1.1 of follows by running the algorithm of Theorem 3 and the algorithm of Lemma 3 in parallel. Specifically, we alternate queries between the two algorithms until one of them terminates. Once one terminates with an estimate to , we output .
3.1 Reduction to edge estimation with advice
We prove Theorem 3 using the following lemma stated next. We will provide an algorithm, which we call Estimate-With-Advice, for estimating given an extra parameter which is promised to be an upper bound for .
[Estimation with advice] There is a randomized algorithm, Estimate-With-Advice, that takes four inputs: (1) an accuracy parameter , (2) two positive integers , and (3) access to an independent set oracle of with . Estimate-With-Advice makes queries and with probability at least outputs that satisfies
Note that at the end of each iteration of step 2 in Figure 2, either the algorithm terminates or is halved. Since is initially , the maximum number of iterations of the step 2 (before ) in Estimate-Edges is . It follows from Lemma 3.1 and a union bound that, with probability at least , every execution of Estimate-With-Advice in step 2(a) of Estimate-Edges returns a correct value (meaning that if of this run indeed satisfies , then its output satisfies (1) but with set to ). We show that the following holds when this is the case:
(): Estimate-Edges terminates in the while loop (instead of going to line 3)
with the final value of satisfying .
Assume that () holds, and let be the output of . Since in every iteration of step 2 and the final iteration also satisfies , Theorem 3 would follow from two observations.(i) The query complexity of Estimate-Edges can be bounded using , and (ii) since the final run of Estimate-With-Advice is correct, we have (using )
It suffices to show that () holds when every run of Estimate-With-Advice returns a correct value.
Assuming for contradiction of () that the final value of is smaller than . This implies that in one of the runs of Estimate-With-Advice in Estimate-Edges. Since it returns a correct value (and note that for this run we still have ), the same calculation in (2) implies that and thus, and the algorithm should have terminated at the end of this run, a contradiction. On the other hand, assume for a contradiction of () that the final value of is larger than . Since the final run returns a correct value , and thus, ; however, step 2(b) should have terminated if , a contradiction. This finishes the proof of the theorem.
We prove Lemma 3.1 in the rest of the section. From now on, let be the accuracy parameter, be a positive integer, and be a graph with as in the statement of Lemma 3.1. Let and let be the unique positive integer such that
We also write to denote the smallest integer such that , and to denote the smallest integer such that (so ). It may be helpful to the reader to consider the case when is only a constant factor larger than , so the algorithm’s task is to refine an approximation to the number of edges given a crude approximation; however, the proof of Lemma 3.1 assumes just the upper bound .
3.2 Degree oracles and the high-level plan
To simplify the presentation and analysis of our algorithm, Estimate-With-Advice, we introduce the notion of -degree partitions and -degree oracles. Roughly speaking, an -degree partition of an undirected graph is a partition of (so ’s and ’s are pairwise disjoint subsets of whose union is ) such that the placement of a vertex reveals important degree information of (see Definition 3.2 for details). An -degree oracle, on the other hand, contains an underlying -degree partition and the latter can be accessed via queries such as “does belong to ” or “does belong to .” There is also a cost associated with each such query (see Definition 3.2 for details).
With the definition of degree partitions and degree oracles, our proof of Lemma 3.1 proceeds in the following two steps. First we present in Lemma 3.2 an algorithm that achieves the same goal as Estimate-With-Advice, namely (1) in Lemma 3.1 with high probability. The difference, however, is that is given access to not only an independent set oracle but also an -degree oracle. Next, we show in Lemma 3.2 that an -degree oracle can be implemented efficiently using access to the independent set oracle. This allows us to convert into Estimate-With-Advice with a similar performance guarantee, and Lemma 3.1 follows directly from Lemma 3.2 and Lemma 3.2.
We start with the definition of -degree partitions:
Let be a graph. An -degree partition of is a partition
of its vertex set (so the sets in are disjoint and their union is ) such that
Let and (so we have ). Every vertex satisfies
and every vertex satisfies .
Every vertex satisfies and every vertex , , satisfies
Let for each . Then every vertex satisfies
Moreover, every vertex for some satisfies
and every satisfies
It is worth pointing out that intervals used in (4), (5), and (6) are not disjoint (and so are the conditions on in the first item). As a result, such partitions are not unique for a given in general. For example, a vertex with degree between and can lie in either or .
Next we define -degree oracles:
Let be an undirected graph. An -degree oracle of contains an underlying -degree partition of and can be accessed via two maps and , where
For every vertex , if and otherwise.
For every vertex , if and otherwise.
The cost of each query on is and the cost of each query is .
We will be interested in algorithms that have access to both the independent set oracle and an -degree oracle D of a graph . For such an algorithm (for clarity we always use to mark algorithms that have access to such a pair of oracles), we are interested in its total cost. The cost of each query on the independent set oracle is , and the cost of each query on the degree oracle is specified in Definition 3.2. The total cost of an algorithm is the sum of the costs of individual queries.
[Estimation with degree oracles] There is a randomized algorithm, Estimate-With- , that takes four inputs: an accuracy parameter , two positive integers and , and access to both the independent set oracle and an -degree oracle D of a graph with . Its worst-case total cost is and with probability at least , it returns satisfying
We point out that, because -degree partitions are not unique, in Lemma 3.2 needs to work with an -degree oracle with any underlying -degree partition (as long as it satisfies Definition 3.2). Lemma 3.2 below says that one can simulate a degree oracle efficiently using the independent set oracle.
[Simulation of degree oracles] Let and be positive integers. There are a positive integer and a pair of deterministic algorithms and , where takes as input a vertex , , access to the independent set oracle of a graph with , and a string ; takes the same inputs but has replaced by and . Both algorithms output a value in and together have the following performance guarantee:
makes queries to and makes queries to .
Given any graph with , when is drawn uniformly at random, viewed as a map from and viewed as a map from together form an -degree
oracle of with probability at least (over the randomness of ).
Proof of Lemma 3.1 Assuming Lemma 3.2 and 3.2: The algorithm Estimate-With-Advice draws a string uniformly at random, where as in Lemma 3.2, and simulates . When the latter makes a query on its given degree oracle, Estimate-With-Advice runs either or using and uses its output to continue the simulation of Estimate-With-Advice. The query complexity of Estimate-With-Advice can be bounded using the total cost of and complexity of and . The error probability of Estimate-With-Advice is at most (for the probability that fails to produce an -degree oracle) plus (for the error probability of ), which is smaller than . This finishes the proof of Lemma 3.1.
3.3 Estimation of and .
Let be the input graph with . We are given access to the independent set oracle and an -degree oracle of , where we use to denote the degree partition underlying the degree oracle D. To obtain a good estimation of , it suffices to obtain good estimations of cardinalities of ’s and ’s (the latter would also lead to good estimations of ; recall that ). Roughly speaking, estimations of ’s allow us to approximately count the number of edges in the subgraph induced by ; estimations of ’s allow us to approximately count the total degree of vertices in ; estimations of ’s allow us to approximately count the number of edges between and .
[Estimation of ] Let and be a positive integer. There is a randomized algorithm that runs on graphs with via access to the independent set oracle and an -degree oracle of with an underlying -degree partition . It has total cost and returns a number for each satisfying
with probability at least . Proof: Fix an and let . We show how to compute . If
vertices uniformly at random from (with replacements). For each vertex sampled, we query the degree oracle with a cost of to tell if it belongs to . The fraction of times that a vertex sampled belongs to