Log In Sign Up

Set Cover in Sub-linear Time

by   Piotr Indyk, et al.

We study the classic set cover problem from the perspective of sub-linear algorithms. Given access to a collection of m sets over n elements in the query model, we show that sub-linear algorithms derived from existing techniques have almost tight query complexities. On one hand, first we show an adaptation of the streaming algorithm presented in Har-Peled et al. [2016] to the sub-linear query model, that returns an α-approximate cover using Õ(m(n/k)^1/(α-1) + nk) queries to the input, where k denotes the value of a minimum set cover. We then complement this upper bound by proving that for lower values of k, the required number of queries is Ω̃(m(n/k)^1/(2α)), even for estimating the optimal cover size. Moreover, we prove that even checking whether a given collection of sets covers all the elements would require Ω(nk) queries. These two lower bounds provide strong evidence that the upper bound is almost tight for certain values of the parameter k. On the other hand, we show that this bound is not optimal for larger values of the parameter k, as there exists a (1+ε)-approximation algorithm with Õ(mn/kε^2) queries. We show that this bound is essentially tight for sufficiently small constant ε, by establishing a lower bound of Ω̃(mn/k) query complexity.


Tight Bounds on Subexponential Time Approximation of Set Cover and Related Problems

We show that Set Cover on instances with N elements cannot be approximat...

Bipartite Independent Set Oracles and Beyond: Can it Even Count Triangles in Polylogarithmic Queries?

Beame et al. [ITCS 2018] introduced and used the Bipartite Independent S...

Understanding the hardness of approximate query processing with joins

We study the hardness of Approximate Query Processing (AQP) of various t...

On the Hardness of Set Disjointness and Set Intersection with Bounded Universe

In the SetDisjointness problem, a collection of m sets S_1,S_2,...,S_m f...

Query Complexity of the Metric Steiner Tree Problem

We study the query complexity of the metric Steiner Tree problem, where ...

Generalized Framework for Group Testing: Queries, Feedbacks and Adversaries

In the Group Testing problem, the objective is to learn a subset K of so...

Bounding the Menu-Size of Approximately Optimal Auctions via Optimal-Transport Duality

The question of the minimum menu-size for approximate (i.e., up-to-ε) Ba...

1 Introduction

Set Cover is a classic combinatorial optimization problem, in which we are given a set (universe) of elements and a collection of sets . The goal is to find a set cover of , i.e., a collection of sets in whose union is , of minimum size. Set Cover is a well-studied problem with applications in operations research [16], information retrieval and data mining [32], learning theory [19], web host analysis [9], and many others. Recently, this problem and other related coverage problems have gained a lot of attention in the context of massive data sets, e.g., streaming model [32, 12, 10, 17, 7, 3, 24, 2, 5, 18] or map reduce model [22, 25, 4].

Although the problem of finding an optimal solution is -complete, a natural greedy algorithm which iteratively picks the “best” remaining set (the set that covers the most number of uncovered elements) is widely used. The algorithm finds a solution of size at most where is the optimum cover size, and can be implemented to run in time linear in the input size. However, the input size itself could be as large as , so for large data sets even reading the input might be infeasible.

This raises a natural question: is it possible to solve minimum set cover in sub-linear time? This question was previously addressed in [28, 33], who showed that one can design constant running-time algorithms by simulating the greedy algorithm, under the assumption that the sets are of constant size and each element occurs in a constant number of sets. However, those constant-time algorithms have a few drawbacks: they only provide a mixed multiplicative/additive guarantee (the output cover size is guaranteed to be at most ), the dependence of their running times on the maximum set size is exponential, and they only output the (approximate) minimum set cover size, not the cover itself. From a different perspective, [20] (building on [15]) showed that an -approximate solution to the fractional version of the problem can be found in time111The method can be further improved to (N. Young, personal communication).. Combining this algorithm with the randomized rounding yields an -approximate solution to Set Cover with the same complexity.

In this paper we initiate a systematic study of the complexity of sub-linear time algorithms for set cover with multiplicative approximation guarantees. Our upper bounds complement the aforementioned result of [20] by presenting algorithms which are fast when is large, as well as algorithms that provide more accurate solutions (even with a constant-factor approximation guarantee) that use a sub-linear number of queries222Note that polynomial time algorithm with sub-logarithmic approximation algorithms are unlikely to exist.. Equally importantly, we establish nearly matching lower bounds, some of which even hold for estimating the optimal cover size. Our algorithmic results and lower bounds are presented in Table 1.1.

Data access model. As in the prior work [28, 33] on Set Cover, our algorithms and lower bounds assume that the input can be accessed via the adjacency-list oracle.333In the context of graph problems, this model is also known as the incidence-list model, and has been studied extensively, see e.g., [8, 14, 6]. More precisely, the algorithm has access to the following two oracles:

  1. EltOf: Given a set and an index , the oracle returns the element of . If , is returned.

  2. SetOf: Given an element and an index , the oracle returns the set containing . If appears in less than sets, is returned.

This is a natural model, providing a “two-way” connection between the sets and the elements. Furthermore, for some graph problems modeled by Set Cover (such as Dominating Set or Vertex Cover), such oracles are essentially equivalent to the aforementioned incident-list model studied in sub-linear graph algorithms. We also note that the other popular access model employing the membership oracle, where we can query whether an element is contained in a set , is not suitable for Set Cover, as it can be easily seen that even checking whether a feasible cover exists requires time.

1.1 Overview of our results

In this paper we present algorithms and lower bounds for the Set Cover problem. The results are summarized in Table 1.1. The -hardness of this problem (or even its -approximate version [13, 31, 1, 26, 11]) precludes the existence of highly accurate algorithms with fast running times, while (as we show) it is still possible to design algorithms with sub-linear query complexities and low approximation factors. The lower bound proofs hold for the running time of any algorithm approximation set cover assuming the defined data access model.

We present two algorithms with sub-linear number of queries. First, we show that the streaming algorithm presented in [17] can be adapted so that it returns an -approximate cover using queries, which could be quadratically smaller than . Second, we present a simple algorithm which is tailored to the case when the value of is large. This algorithm computes an -approximate cover in time (not just query complexity). Hence, by combining it with the algorithm of [20], we get an -approximation algorithm that runs in time .

We complement the first result by proving that for low values of , the required number of queries is even for estimating the size of the optimal cover. This shows that the first algorithm is essentially optimal for the values of where the first term in the runtime bound dominates. Moreover, we prove that even the Cover Verification problem, which is checking whether a given collection of sets covers all the elements, would require queries. This provides strong evidence that the term in the first algorithm is unavoidable. Lastly, we complement the second algorithm, by showing a lower bound of if the approximation ratio is a small constant.

Problem Approximation Constraints Query Complexity Section Set Cover 4.2 - 4.3 B 3.2 Cover Verification - 5

Table 1.1: A summary of our algorithms and lower bounds. We use the following notation: denotes the size of the optimum cover; denotes a parameter that determines the trade-off between the approximation quality and query/time complexities; denotes the approximation factor of a “black box” algorithm for set cover used as a subroutine; We assume that and .

1.2 Related work

Sub-linear algorithms for Set Cover under the oracle model have been previously studied as an estimation problem; the goal is only to approximate the size of the minimum set cover rather than constructing one. Nguyen and Onak [28] consider Set Cover under the oracle model we employ in this paper, in a specific setting where both the maximum cardinality of sets in , and the maximum number of occurrences of an element over all sets, are bounded by some constants and ; this allows algorithms whose time and query complexities are constant, , containing no dependency on or . They provide an algorithm for estimating the size of the minimum set cover when, unlike our work, allowing both multiplicative and additive errors. Their result has been subsequently improved to by Yoshida et al. [33]. Additionally, the results of Kuhn et al. [21] on general packing/covering LPs in the distributed model, together with the reduction method of Parnas and Ron [30], implies estimating set cover size to within a -multiplicative factor (with additive error), can be performed in time/query complexities.

Set Cover can also be considered as a generalization of the Vertex Cover problem. The estimation variant of Vertex Cover under the adjacency-list oracle model has been studied in [30, 23, 29, 33]. Set Cover has been also studied in the sub-linear space context, most notably for the streaming model of computation  [32, 12, 7, 3, 2, 5, 18, 10, 17]. In this model, there are algorithms that compute approximate set covers with only multiplicative errors. Our algorithms use some of the ideas introduced in the last two papers [10, 17].

1.3 Overview of the Algorithms

The algorithmic results presented in Section 4, use the techniques introduced for the streaming Set Cover problem by [10, 17] to get new results in the context of sub-linear time algorithms for this problem. Two components previously used for the set cover problem in the context of streaming are Set Sampling and Element Sampling. Assuming the size of the minimum set cover is , Set Sampling randomly samples sets and adds them to the maintained solution. This ensures that all the elements that are well represented in the input (i.e., appearing in at least sets) are covered by the sampled sets. On the other hand, the Element Sampling technique samples roughly elements, and finds a set cover for the sampled elements. It can be shown that the cover for the sampled elements covers a fraction of the original elements.

Specifically, the first algorithm performs a constant number of iterations. Each iteration uses element sampling to compute a “partial” cover, removes the elements covered by the sets selected so far and recurses on the remaining elements. However, making this process work in sub-linear time (as opposed to sub-linear space) requires new technical development. For example, the algorithm of [17] relies on the ability to test membership for a set-element pair, which generally cannot be efficiently performed in our model.

The second algorithm performs only one round of set sampling, and then identifies the elements that are not covered by the sampled sets, without

performing a full scan of those sets. This is possible because with high probability only those elements that belong to few input sets are not covered by the sample sets. Therefore, we can efficiently enumerate all pairs

, , for those elements that were not covered by the sampled sets. We then run a black box algorithm only on the set system induced by those pairs. This approach lets us avoid the term present in the query and runtime bounds for the first algorithm, which makes the second algorithm highly efficient for large values of .

1.4 Overview of the Lower Bounds

The Set Cover lower bound for smaller optimal value . We establish our lower bound for the problem of estimating the size of the minimum set cover, by constructing two distributions of set systems. All systems in the same distribution share the same optimal set cover size, but these sizes differ by a factor between the two distributions; thus, the algorithm is required to determine from which distribution its input set system is drawn, in order to correctly estimate the optimal cover size. Our distributions are constructed by a novel use of the probabilistic method. Specifically, we first probabilistically construct a set system called median instance (see Lemma 3.8): this set system has the property that (a) its minimum set cover size is and (b) a small number of changes to the instance reduces the minimum set cover size to . We set the first distribution to be always this median instance. Then, we construct the second distribution by a random process that performs the changes (depicted in Figure 3.1) resulting in a modified instance. This process distributes the changes almost uniformly throughout the instance, which implies that the changes are unlikely to be detected unless the algorithm performs a large number of queries. We believe that this construction might find applications to lower bounds for other combinatorial optimization problems.

The Set Cover lower bound for larger optimal value . Our lower bound for the problem of computing an approximate set cover leverages the construction above. We create a combined set system consisting of multiple modified instances all chosen independently at random, allowing instances with much larger . By the properties of the random process generating modified instances, we observe that most of these modified instances have different optimal set cover solution, and that distinguishing these instances from one another requires many queries. Thus, it is unlikely for the algorithm to be able to compute an optimal solution to a large fraction of these modified instances, and therefore it fails to achieve the desired approximation factor for the overall combined instance.

The Cover Verification lower bound for a cover of size . For Cover Verification, however, we instead give an explicit construction of the distributions. We first create an underlying set structure such that initially, the candidate sets contain all but elements. Then we may swap in each uncovered element from a non-candidate set. Our set structure is systematically designed so that each swap only modifies a small fraction of the answers from all possible queries; hence, each swap is hard to detect without queries. The distribution of valid set covers is composed of instances obtained by swapping in every uncovered element, and that of non-covers is similarly obtained but leaving one element uncovered.

2 Preliminaries for the Lower Bounds

First, we formally specify the representation of the set structures of input instances, which applies to both Set Cover and Cover Verification.

Our lower bound proofs rely mainly on the construction of instances that are hard to distinguish by the algorithm. To this end, we define the operation that exchanges a pair of elements between two sets, and how this is implemented in the actual representation.

Definition 2.1 ( operation).

Consider two sets and . A swap on and is defined over two elements such that and , where and exchange and . Formally, after performing , and . As for the representation via EltOf and SetOf, each application of only modifies entries for each oracle. That is, if previously , , , and , then their new values change as follows: , , , and .

In particular, we extensively use the property that the amount of changes to the oracle’s answers incurred by each is minimal. We remark that when we perform multiple s on multiple disjoint set-element pairs, every swap modifies distinct entries and do not interfere with one another.

Lastly, we define the notion of query-answer history, which is a common tool for establishing lower bounds for sub-linear algorithms under query models.

Definition 2.2.

By query-answer history, we denote the sequence of query-answer pairs recording the communication between the algorithm and the oracles, where each new query may only depend on the query-answer pairs . In our case, each represents either a SetOf query or an EltOf query made by the algorithm, and each is the oracle’s answer to that respective query according to the set structure instance.

3 Lower Bounds for the Set Cover Problem

In this section, we present lower bounds for Set Cover both for small values of the optimal cover size (in Section 3.1), and for large values of (in Section 3.2). For low values of , we prove the following theorem whose proof is postponed to Appendix B.

Theorem 3.3.

For and , any randomized algorithm that solves the Set Cover problem with approximation factor and success probability at least requires queries.

Instead, in Section 3.1 we focus on the simple setting of this theorem which applies to approximation protocols for distinguishing between instances with minimum set cover sizes and , and show a lower bound of (which is tight up to a polylogarithmic factor) for approximation factor . This simplification is for the purpose of both clarity and also for the fact that the result for this case is used in Section 3.2 to establish our lower bound for large values of .

High level idea. Our approach for establishing the lower bound is as follows. First, we construct a median instance for Set Cover, whose minimum set cover size is . We then apply a randomized procedure GenModifiedInst, which slightly modifies the median instance into a new instance containing a set cover of size . Applying Yao’s principle, the distribution of the input to the deterministic algorithm is either with probability , or a modified instance generated thru , which is denoted by , again with probability . Next, we consider the execution of the deterministic algorithm. We show that unless the algorithm asks at least queries, the resulting query-answer history generated over would be the same as those generated over instances constituting a constant fraction of , reducing the algorithm’s success probability to below . More specifically, we will establish the following theorem.

Theorem 3.4.

Any algorithm that can distinguish whether the input instance is or belongs to with probability of success greater than , requires queries.

Corollary 3.5.

For , and , any randomized algorithm that approximates by a factor of , the size of the optimal cover for the Set Cover problem with success probability at least requires queries.

For simplicity, we assume that the algorithm has the knowledge of our construction (which may only strengthens our lower bounds); this includes and , along with their representation via EltOf and SetOf. The objective of the algorithm is simply to distinguish them. Since we are distinguishing a distribution of instances against a single instance , we may individually upper bound the probability that each query-answer pair reveals the modified part of the instance, then apply the union bound directly. However, establishing such a bound requires a certain set of properties that we obtain through a careful design of and GenModifiedInst. We remark that our approach shows the hardness of distinguishing instances with with different cover sizes. That is, our lower bound on the query complexity also holds for the problem of approximating the size of the minimum set cover (without explicitly finding one).

Lastly, in Section 3.2 we provide a construction utilizing Theorem 3.4 to extend Corollary 3.5, establish the following theorem on lower bounds for larger minimum set cover sizes.

Theorem 3.6.

For any sufficiently small approximation factor and , any randomized algorithm that computes an -approximation to the Set Cover problem with success probability at least requires queries.

3.1 The Set Cover Lower Bound for Small Optimal Value

3.1.1 Construction of the Median Instance

Let be a collection of sets such that (independently for each set-element pair ) contains with probability , where (note that since we assume for large enough , we can assume that ). Equivalently, we may consider the incidence matrix of this instance: each entry is either (indicating ) with probability , or (indicating ) otherwise. We write denoting the collection of sets obtained from this construction.

Definition 3.7 (Median instance).

An instance of Set Cover, , is a median instance if it satisfies all the following properties.

  1. No two sets cover all the elements. (The size of its minimum set cover is at least .)

  2. For any two sets the number of elements not covered by the union of these sets is at most .

  3. The intersection of any two sets has size at least .

  4. For any pair of elements , the number of sets s.t. but is at least .

  5. For any triple of sets and , .

  6. For each element, the number of sets that do not contain that element is at most .

Lemma 3.8.

There exists a median instance satisfying all properties from Definition 3.7. In fact, with high probability, an instance drawn from the distribution in which independently at random, satisfies the median properties.

The proof of the lemma follows from standard applications of concentration bounds. Specifically, it follows from the union bound and Lemmas A.27A.32, appearing in Appendix A.

3.1.2 Distribution of Modified Instances Derived from

Fix a median instance . We now show that we may perform operations on so that the size of the minimum set cover in the modified instance becomes . Moreover, its incidence matrix differs from that of in entries. Consequently, the number of queries to EltOf and SetOf that induce different answers from those of is also at most .

We define as the distribution of instances generated from a median instance by given below in Figure 3.1 as follows. Assume that . We select two different sets from uniformly at random; we aim to turn these two sets into a set cover. To do so, we swap out some of the elements in and bring in the uncovered elements. For each uncovered element , we pick an element that is also covered by . Next, consider the candidate set that we may exchange its with :

Definition 3.9 (Candidate set).

For any pair of elements , the candidate set of are all sets that contain but not . The collection of candidate sets of is denoted by . Note that (in fact, these two collections are disjoint).


pick two different sets from uniformly at random

pick uniformly at random

pick a random set in


Figure 3.1: The procedure of constructing a modified instance of .

We choose a random set from , and swap with so that now contains . We repeatedly apply this process for all initially uncovered so that eventually and form a set cover. We show that the proposed algorithm, GenModifiedInst, can indeed be executed without getting stuck.

Lemma 3.10.

The procedure GenModifiedInst is well-defined under the precondition that the input instance is a median instance.


To carry out the algorithm, we must ensure that the number of the initially uncovered elements is at most that of the elements covered by both and . This follows from the properties of median instances (Definition 3.7): by property (b), and that the size of the intersection of and is greater than by property (c). That is, in our construction there are sufficiently many possible choices for to be matched and swapped with each uncovered element . Moreover, by property (d) there are plenty of candidate sets for performing with .

3.1.3 Bounding the Probability of Modification

Let denote the distribution of instances generated by . If an algorithm were to distinguish between or , it must find some cell in the EltOf or SetOf tables that would have been modified by GenModifiedInst, to confirm that GenModifiedInst is indeed executed; otherwise it would make wrong decisions half of the time. We will show an additional property of this distribution: none of the entries of EltOf and SetOf are significantly more likely to be modified during the execution of GenModifiedInst. Consequently, no algorithm may strategically detect the difference between or with the desired probability, unless the number of queries is asymptotically the reciprocal of the maximum probability of modification among any cells.

Define as the probability that an element is swapped by a set. More precisely, for an element and a set , if in the median instance , then ; otherwise, it is equal to the probability that swaps . We note that these probabilities are taken over where is a fixed median instance. That is, as per Figure 3.1, they correspond to the random choices of , the random matching between and , and their random choices of choosing each candidate set . We bound the values of via the following lemma.

Lemma 3.11.

For any and , where the probability is taken over .


Let denote the first two sets picked (uniformly at random) from to construct a modified instance of . For each element and a set such that in the basic instance ,

where all probabilities are taken over . Next we bound each of the above six terms. Since we choose the sets randomly, . We bound the second term by . For the third term, since we pick a matching uniformly at random among all possible (maximum) matchings between and , by symmetry, the probability that a certain element is in the matching is (by properties (b) and (c) of median instances),

We bound the fourth term by . To compute the fifth term, let denote the number of sets in that do not contain . By property (f) of median instances, the probability that is in given that is at most,

Finally for the last term, note that by symmetry, each pair of matched elements is picked by GenModifiedInst equiprobably. Thus, for any , the probability that each element is matched to is . By properties (c)–(e) of median instances, the last term is at most


3.1.4 Proof of Theorem 3.4

Now we consider a median instance , and its corresponding family of modified sets . To prove the promised lower bound for randomized protocols distinguishing and , we apply Yao’s principle and instead show that no deterministic algorithm may determine whether the input is or with success probability at least using queries. Recall that if ’s query-answer history when executed on is the same as that of , then must unavoidably return a wrong decision for the probability mass corresponding to . We bound the probability of this event as follows.

Lemma 3.12.

Let be the set of queries made by on . Let where is a given median instance. Then the probability that returns different outputs on and is at most .


Let denote the algorithm’s output for input instance (whether the given instance is or drawn from ). For each query , let denote the answer of to query . Observe that since is deterministic, if all of the oracle’s answers to its previous queries are all the same, then it must make the same next query. Combining this fact with the union bound, we may lower bound the probability that returns the same outputs on and as follows:

For each , let and denote respectively the set and element queried by . Applying Lemma 3.11, we obtain

Proof of Theorem 3.4. If does not output correctly on , the probability of success of is less than ; thus, we can assume that returns the correct answer on . This implies that returns an incorrect solution on the fraction of for which . Now recall that the distribution in which we apply Yao’s principle consists of with probability , and drawn uniformly at random from also with probability . Then over this distribution, by Lemma 3.12,

Thus, if the number of queries made by is less than , then the probability that returns the correct answer over the input distribution is less than and the proof is complete.

3.2 The Set Cover Lower Bound for Large Optimal Value .

Our construction of the median instance and its associated distribution of modified instances also leads to the lower bound of for the problem of computing an approximate solution to Set Cover. This lower bound matches the performance of our algorithm for large optimal value and shows that it is tight for some range of value , albeit it only applies to sufficiently small approximation factor .

Proof overview. We construct a distribution over compounds: a compound is a Set Cover instance that consists of smaller instances , where each of these instances is either the median instance or a random modified instance drawn from . By our construction, a large majority of our distribution is composed of compounds that contains at least modified instances such that, any deterministic algorithm must fail to distinguish from when it is only allowed to make a small number of queries. A deterministic can safely cover these modified instances with three sets, incurring a cost (sub-optimality) of . Still, may choose to cover such an with two sets to reduce its cost, but it then must err on a different compound where is replaced with . We track down the trade-off between the amount of cost that saves on these compounds by covering these ’s with two sets, and the amount of error on other compounds its scheme incurs. is allowed a small probability to make errors, which we then use to upper-bound the expected cost that may save, and conclude that still incurs an expected cost of overall. We apply Yao’s principle (for algorithms with errors) to obtain that randomized algorithms also incur an expected cost of , on compounds with optimal solution size , yielding the impossibility result for computing solutions with approximation factor when given insufficient queries.

3.2.1 Overall Lower Bound Argument

Compounds. Consider the median instance and its associated distribution of modified instances for Set Cover with elements and sets, and let be a positive integer parameter. We define a compound as a set structure instance consisting of median or modified instances , forming a set structure of elements and sets, in such a way that each instance occupies separate elements and sets. Since the optimal solution to each instance is if , and if is any modified instance, the optimal solution for the compound is plus the number of occurrences of the median instance; this optimal objective value is always .

Random distribution over compounds. Employing Yao’s principle, we construct a distribution of compounds : it will be applied against any deterministic algorithm for computing an approximate minimum set cover, which is allowed to err on at most a -fraction of the compounds from the distribution (for some small constant ). For each , we pick with probability where is a sufficiently large constant. Otherwise, simply draw a random modified instance . We aim to show that, in expectation over , must output a solution that of size more than the optimal set cover size of the given instance .

frequently leaves many modified instances undetected. Consider an instance containing at least modified instances. These instances constitute at least a -fraction of : the expected number of occurrences of the median instance in each compound is only , so by Markov’s inequality, the probablity that there are more than median instances is at most for large . We make use of the following useful lemma, whose proof is deferred to Section 3.2.2. In what follow, we say that the algorithm “distinguishes” or “detects the difference” between and if it makes a query that induces different answers, and thus may deduce that one of or cannot be the input instance. In particular, if then detecting the difference between them would be impossible.

Lemma 3.13.

Fix and consider the distribution over compounds with for and for . If makes at most queries to , then it may detect the differences between and at least of the modified instances , with probability at most .

We apply this lemma for any (although the statement holds for any , even vacuously for ). Thus, for -fraction of , fails to identify, for at least modified instances in , whether it is a median instance or a modified instance. Observe that the query-answer history of on such would not change if we were to replace any combination of these modified instances by copies of . Consequently, if the algorithm were to correctly cover by using two sets for some of these , it must unavoidably err (return a non-cover) on the compound where these ’s are replaced by copies of the median instance.

Charging argument. We call a compound tough if does not err on , and fails to detect at least modified instances; denote by the conditional distribution of restricted to tough instances. For tough , let denote the number of modified instances that the algorithm decides to cover with three sets. That is, for each tough compound , measures how far the solution returned by is, from the optimal set cover size. Then, there are at least modified instances that chooses to cover with only two sets despite not being able to verify whether or not. Let denote the set of the indices of these modified instances, so . By doing so, then errs on the replaced compound , denoting the compound similar to , except that each modified instance for is replaced by . In this event, we say that the tough compound charges the replaced compound via . Recall that the total error of is : this quantity upper-bounds the total probability masses of charged instances, which we will then manipulate to obtain a lower bound on .

Instances must share optimal solutions for to charge the same replaced instance. Observe that many tough instances may charge to the same replaced instance: we must handle these duplicities. First, consider two tough instances charing the same via the same . As but , these tough instances differ on some modified instances with indices in . Nonetheless, the query-answer histories of operating on and must be the same as their instances in are both indistinguishable from by the deterministic . Since does not err on tough instances (by definition), both tough and must share the same optimal set cover on every instance in . Consequently, for each fixed , only tough instances that have the same optimal solution for modified instances in may charge the same replaced instance via .

Charged instance is much heavier than charging instances combined. By our construction of drawn from , for the median instance. On the other hand, for modified instances sharing the same optimal set cover, because they are all modified instances constructed to have the two sets chosen by GenModifiedInst as their optimal set cover: each pair of sets is chosen uniformly with probability . Thus, the probability that is chosen is more than times the total probability that any is chosen. Generalizing this observation, we consider tough instances charging the same via , and bound the difference in probabilities that and any are drawn. For each index in , it is more than times more likely for to draw the median instance, rather than any modified instances of a fixed optimal solution. Then, for the replaced compound that errs, (where denotes the probability mass in , not in ). In other words, the probability mass of the replaced instance charged via is always at least times the total probability mass of the charging tough instances.

Bounding the expected cost using . In our charging argument by tough instances above, we only bound the amount of charges on the replaced instances via a fixed . As there are up to choices for , we scale down the total amount charged to a replaced instance by a factor of , so that lower bounds the total probability mass of the replaced instances that errs.

Let us first focus on the conditional distribution restricted to tough instances. Recall that at least a -fraction of the compounds in are tough: fails to detect differences between modified instances from the median instance with probability , and among these compounds, may err on at most a -fraction. So in the conditional distribution over tough instances, the individual probability mass is scaled-up to . Thus,

As the probability mass above cannot exceed the total allowed error , we have

where Jensen’s inequality is applied in the last step above. So,

for sufficiently large (and ) when choosing .

We now return to the expected cost over the entire distribution . For simplicity, define for any non-tough . This yields , establishing the expected cost of any deterministic with probability of error at most over .

Establishing the lower bound for randomized algorithms. Lastly, we apply Yao’s principle444Here we use the Monte Carlo version where the algorithm may err, and use cost instead of the time complexity as our measure of performance. See, e.g., Proposition 2.6 in [27] and the description therein. to obtain that, for any randomized algorithm with error probability , its expected cost under the worst input is at least . Recall now that our cost here lower-bounds the sub-optimality of the computed set cover (that is, the algorithm uses at least cost more sets to cover the elements than the optimal solution does). Since our input instances have optimal solution and the randomized algorithm returns a solution with cost at least in expectation, it achieves an approximation factor of no better than with queries. Theorem 3.6 then follows, noting the substitution of our problem size: .

3.2.2 Proof of Lemma 3.13

First, we recall the following result from Lemma 3.12 for distinguishing between and a random .

Corollary 3.14.

Let be the number of queries made by on over elements and sets, where is a median instance. Then the probability that detects a difference between and in one of its queries is at most .

Marbles and urns. Fix a compound . Let , and then consider the following, entirely different, scenario. Suppose that we have urns, where each urn contains marbles. In the urn, in case is a modified instance, we put in this urn one marble and marbles; otherwise if , we put in white marbles. Observe that the probability of obtaining a marble by drawing marbles from a single urn without replacement is exactly (for ). Now, we will relate the probability of drawing marbles to the probability of successfully distinguishing instances. We emphasize that we are only comparing the probabilities of events for the sake of analysis, and we do not imply or suggest any direct analogy between the events themselves.

Corollary 3.14 above bounds the probability that the algorithm successfully distinguishes a modified instance from with . Then, the probability of distinguishing between and using queries, is bounded from above by the probability of obtaining a marble after drawing marbles from an urn. Consequently, the probability that the algorithm distinguishes instances is bounded from above by the probability of drawing the marbles from at least urns. Hence, to prove that the event of Lemma 3.13 occurs with probability at most , it is sufficient to upper-bound the probability that an algorithm obtains marbles by .

Consider an instance of urns; for each urn corresponding to a modified instance , exactly one of its marbles is . An algorithm may draw marbles from each urn, one by one without replacement, for potentially up to times. By the principle of deferred decisions, the marble is equally likely to appear in any of these draws, independent of the events for other urns. Thus, we can create a tuple of random variables such that for each , is chosen uniformly at random from . The variable represents the number of draws required to obtain the marble in the urn; that is, only the draw from the urn finds the marble from that urn. In case is a median instance, we simply set indicating that the algorithm never detects any difference as and are the same instance.

We now show the following two lemmas in order to bound the number of marbles the algorithm may encounter throughout its execution.

Lemma 3.15.

Let be a fixed constant and define