We consider the Max Coverage and Max Unique Coverage problems in the data stream model. The input to both problems are subsets of a universe of size and a value . In Max Coverage, the problem is to find a collection of at most sets such that the number of elements covered by at least one set is maximized. In Max Unique Coverage, the problem is to find a collection of at most sets such that the number of elements covered by exactly one set is maximized. In the data stream model, we assume is provided but that the sets are revealed online and our goal is to design single-pass algorithms that use space that is sub-linear in the input size.
Max Coverage is a classic NP-Hard problem that has a wide range of applications including facility and sensor allocation , information retrieval , influence maximization in marketing strategy design , and the blog monitoring problem . It is well-known that the greedy algorithm, which greedily picks the set that covers the most number of uncovered elements, is a approximation and that unless , this approximation factor is the best possible in polynomial time .
Max Unique Coverage was first studied in the offline setting by Demaine et al. . A motivating application for this problem was in the design of wireless networks where we want to place base stations that cover mobile clients. Each station could cover multiple clients but unless a client is covered by a unique station the client would experience too much interference. Demaine et al.  gave a polynomial time approximation. Furthermore, they showed that Max Unique Coverage is hard to approximate within a factor for some constant under reasonable complexity assumptions. Erlebach and van Leeuwen  and Ito et al.  considered a geometric variant of the problem and Misra et al.  considered the parameterized complexity of the problem. This problem is also closely related to Minimum Membership Set Cover where one has to cover every element and minimizes the maximum overlap on any element [53, 26].
In the streaming set model, Max Coverage and the related Set Cover problem111That is, find the minimum number of sets that cover the entire universe. have both received a significant amount of attention [38, 64, 36, 15, 27, 7, 61, 39]. The most relevant result is a single-pass approximation using space [61, 8] although better approximation is possible in a similar amount of space if multiple passes are permitted  or if the stream is randomly ordered [63, 2]. In this paper, we almost exclusively consider single-pass algorithms where the sets arrive in an arbitrary order.
The unique coverage problem has not been studied in the data stream model although it, and Max Coverage, are closely related to various graph problems that have been studied.
Relationship to Graph Streaming.
There are two main variants of the graph stream model. In the arbitrary order model, the stream consists of the edges of the graph in arbitrary order. In the adjacency list model, all edges that include the same node are grouped together. Both models generalize naturally to hypergraphs where each edge could consists of more than two nodes. The arbitary order model has been more heavily studied than the adjacency list model but there has still been a significant amount of work in the latter model [59, 57, 11, 50, 36, 6, 7, 42, 58]. For further details, see a recent survey on work on the graph stream model .
To explore the relationship between Max Coverage and Max Unique Coverage and various graph stream problems, it makes sense to introduce to additional parameters beyond (the number of sets) and (the size of the universe). Specifically, throughout the paper we let denote the maximum cardinality of a set in the input and let denote the maximum multiplicity of an element in the universe where the multiplicity is the number of sets an element appears.222Note that and are dual parameters in the sense that if the input is and we define then and . Then an input to Max Coverage and Max Unique Coverage can define a (hyper)graph in one of the following two natural ways:
First Interpretation: A sequence of (hyper-)edges on a graph with nodes of maximum degree (where the degree of a node corresponds to how many hyperedges include that node) and hyperedges where each hyperedge has size at most . In the case where every set has size , the hypergraph is an ordinary graph, i.e., a graph where every edge just has two endpoints. With this interpretation, the graph is being presented in the arbitrary order model.
Second Interpretation: A sequence of adjacency lists (where the adjacency list for a given node includes all the hyperedges that include that node) on a graph with nodes of maximum degree and hyperedges of maximum size . In this interpretation, if every element appears in exactly sets, then this corresponds to an ordinary graph where each element corresponds to an edge and each set corresponds to a node. With this interpretation, the graph is being presented in the adjacency list model.
Under the first interpretation, the Max Coverage problem and the Max Unique Coverage problem when all sets have size exactly naturally generalize the problem of finding a maximum matching in an ordinary graph in the sense that if there exists a matching with at least edges, the optimum solution to either Max Coverage and Max Unique Coverage will be a matching. There is a large body of work on graph matchings in the data stream model [3, 28, 55, 66, 23, 24, 31, 44, 43, 34, 50, 51, 12, 49, 35] including work specifically on solving the problem exactly if the matching size is bounded [20, 18]. More precisely, Max Coverage corresponds to the partial vertex cover problem : what is the maximum number of edges that can be covered by selecting nodes. For larger sets, the Max Coverage and Max Unique Coverage are at least as hard as finding partial vertex covers and matching in hypergraphs.
Under the second interpretation, when all elements have multiplicity 2, then the problem Max Unique Coverage corresponds to finding the capacitated maximum cut, i.e., a set of at most vertices such that the number of edges with exactly one endpoint in this set is maximized. In the offline setting, Ageev and Sviridenko  and Gaur et al. 
presented a 2 approximation for this problem using linear programming and local search respectively. The (uncapacitated) maximum cut problem was been studied in the data stream model by Kapralov et al.[45, 46, 47]; a 2-approximation is trivial in logarithmic space333It suffices to count the number of edges since there is always a cut whose size is at least . but improving on this requires space that is polynomial in the size of the graph. The capacitated problem is a special case of the problem of maximizing a non-monotone sub-modular function subject to a cardinality constraint. This general problem has been considered in the data stream model [8, 16, 13, 37] but in that line of work it is assumed that there is oracle access to the function being optimized, e.g., given any set of nodes, the oracle will return the number of edges cut. Alaluf et al.  presented a approximation in this setting, assuming exponential post-processing time. In contrast, our algorithm does not assume an oracle while obtaining a approximation (and also works for the more general problem Max Unique Coverage).
1.1 Our Results
Our main results are the following single-pass streaming algorithms444Throughout we use to denote that logarithmic factors of and are being omitted.:
- (A) Bounded Set Cardinality.
If all sets have size at most , there exists a space data stream algorithm that solves Max Unique Coverage and Max Coverage exactly. We show that this is nearly optimal in the sense that any exact algorithm requires space for constant .
- (B) Bounded Multiplicity.
If every element appears in at most sets, we present the following algorithms:
(B1) Max Unique Coverage: There exists a approximation using space.
(B2) Max Coverage: There exists a approximation algorithm using space.
In contrast to the above results, when and are arbitrary, any constant pass approximation algorithm for either problem requires space .555The lower bound result by Assadi  was for the case of Max Coverage but we will explain that it also applies in the case of Max Unique Coverage. We also generalize of lower bound for Max Coverage  to Max Unique Coverage to show that any constant-pass algorithm with an approximation better than requires space. We also present a single-pass algorithm with an approximation for Max Unique Coverage using space, i.e., the space is independent of and but the approximation factor depends on . This algorithm is a simple combination of a Max Coverage algorithm due to McGregor and Vu  and an algorithm for Max Unique Coverage in the offline setting due to Demaine et al. . Finally, our Max Coverage result (B2) algorithm also yields a new multi-pass result for a parameterized version of the streaming Set Cover problem. We will also show that results (A) and (B2) can also be made to handle stream deletions. The generalization for result (A) that we present requires space that scales with rather than . However, in subsequent work we have shown that space the scales with is also sufficient in the insert/delete setting.
1.2 Technical Summary and Comparisons
Our results are essentially streamable kernelization results, i.e., the algorithm “prunes” the input (in the case of Max Unique Coverage and Max Coverage this corresponds to ignoring some of the input sets) to produce a “kernel” in such a way that a) solving the problem optimally on the kernel yields a solution that is as good (or almost as good) as the optimal solution on the original input and b) the kernel can be constructed in the data stream model and is sufficiently smaller than the original input such that it is possible to find an optimal solution for the kernel in significantly less time than it would take to solve on the original input. In the field of fixed parameter tractability, the main requirement is that the kernel can be produced in polynomial time. In the growing body of work on streaming kernelization [18, 19, 17] the main requirement is that the kernel can be constructed using small space in the data stream model. Our results fits in with this line of work and the analysis requires numerous combinatorial insights into the structure of the optimum solution for Max Unique Coverage and Max Coverage.
Our technical contributions can be outlined as follows.
Result (A) relies on a key combinatorial lemma. This lemma provides a rule to discard sets such that there is an optimum solution that does not contain any of the discarded sets. Furthermore, the number of stored sets can be bounded in terms of and .
Result (B1) uses the observation that each set of any optimal solution intersects some maximal collection of disjoint sets. The main technical step is to demonstrate that storing a small number of intersecting sets, in terms of and , suffices to preserve the optimal solution.
Result (B2) is based on a very simple idea of first collecting the largest sets and then solving the problem optimally on these sets. This can be done in a space efficient manner using existing sketch for estimation in the case of Max Coverage. While the approach is simple, showing that it yields the required approximations requires some work that builds on a recent result by Manurangsi . We also extend the algorithm to the model where sets can be inserted and deleted.
Comparison to Related Work.
In the context of streaming algorithms, for the Max Coverage problem, McGregor and Vu  showed that any approximation better than requires space. For the more general problem of streaming submodular maximization subject to a cardinality constraint, Feldman et al.  very recently showed a stronger lower bound that any approximation better than 2 requires space. Our results provide a route to circumvent these bounds via parameterization on and .
Result (B2) also leads to a parameterized algorithm for streaming Set Cover. This new algorithm uses space which improves upon the algorithm by Har-Peled et al.  that uses space, where is an upper bound for the size of the minimum set cover, in the case . Both algorithms use passes and yield an approximation.
In the context of offline parameterized algorithms, Bonnet et al.  showed that Max Coverage is fixed-parameter tractable in terms of and . However, their branching-search algorithm cannot be implemented in the streaming setting. Misra et al.  showed that the maximum unique coverage problem in which the aim is to maximize the number of uniquely covered elements (without any restriction on the number of sets) admits a kernel of size . On the other hand, they showed that the budgeted version of this problem (where each element has a profit and each set has a cost and the goal is maximize the profit subject to a budget constraint) is -hard when parameterized by the budget666In the Max Unique Coverage problem that we consider, all costs and profits are one and the budget is .. In this context, our result shows that a parameterization on both the maximum set size and the budget is possible (at least when all costs and profits are unit).
2.1 Notation and Parameters
Throughout the paper, will denote the number of sets, will denote the size of the universe, and will denote the maximum number of sets that can be used in the solution. Given input sets , let
be the maximum set size and let
be the maximum number of sets that contain the same element.
Suppose is a collection of sets. We let (and ) be the set of elements covered (and uniquely covered) by an optimal solution in . Furthermore, let and . In other words, is the maximum number of elements that can be covered by sets. Similarly, is the maximum number of elements that can be uniquely covered by sets. Furthermore, let and be the set of elements covered and uniquely covered respectively by the sets in .
To ease the notation, if is a collection of set and is a set, we often use to denote and to denote .
We use to denote the collection of all sets in the stream. Therefore, the optimal value to Max Coverage and Max Unique Coverage are and respectively.
Throughout this paper, we say an algorithm is correct with high probability if the probability of failure is inversely polynomial in.
2.2 Sketches and Subsampling
Given a vector, is defined as the number of elements of which are non-zero. If given a subset , we define to be the characteristic vector of (i.e., iff ) then given sets note that is exactly the number of elements covered by . We will use the following result for estimating .
[ Sketch [21, 9]] Given a set , there exists an -space algorithm that constructs a data structure (called an sketch of ). The sketch has the property that the number of distinct elements in a collection of sets can be approximated up to a factor with probability at least provided the collection of sketches .
Note that if we set in the above result we can try each collection of sets amongst and get a approximation for the coverage of each collection with high probability.
Unique Coverage Sketch.
For unique coverage, our sketch of a set corresponds to subsampling the universe via some hash function where is chosen randomly such that for each , for some appropriate value . Specifically, rather processing an input set , we process . Note that has size in expectation. This approach was use by McGregor and Vu  in the context of Max Coverage and it extends easily to Max Unique Coverage; see Section 7. The consequence is that if there is a streaming algorithm that finds a approximation, we can turn that algorithm into a approximation algorithm in which we can assume that with high probability by running the algorithm on a subsampled sets rather than the original sets. Note that this also allows us to assume input sets have size since . Hence each “sketches” set can be stored using bits.
An Algorithm with Memory.
We will use the above sketches in a more interesting context later in the paper, but note that they immediately imply a trivial algorithmic result. Consider the naive algorithm that stores every set and finds the best solution; note that this requires exponential time. We note that since we can assume , each set has size at most . Hence, we need memory to store all the sets. This approach was noted in  in the context of Max Coverage but also apples to Max Unique Coverage. We will later show that for a approximation, the above trivial algorithm is optimal up to polylogarithmic factors for constant .
3 An Exact Algorithm
Our algorithm, though perhaps non-intuitive, is simple to state:
Initialize to be an empty collection of sets. Let .
Let be the sub-collection of that contains sets of size .
For each set in the stream: Suppose . Add to if there does not exist that occurs as a subset of sets of .
Post-processing: Return the best solution in .
Our algorithm relies on the following combinatorial lemma.
Let be a collection of distinct sets where each and . Suppose for all with there exist at most
sets in that contain . Furthermore, suppose there exists a set such that this inequality is tight. Then, for all disjoint from with there exists a set such that and
If then , then we can simply set . Henceforth, assume . Consider the sets in that are supersets of . Call this collection . For any , there are at most sets that include . Since there are choices for , at most
sets in contain an element in . Hence, at least one set in does not contain any element in . ∎
We show that the algorithm indeed obtains an exact kernel for the problems. Recall that is the collection of all sets in the stream, i.e., the optimal solution has size .
The output of the algorithm is optimal. In particular, and .
Recall that is the collection of all stored sets. We define
Clearly, . Now, suppose there exists such that . Let be the smallest such index. Let be an optimal solution of (note that is also an overall optimal solution based on the minimal assumption on ). Let be the th set that was not stored in . If then we have a contradiction since . Thus, assume . Suppose .
There exists in such that .
Note that was not stored because there existed such that was a subset of sets in . Consider the set . Clearly, and . By Lemma 3, there is a set in such that .
Let and Note that since . Define indicator variables iff and iff . Note that
where the last equation uses the fact that is disjoint from . Then
Thus, which is a contradiction. Hence, there is no such and the claim follows. The proof for unique coverage is almost identical: for the analogous claim we define indicator variables iff and iff . The proof goes through with and replaced by and since it is still the case that
where now the last two equations use the fact that is disjoint from . ∎
The space used by the algorithm is .
Recall that one of the requirements for a set to be added to is that the number of sets in that are supersets of any subset of of size is at most . This includes the empty subset and since every set in is a superset of the empty set, we deduce that . Since each set needs bits to store, and , the total space is . ∎
We summarize the above as a theorem.
There exist deterministic single-pass algorithms using space that yields an exact solution to Max Coverage and Max Unique Coverage.
Handling Insertion-Deletion Streams.
We outline another exact algorithm that works for insertion-deletion streams, however with a worse space bound , in Section 6.1. There exist randomized single-pass algorithms using space and allowing deletions that w.h.p. yield an exact solution to Max Coverage and Max Unique Coverage. 777This improves upon our earlier result in the ICDT version of the paper that uses space.
4 Approximation Algorithms
In this section, we present a variety of different approximation algorithms where the space used by the algorithm is independent of but, in some cases, may depend on . The first algorithm uses memory and obtains a approximation to both problems. The second algorithm uses memory and obtains a approximation to Max Coverage and a approximation to Max Unique Coverage; it can also be extended to streams with deletions.
4.1 A Approximation
Given a collection of sets , we say a sub-collection is a matching if the sets in are mutually disjoint. is a maximal matching if there does not exist such that is disjoint from all sets in . For any input , let be an optimal solution for either the Max Coverage or Max Unique Coverage problem. Let be a maximal matching amongst the input set of size . Then every set of size in intersects with some set in .
Let have size . If it was disjoint from all sets in then it could be added to and the resulting collection would still be a matching. This violates the assumption that is maximal. ∎
The next lemma extends the above result to show that we can potentially remove many sets from each and still argue that there is an optimal solution for the original instance amongst the sets that intersect a set in some .
Consider an input of sets of size at most . For , let be a maximal matching amongst the input set of size and let be an arbitrary subset of of size . Let be the collection of all sets that intersect a set in . Then contains an optimal solution to both the Max Unique Coverage and Max Coverage problem.
If for all then the result follows from Lemma 4.1. If not, let . Let be an optimal solution and let be all the sets in of size . We know that every set in is in
Hence, the number of elements (uniquely) covered by is at most the number of elements (uniquely) covered by plus since every set in (uniquely) covers at most additional elements. But we can (uniquely) cover at least the number of elements (uniquely) covered by plus . This is because contains disjoint sets of size and at least of these are disjoint from all sets in . Hence, there is a solution amongst that is at least as good as and hence is also optimal. ∎
The above lemma suggests an exact algorithm that stores the sets in and find the optimum solution among these sets. In particular, we construct matchings of each size greedily up to the appropriate size and store all intersecting sets. Note that since each element belongs to at most sets, the total space is . Applying the sub-sampling framework, we have and the approximation factor becomes .
There exists a randomized one-pass algorithm using space that finds a approximation to Max Unique Coverage and Max Coverage.
4.2 A More Efficient Approximation for Maximum Coverage
In this section, we generalize the approach of Manurangsi  and combine that with the -sketching technique to obtain a approximation using space for maximum coverage. This saves a factor and the generalized analysis might be of independent interest. Let denote the optimal coverage of the input stream.
Manurangsi  showed that for the maximum -vertex cover problem, the vertices with highest degrees form a approximation kernel for the maximum vertex coverage problem. That is, there exist vertices among those that cover edges. We now consider a set system in which an element belongs to at most sets (this can also be viewed as a hypergraph where each set corresponds to a vertex and each element corresponds to a hyperedge; we then want to find vertices that touch as many hyperedges as possible).
We begin with the following lemma that generalizes the aforementioned result in . We may assume that since otherwise, we can store all the sets. Suppose . Let be the collection of sets with largest sizes (tie-broken arbitrarily). There exist sets in that cover elements.
Let denote the collection of sets in some optimal solution. Let and . We consider a random subset of size . We will show that the sets in cover elements in expectation; this implies the claim.
Let denote the indicator variable for event . We rewrite
Furthermore, the probability that we pick a set in to add to is
Next, we upper bound . We have
We lower bound as follows.
In the above derivation, the second inequality follows from the observation that
The third inequality is because since each element belongs to at most sets.
For all , we must have
Putting it together,
With the above lemma in mind, the following algorithm’s correctness is immediate.
Store -sketches of the largest sets, where the failure probability of the sketches is set to .
At the end of the stream, return the sets with the largest coverage based on the estimates given by the -sketches.
We restate our result as a theorem.
There exists a randomized one-pass, -space, algorithm that with high probability finds a approximation to Max Coverage.
Obtaining a approximation to Max Unique Coverage.
We note that finding the best solution to Max Unique Coverage in will yield a approximation. This is a worse approximation than that of the previous subsection. However, we save a factor of in memory. Furthermore, this approach also allows us to handle streams with deletions.
To see that we get a approximation to Max Unique Coverage. Note that . Furthermore, a similar derivation shows . Specifically, in the derivation in Eq. 1, we can simply replace with . This gives us .
Extension to Insert/Delete Streams.
The result can be extended to the case where sets are inserted and deleted. For the full details, see Section 6.2.
4.3 An Approximation for Unique Coverage
We now present an algorithm whose space does not depend on but the result comes at the cost of increasing the approximation factor to . It also has the feature that the running time is polynomial in in addition to being polynomial in and .
The basic idea is as follows: We consider an existing algorithm that first finds a 2.01 approximation to Max Coverage. It then finds the best solution of Max Unique Coverage among the sets in .
There exists a randomized one-pass, -space, algorithm that with high probability finds a approximation to Max Unique Coverage.
From previous work [61, 8], we can find a approximation to Max Coverage using memory. Note that their algorithm maintains a collection of sets during the stream. Demaine et al.  proved that that if is the best solution to Max Unique Coverage among the sets in , then is an approximation to Max Unique Coverage. In fact, they presented a polynomial time algorithm to find from such that the number of uniquely covered elements is at least
Note that storing each set in requires memory. Hence, the total memory is . Applying the sub-sampling framework, we obtain an memory algorithm.
4.4 Application to Parameterized Set Cover
We parameterize the set cover problem as follows. Given a set system, either A) output a set cover of size if where the approximation factor or B) correctly declare that a set cover of size does not exist.
For , there exists a randomized, -pass, -space, algorithm that with high probability finds a approximation to the parameterized Set Cover problem.
In each pass, we run the algorithm in Theorem 4.2 with parameters and on the remaining uncovered elements. The space use is . Here, we need additional space to keep track of the remaining uncovered elements.
Note that if , after each pass, the number of uncovered elements is reduced by a factor . This is because if is the number of uncovered elements at the beginning of a pass, then after that pass, we cover all but at most of those elements. After passes, the number of remaining uncovered elements is ; we therefore use at most passes until we are done. At the end, we have a set cover of size .
If after passes, there are still remaining uncovered elements, we declare that such a solution does not exist. ∎
Our algorithm improves upon the algorithm by Har-Peled et al.  that uses space for when . Both algorithms yield an approximation and use passes.
5 Lower Bounds
5.1 Lower Bounds for Exact Solutions
As observed earlier, any exact algorithm for either the Max Coverage or Max Unique Coverage problem on an input where all sets have size will return a matching of size if one exists. However, by a lower bound due to Chitnis et al.  we know that determining if there exists a matching of size in a single pass requires space. This immediately implies the following theorem.
Any single-pass algorithm that solves Max Coverage or Max Unique Coverage exactly with probability at least requires space.
5.2 Lower bound for a approximation
The strategy is similar to previous work on Max Coverage [60, 61]. However, we need to argue that the relevant probabilistic construction works for all collections of fewer than sets since the unique coverage function is not monotone.
We make a reduction from the communication problem -player set disjointness, denoted by . In this problem, there are players where the th player has a set . It is promised that exactly one of the following two cases happens a) NO instance: All the sets are pairwise disjoint and b) YES instance: There is a unique element such that for all and all other elements belong to at most one set. The (randomized) communication complexity (in the one-way model or the blackboard model), for some large enough constant success probability, of the above problem is even if the players may use public randomness . We can assume that
via a padding argument.
Any constant-pass randomized algorithm with an approximation better than to Max Unique Coverage requires space.
For each , let be a random partition of into sets such that an element in the universe belongs to exactly one of these sets uniformly at random. In particular, for all and ,
The partitions are chosen independently using public randomness before receiving the input. For each player , if , then they put in the stream. Note that the stream consists of sets.
If the input is a NO instance, then for each , there is at most one set in the stream. Therefore, for each element and any collection of sets in the stream,
Therefore, in expectation, . By an application of Hoeffding’s inequality,
The last inequality follows by letting . The following claim shows that for large , in expectation, picking sets is optimal in terms of unique coverage. The function is increasing in the interval and decreasing in the interval .
We take the partial derivative of with respect to
and observe that it is non-negative if and only if . ∎
By appealing to the union bound over all possible collections sets, we deduce that with high probability, for all collections of sets ,
If the input is a YES instance, then clearly, the maximum -unique coverage is . This is because there exists such that and therefore are in the stream and these sets uniquely cover all elements.
Therefore, any constant pass algorithm that returns better than a approximation to Max Unique Coverage for some large enough constant success probability implies a protocol to solve . Thus, space is required. ∎
5.3 Lower bound for approximation
Assadi  presents a lower bound for the space required to compute a approximation for Max Coverage when , even when the stream is in a random order and the algorithm is permitted constant passes. This is proved via a reduction to multiple instances of the Gap-Hamming Distance problem on a hard input distribution, where an input with high maximum coverage corresponds to a YES answer for some Gap-Hamming Distance instance, and a low maximum coverage corresponds to a NO answer for all GHD instances. This hard distribution has the additional property that high maximum coverage inputs also have high maximum unique coverage, and low maximum coverage inputs have low maximum unique coverage. Therefore, the following corollary holds:
Any constant-pass randomized algorithm with an approximation factor for Max Unique Coverage requires space.
6 Handling Insert-Delete Streams
6.1 Proof of Theorem 3
Consider coloring the elements of a universe with a -wise hash-function such that each element is equally likely to get one of colors.
We say a set has color if the colors of its elements are all different and form the set . Then, via sampling , use space to sample a set (if one exists) that is colored (i.e., for each color in there is exactly one element in the sampled set with this color) for each subset of size at most .
Let be a collection of at most sets where each set have size at most . Say a set in is good with respect to if the elements of receive different colors and they are all different from the colors received by elements in .
For any good set in the collection, let be the set found by the sampling algorithm that is colored the same as set . We call the replacement for .
Removing sets that are good with respect to (w.r.t.) from and replacing them by yields a new collection that (uniquely) covers at least the same number of elements as .
Let be the set of colors used to color elements in and let be the set of colors used to color elements in . Because are good sets, and . After replacing by , the multiplicity of an element with a color in is unchanged. For any color in , let be the element in with this color. There will be at least one element with the same color as after the collection is transformed. It follows that the coverage of the collection does not decrease: the removal of reduces the coverage by at most but adding increases the coverage by at least . To argue that the unique coverage of the collection does not decrease, note that if had multiplicity 1 then the element with the same color as after the transformation also has multiplicity 1.
For any , .
First note that, a set is not good if one of its element shares a color with an element in that set or in another set in the collection. By the union bound,
Hence, for any subset of , and the lemma follows via Markov inequality. ∎
After repeating the random coloring and sampling times, we have a collection of sets that includes the collection of size at most that (uniquely) covers the maximum number of elements.
For the sake of analysis, let be a collection of at most sets with optimum (unique) coverage. Let .
Randomly color elements. Let be the collection formed from by replacing all sets in that are good sets wrt by their replacements. Remove all good sets (w.r.t. from .
Randomly color elements. Let be the collection formed from by replacing all sets in that are good sets wrt by their replacements. Remove all good sets (w.r.t. from .
…continue in this way for steps.
In each step, the size of decreases by a constant factor with constant probability by appealing to Lemma 6.1. Hence after steps . Note that the (unique) coverage of is at least the (unique) coverage of by Lemma 6.1. ∎
Noting that the colorings/sampling can be performed in parallel, we have a single-pass algorithm.
6.2 Handling deletions for the algorithm in Theorem 4.2
We now explain how the approach using in Theorem 4.2 can be extended to the case where sets may be inserted and deleted. In this setting, it is not immediately obvious how to select the largest sets; the approach used when sets are only inserted does not extend. Note that in this model we can set to be the maximum number of sets that have been inserted and not deleted at any prefix of the stream rather than the total number of sets inserted/deleted.
However, we can extend the result as follows. Suppose the sketch of a set for approximating maximum (unique) coverage requires bits; recall from Section 2.2 that suffices. We can encode such a sketch of a set as an integer . Suppose we know that exactly sets have size at least some threshold . We will remove this assumption shortly. Consider the vector where that is initially 0 and then is updated by a stream of set insertions/deletions as follows:
When is inserted, if , then .
When is deleted, if , then .
At the end of this process , , and reconstruct the sketches of largest sets given . Unfortunately, storing explicitly in small space is not possible since, while we are promised that at the end of the stream , during the stream it could be that is an arbitrary binary string with one’s and this requires memory to store. To get around this, it is sufficient to maintain a linear sketch of itself that support sparse recovery. For our purposes, the CountMin Sketch  is sufficient although other approaches are possible. The CountMin Sketch allows to be reconstructed with probability using a sketch of size