Log In Sign Up

Improved Multi-Pass Streaming Algorithms for Submodular Maximization with Matroid Constraints

We give improved multi-pass streaming algorithms for the problem of maximizing a monotone or arbitrary non-negative submodular function subject to a general p-matchoid constraint in the model in which elements of the ground set arrive one at a time in a stream. The family of constraints we consider generalizes both the intersection of p arbitrary matroid constraints and p-uniform hypergraph matching. For monotone submodular functions, our algorithm attains a guarantee of p+1+ε using O(p/ε)-passes and requires storing only O(k) elements, where k is the maximum size of feasible solution. This immediately gives an O(1/ε)-pass (2+ε)-approximation algorithms for monotone submodular maximization in a matroid and (3+ε)-approximation for monotone submodular matching. Our algorithm is oblivious to the choice ε and can be stopped after any number of passes, delivering the appropriate guarantee. We extend our techniques to obtain the first multi-pass streaming algorithm for general, non-negative submodular functions subject to a p-matchoid constraint with a number of passes independent of the size of the ground set and k. We show that a randomized O(p/ε)-pass algorithm storing O(p^3klog(k)/ε^3) elements gives a (p+1+γ̅+O(ε))-approximation, where g̅a̅m̅m̅a̅ is the guarantee of the best-known offline algorithm for the same problem.


page 1

page 2

page 3

page 4


Streaming Submodular Maximization with Matroid and Matching Constraints

Recent progress in (semi-)streaming algorithms for monotone submodular f...

Optimal deletion-robust coreset for submodular maximization

In recent years we have witnessed an increase on the development of meth...

An Optimal Streaming Algorithm for Non-monotone Submodular Maximization

We study the problem of maximizing a non-monotone submodular function su...

Semi-Streaming Algorithms for Submodular Matroid Intersection

While the basic greedy algorithm gives a semi-streaming algorithm with a...

Minimum Robust Multi-Submodular Cover for Fairness

In this paper, we study a novel problem, Minimum Robust Multi-Submodular...

1 Introduction

Many discrete optimization problems in theoretical computer science, operations research, and machine learning can be cast as special cases of maximizing a

submodular function subject to some constraint. Formally, a function is submodular if and only if for all . One reason for the ubiquity of submodularity in optimization settings is that it also captures a natural “diminishing returns” property. Let be the marginal increase obtained in when adding an element to a set (where here and throughout we use the shorthands and for and , respectively). It is well-known that is submodular if and only if for any and any . If additionally we have for all and we say that is monotone.

Here, we consider the problem of maximizing both monotone and arbitrary submodular functions subject to an arbitrary -matchoid constraint on the set of elements that can be selected. Formally, a -matchoid on is given by a collection of matroids each defined on some subset of , where each is present in at most of these subsets. A set is then independent if and only if for each matroid . One can intuitively think of a -matchoid as a collection of matroids in which each element “participates” in at most of the matroid constraints. The resulting family of constraints is quite general and captures both intersections of matroid constraints (by letting for all ) and matchings in -uniform hypergraphs (by considering as a collection of hyperedges and defining a uniform matroid constraint for each vertex, ensuring that at most one hyperedge containing this vertex is selected).

In many applications of submodular optimization, such as summarization [Badanidiyuru:2014ib, lin:2010wpa, mirzasoleiman2016fast, mirzasoleiman18streaming] we must process datasets so large that they cannot be stored in memory. Thus, there has been recent interest in streaming algorithms for submodular optimization problems. In this context, we suppose the ground set is initially unknown and elements arrive one-by-one in a stream. We suppose that the algorithm has an efficient oracle for evaluating the submodular function on any given subset of , but has only enough memory to store a small number of elements from the stream. Variants of standard greedy and local search algorithms have been developed that obtain a constant-factor approximation in this setting, but their approximation guarantees are considerably worse than that of their simple, offline counterparts.

Here, we consider the multi-pass setting in which the algorithm is allowed to perform several passes over a stream—in each pass all of arrives in some order, and the algorithm is still only allowed to store a small number of elements. In the offline setting, simple variants of greedy [FisherNemhauserWolsey] or local search [Feldman2011, Lee:2010] algorithms in fact give the best-known approximation guarantees for maximizing submodular functions subject to the -matroid constraints or a general -matchoid constraint. However, these algorithms potentially require considering all elements in each time a choice is made. It is natural to ask whether this is truly necessary, or whether we could instead recover an approximation ratio nearly equal to these offline algorithms by using only a constant number of passes through the data stream.

1.1 Our Results

Here we show that for monotone submodular functions, -passes suffice to obtain guarantees only times worse than those guaranteed by the offline local search algorithm. We give an -pass streaming algorithm that gives a approximation for maximizing a monotone submodular function subject to an arbitrary -matchoid constraint. It immediately gives us an -pass streaming algorithm attaining a approximation for matroid constraints and a approximation for matching constraints in graphs. Each pass of our algorithm is equivalent to a single pass of the streaming local search algorithm described by Chakrabarti and Kale [DBLP:journals/mp/ChakrabartiK15] and Chekuri, Gupta, and Quanrud [DBLP:conf/icalp/ChekuriGQ15]. However, obtaining a rapid convergence to a approximation requires some new insights. We show that if a pass makes either large or small progress in the value of , then the guarantee obtained at the end of this pass can be improved. Balancing these two effects then leads to a carefully chosen sequence of parameters for each pass. Our general approach is similar to that of Chakrabarti and Kale [DBLP:journals/mp/ChakrabartiK15], but our algorithm is oblivious to the choice of . This allows us to give a uniform bound on the convergence of the approximation factor obtained after some number of passes. This bound is actually available to the algorithm, and so we can certify the quality of the current solution after each pass. In practice, this allows for terminating the algorithm early if a sufficient guarantee has already been obtained. Even in the worst case, however, we improve on the number of passes required by similar previous results by a factor of . Our algorithm only requires storing elements, where is the rank of the given -matchoid, defined as the size of the largest independent set of elements.

Building on these ideas, we also give a randomized, multi-pass algorithm that uses -passes and attains a approximation for maximizing an arbitrary submodular function subject to a -matchoid constraint, where is the approximation ratio attained by best-known offline algorithm for the same problem. To the best of our knowledge, ours is the first multipass algorithm when the function is non-monotone with a number of passes independent of and , where is the size of the ground set. In this case, our algorithm requires storing elements. We remark that to facilitate comparison with existing work, we have stated all approximation guarantees as factors . However, we note that if one states ratios of the form less than 1, then our results lead to approximations in which all dependence on can be eliminated (by setting simply selecting some ).

1.2 Related Work

[ colframe=gray!75!black, colback=gray!7!white, size=small, title= Current State of the Art]

  Offline   Streaming  
Constraint   M NN   M NN  


matroid   [Calinescu:2011ju] [Buchbinder:2016]    [DBLP:journals/mp/ChakrabartiK15, DBLP:conf/icalp/ChekuriGQ15, Feldman2018DoLess] [Feldman2018DoLess]  
-hyp.m   [Feldman2011] [Feldman2011]   [DBLP:conf/icalp/ChekuriGQ15, Feldman2018DoLess] [Feldman2018DoLess]   [Lee:2010] [Lee:2010]   [DBLP:journals/mp/ChakrabartiK15, DBLP:conf/icalp/ChekuriGQ15, Feldman2018DoLess] [Feldman2018DoLess]  
-matchoid   [Badanidiyuru:2013jc, FisherNemhauserWolsey] [DBLP:journals/siamcomp/ChekuriVZ14, DBLP:conf/focs/FeldmanNS11]   [DBLP:conf/icalp/ChekuriGQ15, Feldman2018DoLess] [Feldman2018DoLess]  
Table 1: Approximation ratio in offline and streaming setting
  Multipass   Our results  
Constraint   M -passes NN   M NN -passes  


matroid   [DBLP:journals/mp/ChakrabartiK15] [DBLP:journals/mp/ChakrabartiK15]    
  [DBLP:journals/mp/ChakrabartiK15] [DBLP:journals/mp/ChakrabartiK15]      
  [DBLP:journals/mp/ChakrabartiK15] [DBLP:journals/mp/ChakrabartiK15]    
Table 2: Summary of results for maximizing a submdodular function in the multipass streaming.

We use the following abbreviations: M means monotone and NN means that is non-negative. means -matroid intersection and -hyp.m denotes rank -hypergraph -matching.
: If we restrict ourselves with algorithms performing -passes then only the -pass setting is understood.

There is a vast literature on submodular maximization with various constraints and different models of computation. In the offline model, the work on maximizing a monotone submodular function goes back to Nemhauser, Wolsey and Fischer [Nemhauser:1978dm]. Monotone submodular functions are well studied and many new and powerful results have been obtained since then. The best approximation algorithm under a matroid constraint is due to Calinescu et al. [Calinescu:2011ju] which is the best that can be done using a polynomial number of queries [Nemhauser:1978dm] (if is given as a value oracle) or assuming  [Feige1998] (if is given explicitly). For more general constraints, Lee, Sviridenko and Vondrák obtained a approximation algorithm under -matroid intersection constraint [Lee:2010]. Feldman et al. [Feldman2011] obtained the same approximation ratio for the general class of -exchange systems. For general -matchoid constraints, the best approximation ratio is , which is attained by the standard greedy algorithm [FisherNemhauserWolsey].

Non-monotone objectives are less understood even under the simplest assumptions. The current best-known result for maximizing a submodular function under a matroid constraint is [Buchbinder:2016], which is far from the hardness result [Gharan:2011]. Table 1 gives the best known bounds for the constraints that we consider in the paper.

Due to the large volume of data in modern applications, there has also been a line of research focused on developing fast algorithms for submodular maximization [Badanidiyuru:2013jc, Mirzasoleiman:2015]. However, all results we have discussed so far assume that the entire instance is available at any time, which may not be feasible for massive datasets. This has motivated the study of streaming submodular maximization algorithms with low memory requirements. Badaniyuru et al. [Badanidiyuru:2014ib] achieved a approximation algorithm for maximizing a monotone submodular function under a cardinality constraint in the streaming setting. This was recently shown to be the best possible bound attainable in one pass with memory sublinear in the size of the instance [Feldman2020]. Chakrabarti and Kale [DBLP:journals/mp/ChakrabartiK15] gave a approximation for -matroid intersection constraint or -uniform hypergraph matching. Later, Chekuri et al. [DBLP:conf/icalp/ChekuriGQ15] generalized their argument to arbitrary -matchoid constraints, and also gave a modified algorithm for handling non-monotone submodular objectives. A fast, randomized variant of the algorithm of [DBLP:journals/mp/ChakrabartiK15] was studied by Feldman, Karbasi and Kazemi [Feldman2018DoLess], who showed that it has the same approximation guarantee when is monotone and achieves a approximation for general submodular function. Related to our work, there is an active research direction focusing on streaming (sub)modular maximization subject to matching constraints. For submodular maximization, the best approximation is and for monotone and non-montone functions respectively [Levin:2020:Streaming].

When multiple passes through the stream are allowed, less is known and the tradeoff between the approximation guarantee and the number of passes requires more attention. Assuming cardinality constraints, one can obtain a multipass streaming algorithm in -passes (see [Badanidiyuru:2013jc, HuangKakimuraMultiPass2018, mcGregor2017, mirzasoleiman2016fast, Ashkan2018]). Huang et al. [HuangKakimuraMultiPass2018] achieved a approximation under a knapsack constraint in passes. For the intersection of -partition matroids or rank -hypergraph matching, the number of passes becomes dependent on . Chakrabarti and Kale [DBLP:journals/mp/ChakrabartiK15]333In [DBLP:journals/mp/ChakrabartiK15] a bound of is stated. We note that there appears to be a small oversight in their analysis, arising from the fact that their convergence parameter in this case is . In any case, it seems reasonable to assume that is a small constant in most cases. showed that if one allows -passes, a approximation is possible. Here we show how to obtain the same guarantee for an arbitrary -matchoid constraint, while reducing the number of passes to .

2 The main multi-pass streaming algorithm

For monotone functions, our main multi-pass algorithm is given by the procedure MultipassLocalSearch in Algorithm 1. We suppose that we are given a submodular function and a -matchoid constraint on given as a collection of matroids . Our procedure runs for passes, each of which uses a modification of the algorithm of Chekuri, Gupta, and Quanrud [DBLP:conf/icalp/ChekuriGQ15], given as the procedure StreamingLocalSearch. In each pass, procedure StreamingLocalSearch maintains a current solution , which is initially set to some . Whenever an element arrives again in the subsequent stream, the procedure simply discards . For all other elements , the procedure invokes a helper procedure Exchange, given formally in Algorithm 2, to find an appropriate set of up to elements so that . It then exchanges with if it gives a significantly improved solution.

       for  to  do
             Let be the output of ;
      return ;
       foreach  in the stream do
             if  then discard ;
             if  then
      return ;
Algorithm 1 The multi-pass streaming local search algorithm
       foreach  with  do
             if  then
      return ;
Algorithm 2 The procedure

The improvement is measured with respect to a set of auxiliary weights maintained by the algorithm. For , let denote that “element arrives before ” in the stream. Then, we define the incremental value of an element with respect to a set as

There is a slight difficulty here in that we must also define incremental values for the elements of . To handle this difficulty, we in fact define with respect to a pretend stream ordering. Note that in all invocations of the procedure StreamingLocalSearch made by MultipassLocalSearch, the set is either or the result of a previous application of StreamingLocalSearch. In our pretend ordering () all of first arrives in the same relative pretend ordering as the previous pass, followed by all of in the same order given by the stream . We then define our incremental values with respect to this pretend stream ordering.

Using these incremental values, StreamingLocalSearch proceeds as follows. When an element arrives, StreamingLocalSearch computes a set of elements that can be exchanged for . StreamingLocalSearch replaces with if and only if the marginal value with respect to is at least times larger than the sum of the current incremental values of all elements plus some threshold , where are given as parameters. In this case, we say that the element is accepted. Otherwise, we say that is rejected. An element that has been accepted may later be removed from if for some later element that arrives in the stream. In this case we say that is evicted.

The approximation ratio obtained by one pass of StreamingLocalSearch depends on the parameter in two ways, which can be intuitively understood in terms of the standard analysis of the offline local search algorithm for the problem. Intuitively, if is chosen to be too large, more valuable elements will be rejected upon arrival and so, in the offline setting, our solution would be only approximately locally optimal, leading to a deterioration of the guarantee by a factor of . However, in the streaming setting, the algorithm only attempts to exchange an element upon its arrival, and so the final solution will not necessarily be even -approximately locally optimal—an element may be rejected because is small when it arrives, but the processing of later elements in the stream can evict some elements of . After these evictions, we could have larger. The key observation in the analyses of [DBLP:journals/mp/ChakrabartiK15, DBLP:conf/icalp/ChekuriGQ15] is that the total value of these evicted elements—and so also the total increase in the marginal value of all rejected elements—can be bounded by times the final value of at the end of the algorithm. Intuitively, if is chosen to be too small, the algorithm will make more exchanges, evicting more elements, which may result in rejected elements being much more valuable with respect to the final solution. Selecting the optimal value of thus requires balancing these two effects.

Here, we observe that this second effect depends only on the total value of those elements that were accepted after an element arrives. To use this observation, we measure the ratio between the value of the initial solution of some pass of StreamingLocalSearch and the final solution produced by this pass. If is relatively small—and so one pass makes a lot of progress—then this pass gives us an improvement of over the ratio already guaranteed by the previous pass since . On the other hand, if is relatively large—and so one pass does not make much progress—then the total increase in the value of our rejected elements can be bounded by , and so the potential loss due to only testing these elements at arrival is relatively small. Balancing these two effects allows us to set smaller in each subsequent passes and obtain an improved guarantee.

We now turn to the analysis of our algorithm. Here we focus on a single pass of StreamingLocalSearch. For we let . Throughout, we use to denote the current solution maintained by this pass (initially, ). The following key properties of incremental values will be useful in our analysis. We defer the proof to the Appendix. lemmaincremental For any ,

  1. .

  2. for all .

  3. .

  4. At all times during the execution of StreamingLocalSearch, for all .

Let denote the set of elements accepted during the present pass. These are the elements which were present in the solution at some previous time during the execution of this pass. Initially we have and whenever an element is added to , during this pass we also add this element to . Let and denote the sets of elements and at the end of this pass. Note that we regard all elements of as having been accepted at the start of the pass. The following lemma follows from the analysis of Chekuri, Gupta, and Quanrud [DBLP:conf/icalp/ChekuriGQ15] in the single-pass setting. We give a complete, self-contained proof in Appendix A. Each element was accepted but later evicted by the algorithm. For any such evicted element, we let denote the value of

at the moment that

was removed from . lemmaonepass Let be a submodular function. Suppose is the solution produced at the end of one pass of StreamingLocalSearch and be the set of all elements accepted during this pass. Then,

We now derive a bound for the summation (representing the value of evicted elements) in terms of the total gain made by the pass, and also bound the total number of accepted elements in terms of .

Lemma 2.1.

Let be a submodular function. Suppose that is the solution produced at the end of one pass of StreamingLocalSearch and is the set of all elements accepted during this pass. Then, and


We consider the quantity . Suppose some element with is added to by the algorithm, evicting the elements of . Then (as each element can be evicted only once) increases by precisely . Let and be the sets and , respectively, immediately before and after is accepted. Let be the change in the objective function after the exchange between and . Since is accepted, we must have . Then,

(by submodularity)
(by Lemma 2 (3))
(since is accepted)

It follows that whenever increases by , must increase by at least . Initially, and and at the end of the algorithm, and . Thus, .

It remains to show that . For this, we note that the above chain of inequalities also implies that every time an element is accepted (and so increases by one), also increases by at least . Thus, we have . ∎

Using Lemma 2.1 to bound the sum of exit values in Lemma 2 then immediately gives us the following guarantee for each pass performed in MultipassLocalSearch. In the such pass, we will have , , and . We let denote the set of of all elements accepted during this particular pass.

Lemma 2.2.

Let be a submodular function. Consider the pass of StreamingLocalSearch performed by MultipassLocalSearch, and let be the set of all elements accepted during this pass. Then, and

3 Analysis of the multipass algorithm for monotone functions.

We now show how to use Lemma 2.2 together with a careful selection of parameters and to derive guarantees for the solution produced after the pass made in MultipassLocalSearch. Here, we consider the case that is a monotone function. In this case, we have for all . We set in each pass. In the first pass, we will set . Then, since Lemma 2.2 immediately gives:


For passes , we use the following, which relates the approximation guarantee obtained in this pass to that from the previous pass.

Theorem 1.

For , suppose that and define as the ratio between the two previous passes. Then,


From the definition of and , we have:

On the other hand, . Thus, Lemma 2.2 gives:

Now, we observe that for any fixed guarantee from the previous pass, is an increasing function of and is an decreasing function of . Thus, the guarantee we obtain in Theorem 1 is always at least as good as that obtained when these two values are equal. Setting:

and solving for gives us:


In the following analysis, we consider this value of since the guarantee given by Theorem 1 will always be no worse than that given by this value. The analysis for a single matroid constraint follows from our results for -matchoids, but the analysis and parameter values obtained are much simpler, so we present it separately, first.

Theorem 2.

Suppose we run Algorithm 1 for an arbitrary matroid constraint and monotone submodular function , with . Then for all . In particular, after passes, .


Let be the guarantee for our algorithm after passes. We show, by induction on , that . For , we have and so from (1) we have , as required. For , suppose that . Since and , identity (2) gives:

Thus, by Theorem 1, the pass of our algorithm has guarantee satisfying:

as required. ∎

Theorem 3.

Suppose we run Algorithm 1 for an arbitrary -matchoid constraint and monotone submodular function , and

for , where is given by the recurrence and

for . Then for all . In particular, after passes, .


We first show that approximation guarantee of our algorithm after passes is given by . Setting , we obtain from (1), agreeing with our definition. For passes , let . As in the case of matroid constraint, Theorem 1 implies that the guarantee for pass will be at most , where is chosen to satisfy (2). Specifically, if we set

then we have .

We now show by induction on that . In the case , we have and the claim follows immediately from . In the general case , and we may assume without loss of generality that . Otherwise the theorem holds immediately, as each subsequent pass can only increase the value of the solution. Then, we note (as shown in Appendix B) that for and , is an increasing function of . By the induction hypothesis, . Therefore:

as required. The last inequality above follows from straightforward but tedious algebraic manipulations, which can be found in Appendix B. ∎

4 A multi-pass algorithm for general submodular      functions

In this section, we show that the guarantees for monotone submdodular maximization can be extended to non-monotone submodular maximization even when dealing with multiple passes. Our main algorithm is given by procedure MultipassRandomizedLocalSearch in Algorithm 3. In each pass, it calls a procedure RandomizedLocalSearch, which is an adaptation of StreamingLocalSearch, to process the stream. Note that each such pass produces a pair of feasible solutions and , which we now maintain throughout MultipassRandomizedLocalSearch. The set is maintained similarly as before and gradually improves by exchanging “good” elements into a solution throughout the pass. The set will be maintained by considering the best output of an offline algorithm that we run after each pass as described in more detail below.

       for  to  do
             Let be the output of ;
             , ;
      return ;
       ; ;
       foreach  in the stream do
             if  then
            if   then
                   uniformly random element from ;
                   ; ;
                   foreach  in  do
                         if  then
       return ;
Algorithm 3 The randomized multi-pass streaming algorithm

To deal with non-monotone submodular functions, we will limit the probability of elements being added to

. Instead of exchanging good elements on arrival, we store them in a buffer of size . When the buffer becomes full, an element is chosen uniformly at random and added to . Adding a new element to the current solution may affect the quality of the remaining elements in the buffer and thus we need to re-evaluate them and remove the elements that are no longer good. As before, we let denote the set of elements that were previously added to during the current pass of the algorithm. Note that we do not consider an element to be accepted until it has actually been added to from the buffer. For any fixed set of random choices, the execution of RandomizedLocalSearch can be considered as the execution of StreamingLocalSearch on the following stream: we suppose that an element arrives whenever it is selected from the buffer and accepted into . All elements that are discarded from the buffer after accepting then arrive, and will also be rejected by StreamingLocalSearch. Any elements remaining in the buffer after the execution of the algorithm do not arrive in the stream. Applying Lemma 2.2 with respect to this pretend stream ordering allows us to bound with respect to (that is, the value of the part of that does not remain in the buffer ) after a single pass of RandomizedLocalSearch. Formally, let be the value of the buffer after the pass of our algorithm. Then, applying Lemma 2.2 to the set , and taking expectation, gives:


In order to bound the value of the elements in , we apply any offline -approximation algorithm Offline to the buffer at the end of the pass to obtain a solution . In MultipassRandomizedLocalSearch, we then remember the best such offline solution computed across the first passes. Then, in the pass, we have


From submodularity of and we have . Thus, combining (3) and (4) we have:


To relate the right-hand side to we use the following result from Buchbinder et al. [DBLP:conf/soda/BuchbinderFNS14]:

Lemma 4.1 (Lemma 2.2 in [DBLP:conf/soda/BuchbinderFNS14]).

Let be a non-negative submodular function. Suppose that is a random set where no element appears in with probability more than . Then, . Moreover, for any set , it follows that .

We remark that a similar theorem also appeared earlier in Feige, Mirrokni, and Vondrák [DBLP:journals/siamcomp/FeigeMV11] for a random set that contains each element independently with probability exactly . Here, the probability that an element occurs in is delicate to handle because such an element may either originate from the starting solution

or be added during the pass. Thus, we use a rougher estimate. By definition

. Thus, . The number of selections during the pass is at most and by Lemma 2.2 (applied to the set due to our pretend stream ordering in each pass ), in any pass. Here, the second inequality follows from the optimality of , and the fact that any subset of the feasible solution is also feasible for our -matchoid constraint. Thus, the total number of selections in the first passes at most . We select an element only when the buffer is full, and each selection is made independently and uniformly at random from the buffer. Thus, the probability that any given element is selected when the algorithm makes a selection is at most and by a union bound, . Let be the number of passes that the algorithm makes and suppose we set (in Appendix C we show that this can be accomplished approximately by guessing , which can be done at the expense of an extra factor space). Finally, let . Then, applying Lemma 4.1, after passes we have:


Our definition of also implies that . Using this and equation (6) in (5), we obtain:


As we show in Appendix C, the rest of the analysis then follows similarly to that in Section 3, using the fact that . theoremnonmonmain Let be a p-matchoid of rank and let be a non-negative submodular function. Suppose there exists an algorithm for the offline instance of the problem with approximation factor . For any , the randomized streaming local-search algorithm returns a solution such that

using a total space of and -passes.