In this paper, we study the problem of maximizing a non-monotone
submodular function subject to a size constraint in the streaming model. This problem captures problems of interest in a wide-range of domains, such as machine learning, data mining, combinatorial optimization, algorithmic game theory, social networks, and many others. A representative application is data summarization, where the goal is to select a very small subset of the data that captures the salient features of the overall dataset. One can model these problems as submodular maximization with a cardinality constraint: the submodular objective captures how informative the summary is, as well as other considerations such as how diverse the summary is, and the size constraint ensures that the summary is small. Obtaining such a summary is very beneficial when working with massive data sets that may not even fit into memory since it makes it possible to analyze the data using algorithms that would be prohibitive to run on the entire dataset.
There have been two main approaches to deal with the large size of modern datasets: the distributed computation approach partitions the data across many machines and uses local computation on the machines and communication across the machines in order to perform the analysis, and the streaming computation approach processes the data in a stream using only a small amount of memory and ideally only one pass over the data. Classical algorithms for submodular maximization, such as the Greedy algorithm, are not suitable in these settings since they are centralized and they require many passes over the data. Motivated by the applications as well as theoretical considerations, there has been a significant interest in studying submodular maximization problems both in the distributed and the streaming setting, leading to many new results and insights [14, 19, 2, 7, 8, 15, 4, 18, 3, 10, 17, 11, 21, 1].
Despite this significant progress, several fundamental questions remain open both in the streaming and distributed setting. In the streaming setting, which is the main focus of this paper, submodular maximization is fairly well understood when the objective function is additionally monotone: we have whenever . In the centralized setting, the classical Greedy algorithm achieves an optimal approximation when the function is monotone . The Greedy approach can also be adapted to the streaming model [14, 2]. A particularly effective approach in the streaming setting is provided by the single-threshold Greedy algorithm: we make a single pass of the data and select an item if its marginal gain exceeds a suitably chosen threshold. If the threshold is chosen to be , where is the value of the optimal solution and is the cardinality constraint, the single-threshold Greedy algorithm is guaranteed to achieve a
-approximation. Although the value of the optimal solution is unknown, it can be estimated based on the largest singleton value even in the streaming setting: the algorithm uses the maximum singleton value to make guesses for and, for each guess, it runs the single-threshold Greedy algorithm; this approach leads to a approximation. Remarkably, this approximation guarantee is essentially optimal in the streaming model even if we allow unbounded computational power: Norouzi-Fard et al.  showed that any algorithm for monotone submodular maximization that achieves an approximation better than requires memory, where is the length of the stream, under the assumption that the algorithm only evaluates the function on feasible sets (sets of size at most ), which is the case for the vast majority of existing algorithms. Additionally, the single-threshold Greedy algorithm enjoys a fast update time of marginal value computations per item, and it uses space.
In contrast, the general problem with a non-monotone objective has proved to be considerably more challenging. Even in the centralized setting, the Greedy algorithm fails to achieve any approximation guarantee when the objective is non-monotone. In the centralized setting, several approaches have been developed that use either continuous optimization or sampling to handle non-monotone objectives [12, 6, 9, 5]; the best approximation guarantee is  and the best hardness is , and it remains a long-standing open problem to settle the approximability of submodular maximization. It remains a challenging task to adapt these techniques to the streaming setting, and the approximation guarantees are weaker. The works [8, 11, 17] have adapted the local search approach of Chakrabarti and Kale  for monotone functions, leading to a approximation. Recently, Alaluf and Feldman  gave an exponential-time algorithm that achieves a approximation and a polynomial-time algorithm that achieves a approximation. These algorithms work in the continuous setting and they need to evaluate the multilinear extension, and thus the update time as well as the overall running time are higher than discrete approaches. As noted by , similarly to the monotone setting, a approximation is best possible: any algorithm that achieves better than a approximation must use space, even if the algorithm is allowed to use unbounded computational power and can evaluate the function on any set. Thus some of the main questions that are left open by these works are:
What is the optimal approximation achievable for submodular maximization in the streaming model?
Is the tight approximation achievable using an algorithm that uses only space?
Is there a streaming algorithm for non-monotone functions that enjoys the fast update time and practical potential of the single-threshold Greedy algorithm for monotone functions?
How to exploit existing heuristics for the offline problem in the streaming setting?
Our contributions. In this work, we give an affirmative answer to all of the above questions. Specifically, we give a streaming algorithm StreamProcess that performs a single pass over the stream and it outputs a set of size that can be post-processed using any offline algorithm for submodular maximization. The post-processing is itself quite straightforward: we simply run the offline algorithm on the output set to obtain a solution of size at most . We show that, if the offline algorithm achieves an approximation, then we obtain a approximation. In particular, if we post-process using an exact (exponential time) algorithm, we obtain a approximation. Thus we answer affirmatively the first two questions, and settle the approximability of the problem if exponential-time computation is allowed.
The best (polynomial-time) approximation guarantee that is currently known in the offline setting is . If we post-process using this algorithm, we obtain a approximation in polynomial time, improving over the previously best polynomial-time approximation of due to . The offline algorithm of  is based on the multilinear extension and it is slower than discrete algorithms. One can obtain a more efficient overall algorithm by using the discrete random Greedy algorithm of  that achieves an approximation. Furthermore, any existing heuristic for the offline problem can be used for post-processing, exploiting their effectiveness beyond the worst case.
Our techniques. Our StreamProcess algorithm takes inspiration both from the single-threshold Greedy algorithm for monotone maximization and the distributed algorithms that randomly partition the data [16, 4, 3]: we randomly partition the elements into parts as they arrive in the stream, and we run the single-threshold Greedy algorithm on each part; we repeat this process independently and in parallel times. Since the main engine behind our algorithm is the very efficient and practical single-threshold Greedy algorithm, our StreamProcess algorithm inherits its very efficient update time and practical potential. Compared to the optimal streaming algorithm for monotone maximization discussed above, our algorithm is quite similar: the monotone algorithm runs instances of single-threshold Greedy, each of which processes all items; StreamProcess runs instances of single-threshold Greedy, each of which processes
items with high probability.
Although our algorithm is quite simple and is inspired by ideas and techniques from the distributed setting, the analysis is a significant departure from previous work. There are subtle technical issues that arise when we combine the random partitioning approach with the single-threshold Greedy algorithm, and handling the event that the single-threshold Greedy algorithm fills up the budget requires some care. As noted by , this was a source of difficulty in previous work and it led to a subtle error in one of the results of . Our approach here is simple in retrospect. Specifically, we consider two cases depending on whether the probability that the budget is filled up in a run (this is a good event since the resulting solution has good value). If this probability is sufficiently large (at least ), we repeat the basic algorithm times to boost the probability of this good event to . Otherwise, the probability that the budget is not filled up in a run is at least and conditioning on this event changes the probabilities by only an factor. Besides properly handling this conditioning, our analysis uses novel techniques to analyze the approximation guarantee of the algorithm, in addition to some techniques borrowed from  (see the discussion at the beginning of Subsection 3.2).
Basic notation. We let denote a finite ground set of size , and we may assume without loss of generality that . We use e.g.
to denote a vector in. For two vectors , we let be the vector such that and we let be the vector such that for all . For a set , we let denote the indicator vector of , i.e., the vector that has in every coordinate and a in every coordinate .
If is a random subset of , we use to denote the vector such that for all (i.e., the expectation is applied coordinate-wise).
Submodular functions. In this paper, we consider the problem of maximizing a non-negative submodular function subject to a cardinality constraint. A set function is submodular if for all subsets .
The analysis of our algorithm makes use of the Lovasz extension: that extends from the discrete domain to the continuous domain . (We emphasize that our main algorithm is discrete and we make use of the continuous extension for analysis purposes only.) The Lovasz extension is defined as follows. For every , we have
where we use the notation to denote a value chosen uniformly at random from the interval . The Lovasz extension has the following properties: (1) convexity: for all and all ; (2) restricted scale invariance: for all and all .
2 The algorithm
The streaming algorithm is shown in Algorithm 1. For simplicity, we describe the algorithm assuming the knowledge of an estimate of the value of the optimal solution, . To remove this assumption, we use the standard technique introduced by . The basic idea is to use the maximum singleton value as a -approximation of . Given this approximation, one can guess a approximation of from a set of values ranging from to ( is the approximation guarantee of the offline algorithm OfflineAlg that we use in the post-processing step). The final streaming algorithm is simply copies of the basic algorithm running in parallel with different guesses. As new elements appear in the stream, the value also increases over time and thus, existing copies of the basic algorithm with small guesses are dropped and new copies with higher guesses are added. An important observation is that when we introduce a new copy with a large guess, starting it from mid-stream has exactly the same outcome as if we started it from the beginning of the stream: all previous elements have marginal gain much smaller than the guess and smaller than the threshold so they would have been rejected anyway. We refer to  for the full details.
There is a streaming algorithm StreamProcess for non-monotone submodular maximization with the following properties ( is any desired accuracy and it is given as input to the algorithm):
The algorithm makes a single pass over the stream.
The algorithm uses space.
The update time per item is marginal gain computations.
At the end of the stream, we post-process the output of StreamProcess using any offline algorithm OfflineAlg for submodular maximization. The resulting solution is a approximation, where is the approximation of OfflineAlg.
3 The analysis
In this section, we analyze Algorithm 1 and show that it achieves a approximation, where is the approximation guarantee of the offline algorithm OfflineAlg.
We divide the analysis into two cases, depending on the probability of the event that a set (for some ) constructed by StreamProcess has size . For every , let be the event that . Since each of the repetitions (iterations of the for loop of StreamProcess) use independent randomness to partition , the events are independent. Additionally, the events have the same probability. We divide the analysis into two cases, depending on whether or . In the first case, since we are repeating times, the probability that there is a set of size is at least and we obtain the desired approximation since . In the second case, we have and we argue that contains a good solution. We now give the formal argument for each of the cases.
As noted earlier, the events are independent and they have the same probability. Thus we have
since . Thus .
Conditioned on the event , we obtain the desired approximation due to the following lemma. The lemma follows immediately from the standard analysis of the single threshold greedy algorithm, and we include it in the appendix for completeness.
We have for all , .
We can combine the two facts and obtain the desired approximation as follows. Let
be the random variable equal to the solution returned byOfflinePostProcess. We have
In this case, we show that the solution returned on the last line of OfflinePostProcess has good value in expectation. Our analysis borrows ideas and techniques from the work of Barbosa et al. : the probabilities defined below are analogous to the probabilities used in that work; the division of into two sets based on these probabilities is analogous to the division employed in Section 7.3 in that work; Lemma 3.3 shows a consistency property for the single threshold greedy algorithm that is analogous to the consistency property shown for the standard greedy algorithm and other algorithms by Barbosa et al. We emphasize that Barbosa et al. use these concepts in a different context (specifically, monotone maximization in the distributed setting). When applied to our context — non-monotone maximization in the streaming setting — the framework of Barbosa et al. requires memory if used with a single pass (alternatively, they use passes) and achieves worse approximation guarantees.
Notation and definitions. For analysis purposes only, we make use of the Lovasz extension . We fix an optimal solution . Let be the distribution of -samples of , where a -sample of includes each element of independently at random with probability . Note that for every , (see StreamProcess). Additionally, for each , is a partition of into -samples.
For a subset , we let be the output of the single threshold greedy algorithm when run as follows (see also Algorithm 2 for a formal description of the algorithm): the algorithm processes the elements of in the order in which they arrive in the stream and it uses the same threshold as StreamProcess; starting with the empty solution and continuing until the size constraint of is reached, the algorithm adds an element to the current solution if its marginal gain is above the threshold. Note that for all . For analysis purposes only, we also consider for sets that do not correspond to any set .
For each , we define
We partition into two sets:
We also define the following subset of :
Note that is a deterministic partition of , whereas is a random subset of .
The role of the sets will become clearer in the analysis. The intuition is that, using the repetition, we can ensure that each element of ends up in the collected set with good probability: each iteration ensures that an element is in with probability and, since we repeat times, we will ensure that . We also have that : an element ends up being picked by STGreedy when run on input , which is a low probability event for the elements in ; more precisely, the probability of this event is equal to (since ) and (since ). Thus , which implies that the expected value of is at least . However, whereas is available in the post-processing phase, elements of may not be available and they may account for most of the value of . The key insight is to show that makes up for the lost value from these elements.
We now dive into the analysis. We start with two helper lemmas, which follow from standard arguments that have been used in previous works. We include the proofs in the appendix for completeness.
The following lemma follows from a standard argument based on the Lovasz extension and its properties.
Let . Let and be random sets such that and . Then .
The following lemma establishes a consistency property for the STGreedy algorithm, analogous to the consistency property shown and used by Barbosa et al. for algorithms such as the standard Greedy algorithm. The proof is also very similar to the proof shown by Barbosa et al. and we include it in the appendix for completeness.
Conditioned on the event , we have .
We now proceed with the main analysis. Recall that OfflinePostProcess runs the algorithm OfflineAlg on to obtain a solution , and returns . In the following lemma, we show that the value of this solution is proportional to . Note that may not be feasible, since we could have , and hence the scaling based on .
To simplify notation, we let . Let . First, we analyze . Let be a random subset of such that and . We can select such a subset as follows: we first choose a permutation of uniformly at random, and let be the first elements in the permutation. For each element of , we add it to with probability .
Since is a feasible solution contained in and OfflineAlg achieves an -approximation, we have
Recall that is the Lovasz extension of and it is convex and it has the restricted scale invariance property. We use the above inequality and the properties of to obtain:
By taking expectation over only (more precisely, the random sampling that we used to select ):
Next, we lower bound using a convex combination with coefficient
Note that, since , we have . By taking this convex combination and using the previous inequality lower bounding , we obtain:
(We note that we chose to make the coefficients of and to be equal, and this allowed us to relate the value of the final solution to .) ∎
Next, we analyze the expected value of . We do so in two steps: first we analyze the marginal gain of on top of and show that it is suitably small, and then we analyze and show that its expected value is proportional to . We use the notation to denote the marginal gain of on top of , i.e., .
We have .
As before, to simplify notation, we let and . We break down the expectation using the law of total expectation as follows:
Above, we have used that , where the first inequality follows by submodularity. We have also used that . Thus it only remains to show that .
We condition on the event for the remainder of the proof. By Lemma 3.3, we have . Since , each element of was rejected because its marginal gain was below the threshold when it arrived in the stream. This together with submodularity implies that
We have .
Let . We show below that . Assuming this claim, we obtain the desired result by applying Lemma 3.2 with and . Since and is a -sample of , we have and thus we can take . By our claim, we have and thus we can take .
Thus it only remains to show that, for each , we have . Equivalently, we want to show that, for each , we have . Recall that is a deterministic partition of . Thus belongs to exactly one of and and we consider each of these cases in turn.
Suppose that . A single iteration of the for loop of StreamProcess ensures that is in with probability . Since we perform independent iterations, we have .
Suppose that . We have
where the first equality follows from the definition of , the second equality follows from the definition of and the fact that , and the inequality follows from the definition of . ∎
We have .
Recall that we use the notation . We have
where the inequality is by submodularity.
We have .
-  Naor Alaluf and Moran Feldman. Making a sieve random: Improved semi-streaming algorithm for submodular maximization under a cardinality constraint. CoRR, abs/1906.11237, 2019.
-  Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming submodular maximization: Massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 671–680. ACM, 2014.
-  Rafael da Ponte Barbosa, Alina Ene, Huy L Nguyen, and Justin Ward. A new framework for distributed submodular maximization. In IEEE Foundations of Computer Science (FOCS), pages 645–654, 2016.
-  Rafael D.P. Barbosa, Alina Ene, Huy L. Nguyen, and Justin Ward. The power of randomization: Distributed submodular maximization on massive datasets. In International Conference on Machine Learning (ICML), 2015.
-  Niv Buchbinder and Moran Feldman. Constrained submodular maximization via a nonsymmetric technique. Mathematics of Operations Research, 2019.
-  Niv Buchbinder, Moran Feldman, Joseph Naor, and Roy Schwartz. Submodular maximization with cardinality constraints. In ACM-SIAM Symposium on Discrete Algorithms (SODA), 2014.
-  Amit Chakrabarti and Sagar Kale. Submodular maximization meets streaming: Matchings, matroids, and more. Mathematical Programming, 154(1-2):225–247, 2015.
-  Chandra Chekuri, Shalmoli Gupta, and Kent Quanrud. Streaming algorithms for submodular function maximization. In International Colloquium on Automata, Languages and Programming (ICALP), pages 318–330. Springer, 2015.
-  Alina Ene and Huy L Nguyen. Constrained submodular maximization: Beyond 1/e. In IEEE Foundations of Computer Science (FOCS), pages 248–257. IEEE, 2016.
-  Alessandro Epasto, Vahab Mirrokni, and Morteza Zadimoghaddam. Bicriteria distributed submodular maximization in a few rounds. In PACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 25–33, 2017.
-  Moran Feldman, Amin Karbasi, and Ehsan Kazemi. Do less, get more: Streaming submodular maximization with subsampling. In Advances in Neural Information Processing Systems (NIPS), pages 732–742, 2018.
-  Moran Feldman, Joseph (Seffi) Naor, and Roy Schwartz. A unified continuous greedy algorithm for submodular maximization. In 52nd IEEE Foundations of Computer Science (FOCS), pages 570–579. IEEE Computer Society, 2011.
-  Shayan Oveis Gharan and Jan Vondrák. Submodular maximization by simulated annealing. In ACM-SIAM Symposium on Discrete Algorithms (SODA), 2011.
-  Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algorithms in mapreduce and streaming. In PACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 1–10, 2013.
Vahab Mirrokni and Morteza Zadimoghaddam.
Randomized composable core-sets for distributed submodular
ACM Symposium on Theory of Computing (STOC), 2015.
-  Vahab S. Mirrokni and Morteza Zadimoghaddam. Randomized composable core-sets for distributed submodular maximization. In ACM Symposium on Theory of Computing (STOC), 2015.
Baharan Mirzasoleiman, Stefanie Jegelka, and Andreas Krause.
Streaming non-monotone submodular maximization: Personalized video
summarization on the fly.
Thirty-second AAAI conference on artificial intelligence, 2018.
-  Baharan Mirzasoleiman, Amin Karbasi, Ashwinkumar Badanidiyuru, and Andreas Krause. Distributed submodular cover: Succinctly summarizing massive data. In Advances in Neural Information Processing Systems, pages 2881–2889, 2015.
-  Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems (NIPS), pages 2049–2057, 2013.
-  G L Nemhauser, L A Wolsey, and M L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming, 14(1):265–294, 1978.
-  Ashkan Norouzi-Fard, Jakub Tarnawski, Slobodan Mitrovic, Amir Zandieh, Aidasadat Mousavifar, and Ola Svensson. Beyond 1/2-approximation for submodular maximization on massive data streams. In International Conference on Machine Learning, pages 3826–3835, 2018.
Appendix A Omitted proofs
(Proof of Lemma 3.1) To simplify notation, let . Let be the elements of in the order in which they were added to . Let . We have and thus
(Proof of Lemma 3.3) To simplify notation, we let and . Let . Suppose for contradiction that . Let be the elements of in the order in which they arrived in the stream. Let be the smallest index such that . By the choice of , we have . Note that , since and by assumption. Since , we must have (and thus ) and . The latter implies that : after processing all of the elements of that arrived before , the partial greedy solution is ; when arrives, it is added to the solution since and . But then , which is a contradiction. ∎
(Proof of Lemma 3.2) Let be the Lovasz extension of . Using the fact that is an extension and it is convex, we obtain
Let . Note that
The first equality is the definition of . The inequality is by the non-negativity of . The second equality is due to the fact that, for , we have . ∎