DeepAI
Log In Sign Up

Multi-Pass Streaming Algorithms for Monotone Submodular Function Maximization

We consider maximizing a monotone submodular function under a cardinality constraint or a knapsack constraint in the streaming setting. In particular, the elements arrive sequentially and at any point of time, the algorithm has access to only a small fraction of the data stored in primary memory. We propose the following streaming algorithms taking O(ε^-1) passes: ----a (1-e^-1-ε)-approximation algorithm for the cardinality-constrained problem ---- a (0.5-ε)-approximation algorithm for the knapsack-constrained problem. Both of our algorithms run in O^∗(n) time, using O^∗(K) space, where n is the size of the ground set and K is the size of the knapsack. Here the term O^∗ hides a polynomial of K and ε^-1. Our streaming algorithms can also be used as fast approximation algorithms. In particular, for the cardinality-constrained problem, our algorithm takes O(nε^-1 (ε^-1 K) ) time, improving on the algorithm of Badanidiyuru and Vondrák that takes O(n ε^-1 (ε^-1 K) ) time.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/14/2021

Cardinality constrained submodular maximization for random streams

We consider the problem of maximizing submodular functions in single-pas...
04/27/2020

Robust Algorithms under Adversarial Injections

In this paper, we study streaming and online algorithms in the context o...
08/16/2022

Deletion Robust Non-Monotone Submodular Maximization over Matroids

Maximizing a submodular function is a fundamental task in machine learni...
01/31/2022

Deletion Robust Submodular Maximization over Matroids

Maximizing a monotone submodular function is a fundamental task in machi...
05/02/2019

Submodular Streaming in All its Glory: Tight Approximation, Minimum Memory and Low Adaptive Complexity

Streaming algorithms are generally judged by the quality of their soluti...
06/26/2019

Making a Sieve Random: Improved Semi-Streaming Algorithm for Submodular Maximization under a Cardinality Constraint

In this paper we consider the problem of maximizing a non-negative submo...

1 Introduction

A set function on a ground set is submodular if it satisfies the diminishing marginal return property, i.e., for any subsets and ,

A function is monotone if for any

. Submodular functions play a fundamental role in combinatorial optimization, as they capture rank functions of matroids, edge cuts of graphs, and set coverage, just to name a few examples. Besides their theoretical interests, submodular functions have attracted much attention from the machine learning community because they can model various practical problems such as online advertising 

[1, 16, 26], sensor location [17]

, text summarization 

[21, 22], and maximum entropy sampling [19].

Many of the aforementioned applications can be formulated as the maximization of a monotone submodular function under a knapsack constraint. In this problem, we are given a monotone submodular function , a size function , and an integer , where denotes the set of positive integers. The problem is defined as

(1)

where we denote for a subset . Throughout this paper, we assume that every item satisfies as otherwise we can simply discard it. Note that, when for every item , the constraint coincides with a cardinality constraint:

(2)

The problem of maximizing a monotone submodular function under a knapsack or a cardinality constraint is classical and well-studied [13, 28]. The problem is known to be NP-hard but can be approximated within the factor of (close to) ; see e.g., [3, 10, 14, 18, 27, 29]. Notice that for both problems, it is standard to assume that a function oracle is given and the complexity of the algorithms is measured based on the number of oracle calls.

In this work, we study the two problems with a focus on designing space and time efficient approximation algorithms. In particular, we assume the streaming setting: each item in the ground set arrives sequentially, and we can keep only a small number of the items in memory at any point. This setting renders most of the techniques in the literature ineffective, as they typically require random access to the data.

Our contribution

Our contributions are summarized as follows.

Theorem 1.1 (Cardinality Constraint).

Let . We design streaming -approximation algorithms for the problem (2) requiring either

  1. space, passes, and running time, or

  2. space, passes, and running time.

Theorem 1.2 (Knapsack Constraint).

Let . We design streaming -approximation algorithms for the problem (1) requiring space, passes, and running time.

To put our results in a better context, we list related work in Tables 1 and 2. For the cardinality-constrained problem, our first algorithm achieves the same ratio as Badanidiyuru and Vondrák [3], using the same space, while strictly improving on the running time and the number of passes. The second algorithm improves further the number of passes to , which is independent of and , but slightly loses out in the running time and the space requirement.

For the knapsack-constrained problem, our algorithm gives the best ratio so far using only small space (though at the cost of using more passes than [15, 30]). In the non-streaming setting, Sviridenko [27] gave a -approximation algorithm, which takes time. Very recently, Ene and Nguy~n [11] gave -approximation algorithm, which takes .111In [3], a -approximation algorithm of running time was claimed. However, this algorithm seems to require some assumption on the curvature of the submodular function. See [11, 29] for details on this issue.

approx. ratio # passes space running time
Badanidiyuru et al. [2] 1
Ours
Ours
Badanidiyuru–Vondrak [3]
Mirzasoleiman et al. [24]

(in expectation)

Greedy [14]
Table 1: The cardinality-constrained problem
approx. ratio # passes space running time
Yu et al. [30] 1
Huang et al. [15] 1
Huang et al. [15] 3
Ours
Ours
Ours
Ene and Nguy~n [11]
Sviridenko [27]
Table 2: The knapsack-constrained problem. The algorithms [11, 27] are not for the streaming setting.
Our Technique

We first give an algorithm, called Simple, for the cardinality-constrained problem (2). This algorithm is later used as a subroutine for the knapsack-constrained problem (1). The basic idea of Simple is similar to those in [3, 23]: in each pass, a certain threshold is set; items whose marginal value exceeds the threshold are added into the collection; others are just ignored. In [3, 23], the threshold is decreased in a conservative way (by the factor of ) in each pass. In contrast, we adjust the threshold dynamically, based on the -value of the current collection. We show that, after passes, we reach a

-approximation. To set the threshold, we need a prior estimate of the optimal value, which we show can be found by a pre-processing step requiring either

space and a single pass, or space and passes. The implementation and analysis of the algorithm are very simple. See Section 2 for the details.

For the knapsack-constrained problem (1), let us first point out the challenges in the streaming setting. The techniques achieving the best ratios in the literature are in [11, 27]. In [27], partial enumeration and density greedy are used. In the former, small sets (each of size at most 3) of items are guessed and for each guess, density greedy adds items based on the decreasing order of marginal ratio (i.e., the marginal value divided by the item size). To implement density greedy in the streaming setting, large number of passes would be required. In [11], partial enumeration is replaced by a more sophisticated multi-stage guessing strategies (where fractional items are added based on the technique of multilinear extension) and a “lazy” version of density greedy is used so as to keep down the time complexity. This version of density greedy nonetheless requires a priority queue to store the density of all items, thus requiring large space.

We present algorithms, in increasing order of sophistication, in Sections 3 to 5, that give , , and approximations respectively. The first simpler algorithms are useful for illustrating the main ideas and also are used as subroutines for later, more involved algorithms. The first algorithm adapts the algorithm Simple for the cardinality-constrained case. We show that Simple still performs well if all items in the optimal solution (henceforth denoted by OPT) are small in size. Therefore, by ignoring the largest optimal item , we can obtain a -approximate solution (See Section 3).

The difficulty arises when is large and the function value is too large to be ignored. To take care of such a large-size item, we first aim at finding a good item whose size approximates that of , using a single pass [15]. This item satisfies the following properties: (1) is large, (2) the marginal value of with respect to is large. Then, after having this item , we apply Simple to pack items in . Since the largest item size in is smaller, the performance of Simple is better than just applying Simple to the original instance. The same argument can be applied for , where is the second largest item. These solutions, together with , yield a -approximation (See Section 4 for the details).

The above strategy would give a -approximation if is large enough. When is small, we need to generalize the above ideas further. In Section 5, we propose a two-phase algorithm. In Phase 1, an initial good set is chosen (instead of a single good item); in Phase 2, pack items in some subset using the remaining space. Ideally, the good set should satisfy the following properties: (1) is large, (2) the marginal value of with respect to is large, and (3) the remaining space, , is sufficiently large to pack items in . To find a such a set , we design two strategies, depending on the sizes, , of the two largest items in OPT.

The first case is when is large. As mentioned above, we may assume that is small. In a similar way, we can show that is small. Then there exists a “dense” set of small items in OPT, i.e., is large. The good set thus can be small items approximating while still leaving enough space for Phase 2.

The other case is when is small. In this case, we apply a modified version of Simple to obtain a good set . The modification allows us to lower-bound the marginal value of with respect to . Furthermore, we can show that is already a -approximation when is large. Thus we may assume that is small, implying that we have still enough space to pack items in in Phase 2.

Related Work

Maximizing a monotone submodular function subject to various constraints is a subject that has been extensively studied in the literature. We do not attempt to give a complete survey here and just highlight the most relevant results. Besides a knapsack constraint or a cardinality constraint mentioned above, the problem has also been studied under (multiple) matroid constraint(s), -system constraint, multiple knapsack constraints. See [5, 7, 8, 10, 12, 18, 20] and the references therein. In the streaming setting, researchers have considered the same problem with matroid constraint [6] and knapsack constraint [15, 30], and the problem without monotonicity [9, 25].

For the special case of set-covering function with cardinality constraint, McGregor and Vu [23] give a -approximation algorithm in the streaming setting. They use a sampling technique to estimate the value of and then collect items based on thresholds using passes. Batani et al. [4] independently proposed a streaming algorithm with a sketching technique for the same problem.

Notation

For a subset and an element , we use the shorthand and to stand for and , respectively. For a function , we also use the shorthand to stand for . The marginal return of adding with respect to is defined as .

2 Cardinality Constraint

2.1 Simple Algorithm with Approximated Optimal Value

In this section, we introduce a procedure Simple (see Algorithm 1). This procedure can be used to give a -approximation with the cardinality constraint; moreover, it will be adapted for the knapsack-constrained problem in Section 3.

The input of Simple consists of

  1. An instance for the problem (2).

  2. Approximated values and of and , respectively, where OPT is an optimal solution of . Specifically, we suppose and .

The output of Simple is a set that satisfies for some constant that will be determined later. If in addition, then the output turns out to be a -approximation. We will describe how to find such satisfying that in the next subsection.

1:procedure Simple() and
2:     .
3:     repeat
4:          and .
5:         for each  do
6:              if and then .          
7:         .
8:     until 
9:     return .
Algorithm 1

The following observations hold for the algorithm Simple.

Lemma 2.1.

During the execution of Simple in each round  (in Lines 3–8), the following hold:

  1. The current set always satisfies , where .

  2. If an item fails the condition at Line 6, where is the set just before arrives, then the final set in the round satisfies .

Proof.

(1) Every item satisfies , where is the set just before arrives. Hence . (2) follows from the definition of submodularity. ∎

Moreover, we can bound from below using the size of .

Lemma 2.2.

In the end of each round (in Lines 3–8), we have

Proof.

We prove the statement by induction on the number of rounds. Let be a set in the end of some round. Furthermore, let and be corresponding two sets in the round; thus . By induction hypothesis, we have

Note that in the first round, that also satisfies the above inequality.

Due to Lemma 2.1(1), it holds that , where . Hence it holds that

where the second inequality uses the induction hypothesis. Since , we have

which proves the lemma. ∎

The next lemma says that the function value increases by at least in each round. This implies that the algorithm terminates in rounds.

Lemma 2.3.

Suppose that we run Simple() with and . In the end of each round, if the final set  (at Line 7) satisfies , then .

Proof.

Suppose that the final set satisfies . This means that, in the last round, each item in is discarded because the marginal return is not large, which implies that by Lemma 2.1(2). As and , we have from submodularity that

Since , this proves the lemma. ∎

From Lemmas 2.2 and 2.3, we have the following.

Theorem 2.4.

Let be an instance of the cardinality-constrained problem (2). Suppose that . Then Simple() can compute a -approximate solution in passes and space. The total running time is .

Proof.

While , the -value is increased by at least in each round by Lemma 2.3. Hence, after rounds, the current set satisfies that . Since , the number of rounds is at most . As each round takes time, the total running time is . Since we only store a set , the space required is clearly .

The algorithm terminates when . From Lemma 2.2 and the fact that , we have

2.2 Algorithm with guessing the optimal value

We first note that , where . Hence, if we prepare , then we can guess such that . As the size of is equal to , if we run Simple for each element in , we need space and passes in the streaming setting. This, however, will take running time. We remark that, using a -approximate solution by a single-pass streaming algorithm [3], we can guess from the range between and , which leads to space and time, taking passes. This proves the second part in Theorem 1.1.

Below we explain how to reduce the running time to by the binary search.

Theorem 2.5.

We can find a -approximate solution in passes and space, running in time.

Proof.

We here describe an algorithm using Simple with slight modification. Let be the minimum integer that satisfies . It follows that .

We set and . Suppose that for some . Set , and take the middle . Perform Simple(), but we stop the repetition in rounds.

Suppose that the output is of size . Then, if , we have by Lemma 2.2. Hence we may assume that . So we set and .

Suppose that the output is of size . It follows from Lemma 2.3 that, if , it holds that after rounds. Hence, after rounds, we have , a contradiction. Thus we are sure that . So we see that , and we set and .

We repeat the above binary search until the interval is 1. As , the number of iterations is . Since each iteration takes passes, it takes passes in total. The running time is . Notice that there is no need to store the solutions obtained in each iteration, rather, just the function values and the corresponding indices are enough to find out the best solution. Therefore, just space suffices. The algorithm description is given in Algorithm 2. ∎

1:procedure Cardinality()
2:     , and let be the minimum integer that satisfies .
3:     , , and .
4:     while  do
5:          and .
6:         . Perform Simple but stop in rounds
7:         for  do
8:               and .
9:              for each  do
10:                  if and then .                        
11:         .
12:         if  then
13:               and .
14:         else
15:               and .               
16:      where Simple().
17:      and return Simple().
Algorithm 2 Algorithm for the cardinality-constrained problem

3 Simple Algorithm for the Knapsack-Constrained Problem

In the rest of the paper, let be an input instance of the problem (1). Let denote an optimal solution with . We denote for .

Similarly to Section 2, we suppose that we know in advance the approximate value of , i.e., . The value can be found with a single-pass streaming algorithm with constant ratio [30] in time and space. Specifically, letting be the output of a single-pass -approximation algorithm, we know that the optimal value is between and . We can guess by a geometric series in this range, and then the number of guesses is . Thus, if we design an algorithm running in time and space provided the approximate value , then the total running time is and the space required is .

3.1 Simple Algorithm

We first claim that the algorithm Simple in Section 2 can be adapted for the knapsack-constrained problem (1) as below (Algorithm 3). At Line 6, we pick an item when the marginal return per unit weight exceeds the threshold . We stop the repetition when . Clearly, the algorithm terminates.

1:procedure Simple()
2:     .
3:     repeat
4:          and .
5:         for each  do
6:              if and then .          
7:         .
8:     until 
9:     return .
Algorithm 3

In a similar way to Lemmas 2.1 and 2.2, we have the following observations. We omit the proof.

Lemma 3.1.

During the execution of Simple in each round (in Lines 3–8), the following hold:

  1. The current set always satisfies , where .

  2. If an item fails the condition at Line 6, where is the set just before arrives, then the final set in the round satisfies .

  3. In the end of each round, we have

Furthermore, similarly to the proof of Lemma 2.3, we see that the output has size more than .

Lemma 3.2.

Suppose that we run Simple() with and . In the end of the algorithm, it holds that .

Proof.

Suppose to the contrary that in the end. Then, in the last round, each item in is discarded because the marginal return is not large, which implies that by Lemma 3.1(2). As and , where is the initial set in the last round, we have

Since , we obtain , which proves the lemma. ∎

Thus, we obtain the following approximation ratio, depending on size of the largest item.

Lemma 3.3.

Let be an instance of the problem (1). Suppose that and . The algorithm Simple() can find in passes and space a set such that and

(3)

The total running time is .

Proof.

Let be the final set of Simple(). By Lemma 3.2, the final set satisfies that . Hence (3) follows from Lemma 3.2 (3). The number of passes is , as each round increases the -value by and . Hence the running time is , and the space required is clearly . ∎

Lemma 3.3 gives us a good ratio when is small (see Corollary 5.1 in Section 5.1). However, the ratio worsens when becomes larger. In the next subsection, we show that Simple can be used to obtain a -approximation by ignoring large-size items.

3.2 -Approximation: Ignoring Large Items

Let us remark that Simple would work for finding a set that approximates any subset . More precisely, given an instance of the problem (1), consider finding a feasible set to that approximates

() a subset such that and .

This means that and are the approximated values of and , respectively. Let with . Note that is not necessarily feasible to , i.e.,  (and thus ) may be larger than , but we assume that for any . Then Simple() can find an approximation of .

Corollary 3.4.

Suppose that we are given an instance for the problem (1) and satisfying the above condition () for some subset . Then Simple() can find a set in passes and space such that and

The total running time is .

In particular, Corollary 3.4 can be applied to approximate , with estimates of and .

Corollary 3.5.

Suppose that we are given an instance for the problem (1) such that and . We further suppose that we are given with and with . Then we can find a set in passes and space such that and

In particular, when , we have

(4)
Proof.

We may assume that , as otherwise by taking a singleton with maximum return , we have , implying that satisfies the inequality as . Moreover, it holds that and , and thus . Using the fact, we perform Simple() to approximate . Since the largest size in is , by Corollary 3.4, we can find a set such that and

Thus the first part of the lemma holds.

When , the above bound is equal to

(5)

We note that

Indeed, the inequality clearly holds when . Consider the case when . Then, since , we see that . Hence, since , we obtain

where the last inequality holds since . Thus we have (4) from (5). ∎

The above corollary, together with Lemma 3.3, delivers a -approximation.

Corollary 3.6.

Suppose that we are given an instance for the problem (1) with . Then we can find a -approximate solution in passes and space. The total running time is .

Proof.

Fist suppose that . Then Lemma 3.3 with implies that we can find a set such that

Thus we may suppose that . We guess with by a geometric series of the interval , i.e., we find such that using space. We may also suppose that , as otherwise we can just take a singleton with maximum return from . By Corollary 3.5 with and , we can find a set such that

Since , we have

Therefore, it holds that

This completes the proof. ∎

4 -Approximation Algorithm

In this section, we present a -approximation algorithm for the knapsack-constrained problem. In our algorithm, we assume that we know in advance approximations of and . That is, we are given such that and for . Define for . We call items in