1 Introduction
A set function on a ground set is submodular if it satisfies the diminishing marginal return property, i.e., for any subsets and ,
A function is monotone if for any
. Submodular functions play a fundamental role in combinatorial optimization, as they capture rank functions of matroids, edge cuts of graphs, and set coverage, just to name a few examples. Besides their theoretical interests, submodular functions have attracted much attention from the machine learning community because they can model various practical problems such as online advertising
[1, 16, 26], sensor location [17][21, 22], and maximum entropy sampling [19].Many of the aforementioned applications can be formulated as the maximization of a monotone submodular function under a knapsack constraint. In this problem, we are given a monotone submodular function , a size function , and an integer , where denotes the set of positive integers. The problem is defined as
(1) 
where we denote for a subset . Throughout this paper, we assume that every item satisfies as otherwise we can simply discard it. Note that, when for every item , the constraint coincides with a cardinality constraint:
(2) 
The problem of maximizing a monotone submodular function under a knapsack or a cardinality constraint is classical and wellstudied [13, 28]. The problem is known to be NPhard but can be approximated within the factor of (close to) ; see e.g., [3, 10, 14, 18, 27, 29]. Notice that for both problems, it is standard to assume that a function oracle is given and the complexity of the algorithms is measured based on the number of oracle calls.
In this work, we study the two problems with a focus on designing space and time efficient approximation algorithms. In particular, we assume the streaming setting: each item in the ground set arrives sequentially, and we can keep only a small number of the items in memory at any point. This setting renders most of the techniques in the literature ineffective, as they typically require random access to the data.
Our contribution
Our contributions are summarized as follows.
Theorem 1.1 (Cardinality Constraint).
Let . We design streaming approximation algorithms for the problem (2) requiring either

space, passes, and running time, or

space, passes, and running time.
Theorem 1.2 (Knapsack Constraint).
Let . We design streaming approximation algorithms for the problem (1) requiring space, passes, and running time.
To put our results in a better context, we list related work in Tables 1 and 2. For the cardinalityconstrained problem, our first algorithm achieves the same ratio as Badanidiyuru and Vondrák [3], using the same space, while strictly improving on the running time and the number of passes. The second algorithm improves further the number of passes to , which is independent of and , but slightly loses out in the running time and the space requirement.
For the knapsackconstrained problem, our algorithm gives the best ratio so far using only small space (though at the cost of using more passes than [15, 30]). In the nonstreaming setting, Sviridenko [27] gave a approximation algorithm, which takes time. Very recently, Ene and Nguy~n [11] gave approximation algorithm, which takes .^{1}^{1}1In [3], a approximation algorithm of running time was claimed. However, this algorithm seems to require some assumption on the curvature of the submodular function. See [11, 29] for details on this issue.
Our Technique
We first give an algorithm, called Simple, for the cardinalityconstrained problem (2). This algorithm is later used as a subroutine for the knapsackconstrained problem (1). The basic idea of Simple is similar to those in [3, 23]: in each pass, a certain threshold is set; items whose marginal value exceeds the threshold are added into the collection; others are just ignored. In [3, 23], the threshold is decreased in a conservative way (by the factor of ) in each pass. In contrast, we adjust the threshold dynamically, based on the value of the current collection. We show that, after passes, we reach a
approximation. To set the threshold, we need a prior estimate of the optimal value, which we show can be found by a preprocessing step requiring either
space and a single pass, or space and passes. The implementation and analysis of the algorithm are very simple. See Section 2 for the details.For the knapsackconstrained problem (1), let us first point out the challenges in the streaming setting. The techniques achieving the best ratios in the literature are in [11, 27]. In [27], partial enumeration and density greedy are used. In the former, small sets (each of size at most 3) of items are guessed and for each guess, density greedy adds items based on the decreasing order of marginal ratio (i.e., the marginal value divided by the item size). To implement density greedy in the streaming setting, large number of passes would be required. In [11], partial enumeration is replaced by a more sophisticated multistage guessing strategies (where fractional items are added based on the technique of multilinear extension) and a “lazy” version of density greedy is used so as to keep down the time complexity. This version of density greedy nonetheless requires a priority queue to store the density of all items, thus requiring large space.
We present algorithms, in increasing order of sophistication, in Sections 3 to 5, that give , , and approximations respectively. The first simpler algorithms are useful for illustrating the main ideas and also are used as subroutines for later, more involved algorithms. The first algorithm adapts the algorithm Simple for the cardinalityconstrained case. We show that Simple still performs well if all items in the optimal solution (henceforth denoted by OPT) are small in size. Therefore, by ignoring the largest optimal item , we can obtain a approximate solution (See Section 3).
The difficulty arises when is large and the function value is too large to be ignored. To take care of such a largesize item, we first aim at finding a good item whose size approximates that of , using a single pass [15]. This item satisfies the following properties: (1) is large, (2) the marginal value of with respect to is large. Then, after having this item , we apply Simple to pack items in . Since the largest item size in is smaller, the performance of Simple is better than just applying Simple to the original instance. The same argument can be applied for , where is the second largest item. These solutions, together with , yield a approximation (See Section 4 for the details).
The above strategy would give a approximation if is large enough. When is small, we need to generalize the above ideas further. In Section 5, we propose a twophase algorithm. In Phase 1, an initial good set is chosen (instead of a single good item); in Phase 2, pack items in some subset using the remaining space. Ideally, the good set should satisfy the following properties: (1) is large, (2) the marginal value of with respect to is large, and (3) the remaining space, , is sufficiently large to pack items in . To find a such a set , we design two strategies, depending on the sizes, , of the two largest items in OPT.
The first case is when is large. As mentioned above, we may assume that is small. In a similar way, we can show that is small. Then there exists a “dense” set of small items in OPT, i.e., is large. The good set thus can be small items approximating while still leaving enough space for Phase 2.
The other case is when is small. In this case, we apply a modified version of Simple to obtain a good set . The modification allows us to lowerbound the marginal value of with respect to . Furthermore, we can show that is already a approximation when is large. Thus we may assume that is small, implying that we have still enough space to pack items in in Phase 2.
Related Work
Maximizing a monotone submodular function subject to various constraints is a subject that has been extensively studied in the literature. We do not attempt to give a complete survey here and just highlight the most relevant results. Besides a knapsack constraint or a cardinality constraint mentioned above, the problem has also been studied under (multiple) matroid constraint(s), system constraint, multiple knapsack constraints. See [5, 7, 8, 10, 12, 18, 20] and the references therein. In the streaming setting, researchers have considered the same problem with matroid constraint [6] and knapsack constraint [15, 30], and the problem without monotonicity [9, 25].
For the special case of setcovering function with cardinality constraint, McGregor and Vu [23] give a approximation algorithm in the streaming setting. They use a sampling technique to estimate the value of and then collect items based on thresholds using passes. Batani et al. [4] independently proposed a streaming algorithm with a sketching technique for the same problem.
Notation
For a subset and an element , we use the shorthand and to stand for and , respectively. For a function , we also use the shorthand to stand for . The marginal return of adding with respect to is defined as .
2 Cardinality Constraint
2.1 Simple Algorithm with Approximated Optimal Value
In this section, we introduce a procedure Simple (see Algorithm 1). This procedure can be used to give a approximation with the cardinality constraint; moreover, it will be adapted for the knapsackconstrained problem in Section 3.
The input of Simple consists of

An instance for the problem (2).

Approximated values and of and , respectively, where OPT is an optimal solution of . Specifically, we suppose and .
The output of Simple is a set that satisfies for some constant that will be determined later. If in addition, then the output turns out to be a approximation. We will describe how to find such satisfying that in the next subsection.
The following observations hold for the algorithm Simple.
Lemma 2.1.
During the execution of Simple in each round (in Lines 3–8), the following hold:

The current set always satisfies , where .

If an item fails the condition at Line 6, where is the set just before arrives, then the final set in the round satisfies .
Proof.
(1) Every item satisfies , where is the set just before arrives. Hence . (2) follows from the definition of submodularity. ∎
Moreover, we can bound from below using the size of .
Lemma 2.2.
In the end of each round (in Lines 3–8), we have
Proof.
We prove the statement by induction on the number of rounds. Let be a set in the end of some round. Furthermore, let and be corresponding two sets in the round; thus . By induction hypothesis, we have
Note that in the first round, that also satisfies the above inequality.
Due to Lemma 2.1(1), it holds that , where . Hence it holds that
where the second inequality uses the induction hypothesis. Since , we have
which proves the lemma. ∎
The next lemma says that the function value increases by at least in each round. This implies that the algorithm terminates in rounds.
Lemma 2.3.
Suppose that we run Simple() with and . In the end of each round, if the final set (at Line 7) satisfies , then .
Proof.
Suppose that the final set satisfies . This means that, in the last round, each item in is discarded because the marginal return is not large, which implies that by Lemma 2.1(2). As and , we have from submodularity that
Since , this proves the lemma. ∎
Theorem 2.4.
Let be an instance of the cardinalityconstrained problem (2). Suppose that . Then Simple() can compute a approximate solution in passes and space. The total running time is .
Proof.
While , the value is increased by at least in each round by Lemma 2.3. Hence, after rounds, the current set satisfies that . Since , the number of rounds is at most . As each round takes time, the total running time is . Since we only store a set , the space required is clearly .
2.2 Algorithm with guessing the optimal value
We first note that , where . Hence, if we prepare , then we can guess such that . As the size of is equal to , if we run Simple for each element in , we need space and passes in the streaming setting. This, however, will take running time. We remark that, using a approximate solution by a singlepass streaming algorithm [3], we can guess from the range between and , which leads to space and time, taking passes. This proves the second part in Theorem 1.1.
Below we explain how to reduce the running time to by the binary search.
Theorem 2.5.
We can find a approximate solution in passes and space, running in time.
Proof.
We here describe an algorithm using Simple with slight modification. Let be the minimum integer that satisfies . It follows that .
We set and . Suppose that for some . Set , and take the middle . Perform Simple(), but we stop the repetition in rounds.
Suppose that the output is of size . Then, if , we have by Lemma 2.2. Hence we may assume that . So we set and .
Suppose that the output is of size . It follows from Lemma 2.3 that, if , it holds that after rounds. Hence, after rounds, we have , a contradiction. Thus we are sure that . So we see that , and we set and .
We repeat the above binary search until the interval is 1. As , the number of iterations is . Since each iteration takes passes, it takes passes in total. The running time is . Notice that there is no need to store the solutions obtained in each iteration, rather, just the function values and the corresponding indices are enough to find out the best solution. Therefore, just space suffices. The algorithm description is given in Algorithm 2. ∎
3 Simple Algorithm for the KnapsackConstrained Problem
In the rest of the paper, let be an input instance of the problem (1). Let denote an optimal solution with . We denote for .
Similarly to Section 2, we suppose that we know in advance the approximate value of , i.e., . The value can be found with a singlepass streaming algorithm with constant ratio [30] in time and space. Specifically, letting be the output of a singlepass approximation algorithm, we know that the optimal value is between and . We can guess by a geometric series in this range, and then the number of guesses is . Thus, if we design an algorithm running in time and space provided the approximate value , then the total running time is and the space required is .
3.1 Simple Algorithm
We first claim that the algorithm Simple in Section 2 can be adapted for the knapsackconstrained problem (1) as below (Algorithm 3). At Line 6, we pick an item when the marginal return per unit weight exceeds the threshold . We stop the repetition when . Clearly, the algorithm terminates.
Lemma 3.1.
During the execution of Simple in each round (in Lines 3–8), the following hold:

The current set always satisfies , where .

If an item fails the condition at Line 6, where is the set just before arrives, then the final set in the round satisfies .

In the end of each round, we have
Furthermore, similarly to the proof of Lemma 2.3, we see that the output has size more than .
Lemma 3.2.
Suppose that we run Simple() with and . In the end of the algorithm, it holds that .
Proof.
Suppose to the contrary that in the end. Then, in the last round, each item in is discarded because the marginal return is not large, which implies that by Lemma 3.1(2). As and , where is the initial set in the last round, we have
Since , we obtain , which proves the lemma. ∎
Thus, we obtain the following approximation ratio, depending on size of the largest item.
Lemma 3.3.
Let be an instance of the problem (1). Suppose that and . The algorithm Simple() can find in passes and space a set such that and
(3) 
The total running time is .
Proof.
3.2 Approximation: Ignoring Large Items
Let us remark that Simple would work for finding a set that approximates any subset . More precisely, given an instance of the problem (1), consider finding a feasible set to that approximates
() a subset such that and .
This means that and are the approximated values of and , respectively. Let with . Note that is not necessarily feasible to , i.e., (and thus ) may be larger than , but we assume that for any . Then Simple() can find an approximation of .
Corollary 3.4.
Suppose that we are given an instance for the problem (1) and satisfying the above condition () for some subset . Then Simple() can find a set in passes and space such that and
The total running time is .
In particular, Corollary 3.4 can be applied to approximate , with estimates of and .
Corollary 3.5.
Suppose that we are given an instance for the problem (1) such that and . We further suppose that we are given with and with . Then we can find a set in passes and space such that and
In particular, when , we have
(4) 
Proof.
We may assume that , as otherwise by taking a singleton with maximum return , we have , implying that satisfies the inequality as . Moreover, it holds that and , and thus . Using the fact, we perform Simple() to approximate . Since the largest size in is , by Corollary 3.4, we can find a set such that and
Thus the first part of the lemma holds.
The above corollary, together with Lemma 3.3, delivers a approximation.
Corollary 3.6.
Suppose that we are given an instance for the problem (1) with . Then we can find a approximate solution in passes and space. The total running time is .
Proof.
Fist suppose that . Then Lemma 3.3 with implies that we can find a set such that
Thus we may suppose that . We guess with by a geometric series of the interval , i.e., we find such that using space. We may also suppose that , as otherwise we can just take a singleton with maximum return from . By Corollary 3.5 with and , we can find a set such that
Since , we have
Therefore, it holds that
This completes the proof. ∎
4 Approximation Algorithm
In this section, we present a approximation algorithm for the knapsackconstrained problem. In our algorithm, we assume that we know in advance approximations of and . That is, we are given such that and for . Define for . We call items in