The multi-armed bandit (MAB) problem has been widely used for practical applications. Examples include interactive recommender systems, Internet advertising, portfolio selection, and clinical trials. In a typical MAB problem, the agent selects one arm in each round. However, in practice, it is more convenient to select more than one arm in each round. Such a problem is called a combinatorial bandit problem [Chen et al., 2013]. For example, in [Yue and Guestrin, 2011, Radlinski et al., 2008], they considered the problem where in each round, the agent proposes multiple news articles or web documents to a user.
When recommending multiple items to a user, agents should select well-diversified items to maximize coverage of the information the user finds interesting [Yue and Guestrin, 2011] or to reduce item similarity in the list [Ziegler et al., 2005]. Recommending redundant items leads to diminishing returns in terms of utility [Yu et al., 2016]. It is well-known that properties such as diversity or diminishing returns are well captured by submodular set functions [Krause and Golovin, 2014]. To simultaneously address diversified retrieval and online learning in a recommender system, Yue and Guestrin  proposed a combinatorial bandit problem (or more specifically a semi-bandit problem), called the linear submodular bandit problem, where in each round a sequence rewards are generated by an unknown submodular function.
For a real-world application, recommendation lists should satisfy several constraints. We explain this by using a news article recommendation example. For a comfortable user experience while selecting news articles from a recommendation list, the length of the list should not be excessively long, which implies that the list should satisfy a cardinality constraint. Furthermore, a user may not wish to spend more than a certain amount of time by reading news articles. This can be modeled as a knapsack constraint. With only a knapsack constraint, a system can recommend a long list of short (or low cost) news articles. However, due to the space constraint of the web site, such a list cannot be displayed. Therefore, it is necessary to consider a submodular bandit problem under the intersection of the knapsack and cardinality constraints.
Yue and Guestrin  introduced a submodular bandit problem under a cardinality constraint and proposed an algorithm called LSBGreedy. Later, Yu et al.  considered a submodular bandit problem under a knapsack constraint and proposed two greedy algorithms called MCSGreedy and CGreedy. However, such existing algorithms fail to properly optimize the objective function under complex constraints. In fact, we theoretically and empirically show that such simple greedy algorithms can perform poorly.
Under a simple constraint such as a cardinality or a knapsack constraint, there is a simple rule to select elements. This rule is called the upper confidence bound (UCB) rule or the modified UCB rule if the constraint is a cardinality or a knapsack constraint, respectively [Yu et al., 2016]. For example, with the UCB rule, the algorithm selects the element with the largest UCB sequentially in each round. Considering that our problem is a generalization of both the problems, we should generalize both the rules.
In this study, we solve the problem under a more generalized constraint, i.e., the intersection of knapsacks and -system constraints. Here, the -system constraints form a very general class of constraints, including cardinality constraints and the intersection of matroid constraints. For example, when recommending news articles, we can restrict the number of news articles from each topic with a -system constraint. To solve the problem, we propose a non-greedy algorithm that adaptively focuses on the UCB and modified UCB rules. Since the submodular maximization problem is NP-hard, we theoretically evaluate our method by an -approximation regret, where is an approximation ratio. In this study, we provide an upper bound of the -approximation regret in the case when , where is a parameter of the algorithm. We note that the approximation ratio matches that of an offline algorithm [Badanidiyuru and Vondrák, 2014]. To the best of our knowledge, no known offline algorithm achieves a better approximation ratio than above and better computational complexity than the offline algorithm, simultaneously 111After we submitted this paper to the conference, Li and Shroff  have updated their preprint. They proposed an offline submodular maximization algorithm and improved the approximation ratio of [Badanidiyuru and Vondrák, 2014] to . . More precisely, our contributions are stated as follows:
We propose a submodular bandit problem with semi-bandit feedback under the intersection of knapsacks and -system constraints (Section 4). This is the first attempt to solve the submodular bandit problem under such complex constraints. The problem is new even when the -system constraint is a cardinality constraint.
We propose a novel algorithm called AFSM-UCB that Adaptively Focuses on a Standard or Modified Upper Confidence Bound (Section 5).
We provide a high-probability upper bound of an approximation regret for AFSM-UCB (Section 6). We prove that the -approximation regret is given by with probability in least and the computational complexity in each round is given as , where , is a parameter of the algorithm, is the time horizon, is the cardinality of a maximal feasible solution, and is the ground set (e.g., the set of all news articles in the news recommendation example). We note that no known offline fast222We refer to Section 2.1 for the meaning of “fast”. algorithm achieves a better approximation ratio than above333See footnote in page 1..
We empirically prove the effectiveness of our proposed method by comprehensively evaluating it on a synthetic and two real-world datasets. We show that our proposed method outperforms the existing greedy baselines such as LSBGreedy and CGreedy.
2 Related Work
2.1 Submodular Maximization
Although submodular maximization has been studied over four decades, we introduce only recent results relevant to our work. Badanidiyuru and Vondrák  provided a maximization algorithm for a non-negative, monotone submodular function with knapsack constraints and a -system constraint that achieves -approximation solution. Based on this work and Gupta et al. , Mirzasoleiman et al.  proposed a maximization algorithm called FANTOM under the same constraint in the case when the objective function is not necessarily monotone. Our proposed method is inspired by these two offline algorithms. However, because of uncertainty due to semi-bandit feedback, we need a nontrivial modification. A key feature of our method and aforementioned two offline algorithms is that they filter out “bad” elements via a threshold. Such a threshold method is also used for other problem settings such as streaming submodular maximization under a cardinality constraint [Badanidiyuru et al., 2014]. Some algorithms [Sarpatwar et al., 2019, Chekuri et al., 2010, 2014] achieves better approximation ratios than that of [Badanidiyuru and Vondrák, 2014] under narrower classes of constraints (e.g., a matroid + knapsacks). However, these algorithms are not “fast” because their computational complexity is with a polynomial of high degree, while that of [Badanidiyuru and Vondrák, 2014] is . For example, the computational complexity of the algorithm provided in [Sarpatwar et al., 2019] is when . We refer to [Sarpatwar et al., 2019, Mirzasoleiman et al., 2016] for further comparison with respect to an approximation ratio and computational complexity.
2.2 Submodular Bandit Problems
Yue and Guestrin  introduced the linear submodular bandit problem to solve a diversification problem in a retrieval system and proposed a greedy algorithm called LSBGreedy. Later, Yu et al.  considered a variant of the problem, that is, the linear submodular bandit problem with a knapsack constraint and proposed two greedy algorithms called MCSGreedy and CGreedy. Chen et al.  generalized the linear submodular bandit problem to an infinite dimensional case, i.e., in the case where the marginal gain of the score function belongs to a reproducing kernel Hilbert space (RKHS) and has a bounded norm in the space. Then, they proposed a greedy algorithm called SM-UCB. Recently, Hiranandani et al.  studied a model combining linear submodular bandits with a cascading model [Craswell et al., 2008]. Strictly speaking, their objective function is not a submodular function. Table 1 shows a comparison with other submodular bandit problems with respect to constraints.
In this section, we provide definitions of terminology used in this paper. Throughout this paper, we fix a finite set called a ground set that represents the set of the entire news articles in the news article recommendation example.
3.1 Submodular Function
In this subsection, we define submodular functions. We refer to [Krause and Golovin, 2014] for an introduction to this subject.
We denote by the set of subsets of . For and , we write . Let be a set function. We call a submodular function if satisfies for any with and for any . Here, is the marginal gain when is added to and defined as . We note that a linear combination of submodular functions with non-negative coefficients is also submodular. A submodular function on is called monotone if for any with . A set function on is called non-negative if for any . Although non-monotone submodular functions have important applications [Mirzasoleiman et al., 2016], we consider only non-negative, monotone submodular functions in this study as in the preceding work [Yue and Guestrin, 2011, Yu et al., 2016, Chen et al., 2017].
3.2 Matroid, -System, and Knapsack Constraints
For succinctness, we omit formal definitions of the matroid and -system. Instead, we introduce examples of matroids and remark that the intersection of matroids is a -system. For definitions of these notions, we refer to [Calinescu et al., 2011].
First, we provide an important example of a matroid. Let () be a partition of , that is is the disjoint union of these subsets. For , we fix a non-negative integer and let . Then, the pair is an example of a matroid and called a partition matroid. Let be a non-negative integer and put . Then is a special case of partition matroids and called a uniform matroid. Let for be matroids, where . The intersection of matroids is not necessarily a matroid but a -system (or more specifically it is a -extendible system) [Calinescu et al., 2011, Mestre, 2006, 2015]. In particular, any matroid is a -system. For a -system with and a subset , we say that satisfies the -system constraint if and only if . Trivially, a uniform matroid constraint is equivalent to a cardinality constraint.
Next, we provide a definition of knapsack constraint. Let be a function. For , we suppose represents the cost of . Let be a budget and a subset. We say that satisfies the knapsack constraint with the budget if . Without loss of generality, it is sufficient to consider the unit budget case, i.e., .
4 Problem Formulation
Throughout this paper, we consider the following intersection of knapsacks and -system constraints:
Here for , is a cost and is a -system.
In this study, we consider the following sequential decision-making process for times steps .
(i) The algorithm selects a list satisfying the constraints (1).
(ii) The algorithm receives noisy rewards as follows:
Here is a submodular function unknown to the algorithm, and is a noise. We regard , and
as random variables. The objective of the algorithm is to maximize the sum of rewards.
Following [Yue and Guestrin, 2011], we explain this problem by using the news article recommendation example. In each round, the user scans the list of the recommended items one-by-one in top-down fashion, where is the cardinality of at round . We assume that the marginal gain represents the new information covered by and not covered by . The noisy rewards are binary random variables and the user likes with probability .
4.1 Assumptions on the Score Function
Following [Yue and Guestrin, 2011], we assume that there exist known submodular functions on that are linearly independent and the objective submodular function can be written as a linear combination , where the coefficients are non-negative and unknown to the algorithm. We fix a parameter and assume that . We also assume that for some , the
-norm of vectoris bounded above by for any and .
We note that this can be generalized to an infinite dimensional case as in [Chen et al., 2017]. We discuss this setting more in detail in the supplemental material and provide a theoretical result in this setting.
4.2 Assumptions on Noise Stochastic Process
We assume that there exists such that for all and consider the lexicographic order on the set , i.e., if and only if either or and . Then, we can identify the set with the set of natural numbers (as ordered sets) and can regard as a sequence. We assume that the stochastic process is conditionally -sub-Gaussian for a fixed constant , i.e., for any and any . Here, is the -algebra generated by and . This is a standard assumption on the noise sequence [Chowdhury and Gopalan, 2017, Abbasi-Yadkori et al., 2011]. For example, if is a martingale difference sequence and or
is conditionally Gaussian with zero mean and variance, then the condition is satisfied [Lattimore and Szepesvári, 2019].
4.3 Approximation Regret
As usual in the combinatorial bandit problem, we evaluate bandit algorithms by a regret called -approximation regret (or -regret in short), where . The approximation regret is necessary for meaningful evaluation. Even if the submodular function is completely known, it has been proved that no algorithm can achieve the optimal solution by evaluating in polynomial time [Nemhauser and Wolsey, 1978].
We denote by the optimal solution, i.e., where runs over satisfying the constraint (1). We define the -regret as follows:
This definition is slightly different from that given in [Yue and Guestrin, 2011] because our definition does not include noise as in [Chowdhury and Gopalan, 2017]. In either case, one can prove a similar upper bound. For the proof in the cardinality constraint case, we refer to Lemma 4 in the supplemental material of [Yue and Guestrin, 2011].
In this study, we take the same approximation ratio as that of a fast algorithm in the offline setting [Badanidiyuru and Vondrák, 2014, Theorem 6.1]. As mentioned in Section 2, there exist offline algorithms that achieve better approximation ratios than above, but they have high computational complexity. Later, we remark that our proposed method is also “fast”.
In this section, following [Yue and Guestrin, 2011, Yu et al., 2016], we first define a UCB score of the marginal gain and introduce a modified UCB score. With a UCB score, one can balance the exploitation and exploration tradeoff with bandit feedback. Then, we propose a non-greedy algorithm (Algorithm 2) that adaptively focuses on the UCB score and modified UCB score.
5.1 Ucb Scores
For and , we define a column vector by and put . Here, we use the same notation as in Section 4. We define and as follows:
Here, is a parameter of the model and for a column vector , we denote by the Kronecker product of and .
For and , we define and Then, we define a UCB score of the marginal gain by
and a modified UCB score by . Here,
and . It is well-known that is an upper confidence bound for . More precisely, we have the following result.
We assume there exists such that for all . We also assume that . Then, with probability at least , the following inequality holds:
for any and .
Proposition 1 follows from the proof of [Chowdhury and Gopalan, 2017, Theorem 2]. We note that this theorem is a more generalized result than the statement above (they do not assume that the objective function is linear but belongs to an RKHS). In the linear kernel case, an equivalent result to Proposition 1 was proved in [Abbasi-Yadkori et al., 2011].
We also define the UCB score for a list by Here and are defined as and respectively. The factor in the definition of is due to a technical reason as clarified by the proof of Lemma in the supplemental material.
In this subsection, we propose a UCB-type algorithm for our problem. We call our proposed method AFSM-UCB and its pseudo code is outlined in Algorithm 2. Algorithm 2 calls a sub-algorithm called GM-UCB (an algorithm that Greedily selects elements with Modified UCB scores larger than a threshold, outlined in Algorithm 1). Algorithm 1 takes a threshold as a parameter and returns a list of elements satisfying the constraint 1. Algorithm 1 selects elements greedily from the elements whose modified UCB scores and are larger or equal to the threshold . If the threshold is small, then this algorithm is almost the same as a greedy algorithm, such as LSBGreedy [Yue and Guestrin, 2011]. If the threshold is large, then the elements with large modified UCB scores will be selected. Thus, the threshold controls the importance of the standard and modified scores. The main algorithm 2 calls Algorithm 1 repeatedly by changing the threshold and returns a list with the largest UCB score. We prove that there exists a good list among these candidates lists.
As remarked before, Algorithm 2 is inspired by submodular maximization algorithms in the the offline setting [Badanidiyuru and Vondrák, 2014, Mirzasoleiman et al., 2016]. However, we need a nontrivial modification since the diminishing return property does not hold for unlike the marginal gain . We note that
can be large not only when the estimated value ofis large but also if the uncertainty in adding to is high. Therefore, we need additional filter conditions to ensure that is a “good” element. Natural candidates for the condition are that for some indices . In Algorithm 2, we require in addition to .
In the algorithm, we introduce parameters and . The parameter (resp. ) is used for defining the initial (resp. terminal) value of the threshold . In the next section, for a theoretical guarantee, we assume that If the upper bound of the reward is known, then we can take as the known upper bound. In practice, it is plausible that most users are interested in at least one item in the entire item set , which implies is not very small. In addition, the number of iterations in the while loop in Algorithm 2 is given by . Therefore, taking a very small does not increase the number of iterations as much.
5.3 Computational Complexity
We discuss the computational complexity of Algorithm 2 and that of existing methods. We consider a greedy algorithm by applying LSBGreedy to our problem; i.e., we consider a greedy algorithm that selects the element with the largest UCB score until the constraint is satisfied. By abuse of terminology, we call this algorithm LSBGreedy. Similarly, when we apply CGreedy (resp. MCSGreedy) to our problem, we also call this algorithm CGreedy (resp. MCSGreedy). In each round, the expected number of times to compute in Algorithm 2 is given by , while that of LSBGreedy is given by . The computational complexity of MCSGreedy and CGreedy is given as and respectively. Therefore, ignoring unimportant parameters, our algorithms incur an additional factor compared to that of LSBGreedy and CGreedy.
6 Main Results
The main challenge of this paper is to provide a strong theoretical result for AFSM-UCB. In this section, under the assumptions stated as in the previous section, we provide an upper bound for the approximation regret of AFSM-UCB and give a sketch of the proof. We also show that existing greedy methods incur linear approximation regret in the worst case for our problem.
6.1 Statement of the Main Results
Let the notation and assumptions be as previously mentioned. We also assume that . We let . Then, with probability at least , the proposed algorithm achieves the following -regret bound:
In particular, ignoring , we have with probability at least .
There is a tradeoff between the approximation ratio and computational complexity. As discussed in Section 5.3, the computational complexity of the algorithm is given as in each round, while the approximation of the algorithm is given as .
We assume the score function is a linear combination of known submodular functions. We can relax the assumption to the case when the function belongs to an RKHS and has a bounded norm in the space as in [Chen et al., 2017]. We discuss this setting more in detail and provide a generalized result in the supplemental material.
In the setting of [Yue and Guestrin, 2011, Yu et al., 2016], greedy methods have good theoretical properties. However, we show that for any , these greedy methods incur linear -regret in the worst case for our problem. We denote by and the -regret of MCSGreedy and that of LSBGreedy, respectively. Then the following proposition holds.
For any , there exists cost , -system , a submodular function , and a constant such that with probability at least ,
for any . Moreover, the same statement holds for .
We provide the proof in the supplemental material.
6.2 Sketch of the Proof of Theorem 1
We provide a sketch of the proof of Theorem 1 and provide a detailed and generalized proof in the supplemental material. Throughout the proof, we fix the event on which the inequality in Proposition 1 holds.
We evaluate the solution by AFSM-UCB in each round . The following is a key result for our proof of Theorem 1.
Let be any set satisfying the constraint (1). Let be a set returned by GM-UCB at time step . Then, on the event , we have
sketch of proof.
This can be proved in a similar way to the proof of [Badanidiyuru and Vondrák, 2014, Theorem 6.1] or [Mirzasoleiman et al., 2016, Theorem 5.1]. However, because of uncertainty and lack of diminishing property of the UCB score, we need further analysis. We divide the proof into two cases.
Case One. This is the case when GM-UCB terminates because there exists an element such that and , but any element satisfying does not satisfy the knapsack constraints, i.e., for some . We fix an element satisfying . Because any element of has enough modified UCB score, by Proposition 1, we have By the definition of , we also have . Because and does not satisfy the knapsack constraint, we have
Case Two. This is the case when GM-UCB terminates because for any element satisfying , satisfies the knapsack constraints but does not satisfy the -system constraint. We note that this case includes the case when there does not exist an element satisfying .
We define a set as
and . Let . Then on the event , by Proposition 1 and submodularity, we have
Next, we consider . Running the greedy algorithm (with respect to the UCB score) on under only the -system constraint, we obtain by the assumption of this case. Then, it can be proved that We note that this is a variant of the result proved in [Calinescu et al., 2011, Appendix B]. By this inequality, inequality (2), and submodularity, we can derive the desired result. ∎
Using Proposition 3, we can bound the approximation regret above by the sum of uncertainty . Because the algorithm selects and obtain feedbacks for , the sum of uncertainty can be bounded above by a sub-linear function of .
7 Experimental Analysis
In this section, we empirically evaluate our methods by a synthetic dataset that simulates an environment for news article recommendation and two real-world datasets (MovieLens100K [Grouplens, 1998] and the Million Song Dataset [Bertin-Mahieux et al., 2011]).
We compare our proposed algorithm to the following baselines:
RANDOM. In each round, this algorithm selects elements uniform randomly until no element satisfies the constraints.
CGreedy. This is an algorithm for a submodular bandit problem under a knapsack constraint and was proposed in [Yu et al., 2016]. They also proposed an algorithm called MCSGreedy. However because MCSGreedy is computationally expensive (in each round it calls functions for times) and their experimental results show that both algorithms have a similar empirical performance, we do not add MCSGreedy to the baselines.
In Proposition 2, we showed that these greedy algorithms incur linear approximation regret in the worst case. However, even without theoretical guarantee, it is empirically known that a greedy algorithm achieve a good experimental performance. In this section, we demonstrate that our algorithm outperforms these greedy algorithms under various combinations of constraints. As a special case, such constraints include the case when there is a sufficiently large budget for knapsack constraints and the case when the -system constraint is sufficiently mild. The greedy algorithms are algorithms for such cases. We also show that our proposed method performs no worse than the baselines even in these cases.
As in the preceding work [Yue and Guestrin, 2011], we assume the score function is a linear combination of known probabilistic coverage functions. We assume there exists a set of topics (or genres) with and for each item , there is a feature vector that represents the information coverage on different genres. For each genre , we define the probabilistic coverage function by and we assume with unknown linear coefficients . The vector represents user preference on genres. We assume that the noisy rewards are sampled by . Below, we define these feature vectors , , and constraints explicitly. We note that in the experiments, we use an un-normalized knapsack constraint . In the following experiments, using 100 users (100 vectors ), we compute cumulative average rewards for each algorithm. When taking the average, we repeated this experiment 10 times for each user.
7.1 News Article Recommendation
In this synthetic dataset, we assume and . We define and costs for a knapsack constraint in a similar manner in [Yu et al., 2016]. We sample each entry of
from two types of uniform distributions. We assume that for each item, the number of genres that have high information coverage is limited to two. More precisely, we randomly select two indices of and sample entries from and sample other entries from . We generate 100 user preference vectors in a similar way to . We also sample the costs of items uniform randomly from . In this dataset, we consider the intersection of a cardinality constraint and a knapsack constraint. The result is shown in Figure 1.
7.2 Movie Recommendation
We perform a similar experiment in [Mirzasoleiman et al., 2016] but with a semi-bandit feedback. In MovieLens100K, there are 943 users and 1682 movies. We take as the set of 1682 movies in the dataset. There are genres in this dataset. First, we fill the ratings for all the user-item pairs using matrix factorization [Koren et al., 2009] and we normalized the ratings so that . For each movie , we denote by the mean of the ratings of the movie for all users. We define if , otherwise we define . We normalize as previously mentioned, because if for all , then we have .
We define a similar knapsack, cardinality, and matroid constraints to those of [Mirzasoleiman et al., 2016]. For , the cost is defined as , where
is the cumulative distribution function of the. For a budget , we consider a knapsack constraint
. The beta distribution lets us differentiate the highly rated movies from those with lower ratings[Mirzasoleiman et al., 2016]. We generate 100 user preference vectors in a similar way to the news article recommendation example. In this dataset, we consider the following constraints on genres in addition to the knapsack and cardinality constraints, There are genres in MovieLens100K, where . For each genre , we fix a non-negative integer and consider the constraint for . This can be regarded as a partition matroid constraint. Therefore, the intersection of the constraints for all genres is a -system constraint. One can prove that the intersection of this -system constraint and a cardinality constraint is also a -system constraint. The results are displayed in Figure 2 in the case of the matroid limit .
7.3 Music Recommendation
From the Million Song Dataset, we select 1000 most popular songs and 30 most popular genres. Thus, we have and . For active 100 users, we compute and user preference vector in almost the same way as and in [Hiranandani et al., 2019] respectively. They assume that a user likes a song if the user listened to the song at least five times, however, we assume that a user likes the song if the user listened to the song at least two times. We consider the intersection of a cardinality and a knapsack constraint . We define a cost for the knapsack constraint by the length (in seconds) of the song in the dataset. The costs represent the length of time spent by users before they decide to listen to the song and we assume that it is proportional to the length of the song 444We can also assume that users listen to the song and give feedbacks later.. The results are displayed in Figure 3. We do not show the performance of RANDOM in the figure since it achieves only very low rewards.
In Figures 1a, 2a, 3a, we plot the cumulative average rewards for each algorithm up to time step . In Figures 1b, 2b, and, 3b (resp. 1c, 2c, and, 3c), we show the cumulative average rewards at the final round by changing the budget (resp. by changing the cardinality limit ) and fixing the cardinality limit (resp. fixing the budget ). These results shows that overall our proposed method outperforms the baselines. We note that Figure 3 shows different tendency as compared to other datasets since popular items in the Million Song Dataset have high information coverage for multiple genres and about 47 of the items have low information coverage (less than 0.01) for all genres. Figures 1b, 2b, and 3b also show the results for the case when the budget is sufficiently large. This is the case when LSBGreedy performs well and our experimental results show that even in this case, our method have comparable performance to greedy algorithms. Moreover, Figures 1c, 2c, and 3c also show the results in the case when the cardinality constraints are sufficiently mild. In this case, CGreedy performs well since the constraints are almost same as a knapsack constraint. The experimental results show that our method tends to have better performance than that of CGreedy even in this case.
In this study, motivated by diversified retrieval considering cost of items, we introduced the submodular bandit problem under the intersection of a -system and knapsack constraints. Then, we proposed a non-greedy algorithm to solve the problem and provide a strong theoretical guarantee. We demonstrated our proposed method outperforms the greedy baselines using synthetic and two real-world datasets.
A possible generalization of this work is a generalization to the full bandit setting. In this setting, a leaner observes only a value in each round. Since it needs much work to derive a theoretical guarantee, we leave this setting for future work.
In this appendix, we generalize the reward model considered in the main article to the kernelized setting as in [Chen et al., 2017]. We also provide parameters used in the experiments.
9 Problem Formulation Under a Generalized Reward Model
In this appendix, we consider the same reward model as in [Chen et al., 2017], but subject to the intersection of knapsacks and -system constraint as in the main article. SM-UCB [Chen et al., 2017] is based on CGP-UCB [Krause and Ong, 2011] or GP-UCB [Srinivas et al., 2010]. However, recently, [Chowdhury and Gopalan, 2017] improved the assumption and the regret analysis of GP-UCB [Srinivas et al., 2010]. We follow setting of [Chowdhury and Gopalan, 2017].
Let be a compact subset of some Euclidean space, which represents the space of contexts. We consider the following sequential decision making process for times steps .
The algorithm observes context and selects a list satisfying the constraints.
The algorithm receives noisy rewards as follows:
for . Here is a non-negative, monotone submodular function unknown to the algorithm, and is a noise.
Here, we regard , and as random variables.
9.1 Assumptions Regarding the Score Function
The linear model considered in the main article can be generalized to an infinite dimensional case [Chen et al., 2017]. We let and define by . We assume that there exists an RKHS (reproducing kernel Hilbert space) on with a positive definite kernel and belongs to and the norm is bounded by . We also assume that for any . If , then our -regret would increase by a factor of .
9.2 Assumptions Regarding Noise Stochastic Process
As for noises, we consider the same assumption as in the main article.
10 Definition of Ucb Scores
We let and , where . We also define as . For a sequence , we define and . We also let and . Then, we define and as follows:
Here, is a parameter of the model. If , we also write and .
We define a UCB score of the marginal gain by
and a modified UCB score by . Here, is defined as and . Here is the maximum information gain [Srinivas et al., 2010, Chowdhury and Gopalan, 2017] after observing rewards. We refer to [Chowdhury and Gopalan, 2017] for the definition.
We assume there exists such that for all . We also assume that . Then, with probability at least , the following inequality holds:
for any and .
11 Statement of the Main Theorem
With generalized UCB scores, we consider the same algorithm in the main article. Then, we provide a statement for the generalized version of Theorem 1.
Let the notation and assumptions be as previously mentioned. We also assume that . We let , and define -regret as
where is a feasible optimal solution at round . Then, with probability at least , the proposed algorithm achieves the following -regret bound:
In particular, with at least probability , the -regret is given as
The maximum information gain is and if the kernel is a -dimensional linear and Squared Exponential kernel, respectively [Srinivas et al., 2010]. They also showed that a similar result for the Matèrn kernel. Thus, if the kernel is a -dimensional kernel, up to a polylogarithmic factor, we obtain Theorem 1 in the main article as a corollary.
12 Proof of the Main Theorem
12.1 Greedy Algorithm Under a -System Constraint
In this subsection, we fix time step and context and consider the greedy algorithm under only -system constraint as shown in Algorithm 3. Here, we drop from notation. We denote by the -system.
Let be a set returned by Algorithm 3. Then for any feasible set , on the event , the following inequality holds: