1 Introduction
The multiarmed bandit (MAB) problem is one of the most elementary problems in decision making under uncertainty. Under this setting, the agent must choose an action (or arm) on each round, out of possible actions. It then observes the chosen arm’s reward, which is generated from some fixed unknown distribution, and aims to maximize the average cumulative reward (robbins1952some)
. Equivalently, the agent minimizes its expected cumulative regret, i.e., the difference between the best achievable reward and the agent’s cumulative reward. This framework enables us to understand and control the tradeoff between information gathering (‘exploration’) and reward maximization (‘exploitation’), and many of the current reinforcement learning algorithms are based on exploration concepts that originate from MAB
(jaksch2010near; bellemare2016unifying; osband2013more; gopalan2015thompson).One of the active research directions in MAB consists in extending the model to support more complicated feedbacks from the environment and more complex reward functions. An important extension that follows this direction is the combinatorial multiarmedbandit (CMAB) problem with semibandit feedback (chen2016combinatorialA). Instead of choosing a single arm, the agent selects a subset of the arms (a ‘batch’), and observes feedback from each of the arms (‘semi bandit feedback’). The reward can be a general function of the expectations , with the linear function as the most common example: for any batch of size .
Another common case that falls under the CMAB framework, and will benefit from the results of this paper, is the bandit version of the probabilistic maximum coverage (PMC) problem. Under this setting, each arm is a random set that may contain some subset of
possible items, according to a fixed probability distribution. On each round, the agent chooses a batch of
sets and aims to maximize the number of items that appear in any of the sets (i.e., the size of the union of the sets). In the bandit setting, we assume that the probabilities that items appear in a set are unknown and aim to maximize the item coverage while concurrently learning the probabilities. Variants of the PMC bandit problem have many practical applications, such as influence maximization (vaswani2015influence), ranked recommendations (kveton2015cascading), wireless channel monitoring (arora2011sequential), online advertisement placement (chen2016combinatorialA) and more.Although existing algorithms offer solutions to the CMAB framework under very general assumptions (chen2016combinatorialA; chen2016combinatorialB)
, there are several issues that still pose a major challenge in the design and analysis of practical algorithms. Notably, most of the existing algorithms quantify the nonlinearity of the function using a global Lipschitz constant. However, large gradients do not necessarily translate to a large regret, but rather the combined influence of the gradient size and the local parameter uncertainty. More specifically, if there are regions in which tight parameter estimates can be derived, the regret will not be large even if the gradients are large. Conversely, in regions where the parameter uncertainty is large, smaller gradients can still cause a large regret. For example, consider the reward function
, for and , and its translated version . Despite the fact that both functions have the same Lipschitz constant, the regret in the first problem can be much smaller, since parameters are easier to estimate when they are close to the edge of their domain. Another similar example is the PMC bandit problem, in which the reward is approximately linear when the coverage probabilities are small, but declines exponentially when they are large. Thus, the naïve bound cannot capture the interaction between the gradient size and the parameter uncertainty, which results in loose regret bounds.In this paper, we aim to utilize this principle for the CMAB framework. To this end, we introduce a new smoothness criterion called Giniweighted smoothness
. This criterion takes into account both the nonlinearity of the reward function and the concentration properties of bounded random variables around the edges of their domain. We then suggest an upper confidence bound (UCB) based strategy
(auer2002finite), but replace the classical Hoeffdingtype confidence intervals with ones that depend on the empirical variance of the arms, and are based on the Empirical Bernstein inequality
(audibert2009exploration). We show that Bernsteintype bounds capture similar properties to the Giniweighted smoothness, and thus allow us to derive tighter regret bounds. Notably, the linear dependence of the regret bound in the batch size is almost completely removed, except for a logarithmic factor, and the batch size only affects the regret through the Ginismoothness parameter. In problems in which this parameter is batchsize independent, including the PMC bandit problem, our new bound is tighter by a factor of the batch size . This is comparable to the best possible improvement due to independence assumption in the linear CMAB problem (degenne2016combinatorial), but without any additional statistical assumptions.Moreover, we demonstrate the tightness of our regret bounds by proving matching lower bounds for the PMC bandit problem, up to logarithmic factors in the batch size. To do so, we construct an instance of the PMC bandit problem that is equivalent to a classical MAB problem and then analyze the lower bounds of this problem. We also show that in contrast to the linear CMAB problem, the lower bounds do not change even if different sets are independent. To the best of our knowledge, our algorithm is the first to achieve tight regret bounds for the PMC bandit problem.
2 Related Work
The multiarmed bandits’ literature is vast. We thus cover only some aspects of the area, and refer the reader to (bubeck2012regret) and (lattimore2018bandit) for a comprehensive survey. We employ the Optimism in the Face of Uncertainty (OFU) principle (lai1985asymptotically), which is one of the most fundamental concepts in MAB, and can be found in many known MAB algorithms (e.g., auer2002finite; garivier2011kl). While many algorithms rely on Hoeffdingtype concentration bounds to derive an upper confidence bound (UCB) of an arm, a few previous works also apply Bernsteintype bounds and demonstrate superior performance, both in theory and in practice (audibert2009exploration; mukherjee2018efficient).
The general stochastic combinatorial multiarmed bandit framework was introduced in (chen2013combinatorial). They presented CUCB, an algorithm that uses UCB per arm (or ’base arm’), and then inputs the optimistic value of the arm into a maximization oracle. Many preceding works also fall under the CMAB framework, but mainly focus on a specific reward function (caro2007dynamic; gai2010learning; gai2012combinatorial; liu2012adaptive), or work in the adversarial setting (cesa2012combinatorial)
. While most algorithms for the CMAB setting follow the OFU principle, a few employ Thompson Sampling
(thompson1933likelihood), e.g., (wang2018thompson; huyuk2018thompson) for the semibandit feedback and (gopalan2014thompson) for the fullbandit feedback. In recent years, there have been extensive studies on deriving tighter bounds, but these works mostly address the linear CMAB problem (kveton2014matroid; kveton2015tight; combes2015combinatorial; degenne2016combinatorial). More recently, tighter regret bounds were derived for this framework in (wang2017improving), and also allow probabilistically triggered arms. Nevertheless, we show that in our setting, these bounds can be improved by a factor of the batch size for many problems, e.g., the PMC bandit problem, and are comparable, up to logarithmic factors, otherwise.Empirical Bernstein Inequality was first used for the CMAB problem in (gisselbrecht2015whichstreams) for linear reward functions, and was later used in (perrault2018finding) for sequential searchandstop problems. Both works focus on specific reward functions and utilize Bernstein inequality to get variancedependent regret bounds. In contrast, we analyze general reward functions and exploit the relation between the confidence interval and the reward function to derive tighter regret bounds.
The PMC bandit problem is the bandit version of the maximum coverage problem (hochbaum1996approximation), a well studied subject in computer science with many variants and extensions. The bandit variant is closely related to the influence maximization problem (vaswani2015influence; carpentier2016revealing; wen2017online), in which the agent chooses a set of nodes in a graph, that influence other nodes through random edges, and aims to maximize the number of influenced nodes. Another related setting is the cascading bandit problem (e.g., kveton2015cascading; kveton2015combinatorial; combes2015learning; lattimore2018toprank), in which a list of items is sequentially shown to a user until she finds one of them satisfactory. This is equivalent to a coverage problem with a single object, but only partial feedback  the user will not give any feedback about items that appear after the one she liked. In both settings, the focus is very different than ours. In influence maximization, the focus is on the diffusion inside a graph and the graph structure, and in cascading bandits on the partial feedback and the list ordering. They are thus complementary to our framework and could benefit from our results.
3 Preliminaries
We work under the stochastic combinatorialsemi bandits framework, when the reward is the weighted sum of smooth monotonic functions. Assume that there are arms (‘base arms’), and let be the action set, i.e., the collection of possible batches (actions) from which the agent can choose, with . Also assume that the size of any batch is bounded by . Denote the reward function of an action with arm parameters by and assume that the reward is the weighted sum of functions, , for some fixed weights and . Without loss of generality, we also assume .
The agent interacts with the environment as follows: on each round , the agent chooses an action . Then, for each arm , it observes feedback for any , with mean . For ease of notation, assume that if . We denote the empirical estimators of the parameters by , where is the number of times an arm was chosen up to time , and is the indicator function. We also denote the empirical variance by . The estimated parameters are concentrated around their mean according to the Empirical Bernstein inequality:
[Empirical Bernstein](audibert2009exploration) Let be i.i.d random variables taking their values in , and let be their common expected value. Consider the empirical mean and variance defined respectively by
and .
Then, for any and , it holds that .
We require the functions to be monotonic Ginismooth with smoothness parameters , which we define in the following: Let be a differentiable function in and continuous in , for any . The function is said to be monotonic Ginismooth, with smoothness parameters and , if:

For any , the function is monotonically increasing with bounded gradient, i.e., for any and , . If , then for all .

For any and , it holds that
(1) Throughout the paper, we refer this condition as the Giniweighted smoothness^{1}^{1}1The name is motivated by the similarity of the weights to the Gini impurity . of .
While the first condition is very intuitive, and is equivalent to the standard smoothness requirement (wang2017improving), the second condition demands further explanation. Notably, the Ginismoothness parameter is less sensitive to changes in the reward function when the parameters are close to the edges, i.e., close to or . It may be observed that in these regions, the variances of the parameters are small, which implies that they are more concentrated around their mean. We will later show that this will allow us to mitigate the effect of large gradients on the regret. For simplicity, we assume that all of the functions have the same smoothness parameters, but the extension is trivial. We note that if for all , the functions are Ginismooth, then is also Ginismooth. Nevertheless, explicitly decomposing the reward into a sum of monotonic Ginismooth functions leads to a slightly tighter regret bound. We believe that this is due to a technical artefact, but nonetheless modify our analysis since many important cases fall under this scenario.
An important example that falls under our model is the Probabilistic Maximum Coverage (PMC) problem. In this setting, each arm is a random set that may contain some subset of possible items. The agent’s goal is to choose a batch of sets such that as many items appear in the sets (i.e., the union of the sets is maximized). Formally, the functions are the probabilities that an item was covered and is the (weighted) expected number of covered items. The smoothness constants for this problem are and , independently of the batch size. Another example is the logistic function , for which the smoothness parameters are equal to and , which are batch size independent, provided that does not depend on . In the linear case, and the smoothness parameters are and . In general, we can always bound , and we will later see that in this case, our bounds are comparable to existing results that only rely on .
Similarly to previous work, the performance of an agent is measured according to its regret, i.e., the difference between the reward of an oracle, which acts optimally according to the statistics of the arms, and the cumulative reward of the agent. However, in many problems, it is impractical to achieve the optimal solution even when the problem’s parameters are known. Thus, it is more sensible to compare the performance of an algorithm to the best approximate solution that an oracle can achieve. Denote the optimal action by and its value by . An oracle is called an approximation oracle if for every parameter set , with probability , it outputs a solution whose value is at least . We define the expected approximation regret of an algorithm as:
(2) 
where the expectation is over the randomness of the environment and the agent’s actions, through the oracle. As noted by (chen2016combinatorialA), in the linear case, and sometimes when arms are independent, the reward function equals to the expectation of the empirical reward, i.e., but unfortunately, it does not necessarily occur when arms are arbitrarily correlated.
We end this section with some notations. Let . Denote the suboptimality gap of a batch by . The minimal gap of a base arm is the smallest positive gap of a batch that contains arm , namely, and the maximal gap is . Note that for all , . We denote by the KullbackLeibler (KL) divergence between two random variables . Also denote by , the KL divergence between two Bernoulli random variables of means and .
4 Algorithm
We suggest a combinatorial UCBtype algorithm with Bernstein based confidence interval, which we call BCUCB (Bernstein Combinatorial  Upper Confidence Bound). The UCB index is defined as
(3) 
A pseudocode can be found in Algorithm 1. On the first rounds, the agent initially samples batches to make sure that each base arm is sampled at least once. Afterwards, we calculate the UCB index , and an approximation oracle chooses an action such that with probability . The agent then plays this action and observes feedback for any arm . It finally updates the empirical probabilities and variances and continues to the next round.
An example for an approximation oracle that can be used when the reward function is monotonic and submodular ^{2}^{2}2Let be a finite set. A set function is called submodular if . is the greedy oracle, which enjoys an approximation factor of with probability (nemhauser1978analysis). First, the oracle initializes , and then selects items sequentially in a greedy manner, i.e. and , with .
The first term of the confidence interval resembles the standard UCB term of , and is actually always smaller, since the empirical variance of variables in is always bounded by .^{3}^{3}3. The second term does not appear in the standard UCB bound, and can slightly affect the regret, since suboptimal arms will be asymptotically sampled times, so the two terms are comparable. Nevertheless, the first term is still dominant, and if the variance of an arm is drastically lower than , the confidence bound is significantly tighter. This can happen, for example, when the arm’s mean is close to or .
We take advantage of this property through the smoothness parameter , that only takes into account the sensitivity of the function to parameter changes when the parameters are far away from or . This will allow us to derive a drastically tighter regret bound, in comparison to existing algorithms, when , as we establish in the following theorem:
Theorem
Let be monotonic Ginismooth reward functions with smoothnsess parameters and , and let be the reward function. For any , the expected approximation regret of BCUCB with (,)approximation oracle is bounded by
(4) 
We can also exploit the problem dependent regret bound to derive a problem independent bound, that is, a bound that holds for any gaps :
Corollary
Let be monotonic Ginismooth reward functions with smoothnsess parameters and , and let be the reward function. For any , the expected approximation regret of BCUCB with (,)approximation oracle can be bounded by
(5) 
The proof of Theorem 4 is presented in the following section, along with a proof sketch for Corollary 4. The full proof of the corollary can be found in Appendix E.
We start by noting that we could avoid decomposing the reward into sum of functions, but the regret bound in this case is slightly looser  the factor in (4) is replaced by the larger factor , and the factor is replaced by . We believe that the logarithmic factor is due to a technical artefact, but leave its removal for future work. We also remark that the second term of the problem dependent regret bounds is negligible for small gaps and can always be bounded using the identity , which yields a regret of .
To the best of our knowledge, the closest bound to ours appears in (wang2017improving). From their perspective, our bandit problem has base arms and a batch size of with gaps . Their smoothness parameter equals . Substituting into their bound yields a regret of . Since , and , our regret is tighter when is not trivial (that is, ), up to a factor of .
Alternatively to our approach, it is possible to analyze BCUCB only based on the smoothness. In this case, the analysis will be very similar to that of kveton2015tight. This will yield a dominant term that does not depend on the logarithmic factor , and declines with the variance of the arms. However, it can lead to dramatically worse bounds when is small, so we decided not to pursue this path. Nonetheless, this approach is still worth mentioning when comparing to other algorithms, since when its bounds are combined with ours, we can conclude that the regret of BCUCB is always tighter than the regret obtained in (wang2017improving).
On a final note, we return to the PMC bandit problem. In this case, and are and is , so the regret is , which is tighter by a factor of from existing results (wang2017improving). We will later show that this bound is tight, up to logarithmic factors in .
5 Proving the Regret Upper Bounds
We start the proof by simplifying the first term of the UCB index. To this end, recall Bernstein’s inequality:
[Bernstein’s Inequality] Let be independent random variables in with mean and variance . Then, with probability :
(6) 
Next, let be the empirical variance of independent random variables with mean , and note that . We can thus define the independent random variables and bound instead. It is clear that , and their expectations can also be bounded by
The variance of can be similarly bounded by . Applying Bernstein’s Inequality (6) on can now give a high probability bound on the empirical variance: with probability ,
(7) 
where utilizes the relation with and . An important (informal) conclusion of inequality (5), is that if the event under which the inequality holds for does occur, then the confidence interval around can be bounded by:
(8) 
For , the confidence interval is of the form , for and . We should therefore analyze how this kind of parameter perturbations affect the reward function. To do so, we take advantage of the Giniweighted smoothness, as stated in the following lemma (see Appendix A for proof):
Lemma
Let be a monotonic Ginismooth function, with smoothness parameters and . Also let
be some constant vectors such that
and .For any such that , the sensitivity of to parameter change can be bounded by
(9) 
Next, define the low probability events under which some of the variables are not concentrated in their confidence intervals:
(10)  
(11) 
Also denote . Intuitively, even though the regret may be large under , the event cannot occur many times, and we can therefore analyze the regret under the assumption that does not occur. We can then bound the regret similarly to inequality (5), combined with Lemma 5. Formally, we decompose the regret as follows:
Lemma
Let be Ginismooth functions with parameters ,, and define
(12) 
for and . The regret of Algorithm 1, when used with approximation oracle, can be bounded by
(13) 
The proof is in Appendix B. It is interesting to observe that is very similar in its form to the confidence interval for the linear combinatorial problem when parameters are independent (combes2015combinatorial). We have achieved this form of confidence without any independence assumptions, only on the basis of the properties of the reward function. We can therefore adapt the proofs of (kveton2015tight; degenne2016combinatorial) to derive a problem dependent regret bound. Since we are interested in bounding the regret of the first term, and due to the initial sampling stage, we assume from this point onward that all of the arms were sampled at least once.
Define and , two positive decreasing sequences that converge to and will be determined later, with . Also define the set for some function that will also be determined later, with . Intuitively, is the set of arms that were chosen on round and were not sampled enough times. Denote the events in which contains at least elements, but contain less than for any , by , and let . We show that when , then must occur, for the appropriate . To do so, we first cite a variant of Lemmas 7 and 8 of (degenne2016combinatorial), with , and : Define , and let be the smallest index such that . Then . Also, under , the following inequality holds:
(14) 
Using this lemma, we can now prove that must occur (see Appendix C for proof):
Lemma
If , and if , then occurs.
A direct result of this lemma is that if , at least one event occurs. This allows us to further decompose the first term of (13), and achieve the final result of Theorem 4:
Lemma
The regret from the event can be bounded by
(15) 
and if ,
(16) 
The proof for the problem independent upper bound of Corollary 4 is a direct result of Lemma 5. Specifically, the bound can be achieved by decomposing the regret according to Lemma 5, and then dividing the regret into large gaps () and small gaps (), according to some fixed threshold . Large gaps are bounded according to Lemma 5 with , and small gaps are bounded trivially by . The final bound is achieved by optimizing the threshold . The full proof can be found in Appendix E.
6 Lower Bounds
Although our algorithm enjoys improved upper bounds in comparison to CUCB, it is still interesting to see whether our results are tight in problems where previous bounds are loose. To demonstrate the tightness of our algorithm, we present an instance of the PMC bandit problem, on which our results are tight up to logarithmic factors. We assume throughout the rest of this section that the maximization oracle can output the optimal batch, i.e. has an approximation factor of , with probability . This assumption allows us to focus on the difficulty of the problem due to parameter uncertainty and the semibandit feedback. We formally state the results in the following proposition:
There exist an instance of the PMC bandit problem with minimal gap such that the expected regret of any consistent algorithm^{4}^{4}4An algorithm is called consistent if for any problem and any , the regret of the algorithm is as . is bounded by
(17) 
Moreover, for any and , there exist an instance of the PMC bandit problem such that the expected regret of any algorithm is bounded by
(18) 
Consider the following PMC bandit problem: fix the first arms to be empty sets, that is, for any , which also implies . For the rest of the arms , we force all of the items to be identically distributed, i.e., and . We also fix the action set to be , where contains all of the arms with zero reward plus an additional arm . The expected reward when choosing an action is thus , and the problem is equivalent to a armed bandit problem with arm distribution . In order to prove the problem dependent regret bound, let and , or equivalently, and . The gaps of the problem are thus for any arm . For any consistent MAB algorithm, the expected regret of the algorithm can be lower bounded by (lai1985asymptotically):
(19) 
The KL divergence can be directly bounded by
where is due to the relation (csiszar2006context). Substituting back into (19) yields the first part of the proposition:
Due to the fact that all of the items have the same distribution, our problem is equivalent to an MAB problem with arms in scaled by a factor of . Thus, the problem independent lower bound from (auer2002nonstochastic) can also be applied to this problem, and is scaled by the same factor :
We remark that throughout the proof, we assumed nothing about the correlation between different arms, and thus the bound cannot be improved by assuming this kind of independence. Nevertheless, if items are assumed to be independent, the lower bounds can be drastically improved. We will not tackle the problem of independent items in this paper, but leave it to future work.
7 Summary
In this work, we introduced BCUCB, a CMAB algorithm that utilizes Bernsteintype confidence intervals. We defined a new smoothness criterion called Giniweighted smoothness, and showed that it allows us to derive tighter regret bounds in many interesting problems. We also presented matching lower bounds for such a problem, and thus demonstrated the tightness of our algorithm.
We believe that our concepts can be applied to derive tighter bounds in many interesting settings. Specifically, our analysis includes the PMC bandit problem, that has a central place in the areas of ranked recommendations and influence maximization. We also believe that our results could be extended to the frameworks of cascading bandits and probabilistically triggered arms, but leave this for future work.
Another possible direction involves analyzing specific arm distributions  in our framework, we assumed nothing about the arms’ distribution except for its domain, and thus could only take into account very weak concentration properties, and specifically the concentration properties around the edges of the domain. If additional information about the distribution of the arms is present, it should be possible to leverage such information to design more sophisticated smoothness criteria. Such criteria could take into account tighter concentration properties of the arms’ distribution, and thus lead to tighter regret bounds.
Finally, we remark that the lower bounds were possible to derive only since we have required the algorithm to support any arbitrary choice of action set . For the PMC problem, previous work shows that when contains any subset of fixed size , the regret bounds can be significantly improved (kveton2015cascading). It is interesting to see if our technique can be used in this setting to extend these results and derive tighter bounds for any Ginismooth function.
The authors thank Asaf Cassel and Esther Derman for their helpful comments on the manuscript.
References
Appendix A Proof of Lemma 5
See 5
First, we define the functions
(20) 
We note that is well defined for , since the function is integrable near and is continuous, so the product is integrable near . Symmetrically, the function is also integrable near . can be explicitly written as
(21) 
The two functions are closely related: observe that with , and thus . In addition, , and therefore the function is strictly monotonically increasing, so its inverse is well defined. Finally, the relation between the derivatives also yields the property for any in .
Next, we bound , for any such that
Comments
There are no comments yet.