This paper considers a variant to the best arm problem in the stochastic multi-arm bandit (MAB) setting, called cascading bandits. Consider a MAB with arms, each with unknown mean payoff in . A sample of the
th arm is an independent realization of a Bernoulli random variable with mean. However, instead of playing a single arm in each round, the player selects an ordered subsequence of the arms. The arms are then played in order until a reward of is obtained. The player observes the sequence of rewards , upto and including the first
. Therefore, the player receives an unbiased estimate of the means of a random prefix of the selected subsequence. Note that there is an inherent trade-off in the size of the subsequence that should be selected. On the one hand, selecting a longer subsequence will provide rewards for more arms on average. On the other hand, always playing the longest possible subsequence will mean that rewards for the latter bins are received infrequently (since it is unlikely that the first non-zero reward is received after a long time).
This setting is primarily motivated by the problem of active 3D sensing in computer vision, using photon counting sensors (Shin et al. (2015)). Photon detection is a stochastic process, and the sensor makes discrete measurements across time of photon arrivals. Corresponding to each time bin, the sensor records a Bernoulli random variable with mean proportional to the intensity at time bin . Due to certain physical constraints, the sensor is constrained to detect at most photon in each round of measurement. The goal is to identify the time bin with the highest mean (since it corresponds to the time taken for light reflected from the scene to reach the sensor), and thus recover the depth. By actively gating the sensor at each time bin, we cast the problem as an instance of cascading bandits.
Another potential application is in search ranking (Radlinski et al. (2008)). Given documents, we want to identify the top- results to display to each user. When the user clicks on a result, every document preceding that gets a reward of , while the clicked document gets a reward of .
In this work, we propose a sample-efficient algorithm for the fixed confidence setting, and derive matching lower and upper bounds for the problem. Using a novel uniform sampling scheme, we extend the exponential gap elimination algorithm to our cascading bandits setting and obtain a method that is not only optimal, but also works for applications where the relative order of arms is fixed. In particular, this provides a practically implementable solution for the 3D sensing problem.
2 Related work
The standard version of the best arm problem has been extensively studied since the ’50s, especially in the last decade. Two settings that are commonly studied: fixed confidence and fixed budget
, in which either the error probability or the number of arm pulls are fixed and the goal is to minimize the other (seeGabillon et al. (2012)). In the fixed confidence setting, Bubeck et al. (2011) proposed the uniform allocation strategy which was shown to achieve a sample complexity of order , where . The exponential-gap elimination procedure of Karnin et al. (2013) guarantees best arm identification with high probability using order samples. Jamieson et al. (2014) proposed a UCB-type algorithm, lil’ UCB, which also achieves the same sample complexity, and proved a matching lower bound. While uniform allocation and exponential-gap elimination proceed by performing uniform sampling of the remaining arms in each round, UCB strategies choose a single arm with the largest upper confidence bound to sample in each round.
Cascading bandits were proposed by Kveton et al. (2015) to model the problem of learning to rank, in the regret minimization setting. We extend the idea to the best arm identification setting, which is naturally conducive to cascading bandits as we will show later. Many other variants of the bandit problem have also been proposed. Of particular relevance are combinatorial bandits (Chen et al. (2013)), which deal with arms that form certain combinatorial structures, and in each round, a set of arms (super arm) are played together. All these results can be considered as steps toward the generalized combinatorial bandit problem where the action, the reward and the output, all come from certain combinations of the set of arms.
3 Problem setup
Consider a given sequence of arms, with means . In each round, we can select a subsequence of the arms to “play”. Let denote the selected subsequence in a particular play. The environment generates Bernoulli rewards for each of the selected arms independently. However, only a prefix of the reward subsequence, upto the first reward is revealed to the player. Instead of referring to the event at time as an arm pull, we call it a “play”.
Let denote the number of times a reward of was obtained for the th arm, and the number of times the th arm’s reward was revealed. Note that an arm’s reward might not be revealed every time it is played, since another arm before it in the sequence might have already received a reward. only counts the plays where the th arm explicitly received either a or a reward. Using these, we can define mean estimates:
is an unbiased estimator of , and can be used to bound via a concentration inequality. This is possible because is conditionally independent of other ’s, given . Therefore, as long as is large enough, concentrates around .
In this paper, we only consider the fixed confidence setting. For a given confidence level , our goal is to find the adequate number of plays needed such that:
where is the predicted optimal arm after plays.
4 Allocation strategies
Before we talk about various allocation strategies, let’s get an idea about the sample complexities that our designed strategies can be expected to achieve. Since each play gives us the reward for at least one arm, the complexity should be upper bounded by that of standard MAB. Also, the improvement over standard MAB would depend on how many rewards we get per play on average. For low ’s, the sequence of rewards will be longer on average, upto a maximum length of . Therefore, we can expect a maximum improvement by a factor of . For high ’s, we don’t expect much of an improvement.
There are multiple differences between the standard bandit setting and our variant, which makes it difficult to design optimal strategies. Unlike standard bandits where we deterministically observe the reward of the pulled arm, here we stochastically observe the rewards for a prefix of the selected subsequence. The stochasticity makes it unclear how an allocation strategy will work.
A naive strategy (or set of strategies) is to select a singleton subsequence in each round. Then the problem reduces to standard MAB, and we can use any of the optimal algorithms to identify the best arm with order samples. This strategy lies on one end of the spectrum, which we call "maximum wastage". It is possible to improve this strategy slightly by selecting a suffix instead of a singleton whenever the singleton is desired to be played. However, there will be a bias towards the former bins to be played more often than the latter, and we can design problems with decreasing to show that the sample complexity will be similar to that of the naive strategy.
On the other end of the spectrum, we have the "minimum wastage" strategy (set of strategies), which always selects the whole sequence of remaining arms to play in each round. The following result shows that this strategy is also suboptimal (except for some special cases):
Theorem 1. With probability at least , the “whole sequence” strategy finds the optimal arm using at most
The proof is discussed in the Appendix, and uses an analysis similar to that of the uniform allocation strategy for standard MAB. Notice that the dependence on is replaced by s. A similar result can be shown for the strategy that combines exponential gap elimination with the minimum wastage principle (more explicitly, for each round of exponential gap elimination, we play the whole sequence of remaining arms together until each arm has been effectively sampled for the number of rounds required by the algorithm). Broadly speaking, the sample complexities of these strategies differ from those of the standard MAB algorithms by a divisive factor of . For small values of , and the sample complexities are good. For large values of however, can be much smaller than (exponential in the number of arms ), and the sample complexities are actually much worse than those of standard MAB. Again, the reason for the bad complexity is that the allocation strategy causes a bias against the latter bins.
5 Uniform sampling for cascading bandits
The previous section motivated the design of an allocation strategy which is fair to all arms, while being least “wasteful”. In this section, we introduce a strategy to perform uniform sampling which is strictly better than the naive strategy of only selecting a single arm in each round. As we will see later, using this strategy as a subroutine leads to the design of algorithms that are optimal in terms of sample complexity.
The algorithm proceeds iteratively by playing the current subsequence till the leftmost arm is sampled times, then removing it from the subsequence. This ensures that each arm is sampled times when the algorithm ends.
The total number of rounds that the algorithm uses is stochastic, but can be bounded with high confidence around its mean. Let denote the number of times a reward of was obtained for . Then, the number of rounds is given by:
is binomially distributed with meanand number of trials , conditioned on . Using Hoeffding’s inequality, we get:
where is the total number of plays. Therefore, the algorithm “effectively” samples each arm times, using plays in total.
5.1 Staticity of the order of arms
The uniform sampling scheme of the previous subsection has the property that the order in which arms are played remains static throughout. Given any two arms and , with , that are played in the round, is always played before . This is desirable for applications where the order of arms is determined before the algorithm is run, and is constrained to remain the same as the algorithm proceeds. For the 3D sensing problem described earlier, the order of the arms follows the temporal order of the bins of the sensor. In contrast, the search ranking problem doesn’t have this constraint, since the results can be placed in any arbitrary order. Even though we framed the problem of cascading bandits in the unconstrained setting, the algorithms we discuss are compatible with the constrained setting too. As we show in Section 7, this doesn’t increase the sample complexity.
5.2 Uniform allocation: with uniform sampling
The subroutine introduced in the previous section naturally leads to the uniform allocation algorithm for best arm prediction. For all , if we perform uniform sampling with , and then predict , we get:
Combining this with 2, we get:
This translates to a sample complexity of order . This is strictly less than , the sample complexity for standard MAB. The relative improvement over standard MAB increases from to as s decrease, in agreement with our intuitive expectation.
6 Exponential gap elimination: with uniform sampling
In this section, we use the uniform sampling subroutine described earlier to replace the singleton arm plays in the standard exponential gap elimination strategy. For clarity, we provide the original algorithm (in the context of cascading bandits) and the modified one side-by-side.
1: input confidence
3: while do
4: let and
5: sample each arm for times,and let be the average reward
9: end while
10: output arm in
1: input confidence
3: while do
4: let and
5: UniformlySample(), and let be the average reward
9: end while
10: output arm in
Note that there are 2 places where arms are played in the algorithm, lines 5 and 6. In both places, the arms are to be played such that each of the remaining arms is effectively sampled uniformly, for the required number of times. Instead of playing singleton arms as in the original algorithm, we play suffix sequences according to the UniformlySample subroutine. We’ll now analyze the sample complexity of the modified algorithm.
6.1 Sample Complexity Analysis
We borrow the outline of the proof from the original Exponential gap elimination paper. Since the number of effective samples for each arm remains the same in the modified algorithm, Lemma 3.3 and 3.4 still hold. We state them here for reference.
Lemma 3.3. With probability at least , we have for all .
For all , let , and , .
Lemma 3.4. Assume that the optimal arm is not eliminated by the algorithm. Then with probability at least , we have for all .
We next calculate how many plays happen in a run of the algorithm, ignoring (for now) the plays spent due to invocations of MedianElimination (for the same reason as mentioned in the original paper, i.e., that the same plays issued by the algorithm in line 5 can be reused within invocations of ME, without any major change in the analysis). Let denote the number of plays in round of the algorithm. As before:
, where is distributed binomially with mean and number of trials . Moreover, the total number of plays:
As before, using Hoeffding’s inequality, we get:
For , using and , we get . Combining with above, we get:
Setting the right hand side of the probability expression to , we get:
Combining this with above, we get the following result:
Theorem 2. With probability at least , exponential gap elimination (uniform sampling) finds the optimal arm using at most
Note that this bound has a pattern similar to the bound that we derived for the uniform allocation strategy: . Again, for low ’s, this is an -fold improvement over standard MAB.
7 Lower Bound
In this section, we derive a problem-dependent lower bound on the number of plays needed in the fixed confidence setting. Our analysis follows the same direction as Jamieson et al. (2014), by relying on Farrell’s optimal test Farrell (1964).
Consider the best arm identification in cascading bandits problem with arms, with mean rewards . Let denote the expected number of trials of the th arm, and the gap . Any procedure with , necessarily has
for all .
See Corollary 1 in Jamieson et al. (2014). ∎
Denoting the number of plays of the cascading bandit by , the number of trials of the th arm with reward 1 by and the number of plays with no positive reward by , we have:
since each play ends with either exactly one positive reward, or with no positive reward. Taking expectation on both sides, we get:
Together with the corollary, this implies that no best arm procedure can have and use fewer than plays in expectation for all . Other than the factor in the last term, this matches the upper bound derived for exponential gap elimination.
8 Discussion and Future work
We looked at 2 algorithms for best arm identification in the standard MAB problem, and adapted them to the cascading MAB problem with good results. The tool which allowed us to do this was a method to perform uniform sampling efficiently. Indeed it was this uniform sampling strategy that helped avoid the wastage associated with naive sampling, and reap the full benefit of cascading bandits as compared to standard bandits. Since the uniform sampling subroutine is “optimal” in terms of sample complexity, combining it with an optimal standard MAB algorithm, namely exponential gap elimination, gives a close to optimal sample complexity for the best arm identification problem in cascading bandits. Furthermore, we hypothesize that any algorithm which proceeds by sampling all remaining arms an equal number of times can benefit from this sampling strategy. However, this leaves out a very important class of algorithms: UCB-type algorithms. Since UCB-type algorithms select the arm with the highest upper confidence bound to pull every time, our uniform sampling subroutine can’t be invoked. We would need some other way of selecting an optimal subsequence of arms to pull based on confidence bounds. This means that our algorithms cannot be readily extended to the regret minimization setting for cascading bandits, where UCB-type algorithms are usually dominant. Kveton et al. (2015) gave such an algorithm for regret minimization whose sample complexity is optimal with respect to the gaps , but not . In the best arm identification setting, the exponential gap algorithm lags behind lil’ UCB in practice, and substantial improvements can be expected if an optimal UCB-type algorithm for cascading bandits is designed. These remain open problems.
Finally, the discussion of cascading bandits can be extended to consider the problem of identifying the -best arms with cascading bandits. This will be especially useful for the search ranking problem, where we are interested in identifying the top- search results for a query.
- Shin et al.  Dongeek Shin, Ahmed Kirmani, Vivek K Goyal, and Jeffrey H Shapiro. Photon-efficient computational 3-d and reflectivity imaging with single-photon detectors. IEEE Transactions on Computational Imaging, 1(2):112–125, 2015.
Radlinski et al. 
Filip Radlinski, Robert Kleinberg, and Thorsten Joachims.
Learning diverse rankings with multi-armed bandits.
Proceedings of the 25th international conference on Machine learning, pages 784–791. ACM, 2008.
- Gabillon et al.  Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence. Research report, October 2012. URL https://hal.inria.fr/hal-00747005.
- Bubeck et al.  Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832 – 1852, 2011. ISSN 0304-3975. doi: https://doi.org/10.1016/j.tcs.2010.12.059. URL http://www.sciencedirect.com/science/article/pii/S030439751000767X. Algorithmic Learning Theory (ALT 2009).
- Karnin et al.  Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pages 1238–1246, 2013.
- Jamieson et al.  Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014.
- Kveton et al.  Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pages 767–776, 2015.
- Chen et al.  Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pages 151–159, 2013.
- Farrell  R. H. Farrell. Asymptotic behavior of expected sample size in certain one sided tests. Ann. Math. Statist., 35(1):36–72, 03 1964. doi: 10.1214/aoms/1177703731. URL https://doi.org/10.1214/aoms/1177703731.
- Combes  Richard Combes. An extension of mcdiarmid’s inequality. arXiv preprint arXiv:1511.05240, 2015.
- Bickel et al.  P Bickel, P Diggle, S Feinberg, U Gather, I Olkin, and S Zeger. Springer Series in Statistics. Springer, 2009.
In this section, we’ll derive an error bound for the simplest “whole sequence” allocation strategy for cascading bandits.
denote 1 hot vectors representing the reward vectors for the whole sequence of arms obtained in theth play, extended with s for the latter arms. Let denote the true means. We define the mean estimates . Note that this is the same definition as earlier, except that the form is more amenable to application of McDiarmid’s inequality. We want to bound the probability:
Define . It is easy to see that has bounded differences:
where , the smallest denominator amongst all the quotients. Now, b is binomially distributed as , where . Therefore, we can bound w.h.p around its mean using Hoeffding’s inequality:
Taking , where , we get:
where and .
Therefore, we can use McDiarmid’s inequality for differences bounded w.h.p Combes :
where , provided .
Let represent the event . We need to bound . Consider:
(We are dropping the subscript from for clarity).
Assuming (the other case can be analyzed similarly):
(where denotes ) Let’s look at the second term inside the outer expectation:
(Intuitively, is negatively correlated with . Therefore, if we are leaving a minimum , we’ll have a smaller than if we don’t have to leave any minimum . A formal proof is not provided due to lack of space.)
By sub-gaussianity of binomial random variables:
(since ). Combining the above with (5), we get:
By similar reasoning as before, we get:
Thus, are sub-gaussian. Using the inequality for max of sub-gaussian random variables [Bickel et al., 2009, Chapter 2], we get:
in (4), we get:
For appropriate value of , and under some mild conditions on , we can show the sample complexity result provided in the text.