Let be a large set of items to be ranked. For example, a database of movies, news articles or search results. We consider a sequential version of the ranking problem where in each round the learner chooses an ordered list of distinct items from to show the user. We assume the feedback comes in the form of clicks and the learner’s objective is to maximize the expected number of clicks over rounds. Our focus is on the case where is large (perhaps millions) and is relatively small (fifty or so). There are two main challenges that arise in online ranking problems:
The number of rankings grows exponentially in , which makes learning one parameter for each ranking a fruitless endeavour. Click models may be used to reduce the dimensionality of the learning problem, but balancing generality of the model with learnability is a serious challenge. The majority of previous works on online learning to rank have used unstructured models, which are not well suited to our setting where is large.
Most click models depend on an unknown attractiveness function that endows the item set with an order. This yields a model with at least parameters, which is prohibitively large in the applications we have in mind.
The first challenge is tackled by adapting the flexible click models introduced in [34, 23] to our setting. For the second we follow previous works on bandits with large action sets by assuming the attractiveness function can be written as a linear function of a relatively small number of features.
We make several contributions:
A new model for ranking problems with features is proposed that generalizes previous work [25, 35, 26] by relaxing the relatively restrictive assumptions on the probability that a user clicks on an item. The new model is strictly more robust than previous works focusing on regret analysis for large item sets.
We introduce a novel polynomial-time algorithm called RecurRank. The algorithm operates recursively over an increasingly fine set of partitions of . Within each part the algorithm balances exploration and exploitation, subdividing the partition once it becomes sufficiently certain about the suboptimality of a subset of items.
A regret analysis shows that the cumulative regret of RecurRank is at most , where is the number of positions, is the number of items and is the dimension of the feature space. Even in the non-feature case where this improves on the state-of-the-art by a factor of .
Online learning to rank has seen an explosion of research in the last decade and there are multiple ways of measuring the performance of an algorithm. One view is that the clicks themselves should be maximized, which we take in this article. An alternative is to assume an underlying relevance of all items in a ranking that is never directly observed, but can be inferred in some way from the observed clicks. In all generality this latter setting falls into the partial monitoring framework , but has been studied in specific ranking settings [7, and references therein]. See the article by Hofmann et al.  for more discussion on various objectives.
Maximizing clicks directly is a more straightforward objective because clicks are an observed quantity. Early work was empirically focused. For example, Li et al.  propose a modification of LinUCB for contextual ranking and Chen and Hofmann  modify the optimistic algorithms for linear bandits. These algorithms do not come with theoretical guarantees, however. There has recently been significant effort towards designing theoretically justified algorithms in settings of increasing complexity [20, 10, 35, 16, 21]. These works assume the user’s clicks follow a click model that connects properties of the shown ranking to the probability that a user clicks on an item placed in a given position. For example, in the document-based model it is assumed that the probability that the user clicks on a shown item only depends on the unknown attractiveness of that item and not its position in the ranking or the other items. Other simple models include the position-based, cascade and dependent click models. For a survey of click models see .
As usual, however, algorithms designed for specific models are brittle when the modeling assumptions are not met. Recent work has started to relax the strong assumptions by making the observation that in all of the above click models the probability of a user clicking on an item can be written as the product of the item’s inherent attractiveness and the probability that the user examines its position in the list. Zoghi et al.  use a click model where this decomposition is kept, but the assumption on how the examination probability of a position depends on the list is significantly relaxed. This is relaxed still further by Lattimore et al.  who avoid the factorization assumption by making assumptions directly on the click probabilities, but the existence of an attractiveness function remains.
The models mentioned in the last paragraph do not make assumptions on the attractiveness function, which means the regret depends badly on the size of . Certain simple click models have assumed the attractiveness function is a linear function of an item’s features and the resulting algorithms are suitable for large action sets. This has been done for the cascade model  and the dependent-click model . While these works are welcomed, the strong assumptions leave a lingering doubt that perhaps the models may not be a good fit for practical problems.
We would be remiss not to mention that ranking has also been examined in an adversarial framework by Radlinski et al. . These settings are most similar to the stochastic position-based and document-based models, but with the additional robustness bought by the adversarial framework. Another related setup is the rank- bandit problem in which the learner should choose just one of items to place in one of positions. For example, the location of a billboard with the budget to place only one. These setups have a lot in common with the present one, but cannot be directly applied to ranking problems. For more details see [17, 18].
Finally, we note that some authors do not assume an ordering of the item set provided by an attractiveness function. The reader is referred to the work by Slivkins et al.  (which is a follow-up work to ) where the learner’s objective is to maximise the probability that a user clicks on any item, rather than rewarding multiple clicks. This model encourages diversity and provides an interesting alternative approach.
Let denote the first natural numbers. Given a set the indicator function is . For vector and positive definite matrix we let . The Moore-Penrose pseudoinverse of a matrix is .
Let be a finite set of items, and a natural number, denoting the number of positions. A ranking is an injective function from , the set of positions, to and the set of all rankings is denoted by . We use uppercase letters like to denote rankings in and lowercase letters to denote items in . The game proceeds over rounds. In each round the learner chooses a ranking and subsequently receives feedback in the form of a vector where if the user clicked on the th position. We assume that the conditional distribution of only depends on , which means there exists an unknown function such that
for all and .
We do not assume conditional independence of .
In all generality the function has parameters, which is usually impractically large to learn in any reasonable time-frame. A click model corresponds to making assumptions on that reduces the statistical complexity of the learning problem. We assume a factored model:
where is called the examination probability and is the attractiveness function. We assume that attractiveness is linear in the action, which means there exists an unknown parameter such that
Let be the -th best item sorted in order of decreasing attractiveness. Then let . In case of ties the choice of may not be unique. All of the results that follow hold for any choice.
The examination function satisfies three additional assumptions. The first says the examination probability of position only depends on the identity of the first items and not their order:
for any with .
The second assumption is that the examination probability on any ranking is monotone decreasing in :
for all and .
The third assumption is that the examination probability on ranking is minimal:
for all and .
The learning objective
We measure the performance of our algorithm in terms of the cumulative regret, which is
Our assumptions do not imply that
In some articles  the assumptions are strengthened so that this holds while in others [20, 16, 21] it is simply assumed to hold directly. Here we take a more relaxed approach by proving a regret bound relative to any action that orders the items from most attractive to least, rather than relative to the optimal action.
Our algorithm makes use of an exploration ‘spanner’ that approximately minimises the covariance of the least-squares estimator. Given an arbitrary finite set of vectorsand distribution let . By the Kiefer–Wolfowitz theorem  there exists a called the -optimal design such that
As explained in [22, Chap. 21], John’s theorem implies that may be chosen so that . Given a finite set of vectors we let denote a -optimal design distribution. Methods from experimental design have been used for pure exploration in linear bandits [30, 33] and also finite-armed linear bandits [22, Chap. 22] as well as adversarial linear bandits .
As the name suggests, RecurRank is a recursive algorithm. The full pseudocode is given in Algorithm 1. Here we provide a slightly informal description, which is followed by an illustration. Each instantiation is called with three arguments:
A phase number ;
An ordered tuple of items ;
A tuple of positions .
The algorithm is responsible for ranking the items in into positions . Note that in all instantiations the parameters satisfy . Furthermore, is only possible when . The algorithm operates in three steps, only actually choosing actions in the second step.
Step 1: Initialization
Before placing any items the algorithm finds a -optimal design that is used for optimizing exploration. Then for each action let
where and .
Step 2: Ranking
The algorithm then acts deterministically, placing each item into the first position, , times. The remaining positions in are filled using the first items in . The results from the first position are stored and used to compute a least-square estimator that estimates a multiple of the attractiveness for each of the items in up to accuracy . Precisely, the algorithm estimates where is the position examination probability of position . In fact, we will prove that with high probability.
Step 3: Recursion
Once the previous step completes the subroutine eliminates items that are unlikely to be part of the optimal ranking and creates disjoint tuples of denoted by each matched with a corresponding subset of denoted by . Note the elimination only occurs when is larger than . It then instantiates copies of itself with inputs with . The algorithm is initialized with arguments and and where the order of is arbitrary. The precise details about how the partitions are created is provided in Algorithm 1. Intuitively, the set of items are partitioned when the algorithm can be confident that items in lower partitions are less attractive than items in higher partitions. The order of each when the new partition is created is chosen according to the attractiveness estimates of the items in it.
Suppose and the current partition on positions has blocks , , . The corresponding item set for each block is , , , respectively. Suppose these items are ordered by estimates in last phase. For each block, the algorithm RecurRank only uses the block’s first position to explore (denoted as dashed line) and uses the other positions to exploit. For example, for the first block, RecurRank will select one of the partial lists with different budgets; for the last block, RecurRank will select one of the partial lists where is any item available for the last partition besides . RecurRank selects a partial list for each block and then piece them together to a list of length.
After some time, RecurRank might finish exploring on the second block. Then it computes the estimates and constructs finer partitions with phase number increased by . Next it will run a new RecurRank on newly formed blocks and continue running on the old blocks . The blocks at any time step might have different starting time and ending time and the phases for these blocks can be different.
The most expensive component is computing the -optimal design. This is a convex optimization problem and has been studied extensively [5, §7.5] and [12, 31]. It is not necessary to solve the optimization problem exactly. Suppose instead we find a distribution on with support at most and for which
then our bounds continue to hold with replaced by . Such approximations are generally easy to find. For example,
may be chosen to be a uniform distribution on a volumetric spanner ofof size . See Appendix B for a summary on volumetric spanners. Hazan and Karnin  provide a somewhat impractical algorithm for computing a spanner of size in time polynomial in and . They have also given a randomized algorithm that returns a volumetric spanner of size with an expected running time of . For the remaining parts of the algorithm, the least-squares estimation is at most . The elimination and partitioning run in . Note these computations happen only once for each instantiation. The update for each partition at each time is . The total running time is .
4 Regret Analysis
Our main theorem bounds the regret of Algorithm 1.
There exists a universal constant such that the regret bound for Algorithm 1 with satisfies
Let be the number of calls to RecurRank with phase number . Hence each corresponds to a call of RecurRank with phase number and the arguments are denoted by and . Abbreviate for the first position of , for the number of positions and . We also let and assume that the calls are ordered so that
The reader is reminded that is the examination probability of the th position under the optimal list. Let be the shorthand for the optimal examination probability of the first position in call . We let be the least-squares estimator computed in Eq. 6 in Algorithm 1. The maximum phase number during the entire operation of the algorithm is .
Let be the failure event that there exists an , and such that
or there exists an , and such that .
On the event it holds for any and and positions that .
Let hold. Since the result is trivial for . Suppose , the lemma holds for all and there exists a pair satisfying . Let be the parent of , which satisfies . Since it follows from creftype 2 and the definition of that and hence
where we used the definition of . Given any with we have
The first and fifth inequalities are because does not hold. The third inequality is due to induction assumption on phase . Hence by the definition of the algorithm the items and will be split into different partitions by the end of call , which is a contradiction. ∎
On the event it holds for any and that .
We use the same idea as the previous lemma. Let hold. The result is trivial for . Suppose , the lemma holds for and there exists an satisfying . By the definition of the algorithm and does not hold, and hence
For any with it holds that
Hence there exist at least items for which . But if this was true then by the definition of the algorithm item would have been eliminated by the end of call , which is a contradiction. ∎
Suppose that in its th call RecurRank places item in position . Then, provided holds,
Suppose that in its th call RecurRank places item in position . Then provided holds, .
The result is immediate for . From now on assume that and let be the parent of . Since does not hold, . It cannot be that for all with , since this would mean that there are items that precede item and hence item would not be put in position by the algorithm. Hence there exists an with such that and
which completes the proof. ∎
Proof of Theorem 1.
The first step is to decompose the regret using the failure event:
From now on we assume that holds and bound the term inside the expectation. Given and let be the set of rounds when algorithm is active. Then
where is the regret incurred during call :
This quantity is further decomposed into the first position in , which is used for exploration, and the remaining positions:
Each of these terms is bounded separately. For the first term we have
where the first equality is the definition of , the second is the definition of . The third inequality is true because event ensures that
where the second inequality follows from creftype 3 and the third inequality follows from creftype 2 on ranking . The inequality in Eq. 9 follows from Lemma 5 and the one after it from the definition of . Putting things together,
where we used that . To bound note that, on the one hand, (this will be useful when is large), while on the other hand, by the definition of the algorithm and the fact that the -optimal design is supported on at most points we have
We now split to sum in (10) into two. For to be chosen later,
The result is completed by optimising . ∎
We construct environments using the cascade click model (CM) and the position-based click model (PBM) with items in dimension to be displayed in positions. We first randomly draw item vectors and weight vector in dimension with each entry a standard Gaussian variable, then normalize, add one more dimension with constant , and divide by . color=Cyan!20,size=,color=Cyan!20,size=,todo: color=Cyan!20,size=,Cs: Not sure we have time to change this, but perhaps we should rather design some environments. Or if we choose some specific distribution over the environments, we better justify the choice. I mean the distribution could accidentally hide nasty properties of the various algorithms. color=Blue!20,size=,color=Blue!20,size=,todo: color=Blue!20,size=,S: There will be not enough time to rerun all experiments since TopRank is very slow to rank items. I can continue running experiments to update this part in a later version. When I design the experiments, I just try to avoid ‘design’ some specific environments. But you are right. Always using random might also introduce a specific environment. The transformation is as follows:
This transformation on both the item vector and weight vector is to guarantee the attractiveness of each item lies in . The position bias for PBM is also randomly determined: first we randomly select numbers from , then rank them in decreasing order and divide them by their maximum. The evolution of the regret as a function of time is shown in Fig. 1(a)(b). The regret at the end of the rounds are given in the first two rows of Table 1, while total running times (wall-clock time) are shown in Table 2. The experiments are run on Dell PowerEdge R920 with CPU of Quad Intel Xeon CPU E7-4830 v2 (Ten-core 2.20GHz) and memory of 512GB.
CascadeLinUCB is best in CM but worst in PBM because of its modelling bias. TopRank takes much longer time to converge than either CascadeLinUCB or RecurRank since it neither exploits the specifics of the click model, nor does it use the linear structure.
We use the MovieLens dataset111https://grouplens.org/datasets/movielens/20m/, which contains million ratings for movies by users. We extract movies with most ratings and users who rate most and randomly split the user set to two parts, and with and . We then use the rating matrix of users in to derive feature vectors with
for all movies using singular-value decomposition (SVD). The resulting feature vectors are also processed using (11). The remaining rating matrix by is used as the reward matrix. At each time , each algorithm selects a list of items and receives reward of each item based on the rating of a randomly selected user . The performances are measured in averaged reward, which is the ratio of cumulative reward and number of rounds. The result over time is shown in Fig. 1(c). As can be seen, RecurRank collects more reward and learns faster than the other two algorithms. Of these two algorithms, the performance of CascadeLinUCB saturates: this is due to its incorrect bias.
We introduced a new setting for online learning-to-rank that is better adapted to practical problems when the number of items to be ranked is large. For this setting, we designed a new algorithm and analyzed its regret.
Our assumptions are most closely related to the work by Lattimore et al.  and Zoghi et al. . The latter work also assumes a factored model where the probability of clicking on an item factors into an examination probability and an attractiveness function. None of these works make use of features to model the attractiveness of items: They are a special case of our model when we set the features of items to be orthogonal to each other (in particular, ). Our assumptions on the examination probability function are weaker than those by Zoghi et al. . Despite this, our regret upper bound is better by a factor of (when setting ) and the analysis is also simpler. The paper by Lattimore et al.  does not assume a factored model, but instead places assumptions directly on . They also assume a specific behaviour of the function under pairwise exchanges that is not required here. Their assumptions are weaker in the sense that they do not assume the probability of clicking on position only depends on the identities of the items in positions and the attractiveness of the item in position . On the other hand, they do assume a specific behaviour of the function under pairwise exchanges that is not required by our analysis. It is unclear which set of these assumptions is preferable.
In the orthogonal case where the lower bound in  provides an example where the regret is at least . For , the standard techniques for proving lower bounds for linear bandits can be used to prove the regret is at least , which except for logarithmic terms means our upper bound is suboptimal by a factor of at most . We are not sure whether either the lower bound or the upper bound is tight.
The new algorithm only uses data from the first position in each partition for estimating the quality of the items. This seems suboptimal, but is hard to avoid without making additional assumptions. Nevertheless, we believe a small improvement should be possible here. Note the situation is not as bad as it may seem. As partitions are created RecurRank
starts using more and more of the data available. Another natural question is how to deal with the situation when the set of available items is changing. In practice this happens in many applications, either because the features are changing or because new items really are being added or removed. Other interesting directions are to use weighted least-squares estimators to exploit the low variance when the examination probability and attractiveness are small. Additionally one can use a generalized linear model instead of the linear model to model the attractiveness function, which may be analyzed using techniques developed byFilippi et al.  and Jun et al. . Finally, it could be interesting to generalize to the setting where item vectors are sparse (see  and [22, Chap. 23]).
- Abbasi-Yadkori et al.  Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, NIPS, pages 2312–2320. Curran Associates, Inc., 2011.
Abbasi-Yadkori et al. 
Y. Abbasi-Yadkori, D. Pal, and C. Szepesvári.
Online-to-confidence-set conversions and application to sparse
In N. D. Lawrence and M. Girolami, editors,
Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, volume 22 of
Proceedings of Machine Learning Research, pages 1–9, La Palma, Canary Islands, 21–23 Apr 2012. PMLR.
Abe and Long 
N. Abe and P. M. Long.
Associative reinforcement learning using linear probabilistic concepts.In Proceedings of the 16th International Conference on Machine Learning, ICML, pages 3–11, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
- Auer  P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Boyd and Vandenberghe  S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
- Bubeck et al.  S. Bubeck, N. Cesa-Bianchi, and S. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Annual Conference on Learning Theory, volume 23, pages 41–1. Microtome, 2012.
- Chaudhuri  S. Chaudhuri. Learning to Rank: Online Learning, Statistical Theory and Applications. PhD thesis, 2016.
- Chen and Hofmann  Y. Chen and K. Hofmann. Online learning to rank: Absolute vs. relative. In Proceedings of the 24th International Conference on World Wide Web, pages 19–20. ACM, 2015.
- Chuklin et al.  A. Chuklin, I. Markov, and M. de Rijke. Click Models for Web Search. Morgan & Claypool Publishers, 2015.
- Combes et al.  R. Combes, S. Magureanu, A. Proutiere, and C. Laroche. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 231–244. ACM, 2015. ISBN 978-1-4503-3486-0.
- Filippi et al.  S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, NIPS, pages 586–594. Curran Associates, Inc., 2010.
Grötschel et al. 
M. Grötschel, L. Lovász, and A. Schrijver.
Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
- Hazan and Karnin  E. Hazan and Z. Karnin. Volumetric spanners: an efficient exploration basis for learning. The Journal of Machine Learning Research, 17(1):4062–4095, 2016.
- Hofmann et al.  K. Hofmann, S. Whiteson, and M. De Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 249–258. ACM, 2011.
- Jun et al.  K. Jun, A. Bhargava, R. Nowak, and R. Willett. Scalable generalized linear bandits: Online computation and hashing. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 99–109. Curran Associates, Inc., 2017.
- Katariya et al.  S. Katariya, B. Kveton, C. Szepesvári, and Z. Wen. DCM bandits: Learning to rank with multiple clicks. In Proceedings of the 33rd International Conference on Machine Learning, pages 1215–1224, 2016.
- Katariya et al. [2017a] S. Katariya, B. Kveton, C. Szepesvári, C. Vernade, and Z. Wen. Bernoulli rank-1 bandits for click feedback. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017a.
- Katariya et al. [2017b] S. Katariya, B. Kveton, C. Szepesvári, C. Vernade, and Z. Wen. Stochastic rank-1 bandits. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017b.
- Kiefer and Wolfowitz  J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12(5):363–365, 1960.
- Kveton et al.  B. Kveton, C. Szepesvári, Z. Wen, and A. Ashkan. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, pages 767–776. JMLR.org, 2015.
- Lagree et al.  P. Lagree, C. Vernade, and O. Cappé. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems 29, NIPS, pages 1597–1605. Curran Associates Inc., 2016.
- Lattimore and Szepesvári  T. Lattimore and C. Szepesvári. Bandit Algorithms. preprint, 2018.
- Lattimore et al.  T. Lattimore, B. Kveton, S. Li, and C. Szepesvári. Toprank: A practical algorithm for online stochastic ranking. In Proceedings of the 31st Conference on Neural Information Processing Systems. 2018.
- Li et al.  L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on world wide web, pages 661–670. ACM, 2010.
- Li et al.  S. Li, B. Wang, S. Zhang, and W. Chen. Contextual combinatorial cascading bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 1245–1253, 2016.
- Liu et al.  W. Liu, S. Li, and S. Zhang. Contextual dependent click bandit algorithm for web recommendation. In International Computing and Combinatorics Conference, pages 39–50. Springer, 2018.
- Radlinski et al.  F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pages 784–791. ACM, 2008.
- Rustichini  A. Rustichini. Minimizing regret: The general case. Games and Economic Behavior, 29(1):224–243, 1999.
- Slivkins et al.  A. Slivkins, F. Radlinski, and S. Gollapudi. Ranked bandits in metric spaces: learning diverse rankings over large document collections. Journal of Machine Learning Research, 14(Feb):399–436, 2013.
- Soare et al.  M. Soare, A. Lazaric, and R. Munos. Best-arm identification in linear bandits. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, NIPS, pages 828–836. Curran Associates, Inc., 2014.
- Todd  M. J. Todd. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016.
- Valko et al.  M. Valko, R. Munos, B. Kveton, and T. Kocák. Spectral bandits for smooth graph functions. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 46–54, Bejing, China, 22–24 Jun 2014. PMLR.
- Xu et al.  L. Xu, J. Honda, and M. Sugiyama. Fully adaptive algorithm for pure exploration in linear bandits. arXiv preprint arXiv:1710.05552, 2017.
- Zoghi et al.  M. Zoghi, T. Tunys, M. Ghavamzadeh, B. Kveton, C. Szepesvári, and Z. Wen. Online learning to rank in stochastic click models. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of PMLR, pages 4199–4208, 2017.
- Zong et al.  S. Zong, H. Ni, K. Sung, R. N. Ke, Z. Wen, and B. Kveton. Cascading bandits for large-scale recommendation problems. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, UAI, 2016.
Appendix A Proof of Lemma 1
In what follows, we add the index to any symbol used in the algorithm to indicate the value that it takes in the call. For example, denotes the data multiset collected in the call, be the value computed in Eq. 5, etc.
Fix and let be the failure event that there exists an and such that
Let be the event that for any , the examination probability on the first position of the call is . For the argument that follows, let us assume that holds.
where is a conditionally -subgaussian sequence.