Approximate Ranking from Pairwise Comparisons

01/04/2018 ∙ by Reinhard Heckel, et al. ∙ 0

A common problem in machine learning is to rank a set of n items based on pairwise comparisons. Here ranking refers to partitioning the items into sets of pre-specified sizes according to their scores, which includes identification of the top-k items as the most prominent special case. The score of a given item is defined as the probability that it beats a randomly chosen other item. Finding an exact ranking typically requires a prohibitively large number of comparisons, but in practice, approximate rankings are often adequate. Accordingly, we study the problem of finding approximate rankings from pairwise comparisons. We analyze an active ranking algorithm that counts the number of comparisons won, and decides whether to stop or which pair of items to compare next, based on confidence intervals computed from the data collected in previous steps. We show that this algorithm succeeds in recovering approximate rankings using a number of comparisons that is close to optimal up to logarithmic factors. We also present numerical results, showing that in practice, approximation can drastically reduce the number of comparisons required to estimate a ranking.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of ranking a collection of items from noisy pairwise comparisons arises in a wide range of applications, including recommender systems for rating movies, books, or other consumer items [piech_tuned_2013, aggarwal_recommender_2016]; peer grading for ranking students in massive open online courses [shah2013case]; ranking players in tournaments; search engines; quantifying people’s perception of cities from pairwise comparison of street views of the cities [salesses_collaborative_2013]; and online sequential survey sampling for assessing the popularity of proposals in a population of voters [salganik_wiki_2015].

In each of these applications, the aim is to obtain a statistically sound ranking from as few comparisons as possible. In this work, we investigate the power of adaptively selecting which pairs to compare based on the outcomes of previous comparisons, a setting we call active or adaptive ranking. In contrast, passive or non-adaptive ranking approaches fix the comparisons to make before any data is collected. It is well understood that one can typically learn a ranking using fewer adaptively chosen comparisons than one would need when passively choosing comparisons [heckel_active_2016]. However, for moderately large or large collections of items–such as the ones that appear in most of the applications mentioned above–or for collections with many items of “similar quality” (to be made rigorous below), learning the exact ground-truth ranking may still require prohibitively many comparisons.

Motivated by these large-scale ranking problems, this work studies the problem of adaptivity obtaining approximate rankings. We demonstrate that learning an approximate ranking may still be statistically tractable even when recovering the exact ranking is not. Formally, we consider a collection of items, and make comparison queries between pairs of items . We assume that the response to those queries are stochastic, where the probability that item “beats” item is given by . We assume that the outcomes of all queries are statistically independent, and assume that either item or item “wins” the comparison with probability , which means that for all . Our aim is to rank the items in terms of their Borda scores [de1781memoire], defined as the probability that item defeats an item chosen uniformly at random from :

(1)

Apart from their intuitive appeal, the Borda scores generalize the orderings considered in several popular comparison models, including the classical, parametric Bradley-Terry-Luce (BTL) [bradley_rank_1952, luce_individual_1959] and Thurstone [thurstone_law_1927] models, as well as the non-parametric Strong Stochastic Transitivity (SST) model [tversky_substitutability_1969]. In all of these models, the intrinsic model-defined ordering coincides with that given by the scores . Rather than learning the scores exactly, or ranking items according to their exact score, this paper considers the problem of approximately partitioning the items into sets of pre-specified sizes according to their respective scores. This includes finding a total ordering that is approximately correct, and the task of finding a set of items that is close to the top- items. For simplicity, we exclusively focus on the latter problem in this paper.

Contributions:

Our main contribution is to present and analyze a novel active ranking algorithm for estimating an approximate ranking of the items. The algorithm is based on adaptively estimating the scores to within sufficient resolution to deduce a ranking. We establish that with high probability, the algorithm returns a ranking which satisfies the desired approximation guarantee, and attains a distribution-dependent sample complexity which can be parameterized in terms of the scores . We then prove distribution-dependent lower bounds that match our upper bound up to logarithmic factors for many problem instances. Our analysis leverages the fact that ranking in terms of the scores is related to a particular class of multi-armed bandit problems [even-dar_action_2006, bubeck_multiple_2013, urvoy_generic_2013]; this same connection has been observed in the context of finding the top item [yuekarmed2012, jamieson_sparse_2015, urvoy_generic_2013]. Since to the best of our knowledge, the approximate subset selection problem has not been studied in the bandit literature, a version of our algorithm and results are also new when specialized to the multi-armed bandit problem. Finally, we examine pathological distributions for which the complexity of approximate ranking (or approximate subset selection in the multi-armed bandit setup) seems to diverge from what one would expect. In these cases, we show that careful randomized guessing strategies can yield significant improvements in sample complexity.

Motivation for Approximate Rankings:

In order to understand how approximation can drastically reduce the number of comparisons required, let us consider a motivating example. Suppose that we are interested in identifying the top- items, and suppose for simplicity that the items are ordered, i.e., (of course this ordering is not known a-priori). The paper [heckel_active_2016] shows that in the active setting, the number of comparisons necessary and sufficient for finding the top items is of the order

(2)

up to a logarithmic factor. Thus, the sample complexity depends on the distribution of the scores; see Figure 1 how these scores are distributed in some applications. In practice, the differences between the scores often obey the scaling on average (see Figure 1). To identify the top- items exactly, the aforementioned optimal active scheme would require on the order of comparisons, and a minimax-optimal passive ranking scheme would even require on the order of comparisons [shah_simple_2015].

Theorem 1 in this paper shows that if one does not need to extract the exact top- items, but is instead willing to tolerate a few–say, many–mistakes, then the number of comparisons shrinks drastically, specifically by a factor proportional to . In particular, if we want to find a set of of the items () such that all but of the elements of are among the true top of items (, ), then the overall number of comparisons required would be on the order of . Thus, relaxing to approximate ranking can yield speedups that are linear and quadratic in the number of items, compared to optimal exact active and exact passive schemes. Moreover, our algorithm (Algorithm 1 below) that obtains this factor-of- speedup does not require priori information about the spacings of the , but instead learns a near-optimal measurement allocation for these scores adaptively.

Related works:

There is a vast literature on ranking and estimation from pairwise comparison data; however, most work focuses on finding exact rankings. There are a number of papers [hunter2004mm, negahban_iterative_2012, hajek2014minimax, shah_estimation_2015, shah_simple_2015]

devoted to settings in which pairs to be compared are chosen a priori, whereas here we assume that the pairs may be chosen in an active manner. Moreover, several works impose restrictions on the pairwise comparison probabilities, e.g., by assuming the Bradley-Terry-Luce (BTL) parametric model (discussed below)

[szorenyi_online_2015, hunter2004mm, negahban_iterative_2012, hajek2014minimax, shah_estimation_2015]. eriksson_learning_2013 considers the problem of finding the very top items using graph-based techniques, whereas busa-fekete_top-k_2013 consider the problem of finding the top-k items. ailon_active_2011 considers the problem of linearly ordering the items so as to disagree in as few pairwise preference labels as possible. Our work is also related to the literature on multi-armed bandits, as discussed later in the paper.

Figure 1: Estimated scores from three different domains: (a) Scores of the Association of Tennis Professionals (ATP) world tour, computed from the games played within a 52 week interval as the fraction of games won by the total number of games played. (b) Comparisons of the proposals in the PlaNYC survey, as reported in the paper [salganik_wiki_2015] (only scores of items (proposals) that were rated at least times are depicted). (c) Scores from comparisons of Gif’s according to whether they display a certain emotion (see http://www.gif.gf/).

2 Problem formulation and background

In this section we formally state the approximate ranking problem considered in this paper.

2.1 Pairwise probabilities and scores

Given a collection of items , let us denote by the (unknown) probability that item wins a comparison with item . We let

denote a Bernoulli random variable taking a value of

if beats and otherwise, so that . Moreover, we require that any comparison results in a winner, so that . For each item , recall that the score (1) defined by corresponds to the probability that item wins a comparison with an item chosen uniformly at random from . We let denote any (possibly non-unique) permutation such that In words, denotes the item with the largest score. Ranking corresponds to partitioning the items into disjoint sets according to its scores. For simplicity, in this paper we focus on the ranking problem of splitting into the top- items and its complement . In this work, our goal is to find an approximation to and in terms of the Hamming distance between two sets , defined as . Specifically, we say the ranking with is -Hamming-accurate if

For future reference, we define

corresponding to the set of pairwise comparison matrices with pairwise comparison probabilities lower bounded by .

2.2 The active approximate ranking problem

An active ranking algorithm acts on a pairwise comparison model . The goal is to obtain an approximate partition of the items into disjoint sets from active comparisons. At each time instant, the algorithm can compare two arbitrary items, which the algorithm may select based on the outcomes of previous comparisons. When comparing and , the algorithm obtains an independent draw of the random variable in response. The algorithm terminates based on an associated stopping rule, and returns an approximate ranking . For a given tolerance parameter , we say a ranking algorithm is -accurate for a pairwise comparison matrix , if the ranking returned is -Hamming accurate with probability at least . Moreover, we say that is uniformly -accurate over a given set of pairwise comparison models if it is -accurate for each .

2.3 Relation to multi-armed bandits

The exact version of the ranking problem considered in this paper is related to the subset selection problem in the bandit literature [kalyanakrishnan_pac_2012]. Specifically, a multi-armed bandit model consists of arms, each a random variable with unknown distribution. The subset selection problem is concerned with identifying the top arms (according to the means) by taking independent draws of the random variables. Various works [yue_beat_2011, yuekarmed2012, urvoy_generic_2013, jamieson_sparse_2015] have observed that, by definition of the score , comparing item to an item chosen uniformly at random from can be modeled as drawing a Bernoulli random variable with mean . Our subsequent analysis relies on this relation.

However, when viewing our problem as a multi-armed bandit problem with means , we are ignoring the fact that the means are coupled, as they must be realized by some pairwise comparison matrix . Due to , this matrix must satisfy certain constraints, such as and (e.g., see the papers [landau_dominance_1953, joe_majorization_1988]). Our algorithm turns out to be near-optimal, even though it does not take those constraints into account. This seems to corroborate the observation in [simchowitz_simulator_2017] that many types of constraints surprisingly do not improve the sample complexity of bandit problems.

Finally, at least to the best of our knowledge, the problem of approximate subset selection has not been studied in the bandit literature, meaning that our algorithm and results are also new when specialized to the multi-armed bandit problem. However, it should be noted that other versions of approximation have been considered in the literature; for instance, zhou_optimal_2014 studied the problem of selecting arms with low aggregate regret, defined as the gap between the average reward of the optimal solution and the solution given by the algorithm.

2.4 Parametric models

In this section, we introduce a family of parametric models that are popular in the pairwise comparison literature [szorenyi_online_2015, hunter2004mm, negahban_iterative_2012, hajek2014minimax, shah_estimation_2015]. We focus on these parametric models in Section 3.3, where we show that, perhaps surprisingly, if the pairwise comparison probabilities are bounded away from zero, for most constellations of scores, these assumptions can at most provide little gains in sample complexity.

Any member of this family is defined by a strictly increasing and continuous function obeying , for all . The function

is assumed to be known. A pairwise comparison matrix in this family is associated to an unknown vector

, where each entry of represents some quality or strength of the corresponding item. The parametric model associated with the function is defined as:

Popular examples of models in this family are the Bradley-Terry-Luce (BTL) model, obtained by setting

equal to the sigmoid function

, and the Thurstone model, obtained by setting equal to the Gaussian CDF. Since is equivalent to , the ranking induced by the scores is equivalent to that induced by .

3 Hamming-LUCB: Algorithm and analysis

In this section, we present our approximate ranking algorithm, and an analysis proving that it is near optimal for many interesting and natural problem instances.

3.1 The Hamming-LUCB algorithm

Our algorithm is based on actively identifying sets and consisting of items and items, respectively, such that with high confidence the items in the first set have a larger score than the items in the second set. Once we have found such sets, we can arbitrarily distribute the remaining items to the sets and in order to obtain a Hamming-accurate ranking with high confidence.

Our algorithm identifies those sets based on adaptively estimating the scores . We estimate the score of item by comparing item with items chosen uniformly at random from

, which yields an unbiased estimate of

. The key idea is to only estimate the scores sufficiently well so we can obtain the two sets and from them. This strategy decides based on the current estimates of the scores and associated confidence intervals which estimate to “update”, by comparing it to a randomly chosen item. Our strategy to update the estimates of the scores is guided by the insight that the “easiest” items to distinguish are the top items, , and the bottom items, . Hence, our algorithm focuses on what it “thinks” are those top and bottom items.

We define a confidence bound based on an non-asymptotic version of the law of the iterated algorithm [kaufmann_complexity_2014, jamieson_lil_2014]; it is of the form , where is an integer corresponding to the number of comparisons, and with the constants involved explicitly chosen by setting

For each item , the algorithm stores a counter of the number of comparisons in which it has been involved, along with an empirical estimate of the associated score . For notational convenience, we adopt the shorthands and . Within each round, we also let denote a permutation of such that . We then define the indices

(3)

These indices are the analogues of the standard indices of the Lower-Upper Confidence Bound (LUCB) strategy from the bandit literature [kalyanakrishnan_pac_2012] for the top and bottom items. The LUCB strategy for exact top recovery would update the scores and (for ) at each round. As mentioned before, our strategy will go after what it “thinks” are the top items, , and what it “thinks” are the bottom items, . Moreover, the algorithm keeps all the other items in consideration for inclusion in these sets, by keeping their confidence intervals below the confidence intervals of the items in and (cf. equation (4) in the algorithm below). This is crucial to ensure that the algorithm does not get stuck trying to distinguish the middle items , which in general requires many comparisons, as their scores are typically closer. In Figure 2 we show an example run of the Hamming-LUCB algorithm, to illustrate the idea.

1 Input: Confidence parameter .
2 Initialization: For every item , compare to an item chosen uniformly at random from , and set , .
3 Do until termination:
4 Let denote a permutation of such that .
5 For and defined by equation (3), define the indices
(4)
6 For , increment , compare to an item chosen uniformity at random from , and update .
7 End Loop once the termination condition holds:
(5)
Return the estimates of the partitions and .
Algorithm 1 Hamming-LUCB
Figure 2: Visualization of a run of the Hamming-LUCB algorithm on a problem instance with scores evenly spaced in the interval , and parameters . The estimates of the scores of the top items , the middle items , and the bottom items , along with the confidence intervals are depicted in blue, brown, and red, respectively, after 30 and 200 comparisons, and at termination. Note that once the confidence intervals of the top and bottom items are separated, the algorithm terminates.

3.2 Guarantees and optimality of the Hamming-LUCB algorithm

We next establish guarantees on the number of comparisons for the Hamming-LUCB algorithm to succeed. As we show below, the number of comparisons depends on the following gaps between the scores

Thus, as one might intuitively expect, the number of comparisons is typically smaller when is larger, as the corresponding gaps typically become larger.

Theorem 1.

For any , the Hamming-LUCB algorithm run with confidence parameter is -Hamming-accurate, and with probability at least , makes at most comparisons, where

(6)

The notation absorbs factors logarithmic in , and doubly logarithmic in the gaps.

Theorem 1 proves that the Hamming-LUCB algorithm is -accurate, and characterizes the number of comparisons that it requires as a function of the gaps between the scores.

Comparing to the number of comparisons necessary and sufficient for finding the top- items, we see that the Hamming-LUCB algorithm depends on the gaps and instead of the gaps and which appear in the sample complexity for finding the top items (cf. equation (2)). These gaps are typically significantly larger, resulting in a lower sample complexity. For example, in practice, the scores are often increasing in that is on average on the order of . Thus, for sufficiently large , several real world models belong to the class (see Figure 1 for plausible members of this class):

(7)

For this class, the complexity of finding the top- items with the Hamming-LUCB algorithm is on the order of , which is by a factor of smaller than the complexity for finding the exact top- items.

Moreover, Hamming LUCB provides a strict improvement over the optimal sample complexity in the passive setup, for which Shah and Wainwright [shah_simple_2015] establish upper bounds and minimax lower bounds which state that comparisons are necessary and sufficient to identify the top items up to a Hamming error with high probability.

As increases, the upper bound depends on gaps between items with increasingly disparate position in the ranking, and thus, the upper bound on the sample complexity decreases. The following lower bound shows that, up to logarithmic factors in , doubly logarithmic factors in the gaps, and a multiplicative scaling of , the Hamming-LUCB algorithm is optimal.

Theorem 2.

For any , let denote an algorithm which is uniformly -accurate over . Then, when is run on any comparison instance , must make at least comparisons in expectation, where

for some universal constant .

Note that the above lower bound does not depend on the gaps involving the items . However, we can still relate the lower bound to the upper bound by (see Section A for the simple proof)

(8)

so that we see that, up to rescaling our Hamming error tolerance , our upper and lower bounds ( and , respectively) match up to logarithmic factors. For many problem instances of interest—such as models in the class in equation (7)—the sample complexity bounds and degrade gracefully with the Hamming tolerance , so that typically we have .

Observe that if , we recover the exact top- recovery upper bound in equation (2), which is related to similar results for multi armed bandits [kalyanakrishnan_pac_2012]. We believe that by modifying the confidence intervals in Hamming LUCB as in the LUCB++ algorithm of Simchowitz et al. [simchowitz_simulator_2017], one can sharpen the upper bound on the sample complexity by replacing with on the terms corresponding to items , thereby matching known lower bounds for top- subset selection problem in the bandit literature [simchowitz_simulator_2017, chen2017nearly, kalyanakrishnan_pac_2012]. In the interest of simplicity, we defer refining these logarithmic factors to later work.

3.3 Parametric models

Even though the lower bound of qualitatively matches the upper bound , it gives the misleading impression that an -approximate algorithm can get away without querying the items in . In the proof section, we use techniques from [simchowitz_simulator_2017] and [chen2017nearly] to establish a more refined technical lower bound showing that all items, including those with ranks close to must be compared an “adequate” number of times. For simplicity, we state a consequence of this lower bound applied to the parametric models described in Section 2.4. In addition to showing that each item has to be compared a certain number of times, this bound also establishes that even knowledge of the exact parametric form of the pairwise comparison probabilities cannot drastically improve the performance of an active ranking algorithm.

In more detail, we say that a model is parametric, if there exists a strictly increasing CDF such that for some weights . For any pair of constants , we say that a CDF is -bounded, if it is differentiable, and if its derivative satisfies the bounds

(9)

Note that for the popular BTL and Thurstone models, equation (9) holds with close to one, provided that is not too small. We say that an algorithm is symmetric if its distribution of comparisons commutes with permutations of the items. For any such algorithm, our main lower bound is as follows:

Theorem 3.

For a given , let be any symmetric algorithm that is uniformly -Hamming accurate over . Then, when is run on the instance , for any integer and any item , it must make at least

comparisons involving item on average.

In particular, by choosing , we see that the total sample complexity is lower bounded by

(10)

which is equivalent to the upper bound achieved by the Hamming-LUCB algorithm up to logarithmic factors. The lower bound from Theorem 3 is stronger than the lower bound from Theorem 2, in that it applies to the larger class of algorithms that are only -accurate over the smaller class of parametric models. In fact, the parametric subclass is significantly smaller than the full set of pairwise comparison models , in the sense that one can find matrices in that cannot be well-approximated by any parametric model [shah_stochastically_2015]. Therefore, theorem 3 shows that, up to rescaling the Hamming error tolerance and logarithmic factors, the Hamming-LUCB algorithm is optimal, even if we restrict ourself to algorithms that are uniformly -accurate only over a parametric subclass. Thus, in the regime where the pairwise comparison probabilities are bounded away from zero, parametric assumptions cannot substantially reduce the sample complexity of finding an approximate ranking; an observation that has been made previously in the paper [heckel_active_2016] for exact rankings.

A second and equally important consequence of Theorem 3 is that each item has to be sampled a certain number of times, an intuition not captured by Theorem 2. This conclusion continues to hold for general pairwise comparison matrices, please see Theorem 4 in Section 5.3 for a formal statement.

3.4 Random guessing

Even though our the upper and lower bounds essentially match whenever , there are there are pathological instances where , and where the Hamming-LUCB algorithm will make considerably more comparisons than a careful random guessing strategy.

As an example, consider a problem instance parameterized by , with scores given by

for some and . The upper bound (6) for the Hamming-LUCB strategy is at least on the order of , since the gap between the -th and the -th largest score is . However, the lower bound provided by Theorem 2 is , which is independent of . Thus, by making small, the ratio of upper and lower bounds becomes arbitrarily large. Intuitively, Hamming-LUCB is wasteful because it is attempting to identify the exact top arms with too much precision. However, for this particular problem instance, the following random guessing strategy will attain our lower bound. First, we obtain estimates of each score by comparing item to randomly chosen items. For each score, test whether there are items obeying and whether there are items obeying . If yes, assign these items the estimates and , respectively, and assign all remaining items uniformly at random to the sets and , and terminate.

4 Experimental results

In this section, we provide experimental evidence that corroborates our theoretical claims that the Hamming-LUCB algorithm allows to significantly reduce the number of comparisons if one is content with an approximate ranking. We show that these gains are attained on a real-world data set. Specifically, we generate a pairwise comparison model by choosing such that the Borda scores coincide with those found empirically in the PlaNYC survey [salganik_wiki_2015]; see panel (b) of Figure 1. We emphasize that, since Hamming LUCB depends only on the Borda scores and not on the comparison probabilities , these simulations provide a faithful representation of how Hamming LUCB performs on real-world data. In Figure 3, we plot the results of running the Hamming-LUCB algorithm on the PlanNYC-pairwise comparison model in order to determine the top items, for different values of . We observed that the results for other values of are very similar. As suggested by our theory, the number of comparisons to find an approximate ranking decays in a manner inversely proportional in . We compare the Hamming-LUCB algorithm to another sensible active ranking strategy for obtaining an Hamming-accurate ranking. Specifically, we consider a version of the successive elimination strategy proposed in [heckel_active_2016, Sec. 3.1] for finding an exact ranking. This strategy can be adapted to yield an Hamming-accurate ranking by changing its stopping criterium. Instead of stopping once all items have been eliminated, we stop when either items have been assigned to the top, or items have been assigned to the bottom. While this strategy yields an Hamming accurate ranking, its sample complexity is, up to logarithmic factors, equal to , which is strictly smaller than that of the Hamming-LUCB algorithm. As Figure 3 shows, this strategy requires significantly more comparisons for finding an approximate ranking, thereby validating the benefits of our approach.

Figure 3: Sample complexity of the Hamming-LUCB algorithm and an elimination strategy run on a pairwise comparison model resembling the PlaNYC online sequential survey. Both algorithms find the top proposals out of proposals, up to Hamming error

. The error bars correspond to one standard deviation from the mean. The results show that the sample complexity of the Hamming-LUCB algorithm for finding an

-accurate ranking drops by a factor of about . Moreover, the Hamming-LUCB algorithm requires significantly fewer samples than the elimination strategy.

5 Proofs

In this section, we provide the proofs of our theorems. In order to simplify notation, we assume without loss of generality (re-indexing as needed) that the underlying permutation equal to the identity, so that .

5.1 Proof of Theorem 1

Our analysis uses an argument inspired by the proof of the performance guarantee of the original LUCB algorithm from the bandit literature, presented in [kalyanakrishnan_pac_2012]. We begin by showing that the estimate is guaranteed to be -close to , for all , with high probability.

Lemma 1 ([kaufmann_complexity_2014, Lem. 19]).

For any , with probability at least , the event

(11)

occurs. The statement continues to hold for any with , .

Lemma 1 is a non-asymptotic version of the law of the iterated logarithm from kaufmann_complexity_2014 and jamieson_lil_2014.

We first show that, on the event defined in equation (11), the Hamming-LUCB algorithm returns sets and obeying , as desired. Indeed, suppose that . This implies that and differ in at most values, which in turn implies that and differ by at most values. Therefore, . Next, suppose that . Then, at least one item in is in . Thus, on , the termination condition (5) implies that . Similarly as above, this in turn implies that .

We next show that on the event , Hamming-LUCB terminates after the desired number of comparisons. Let , and define the event that item is bad as

Lemma 2.

If occurs and the termination condition (5) is false, then either or occurs.

Given Lemma 2, we can complete the proof in the following way. For an item , define

and let be the largest integer satisfying the bound . A simple calculation (see Section 5.1.1 for the details) yields that

(12)

Let be the -th iteration of the steps in the LUCB algorithm, and let and be the two items selected in Step LABEL:it:step4 of the algorithm. Note that in each iteration only those two items are compared to other items. By Lemma 2, we can therefore bound the total number comparisons by

(13)

For inequality (i), we used the fact (12), and inequality (ii) follows because can only be true for iterations .

We conclude the proof by noting that the definition of and some algebra yields (see [heckel_active_2016, Eq. (20)]) that for sufficiently large

Applying this inequality to the RHS of equation (13) above concludes the proof.

5.1.1 Proof of fact (12)

First, consider an item . We show that if , then is false. On the event ,

(14)

where inequality (i) follows from for , by definition of , and the last inequality follows from and . Thus, does not occur.

For an item , that is false, the argument is equivalent. For an item in the middle , the event is false by definition. This concludes the proof.

5.1.2 Proof of Lemma 2

We prove the lemma by considering all different values the indices and selected by the LUCB algorithm can take on, and showing that in each case and cannot occur simultaneously. For notational convenience, we define the indices

and note that

  1. Suppose that and , and that both and do not occur. First note that

    (15)

    In order to establish this claim, note that the inequality holds trivially with equality if . If , then it follows from and . Thus, we obtain

    (16)

    where the last inequality holds by the assumption that does not occur. An analogous argument yields that

    (17)

    Combining those inequalities yields , which contradicts that the termination condition (5) is false.

  2. Next, suppose that is an index in the middle and is in the very bottom, i.e., , and , and both and do not occur.

    First note that from and not occurring, we have that

    Here, inequality (i) holds by and , and inequality (ii) follows by the definition of . On the event , this implies

    (18)

    Inequality (18) can only be true for all if , which is equivalent to

    Again using that and not occurring, we have that

    (19)

    where inequality (i) holds since the termination condition (5) is false, and inequality (ii) follows from , where the last inequality holds since does not occur, by assumption.

    From for all , it follows that for ,

    (20)

    Below, we show that

    ,  for all . (21)

    It follows that