# PAC-Battling Bandits with Plackett-Luce: Tradeoff between Sample Complexity and Subset Size

We introduce the probably approximately correct (PAC) version of the problem of Battling-bandits with the Plackett-Luce (PL) model -- an online learning framework where in each trial, the learner chooses a subset of k < n arms from a pool of fixed set of n arms, and subsequently observes a stochastic feedback indicating preference information over the items in the chosen subset; e.g., the most preferred item or ranking of the top m most preferred items etc. The objective is to recover an approximate-best' item of the underlying PL model with high probability. This framework is motivated by practical settings such as recommendation systems and information retrieval, where it is easier and more efficient to collect relative feedback for multiple arms at once. Our framework can be seen as a generalization of the well-studied PAC-Dueling-Bandit problem over set of n arms. We propose two different feedback models: just the winner information (WI), and ranking of top-m items (TR), for any 2< m < k. We show that with just the winner information (WI), one cannot recover the approximate-best' item with sample complexity lesser than Ω( n/ϵ^21/δ), which is independent of k, and same as the one required for standard dueling bandit setting (k=2). However with top-m ranking (TR) feedback, our lower analysis proves an improved sample complexity guarantee of Ω( n/mϵ^21/δ), which shows a relative improvement of 1/m factor compared to WI feedback, rightfully justifying the additional information gain due to the knowledge of ranking of topmost m items. We also provide algorithms for each of the above feedback models, our theoretical analyses proves the optimality of their sample complexities which matches the derived lower bounds (upto logarithmic factors).

## Authors

• 26 publications
• 10 publications
• ### Active Ranking with Subset-wise Preferences

We consider the problem of probably approximately correct (PAC) ranking ...
10/23/2018 ∙ by Aadirupa Saha, et al. ∙ 0

• ### Best-item Learning in Random Utility Models with Subset Choices

We consider the problem of PAC learning the most valuable item from a po...
02/19/2020 ∙ by Aadirupa Saha, et al. ∙ 0

• ### From PAC to Instance-Optimal Sample Complexity in the Plackett-Luce Model

We consider PAC learning for identifying a good item from subset-wise sa...
03/01/2019 ∙ by Aadirupa Saha, et al. ∙ 0

• ### Learning from Comparisons and Choices

When tracking user-specific online activities, each user's preference is...
04/24/2017 ∙ by Sahand Negahban, et al. ∙ 0

• ### Online Boosting for Multilabel Ranking with Top-k Feedback

We present online boosting algorithms for multilabel ranking with top-k ...
10/24/2019 ∙ by Daniel T. Zhang, et al. ∙ 12

We introduce the problem of regret minimization in Adversarial Dueling B...
10/27/2020 ∙ by Aadirupa Saha, et al. ∙ 0

• ### Adaptive Sampling for Coarse Ranking

We consider the problem of active coarse ranking, where the goal is to s...
02/20/2018 ∙ by Sumeet Katariya, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The dueling bandit problem has recently gained attention in the machine learning community

(Yue et al., 2012; Ailon et al., 2014; Zoghi et al., 2014; Szörényi et al., 2015). This is a variant of the multi-armed bandit problem (Auer et al., 2002) in which the learner needs to learn an ‘best arm’ from pairwise comparisons between arms. In this work, we consider a natural generalization of the dueling bandit problem where the learner can adaptively select a subset of arms () in each round, and observe relative preferences in the subset following a Plackett-Luce (PL) feedback model (Marden, 1996), with the objective of learning the ‘best arm’. We call this the battling bandit problem with the Plackett-Luce model.

The battling bandit decision framework (Saha and Gopalan, 2018; Chen et al., 2018) models several application domains where it is possible to elicit feedback about preferred options from among a general set of offered options, instead of being able to compare only two options at a time as in the dueling setup. Furthermore, the phenomenon of competition – that an option’s utility or attractiveness is often assessed relative to that of other items in the offering – is captured effectively by a subset-dependent stochastic choice model such as Plackett-Luce. Common examples of learning settings with such feedback include recommendation systems and search engines, medical interviews, tutoring systems–any applications where relative preferences from a chosen pool of options are revealed.

We consider a natural probably approximately correct (PAC) learning problem in the battling bandit setting: Output an -approximate best item (with respect to its Plackett-Luce parameter) with probability at least , while keeping the total number of adaptive exploration rounds small. We term this the -PAC objective of searching for an approximate winner or top- item.

Our primary interest lies in understanding how the subset size influences the sample complexity of achieving -PAC objective in subset choice models for various feedback information structures, e.g., winner information (WI), which returns only a single winner of the chosen subset, or the more general top ranking (TR) information structure, where an ordered tuple of ‘most-preferred’ items is observed. More precisely, we ask: Does being able to play size- subsets help learn optimal items faster than in the dueling setting ()? How does this depend on the subset size , and on the feedback information structure? How much, if any, does rank-ordered feedback accelerate the rate of learning, compared to only observing winner feedback? This paper takes a step towards resolving such questions within the context of the Plackett-Luce choice model. Among the contributions of this paper are:

1. We frame a PAC version of Battling Bandits with arms – a natural generalization of the PAC-Dueling-Bandits problem (Szörényi et al., 2015) – with the objective of finding an -approximate best item with probability at least with minimum possible sample complexity, termed as the -PAC objective (Section 3.2).

2. We consider learning with winner information (WI) feedback, where the learner can play a subsets of exactly distinct elements at each round , following which a winner of is observed according to an underlying, unknown, Plackett-Luce model. We show an information-theoretic lower bound on sample complexity for -PAC of rounds (Section 4.1), which is of the same order as that for the dueling bandit () (Yue and Joachims, 2011). This implies that, despite the increased flexibility of playing sets of potentially large size , with just winner information feedback, one cannot hope for a faster rate of learning than in the case of pairwise selections. Intuitively, competition among a large number () of elements vying for the top spot at each time exactly offsets the potential gain that being able to test more alternatives together brings. On the achievable side, we design two algorithms (Section 4.2) for the -PAC objective, and derive sample complexity guarantees which are optimal within a logarithmic factor of the lower bound derived earlier. When the learner is allowed to play subsets of sizes upto , which is a slightly more flexible setting than above, we design a median elimination-based algorithm with order-optimal sample complexity which, when specialized to , improves upon existing sample complexity bounds for PAC-dueling bandit algorithms, e.g. Yue and Joachims (2011); Szörényi et al. (2015) under the PL model (Section. 4.3).

3. We next study the -PAC problem in a more general top-ranking (TR) feedback model where the learner gets to observe the ranking of top items drawn from the Plackett-Luce distribution, (Section 3.1), departing from prior work. For , the setting simply boils down to WI feedback model. In this case, we are able to prove a sample complexity lower bound of (Theorem 10), which suggests that with top- ranking (TR) feedback, it may be possible to aggregate information times faster than with just winner information feedback. We further present two algorithms (Section 5.2) for this problem which, are shown to enjoy optimal (upto logarithmic factors) sample complexity guarantees. This formally shows that the -fold increase in statistical efficiency by exploiting richer information contained in top- ranking feedback is, in fact, algorithmically achievable.

4. From an algorithmic point of view, we elucidate how the structure of the Plackett-Luce choice model, such as its independent of irrelevant attributes (IIA) property, play a crucial role in allowing the development of parameter estimates, together with tight confidence sets, which form the basis for our learning algorithms. It is indeed by leveraging this property (Lemma

1) that we afford to maintain consistent pairwise preferences of the items by applying the concept of Rank Breaking to subsetwise preference data. This significantly alleviates the combinatorial explosion that could otherwise result if one were to keep more general subset-wise estimates.

Related Work: Statistical parameter estimation in Plackett-Luce models has been studied in detail in the offline batch (non-adaptive) setting (Chen and Suh, 2015; Khetan and Oh, 2016; Jang et al., 2017).

In the online setting, there is a fairly mature body of work concerned with PAC best-arm (or top- arm) identification in the classical multi-armed bandit (Even-Dar et al., 2006; Audibert and Bubeck, 2010; Kalyanakrishnan et al., 2012; Karnin et al., 2013; Jamieson et al., 2014), where absolute utility information is assumed to be revealed upon playing a single arm or item. Though most work on dueling bandits has focused on the regret minimization goal (Zoghi et al., 2014; Ramamohan et al., 2016), there have been recent developments on the PAC objective for different pairwise preference models, such as those satisfying stochastic triangle inequalities and strong stochastic transitivity (Yue and Joachims, 2011), general utility-based preference models (Urvoy et al., 2013), the Plackett-Luce model (Szörényi et al., 2015), the Mallows model (Busa-Fekete et al., 2014a), etc. Recent work in the PAC setting focuses on learning objectives other than identifying the single (near) best arm, e.g. recovering a few of the top arms (Busa-Fekete et al., 2013; Mohajer et al., 2017; Chen et al., 2017), or the true ranking of the items (Busa-Fekete et al., 2014b; Falahatgar et al., 2017).

The work which is perhaps closest in spirit to ours is that of Chen et al. (2018), which addresses the problem of learning the top- items in Plackett-Luce battling bandits. Even when specialized to (as we consider here), however, this work differs in several important aspects from what we attempt. Chen et al. (2018) develop algorithms for the probably exactly correct objective (recovering a near-optimal arm is not favored), and, consequently, show instance-dependent sample complexity bounds, whereas we allow a tolerance of in defining best arms, which is often natural in practice Szörényi et al. (2015); Yue and Joachims (2011). As a result, we bring out the dependence of the sample complexity on the specified tolerance level , rather than on purely instance-dependent measures of hardness. Also, their work considers only winner information (WI) feedback from the subsets chosen, whereas we consider, for the first time, general top- ranking information feedback.

A related battling-type bandit setting has been studied as the MNL-bandits assortment optimization problem by Agrawal et al. (2016), although it takes prices of items into account when defining their utilities. As a result, their work optimizes for a subset with highest expected revenue (price), whereas we search for a best item (Condorcet winner). and the two settings are in general incomparable.

## 2 Preliminaries

Notation. We denote by the set . For any subset , let denote the cardinality of . When there is no confusion about the context, we often represent (an unordered) subset

as a vector, or ordered subset,

of size (according to, say, a fixed global ordering of all the items ). In this case, denotes the item (member) at the th position in subset . is a permutation over items of , where for any permutation , denotes the element at the -th position in . is generically used to denote an indicator variable that takes the value if the predicate is true, and otherwise. denotes the maximum of and , and is used to denote the probability of event , in a probability space that is clear from the context.

### 2.1 Discrete Choice Models and Plackett-Luce (PL)

A discrete choice model specifies the relative preferences of two or more discrete alternatives in a given set. A widely studied class of discrete choice models is the class of Random Utility Models (RUMs), which assume a ground-truth utility score for each alternative , and assign a conditional distribution for scoring item . To model a winning alternative given any set , one first draws a random utility score for each alternative in , and selects an item with the highest random score.

One widely used RUM is the

Multinomial-Logit (MNL)

or Plackett-Luce model (PL), where the s are taken to be independent Gumbel distributions with location parameters and scale parameter (Azari et al., 2012), which result to probability densities , . Moreover assuming , , in this case the probability that an alternative emerges as the winner in the set becomes proportional to its parameter value:

 Pr(i|S)=θi∑j∈Sθj. (1)

We will henceforth refer the above choice model as PL model with parameters . Clearly the above model induces a total ordering on the arm set : If denotes the pairwise probability of item being preferred over item , then if and only if , or in other words if and then , (Ramamohan et al., 2016).

Other families of discrete choice models can be obtained by imposing different probability distributions over the utility scores

, e.g. if are jointly normal with mean and covariance , then the corresponding RUM-based choice model reduces to the Multinomial Probit (MNP). Unlike MNL, though, the choice probabilities for the MNP model do not admit a closed-form expression (Vojacek et al., 2010).

### 2.2 Independence of Irrelevant Alternatives

A choice model is said to possess the Independence of Irrelevant Alternatives (IIA) property if the ratio of probabilities of choosing any two items, say and from within any choice set is independent of a third alternative present in (Benson et al., 2016). More specifically, that contain and . One example of such a choice model is Plackett-Luce.

###### Remark 1.

IIA turns out to be very valuable in estimating the parameters of a PL model, with high confidence, via Rank-Breaking – the idea of extracting pairwise comparisons from (partial) rankings and applying estimators on the obtained pairs, treating each comparison independently. Although this technique has previously been used in batch (offline) PL estimation (Khetan and Oh, 2016), we show that it can be used in online problems for the first time. We crucially exploit this property of the PL model in the algorithms we design (Algorithms 1-3), and in establishing their correctness and sample complexity guarantees.

###### Lemma 1 (Deviations of pairwise win-probability estimates for PL model).

Consider a Plackett-Luce choice model with parameters (see Eqn. (1)), and fix two distinct items . Let be a sequence of (possibly random) subsets of of size at least , where is a positive integer, and a sequence of random items with each , , such that for each , (a) depends only on , and (b) is distributed as the Plackett-Luce winner of the subset , given and , and (c) with probability . Let and . Then, for any positive integer , and ,

 Pr(ni(T)nij(T)−θiθi+θj≥η,nij(T)≥v)∨Pr(ni(T)nij(T)−θiθi+θj≤−η,nij(T)≥v)≤e−2vη2.
###### Proof.

(sketch). The proof uses a novel coupling argument to work in an equivalent probability space for the PL model with respect to the item pair , as follows. Let

be a sequence of iid Bernoulli random variables with success parameter

. A counter is first initialized to . At each time , given and , an independent coin is tossed with probability of heads . If the coin lands tails, then is drawn as an independent sample from the Plackett-Luce distribution over , else, the counter is incremented by , and is returned as if or if

. This construction yields the correct joint distribution for the sequence

, because of the IIA property of the PL model:

 Pr(it=i|it∈{i,j},St)=Pr(it=i|St)Pr(it∈{i,j}|St)=θi/∑k∈Stθk(θi+θj)/∑k∈Stθk=θiθi+θj.

The proof now follows by applying Hoeffding’s inequality on prefixes of the sequence .∎

## 3 Problem Setup

We consider the PAC version of the sequential decision-making problem of finding the best item in a set of items by making subset-wise comparisons. Formally, the learner is given a finite set of arms. At each decision round , the learner selects a subset of distinct items, and receives (stochastic) feedback depending on (a) the chosen subset , and (b) a Plackett-Luce (PL) choice model with parameters a priori unknown to the learner. The nature of the feedback can be of several types as described in Section 3.1. Without loss of generality, we will henceforth assume , since the PL choice probabilities are positive scale-invariant by (1). We also let for ease of exposition111We naturally assume that this knowledge is not known to the learning algorithm, and note that extension to the case where several items have the same highest parameter value is easily accomplished.. We call this decision-making model, parameterized by a PL instance and a playable subset size , as Battling Bandits (BB) with the Plackett-Luce (PL), or BB-PL in short. We define a best item to be one with the highest score parameter: . Under the assumptions above, uniquely. Note that here we have , , so item is the Condorcet Winner (Ramamohan et al., 2016) of the PL model.

### 3.1 Feedback models

By feedback model, we mean the information received (from the ‘environment’) once the learner plays a subset of items. We define three types of feedback in the PL battling model:

• Winner of the selected subset (WI): The environment returns a single item , drawn independently from the probability distribution

• Full ranking selected subset of items (FR): The environment returns a full ranking , drawn from the probability distribution In fact, this is equivalent to picking according to the winner (WI) feedback from , then picking according to WI feedback from , and so on, until all elements from are exhausted, or, in other words, successively sampling winners from according to the PL model, without replacement.

A feedback model that generalizes the types of feedback above is:

• Top- ranking of items (TR- or TR): The environment returns a ranking of only items from among , i.e., the environment first draws a full ranking over according to Plackett-Luce as in FR above, and returns the first rank elements of , i.e., . It can be seen that for each permutation on a subset , , we must have . Generating such a is also equivalent to successively sampling winners from according to the PL model, without replacement. It follows that TR reduces to FR when and to WI when .

### 3.2 Performance Objective: Correctness and Sample Complexity

Suppose and define a BB-PL instance with best arm , and are given constants. An arm is said to be -optimal222informally, a ‘near-best’ arm if the probability that beats is over , i.e., if . A sequential algorithm that operates in this BB-PL instance, using feedback from an appropriate subset-wise feedback model (e.g., WI, FR or TR), is said to be -PAC if (a) it stops and outputs an arm after a finite number of decision rounds (subset plays) with probability , and (b) the probability that its output is an -optimal arm is at least , i.e, . Furthermore, by sample complexity of the algorithm, we mean the expected time (number of decision rounds) taken by the algorithm to stop.

Note that , so the score parameter of a near-best item must be at least times .

## 4 Analysis with Winner Information (WI) feedback

In this section we consider the PAC-WI goal with the WI feedback information model in BB-PL instances of size with playable subset size . We start by showing that a sample complexity-lower bound for any -PAC algorithm with WI feedback is (Theorem 2). This bound is independent of , implying that playing a dueling game () is as good as the battling game as the extra flexibility of -subsetwise feedback does not result in a faster learning rate. We next propose two algorithms for -PAC, with WI feedback, with optimal (upto a logarithmic factor) sample complexity of (Section 4.2). We also analyze a slightly different setting allowing the learner to play subsets of any size , rather than a fixed size – this gives somewhat more flexibility to the learner, resulting in algorithms with improved sample complexity guarantees of , without the dependency as before (Section 4.3).

### 4.1 Lower Bound for Winner Information (WI) feedback

###### Theorem 2 (Lower bound on Sample Complexity with WI feedback).

Given and , and an -PAC algorithm for BB-PL with feedback model WI, there exists a PL instance such that the sample complexity of on is at least

###### Proof.

(sketch). The argument is based on a change-of-measure argument (Lemma ) of Kaufmann et al. (2016), restated below for convenience:

Consider a multi-armed bandit (MAB) problem with arms or actions . At round , let and denote the arm played and the observation (reward) received, respectively. Let be the sigma algebra generated by the trajectory of a sequential bandit algorithm upto round .

###### Lemma 3 (Lemma 1, Kaufmann et al. (2016)).

Let and be two bandit models (assignments of reward distributions to arms), such that is the reward distribution of any arm under bandit model , and such that for all such arms , and are mutually absolutely continuous. Then for any almost-surely finite stopping time with respect to ,

 n∑i=1Eν[Ni(τ)]KL(νi,ν′i)≥supE∈Fτkl(Prν(E),Prν′(E)),

where is the binary relative entropy, denotes the number of times arm is played in rounds, and and denote the probability of any event under bandit models and , respectively.

To employ this result, note that in our case, each bandit instance corresponds to an instance of the BB-PL problem with the arm set containing all subsets of of size : . The key part of our proof relies on carefully crafting a true instance, with optimal arm , and a family of slightly perturbed alternative instances , each with optimal arm .

We choose the true problem instance as the Plackett-Luce model with parameters

 θj=θ(12−ϵ),∀j∈[n]∖{1}, and θ1=θ(12+ϵ),(true instance)

for some . Corresponding to each suboptimal item , we now define an alternative problem instance as the Plackett-Luce model with parameters

 θ′j=θ(12−ϵ)2,∀j∈[n]∖{a,1},θ′1=θ(14−ϵ2),θ′a=θ(12+ϵ)2(alternative instance).

The result of Theorem 2 is now obtained by applying Lemma 3 on pairs of problem instances , with suitable upper bounds on the KL-divergence terms, and the observation that . The complete proof is given in Appendix B.1. ∎

###### Remark 2.

Theorem 2 shows, rather surprisingly, that the PAC sample complexity of identifying a near-optimal item with only winner feedback information from -size subsets, does not reduce with , implying that there is no reduction in hardness of learning from the pairwise comparisons case (). On one hand, one may expect to see improved sample complexity as the number of items being simultaneously tested in each round is large (). On the other hand, the sample complexity could also worsen, since it is intuitively ‘harder’ for a good (near-optimal) item to win and show itself, in just a single winner draw, against a large population of

other competitors. The result, in a sense, formally establishes that the former advantage is nullified by the latter drawback. A somewhat more formal, but heuristic, explanation for this phenomenon is that the number of bits of information that a single winner draw from a size-

subset provides is , which is not significantly larger than when , thus an algorithm cannot accumulate significantly more information per round compared to the pairwise case.

### 4.2 Algorithms for Winner Information (WI) feedback model

This section describes our proposed algorithms for the -PAC objective with winning item (WI) feedback.

Principles of algorithm design. The key idea on which all our learning algorithms are based is that of maintaining estimates of the pairwise win-loss probabilities in the Plackett-Luce model. This helps circumvent an combinatorial explosion that would otherwise result if we directly attempted to estimate probability distributions for each possible -size subset. However, it is not obvious if consistent and tight pairwise estimates can be constructed in a general subset-wise choice model, but the special form of the Plackett-Luce model again comes to our rescue. The IIA property that the PL model enjoys, allows for accurate pairwise estimates via interpretation of partial preference feedback as a set of pairwise preferences, e.g., a winner sampled from among is interpreted as the pairwise preferences , . Lemma 1

formalizes this property and allows us to use pairwise win/loss probability estimators with explicit confidence intervals for them.

Algorithm 1: (Trace-the-Best). Our first algorithm Trace-the-Best  is based on the simple idea of tracing the empirical best item–specifically, it maintains a running winner at every iteration , making it battle with a set of arbitrarily chosen items. After battling long enough (precisely, for many rounds), if the empirical winner turns out to be more than -favorable than the running winner , in term of its pairwise preference score: , then replaces , or else retains its place and status quo ensues.

###### Theorem 4 (Trace-the-Best: Correctness and Sample Complexity with WI).

Trace-the-Best  (Algorithm 1) is -PAC with sample complexity .

###### Proof.

(sketch). The main idea is to retain an estimated best item as a ‘running winner’ , and compare it with the ‘empirical best item’ of at every iteration . The crucial observation lies in noting that at any iteration , gets updated as follows:

###### Lemma 5.

At any iteration , with probability at least , Algorithm 1 retains if , and sets if .

This leads to the claim that between any two successive iterations and , we must have, with high probability, that showing that the estimated ‘best’ item can only get improved per iteration as (with high probability at least ). Repeating this above argument for each iteration results in the desired correctness guarantee of . The sample complexity bound follows easily by noting the total number of possible iterations can be at most , with the per-iteration sample complexity being . ∎

###### Remark 3.

The sample complexity of Trace-the-Best, is order wise optimal when , as follows from our derived lower bound guarantee (Theorem 2).

When , the sample complexity guarantee of Trace-the-Best  is off by a factor of . We now propose another algorithm, Divide-and-Battle  (Algorithm 2) that enjoys an -PAC sample complexity of .

Algorithm 2: (Divide-and-Battle). Divide-and-Battle  first divides the set of items into groups of size , and plays each group long enough so that a good item in the group stands out as the empirical winner with high probability (Line ). It then retains the empirical winner per group (Line ) and recurses on the retained set of the winners, until it is left with only a single item, which is finally declared as the -optimal item. The pseudo code of Divide-and-Battle  is given in Appendix B.3.

###### Theorem 6 (Divide-and-Battle: Correctness and Sample Complexity with WI).

Divide-and-Battle  (Algorithm 2) is -PAC with sample complexity .

###### Proof.

(sketch). The crucial observation here is that at any iteration , for any set (), the item retained by the algorithm is likely to be not more than -worse than the best item of the set , with probability at least . Precisely, we show that:

###### Lemma 7.

At any iteration , for any , if , then with probability at least , .

This guarantees that, between any two successive rounds and , we do not lose out by more than an additive factor of in terms of highest score parameter of the remaining set of items. Aggregating this claim over all iterations can be made to show that , as desired. The sample complexity bound follows by carefully summing the total number of times () a set is played per iteration , with the maximum number of possible iterations being . ∎

###### Remark 4.

The sample complexity of Divide-and-Battle  is order-wise optimal in the ‘small-’ regime by the lower bound result (Theorem 2). However, for the ‘moderate-’ regime , we conjecture that the lower bound is loose by an additive factor of , i.e., that a improved lower bound of holds. This is primarily because we believe that the error probability of any typical, label-invariant PAC algorithm ought to be distributed roughly uniformly across misidentification of all the items, allowing us to use instead of on the right hand side of the change-of-measure inequalities of Lemma 3, resulting in the improved quantity . This is perhaps in line with recent work in multi-armed bandits (Simchowitz et al., 2017) that points to an increased difficulty of PAC identification in the moderate-confidence regime.

We now consider a variant of the BB-PL decision model which allows the learner to play sets of any size , instead of a fixed size . In this setting, we are indeed able to design an -PAC algorithm that enjoys an order-optimal sample-complexity.

### 4.3 BB-PL2: A slightly different battling bandit decision model

The new winner information feedback model BB-PL-2 is formally defined as follows: At each round , here the learner is allowed to select a set of size upto . Upon receiving any set , the environment returns the index of the winning item as such that,

On applying existing PAC-Dueling-Bandit strategies. Note that given the flexibility of playing sets of any size, one might as well hope to apply the PAC-Dueling Bandit algorithm PLPAC of Szörényi et al. (2015) which plays only pairs of items per round. However, their algorithm is shown to have a sample complexity guarantee of , which is suboptimal by an additive as our results will show. A similar observation holds for the Beat-the-Mean (BTM) algorithm of Yue and Joachims (2011), which in fact has a even worse sample complexity guarantee of .

Algorithm 3: Halving-Battle. We here propose a Median-Elimination-based approach (Even-Dar et al., 2006) which is shown to run with optimal sample complexity rounds (Theorem 8). (Note that an fundamental limit on PAC sample complexity for BB-PL2-WI can easily be derived using an argument along the lines of Theorem 2; we omit the explicit derivation.) The name Halving-Battle  for the algorithm is because it is based on the idea of dividing the set of items into two partitions with respect to the empirical median item and retaining the ‘better half’. Specifically, it first divides the entire item set into groups of size , and plays each group for a fixed number of times. After this step, only the items that won more than the empirical median are retained and rest are discarded. The algorithm recurses until it is left with a single item. The intuition here is that some -best item is always likely to beat the group median and can never get wiped off.

###### Theorem 8 (Halving-Battle: Correctness and Sample Complexity with WI).

Halving-Battle  (Algorithm 3) is -PAC with sample complexity .

###### Proof.

(sketch). The sample complexity bound follows by carefully summing the total number of times () a set is played per iteration , with the maximum number of possible iterations being (this is because the size of the set of remaining items gets halved at each iteration as it is pruned with respect to its median). The key intuition in proving the correctness property of Halving-Battle  lies in showing that at any iteration , Halving-Battle  always carries forward at least one ‘near-best’ item to the next iteration .

###### Lemma 9.

At any iteration , for any set , let , and consider any suboptimal item such that . Then with probability at least , the empirical win count of lies above that of , i.e. (equivalently ).

Using the property of the median element along with Lemma 9 and Markov’s inequality, we show that we do not lose out more than an additive factor of in terms of highest score of the remaining set of items between any two successive iterations and . This finally leads to the desired -PAC correctness of Halving-Battle. ∎

###### Remark 5.

Theorem 8 shows that the sample complexity guarantee of Halving-Battle improves over the that of existing PLPAC algorithm for the same objective in dueling bandit setup (), which was shown to be (see Theorem , Szörényi et al. (2015)), and also the complexity of BTM algorithm (Yue and Joachims, 2011) for dueling feedback from any pairwise preference matrix with relaxed stochastic transitivity and stochastic triangle inequality (of which PL model is a special case).

## 5 Analysis with Top Ranking (TR) feedback

We now proceed to analyze the BB-PL problem with Top- Ranking (TR) feedback (Section 3.1). We first show that unlike WI feedback, the sample complexity lower bound here scales as (Theorem 10), which is a factor smaller than that in Thm. 2 for the WI feedback model. At a high level, this is because TR reveals the preference information of items per feedback step (round of battle), as opposed to just a single (noisy) information sample of the winning item (WI). Following this, we also present two algorithms for this setting which are shown to enjoy an optimal (upto logarithmic factors) sample complexity guarantee of (Section 5.2).

### 5.1 Lower Bound for Top-m Ranking (TR) feedback

###### Theorem 10 (Sample Complexity Lower Bound for TR).

Given and , and an -PAC algorithm with top- ranking (TR) feedback (), there exists a PL instance such that the expected sample complexity of on is at least .

###### Remark 6.

The sample complexity lower for PAC-WI objective for BB-PL with top- ranking (TR) feedback model is -times that of the WI model (Thm. 2). Intuitively, revealing a ranking on items in a -set provides about bits of information per round, which is about times as large as that of revealing a single winner, yielding an acceleration of .

###### Corollary 11.

Given and , and an -PAC algorithm with full ranking (FR) feedback (), there exists a PL instance such that the expected sample complexity of on is at least .

### 5.2 Algorithms for Top-m Ranking (TR) feedback model

This section presents two algorithms for -PAC objective for BB-PL with top- ranking feedback. We achieve this by generalizing our earlier two proposed algorithms (see Algorithm 1 and 2, Sec. 4.2 for WI feedback) to the top- ranking (TR) feedback mechanism. 333Our third algorithm Halving-Battle  is not applicable to TR feedback as it allows the learner to play sets of sizes , whereas the TR feedback is defined only when the size of the subset played is at least . The lower bound analysis of Theorem 10 also does not apply if sets of size less than is allowed.

Rank-Breaking. The main trick we use in modifying the above algorithms for TR feedback is Rank Breaking (Soufiani et al., 2014), which essentially extracts pairwise comparisons from multiwise (subsetwise) preference information. Formally, given any set of size , if denotes a possible top- ranking of , the Rank Breaking subroutine considers each item in to be beaten by its preceding items in in a pairwise sense. For instance, given a full ranking of a set of elements , say , Rank-Breaking generates the set of pairwise comparisons: . Similarly, given the ranking of only most preferred items say , it yields the pairwise comparisons and etc. See Algorithm 4 for detailed description of the Rank-Breaking procedure.

###### Lemma 12 (Rank-Breaking Update).

Consider any subset with . Let be played for rounds of battle, and let , denote the TR feedback at each round . For each item , let be the number of times appears in the top- ranked output in rounds. Then, the most frequent item(s) in the top- positions must appear at least times, i.e. .

Proposed Algorithms for TR feedback. The formal descriptions of our two algorithms, Trace-the-Best  and Divide-and-Battle , generalized to the setting of TR feedback, are given as Algorithm 5 and Algorithm 6 respectively. They essentially maintain the empirical pairwise preferences for each pair of items by applying Rank Breaking on the TR feedback after each round of battle. Of course in general, Rank Breaking may lead to arbitrarily inconsistent estimates of the underlying model parameters (Azari et al., 2012). However, owing to the IIA property of the Plackett-Luce model, we get clean concentration guarantees on using Lemma 1. This is precisely the idea used for obtaining the factor improvement in the sample complexity guarantees of our proposed algorithms along with Lemma 12 (see proofs of Theorem 13 and 14).

###### Theorem 13 (Trace-the-Best: Correctness and Sample Complexity with TR).

With top- ranking (TR) feedback model, Trace-the-Best  (Algorithm 5) is -PAC with sample complexity .

###### Theorem 14 (Divide-and-Battle: Correctness and Sample Complexity with TR).

With top- ranking (TR) feedback model, Divide-and-Battle  (Algorithm 6) is -PAC with sample complexity .

###### Remark 7.

The sample complexity bounds of the above two algorithms are fraction lesser than their corresponding counterparts for WI feedback, as follows comparing Theorem 4 vs. 13, or Theorem 6 vs. 14, which admit a faster learning rate with TR feedback. Similar to the case with WI feedback, sample complexity of Divide-and-Battle  is still orderwise optimal for any , as follows from the lower bound guarantee (Theorem 10). However, we believe that the above lower bound can be tightened by a factor of for ’moderate’ , for reasons similar to those stated in Remark 4.