1 Introduction
The dueling bandit problem has recently gained attention in the machine learning community
(Yue et al., 2012; Ailon et al., 2014; Zoghi et al., 2014; Szörényi et al., 2015). This is a variant of the multiarmed bandit problem (Auer et al., 2002) in which the learner needs to learn an ‘best arm’ from pairwise comparisons between arms. In this work, we consider a natural generalization of the dueling bandit problem where the learner can adaptively select a subset of arms () in each round, and observe relative preferences in the subset following a PlackettLuce (PL) feedback model (Marden, 1996), with the objective of learning the ‘best arm’. We call this the battling bandit problem with the PlackettLuce model.The battling bandit decision framework (Saha and Gopalan, 2018; Chen et al., 2018) models several application domains where it is possible to elicit feedback about preferred options from among a general set of offered options, instead of being able to compare only two options at a time as in the dueling setup. Furthermore, the phenomenon of competition – that an option’s utility or attractiveness is often assessed relative to that of other items in the offering – is captured effectively by a subsetdependent stochastic choice model such as PlackettLuce. Common examples of learning settings with such feedback include recommendation systems and search engines, medical interviews, tutoring systems–any applications where relative preferences from a chosen pool of options are revealed.
We consider a natural probably approximately correct (PAC) learning problem in the battling bandit setting: Output an approximate best item (with respect to its PlackettLuce parameter) with probability at least , while keeping the total number of adaptive exploration rounds small. We term this the PAC objective of searching for an approximate winner or top item.
Our primary interest lies in understanding how the subset size influences the sample complexity of achieving PAC objective in subset choice models for various feedback information structures, e.g., winner information (WI), which returns only a single winner of the chosen subset, or the more general top ranking (TR) information structure, where an ordered tuple of ‘mostpreferred’ items is observed. More precisely, we ask: Does being able to play size subsets help learn optimal items faster than in the dueling setting ()? How does this depend on the subset size , and on the feedback information structure? How much, if any, does rankordered feedback accelerate the rate of learning, compared to only observing winner feedback? This paper takes a step towards resolving such questions within the context of the PlackettLuce choice model. Among the contributions of this paper are:

We frame a PAC version of Battling Bandits with arms – a natural generalization of the PACDuelingBandits problem (Szörényi et al., 2015) – with the objective of finding an approximate best item with probability at least with minimum possible sample complexity, termed as the PAC objective (Section 3.2).

We consider learning with winner information (WI) feedback, where the learner can play a subsets of exactly distinct elements at each round , following which a winner of is observed according to an underlying, unknown, PlackettLuce model. We show an informationtheoretic lower bound on sample complexity for PAC of rounds (Section 4.1), which is of the same order as that for the dueling bandit () (Yue and Joachims, 2011). This implies that, despite the increased flexibility of playing sets of potentially large size , with just winner information feedback, one cannot hope for a faster rate of learning than in the case of pairwise selections. Intuitively, competition among a large number () of elements vying for the top spot at each time exactly offsets the potential gain that being able to test more alternatives together brings. On the achievable side, we design two algorithms (Section 4.2) for the PAC objective, and derive sample complexity guarantees which are optimal within a logarithmic factor of the lower bound derived earlier. When the learner is allowed to play subsets of sizes upto , which is a slightly more flexible setting than above, we design a median eliminationbased algorithm with orderoptimal sample complexity which, when specialized to , improves upon existing sample complexity bounds for PACdueling bandit algorithms, e.g. Yue and Joachims (2011); Szörényi et al. (2015) under the PL model (Section. 4.3).

We next study the PAC problem in a more general topranking (TR) feedback model where the learner gets to observe the ranking of top items drawn from the PlackettLuce distribution, (Section 3.1), departing from prior work. For , the setting simply boils down to WI feedback model. In this case, we are able to prove a sample complexity lower bound of (Theorem 10), which suggests that with top ranking (TR) feedback, it may be possible to aggregate information times faster than with just winner information feedback. We further present two algorithms (Section 5.2) for this problem which, are shown to enjoy optimal (upto logarithmic factors) sample complexity guarantees. This formally shows that the fold increase in statistical efficiency by exploiting richer information contained in top ranking feedback is, in fact, algorithmically achievable.

From an algorithmic point of view, we elucidate how the structure of the PlackettLuce choice model, such as its independent of irrelevant attributes (IIA) property, play a crucial role in allowing the development of parameter estimates, together with tight confidence sets, which form the basis for our learning algorithms. It is indeed by leveraging this property (Lemma
1) that we afford to maintain consistent pairwise preferences of the items by applying the concept of Rank Breaking to subsetwise preference data. This significantly alleviates the combinatorial explosion that could otherwise result if one were to keep more general subsetwise estimates.
Related Work: Statistical parameter estimation in PlackettLuce models has been studied in detail in the offline batch (nonadaptive) setting (Chen and Suh, 2015; Khetan and Oh, 2016; Jang et al., 2017).
In the online setting, there is a fairly mature body of work concerned with PAC bestarm (or top arm) identification in the classical multiarmed bandit (EvenDar et al., 2006; Audibert and Bubeck, 2010; Kalyanakrishnan et al., 2012; Karnin et al., 2013; Jamieson et al., 2014), where absolute utility information is assumed to be revealed upon playing a single arm or item. Though most work on dueling bandits has focused on the regret minimization goal (Zoghi et al., 2014; Ramamohan et al., 2016), there have been recent developments on the PAC objective for different pairwise preference models, such as those satisfying stochastic triangle inequalities and strong stochastic transitivity (Yue and Joachims, 2011), general utilitybased preference models (Urvoy et al., 2013), the PlackettLuce model (Szörényi et al., 2015), the Mallows model (BusaFekete et al., 2014a), etc. Recent work in the PAC setting focuses on learning objectives other than identifying the single (near) best arm, e.g. recovering a few of the top arms (BusaFekete et al., 2013; Mohajer et al., 2017; Chen et al., 2017), or the true ranking of the items (BusaFekete et al., 2014b; Falahatgar et al., 2017).
The work which is perhaps closest in spirit to ours is that of Chen et al. (2018), which addresses the problem of learning the top items in PlackettLuce battling bandits. Even when specialized to (as we consider here), however, this work differs in several important aspects from what we attempt. Chen et al. (2018) develop algorithms for the probably exactly correct objective (recovering a nearoptimal arm is not favored), and, consequently, show instancedependent sample complexity bounds, whereas we allow a tolerance of in defining best arms, which is often natural in practice Szörényi et al. (2015); Yue and Joachims (2011). As a result, we bring out the dependence of the sample complexity on the specified tolerance level , rather than on purely instancedependent measures of hardness. Also, their work considers only winner information (WI) feedback from the subsets chosen, whereas we consider, for the first time, general top ranking information feedback.
A related battlingtype bandit setting has been studied as the MNLbandits assortment optimization problem by Agrawal et al. (2016), although it takes prices of items into account when defining their utilities. As a result, their work optimizes for a subset with highest expected revenue (price), whereas we search for a best item (Condorcet winner). and the two settings are in general incomparable.
2 Preliminaries
Notation. We denote by the set . For any subset , let denote the cardinality of . When there is no confusion about the context, we often represent (an unordered) subset
as a vector, or ordered subset,
of size (according to, say, a fixed global ordering of all the items ). In this case, denotes the item (member) at the th position in subset . is a permutation over items of , where for any permutation , denotes the element at the th position in . is generically used to denote an indicator variable that takes the value if the predicate is true, and otherwise. denotes the maximum of and , and is used to denote the probability of event , in a probability space that is clear from the context.2.1 Discrete Choice Models and PlackettLuce (PL)
A discrete choice model specifies the relative preferences of two or more discrete alternatives in a given set. A widely studied class of discrete choice models is the class of Random Utility Models (RUMs), which assume a groundtruth utility score for each alternative , and assign a conditional distribution for scoring item . To model a winning alternative given any set , one first draws a random utility score for each alternative in , and selects an item with the highest random score.
One widely used RUM is the
MultinomialLogit (MNL)
or PlackettLuce model (PL), where the s are taken to be independent Gumbel distributions with location parameters and scale parameter (Azari et al., 2012), which result to probability densities , . Moreover assuming , , in this case the probability that an alternative emerges as the winner in the set becomes proportional to its parameter value:(1) 
We will henceforth refer the above choice model as PL model with parameters . Clearly the above model induces a total ordering on the arm set : If denotes the pairwise probability of item being preferred over item , then if and only if , or in other words if and then , (Ramamohan et al., 2016).
Other families of discrete choice models can be obtained by imposing different probability distributions over the utility scores
, e.g. if are jointly normal with mean and covariance , then the corresponding RUMbased choice model reduces to the Multinomial Probit (MNP). Unlike MNL, though, the choice probabilities for the MNP model do not admit a closedform expression (Vojacek et al., 2010).2.2 Independence of Irrelevant Alternatives
A choice model is said to possess the Independence of Irrelevant Alternatives (IIA) property if the ratio of probabilities of choosing any two items, say and from within any choice set is independent of a third alternative present in (Benson et al., 2016). More specifically, that contain and . One example of such a choice model is PlackettLuce.
Remark 1.
IIA turns out to be very valuable in estimating the parameters of a PL model, with high confidence, via RankBreaking – the idea of extracting pairwise comparisons from (partial) rankings and applying estimators on the obtained pairs, treating each comparison independently. Although this technique has previously been used in batch (offline) PL estimation (Khetan and Oh, 2016), we show that it can be used in online problems for the first time. We crucially exploit this property of the PL model in the algorithms we design (Algorithms 13), and in establishing their correctness and sample complexity guarantees.
Lemma 1 (Deviations of pairwise winprobability estimates for PL model).
Consider a PlackettLuce choice model with parameters (see Eqn. (1)), and fix two distinct items . Let be a sequence of (possibly random) subsets of of size at least , where is a positive integer, and a sequence of random items with each , , such that for each , (a) depends only on , and (b) is distributed as the PlackettLuce winner of the subset , given and , and (c) with probability . Let and . Then, for any positive integer , and ,
Proof.
(sketch). The proof uses a novel coupling argument to work in an equivalent probability space for the PL model with respect to the item pair , as follows. Let
be a sequence of iid Bernoulli random variables with success parameter
. A counter is first initialized to . At each time , given and , an independent coin is tossed with probability of heads . If the coin lands tails, then is drawn as an independent sample from the PlackettLuce distribution over , else, the counter is incremented by , and is returned as if or if. This construction yields the correct joint distribution for the sequence
, because of the IIA property of the PL model:The proof now follows by applying Hoeffding’s inequality on prefixes of the sequence .∎
3 Problem Setup
We consider the PAC version of the sequential decisionmaking problem of finding the best item in a set of items by making subsetwise comparisons. Formally, the learner is given a finite set of arms. At each decision round , the learner selects a subset of distinct items, and receives (stochastic) feedback depending on (a) the chosen subset , and (b) a PlackettLuce (PL) choice model with parameters a priori unknown to the learner. The nature of the feedback can be of several types as described in Section 3.1. Without loss of generality, we will henceforth assume , since the PL choice probabilities are positive scaleinvariant by (1). We also let for ease of exposition^{1}^{1}1We naturally assume that this knowledge is not known to the learning algorithm, and note that extension to the case where several items have the same highest parameter value is easily accomplished.. We call this decisionmaking model, parameterized by a PL instance and a playable subset size , as Battling Bandits (BB) with the PlackettLuce (PL), or BBPL in short. We define a best item to be one with the highest score parameter: . Under the assumptions above, uniquely. Note that here we have , , so item is the Condorcet Winner (Ramamohan et al., 2016) of the PL model.
3.1 Feedback models
By feedback model, we mean the information received (from the ‘environment’) once the learner plays a subset of items. We define three types of feedback in the PL battling model:

Winner of the selected subset (WI): The environment returns a single item , drawn independently from the probability distribution

Full ranking selected subset of items (FR): The environment returns a full ranking , drawn from the probability distribution In fact, this is equivalent to picking according to the winner (WI) feedback from , then picking according to WI feedback from , and so on, until all elements from are exhausted, or, in other words, successively sampling winners from according to the PL model, without replacement.
A feedback model that generalizes the types of feedback above is:

Top ranking of items (TR or TR): The environment returns a ranking of only items from among , i.e., the environment first draws a full ranking over according to PlackettLuce as in FR above, and returns the first rank elements of , i.e., . It can be seen that for each permutation on a subset , , we must have . Generating such a is also equivalent to successively sampling winners from according to the PL model, without replacement. It follows that TR reduces to FR when and to WI when .
3.2 Performance Objective: Correctness and Sample Complexity
Suppose and define a BBPL instance with best arm , and are given constants. An arm is said to be optimal^{2}^{2}2informally, a ‘nearbest’ arm if the probability that beats is over , i.e., if . A sequential algorithm that operates in this BBPL instance, using feedback from an appropriate subsetwise feedback model (e.g., WI, FR or TR), is said to be PAC if (a) it stops and outputs an arm after a finite number of decision rounds (subset plays) with probability , and (b) the probability that its output is an optimal arm is at least , i.e, . Furthermore, by sample complexity of the algorithm, we mean the expected time (number of decision rounds) taken by the algorithm to stop.
Note that , so the score parameter of a nearbest item must be at least times .
4 Analysis with Winner Information (WI) feedback
In this section we consider the PACWI goal with the WI feedback information model in BBPL instances of size with playable subset size . We start by showing that a sample complexitylower bound for any PAC algorithm with WI feedback is (Theorem 2). This bound is independent of , implying that playing a dueling game () is as good as the battling game as the extra flexibility of subsetwise feedback does not result in a faster learning rate. We next propose two algorithms for PAC, with WI feedback, with optimal (upto a logarithmic factor) sample complexity of (Section 4.2). We also analyze a slightly different setting allowing the learner to play subsets of any size , rather than a fixed size – this gives somewhat more flexibility to the learner, resulting in algorithms with improved sample complexity guarantees of , without the dependency as before (Section 4.3).
4.1 Lower Bound for Winner Information (WI) feedback
Theorem 2 (Lower bound on Sample Complexity with WI feedback).
Given and , and an PAC algorithm for BBPL with feedback model WI, there exists a PL instance such that the sample complexity of on is at least
Proof.
(sketch). The argument is based on a changeofmeasure argument (Lemma ) of Kaufmann et al. (2016), restated below for convenience:
Consider a multiarmed bandit (MAB) problem with arms or actions . At round , let and denote the arm played and the observation (reward) received, respectively. Let be the sigma algebra generated by the trajectory of a sequential bandit algorithm upto round .
Lemma 3 (Lemma , Kaufmann et al. (2016)).
Let and be two bandit models (assignments of reward distributions to arms), such that is the reward distribution of any arm under bandit model , and such that for all such arms , and are mutually absolutely continuous. Then for any almostsurely finite stopping time with respect to ,
where is the binary relative entropy, denotes the number of times arm is played in rounds, and and denote the probability of any event under bandit models and , respectively.
To employ this result, note that in our case, each bandit instance corresponds to an instance of the BBPL problem with the arm set containing all subsets of of size : . The key part of our proof relies on carefully crafting a true instance, with optimal arm , and a family of slightly perturbed alternative instances , each with optimal arm .
We choose the true problem instance as the PlackettLuce model with parameters
for some . Corresponding to each suboptimal item , we now define an alternative problem instance as the PlackettLuce model with parameters
Remark 2.
Theorem 2 shows, rather surprisingly, that the PAC sample complexity of identifying a nearoptimal item with only winner feedback information from size subsets, does not reduce with , implying that there is no reduction in hardness of learning from the pairwise comparisons case (). On one hand, one may expect to see improved sample complexity as the number of items being simultaneously tested in each round is large (). On the other hand, the sample complexity could also worsen, since it is intuitively ‘harder’ for a good (nearoptimal) item to win and show itself, in just a single winner draw, against a large population of
other competitors. The result, in a sense, formally establishes that the former advantage is nullified by the latter drawback. A somewhat more formal, but heuristic, explanation for this phenomenon is that the number of bits of information that a single winner draw from a size
subset provides is , which is not significantly larger than when , thus an algorithm cannot accumulate significantly more information per round compared to the pairwise case.4.2 Algorithms for Winner Information (WI) feedback model
This section describes our proposed algorithms for the PAC objective with winning item (WI) feedback.
Principles of algorithm design. The key idea on which all our learning algorithms are based is that of maintaining estimates of the pairwise winloss probabilities in the PlackettLuce model. This helps circumvent an combinatorial explosion that would otherwise result if we directly attempted to estimate probability distributions for each possible size subset. However, it is not obvious if consistent and tight pairwise estimates can be constructed in a general subsetwise choice model, but the special form of the PlackettLuce model again comes to our rescue. The IIA property that the PL model enjoys, allows for accurate pairwise estimates via interpretation of partial preference feedback as a set of pairwise preferences, e.g., a winner sampled from among is interpreted as the pairwise preferences , . Lemma 1
formalizes this property and allows us to use pairwise win/loss probability estimators with explicit confidence intervals for them.
Algorithm 1: (TracetheBest). Our first algorithm TracetheBest is based on the simple idea of tracing the empirical best item–specifically, it maintains a running winner at every iteration , making it battle with a set of arbitrarily chosen items. After battling long enough (precisely, for many rounds), if the empirical winner turns out to be more than favorable than the running winner , in term of its pairwise preference score: , then replaces , or else retains its place and status quo ensues.
Theorem 4 (TracetheBest: Correctness and Sample Complexity with WI).
TracetheBest (Algorithm 1) is PAC with sample complexity .
Proof.
(sketch). The main idea is to retain an estimated best item as a ‘running winner’ , and compare it with the ‘empirical best item’ of at every iteration . The crucial observation lies in noting that at any iteration , gets updated as follows:
Lemma 5.
At any iteration , with probability at least , Algorithm 1 retains if , and sets if .
This leads to the claim that between any two successive iterations and , we must have, with high probability, that showing that the estimated ‘best’ item can only get improved per iteration as (with high probability at least ). Repeating this above argument for each iteration results in the desired correctness guarantee of . The sample complexity bound follows easily by noting the total number of possible iterations can be at most , with the periteration sample complexity being . ∎
Remark 3.
The sample complexity of TracetheBest, is order wise optimal when , as follows from our derived lower bound guarantee (Theorem 2).
When , the sample complexity guarantee of TracetheBest is off by a factor of . We now propose another algorithm, DivideandBattle (Algorithm 2) that enjoys an PAC sample complexity of .
Algorithm 2: (DivideandBattle). DivideandBattle first divides the set of items into groups of size , and plays each group long enough so that a good item in the group stands out as the empirical winner with high probability (Line ). It then retains the empirical winner per group (Line ) and recurses on the retained set of the winners, until it is left with only a single item, which is finally declared as the optimal item. The pseudo code of DivideandBattle is given in Appendix B.3.
Theorem 6 (DivideandBattle: Correctness and Sample Complexity with WI).
DivideandBattle (Algorithm 2) is PAC with sample complexity .
Proof.
(sketch). The crucial observation here is that at any iteration , for any set (), the item retained by the algorithm is likely to be not more than worse than the best item of the set , with probability at least . Precisely, we show that:
Lemma 7.
At any iteration , for any , if , then with probability at least , .
This guarantees that, between any two successive rounds and , we do not lose out by more than an additive factor of in terms of highest score parameter of the remaining set of items. Aggregating this claim over all iterations can be made to show that , as desired. The sample complexity bound follows by carefully summing the total number of times () a set is played per iteration , with the maximum number of possible iterations being . ∎
Remark 4.
The sample complexity of DivideandBattle is orderwise optimal in the ‘small’ regime by the lower bound result (Theorem 2). However, for the ‘moderate’ regime , we conjecture that the lower bound is loose by an additive factor of , i.e., that a improved lower bound of holds. This is primarily because we believe that the error probability of any typical, labelinvariant PAC algorithm ought to be distributed roughly uniformly across misidentification of all the items, allowing us to use instead of on the right hand side of the changeofmeasure inequalities of Lemma 3, resulting in the improved quantity . This is perhaps in line with recent work in multiarmed bandits (Simchowitz et al., 2017) that points to an increased difficulty of PAC identification in the moderateconfidence regime.
We now consider a variant of the BBPL decision model which allows the learner to play sets of any size , instead of a fixed size . In this setting, we are indeed able to design an PAC algorithm that enjoys an orderoptimal samplecomplexity.
4.3 BBPL2: A slightly different battling bandit decision model
The new winner information feedback model BBPL2 is formally defined as follows: At each round , here the learner is allowed to select a set of size upto . Upon receiving any set , the environment returns the index of the winning item as such that,
On applying existing PACDuelingBandit strategies. Note that given the flexibility of playing sets of any size, one might as well hope to apply the PACDueling Bandit algorithm PLPAC of Szörényi et al. (2015) which plays only pairs of items per round. However, their algorithm is shown to have a sample complexity guarantee of , which is suboptimal by an additive as our results will show. A similar observation holds for the BeattheMean (BTM) algorithm of Yue and Joachims (2011), which in fact has a even worse sample complexity guarantee of .
Algorithm 3: HalvingBattle. We here propose a MedianEliminationbased approach (EvenDar et al., 2006) which is shown to run with optimal sample complexity rounds (Theorem 8). (Note that an fundamental limit on PAC sample complexity for BBPL2WI can easily be derived using an argument along the lines of Theorem 2; we omit the explicit derivation.) The name HalvingBattle for the algorithm is because it is based on the idea of dividing the set of items into two partitions with respect to the empirical median item and retaining the ‘better half’. Specifically, it first divides the entire item set into groups of size , and plays each group for a fixed number of times. After this step, only the items that won more than the empirical median are retained and rest are discarded. The algorithm recurses until it is left with a single item. The intuition here is that some best item is always likely to beat the group median and can never get wiped off.
Theorem 8 (HalvingBattle: Correctness and Sample Complexity with WI).
HalvingBattle (Algorithm 3) is PAC with sample complexity .
Proof.
(sketch). The sample complexity bound follows by carefully summing the total number of times () a set is played per iteration , with the maximum number of possible iterations being (this is because the size of the set of remaining items gets halved at each iteration as it is pruned with respect to its median). The key intuition in proving the correctness property of HalvingBattle lies in showing that at any iteration , HalvingBattle always carries forward at least one ‘nearbest’ item to the next iteration .
Lemma 9.
At any iteration , for any set , let , and consider any suboptimal item such that . Then with probability at least , the empirical win count of lies above that of , i.e. (equivalently ).
Using the property of the median element along with Lemma 9 and Markov’s inequality, we show that we do not lose out more than an additive factor of in terms of highest score of the remaining set of items between any two successive iterations and . This finally leads to the desired PAC correctness of HalvingBattle. ∎
Remark 5.
Theorem 8 shows that the sample complexity guarantee of HalvingBattle improves over the that of existing PLPAC algorithm for the same objective in dueling bandit setup (), which was shown to be (see Theorem , Szörényi et al. (2015)), and also the complexity of BTM algorithm (Yue and Joachims, 2011) for dueling feedback from any pairwise preference matrix with relaxed stochastic transitivity and stochastic triangle inequality (of which PL model is a special case).
5 Analysis with Top Ranking (TR) feedback
We now proceed to analyze the BBPL problem with Top Ranking (TR) feedback (Section 3.1). We first show that unlike WI feedback, the sample complexity lower bound here scales as (Theorem 10), which is a factor smaller than that in Thm. 2 for the WI feedback model. At a high level, this is because TR reveals the preference information of items per feedback step (round of battle), as opposed to just a single (noisy) information sample of the winning item (WI). Following this, we also present two algorithms for this setting which are shown to enjoy an optimal (upto logarithmic factors) sample complexity guarantee of (Section 5.2).
5.1 Lower Bound for Top Ranking (TR) feedback
Theorem 10 (Sample Complexity Lower Bound for TR).
Given and , and an PAC algorithm with top ranking (TR) feedback (), there exists a PL instance such that the expected sample complexity of on is at least .
Remark 6.
The sample complexity lower for PACWI objective for BBPL with top ranking (TR) feedback model is times that of the WI model (Thm. 2). Intuitively, revealing a ranking on items in a set provides about bits of information per round, which is about times as large as that of revealing a single winner, yielding an acceleration of .
Corollary 11.
Given and , and an PAC algorithm with full ranking (FR) feedback (), there exists a PL instance such that the expected sample complexity of on is at least .
5.2 Algorithms for Top Ranking (TR) feedback model
This section presents two algorithms for PAC objective for BBPL with top ranking feedback. We achieve this by generalizing our earlier two proposed algorithms (see Algorithm 1 and 2, Sec. 4.2 for WI feedback) to the top ranking (TR) feedback mechanism. ^{3}^{3}3Our third algorithm HalvingBattle is not applicable to TR feedback as it allows the learner to play sets of sizes , whereas the TR feedback is defined only when the size of the subset played is at least . The lower bound analysis of Theorem 10 also does not apply if sets of size less than is allowed.
RankBreaking. The main trick we use in modifying the above algorithms for TR feedback is Rank Breaking (Soufiani et al., 2014), which essentially extracts pairwise comparisons from multiwise (subsetwise) preference information. Formally, given any set of size , if denotes a possible top ranking of , the Rank Breaking subroutine considers each item in to be beaten by its preceding items in in a pairwise sense. For instance, given a full ranking of a set of elements , say , RankBreaking generates the set of pairwise comparisons: . Similarly, given the ranking of only most preferred items say , it yields the pairwise comparisons and etc. See Algorithm 4 for detailed description of the RankBreaking procedure.
Lemma 12 (RankBreaking Update).
Consider any subset with . Let be played for rounds of battle, and let , denote the TR feedback at each round . For each item , let be the number of times appears in the top ranked output in rounds. Then, the most frequent item(s) in the top positions must appear at least times, i.e. .
Proposed Algorithms for TR feedback. The formal descriptions of our two algorithms, TracetheBest and DivideandBattle , generalized to the setting of TR feedback, are given as Algorithm 5 and Algorithm 6 respectively. They essentially maintain the empirical pairwise preferences for each pair of items by applying Rank Breaking on the TR feedback after each round of battle. Of course in general, Rank Breaking may lead to arbitrarily inconsistent estimates of the underlying model parameters (Azari et al., 2012). However, owing to the IIA property of the PlackettLuce model, we get clean concentration guarantees on using Lemma 1. This is precisely the idea used for obtaining the factor improvement in the sample complexity guarantees of our proposed algorithms along with Lemma 12 (see proofs of Theorem 13 and 14).
Theorem 13 (TracetheBest: Correctness and Sample Complexity with TR).
With top ranking (TR) feedback model, TracetheBest (Algorithm 5) is PAC with sample complexity .
Theorem 14 (DivideandBattle: Correctness and Sample Complexity with TR).
With top ranking (TR) feedback model, DivideandBattle (Algorithm 6) is PAC with sample complexity .
Remark 7.
The sample complexity bounds of the above two algorithms are fraction lesser than their corresponding counterparts for WI feedback, as follows comparing Theorem 4 vs. 13, or Theorem 6 vs. 14, which admit a faster learning rate with TR feedback. Similar to the case with WI feedback, sample complexity of DivideandBattle is still orderwise optimal for any , as follows from the lower bound guarantee (Theorem 10). However, we believe that the above lower bound can be tightened by a factor of for ’moderate’ , for reasons similar to those stated in Remark 4.
Comments
There are no comments yet.