Ranking or sorting is a classic search problem and basic algorithmic primitive in computer science. Perhaps the simplest and most well-studied ranking problem is using (noisy) pairwise comparisons, which started from the work of Feige et al. (1994)
, and which has recently been studied in machine learning under the rubric of ranking in ‘dueling bandits’(Busa-Fekete and Hüllermeier, 2014).
However, more general subset-wise preference feedback arises naturally in application domains where there is flexibility to learn by eliciting preference information from among a set of offerings, rather than by just asking for a pairwise comparison. For instance, web search and recommender systems applications typically involve users expressing preferences by clicking on one result (or a few results) from a presented set. Medical surveys, adaptive tutoring systems and multi-player sports/games are other domains where subsets of questions, problem set assignments and tournaments, respectively, can be carefully crafted to learn users’ relative preferences by subset-wise feedback.
In this paper, we explore active, probably approximately correct (PAC) ranking of items using subset-wise, preference information. We assume that upon choosing a subset of items, the learner receives preference feedback about the subset according to the well-known Plackett-Luce (PL) probability model (Marden, 1996). The learner faces the goal of returning a near-correct ranking of all items, with respect to a tolerance parameter on the items’ PL weights, with probability at least of correctness, after as few subset comparison rounds as possible. In this context, we make the following contributions:
We consider active ranking with winner information feedback, where the learner, upon playing a subset of exactly elements at each round
, receives as feedback a single winner sampled from the Plackett-Luce probability distribution on the elements of. We design two -PAC algorithms for this problem (Section 5) with sample complexity rounds, for learning a near-correct ranking on the items.
We show a matching lower bound of rounds on the -PAC sample complexity of ranking with winner information feedback (Section 6), which is also of the same order as that for the dueling bandit () (Yue and Joachims, 2011). This implies that despite the increased flexibility of playing larger sets, with just winner information feedback, one cannot hope for a faster rate of learning than in the case of pairwise comparisons.
In the setting where it is possible to obtain ‘top-rank’ feedback – an ordered list of items sampled from the Plackett-Luce distribution on the chosen subset – we show that natural generalizations of the winner-feedback algorithms above achieve -PAC sample complexity of rounds (Section 7), which is a significant improvement over the case of only winner information feedback. We show that this is order-wise tight by exhibiting a matching lower bound on the sample complexity across -PAC algorithms.
We report numerical results to show the performance of the proposed algorithms on synthetic environments (Section 8).
By way of techniques, the PAC algorithms we develop leverage the property of independence of irrelevant attributes (IIA) of the Plackett-Luce model, which allows for dimensional parameter estimation with tight confidence bounds, even in the face of a combinatorially large number of possible subsets of size . We also devise a generic ‘pivoting’ idea in our algorithms to efficiently estimate a global ordering using only local comparisons with a pivot or probe element: split the entire pool into playable subsets all containing one common element, learn local orderings relative to this element and then merge. Here again, the IIA structure of the PL model helps to ensure consistency among preferences aggregated across disparate subsets but with a common reference pivot. Our sample complexity lower bounds are information-theoretic in nature and rely on a generic change-of-measure argument but with carefully crafted confusing instances.
Related Work. Over the years, ranking from pairwise preferences () has been studied in both the batch or non-adaptive setting Gleich and Lim (2011); Rajkumar and Agarwal (2016); Wauthier et al. (2013); Negahban et al. (2012) and the active or adaptive setting Braverman and Mossel (2008); Jamieson and Nowak (2011); Ailon (2012). In particular, prior work has addressed the problem of statistical parameter estimation given preference observations from the Plackett-Luce model in the offline setting Rajkumar and Agarwal (2014); Negahban et al. (2012); Chen and Suh (2015); Khetan and Oh (2016); Hajek et al. (2014). There also have been recent developments on the PAC objective for different pairwise preference models, such as those satisfying stochastic triangle inequalities and strong stochastic transitivity (Yue and Joachims, 2011), general utility-based preference models (Urvoy et al., 2013), the Plackett-Luce model (Szörényi et al., 2015) and the Mallows model (Busa-Fekete et al., 2014a)]. Recent work has studied PAC-learning objectives other than identifying the single (near) best arm, e.g. recovering a few of the top arms (Busa-Fekete et al., 2013; Mohajer et al., 2017; Chen et al., 2017), or the true ranking of the items (Busa-Fekete et al., 2014b; Falahatgar et al., 2017). There is also work on the problem of Plackett-Luce parameter estimation in the subset-wise feedback setting Jang et al. (2017); Khetan and Oh (2016), but for the batch (offline) setup where the sampling is not adaptive. Recent work by Chen et al. (2018)
analyzes an active learning problem in the Plackett-Luce model with subset-wise feedback; however, the objective there is to recover the top-(unordered) items of the model, unlike full-rank recovery considered in this work. Moreover, they give instance-dependent sample complexity bounds, whereas we allow a tolerance () in defining good rankings, natural in many settings Szörényi et al. (2015); Yue and Joachims (2011); Busa-Fekete et al. (2014a).
Notation. We denote the set . When there is no confusion about the context, we often represent (an unordered) subset
as a vector, or ordered subset,of size (according to, say, the order induced by the natural global ordering of all the items). In this case, denotes the item (member) at the th position in subset . is a permutation over items of . where for any permutation , denotes the position of element in the ranking . denote an indicator variable that takes the value if the predicate is true, and otherwise. is used to denote the probability of event , in a probability space that is clear from the context. respectively denote Bernoulli and Geometric 111this is the ‘number of trials before success’ versionrandom variable with probability of success at each trial being . Moreover, for any , and
respectively denote Binomial and Negative Binomial distribution.
2.1 Discrete Choice Models and Plackett-Luce (PL)
A discrete choice model specifies the relative preferences of two or more discrete alternatives in a given set. A widely studied class of discrete choice models is the class of Random Utility Models (RUMs), which assume a ground-truth utility score for each alternative , and assign a conditional distribution for scoring item . To model a winning alternative given any set , one first draws a random utility score for each alternative in , and selects an item with the highest random score.
One widely used RUM is the Multinomial-Logit (MNL)
Multinomial-Logit (MNL)or Plackett-Luce model (PL), where the s are taken to be independent Gumbel distributions with parameters (Azari et al., 2012), i.e., with probability densities , and . In this case, the probability that alternative emerges the winner in the set is simply proportional to its (exponentiated) parameter value:
Other families of discrete choice models can be obtained by imposing different probability distributions over the utility scores , e.g. if are jointly normal with mean and covariance , then the corresponding RUM-based choice model reduces to the Multinomial Probit (MNP).
Independence of Irrelevant Alternatives A choice model is said to possess the Independence of Irrelevant Attributes (IIA) property if the ratio of probabilities of choosing any two items, say and from within any choice set is independent of a third alternative present in (Benson et al., 2016). Specifically, for any two distinct subsets that contain and . Plackett-Luce satisfies the IIA property.
3 Problem Setup
We consider the PAC version of the sequential decision-making problem of finding the ranking of items by making subset-wise comparisons. Formally, the learner is given a finite set of arms. At each decision round , the learner selects a subset of items, and receives (stochastic) feedback about the winner (or most preferred) item of drawn from a Plackett-Luce (PL) model with parameters , a priori unknown to the learner. The nature of the feedback is described in Section 3.1. We assume henceforth that , and also for ease of exposition222We naturally assume that this knowledge is not known to the learning algorithm, and note that extension to the case where several items have the same highest parameter value is easily accomplished..
Definition 1 (-Best-Item).
For any , an item is called -Best-Item if its PL score parameter is worse than the Best-Item by no more than , i.e. if . A -best item is an item with largest PL parameter, which is also a Condorcet winner (Ramamohan et al., 2016) in case it is unique.
Definition 2 (-Best-Ranking).
We define a ranking to be an -Best-Ranking when no pair of items in is misranked by unless their PL scores are -close to each other. Formally, A -Best-Ranking will be called a Best-Ranking or optimal ranking of the PL model. With , the unique Best-Ranking is .
Definition 3 (-Best-Ranking-Multiplicative).
We define a ranking of to be -Best-Ranking-Multiplicative if
Note: The term ‘multiplicative’ emphasizes the fact that the condition equivalently imposes a multiplicative constraint on the PL score parameters.
3.1 Feedback models
By feedback model, we mean the information received (from the ‘environment’) once the learner plays a subset of items. We consider the following feedback models in this work:
Winner of the selected subset (WI): The environment returns a single item , drawn independently from the probability distribution
Full ranking on the selected subset (FR): The environment returns a full ranking , drawn from the probability distribution This is equivalent to picking item according to winner (WI) feedback from , then picking according to WI feedback from , and so on, until all elements from are exhausted, or, in other words, successively sampling winners from according to the PL model, without replacement.
More generally, we define
Top- ranking from the selected subset (TR- or TR): The environment successively samples (without replacement) only the first items from among , according to the PL model over , and returns the ordered list. It follows that TR reduces to FR when and to WI when .
3.2 Performance Objective: -PAC-Rank – Correctness and Sample Complexity
Consider a problem instance with Plackett-Luce (PL) model parameters and subsetsize , with its Best-Ranking being , and are two given constants. A sequential algorithm that operates on this problem instance, with WI feedback model, is said to be -PAC-Rank if (a) it stops and outputs a ranking after a finite number of decision rounds (subset plays) with probability , and (b) the probability that its output is an -Best-Ranking is at least , i.e, . Furthermore, by sample complexity of the algorithm, we mean the expected time (number of decision rounds) taken by the algorithm to stop.
In the context of our above problem objective, it is worth noting the work by Szörényi et al. (2015) addressed a similar problem, except in the dueling bandit setup () with the same objective as above, except with the notion of -Best-Ranking-Multiplicative—we term this new objective as -PAC-Rank-Multiplicative as referred later for comparing the results. The two objectives are however equivalent under a mild boundedness assumption as follows:
Assume , for any . If an algorithm is -PAC-Rank, then it is also -PAC-Rank-Multiplicative for any . On the other hand, if an algorithm is -PAC-Rank-Multiplicative, then it is also -PAC-Rank for any .
4 Parameter Estimation with PL based preference data
We develop in this section some useful parameter estimation techniques based on adaptively sampled preference data from the PL model, which will form the basis for our PAC algorithms later on, in Section 5.1.
4.1 Estimating Pairwise Preferences via Rank-Breaking.
Rank breaking is a well-understood idea involving the extraction of pairwise comparisons from (partial) ranking data, and then building pairwise estimators on the obtained pairs by treating each comparison independently (Khetan and Oh, 2016; Jang et al., 2017), e.g., a winner sampled from among is rank-broken into the pairwise preferences , . We use this idea to devise estimators for the pairwise win probabilities in the active learning setting. The following result, used to design Algorithm 1
later, establishes explicit confidence intervals for pairwise win/loss probability estimates under adaptively sampled PL data.
Lemma 5 (Pairwise win-probability estimates for the PL model).
Consider a Plackett-Luce choice model with parameters , and fix two items . Let be a sequence of (possibly random) subsets of of size at least , where is a positive integer, and a sequence of random items with each , , such that for each , (a) depends only on , and (b) is distributed as the Plackett-Luce winner of the subset , given and , and (c) with probability . Let and . Then, for any positive integer , and ,
Notes: (a) The result gives an exponential deviation inequality for the estimate of . Although it is tempting to conclude that
is an unbiased estimate of, it is unclear if this holds for any finite time horizon due to the denominator being a random quantity. In an asymptotic sense, as , the bias can indeed be seen to vanish by a renewal theory argument. The lemma exploits the IIA property of the PL model, together with a novel coupling argument in an -specific probability space and Hoeffding’s inequality, to establish a large deviation bound for the estimate (proof in the appendix). (b) Jang et al. (see 2017, Proof of Thm. 3) also control deviations of pairwise probability estimators for PL, but in the offline (batch) setting where the denominator is nonrandom.
4.2 Estimating relative PL scores () using Renewal Cycles
We detail another method to directly estimate (relative) score parameters of the PL model, using renewal cycles and the IIA property.
Consider a Plackett-Luce choice model with parameters , , and an item . Let be a sequence of iid draws from the model. Let be the first time at which appears, and for each , let be the number of times appears until time . Then, and are Geometric random variables with parameters and , respectively.
With this in hand, we now show how fast the empirical mean estimates over several renewal cycles (defined by the appearance of a distinguished item) converge to the true relative scores , a result to be employed in the design of Algorithm 2 later.
Lemma 7 (Concentration of Geometric Random Variables via the Negative Binomial distribution.).
Suppose are iid Geo random variables, and . Then, for any , .
5 Algorithms for WI Feedback
This section describes the design of -PAC-Rank algorithms which use winner information (WI) feedback.
A key idea behind our proposed algorithms is to estimate the relative strength of each item with respect to a fixed item, termed as a pivot-item . This helps to compare every item on common terms (with respect to the pivot item) even if two items are not directly compared with each other. Our first algorithm Beat-the-Pivot maintains pairwise score estimates of the items with respect to the pivot element, based on the idea of Rank-Breaking and Lemma 5. The second algorithm Score-and-Rank directly estimates the relative scores for each item , relying on Lemma 6 (Section 4.2). Once all item scores are estimated with enough confidence, the items are simply sorted with respect to their preference scores to obtain a ranking.
5.1 The Beat-the-Pivot algorithm
Beat-the-Pivot (Algorithm 1) first estimates an approximate Best-Item with high probability . We do this using the subroutine Find-the-Pivot (Algorithm Find-the-Pivot) that with probability at least Find-the-Pivot outputs an -Best-Item within a sample complexity of .
Once the best item is estimated, Beat-the-Pivot divides the rest of the items into groups of size , , and appends to each group. This way elements of every group get to compete (and hence compared) against , which aids estimating the pairwise score compared to the pivot item , owing to the IIA property of PL model and Lemma 5 (Sec. 4.1), sorting which we obtain the final ranking. Theorem 8 shows that Beat-the-Pivot enjoys the optimal sample complexity guarantee of . The pseudo code of Beat-the-Pivot is given in Algorithm 1.
Theorem 8 (Beat-the-Pivot: Correctness and Sample Complexity).
Beat-the-Pivot (Algorithm 1) is -PAC-Rank with sample complexity .
5.2 The Score-and-Rank algorithm
Score-and-Rank (Algorithm 2) differs from Beat-the-Pivot in terms of the score estimate it maintains for each item. Unlike our previous algorithm, instead of maintaining pivot-preference scores , Beat-the-Pivot, aims to directly estimate the PL-score of each item relative to score of the pivot . In other words, the algorithm maintains the relative score estimates for every item borrowing results from Lemma 6 and 7, and finally return the ranking sorting the items with respect to their relative pivotal-score. Score-and-Rank also runs within an optimal sample complexity of as shown in Theorem 9. The complete algorithm is described in Algorithm 2.
Theorem 9 (Score-and-Rank: Correctness and Sample Complexity).
Score-and-Rank (Algorithm 2) is -PAC-Rank with sample complexity .
In this section, we describe the pivot selection procedure Find-the-Pivot. The algorithm serves the purpose of finding an -Best-Item with high probability that is used as the pivoting element both by Algorithm 1 (Sec. 5.1) and 2 (Sec. 5.2).
Find-the-Pivot is based on the simple idea of tracing the empirical best item–specifically, it maintains a running winner at every iteration , which is made to compete with a set of arbitrarily chosen other items long enough ( rounds). At the end if the empirical winner turns out to be more than -favorable than the running winner , in term of its pairwise preference score: , then replaces , or else retains its place and status quo ensues. The process recurses till we are left with only a single element which is returned as the pivot. The formal description of Find-the-Pivot is in Algorithm 3.
Lemma 10 (Find-the-Pivot: Correctness and Sample Complexity with WI).
Find-the-Pivot (Algorithm 3) achieves the -PAC objective with sample complexity .
6 Lower Bound
In this section we show the minimum sample complexity required for any symmetric algorithm to be -PAC-Rank is atleast (Theorem 12). Note this in fact matches the sample complexity bounds of our proposed algorithms (recall Theorem 8 and 9) showing the tightness of both our upper and lower bound guarantees. The key observation lies in noting that results are independent of , which shows the learning problem with -subsetwise WI feedback is as hard as that of the dueling bandit setup —the flexibility of playing a sized subset does not help in faster information aggregation. We first define the notion of a symmetric or label-invariant algorithm.
Definition 11 (Symmetric Algorithm).
A PAC algorithm is said to be symmetric if its output is insensitive to the specific labelling of items, i.e., if for any PL model , bijection and ranking , it holds that , where denotes the probability distribution on the trajectory of induced by the PL model .
Theorem 12 (Lower bound on Sample Complexity with WI feedback).
Given a fixed , , and a symmetric -PAC-Rank algorithm that applies to the problem setup for WI feedback model, there exists a PL instance such that the sample complexity of on is at least
(sketch). The argument is based on the following change-of-measure argument (Lemma ) of Kaufmann et al. (2016). (restated in Appendix D.1 as Lemma 25). To employ this result, note that in our case, each bandit instance corresponds to an problem instance of with the arm set containing all subsets of of size : . The key part of our proof relies on carefully crafting a true instance, with optimal arm , and a family of slightly perturbed alternative instances , each with optimal arm .
Designing the problem instances. We first renumber the items as . Now for any integer , we define to be the set of problem instances where any instance is associated to a set , such that , and the PL parameters associated to instance are set up as follows: , for some . We will restrict ourselves to the class of instances of the form .
Corresponding to each problem , such that , consider a slightly altered problem instance associated with a set , such that , where . Following the same construction as above, the PL parameters of the problem instance are set up as: .
It is easy to verify that, for any , an -Best-Ranking (Definition. 2) for problem instance , say we denote it as , has to satisfy the following: . Moreover, is unique.
Theorem 12 is now obtained by applying Lemma 25 on any pair of problem instances , such that with , for the event . However for this we derive a tighter upper bounds for the KL-divergence term of in the right hand side of Lemma 25. Clearly being -PAC-Rank , that itself implies, , and But using (due to Lemma 26) leads to a looser lower bound guarantee of . However owing to the symmetric property of we prove a tighter guarantee by carefully applying both symmetry and -PAC-Rank property of across all possible choices of the problem instances . Formally we show that:
For any symmetric -PAC-Rank algorithm , for any problem instance associated to the set , and any item , where be the ranking returned by algorithm , denotes the probability of an event under the underlying problem instance and the internal randomness of the algorithm (if any).
We use the above result with and , which leads to the desired tighter upper bound for , the last inequality follows due to Lemma 26 (in the appendix).
Theorem 12 shows, rather surprisingly, that the PAC-ranking with winner feedback information from size- subsets, does not become easier (in a worst-case sense) with , implying that there is no reduction in hardness of learning from the pairwise comparisons case (). While one may expect sample complexity to improve as the number of items being simultaneously tested in each round () becomes larger, there is a counteracting effect due to the fact that it is intuitively ‘harder’ for a high-value item to win in just a single winner draw against a (large) population of other competitors. A useful heuristic here is that the number of bits of information that a single winner draw from a size-
other competitors. A useful heuristic here is that the number of bits of information that a single winner draw from a size-subset provides is , which is not significantly larger than when ; thus, an algorithm cannot accumulate significantly more information per round compared to the pairwise case.
Given a fixed , , and a symmetric -PAC-Rank-Multiplicative algorithm that applies to the problem setup for WI feedback model, there exists a PL instance such that the sample complexity of on is at least
7 Analysis with Top Ranking (TR) feedback
We now proceed to analyze the problem with Top- Ranking (TR) feedback (Sec. 3.1). We first show that unlike WI feedback, the sample complexity lower bound here scales as (Thm. 15), which is a factor smaller than that in Thm. 12 for the WI feedback model. At a high level, this is because TR reveals preference information for items per feedback round, as opposed to just a single (noisy) information sample of the winning item (WI). Following this, we also present two algorithms for this setting which are shown to enjoy an exact optimal sample complexity guarantee of (Sec. 7.2).
7.1 Lower Bound for Top- Ranking (TR) feedback
Theorem 15 (Sample Complexity Lower Bound for TR).
Given and , and a symmetric -PAC-Rank algorithm with top- ranking (TR) feedback (), there exists a PL instance such that the expected sample complexity of on is at least .
The sample complexity lower bound for -PAC-Rank with top- ranking (TR) feedback model is -times that of the WI model (Thm. 12). Intuitively, revealing a ranking on items in a -set provides about bits of information per round, which is about times as large as that of revealing a single winner, yielding an acceleration by a factor of .
Given and , and a symmetric -PAC-Rank algorithm with full ranking (FR) feedback (), there exists a PL instance such that the expected sample complexity of on is at least .
7.2 Algorithms for Top- Ranking (TR) feedback model
This section presents two algorithms that works on top- ranking feedback and shown to satisfy the -PAC-Rank property with the optimal sample complexity guarantee of that matches the lower bound derived in the previous section (Theorem 15). This shows a factor faster learning rate compared to the WI feedback model which id achieved by generalizing our earlier two proposed algorithms (see Algorithm 1 and 2, Sec. 5 for WI feedback) to the top- ranking (TR) feedback.The two algorithms are presented below:
Algorithm 5: Generalizing Beat-the-Pivot for top- ranking (TR) feedback.
The first algorithm is based on our earlier Beat-the-Pivot algorithm (Algorithm 1) which essentially maintains the empirical pivotal preferences for each item by applying a novel trick of Rank Breaking on the TR feedback (i.e. the ranking , ) received per round after each -subsetwise play.
Rank-Breaking. Khetan and Oh (2016); Soufiani et al. (2014) The concept of Rank Breaking is essentially based upon the clever idea of extracting pairwise comparisons from subsetwise preference information. Formally, given any set of size , if denotes a possible top- ranking of , the Rank Breaking subroutine considers each item in to be beaten by its preceding items in in a pairwise sense. See Algorithm 4 for detailed description of the procedure.
Of course in general, Rank Breaking may lead to arbitrarily inconsistent estimates of the underlying model parameters (Azari et al., 2012). However, owing to the IIA property of the Plackett-Luce model, we get clean concentration guarantees on using Lem. 5. This is precisely the idea used for obtaining the factor improvement in the sample complexity guarantees of Beat-the-Pivot as analysed in Theorem 8. The formal descriptions of Beat-the-Pivot generalized to the setting of TR feedback, is given in Algorithm 5.
Theorem 17 (Beat-the-Pivot: Correctness and Sample Complexity for TR feedback).
With top- ranking (TR) feedback model, Beat-the-Pivot (Algorithm 5) is