Best-item Learning in Random Utility Models with Subset Choices

02/19/2020 ∙ by Aadirupa Saha, et al. ∙ indian institute of science 0

We consider the problem of PAC learning the most valuable item from a pool of n items using sequential, adaptively chosen plays of subsets of k items, when, upon playing a subset, the learner receives relative feedback sampled according to a general Random Utility Model (RUM) with independent noise perturbations to the latent item utilities. We identify a new property of such a RUM, termed the minimum advantage, that helps in characterizing the complexity of separating pairs of items based on their relative win/loss empirical counts, and can be bounded as a function of the noise distribution alone. We give a learning algorithm for general RUMs, based on pairwise relative counts of items and hierarchical elimination, along with a new PAC sample complexity guarantee of O(n/c^2ϵ^2logk/δ) rounds to identify an ϵ-optimal item with confidence 1-δ, when the worst case pairwise advantage in the RUM has sensitivity at least c to the parameter gaps of items. Fundamental lower bounds on PAC sample complexity show that this is near-optimal in terms of its dependence on n,k and c.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Random utility models (RUMs) are a popular and well-established framework for studying behavioral choices by individuals and groups Thurstone (1927). In a RUM with finite alternatives or items, a distribution on the preferred alternative(s) is assumed to arise from a random utility drawn from a distribution for each item, followed by rank ordering the items according to their utilities.

Perhaps the most widely known RUM is the Plackett-Luce or multinomial logit model

Plackett (1975); Luce (2012) which results when each item’s utility is sampled from an additive model with a Gumbel-distributed perturbation. It is unique in the sense of enjoying the property of independence of irrelevant attributes (IIA), which is often key in permitting efficient inference of Plackett-Luce models from data Khetan and Oh (2016). Other well-known RUMs include the probit model Bliss (1934) featuring random Gaussian perturbations to the intrinsic utilities, mixed logit, nested logit, etc.

A long line of work in statistics and machine learning focuses on estimating RUM properties from observed data

Soufiani et al. (2014); Zhao et al. (2018); Soufiani et al. (2013). Online learning or adaptive testing, on the other hand, has shown efficient ways of identifying the most attractive (i.e., highest utility) items in RUMs by learning from relative feedback from item pairs or more generally subsets Szörényi et al. (2015); Saha and Gopalan (2019); Jang et al. (2017). However, almost all existing work in this vein exclusively employs the Plackett-Luce model, arguably due to its very useful IIA property, and our understanding of learning performance in other, more general RUMs has been lacking. We take a step in this direction by framing the problem of sequentially learning the best item/items in general RUMs by adaptive testing of item subsets and observing relative RUM feedback. In the process, we uncover new structural properties in RUMs, including models with exponential, uniform, Gaussian (probit) utility distributions, and give algorithmic principles to exploit this structure, that permit provably sample-efficient online learning and allow us to go beyond Plackett-Luce.

Our contributions: We introduce a new property of a RUM, called the (pairwise) advantage ratio

, which essentially measures the worst-case relative probabilities between an item pair across all possible contexts (subsets) where they occur. We show that this ratio can be controlled (bounded below) as an affine function of the relative strengths of item pairs for RUMs based on several common centered utility distributions, e.g., exponential, Gumbel, uniform, Gamma, Weibull, normal, etc., even when the resulting RUM does not possess analytically favorable properties such as IIA.

We give an algorithm for sequentially and adaptively PAC (probably approximately correct) learning the best item from among a finite pool when, in each decision round, a subset of fixed size can be tested and top- rank ordered feedback from the RUM can be observed. The algorithm is based on the idea of maintaining pairwise win/loss counts among items, hierarchically testing subsets and propagating the surviving winners – principles that have been shown to work optimally in the more structured Plackett-Luce RUM Szörényi et al. (2015); Saha and Gopalan (2019).

In terms of performance guarantees, we derive a PAC sample complexity bound for our algorithm: when working with a pool of items in total with subsets of size- chosen in each decision round, the algorithm terminates in rounds where is a lower bound on the advantage ratio’s sensitivity to intrinsic item utilities. This can in turn be shown to be a property of only the RUM’s perturbation distribution, independent of the subset size . A novel feature of the guarantee is that, unlike existing sample complexity results for sequential testing in the Plackett-Luce model, it does not rely on specific properties like IIA which are not present in general RUMs. We also extend the result to cover top- rank ordered feedback, of which winner feedback () is a special case. Finally, we show that the sample complexity of our algorithm is order-wise optimal across RUMs having a given advantage ratio sensitivity , by arguing an information-theoretic lower bound on the sample complexity of any online learning algorithm.

Our results and techniques represent a conceptual advance in the problem of online learning in general RUMs, moving beyond the Plackett-Luce model for the first time to the best of our knowledge.

Related Work: For classical multiarmed bandits setting, there is a well studied literature on PAC-arm identification problem Even-Dar et al. (2006); Audibert and Bubeck (2010); Kalyanakrishnan et al. (2012); Karnin et al. (2013); Jamieson et al. (2014), where the learner gets to see a noisy draw of absolute reward feedback of an arm upon playing a single arm per round. On the contrary, learning to identify the best item(s) with only relative preference information (ordinal as opposed to cardinal feedback) has seen steady progress since the introduction of the dueling bandit framework Zoghi et al. (2013) with pairs of items (size- subsets) that can be played, and subsequent work on generalisation to broader models both in terms of distributional parameters Yue and Joachims (2009); Gajane et al. (2015); Ailon et al. (2014); Zoghi et al. (2015) as well as combinatorial subset-wise plays Mohajer et al. (2017); González et al. (2017); Saha and Gopalan (2018a); Sui et al. (2017). There have been several developments on the PAC objective for different pairwise preference models, such as those satisfying stochastic triangle inequalities and strong stochastic transitivity (Yue and Joachims, 2011), general utility-based preference models (Urvoy et al., 2013), the Plackett-Luce model (Szörényi et al., 2015) and the Mallows model (Busa-Fekete et al., 2014a)]. Recent work has studied PAC-learning objectives other than identifying the single (near) best arm, e.g. recovering a few of the top arms (Busa-Fekete et al., 2013; Mohajer et al., 2017), or the true ranking of the items (Busa-Fekete et al., 2014b; Falahatgar et al., 2017). Some of the recent works also extended the PAC-learning objective with relative subsetwise preferences Saha and Gopalan (2018b); Chen et al. (2017, 2018); Saha and Gopalan (2019); Ren et al. (2018).

However, none of the existing work considers strategies to learn efficiently in general RUMs with subset-wise preferences and to the best of our knowledge we are the first to address this general problem setup. In a different direction, there has been work on batch (non-adaptive) estimation in general RUMs, e.g., Zhao et al. (2018); Soufiani et al. (2013)

; however, this does not consider the price of active learning and the associated exploration effort required as we study here. A related body of literature lies in dynamic assortment selection, where the goal is to offer a subset of items to customers in order to maximise expected revenue, which has been studied under different choice models, e.g. Multinomial-Logit

(Talluri and Van Ryzin, 2004), Mallows and mixture of Mallows (Désir et al., 2016a)

, Markov chain-based choice models

(Désir et al., 2016b), single transition model (Nip et al., 2017) etc., but again each of this work addresses a given and a very specific kind of choice model, and their objective is more suited to regret minimization type framework where playing every item comes with a associated cost.

Organization: We give the necessary preliminaries and our general RUM based problem setup in Section 2. The formal description of our feedback models and the details of -best arm identification problem is given in Section 3. In Section 4, we analyse the pairwise preferences of item pairs for our general RUM based subset choice model and introduce the notion of Advantage-Ratio  connecting subsetwise scores to pairwise preferences. Our proposed algorithm along with its performance guarantee and also matching lower bound analysis is given in Section 5. We further extend the above results to a more general top- ranking feedback model in Section 6. Section 7 finally conclude our work with certain future directions. All the proofs of results are moved to the appendix.

2 Preliminaries

Notation. We denote by the set . For any subset , let denote the cardinality of . When there is no confusion about the context, we often represent (an unordered) subset

as a vector, or ordered subset,

of size (according to, say, a fixed global ordering of all the items ). In this case, denotes the item (member) at the th position in subset . is a permutation over items of , where for any permutation , denotes the element at the -th position in . is generically used to denote an indicator variable that takes the value if the predicate is true, and otherwise. denotes the maximum of and , and is used to denote the probability of event , in a probability space that is clear from the context.

2.1 Random Utility-based Discrete Choice Models

A discrete choice model specifies the relative preferences of two or more discrete alternatives in a given set. Random Utility Models (RUMs) are a widely-studied class of discrete choice models; they assume a (non-random) ground-truth utility score for each alternative , and assign a distribution for scoring item , where . To model a winning alternative given any set , one first draws a random utility score for each alternative in , and selects an item with the highest random score. More formally, the probability that an item emerges as the winner in set is given by:


In this paper, we assume that for each item , its random utility score is of the form , where all the

are ‘noise’ random variables drawn independently from a probability distribution


A widely used RUM is the Multinomial-Logit (MNL) or Plackett-Luce model (PL), where the s are taken to be independent Gumbel distributions with location parameters and scale parameter (Azari et al., 2012), which results in score distributions , . Moreover, it can be shown that the probability that an alternative emerges as the winner in any set is simply proportional to its score parameter:

Other families of discrete choice models can be obtained by imposing different probability distributions over the iid noise ; e.g.,

  1. Exponential  noise: is the Exponential distribution ().

  2. Noise from Extreme value distributions: is the Extreme-value-distribution (). Many well-known distributions fall in this class, e.g., Frechet, Weibull, Gumbel. For instance, when , this reduces to the Gumbel distribution.

  3. Uniform  noise: is the (continuous) Uniform distribution ().

  4. Gaussian  or Frechet, Weibull, Gumbel noise: is the Gaussian distribution ().

  5. Gamma noise: is the Gamma distribution (where ).

Other distributions can alternatively be used for modelling the noise distribution , depending on desired tail properties, domain-specific information, etc.

Finally, we denote a RUM choice model, comprised of an instance (with its implicit dependence on the noise distribution ) along with a playable subset size , by RUM.

3 Problem Setting

We consider the probably approximately correct (PAC) version of the sequential decision-making problem of finding the best item in a set of items, by making only subset-wise comparisons.

Formally, the learner is given a finite set of items or ‘arms’111terminology borrowed from multi-armed bandits along with a playable subset size . At each decision round , the learner selects a subset of distinct items, and receives (stochastic) feedback depending on (a) the chosen subset , and (b) a RUM  choice model with parameters a priori unknown to the learner. The nature of the feedback can be of several types as described in Section 3.1. For the purposes of analysis, we assume, without loss of generality222under the assumption that the learner’s decision rule does not contain any bias towards a specific item index, that for ease of exposition333The extension to the case where several items have the same highest parameter value is easily accomplished.. We define a best item to be one with the highest score parameter: , under the assumptions above.

Remark 1.

Under the assumptions above, it follows that item is the Condorcet Winner Zoghi et al. (2014) for the underlying pairwise preference model induced by RUM.

3.1 Feedback models

We mean by ‘feedback model’ the information received (from the ‘environment’) once the learner plays a subset of items. Similar to different types of feedback models introduced earlier in the context of the specific Plackett-Luce RUM Saha and Gopalan (2019), we consider the following feedback mechanisms:

  • Winner of the selected subset (WI: The environment returns a single item , drawn independently from the probability distribution

  • Full ranking selected subset of items (FR): The environment returns a full ranking , drawn from the probability distribution In fact, this is equivalent to picking according to the winner feedback from , then picking from following the same feedback model, and so on, until all elements from are exhausted, or, in other words, successively sampling winners from according to the RUM  model, without replacement.

3.2 PAC Performance Objective: Correctness and Sample Complexity

For a RUM  instance with arms, an arm is said to be -optimal if . A sequential444We essentially mean a causal algorithm that makes present decisions using only past observed information at each time; the technical details for defining this precisely are omitted. learning algorithm that depends on feedback from an appropriate subset-wise feedback model is said to be -PAC, for given constants , if the following properties hold when it is run on any instance RUM: (a) it stops and outputs an arm after a finite number of decision rounds (subset plays) with probability , and (b) the probability that its output is an -optimal arm in RUM  is at least , i.e, . Furthermore, by sample complexity of the algorithm, we mean the expected time (number of decision rounds) taken by the algorithm to stop when run on the instance RUM.

4 Connecting Subsetwise preferences to Pairwise Scores

In this section, we introduce the key concept of Advantage ratio as a means to systematically relate subsetwise preference observations to pairwise scores in general RUMs.

Consider any set , and recall that the probability of item winning in is for all . For any two items , let us denote . Let us also denote by and

the probability density function

555We assume by default that all noise distributions have a density; the extension to more general noise distributions is left to future work.

, cumulative distribution function and complementary cumulative distribution function of the noise distribution

, respectively; thus, for any Support and for any Support.

We now introduce and analyse the Advantage-Ratio  (Def. 1); we will see in Sec. 5.1 how this quantity helps us deriving an improved sample complexity guarantee for our -PAC item identification problem.

Definition 1 (Advantage ratio and Minimum advantage ratio).

Given any subsetwise preference model defined on items, we define the advantage ratio of item over item within the subset , as

Moreover, given a playable subset size , we define the minimum advantage ratio, Min-AR, of item- over , as the least advantage ratio of over across size- subsets of , i.e.,


The key intuition here is that when does not equal , it serves as a distinctive measure for identifying item and separately irrespective of the context . We specifically build on this intuition later in Sec. 5.1 to propose a new algorithm (Alg. 1) which finds the -PAC best item relying on the unique distinctive properly of the best-item (as described in Sec. 3).

The following result shows a variational lower bound, in terms of the noise distribution, for the minimum advantage ratio in a RUM  model with independent and identically distributed (iid) noise variables, that is often amenable to explicit calculation/bounding.

Figure 1: The geometrical interpretation behind Min-AR for a fixed : The green shaded region is where , the red shaded region is where , and the white rectangle is where . Note how the shape of the green and red region varies as (blue dot) moves on the real line (X-axis).
Lemma 2 (Variational lower bound for the advantage ratio).

For any RUM  based subsetwise preference model and any item pair ,666We assume to be in the right hand side of Eqn. 3.


Moreover for RUM  models one can show that for any triplet , , which further lower bounds Min-AR by:

The proof of the result appears in Appendix A.1. Fig. 1 shows a geometrical interpretation behind Min-AR, under the joint realization of the pair of values .

Remark 2.

Suppose . It is sufficient to consider the domain of in the right hand side of (3) to be just the set + support, as the proof of Lemma 2 brings out. However, for simplicity we use a smaller lower bound in Eqn. 3 and take .

We next derive the Min-AR values certain specific noise distributions:

Lemma 3 (Analysing Min-AR  for specific noise models).

Given a fixed item pair such that , the following bounds hold under the respective noise models in an iid RUM.

  1. Exponential(): Min-AR for Exponential noise with .

  2. Extreme value distribution: For Gumbel () noise, Min-AR.

  3. Uniform: Min-AR for Uniform noise ( and ).

  4. Gamma: Min-AR for Gamma noise.

  5. Weibull: Min-AR for .

  6. Normal : For small enough (in a neighborhood of ), Min-AR.

Proof is given in Appendix A.2.

5 An optimal algorithm for the winner feedback model

In this section, we propose an algorithm (Sequential-Pairwise-Battle, Algorithm 1) for the -PAC objective with winner feedback. We then analyse its correctness and sample complexity guarantee (Theorem 4) for any noise distribution (under a mild assumption of its being Min-AR  bounded away from ). Following this, we also prove a matching lower bound for the problem which shows that the sample complexity of Algorithm Sequential-Pairwise-Battle  is unimprovable (up to a factor of ).

5.1 The Sequential-Pairwise-Battle  algorithm

Our algorithm is based on the simple idea of dividing the set of items into sub-groups of size , querying each subgroup ‘sufficiently enough’, retaining thereafter only the empirically ‘strongest item’ of each sub-group, and recursing on the remaining set of items until only one item remains.

More specifically, it starts by partitioning the initial item pool into mutually exclusive and exhaustive sets such that and . Each set is then queried for rounds, and only the ‘empirical winner’ of each group is retained in a set , rest are discarded. The algorithm next recurses the same procedure on the remaining set of surviving items, until a single item is left, which then is declared to be the PAC-best item. Algorithm 1 presents the pseudocode in more detail.

Key idea: The primary novelty here is how the algorithm reasons about the ‘strongest item’ in each sub-group : It maintains the pairwise preferences of every item pair in any sub-group and simply chooses the item that beats the rest of the items in the sub-group with a positive advantage of greater than (alternatively, the item that wins maximum number of subset-wise plays). Our idea of maintaining pairwise preferences is motivated by a similar algorithm proposed in Saha and Gopalan (2019); however, their performance guarantee applies to only the very specific class of Plackett-Luce feedback models, whereas the novelty of our current analysis reveals the power of maintaining pairwise-estimates for more general RUM  subsetwise model (which includes the Plackett-Luce choice model as a special case). The pseudo code of Sequential-Pairwise-Battle  is given in Alg. 1.

1:  Input:
2:      Set of items: , Subset size:
3:      Error bias: , Confidence parameter:
4:      Noise model dependent constant
5:  Initialize:
6:      , , and
7:      Divide into sets such that and , where
8:      If , then set and
9:  while  do
10:     Set ,
11:     for  do
12:        Play the set for rounds
13:         Number of times won in plays of ,
14:        Set and
15:     end for
17:     if  then
18:        Break (go out of the while loop)
19:     else if  then
20:         Randomly sample items from , and , ,
21:     else
22:        Divide into sets , such that , and , where
23:        If , then set and
24:     end if
25:  end while
26:  Output: The unique item left in
Algorithm 1 Sequential-Pairwise-Battle(Seq-PB)

The following is our chief result; it proves correctness and a sample complexity bound for Algorithm 1.

Theorem 4 (Sequential-Pairwise-Battle: Correctness and Sample Complexity).

Consider any iid subsetwise preference model RUM  based on a noise distribution , and suppose that for any item pair , we have Min-AR for some -dependent constant . Then, Algorithm 1, with input constant , is an -PAC algorithm with sample complexity .

The proof of the result appears in Appendix B.1.

Remark 3.

The linear dependence on the total number of items, , is, in effect, indicates the price to pay for learning the unknown model parameters which decide the subsetwise winning probabilities of the items. Remarkably, however, the theorem shows that the PAC sample complexity of the -best item identification problem, with only winner feedback information from -size subsets, is independent of . One may expect to see improved sample complexity as the number of items being simultaneously tested in each round is large , but note that on the other side, the sample complexity could also worsen, since it is also harder for a good item to win and show itself in a few draws against a large population of other competitors – these effects roughly balance each other out, and the final sample complexity only depends on the total number of items and the accuracy parameters .

Note that Lemma 3 gives specific values of the noise-model dependent constant , using which we can derive specific sample complexity bounds for certain noise models:

Corollary 5 (Model specific correctness and sample complexity guarantees).

For the following representative noise distributions: Exponential, Gumbel Gamma, Uniform, Weibull, Standard normal or Normal, Seq-PB  (Alg.1) finds an -PAC item within sample complexity .

Proof sketch.

The proof follows from the general performance guarantee of Seq-PB  (Thm. 4) and Lem. 3. More specifically from Lem. 3 it follows that the value of for these specific distributions are constant, which concludes the claim. For completeness the distribution-specific values of are given in Appendix B.2. ∎

5.2 Sample Complexity Lower Bound

In this section we derive a sample complexity lower bound for any -PAC algorithm for any RUM  model with Min-AR strictly bounded away from in terms of . Our formal claim goes as follows:

Theorem 6 (Sample Complexity Lower Bound for RUM  model).

Given , , and an -PAC algorithm with winner item feedback, there exists a RUM  instance with Min-AR for all , where the expected sample complexity of on is at least .

The proof is given in Appendix B.3. It essentially involves a change of measure argument demonstrating a family of Plackett-Luce models (iid Gumbel noise), with the appropriate value, that cannot easily be teased apart by any learning algorithm.

Comparing this result with the performance guarantee of our proposed algorithm (Theorem 6) shows that the sample complexity of the algorithm is order-wise optimal (up to a factor). Moreover, this result also shows that the IIA (independence of irrelevant attributes) property of the Plackett-Luce choice model is not essential for exploiting pairwise preferences via rank breaking, as was claimed in Saha and Gopalan (2019). Indeed, except for the case of Gumbel  noise, none of the RUM  based models in Corollary 5 satisfies IIA, but they all respect the -PAC sample complexity guarantee.

Remark 4.

For constant , the fundamental sample complexity bound of Theorem 6 resembles that of PAC best arm identification in the standard multi-armed bandit (MAB) problem Even-Dar et al. (2006). Recall that our problem objective is exactly same as MAB, however our feedback model is very different since in MAB, the learner gets to see the noisy rewards/scores (i.e. the exact values of , which can be seen as a noisy feedback of the true reward/score of item-), whereas here the learner only sees a -wise relative preference feedback based on the underlying observed values of , which is a more indirect way of giving feedback on the item scores, and thus intuitively our problem objective is at least as hard as that of MAB setup.

6 Results for Top- Ranking (TR) feedback model

We now address our -PAC item identification problem for the case of more general, top- rank ordered feedback for the RUM  model, that generalises both the winner-item (WI) and full ranking (FR) feedback models.

Top- ranking of items (TR-): In this feedback setting, the environment is assumed to return a ranking of only items from among , i.e., the environment first draws a full ranking over according to RUM  as in FR above, and returns the first rank elements of , i.e., . It can be seen that for each permutation on a subset , , we must have , where by we denote the set of all possible -length ranking of items in set , it is easy to note that . Thus, generating such a is also equivalent to successively sampling winners from according to the PL model, without replacement. It follows that TR reduces to FR when and to WI when . Note that the idea for top- ranking feedback was introduced by Saha and Gopalan (2018b) but only for the specific Plackett Luce choice model.

6.1 Algorithm for top- ranking feedback

In this section, we extend the algorithm proposed earlier (Alg. 1) to handle feedback from the general top- ranking feedback model. Based of the performance analysis of our algorithm (Thm. 7), we are able to show that we can achieve an -factor improved sample complexity rate with top- ranking feedback. We finally also give a lower bound analysis under this general feedback model (Thm. 8) showing the fundamental performance limit of the current problem of interest. Our derived lower bound shows optimality of our proposed algorithm mSeq-PB  up to logarithmic factors.

Main idea: Same as Seq-PB, the algorithm proposed in this section (Alg. 2) in principle follows the same sequential elimination based strategy to find the near-best item of the RUM  model based on pairwise preferences. However, we use the idea of rank breaking (Soufiani et al., 2014; Saha and Gopalan, 2018b) to extract the pairwise preferences: formally, given any set of size , if denotes a possible top- ranking of , then the Rank-Breaking  subroutine considers each item in to be beaten by its preceding items in in a pairwise sense. For instance, given a full ranking of a set of elements , say , Rank-Breaking generates the set of pairwise comparisons: etc.

As a whole, our new algorithm now again divides the set of items into small groups of size , say , and play each sub-group some many rounds. Inside any fixed subgroup , after each round of play, it uses Rank-Breaking  on the top- ranking feedback , to extract out many pairwise feedback, which is further used to estimate the empirical pairwise preferences for each pair of items . Based on these pairwise estimates it then only retains the strongest item of and recurse the same procedure on the set of surviving items, until just one item is left in the set. The complete algorithm is given in Alg. 2 (Appendix C.1).

Theorem 7 analyses the correctness and sample complexity bounds of mSeq-PB. Note that the sample complexity bound of mSeq-PB  with top- ranking (TR) feedback model is -times that of the WI model (Thm. 4). This is justified since intuitively revealing a ranking on items in a -set provides about many WI feedback per round, which essentially leads to the -factor improvement in the sample complexity.

Theorem 7 (mSeq-PB(Alg. 2): Correctness and Sample Complexity).

Consider any RUM  subsetwise preference model based on noise distribution and suppose for any item pair , we have Min-AR for some -dependent constant . Then mSeq-PB  (Alg.2) with input constant on top- ranking feedback model is an -PAC algorithm with sample complexity .

Proof is given in Appendix C.2.

Similar to Cor. 5, for the top- model again, we can derive specific sample complexity bounds for different noise distributions, e.g., Exponential, Gumbel, Gaussian, Uniform, Gamma etc., in this case as well.

6.2 Lower Bound: Top- ranking feedback

In this section, we analyze the fundamental limit of sample complexity lower bound for any -PAC algorithm for RUM  model.

Theorem 8 (Sample Complexity Lower Bound for RUM  model with TR- feedback).

Given and , and an -PAC algorithm with winner item feedback, there exists a RUM  instance , in which for any pair Min-AR, where the expected sample complexity of on with top- ranking feedback has to be at least for A to be -PAC.

The proof is given in Appendix C.3.

Similar to the case of winner feedback, comparing Theorem 7 with the above result shows that the sample complexity of mSeq-PB  is orderwise optimal (up to logarithmic factors), for general case of top- ranking feedback as well.

7 Conclusion and Future Directions

We have identified a new principle to learn with general subset-size preference feedback in general iid RUMs – rank breaking followed by pairwise comparisons. This has been made possible by extending the concept of pairwise advantage from the popular Plackett-Luce choice model to much more general RUMs, and showing that the IIA property that Plackett-Luce models enjoy is not essential to obtain optimal sample complexity.

Our results suggest several interesting directions for future investigation, namely the possibility of considering correlated noise models (making the RUM more general), explicitly modeling the dependence of samples on item features or attributes, other performance objectives like regret for online utility optimization, and extension to learning with relative preferences in time-correlated settings like Markov Decision Processes.


  • Ailon et al. [2014] Nir Ailon, Zohar Shay Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In ICML, volume 32, pages 856–864, 2014.
  • Audibert and Bubeck [2010] Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pages 13–p, 2010.
  • Azari et al. [2012] Hossein Azari, David Parkes, and Lirong Xia. Random utility theory for social choice. In Advances in Neural Information Processing Systems, pages 126–134, 2012.
  • Bliss [1934] Chester I Bliss. The method of probits. Science, 1934.
  • Busa-Fekete et al. [2013] Róbert Busa-Fekete, Balazs Szorenyi, Weiwei Cheng, Paul Weng, and Eyke Hüllermeier. Top-k selection based on adaptive sampling of noisy preferences. In International Conference on Machine Learning, pages 1094–1102, 2013.
  • Busa-Fekete et al. [2014a] Róbert Busa-Fekete, Eyke Hüllermeier, and Balázs Szörényi. Preference-based rank elicitation using statistical models: The case of mallows. In Proceedings of The 31st International Conference on Machine Learning, volume 32, 2014a.
  • Busa-Fekete et al. [2014b] Róbert Busa-Fekete, Balázs Szörényi, and Eyke Hüllermeier. Pac rank elicitation through adaptive sampling of stochastic pairwise preferences. In AAAI, pages 1701–1707, 2014b.
  • Chen et al. [2017] Xi Chen, Sivakanth Gopi, Jieming Mao, and Jon Schneider. Competitive analysis of the top-k ranking problem. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1245–1264. SIAM, 2017.
  • Chen et al. [2018] Xi Chen, Yuanzhi Li, and Jieming Mao. A nearly instance optimal algorithm for top-k ranking under the multinomial logit model. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2504–2522. SIAM, 2018.
  • Désir et al. [2016a] Antoine Désir, Vineet Goyal, Srikanth Jagabathula, and Danny Segev. Assortment optimization under the mallows model. In Advances in Neural Information Processing Systems, pages 4700–4708, 2016a.
  • Désir et al. [2016b] Antoine Désir, Vineet Goyal, Danny Segev, and Chun Ye. Capacity constrained assortment optimization under the markov chain based choice model. Operations Research, 2016b.
  • Even-Dar et al. [2006] Eyal Even-Dar, Shie Mannor, and Yishay Mansour.

    Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.

    Journal of machine learning research, 7(Jun):1079–1105, 2006.
  • Falahatgar et al. [2017] Moein Falahatgar, Yi Hao, Alon Orlitsky, Venkatadheeraj Pichapati, and Vaishakh Ravindrakumar. Maxing and ranking with few assumptions. In Advances in Neural Information Processing Systems, pages 7063–7073, 2017.
  • Gajane et al. [2015] Pratik Gajane, Tanguy Urvoy, and Fabrice Clérot. A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In Proceedings of the 32nd International Conference on Machine Learning, pages 218–227, 2015.
  • González et al. [2017] Javier González, Zhenwen Dai, Andreas Damianou, and Neil D. Lawrence. Preferential Bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1282–1291. JMLR. org, 2017.
  • Jamieson et al. [2014] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. lil’ ucb : An optimal exploration algorithm for multi-armed bandits. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesvari, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 423–439. PMLR, 2014.
  • Jang et al. [2017] Minje Jang, Sunghyun Kim, Changho Suh, and Sewoong Oh. Optimal sample complexity of m-wise data for top-k ranking. In Advances in Neural Information Processing Systems, pages 1685–1695, 2017.
  • Kalyanakrishnan et al. [2012] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pages 655–662, 2012.
  • Karnin et al. [2013] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pages 1238–1246, 2013.
  • Kaufmann et al. [2016] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
  • Khetan and Oh [2016] Ashish Khetan and Sewoong Oh. Data-driven rank breaking for efficient rank aggregation. Journal of Machine Learning Research, 17(193):1–54, 2016.
  • Luce [2012] R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.
  • Mohajer et al. [2017] Soheil Mohajer, Changho Suh, and Adel Elmahdy. Active learning for top- rank aggregation from noisy comparisons. In International Conference on Machine Learning, pages 2488–2497, 2017.
  • Nip et al. [2017] Kameng Nip, Zhenbo Wang, and Zizhuo Wang. Assortment optimization under a single transition model. 2017.
  • Plackett [1975] Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202, 1975.
  • Popescu et al. [2016] Pantelimon G Popescu, Silvestru Dragomir, Emil I Slusanschi, and Octavian N Stanasila.

    Bounds for Kullback-Leibler divergence.

    Electronic Journal of Differential Equations, 2016, 2016.
  • Ren et al. [2018] Wenbo Ren, Jia Liu, and Ness B Shroff. Pac ranking from pairwise and listwise queries: Lower bounds and upper bounds. arXiv preprint arXiv:1806.02970, 2018.
  • Saha and Gopalan [2018a] Aadirupa Saha and Aditya Gopalan. Battle of bandits. In

    Uncertainty in Artificial Intelligence

    , 2018a.
  • Saha and Gopalan [2018b] Aadirupa Saha and Aditya Gopalan. Active ranking with subset-wise preferences. International Conference on Artificial Intelligence and Statistics (AISTATS), 2018b.
  • Saha and Gopalan [2019] Aadirupa Saha and Aditya Gopalan. PAC Battling Bandits in the Plackett-Luce Model. In Algorithmic Learning Theory, pages 700–737, 2019.
  • Soufiani et al. [2013] Hossein Azari Soufiani, Hansheng Diao, Zhenyu Lai, and David C Parkes. Generalized random utility models with multiple types. In Advances in Neural Information Processing Systems, pages 73–81, 2013.
  • Soufiani et al. [2014] Hossein Azari Soufiani, David C Parkes, and Lirong Xia. Computing parametric ranking models via rank-breaking. In ICML, pages 360–368, 2014.
  • Sui et al. [2017] Yanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Multi-dueling bandits with dependent arms. arXiv preprint arXiv:1705.00253, 2017.
  • Szörényi et al. [2015] Balázs Szörényi, Róbert Busa-Fekete, Adil Paul, and Eyke Hüllermeier. Online rank elicitation for plackett-luce: A dueling bandits approach. In Advances in Neural Information Processing Systems, pages 604–612, 2015.
  • Talluri and Van Ryzin [2004] Kalyan Talluri and Garrett Van Ryzin. Revenue management under a general discrete choice model of consumer behavior. Management Science, 50(1):15–33, 2004.
  • Thurstone [1927] Louis L Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927.
  • Urvoy et al. [2013] Tanguy Urvoy, Fabrice Clerot, Raphael Féraud, and Sami Naamane. Generic exploration and k-armed voting bandits. In International Conference on Machine Learning, pages 91–99, 2013.
  • Yue and Joachims [2009] Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201–1208. ACM, 2009.
  • Yue and Joachims [2011] Yisong Yue and Thorsten Joachims. Beat the mean bandit. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 241–248, 2011.
  • Zhao et al. [2018] Zhibing Zhao, Tristan Villamil, and Lirong Xia. Learning mixtures of random utility models. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Zoghi et al. [2013] Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten de Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. arXiv preprint arXiv:1312.3393, 2013.
  • Zoghi et al. [2014] Masrour Zoghi, Shimon Whiteson, Remi Munos, Maarten de Rijke, et al. Relative upper confidence bound for the k-armed dueling bandit problem. In JMLR Workshop and Conference Proceedings, number 32, pages 10–18. JMLR, 2014.
  • Zoghi et al. [2015] Masrour Zoghi, Shimon Whiteson, and Maarten de Rijke. Mergerucb: A method for large-scale online ranker evaluation. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 17–26. ACM, 2015.

Supplementary for Best-item Learning in Random Utility Models with Subset Choices

Appendix A Appendix for Section 4

a.1 Proof of Lemma 2

See 2


Let us fix any subset and two consider the items such that . Recall that we also denote by . Let us define a random variable that denotes the maximum score value taken by the rest of the items in set . Note that the support of , say denoted by supp supp.

Let us also denote . We have:

Let us now introduce a random variable . Now owing to the ‘independent and identically distributed noise’  assumption of the RUM  model, we can further show that:

which proves the claim. ∎

a.2 Proof of Lemma 3

See 3


We can derive the Min-AR values for the following distributions by simply applying the lower bound formula stated in Thm. 2 () along with their specific density functions as stated below for each specific distributions:

1. Exponential noise:

When the noise distribution is Exponential, i.e. Exponential note that: , , and support.

2. Gumbel noise:

When the noise distribution is Gumbel, i.e. Gumbel note that: , , and support.

3. Uniform noise case:

When the noise distribution is Uniform, i.e. Uniform note that: , , and support.

4. Gamma noise:

When the noise distribution is Gamma, with and , i.e. Gamma note that: , , and support.

5. Weibull noise:

When the noise distribution is Weibull, with , i.e. Weibull note that: , , and support.

6. Argument for the Gaussian noise case.

Note that Gaussian distributions do not have closed form CDFs and are difficult to compute in general, so we propose a different line of analysis specifically for the Gaussian noise case: Take the noise distribution to be standard normal, i.e.,

, with density . When and with , we find a lower bound on

First, note that by translation, we can take and without loss of generality. Doing so allows us to write

and likewise (taking ),

With this notation, we wish to minimize the ratio