1 Introduction
Random utility models (RUMs) are a popular and wellestablished framework for studying behavioral choices by individuals and groups Thurstone (1927). In a RUM with finite alternatives or items, a distribution on the preferred alternative(s) is assumed to arise from a random utility drawn from a distribution for each item, followed by rank ordering the items according to their utilities.
Perhaps the most widely known RUM is the PlackettLuce or multinomial logit model
Plackett (1975); Luce (2012) which results when each item’s utility is sampled from an additive model with a Gumbeldistributed perturbation. It is unique in the sense of enjoying the property of independence of irrelevant attributes (IIA), which is often key in permitting efficient inference of PlackettLuce models from data Khetan and Oh (2016). Other wellknown RUMs include the probit model Bliss (1934) featuring random Gaussian perturbations to the intrinsic utilities, mixed logit, nested logit, etc.A long line of work in statistics and machine learning focuses on estimating RUM properties from observed data
Soufiani et al. (2014); Zhao et al. (2018); Soufiani et al. (2013). Online learning or adaptive testing, on the other hand, has shown efficient ways of identifying the most attractive (i.e., highest utility) items in RUMs by learning from relative feedback from item pairs or more generally subsets Szörényi et al. (2015); Saha and Gopalan (2019); Jang et al. (2017). However, almost all existing work in this vein exclusively employs the PlackettLuce model, arguably due to its very useful IIA property, and our understanding of learning performance in other, more general RUMs has been lacking. We take a step in this direction by framing the problem of sequentially learning the best item/items in general RUMs by adaptive testing of item subsets and observing relative RUM feedback. In the process, we uncover new structural properties in RUMs, including models with exponential, uniform, Gaussian (probit) utility distributions, and give algorithmic principles to exploit this structure, that permit provably sampleefficient online learning and allow us to go beyond PlackettLuce.Our contributions: We introduce a new property of a RUM, called the (pairwise) advantage ratio
, which essentially measures the worstcase relative probabilities between an item pair across all possible contexts (subsets) where they occur. We show that this ratio can be controlled (bounded below) as an affine function of the relative strengths of item pairs for RUMs based on several common centered utility distributions, e.g., exponential, Gumbel, uniform, Gamma, Weibull, normal, etc., even when the resulting RUM does not possess analytically favorable properties such as IIA.
We give an algorithm for sequentially and adaptively PAC (probably approximately correct) learning the best item from among a finite pool when, in each decision round, a subset of fixed size can be tested and top rank ordered feedback from the RUM can be observed. The algorithm is based on the idea of maintaining pairwise win/loss counts among items, hierarchically testing subsets and propagating the surviving winners – principles that have been shown to work optimally in the more structured PlackettLuce RUM Szörényi et al. (2015); Saha and Gopalan (2019).
In terms of performance guarantees, we derive a PAC sample complexity bound for our algorithm: when working with a pool of items in total with subsets of size chosen in each decision round, the algorithm terminates in rounds where is a lower bound on the advantage ratio’s sensitivity to intrinsic item utilities. This can in turn be shown to be a property of only the RUM’s perturbation distribution, independent of the subset size . A novel feature of the guarantee is that, unlike existing sample complexity results for sequential testing in the PlackettLuce model, it does not rely on specific properties like IIA which are not present in general RUMs. We also extend the result to cover top rank ordered feedback, of which winner feedback () is a special case. Finally, we show that the sample complexity of our algorithm is orderwise optimal across RUMs having a given advantage ratio sensitivity , by arguing an informationtheoretic lower bound on the sample complexity of any online learning algorithm.
Our results and techniques represent a conceptual advance in the problem of online learning in general RUMs, moving beyond the PlackettLuce model for the first time to the best of our knowledge.
Related Work: For classical multiarmed bandits setting, there is a well studied literature on PACarm identification problem EvenDar et al. (2006); Audibert and Bubeck (2010); Kalyanakrishnan et al. (2012); Karnin et al. (2013); Jamieson et al. (2014), where the learner gets to see a noisy draw of absolute reward feedback of an arm upon playing a single arm per round. On the contrary, learning to identify the best item(s) with only relative preference information (ordinal as opposed to cardinal feedback) has seen steady progress since the introduction of the dueling bandit framework Zoghi et al. (2013) with pairs of items (size subsets) that can be played, and subsequent work on generalisation to broader models both in terms of distributional parameters Yue and Joachims (2009); Gajane et al. (2015); Ailon et al. (2014); Zoghi et al. (2015) as well as combinatorial subsetwise plays Mohajer et al. (2017); González et al. (2017); Saha and Gopalan (2018a); Sui et al. (2017). There have been several developments on the PAC objective for different pairwise preference models, such as those satisfying stochastic triangle inequalities and strong stochastic transitivity (Yue and Joachims, 2011), general utilitybased preference models (Urvoy et al., 2013), the PlackettLuce model (Szörényi et al., 2015) and the Mallows model (BusaFekete et al., 2014a)]. Recent work has studied PAClearning objectives other than identifying the single (near) best arm, e.g. recovering a few of the top arms (BusaFekete et al., 2013; Mohajer et al., 2017), or the true ranking of the items (BusaFekete et al., 2014b; Falahatgar et al., 2017). Some of the recent works also extended the PAClearning objective with relative subsetwise preferences Saha and Gopalan (2018b); Chen et al. (2017, 2018); Saha and Gopalan (2019); Ren et al. (2018).
However, none of the existing work considers strategies to learn efficiently in general RUMs with subsetwise preferences and to the best of our knowledge we are the first to address this general problem setup. In a different direction, there has been work on batch (nonadaptive) estimation in general RUMs, e.g., Zhao et al. (2018); Soufiani et al. (2013)
; however, this does not consider the price of active learning and the associated exploration effort required as we study here. A related body of literature lies in dynamic assortment selection, where the goal is to offer a subset of items to customers in order to maximise expected revenue, which has been studied under different choice models, e.g. MultinomialLogit
(Talluri and Van Ryzin, 2004), Mallows and mixture of Mallows (Désir et al., 2016a), Markov chainbased choice models
(Désir et al., 2016b), single transition model (Nip et al., 2017) etc., but again each of this work addresses a given and a very specific kind of choice model, and their objective is more suited to regret minimization type framework where playing every item comes with a associated cost.Organization: We give the necessary preliminaries and our general RUM based problem setup in Section 2. The formal description of our feedback models and the details of best arm identification problem is given in Section 3. In Section 4, we analyse the pairwise preferences of item pairs for our general RUM based subset choice model and introduce the notion of AdvantageRatio connecting subsetwise scores to pairwise preferences. Our proposed algorithm along with its performance guarantee and also matching lower bound analysis is given in Section 5. We further extend the above results to a more general top ranking feedback model in Section 6. Section 7 finally conclude our work with certain future directions. All the proofs of results are moved to the appendix.
2 Preliminaries
Notation. We denote by the set . For any subset , let denote the cardinality of . When there is no confusion about the context, we often represent (an unordered) subset
as a vector, or ordered subset,
of size (according to, say, a fixed global ordering of all the items ). In this case, denotes the item (member) at the th position in subset . is a permutation over items of , where for any permutation , denotes the element at the th position in . is generically used to denote an indicator variable that takes the value if the predicate is true, and otherwise. denotes the maximum of and , and is used to denote the probability of event , in a probability space that is clear from the context.2.1 Random Utilitybased Discrete Choice Models
A discrete choice model specifies the relative preferences of two or more discrete alternatives in a given set. Random Utility Models (RUMs) are a widelystudied class of discrete choice models; they assume a (nonrandom) groundtruth utility score for each alternative , and assign a distribution for scoring item , where . To model a winning alternative given any set , one first draws a random utility score for each alternative in , and selects an item with the highest random score. More formally, the probability that an item emerges as the winner in set is given by:
(1) 
In this paper, we assume that for each item , its random utility score is of the form , where all the
are ‘noise’ random variables drawn independently from a probability distribution
.A widely used RUM is the MultinomialLogit (MNL) or PlackettLuce model (PL), where the s are taken to be independent Gumbel distributions with location parameters and scale parameter (Azari et al., 2012), which results in score distributions , . Moreover, it can be shown that the probability that an alternative emerges as the winner in any set is simply proportional to its score parameter:
Other families of discrete choice models can be obtained by imposing different probability distributions over the iid noise ; e.g.,

Exponential noise: is the Exponential distribution ().

Noise from Extreme value distributions: is the Extremevaluedistribution (). Many wellknown distributions fall in this class, e.g., Frechet, Weibull, Gumbel. For instance, when , this reduces to the Gumbel distribution.

Uniform noise: is the (continuous) Uniform distribution ().

Gaussian or Frechet, Weibull, Gumbel noise: is the Gaussian distribution ().

Gamma noise: is the Gamma distribution (where ).
Other distributions can alternatively be used for modelling the noise distribution , depending on desired tail properties, domainspecific information, etc.
Finally, we denote a RUM choice model, comprised of an instance (with its implicit dependence on the noise distribution ) along with a playable subset size , by RUM.
3 Problem Setting
We consider the probably approximately correct (PAC) version of the sequential decisionmaking problem of finding the best item in a set of items, by making only subsetwise comparisons.
Formally, the learner is given a finite set of items or ‘arms’^{1}^{1}1terminology borrowed from multiarmed bandits along with a playable subset size . At each decision round , the learner selects a subset of distinct items, and receives (stochastic) feedback depending on (a) the chosen subset , and (b) a RUM choice model with parameters a priori unknown to the learner. The nature of the feedback can be of several types as described in Section 3.1. For the purposes of analysis, we assume, without loss of generality^{2}^{2}2under the assumption that the learner’s decision rule does not contain any bias towards a specific item index, that for ease of exposition^{3}^{3}3The extension to the case where several items have the same highest parameter value is easily accomplished.. We define a best item to be one with the highest score parameter: , under the assumptions above.
Remark 1.
Under the assumptions above, it follows that item is the Condorcet Winner Zoghi et al. (2014) for the underlying pairwise preference model induced by RUM.
3.1 Feedback models
We mean by ‘feedback model’ the information received (from the ‘environment’) once the learner plays a subset of items. Similar to different types of feedback models introduced earlier in the context of the specific PlackettLuce RUM Saha and Gopalan (2019), we consider the following feedback mechanisms:

Winner of the selected subset (WI: The environment returns a single item , drawn independently from the probability distribution

Full ranking selected subset of items (FR): The environment returns a full ranking , drawn from the probability distribution In fact, this is equivalent to picking according to the winner feedback from , then picking from following the same feedback model, and so on, until all elements from are exhausted, or, in other words, successively sampling winners from according to the RUM model, without replacement.
3.2 PAC Performance Objective: Correctness and Sample Complexity
For a RUM instance with arms, an arm is said to be optimal if . A sequential^{4}^{4}4We essentially mean a causal algorithm that makes present decisions using only past observed information at each time; the technical details for defining this precisely are omitted. learning algorithm that depends on feedback from an appropriate subsetwise feedback model is said to be PAC, for given constants , if the following properties hold when it is run on any instance RUM: (a) it stops and outputs an arm after a finite number of decision rounds (subset plays) with probability , and (b) the probability that its output is an optimal arm in RUM is at least , i.e, . Furthermore, by sample complexity of the algorithm, we mean the expected time (number of decision rounds) taken by the algorithm to stop when run on the instance RUM.
4 Connecting Subsetwise preferences to Pairwise Scores
In this section, we introduce the key concept of Advantage ratio as a means to systematically relate subsetwise preference observations to pairwise scores in general RUMs.
Consider any set , and recall that the probability of item winning in is for all . For any two items , let us denote . Let us also denote by and
the probability density function
^{5}^{5}5We assume by default that all noise distributions have a density; the extension to more general noise distributions is left to future work., cumulative distribution function and complementary cumulative distribution function of the noise distribution
, respectively; thus, for any Support and for any Support.We now introduce and analyse the AdvantageRatio (Def. 1); we will see in Sec. 5.1 how this quantity helps us deriving an improved sample complexity guarantee for our PAC item identification problem.
Definition 1 (Advantage ratio and Minimum advantage ratio).
Given any subsetwise preference model defined on items, we define the advantage ratio of item over item within the subset , as
Moreover, given a playable subset size , we define the minimum advantage ratio, MinAR, of item over , as the least advantage ratio of over across size subsets of , i.e.,
(2) 
The key intuition here is that when does not equal , it serves as a distinctive measure for identifying item and separately irrespective of the context . We specifically build on this intuition later in Sec. 5.1 to propose a new algorithm (Alg. 1) which finds the PAC best item relying on the unique distinctive properly of the bestitem (as described in Sec. 3).
The following result shows a variational lower bound, in terms of the noise distribution, for the minimum advantage ratio in a RUM model with independent and identically distributed (iid) noise variables, that is often amenable to explicit calculation/bounding.
Lemma 2 (Variational lower bound for the advantage ratio).
For any RUM based subsetwise preference model and any item pair ,^{6}^{6}6We assume to be in the right hand side of Eqn. 3.
(3) 
Moreover for RUM models one can show that for any triplet , , which further lower bounds MinAR by:
The proof of the result appears in Appendix A.1. Fig. 1 shows a geometrical interpretation behind MinAR, under the joint realization of the pair of values .
Remark 2.
We next derive the MinAR values certain specific noise distributions:
Lemma 3 (Analysing MinAR for specific noise models).
Given a fixed item pair such that , the following bounds hold under the respective noise models in an iid RUM.

Exponential(): MinAR for Exponential noise with .

Extreme value distribution: For Gumbel () noise, MinAR.

Uniform: MinAR for Uniform noise ( and ).

Gamma: MinAR for Gamma noise.

Weibull: MinAR for .

Normal : For small enough (in a neighborhood of ), MinAR.
Proof is given in Appendix A.2.
5 An optimal algorithm for the winner feedback model
In this section, we propose an algorithm (SequentialPairwiseBattle, Algorithm 1) for the PAC objective with winner feedback. We then analyse its correctness and sample complexity guarantee (Theorem 4) for any noise distribution (under a mild assumption of its being MinAR bounded away from ). Following this, we also prove a matching lower bound for the problem which shows that the sample complexity of Algorithm SequentialPairwiseBattle is unimprovable (up to a factor of ).
5.1 The SequentialPairwiseBattle algorithm
Our algorithm is based on the simple idea of dividing the set of items into subgroups of size , querying each subgroup ‘sufficiently enough’, retaining thereafter only the empirically ‘strongest item’ of each subgroup, and recursing on the remaining set of items until only one item remains.
More specifically, it starts by partitioning the initial item pool into mutually exclusive and exhaustive sets such that and . Each set is then queried for rounds, and only the ‘empirical winner’ of each group is retained in a set , rest are discarded. The algorithm next recurses the same procedure on the remaining set of surviving items, until a single item is left, which then is declared to be the PACbest item. Algorithm 1 presents the pseudocode in more detail.
Key idea: The primary novelty here is how the algorithm reasons about the ‘strongest item’ in each subgroup : It maintains the pairwise preferences of every item pair in any subgroup and simply chooses the item that beats the rest of the items in the subgroup with a positive advantage of greater than (alternatively, the item that wins maximum number of subsetwise plays). Our idea of maintaining pairwise preferences is motivated by a similar algorithm proposed in Saha and Gopalan (2019); however, their performance guarantee applies to only the very specific class of PlackettLuce feedback models, whereas the novelty of our current analysis reveals the power of maintaining pairwiseestimates for more general RUM subsetwise model (which includes the PlackettLuce choice model as a special case). The pseudo code of SequentialPairwiseBattle is given in Alg. 1.
The following is our chief result; it proves correctness and a sample complexity bound for Algorithm 1.
Theorem 4 (SequentialPairwiseBattle: Correctness and Sample Complexity).
Consider any iid subsetwise preference model RUM based on a noise distribution , and suppose that for any item pair , we have MinAR for some dependent constant . Then, Algorithm 1, with input constant , is an PAC algorithm with sample complexity .
The proof of the result appears in Appendix B.1.
Remark 3.
The linear dependence on the total number of items, , is, in effect, indicates the price to pay for learning the unknown model parameters which decide the subsetwise winning probabilities of the items. Remarkably, however, the theorem shows that the PAC sample complexity of the best item identification problem, with only winner feedback information from size subsets, is independent of . One may expect to see improved sample complexity as the number of items being simultaneously tested in each round is large , but note that on the other side, the sample complexity could also worsen, since it is also harder for a good item to win and show itself in a few draws against a large population of other competitors – these effects roughly balance each other out, and the final sample complexity only depends on the total number of items and the accuracy parameters .
Note that Lemma 3 gives specific values of the noisemodel dependent constant , using which we can derive specific sample complexity bounds for certain noise models:
Corollary 5 (Model specific correctness and sample complexity guarantees).
For the following representative noise distributions: Exponential, Gumbel Gamma, Uniform, Weibull, Standard normal or Normal, SeqPB (Alg.1) finds an PAC item within sample complexity .
Proof sketch.
The proof follows from the general performance guarantee of SeqPB (Thm. 4) and Lem. 3. More specifically from Lem. 3 it follows that the value of for these specific distributions are constant, which concludes the claim. For completeness the distributionspecific values of are given in Appendix B.2. ∎
5.2 Sample Complexity Lower Bound
In this section we derive a sample complexity lower bound for any PAC algorithm for any RUM model with MinAR strictly bounded away from in terms of . Our formal claim goes as follows:
Theorem 6 (Sample Complexity Lower Bound for RUM model).
Given , , and an PAC algorithm with winner item feedback, there exists a RUM instance with MinAR for all , where the expected sample complexity of on is at least .
The proof is given in Appendix B.3. It essentially involves a change of measure argument demonstrating a family of PlackettLuce models (iid Gumbel noise), with the appropriate value, that cannot easily be teased apart by any learning algorithm.
Comparing this result with the performance guarantee of our proposed algorithm (Theorem 6) shows that the sample complexity of the algorithm is orderwise optimal (up to a factor). Moreover, this result also shows that the IIA (independence of irrelevant attributes) property of the PlackettLuce choice model is not essential for exploiting pairwise preferences via rank breaking, as was claimed in Saha and Gopalan (2019). Indeed, except for the case of Gumbel noise, none of the RUM based models in Corollary 5 satisfies IIA, but they all respect the PAC sample complexity guarantee.
Remark 4.
For constant , the fundamental sample complexity bound of Theorem 6 resembles that of PAC best arm identification in the standard multiarmed bandit (MAB) problem EvenDar et al. (2006). Recall that our problem objective is exactly same as MAB, however our feedback model is very different since in MAB, the learner gets to see the noisy rewards/scores (i.e. the exact values of , which can be seen as a noisy feedback of the true reward/score of item), whereas here the learner only sees a wise relative preference feedback based on the underlying observed values of , which is a more indirect way of giving feedback on the item scores, and thus intuitively our problem objective is at least as hard as that of MAB setup.
6 Results for Top Ranking (TR) feedback model
We now address our PAC item identification problem for the case of more general, top rank ordered feedback for the RUM model, that generalises both the winneritem (WI) and full ranking (FR) feedback models.
Top ranking of items (TR): In this feedback setting, the environment is assumed to return a ranking of only items from among , i.e., the environment first draws a full ranking over according to RUM as in FR above, and returns the first rank elements of , i.e., . It can be seen that for each permutation on a subset , , we must have , where by we denote the set of all possible length ranking of items in set , it is easy to note that . Thus, generating such a is also equivalent to successively sampling winners from according to the PL model, without replacement. It follows that TR reduces to FR when and to WI when . Note that the idea for top ranking feedback was introduced by Saha and Gopalan (2018b) but only for the specific Plackett Luce choice model.
6.1 Algorithm for top ranking feedback
In this section, we extend the algorithm proposed earlier (Alg. 1) to handle feedback from the general top ranking feedback model. Based of the performance analysis of our algorithm (Thm. 7), we are able to show that we can achieve an factor improved sample complexity rate with top ranking feedback. We finally also give a lower bound analysis under this general feedback model (Thm. 8) showing the fundamental performance limit of the current problem of interest. Our derived lower bound shows optimality of our proposed algorithm mSeqPB up to logarithmic factors.
Main idea: Same as SeqPB, the algorithm proposed in this section (Alg. 2) in principle follows the same sequential elimination based strategy to find the nearbest item of the RUM model based on pairwise preferences. However, we use the idea of rank breaking (Soufiani et al., 2014; Saha and Gopalan, 2018b) to extract the pairwise preferences: formally, given any set of size , if denotes a possible top ranking of , then the RankBreaking subroutine considers each item in to be beaten by its preceding items in in a pairwise sense. For instance, given a full ranking of a set of elements , say , RankBreaking generates the set of pairwise comparisons: etc.
As a whole, our new algorithm now again divides the set of items into small groups of size , say , and play each subgroup some many rounds. Inside any fixed subgroup , after each round of play, it uses RankBreaking on the top ranking feedback , to extract out many pairwise feedback, which is further used to estimate the empirical pairwise preferences for each pair of items . Based on these pairwise estimates it then only retains the strongest item of and recurse the same procedure on the set of surviving items, until just one item is left in the set. The complete algorithm is given in Alg. 2 (Appendix C.1).
Theorem 7 analyses the correctness and sample complexity bounds of mSeqPB. Note that the sample complexity bound of mSeqPB with top ranking (TR) feedback model is times that of the WI model (Thm. 4). This is justified since intuitively revealing a ranking on items in a set provides about many WI feedback per round, which essentially leads to the factor improvement in the sample complexity.
Theorem 7 (mSeqPB(Alg. 2): Correctness and Sample Complexity).
Consider any RUM subsetwise preference model based on noise distribution and suppose for any item pair , we have MinAR for some dependent constant . Then mSeqPB (Alg.2) with input constant on top ranking feedback model is an PAC algorithm with sample complexity .
Proof is given in Appendix C.2.
Similar to Cor. 5, for the top model again, we can derive specific sample complexity bounds for different noise distributions, e.g., Exponential, Gumbel, Gaussian, Uniform, Gamma etc., in this case as well.
6.2 Lower Bound: Top ranking feedback
In this section, we analyze the fundamental limit of sample complexity lower bound for any PAC algorithm for RUM model.
Theorem 8 (Sample Complexity Lower Bound for RUM model with TR feedback).
Given and , and an PAC algorithm with winner item feedback, there exists a RUM instance , in which for any pair MinAR, where the expected sample complexity of on with top ranking feedback has to be at least for A to be PAC.
The proof is given in Appendix C.3.
Similar to the case of winner feedback, comparing Theorem 7 with the above result shows that the sample complexity of mSeqPB is orderwise optimal (up to logarithmic factors), for general case of top ranking feedback as well.
7 Conclusion and Future Directions
We have identified a new principle to learn with general subsetsize preference feedback in general iid RUMs – rank breaking followed by pairwise comparisons. This has been made possible by extending the concept of pairwise advantage from the popular PlackettLuce choice model to much more general RUMs, and showing that the IIA property that PlackettLuce models enjoy is not essential to obtain optimal sample complexity.
Our results suggest several interesting directions for future investigation, namely the possibility of considering correlated noise models (making the RUM more general), explicitly modeling the dependence of samples on item features or attributes, other performance objectives like regret for online utility optimization, and extension to learning with relative preferences in timecorrelated settings like Markov Decision Processes.
References
 Ailon et al. [2014] Nir Ailon, Zohar Shay Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In ICML, volume 32, pages 856–864, 2014.
 Audibert and Bubeck [2010] JeanYves Audibert and Sébastien Bubeck. Best arm identification in multiarmed bandits. In COLT23th Conference on Learning Theory2010, pages 13–p, 2010.
 Azari et al. [2012] Hossein Azari, David Parkes, and Lirong Xia. Random utility theory for social choice. In Advances in Neural Information Processing Systems, pages 126–134, 2012.
 Bliss [1934] Chester I Bliss. The method of probits. Science, 1934.
 BusaFekete et al. [2013] Róbert BusaFekete, Balazs Szorenyi, Weiwei Cheng, Paul Weng, and Eyke Hüllermeier. Topk selection based on adaptive sampling of noisy preferences. In International Conference on Machine Learning, pages 1094–1102, 2013.
 BusaFekete et al. [2014a] Róbert BusaFekete, Eyke Hüllermeier, and Balázs Szörényi. Preferencebased rank elicitation using statistical models: The case of mallows. In Proceedings of The 31st International Conference on Machine Learning, volume 32, 2014a.
 BusaFekete et al. [2014b] Róbert BusaFekete, Balázs Szörényi, and Eyke Hüllermeier. Pac rank elicitation through adaptive sampling of stochastic pairwise preferences. In AAAI, pages 1701–1707, 2014b.
 Chen et al. [2017] Xi Chen, Sivakanth Gopi, Jieming Mao, and Jon Schneider. Competitive analysis of the topk ranking problem. In Proceedings of the TwentyEighth Annual ACMSIAM Symposium on Discrete Algorithms, pages 1245–1264. SIAM, 2017.
 Chen et al. [2018] Xi Chen, Yuanzhi Li, and Jieming Mao. A nearly instance optimal algorithm for topk ranking under the multinomial logit model. In Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete Algorithms, pages 2504–2522. SIAM, 2018.
 Désir et al. [2016a] Antoine Désir, Vineet Goyal, Srikanth Jagabathula, and Danny Segev. Assortment optimization under the mallows model. In Advances in Neural Information Processing Systems, pages 4700–4708, 2016a.
 Désir et al. [2016b] Antoine Désir, Vineet Goyal, Danny Segev, and Chun Ye. Capacity constrained assortment optimization under the markov chain based choice model. Operations Research, 2016b.

EvenDar et al. [2006]
Eyal EvenDar, Shie Mannor, and Yishay Mansour.
Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems.
Journal of machine learning research, 7(Jun):1079–1105, 2006.  Falahatgar et al. [2017] Moein Falahatgar, Yi Hao, Alon Orlitsky, Venkatadheeraj Pichapati, and Vaishakh Ravindrakumar. Maxing and ranking with few assumptions. In Advances in Neural Information Processing Systems, pages 7063–7073, 2017.
 Gajane et al. [2015] Pratik Gajane, Tanguy Urvoy, and Fabrice Clérot. A relative exponential weighing algorithm for adversarial utilitybased dueling bandits. In Proceedings of the 32nd International Conference on Machine Learning, pages 218–227, 2015.
 González et al. [2017] Javier González, Zhenwen Dai, Andreas Damianou, and Neil D. Lawrence. Preferential Bayesian optimization. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1282–1291. JMLR. org, 2017.
 Jamieson et al. [2014] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. lil’ ucb : An optimal exploration algorithm for multiarmed bandits. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesvari, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 423–439. PMLR, 2014.
 Jang et al. [2017] Minje Jang, Sunghyun Kim, Changho Suh, and Sewoong Oh. Optimal sample complexity of mwise data for topk ranking. In Advances in Neural Information Processing Systems, pages 1685–1695, 2017.
 Kalyanakrishnan et al. [2012] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in stochastic multiarmed bandits. In ICML, volume 12, pages 655–662, 2012.
 Karnin et al. [2013] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multiarmed bandits. In International Conference on Machine Learning, pages 1238–1246, 2013.
 Kaufmann et al. [2016] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of bestarm identification in multiarmed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
 Khetan and Oh [2016] Ashish Khetan and Sewoong Oh. Datadriven rank breaking for efficient rank aggregation. Journal of Machine Learning Research, 17(193):1–54, 2016.
 Luce [2012] R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.
 Mohajer et al. [2017] Soheil Mohajer, Changho Suh, and Adel Elmahdy. Active learning for top rank aggregation from noisy comparisons. In International Conference on Machine Learning, pages 2488–2497, 2017.
 Nip et al. [2017] Kameng Nip, Zhenbo Wang, and Zizhuo Wang. Assortment optimization under a single transition model. 2017.
 Plackett [1975] Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202, 1975.

Popescu et al. [2016]
Pantelimon G Popescu, Silvestru Dragomir, Emil I Slusanschi, and Octavian N
Stanasila.
Bounds for KullbackLeibler divergence.
Electronic Journal of Differential Equations, 2016, 2016.  Ren et al. [2018] Wenbo Ren, Jia Liu, and Ness B Shroff. Pac ranking from pairwise and listwise queries: Lower bounds and upper bounds. arXiv preprint arXiv:1806.02970, 2018.

Saha and Gopalan [2018a]
Aadirupa Saha and Aditya Gopalan.
Battle of bandits.
In
Uncertainty in Artificial Intelligence
, 2018a.  Saha and Gopalan [2018b] Aadirupa Saha and Aditya Gopalan. Active ranking with subsetwise preferences. International Conference on Artificial Intelligence and Statistics (AISTATS), 2018b.
 Saha and Gopalan [2019] Aadirupa Saha and Aditya Gopalan. PAC Battling Bandits in the PlackettLuce Model. In Algorithmic Learning Theory, pages 700–737, 2019.
 Soufiani et al. [2013] Hossein Azari Soufiani, Hansheng Diao, Zhenyu Lai, and David C Parkes. Generalized random utility models with multiple types. In Advances in Neural Information Processing Systems, pages 73–81, 2013.
 Soufiani et al. [2014] Hossein Azari Soufiani, David C Parkes, and Lirong Xia. Computing parametric ranking models via rankbreaking. In ICML, pages 360–368, 2014.
 Sui et al. [2017] Yanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Multidueling bandits with dependent arms. arXiv preprint arXiv:1705.00253, 2017.
 Szörényi et al. [2015] Balázs Szörényi, Róbert BusaFekete, Adil Paul, and Eyke Hüllermeier. Online rank elicitation for plackettluce: A dueling bandits approach. In Advances in Neural Information Processing Systems, pages 604–612, 2015.
 Talluri and Van Ryzin [2004] Kalyan Talluri and Garrett Van Ryzin. Revenue management under a general discrete choice model of consumer behavior. Management Science, 50(1):15–33, 2004.
 Thurstone [1927] Louis L Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927.
 Urvoy et al. [2013] Tanguy Urvoy, Fabrice Clerot, Raphael Féraud, and Sami Naamane. Generic exploration and karmed voting bandits. In International Conference on Machine Learning, pages 91–99, 2013.
 Yue and Joachims [2009] Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201–1208. ACM, 2009.
 Yue and Joachims [2011] Yisong Yue and Thorsten Joachims. Beat the mean bandit. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 241–248, 2011.
 Zhao et al. [2018] Zhibing Zhao, Tristan Villamil, and Lirong Xia. Learning mixtures of random utility models. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Zoghi et al. [2013] Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten de Rijke. Relative upper confidence bound for the karmed dueling bandit problem. arXiv preprint arXiv:1312.3393, 2013.
 Zoghi et al. [2014] Masrour Zoghi, Shimon Whiteson, Remi Munos, Maarten de Rijke, et al. Relative upper confidence bound for the karmed dueling bandit problem. In JMLR Workshop and Conference Proceedings, number 32, pages 10–18. JMLR, 2014.
 Zoghi et al. [2015] Masrour Zoghi, Shimon Whiteson, and Maarten de Rijke. Mergerucb: A method for largescale online ranker evaluation. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 17–26. ACM, 2015.
Supplementary for Bestitem Learning in Random Utility Models with Subset Choices
Appendix A Appendix for Section 4
a.1 Proof of Lemma 2
See 2
Proof.
Let us fix any subset and two consider the items such that . Recall that we also denote by . Let us define a random variable that denotes the maximum score value taken by the rest of the items in set . Note that the support of , say denoted by supp supp.
Let us also denote . We have:
Let us now introduce a random variable . Now owing to the ‘independent and identically distributed noise’ assumption of the RUM model, we can further show that:
which proves the claim. ∎
a.2 Proof of Lemma 3
See 3
Proof.
We can derive the MinAR values for the following distributions by simply applying the lower bound formula stated in Thm. 2 () along with their specific density functions as stated below for each specific distributions:
1. Exponential noise:
When the noise distribution is Exponential, i.e. Exponential note that: , , and support.
2. Gumbel noise:
When the noise distribution is Gumbel, i.e. Gumbel note that: , , and support.
3. Uniform noise case:
When the noise distribution is Uniform, i.e. Uniform note that: , , and support.
4. Gamma noise:
When the noise distribution is Gamma, with and , i.e. Gamma note that: , , and support.
5. Weibull noise:
When the noise distribution is Weibull, with , i.e. Weibull note that: , , and support.
6. Argument for the Gaussian noise case.
Note that Gaussian distributions do not have closed form CDFs and are difficult to compute in general, so we propose a different line of analysis specifically for the Gaussian noise case: Take the noise distribution to be standard normal, i.e.,
, with density . When and with , we find a lower bound onFirst, note that by translation, we can take and without loss of generality. Doing so allows us to write
and likewise (taking ),
With this notation, we wish to minimize the ratio
Comments
There are no comments yet.