1 Introduction
Bandit optimisation with absolute or cardinal utility feedback is wellunderstood in terms of algorithms and fundamental limits; this includes statistically efficient algorithms for bandits with large, combinatorial subset action spaces (Chen et al., 2013a; Kveton et al., 2015). In many natural online learning settings, however, information obtained about the utilities of alternatives chosen is inherently relative or ordinal, e.g., recommender systems (Hofmann et al., 2013; Radlinski et al., 2008), crowdsourcing (Chen et al., 2013b), multiplayer game ranking (Graepel and Herbrich, 2006), market research and social surveys (BenAkiva et al., 1994; Alwin and Krosnick, 1985; Hensher, 1994), and in other systems where human beings often express preferences.
The framework of dueling bandits (Yue and Joachims, 2009; Zoghi et al., 2013) represents a promising attempt to model online optimisation with pairwise preference feedback; however, our understanding of the more general, and often realistic, setting of online learning with combinatorial subset choices and subsetlevel feedback is relatively less developed.
In this work, we consider a generalisation of the dueling bandit problem where the learner, instead of choosing only two arms, selects a subset of (up to) many arms in each round. The learner subsequently observes as feedback a rankordered list of items from the subset, generated probabilistically according to an underlying subsetwise preference model – the PlackettLuce distribution on rankings, based on the multinomial logit (MNL) choice model (Azari et al., 2012) – in which each arm has an unknown positive value. Simultaneously, the learner earns as reward the average value of the subset played in the round. The goal of the learner is to play subsets to minimise its cumulative regret with respect to the most rewarding subset.
The regret minimisation objective is relevant in settings where deviating from choosing a good or optimal subset of alternatives at each time comes at a cost (instantaneous regret) often dictated by external considerations, but where the feedback information provides purely relative feedback. For instance, consider a beverage company that experimentally devises several variants of a drink (the arms or alternatives), out of which it wants to put up a subset of variants that would sell the best in the open market. It then elicits reliable (i.e., reasonably unbiased) relative preference feedback about a candidate subset of variants from a team of tasters or some other form of crowdsourcing. The inherent value or revenue of the subset, modelled as the average value of items in it, is not directly observable since it is a function of what the open market response to the subset offering is, and may also be costly or timeconsuming to estimate. Nevertheless, there is enough information among the relative preferences revealed per round for the company to hope to optimise its selection over time.
Like the dueling bandit, this more general problem can be viewed as a stochastic partial monitoring problem (Bartók et al., 2011), in which the reward or loss of a subset play is not directly observed; instead, only stochastic feedback depending on the subset’s parameters is observed. Moreover, under one of the regret structures we consider (Winnerregret, Sec. 3.2), playing the optimal subset (the single item with the highest value) yields no useful information.
A distinctly challenging feature of this problem, with subset plays and rankedorder feedback received after each play, lies in the combinatorially large action and feedback spaces, similar to those encountered in combinatorial bandit problems (CesaBianchi and Lugosi, 2012; Combes et al., 2015). The key question here is whether (and if so, how) structure in the subset choice model, defined compactly by only a few parameters (as many as the number of arms) can be exploited to give algorithms whose regret does not reflect this combinatorial explosion.
Against this backdrop, the specific contributions of this paper are as follows:

We consider the problem of regret minimisation when subsets of items of size at most can be played, top rankordered feedback is received according to the MNL model, and the reward from the subset play is the mean MNLparameter value of the items in the subset. We propose an upper confidence bound (UCB)based algorithm, with a novel maximin rule to build large subsets and a lightweight space requirement of tracking pairwise item estimates, and show that it enjoys instancedependent regret in rounds of about . This is shown to be orderoptimal by exhibiting a lower bound of on the regret for any Noregret algorithm. A consequence of our results show the optimal regret does not vary with the maximum subset size () that can be played, but improves multiplicatively with the length of top rankordered feedback received per round (Sec. 3).

We also consider a related regret minimisation setting in which subsets of size exactly must be played, after which a ranking of the items is received as feedback, and where the zeroregret subset consists of the items with the highest MNLparameter values. In this case, our analysis reveals a fundamental lower bound on regret of , where the problem complexity now depends on the parameter difference between the and best item of the MNL model. We follow this up with a subsetplaying algorithm (Alg. 3) for this problem – a recursive variant of the UCBbased algorithm above – with a matching, optimal regret guarantee of (Sec. 4).
Related Work. Over the last decade, online learning from pairwise preferences has seen a widespread resurgence in the form of the Dueling Bandit problem, from the points of view of both pureexploration (PAC) settings (Yue and Joachims, 2011; Szörényi et al., 2015; BusaFekete et al., 2014; BusaFekete and Hüllermeier, 2014), and regret minimisation (Yue et al., 2012; Urvoy et al., 2013; Zoghi et al., 2014; Ailon et al., 2014; Komiyama et al., 2015; Wu and Liu, 2016). In contrast, bandit learning with combinatorial, subsetwise preferences, though a natural and practical generalisation, has not received a commensurate treatment.
There have been a few attempts in the batch (i.e., nonadaptive) setting for parameter estimation in utilitybased subset choice models, e.g. PlackettLuce or Thurstonian models (Hajek et al., 2014; Chen and Suh, 2015; Khetan and Oh, 2016; Jang et al., 2017). In the online setup, a recent work by Brost et al. (2016) considers an extension of the dueling bandits framework where multiple arms are chosen in each round, but they receive comparisons for each pair, and there are no regret guarantees stated for their algorithm. Another similar work is DCMbandits (Katariya et al., 2016), where a list of distinct items are offered at each round and the users choose one or more from it scanning the list from top to bottom. However due to this cascading nature of their feedback model, this is also not strictly a relative subsetwise preference model unlike ours, since the utility or attraction weight of an item is assumed to be independently drawn, and so their learning objective differs substantially.
A related body of literature lies in dynamic assortment selection, where the goal is to offer a subset of items to customers in order to maximise expected revenue. A specific, bandit (online) counterpart of this problem has been studied in recent work (Agrawal et al., 2016, 2017), although it takes items’ prices into account due to which their notion of the ‘best subset’ is rather different from our ‘benchmark subset’, and the two settings are incomparable in general.
Some recent work addresses the probably approximately correct (PAC) version of the best arm(s) identification problem from subsetwise preferences
Chen et al. (2018); Ren et al. (2018); Saha and Gopalan (2018b), which is qualitatively different than the optimisation objective considered here. The work which is perhaps closest in spirit to ours is that of Saha and Gopalan (2018a), but they consider a much more elementary subset choice model based on pairwise preferences, unlike the standard MNL model rooted in choice theory. Sui et al. (2017) also address a similar problem; however, a key difference lies in the feedback which consists of outcomes of one or more pairs from the played subset, as opposed to our winner or Topranking Feedback which is often practical.2 Preliminaries and Problem Statement
Notation. We denote by the set . For any subset , we let denote the cardinality of . When there is no confusion about the context, we often represent (an unordered) subset
as a vector (or ordered subset)
of size according to, say, a fixed global ordering of all the items . In this case, denotes the item (member) at the th position in subset . For any ordered set , denotes the set of items from position to , , . is a permutation over items of , where for any permutation , denotes the element at the th position in . We also denote by the set of permutations of any subset of , for any , i.e. . is generically used to denote an indicator variable that takes the value if the predicate is true, and otherwise. is used to denote the probability of event , in a probability space that is clear from the context.Definition 1 (Multinomial logit probability model).
A Multinomial logit (MNL) probability model MNL(), specified by positive parameters
, is a collection of probability distributions
, where for each nonempty subset , . The indices are referred to as ‘items’ or ‘arms’ .BestItem: Given an MNL() instance, we define the BestItem , to be the item with highest MNL parameter if such a unique item exists, i.e. .
Top BestItems: Given any instance of MNL() we define the Top BestItems , to be the set of distinct items with highest MNL parameters if such a unique set exists, i.e. for any pair of items and , , such that . So if , .
2.1 Feedback models
An online learning algorithm interacts with a MNL() probability model over items (the ‘environment’) as follows. At each round , the algorithm plays a subset of (distinct) items, with , upon which it receives stochastic feedback whose distribution is governed by the probability distribution . We specifically consider the following structures for feedback received upon playing a subset , :
1. Winner Feedback: In this case, the environment returns a single item drawn independently from probability distribution , i.e., .
2. Topranking Feedback (): Here, the environment returns an ordered list of items sampled without replacement from the MNL() probability model on . More formally, the environment returns a partial ranking , drawn from the probability distribution This can also be seen as picking an item according to Winner Feedback from , then picking from , and so on, until all elements from are exhausted. When , Topranking Feedback is the same as Winner Feedback. To incorporate sets with , we set . Clearly this model reduces to Winner Feedback for , and a full rank ordering of the set when .
2.2 Decisions (Subsets) and Regret
We define two different settings in terms of their decision spaces and associated notions of regret:

Winnerregret: This is motivated by learning to identify the BestItem . At any round , the learner can play sets of size , but is penalised for playing any item other than .Formally, we define the learner’s instantaneous regret at round as , and its cumulative regret from rounds as
The learner aims to play sets to keep the regret as low as possible, i.e., to play only the singleton set over time, as that is the only set with regret. The instantaneous Winnerregret can be interpreted as the shortfall in value of the played set with respect to , where the value of a set is simply the mean parameter value of its items.
Remark 1.
Assuming (we can do this without loss of generality since the MNL model is positive scale invariant, see Defn. 1), it is easy to note that for any item (as ). Consequently, the Winnerregret as defined above, can be further bounded above (up to constant factors) as which, for , is the definition of regret in the dueling bandit problem (Yue et al., 2012; Zoghi et al., 2014; Wu and Liu, 2016).

Topregret: This setting is motivated by learning to identify the set of Top BestItems of the MNL() model. Correspondingly, we assume that the learner can play sets of distinct items at each round . The instantaneous regret of the learner, in this case, in the th round is defined to be , where . Consequently, the cumulative regret of the learner at the end of round becomes As with the Winnerregret, the Topregret also admits a natural interpretation as the shortfall in value of the set with respect to the set , with the value of a set being the mean parameter value of the arms it contains.
3 Minimising Winnerregret
This section considers the problem of minimising Winnerregret with the most general Topranking Feedback. We start by deriving a fundamental lower bound on Winnerregret for any reasonable algorithm. The main finding is a regret lower bound that does not exhibit an improvement with larger playable subset sizes (Thm. 3), and, in fact, one that is orderwise the same, in terms of the total number of arms and time horizon , as that of the corresponding dueling version . We next analyse the hardness of the Winnerregret minimisation problem with Topranking Feedback and show a reduced lower bound by a factor over Winner Feedback (Thm. 6). Following this Sec. 3.2 presents an algorithm with matching regret guarantee.
3.1 Lower Bound for Winnerregret
Along the lines of Lai and Robbins (1985), we define the following consistency property of any reasonable online learning algorithm in order to state a fundamental lower bound on regret performance.
Definition 2 (Noregret algorithm).
An online learning algorithm is defined to be Noregret algorithm if for all problem instances MNL() , the number of times plays (or queries) any suboptimal set is sublinear in , or in other words , for some , being the number of times the set is played by in rounds.
Theorem 3 (Regret Lower Bound: Winnerregret with Winner Feedback).
For any Noregret learning algorithm for Winnerregret with Winner Feedback, there exists a problem instance MNL() such that expected regret incurred by on it satisfies
where denotes expectation under the algorithm and MNL() model, .
Note: This is a ‘problem’ or ‘gap’dependent lower bound: denotes an instancedependent complexity term (‘gap’) for the regret performance limit.
Proof sketch. The argument is based on the following powerful standard changeofmeasure result:
Lemma 4 (Garivier et al. (2016)).
Given any bandit instance , with being the arm set of MAB, and being the set of reward distributions associated to with arm having the highest expected reward, for any suboptimal arm , consider an altered bandit instance with being the (unique) optimal arm (the one with highest expected reward) for , and let and are mutually absolutely continuous for all . At any round , let and denote the arm played and the observation (reward) received, respectively. Let be the sigma algebra generated by the trajectory of a sequential bandit algorithm upto round . Then, for any
measurable random variable
with values in it satisfies:where denotes the number of pulls of arm in
rounds, KL is the KullbackLeibler divergence between distributions, and
is the KullbackLeibler divergence between Bernoulli distributions with parameters
and .In our case, each bandit instance corresponds to an instance of the MNL() problem with the arm set containing all subsets of of size upto : . The key of the proof relies on carefully crafting a true instance, with optimal arm , and a family of ‘slightly perturbed’ alternative instances , each with optimal arm , which we choose as: and for every suboptimal item , consider the altered problem instance MNL: for some . The result of Thm. 3 is now obtained by applying Lemma 4 on pairs of problem instances , with suitably upper bounding the KLdivergence terms in the right hand side of the Lem. 4 inequality by and further lower bounding the left hand side setting along with the Noregret property of as: Finally rewriting
and combining all the results leads to the desired bound. (Complete proof given in Appendix B.2).
Remark 2.
Thm. 3 establishes that the regret rate with only Winner Feedback cannot improve with , uniformly across all problem instances. Rather strikingly, there is no reduction in hardness (measured in terms of regret rate) in learning the BestItem using Winner Feedback from large (size) subsets as compared to using pairwise (dueling) feedback (). It could be tempting to expect an improved learning rate with subsetwise feedback as the number of items being tested per iteration is more (), so informationtheoretically one may expect to collect more data about the underlying model per subset query. On the contrary, it turns out that it is intuitively ‘harder’ for a good (i.e., nearoptimal) item to prove its competitiveness in just a single winner draw against a large population of its other competitors, as compared to winning over just a single competitor for case. Our result establishes this formally: The advantage of investigating larger sized sets gets nullified by the drawback of requiring to query any particular set for larger number of times.
Theorem 5 (Alternate version of Thm. 6 with pairwise preference based instance complexities).
For any Noregret algorithm for Winnerregret with Winner Feedback, there exists a problem instance of MNL() model, such that the expected regret incurred by on it satisfies where , and , are same as that of Thm. 3. Thus the only difference lies in terms of the instance dependent complexity term (‘gap’) which is now expressed in terms pairwise preference of the best item over the second best item: .
Improved regret lower bound with Topranking Feedback. In contrast to the situation with only winner feedback, the following (more general) result shows a reduced lower bound when Topranking Feedback is available in each play of a subset, opening up the possibility of improved learning (regret) performance when rankedorder feedback is available.
Theorem 6 (Regret Lower Bound: Winnerregret with Topranking Feedback).
For any Noregret algorithm for the Winnerregret problem with Topranking Feedback, there exists a problem instance MNL() such that the expected Winnerregret incurred by satisfies where as in Thm. 3, denotes expectation under the algorithm and the MNL model MNL(), and recall .
Proof sketch. A crucial fact we establish in the course of the the argument is that the KL divergences that appear when analysing the case of Topranking Feedback are
times those for the case of Winner Feedback. We show this by appealing to the chain rule for KL divergences
(Cover and Thomas, 2012): , where we abbreviate as and denotes the conditional KLdivergence. Using this, along with the upper bound on the KL divergences for Winner Feedback (derived for Thm. 3), we get that in this case , which precisely gives the factor reduction in the regret lower bound compared to Winner Feedback case. The lower bound is now derived following a similar technique as described for Thm. 3. .Remark 3.
Thm. 6 shows a lower bound on regret, containing the instancedependent constant term which exposes the hardness of the regret minimisation problem in terms of the ‘gap’ between the best and the second best item : . The factor improvement in learning rate with Topranking Feedback can be intuitively interpreted as follows: revealing an ranking of a set is worth about bits of information, which is about times as large compared to revealing a single winner.
The next section shows that these fundamental lower bounds on Winnerregret are, in fact, achievable with carefully designed online learning algorithms.
3.2 An orderoptimal algorithm for Winnerregret
We now give an upperconfidence bound (UCB)based algorithm for Winnerregret with Topranking Feedback model which is based on the following key design ideas:

Playing sets of only two sizes: We show that it is enough for the algorithm to play subsets of size either (to fully exploit the Topranking Feedback) or (singleton sets), and not play a singleton unless there is a high degree of confidence about the single item being the best item (since playing a singleton does not lead to any feedback information).

Parameter estimation from pairwise preferences: We show that it is possible to play the subsetwise game just by maintaining pairwise preference estimates of all items of the MNL() model using the idea of RankBreaking–the idea of extracting pairwise comparisons from (partial) rankings and applying estimators on the obtained pairs treating each comparison independently (see Defn. 13) over the received subsetwise feedback—this is possible owning to the independence of irrelevant attributes (IIA) property of the MNL model (Defn. 12), or more precisely Lem. 14 (Appendix A). This idea of playing the subsetwise game with only pairwise estimates helps sidestep the combinatorial nature of the underlying problem of maintaining estimates of up to possible ranking probabilities, which is what the learner observes as Topranking Feedback from items.

A new UCBbased set building rule for playing large sets (build_S): The main novelty of MaxMinUCB lies in its underlying set building subroutine (see Alg. 2), that constructs by applying a recursive maximin strategy on the UCB estimates of empirical pairwise preferences.
Algorithm description. MaxMinUCB maintains an pairwise preference matrix , whose th entry records the empirical probability of having beaten in a pairwise duel, and a corresponding upper confidence bound for each pair . At any round , it plays a subset using the MaxMin set building rule build_S (see Alg. 2), receives Topranking Feedback from , and updates the entries of pairs in by applying RankBreaking (Line ). The set building rule build_S is at the heart of MaxMinUCB which builds the subset from a set of potential Condorcet winners () of round : By recursively picking the strongest opponents of the already selected items using a maximin selection strategy on .
The following result establishes that MaxMinUCB enjoys regret with high probability.
Theorem 7 (MaxMinUCB: High Probability Regret bound).
Fix a time horizon and , . With probability at least , the regret of MaxMinUCB for Winnerregret with Topranking Feedback satisfies where , , , , , .
Proof sketch. The proof hinges on analysing the entire run of MaxMinUCB by breaking it up into phases: (1). RandomExploration (2). Progress, and (3). Saturation.

RandomExploration: This phase runs from time to , for any , such that for any , the upper confidence bounds are guaranteed to be correct for the true values for all pairs (i.e. ), with high probability . The formal claim is given in Lem. 15–the proof of this is adapted from a similar analysis used by Zoghi et al. (2014) which is possible to apply for our algorithm due to RankBreaking update of s that exploits IIA property of the MNL model and Lem. 14.

Progress: After , the algorithm can be viewed as starting to explore the ‘confusing items’, appearing in , as potential candidates for the BestItem , and trying to capture in the holding set . Note that at any time, the set is either empty or a singleton by construction, and once it stays their forever (with high probability) due to Lem. 15. The Progress phase just ensures that the algorithm explores fast enough so that within a constant number of rounds (independent of ), captures , and henceforth for all . (Lem. 20)

Saturation: This is the last phase from time to . As the name suggests, MaxMinUCB shows relatively stable behavior in this phase, mostly playing and incurring almost no regret. The only times that it plays an extra set of suboptimal elements along with are when these elements beat with sufficiently high confidence in terms of . But the number of such suboptimal rounds of plays are limited, to precisely per item (Lem. 22), which finally yields the desired regret rate of MaxMinUCB.
The complete proof is given in Appendix B.5.
Although Thm. 7 shows a high probability regret bound for MaxMinUCB it is important to note that the algorithm itself does not require to take the probability of failure as input. As a consequence, by simply integrating the bound obtained in Thm. 7 over the entire range of , we get an expected regret bound of MaxMinUCB for Winnerregret with Topranking Feedback:
Theorem 8 (MaxMinUCB: Expected Regret bound).
The expected regret incurred by MaxMinUCB for Winnerregret with Topranking Feedback is: , in rounds.
Remark 4.
This is an upper bound on expected regret of the same order as that in the lower bound of Thm. 3, which shows that the algorithm is essentially regretoptimal. From Thm. 8, note that the first two terms of are essentially instance specific constants, its only the third term which makes expected regret which is in fact optimal in terms of its dependencies on and (since it matches the lower bound of Thm. 6). Moreover the problem dependent complexity terms , also brings out the inverse dependency on the ‘gapterm’ as discussed in Rem. 3.
4 Minimising Topregret
In this section, we study the problem of minimising Topregret with Topranking Feedback. As before, we first derive a regret lower bound, for this learning setting, of the form , with being a problemdependent complexity term that measures the ‘gap’ between and best item. We next propose an UCB based algorithm (Algo. 3) for the same, along with a regret analysis of matching upper bound (Thm. 10,11) which proves optimality of our proposed algorithm.
4.1 Regret lower bound for Topregret with Topranking Feedback
For the analysis in this section, we assume that the underlying MNL() model is such that , and denote .
Theorem 9 (Regret Lower Bound: Topregret with Topranking Feedback).
For any Noregret algorithm , there exists a problem instance MNL() such that the expected regret incurred by for Topregret with Topranking Feedback on MNL() is at least where denotes expectation under the algorithm and MNL() model.
Proof sketch. Similar to 6, the proof again relies on carefully constructing a true instance, with optimal set of Top BestItems , and a family of slightly perturbed alternative instances , for each suboptimal arm , which we design as: for some and . Clearly Top BestItems of MNL is , and for every suboptimal items , we consider the altered instance MNL: The result of Thm. 9 now can be obtained by following an exactly same procedure as described for the proof of Thm. 6. The complete details is given in Appendix C.1.
Remark 5.
The regret lower bound of Thm. 9 is , with an instancedependent term which shows for recovering the Top BestItems, the problem complexity is governed by the ‘gap’ between the and best item , as consistent with intuition.
4.2 An orderoptimal algorithm with low Topregret for Topranking Feedback
In this section, we present an online learning algorithm for playing subsets with low Topregret with Topranking Feedback.
Main idea: A recursive setbuilding rule: As with the MaxMinUCB algorithm (Alg. 1), we maintain pairwise UCB estimates () of empirical pairwise preferences via RankBreaking. However the chief difference here lies in the set building rule, as here it is required to play sets of size exactly . The core idea here is to recursively try to capture the set of Top BestItems in an ordered set , and, once the set is assumed to be found with confidence (formally ), to keep playing unless some other potential good item emerges, which is then played replacing the weakest element of . The algorithm is described in Alg. 3.
Theorem 10 (RecMaxMinUCB: High Probability Regret bound).
Given a fixed time horizon and , with high probability , the regret incurred by RecMaxMinUCB for Topregret admits the bound where is an instance dependent constant (see Lem. 26, Appendix), , and
Comments
There are no comments yet.