Bandit optimisation with absolute or cardinal utility feedback is well-understood in terms of algorithms and fundamental limits; this includes statistically efficient algorithms for bandits with large, combinatorial subset action spaces (Chen et al., 2013a; Kveton et al., 2015). In many natural online learning settings, however, information obtained about the utilities of alternatives chosen is inherently relative or ordinal, e.g., recommender systems (Hofmann et al., 2013; Radlinski et al., 2008), crowdsourcing (Chen et al., 2013b), multi-player game ranking (Graepel and Herbrich, 2006), market research and social surveys (Ben-Akiva et al., 1994; Alwin and Krosnick, 1985; Hensher, 1994), and in other systems where human beings often express preferences.
The framework of dueling bandits (Yue and Joachims, 2009; Zoghi et al., 2013) represents a promising attempt to model online optimisation with pairwise preference feedback; however, our understanding of the more general, and often realistic, setting of online learning with combinatorial subset choices and subset-level feedback is relatively less developed.
In this work, we consider a generalisation of the dueling bandit problem where the learner, instead of choosing only two arms, selects a subset of (up to) many arms in each round. The learner subsequently observes as feedback a rank-ordered list of items from the subset, generated probabilistically according to an underlying subset-wise preference model – the Plackett-Luce distribution on rankings, based on the multinomial logit (MNL) choice model (Azari et al., 2012) – in which each arm has an unknown positive value. Simultaneously, the learner earns as reward the average value of the subset played in the round. The goal of the learner is to play subsets to minimise its cumulative regret with respect to the most rewarding subset.
The regret minimisation objective is relevant in settings where deviating from choosing a good or optimal subset of alternatives at each time comes at a cost (instantaneous regret) often dictated by external considerations, but where the feedback information provides purely relative feedback. For instance, consider a beverage company that experimentally devises several variants of a drink (the arms or alternatives), out of which it wants to put up a subset of variants that would sell the best in the open market. It then elicits reliable (i.e., reasonably unbiased) relative preference feedback about a candidate subset of variants from a team of tasters or some other form of crowdsourcing. The inherent value or revenue of the subset, modelled as the average value of items in it, is not directly observable since it is a function of what the open market response to the subset offering is, and may also be costly or time-consuming to estimate. Nevertheless, there is enough information among the relative preferences revealed per round for the company to hope to optimise its selection over time.
Like the dueling bandit, this more general problem can be viewed as a stochastic partial monitoring problem (Bartók et al., 2011), in which the reward or loss of a subset play is not directly observed; instead, only stochastic feedback depending on the subset’s parameters is observed. Moreover, under one of the regret structures we consider (Winner-regret, Sec. 3.2), playing the optimal subset (the single item with the highest value) yields no useful information.
A distinctly challenging feature of this problem, with subset plays and ranked-order feedback received after each play, lies in the combinatorially large action and feedback spaces, similar to those encountered in combinatorial bandit problems (Cesa-Bianchi and Lugosi, 2012; Combes et al., 2015). The key question here is whether (and if so, how) structure in the subset choice model, defined compactly by only a few parameters (as many as the number of arms) can be exploited to give algorithms whose regret does not reflect this combinatorial explosion.
Against this backdrop, the specific contributions of this paper are as follows:
We consider the problem of regret minimisation when subsets of items of size at most can be played, top rank-ordered feedback is received according to the MNL model, and the reward from the subset play is the mean MNL-parameter value of the items in the subset. We propose an upper confidence bound (UCB)-based algorithm, with a novel maximin rule to build large subsets and a lightweight space requirement of tracking pairwise item estimates, and show that it enjoys instance-dependent regret in rounds of about . This is shown to be order-optimal by exhibiting a lower bound of on the regret for any No-regret algorithm. A consequence of our results show the optimal regret does not vary with the maximum subset size () that can be played, but improves multiplicatively with the length of top -rank-ordered feedback received per round (Sec. 3).
We also consider a related regret minimisation setting in which subsets of size exactly must be played, after which a ranking of the items is received as feedback, and where the zero-regret subset consists of the items with the highest MNL-parameter values. In this case, our analysis reveals a fundamental lower bound on regret of , where the problem complexity now depends on the parameter difference between the and best item of the MNL model. We follow this up with a subset-playing algorithm (Alg. 3) for this problem – a recursive variant of the UCB-based algorithm above – with a matching, optimal regret guarantee of (Sec. 4).
Related Work. Over the last decade, online learning from pairwise preferences has seen a widespread resurgence in the form of the Dueling Bandit problem, from the points of view of both pure-exploration (PAC) settings (Yue and Joachims, 2011; Szörényi et al., 2015; Busa-Fekete et al., 2014; Busa-Fekete and Hüllermeier, 2014), and regret minimisation (Yue et al., 2012; Urvoy et al., 2013; Zoghi et al., 2014; Ailon et al., 2014; Komiyama et al., 2015; Wu and Liu, 2016). In contrast, bandit learning with combinatorial, subset-wise preferences, though a natural and practical generalisation, has not received a commensurate treatment.
There have been a few attempts in the batch (i.e., non-adaptive) setting for parameter estimation in utility-based subset choice models, e.g. Plackett-Luce or Thurstonian models (Hajek et al., 2014; Chen and Suh, 2015; Khetan and Oh, 2016; Jang et al., 2017). In the online setup, a recent work by Brost et al. (2016) considers an extension of the dueling bandits framework where multiple arms are chosen in each round, but they receive comparisons for each pair, and there are no regret guarantees stated for their algorithm. Another similar work is DCM-bandits (Katariya et al., 2016), where a list of distinct items are offered at each round and the users choose one or more from it scanning the list from top to bottom. However due to this cascading nature of their feedback model, this is also not strictly a relative subset-wise preference model unlike ours, since the utility or attraction weight of an item is assumed to be independently drawn, and so their learning objective differs substantially.
A related body of literature lies in dynamic assortment selection, where the goal is to offer a subset of items to customers in order to maximise expected revenue. A specific, bandit (online) counterpart of this problem has been studied in recent work (Agrawal et al., 2016, 2017), although it takes items’ prices into account due to which their notion of the ‘best subset’ is rather different from our ‘benchmark subset’, and the two settings are incomparable in general.
Some recent work addresses the probably approximately correct (PAC) version of the best arm(s) identification problem from subsetwise preferencesChen et al. (2018); Ren et al. (2018); Saha and Gopalan (2018b), which is qualitatively different than the optimisation objective considered here. The work which is perhaps closest in spirit to ours is that of Saha and Gopalan (2018a), but they consider a much more elementary subset choice model based on pairwise preferences, unlike the standard MNL model rooted in choice theory. Sui et al. (2017) also address a similar problem; however, a key difference lies in the feedback which consists of outcomes of one or more pairs from the played subset, as opposed to our winner or Top--ranking Feedback which is often practical.
2 Preliminaries and Problem Statement
Notation. We denote by the set . For any subset , we let denote the cardinality of . When there is no confusion about the context, we often represent (an unordered) subset
as a vector (or ordered subset)of size according to, say, a fixed global ordering of all the items . In this case, denotes the item (member) at the th position in subset . For any ordered set , denotes the set of items from position to , , . is a permutation over items of , where for any permutation , denotes the element at the -th position in . We also denote by the set of permutations of any -subset of , for any , i.e. . is generically used to denote an indicator variable that takes the value if the predicate is true, and otherwise. is used to denote the probability of event , in a probability space that is clear from the context.
Definition 1 (Multinomial logit probability model).
A Multinomial logit (MNL) probability model MNL(), specified by positive parameters , is a collection of probability distributions
, is a collection of probability distributions, where for each non-empty subset , . The indices are referred to as ‘items’ or ‘arms’ .
Best-Item: Given an MNL() instance, we define the Best-Item , to be the item with highest MNL parameter if such a unique item exists, i.e. .
Top- Best-Items: Given any instance of MNL() we define the Top- Best-Items , to be the set of distinct items with highest MNL parameters if such a unique set exists, i.e. for any pair of items and , , such that . So if , .
2.1 Feedback models
An online learning algorithm interacts with a MNL() probability model over items (the ‘environment’) as follows. At each round , the algorithm plays a subset of (distinct) items, with , upon which it receives stochastic feedback whose distribution is governed by the probability distribution . We specifically consider the following structures for feedback received upon playing a subset , :
1. Winner Feedback: In this case, the environment returns a single item drawn independently from probability distribution , i.e., .
2. Top--ranking Feedback (): Here, the environment returns an ordered list of items sampled without replacement from the MNL() probability model on . More formally, the environment returns a partial ranking , drawn from the probability distribution This can also be seen as picking an item according to Winner Feedback from , then picking from , and so on, until all elements from are exhausted. When , Top--ranking Feedback is the same as Winner Feedback. To incorporate sets with , we set . Clearly this model reduces to Winner Feedback for , and a full rank ordering of the set when .
2.2 Decisions (Subsets) and Regret
We define two different settings in terms of their decision spaces and associated notions of regret:
Winner-regret: This is motivated by learning to identify the Best-Item . At any round , the learner can play sets of size , but is penalised for playing any item other than .Formally, we define the learner’s instantaneous regret at round as , and its cumulative regret from rounds as
The learner aims to play sets to keep the regret as low as possible, i.e., to play only the singleton set over time, as that is the only set with regret. The instantaneous Winner-regret can be interpreted as the shortfall in value of the played set with respect to , where the value of a set is simply the mean parameter value of its items.
Assuming (we can do this without loss of generality since the MNL model is positive scale invariant, see Defn. 1), it is easy to note that for any item (as ). Consequently, the Winner-regret as defined above, can be further bounded above (up to constant factors) as which, for , is the definition of regret in the dueling bandit problem (Yue et al., 2012; Zoghi et al., 2014; Wu and Liu, 2016).
Top--regret: This setting is motivated by learning to identify the set of Top- Best-Items of the MNL() model. Correspondingly, we assume that the learner can play sets of distinct items at each round . The instantaneous regret of the learner, in this case, in the -th round is defined to be , where . Consequently, the cumulative regret of the learner at the end of round becomes As with the Winner-regret, the Top--regret also admits a natural interpretation as the shortfall in value of the set with respect to the set , with the value of a set being the mean parameter value of the arms it contains.
3 Minimising Winner-regret
This section considers the problem of minimising Winner-regret with the most general Top--ranking Feedback. We start by deriving a fundamental lower bound on Winner-regret for any reasonable algorithm. The main finding is a regret lower bound that does not exhibit an improvement with larger playable subset sizes (Thm. 3), and, in fact, one that is order-wise the same, in terms of the total number of arms and time horizon , as that of the corresponding dueling version . We next analyse the hardness of the Winner-regret minimisation problem with Top--ranking Feedback and show a reduced lower bound by a factor over Winner Feedback (Thm. 6). Following this Sec. 3.2 presents an algorithm with matching regret guarantee.
3.1 Lower Bound for Winner-regret
Along the lines of Lai and Robbins (1985), we define the following consistency property of any reasonable online learning algorithm in order to state a fundamental lower bound on regret performance.
Definition 2 (No-regret algorithm).
An online learning algorithm is defined to be No-regret algorithm if for all problem instances MNL() , the number of times plays (or queries) any suboptimal set is sublinear in , or in other words , for some , being the number of times the set is played by in rounds.
Theorem 3 (Regret Lower Bound: Winner-regret with Winner Feedback).
For any No-regret learning algorithm for Winner-regret with Winner Feedback, there exists a problem instance MNL() such that expected regret incurred by on it satisfies
where denotes expectation under the algorithm and MNL() model, .
Note: This is a ‘problem-’ or ‘gap’-dependent lower bound: denotes an instance-dependent complexity term (‘gap’) for the regret performance limit.
Proof sketch. The argument is based on the following powerful standard change-of-measure result:
Lemma 4 (Garivier et al. (2016)).
Given any bandit instance , with being the arm set of MAB, and being the set of reward distributions associated to with arm having the highest expected reward, for any suboptimal arm , consider an altered bandit instance with being the (unique) optimal arm (the one with highest expected reward) for , and let and are mutually absolutely continuous for all . At any round , let and denote the arm played and the observation (reward) received, respectively. Let be the sigma algebra generated by the trajectory of a sequential bandit algorithm upto round . Then, for any -measurable random variable
-measurable random variablewith values in it satisfies:
In our case, each bandit instance corresponds to an instance of the MNL() problem with the arm set containing all subsets of of size upto : . The key of the proof relies on carefully crafting a true instance, with optimal arm , and a family of ‘slightly perturbed’ alternative instances , each with optimal arm , which we choose as: and for every suboptimal item , consider the altered problem instance MNL: for some . The result of Thm. 3 is now obtained by applying Lemma 4 on pairs of problem instances , with suitably upper bounding the KL-divergence terms in the right hand side of the Lem. 4 inequality by and further lower bounding the left hand side setting along with the No-regret property of as: Finally rewriting
and combining all the results leads to the desired bound. (Complete proof given in Appendix B.2).
Thm. 3 establishes that the regret rate with only Winner Feedback cannot improve with , uniformly across all problem instances. Rather strikingly, there is no reduction in hardness (measured in terms of regret rate) in learning the Best-Item using Winner Feedback from large (-size) subsets as compared to using pairwise (dueling) feedback (). It could be tempting to expect an improved learning rate with subset-wise feedback as the number of items being tested per iteration is more (), so information-theoretically one may expect to collect more data about the underlying model per subset query. On the contrary, it turns out that it is intuitively ‘harder’ for a good (i.e., near-optimal) item to prove its competitiveness in just a single winner draw against a large population of its other competitors, as compared to winning over just a single competitor for case. Our result establishes this formally: The advantage of investigating larger -sized sets gets nullified by the drawback of requiring to query any particular set for larger number of times.
Theorem 5 (Alternate version of Thm. 6 with pairwise preference based instance complexities).
For any No-regret algorithm for Winner-regret with Winner Feedback, there exists a problem instance of MNL() model, such that the expected regret incurred by on it satisfies where , and , are same as that of Thm. 3. Thus the only difference lies in terms of the instance dependent complexity term (‘gap’) which is now expressed in terms pairwise preference of the best item over the second best item: .
Improved regret lower bound with Top--ranking Feedback. In contrast to the situation with only winner feedback, the following (more general) result shows a reduced lower bound when Top--ranking Feedback is available in each play of a subset, opening up the possibility of improved learning (regret) performance when ranked-order feedback is available.
Theorem 6 (Regret Lower Bound: Winner-regret with Top--ranking Feedback).
For any No-regret algorithm for the Winner-regret problem with Top--ranking Feedback, there exists a problem instance MNL() such that the expected Winner-regret incurred by satisfies where as in Thm. 3, denotes expectation under the algorithm and the MNL model MNL(), and recall .
Proof sketch. A crucial fact we establish in the course of the the argument is that the KL divergences that appear when analysing the case of Top--ranking Feedback are
times those for the case of Winner Feedback. We show this by appealing to the chain rule for KL divergences(Cover and Thomas, 2012): , where we abbreviate as and denotes the conditional KL-divergence. Using this, along with the upper bound on the KL divergences for Winner Feedback (derived for Thm. 3), we get that in this case , which precisely gives the -factor reduction in the regret lower bound compared to Winner Feedback case. The lower bound is now derived following a similar technique as described for Thm. 3. .
Thm. 6 shows a lower bound on regret, containing the instance-dependent constant term which exposes the hardness of the regret minimisation problem in terms of the ‘gap’ between the best and the second best item : . The factor improvement in learning rate with Top--ranking Feedback can be intuitively interpreted as follows: revealing an -ranking of a -set is worth about bits of information, which is about times as large compared to revealing a single winner.
The next section shows that these fundamental lower bounds on Winner-regret are, in fact, achievable with carefully designed online learning algorithms.
3.2 An order-optimal algorithm for Winner-regret
We now give an upper-confidence bound (UCB)-based algorithm for Winner-regret with Top--ranking Feedback model which is based on the following key design ideas:
Playing sets of only two sizes: We show that it is enough for the algorithm to play subsets of size either (to fully exploit the Top--ranking Feedback) or (singleton sets), and not play a singleton unless there is a high degree of confidence about the single item being the best item (since playing a singleton does not lead to any feedback information).
Parameter estimation from pairwise preferences: We show that it is possible to play the subset-wise game just by maintaining pairwise preference estimates of all items of the MNL() model using the idea of Rank-Breaking–the idea of extracting pairwise comparisons from (partial) rankings and applying estimators on the obtained pairs treating each comparison independently (see Defn. 13) over the received subset-wise feedback—this is possible owning to the independence of irrelevant attributes (IIA) property of the MNL model (Defn. 12), or more precisely Lem. 14 (Appendix A). This idea of playing the subsetwise game with only pairwise estimates helps sidestep the combinatorial nature of the underlying problem of maintaining estimates of up to possible ranking probabilities, which is what the learner observes as Top--ranking Feedback from items.
A new UCB-based set building rule for playing large sets (build_S): The main novelty of MaxMin-UCB lies in its underlying set building subroutine (see Alg. 2), that constructs by applying a recursive maximin strategy on the UCB estimates of empirical pairwise preferences.
Algorithm description. MaxMin-UCB maintains an pairwise preference matrix , whose -th entry records the empirical probability of having beaten in a pairwise duel, and a corresponding upper confidence bound for each pair . At any round , it plays a subset using the Max-Min set building rule build_S (see Alg. 2), receives Top--ranking Feedback from , and updates the entries of pairs in by applying Rank-Breaking (Line ). The set building rule build_S is at the heart of MaxMin-UCB which builds the subset from a set of potential Condorcet winners () of round : By recursively picking the strongest opponents of the already selected items using a maximin selection strategy on .
The following result establishes that MaxMin-UCB enjoys regret with high probability.
Theorem 7 (MaxMin-UCB: High Probability Regret bound).
Fix a time horizon and , . With probability at least , the regret of MaxMin-UCB for Winner-regret with Top--ranking Feedback satisfies where , , , , , .
Proof sketch. The proof hinges on analysing the entire run of MaxMin-UCB by breaking it up into phases: (1). Random-Exploration (2). Progress, and (3). Saturation.
Random-Exploration: This phase runs from time to , for any , such that for any , the upper confidence bounds are guaranteed to be correct for the true values for all pairs (i.e. ), with high probability . The formal claim is given in Lem. 15–the proof of this is adapted from a similar analysis used by Zoghi et al. (2014) which is possible to apply for our algorithm due to Rank-Breaking update of s that exploits IIA property of the MNL model and Lem. 14.
Progress: After , the algorithm can be viewed as starting to explore the ‘confusing items’, appearing in , as potential candidates for the Best-Item , and trying to capture in the holding set . Note that at any time, the set is either empty or a singleton by construction, and once it stays their forever (with high probability) due to Lem. 15. The Progress phase just ensures that the algorithm explores fast enough so that within a constant number of rounds (independent of ), captures , and henceforth for all . (Lem. 20)
Saturation: This is the last phase from time to . As the name suggests, MaxMin-UCB shows relatively stable behavior in this phase, mostly playing and incurring almost no regret. The only times that it plays an extra set of suboptimal elements along with are when these elements beat with sufficiently high confidence in terms of . But the number of such suboptimal rounds of plays are limited, to precisely per item (Lem. 22), which finally yields the desired regret rate of MaxMin-UCB.
The complete proof is given in Appendix B.5.
Although Thm. 7 shows a -high probability regret bound for MaxMin-UCB it is important to note that the algorithm itself does not require to take the probability of failure as input. As a consequence, by simply integrating the bound obtained in Thm. 7 over the entire range of , we get an expected regret bound of MaxMin-UCB for Winner-regret with Top--ranking Feedback:
Theorem 8 (MaxMin-UCB: Expected Regret bound).
The expected regret incurred by MaxMin-UCB for Winner-regret with Top--ranking Feedback is: , in rounds.
This is an upper bound on expected regret of the same order as that in the lower bound of Thm. 3, which shows that the algorithm is essentially regret-optimal. From Thm. 8, note that the first two terms of are essentially instance specific constants, its only the third term which makes expected regret which is in fact optimal in terms of its dependencies on and (since it matches the lower bound of Thm. 6). Moreover the problem dependent complexity terms , also brings out the inverse dependency on the ‘gap-term’ as discussed in Rem. 3.
4 Minimising Top--regret
In this section, we study the problem of minimising Top--regret with Top--ranking Feedback. As before, we first derive a regret lower bound, for this learning setting, of the form , with being a problem-dependent complexity term that measures the ‘gap’ between and best item. We next propose an UCB based algorithm (Algo. 3) for the same, along with a regret analysis of matching upper bound (Thm. 10,11) which proves optimality of our proposed algorithm.
4.1 Regret lower bound for Top--regret with Top--ranking Feedback
For the analysis in this section, we assume that the underlying MNL() model is such that , and denote .
Theorem 9 (Regret Lower Bound: Top--regret with Top--ranking Feedback).
For any No-regret algorithm , there exists a problem instance MNL() such that the expected regret incurred by for Top--regret with Top--ranking Feedback on MNL() is at least where denotes expectation under the algorithm and MNL() model.
Proof sketch. Similar to 6, the proof again relies on carefully constructing a true instance, with optimal set of Top- Best-Items , and a family of slightly perturbed alternative instances , for each suboptimal arm , which we design as: for some and . Clearly Top- Best-Items of MNL is , and for every suboptimal items , we consider the altered instance MNL: The result of Thm. 9 now can be obtained by following an exactly same procedure as described for the proof of Thm. 6. The complete details is given in Appendix C.1.
The regret lower bound of Thm. 9 is , with an instance-dependent term which shows for recovering the Top- Best-Items, the problem complexity is governed by the ‘gap’ between the and best item , as consistent with intuition.
4.2 An order-optimal algorithm with low Top--regret for Top--ranking Feedback
In this section, we present an online learning algorithm for playing subsets with low Top--regret with Top--ranking Feedback.
Main idea: A recursive set-building rule: As with the MaxMin-UCB algorithm (Alg. 1), we maintain pairwise UCB estimates () of empirical pairwise preferences via Rank-Breaking. However the chief difference here lies in the set building rule, as here it is required to play sets of size exactly . The core idea here is to recursively try to capture the set of Top- Best-Items in an ordered set , and, once the set is assumed to be found with confidence (formally ), to keep playing unless some other potential good item emerges, which is then played replacing the weakest element of . The algorithm is described in Alg. 3.
Theorem 10 (Rec-MaxMin-UCB: High Probability Regret bound).
Given a fixed time horizon and , with high probability , the regret incurred by Rec-MaxMin-UCB for Top--regret admits the bound where is an instance dependent constant (see Lem. 26, Appendix), , and