Sequential decision making problems with side information, in the form of features or attributes, have been popular in machine learning as contextual banditsFilippi et al. (2010); Chu et al. (2011); Li et al. (2017). A contextual bandit learner, at each round, observes a context before taking an action based on it. The resulting payoff is typically assumed to depend on the context and the action taken according to an unknown map, and the learner aims to play the best possible action for the current context at each time, and thus minimize its regret with respect to an oracle that knows the payoff function.
In many learning settings, however, it is more common to be able to only relatively compare actions, in a decision step, instead of being able to gauge their absolute utilities, e.g., information retrieval, search engine ranking, tournament ranking, etc. Hajek et al. (2014); Khetan and Oh (2016). Dueling bandits Komiyama et al. (2015); Ailon et al. (2014) explicitly model this relative preference information structure, often in the setting of finite action spaces and unstructured action utilities, and have seen great interest in the recent past. However, the more general, and pertinent, problem of online learning in structured, contextual bandits with large decision spaces and relative feedback information structures has largely remained unexplored.
This paper considers a natural and structured contextual dueling bandit setting, comprised of items that have intrinsic (absolute) scores depending on their features in an unknown way, e.g., linear with unknown weights. When a learner plays (compares) two items together, the result is a ‘winner’ of the pair of items with a probability distribution governed by a transformation of both items’ scores (we use here the sigmoid function of the score difference). We are primarily interested in the development of adaptive item pair-selection algorithms for which guarantees can be given with respect to a suitably defined measure of dueling regret. In this regard, our contributions are as follows.
To the best of our knowledge, we are the first to consider the problem of regret minimization for contextual dueling bandits for potentially infinitely large decision spaces. Some recent works González et al. (2017); Sui et al. (2017b) consider this problem but their algorithms do not guarantee any finite time regret bounds and validate their performance optimality theoretically.
We propose two algorithms for the problem. Our first algorithm, Maximum-Informative-Pair (Alg 1), is based on the idea of selecting the most uncertain-looking pair from the set of promising candidates for the top item. We rigorously show an regret bound for this algorithm, which is seen to be off by a factor from an information-theoretic fundamental limit on performance (Thm. 3), despite performing well empirically.
Our, second algorithm Stagewise-Adaptive-Duel (Alg. 2), is developed on the idea of tracking, in a phased fashion, the best arm of the context set, which ensures a sharper concentration rate of the pairwise scores. This results in an optimal 111The notation hides logarithmic dependencies. regret guarantee, improving upon the regret bound of the previous algorithm by a factor.
Our theoretical results are supported with suitably designed extensive empirical evaluations. Related Works (Appendix A) and all the detailed proofs are moved to the Appendix.
2 Preliminaries and Problem Formulation
Notations. For any positive integer , we denote by the set . is generically used to denote an indicator variable that takes the value if the predicate is true, and otherwise. The decision space is denoted by , where . We use to denote an -dimensional vector of all ’s. For any matrix , we denote respectively by and
the maximum a minimum eigenvalue of matrix. For any , denotes the weighted -norm associated with matrix (assuming is positive-definite).
2.1 Problem Setup
We consider the stochastic -armed contextual dueling bandit problem for rounds, where at each round , the learner is presented with a context set of size which is drawn IID from some -dimensional decision space (according to some unknown distribution on , say ), and the learner requires to play two arms , upon environment provides a stochastic preference feedback indicating the better arm of the drawn pair , such that for any , the probability is preferred over , denoted by , is drawn according to , where is a utility score function on each decision pair the decisions in , and is the sigmoid transformation (i.e. for any ). One intuitive choice for the utility function could be such as: , where again is a utility score function on each point in the decisions space .
Analysis with linear scores. In this paper, we assume that , where is some unknown fixed vector in such that . This implies that for any pair , we have .
Objective: Regret Minimization. We denote by the best arm (with highest score) of round . Then the goal of the learner is to minimize the -round cumulative regret with respect to the best arm of each round , such that the instantaneous regret of playing an arm-pair is measured in terms of the average score of the played duel with respect to that of the best arm , defined as:
Above notion of learner’s regret is motivated from the definition of classical -armed dueling bandit regret introduced by Yue et al. (2012) which is later adopted by the dueling bandit literature Zoghi et al. (2013); Komiyama et al. (2015); Ailon et al. (2014); Wu and Liu (2016); Zoghi et al. (2015); Sui et al. (2017b); Saha and Gopalan (2018a). Here the context set at any round is assumed to be a fixed set of arms , and at each round the instantaneous regret incurred by the learner for playing an arm-pair is given by , being the ‘best-arm’ in the hindsight (e.g. cordorcet winner Zoghi et al. (2013) or copeland winner Komiyama et al. (2015); Urvoy et al. (2013)) depending on the underlying preference matrix .
Remark 1 (Equivalence with Dueling Bandit Regret).
It is easy to note that assuming the context set to be fixed and denoting , our regret definition (Eqn. (1)) is equivalent to dueling bandit regret (upto constant factors), as in our case the pairwise advantage of the best arm w.r.t and at the same time . Combining above claims one can obtain (see Appendix B.1 for the derivation).
3 Propose Algorithms and Regret Analysis
In this section we present two algorithms for our regret objective defined in Eqn. (1).
3.1 Connection to GLM bandits
We start by observing the relation of our preference feedback model to that of generalized linear model (GLM) based bandits Filippi et al. (2010); Li et al. (2017)–precisely the feedback mechanism. The setup of GLM bandits generalizes the stochastic linear bandits problem Dani et al. (2008); Abbasi-Yadkori et al. (2011), where at each round the learner is supposed to play a decision point from a set fixed decision set , upon which a noisy reward feedback is revealed by the environment such that where is some unknown fixed direction, is a fixed strictly increasing link function, and is a zero mean sub-Gaussian noise for some universal constant , i.e. and (here denotes the sigma algebra generated by the history till time ).
The important connection now to make is that our structured dueling bandit feedback can be modeled as a GLM feedback model on the decision space of pairwise differences , since in this case the feedback received by the learner upon playing a duel can be seen as: where is a -mean -measurable random binary noise such that
where we denote , and it is easy to verify that is sub-Gaussian. Thus our dueling based preference feedback model can be seen as a special case of GLM bandit feedback on the decision space where the link function in our case is the sigmoid .
) for estimating the unknown parameter, denoted by , with high confidence using maximum likelihood estimation on the observed pairwise preferences upto time , following the same technique suggested by Filippi et al. (2010); Li et al. (2017).
Having established the connection of our dueling feedback model to that of GLM bandits, we only use this to estimate the unknown parameter efficiently. At the same time our regret objective (Eqn. (1)) is very different than that of GLM bandits, and thus we need very algorithm design techniques (i.e. arm-selection rules) for achieving optimal regret bounds. Towards this we propose the following two algorithms (Sec. 3.2, 3.3) and also establish their optimality guarantees (see Thm. 3 and 6).
3.2 Algorithm-: Maximum-Informative-Pair
Our first algorithm is computationally more efficient and shown to achieve an regret (Thm. 3)—this is however slightly suboptimal by a factor of , as reflects from our lower bound analysis (Thm. 11, Sec. 4).
Main Idea: At any time , the algorithm simply maintains an UCB estimate on the pairwise scores for any pair of arms , where . It then collects the set of the promising arms in the context set , such that those which beats rest of the arms in terms of the of the optimistic pairwise score , and plays the pair
, which has highest pairwise score variance (i.e. which appears to be the most uncertain pair in). The algorithm is described in Alg. 1.
We next proof the its regret guarantee (Thm. 3) based on the following concentration lemmas.
Lemma 1 (Self-Normalized Bound).
Suppose be a sequence of arm-pair played such that all arms belong to the ball of unit radius. Also suppose the initial exploration length be such that . Then ,
where recall .
Lemma 2 (Confidence Ellipsoid).
Suppose the initial exploration length be such that , and is as defined in Thm. 3. Then for any , with probability at least , for all ,
where recall .
Theorem 3 (Regret bound of Maximum-Informative-Pair (Alg. 1)).
Let , where is the minimum slope of the estimated sigmoid when is sufficiently close to ( being the first order derivative of the sigmoid function ). Then given any , with probability at least , the round cumulative regret of Maximum-Informative-Pair satisfies:
where we choose , (for some universal problem independent constants ).
(sketch) Our choice of ensures that with probability at least , is full rank, or more precisely
owning to some standard results from random matrix theoryVershynin (2010) (see Lem. 12, Appendix C for the formal statement). We next use the existing results from GLM literature to derive the two key concentration lemmas (Lem 1 and 2) that holds owing to the connection of our structured dueling bandits problem setup to that of GLM bandits Li et al. (2017) (see Sec. 3.1). The rest of the proof lies in expressing the regret bound in terms of the above concentration results which is possible owning to our ‘most informative pair’ based arm selection strategy. The complete proof is given in Appendix C.1. ∎
3.3 Algorithm-: Stagewise-Adaptive-Duel (StaD))
Our second algorithm runs with a provable optimal regret bound of , except with an additional factor. So as long as , the algorithm indeed yields an optimal regret guarantee.
Main Idea. This algorithm is build on the idea of sequentially examining the arms over stages, and eliminate the weakly performing pairs based on confidence bounding the pairwise scores of the dueling arms: we term this algorithm Stagewise-Adaptive-Duel (Alg. 2) which borrows some similar ideas from Auer (2002); Chu et al. (2011); Li et al. (2017), however due to the preferential nature of the feedback model, our strategy of maintaining the stagewise ‘good-performing’ arms and consequently selecting the arm-pair at any round has to be very different and carefully decided.
More precisely, each round of this algorithm proceeds in multiple stages where we try gradually try tracking the set of ‘promising arms’ : Towards this, at each and stage
, we first choose to maintain confidence interval on the pairwise scores of each index pair(owing to the dueling nature of the problem). If at any stage , the confidence-score of any arm-pair is not estimated to the sufficient accuracy, we examine (play) that pair and include it in the set of ‘informative pairs’ of stage to be utilized in following rounds (see line -)—at the initial rounds the algorithm mostly hits this case and keep exploring different arm-pairs, which although might contribute to learner’s regret but this is an unavoidable cost we need to pay towards identifying the optimal arms in the later rounds.
Otherwise, we sequentially try eliminating the ‘weakly-performing’ arms which which gets defeated by some other arm even in terms of its optimistic pairwise score (see line -), and proceed to the next stage to examine the remaining item pairs for a stricter confidence interval. Finally, if the pairwise scores of every index pair in the set of ‘promising-arms’ has been almost accurately estimated, we pick the first arm as the one which has the maximum estimated score, followed by choosing its strongest challenger which beats with highest pairwise score (in an optimistic sense), play and proceed to the next round (see line -)—the intuition is as we explore sufficiently enough, the algorithm would reach this last case more and more often, and consequently would end up playing only ‘good arm-pairs’ as desired. The complete description of the algorithm is given in Alg. 2.
Thm. 6 proves the optimal (see Thm. 11 for the lower bound analysis). Assuming to be constant this leads to optimal rate, or note even if Stagewise-Adaptive-Duel improves over the regret guarantee of our earlier algorithm Maximum-Informative-Pair. It is worth pointing that the near optimal regret analysis of Stagewise-Adaptive-Duel crucially relies on the stronger concentration guarantees of the pairwise scores (as shown in Lem. 5), which is possible with this algorithm due to its novel strategy of maintaining independent ‘stagewise informative samples’ —achieving this independence criterion (see Lem. 4) is crucial towards deriving a faster concentration rate as also pioneered is few of the earlier works Auer et al. (2002); Chu et al. (2011) for the classical setup multi-armed bandits.
We now proceed to analyse the regret guarantee of Stagewise-Adaptive-Duel. Towards this we first make some key observations as described below:
Lemma 4 (Stagewise Sample Independence).
At any time , at any stage , and given an fixed realization of the played arm-pairs , the corresponding preference outcomes are independent random variables with
are independent random variables with.
Owing to Lem. 4, one can derive the following sharper concentration bounds on the pairwise-arm scores:
Lemma 5 (Sharper Concentration of Pairwise Scores).
Consider any , and suppose we set the parameters of Stagewise-Adaptive-Duel (Alg. 2) as , where , and , where and (for some universal problem independent constants ). Then with probability at least , for all stages at all rounds and for all index pairs of round : .
Theorem 6 (Regret bound of Stagewise-Adaptive-Duel (Alg. 2)).
Consider we set , and as per Lem. 5. Then for any , with probability at least , the round cumulative regret of Stagewise-Adaptive-Duel is upper bounded as:
(sketch) Suppose we denote by the set of all good time intervals where all the index pairs are estimated within the confidence accuracy . The proof crucially relies on the concentration bound of Lem. 5, from which we first derive the following important result.
For any , suppose the pair is chosen at stage , and denotes the index of the best action of round , i.e. . Then with probability at least , for all : and for both , , for any .
And owning to Lem. 1 and due to the construction of our ‘stagewise-good item pairs’ we can also show:
Assume any . Then at any stage at round , with probability at least , .
where recall that . We consider the trivial bound of for the initial rounds. Note that here the inequality (a) follows from Lem. 7, (b) from Lem. 8 and since . Inequality (c) uses Cauchy-Schwartz along with the fact that . Finally the order of the regret bound follows by considering our particular choice of and rearranging the terms. ∎
4 Matching Lower Bound
In this section, we prove a fundamental performance limit of our contextual bandit problem by reducing an instance of linear bandits problem to the former, and consequently prove a regret lower bound of for our problem.
More precisely, let us denote any instance of our linear-score based -armed contextual dueling bandit problem (see Sec. 2.1) with problem parameter for iterations as . On the other hand define any instance of -armed contextual linear bandit problem Chu et al. (2011) with problem parameter for iterations as : Recall in this setup, at each iteration the learner is provided with a context set of size (such that for all , ), upon which the learner is supposed to choose any arm , and the environment provides a stochastic reward feedback , where is a zero mean random noise such that . The learner’s objective is to minimize the regret with respect to the best (expected highest-scored) action, , of each round , defined as:
Towards proving a lower bound for , we first show that under Gumbel noise Azari et al. (2012); Soufiani et al. (2013), any instance of contextual linear bandits can be reduced to an instance of as shown below:
Lemma 9 (Reducing with Gumbel noise to ).
There exists a reduction from the problem (under Gumbel noise, i.e. ) to which preserves the expected regret.
(sketch) Suppose we have a blackbox algorithm for the instance of problem, say . To prove the claim, our goal is to show that this can be used to solve the problem where the underlying stochastic noise, at round , is generated from a Gumbel distribution Tomczak (2016a); Azari et al. (2012): Precisely we can construct an algorithm for (say ) using :
If rums on a problem instance with Gumbel noise, then the internal world of underlying blackbox runs on a problem instance of .
Given the above reduction, our lower bound result now immediately follows as a implication of Thm. 11 and from the existing lower bound result of -armed -dimensional contextual linear bandits problem Chu et al. (2011).
Theorem 11 (Regret Lower Bound).
For any algorithm for the problem of stochastic -armed -dimensional contextual dueling bandit problem with linear utility scores for any rounds, there exists a sequence of -dimensional vectors and a constant such that the regret incurred by on rounds is at least , i.e.:
In this section, we present the empirical performances of our two proposed algorithms (Alg. 1 and 2) and compare them with some existing dueling bandits algorithms. The details of the algorithms are given below:
1. MaxInP: Our proposed algorithm Maximum-Informative-Pair (Alg. 1 as described in Sec. 3.2).
2. StaD: Our proposed algorithm Stagewise-Adaptive-Duel (Alg. 1 as described in Sec. 3.3).
3 SS: (IND)Self-Sparring (independent beta priors on each arm) algorithm for multi-dueling bandits (Sui et al., 2017a)
4. RUCB: The Relative Upper Confidence Bound algorithm for regret minimization in standard dueling bandits Zoghi et al. (2013).
5. DTS: Dueling-Thompson Sampling
Dueling-Thompson Samplingalgorithm for best arm identification problem in bayesian dueling bandits González et al. (2017) 222For linear scores we specifically fit a linear function for DTS, instead of a GP as suggested in the original paper.
In every experiment, the performances of the algorithms are measured in terms of cumulative regret (sec. 1), averaged across
runs, reported with standard deviation.
Constructing Problem Instances. We firstly run the experiments for preference functions with linear scores (details in Sec. 2.1): Note that the difficulty of the problem instance relies on the difference of scores of the best and second best arms which is governed by the ‘worst case slope’ of the sigmoid function in the hindsight (see the dependency of in our derived regret bounds (Thm. 3 or Thm. 6))—but in turn is governed by the underlying problem parameter (given a fixed instance set).
So we simulated different linear score based problem instances based on different characterizations of (with arms and dimension ): 1. : Refers to the easy instances where is small of the order of –here the scores of all the arms are fairly similar so no matter which arm is played the learner does not incur much cost. 2. : This on the other hand refers to the hardest instances where is large, of the order of , that sufficiently spreads out the scores of the individual arms and in this case it is really important for the learner to detect the best arms quickly to attain a smaller regret. 3. : The intermediate problem instances where . For any instance, we first choose any arbitrary in unit ball of dimension and subsequently scale its coordinates suitable to adjust the norm in the desired range.
Also in all settings, the -dimensional feature vectors (of the arm set) are generated as random linear combination of each arm to be a random linear combination of the d-dimensional basis vectors (for scaling issues of the item scores, we limit each instance vector to be within ball of radius , i.e. -norm upper bounded by ). Following sections describe our different experimental results.
5.1 Regret vs time
We first analyse the (averaged) cumulative regret performance of different algorithms over time on three different linear score environments (i.e. problem instances). For this experiment we fix and . Fig. 2 shows that both our proposed algorithms MaxInP and StaD always outperform the rest, the superiority in their performance gets comparatively better with increasing hardness of the problem instances (see discussion in the construction of our problem instances). As expected, RUCB performs the worst as by construction it fails to exploit the structure of underlying linear score based item preferences, due to the same reason SS performs poorly as well (note we implement independent armed version of the Self-Sparring algorithm Sui et al. (2017a) for this case, and later the Kernelized version for the case of non-linear item scores as given in Sec. 5.4). On the contrary, DTS performs reasonably well as its algorithmic construction is made to exploits the underlying utility structures in the pairwise-preferences.
5.2 Regret vs Setsize(K)
Our next set of experiment compare the (averaged) final cumulative regret of each algorithm over varying context set size over two different problem instances. For this experiment we fix and . From Fig. 4 note that again our algorithms superiorly outperforms the other baselines with DTS performing competitively. SS and RUCB performs very badly due to the same reason as explained in Sec. 5.1. Interesting observation to make is that the performance of both our algorithms MaxInP and StaD is almost independent of as also follows from their respective regret guarantees (see Thm. 3 and Thm. 6)–as long as is fixed our algorithms clearly could identify the best item irrespectively of the size of the context set, owning to their ability to exploit the underlying preference structures, unlike SS or RUCB.
5.3 Regret vs Dimension(d)
We next analyse the tradeoff between the (averaged) final cumulative regret performances of different algorithms vs problem dimension on two different problem instances. For this experiment we fix and . From Fig. 5 shows that in general the performance of every algorithm degrades over increasing . However the effect is much most severe for the DTS baseline compared to ours. Since RUCB can not exploit the underlying preference structure, its performance is mostly independent of and same goes for SS as well due to the same reason. Here the interesting observation to make is that with increasing , fixed and , our first algorithm MaxInP indeed performs worse than our second algorithm StaD, same as what follows from their theoretical regret guarantees as well: see Thm. 3 shows a multiplicatively worse regret bound for MaxInP compared to that for StaD (Thm. 6).
5.4 Non-Linear score based preferences
We finally run some experiments to analyse the comparative regret performances of our proposed algorithms for non-linear score based preferences, i.e. when the score function is not linear in (see Sec. 2.1 for details). We particularly use the following three different score functions to simulate three different problem instances for this case:
Environments. We use thsese functions as : 1. Quadratic, 2. Six-Hump Camel and 3. Gold Stein. Quadratic is the reward function , where and are randomly generated. The Six-Hump Camel and Gold Stein functions are as described in González et al. (2017). For all cases, we fix and .
Algorithms. We use a slightly modified version of our two algorithms (MaxInP and StaD) for the non-linear scores, since the GLM based parameter estimation techniques would no longer work here. But unfortunately, without suitable assumptions, we do not have an efficient way to estimate the score functions for this general setup, so instead we fit a GP to to the underlying unknown score function based on the Laplace approximation based technique suggested in Rasmussen and Williams (2006) (see Chap ). For SS also we now used the kernelized self-sparring version of the algorithm Sui et al. (2017a), and for DTS we now fit a GP model (instead of a linear model as before).
From Fig. 3 it shows that both our algorithms still perform best in al most all instances, even for the non-linear score based preferences. This actually implies the generality of our algorithmic ideas which applies beyond linear-scores (and thus perhaps it is also worth understanding their theoretical guarantees for this general setup in the follow up works). Moreover, unlike the previous scenarios SS, now starts to perform better since it could now exploit underlying preferences structures owing to the implementation of kernelized self-sparring Sui et al. (2017a). The performance of RUCB is again worst due to its inability to exploit the structured preference relations. DTS performs competitively for Gold Stein but quite badly for the rest.
6 Conclusion and Future Scopes
We consider the problem of regret minimization for contextual dueling bandits for potentially infinitely large decision spaces, and to the best of our knowledge is the first to give an optimal (upto logarithmic factors)
algorithm for the problem setup with a matching lower bound analysis. While our work is the first to guarantee an optimal finite time regret analysis, there are a numerous interesting open threads to pursue along this direction, e.g. considering other link functions (probit, nested logit etc.) based arm preferences, analysing the best achievable regret bound for contextual dueling bandits with adversarial preferences, or even extending the dueling preferences to multiwise preferencesSaha and Gopalan (2019) and other practical bandit setups, e.g. in presence of side information Mannor and Shamir (2011); Kocak et al. (2014), or graph structured feedback Alon et al. (2015, 2017) etc.
Aadirupa Saha thanks Branislav Kveton for all the useful initial discussions during her internship at Google, Mountain View, and Ofer Meshi, Craig Boutilier for hosting her internship.
- Abbasi-Yadkori et al.  Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- Agrawal and Goyal  Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
- Ailon et al.  Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856–864, 2014.
- Alon et al.  Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In JMLR WORKSHOP AND CONFERENCE PROCEEDINGS, volume 40. Microtome Publishing, 2015.
- Alon et al.  Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1