This project collects the different accepted papers and their link to Arxiv or Gitxiv
In this paper, we propose a Double Thompson Sampling (D-TS) algorithm for dueling bandit problems. As indicated by its name, D-TS selects both the first and the second candidates according to Thompson Sampling. Specifically, D-TS maintains a posterior distribution for the preference matrix, and chooses the pair of arms for comparison by sampling twice from the posterior distribution. This simple algorithm applies to general Copeland dueling bandits, including Condorcet dueling bandits as its special case. For general Copeland dueling bandits, we show that D-TS achieves O(K^2 T) regret. For Condorcet dueling bandits, we further simplify the D-TS algorithm and show that the simplified D-TS algorithm achieves O(K T + K^2 T) regret. Simulation results based on both synthetic and real-world data demonstrate the efficiency of the proposed D-TS algorithm.READ FULL TEXT VIEW PDF
We study two randomized algorithms for generalized linear bandits, GLM-T...
We introduce the factored bandits model, which is a framework for learni...
We address the problem of regret minimization in logistic contextual ban...
In this paper, we present simple algorithms for Dueling Bandits. We prov...
We introduce Stacked Thompson Bandits (STB) for efficiently generating p...
Restless bandit problems assume time-varying reward distributions of the...
We design a new myopic strategy for a wide class of sequential design of...
This project collects the different accepted papers and their link to Arxiv or Gitxiv
The dueling bandit problem Yue2012JCSS:DuelingBandits is a variant of the classical multi-armed bandit (MAB) problem, where the feedback comes in the form of pairwise comparison. This model has attracted much attention as it can be applied in many systems such as information retrieval (IR) Yue2009ICML:DuelingBandits ; Zoghi2014WSDM:RCS , where user preferences are easier to obtain and typically more stable. Most earlier work Yue2012JCSS:DuelingBandits ; Zoghi2014ICML:RUCB ; Komiyama2015COLT:DB focuses on Condorcet dueling bandits, where there exists an arm, referred to as the Condorcet winner, that beats all other arms. Recent work Zoghi2015NIPS:CDB ; Komiyama2016ICML:CWRMED turns to a more general and practical case of a Copeland winner(s), which is the arm (or arms) that beats the most other arms. Existing algorithms are mainly generalized from traditional MAB algorithms along two lines: 1) UCB (Upper Confidence Bound)-type algorithms, such as RUCB Zoghi2014ICML:RUCB and CCB Zoghi2015NIPS:CDB ; and, 2) MED (Minimum Empirical Divergence)-type algorithms, such as RMED Komiyama2015COLT:DB and CW-RMED/ECW-RMED Komiyama2016ICML:CWRMED .
In traditional MAB, an alternative effective solution is Thompson Sampling (TS) Thompson1933FirstMAB . Its principle is to choose the optimal action that maximizes the expected reward according to the randomly drawn belief. TS has been successfully applied in traditional MAB Chapelle2011NIPS:TS ; Agrawal2012COLT:TS ; Komiyama2015ICML:MP_TS ; Qin&Liu2015IJCAI:TS and other online learning problems Gopalan2014ICML:TS ; Gopalan2015COLT:TSMDP . In particular, empirical studies in Chapelle2011NIPS:TS show that TS not only achieves lower regret than other algorithms in practice, but is also more robust as a randomized algorithm.
In the wake of the success of TS in these online learning problems, a natural question is whether and how TS can be applied to dueling bandits to further improve the performance. However, it is challenging to apply the standard TS framework to dueling bandits, because not all comparisons provide information about the system statistics. Specifically, a good learning algorithm for dueling bandits will eventually compare the winner against itself. However, comparing one arm against itself does not provide any statistical information, which is critical in TS to update the posterior distribution. Thus, TS needs to be adjusted so that 1) comparing the winners against themselves is allowed, but, 2) trapping in comparing a non-winner arm against itself is avoided.
In this paper, we propose a Double Thompson Sampling (D-TS) algorithm for dueling bandits, including both Condorcet dueling bandits and general Copeland dueling bandits. As its name suggests, D-TS typically selects both the first and the second candidates according to samples independently drawn from the posterior distribution. D-TS also utilizes the idea of confidence bounds to eliminate the likely non-winner arms, and thus avoids trapping in suboptimal comparisons. Compared to prior studies on dueling bandits, D-TS has both practical and theoretical advantages.
First, the double sampling structure of D-TS better suits the nature of dueling bandits. Launching two independent rounds of sampling provides us the opportunity to select the same arm in both rounds and thus to compare the winners against themselves. This double sampling structure also leads to more extensive utilization of TS (e.g., compared to RCS Zoghi2014WSDM:RCS ), and significantly reduces the regret. In addition, this simple framework applies to general Copeland dueling bandits and achieves lower regret than existing algorithms such as CCB Zoghi2015NIPS:CDB . Moreover, as a randomized algorithm, D-TS is more robust in practice.
Second, this double sampling structure enables us to obtain theoretical bounds for the regret of D-TS. As noted in traditional MAB literature Agrawal2012COLT:TS ; Agrawal2013AISTATS:TS2 , theoretical analysis of TS is usually more difficult than UCB-type algorithms. The analysis in dueling bandits is even more challenging because the selection of arms involves more factors and the two selected arms may be correlated. To address this issue, our D-TS algorithm draws the two sets of samples independently. Because their distributions are fully captured by historic comparison results, when the first candidate is fixed, the comparison between it and all other arms is similar to traditional MAB and thus we can borrow ideas from traditional MAB. Using the properties of TS and confidence bounds, we show that D-TS achieves regret for a general -armed Copeland dueling bandit. More interestingly, the property that the sample distribution only depends on historic comparing results (but not ) enables us to refine the regret using a back substitution argument, where we show that D-TS achieves in Condorcet dueling bandits and many practical Copeland dueling bandits.
Based on the analysis, we further refine the tie-breaking criterion in D-TS and propose its enhancement called D-TS. D-TS achieves the same theoretical regret bound as D-TS, but performs better in practice especially when there are multiple winners.
In summary, the main contributions of this paper are as follows:
We propose a D-TS algorithm and its enhancement D-TS for general Copeland dueling bandits. The double sampling structure suits the nature of dueling bandits and leads to more extensive usage of TS, which significantly reduces the regret.
We obtain theoretical regret bounds for D-TS and D-TS. For general Copeland dueling bandits, we show that D-TS and D-TS achieve regret. In Condorcet dueling bandits and most practical Copeland dueling bandits, we further refine the regret bound to using a back substitution argument.
We evaluate the D-TS and D-TS algorithms through experiments based on both synthetic and real-world data. The results show that D-TS and D-TS significantly improve the overall performance, in terms of regret and robustness, compared to existing algorithms.
Early dueling bandit algorithms study finite-horizon settings, using the “explore-then-exploit” approaches, such as IF Yue2012JCSS:DuelingBandits , BTM Yue2011ICML:BTM , and SAVAGE Urvoy2013ICML:SAVAGE . For infinite horizon settings, recent work has generalized the traditional MAB algorithms to dueling bandits along two lines. First, RUCB Zoghi2014ICML:RUCB and CCB Zoghi2015NIPS:CDB are generalizations of UCB for Condorcet and general Copeland dueling bandits, respectively. In addition, Ailon2014ICML:UBDB reduces dueling bandits to traditional MAB, which is then solved by UCB-type algorithms, called MutiSBM and Sparring. Second, Komiyama2015COLT:DB and Komiyama2016ICML:CWRMED extend the MED algorithm to dueling bandits, where they present the lower bound on the regret and propose the corresponding optimal algorithms, including RMED for Condorcet dueling bandits Komiyama2015COLT:DB , CW-RMED and its computationally efficient version ECW-RMED for general Copeland dueling bandits Komiyama2016ICML:CWRMED . Different from such existing work, we study algorithms for dueling bandits from the perspective of TS, which typically achieves lower regret and is more robust in practice.
Dated back to 1933, TS Thompson1933FirstMAB is one of the earliest algorithms for exploration/exploitation tradeoff. Nowadays, it has been applied in many variants of MAB Komiyama2015ICML:MP_TS ; Qin&Liu2015IJCAI:TS ; Gopalan2014ICML:TS and other more complex problems, e.g., Gopalan2015COLT:TSMDP , due to its simplicity, good performance, and robustness Chapelle2011NIPS:TS . Theoretical analysis of TS is much more difficult. Only recently, Agrawal2012COLT:TS proposes a logarithmic bound for the standard frequentist expected regret, whose constant factor is further improved in Agrawal2013AISTATS:TS2 . Moreover Russo2014TR:TS ; Russo2014MOR:TS derive the bounds for its Bayesian expected regret through information-theoretic analysis.
TS has been preliminarily considered for dueling bandits Zoghi2014WSDM:RCS ; Welsh2012LSOLDM:TS . In particular, recent work Zoghi2014WSDM:RCS proposes a Relative Confidence Sampling (RCS) algorithm that combines TS with RUCB Zoghi2014ICML:RUCB for Condorcet dueling bandits. Under RCS, the first arm is selected by TS while the second arm is selected according to their RUCB. Empirical studies demonstrate the performance improvement of using RCS in practice, but no theoretical bounds on the regret are provided.
We consider a dueling bandit problem with () arms, denoted by . At each time-slot , a pair of arms is displayed to a user and a noisy comparison outcome is obtained, where if the user prefers to , and otherwise. We assume the user preference is stationary over time and the distribution of comparison outcomes is characterized by the preference matrix , where
is the probability that the user prefers armto arm , i.e., . We assume that the displaying order does not affect the preference, and hence, and . We say that arm beats arm if .
We study the general Copeland dueling bandits, where the Copeland winner is defined as the arm (or arms) that maximizes the number of other arms it beats Zoghi2015NIPS:CDB ; Komiyama2016ICML:CWRMED . Specifically, the Copeland score is defined as , and the normalized Copeland score is defined as , where is the indicator function. Let be the highest normalized Copeland score, i.e., . Then the Copeland winner is defined as the arm (or arms) with the highest normalized Copeland score, i.e., . Note that the Condorcet winner is a special case of Copeland winner with .
A dueling bandit algorithm decides which pair of arms to compare depending on the historic observations. Specifically, define a filtration as the history before , i.e., . Then a dueling bandit algorithm is a function that maps to , i.e., The performance of a dueling bandit algorithm is measured by its expected cumulative regret, which is defined as
The objective of is then to minimize . As pointed out in Zoghi2015NIPS:CDB , the results can be adapted to other regret definitions because the above definition bounds the number of suboptimal comparisons.
We present the D-TS algorithm for Copeland dueling bandits, as described in Algorithm 1 (time index is omitted in pseudo codes for brevity). As its name suggests, the basic idea of D-TS is to select both the first and the second candidates by TS. For each pair with , we assume a beta prior distribution for its preference probability . These distributions are updated according to the comparison results and , where (resp. is the number of time-slots when arm (resp. ) beats arm (resp. ) before . D-TS selects the two candidates by sampling from the posterior distributions.
Specifically, at each time-slot , the D-TS algorithm consists of two phases that select the first and the second candidates, respectively. When choosing the first candidate , we first use the RUCB Zoghi2014ICML:RUCB of to eliminate the arms that are unlikely to be the Copeland winner, resulting in a candidate set (Lines 4 to 6). The algorithm then samples
from the posterior beta distribution, and the first candidateis chosen by “majority voting”, i.e., the arm within that beats the most arms according to will be selected (Lines 7 to 11). The ties are broken randomly here for simplicity and will be refined later in Section 4.3. A similar idea is applied to select the second candidate , where new samples are generated and the arm with the largest among all arms with is selected as the second candidate (Lines 13 to 14).
The double sampling structure of D-TS is designed based on the nature of dueling bandits, i.e., at each time-slot, two arms are needed for comparison. Unlike RCS Zoghi2014WSDM:RCS , D-TS selects both candidates using TS. This leads to more extensive utilization of TS and thus achieves much lower regret. Moreover, the two sets of samples are independently distributed, following the same posterior that is only determined by the comparison statistics and . This property enables us to obtain an regret bound and further refine it by a back substitution argument, as discussed later.
We also note that RUCB-based elimination (Lines 4 to 6) and RLCB (Relative Lower Confidence Bound)-based elimination (Line 14) are essential in D-TS. Without these eliminations, the algorithm may trap in suboptimal comparisons. Consider one extreme case in Condorcet dueling bandits111A Borda winner may be more appropriate in this special case Jamieson2015AISTAT:SDB , and we mainly use it to illustrate the dilemma.: assume arm is the Condorcet winner with for all , and arm 2 is not the Condorcet winner, but with for all . Then for a larger (e.g., ), without RUCB-based elimination, the algorithm may trap in for a long time, because arm 2 is likely to receive higher score than arm 1. This issue can be addressed by RUCB-based elimination as follows: when chosen as the first candidate, arm 2 has a great probability to compare with arm 1; after sufficient comparisons with arm 1, arm 2 will have with high probability; then arm 2 is likely to be eliminated because arm 1 has with high probability. Similarly, RLCB-based elimination (Line 14, where we restrict to the arms with ) is important especially for non-Condorcet dueling bandits. Specifically, indicates that arm beats with high probability. Thus, comparing and arm brings little information gain and thus should be eliminated to minimize the regret.
Before conducting the regret analysis, we first introduce certain notations that will be used later.
Gap to 1/2: In dueling bandits, an important benchmark for is 1/2, and thus we let be the gap between and 1/2, i.e., .
Number of Comparisons: Under D-TS, can be compared in the form of and . We consider these two cases separately and define the following counters: and . Then the total number of comparisons is for , and for .
To obtain theoretical bounds for the regret of D-TS, we make the following assumption:
Assumption 1: The preference probability for any .
Under Assumption 1, we present the first result for D-TS in general Copeland dueling bandits:
When applying D-TS with in a Copeland dueling bandit with a preference matrix satisfying Assumption 1, its regret is bounded as:
where is an arbitrary constant, and is the KL divergence.
The summation operation in Eq. (2) is conducted over all pairs with . Thus, Proposition 1 states that D-TS achieves regret in Copeland dueling bandits. To the best of our knowledge, this is the first theoretical bound for TS in dueling bandits. The scaling behavior of this bound with respect to is order optimal, since a lower bound has been shown in Komiyama2016ICML:CWRMED . The refinement of the scaling behavior with respect to will be discussed later.
Proving Proposition 1 needs to bound the number of comparisons for all pairs with or . When fixing the first candidate as , the selection of the second candidate is similar to a traditional -armed bandit problem with expected utilities (). However, the analysis is more complex here since different arms are eliminated differently depending on the value of . We prove Proposition 1 through Lemmas 1 to 3, which bound the number of comparisons for all suboptimal pairs under different scenarios, i.e., , , and (), respectively.
Under D-TS, for an arbitrary constant and one pair with , we have
We can prove this lemma by viewing the comparison between the first candidate arm and its inferiors as a traditional MAB. In fact, it may be even simpler than that in Agrawal2013AISTATS:TS2 because under D-TS, arm with is competing with arm with , which is known and fixed. Then we can bound using the techniques in Agrawal2013AISTATS:TS2 . Details can be found in Appendix B.1. ∎
Under D-TS with , for one pair with , we have
Under D-TS, for any arm , we have
Before proving Lemma 3, we present an important property for . Recall that is the maximum normalized Copeland score. Using the concentration property of RUCB (Lemma 6 in Appendix A), the following lemma shows that is indeed a UCB of .
For any and , .
Return to the proof of Lemma 3. To prove Lemma 3, we consider the cases of and . The former case can be bounded by Lemma 4. For the latter case, we note that when , the event occurs only if: a) there exists at least one with , such that ; and b) for all with . In this case, we can bound the probability of by that of , for with but , where the coefficient decays exponentially. Then we can bound by similar to Agrawal2013AISTATS:TS2 . Details of proof can be found in Appendix B.4.
In this section, we refine the regret bound for D-TS and reduce its scaling factor with respect to the number of arms .
We sort the arms for each in the descending order of , and let be a permutation of , such that . In addition, for a Copeland winner , let be the number of arms that beat arm . To refine the regret, we introduce an additional no-tie assumption:
Assumption 2: For each arm , for all .
We present a refined regret bound for D-TS as follows:
When applying D-TS with in a Copeland dueling bandit with a preference matrix satisfying Assumptions 1 and 2, its regret is bounded as:
where and are constants, and is the KL-divergence.
In (6), the first term corresponds to the regret when the first candidate is a winner, and is . The second term corresponds to the comparisons between a non-winner arm and its first superiors, which is bounded by . The remaining terms correspond to the comparisons between a non-winner arm and the remaining arms, and is bounded by . As demonstrated in Zoghi2015NIPS:CDB , is relatively small compared to , and can be viewed as a constant. Thus, the total regret is bounded as . In particular, this asymptotic trend can be easily seen for Condorcet dueling bandits where .
Comparing Eq. (6) with Eq. (2), we can see the difference is the third and fourth terms in (6), which refine the regret of comparing a suboptimal arm and its last inferiors into . Thus, to prove Theorem 1, it suffices to show the following additional lemma:
Under Assumptions 1 and 2, for any suboptimal arm and , we have
where and are constants.
We prove this lemma using a back substitution argument. The intuition is that when fixing the first candidate as , the comparison between and the other arms is similar to a traditional MAB with expected utilities (). Let be the number of time-slots when this type of MAB is played. Using the fact that the distribution of the samples only depends on the historic comparison results (but not ), we can show , which holds for any . We have shown that for any when proving Proposition 1. Then, substituting the bound of back and using the concavity of the function, we have . Details can be found in Appendix C.1 ∎
D-TS is a TS framework for dueling bandits, and its performance can be improved by refining certain components of it. In this section, we propose an enhanced version of D-TS, referred to as D-TS, that carefully breaks the ties to reduce the regret.
Note that by randomly breaking the ties (Line 11 in Algorithm 1), D-TS tends to explore all potential winners. This may be desirable in certain applications such as restaurant recommendation, where users may not want to stick to a single winner. However, because of this, the regret of D-TS scales with the number of winners as shown in Theorem 1
. To further reduce the regret, we can break the ties according to estimated regret.
Specifically, with samples , the normalized Copeland score for each arm can be estimated as . Then the maximum normalized Copeland score is , and the loss of comparing arm and arm is . For , we need about time-slots to distinguish it from 1/2 Komiyama2015COLT:DB . Thus, when choosing as the first candidate, the regret of comparing it with all other arms can be estimated by . We propose the following D-TS algorithm that breaks the ties to minimize .
D-TS: Implement the same operations as D-TS, except for the selection of the first candidate (Line 11 in Algorithm 1) is replaced by the following two steps:
D-TS only changes the tie-breaking criterion in selecting the first candidate. Thus, the regret bound of D-TS directly applies to D-TS:
The regret of D-TS, , satisfies inequality (6) under Assumptions 1 and 2.
Corollary 1 provides an upper bound for the regret of D-TS. In practice, however, D-TS performs better than D-TS in the scenarios with multiple winners, as we can see in Section 5 and Appendix D. Our conjecture is that with this regret-minimization criterion, the D-TS algorithm tends to focus on one of the winners (if there is no tie in terms of expected regret), and thus reduces the first term in (6) from to . The proof of this conjecture requires properties for the evolution of the statistics for all arms and the majority voting results based on the Thompson samples, and is complex. This is left as part of our future work.
In the above D-TS algorithm, we only consider the regret of choosing as the first candidate. From Theorem 1, we know that comparing other arms with their superiors will also result in regret. Thus, although the current D-TS algorithm performs well in most practical scenarios, one may further improve its performance by taking these additional comparisons into account in .
To evaluate the proposed D-TS and D-TS algorithms, we run experiments based on synthetic and real-world data. Here we present the results for experiments based on the Microsoft Learning to Rank (MSLR) dataset Microsoft2010MSLR , which provides the relevance for queries and ranked documents. Based on this dataset, Zoghi2015NIPS:CDB derives a preference matrix for 136 rankers, where each ranker is a function that maps a user’s query to a document ranking and can be viewed as one arm in dueling bandits. We use the two 5-armed submatrices in Zoghi2015NIPS:CDB , one for Condorcet dueling bandit and the other for non-Condorcet dueling bandit. More experiments and discussions can be found in Appendix D 222Source codes are available at https://github.com/HuasenWu/DuelingBandits..
We compare D-TS and D-TS with the following algorithms: BTM Yue2011ICML:BTM , SAVAGE Urvoy2013ICML:SAVAGE , Sparring Ailon2014ICML:UBDB , RUCB Zoghi2014ICML:RUCB , RCS Zoghi2014WSDM:RCS , CCB Zoghi2015NIPS:CDB , SCB Zoghi2015NIPS:CDB , RMED1 Komiyama2015COLT:DB , and ECW-RMED Komiyama2016ICML:CWRMED . For BTM, we set the relaxed factor as Yue2011ICML:BTM . For algorithms using RUCB and RLCB, including D-TS and D-TS, we set the scale factor . For RMED1, we use the same settings as Komiyama2015COLT:DB , and for ECW-RMED, we use the same setting as Komiyama2016ICML:CWRMED . For the “explore-then-exploit” algorithms, BTM and SAVAGE, each point is obtained by resetting the time horizon as the corresponding value. The results are averaged over 500 independent experiments, where in each experiment, the arms are randomly shuffled to prevent algorithms from exploiting special structures of the preference matrix.
In Condorcet dueling bandits, our D-TS and D-TS algorithms achieve almost the same performance and both perform much better than existing algorithms, as shown in Fig. 1(a). In particular, compared with RCS, we can see that the full utilization of TS in D-TS and D-TS significantly reduces the regret. Compared with RMED1 and ECW-RMED, our D-TS and D-TS algorithms also perform better. Komiyama2015COLT:DB has shown that RMED1 is optimal in Condorcet dueling bandits, not only in the sense of asymptotic order, but also the coefficients in the regret bound. The simulation results show that D-TS and D-TS not only achieve the similar slope as RMED1/ECW-RMED, but also converge faster to the asymptotic regime and thus achieve much lower regret. This inspires us to further refine the regret bounds for D-TS and D-TS in the future.
In non-Condorcet dueling bandits, as shown in Fig. 1(b), D-TS and D-TS significantly reduce the regret compared to the UCB-type algorithm, CCB (e.g., the regret of D-TS is less than 10% of that of CCB). Compared with ECW-RMED, D-TS achieves higher regret, mainly because it randomly explores all Copeland winners due to the random tie-breaking rule. With a regret-minimization tie-breaking rule, D-TS further reduces the regret, and outperforms ECW-RMED in this dataset. Moreover, as randomized algorithms, D-TS and D-TS are more robust to the preference probabilities. As shown in Fig. 2, D-TS and D-TS have much smaller regret STD than that of ECW-RMED in the non-Condorcet dataset, where certain preference probabilities (for different arms) are close to 1/2. In particular, the STD of regret for ECW-RMED is almost 200% of its mean value, while it is only 13.16% for D-TS. In addition, as shown in Appendix D.2.3, D-TS and D-TS are also robust to delayed feedback, which is typically batched and provided periodically in practice.
Overall, D-TS and D-TS significantly outperform all existing algorithms, with the exception of ECW-RMED. Compared to ECW-RMED, D-TS achieves much lower regret in the Condorcet case, lower or comparable regret in the non-Condorcet case, and much more robustness in terms of regret STD and delayed feedback. Thus, the simplicity, good performance, and robustness of D-TS and D-TS make them good algorithms in practice.
In this paper, we study TS algorithms for dueling bandits. We propose a D-TS algorithm and its enhanced version D-TS for general Copeland dueling bandits, including Condorcet dueling bandits as a special case. Our study reveals desirable properties of D-TS and D-TS from both theoretical and practical perspectives. Theoretically, we show that the regret of D-TS and D-TS is bounded by in general Copeland dueling bandits, and can be refined to in Condorcet dueling bandits and most practical Copeland dueling bandits. Practically, experimental results demonstrate that these simple algorithms achieve significantly better overall-performance than existing algorithms, i.e., D-TS and D-TS typically achieve much lower regret in practice and are robust to many practical factors, such as preference matrix and feedback delay.
Although logarithmic regret bounds have been obtained for D-TS and D-TS, our analysis relies heavily on the properties of RUCB/RLCB and the regret bounds are likely loose. In fact, we see from experiments that RUCB-based elimination seldom occurs under most practical settings. We will further refine the regret bounds by investigating the properties of TS-based majority-voting. Moreover, results from recent work such as Komiyama2016ICML:CWRMED may be leveraged to improve TS algorithms. Last, it is also an interesting future direction to study D-TS type algorithms for dueling bandits with other definition of winners.
Acknowledgements: This research was supported in part by NSF Grants CCF-1423542, CNS-1457060, and CNS-1547461. The authors would like to thank Prof. R. Srikant (UIUC), Prof. Shipra Agrawal (Columbia University), Masrour Zoghi (University of Amsterdam), and Dr. Junpei Komiyama (University of Tokyo) for their helpful discussions and suggestions.
International Conference on Machine Learning (ICML), pages 1201–1208, 2009.
International Joint Conference on Artificial Intelligence, 2015.
Thompson sampling for learning parameterized Markov decision processes.In Proceedings of Conference on Learning Theory, pages 861–898, 2015.
We first present the concentration properties of RUCB/RLCB. By relating RUCB/RLCB to UCB/LCB in traditional MAB, we can adjust the results in Bubeck2010PhD:bandits for RUCB/RLCB as follows.
1) When , for any and ,
2) For any ,
We prove this lemma using the techniques in the proof of Theorem 2.2 in Bubeck2010PhD:bandits .
In fact, RUCB (resp., RLCB) in dueling bandits are essentially the same as UCB (resp., LCB) in traditional MAB. Thus, Part 1) of this lemma can be proved using the peeling argument in Bubeck2010PhD:bandits .
For Part 2), the sum can be bounded by the integration as in Bubeck2010PhD:bandits . ∎
For a pair with , let be a number satisfying . Let be the empirical estimation for the probability that arm beats arm . Define the following events:
For an event , we let be the event of “not ”. Then
The first term is zero, because for all , due to the fact that when .
The second and third terms can be bounded similarly to the analysis of TS in traditional MABs Agrawal2013AISTATS:TS2 . To see this, we note that when fixing the first candidate as , the comparison between and other arms is similar to a traditional MAB problem with expected reward (). For the case of , we only need to care about two differences: first, is fixed and known; second, in addition to , arm and arm could also be compared when . By capturing the second difference with , we can leverage the techniques in Agrawal2013AISTATS:TS2 to prove our results.
Specifically, the second term can be bounded by using the concentration property of the Thompson samples. Letting , similar to the proof of Lemma 4 in Agrawal2013AISTATS:TS2 , we have
The third term can be bounded similarly to Lemma 3 in Agrawal2013AISTATS:TS2 . Specifically, let be the slot index when and are compared for the -th time, including both cases and . Let . Then, is fixed between and , and (it is 0 if the -th comparison is implemented in the form of ). Then