The stochastic multi-armed bandit (MAB) problem is a sequential decision-making problem that an agent repeatedly chooses one option from alternatives, which are often called arms. At each round, the agent receives a random reward that depends on the arm being selected, and the goal is to maximize the cumulative reward. This problem has been extensively studied for many years, both from theoretical and practical aspects. Numerous algorithms has been proposed for the problem Thompson , Auer , and applied to various fields including the design of clinical trial Villar et al. , economics Rothschild , and crowdsourcing Zhou et al. .
The dueling bandit (DB) problem Yue et al.  is a variant of the MAB problem, where an agent only observes the result of the “duel”, a noisy comparison between the selected two arms. While the MAB problem assumes that the feedback is numeric, the DB problem only assumes that the arms are comparable based on the feedback. Therefore, it is useful for the case where the numeric feedback is not available, such as information retrieval and clinical trial, in which the feedback is qualitative by nature.
Even in the case where the numeric feedback is not available, we may still have access to qualitative feedback. For example, in information retrieval, users might report the relevance of a document returned by a system on a scale of “Irrelevant”—“Partially Relevant”—“Relevant”. In such a situation, we can consider a special kind of the DB problem first introduced by Busa-Fekete et al. , which we call the qualitative DB (QDB) problem.
In the QDB problem, an agent pulls one arm at each round and observes qualitative feedback. Although the duel is not conducted explicitly in the QDB problem, the algorithm is evaluated based on the same criterion as the DB problem. Here, the probability of an arm winning a duel with another arm corresponds to the probability of the arm getting higher qualitative feedback than the other. Therefore, we can adapt any algorithms for the DB problem to the QDB problem by converting the feedback in every two rounds into the result of one duel.
However, this reduction significantly worsens the performance because, in the QDB problem, the winning probability can be calculated from the estimated feedback distributions. Busa-Fekete et al.  also partially considered this problem, and they succeeded in improving the performance of the classic DB algorithms by constructing a tight confidence bound. However, they still use the same exploration strategy as the classic DB algorithm. In this paper, we show that we can further improve the performance by designing a special exploration strategy for the QDB problem.
Several definitions of the “best arm” have been proposed for the DB problem. In this paper, we consider two types of winners, the Condorcet winner and the Borda winner, both of which are defined in Section 3
, and we propose algorithms for each winner. The proposed algorithms are inspired by algorithms in the MAB, namely Thompson samplingThompson  and the upper confidence bound (UCB) algorithm Auer . Interestingly, the algorithm based on Thompson sampling, one of the most popular algorithms for the MAB problem, only works for the criterion of the Condorcet winner and suffers polynomial regret in a specific instance in the criterion of the Borda winner.
The paper is structured as follows. After discussing the related work in Section 2, we formulate the QDB problem in detail in Section 3. We introduce the two formulations of the QDB problem and propose algorithms for these problems in Sections 4 and 5. Lastly, we show the empirical results for the information retrieval setting in Section 6.
2 Related Work
There are two lines of researches that relate with the QDB problem. The first is the DB problem Yue et al. , which is the MAB problem with the feedback given as a form of noisy comparison between two arms. Many researches have been conducted for this problem and some of them discuss specific comparison models. For example, Hofmann et al.  discussed the case where the duel is carried out by the interleaved comparison with some user model, and Yue et al. 
introduced Bradley-Terry model. Among them, several models involve random variables corresponding to the utilities associated with arms, and the result of the duel is determined by the order of such variables. For example, Gaussian modelYue et al. 
is the case where the random variables follows a Gaussian distribution, andBusa-Fekete et al.  considered the case where the random variables on a partially ordered set as in the QDB problem.
In the DB problem, the definition of the “best arm” is no longer straightforward because there may exist cyclic preference. Although early work of the DB assumes the total order on arms to ensure the existence of the maximal element, recent work has mainly sought to design algorithms for finding the Condorcet winner Urvoy et al. , which is the arm that wins over all the other arms with probability larger than or equal to . This definition can be regarded as a natural generalization of the maximal element, since the Condorcet winner reduces to the maximal element when the total order exists. A number of algorithms have been proposed for the Condorcet winner Urvoy et al. , Komiyama et al. , Wu and Liu .
A drawback of this formulation is that the Condorcet winner does not always exist. In such cases, we may introduce other notions of the winners, such as the Borda winner Urvoy et al.  and the Copeland set Zoghi et al. . Ramamohan et al.  introduced numerous notions of the winners other than the Condorcet winner.
The other line of the related work is qualitative multi-armed bandit (QMAB) problem Szorenyi et al. , in which an agent also receives qualitative feedback according to the chosen arm. The difference between the QDB problem and the QMAB problem is that the QDB problem handles the winners defined in the classic DB problem, while the QMAB problem introduces its own definition of a “winner”, which is defined as the arm with the highest
-quantile of the feedback distribution for.
This definition is, however, sometimes problematic since it ignores the difference in the feedback distribution below the -quantile. Let us consider the case that we have two types of medicines, A and B, and want to figure out which has less side effect. Then, we can perform clinical trials and obtain feedback from patients about the severeness of side effects.
Assume that the feedback is reported on the scale of “No side effect”—“Moderate”—“Severe” and the true probabilities of getting each feedback are shown in Table 1. Then, we can clearly conclude that medicine A is more preferable since it has a less probability of having a severe side effect, and in fact, medicine A becomes the winner in the formulation of the QDB problem. However, the QMAB problem regards these medicines equally good unless since the -quantile feedback is the same. Nevertheless, setting is almost impossible in practice since we do not have access to the true probabilities beforehand.
On the other hand, the definitions of winners considered in the QDB problem are well-studied in the context of voting theory (see Charon and Hudry , for a survey), and they dot not have any hyper-parameter to define the problem itself. This makes our algorithms more applicable to the real-world problems.
|No side effect||Moderate||Severe|
3 Problem Formulation
We formulate the QDB problem in this section. As in the MAB problem, we consider arms associated with feedback distributions , and at each round , the agent chooses one arm and receives feedback sampled from distribution . While the MAB problem assumes to be distributions on real values, the QDB considers qualitative feedback which corresponds to the case where are the distributions on the totally ordered set , where is the set of possible feedback and denotes a total order between feedback. For simplicity, we assume that and total order corresponds to order relation , which means . Thus, distributions are all categorical, supports of which are . Note that even though the rewards are nominal for notational simplicity, the sum of the feedback has no meaning in the QDB setting.
The QDB problem aims to minimize the same regret as the classic DB problem, which is defined based on pairwise comparison. Following early work Busa-Fekete et al. , we characterize , the probability of arm winning over arm , as
where and are mutually independent random variables following distributions and , respectively.
We consider two types of winners in this paper. The first one is the Condorcet winner, which is the arm that wins all the other arms with probability larger than or equal to . Formally, arm is the Condorcet winner if for all . We denote the Condorcet winner as , and the goal of the QDB problem when employing the Condorcet winner is to minimize the following regret:
The second winner is the Borda winner, which is the arm with the largest Borda score, the average of the winning probabilities against other arms. Formally, the Borda score for arm is defined as
and thus the Borda winner is . The regret to minimize in this case is formulated as
The QDB problem can be solved by any algorithm for the classic DB since the same regret is used between them. Algorithms for the DB problem specify two arms to compare at each round and receive a result of the noisy comparison generated from , where
is the Bernoulli distribution with success probability. This comparison can be simulated in the QDB problem as follows: We observe and by pulling both arms and return which or occurred with the ties broken at random.
However, in the QDB problem, we can directly estimate from the feedback distribution of each arm, which significantly enhances exploration. Considering that are all categorical distributions on , we have another representation for given by
where . Let be the probability simplex , and we define function as
Hence, for .
4 Qualitative Dueling Bandit with the Condorcet Winner
In this section, we propose an algorithm for the QDB problem with the Condorcet winner. The algorithm is called Thompson Condorcet sampling, which is based on Thompson sampling Thompson , an algorithm famous for its good performance in the standard MAB problem and wide applicability to many other problems.
This algorithm maintains Bayesian posterior distributions of defined in Section 3. We employ the Dirichlet distribution
as the prior distribution, the probability density function of which is
where is the gamma function.
Having Dirichlet distributions as priors is a convenient choice when observations are sampled from a categorical distribution. Let
be the vector representing the observation until the-th round, where represents the number of times that the feedback is observed when arm is pulled. If we employ the prior distribution as , then the posterior distribution given observations is . For notational simplicity, we sometimes denote as when the round is obvious from the context.
The entire algorithm is shown in Algorithm 1. At each round , the algorithm samples from posterior distributions of , and pulls the Condorcet winner in . If the Condorcet winner does not exist, the algorithm samples again.
for Kullback-Leibler (KL) divergence . Then, the regret of Thompson Condorcet sampling is bounded as follows.
The proof is given in Appendix B, where the detailed condition on and the precise form of the bound is also provided. From the precise form of (2) that can be found in (23) in Appendix B, one can see that this regret bound grows exponentially with the number of arms . However, this is not the inherent limitation of the Thompson Condorcet sampling but the artifact of pursuing the optimal asymptotic dependence on . As we will show in Section 6, this exponential increase in the regret does not occur in pracitice, and the algorithm works well for relatively large .
The regret bound has a similar form to the information theoretic lower bound in the MAB problems for multi-parameter models Burnetas and Katehakis . Note that considering distributions is essential in these case, whereas they are replaced with the distribution of the optimal arm in the regret bound of Thompson sampling in the MAB problem with the Bernoulli model given by Agrawal and Goyal . For example, when and , we have as .
Theorem 1 suggests the possibility of Thompson Condorcet sampling performing drastically better than the case when we apply classic DB algorithms for the QDB problem in the way discussed in Section 3. The regret lower bound of such direct applications immediately follows from the lower bound for the classic DB problem given by Komiyama et al. .
Proposition 1 (Adapted from Komiyama et al., 2015).
When we apply any consistent algorithms for the DB problem to the QDB problem, we have
From the upper bound given in Theorem 1, we have
which can be arbitrarily smaller than (3) as stated in the next lemma.
Assume that . For any fixed , there exist such that
The proof can be found in Appendix B. From Lemma 1, we can say that there exists the case where Thompson Condorcet sampling can perform arbitrarily better than the direct application of any algorithms in the DB. This implies that the algorithm successfully incorporates the qualitative information to reduce the regret in the DB.
5 Qualitative Dueling Banidt with the Borda Winner
In this section, we study two algorithms for the QDB problem with the Borda winner, the one based on the Thompson sampling called Thompson Borda sampling and the other based on the UCB algorithm Auer  called Borda-UCB. In spite of the success of Thompson Condorcet sampling, our theoretical analysis reveals that Thompson Borda sampling can have polynomial regret in some setting. On the other hand, Borda-UCB achieves logarithmic regret, which matches the regret lower bound of the classic DB problems.
Thompson Borda sampling given in Algorithm 2 is similar to Thompson Condorcet sampling. The only difference is that Thompson Borda sampling pulls the Borda winner in samples . Since there always exists the Borda winner for any samples , thus we do not need resampling. Although it is works surprisingly well empirically as we will see in Section 6, we prove that it suffers from polynomial regret in the worst case.
Assume that there are arms such that arm is the Borda winner. Then, there exists such that under Thompson Borda sampling with , , and , the statement
holds for some constants .
The proof can be found in Appendix C. The situation considered in Theorem 2 may be somewhat unrealistic since we assume that and are known beforehand. However, we will show by an experiment that Thompson Borda sampling actually suffers from the polynomial regret without such an assumption in Section 6.
Another proposed algorithm, Borda-UCB, is based on the UCB algorithm Auer , which is shown in Algorithm 3. As in the original UCB algorithm, we consider the upper confidence bound for each arm , where is an estimated Borda score, and
is the width of the confidence interval controlled by a positive parameter. Let be the arm with the largest upper confidence bound. While the original UCB algorithm always pulls the arm with the largest upper confidence bound, Borda-UCB pulls all arms that do not belong to , the set of arms that were pulled the most, if does not belong to . This exploration strategy reflects the fact that we have to estimate all feedback distributions accurately in order to have the precise estimation of the Borda score.
The regret of Borda-UCB is bounded as follows.
Assume that is set as
for arbitrarily taken . Then, for any , the regret of Borda-UCB is bounded as
for some constants , where and .
The proof is presented in Appendix D, where the explicit forms of and are also provided. The regret bound in Theorem 3 is simplified to when for all , while the regret of the original UCB algorithm is Auer , which is smaller by . However, this difference is inevitable, as proved in the following theorem.
Consider two instances of the QDB problem with , in which the feedback distributions of the arms are represented as and . Let and be the regret in each instance. Then, there exists a pair of instances that all algorithms which achieve
for all constant satisfy
where defined on .
The proof is presented in Appendix E. This theorem states that if the algorithm achieves sub-polynomial regret for all instances of the QDB problem with the Borda winner, there exists a case where it suffers from regret. Therefore, we can conclude that the difference in the regret upper-bound between the original UCB and Borda-UCB comes from the characteristic of the QDB problem.
The upper bound in Theorem 3 matches the regret lower bound in the classic DB problem, which is considered in the context of the -PAC DB problem Jamieson et al. . The algorithm is called -PAC if it finds the Borda winner with failure probability less than . We have the following bound of the minimum number of samples required in such -PAC algorithms.
Proposition 2 (Theorem 1; Jamieson et al., 2015).
Let be the total number of pulls. If and for all , then any -PAC DB algorithm with has
Existing algorithms for the Borda winner Busa-Fekete et al. , Jamieson et al.  use a -PAC DB algorithm as a sub-routine. They first run such an algorithm with and then pulls the estimated Borda winner in the remaining rounds. Therefore, the regret of such algorithms is at least from Proposition 2, and hence the regret upper bound of Borda-UCB is no worse than this lower bound.
Although we were not able to prove that the regret of Borda-UCB is smaller than the direct application of classic DB algorithms, Borda-UCB performs better than them empirically as we will see in Section 6. Furthermore, Borda-UCB has an another advantage that it does not require to specify . Since existing algorithms run a -PAC algorithm, it requires the number of rounds to be known beforehand. However, it is often difficult to guess beforehand, and thus our algorithms are more useful in practice.
We test the empirical performance of the proposed algorithms through experiments based on both synthetic setting and real-world data. We first conduct the experiments based on the real-world web search dataset that is also used in the previous work. In the experiments, our methods significantly outperform the direct application of the existing algorithms for the classic DB. Then, we show the results of the experiments in a synthetic setting that Thompson Borda sampling has polynomial regret.
Experiments on a Real-World Dataset
We apply proposed methods to the problem of ranker evaluation from the field of information retrieval, which is used for evaluating the algorithms for the classic DB problem in Jamieson et al. . The task is to identify the best ranker, which takes a user’s search query as input and ranks the documents according to their relevance to that query.
We used two web search datasets. The first is the MSLR-WEB10K dataset Qin et al. , which consists of 10,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding user-labeled relevance factor on a scale of one to five with respect to each query-document pair. The other is the MQ2008 dataset Qin and Liu  that contains 46 features and a relevance factor labelled from one to three for each query-document pair. As in Jamieson et al. , we only consider rankers that use one feature to rank documents. Therefore, the aim of the task is to determine which feature is the most capable of predicting the relevance of query-document pairs.
Although Jamieson et al.  set up the classic DB problem from these datasets, we can naturally formulate the QDB problem as well since we have access to the relevance factors. The qualitative feedback is generated in the following way. At each round, the algorithm selects one ranker, and it ranks the documents for a randomly chosen query. The relevance factor for the top-ranked document is revealed to the algorithm as the qualitative feedback. Therefore, we have in the MSLR-WEB10K dataset and in the MQ2008 dataset. We compare the regrets of the proposed algorithms to the direct application of the classic DB algorithms, which corresponds to the experiments conducted in Jamieson et al. . We repeat 100 runs for each instance and the mean of the regret is reported.
Experiments for Condorcet Winner
We first show the experimental result of the QDB problem with the Condorcet winner. We compare Thompson Condorcet sampling with RUCB Zoghi et al. , RMED1, RMED2, RMED2F Komiyama et al. , which are all promising algorithms proposed for the classic DB problem with the Condorcet winner. We set , and the Figure 1 is the experimental result when the number of rankers is .
Figure 1 shows the superiority of Thompson Condorcet sampling. Furthermore, we can observe all existing algorithms incur the large regrets in early rounds while Thompson Condorcet sampling does not. This is because most algorithms for the DB problem construct a set of candidates for the Condorcet winner and explores it in the first part of the rounds, but Thompson Condorcet sampling conducts exploration and exploitation at the same time and does not require such a set. In this sense, Thompson Condorcet sampling performs more stably than the existing methods.
To see the dependency of the performance of Thompson Condorcet sampling on the number of arms, we tried the setting in which we have a relatively large number of arms. The result is shown in Figure 2, in which Thompson Condorcet sampling still performs the best among the other classic DB algorithms even though the regret upper-bound proved in Theorem 1 grows exponentially with . This result supports the argument that exponential dependency on is just an artifact of pursuing the best regret bound in the asymptotic case and Thompson Condorcet sampling empirically performs much better than the theoretical analysis.
Experiments for Borda Winner
For the Borda setting, we compare our proposed methods, Thompson Borda Sampling and Borda-UCB, with existing classic DB algorithm SSSE Busa-Fekete et al. . Furthermore, we also conduct a comparison with an extension of SSSE, which we call QSEEE, proposed in Busa-Fekete et al.  to utilize the qualitative feedback explicitly.
The result is shown in Figure 3, which shows the superiority of the proposed methods. As in the Condorcet case, SSSE and QSSSE suffer from a large regret in the early stage, while regret always increases logarithmically in the proposed algorithms. This is because existing methods first only explore, while proposing methods always balance exploration and exploitation. Although existing methods achieve zero-regret after the exploration, this does not mean that they perform better than Borda-UCB in since they require longer exploration phase.
Surprisingly, Thompson Borda sampling works quite well in this setting, even though Theorem 2 states that it has the polynomial regret in the worst case. We suspect it is rare to encounter such a worst case in practice, but the condition for sub-polynomial regret is unknown and left to future work.
Experiments on a Synthetic Setting
Theorem 2 proves that Thompson Borda sampling can incur polynomial regret for some instances, which we confirm through experiments in the following. We set up the instance with and , in which each feedback distribution is represented as , , and . We repeat running Thompson Borda sampling and Borda-UCB in this instance for 10 times, and the mean of regret is shown in Figure 4.
From Figure 4, we can clearly see that Thompson Borda sampling suffers from polynomial regret, while Borda-UCB still has sub-polynomial regret. However, it takes many rounds for Borda-UCB to have less regret than Thompson Borda sampling. This is because Thompson Borda sampling explores less than necessary. In early rounds, UCB-Borda pulls arm 3 many times, which is necessary for knowing the Borda winner but incurs large regret. On the other hand, Thompson Borda sampling exploits arms 1 and 2 more, which leads its superior performance in early rounds.
In this paper, we formulated and studied a novel type of the dueling bandit, called a qualitative dueling bandit. In this problem, an agent receives qualitative feedback at each round and aims to minimize the same regret as the classic DB when the duel is carried out based on that feedback.
We considered two notions of winners, the Condorcet winner and the Borda winner. For the Condorcet winner, we proposed an algorithm, called Thompson Condorcet sampling, and we showed that the regret can be arbitrarily smaller than the direct application of the algorithms in classic DB. Thompson Condorcet sampling also exhibited the superior performance in the experiments based on the real-word web search datasets.
For the Borda winner, we studied two algorithms, Thompson Borda sampling and UCB-Borda. Although the theoretical analysis reveals that Thompson Borda sampling can have polynomial regret in some instances, the experiments showed that it performs surprisingly well empirically, especially when the number of rounds is not very large. On the other hand, we prove the logarithmic regret upper bound for UCB-Borda, which is no worse than the regret lower bound in the classic DB.
As future work, it is important to derive general algorithms that can handle various notions of winners as in Ramamohan et al. . Another promising direction is to improve the algorithms for the Borda winner and achieve regret significantly smaller than the classic DB as Thompson Condorcet sampling does in the Condorcet winner case.
LX utilized the facility provided by Masason Foundation. JH acknowledges support by KAKENHI 18K17998, and MS acknowledges support by KAKENHI 17H00757.
Agrawal and Goyal 
S. Agrawal and N. Goyal.
Further optimal regret bounds for Thompson sampling.
Proceedings of the 16th International Conference on Artificial Intelligence and Statistics, pages 99–107, 2013.
Using confidence bounds for exploitation-exploration trade-offs.
Journal of Machine Learning Research, 3:397–422, March 2003. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=944919.944941.
- Burnetas and Katehakis  A. N. Burnetas and M. N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 – 142, 1996.
- Busa-Fekete et al.  R. Busa-Fekete, B. Szörényi, P. Weng, W. Cheng, and E. Hüllermeier. Top-k selection based on adaptive sampling of noisy preferences. In Proceedings of the 30th International Conference on Machine Learning, pages 1094–1102, 2013.
- Charon and Hudry  I. Charon and O. Hudry. An updated survey on the linear ordering problem for weighted or unweighted tournaments. Annals of Operations Research, 175(1):107–158, March 2010.
- Hofmann et al.  K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th International Conference on Information and Knowledge Management, pages 249–258, 2011.
- Honda and Takemura  J. Honda and A. Takemura. Optimality of Thompson sampling for Gaussian bandits depends on priors. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, pages 375–383, 2014.
- Jamieson et al.  K. Jamieson, S. Katariya, A. Deshpande, and R. Nowak. Sparse dueling bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 416–424, 2015.
- Komiyama et al.  J. Komiyama, J. Honda, H. Kashima, and H. Nakagawa. Regret lower bound and optimal algorithm in dueling bandit problem. In Proceedings of The 28th Conference on Learning Theory, pages 1141–1154, 2015.
- Lai and Robbins  T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985. ISSN 10902074. doi: 10.1016/0196-8858(85)90002-8.
- Massart  P. Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability, 18(3):1269–1283, July 1990.
- Olver et al.  F. W. Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark. NIST Handbook of Mathematical Functions. Cambridge University Press, New York, NY, USA, 1st edition, 2010. ISBN 0521140633, 9780521140638.
- Qin and Liu  T. Qin and T. Liu. Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597, 2013.
- Qin et al.  T. Qin, T.-Y. Liu, J. Xu, and H. Li. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13(4):346–374, Aug 2010. ISSN 1386-4564. doi: 10.1007/s10791-009-9123-y. URL http://dx.doi.org/10.1007/s10791-009-9123-y.
- Ramamohan et al.  S. Y. Ramamohan, A. Rajkumar, and S. Agarwal. Dueling bandits: Beyond Condorcet winners to general tournament solutions. In Advances in Neural Information Processing Systems 29, pages 1253–1261, 2016.
- Rothschild  M. Rothschild. A two-armed bandit theory of market pricing. Journal of Economic Theory, 9:185 – 202, 1974. ISSN 0022-0531.
- Szorenyi et al.  B. Szorenyi, R. Busa-Fekete, P. Weng, and E. Hüllermeier. Qualitative multi-armed bandits: A quantile-based approach. In Proceedings of the 32nd International Conference on Machine Learning, pages 1660–1668, 2015.
- Thompson  W. R. Thompson. On the likelihood that one unknown probability exceeds another in the view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
- Urvoy et al.  T. Urvoy, F. Clerot, R. Féraud, and S. Naamane. Generic exploration and K-armed voting bandits. In Proceedings of the 30th International Conference on Machine Learning, pages 1191–1199, 2013.
- van der Vaart and Wellner  A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics (Springer Series in Statistics). Springer, 2000. ISBN 0387946403.
- Villar et al.  S. S. Villar, J. Bowden, and J. Wason. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical Science, 30:199–215, May 2015. doi: 10.1214/14-STS504.
- Wu and Liu  H. Wu and X. Liu. Double Thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems 30, pages 649–657, 2016.
- Yue et al.  Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 – 1556, 2012.
- Zhou et al.  Y. Zhou, X. Chen, and J. Li. Optimal PAC multiple arm identification with applications to crowdsourcing. In Proceedings of the 31st International Conference on Machine Learning, pages 217–225, 2014.
- Zoghi et al.  M. Zoghi, S. Whiteson, R. Munos, and M. de Rijke. Relative upper confidence bound for the K-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning, pages 10–18, 2014.
- Zoghi et al.  M. Zoghi, Z. Karnin, S. Whiteson, and M. de Rijke. Copeland dueling bandits. In Advances in Neural Information Processing Systems 28, pages 307–315, 2015.
Appendix A Preliminaries
In this section, we introduce the concentration inequalities for multinomial distributions, which are the bounds on how a random variable deviates from the expected value. The first inequality measures deviation terms of the KL-divergence as follows.
Let us consider the random variable sampled from multinomial distribution . If we denote the true probability as and the empirical probability as , we have
for any and .
Proof of Lemma 2.
We also use the following inequality to handle the deviation measured by the norm.
Lemma 3 (Bretagnolle-Huber-Carol Inequality van der Vaart and Wellner ).
For defined in Lemma 2, we have
for any , where is the -norm of vector .
The last inequality is for the error in the cumulative distribution.
Lemma 4 (Dvoretzky-Kiefer-Wolfowitz inequality (Massart, 1990)).
For defined in Lemma 2, we have
for any .
Next, we introduce the concentration inequality for the Dirichlet distribution.
Let be a sample drawn from Dirichlet distribution for and . For all and , we have
where for defined in (7).
Using Pinsker’s inequality, we can derive the concentration inequality for -norm.
For and defined in Lemma 5, we have
for any .
Lastly, we state two simple lemmas, which are useful for analysis. The first is about the characteristic of function .
For and function defined in (1), we have
Here, is the -norm of defined as .
We can confirm it by simple calculation. The second is used for bounding the confidence bound.
Then, for all , we have
If we set as
Appendix B Proof of Theorem 1
We first introduce several events that is used in the proof. Let be the event