1 Introduction
The stochastic multiarmed bandit (MAB) problem is a sequential decisionmaking problem that an agent repeatedly chooses one option from alternatives, which are often called arms. At each round, the agent receives a random reward that depends on the arm being selected, and the goal is to maximize the cumulative reward. This problem has been extensively studied for many years, both from theoretical and practical aspects. Numerous algorithms has been proposed for the problem Thompson [1933], Auer [2003], and applied to various fields including the design of clinical trial Villar et al. [2015], economics Rothschild [1974], and crowdsourcing Zhou et al. [2014].
The dueling bandit (DB) problem Yue et al. [2012] is a variant of the MAB problem, where an agent only observes the result of the “duel”, a noisy comparison between the selected two arms. While the MAB problem assumes that the feedback is numeric, the DB problem only assumes that the arms are comparable based on the feedback. Therefore, it is useful for the case where the numeric feedback is not available, such as information retrieval and clinical trial, in which the feedback is qualitative by nature.
Even in the case where the numeric feedback is not available, we may still have access to qualitative feedback. For example, in information retrieval, users might report the relevance of a document returned by a system on a scale of “Irrelevant”—“Partially Relevant”—“Relevant”. In such a situation, we can consider a special kind of the DB problem first introduced by BusaFekete et al. [2013], which we call the qualitative DB (QDB) problem.
In the QDB problem, an agent pulls one arm at each round and observes qualitative feedback. Although the duel is not conducted explicitly in the QDB problem, the algorithm is evaluated based on the same criterion as the DB problem. Here, the probability of an arm winning a duel with another arm corresponds to the probability of the arm getting higher qualitative feedback than the other. Therefore, we can adapt any algorithms for the DB problem to the QDB problem by converting the feedback in every two rounds into the result of one duel.
However, this reduction significantly worsens the performance because, in the QDB problem, the winning probability can be calculated from the estimated feedback distributions. BusaFekete et al. [2013] also partially considered this problem, and they succeeded in improving the performance of the classic DB algorithms by constructing a tight confidence bound. However, they still use the same exploration strategy as the classic DB algorithm. In this paper, we show that we can further improve the performance by designing a special exploration strategy for the QDB problem.
Several definitions of the “best arm” have been proposed for the DB problem. In this paper, we consider two types of winners, the Condorcet winner and the Borda winner, both of which are defined in Section 3
, and we propose algorithms for each winner. The proposed algorithms are inspired by algorithms in the MAB, namely Thompson sampling
Thompson [1933] and the upper confidence bound (UCB) algorithm Auer [2003]. Interestingly, the algorithm based on Thompson sampling, one of the most popular algorithms for the MAB problem, only works for the criterion of the Condorcet winner and suffers polynomial regret in a specific instance in the criterion of the Borda winner.The paper is structured as follows. After discussing the related work in Section 2, we formulate the QDB problem in detail in Section 3. We introduce the two formulations of the QDB problem and propose algorithms for these problems in Sections 4 and 5. Lastly, we show the empirical results for the information retrieval setting in Section 6.
2 Related Work
There are two lines of researches that relate with the QDB problem. The first is the DB problem Yue et al. [2012], which is the MAB problem with the feedback given as a form of noisy comparison between two arms. Many researches have been conducted for this problem and some of them discuss specific comparison models. For example, Hofmann et al. [2011] discussed the case where the duel is carried out by the interleaved comparison with some user model, and Yue et al. [2012]
introduced BradleyTerry model. Among them, several models involve random variables corresponding to the utilities associated with arms, and the result of the duel is determined by the order of such variables. For example, Gaussian model
Yue et al. [2012]is the case where the random variables follows a Gaussian distribution, and
BusaFekete et al. [2013] considered the case where the random variables on a partially ordered set as in the QDB problem.In the DB problem, the definition of the “best arm” is no longer straightforward because there may exist cyclic preference. Although early work of the DB assumes the total order on arms to ensure the existence of the maximal element, recent work has mainly sought to design algorithms for finding the Condorcet winner Urvoy et al. [2013], which is the arm that wins over all the other arms with probability larger than or equal to . This definition can be regarded as a natural generalization of the maximal element, since the Condorcet winner reduces to the maximal element when the total order exists. A number of algorithms have been proposed for the Condorcet winner Urvoy et al. [2013], Komiyama et al. [2015], Wu and Liu [2016].
A drawback of this formulation is that the Condorcet winner does not always exist. In such cases, we may introduce other notions of the winners, such as the Borda winner Urvoy et al. [2013] and the Copeland set Zoghi et al. [2015]. Ramamohan et al. [2016] introduced numerous notions of the winners other than the Condorcet winner.
The other line of the related work is qualitative multiarmed bandit (QMAB) problem Szorenyi et al. [2015], in which an agent also receives qualitative feedback according to the chosen arm. The difference between the QDB problem and the QMAB problem is that the QDB problem handles the winners defined in the classic DB problem, while the QMAB problem introduces its own definition of a “winner”, which is defined as the arm with the highest
quantile of the feedback distribution for
.This definition is, however, sometimes problematic since it ignores the difference in the feedback distribution below the quantile. Let us consider the case that we have two types of medicines, A and B, and want to figure out which has less side effect. Then, we can perform clinical trials and obtain feedback from patients about the severeness of side effects.
Assume that the feedback is reported on the scale of “No side effect”—“Moderate”—“Severe” and the true probabilities of getting each feedback are shown in Table 1. Then, we can clearly conclude that medicine A is more preferable since it has a less probability of having a severe side effect, and in fact, medicine A becomes the winner in the formulation of the QDB problem. However, the QMAB problem regards these medicines equally good unless since the quantile feedback is the same. Nevertheless, setting is almost impossible in practice since we do not have access to the true probabilities beforehand.
On the other hand, the definitions of winners considered in the QDB problem are wellstudied in the context of voting theory (see Charon and Hudry [2010], for a survey), and they dot not have any hyperparameter to define the problem itself. This makes our algorithms more applicable to the realworld problems.
No side effect  Moderate  Severe  

Medicine A  0.995  0.003  0.002 
Medicine B  0.995  0.002  0.003 
3 Problem Formulation
We formulate the QDB problem in this section. As in the MAB problem, we consider arms associated with feedback distributions , and at each round , the agent chooses one arm and receives feedback sampled from distribution . While the MAB problem assumes to be distributions on real values, the QDB considers qualitative feedback which corresponds to the case where are the distributions on the totally ordered set , where is the set of possible feedback and denotes a total order between feedback. For simplicity, we assume that and total order corresponds to order relation , which means . Thus, distributions are all categorical, supports of which are . Note that even though the rewards are nominal for notational simplicity, the sum of the feedback has no meaning in the QDB setting.
The QDB problem aims to minimize the same regret as the classic DB problem, which is defined based on pairwise comparison. Following early work BusaFekete et al. [2013], we characterize , the probability of arm winning over arm , as
where and are mutually independent random variables following distributions and , respectively.
We consider two types of winners in this paper. The first one is the Condorcet winner, which is the arm that wins all the other arms with probability larger than or equal to . Formally, arm is the Condorcet winner if for all . We denote the Condorcet winner as , and the goal of the QDB problem when employing the Condorcet winner is to minimize the following regret:
where .
The second winner is the Borda winner, which is the arm with the largest Borda score, the average of the winning probabilities against other arms. Formally, the Borda score for arm is defined as
and thus the Borda winner is . The regret to minimize in this case is formulated as
where .
The QDB problem can be solved by any algorithm for the classic DB since the same regret is used between them. Algorithms for the DB problem specify two arms to compare at each round and receive a result of the noisy comparison generated from , where
is the Bernoulli distribution with success probability
. This comparison can be simulated in the QDB problem as follows: We observe and by pulling both arms and return which or occurred with the ties broken at random.However, in the QDB problem, we can directly estimate from the feedback distribution of each arm, which significantly enhances exploration. Considering that are all categorical distributions on , we have another representation for given by
where . Let be the probability simplex , and we define function as
(1) 
Hence, for .
4 Qualitative Dueling Bandit with the Condorcet Winner
In this section, we propose an algorithm for the QDB problem with the Condorcet winner. The algorithm is called Thompson Condorcet sampling, which is based on Thompson sampling Thompson [1933], an algorithm famous for its good performance in the standard MAB problem and wide applicability to many other problems.
This algorithm maintains Bayesian posterior distributions of defined in Section 3. We employ the Dirichlet distribution
as the prior distribution, the probability density function of which is
where is the gamma function.
Having Dirichlet distributions as priors is a convenient choice when observations are sampled from a categorical distribution. Let
be the vector representing the observation until the
th round, where represents the number of times that the feedback is observed when arm is pulled. If we employ the prior distribution as , then the posterior distribution given observations is . For notational simplicity, we sometimes denote as when the round is obvious from the context.The entire algorithm is shown in Algorithm 1. At each round , the algorithm samples from posterior distributions of , and pulls the Condorcet winner in . If the Condorcet winner does not exist, the algorithm samples again.
Let be
for KullbackLeibler (KL) divergence . Then, the regret of Thompson Condorcet sampling is bounded as follows.
Theorem 1.
The proof is given in Appendix B, where the detailed condition on and the precise form of the bound is also provided. From the precise form of (2) that can be found in (23) in Appendix B, one can see that this regret bound grows exponentially with the number of arms . However, this is not the inherent limitation of the Thompson Condorcet sampling but the artifact of pursuing the optimal asymptotic dependence on . As we will show in Section 6, this exponential increase in the regret does not occur in pracitice, and the algorithm works well for relatively large .
The regret bound has a similar form to the information theoretic lower bound in the MAB problems for multiparameter models Burnetas and Katehakis [1996]. Note that considering distributions is essential in these case, whereas they are replaced with the distribution of the optimal arm in the regret bound of Thompson sampling in the MAB problem with the Bernoulli model given by Agrawal and Goyal [2013]. For example, when and , we have as .
Theorem 1 suggests the possibility of Thompson Condorcet sampling performing drastically better than the case when we apply classic DB algorithms for the QDB problem in the way discussed in Section 3. The regret lower bound of such direct applications immediately follows from the lower bound for the classic DB problem given by Komiyama et al. [2015].
Proposition 1 (Adapted from Komiyama et al., 2015).
When we apply any consistent algorithms for the DB problem to the QDB problem, we have
(3) 
where .
From the upper bound given in Theorem 1, we have
which can be arbitrarily smaller than (3) as stated in the next lemma.
Lemma 1.
Assume that . For any fixed , there exist such that
(4) 
The proof can be found in Appendix B. From Lemma 1, we can say that there exists the case where Thompson Condorcet sampling can perform arbitrarily better than the direct application of any algorithms in the DB. This implies that the algorithm successfully incorporates the qualitative information to reduce the regret in the DB.
5 Qualitative Dueling Banidt with the Borda Winner
In this section, we study two algorithms for the QDB problem with the Borda winner, the one based on the Thompson sampling called Thompson Borda sampling and the other based on the UCB algorithm Auer [2003] called BordaUCB. In spite of the success of Thompson Condorcet sampling, our theoretical analysis reveals that Thompson Borda sampling can have polynomial regret in some setting. On the other hand, BordaUCB achieves logarithmic regret, which matches the regret lower bound of the classic DB problems.
Thompson Borda sampling given in Algorithm 2 is similar to Thompson Condorcet sampling. The only difference is that Thompson Borda sampling pulls the Borda winner in samples . Since there always exists the Borda winner for any samples , thus we do not need resampling. Although it is works surprisingly well empirically as we will see in Section 6, we prove that it suffers from polynomial regret in the worst case.
Theorem 2.
Assume that there are arms such that arm is the Borda winner. Then, there exists such that under Thompson Borda sampling with , , and , the statement
holds for some constants .
The proof can be found in Appendix C. The situation considered in Theorem 2 may be somewhat unrealistic since we assume that and are known beforehand. However, we will show by an experiment that Thompson Borda sampling actually suffers from the polynomial regret without such an assumption in Section 6.
Another proposed algorithm, BordaUCB, is based on the UCB algorithm Auer [2003], which is shown in Algorithm 3. As in the original UCB algorithm, we consider the upper confidence bound for each arm , where is an estimated Borda score, and
is the width of the confidence interval controlled by a positive parameter
. Let be the arm with the largest upper confidence bound. While the original UCB algorithm always pulls the arm with the largest upper confidence bound, BordaUCB pulls all arms that do not belong to , the set of arms that were pulled the most, if does not belong to . This exploration strategy reflects the fact that we have to estimate all feedback distributions accurately in order to have the precise estimation of the Borda score.The regret of BordaUCB is bounded as follows.
Theorem 3.
Assume that is set as
for arbitrarily taken . Then, for any , the regret of BordaUCB is bounded as
for some constants , where and .
The proof is presented in Appendix D, where the explicit forms of and are also provided. The regret bound in Theorem 3 is simplified to when for all , while the regret of the original UCB algorithm is Auer [2003], which is smaller by . However, this difference is inevitable, as proved in the following theorem.
Theorem 4.
Consider two instances of the QDB problem with , in which the feedback distributions of the arms are represented as and . Let and be the regret in each instance. Then, there exists a pair of instances that all algorithms which achieve
for all constant satisfy
where defined on .
The proof is presented in Appendix E. This theorem states that if the algorithm achieves subpolynomial regret for all instances of the QDB problem with the Borda winner, there exists a case where it suffers from regret. Therefore, we can conclude that the difference in the regret upperbound between the original UCB and BordaUCB comes from the characteristic of the QDB problem.
The upper bound in Theorem 3 matches the regret lower bound in the classic DB problem, which is considered in the context of the PAC DB problem Jamieson et al. [2015]. The algorithm is called PAC if it finds the Borda winner with failure probability less than . We have the following bound of the minimum number of samples required in such PAC algorithms.
Proposition 2 (Theorem 1; Jamieson et al., 2015).
Let be the total number of pulls. If and for all , then any PAC DB algorithm with has
Existing algorithms for the Borda winner BusaFekete et al. [2013], Jamieson et al. [2015] use a PAC DB algorithm as a subroutine. They first run such an algorithm with and then pulls the estimated Borda winner in the remaining rounds. Therefore, the regret of such algorithms is at least from Proposition 2, and hence the regret upper bound of BordaUCB is no worse than this lower bound.
Although we were not able to prove that the regret of BordaUCB is smaller than the direct application of classic DB algorithms, BordaUCB performs better than them empirically as we will see in Section 6. Furthermore, BordaUCB has an another advantage that it does not require to specify . Since existing algorithms run a PAC algorithm, it requires the number of rounds to be known beforehand. However, it is often difficult to guess beforehand, and thus our algorithms are more useful in practice.
6 Experiments
We test the empirical performance of the proposed algorithms through experiments based on both synthetic setting and realworld data. We first conduct the experiments based on the realworld web search dataset that is also used in the previous work. In the experiments, our methods significantly outperform the direct application of the existing algorithms for the classic DB. Then, we show the results of the experiments in a synthetic setting that Thompson Borda sampling has polynomial regret.
Experiments on a RealWorld Dataset
We apply proposed methods to the problem of ranker evaluation from the field of information retrieval, which is used for evaluating the algorithms for the classic DB problem in Jamieson et al. [2015]. The task is to identify the best ranker, which takes a user’s search query as input and ranks the documents according to their relevance to that query.
We used two web search datasets. The first is the MSLRWEB10K dataset Qin et al. [2010], which consists of 10,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding userlabeled relevance factor on a scale of one to five with respect to each querydocument pair. The other is the MQ2008 dataset Qin and Liu [2013] that contains 46 features and a relevance factor labelled from one to three for each querydocument pair. As in Jamieson et al. [2015], we only consider rankers that use one feature to rank documents. Therefore, the aim of the task is to determine which feature is the most capable of predicting the relevance of querydocument pairs.
Although Jamieson et al. [2015] set up the classic DB problem from these datasets, we can naturally formulate the QDB problem as well since we have access to the relevance factors. The qualitative feedback is generated in the following way. At each round, the algorithm selects one ranker, and it ranks the documents for a randomly chosen query. The relevance factor for the topranked document is revealed to the algorithm as the qualitative feedback. Therefore, we have in the MSLRWEB10K dataset and in the MQ2008 dataset. We compare the regrets of the proposed algorithms to the direct application of the classic DB algorithms, which corresponds to the experiments conducted in Jamieson et al. [2015]. We repeat 100 runs for each instance and the mean of the regret is reported.
Experiments for Condorcet Winner
We first show the experimental result of the QDB problem with the Condorcet winner. We compare Thompson Condorcet sampling with RUCB Zoghi et al. [2014], RMED1, RMED2, RMED2F Komiyama et al. [2015], which are all promising algorithms proposed for the classic DB problem with the Condorcet winner. We set , and the Figure 1 is the experimental result when the number of rankers is .
Figure 1 shows the superiority of Thompson Condorcet sampling. Furthermore, we can observe all existing algorithms incur the large regrets in early rounds while Thompson Condorcet sampling does not. This is because most algorithms for the DB problem construct a set of candidates for the Condorcet winner and explores it in the first part of the rounds, but Thompson Condorcet sampling conducts exploration and exploitation at the same time and does not require such a set. In this sense, Thompson Condorcet sampling performs more stably than the existing methods.
To see the dependency of the performance of Thompson Condorcet sampling on the number of arms, we tried the setting in which we have a relatively large number of arms. The result is shown in Figure 2, in which Thompson Condorcet sampling still performs the best among the other classic DB algorithms even though the regret upperbound proved in Theorem 1 grows exponentially with . This result supports the argument that exponential dependency on is just an artifact of pursuing the best regret bound in the asymptotic case and Thompson Condorcet sampling empirically performs much better than the theoretical analysis.
Experiments for Borda Winner
For the Borda setting, we compare our proposed methods, Thompson Borda Sampling and BordaUCB, with existing classic DB algorithm SSSE BusaFekete et al. [2013]. Furthermore, we also conduct a comparison with an extension of SSSE, which we call QSEEE, proposed in BusaFekete et al. [2013] to utilize the qualitative feedback explicitly.
The result is shown in Figure 3, which shows the superiority of the proposed methods. As in the Condorcet case, SSSE and QSSSE suffer from a large regret in the early stage, while regret always increases logarithmically in the proposed algorithms. This is because existing methods first only explore, while proposing methods always balance exploration and exploitation. Although existing methods achieve zeroregret after the exploration, this does not mean that they perform better than BordaUCB in since they require longer exploration phase.
Surprisingly, Thompson Borda sampling works quite well in this setting, even though Theorem 2 states that it has the polynomial regret in the worst case. We suspect it is rare to encounter such a worst case in practice, but the condition for subpolynomial regret is unknown and left to future work.
Experiments on a Synthetic Setting
Theorem 2 proves that Thompson Borda sampling can incur polynomial regret for some instances, which we confirm through experiments in the following. We set up the instance with and , in which each feedback distribution is represented as , , and . We repeat running Thompson Borda sampling and BordaUCB in this instance for 10 times, and the mean of regret is shown in Figure 4.
From Figure 4, we can clearly see that Thompson Borda sampling suffers from polynomial regret, while BordaUCB still has subpolynomial regret. However, it takes many rounds for BordaUCB to have less regret than Thompson Borda sampling. This is because Thompson Borda sampling explores less than necessary. In early rounds, UCBBorda pulls arm 3 many times, which is necessary for knowing the Borda winner but incurs large regret. On the other hand, Thompson Borda sampling exploits arms 1 and 2 more, which leads its superior performance in early rounds.
7 Conclusions
In this paper, we formulated and studied a novel type of the dueling bandit, called a qualitative dueling bandit. In this problem, an agent receives qualitative feedback at each round and aims to minimize the same regret as the classic DB when the duel is carried out based on that feedback.
We considered two notions of winners, the Condorcet winner and the Borda winner. For the Condorcet winner, we proposed an algorithm, called Thompson Condorcet sampling, and we showed that the regret can be arbitrarily smaller than the direct application of the algorithms in classic DB. Thompson Condorcet sampling also exhibited the superior performance in the experiments based on the realword web search datasets.
For the Borda winner, we studied two algorithms, Thompson Borda sampling and UCBBorda. Although the theoretical analysis reveals that Thompson Borda sampling can have polynomial regret in some instances, the experiments showed that it performs surprisingly well empirically, especially when the number of rounds is not very large. On the other hand, we prove the logarithmic regret upper bound for UCBBorda, which is no worse than the regret lower bound in the classic DB.
As future work, it is important to derive general algorithms that can handle various notions of winners as in Ramamohan et al. [2016]. Another promising direction is to improve the algorithms for the Borda winner and achieve regret significantly smaller than the classic DB as Thompson Condorcet sampling does in the Condorcet winner case.
8 Acknowledgements
LX utilized the facility provided by Masason Foundation. JH acknowledges support by KAKENHI 18K17998, and MS acknowledges support by KAKENHI 17H00757.
References

Agrawal and Goyal [2013]
S. Agrawal and N. Goyal.
Further optimal regret bounds for Thompson sampling.
In
Proceedings of the 16th International Conference on Artificial Intelligence and Statistics
, pages 99–107, 2013. 
Auer [2003]
P. Auer.
Using confidence bounds for exploitationexploration tradeoffs.
Journal of Machine Learning Research
, 3:397–422, March 2003. ISSN 15324435. URL http://dl.acm.org/citation.cfm?id=944919.944941.  Burnetas and Katehakis [1996] A. N. Burnetas and M. N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 – 142, 1996.
 BusaFekete et al. [2013] R. BusaFekete, B. Szörényi, P. Weng, W. Cheng, and E. Hüllermeier. Topk selection based on adaptive sampling of noisy preferences. In Proceedings of the 30th International Conference on Machine Learning, pages 1094–1102, 2013.
 Charon and Hudry [2010] I. Charon and O. Hudry. An updated survey on the linear ordering problem for weighted or unweighted tournaments. Annals of Operations Research, 175(1):107–158, March 2010.
 Hofmann et al. [2011] K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th International Conference on Information and Knowledge Management, pages 249–258, 2011.
 Honda and Takemura [2014] J. Honda and A. Takemura. Optimality of Thompson sampling for Gaussian bandits depends on priors. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, pages 375–383, 2014.
 Jamieson et al. [2015] K. Jamieson, S. Katariya, A. Deshpande, and R. Nowak. Sparse dueling bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 416–424, 2015.
 Komiyama et al. [2015] J. Komiyama, J. Honda, H. Kashima, and H. Nakagawa. Regret lower bound and optimal algorithm in dueling bandit problem. In Proceedings of The 28th Conference on Learning Theory, pages 1141–1154, 2015.
 Lai and Robbins [1985] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985. ISSN 10902074. doi: 10.1016/01968858(85)900028.
 Massart [1990] P. Massart. The tight constant in the DvoretzkyKieferWolfowitz inequality. The Annals of Probability, 18(3):1269–1283, July 1990.
 Olver et al. [2010] F. W. Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark. NIST Handbook of Mathematical Functions. Cambridge University Press, New York, NY, USA, 1st edition, 2010. ISBN 0521140633, 9780521140638.
 Qin and Liu [2013] T. Qin and T. Liu. Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597, 2013.
 Qin et al. [2010] T. Qin, T.Y. Liu, J. Xu, and H. Li. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13(4):346–374, Aug 2010. ISSN 13864564. doi: 10.1007/s107910099123y. URL http://dx.doi.org/10.1007/s107910099123y.
 Ramamohan et al. [2016] S. Y. Ramamohan, A. Rajkumar, and S. Agarwal. Dueling bandits: Beyond Condorcet winners to general tournament solutions. In Advances in Neural Information Processing Systems 29, pages 1253–1261, 2016.
 Rothschild [1974] M. Rothschild. A twoarmed bandit theory of market pricing. Journal of Economic Theory, 9:185 – 202, 1974. ISSN 00220531.
 Szorenyi et al. [2015] B. Szorenyi, R. BusaFekete, P. Weng, and E. Hüllermeier. Qualitative multiarmed bandits: A quantilebased approach. In Proceedings of the 32nd International Conference on Machine Learning, pages 1660–1668, 2015.
 Thompson [1933] W. R. Thompson. On the likelihood that one unknown probability exceeds another in the view of the evidence of two samples. Biometrika, 25(34):285–294, 1933.
 Urvoy et al. [2013] T. Urvoy, F. Clerot, R. Féraud, and S. Naamane. Generic exploration and Karmed voting bandits. In Proceedings of the 30th International Conference on Machine Learning, pages 1191–1199, 2013.
 van der Vaart and Wellner [2000] A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics (Springer Series in Statistics). Springer, 2000. ISBN 0387946403.
 Villar et al. [2015] S. S. Villar, J. Bowden, and J. Wason. Multiarmed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical Science, 30:199–215, May 2015. doi: 10.1214/14STS504.
 Wu and Liu [2016] H. Wu and X. Liu. Double Thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems 30, pages 649–657, 2016.
 Yue et al. [2012] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The karmed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 – 1556, 2012.
 Zhou et al. [2014] Y. Zhou, X. Chen, and J. Li. Optimal PAC multiple arm identification with applications to crowdsourcing. In Proceedings of the 31st International Conference on Machine Learning, pages 217–225, 2014.
 Zoghi et al. [2014] M. Zoghi, S. Whiteson, R. Munos, and M. de Rijke. Relative upper confidence bound for the Karmed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning, pages 10–18, 2014.
 Zoghi et al. [2015] M. Zoghi, Z. Karnin, S. Whiteson, and M. de Rijke. Copeland dueling bandits. In Advances in Neural Information Processing Systems 28, pages 307–315, 2015.
Appendix A Preliminaries
In this section, we introduce the concentration inequalities for multinomial distributions, which are the bounds on how a random variable deviates from the expected value. The first inequality measures deviation terms of the KLdivergence as follows.
Lemma 2.
Let us consider the random variable sampled from multinomial distribution . If we denote the true probability as and the empirical probability as , we have
for any and .
Proof of Lemma 2.
We also use the following inequality to handle the deviation measured by the norm.
Lemma 3 (BretagnolleHuberCarol Inequality van der Vaart and Wellner [2000]).
The last inequality is for the error in the cumulative distribution.
Lemma 4 (DvoretzkyKieferWolfowitz inequality (Massart, 1990)).
Next, we introduce the concentration inequality for the Dirichlet distribution.
Lemma 5.
Let be a sample drawn from Dirichlet distribution for and . For all and , we have
where for defined in (7).
Using Pinsker’s inequality, we can derive the concentration inequality for norm.
Corollary 1.
Lastly, we state two simple lemmas, which are useful for analysis. The first is about the characteristic of function .
Lemma 6.
We can confirm it by simple calculation. The second is used for bounding the confidence bound.
Lemma 7.
Let be
Then, for all , we have
Proof.
Let be
If we set as
we have
∎
Appendix B Proof of Theorem 1
We first introduce several events that is used in the proof. Let be the event