## 1 Introduction

Dueling Bandits, first proposed in yue2009interactively

, is an important variation on the Multi-Armed Bandit (MAB), a well-known online machine learning problem that has been studied extensively by many previous works, such as

auerfinite2002 , cesa2006prediction , and bubeck2012regret . Dueling Bandits is different from MAB in that it provides binary feedback at each time, the win/lose outcome of a duel between two actions. This corresponds well to comparisons between two system states that receive better/worse type responses from users, patients, raters, and so on. Previous work on this topic has proposed various algorithms that generally allow for regret bounds of the order to be proven, where represents the preference gap between two different states (or actions). See sui2018advancements for a reference. Such algorithms include, Beat the Mean yue2011beat , Interleaved Filter yue2012k , SAVAGE urvoy2013generic , RUCB zoghi2013relative and RCS zoghi2014relative , MultiSBM and Sparring ailon2014reducing , Sparse Borda jamieson2015sparse , RMED komiyama2015regret , CCB zoghi2015copeland , and (E)CW-RMED komiyama2016copeland. Thompson Sampling, first proposed in

thompson1933likelihood , is a powerful method of learning true parameters values, by sampling from a posterior distribution using Bayes Theorem. See

russo2014learning and russo2018tutorial for reference. It has been implemented in algorithms for multi-armed bandits, such as in chapelle2011empirical , agrawal2012analysis , kaufmann2012thompson , agrawal2013further , komiyama2015optimal , and xia2015thompson . The current state-of-the-art algorithms for Dueling Bandits both utilize Thompson Sampling methods, Independent Self-Sparring (ISS) sui2017multi and Double Thompson Sampling (DTS) wu2016double . The ISS method is relatively simple, has strong empirical performance, and has been proven to converge asymptotically to a Condorcet winner, if one exists. However, its non-asymptotic regret has not been analyzed. The DTS algorithm is a relatively complex algorithm with a highly complex proof. It achieves regret of order . However, the worst-case values, lead to regret bounds that are actually of order . We address these issues in this paper, with our main contributions: (1)we present four simple algorithms for Dueling Bandits, each of which allows provable upper bounds on regret of order with that do not depend on any preference gap between actions, (2) we compare and contrast the algorithm complexity and theoretical results of the presented simple algorithms against the current state-of-the-art algorithms for Dueling Bandits, and (3) we evaluate the algorithms on multiple scenarios using synthetically generated data, demonstrating their performance for multiple definitions of optimality, that in some cases exceeds the state-of-the-art.## 2 Background

### 2.1 Dueling Bandits

The dueling bandits problem is described in Problem 1. The random matrices

are independent and identically distributed. Each element is Bernoulli distributed such that

denotes the probability of action

winning a duel with action .For Thompson sampling algorithms, we will assume that the win probabilities depend on an unobserved random parameter, , so that . The parameter can be used to encode correlations between the actions and other structural assumptions.

For algorithms based on Exp3.P and partial monitoring, we assumes that , where is a fixed but unknown matrix of win probabilities.

We assume that when and that or , depending on the problem setup.

Random variables and
represent the actions selected to duel at each time, and we
denote as the
available *history* to help guide the selections. Note that the
assumptions about imply that if is observed,
then is also known.

### 2.2 Optimal Actions

It is assumed that there is a sub-set of optimal actions within , and that we wish to find an optimal action as efficiently as possible. There are several optimality notions used for dueling bandits. We discuss some of these below, and note that section 4.1 of sui2018advancements provides additional definitions.

#### 2.2.1 Copeland and Condorcet Winners

The standard definition of optimal actions in dueling bandits literature are Copeland and Condorcet winners. These rely on counting the number of other actions that a particular action is likely to beat in a duel (in the sense of ). Copeland winners are defined as,

If there is a single action that is likely to beat all other actions, this is known as a Condorcet winner. Copeland winners always exist, even if a Condorcet winner does not exist.

#### 2.2.2 Maximin and Borda Winners

In this paper, we focus on two alternatives to Copeland and Condorcet winners for defining optimal actions: Maximin winners and Borda winners. Both rely on simpler measures of to determine the optimal actions. Maximin winners use row minimum values of , and Borda winners use row average values of . Let us define Maximin winners and Borda winners as,

Maximin and Borda winners both always exist, even if a Condorcet winner does not exist. Also, Copeland winners are not guaranteed to align with either Maximin or Borda winners. Condorcet winners are guaranteed to align with Maximin winners, but not with Borda winners. For these reasons, we find these to be compelling alternative definitions for optimal actions.

### 2.3 Regret

To characterize the performance of the selected actions over time horizon , we can compare them against ideal selections that could have been made over that time period. This is known as regret. While it may be intuitive that an ideal sequence of selections would be any which maximizes , for a given sequence of selections (and vice versa, minimizes it for ideal selections), this is unreasonable and not possible. Selections are unknown prior to a duel, and adaptations to selection strategies are made after a duel, meaning the original given selection sequence would no longer be valid. Instead, a reasonable ideal sequence of selections that could have been made is for both and to have been optimal actions, at all times. Therefore, if the regret incurred over time horizon is minimized, then the selected actions have converged to optimal actions as efficiently as possible in that time period.

## 3 Algorithms

### 3.1 Thompson Sampling for Dueling Bandits

We describe Thompson Sampling in generality, in order to highlight its flexibility. It learns true parameter values , which can represent directly or some other latent values for each action, by sampling the posterior distribution conditioned on the history . The samples of become more accurate as the information in

increases, and are used to form an estimate of

, which can be used with any optimal action definition. We present algorithms for both Maximin winners (Alg. 1) and Borda winners (Alg. 2).An appropriate prior distribution over

must be chosen so that the posterior distribution can either be determined analytically or sampled from by using computational means (such as Markov chain Monte Carlo). The prior can be used to model correlations between actions, for example by using a Gaussian Process.

### 3.2 SparringExp3.P for Dueling Bandits

SparringExp3.P is implemented for dueling bandits in Algorithm 3, and is inspired by the methods in ailon2014reducing and dudik2015contextual . It learns from the previous duel outcomes and accordingly adjusts the strategies and

using hyperparameters

and . For all times and all actions , the update equations are,(1) | ||||||

(2) |

Since only outcome is revealed at each time , the other outcomes in the corresponding rows of must be estimated. These estimates are made using the observed outcome and hyperparameter as follows,

(3) |

for all . These estimates satisfy and for all and all times .

### 3.3 Partial Monitoring Forecaster for Dueling Bandits

The Partial Monitoring forecaster cesa2006prediction is implemented for dueling bandits in Algorithm 4. The forecaster learns from the previous duel outcomes and accordingly adjusts the strategy using hyperparameters and . For all times and all actions , the update equations are,

(4) | ||||

(5) |

Since only outcome is revealed at each time , the Borda score for , must be estimated using the observed outcome as follows,

(6) |

for all . These estimates satisfy for all and all times .

### 3.4 Comparison to State-of-the-Art

Both state-of-the-art dueling bandits algorithms ISS sui2017multi and DTS wu2016double use variations of specific Thompson Sampling implementations. They both use as prior distributions , for each independent, true

value they attempt to learn. Since Beta distributions are conjugate pairs with Bernoulli likelihoods, the independent posterior distributions

are able to be determined analytically and are themselves Beta distributions.While the ISS algorithm is very simple, it does not learn an estimate for . Instead, it learns the more basic overall probability of each action winning a duel with a Concorcet winner. It therefore learns independent values, one for each action. Since it does not learn , it cannot learn to track a Borda winner unless it is also the Condorcet winner.

The DTS algorithm does learn an estimate of . It thus learns independent values, one for each pair in . However, it is a complex and specialized algorithm that tracks the Copeland winner, so it cannot learn to track a Borda winner unless it is also the Copeland winner.

## 4 Theoretical Results

In this section, we will present theorems that upper bound the regret for each of the algorithms described in the previous section, and also compare the bounds to those for the current state-of-the-art. Each of the regret upper bounds is of the order with , and this bound holds regardless of the size of any preference gaps between any two actions . All definitions of regret are normalized, such that the regret incurred at any time satisfies , and therefore . Detailed proofs are provided in the appendix.

##### Theorem 4.1

Let us define regret over time horizon in the sense of Maximin winner ,

Then, if actions are selected at each time using Thompson Sampling for Dueling Bandits with Maximin winners (Alg. 1), the expected regret is upper bounded as,

The proof method is a variation on the worst case bound from russo2016information .

##### Theorem 4.2

Let us define regret over time horizon in the sense of Borda winner ,

Then, if actions are selected at each time using Thompson Sampling for Dueling Bandits with Borda winners (Alg. 2), using for , the expected regret is upper bounded as,

The proof method uses the same concepts from russo2016information as the proof of Theorem 4.1.

##### Theorem 4.3

Let us define regret over time horizon in the sense of Maximin winner ,

Then, if actions are selected at each time using SparringExp3.P for Dueling Bandits (Alg. 3), with hyperparameter values of,

and satisfying,

the expected regret is upper bounded as,

The proof method follows those used for lemma 3.1 and theorems 3.2 and 3.3 in bubeck2012regret .

##### Theorem 4.4

Let us define regret over time horizon in the sense of Borda winner ,

Then, if actions are selected at each time using the Partial Monitoring Forecaster for Dueling Bandits (Alg. 4), with hyperparameter values of,

and satisfying,

the expected regret is upper bounded as,

The proof method follows those used for theorem 6.5 in cesa2006prediction .

### 4.1 Comparison to State-of-the-Art

Many works on dueling bandits assume that a Condorcet winner, , exists. In this case, for all , and let be the *preference gap* between the Condorcet winner and the next best action. This commonly allows regret bounds of to be proven. These bounds appear to be superior to the bounds derived in this paper. However, as discussed in bubeck2012regret (and others), when is small, the
bound becomes smaller than the regret for selecting the sub-optimal action each time, which is . Therefore, taking a worst-case value over leads to an actual regret bound of , which is not superior to the bounds we show.

This is the case for both state-of-the-art methods ISS sui2017multi and DTS wu2016double . Furthermore, we note that the proof for ISS demonstrates only asymptotic convergence to a Condorect winner, while the proof for DTS is highly complex (owing the relatively complex nature of the algorithm). In comparison, the proofs available in appendix A are relatively simple (though presented in a detailed manner).

## 5 Experimental Results

### 5.1 Methods

We simulate each of the proposed algorithms, along with the two state-of-the-art algorithms ISS sui2017multi and DTS wu2016double , on two different scenarios using synthetic data. For the Thompson Sampling methods, we use independent priors for the values we attempt to learn. We set directly, for all . In the Condorcet scenario, an matrix is synthetically generated by linking a latent value for each action (called “utility") to the duel winning probability for each pair of actions . The utility of each action,

, is uniformly distributed between

and . We chose to give a larger spread of probabilities over the actions. One action has a maximum utility, that is significantly better than all other actions, and so it is the lone Borda winner and Condorcet winner, and thus also the lone Maximin winner. Linking the utility of each pair of actions to the corresponding duel winning probability is accomplished by using the logistic function on the gap between utilities of the actions,In the Borda scenario, we modify the previous matrix such that the action with the second largest utility becomes the lone Borda winner, even though the same Condorcet and Maximin winner still exists. This is done by setting for all other than the Condorcet winner. This aptly represents why the Borda winner is a reasonable definition for optimality. Even though it isn’t likely to beat every action, it is the most likely to beat an action drawn at random. Each algorithm runs with a time horizon of iterations, for separate runs, on each scenario.

### 5.2 Results

The results of the Condorcet scenario are shown in Figure 1, and the results of the Borda scenario are shown in Figure 2

. In both subfigure (c), a shaded area, plotted above the mean, shows the standard deviation over the runs. Additional detailed plots of each algorithm, for each scenario, are available in appendix B. In the Condorcet scenario, the regret for each algorithm is as prescribed in the respective theorem, and the regret for ISS and DTS use the Maximin winner (theorem 4.1). All formulations for regret are comparable, due to the scenario having the same winning action in all cases. Both state-of-the-art methods show very strong regret performance. However, the Thompson Sampling with Borda winners method shows comparably strong performance, with other methods also performing well. All methods beat the regret upper bounds proposed in their respective theorems. In the Borda scenario, the regret for all algorithms (including ISS and DTS) uses the Borda winner. This is to highlight the fact that some of the methods are not capable of performing well in this type of scenario. Both state-of-the-art methods struggle with Borda winners, and so their Borda regret grows linearly. A similar behavior ultimately happens to SparringExp3.P (more details available in the appendix). Thompson Sampling shines in this case. Both methods that focus on Borda winners are able to beat their respective regret upper bounds.

## 6 Conclusion

In this paper, we have presented four simple algorithms for Dueling Bandits, each of which is able to efficiently find an optimal action within a finite set of available actions. We proved an upper bound on regret for each, over a variety of different optimal action types, such as the Borda Winner. The proven regret bounds were all of the order with , and did not depend on any preference gap between any two actions . The algorithms were all evaluated and compared against the current state-of-the-art for Dueling Bandits, the ISS and DTS algorithms. While they did not meet or exceed the performance of ISS and DTS in certain scenarios, in others they demonstrated superior ability to find different types of optimal actions. Overall, their simplicity, regret bounds, and ability do merit inclusion with the current state-of-the-art.

## References

- (1) Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
- (2) Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pages 99–107, 2013.
- (3) Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856–864, 2014.
- (4) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- (5) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- (6) Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
- (7) Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
- (8) Miroslav Dud\́mathbf{i}k, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. arXiv preprint arXiv:1502.06362, 2015.
- (9) Kevin G Jamieson, Sumeet Katariya, Atul Deshpande, and Robert D Nowak. Sparse dueling bandits. In AISTATS, 2015.
- (10) Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, pages 199–213. Springer, 2012.
- (11) Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on Learning Theory, pages 1141–1154, 2015.
- (12) Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. arXiv preprint arXiv:1506.00779, 2015.
- (13) Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Copeland dueling bandit problem: Regret lower bound, optimal algorithm, and computationally efficient algorithm. arXiv preprint arXiv:1605.01677, 2016.
- (14) Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
- (15) Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
- (16) Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
- (17) Yanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Multi-dueling bandits with dependent arms. arXiv preprint arXiv:1705.00253, 2017.
- (18) Yanan Sui, Masrour Zoghi, Katja Hofmann, and Yisong Yue. Advancements in dueling bandits. In IJCAI, pages 5502–5510, 2018.
- (19) William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- (20) Tanguy Urvoy, Fabrice Clerot, Raphael Féraud, and Sami Naamane. Generic exploration and k-armed voting bandits. In International Conference on Machine Learning, pages 91–99, 2013.
- (21) Huasen Wu and Xin Liu. Double thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649–657, 2016.
- (22) Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson sampling for budgeted multi-armed bandits. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
- (23) Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
- (24) Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201–1208. ACM, 2009.
- (25) Yisong Yue and Thorsten Joachims. Beat the mean bandit. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 241–248, 2011.
- (26) Masrour Zoghi, Zohar S Karnin, Shimon Whiteson, and Maarten De Rijke. Copeland dueling bandits. In Advances in Neural Information Processing Systems, pages 307–315, 2015.
- (27) Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten De Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. arXiv preprint arXiv:1312.3393, 2013.
- (28) Masrour Zoghi, Shimon A Whiteson, Maarten De Rijke, and Remi Munos. Relative confidence sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 73–82. ACM, 2014.

## Appendix A Theoretical Results

In this section, we provide formal proofs for all theorems presented in the paper. All random variables and probability distributions use bold font.

### a.1 Proof of Theorem 4.1

The proof method is a variation on the worst case bound from [15].

First, we make the following definitions: is the expectation, is the probability measure, is the probability density, and is mutual information, all conditioned on the history , at time . Furthermore,

is the Kullback-Leibler divergence and

is entropy.Then we note that Thompson Sampling selects both and using independent samples from the same posterior distribution conditioned on . Therefore, and are independent and identically distributed, and the terms and are identically distributed.

Let be the instantaneous regret at time , such that .

We claim the following,

(7) | |||

(8) |

To begin proving (7), we show,

(9) |

where the second equality follows because is independent of , when conditioned on .

Furthermore,

(10) |

where the second equality follows because of the assumption . Combining (9) and (10), gives (7).

Next we prove (A.1).

Here the first equality is the chain rule for mutual information, while the second follows from conditional independence of

, , and , given . The third equality follows because of conditional independence of and given . The final equality is a standard identity for mutual information. Thus, (A.1) holds.Then we bound in terms of the mutual information.

The first inequality is from Pinsker’s inequality. The second is from the Cauchy-Schwarz inequality. The third is because adding more non-negative terms cannot decrease the sum. The final inquality is because .

Next we cite the following,
(see section 5 of [15]) and therefore
(Cauchy-Schwartz inequality),

Finally, we have since there are actions, and so the desired bound is achieved.

### a.2 Proof of Theorem 4.2

The proof method uses the same concepts from [15] as the proof of Theorem 4.1.

First, we make the following definitions: is the expectation, is the probability measure, is the probability density, and is mutual information, all conditioned on the history , at time . Furthermore, is the Kullback-Leibler divergence and is entropy.

Then we note that Thompson Sampling selects both and using independent samples from the same posterior distribution conditioned on . Therefore, and are independent and identically distributed, and the terms and are identically distributed.

Let be the instantaneous regret at time , such that .

By construction,

(11) |

Now we bound in terms of mutual information.

(12) | |||

(13) | |||

(14) | |||

(15) | |||

(16) | |||

(17) |

(18) | |||

(19) | |||

(20) | |||

(21) |

Here (12) is derived analogously to (7), and the inequality (14) follows because and

Comments

There are no comments yet.