# Preference-based Online Learning with Dueling Bandits: A Survey

In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are not readily available -- instead, only weaker information is provided, in particular relative preferences in the form of qualitative comparisons between pairs of alternatives. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our taxonomy is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback.

## Authors

• 8 publications
• 39 publications
• 2 publications
• ### Multi-armed bandit approach to password guessing

The multi-armed bandit is a mathematical interpretation of the problem a...
06/29/2020 ∙ by Hazel Murray, et al. ∙ 0

• ### A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit

Adaptive and sequential experiment design is a well-studied area in nume...
10/02/2015 ∙ by Giuseppe Burtini, et al. ∙ 0

• ### Partial Bandit and Semi-Bandit: Making the Most Out of Scarce Users' Feedback

Recent works on Multi-Armed Bandits (MAB) and Combinatorial Multi-Armed ...
09/16/2020 ∙ by Alexandre Letard, et al. ∙ 15

• ### Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior

Prisoner's Dilemma mainly treat the choice to cooperate or defect as an ...
06/09/2020 ∙ by Baihan Lin, et al. ∙ 0

• ### Online learning with feedback graphs and switching costs

We study online learning when partial feedback information is provided f...
10/23/2018 ∙ by Anshuka Rangi, et al. ∙ 0

• ### Incorporating Behavioral Constraints in Online AI Systems

AI systems that learn through reward feedback about the actions they tak...
09/15/2018 ∙ by Avinash Balakrishnan, et al. ∙ 0

• ### Multi-Dueling Bandits and Their Application to Online Ranker Evaluation

New ranking algorithms are continually being developed and refined, nece...
08/22/2016 ∙ by Brian Brost, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multi-armed bandit (MAB) algorithms have received considerable attention and have been studied quite intensely in machine learning in the recent past. The great interest in this topic is hardly surprising, given that the MAB setting is not only theoretically challenging but also practically useful, as can be seen from its use in a wide range of applications. For example, MAB algorithms turned out to offer effective solutions for problems in medical treatment design (Lai and Robbins, 1985; Kuleshov and Precup, 2014), online advertisement (Chakrabarti et al., 2008), and recommendation systems (Kohli et al., 2013), just to mention a few.

The multi-armed bandit problem, or bandit problem for short, is one of the simplest instances of the sequential decision making problem, in which a learner (also called decision maker or agent) needs to select options from a given set of alternatives repeatedly in an online manner—referring to the metaphor of the eponymous gambling machine in casinos, these options are also associated with “arms” that can be “pulled”. More specifically, the agent selects one option at a time and observes a numerical (and typically noisy) reward signal providing information on the quality of that option. The goal of the learner is to optimize an evaluation criterion such as the error rate (the expected percentage of playing a suboptimal arm) or the cumulative regret (the expected difference between the sum of the rewards actually obtained and the sum of rewards that could have been obtained by playing the best arm in each round). To achieve the desired goal, the online learner has to cope with the famous exploration/exploitation dilemma (Auer et al., 2002a; Cesa-Bianchi and Lugosi, 2006; Lai and Robbins, 1985): It has to find a reasonable compromise between playing the arms that produced high rewards in the past (exploitation) and trying other, possibly even better arms the (expected) reward of which is not precisely known so far (exploration).

The assumption of a numerical reward signal is a potential limitation of the MAB setting. In fact, there are many practical applications in which it is hard or even impossible to quantify the quality of an option on a numerical scale. More generally, the lack of precise feedback or exact supervision has been observed in other branches of machine learning, too, and has led to the emergence of fields such as

weakly supervised learning

and preference learning (Fürnkranz and Hüllermeier, 2011). In the latter, feedback is typically represented in a purely qualitative way, namely in terms of pairwise comparisons or rankings. Feedback of this kind can be useful in online learning, too, as has been shown in online information retrieval (Hofmann, 2013; Radlinski et al., 2008). As another example, think of crowd-sourcing services like the Amazon Mechanical Turk, where simple questions such as pairwise comparisons between decision alternatives are asked to a group of annotators. The task is to approximate an underlying target ranking on the basis of these pairwise comparisons, which are possibly noisy and partially noncoherent (Chen et al., 2013). Another application worth mentioning is the ranking of XBox gamers based on their pairwise online duels; the ranking system of XBox is called TrueSkill (Guo et al., 2012).

Extending the multi-armed bandit setting to the case of preference-based feedback, i.e., the case in which the online learner is allowed to compare arms in a qualitative way, is therefore a promising idea. And indeed, extensions of that kind have received increasing attention in the recent years. The aim of this paper is to provide a survey of the state of the art in the field of preference-based multi-armed bandits (PB-MAB). After recalling the basic setting of the problem in Section 2, we provide an overview of methods that have been proposed to tackle PB-MAB problems in Sections 3 and 4. Our taxonomy is mainly based on the assumptions made by these methods about the data-generating process or, more specifically, the properties of the pairwise comparisons between arms. Our survey is focused on the stochastic MAB setup, in which feedback is generated according to an underlying (unknown but stationary) probabilistic process; we do not cover the case of an adversarial data-generating processes, except briefly in Section 5, although this setting has recently received a lot of attention, too (Ailon et al., 2014a; Cesa-Bianchi and Lugosi, 2012, 2006).

## 2 The Preference-based Multi-Armed Bandit Problem

The stochastic MAB problem with pairwise comparisons as actions has been studied under the notion of “dueling bandits” in several papers (Yue and Joachims, 2009; Yue et al., 2012). Although this term has been introduced for a concrete setting with specific modeling assumptions (Sui et al., 2018), it is meanwhile used more broadly for variants of that setting, too. Throughout this paper, we shall use the terms “dueling bandits” and “preference-based bandits” synonymously.

Consider a fixed set of arms (options) . As actions, the learning algorithm (or simply the learner or agent) can perform a comparison between any pair of arms and , i.e., the action space can be identified with the set of index pairs such that . We assume the feedback observable by the learner to be generated by an underlying (unknown) probabilistic process characterized by a preference relation

 Q=[qi,j]1≤i,j≤K∈[0,1]K×K.

More specifically, for each pair of actions

, this relation specifies the probability

 P(ai≻aj)=qi,j (1)

of observing a preference for in a direct comparison with . Thus, each

specifies a Bernoulli distribution. These distributions are assumed to be stationary and independent, both across actions and iterations. Thus, whenever the learner takes action

, the outcome is distributed according to (1), regardless of the outcomes in previous iterations.

The relation is reciprocal in the sense that for all . We note that, instead of only observing strict preferences, one may also allow a comparison to result in a tie or an indifference. In that case, the outcome is a trinomial instead of a binomial event. Since this generalization makes the problem technically more complicated, though without changing it conceptually, we shall not consider it further. Busa-Fekete et al. (2013, 2014b) handle indifference by giving “half a point” to both arms, which, in expectation, is equivalent to deciding the winner by tossing a coin. Thus, the problem is essentially reduced to the case of binomial outcomes.

We say arm beats arm if , i.e., if the probability of winning in a pairwise comparison is larger for than it is for . Clearly, the closer is to , the harder it becomes to distinguish the arms and based on a finite sample set from . In the worst case, when , one cannot decide which arm is better based on a finite number of pairwise comparisons. Therefore,

 Δi,j=qi,j−12

appears to be a reasonable quantity to characterize the hardness of a PB-MAB task (whatever goal the learner wants to achieve). Note that can also be negative (unlike the value-based setting, in which the quantity used for characterizing the complexity of a multi-armed bandit task is always positive and depends on the gap between the means of the best arm and the suboptimal arms).

### 2.1 Pairwise probability estimation

The decision making process iterates in discrete steps, either through a finite time horizon or an infinite horizon . As mentioned above, the learner is allowed to compare two actions in each iteration . Thus, in each iteration , it selects an index pair and observes

 {ai(t)≻aj(t) with probability qi(t),j(t)aj(t)≻ai(t) with probability qj(t),i(t)

The pairwise probabilities

can be estimated on the basis of finite sample sets. Consider the set of time steps among the first

iterations, in which the learner decides to compare arms and , and denote the size of this set by . Moreover, denoting by and the frequency of “wins” of and , respectively, the proportion of wins of against up to iteration is then given by

 ˆqti,j=wti,jnti,j=wti,jwti,j+wtj,i.

Since our samples are assumed to be independent and identically distributed (i.i.d.), is a plausible estimate of the pairwise probability (1). Yet, this estimate might be biased, since depends on the choice of the learner, which in turn depends on the data; therefore,

itself is a random quantity. A high probability confidence interval for

can be obtained based on the Hoeffding bound (Hoeffding, 1963), which is commonly used in the bandit literature. Although the specific computation of the confidence intervals may differ from case to case, they are generally of the form . Accordingly, if , arm beats arm with high probability; analogously, is beaten by arm with high probability, if .

### 2.2 Evaluation criteria

The goal of the online learner is usually stated as minimizing some kind of cumulative regret. Alternatively, in the “pure exploration” scenario, the goal is to identify the best arm (or the top- arms, or a ranking of all arms) both quickly and reliably. As an important difference between these two types of targets, note that the regret of a comparison of arms depends on the concrete arms being chosen, whereas the sample complexity penalizes each comparison equally.

It is also worth mentioning that the notion of optimality of an arm is far less obvious in the preference-based setting than it is in the value-based (numerical) setting. In the latter, the optimal arm is simply the one with the highest expected reward—more generally, the expected reward induces a natural total order on the set of actions . In the preference-based case, the connection between the pairwise preferences and the order induced by this relation on is less trivial; in particular, the latter may contain preferential cycles. We shall postpone a more detailed discussion of these issues to subsequent sections, and for the time being simply assume the existence of an arm that is considered optimal.

### 2.3 Cumulative regret

In a preference-based setting, defining a reasonable notion of regret is not as straightforward as in the value-based setting, where the sub-optimality of an action can be expressed easily on a numerical scale. In particular, since the learner selects two arms to be compared in an iteration, the sub-optimality of both of these arms should be taken into account. A commonly used definition of regret is the following (Yue and Joachims, 2009, 2011; Urvoy et al., 2013; Zoghi et al., 2014a): Suppose the learner selects arms and in time step . Then, the cumulative regret incurred by the learner up to time is

 RTA=T∑t=1rt=T∑t=1Δi∗,i(t)+Δi∗,j(t)2. (2)

This regret takes into account the optimality of both arms, meaning that the learner has to select two nearly optimal arms to incur small regret. Note that this regret is zero if the optimal arm is compared to itself, i.e., if the learner effectively abstains from gathering further information and instead fully commits to the arm .

### 2.4 Regret bounds

In a theoretical analysis of a MAB algorithm, one is typically interested in providing a bound on the (cumulative) regret produced by that algorithm. We are going to distinguish two types of regret bound. The first one is the expected regret bound, which is of the form

 E[RT]≤B(Q,K,T), (3)

where is the expected value operator, is the regret accumulated till time step , and is a positive real-valued function with the following arguments: the pairwise probabilities , the number of arms , and the iteration number . This function may additionally depend on parameters of the learner, however, we neglect this dependence here. The expectation is taken with respect to the stochastic nature of the data-generating process and the (possible) internal randomization of the online learner. The regret bound (3) is technically akin to the expected regret bound of value-based multi-armed bandit algorithms like the one that is calculated for UCB (Auer et al., 2002a), although the parameters used for characterizing the complexity of the learning task are different.

The bound in (3) does not inform about how the regret achieved by the learner is concentrated around its expectation. Therefore, we consider a second type of regret bound, namely one that holds with high probability. This bound can be written in the form

 P(RT

For simplicity, we also say that the regret achieved by the online learner is with high probability.

### 2.5 Sample complexity

The sample complexity analysis is considered in a “pure exploration” setup where the learner, in each iteration, must either select a pair of arms to be compared or terminate and return its recommendation. The sample complexity of the learner is then the number of pairwise comparisons it queries prior to termination, and the corresponding bound is denoted . Here, specifies a lower bound on the probability that the learner terminates and returns the correct solution111Here, we consider the pure exploration setup with fixed confidence. Alternatively, one can fix the horizon and control the error of the recommendation (Audibert et al., 2010; Bubeck et al., 2011, 2013).. Note that only the number of the pairwise comparisons is taken into account, which means that pairwise comparisons are equally penalized, independently of the suboptimality of the arms chosen.

The recommendation of the learner depends on the task to be solved. In the simplest case, it consists of the best arm. However, as will be discussed in Section 4, more complex predictions are conceivable, such as a complete ranking of all arms.

The above sample complexity bound is valid most of the time (more than of the runs). However, in case an error occurs and the correct recommendation is not found by the algorithm, the bound does not guarantee anything. Therefore, it cannot be directly linked to the expected sample complexity. In order to define the expected sample complexity, the learning algorithm needs to terminate in a finite number of steps with probability . Under this condition, running a learning algorithm on the same bandit instance results in a finite sample complexity, which is a random number distributed according to an unknown law . The distribution has finite support, since the algorithm terminates in a finite number of steps in every case. By definition, the expected sample complexity of the learning algorithm is the finite mean of the distribution . Moreover, the worst case sample complexity is the upper bound of the support of .

### 2.6 PAC algorithms

In many applications, one is willing to gain efficiency at the cost of optimality: The algorithm is allowed to return a solution that is only approximately optimal, though it is supposed to do so more quickly. For standard bandit problems, for example, this could mean returning an arm the expected reward of which deviates by at most some from the expected reward of the optimal arm.

In the preference-based setup, approximation errors are less straightforward to define. Nevertheless, the sample complexity can also be analyzed in a PAC-framework as originally introduced by Even-Dar et al. (2002) for value-based MABs. A preference-based MAB algorithm is called -PAC preference-based MAB algorithm with a sample complexity , if it terminates and returns an -optimal arm with probability at least , and the number of comparisons taken by the algorithm is at most . If the problem is to select a single arm, -optimality could mean, for example, that , although other notions of approximation can be used as well.

### 2.7 Explore-then-exploit algorithms

Most PB-MAB algorithms for optimizing regret are based on the idea of decoupling the exploration and exploitation phases: First, the algorithm tries to identify the best arm with high probability, and then fully commits to the arm found to be best for the rest of the time (i.e., repeatedly compares this arm to itself). Algorithms implementing this principle are called “explore-then-exploit” algorithms.

Such algorithms need to know the time horizon in advance, since being aware of the horizon, the learning algorithm is able to control the regret incurred in case it fails to identify the best arm. More specifically, assume a so-called exploratory algorithm to be given, which is able to identify the best arm with probability at least . By setting to , algorithm guarantees that if it terminates before iteration step , where is the arm index returned by . Thus, if terminates and commits a mistake, i.e., , then the expected regret incurred in the exploitation phase is , since the per-round regret is upper-bounded by 1 and the exploitation phase consists of at most steps. Consequently, the expected regret of an explore-then-exploit algorithm is

 E[RT]≤(1−1/T)E[RTA]+(1/T)O(T)=O(E[RTA]+1).

Note that the inequality is trivially valid if does not terminate before .

The same argument as given above for the case of expected regret also holds for high probability regret bounds in the explore-then-exploit framework. In summary, the performance of an explore-then-exploit algorithm is bounded by the performance of the exploration algorithm. More importantly, since the per round regret is at most 1, the sample complexity of the exploration algorithm readily upper-bounds the expected regret; this fact was pointed out by Yue and Joachims (2011) and Yue et al. (2012). Therefore, like in the case of value-based MABs, explore-then-exploit algorithms somehow blur the distinction between the “pure exploration” and regret optimization setting.

However, in a recent study (Zoghi et al., 2014a), a novel preference-based MAB algorithm is proposed that optimizes the cumulative regret without decoupling the exploration from the exploitation phase (for more details see Section 3.1). Without decoupling, there is no need to know the horizon in advance, which allows one to provide a horizonless regret bound that holds for any time step .

The regret defined in (2) reflects the average quality of the decision made by the learner. Obviously, one can define a more strict or less strict regret by taking the maximum or minimum, respectively, instead of the average. Formally, the strong and weak regret in time step are defined, respectively, as

 rtmax =max{Δi∗,i(t),Δi∗,j(t)}, rtmin =min{Δi∗,i(t),Δi∗,j(t)}.

From a theoretical point of view, when the number of pairwise comparisons is bounded by a known horizon, these regret definitions do not lead to a fundamentally different problem. Roughly speaking, this is because most of the methods designed for optimizing regret seek to identify the best arm with high probability in the exploration phase, based on as few samples as possible.

## 3 Learning from Coherent Pairwise Comparisons

As explained in Section 2.1, learning in the PB-MAB setting essentially means estimating the pairwise preference matrix , i.e., the pairwise probabilities . The target of the agent’s prediction, however, is not the relation itself, but the best arm or, more generally, a ranking of all arms . Consequently, the least assumption to be made is a connection between and , so that information about the former is indicative of the latter. Or, stated differently, the pairwise probabilities should be sufficiently coherent, so as to allow the learner to approximate and eventually identify the target (at least in the limit when the sample size grows to infinity). For example, if the target is a ranking on , then the should be somehow coherent with that ranking, e.g., in the sense that implies .

While this is only an example of a consistency property that might be required, different consistency or regularity assumptions on the pairwise probabilities have been proposed in the literature—needless to say, these assumptions have a major impact on how PB-MAB problems are tackled algorithmically. In this section and the next one, we provide an overview of approaches to such problems, categorized according to these assumptions (see Figure 1).

### 3.1 Axiomatic approaches

We begin this section by collecting various assumptions on pairwise preferences that can be found in the literature. As will be seen later on, by exploiting the (preference) structure imposed by these assumptions, the development of efficient algorithms will become possible.

1. Total order over arms: There is a total order on , such that implies .

2. Strong stochastic transitivity: For any triplet of arms such that , the pairwise probabilities satisfy .

3. Relaxed stochastic transitivity: There is a such that, for any triplet of arms such that , the pairwise probabilities satisfy .

4. Stochastic triangle inequality: For any triplet of arms such that , the pairwise probabilities satisfy .

5. Existence of a Condorcet winner: An arm is considered a Condorcet winner if for all , i.e., if it beats all other arms in a pairwise comparison.

6. Specific structural constraints on the preference matrix: We will see an example of such constraint in subsection 3.1.6.

Note that the first assumption of a total order with arms separated by positive margins ensures the existence of a unique best arm, which in this case coincides with the Condorcet winner. Also note that strong stochastic transitivity is recovered from the relaxed stochastic transitivity for . Prior to describing the methods, we summarize the assumptions, targets, and goals they consider in in Table 1.

#### 3.1.1 Interleaved filtering

Assuming a total order over arms, strong stochastic transitivity, and the stochastic triangle inequality, Yue et al. (2012) propose an explore-then-exploit algorithm. The exploration step consists of a simple sequential elimination strategy, called Interleaved Filtering (IF), which identifies the best arm with probability at least . The IF algorithm successively selects an arm which is compared to other arms in a one-versus-all manner. More specifically, the currently selected arm is compared to the rest of the active (not yet eliminated) arms. If an arm beats , that is, , then is eliminated, and is compared to the rest of the (active) arms, again in a one-versus-all manner. In addition, a simple pruning technique can be applied: if for an arm at any time, then can be eliminated, as it cannot be the best arm anymore (with high probability). After the exploration step, the exploitation step simply takes the best arm found by IF and repeatedly compares to itself.

The authors analyze the expected regret achieved by IF. Assuming the horizon to be finite and known in advance, they show that IF incurs an expected regret

 E[RT\textscIF]=O(Kminj≠i∗Δi∗,jlogT).

#### 3.1.2 Beat the mean

In a subsequent work, Yue and Joachims (2011) relax the strong stochastic transitivity property and only require relaxed stochastic transitivity for the pairwise probabilities. Further, both the relaxed stochastic transitivity and the stochastic triangle inequality are required to hold only relative to the best arm, i.e., only for triplets where is the index of the best arm .

With these relaxed properties, Yue and Joachims (2011) propose a preference-based online learning algorithm called Beat-The-Mean (BTM), which is an elimination strategy resembling IF. However, while IF compares a single arm to the rest of the (active) arms in a one-versus-all manner, BTM

selects an arm with the fewest comparisons so far and pairs it with a randomly chosen arm from the set of active arms (using the uniform distribution). Based on the outcomes of the pairwise comparisons, a score

is assigned to each active arm , which is an empirical estimate of the probability that is winning in a pairwise comparison (not taking into account which arm it was compared to). The idea is that comparing an arm to the “mean” arm, which beats half of the arms, is equivalent to comparing to an arm randomly selected from the active set. One can deduce a confidence interval for the scores, which allows for deciding whether the scores for two arms are significantly different. An arm is then eliminated as soon as there is another arm with a significantly higher score.

In the regret analysis of BTM, a high probability bound is provided for a finite time horizon. More precisely, the regret accumulated by BTM is

 O(γ7Kminj≠i∗Δi∗,jlogT)

with high probability. This result is stronger than the one proven for IF, in which only the expected regret is upper bounded. Moreover, this high probability regret bound matches with the expected regret bound in the case (strong stochastic transitivity). The authors also analyze the BTM algorithm in a PAC setting, and find that BTM is an -PAC preference-based learner (by setting its input parameters appropriately) with a sample complexity of if is large enough, that is, is the smallest positive integer for which . One may simplify this bound by noting that . Therefore, the sample complexity is

 O(γ6Kϵ2logKγlog(K/δ)δϵ).

#### 3.1.3 Knockout tournaments

Falahatgar et al. (2017b) assume strong stochastic transitivity and stochastic triangle inequality and consider the goals of finding the best arm as well as the best ranking in the PAC setting. More specifically, for any given , the algorithm for the best arm must output an arm such that, with probability at least , for all , and the algorithm for the best ranking must output, with probability at least , a ranking such that whenever .

For the best arm problem they propose the KNOCKOUT algorithm, which is based on Knockout tournaments and has a sample complexity of . The KNOCKOUT algorithm takes as input the set of all arms and runs in rounds, in which arms are randomly paired. At the end of each round, the size of the input is halved while ensuring that the maximum arm in the output set is comparable to the maximum arm in the input set, i.e., the -value corresponding to them is no more than .

For the best ranking problem, the authors propose the Binary-Search-Ranking algorithm, which uses comparisons for . This algorithm comprises three major steps. In the first step, it randomly selects a set of arms of size , called anchors, and ranks them using a procedure called Rank-x—an -PAC ranking algorithm, which for any , uses comparisons, while at the same time creating bins between each two successive anchors. Then, in the second step, a random walk on a binary search tree is used to assign each arm to a bin. Finally, in the last step, the output ranking is produced. To this end, the arms that are close to an anchor are ranked close to it, while arms that are distant from two successive anchors are ranked using Rank-x.

#### 3.1.4 Sequential elimination

Seeking the same goals as Falahatgar et al. (2017b), but this time only requiring the property of strong stochastic transitivity, Falahatgar et al. (2017a) present the Seq-Eliminate algorithm for the best arm problem, which uses comparisons. The algorithm adopts a sequential elimination technique to find the best arm. More specifically, it starts by selecting a running arm at random, and keeps comparing it to another random competing arm until the better of the two is determined. It then proceeds to the next competition stage, after setting the winner from the last stage as the new running arm and eliminating the loser. The algorithm stops as soon as only a single arm remains.

For the best ranking problem however, the authors show that any algorithm needs comparisons by considering a model for which they reduce the problem of finding -ranking to finding a coin with bias among other fair coins and showing that any algorithm requires quadratically many comparisons.

The authors also consider the Borda-score metric without any assumptions. The Borda score of an arm is , which gives its probability of winning against a randomly selected arm from the rest of arms. An arm such that is called Borda maximal or winner. An arm such that is called -Borda maximal. A permutation such that for all is called a Borda ranking. A permutation such that for all is called an -Borda ranking.

They show that the problem of finding an -Borda maximum can be solved using linearly many comparisons by showing that PAC optimal algorithms for the standard MAB setting can be used to solve the Borda score setting using the so-called Borda reduction of the dueling bandits to the standard MAB problem, in which drawing a sample from an arm is simulated by dueling it with another randomly selected arm.

For the problem of finding an -Borda ranking, they present an algorithm that requires comparisons. The algorithm first approximates the Borda scores of all arms with an additive error of , then second ranks the arms based on these approximate scores.

#### 3.1.5 Single elimination tournament

Under the assumption of the existence of a total order over the arms, Mohajer et al. (2017) study the top- ranking problem, in which the goal is to find the top- out of the arms in order, and the top- partitioning problem, where of interest is only the set of the top- arms, using the error rate performance metric.

They first characterize an upper bound on the sample size required for both problems, and demonstrate the benefit in sample complexity of active over passive ranking.

Then, they present the Select algorithm for identifying the top arm, which can be seen as a customized Single-elimination tournament, consisting of multiple layers, where in each layer, pairs of arms are randomly built first, and on the basis of pairwise comparisons, one arm is retained and the other one is eliminated. This process is repeated until the top arm is identified. They subsequently show that the algorithm Select finds the top arm with high probability and has sample complexity .

Lastly, they generalize the Select to the Top algorithm, which works for both top- ranking and partitioning, by first splitting the arms into sub-groups, then identifying the top arm in each sub-group using Select, and finally forming a short list that includes all winners from the sub-groups. For the formed list, they build a heap data structure, from which they extract the top- arms one after another, while whenever a top arm is extracted from its list, the second top arm from that list is identified and reinserted into the short list. The Top algorithm achieves the sample complexity , where in the case of ranking and in the case of partitioning.

#### 3.1.6 Successive elimination

Jamieson et al. (2015) focus on the pure exploration problem of finding the best arm according to the Borda criterion and consider a specific type of structural constraints on the preference matrix; a sparsity model in which there are a small set of top candidates that are similar to each other, and a large set of irrelevant candidates, which would always lose in a pairwise comparison against one of the top candidates.

They first show that, in such a situation, the Borda reduction in which the number of samples required only depends on the Borda scores, but not on the individual entries of the preference matrix may result in very poor performance. Subsequently, they propose the Successive Elimination with Comparison Sparsity (SECS) algorithm which automatically exploits this kind of structure by determining which of two arms is better on the basis of their performance with respect to a sparse set of comparison arms, leading to significant sample complexity improvements compared to the Borda reduction scheme. Basically, SECS implements the successive elimination strategy of Even-Dar et al. (2006) together with the Borda reduction and an additional elimination criterion that exploits sparsity. More specifically, SECS maintains an active set of arms of potential Borda winners, and at each time, it chooses an arm uniformly at random and compares it with all the arms in the active set. The algorithm terminates when only one arm remains.

#### 3.1.7 Preference-based UCB

In a work by Zoghi et al. (2014a), the well-known UCB algorithm (Auer et al., 2002a) is adapted from the value-based to the preference-based MAP setup. One of the main advantages of the proposed algorithm, called RUCB (for Relative UCB), is that only the existence of a Condorcet winner is required. The RUCB algorithm is based on the “optimism in the face of uncertainty” principle, which means that the arms to be compared next are selected based on the optimistic estimates of the pairwise probabilities, that is, based on the upper bounds of the confidence intervals. In an iteration step, RUCB selects the set of potential Condorcet winners for which all values are above , and then selects an arm from this set uniformly at random. Finally, is compared to the arm , , that may lead to the smallest regret, taking into account the optimistic estimates.

In the analysis of the RUCB algorithm, horizonless bounds are provided, both for the expected and high probability regret. Thus, unlike the bounds for IF and BTM, these bounds are valid for each time step. Both the expected regret bound and high probability bound of RUCB are . However, while the regret bounds of IF and BTM only depend on , the constants are now of different nature, despite being still calculated based on the values. Therefore, the regret bounds for RUCB are not directly comparable with those given for IF and BTM. Moreover, the regret bound for IF and BTM is derived based on the so-called explore-and-exploit technique which requires the knowledge of the horizon in advance, whereas regret bounds for RUCB, both the high probability and expectation one, are finite time bounds, thus they hold for any time step .

#### 3.1.8 Relative confidence sampling

Zoghi et al. (2014b) consider the cumulative regret minimization setting assuming the existence of a Condorcet winner. They propose the relative confidence sampling (RCS) algorithm, whose goal is to reduce cumulative regret by being less conservative than existing methods when eliminating arms from comparison. More specifically, RCS proceeds in two phases. First, the results of the comparisons conducted so far are used to simulate a round-robin tournament among the arms, in which posterior distributions over the expected value of each arm are maintained and sampling is performed from those posteriors to determine a champion, which is then compared in a second phase against a challenger deemed to have the best chance of beating it. As more comparisons are conducted, it becomes more likely that the best arm is selected as both champion and challenger, causing regret to fall over time. The authors present experimental results on learning to rank datasets but no theoretical guarantees for the algorithm.

#### 3.1.9 MergeRUCB

Zoghi et al. (2015b) consider the problem of finding the Condorcet winner while minimizing the total regret accumulated, when the number of arms is large. To reduce the overall number of comparisons carried out, they use the MergeRUCB algorithm which proceeds, in a similar divide-and-conquer strategy as the merge sort algorithm, by first grouping the arms in small batches and then processing them separately before merging them together. Further, because of the stochasticity of the feedback the algorithm gets, the local comparisons between two arms within each batch are run multiple times before eliminating loosing arms based on upper confidence bounds of the preference probabilities. When the procedure encounters similar arms, the best arm in the batch is used to eliminate them, and if only similar arms are present in a batch or its size becomes small, then the latter is merged with another one with more variety. The process ends when only a single arm remains, that is guaranteed, with high probability, to be the Condorcet winner.

Under the assumptions, that there is no repetition of arms, unless they are uninformative, i.e., they provide no useful information and lose to all others, and that at most a third of the arms are uninformative, they provide theoretical performance guarantees in terms of high probability bounds on the total regret accumulated by the algorithm, which is logarithmic in the number of time-steps and linear in , taking the form improving upon the regret bound of RUCB by eliminating the factor.

#### 3.1.10 Relative minimum empirical divergence

Komiyama et al. (2015) assume that the preference matrix has a Condorcet winner, and propose the Relative Minimum Empirical Divergence (RMED) algorithm that is inspired by the Deterministic Minimum Empirical Divergence (DMED) algorithm (Honda and Takemura, 2010). RMED is based on the empirical Kullback-Leibler (KL) divergence between Bernoulli distributions with parameters corresponding to the probability that one arm being preferred to another one and draws arms that are likely to be the Condorcet winner with high probability. Based on the information divergence, they show a regret bound of the form for RMED, which is optimal in the sense that its constant factor matches the asymptotic lower bound under the Condorcet and also the total order assumption.

They also provide the RMED2 algorithm, which shares its main routine with RMED, but differs from it in the way how it selects the comparison target of the first selected arm. More specifically, RMED2 tries to select the arm that is most likely the Condorcet winner for most rounds, and explores from time to time in order to reduce the regret increase when it fails to estimate the true Condorcet winner correctly. For ease of analysis, they further propose the algorithm RMED2 Fixed Horizon (RMED2FH), which is a static version of RMED2, and show that it is asymptotically optimal under the Condorcet assumption.

#### 3.1.11 Verification based solution

Karnin (2016) considers the problem of finding the best arm in stochastic MABs in the pure exploration setting with the goal of minimizing the sample complexity, focusing on the scenario where the failure probability is very small, and presents the Explore-Verify framework for improving the performance of the task in multiple generalizations of the MAB setting, including dueling bandits with the Condorcet assumption. The framework is based on the fact that in MAB problems with structure, the task of verifying the optimality of a candidate is easier than discovering the best arm, which leads to a design in which first the arms are explored and a candidate best arm is obtained with probability for some constant , and then it is verified whether the found arm is indeed the best with confidence . If the exploration procedure was correct, the sample complexity will be the sum of the one of the exploration algorithm which is independent of , and the one of the easier verification task which depends on . Thus, for small values of , the savings are large, regardless whether the sample complexity is dominated by that of the verification task, or by that of the original task with a constant failure probability.

In concrete terms for the setting of dueling bandits with the Condorcet assumption in the high confidence regime, the exploration procedure queries all pairs until it finds, for each suboptimal arm , an arm with ; the exploration algorithm provides as output the identity of the optimal arm, together with the identity of an arm that beats by the largest gap for each sub-optimal arm . The verification procedure proceeds from the above advice by making sure that for each allegedly sub-optimal , the arm indeed beats it. This exploration verification algorithm leads to a sample complexity that is an improvement of the one from (Komiyama et al., 2015) by more than for large and .

#### 3.1.12 Winner stays

Chen and Frazier (2017) study the dueling bandit problem in the Condorcet winner setting, and consider two notions of regret: strong regret, which is only when both arms pulled are the Condorcet winner; and the weak regret, which is if either arm pulled is the Condorcet winner. They propose the Winner Stays (WS) algorithm with variations for each kind of regret. WS for weak regret (WS-W) which runs in a sequence of rounds, in each of which, pairs of arms play each other in a sequence of iterations, and the winner from an iteration plays in the next iteration against a randomly selected arm from those that have not yet played in the round. At the end of a round, the winner is considered first in the next round. And WS for strong regret (WS-S), which uses WS-W as a subroutine and in which each round consists of an exploration phase, whose length increases exponentially with the number of phases and an exploitation phase.

The authors prove that WS-W has expected cumulative weak regret that is constant in time, with dependence on the number of arms given by under the Condorcet winner setting, and under the total order setting, and that WS-S has expected cumulative strong regret that is under the Condorcet winner setting, and under the total order setting, both of which have optimal dependence on . Further, they also consider utility-based extensions of weak and strong regret, and show that their bounds also apply here, with a small modification. It is worth to mentiom that even if the regret bound of these algorithms are not optimal, they are unique in a sense that the Gambler’s ruin problem is used to upper bound the number of pull of sub-optimal arms, whereas all regret optimization algorithm which we review in this study, make use of the Chernoff bound in some way,

### 3.2 Regularity through latent utility functions

The representation of preferences in terms of utility functions has a long history in decision theory (Fishburn, 1970). The idea is that the absolute preference for each choice alternative can be reflected by a real-valued utility degree. Obviously, such degrees immediately impose a total order on the set of alternatives. Typically, however, the utility degrees are assumed to be latent and not directly observable.

In (Yue and Joachims, 2009), a preference-based stochastic MAB setting is introduced in which the pairwise probabilities are directly derived from the (latent) utilities of the arms. More specifically, the authors assume a space of arms, which is not necessarily finite222This space corresponds to our set of arms . However, as we assume to be finite, we use another notation here.. The probability of an arm beating arm is given by

 P(a≻a′)=12+δ(a,a′)

where . Obviously, the closer the value of the function is to , the harder it becomes to compare the corresponding pair of arms. The authors furthermore assume the pairwise -values to be connected to an underlying (differentiable and strictly concave) utility function :

 12+δ(a,a′)=σ(u(a)−u(a′)),

where is called link function, as it establishes a connection between the pairwise probabilities and utilities. This function is assumed to satisfy the following conditions: and , , . An example of such a function is the logistic function given by , which was used by Yue and Joachims (2009).

The problem of finding the optimal arm can be viewed as a noisy optimization task (Finck et al., 2011). The underlying search space is , and the function values cannot be observed directly; instead, only noisy pairwise comparisons of function values (utilities) are available. In this framework, it is hard to have a reasonable estimate for the gradient. Therefore, the authors opt for applying an online convex optimization method (Flaxman et al., 2005), which does not require the gradient to be calculated explicitly, and instead optimizes the parameter by estimating an unbiased gradient approximation. This optimization algorithm is an iterative search method that proceeds in discrete time step . In time step , assume that the current point is . Then it draws a random direction from the unit ball uniformly, and calculates , with denoting the projection into , and an exploratory learning step parameter.

In the theoretical analysis of the proposed method, called Dueling Bandit Gradient Descent (DBGD), the regret definition is similar to the one in (2), and can be written as

 RT=T∑t=1δ(a∗,at)+δ(a∗,a′t).

Here, however, the reference arm is the best one known only in hindsight. In other words, is the best arm among those evaluated during the search process.

Under a strong convexity assumption on , an expected regret bound for the proposed algorithm is derived. More specifically, assuming the search space to be given by the -dimensional ball of radius , the expected regret is

 E[RT]≤2T3/4√10RdL,

where is the Lipschitz constant of .

In an attempt to improve the performance of the DBGD algorithm (Yue and Joachims, 2009)

, which may suffer from large variance due to the fact that exploration is performed based on one exploratory parameter that is a sum of the current parameter

and a real multiple of a stochastic uniform vector

, Zhao and King (2016) propose two extensions of DBGD, a Dual-Point Dueling Bandit Gradient Descent (DP-DBGD) method and a Multi-Point Deterministic Gradient Descent (MP-DGD) method, which construct gradient exploration from multiple directions within one time step.

More specifically, DP-DBGD extends the exploration in DBGD to two exploratory parameters constructed by two opposite stochastic directions and , instead of only one exploring parameter, to reduce the variance of the gradient approximation.

MP-DGD constructs a set of deterministic standard unit basis vectors for exploration, and updates the parameter by walking along the combination of exploratory winners from the basis ones, where the winner vectors are the ones that perform better than the current parameter.

#### 3.2.3 Stochastic mirror descent

Kumagai (2017) studies the utility-based dueling bandit problem imposing convexity and smoothness assumptions for the utility function, which are stronger than those in (Yue and Joachims, 2009), and which guarantee the existence of a unique minimizer of the utility function, and other assumptions on the link function, which are weaker than those in (Ailon et al., 2014b) and satisfied by common functions, including the logistic function used in (Yue and Joachims, 2009), the linear function used in (Ailon et al., 2014b)

, and the Gaussian distribution function.

Motivated by the fact that Yue and Joachims (2009)

use a stochastic gradient descent algorithm for the dueling bandits problem, the authors propose to use a stochastic mirror descent algorithm, which achieves near optimal order in convex optimization. They first reduce the dueling bandit problem to a locally-convex optimization problem and then show that the regret of dueling bandits and function optimization under noisy comparisons are essentially equivalent.

The proposed algorithm, called Noisy Comparison-based Stochastic Mirror Descent (NC-SMD), achieves a regret bound of in expectation, which is optimal except for a logarithmic factor.

#### 3.2.4 Reduction to value-based MAB

Ailon et al. (2014b) propose various methodologies to reduce the utility-based PB-MAB problem to the standard value-based MAB problem. In their setup, the utility of an arm is assumed to be in . Formally, , and the link function is a linear function . Therefore, the probability of an arm beating another arm is

 P(a≻a′)=1+u(a)−u(a′)2,

which is again in . The regret considered is the one defined in (2), where the reference arm is the globally best arm with maximal utility.

In (Ailon et al., 2014b), two reduction techniques are proposed for a finite and an infinite set of arms. In both techniques, value-based MAB algorithms such as UCB (Auer et al., 2002a) are used as a black box for driving the search in the space of arms. For a finite number of arms, value-based bandit instances are assigned to each arm, and these bandit algorithms are run in parallel. More specifically, assume that an arm is selected in iteration (to be explained in more detail shortly). Then, the bandit instance that belongs to arm suggests another arm . These two arms are then compared in iteration , and the reward, which is 0 or 1, is assigned to the bandit algorithm that belongs to . In iteration , the arm suggested by the bandit algorithm is compared, that is, . What is nice about this reduction technique is that, under some mild conditions on the performance of the bandit algorithm, the preference-based expected regret defined in (2) is asymptotically identical to the one achieved by the value-based algorithm for the standard value-based MAB task.

For infinitely many arms, the reduction technique can be viewed as a two player game. A run is divided into epochs: the

-th epoch starts in round and ends in round , and in each epoch the players start a new game. During the th epoch, the second player plays adaptively according to a strategy provided by the value-based bandit instance, which is able to handle infinitely many arms, such as the ConfidenceBall algorithm by Dani et al. (2008). The first player obeys some stochastic strategy, which is based on the strategy of the second player from the previous epoch. That is, the first player always draws a random arm from the multi-set of arms that contains the arms selected by the second player in the previous epoch. This reduction technique incurs an extra factor to the expected regret of the value-based bandit instance.

#### 3.2.5 Multisort

Maystre and Grossglauser (2017) address the ranking problem when comparison outcomes are generated from the Bradley-Terry (BT) (Bradley and Terry, 1952) probabilistic model with parameters , which represent the utilities of the arms. Using the BT model, the probability that an arm is preferred to is given by

 P(ai≻aj)=11+exp[−(θi−θj)].

Thus, they end up with a utility-based MAB instance with the logistic link function as in (Yue and Joachims, 2009).

Under the assumption that the distance between adjacent parameters is stochastically uniform across the ranking, they first show that the output of a single call of the QuickSort algorithm (Hoare, 1962) is a good approximation to the ground-truth ranking in terms of the quality of a ranking estimate by its displacement with respect to the ground truth measured by the Spearman’s footrule distance given by , where and are the ranks of according to the rankings and respectively. Then they show that the aggregation of independent runs of QuickSort using Copeland’s method (Copeland, 1951)

, in which the arms are increasingly sorted by their scores given by the total number of pairwise wins, can recover the ground truth everywhere, except at a vanishing fraction of the items, i.e., all but a vanishing fraction of the arms are correctly ranked, based on which they propose an active-learning strategy that consists of repeatedly sorting the items. More specifically, for a budget of

pairwise comparisons, they run QuickSort repeatedly until the budget is exhausted to get a set of comparison pairs and their outcomes while ignoring the produced rankings themselves, and then they induce the final ranking estimate from the ML estimate over the set of all the pairwise comparison outcomes.

### 3.3 Regularity through statistical models

Since the most general task in the realm of preference-based bandits is to elicit a ranking of the complete set of arms based on noisy (probabilistic) feedback, it is quite natural to establish a connection to statistical models of rank data (Marden, 1995).

The idea of relating preference-based bandits to rank data models has been put forward by Busa-Fekete et al. (2014a)

, who assume the underlying data-generating process to be given in the form of a probability distribution

. Here, is the set of all permutations of (the symmetric group of order ) or, via a natural bijection, the set of all rankings (total orders) of the arms.

The probabilities for pairwise comparisons are then obtained as marginals of . More specifically, with the probability of observing the ranking , the probability that is preferred to is obtained by summing over all rankings in which precedes :

 qi,j=P(ai≻aj)=∑r∈L(rj>ri)P(r) (4)

where denotes the subset of permutations for which the rank of is higher than the rank of (smaller ranks indicate higher preference). In this setting, the learning problem essentially comes down to making inference about based on samples in the form of pairwise comparisons.

#### 3.3.1 Mallows

Busa-Fekete et al. (2014a) assume the underlying probability distribution to be a Mallows model (Mallows, 1957), one of the most well-known and widely used statistical models of rank data (Marden, 1995). The Mallows model or, more specifically, Mallows

-distribution is a parameterized, distance-based probability distribution that belongs to the family of exponential distributions:

 P(r|θ,˜r)=1Z(ϕ)ϕd(r,˜r) (5)

where and are the parameters of the model: is the location parameter (center ranking) and the spread parameter. Moreover, is the Kendall distance on rankings, that is, the number of discordant pairs:

 d(r,˜r)=∑1≤i

where denotes the indicator function. The normalization factor in (5) can be written as

 Z(ϕ)=∑r∈SKP(r|θ,˜r)=K−1∏i=1i∑j=0ϕj

and thus only depends on the spread (Fligner and Verducci, 1986). Note that, since is equivalent to , the center ranking is the mode of , that is, the most probable ranking according to the Mallows model.

In (Busa-Fekete et al., 2014a), three different goals of the learner, which are all meant to be achieved with probability at least , are considered, depending on whether the application calls for the prediction of a single arm, a full ranking of all arms, or the entire probability distribution:

1. The MPI problem consists of finding the most preferred item , namely the item whose probability of being top-ranked is maximal:

 i∗ =\operatornamewithlimitsargmax1≤i≤KEr∼PI{ri=1}=\operatornamewithlimitsargmax1≤i≤K∑r∈L(ri=1)P(r)
2. The MPR problem consists of finding the most probable ranking :

 r∗=\operatornamewithlimitsargmaxr∈SKP(r)
3. The KLD problem calls for producing a good estimate of the distribution , that is, an estimate with small KL divergence:

 \operatornamewithlimitsKL(P,ˆP)=∑r∈SKP(r)logP(r)ˆP(r)<ϵ

In the case of Mallows, it is easy to see that implies for any pair of items and . That is, the center ranking defines a total order on the set of arms: If an arm precedes another arm in the (center) ranking, then beats in a pairwise comparison333Recall that this property is an axiomatic assumption underlying the IF and BTM algorithms. Interestingly, the stochastic triangle inequality, which is also assumed by Yue et al. (2012), is not satisfied for Mallows -model (Mallows, 1957).. Moreover, as shown by Mallows (1957), the pairwise probabilities can be calculated analytically as functions of the model parameters and as follows: Assume the Mallows model with parameters and . Then, for any pair of items and such that , the pairwise probability is given by , where

 g(i,j,ϕ)=h(j−i+1,ϕ)−h(j−i,ϕ)

with . Based on this result, one can show that the “margin”

 mini≠j|1/2−qi,j|

around is relatively wide; more specifically, there is no . Moreover, the result also implies that for arms satisfying with , and for arms satisfying with . Therefore, deciding whether an arm has higher or lower rank than (with respect to ) is easier than selecting the preferred option from two candidates and for which .

Based on these observations, one can devise an efficient algorithm for identifying the most preferred arm when the underlying distribution is Mallows. The algorithm proposed in (Busa-Fekete et al., 2014a) for the MPI problem, called MallowsMPI

, is similar to the one used for finding the largest element in an array. However, since a stochastic environment is assumed in which the outcomes of pairwise comparisons are random variables, a single comparison of two arms

and is not enough; instead, they are compared until

 1/2∉[ˆqi,j−ci,j,ˆqi,j+ci,j]. (6)

This simple strategy finds the most preferred arm with probability at least for a sample complexity that is of the form , where .

For the MPR problem, a sampling strategy called MallowsMerge is proposed, which is based on the merge sort algorithm for selecting the arms to be compared. However, as in the case of MPI, two arms and are not only compared once but until condition (6) holds. The MallowsMerge algorithm finds the most probable ranking, which coincides with the center ranking of the Mallows model, with a sample complexity of

 O(Klog2Kρ2logKlog2Kδρ),

where . The leading factor of the sample complexity of MallowsMerge differs from the one of MallowsMPI by a logarithmic factor. This was to be expected, and simply reflects the difference in the worst case complexity for finding the largest element in an array and sorting an array using the merge sort strategy.

The KLD problem turns out to be very hard for the case of Mallows, and even for small , the sample complexity required for a good approximation of the underlying Mallows model is extremely high with respect to . In (Busa-Fekete et al., 2014a), the existence of a polynomial algorithm for this problem (under the assumption of the Mallows model) was left as an open question.

#### 3.3.2 Plackett-Luce

Szörényi et al. (2015) assume the underlying probability distribution is a Plackett-Luce (PL) model (Plackett, 1975; Luce, 1959). The PL model is parametrized by a vector . Each can be interpreted as the weight or “strength” of the option . The probability assigned by the PL model to a ranking represented by a permutation is given by

 Pθ(π)=K∏i=1θπ−1(i)θπ−1(i)+θπ−1(i+1)+…+θπ−1(K). (7)

The product on the right-hand side of (7) is the probability of producing the ranking in a stagewise process: First, the item on the first position is selected, then the item on the second position, and so forth. In each step, the probability of an item to be chosen next is proportional to its weight. Consequently, items with a higher weight tend to occupy higher positions. In particular, the most probable ranking (i.e., the mode of the PL distribution) is simply obtained by sorting the items in decreasing order of their weight:

 τ=argmaxπ∈SKPθ(π)=argsortk∈[K]{θ1,…,θK} (8)

The authors consider two different goals of the learner, which are both meant to be achieved with high probability. In the first problem, called PACI (for PAC-item), the goal is to find an item that is almost as good as the Condorcet winner, i.e., an item such that , where is the Condorcet winner, for which they devise the PLPAC algorithm with a sample complexity of .

The second goal, called AMPR (approximate most probable ranking), is to find the approximately most probable ranking , i.e., there is no pair of items such that and , where , for which they propose the PLPAC-AMPR algorithm, whose sample complexity is of order .

Both algorithms are based on a budgeted version of the QuickSort algorithm (Hoare, 1962), which reduces its quadratic worst case complexity to the order , and in which the pairwise stability property is provably preserved (the pairwise marginals obtained from the distribution defined by the QuickSort algorithm coincide with the marginals of the PL distribution).

## 4 Learning from Noncoherent Pairwise Comparisons

The methods presented in the previous section essentially proceed from a given target, for example a ranking of all arms, which is considered as a “ground truth”. The preference feedback in the form of (stochastic) pairwise comparisons provide information about this target and, consequently, should obey certain consistency or regularity properties. This is perhaps most explicitly expressed in Section 3.3, in which the are derived as marginals of a probability distribution on the set of all rankings, which can be seen as modeling a noisy observation of the ground truth given in the form of the center ranking.

Another way to look at the problem is to start from the pairwise preferences themselves, that is to say, to consider the pairwise probabilities as the ground truth. In tournaments in sports, for example, the may express the probabilities of one team beating another one . In this case, there is no underlying ground truth ranking from which these probabilities are derived. Instead, it is just the other way around: A ranking is derived from the pairwise comparisons. Moreover, there is no reason for why the should be coherent in a specific sense. In particular, preferential cyclic and violations of transitivity are commonly observed in many applications.

This is exactly the challenge faced by ranking procedures, which have been studied quite intensely in operations research and decision theory (Moulin, 1988; Chevaleyre et al., 2007). A ranking procedure turns into a complete preorder relation of the alternatives under consideration. Thus, another way to pose the preference-based MAB problem is to instantiate with as the target for prediction—the connection between and is then established by the ranking procedure , which of course needs to be given as part of the problem specification.

Formally, a ranking procedure is a map , where denotes the set of complete preorders on the set of alternatives. We denote the complete preorder produced by the ranking procedure on the basis of by , or simply by if is clear from the context. Below we present some of the most common instantiations of the ranking procedure :

1. Copeland’s ranking () is defined as follows (Moulin, 1988): if and only if , where is the Copeland score of . The interpretation of this relation is very simple: An option is preferred to whenever “beats” more options than does.

2. The sum of expectations () (or Borda) ranking is a “soft” version of : if and only if

 qi=1K−1∑k≠iqi,k>1K−1∑k≠jqj,k=qj. (9)
3. The idea of the random walk (RW) ranking is to handle the matrix

as a transition matrix of a Markov chain and order the options based on its stationary distribution. More precisely,

RW first transforms

into the stochastic matrix

where . Then, it determines the stationary distribution

for this matrix (i.e., the eigenvector corresponding to the largest eigenvalue 1). Finally, the options are sorted according to these probabilities:

iff . The ranking is directly motivated by the PageRank algorithm (Brin and Page, 1998), which has been well studied in social choice theory (Cohen et al., 1999; Brandt and Fischer, 2007) and rank aggregation (Negahban et al., 2012), and which is widely used in many application fields (Brin and Page, 1998; Kocsor et al., 2008).

In Table 2, we summarize the assumptions, targets, and goals considered in the approaches for the dueling bandits problem with noncoherent pairwise comparisons, prior to elaborating on the methods.

### 4.1 Preference-based racing

The learning problem considered by Busa-Fekete et al. (2013) is to find, for some , the top- arms with respect to the , , and ranking procedures with high probability. To this end, three different learning algorithms are proposed in the finite horizon case, with the horizon given in advance. In principle, these learning problems are very similar to the value-based racing task (Maron and W., 1994; Maron and Moore, 1997), where the goal is to select the arms with the highest means. However, in the preference-based case, the ranking over the arms is determined by the ranking procedure instead of the means. Accordingly, the algorithms proposed by Busa-Fekete et al. (2013) consist of a successive selection and rejection strategy. The sample complexity bounds of all algorithms are of the form . Thus, they are not as tight in the number of arms as those considered in Section 3. This is mainly due to the lack of any assumptions on the structure of . Since there are no regularities, and hence no redundancies in that could be exploited, a sufficiently good estimation of the entire relation is needed to guarantee a good approximation of the target ranking in the worst case.

### 4.2 PAC rank elicitation

In a subsequent work by Busa-Fekete et al. (2014b), an extended version of the top-k selection problem is considered. In the PAC rank elicitation problem, the goal is to find a ranking that is “close” to the ranking produced by the ranking procedure with high probability. To make this problem feasible, more practical ranking procedures are considered. In fact, the problem of ranking procedures like Copeland is that a minimal change of a value may strongly influence the induced order relation . Consequently, the number of samples needed to assure (with high probability) a certain approximation quality may become arbitrarily large. A similar problem arises for as a target order if some of the individual scores are very close or equal to each other.

As a practical (yet meaningful) solution to this problem, the relations and are made a bit more “partial” by imposing stronger requirements on the order. To this end, let denote the number of options that are beaten by with a margin , and let . Then, the -insensitive Copeland relation is defined as follows: if and only if . Likewise, in the case of , small differences of the are neglected and the -insensitive sum of expectations relation is defined as follows: if and only if .

These -insensitive extensions are interval (and hence partial) orders, that is, they are obtained by characterizing each option by the interval and sorting intervals according to iff . It is readily shown that for , with equality if for all (and similarly for ). The parameter controls the strictness of the order relations, and thereby the difficulty of the rank elicitation task.

As mentioned above, the task in PAC rank elicitation is to approximate without knowing the . Instead, relevant information can only be obtained through sampling pairwise comparisons from the underlying distribution. Thus, the options can be compared in a pairwise manner, and a single sample essentially informs about a pairwise preference between two options and . The goal is to devise a sampling strategy that keeps the size of the sample (the sample complexity) as small as possible while producing an estimation that is “good” in a PAC sense: