1 Introduction
Let be a wellbehaved blackbox function defined on a bounded subset . We are interested in solving the global optimization problem of finding
(1) 
We assume that is not directly accessible and that queries to can only be done in pairs of points or duels from which we obtain binary feedback that represents whether or not is preferred over (has lower value)^{1}^{1}1In this work we use
to represent the vector resulting from concatenating both elements involved in the duel.
. We will consider that is the winner of the duel if the output is and that wins the duel if the output is . The goal here is to find by reducing as much as possible the number of queried duels.Our setup is different to the one typically used in bo where direct feedback from in the domain is available (Jones, 2001; Snoek et al., 2012). In our context, the objective is a latent object that is only accessible via indirect observations. However, although the scenario described in this work has not received a wider attention, there exist a variety of real wold scenarios in which the objective function needs to be optimized via preferential returns. Most cases involve modeling latent human preferences, such as web design via A/B testing or the use of recommender systems (Brusilovsky et al., 2007). In prospect theory, the models used are based on comparisons with some reference point, as it has been demonstrated that humans are better at evaluating differences rather than absolute magnitudes (Kahneman and Tversky, 1979).
Optimization methods for pairwise preferences have been studied in the armedbandits context (Yuea et al., 2012). Zoghi et al. (2014) propose a new method for the Karmed duelling bandit problem based on the Upper Confidence Bound algorithm. Jamieson et al. (2015) study the problem by allowing noise comparisons between the duels. Zoghi et al. (2015b) choose actions using contextual information. Dudík et al. (2015) study the Copeland’s dueling bandits, a case in which a Condorcet winner, or an arm that uniformly wins the duels with all the other arms may not exist. Szörényi et al. (2015)
study Online Rank Elicitation problem in the duelling bandits setting. An analysis on Thompson sampling in duelling bandits is done by
Wu et al. (2016). Yue and Joachims (2011) proposes a method that does not need transitivity and comparison outcomes to have independent and timestationary distributions.Preference learning has also been studied (Chu and Ghahramani, 2005) in the context of Gaussian process (gp) models by using a likelihood able to model preferential returns. Similarly, Brochu (2010)
used a probabilistic model to actively learn preferences in the context of discovering optimal parameters for simple graphics and animations engines. To select new duels, standard acquisition functions like the Expected Improvement (
ei) (Mockus, 1977) are extended on top of a gp model with likelihood for duels. Although this approach is simple an effective in some cases, new duels are selected greedily, which may lead to overexploitation.In this work we propose a new approach aiming at combining the good properties of the armbandit methods with the advantages of having a probabilistic model able to capture correlations across the points in the domain . Following the above mentioned literature in the bandits settings, the key idea is to learn a preference function in the space of the duels by using a Gaussian process. This allows us to select the most relevant comparisons nongreedily and improve the stateoftheart in this domain.
This paper is organized as follows. In Section 2 we introduce the point of view that it is followed in this work to model latent preferences. We define concepts such as the Copeland score function and the Condorcet’s winner which form the basis of our approach. Also in Section 2, we show how to learn these objects from data. In Section 3 we generalize most commonly used acquisition functions to the dueling case. In Section 4 we illustrate the benefits of the proposed framework compared to stateofthe art methods in the literature. We conclude in Section 5 with a discussion and some future lines of research.
2 Learning latent preferences
The approach followed in this work is inspired by the work of (Ailon et al., 2014) in which cardinal bandits are reduced to ordinal ones. Similarly, here we focus on the idea of reducing the problem of finding the optimum of a latent function defined on to determine a sequence of duels on .
We assume that each duel produces in a joint reward that is never directly observed. Instead, after each pair is proposed, the obtained feedback is a binary return representing which of the two locations is preferred. In this work, we assume that but other alternatives are possible. Note that the more is preferred over the bigger is the reward.
Since the preferences of humans are often unclear and may conflict, we model preferences as a stochastic process. In particular, the model of preference is a Bernoulli probability function
and
where is an inverse link function. Via the latent loss, maps each query to the probability of having a preference on the left input over the right input . The inverse link function has the property that . A natural choice for is the logistic function
(2) 
but others are possible. Note that for any duel in which it holds that . is therefore a preference function that fully specifies the problem.
We introduce here the concept of normalised Copeland score, already used in the literature of raking methods (Zoghi et al., 2015a), as
where is a normalizing constant that bounds in the interval . If is a finite set, the Copeland score is simply the proportion of duels that a certain element will win with probability larger than 0.5. Instead of the Copeland score, in this work we use a soft version of it, in which the probability function is integrated over without further truncation. Formally, we define the softCopeland score as
(3) 
which aims to capture the ‘averaged’ probability of being the winner of a duel.
Following the armedbandits literature, we say that is a Condorcet winner if it is the point with maximal softCopeland score. It is straightforward to see that if is a Condorcet winner with respect to the softCopeland score, it is a global minimum of in : the integral in (3) takes maximum value for points such that for all , which only occurs if is a minimum of . This implies that if by observing the results of a set of duels we can learn the preference function the optimization problem of finding the minimum of can be addressed by finding the Condorcet winner of the Copeland score. See Figure 1 for an illustration of this property.
2.1 Learning the preference function with Gaussian processes
Assume that duels have been performed so far resulting in a dataset . Given , inference over the latent function and its warped version can be carried out by using Gaussian processes (gp) for classification (Rasmussen and Williams, 2005).
In a nutshell, a gp is a probability measure over functions such that any linear restriction is multivariate Gaussian. Any gp
is fully determined by a positive definite covariance operator. In standard regression cases with Gaussian likelihoods, closed forms for the posterior mean and variance are available. In the preference learning, like the one we face here, the basic idea behind Gaussian process modeling is to place a
gp prior over some latent functionthat captures the membership of the data to the two classes and to squash it through the logistic function to obtain some prior probability
. In other words, the model for a gp for classification looks similar to eq. (2) but with the difference that is an stochastic process as it is . The stochastic latent function is a nuisance function as we are not directly interested in its values but instead on particular values of at test locations .Inference is divided in two steps. First we need to compute the distribution of the latent variable corresponding to a test case, and later use this distribution over the latent to produce a prediction
where the vector contains the hyperparameters of the model that can also be marginalized out. In this scenario, gp predictions are not straightforward (in contrast to the regression case), since the posterior distribution is analytically intractable and approximations at required (see (Rasmussen and Williams, 2005) for details). The important message here is, however, that given data from the locations and result of the duels we can learn the preference function by taking into account the correlations across the duels, which makes the approach to be very data efficient compared to bandits scenarios where correlations are ignored.
2.2 Computing the softCopenland score and the Condorcet winner
The softCopeland function can be obtained by integrating over , so it is possible to learn the softCopeland function from data by integrating . Unfortunately, a closed form solution for
does not necessarily exist. In this work we use MonteCarlo integration to approximate the Copeland score at any via
(5) 
where are a set of landmark points to perform the integration. For simplicity, in this work we select the landmark points uniformly, although more sophisticated probabilistic approaches can be applied (Briol et al., 2015).
The Condorcet winner can be computed by taking
which can be done using a standard numerical optimizer. is the point that has, on average, the maximum probability of wining most of the duels (given the data set ) and therefore it is the most likely point to be the optimum of .
3 Sequential Learning of the Condorcet winner
In this section we analyze the case in which extra duels can be carried out to augment the dataset before we have to report a solution to (1). This is similar to the setup in (Brochu, 2010) where interactive Bayesian optimization is proposed by allowing a human user to sequentially decide the result of a number of duels.
In the sequel, we will denote by the data set resulting of augmenting with new pairwise comparisons. Our goal in this section is to define a sequential policy for querying duels: . This policy will enable us to identify as soon as possible the minimum of the the latent function . Note that here, differently to the situation in standard Bayesian optimization, the search space of the acquisition, is not the same as domain of the latent function that we are optimizing. Our best guess about its optimum, however, is the location of the Condorcet’s winner.
We approach the problem by proposing three dueling acquisition functions: (i) pure exploration (pe), the Copeland Expected improvement (cei) and duellingThompson sampling, which makes explicitly use of the generative capabilities of our model. We analyze the three approaches in terms of what the balance explorationexploitation means in our context. For simplicity in the notation, in the sequel we drop the dependency of all quantities on the parameters of the model.
3.1 Pure Exploration
The first question that arises when defining a new acquisition for duels, is what exploration means in this context. Given a model as described in Section 2.1, the output variables
follow a Bernoulli distribution with probability given by the preference function
. A straightforward interpretation of pure exploration would be to search for the duel of which the outcome is most uncertain (has the highest variance of ). The variance of is given byHowever, as preferences are modeled with a Bernoulli model, the variance of does not necessarily reduce with sufficient observations. For example, according to eq. (2), for any two values and such that , will tend to be close to , and therefore it will have maximal variance even if we have already collected several observations in that region of the duels space.
Alternatively, exploration can be carried out by searching for the duel where gp is most uncertain about the probability of the outcome (has the highest variance of ), which is the result of transforming out epistemic uncertainty about , modeled by a gp
, through the logistic function. The first order moment of this distribution coincides with the expectation of
but its variance iswhich explicitly takes into account the uncertainty over . Hence, pure exploration of duels space can be carried out by maximizing
Remark that in this case, duels that have been already visited will have a lower chance of being visited again even in cases in which the objective takes similar values in both players. See Figure 2 for an illustration of this property.
In practice, this acquisition requires to compute and intractable integral, that we approximate in practice using MonteCarlo.
3.2 Copeland Expected Improvement
An alternative way to define an acquisition function is by generalizing the idea of the Expected Improvement (Mockus, 1977). The idea of the ei is to compute, in expectation, the marginal gain with respect to the current best observed output. In our context, as we do not have direct access to the objective, our only way of evaluating the quality of a single point is by computing its Copeland score. To generalize the idea to our context we need to find a couple of duels able to maximally improve the expected score of the Condorcer winner.
Denote by the value of the Condorcet’s winner when duels have been already run. For any new proposed duel , two outcomes are possible that correspond to cases wether or wins the duel. We denote by the
the value of the estimated Condorcer winner resulting of augmenting
with and by the equivalent value but augmenting the dataset with . We define the onelookahead Copeland Expected Improvement at iteration as:(6) 
where and the expectation is take over , the value at the Condorcet winner given the result of the duel. The next duel is selected as the pair that maximizes the cei. Intuitively, the cei evaluated at is a weighted sum of the total increase of the best possible value of the Copeland score in the two possible outcomes of the duel. The weights are chosen to be the probability of the two outcomes, which are given by . The cei can be computed in closed form as
The computation of this acquisition is computationally demanding as it requires updating of the gp classification model for every fantasized output at any point in the domain. As we show in the experimental section, and similarly with what is observed in the literature about the ei, this acquisition tends to be greedy in certain scenarios leading to over exploitation (HernándezLobato et al., 2014; Hennig and Schuler, 2012). Although nonmyopic generalizations of this acquisition are possible to address this issue in the same fashion as in (González et al., 2016b) these are be intractable.
3.3 DuelingThompson sampling
As we have previously detailed, pure explorative approaches that do not exploit the available knowledge about the current best location and cei is expensive to compute and tends to overexploit. In this section we propose an alternative acquisition function that is fast to compute and explicitly balances exploration and exploitation. We call this acquisition duelingThompson sampling and it is inspired by Thompson sampling approaches. It follows a twostep policy for duels:

Step 1, selecting : First, generate a sample from the model using continuous Thompson sampling ^{2}^{2}2Approximated continuous samples from a gp with shiftinvariant kernel can be obtained by using Bochner’s theorem (Bochner et al., 1959). In a nutshell, the idea is to approximate the kernel by means of the inner product of a finite number Fourier features that are sampled according to the measure associated to the kernel (Rahimi and Recht, 2008; HernándezLobato et al., 2014). and compute the associated softCopland’s score by integrating over . The first element of the new duel, , is selected as:
The term in the Copeland score has been dropped here as it does not change the location of the optimum. The goal of this step is to balance exploration and exploitation in the selection of the Condorcet winner, it is the same fashion Thompson sampling does: it is likely to select a point close to the current Condorcet winner but the policy also allows exploration of other locations as we base our decision on a stochastic . Also, the more evaluations are collected, the more greedy the selection will be towards the Condorcet winner. See Figure 3 for an illustration of this effect.

Step 2, selecting : Given the second element of the duel is selected as the location that maximizes the variance of in the direction of . More formally, is selected as
This second step is purely explorative in the direction of and its goal is to find informative comparisons to run with current good locations identified in the previous step.
In summary the duelingThompson sampling approach selects the next duel as:
where and are defined above. This policy combines a selection of a point with high chances of being the optimum with a point whose result of the duel is uncertain with respect of the previous one. See Figure 4 for a visual illustration of the two steps in toy example. See Algorithm 1 for a full description of the pbo approach.
3.4 Generalizations to multiple returns scenarios
A natural extension of the pbo setup detailed above are cases in which multiple comparisons of inputs are simultaneously allowed. This is equivalent to providing a ranking over a set of points . Rankings are trivial to map to pairwise preferences by using the pairwise ordering to obtain the result of the duels. The problem is, therefore, equivalent from a modeling perspective. However, from the point of view of selecting the locations to rank in each iteration, generalization of the above mentioned acquisitions are required. Although this is not the goal of this work, it is interesting to remark that this problem has strong connections with the one of computing batches in bo (González et al., 2016a).
4 Experiments
In this section we present three experiments which validate our approach in terms of performance and illustrate its key properties. The setup is as follows: we have a nonlinear blackbox function of which we look for its minimum as described in equation (1). However, we can only query this function through pairwise comparisons. The outcome of a pairwise comparison is generated as described in Section 2, i.e., the outcome is drawn from a Bernoulli distribution of which the sample probability is computed according to equation (2).
We have considered for the Forrester, the ‘sixhump camel’, ‘GoldStein’ and ‘Levy’ as latent objective functions to optimize. The Forrester function is 1dimensional, whereas the rest are defined in domains of dimension 2. The explicit formulation of these objectives and the domains in which they are optimized are available as part of standard optimization benchmarks^{3}^{3}3https://www.sfu.ca/ssurjano/optimisation.html. The pbo framework is applicable in the continuous setting. In this section, however, the search of the optimum of the objectives is performed in a grid of size ( per dimension for all cases), which has practical advantages: the integral in eq. (5) can easily be treated as a sum and, more importantly, we can compare pbo with bandit methods that are only defined in discrete domains. Each comparison starts with 5 initial (randomly selected) duels and a total budget of 200 duels are run, after which, the best location of the optimum should be reported. Further, each algorithm runs for 20 times (trials) with different initial duels (the same for all methods)^{4}^{4}4ramdom runs for 100 trials to give a reliable curve and pbocei only runs 5 trials for Forrester as it is very slow.. We report the average performance across all trials, which is defined as the value of (known in all cases) evaluated at the current Condorcet winner, considered to be the best by each algorithm at each one of the 200 steps taken until the end of the budget. Note that each algorithm chooses differently, which leads to different performance at step 0. Therefore, we present plots of #iterations versus .
We compare 6 methods. Firstly, the three variants within the pbo framework: pbo with pure exploration (pbope, see section 3.1), pbo with the Copeland Expected Improvement (pbocei, see section 3.2) and pbo with dueling Thomson sampling (pbodts, see section 3.3
). We also compare against a random policy where duels are drawn from a uniform distribution (
ramdom) ^{5}^{5}5 is chosen as the location that wins most frequently. and with the interactive Bayesian optimization (ibo) method of (Brochu, 2010). ibo selects duels by using an extension of Expected Improvement on a gp model that encodes the information of the preferential returns in the likelihood. Finally, we compared against all three cardinal bandit algorithms proposed by Ailon et al. (2014), namely Doubler, MultiSBM and Sparring. Ailon et al. (2014) observes that Sparring has the best performance, and it also outperforms the other two banditbased algorithms in our experiments. Therefore, we only report the performance for Sparring here, to keep the plots clean. In a nutshell, the Sparring considers two bandit players (agents), one for each element of the duel, which use the Upper Confidence Bound criterion and where the input grid is acting as the set of arms. The decision for which pair of arms to query is according to the strategies and beliefs of each agent. Notice that, in this case, correlations in are not captured.Figure 5 shows the performance of the compared methods, which is consistent across the four plots: ibo shows a poor performance, due to the combination of the greedy nature of the acquisitions and the poor calibration of the model. The ramdom policy converges to a suboptimal result and the pbo approaches tend to be the superior ones. In particular, we observe that pbodts is consistently proven as the best policy, and it is able to balance correctly exploration and exploitation in the duels space. Contrarily, pbocei, which is only used in the first experiment due to the excessive computational overload, tends to over exploit. pbope obtains reasonable results but tends to work worse in larger dimensions where is harder to cover the space.
Regarding the bandits methods, they need a much larger number of evaluations to converge compared to methods that model correlations across the arms. They are also heavily affected by an increase of the dimension (number of arms). The results of the Sparring method are shown for the Forrester function but are omitted in the rest of the plots (the number of used evaluations used is smaller than the numbers of the arms and therefore no real learning can happen within the budget). However, in Figure 6 we show the comparison between Sparring and pbodts for an horizon in which the bandit method has almost converged. The gain obtained by modeling correlations among the duels is evident.
5 Conclusions
In this work we have explored a new framework, pbo, for optimizing blackbox functions in which only preferential returns are available. The fundamental idea is to model comparisons of pairs of points with a Gaussian, which leads to the definition of new policies for augmenting the available dataset. We have proposed three acquisitions for duels, pe, cei and dts, and explored their connections with existing policies in standard bo. Via simulation, we have demonstrated the superior performance of dts, both because it finds a good balance between exploration and exploitation in the duels space and because it is computationally tractable. In comparison with other alternatives out of the pbo framework, such as ibo or other bandit methods, dts shows the stateoftheart performance.
We envisage some interesting future extensions of our current approach. The first one is to tackle the existing limitation on the dimension of the input space, which is doubled with respect to the original dimensionality of the problem to be able to capture correlations among duels. Also further theoretical analysis will be carried out on the proposed acquisitions.
References

Ailon et al. [2014]
Nir Ailon, Zohar Shay Karnin, and Thorsten Joachims.
Reducing dueling bandits to cardinal bandits.
In
Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014
, pages 856–864, 2014.  Bochner et al. [1959] Salomon Bochner, Monotonic Functions, Stieltjes Integrals, Harmonic Analysis, Morris Tenenbaum, and Harry Pollard. Lectures on Fourier Integrals. (AM42). Princeton University Press, 1959. ISBN 9780691079943.
 Briol et al. [2015] FrançoisXavier Briol, Chris J. Oates, Mark Girolami, and Michael A. Osborne. FrankWolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 1162–1170, Cambridge, MA, USA, 2015. MIT Press.
 Brochu [2010] Eric Brochu. Interactive Bayesian Optimization: Learning Parameters for Graphics and Animation. PhD thesis, University of British Columbia, Vancouver, Canada, December 2010.
 Brusilovsky et al. [2007] Peter Brusilovsky, Alfred Kobsa, and Wolfgang Nejdl, editors. The Adaptive Web: Methods and Strategies of Web Personalization. SpringerVerlag, Berlin, Heidelberg, 2007. ISBN 9783540720782.
 Chu and Ghahramani [2005] Wei Chu and Zoubin Ghahramani. Preference learning with Gaussian processes. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 137–144, New York, NY, USA, 2005. ACM. ISBN 1595931805.
 Dudík et al. [2015] Miroslav Dudík, Katja Hofmann, Robert E. Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 36, 2015, pages 563–587, 2015.

González et al. [2016a]
Javier González, Zhenwen Dai, Philipp Hennig, and N. Lawrence.
Batch bayesian optimization via local penalization.
In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS 2016)
, volume 51 of JMLR Workshop and Conference Proceedings, pages 648–657, 2016a.  González et al. [2016b] Javier González, Michael A. Osborne, and Neil D. Lawrence. GLASSES: relieving the myopia of bayesian optimisation. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 911, 2016, pages 790–799, 2016b.
 Hennig and Schuler [2012] Philipp Hennig and Christian J. Schuler. Entropy search for informationefficient global optimization. Journal of Machine Learning Research, 13, 2012.
 HernándezLobato et al. [2014] José Miguel HernándezLobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of blackbox functions. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 918–926. Curran Associates, Inc., 2014.
 Jamieson et al. [2015] Kevin G. Jamieson, Sumeet Katariya, Atul Deshpande, and Robert D. Nowak. Sparse dueling bandits. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 912, 2015, 2015.
 Jones [2001] Donald R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21(4):345–383, 2001.
 Kahneman and Tversky [1979] Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–91, 1979.
 Mockus [1977] Jonas Mockus. On Bayesian methods for seeking the extremum and their application. In IFIP Congress, pages 195–200, 1977.
 Rahimi and Recht [2008] Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1177–1184. Curran Associates, Inc., 2008.
 Rasmussen and Williams [2005] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
 Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2951––2959, 12/2012 2012.
 Szörényi et al. [2015] Balázs Szörényi, Róbert BusaFekete, Adil Paul, and Eyke Hüllermeier. Online rank elicitation for plackettluce: A dueling bandits approach. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada, pages 604–612, 2015.
 Wu et al. [2016] Huasen Wu, Xin Liu, and R. Srikant. Double Thompson sampling for dueling bandits. CoRR, abs/1604.07101, 2016.
 Yue and Joachims [2011] Yisong Yue and Thorsten Joachims. Beat the mean bandit. In ICML, pages 241–248, 2011.
 Yuea et al. [2012] Yisong Yuea, Josef Broderb, Robert Kleinbergc, and Thorsten Joachims. The karmed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 – 1556, 2012. ISSN 00220000. {JCSS} Special Issue: Cloud Computing 2011.
 Zoghi et al. [2014] Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten de Rijke. Relative upper confidence bound for the Karmed dueling bandit problem. In ICML 2014: Proceedings of the ThirtyFirst International Conference on Machine Learning, pages 10–18, June 2014.
 Zoghi et al. [2015a] Masrour Zoghi, Zohar S Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland dueling bandits. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 307–315. Curran Associates, Inc., 2015a.
 Zoghi et al. [2015b] Masrour Zoghi, Zohar S. Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland dueling bandits. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada, pages 307–315, 2015b.