Taking an optimal decision is the process of selecting the value of certain variables that produces “best” results. When using mathematical programming to solve this problem, “best” means that the taken decision minimizes a certain cost function, or equivalently maximizes a certain utility function. However, in many problems an objective function is not quantifiable, either because it is of qualitative nature or because it involves several goals. Moreover, sometimes the “goodness” of a certain combination of decision variables can only be assessed by a human decision maker.
This situation arises in many practical cases. When calibrating the parameters of a deep neural network whose goal is to generate a synthetic painting or artificial music, artistic “beauty” is hardly captured by a numerical function, and a human decision-maker is required to assess whether a certain combination of parameters produces “beautiful” results. For example, the authors of propose a tool to help digital artists to calibrate the parameters of an image generator so that the synthetic image “resembles” a given one. Another example is in industrial automation when a calibrating the tuning knobs of a control system: based on engineering insight and rules of thumb, the task is usually carried out manually by trying a series of combinations until the calibrator is satisfied by the observed closed-loop performance. Another frequent situation in which it is hard to formulate an objective function is in multi-objective optimization . Here, selecting a-priori the correct weighted sum of the objectives to minimize in order to choose an optimal decision vector can be very difficult, and is often a human operator that needs to assess whether a certain Pareto optimal solution is better than another one, based on his or her (sometimes unquantifiable) feelings.
It is well known in neuroscience that humans are better at choosing between two options (“this is better than that”) than among multiple ones [5, 6]. In consumer psychology, the “choice overload” effect shows that a human, when presented an abundance of options, has more difficulty to make a decision than if only a few options are given. On the other hand, having a large number of possibilities to choose from creates very positive feelings in the decision maker . In economics, the difficulty of rational behavior in choosing the best option was also recognized in , due to the complexity of the decision problem exceeding the cognitive resources of the decision maker. Indeed, choosing the best option implies ranking choices by absolute values and therefore quantifying a clear objective function to optimize, a process that might be difficult due to the complexity and fuzziness of many criteria that are involved. For the above reasons, the importance of focusing on discrete choices in psychology dates back at least to the 1920’s .
Looking for an optimal value of the decision variables that is “best”, in that the human operator always prefers it compared to all other tested combinations, may involve a lot of trial and error. For example, in parameter calibration the operator has to try many combinations before being satisfied with the winner one. The goal of active preference learning is to drive the trials by automatically proposing decision vectors to the operator for testing, so to converge to the best choice possibly within the least number of experiments.
In the derivative-free black-box global optimization literature there exist some methods for minimizing an objective function that can be used also for preference-based learning. Since, given two decision vectors , , we can say that is not “preferred” to if , finding a global optimizer can be reinterpreted as the problem of looking for the vector such that it is preferred to any other vector . Therefore, optimization methods that only observe the outcome of the comparison (and not the values , not even the difference
) can be used for preference-based optimization. For example particle swarm optimization (PSO) algorithms[18, 32] drive the evolution of particles only based on the outcome of comparisons between function values and could be used in principle for preference-based optimization. However, although very effective in solving many complex global optimization problems, PSO is not conceived for keeping the number of evaluated preferences small, as it relies on randomness (of changing magnitude) in moving the particles, and would be therefore especially inadequate in solving problems where a human decision maker is involved in the loop to express preferences.
Different methods were proposed in the global optimization literature for finding a global minimum of functions that are expensive to evaluate . Some of the most successful ones rely on computing a simpler-to-evaluate surrogate of the objective function and use it to drive the search of new candidate optimizers to sample . The surrogate is refined iteratively as new values of the actual objective function are collected at those points. Rather than minimizing the surrogate, which may easily lead to miss the global optimum of the actual objective function, an acquisition function is minimized instead to generate new candidates. The latter function consists of a combination of the surrogate and of an extra term that promotes exploring areas of the decision space that have not been yet sampled.
Bayesian Optimization (BO) is a very popular method exploiting surrogates to globally optimize functions that are expensive to evaluate. In BO, the surrogate of the underlying objective function is modeled as a Gaussian process (GP), so that model uncertainty can be characterized using probability theory and used to drive the search. BO is used in several methods such as Kriging , in Design and Analysis of Computer Experiments (DACE) , in the Efficient Global Optimization (EGO) algorithm 
, and is nowadays heavily used in machine learning for hyper-parameter tuning.
Bayesian optimization has been proposed also for minimizing (unknown) black-box functions based only on preferences [4, 10, 1]. The surrogate function describing the observed set of preferences is described in terms of a GP, using a probit model to describe the observed pairwise preferences  and the Laplace approximation of the posterior distribution of the latent function to minimize. The GP provides a probabilistic prediction of the preference that is used to define an acquisition function (like expected improvement) which is maximized in order to select the next query point. The acquisition function used in Bayesian preference automatically balances exploration (selecting queries with high uncertainty on the preference) and exploitation (selecting queries which are expected to lead to improvements in the objective function).
In this paper we propose a new approach to active preference learning optimization that models the surrogate by using general radial basis functions (RBFs) rather than a GP, and inverse distance weighting (IDW) functions for exploration of the space of decision variables. A related approach was recently proposed by one of the authors in  for global optimization of known, but difficult to evaluate, functions. Here, we use instead RBFs to construct a surrogate function that only needs to satisfy, if possible, the preferences already expressed by the decision maker at sampled points. The training dataset of the surrogate function is actively augmented in an incremental way by the proposed algorithm according to two alternative criteria. The first criterion, similarly to , is based on a trade off between minimizing the surrogate and maximizing the distance from existing samples using IDW functions. At each iteration, the RBF weights are computed by solving a linear or quadratic programming problem aiming at satisfying the available training set of pairwise preferences. The second alternative criterion is based on quantifying the probability of getting an improvement based on a maximum-likelihood interpretation of the RBF weight selection problem, which allows quantifying the probability of getting an improvement based on the surrogate function. Based on one of the above criteria, the proposed algorithm constructs an acquisition function that is very cheap to evaluate and is minimized to generate a new sample and to query a new preference.
Compared to preferential Bayesian optimization, the proposed approach is computationally lighter, due to the fact that computing the surrogate simply requires solving a convex quadratic or linear programming problem. Instead, in PBO, one has to first compute the Laplace approximation of the posterior distribution of the preference function, which requires to calculate (via a Newton-Raphson numerical optimization algorithm) the mode of the posterior distribution. Then, a system of linear equations, with size equal to the number of observations, has to be solved. Moreover, the IDW term used by our approach to promote exploration does not depend on the surrogate, which guarantees that the space of optimization variables is well explored even if the surrogate poorly approximates the underlying preference function. Finally, the performance of our method in approaching the optimizer within the allocated number of preference queries is similar and sometimes better than preferential Bayesian optimization, as we will show in a set of benchmarks used in global optimization and in solving a multi-objective optimization problem.
The paper is organized as follows. In Section 2 we formulate the preference-based optimization problem we want to solve. Section 3 proposes the way to construct the surrogate function using linear or quadratic programming and Section 4 the acquisition functions that are used for generating new samples. The active preference learning algorithm is stated in Section 5 and its possible application to solve multi-objective optimization problems in Section 6. Section 7
presents numerical results obtained in applying the preference learning algorithm for solving a set of benchmark global optimization problems, a multi-objective optimization problem, and for optimal tuning of a cost-sensitive neural network classifier for object recognition from images. Finally, some conclusions are drawn in Section8.
A MATLAB and a Python implementation of the proposed approach is available for download at http://cse.lab.imtlucca.it/~bemporad/idwgopt.
2 Problem statement
Let be the space of decision variables. Given two possible decision vectors , consider the preference function defined as
where for all it holds , , and the transitive property
The objective of this paper is to solve the following constrained global optimization problem:
that is to find the vector of decision variables that is “better” (or “no worse”) than any other vector according to the preference function .
Vectors in (3) define lower and upper bounds on the decision vector, and imposes further constraints on , such as
where , and when (no inequality constraint is enforced). We assume that the condition is easy to evaluate, for example in case of linear inequality constraints we have , , , . When formulating (3) we have excluded equality constraints , as they can be eliminated by reducing the number of optimization variables.
The problem of minimizing an objective function under constraints,
can be written as in (3) by defining
In this paper we assume that we do not have a way to evaluate the objective function . The only assumption we make is that for each given pair of decision vectors , , only the value is observed. The rationale of our problem formulation is that often one encounters practical decision problems in which a function is impossible to quantify, but anyway it is possible to express a preference, for example by a human operator, for any given presented pair . The goal of the active preference learning algorithm proposed in this paper is to suggest iteratively a sequence of samples to test and compare such that approaches as grows.
In what follows we implicitly assume that a function actually exists but is completely unknown, and attempt to synthesize a surrogate function of such that its associated preference function defined as in (6) coincides with on the finite set of sampled pairs of decision vectors.
3 Surrogate function
Assume that we have generated samples of the decision vector, with such that , , , and have evaluated a preference vector
where is the number of expressed preferences, , , , .
In order to find a surrogate function such that
where is defined from as in (6), we consider a surrogate function
defined as the following radial basis function (RBF) interpolant[11, 23]
In (9) function is the Euclidean distance
In accordance with (8), we impose the following preference conditions
where is a given tolerance and are slack variables, , .
Accordingly, the coefficient vector is obtained by solving the following convex optimization problem
where are positive weights, for example , . The scalar is a regularization parameter. When problem (12) is a quadratic programming (QP) problem that, since for all , admits a unique solution. If problem (12) becomes a linear program (LP), whose solution may not be unique.
Note that the use of slack variables in (12) allows one to relax the constraints imposed by the specified preference vector
. Constraint infeasibility might be due to an inappropriate selection of the RBF and/or to outliers in the acquired componentsof vector . The latter condition may easily happen when preferences are expressed by a human decision maker in an inconsistent way.
For a given set of samples, setting up (12) requires computing the symmetric matrix whose -entry is
with for the inverse quadratic and Gaussian RBF, while for the thin plate spline RBF . Note that if a new sample is collected, updating matrix only requires computing for all .
3.1 Self-calibration of RBF
Computing the surrogate requires to choose the hyper-parameter defining the shape of the RBF (Eq. (9)). This parameter can be tuned through -fold cross-validation , by splitting the available pairwise comparisons into (nearly equally sized) disjoint subsets. To this end, let us define the index sets , , such that , , for all , . For a given and for all , the preferences indexed by the set are used to fit the surrogate function by solving (12), while the performance of in predicting comparisons indexed by is quantified in terms of number of correctly classified preferences , where if or otherwise, and is the preference function induced by as in (6). Since the hyper-parameter is scalar, a fine grid search can be used to find the value of maximizing .
Since in active preference learning the number of observed pairwise preferences is usually small, we use , , namely -fold cross validation or leave-one-out, to better exploit the available comparisons.
Let be the best vector of decision variables in the finite set , that is
Since in active preference learning one is mostly interested in correctly predicting the preference w.r.t. the best optimal point , the solution of problem (12) and the corresponding score are not computed for all indexes such that , that is the preferences involving are only used for training and not for testing.
The -fold cross-validation procedure for self-calibration requires to formulate and solve problem (12) times ( times in case of leave-one-out cross validation, or less when comparisons involving are only used for training). In order to reduce computations, self-calibration can be executed only at a subset of iterations.
4 Acquisition function
with obtained by solving the LP (12), () evaluate , () update , and () iterate over . Such a procedure may easily miss the global minimum of (3), a phenomenon that is well known in global optimization based on surrogate functions: purely minimizing the surrogate function may lead to converge to a point that is not the global minimum of the original function [16, 2]. Therefore, the exploitation of the surrogate function is not enough to look for a new sample , but also an exploration objective must be taken into account to probe other areas of the feasible space.
In the next paragraphs we propose two different acquisition functions that can be used to define the new sample to compare the current best sample to.
4.1 Acquisition based on inverse distance weighting
Following the approach suggested in , we construct an exploration function using ideas from inverse distance weighting (IDW). Consider the IDW exploration function defined by
where is defined by 
Clearly for all , and in . The arc tangent function in (16) avoids that gets excessively large far away from all sampled points. Figure 1 shows the IDW exploration function obtained from (16) for the example generated from (14).
Given an exploration parameter , the acquisition function is defined as
is the range of the surrogate function on the samples in . By setting
|we get , , and therefore|
Clearly if at least one comparison . The scaling factor is used to simplify the choice of the exploration parameter .
The following lemma immediately derives from [2, Lemma 2]:
Function is differentiable everywhere on .
As we will detail below, given a set of samples and a vector of preferences defined by (7), the next sample is defined by solving the global optimization problem
Problem (20) can be solved very efficiently using various global optimization techniques, either derivative-free  or, if and is also differentiable, derivative-based. In case some components of vector are restricted to be integer, (20) can be solved by mixed-integer programming.
4.2 Acquisition based on maximum likelihood of improvement
We show how the surrogate function derived by solving problem (12
) can be seen as a maximum likelihood estimate of an appropriate probabilistic model. The analyses described in the following are inspired by the probabilistic interpretation ofsupport vector machines described in .
Let and let be the -dimensional vector obtained by collecting the terms , with , .
Let us rewrite the QP problem (12) without the slack variables as
are piecewise linear convex functions of , for all .
For a given hyper-parameter , let be the minimizer of problem (21) and let . Then vector is the minimizer of the following problem
Proof See Appendix.
In order to avoid heavy notation, we restrict the coefficients in (12) such that they are equal when the preference is the same, that is where are given positive weights.
Let us now focus on problem (23) and consider the joint p.d.f.
defined for and , and parametrized by , a strictly positive scalar , and a generic unit vector .
The distribution (24) is composed by three terms. The first term is a normalization constant. We will show next that does not depend on when we restrict . The second term depends on all the parameters and it is related to the objective function minimized in (23). The last term ensures integrability of and that the normalization constant does not depend on , as discussed next. A possible choice for is .
The normalization constant is given by
where for the term is the integral defined as
The following Theorem shows that does not depend on , and so is also independent of .
Let in (24) be . For any ,
Proof See Appendix.
Because of Theorem 2, since now on, when we restrict , we will drop the dependence on of and simply write .
Let us assume that the samples of the training sequence
are i.i.d. and generated from the joint distributiondefined in (24). The negative log of the probability of the dataset given is
Thus, for fixed values of and , by Theorem 1 the minimizer of
is . In other words, for any fixed , the solution of the QP problem (12) can be reinterpreted as times the maximizer of the joint likelihood with respect to , , when .
It is interesting to note that the marginal p.d.f. derived from the probabilistic model (24) is equal to
and therefore the corresponding preference posterior probability is
The preference posterior probability given (30) can be used now to explore the vector space , as we describe next.
A criterion to choose is to maximize the preference posterior probability of obtaining a “better” sample compared to the current “best” sample given by (30), or equivalently of getting . This can be achieved by the following acquisition function
Different components of may have different upper and lower bounds , . Rather than using weighted distances as in stochastic process model approaches such as Kriging methods [26, 15], we simply rescale the variables in optimization problem (3) to range in . As described in , we first tighten the given range by computing the bounding box of the set and replacing with . The bounding box is obtained by solving the following optimization problems
where is the
th column of the identity matrix,. Note that Problem (32) is a linear programming (LP) problem in case of linear inequality constraints . Then, we operate with new scaled variables , , and replace the original preference learning problem (3) with
where the scaling mapping is defined as
where clearly , , and is the set
When is the polyhedron , (35) corresponds to defining the new polyhedron
and is the diagonal matrix whose diagonal elements are the components of .
Note that in case the preference function is related to an underlying function as in (6), applying scaling is equivalent to formulate the following scaled preference function
5 Preference learning algorithm
the comparison can be done even if and/or ;
can only be evaluated if .
In the first case, the initial comparisons are still useful to define the surrogate function. In the second case, a possible approach is to generate a number of samples larger than and discard the samples . An approach for performing this is suggested in [2, Algorithm 2].
Step 5(a)iv requires solving a global optimization problem. In this paper we use Particle Swarm Optimization (PSO) [18, 32] to solve problem (20). Alternative global optimization methods such as DIRECT  or others methods [12, 25] could be used to solve (20). Note that penalty functions can be used to take inequality constraints (4) into account, for example by replacing (20) with
where in (39).
Algorithm 1 consists of two phases: initialization and active learning. During initialization, sample is simply retrieved from the initial set . Instead, in the active learning phase, sample is obtained in Steps 5(a)i–5(a)iv by solving the optimization problem (20). Note that the construction of the acquisition function
is rather heuristic, therefore finding global solutions of very high accuracy of (20) is not required.
When using the acquisition function (18), the exploration parameter promotes sampling the space in in areas that have not been explored yet. While setting makes Algorithm 1 exploring the entire feasible region regardless of the results of the comparisons, setting can make Algorithm 1 rely only on the surrogate function and miss the global optimizer. Note that using the acquisition function (31) does not require specifying the hyper-parameter . On the other hand, the presence of the IDW function in the acquisition allows promoting an exploration which is independent of the surrogate, and therefore might be a useful tuning knob to have. Clearly, the acquisition function (31) can be also augmented by the term as in (18) to recover such exploration flexibility.