In many practical situations a user is able to select the best options among a finite set of choices, however they are unable to state explicitly the motivations for their choices. A notable example is in industrial applications where the manufactured product has to satisfy several qualitative requirements that are known to trained staff, but such requirements were never expressed explicitly. In such cases, the definition of quantitative objectives would allow for an explicit multi-objective optimization which would lead to better options. However, measuring the objectives in a quantitative way is often technically difficult and costly. In this context, we would like to improve the quality of the manufactured product using directly the feedback provided by the user’s choices (“these products are better than those”), i.e. we would like to learn the “choice function” of the user and find the inputs that optimize this function. In this paper we propose a Bayesian framework to learn choice functions from a dataset of observed choices. Our framework learns a latent mapping of objectives that are consistent with the given choices, therefore we are also able to optimize them with a multi-objective Bayesian optimization algorithm.
The main contributions of this paper leverage four topics: (1) Bayesian Optimisation (BO); (2) preferential BO; (3) multi-objective BO; (4) choice functions learning. In this section we briefly review the state of the art of each topic.
2.1 Bayesian Optimisation (BO)
BO [jones1998efficient] aims to find the global maximum of an unknown function which is expensive to evaluate. For a scalar real-valued function on a domain , the goal is to find a global maximiser . BO formulates this as a sequential decision problem – a trade-off between learning about the underlying function (exploration) and capitalizing on this information in order to find the optimum (exploitation). BO relies on a probabilistic surrogate model, usually a Gaussian Process (GP) [rasmussen2006gaussian], to provide a posterior distribution over given a dataset of previous evaluations of . It then employs an acquisition function (e.g. Expected Improvement [jones1998efficient, mockus1978application], Upper Credible Bound [srinivas2009gaussian]) to select the next candidate option (solution) . While the true function is expensive-to-evaluate, the surrogate-based acquisition function is not, and it can thus be efficiently optimized to compute an optimal candidate to be evaluated on . This process is repeated sequentially until some stopping criterion is achieved.
2.2 Preferential Bayesian Optimisation (PBO)
In many applications, evaluating can be either too costly or not always possible. In these cases, the objective function may only be accessed via preference judgments, such as “this is better than that” between two candidate options like in A/B tests or recommender systems (pairwise comparisons are usually called duels in the BO and bandits literature). In such situations, PBO [shahriari2015taking] can be used. This approach requires the agent to simply compare the final outcomes of two different candidate options and indicate which they prefer, that is the evaluation is binary either “better than” or “worse than” .
In the PBO context, the state-of-the-art surrogate model is based on a method for preference learning developed in [ChuGhahramani_preference2005]. This method assumes that there is an unobservable latent function value associated with each training sample , and that the function values preserve the preference relations observed in the dataset, that is whenever “better than” . As in the BO setting, by starting with a GP prior on and, by using the likelihood defined in [ChuGhahramani_preference2005], we obtain a posterior distribution over given a dataset of preferences. This posterior distribution is not a GP and several approximations [ChuGhahramani_preference2005, houlsby2011bayesian] were proposed. In [houlsby2011bayesian], the authors showed that GP preference learning is equivalent to GP classification with a transformed kernel function. By using this reformulation, the authors easily derive two approximations for the posterior based on (i) the Laplace Approximation (LP) [mackay1996bayesian, williams1998bayesian]; (ii) Expectation Propagation (EP) [minka2001family]. The LP approximation was then used to develop a framework for PBO [shahriari2015taking] and a new acquisition function, inspired by Thomson sampling, was proposed in [gonzalez2017preferential]. More recently, [benavoli2020preferential]
showed that the posterior of GP preference learning is a Skew GP[Benavoli_etal2020, benavoli2021]. Based on this exact model, the authors derived a PBO framework which outperformed both LP and EP based PBO. Although in this work we focus on GPs as surrogate model, it is worth to mention alternative approaches for PBO developed by [BDG08, pmlr-v32-zoghi14, sadigh2017active, bemporad2019active].
PBO was recently extended [siivola2020preferential] to the batch case by allowing agents to express preferences for a batch of options. However, as depicted in Figure 1 where an agent expresses preferences among 5 options, in batch PBO there can be only one batch winner. In fact, PBO assumes that two options are always comparable.111More precisely, the underlying GP-based model implies a total order and so two options may also be equivalent. When PBO is applied to the multi-objective case such as for instance , it is therefore assumed that the agent’s preferences are determined by a weighted combination of the objectives .
2.3 Multi-objective (MO) optimization
The goal of MO optimization is to identify the set of Pareto optimal options (solutions) such that any improvement in one objective means deteriorating another. Without loss of generality, we assume the goal is to maximize all objectives. Let be a vector-value objective function with , where is the number of objectives. We recall the notions of Pareto dominated options and non-dominated set.
Definition 1 (Pareto dominate option).
Consider a set of options . An option is said to Pareto dominate another option , denoted as , if both the following conditions are true:
for all , ;
, such that .
Definition 2 (Non-dominated set).
Among a set of options , the non-dominated set of options are those that are not dominated by any member of , i.e.
Given the set of options , MO aims to find the non-dominated set of options , called the Pareto set. The set of evaluations is called Pareto front.
MO BO have only be developed for standard (non-preferential) BO, where multi-objectives can directly be evaluated. Many approaches rely on scalarisation to transform the MO problem into a single-objective one, like ParEGO [knowles2006parego] and TS-TCH [paria2020flexible]
(which randomly scalarize the objectives and use Expected Improvement and, respectively, Thompson Sampling).[keane2006statistical] derived an expected improvement criterion with respect to multiple objectives. [ponweiser2008multiobjective] proposed an hypervolume-based infill criterion, where the improvements are measured in terms of hypervolume (of the Pareto front) increase. Other acquisition functions have been proposed in [emmerich2011hypervolume, picheny2015multiobjective, hernandez2016predictive, wada2019bayesian, belakaria2019max]. The most used acquisition function for MO BO is expected hypervolume improvement
. In fact, maximizing the hypervolume has been shown to produce very accurate estimates[zitzler2003performance, couckuyt2014fast, hupkens2015faster, emmerich2016multicriteria, yang2017computing, yang2019multi] of the Pareto front.
2.4 Choice function
Individuals are often confronted with the situation of choosing between several options (alternatives). These alternatives can be goods that are going to be purchased, candidates in elections, food etc.
We model options, that an agent has to choose, as real-valued vectors and identify the sets of options as finite subsets of . Let denote the set of all such finite subsets of .
A choice function is a set-valued operator on sets of options. More precisely, it is a map such that, for any set of options , the corresponding value of is a subset of (see for instance [aleskerov2007utility]).
The interpretation of choice function is as follows. For a given option set , the statement that an option is rejected from (that is, ) means that there is at least one option that an agent strictly prefers over . The set of rejected options is denoted by and is equal to . Therefore choice functions represent non-binary choice models, so they are more general than preferences.
It is important to stress again that the statement implies there is at least one option that an agent strictly prefers over . However, the agent is not required to tell us which option(s) in they strictly prefer to . This makes choice functions a very easy-to-use tool to express choices. As depicted in Figure 2, the agent needs to tell us only the options they selected (in green) without providing any justification for their choices (we do not know which option in the green set dominates ).
By following this interpretation, the set can also be seen as the non-dominated set in the Pareto sense for some latent function. In other words, let us assume that there is a latent vector function , for some dimension , which embeds the options into a space . The choice set can then be represented through a Pareto set of non-dominated options. For example, in Fig. 2, . This approach was proposed in [pfannschmidt2020learning]
to learn choice functions. In particular, to learn the latent vector function, the authors devise a differentiable loss function based on a hinge loss. Furthermore, they add two additional terms to the loss function: (i) anregularization term; (ii) a multidimensional scaling (MDS) loss to ensure that options close to each other in the inputs space will also be close in the embedding space
. This loss function is then used to learn a (deep) multi-layer perceptron to represent the embedding.
In this work, we devise a novel multi-objective PBO based on choice functions. We follow the interpretation of choice functions as set function that select non-dominated sets for an unknown latent function. First we derive a Bayesian framework to learn the function from a dataset of observed choices. This framework is based on a Gaussian Process prior on the unknown latent function vector.222Compared to the approach proposed in [pfannschmidt2020learning], the GP-based model is more sound – no multidimensional scaling is necessary – and it is a generative model. We then build an acquisition function to select the best next options to evaluate. We compare this method against an oracle that knows the true value of the latent functions and we show that, by only working with choice function evaluations, we converge to the same results.
3 Bayesian learning of Choice functions
In this work we consider options and, for , we model each latent function in the vector as an independent GP [rasmussen2006gaussian]:
Each GP is fully specified by its kernel function , which specifies the covariance of the latent function between any two points. In all experiments in this paper, the GP kernel is Matern 3/2 [rasmussen2006gaussian].
3.1 Likelihood for general Choice functions
Having defined the prior on , we can now focus on the likelihood. We propose a new likelihood to model the observed choices of the agent. Given a set of observed choices , we are interested in learning a Pareto-embedding coherent with this data in the sense that , where denotes the Pareto non-dominated options in .
Assume that and let be the subset of indices of the options in , let be equal to and let the vector of dimensions of the latent space. Based on Definition 1, the choice of the agent expressed via implies that:
These equation express the conditions in Definition 1. Condition (2) means that, for each option , it’s not true ( stands for logical negation) that all options in are worse than , i.e. there is at least an option in which is better than . Condition (3) means that, for each option in , there is no better option in . This requires that the latent functions values of the options should be consistent with the choice function implied relations. Given , the likelihood function is one when (2)-(3) hold and zero otherwise.
In practice not all choices might be coherent and we can treat this case by considering that
the latent vector function is corrupted by a Gaussian noise with zero mean vector and covariance .333 We assume the noise variance is the same in each dimension but this can easily be relaxed.
We assume the noise variance is the same in each dimension but this can easily be relaxed.Then we require conditions (2) and (3) to only hold probabilistically. This leads to the following likelihood function for the pair :
where denotes the -th component of the vector and is the indicator function of the set . We now provide two results which allows us to simplify (4). We first compute the integral in the third product in (4).
All proofs are in the supplementary material. We now focus on the first integral in (4), which can be simplified as follows.
Note that eq. (6) is an expectation (with respect to ) of a product of Gaussian CDFs whose argument only depends on . We can thus write the above multidimensional integral as a sum of products of univariate integrals which can be computed efficiently, for instance by using a Gaussian quadrature rule.
Therefore, the likelihood of the choices given the latent vector function can then be written as follows.
The likelihood is
3.1.1 Likelihood for batch preference learning
In case (the latent dimension is one), we have that . This means the agent always selects a single best option. In this case, the likelihood (8) simplifies to
The above likelihood is equal to the batch likelihood derived in [siivola2020preferential, Eq.3] and reduces to the likelihood derived in [ChuGhahramani_preference2005] when (that is the batch 2 case, i.e., ). This shows that the likelihood in (8) encompasses batch preference-based models.
The posterior probability ofis
) and the probability of the evidence is. The posterior is intractable because it is neither Gaussian nor Skew Gaussian Process (SkewGP) distributed. In this paper we propose an approximation schema for the posterior similar to the one proposed in [benavoli2020preferential, benavoli2021]. In [benavoli2020preferential], an analytical formulation of the posterior is available, the marginal likelihood is approximated with a lower bound and inferences are computed with an efficient rejection-free slice sampler [gessner2019integrals]. In [benavoli2020preferential, benavoli2021]
such approximation schema showed better performance in active learning and BO tasks than LP and EP. Here we do not have an analytical formulation for the posterior therefore we use a Variational (ADVI) approximation[kucukelbir2015automatic]
of the posterior to learn the hyperparametersof the kernel and, then, for fixed hyperparameters, we compute the posterior of via elliptical slice sampling (ess) [pmlrv9murray10a].444We implemented our model in PyMC3 [salvatier2016probabilistic], which provides implementations of ADVI and ess. Details about number of iterations and tuning are reported in the supplementary.
3.3 Prediction and Inferences
Let be a set including test points and . The conditional predictive distribution is Gaussian and, therefore,
can be easily approximated as a sum using the samples from . In choice function learning, we are interested in the inference:
which returns the posterior probability that the agent chooses the options from the set of options . Given a finite set (that is ), we can use (12) to compute the probability that a subset of is the set of non-dominated options. This provides an estimate of the Pareto-set based on the learned GP-based latent model.
4 Latent dimension selection
In the previous section, we provided a Bayesian model for learning a choice function. The model is conditional on the pre-defined latent dimension (that is, the dimension of the vector of the latent functions ). Although, it is sometimes reasonable to assume the number of criteria defining the choice function to be known (and so the dimension ), it is crucial to develop a statistical method to select . We propose a forward selection method. We start learning the model and we increase the dimension in a stepwise manner (so learning ) until some model selection criterion is optimised. Criteria like AIC and BIC are inappropriate for the proposed GP-based choice function model, since its nonparametric nature implies that the number of parameters increases also with the size of the data (as ). We propose to use instead the Pareto Smoothed Importance sampling Leave-One-Out cross-validation (PSIS-LOO) [vehtari2017practical]. Exact cross-validation requires re-fitting the model with different training sets. Instead, PSIS-loo can be computed easily using the samples from the posterior.
We define the Bayesian LOO estimate of out-of-sample predictive fit for the model in (10):
where , ,
As derived in [gelfand1992model], we can evaluate (14) using the samples from the full posterior, that is for .555We compute these samples using elliptical slice sampling. We first define the importance weights:
and then approximate (14) as:
It can be noticed that (15) is a function of only, which can easily be computed from the posterior samples. Unfortunately, a direct use of (15) induces instability because the importance weights can have high variance. To address this issue, [vehtari2017practical] applies a simple smoothing procedure to the importance weights using a Pareto distribution (see [vehtari2017practical] for details). In Section 6.3, we will show that the proposed PSIS-LOO-based forward procedure works well in practice.
5 Choice-based Bayesian Optimisation
In the previous sections, we have introduced a GP-based model to learn latent choice functions from choice data. We will now focus on the acquisition component of Bayesian optimization. In choice-based BO, we never observe the actual values of the functions. The data is , where is the set of the training inputs (options), is a subset of and is the choice-set for the given options . We denote the Pareto-set estimated using the GP-based model as . In choice-based BO, the objective is to seek a new input point . Since can only be queried via a choice function, this is obtained by optimizing w.r.t. an acquisition function , where is the current (estimated) Pareto-set. We define the acquisition function , with the aim to find a point that dominates the points in . That is, given the set of options , we aim to find such that .
The acquisition function must also consider the trade-off between exploration and exploitation. Therefore, we propose an acquisition function which is equal to the % (in the experiments we use ) Upper Credible Bound (UCB) of with , and .
Note that the requirement for our acquisition function is strong. We could also define with different objectives in mind. For example we could seek to find a point which allows to reject at least one option in . We opted for UCB over the probability because it leads to a fast to evaluate acquisition function. In particular we only need to compute one probability for each new function evaluation. In future work we will study alternative approaches and the trade-off between more costly acquisition function evaluations and faster convergence.
After computing the maximum of the the acquisition function, denoted with , consistently with the definition of the acquisition function, we should query the agent to express their choice among the set of options in . However, can be a very large set and human cognitive ability cannot efficiently compare more than five options. Therefore, by using the GP-based latent model, we select four options in which have the highest probability of being dominated by and query the agent on a five options set.666Details about the procedure we use to select these 4 options are reported in the supplementary material.
6 Numerical experiments
First, we assume (that is we assume that the latent dimension is known) and evaluate the performance of our algorithm on (1) the tasks of learning choice functions; (2) the use of choice functions in multi-objective BO. Second, we evaluate the latent dimension selection procedure discussed in Section 4 on simulated and real datasets.
6.1 Choice functions learning
We develop a simple toy experiment as a controlled setting for our initial assessment. We consider the bi-dimensional vector function with .
We use to define a choice function. For instance, consider the set of options , given that
we have that and . In fact, one can notice that dominates on both the objectives, and and are incomparable. We sample inputs at random in and, using the above approach, we generate
random subsets of the 200 points each one of size (respectively ) and computed the corresponding choice pairs based on ;
random subsets each one of size (respectively ) and computed the corresponding choice pairs based on ;
for a total of four different datasets. Fixing the latent dimension , we then compute the posterior means and credible intervals of the latent functions learned using the model introduced in Section 3.2. The four posterior plots are shown in Figure 3. By comparing the 1st with the 3rd plot and the 2nd with the 4th plot, it can be noticed how the posterior means become more accurate (and the credible interval smaller) at the increase of the size dataset (from N=50 to N=150 choice-sets). By comparing the 1st with the 2nd plot and the 3rd with the 4th plot, it is evident that estimating the latent function becomes more complex at the increase of . The reason is not difficult to understand. Given , includes the set of rejected options. These are options that are dominated by (at least) one of the options in , but we do not know which one(s). This uncertainty increases with the size of , which makes the estimation problem more difficult.
Considering the same inputs generated at random in , we add Gaussian noise to with and generate two new training datasets with and (i) ; (ii) . We aim to compute the predictive accuracy of the GP-based model for inferring from the set of options on additional unseen pairs . We denote the model learned using the and dataset respectively by Choice-GP100 and Choice-GP300. We compare their accuracy with that of an independent GP regression model which has direct access to : that is we modelled each dimension of as an independent GP and compute the posterior of the components of using GP regression from data pairs and, respectively, . We refer to this model as Oracle-GP: “Oracle” because it can directly query . The accuracy is:
averaged over repetitions of the above data generation process. It can be noticed that the increase of , the accuracy of Choice-GP gets closer to that of Oracle-GP. This confirms the goodness of the learning framework developed in this paper.
We now focus on four benchmark datasets for multi-output regression problems. Table 1 displays the characteristics of the considered datasets.
More details on the used datasets are in the supplementary material. By using 5-fold cross-validation, we divide the dataset in training and testing pairs. The target values in the training set are used to generate choice functions based pairs with and (i) ; (ii) . From the test dataset, we generated pairs. As before we denote the model learned using the and dataset respectively by Choice-GP1 and Choice-GP3 and compare their accuracy against that of Oracle-GP (learned on the training dataset by independent GP regression). The accuracy is:
As before, it can be noticed that the increase of , the accuracy of Choice-GP gets closer to that of Oracle-GP.
6.2 Bayesian Optimisation
We have considered for five standard multi-objective benchmark functions: Branin-Currin (, ), ZDT1 (, ), ZDT2 (, ), DTLZ1 (, ), Kursawe (, ) and Vehicle-Safety777The problem of determining the thickness of five reinforced components of a vehicle’s frontal frame [yang2005metamodeling]. This problem was previously considered as benchmark in [daulton2020differentiable]. (, ). These are minimization problems, which we converted into maximizations
so that the acquisition function in Section 5 is well-defined. We compare the Choice-GP BO (with ) approach proposed in this paper against ParEGO.888We use the BoTorch implementation [balandat2020botorch].
For ParEGO, we assume the algorithm can query directly and, therefore, we refer to it as Oracle-ParEGO.999The most recent MO BO approaches mentioned in Section 1 outperform ParEGO. We use ParEGO only as an Oracle reference.
Conversely, Choice-GP BO can only query via choice functions. We select and use UCB as acquisition function for Choice-GP BO. We also consider a quasi-random baseline that selects candidates from a Sobol sequence denoted as ‘Sobol‘. We evaluate optimization performance on the five benchmark problems in
terms of log-hypervolume difference, which is defined as the difference between the hypervolume of
the true Pareto front101010This known for the six benchmarks. and the hypervolume of the approximate Pareto front based on the observed data .
Each experiment starts with initial (randomly selected) input points which are used to initialise Oracle-ParEGO. We generate pairs of size by randomly selecting subsets of these points. These choices are used to initialise
Choice-GP BO. A total budget of iterations are run for both the algorithms. Further, each experiment is repeated 15 times with different initialization. In these experiments we optimize the kernel hyperparameters
by maximising the marginal likelihood for Oracle-ParEGO and its variational approximation for Choice-GP.
Figure 4 reports the performance of the
three methods. Focusing on Branin-Currin, DTLZ1, Kursawe, and Vehicle-Safety, it can be noticed how Choice-GP BO convergences to the performance of the Oracle-ParEGO at the increase of the number of iterations. The convergence is clearly slower because Choice-GP BO uses qualitative data (choice functions) while Oracle-ParEGO uses quantitative data (it has directly access to ). However, the overall performance shows that the proposed approach is very effective. In DLTZ1 and DLTZ2, Choice-GP BO outperforms Oracle-ParEGO. The bad performance of Oracle-ParEGO is due to the used acquisition function, which does not correctly balance exploitation-exploration in these two benchmarks. Instead, the UCB acquisition function for Choice-GP BO works well in all the benchmarks.
Computational complexity: The simulations were performed in a standard laptop. On average, the time to learn the surrogate model and optimise the acquisition function goes from 30s () to 180s ().
6.3 Unknown latent dimension
We assume that is unknown and evaluate the latent dimension selection procedure proposed in Section 4. First, we consider the one-dimensional and so . We generate training datasets with and sizes and, respectively, and 10 test datasets with size . The following table reports the PSIS-LOO (averaged over the five repetitions) and the average accuracy on the test set for four Choice-GP models with latent dimension .
|PSIS-LOO||acc. test||PSIS-LOO||acc. test|
By using PSIS-LOO (computed on the training dataset) as latent dimension selection criterion, we were able to correctly select the true latent dimension in all the repetitions (10 out of 10). The selected model has also the highest accuracy in the test set. We now focus on the bi-dimensional vector function and consider three different sizes for the training dataset .
|PSIS-LOO||acc. test||PSIS-LOO||acc. test||PSIS-LOO||acc. test|
For , has the highest PSIS-LOO in 4/10 cases, in 4/10 cases and in 2/10 cases. For , has the highest PSIS-LOO in 6/10 cases and in 4/10 cases. For , the PSIS-LOO selects in 10/10 cases. Note that, is never selected. Considering that the models are nested , this shows that the selection procedure works well even with small datasets by never selecting a latent dimension that is smaller than the actual one. Moreover, PSIS-LOO is able to select the correct dimension () at the increase of . In the supplementary material, we have reported a similar analysis for the datasets in Table 1 confirming that the procedure also works for real-datasets.
We have developed a Bayesian method to learn choice functions from data and applied to choice function based Bayesian Optimisation (BO). As future work, we plan to develop strategies to speed up the learning process by exploring more efficient ways to express the likelihood. We also intend to explore different acquisition functions for choice function BO.