Projective Preferential Bayesian Optimization
Bayesian optimization is an effective method for finding extrema of a black-box function. We propose a new type of Bayesian optimization for learning user preferences in high-dimensional spaces. The central assumption is that the underlying objective function cannot be evaluated directly, but instead a minimizer along a projection can be queried, which we call a projective preferential query. The form of the query allows for feedback that is natural for a human to give, and which enables interaction. This is demonstrated in a user experiment in which the user feedback comes in the form of optimal position and orientation of a molecule adsorbing to a surface. We demonstrate that our framework is able to find a global minimum of a high-dimensional black-box function, which is an infeasible task for existing preferential Bayesian optimization frameworks that are based on pairwise comparisons.READ FULL TEXT VIEW PDF
Projective Preferential Bayesian Optimization
Projective Preferential Bayesian Optimization
Let be a black-box function defined on a hypercube where . Without loss of generality we assume that . The objective is to find a global minimizer
We assume, as in Preferential Bayesian Optimization (PBO, PBO), that is not directly accessible. In PBO, queries to can done in pairs of points , and the binary feedback indicates whether . In contrast, in our work we assume that queries to can be done over the projection onto a projection vector
projection vector. The feedback is the optimal scalar projection, that is, the value or coordinate along the projection in the direction . We assume that there are zero coordinates in , and these coordinates are set to fixed values described by a reference vector . Formally, given a query , the feedback is obtained as a minimizer over the possible scalar projections,
What are then natural use cases for such projective preferential queries? The main motivation comes from humans serving as the oracles. The form of the query enables efficient learning of the user preferences over choice sets in which each choice has multiple attributes, and in particular over continuous choice sets. An important application is knowledge elicitation from trained professionals (doctors, physicists, etc.). For example, we may learn a chemist’s preferences, that is, insight based on prior knowledge and experience, over molecular translations and orientations as a molecule adsorbs to a surface. In this case, a projective preferential query could correspond to finding an optimal rotation (see Figure 1), which the chemist can easily give by rotating the molecule in a visual interface.
Probabilistic preference learning is a relatively new topic in machine learning research but has a longer history in econometrics and psychometrics(mcfadden1981; mcfadden2001; Stern1990; Thurstone1927). A wide range of applications of these models exists, for instance in computer graphics (Brochu_2010), expert knowledge elicitation (AzariSoufiani_2013), revenue management systems of airlines (Carrier_2015), rating systems, and almost any application that contains users’ preference modeling. Perhaps the most studied probabilistic model is Thurstone-Mosteller (TM) model that measures a process of pairwise comparisons (Thurstone1927; Mosteller1951). In the preference learning context, the models based on the TM-model can be applied to learning preferences from pairwise comparison feedback (e.g. Chu_2005). An extension of this research into the interactive learning setting is studied by, among others, Brochu_2008 and PBO. All these approaches resort to pairwise feedbacks.
A drawback of most preference learning frameworks is their incapability to handle high-dimensional input spaces. The underlying reason is a combinatorial explosion in the number of possible comparisons with respect to the dimension (given a fixed number of grid-points per dimension). This means that a pairwise comparison has low information content in high-dimensional spaces. This problem was mitigated by PBO by capturing the correlations among pairs of queries (duels). However, it is still extremely difficult to use this method in high-dimensional spaces, say higher than 2-dimensional (see Section 4). Furthermore, the numerical computations become infeasible in a high-dimensional setting, especially the optimization of an acquisition function.
In this paper, we introduce a Bayesian framework, which we call Projective Preferential Bayesian Optimization (PPBO), that scales to high-dimensional input spaces. A main reason is the information content of a projective preferential query that is much higher than in a pairwise preferential query. A projective preferential query is equivalent to infinite pairwise comparisons along a projection. An important consequence is that with projective preferential queries, the user’s workload in answering the queries will be considerably reduced. Furthermore, PPBO avoids numerical computations of high-dimensional integrals, since the feedback is one-dimensional regardless of the dimension of the input space.
The PPBO framework can be also seen as a global optimization algorithm. When the input space of a black-box function is high-dimensional, it may be easier to do coordinate descent to solve the one-dimensional optimization problem (2) and then apply that information to find (1) than try to directly strive for (1) by optimizing in . A similar approach was taken by lineBO when they essentially reduced a high-dimensional optimization problem to Bayesian optimizations in one-dimensional subspaces in their (LINEBO) algorithm.
In this section we introduce a Bayesian framework capable of dealing with projective preferential data. A central idea is to model the user’s utility function, that is , as a Gaussian process as first proposed by Chu_2005. We extend this line of study to allow projective preferential queries, by deriving a tractable likelihood, proposing a method to approximate it, and introducing four acquisition criteria for enabling interactive learning in this setting.
In this paper, for convenience, we will formulate the method for maximization instead of minimization as in (2), without loss of generality.
Our probabilistic model of the user preferences is built upon the Thurstone’s law of comparative judgement (Thurstone1927). A straightforward way to formalize this would be to assume pairwise comparisons are corrupted by Gaussian noise: , if and only if , where the latent function is a utility function that characterizes the user preferences described by the preference relation . The standard assumption is that and are identically and independently distributed Gaussians. Here, we deviate slightly form this assumption: Given two alternatives , we assume that , if and only if , where is a Gaussian white noise process
Gaussian white noise processwith zero-mean and autocorrelation if , and zero otherwise. The likelihood of an observation can be formally written as
The event is an uncountable intersection of pairwise comparisons, so care must be taken when interpreting this probability. Conditioning ongives
where we denote the event
By the conditional independence, the conditional probability can be written as a formal product
is the cumulative distribution of the standard normal distribution. By combining this with (3) and changing the order of product and integration, the likelihood becomes
where is a convolution operator, and is the density function of the standard normal distribution. We interpret this as a Volterra (product) integral
The joint log-likelihood of a dataset , denoted as , takes a form
First, we introduce notation. Assume that projective preferential queries have been performed and gathered into a dataset . For every data instance , we also consider a sequence of pseudo-observations . Technically, the pseudo-observations are Monte-Carlo samples needed for integrating the likelihood. The latent function values evaluated on those points are gathered into a vector,
The latent function vector over all points is formed by concatenating over .
The user’s utility function is modelled as a Gaussian process (Williams_Rasmussen). GP model fits ideally to this objective, since it is flexible (non-parametric) and can conveniently handle uncertainty (predictive distributions can be derived analytically). In particular, it allows us to have insight into those regions of the space in which either we are uncertain about the user preferences due to lack of data, or because of the user gives inconsistent feedback. A possible reason for the latter is that one of the preference axioms, transitivity or completeness, are violated in those regions. A weak preference relation is complete if for all , either or holds. That is, a user is able to reveal their preferences over all possible pairwise comparisons. Similarly, is transitive if for any the followign holds: and implies that . That is, a user has consistent preferences.
Thus, we assume as Chu_2005, that the prior of the utility function follows a zero-mean Gaussian process,
where the -element of the covariance matrix is determined by a kernel as . Throughout the paper, we assume the squared exponential kernel , where the and
For the sake of simplicity, we use the Laplace approximation for the posterior distribution. A maximum a posteriori (MAP) estimate is needed for that,
where we denote the functional (log-scaled posterior)
The convolution can be efficiently approximated by Gauss-Hermite quadrature. The outer integral is approximated as a Monte-Carlo integral
where the pseudo-observations for
are sampled from a suitable distribution. Our choice is to use a family of truncated generalized normal (TGN) distributions, since it provides a continuous transformation from the uniform distribution to the truncated normal distribution, such that the locations of distributions can be specified. The idea is to concentrate pseudo-observations more densely around the optimal valueas the number of queries increases. For more details, see Supplementary material.
For notational convenience, define
If the domain is normalized to , and the projections are normalized to , then . Hence, under this normalization, the functional can be approximated as
The MAP estimate can be efficiently solved by a second-order iterative optimization algorithm, since the gradient and the Hessian can be easily derived for .
The Laplace approximation of the posterior amounts to the second-order Taylor approximation of the log posterior around the MAP estimate. In the ordinary (non-log) scale, this reads
where the matrix H is the negative Hessian of the log-posterior at the MAP estimate, .111We denote the partial derivatives matrix evaluated at MAP estimate as
Based on the well-known properties of the multivariate Gaussian distribution, the predictive distribution ofis also Gaussian. Given test locations , consider the by matrix . The predictive mean and the predictive covariance at test locations are (for more details see (Williams_Rasmussen) or (Chu_2005))
where is the covariance matrix of the test locations.
In this section, we discuss how to select the next projective preferential query . We will choose the next query as a maximizer of an acquisition function, for instance, we will consider a modified version of the expected improvement acquisition function (Jones1998). When the oracle is a human, this allows us to learn the user preferences in an iterative loop, thus making the PPBO framework interactive. However, this interesting special case, where is a utility function of a human, also brings forth issues due to bounded rationality. We apply here the following narrow definition of this more general concept (see simon1990reason): Bounded rationality is the idea that the user gives feedback that reflects their preferences, but within the limits of the information available to them and their mental capabilities.
When the oracle is a human it is important to realize that the optimal next is not solely the one which optimally balances the exploration-exploitation trade-off – as it is the case for a perfect oracle – but the optimal takes also into account human cognitive capabilities and limitations. For instance, the more there are non-zero coordinates in , the greater the ”cognitive burden” to a human user, and the harder it becomes to give useful feedback. Thus, when there is a human in the loop, the choice of should take into account both the optimization needs and what types of queries are convenient for the user.
The projective preferential feedback (2) may not be single-valued or even well-defined for all , if the oracle is a human. For instance, the user may not be able to explicate their preferences with respect to the attribute, that is, the preferences do not satisfy the completeness axiom. Formally, this means that if (the -standard unit vector), then for some it holds that
is multi-valued, a random variable or not well-defined – depending on how we interpret the scenario in which the user should say ”I do not know” but instead gives an arbitrary feedback. This incompleteness can fortunately be easily handled when we use GPs to model
; it just implies that the posterior variance is high along the dimension.
An even better solution would be to allow answer ”I do not know”, and to design an acquisition function that is capable of discovering and avoiding those regions in the space where the user gives inconsistent feedback due to any source of bounded rationality. This challenge is left for future research. It is noteworthy that the acquisition function we introduce next, performed well in the user experiment covered in Section 5.
We define the expected improvement by projective preferential query at the -iteration by
where denotes the highest value of the predictive posterior mean, and the expectation is conditioned on the data up to the -iteration. The maximum over models the anticipated feedback.
This can be evaluated as a Monte-Carlo integral (up to a multiplicative constant that does not depend on ),
where is approximated by using discrete222Another alternative is to consider continuous Thompson sampling to draw a continuous sample path from the GP model, and then maximize this. The method is based on Bochner’s theorem and the equivalence between a Bayesian linear model with random features and a Gaussian process. For more details see (HenrandezLobato2014). Thompson sampling as described in (HenrandezLobato2014)
. The discrete Thompson sampling draws a finite sample from the GP posterior distribution, and then returns the maximum over the sample. The steps needed to approximateare summarized in Algorithm 1.
The bottlenecks are the first and the third steps. In the third step, a predictive covariance matrix of size needs to be computed, and then a sample from the multivariate normal distribution needs to be drawn. Hence, the time complexity of Algorithm 1 is , where the terms come from a matrix inversion (the first step), two matrix multiplications, and a Cholesky decomposition, respectively. Recall that and refer to the number of observations, pseudo-observations, Monte-Carlo samples, and grid points, respectively.
In the experiments we use pure exploration and exploitation as baselines. A natural interpretation of pure exploitation in our context is to select the next query such that , where is the posterior mean of the GP model at location , given all the data so far .
We interpret pure exploration as maximization of the GP variance on a query given the anticipated feedback. That is, the pure explorative acquisition strategy maximizes the following acquisition function
The fourth acquisition strategy corresponds to the interesting special case where (the -standard unit vector), and the coordinates are rotated in a cyclical order for each query. The reference vector can be chosen in several ways, but it is natural to consider an exploitative strategy in which is set to except for the -coordinate which is set to zero. We call this acquisition strategy Preferential Coordinate Descent (PCD), since PPBO with PCD acquisition is closely related to a coordinate descent algorithm that successively minimizes an objective function along coordinate directions. The PPBO framework with PCD acquisition (PPBO-PCD) differs from the classical coordinate descent in two ways: First, PPBO-PCD assumes that direct function evaluations are not possible but instead projective preferential queries are. Second, it models the black-box function (as a GP) whereas the classical coordinate descent does not. This makes PPBO-PCD able to take advantage of past queries from every one-dimensional optimization.
When comparing to other acquisition strategies, we show that PCD performs well in numerical experiments (when is not a utility function but a numerical test function). This agrees with the results in the optimization literature; for instance: if is pseudoconvex with continuous gradient, and is compact and convex with ”nice boundary”, then the coordinate descent algorithm converges to a global minimum (Spall2012, Corollary 3.1). However, PCD may not perform so well in high-dimensional spaces (say, higher than 20D), since it cannot query in between the dimensions. For instance, the expected improvement by projective preferential query outperformed PCD on a 20D test function (see Section 4), since it allows to query arbitrary projections.
In this section we demonstrate the efficiency of the PPBO framework in high-dimensional spaces, and we experiment with various acquisition strategies in numerical experiments on simulated functions.
The goal is to find a global minimum of a black-box function by querying it either through (i) pairwise comparisons or (ii) projective preferential queries. For (i) we use the PBO method of PBO, which is state-of-the-art among Gaussian process preference learning frameworks that are based on pairwise comparisons. For (ii) we use the PPBO method as introduced in this paper. The four different acquisition strategies introduced in Section 3 are compared against the baseline that samples a random . For the PBO method, we consider a random acquisition strategy, because the optimization of non-trivial acquisition functions, such as Copeland Expected Improvement, is numerically infeasible in high-dimensional spaces. In total six different methods are compared: the expected improvement by projective preferential query (ppbo-ei), pure exploitation (ppbo-ext), pure exploration (ppbo-exr), preferential coordinate descent (ppbo-pcd), and random strategies for PBO (pbo-rand) and PPBO (ppbo-rand).
For we consider four different test functions: Six-hump-camel2D, Hartmann6D, Levy10D and Ackley20D.333https://www.sfu.ca/ssurjano/optimization.html We add a small Gaussian error term to the test function outputs. There are as many initial queries as there are dimensions in a test function. The -initial query corresponds to , that is, to the -coordinate projection, and the reference vector is uniformly random. We consider a total budget of 100 queries for Six-hump-camel2D, and 30 queries for the other test functions. The results are depicted in Figure 2.444All experiments of each test function were run on a computing infrastructure of 2x20 Xeon Gold 6148 2.40GHz cores and 60GB RAM.
ppbo-pcd obtained the best performance on three of the four test functions. On the test function Ackley20D, ppbo-ei performed best on this high-dimensional test function. The increase in ppbo-ei in Six-hump-camel2D at around iteration 55 is due to the optimization switching towards second of the two minima of the function. Unsurprisingly, all PPBO variants clearly outperformed the PBO method having a random acquisition strategy (pbo-rand). Since the performance gap between ppbo-rand and pbo-rand is so high, we conclude that whenever a projective preferential query is possible, a PPBO type of approach should be preferred to an approach that is based on pairwise comparisons.
To illustrate the low-information content of pairwise comparisons, we ran a test on the Six-hump-camel2D function. We trained a GP classifier with 2000 random queries (duels), and found a Condorcet winner(see PBO) by maximizing the soft-Copeland score ( MC-samples used for the integration) by using Bayesian optimization (500 iterations with 10 optimization restarts). This took 41 minutes on the -gen Intel i5-CPU, and the distance to a true global minimizer was , and the corresponding function value was compared to a true global minimum value . In contrast, ppbo-rand reached this level of accuracy at the second query, and became more certain around the -query, as seen from Figure 2.
In this section we demonstrate the capability of the PPBO framework to correctly and efficiently encode the user preferences from projective preferential feedback.
We consider a material science problem of a single organic molecule adsorbing to an inorganic surface. This is a key step in understanding the structure at the interface between organic and inorganic films inside electronic devices, coatings, solar cells and other materials of technological relevance. The molecule can bind in different adsorption configurations, altering the electronic properties at the interface and affecting device performance. Exploring the structure and property phase space of materials with accurate but costly computer simulations is a difficult task. Our objective is to find the most stable surface adsorption configuration through human intuition and subsequent computer simulations. The optimal configuration is the one that minimises the computed adsorption energy.
Our test case is the adsorption of a non-symmetric, bulky molecule camphor on the flat surface of (111)-plane terminated Cu slab. Some understanding of chemical bonding is required to infer correct adsorption configurations. The user is asked to consider the adsorption structure as a function of molecular orientation and translation near the surface. These are represented with 6 physical variables: angles , , of molecular rotation around the X, Y, Z Cartesian axes (in the range [0, 360] deg.), and distances x, y, z of translation above the surface (with lattice vectors following the translational symmetry of the surface). The internal structures of the molecule and surface were kept fixed since little structural deformation is expected with adsorption. A similar organic/inorganic model system and experiment scenario was previously employed to detect the most stable surface structures with autonomous BO, given the energies of sampled configurations (BOSS).
In this interactive experiment, the user effectively encodes their preferred adsorption geometry as a location in the 6-dimensional phase space. We employ the quantum-mechanical atomistic simulation code FHI-aims (FHI-AIMS) to ) compute the adsorption energy E of this preferred choice, and ) optimise the structure from this initial position to find the nearest local energy minimum in phase space, E*. We also consider the number of optimization steps N needed to reach the nearest minimum as a measure of quality of the initial location.
There are four different test users: two material science experts (human: a PhD student and an experienced researcher, both of them know the optimal solution), a non-expert (human), and a random bot (computer). The hypothesis is the following: if the framework is capable of encoding the user preferences, then the most preferred configurations (those with the highest posterior mean of user’s GP utility function) of the material science experts should attain lower absorption energy than the most preferred configurations of the non-expert and the computer bot. We consider only coordinate projections, that is . In other words, we let the user choose optimal value for one dimension at a time.
The total number of queries were 24, of which 6 were initial queries. The -initial query corresponded to , that is, to the -coordinate projection. The initial values for the reference coordinate vector were fixed to the same value across all user sessions. For acquisition, we used the expected improvement by projective preferential query. Since we allowed only coordinate projections for , we first selected , and then, either (ei-ext), or the next was drawn uniformly at random (ei-rand). The computer bot gave random values to the queries; to provide some consistency to the bot, was selected by maximizing a standard expected improvement function (ei-ei). The results are summarized in Table 1.
|User||Acq. of||E (eV)||E* (eV)||N|
Our first observation is that PPBO can distinguish between the choices made by a human and a computer bot. Human choices pinpoint atomic arrangements that are close to nearby local minima (small N), while the random bot’s choices are far less reasonable and require much subsequent computation to optimise structures. For all human users, the preferred molecular structures were placed somewhat high above the surface, which led to relatively high E values. With this query arrangement, it appears the z variable was the most difficult one to estimate visually. Human-preferred molecular orientations were favourable, so the structures were optimised quickly (few N steps).
The quality of user preference is best judged by the depth of the nearest energy basin, denoted by E*. Here, there is a marked divide by expertise. The structures refined from the choices of the bot and non-expert are local minima of adsorption, characterised by weak dispersive interactions. The experts’s choice led to two low-energy structure types that compete for the global minimum, and feature strong chemical bonding of the O atom to the Cu surface.
The findings above demonstrate that the PPBO framework was able to encode the expert knowledge described via preferences. However, since there are only 10 samples, further work will be needed to validate the results.
In this paper we have introduced a new Bayesian framework, PPBO, for learning user preferences from a special kind of feedback, which we call projective preferential feedback. The feedback is equivalent to a minimizer along a projection. Its form is especially applicable in a human-in-the-loop context. We demonstrated this in a user experiment in which the user gives the feedback as optimal position or orientation of a molecule adsorbing to a surface. PPBO was capable of encoding the user preferences in this case.
We demonstrated that PPBO can deal with high-dimensional spaces where existing preferential Bayesian optimization frameworks that are based on pairwise comparisons, such as (IBO, Brochu_2010_thesis) or (PBO, PBO), have difficulties to operate. In the numerical experiments, the performance gap between PPBO and PBO was so high that we conclude: whenever a projective preferential query is possible, a PPBO type of approach should be preferred.
In summary, if it is possible to query a projective preferential query, then PPBO provides an efficient way for preference learning in high-dimensional problems. In particular, PPBO can be used for efficient expert knowledge elicitation in high-dimensional settings that has a significant importance in many fields.
This work was supported by the Academy of Finland (Flagship programme: Finnish Center for Artificial Intelligence FCAI; grants 320181, 319264, 313195, 292334). Computational resources were provided by the CSC IT Center for Science, and the Aalto Science-IT Project.