BEBO
Bayesian embeddings for Bayesian optimization
view repo
We describe a method for Bayesian optimization by which one may incorporate data from multiple systems whose quantitative interrelationships are unknown a priori. All general (nonreal-valued) features of the systems are associated with continuous latent variables that enter as inputs into a single metamodel that simultaneously learns the response surfaces of all of the systems. Bayesian inference is used to determine appropriate beliefs regarding the latent variables. We explain how the resulting probabilistic metamodel may be used for Bayesian optimization tasks and demonstrate its implementation on a variety of synthetic and real-world examples, comparing its performance under zero-, one-, and few-shot settings against traditional Bayesian optimization, which usually requires substantially more data from the system of interest.
READ FULL TEXT VIEW PDFBayesian embeddings for Bayesian optimization
Many engineering design problems can be cast as optimization problems of the form
(1) |
where is a
-dimensional vector of design and operation parameters, and
is some quantity of interest, such as the strength of a material or the efficiency of an engine. Solving optimization problems that fit under the scope of Eq. (1) can be challenging if the dimension of the search space is high and/or the function is nonconvex and highly rugged, resulting in an intractably large number of potential solutions that may be considered. Furthermore, if it is expensive to evaluate (corresponding to performing a laboratory experiment or running a complex simulation code), one may be limited by their experimental budget and be forced to try and determine a good solution to Eq. (1) with few opportunities to gain information about the true response surface for the system of interest.The most elementary approaches for solving Eq. (1) involve specifying some space-filling design and querying
at all of the design points. While rudimentary designs such as factorial designs suffer from the curse of dimensionality and are intractable for all but the simplest problems, more advanced space-filling designs such as Latin hypercube designs still may require far too many evaluations than is feasible. Improving on these by orders of magnitude, in Bayesian optimization (BO), one defines a Bayesian metamodel
[such as a Gaussian process (GP)] to approximate and uses it to adaptively select promising designs until the optimum is found or an experimental budget is exhausted. Still, such an approach commonly takes tens to hundreds of evaluations to converge to a satisfactory degree in practice. In this work, we are interested in improving on this by another order of magnitude, finding plausible solutions to Eq. (1) with a handful–or no—evaluations of .Humans intuitively solve optimization problems in their daily lives in novel settings by leveraging knowledge about previous related experiences. By contrast, it is common to start quantitative engineering design problems from scratch since it is unclear how to re-use data from legacy systems with different response surfaces. There is a growing literature around methods that seek to devise suitable models to bridge the gap between these disparate sources of information. Through this, we hope to leverage so-called “legacy” data about previously seen tasks in a manner that endows our current effort with strong, but reasonable inductive biases that guide it towards effective solutions.
The main technical challenge is to devise a means of reasoning in a quantitative way about features of data that are not numerical in nature and therefore not suitable for standard modeling approaches^{6}^{6}6Some literature refer to these as “qualitative” features, though this seems somewhat of a misnomer since certain types of attributes in question can be numerical in nature, such as zip codes, yet are clearly unsuitable to treat as numbers in a model; others may not be quantitative, but are nonetheless precise (e.g. the name of an operator), yet “qualitative” does not convey this preciseness. These “general” features are typically found when describing the difference between tasks [1] and could include, for example, the serial number of a machine, the identity of an operator, or a chemical compound involved in some process of interest.
The key to our approach is to learn a probabilistic embedding associated with the general features associated with a system such that notions of similarity can be quantified and utilized by a downstream data-driven model. Minding our ultimate goal of solving optimization problems, we focus on Gaussian process metamodels and call the composition of our probabilistic embedding with the Gaussian process metamodel “Bayesian embedding GP” (BEGP). The contributions of this work are the following:
We define the structure of the BEGP metamodel designed to fuse information from systems with differing general features. We use a variational inference scheme to learn to infer reasonable probabilistic embeddings of the general features that capture uncertainty due to limited data while showing that the compositional model can be recognized as a deep Gaussian process [2] with a particular choice of kernel function.
We explain and demonstrate the application of this model to the task of Bayesian optimization, showing how the BEGP can be used to satisfy the usual requirements of the algorithm.
We conduct a series of computational experiments on a variety of synthetic and real-world systems to illustrate the usage of our approach and compare its performance to existing methods and evaluating the contribution of various components of the metamodel.
The scope of our work here is optimization problems of the form in Eq. (1). However, because our approach can be used as a drop-in replacement for other Bayesian metamodels, it is straightforward to extend our work to cases including multi-objective optimization and problems involving complex or unknown constraints. We also consider regression tasks as a stepping stone to our ultimate application of interest since satisfactory predictive power is a desired preliminary skill.
The remainder of this paper is organized as follows: In section 2, we explain the formulation of our metamodel and its usage within Bayesian optimization In section 3, we review related approaches and results from the literature. In section 4, we demonstrate our approach on a variety of synthetic and real-world examples. In section 5, we offer concluding remarks.
In this section, we describe the methodology including our model definition, training objective, predictive distribution, and application to Bayesian optimization.
We begin the discussion of our methodology by defining our model, explaining how inference may be done to determine its posterior, and how predictions are computed.
We begin by reviewing Gaussian processes for regression; the interested reader may consult [3] for more details. A Gaussian process (GP) with mean function and kernel is a distribution over functions such that the marginal distribution of the random function over a finite index set is a multivariate Gaussian. Concretely, let be described by a GP where is an input space that indexes the random process; traditionally, this will be -dimensional Euclidean space or some subset therein. However, any (infinite) set of inputs might be used, and we are mindful of this potential generality in the review in this section. Given inputs with corresponding outputs
, we write the joint probability density of
as(2) |
where , and . The quantity is recognized as the latent output of the GP model. Next, we define a Gaussian likelihood
(3) |
where denotes the observed output values. Integrating out results in the familiar marginal likelihood
(4) |
where
. The negative logarithm of this quantity is conventionally used to determine appropriate parameters for the mean and kernel functions as well as the likelihood through gradient-based minimization. Alternatively, approximate Bayesian inference of the model parameters may be done as well using Markov chain Monte Carlo or variational inference once suitable priors are defined
[4].Given a training set , predictions are made by using Bayes’ rule to condition the GP on the training set. The resulting predictive distribution over the latent outputs at some test inputs is
(5) |
where
(6) | ||||
(7) |
, and . Applying the likelihood and marginalizing out gives the posterior in output space
(8) |
Remark: Our use of a likelihood term in our model definition generalizes the concept of a “nugget” or “jitter” that is sometimes used within the GP metamodeling community [5, 4], usually with the stated purpose of either improving the conditioning of the kernel matrix or representing the noisiness in the observations being modeled.
The main source of the rich behavior in GP models is their kernel function, which encodes strong inductive biases about the statistics of the functions being modeled. It is most common to consider parametric kernels of the form . For example, the popular exponentiated quadratic kernel is defined as
(9) |
and can be evaluated easily for any pair of -dimensional vectors and .
While this is suitable for real-valued inputs commonly found in engineering problems, it is not suitable for more general inputs such as those mentioned above. One workaround is to embed elements from general input spaces as -dimensional latent variables; one can then form the complete input vector by concatenating these latents with the real-valued inputs to form a -dimensional input vector amenable to computation with traditional parametric kernels such as Eq. (9).
More general nonparametric positive-definite kernels exist that are valid for Gaussian processes. For example, the white noise kernel function,
(10) |
does not require Euclidean inputs; instead, it merely requires that we are able to distinguish between elements in its index set.
Therefore, to accomplish the embedding mentioned above, we map our general inputs through a -variate Gaussian process with white noise kernel:
(11) |
Sampling this GP at some general input returns potential embeddings of the input as a -dimensional vector in Euclidean space. These latent variables may then be concatenated with the numerical feature space and fed through a second Gaussian process that serves as the familiar regression model employed in Bayesian metamodeling. The Bayesian embedding GP (BEGP) model that combines the embedding Gaussian process with the GP regression model is shown in Fig. 1. This model is composed of two GPs, where the outputs of the first (embedding) GP are fed as inputs to the second (regression) GP, making it a deep Gaussian process with serial topology similar to those originally discussed in [2].
Our modeling approach may be thought of as a type of multi-task learning in that one doing regression traditionally segregates datasets along differences in general features, building a separate model for each dataset using the remaining real-valued features; by contrast, here we build in correlations across general features within a single model, simultaneously improving the predictive power across all of the input space.
Remark: for systems with multi-dimensional general inputs (e.g. operator IDs and machine serial numbers), it is straightforward to extend our approach to embed each general input dimension separately to its own latent space; these latents are then all concatenated with each other to define the total latent variable which is then concatenated with as before.
Having defined our model, we now discuss how to do inference on it. Inference for the Bayesian EGP model presents two challenges. First, inferring the posterior of the embedding layer is challenging due to its nonparameteric nature stemming from the white noise kernel combined with the fact that its posterior will generally be non-Gaussian due to the nonlinearity of the downstream model that operates on it. Second, unlike a traditional Gaussian process regression model, it is analytically intractable to marginalize out the latent variables . In this section, we discuss simple strategies for overcoming these two challenges so that we may devise an objective function for training our model.
First, we consider the challenges associated with the embedding layer. Let denote the number of distinct general inputs that we observe from (i.e. the number of legacy datasets that we would like to incorporate in our model). Due to the kernel, the joint prior density over latents associated with general inputs is
(12) |
where is the -th dimension associated with the latent for input . For reasons analogous to those mentioned in previous literature [6, 2, 7], the posterior over these inputs is generally non-Gaussian. We define a mean-field variational posterior to approximate the true posterior with the form
(13) |
where are variational parameters. Fortunately, one is usually only interested in a relatively small number of inputs in , so it is not a challenge to track these variational parameters. In fact, computational challenges associated with data scalability of Gaussian process models are likely to become problematic well before challenges associated with working with a high number of tasks, even though recent advances have make significant progress in lift traditional barriers to Gaussian process modeling in large-data regimes [8, 9, 10, 11, 7, 12].
Lastly, note that if no data have been observed for some held-out general input , then, by the nature of the white noise kernel, the posterior over the associated latent is equal to the prior. this observation is critical for enabling our model to make credible predictions in a zero-shot setting, where no data about some task of interest is available at the outset of optimization. Indeed, we will see that this probabilistic approach enables our model to make surprisingly good inferences to guide the first iteration of Bayesian optimization in such settings.
Having resolved the challenge of representing the posterior of the embedding layer, we now turn our attention to the second challenge regarding the intractability of the model’s marginal log-likelihood (or evidence), which usually serves as the objective function for training probabilistic models. Again, following a variational strategy, we derive a tractable evidence lower bound (ELBO) that can be used in place of the exact marginal log-likelihood. While this approach was first demonstrated for Gaussian processes using a sparse approach based on inducing variables, it turns out that later advances in variational inference [13] allow us to avoid relying on a sparse GP formulation. That said, we recognize that such an approach may realistically be helpful in the plausible cases where an abundance of legacy data takes the problem into a large-data setting. This is in contrast to the usual assumptions of data-scarcity in engineering metamodeling, which are usually myopically focused only on solving a single task in isolation. For the sake of completeness, we now provide the derivation of the ELBO for the BEGP model. Based on the probabilistic graphical model shown in Fig. 1, the joint probability of our model with observations is
(14) |
where p is given by Eq. (12), is analogous to Eq. (2), and is the likelihood of Eq. (3). and the marginal log-likelihood is obtained by simply integrating over the latent variables and taking the logarithm of the result:
(15) |
From this point onwards, we will omit the conditioning on the inputs for brevity. While we can integrate out due to the conjugacy of the likelihood, cannot be integrated out analytically. We multiply and divide the integrand by to obtain
(16) |
Applying Jensen’s equality, we move the logarithm inside the integrand to get
(17) |
Rearranging and using gives
(18) |
Additional progress may be made at this point by restricting our choice of kernel function for the second GP layer to either an exponentiated quadratic or a linear kernel since the kernel expectations that form the crux of the challenge may then be evaluated analytically. Furthermore, under the special case where the same real-valued input points are sampled for each general input, Atkinson and Zabaras [14, 15]
showed that the kernel expectations can be further broken down into a Kronecker product between a deterministic kernel matrix and a kernel expectation enabling extensions to modeling millions of data. In this work, we instead opt to estimate Eq. (
18) through sampling so as to maintain the ability to leverage arbitrary kernel functions.The model is trained by maximizing Eq. (18
) with respect to the parameters of the variational posteriors as well as the kernel hyperparameters and likelihood variance
through gradient-based methods. We find that it is typically sufficient to approximate the integral with a single sample from at each iteration of the optimization.Note that there is a redundancy in defining the scale of the latent space in that our model possesses both a parameter for the prior variance of the embedding layer as well as length scales for the parametric kernel of the GP regression layer. Given this freedom, one can make the ELBO arbitrarily high by bringing towards zero while scaling the variational parameters of and the length scales of the regression kernel correspondingly without changing the predictions of the model. to avoid this pathology and establish the scale of the latent space, we fix .
We choose to learn point estimates of the other model parameters since they are informed by all of the tasks’ data. Therefore, the uncertainty contributed to the predictions due to uncertainty in these parameters will generally be small. This is again only because we are incorporating a large amount of data into our model by leveraging legacy data; in the typical single-task engineering setting, we might expect to have far fewer data, and Bayesian inference over the parameters may provide value to the model.
A reasonable alternative to our variational inference approach is to instead perform inference using Monte Carlo. We favor VI because we need to update the model posteriors repeatedly during Bayesian optimization as new data are added to the model. This is simple to do with VI since we can continue our gradient-based optimization from the previous variational posterior. By contrast, Monte Carlo requires us to run the chain for almost as long as the initial chain every time that we need to update the model. Thus, while there is not such a clear advantage for one method compared to the other when initially training the model on legacy data, VI quickly becomes more attractive in the following steps. Furthermore, we note that updating the model parameters is rather important since we are focusing on extreme low-data regimes (fewer than 10 data from the task of interest), where each new datum will generally have considerable effects on the posterior.
Another reason for using MC inference is because highly-factorized variational posterior distributions are known to underestimate the entropy of the true posterior when there is mismatch between the variational form and the true posterior [16]. One might reasonably worry that we are neglecting important uncertainty information by using a mean-field posterior for . We find in practice that this approximation successfully captures much of the uncertainty and is sufficient to provide competitive or superior predictions compared to the baselines considered in our examples. However, our method is not incompatible with more powerful VI methods such as multivariate Gaussians with full covariance matrices [15] or normalizing flow-based approaches [17].
Given a trained model, we are now interested in making predictions at held-out inputs in . Consider a set of test inputs with distinct general inputs and distinct real-valued inputs, combined to form test inputs in all. We first apply the embedding layer to obtain a sample of the latents as well as the training data’s latents . These latents are expanded to form the full training and test inputs for the regression model, , where . Given these samples we can compute the conditional predictive distributions over the latent and observed outputs, given by Eq. (5) and (8), respectively.
Due to the nonlinearity of the GP model, marginalizing over latent variables’ variational posterior
generally induces a non-Gaussian predictive distribution. However, given a Gaussian variational posterior, the moments of the marginal predictive distribution admit analytic expressions; previous works
[18, 6] have approximated the predictive distribution as Gaussian with these statistics. Alternatively, one may sample the predictive distribution in order to better resolve its details. We utilize the former approach in our examples.Having discussed the formulation of our model, including training and predictions, we now turn our attention to using it in the context of optimization. The main algorithm for Bayesian optimization is given in Algorithm 1. We restrict ourselves to the task of optimizing subject to a fixed general input , though our method permits searching over multiple general inputs with trivial modification.
One ingredient that must be specified for BO is the acquisition function , which is used to steer the selection of points at which one conducts experiments to evaluate . In this work, for systems in which we can evaluate at any location in for the current task , we consider the expected improvement
(19) |
where is the Heaviside theta function and is the value of the current best design. When has a known form such as a Gaussian, evaluation of Eq. (19) can be done cheaply. Notice as a matter of convention that we have defined in Eq. (19) such that lower values are better.
Finally, the traditional approach to maximizing the acquisition function has been to simple select a large random sampling of , evaluate
on all of the samples (which is generally cheap), then select the best sample. However, following the widespread adoption of computational frameworks supporting automatic differentiation such as TensorFlow
[19]and PyTorch
[20], it has become customary to accelerate the inner loop optimization overusing gradient-based methods by simply backpropagating the acquisition function back through the model to obtain its gradient with respect to the inputs
[21]. Other recent work has investigated this matter more broadly [22]. Here, we use gradient-based search with restarts to find the next point to select similarly to [21]. For the case where one must estimate through sampling (e.g. due to estimating the predictive posterior by sampling latents from , stochastic backpropagation [23] can be used to deal with the variance in the estimates of due to sampling. We also note that this setup may be applied immediately to solve robust optimization problems with no modification by additionally sampling from any uncontrolled inputs.In our real-world examples, we do not have access to the real-world data-generating process and must instead work with a static, pre-computed dataset. In this case, our design space is practically the finite set at which experiments have already been carried out. In this case, we can instead directly estimate the negative^{7}^{7}7We take the negative probability to keep with the convention that lower values of are better. probability that a given available point will be the best design:
(20) |
We approximate Eq. (20) by repeatedly sampling the joint predictive distribution over all available design points and counting the frequency with which each design point is predicted to have the best output. We find that this crude approximation is not overly computationally burdensome and results in good performance in practice as shown by our examples.
McMillan et al. [24] study Gaussian process models based on “qualitative” and “quantitative” variables. This approach would be analogous to a deterministic version of our Bayesian embedding layer. As we show in our examples, the incorporation of Bayesian uncertainty estimates provides highly valuable information that is essential to making the predictive distribution of the model credible.
The Gaussian process latent variable model [25] is a seminal work for inferring latent variables by using the marginal log-likelihood of a Gaussian process as training signal. The later Bayesian extension by Titsias and Lawrence [6] was found to significantly improve the quality of the model and gives compelling evidence in favor of modeling latent variables in a Bayesian manner.
The task of inferring input-output relationships where the inputs are only partially-specified was first identified by Damianou and Lawrence as “semi-described learning” [26]. Our problem statement may be regarded as similar in that we choose to associate the general features in our data with latent variables that are unspecified a priori.
Using related systems to improve knowledge about a related system of interest has a long history in multi-fidelity modeling [27, 28], where one traditionally considers a sequence of systems that constitute successively more accurate (and expensive) representations of a true physical process. However, it is unclear how to define a “sequence” of legacy datasets that are all “on equal footing” (such as different operators or machines performing the same task), particularly as the number of such instances becomes large.
Our model is similar to the multi-task Gaussian process introduced in [29]. Our approaches are similar in that we both learn a kernel over tasks; however, [29] require that the kernel matrix decompose as a Kronecker product between a task-wise covariance matrix and the input kernel matrix. The Kronecker product kernel matrix can be equivalent to a kernel matrix evaluated on our augmented inputs in the special cases of if one uses either a linear or a Gaussian kernel, but more general kernels cannot be composed in this way. By finding representations of tasks in a Euclidean input space, we lift this restriction, allowing us to use arbitrary kernels instead, wherein the Kronecker product decomposition of [29] is a special case. Additionally, by posing a prior over the latent input space, our method also allows us to do principled zero-shot predictions on unseen inputs; it is significantly more challenging to formulate a reasonable prior on the kernel matrix itself. Finally, we can inject additional modeling bias into our model by selecting the dimensionality of our latent space. While a low-rank approximation to the task covariance matrix might be derived to obtain similar savings, it again seems more natural in our opinion to exercise this ability in a latent space.
The task of utilizing “legacy” data in order to improve the predictive capability on a related system of interest was explored in [30]. However, the approach used in that work can only utilize legacy models through linear combinations, making extrapolation on the system of interest difficult, particularly when few (or no) data are available from the task of interest. Additionally, their method requires a cross-validation scheme for model selection; by contrast, the probabilistic representation of tasks in the current method enables the use of Bayesian inference to guide model selection in the sense of identifying relevance between tasks.
A recent work by Zhang et al. [31] also studies Bayesian optimization in the context of qualitative and quantitative inputs. However, they learn point embeddings via maximum likelihood estimation rather than inferring a posterior through Bayesian inference; this makes the overall model prone to overfitting in practice and generally unsuitable for providing credible uncertainty estimates. This also makes their model difficult to apply in few-shot settings, and impossible to apply in a principled way for zero-shot learning. A related work by Iyer et al. [32] identifies a pathology associated with the point estimate inference strategy and proposes an ad hoc workaround; a natural outcome of our Bayesian embedding approach is that this challenge vanishes.
The BEGP model can be thought of as a latent variable model [33, 25, 6] and has similarities with previous works with latent variable GPs [34, 14, 15], though those works additionally impose structure over latent variables as well as (potentially) the input training data. However, neither works were applied within the context of Bayesian optimization, nor did they identify that multiple general feature sets could be decomposed as we have chosen to do.
In this section, we demonstrate our method on a variety of synthetic and real-world examples. The embedding GP model is implemented using gptorch^{8}^{8}8https://github.com/cics-nd/gptorch, a Gaussian process framework built on PyTorch [20]. Source code for implementing the model and synthetic experiments described below can be found on GitHub^{9}^{9}9https://github.com/sdatkinson/BEBO. For each system, we consider both a regression and optimization problem using our embedding GP model with both probabilistic and deterministic embedding layers; the latter is achieved by replacing the posterior of Eq. (13) with a delta function at its mode. We use an RBF kernel as in Eq. (9) and constant mean function whose value is determined via an MLE point estimate.
We compare our results against a vanilla Gaussian process metamodel that is unable to directly leverage the legacy data as a baseline. Since GP models typically fare very poorly the the extreme low-data regime that we are interested in, we conduct full Bayesian inference over the mean and kernel function parameters, using as prior a factorized Gaussian over the parameters^{10}^{10}10Or the logarithm of the parameters that are constrained to be positive. whose mean and variance are estimated by fitting GPs to each legacy task and using the statistics of the point estimates of the model parameters found therein. Inference for the Bayesian GP is carried out using Hamiltonian Monte Carlo [35] with the NUTS sampler [36] as implemented in Pyro [37].
We begin by describing the systems that we consider. Each system contains a set of tasks that we will jointly learn.
We first consider a system with one-dimensional input and . The set of response surfaces are given by
(21) |
where we define . Different response surfaces are selected by sampling . We consider a problem setting where legacy tasks are available, each with data with inputs sampled uniformly from . In the regression setting, we randomly sample a number of data from one response surface as the current task and split it into a training and test set. We quantify our prediction accuracy in terms of mean negative log probability (MNLP), mean absolute error (MAE), and root mean square error (RMSE). In the optimization setting, we perform Bayesian optimization over the current task and track the best design point as a function of the number of evaluations on the current task. We repeat both experiments with different random seeds to quantify the performance statistics of each metamodeling approach.
We also consider a generalization of the Forrester function [28] to a many-task setting:
(22) |
where
(23) |
Under the parameterization of Eq. (22), the original “high-fidelity” function considered in [28] is obtained when , and the low-fidelity function corresponds to . We always include the low-fidelity function in the legacy tasks along with other tasks generated by sampling . Note that this places the high-fidelity task at an extremum of the system parameter space, meaning that multi-task models will not necessarily benefit by assuming a priori that the held-out task lives at the center of latent space (assuming that their learned latent representation resembles that of the parameterization we have chosen). We consider a design space . Each of the 5 legacy tasks includes 5 data where the input was sampled uniformly in . We consider a regression and an optimization with identical setup to the toy system in Sec. 4.1.1.
Our first real-world dataset comprises of data in which the performance of a pump is quantified in terms of seven design variables. Our data includes data from 6 pump families, each with of which has between and data. We consider each pump family in turn as the “current” task while using the other families as legacy task data. We repeat our experiment 6 times, each time using a different pump family as the current task and the other 5 datasets as legacy task data. In the regression setting, we split the current task data into a train and test set, training our metamodel on all of the legacy data as well as whatever data is available from the current task, then quantify our predictions on the held-out test set of the current task in terms of MNLP, MAE, and RMSE. Each experiment is repeated 10 times with different random initializations and train-test splits. In the optimization setting, we begin with no evaluations from the current task and carry out Bayesian optimization using the available data as a (finite) design set. Data are selected at each iteration according the acquisition function of Eq. (20). We track the best evaluation (relative to the per-task best design) as a function of the number of evaluations on the current task.
We consider a dataset in which the performance of an additive manufacturing process is quantified in terms of four design variables. We have data from different chemistries, each of which has data. We conduct experiments on each of the chemistries in an analogous manner to the experiments on the pump data.
Figure 2 shows regression results for the synthetic systems described in Sections 4.1.1 and 4.1.2. We notice that the embedding GPs typically outperform the Bayesian GP by a margin, though there is overlap in the statistics of their performance over repeats. Curiously, we notice that the deterministic embedding GP occasionally outperforms the Bayesian embedding in terms of RMSE and MAE, but fails catastrophically in terms of MNLP, where the overconfidence of its predictions is severely penalized.
Figure 3 shows our results for the pump dataset. For this problem, we observe mixed results with the embedding GPs outperforming the baseline on some of the tasks, but underperforming on others. We hypothesize that this is because certain tasks are substantially different from the others, causing the prior information from legacy tasks to be unhelpful. However, on certain tasks (3 and 4), we see that the Bayesian embedding GP outperforms the baseline even in a zero-shot setting. We also notice that the deterministic embedding GP consistently gives rather poor uncertainty estimates as evidenced by its MNLP scores. Thus, we see a clear benefit to the Bayesian embedding approach for improving the credibility of the predictions.
Figure 4 shows our results for the additive dataset for the first few held-out tasks. In contrast to the pump problem, we often see considerable improvements by using the embedding models. For many of the held-out tasks, the zero-shot predictions from the embedding models are, on average, superior to those of the Bayesian GP even after it has seen 10 examples from the current task. On others (e.g. RMSE, MAE for Task 2 in Fig. 4), the Bayesian GP does somewhat outperform the embedding models, though the embedding models continue to improve. This may be because the prior knowledge built up from the legacy tasks is unhelpful for the held-out task; the embedding model must learn to “overturn” this prior knowledge by finding a way to separate the embedding of the held-out task from the unrelated legacy tasks. However, this requires observations to gradually overturn the belief via Bayesian updating.
Additionally, we see again that the deterministic embedding consistently results in very poor performance in terms of MNLP, frequently performing so much worse that it is impractical to show on the same axes as the Bayesian GP and Bayesian EGP models. Figure 5 shows the results when zoomed out to view the performance of the deterministic EGP for one task. We see a similar deficiency in all of the other held-out tasks, firmly underscoring that it is critical to account for uncertainty in the embedding in order to obtain credible predictions. This makes sense, given the high number of tasks we must embed relative to the number of data available to elucidate their input-output characteristics. This is not uncommon in engineering design problems, where one typically has a limited computational budget to explore any given design problem, but may potentially have access to a large archive of legacy tasks.
Figure 6 shows the best design discovered for the synthetic systems as a function of the number of evaluations on the current task. We report the median performance over splits as well as the 80% coverage interval (-th to -th percentile). We see that the legacy task data enables the embedding GPs to consistently find near-optimal designs with a either a single evaluation (Forrester) or without any data from the current task (synthetic). By contrast, the GP usually requires a handful of evaluations before it begins to find good solutions.
Figures 7 and 8 show the zero- and one-shot posteriors of all three methods on the toy and Forrester systems, respectively. We see that the legacy task data equip the embedding GPs with helpful priors that aid them in identifying the response surface. By contrast, the Bayesian GP guesses randomly in the zero-shot setting and has minimal insight for its one-shot predictions. We also notice that while the deterministic embedding model seems to fortunately pick good designs, its model of the response surface is highly overconfident. By contrast, we see that the Bayesian embedding endows our models with well-calibrated uncertainty estimates that simultaneously allows is to find good designs while retaining a credible metamodel. Thus, while the EGP performs comparably to the BEGP, we must be cautious about generalizing these results. We also point out that while it seems like the BGP has picked a very good first point in Fig. 7, this selection is by pure luck since BGP’s predictive distribution with zero training data (i.e. the prior) is flat in its inputs. Indeed, we see that it picks the same first point for the Forrester function in Fig. 8, which is not nearly so close to optimal.
Figure 9 shows the running best design for the pump and additive systems as a function of the number of current task evaluations. Since each legacy system has a different optimum, results show the performance relative to the best design for the current task. Again, we see that the embedding GPs vastly outperform the vanilla GP approach, usually finding good designs on their first attempts. However, here we see that the Bayesian embedding gives better results over a deterministic embedding; this can be directly attributed to our earlier observation that the deterministic embedding tends to overfit, particularly when there are very few data to inform the embedding. By contrast, the Bayesian embedding safely guards against overfitting and provides robust performance.
We have introduced a method for Bayesian optimization applicable to settings where one has access to a variety of related datasets but does not know how they relate to each other a priori. The Bayesian embedding GP automatically learns latent representations that enable a single metamodel to share information across multiple tasks, thereby improving the predictive performance in all cases, particularly in novel settings where limited data is available. We observe that our method enables Bayesian optimization to more quickly and reliably find good solutions compared to traditional methods.
We find that variational inference with rudimentary variational approximations (factorized Gaussians) over the latent variables provides sufficient uncertainty information that credible estimates may be obtained in practice and is robust with respect to the number of latent dimensions. Our experiments show that this Bayesian embedding is critical for obtaining credible uncertainty quantification.
There are a number of ways in which our work might be extended. For one, our method might be applied to optimization settings with unknown constraints. In such cases, one might simultaneously model the probability of violating a constraint with a multi-task model. Modeling binary outputs such as constraint violation with GPs requires non-conjugate inference, which can be enabled using techniques from [38]. Second, our approach is straightforward to apply to problems of robust design. In fact, we have already shown how uncertainty in our metamodel can be propagated efficiently during design selection through stochastic backpropagation, eliminating the need for expensive double-loop optimization. Other approaches [39, 40] exploit analytic properties to solve the inner-loop optimization, though our work suggests that it might be approximately solved in general using stochastic optimization. Finally, our model learns a single embedding for each general input (task) as a latent vector. We would like to explore the effect of a hierarchical latent representation in which the embedding of each task is allowed to vary with . Such a conditional multi-task model might enable one to exploit partial similarities between tasks in certain regions of input space while still providing for the possibility that their trends might differ elsewhere. While this was studied in [30], we expect that our Bayesian latent representation might improve performance in few-shot settings such as those considered in this work.
Funding for this work was provided internally by GE Research.
Gaussian processes for machine learning
, Vol. 2, MIT Press Cambridge, MA, 2006.Wilson, A., and Nickisch, H., “Kernel interpolation for scalable structured Gaussian processes (KISS-GP),”
International Conference on Machine Learning, 2015, pp. 1775–1784.Lawrence, N. D., “Gaussian process latent variable models for visualisation of high dimensional data,”
Advances in Neural Information Processing Systems, 2004, pp. 329–336.Dai, Z., Alvarez, M., and Lawrence, N., “Efficient modeling of latent information in supervised learning using gaussian processes,”
Advances in Neural Information Processing Systems, 2017, pp. 5131–5139.