1 Introduction
Many engineering design problems can be cast as optimization problems of the form
(1) 
where is a
dimensional vector of design and operation parameters, and
is some quantity of interest, such as the strength of a material or the efficiency of an engine. Solving optimization problems that fit under the scope of Eq. (1) can be challenging if the dimension of the search space is high and/or the function is nonconvex and highly rugged, resulting in an intractably large number of potential solutions that may be considered. Furthermore, if it is expensive to evaluate (corresponding to performing a laboratory experiment or running a complex simulation code), one may be limited by their experimental budget and be forced to try and determine a good solution to Eq. (1) with few opportunities to gain information about the true response surface for the system of interest.The most elementary approaches for solving Eq. (1) involve specifying some spacefilling design and querying
at all of the design points. While rudimentary designs such as factorial designs suffer from the curse of dimensionality and are intractable for all but the simplest problems, more advanced spacefilling designs such as Latin hypercube designs still may require far too many evaluations than is feasible. Improving on these by orders of magnitude, in Bayesian optimization (BO), one defines a Bayesian metamodel
[such as a Gaussian process (GP)] to approximate and uses it to adaptively select promising designs until the optimum is found or an experimental budget is exhausted. Still, such an approach commonly takes tens to hundreds of evaluations to converge to a satisfactory degree in practice. In this work, we are interested in improving on this by another order of magnitude, finding plausible solutions to Eq. (1) with a handful–or no—evaluations of .Humans intuitively solve optimization problems in their daily lives in novel settings by leveraging knowledge about previous related experiences. By contrast, it is common to start quantitative engineering design problems from scratch since it is unclear how to reuse data from legacy systems with different response surfaces. There is a growing literature around methods that seek to devise suitable models to bridge the gap between these disparate sources of information. Through this, we hope to leverage socalled “legacy” data about previously seen tasks in a manner that endows our current effort with strong, but reasonable inductive biases that guide it towards effective solutions.
The main technical challenge is to devise a means of reasoning in a quantitative way about features of data that are not numerical in nature and therefore not suitable for standard modeling approaches^{6}^{6}6Some literature refer to these as “qualitative” features, though this seems somewhat of a misnomer since certain types of attributes in question can be numerical in nature, such as zip codes, yet are clearly unsuitable to treat as numbers in a model; others may not be quantitative, but are nonetheless precise (e.g. the name of an operator), yet “qualitative” does not convey this preciseness. These “general” features are typically found when describing the difference between tasks [1] and could include, for example, the serial number of a machine, the identity of an operator, or a chemical compound involved in some process of interest.
The key to our approach is to learn a probabilistic embedding associated with the general features associated with a system such that notions of similarity can be quantified and utilized by a downstream datadriven model. Minding our ultimate goal of solving optimization problems, we focus on Gaussian process metamodels and call the composition of our probabilistic embedding with the Gaussian process metamodel “Bayesian embedding GP” (BEGP). The contributions of this work are the following:

We define the structure of the BEGP metamodel designed to fuse information from systems with differing general features. We use a variational inference scheme to learn to infer reasonable probabilistic embeddings of the general features that capture uncertainty due to limited data while showing that the compositional model can be recognized as a deep Gaussian process [2] with a particular choice of kernel function.

We explain and demonstrate the application of this model to the task of Bayesian optimization, showing how the BEGP can be used to satisfy the usual requirements of the algorithm.

We conduct a series of computational experiments on a variety of synthetic and realworld systems to illustrate the usage of our approach and compare its performance to existing methods and evaluating the contribution of various components of the metamodel.
The scope of our work here is optimization problems of the form in Eq. (1). However, because our approach can be used as a dropin replacement for other Bayesian metamodels, it is straightforward to extend our work to cases including multiobjective optimization and problems involving complex or unknown constraints. We also consider regression tasks as a stepping stone to our ultimate application of interest since satisfactory predictive power is a desired preliminary skill.
The remainder of this paper is organized as follows: In section 2, we explain the formulation of our metamodel and its usage within Bayesian optimization In section 3, we review related approaches and results from the literature. In section 4, we demonstrate our approach on a variety of synthetic and realworld examples. In section 5, we offer concluding remarks.
2 Methodology
In this section, we describe the methodology including our model definition, training objective, predictive distribution, and application to Bayesian optimization.
2.1 Model definition
We begin the discussion of our methodology by defining our model, explaining how inference may be done to determine its posterior, and how predictions are computed.
2.1.1 Gaussian process regression
We begin by reviewing Gaussian processes for regression; the interested reader may consult [3] for more details. A Gaussian process (GP) with mean function and kernel is a distribution over functions such that the marginal distribution of the random function over a finite index set is a multivariate Gaussian. Concretely, let be described by a GP where is an input space that indexes the random process; traditionally, this will be dimensional Euclidean space or some subset therein. However, any (infinite) set of inputs might be used, and we are mindful of this potential generality in the review in this section. Given inputs with corresponding outputs
, we write the joint probability density of
as(2) 
where , and . The quantity is recognized as the latent output of the GP model. Next, we define a Gaussian likelihood
(3) 
where denotes the observed output values. Integrating out results in the familiar marginal likelihood
(4) 
where
. The negative logarithm of this quantity is conventionally used to determine appropriate parameters for the mean and kernel functions as well as the likelihood through gradientbased minimization. Alternatively, approximate Bayesian inference of the model parameters may be done as well using Markov chain Monte Carlo or variational inference once suitable priors are defined
[4].Given a training set , predictions are made by using Bayes’ rule to condition the GP on the training set. The resulting predictive distribution over the latent outputs at some test inputs is
(5) 
where
(6)  
(7) 
, and . Applying the likelihood and marginalizing out gives the posterior in output space
(8) 
Remark: Our use of a likelihood term in our model definition generalizes the concept of a “nugget” or “jitter” that is sometimes used within the GP metamodeling community [5, 4], usually with the stated purpose of either improving the conditioning of the kernel matrix or representing the noisiness in the observations being modeled.
2.1.2 Incorporating general input features as inputs to Gaussian process regression models
The main source of the rich behavior in GP models is their kernel function, which encodes strong inductive biases about the statistics of the functions being modeled. It is most common to consider parametric kernels of the form . For example, the popular exponentiated quadratic kernel is defined as
(9) 
and can be evaluated easily for any pair of dimensional vectors and .
While this is suitable for realvalued inputs commonly found in engineering problems, it is not suitable for more general inputs such as those mentioned above. One workaround is to embed elements from general input spaces as dimensional latent variables; one can then form the complete input vector by concatenating these latents with the realvalued inputs to form a dimensional input vector amenable to computation with traditional parametric kernels such as Eq. (9).
More general nonparametric positivedefinite kernels exist that are valid for Gaussian processes. For example, the white noise kernel function,
(10) 
does not require Euclidean inputs; instead, it merely requires that we are able to distinguish between elements in its index set.
Therefore, to accomplish the embedding mentioned above, we map our general inputs through a variate Gaussian process with white noise kernel:
(11) 
Sampling this GP at some general input returns potential embeddings of the input as a dimensional vector in Euclidean space. These latent variables may then be concatenated with the numerical feature space and fed through a second Gaussian process that serves as the familiar regression model employed in Bayesian metamodeling. The Bayesian embedding GP (BEGP) model that combines the embedding Gaussian process with the GP regression model is shown in Fig. 1. This model is composed of two GPs, where the outputs of the first (embedding) GP are fed as inputs to the second (regression) GP, making it a deep Gaussian process with serial topology similar to those originally discussed in [2].
Our modeling approach may be thought of as a type of multitask learning in that one doing regression traditionally segregates datasets along differences in general features, building a separate model for each dataset using the remaining realvalued features; by contrast, here we build in correlations across general features within a single model, simultaneously improving the predictive power across all of the input space.
Remark: for systems with multidimensional general inputs (e.g. operator IDs and machine serial numbers), it is straightforward to extend our approach to embed each general input dimension separately to its own latent space; these latents are then all concatenated with each other to define the total latent variable which is then concatenated with as before.
2.1.3 Inference for the Bayesian embedding GP model
Having defined our model, we now discuss how to do inference on it. Inference for the Bayesian EGP model presents two challenges. First, inferring the posterior of the embedding layer is challenging due to its nonparameteric nature stemming from the white noise kernel combined with the fact that its posterior will generally be nonGaussian due to the nonlinearity of the downstream model that operates on it. Second, unlike a traditional Gaussian process regression model, it is analytically intractable to marginalize out the latent variables . In this section, we discuss simple strategies for overcoming these two challenges so that we may devise an objective function for training our model.
First, we consider the challenges associated with the embedding layer. Let denote the number of distinct general inputs that we observe from (i.e. the number of legacy datasets that we would like to incorporate in our model). Due to the kernel, the joint prior density over latents associated with general inputs is
(12) 
where is the th dimension associated with the latent for input . For reasons analogous to those mentioned in previous literature [6, 2, 7], the posterior over these inputs is generally nonGaussian. We define a meanfield variational posterior to approximate the true posterior with the form
(13) 
where are variational parameters. Fortunately, one is usually only interested in a relatively small number of inputs in , so it is not a challenge to track these variational parameters. In fact, computational challenges associated with data scalability of Gaussian process models are likely to become problematic well before challenges associated with working with a high number of tasks, even though recent advances have make significant progress in lift traditional barriers to Gaussian process modeling in largedata regimes [8, 9, 10, 11, 7, 12].
Lastly, note that if no data have been observed for some heldout general input , then, by the nature of the white noise kernel, the posterior over the associated latent is equal to the prior. this observation is critical for enabling our model to make credible predictions in a zeroshot setting, where no data about some task of interest is available at the outset of optimization. Indeed, we will see that this probabilistic approach enables our model to make surprisingly good inferences to guide the first iteration of Bayesian optimization in such settings.
Having resolved the challenge of representing the posterior of the embedding layer, we now turn our attention to the second challenge regarding the intractability of the model’s marginal loglikelihood (or evidence), which usually serves as the objective function for training probabilistic models. Again, following a variational strategy, we derive a tractable evidence lower bound (ELBO) that can be used in place of the exact marginal loglikelihood. While this approach was first demonstrated for Gaussian processes using a sparse approach based on inducing variables, it turns out that later advances in variational inference [13] allow us to avoid relying on a sparse GP formulation. That said, we recognize that such an approach may realistically be helpful in the plausible cases where an abundance of legacy data takes the problem into a largedata setting. This is in contrast to the usual assumptions of datascarcity in engineering metamodeling, which are usually myopically focused only on solving a single task in isolation. For the sake of completeness, we now provide the derivation of the ELBO for the BEGP model. Based on the probabilistic graphical model shown in Fig. 1, the joint probability of our model with observations is
(14) 
where p is given by Eq. (12), is analogous to Eq. (2), and is the likelihood of Eq. (3). and the marginal loglikelihood is obtained by simply integrating over the latent variables and taking the logarithm of the result:
(15) 
From this point onwards, we will omit the conditioning on the inputs for brevity. While we can integrate out due to the conjugacy of the likelihood, cannot be integrated out analytically. We multiply and divide the integrand by to obtain
(16) 
Applying Jensen’s equality, we move the logarithm inside the integrand to get
(17) 
Rearranging and using gives
(18) 
Additional progress may be made at this point by restricting our choice of kernel function for the second GP layer to either an exponentiated quadratic or a linear kernel since the kernel expectations that form the crux of the challenge may then be evaluated analytically. Furthermore, under the special case where the same realvalued input points are sampled for each general input, Atkinson and Zabaras [14, 15]
showed that the kernel expectations can be further broken down into a Kronecker product between a deterministic kernel matrix and a kernel expectation enabling extensions to modeling millions of data. In this work, we instead opt to estimate Eq. (
18) through sampling so as to maintain the ability to leverage arbitrary kernel functions.The model is trained by maximizing Eq. (18
) with respect to the parameters of the variational posteriors as well as the kernel hyperparameters and likelihood variance
through gradientbased methods. We find that it is typically sufficient to approximate the integral with a single sample from at each iteration of the optimization.Note that there is a redundancy in defining the scale of the latent space in that our model possesses both a parameter for the prior variance of the embedding layer as well as length scales for the parametric kernel of the GP regression layer. Given this freedom, one can make the ELBO arbitrarily high by bringing towards zero while scaling the variational parameters of and the length scales of the regression kernel correspondingly without changing the predictions of the model. to avoid this pathology and establish the scale of the latent space, we fix .
We choose to learn point estimates of the other model parameters since they are informed by all of the tasks’ data. Therefore, the uncertainty contributed to the predictions due to uncertainty in these parameters will generally be small. This is again only because we are incorporating a large amount of data into our model by leveraging legacy data; in the typical singletask engineering setting, we might expect to have far fewer data, and Bayesian inference over the parameters may provide value to the model.
A remark on variational inference for EGP
A reasonable alternative to our variational inference approach is to instead perform inference using Monte Carlo. We favor VI because we need to update the model posteriors repeatedly during Bayesian optimization as new data are added to the model. This is simple to do with VI since we can continue our gradientbased optimization from the previous variational posterior. By contrast, Monte Carlo requires us to run the chain for almost as long as the initial chain every time that we need to update the model. Thus, while there is not such a clear advantage for one method compared to the other when initially training the model on legacy data, VI quickly becomes more attractive in the following steps. Furthermore, we note that updating the model parameters is rather important since we are focusing on extreme lowdata regimes (fewer than 10 data from the task of interest), where each new datum will generally have considerable effects on the posterior.
Another reason for using MC inference is because highlyfactorized variational posterior distributions are known to underestimate the entropy of the true posterior when there is mismatch between the variational form and the true posterior [16]. One might reasonably worry that we are neglecting important uncertainty information by using a meanfield posterior for . We find in practice that this approximation successfully captures much of the uncertainty and is sufficient to provide competitive or superior predictions compared to the baselines considered in our examples. However, our method is not incompatible with more powerful VI methods such as multivariate Gaussians with full covariance matrices [15] or normalizing flowbased approaches [17].
2.1.4 Predictions with the model
Given a trained model, we are now interested in making predictions at heldout inputs in . Consider a set of test inputs with distinct general inputs and distinct realvalued inputs, combined to form test inputs in all. We first apply the embedding layer to obtain a sample of the latents as well as the training data’s latents . These latents are expanded to form the full training and test inputs for the regression model, , where . Given these samples we can compute the conditional predictive distributions over the latent and observed outputs, given by Eq. (5) and (8), respectively.
Due to the nonlinearity of the GP model, marginalizing over latent variables’ variational posterior
generally induces a nonGaussian predictive distribution. However, given a Gaussian variational posterior, the moments of the marginal predictive distribution admit analytic expressions; previous works
[18, 6] have approximated the predictive distribution as Gaussian with these statistics. Alternatively, one may sample the predictive distribution in order to better resolve its details. We utilize the former approach in our examples.2.2 Bayesian optimization
Having discussed the formulation of our model, including training and predictions, we now turn our attention to using it in the context of optimization. The main algorithm for Bayesian optimization is given in Algorithm 1. We restrict ourselves to the task of optimizing subject to a fixed general input , though our method permits searching over multiple general inputs with trivial modification.
2.2.1 Acquisition functions
One ingredient that must be specified for BO is the acquisition function , which is used to steer the selection of points at which one conducts experiments to evaluate . In this work, for systems in which we can evaluate at any location in for the current task , we consider the expected improvement
(19) 
where is the Heaviside theta function and is the value of the current best design. When has a known form such as a Gaussian, evaluation of Eq. (19) can be done cheaply. Notice as a matter of convention that we have defined in Eq. (19) such that lower values are better.
Finally, the traditional approach to maximizing the acquisition function has been to simple select a large random sampling of , evaluate
on all of the samples (which is generally cheap), then select the best sample. However, following the widespread adoption of computational frameworks supporting automatic differentiation such as TensorFlow
[19]and PyTorch
[20], it has become customary to accelerate the inner loop optimization overusing gradientbased methods by simply backpropagating the acquisition function back through the model to obtain its gradient with respect to the inputs
[21]. Other recent work has investigated this matter more broadly [22]. Here, we use gradientbased search with restarts to find the next point to select similarly to [21]. For the case where one must estimate through sampling (e.g. due to estimating the predictive posterior by sampling latents from , stochastic backpropagation [23] can be used to deal with the variance in the estimates of due to sampling. We also note that this setup may be applied immediately to solve robust optimization problems with no modification by additionally sampling from any uncontrolled inputs.In our realworld examples, we do not have access to the realworld datagenerating process and must instead work with a static, precomputed dataset. In this case, our design space is practically the finite set at which experiments have already been carried out. In this case, we can instead directly estimate the negative^{7}^{7}7We take the negative probability to keep with the convention that lower values of are better. probability that a given available point will be the best design:
(20) 
We approximate Eq. (20) by repeatedly sampling the joint predictive distribution over all available design points and counting the frequency with which each design point is predicted to have the best output. We find that this crude approximation is not overly computationally burdensome and results in good performance in practice as shown by our examples.
3 Related work
McMillan et al. [24] study Gaussian process models based on “qualitative” and “quantitative” variables. This approach would be analogous to a deterministic version of our Bayesian embedding layer. As we show in our examples, the incorporation of Bayesian uncertainty estimates provides highly valuable information that is essential to making the predictive distribution of the model credible.
The Gaussian process latent variable model [25] is a seminal work for inferring latent variables by using the marginal loglikelihood of a Gaussian process as training signal. The later Bayesian extension by Titsias and Lawrence [6] was found to significantly improve the quality of the model and gives compelling evidence in favor of modeling latent variables in a Bayesian manner.
The task of inferring inputoutput relationships where the inputs are only partiallyspecified was first identified by Damianou and Lawrence as “semidescribed learning” [26]. Our problem statement may be regarded as similar in that we choose to associate the general features in our data with latent variables that are unspecified a priori.
Using related systems to improve knowledge about a related system of interest has a long history in multifidelity modeling [27, 28], where one traditionally considers a sequence of systems that constitute successively more accurate (and expensive) representations of a true physical process. However, it is unclear how to define a “sequence” of legacy datasets that are all “on equal footing” (such as different operators or machines performing the same task), particularly as the number of such instances becomes large.
Our model is similar to the multitask Gaussian process introduced in [29]. Our approaches are similar in that we both learn a kernel over tasks; however, [29] require that the kernel matrix decompose as a Kronecker product between a taskwise covariance matrix and the input kernel matrix. The Kronecker product kernel matrix can be equivalent to a kernel matrix evaluated on our augmented inputs in the special cases of if one uses either a linear or a Gaussian kernel, but more general kernels cannot be composed in this way. By finding representations of tasks in a Euclidean input space, we lift this restriction, allowing us to use arbitrary kernels instead, wherein the Kronecker product decomposition of [29] is a special case. Additionally, by posing a prior over the latent input space, our method also allows us to do principled zeroshot predictions on unseen inputs; it is significantly more challenging to formulate a reasonable prior on the kernel matrix itself. Finally, we can inject additional modeling bias into our model by selecting the dimensionality of our latent space. While a lowrank approximation to the task covariance matrix might be derived to obtain similar savings, it again seems more natural in our opinion to exercise this ability in a latent space.
The task of utilizing “legacy” data in order to improve the predictive capability on a related system of interest was explored in [30]. However, the approach used in that work can only utilize legacy models through linear combinations, making extrapolation on the system of interest difficult, particularly when few (or no) data are available from the task of interest. Additionally, their method requires a crossvalidation scheme for model selection; by contrast, the probabilistic representation of tasks in the current method enables the use of Bayesian inference to guide model selection in the sense of identifying relevance between tasks.
A recent work by Zhang et al. [31] also studies Bayesian optimization in the context of qualitative and quantitative inputs. However, they learn point embeddings via maximum likelihood estimation rather than inferring a posterior through Bayesian inference; this makes the overall model prone to overfitting in practice and generally unsuitable for providing credible uncertainty estimates. This also makes their model difficult to apply in fewshot settings, and impossible to apply in a principled way for zeroshot learning. A related work by Iyer et al. [32] identifies a pathology associated with the point estimate inference strategy and proposes an ad hoc workaround; a natural outcome of our Bayesian embedding approach is that this challenge vanishes.
The BEGP model can be thought of as a latent variable model [33, 25, 6] and has similarities with previous works with latent variable GPs [34, 14, 15], though those works additionally impose structure over latent variables as well as (potentially) the input training data. However, neither works were applied within the context of Bayesian optimization, nor did they identify that multiple general feature sets could be decomposed as we have chosen to do.
4 Examples
In this section, we demonstrate our method on a variety of synthetic and realworld examples. The embedding GP model is implemented using gptorch^{8}^{8}8https://github.com/cicsnd/gptorch, a Gaussian process framework built on PyTorch [20]. Source code for implementing the model and synthetic experiments described below can be found on GitHub^{9}^{9}9https://github.com/sdatkinson/BEBO. For each system, we consider both a regression and optimization problem using our embedding GP model with both probabilistic and deterministic embedding layers; the latter is achieved by replacing the posterior of Eq. (13) with a delta function at its mode. We use an RBF kernel as in Eq. (9) and constant mean function whose value is determined via an MLE point estimate.
We compare our results against a vanilla Gaussian process metamodel that is unable to directly leverage the legacy data as a baseline. Since GP models typically fare very poorly the the extreme lowdata regime that we are interested in, we conduct full Bayesian inference over the mean and kernel function parameters, using as prior a factorized Gaussian over the parameters^{10}^{10}10Or the logarithm of the parameters that are constrained to be positive. whose mean and variance are estimated by fitting GPs to each legacy task and using the statistics of the point estimates of the model parameters found therein. Inference for the Bayesian GP is carried out using Hamiltonian Monte Carlo [35] with the NUTS sampler [36] as implemented in Pyro [37].
4.1 Systems
We begin by describing the systems that we consider. Each system contains a set of tasks that we will jointly learn.
4.1.1 Toy system
We first consider a system with onedimensional input and . The set of response surfaces are given by
(21) 
where we define . Different response surfaces are selected by sampling . We consider a problem setting where legacy tasks are available, each with data with inputs sampled uniformly from . In the regression setting, we randomly sample a number of data from one response surface as the current task and split it into a training and test set. We quantify our prediction accuracy in terms of mean negative log probability (MNLP), mean absolute error (MAE), and root mean square error (RMSE). In the optimization setting, we perform Bayesian optimization over the current task and track the best design point as a function of the number of evaluations on the current task. We repeat both experiments with different random seeds to quantify the performance statistics of each metamodeling approach.
4.1.2 Forrester function
We also consider a generalization of the Forrester function [28] to a manytask setting:
(22) 
where
(23) 
Under the parameterization of Eq. (22), the original “highfidelity” function considered in [28] is obtained when , and the lowfidelity function corresponds to . We always include the lowfidelity function in the legacy tasks along with other tasks generated by sampling . Note that this places the highfidelity task at an extremum of the system parameter space, meaning that multitask models will not necessarily benefit by assuming a priori that the heldout task lives at the center of latent space (assuming that their learned latent representation resembles that of the parameterization we have chosen). We consider a design space . Each of the 5 legacy tasks includes 5 data where the input was sampled uniformly in . We consider a regression and an optimization with identical setup to the toy system in Sec. 4.1.1.
4.1.3 Pump data
Our first realworld dataset comprises of data in which the performance of a pump is quantified in terms of seven design variables. Our data includes data from 6 pump families, each with of which has between and data. We consider each pump family in turn as the “current” task while using the other families as legacy task data. We repeat our experiment 6 times, each time using a different pump family as the current task and the other 5 datasets as legacy task data. In the regression setting, we split the current task data into a train and test set, training our metamodel on all of the legacy data as well as whatever data is available from the current task, then quantify our predictions on the heldout test set of the current task in terms of MNLP, MAE, and RMSE. Each experiment is repeated 10 times with different random initializations and traintest splits. In the optimization setting, we begin with no evaluations from the current task and carry out Bayesian optimization using the available data as a (finite) design set. Data are selected at each iteration according the acquisition function of Eq. (20). We track the best evaluation (relative to the pertask best design) as a function of the number of evaluations on the current task.
4.1.4 Additive manufacturing data
We consider a dataset in which the performance of an additive manufacturing process is quantified in terms of four design variables. We have data from different chemistries, each of which has data. We conduct experiments on each of the chemistries in an analogous manner to the experiments on the pump data.
4.2 Regression results
Figure 2 shows regression results for the synthetic systems described in Sections 4.1.1 and 4.1.2. We notice that the embedding GPs typically outperform the Bayesian GP by a margin, though there is overlap in the statistics of their performance over repeats. Curiously, we notice that the deterministic embedding GP occasionally outperforms the Bayesian embedding in terms of RMSE and MAE, but fails catastrophically in terms of MNLP, where the overconfidence of its predictions is severely penalized.
Figure 3 shows our results for the pump dataset. For this problem, we observe mixed results with the embedding GPs outperforming the baseline on some of the tasks, but underperforming on others. We hypothesize that this is because certain tasks are substantially different from the others, causing the prior information from legacy tasks to be unhelpful. However, on certain tasks (3 and 4), we see that the Bayesian embedding GP outperforms the baseline even in a zeroshot setting. We also notice that the deterministic embedding GP consistently gives rather poor uncertainty estimates as evidenced by its MNLP scores. Thus, we see a clear benefit to the Bayesian embedding approach for improving the credibility of the predictions.
Figure 4 shows our results for the additive dataset for the first few heldout tasks. In contrast to the pump problem, we often see considerable improvements by using the embedding models. For many of the heldout tasks, the zeroshot predictions from the embedding models are, on average, superior to those of the Bayesian GP even after it has seen 10 examples from the current task. On others (e.g. RMSE, MAE for Task 2 in Fig. 4), the Bayesian GP does somewhat outperform the embedding models, though the embedding models continue to improve. This may be because the prior knowledge built up from the legacy tasks is unhelpful for the heldout task; the embedding model must learn to “overturn” this prior knowledge by finding a way to separate the embedding of the heldout task from the unrelated legacy tasks. However, this requires observations to gradually overturn the belief via Bayesian updating.
Additionally, we see again that the deterministic embedding consistently results in very poor performance in terms of MNLP, frequently performing so much worse that it is impractical to show on the same axes as the Bayesian GP and Bayesian EGP models. Figure 5 shows the results when zoomed out to view the performance of the deterministic EGP for one task. We see a similar deficiency in all of the other heldout tasks, firmly underscoring that it is critical to account for uncertainty in the embedding in order to obtain credible predictions. This makes sense, given the high number of tasks we must embed relative to the number of data available to elucidate their inputoutput characteristics. This is not uncommon in engineering design problems, where one typically has a limited computational budget to explore any given design problem, but may potentially have access to a large archive of legacy tasks.
4.3 Optimization results
Figure 6 shows the best design discovered for the synthetic systems as a function of the number of evaluations on the current task. We report the median performance over splits as well as the 80% coverage interval (th to th percentile). We see that the legacy task data enables the embedding GPs to consistently find nearoptimal designs with a either a single evaluation (Forrester) or without any data from the current task (synthetic). By contrast, the GP usually requires a handful of evaluations before it begins to find good solutions.
Figures 7 and 8 show the zero and oneshot posteriors of all three methods on the toy and Forrester systems, respectively. We see that the legacy task data equip the embedding GPs with helpful priors that aid them in identifying the response surface. By contrast, the Bayesian GP guesses randomly in the zeroshot setting and has minimal insight for its oneshot predictions. We also notice that while the deterministic embedding model seems to fortunately pick good designs, its model of the response surface is highly overconfident. By contrast, we see that the Bayesian embedding endows our models with wellcalibrated uncertainty estimates that simultaneously allows is to find good designs while retaining a credible metamodel. Thus, while the EGP performs comparably to the BEGP, we must be cautious about generalizing these results. We also point out that while it seems like the BGP has picked a very good first point in Fig. 7, this selection is by pure luck since BGP’s predictive distribution with zero training data (i.e. the prior) is flat in its inputs. Indeed, we see that it picks the same first point for the Forrester function in Fig. 8, which is not nearly so close to optimal.
Figure 9 shows the running best design for the pump and additive systems as a function of the number of current task evaluations. Since each legacy system has a different optimum, results show the performance relative to the best design for the current task. Again, we see that the embedding GPs vastly outperform the vanilla GP approach, usually finding good designs on their first attempts. However, here we see that the Bayesian embedding gives better results over a deterministic embedding; this can be directly attributed to our earlier observation that the deterministic embedding tends to overfit, particularly when there are very few data to inform the embedding. By contrast, the Bayesian embedding safely guards against overfitting and provides robust performance.
5 Conclusion
We have introduced a method for Bayesian optimization applicable to settings where one has access to a variety of related datasets but does not know how they relate to each other a priori. The Bayesian embedding GP automatically learns latent representations that enable a single metamodel to share information across multiple tasks, thereby improving the predictive performance in all cases, particularly in novel settings where limited data is available. We observe that our method enables Bayesian optimization to more quickly and reliably find good solutions compared to traditional methods.
We find that variational inference with rudimentary variational approximations (factorized Gaussians) over the latent variables provides sufficient uncertainty information that credible estimates may be obtained in practice and is robust with respect to the number of latent dimensions. Our experiments show that this Bayesian embedding is critical for obtaining credible uncertainty quantification.
There are a number of ways in which our work might be extended. For one, our method might be applied to optimization settings with unknown constraints. In such cases, one might simultaneously model the probability of violating a constraint with a multitask model. Modeling binary outputs such as constraint violation with GPs requires nonconjugate inference, which can be enabled using techniques from [38]. Second, our approach is straightforward to apply to problems of robust design. In fact, we have already shown how uncertainty in our metamodel can be propagated efficiently during design selection through stochastic backpropagation, eliminating the need for expensive doubleloop optimization. Other approaches [39, 40] exploit analytic properties to solve the innerloop optimization, though our work suggests that it might be approximately solved in general using stochastic optimization. Finally, our model learns a single embedding for each general input (task) as a latent vector. We would like to explore the effect of a hierarchical latent representation in which the embedding of each task is allowed to vary with . Such a conditional multitask model might enable one to exploit partial similarities between tasks in certain regions of input space while still providing for the possibility that their trends might differ elsewhere. While this was studied in [30], we expect that our Bayesian latent representation might improve performance in fewshot settings such as those considered in this work.
Funding Sources
Funding for this work was provided internally by GE Research.
References
 Argyriou et al. [2007] Argyriou, A., Evgeniou, T., and Pontil, M., “Multitask feature learning,” Advances in Neural Information Processing Systems, 2007, pp. 41–48.
 Damianou and Lawrence [2013] Damianou, A., and Lawrence, N., “Deep Gaussian processes,” Artificial Intelligence and Statistics, 2013, pp. 207–215.

Rasmussen and Williams [2006]
Rasmussen, C. E., and Williams, C. K.,
Gaussian processes for machine learning
, Vol. 2, MIT Press Cambridge, MA, 2006.  Bilionis et al. [2013] Bilionis, I., Zabaras, N., Konomi, B. A., and Lin, G., “Multioutput separable Gaussian process: Towards an efficient, fully Bayesian paradigm for uncertainty quantification,” Journal of Computational Physics, Vol. 241, 2013, pp. 212–239.
 Bilionis and Zabaras [2012] Bilionis, I., and Zabaras, N., “Multioutput local Gaussian process regression: Applications to uncertainty quantification,” Journal of Computational Physics, Vol. 231, No. 17, 2012, pp. 5718–5746.
 Titsias and Lawrence [2010] Titsias, M., and Lawrence, N. D., “Bayesian Gaussian process latent variable model,” Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 844–851.
 Salimbeni and Deisenroth [2017] Salimbeni, H., and Deisenroth, M., “Doubly stochastic variational inference for deep Gaussian processes,” Advances in Neural Information Processing Systems, 2017, pp. 4588–4599.
 Snelson and Ghahramani [2006] Snelson, E., and Ghahramani, Z., “Sparse Gaussian processes using pseudoinputs,” Advances in Neural Information Processing Systems, 2006, pp. 1257–1264.
 Titsias [2009] Titsias, M., “Variational learning of inducing variables in sparse Gaussian processes,” Artificial Intelligence and Statistics, 2009, pp. 567–574.
 Hensman et al. [2013] Hensman, J., Fusi, N., and Lawrence, N. D., “Gaussian processes for big data,” arXiv preprint arXiv:1309.6835, 2013.

Wilson and Nickisch [2015]
Wilson, A., and Nickisch, H., “Kernel interpolation for scalable structured Gaussian processes (KISSGP),”
International Conference on Machine Learning, 2015, pp. 1775–1784.  Wang et al. [2019] Wang, K. A., Pleiss, G., Gardner, J. R., Tyree, S., Weinberger, K. Q., and Wilson, A. G., “Exact Gaussian Processes on a Million Data Points,” arXiv preprint arXiv:1903.08114, 2019.
 Ranganath et al. [2014] Ranganath, R., Gerrish, S., and Blei, D., “Black box variational inference,” Artificial Intelligence and Statistics, 2014, pp. 814–822.
 Atkinson and Zabaras [2018] Atkinson, S., and Zabaras, N., “Structured Bayesian Gaussian process latent variable model,” arXiv preprint arXiv:1805.08665, 2018.
 Atkinson and Zabaras [2019] Atkinson, S., and Zabaras, N., “Structured Bayesian Gaussian process latent variable model: Applications to datadriven dimensionality reduction and highdimensional inversion,” Journal of Computational Physics, Vol. 383, 2019, pp. 166–195.
 Bishop [2006] Bishop, C. M., Pattern Recognition and Machine Learning, Springer Science+ Business Media, 2006.
 Kingma et al. [2016] Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M., “Improved variational inference with inverse autoregressive flow,” Advances in Neural Information Processing Systems, 2016, pp. 4743–4751.
 Girard et al. [2003] Girard, A., Rasmussen, C. E., Candela, J. Q., and MurraySmith, R., “Gaussian process priors with uncertain inputs application to multiplestep ahead time series forecasting,” Advances in Neural Information Processing Systems, 2003, pp. 545–552.
 Abadi et al. [2016] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al., “Tensorflow: A system for largescale machine learning,” 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
 Paszke et al. [2017] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A., “Automatic differentiation in pytorch,” 2017.
 Snoek et al. [2012] Snoek, J., Larochelle, H., and Adams, R. P., “Practical Bayesian optimization of machine learning algorithms,” Advances in Neural Information Processing Systems, 2012, pp. 2951–2959.
 Zhang et al. [2019a] Zhang, Y., Kristensen, J., Ghosh, S., Vandeputte, T., Tallman, J., and Wang, L., “Finding Maximum Expected Improvement for HighDimensional Design Optimization,” AIAA Aviation 2019 Forum, 2019a, p. 2985.
 Rezende et al. [2014] Rezende, D. J., Mohamed, S., and Wierstra, D., “Stochastic backpropagation and approximate inference in deep generative models,” arXiv preprint arXiv:1401.4082, 2014.
 McMillan et al. [1999] McMillan, N. J., Sacks, J., Welch, W. J., and Gao, F., “Analysis of protein activity data by Gaussian stochastic process models,” Journal of Biopharmaceutical Statistics, Vol. 9, No. 1, 1999, pp. 145–160.

Lawrence [2004]
Lawrence, N. D., “Gaussian process latent variable models for visualisation of high dimensional data,”
Advances in Neural Information Processing Systems, 2004, pp. 329–336.  Damianou and Lawrence [2015] Damianou, A., and Lawrence, N. D., “Semidescribed and semisupervised learning with Gaussian processes,” arXiv preprint arXiv:1509.01168, 2015.
 Kennedy and O’Hagan [2000] Kennedy, M. C., and O’Hagan, A., “Predicting the output from a complex computer code when fast approximations are available,” Biometrika, Vol. 87, No. 1, 2000, pp. 1–13.
 Forrester et al. [2007] Forrester, A. I., Sóbester, A., and Keane, A. J., “Multifidelity optimization via surrogate modelling,” Proceedings of the royal society a: mathematical, physical and engineering sciences, Vol. 463, No. 2088, 2007, pp. 3251–3269.
 Swersky et al. [2013] Swersky, K., Snoek, J., and Adams, R. P., “Multitask Bayesian optimization,” Advances in Neural Information Processing Systems, 2013, pp. 2004–2012.
 Ghosh et al. [2018] Ghosh, S., Asher, I., Kristensen, J., Ling, Y., Ryan, K., and Wang, L., “Bayesian MultiSource Modeling with Legacy Data,” 2018 AIAA NonDeterministic Approaches Conference, 2018, p. 1663.
 Zhang et al. [2019b] Zhang, Y., Apley, D., and Chen, W., “Bayesian Optimization for Materials Design with Mixed Quantitative and Qualitative Variables,” arXiv preprint arXiv:1910.01688, 2019b.
 Iyer et al. [2019] Iyer, A., Zhang, Y., Prasad, A., Tao, S., Wang, Y., Schadler, L., Brinson, L. C., and Chen, W., “DataCentric MixedVariable Bayesian Optimization For Materials Design,” arXiv preprint arXiv:1907.02577, 2019.
 Kingma and Welling [2013] Kingma, D. P., and Welling, M., “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.

Dai et al. [2017]
Dai, Z., Alvarez, M., and Lawrence, N., “Efficient modeling of latent information in supervised learning using gaussian processes,”
Advances in Neural Information Processing Systems, 2017, pp. 5131–5139.  Duane et al. [1987] Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D., “Hybrid monte carlo,” Physics letters B, Vol. 195, No. 2, 1987, pp. 216–222.
 Hoffman and Gelman [2014] Hoffman, M. D., and Gelman, A., “The NoUTurn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research, Vol. 15, No. 1, 2014, pp. 1593–1623.
 Bingham et al. [2018] Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D., “Pyro: Deep Universal Probabilistic Programming,” Journal of Machine Learning Research, 2018.
 Hensman et al. [2015] Hensman, J., Matthews, A., and Ghahramani, Z., “Scalable variational Gaussian process classification,” 2015.
 Ryan et al. [2018] Ryan, K. M., Kristensen, J., Ling, Y., Ghosh, S., Asher, I., and Wang, L., “A Gaussian Process Modeling Approach for Fast Robust Design With Uncertain Inputs,” ASME Turbo Expo 2018: Turbomachinery Technical Conference and Exposition, American Society of Mechanical Engineers, 2018, pp. V07AT32A012–V07AT32A012.
 Ling et al. [2018] Ling, Y., Ryan, K., Asher, I., Kristensen, J., Ghosh, S., and Wang, L., “Efficient robust design optimization using Gaussian process and intelligent sampling,” 2018 Multidisciplinary Analysis and Optimization Conference, 2018, p. 4175.
Comments
There are no comments yet.