ProBO: a Framework for Using Probabilistic Programming in Bayesian Optimization

01/31/2019 ∙ by Willie Neiswanger, et al. ∙ Carnegie Mellon University 18

Optimizing an expensive-to-query function is a common task in science and engineering, where it is beneficial to keep the number of queries to a minimum. A popular strategy is Bayesian optimization (BO), which leverages probabilistic models for this task. Most BO today uses Gaussian processes (GPs), or a few other surrogate models. However, there is a broad set of Bayesian modeling techniques that we may want to use to capture complex systems and reduce the number of queries. Probabilistic programs (PPs) are modern tools that allow for flexible model composition, incorporation of prior information, and automatic inference. In this paper, we develop ProBO, a framework for BO using only standard operations common to most PPs. This allows a user to drop in an arbitrary PP implementation and use it directly in BO. To do this, we describe black box versions of popular acquisition functions that can be used in our framework automatically, without model-specific derivation, and show how to optimize these functions. We also introduce a model, which we term the Bayesian Product of Experts, that integrates into ProBO and can be used to combine information from multiple models implemented with different PPs. We show empirical results using multiple PP implementations, and compare against standard BO methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

ProBO

ProBO: a Framework for Using Probabilistic Programming in Bayesian Optimization


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of optimization is widespread in science and engineering. For example, one may want to find a material composition that most expresses some desired property, or choose control settings for a process that best achieve certain outcomes. Often, each iteration of an optimization procedure, referred to as a query or experiment, is expensive and costs a great deal of resources including money, time, or human effort. To reduce the cost of expensive optimization, it is paramount to develop methods that are data efficient and yield optimal designs in a minimal number of iterations.

One popular method for efficient optimization given expensive queries is known as Bayesian optimization (BO) Shahriari et al. [2016], Snoek et al. [2012]. In BO, a Bayesian model is used to model an experiment, and then is leveraged to choose subsequent designs to query. Specifically, one optimizes an acquisition function, which is defined with respect to the model, to choose each .

The most common model used in BO today is the Gaussian Process (GP). This nonparametric Bayesian model makes few assumptions about outcomes

, and is particularly useful for black box optimization. There exist many popular acqusition functions that have been derived for GPs. There has also been some work deriving BO procedures for other flexible models including random forests

Hutter et al. [2011]

and neural networks

Snoek et al. [2015].

However, to accurately model complex systems, we may want to choose from a broader library of Bayesian models and techniques. We may, for example, want to compose models—e.g. GPs, latent factor (e.g. mixture) models, deep bayesian networks, hierarchical regression models—in various ways, and use them in BO.

Recently developed probabilistic programming languages (PPLs) provide a general way to model structure, problem intuition, and prior knowledge in a Bayesian model (or, more generally, in a probabilistic program (PP)). They allow for easy model specification and composition, quick deployment, and automatic inference, often in the form of samples from an approximate posterior distribution.

PPLs give us a language and framework for building complex models and performing automatic inference, but it is not immediately clear how to make them compatible with standard acqusition functions, for use in a typical BO algorithm. In this paper, we aim to help bridge this gap, and provide a way for any PP or Bayesian model (and inference) implementation to be immediately plugged into a BO loop. Similar to how PPs run inference automatically, we aim to perform BO automatically, given any user-specified PP and choice of acquisition strategy.

Specifically, we develop a framework (ProBO) for using an arbitrary probabilistic program within a Bayesian optimization procedure. We aim to achieve the following criteria:

  1. [topsep=0pt,itemsep=0pt,parsep=3pt]

  2. One can plug in any PP that returns posterior samples, including those with differentiable inference (e.g. Hamiltonian Monte Carlo (HMC), black-box variational inference (BBVI)), exact inference (message passing, GPs), deep PPs (Bayesian neural networks), and universal PPs (simulator models).

  3. One can use a variety of acquisition strategies, including Thompson sampling (TS), probability of improvement (PI), expected improvement (EI), upper confidence bound (UCB), and others.

A BO procedure that works automatically with any PP can provide various modeling advantages that benefit BO. For instance, we often have prior knowledge, expert insight, contextual information, or additional observations, all of which may be modeled. These can bias or constrain the optimization routine and improve query predictions, which can increase sample efficiency and greatly reduce the number of iterations needed for BO. A few examples are when:

  • [topsep=0pt,itemsep=0pt,parsep=0pt]

  • Inputs yield experimental outcomes according to a known family of functions (e.g. Li et al. [2018], and our phase shift model in Fig. 2).

  • There is an expected feature in the optimization landscape (e.g. Andersen et al. [2017], Neiswanger and Xing [2017], and our basin model in Sec. 4.2).

  • There is some latent structure, and inferences about it can yield better decisions on which points to query (e.g. corrupt BO and our denoising GP in Sec. 4.1).

  • There exists a simulator that can approximate the experimental process (e.g. discussion in Sec. 5).

In summary, this paper provides the following contributions: We develop a BO framework that is compatible with arbitrary PPs. We describe how a variety of acquisition functions can be implemented approximately using basic PP operators, and how to efficiently optimize these acqusition functions. We give strategies for using PPs within BO in practice, including ensembling models and making the procedure robust to misspecification. Finally, we provide an empirical comparison with standard BO methods, showing that our framework can reduce the number of iterations in BO. Our python implementation of ProBO is available at https://github.com/willieneis/ProBO.

2 Framework

In this section, we first describe a general abstraction for PPs for use within our framework, then use this abstraction to develop black box versions of common acquisition functions that are compatible with arbitrary PPs, and finally show how to efficiently optimize these black box acquisition functions with a multi-fidelity optimization strategy.

2.1 Formalism for Probabilistic Programs

We describe a general formalism for discriminative probabilistic programs for use in BO. Suppose we are modeling a system which, given an input , yields observations , written . We assume that , i.e. the first dimension . Observing the system times at different inputs yields a dataset . Let there exist a Bayesian model for data , with likelihood , where are latent variables. Let denote the PDF of the prior on . We define the joint model PDF to be . The posterior (conditional) PDF can then be written .

In our formalism, we assume only two basic PP operations:

  1. [topsep=0pt,itemsep=0pt,parsep=3pt]

  2. Sample from the data conditional: given an input parameter and latent variable , draw from the generative distribution over data . We write this

    (1)
  3. Sample from the posterior conditional: given a dataset , draw samples from the posterior distribution over latent variables. We write this

    (2)

Using PPs assumed to have only these two operations, we will carry out BO tasks such as constructing and optimizing acquisition functions.

Scope.

This formalism encompasses a broad array of PPs, including those making use of Markov chain Monte Carlo (MCMC), most variational or exact inference methods (e.g. GPs and deep PPs

Tran et al. [2017]), and forward simulation methods (e.g. universal PPs Le et al. [2016]), amortized inference Ritchie et al. [2016], and ABC methods Csilléry et al. [2010]).

For example, our formalism is compatible with popular PPL frameworks such as Stan Carpenter et al. [2015], Edward Tran et al. [2016], PyMC3 Salvatier et al. [2016], Pyro Bingham et al. [2018], GPy GPy [2012], Infer.NET Minka [2012], Venture Mansinghka et al. [2014], Anglican Wood et al. [2014]

, TensorFlow Probability

Dillon et al. [2017], ProbTorch Siddharth et al. [2017], and others.

(a)
(b)
(c)
(d)
Figure 1: Visualizations of black box acquisition functions given in Algs. 2-5

for use in ProBO. In each plot, the posterior predictive distribution is shown in gray, and

is given for two fidelities: (solid color line) and (dashed black line).

2.2 ProBO Framework

We give the ProBO framework in Alg. 1. Each iteration consists of four steps: draw posterior samples via pp.inf, select an input by optimizing a black box acquisition using pp.gen, observe the system at , and add the new data to the dataset.

1:for  do
2:      Posterior sample
3:      Optimize
4:      Observe system at
5:      Add data
6:Return .
Algorithm 1 ProBO

Note that a special case of this framework is BO using GP models, where inference can be done exactly (step 1), there is a closed-form expression for the acquisition function (step 2), and observing means querying a function at and getting a scalar output (step 3).

However, for many Bayesian models and PPs, inference cannot be done exactly, and there does not exist an exact expression for . Furthermore, many systems may have complex inputs and extra observations (in addition to the value to be optimized) Wu et al. [2017], Astudillo and Frazier [2017], Swersky et al. [2013]. We’d like to take any PP that models the available information, run any applicable approximate inference method, and use this automatically in BO.

In the following sections, using the two general PP operations defined in Sec. 2.1, we develop approximate acquisition functions that can be applied broadly, and show how to efficiently optimize these functions. We refer to these as black box acquisition functions since they are computed automatically via these two operations, without any model specific derivation.

2.3 Black Box Acquisition Functions

In the ProBO framework (Alg. 1), we denote the black box acquisition function with , where are latent variable posterior samples given by pp.inf. We also include a parameter , which represents the approximation quality of . We will describe an adaptive method for choosing during acquisition optimization in Sec. 2.4. We will make frequent use of the posterior predictive distribution, which is defined to be .

There are a number of popular acquisition functions used commonly in Bayesian optimization, such as expected improvement (EI) Močkus [1975], probability of improvement (PI) Kushner [1964], GP upper confidence bound (UCB) Srinivas et al. [2009], and Thompson sampling (TS) Thompson [1933]

. Special cases of sample-based (Monte Carlo) estimates of certain acquisition functions have been described in prior work

Snoek et al. [2012], Wilson et al. [2018], Hernández-Lobato et al. [2015]. However, these focus on GP models where there are Gaussian assumptions for .

Here, we propose a few simple acquisition estimates for arbitrary PPs that can be computed with pp.inf and pp.gen. Specifically, we give algorithms below for EI, PI, UCB, and TS acquistion strategies, though similar algorithms could be used for other acquisitions involving expectations or statistics of either or .

The popular EI acquisition function returns the expected improvement that querying at will have over the minimal value observed so far. For , this can be written

where . We construct a nonparametric estimate for this in the black box EI (BB-EI) acquisition in Alg. 2, and illustrate BB-EI in Fig 1(a) for .

The PI acquisition function is similar to EI, but returns the probability that querying at will improve upon the minimal value observed so far. PI can be written

We give the black box PI (BB-PI) acquisition function in Alg. 3, and visualize it in Fig 1(b).

1:for  do
2:     
3:     
4:
5:Return
Algorithm 2 BB-EI acquisition
1:for  do
2:     
3:     
4:Return
Algorithm 3 BB-PI acquisition
1:for  do
2:     
3:     
4:Return . See text for details
Algorithm 4 BB-UCB acquisition
1:for  do
2:      Use final sample only
3:Return
Algorithm 5 BB-TS acquisition

The UCB acquisition function at a point returns a lower confidence bound for (note that we use a lower confidence bound since we are performing minimization). We give the black box UCB (BB-UCB) acquisition in Alg. 4 and visualize it in Fig 1(c).

Alg. 4 involves an estimate of the lower confidence bound of . Two simples strategies for estimating this LCB are

  1. Empirical quantiles:

    Order into , and return if , or else return , where is a tradeoff parameter.

  2. Parametric assumption: As an example, if we model , we can compute and , and return , where is a trade-off parameter.

In Thompson sampling (TS), one draws a single latent variable from the posterior and then optimizes over the mean of with respect to . In our case, we do not know this mean, but can instead optimize over an average of draws from pp.gen. We give the black box TS (BB-TS) acquisition in Alg. 5 and visualize it in Fig. 1(d).

2.4 Multi-fidelity Optimization of Black Box Acquisition Functions

In the ProBO framework, we must optimize over the acquisition estimates defined in the previous section, i.e. compute

(3)

where denotes the number of times pp.gen is called in an evaluation of . However, in some probabilistic programs, the pp.gen operation can be costly, and we’d like to minimize the number of times it is called. As seen in Fig. 1, a small will return a noisy estimate of , while a large will return a more-accurate estimate.

This is a special case of a multi-fidelty optimization problem Forrester et al. [2007], with fidelity parameter . Unlike typical multi-fidelity settings, our goal is to reduce the number of calls to pp.gen for a single only, via modifying the black box acquisition function . This way, the new can be dropped into any arbitrary global optimization routine which operates by making calls to . Concretely, suppose we have fidelities ranging from a small number of samples to a large number , i.e.

Intuitively, when calling on a given , we’d like to use a small if is far from the minimal value , and a larger if is close to .

We propose the following procedure: Suppose is the minimum value of seen so far during optimization (for any ). For a given fidelity (starting with ), we compute a lower confidence bound (LCB) for the sampling distribution of . We can do this via the bootstrap method Efron [1992] along with the LCB estimates described in Sec. 2.3. If this LCB is below , it remains plausible that the acquisition function minimum is at , and we repeat these steps at fidelity . After reaching a fidelity where the LCB is above (or upon reaching the highest fidelity ), we return the estimate . We give this procedure in detail in Alg. 6.

1: Min value of seen so far
2:,
3:while  do
4:      Call pp.gen times
5:     for  do B bootstrap samples
6:          Resample with replacement
7:          Defined in a given      
8:     
9:     
10:Return
Algorithm 6 Multi-fidelity

As a simple case, we could run a two-fidelity algorithm, with , where . For a given , the multi-fidelity acquisition method would first draw samples from pp.gen, and compute the LCB with the bootstrap. If the LCB is greater than , the method would return ; if not, it would return . Near optima, this will call pp.gen times, and will call it times otherwise. Hence, this method will reduce calls to pp.gen when , (dependent on the choice of LCB estimate).

One can apply any derivative-free (query-based) global optimization procedure that iteratively calls . In general, we can replace the optimization step in ProBO with , for each of the black box acquisition strategies described in Sec. 2.3. In Sec. 2.4, we provide experimental results for this method, showing favorable performance relative to high fidelity black box acquisition functions, as well as reduced calls to pp.gen.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 2: Visualization of the Bayesian product of experts (BPoE) ensemble model (column 3) of a phase shift (PS) model (column 1), defined in Sec. 3, and a GP (column two). In the first row (), when is small, the BPoE ensemble more closely resembles the PS model. In the second row (), when is larger, the BPoE ensemble more closely resembles the GP model, and both accurately reflect the true landscape (red dashed line). In all figures, the posterior predictive is shown in gray.

3 Ensembles of PP Models with the Bayesian Product of Experts

We may have multiple models that capture different aspects of a system, or we may want to incorporate information given by, for instance, a parametric PP (e.g. a model with a specific trend, shape, or specialty for a subset of the data) into a nonparametric PP (e.g. a GP, which is highly flexible, but has fewer assumptions).

To incorporate multiple sources of information or bring in side information, we want a valid way to create ensembles of multiple PP models. Here, we develop a method to combine the posterior predictive densities of multiple PP models, using only our two PP operations. Our procedure constructs a model similar to a product of experts model Hinton [2002], and we call our strategy a Bayesian product of experts (BPoE). This model can then be used in our ProBO framework.

As an example, we show an ensemble of two PP models, and , though this could be extended to an arbitrarily large group. Let have likelihood , where are latent variables with prior , and have likelihood , where are latent variables with prior . Note that and need not be in the same space.

Given and , we propose an ensemble model with latent variables , where

Note that this treats the prior on and as independent, and uses the product of expert assumption Hinton [2002] on , which intuitively means that is high where both and agree (i.e. an “and” operation).

We can prove (appendix Sec. A) that the posterior predictive PDF for is proportional to the product of the posterior predictive PDFs for and , i.e.

(4)

Given this property, we need an algorithm for computing and using the posterior predictive for within the ProBO framework. In our black box acquisition algorithms, we use pp.gen to generate samples from predictive distributions. We can integrate these with combination algorithms from the embarassingly parallel MCMC literature Neiswanger et al. [2014], Wang et al. [2015] to develop an algorithm that generates samples from the posterior predictive of the ensemble model and uses these in a black box acquisition function. We give this procedure in Alg. 7, where we’ve used to denote a combination algorithm, which we detail in appendix Sec. B.

1:for  do
2:     
3:     
4:     
5:
6:Return .
Algorithm 7
PP ensemble with BPoE

We can swap the pp.ens operation in for the pp.gen operation in Algs. 2-5.

Note that the BPoE allows us to easily ensemble PPs written in different PPLs. For example, a hierarchical regression model written in Stan Carpenter et al. [2015] using Hamiltonian Monte Carlo for inference could be combined with a deep Bayesian neural network written in Pyro Bingham et al. [2018] using variational inference and with a GP written in GPy GPy [2012] using exact inference.

Example: Combining Phase-Shift and GP Models.

We describe an example and illustrate it in Fig. 2. Suppose we expect a few phase shifts in our input space , which partition into regions with uniform output. We can model this system with , where latent variables , , , and are assigned appropriate priors, and where . This model may accurately describe general trends in the system, but it may be ultimately misspecified, and underfit as the number of observations grows.

Alternatively, we could model this system as a black box using a Gaussian process. The GP posterior predictive may converge to the correct landscape given enough data, but it is nonparametric, and does not encode our assumptions.

We can use the BPoE model to combine both the phase shift and GP models. We see in Fig. 2 that when (first row), the BPoE model resembles the phase shift model, but when (second row), it more closely resembles the true landscape modeled by the GP.

4 Empirical Results

The main goal of our empirical work is to show that we can directly plug models implemented with various PPL implementations into ProBO , and that these PPs can improve BO performance (i.e. increase the data efficiency and reduce the number of iterations) when compared with standard methods and models. We also aim to verify that our black box acquisition functions and extensions (e.g. multi-fidelity and BPoE ensemble) perform well in practice.

We show experimental results on two tasks. The first is the task of corrupt BO, where we assume experimental observations are corrupted (i.e. drawn from some corruption distribution) according to some probability. To perform accurate BO in this setting, models need to infer when to ignore or use subsets of the data.

The second experiment is a model selection task in neural networks that involves finding the optimal number of hidden units at each layer in a network (i.e a basic type of neural architecture search Zoph and Le [2016], Kandasamy et al. [2018a]). To reduce the number of iterations needed in BO, we design a basin model that captures the relationship of validation error (i.e. classification error on a validation data set) with model complexity, and combine it with a GP in a BPoE ensemble (Sec. 3).

Implementation Details.

For all GP implementations in the following experiments, we use the GPy library GPy [2012]. We implement other PPs in Edward Tran et al. [2016], where we use black box variational inference algorithms, and in Stan Carpenter et al. [2015], where we use the No U-Turn Sampler Hoffman and Gelman [2011], a form of Hamiltonian Monte Carlo.

4.1 Corrupt BO

(a)
(b)
(c)
(d)
Figure 3: Visualization of the denoising model (Sec. 4.1) applied to GPs, for use in corrupt BO. When , the GP (a) is negatively affected by corrupt data (orange points), while the denoising GP (b) more accurately reflects the system data (blue points). When , the GP (c) remains biased by the corrupt data, while the denoising GP (d) converges to the true system (red dashed line). In all figures, the corruption probability is , and the model posterior predictive is shown in gray.

Consider the setting where, each time we conduct an experiment, we observe a “corrupt” output with probability , which is drawn according to some corruption distribution. We call this task corrupt BO. The corrupt distribution may have some dependence on input (e.g. may be greater in a window around the optimum value ).

(a)
(b)
(c)
(d)
Figure 4: Results on corrupt BO experiments (Sec. 4.1). For low corruption (), ProBO with black box acquisition functions on denoising GPs is competitive with standard BO using exact aquisition functions on GP models (a-b). For higher corruption (), ProBO with denoising GPs converges to the optimal value, while standard BO with GPs does not, even as

grows large (c-d). Under high corruptions, standard BO with GPs can fare worse than RAND (blue dotted line) (c-d). All methods are averaged over 10 runs, and error bars represent one standard error.

We develop a denoising PP model for this setting. Given a PP system model , and corruption model , we write our model as , where , and . This is a two component mixture of the system and a corruption model where weights can also depend on input .

(a)
(b)
(c)
(d)
Figure 5: Results on neural architecture tuning experiments (Sec. 4.2). We visualize the basin model (a) vs GP (b) on data from a four layer MLP (fixing all layers to same width to plot in 1D). For low , the basin model more accurately captures the shape of the function (red dashed line). In (c) and (d) we show results of ProBO using a BPoE basin ensemble vs standard BO with GPs. Here, the ProBO methods outperform standard BO. All methods are averaged over 10 runs, and error bars represent one standard error.
(a)
(b)
(c)
Figure 6: Results on the multi-fidelity black box acqusition function experiments (Sec. 4.3), showing ProBO using the multi-fidelity (Alg. 6) vs using a fixed high-fidelity () and a fixed low-fidelity (). Results for (BB-EI) are in (a), and (BB-UCB) in (b). Here, performs competitively with the high-fidelity , while low fidelity performs worse. All methods are averaged over 10 runs, and error bars represent one standard error. In (c) we show the average number of pp.gen calls per evaluation of , and see that the multi-fidelity method maintains similar performance as the high fidelity method while reducing the number of calls.

As a special case of this, we implement a denoising GP model, which we illustrate in Fig. 3. Here, we let the system model be a GP, and the corruption model be the uniform distribution.

We show experimental results for ProBO using denoising GPs in the following synthetic corrupt BO task. For an , with probability we query the function , which has a minimum value of at , and with probability , we receive a corrupt value with distribution , where is . Note that we use this corruption distribution (where corruption output is greater or equal to the minimum ) so that we are consistent with our stated goal to minimize , the observed value of the system in at iteration , instead of operating in a more general design of experiments setting Kandasamy et al. [2018b].

For this task we compare ProBO using denoising GPs and black box acquisition functions with standard BO using GPs and exact acquisition functions. We first show results for a low corruption setting , in Fig. 4(a-b), where we plot the minimal found value vs iteration . All methods are averaged over 10 runs, and error bars represent one standard error. Here, both models converge to a near-optimal value and perform similarly.

We then show results for a high corruption setting , in Fig. 4(c-d). Here, ProBO with denoising GPs converges to a near-optimal value, while standard BO with GPs does not. For large , BO with GPs fares worse than random sampling (RAND).

4.2 Tuning Neural Architectures

A popular application of BO is hyperparameter tuning of machine learning models. A recently popular task in this domain is known as neural archicture search (NAS). In this experiment we show that domain knowledge of features of a function landscape can be incoporated in ProBO via a BPoE ensemble (Sec. 

3) to provide better sample efficiency (reduced iterations) for BO, in a basic NAS task.

When modeling datasets of moderate size, there are often two distinct phases as model complexity grows: a phase where the model underfits, where increasing modeling complexity reduces error on a held-out validation set; and a phase where the model overfits, where validation error increases with respect to model complexity. We design a model for this trend, which we refer to as a basin model:

(5)

where , and we’ve placed priors on parameters , , , and . This model captures the inflection point with variable , and uses variables and to model the slope of the optimization landscape above and below (respectively) . We show a one dimensional view of validation error data from this system, and illustrate the basin model, in Fig. 5(a-b).

In this experiment, we optimize over the number of units (i.e. layer width) of the hidden layers in a four layer multi-layer perceptron (MLP) neural network

Rosenblatt [1961], trained on the Wisconsin Breast Cancer Diagnosis dataset Blake and Merz [1998]. We compare ProBO using a BPoE ensemble (Sec. 3) of a basin model and a GP, with standard BO using a GP. We see in Fig 5(c-d) that ProBO with the BPoE ensemble can significantly outperform standard BO with GPs. In this optimization task, the landscape around the inflection point (of under to over fitting) can be very steep, which may hurt the performance of standard BO with GPs. In contrast, the basin model can capture this shape and quickly identify the inflection point via inferences about .

4.3 Multi-fidelity Acquisition Optimization

Here, we empirically assess our multi-fidelity acquisition function optimization algorithm (Sec. 2.4). Our goal is to demonstrate that increasing the fidelity in black box acquisitions can yield better performance in ProBO , and that our multi-fidelity method (Alg. 6) maintains the performance of the high-fidelity acquisitions while reducing the number of calls to pp.gen.

We perform an experiment in a two-fidelity setting, where , and we apply our multi-fidelity method to BB-EI and BB-UCB, using a GP model and the synthetic BO task described in Sec. 4.1. Results are shown in Fig. 6, where we compare high-fidelity (), low-fidelity (), and multi-fidelity , for BB-EI (a) and BB-UCB (b). For both black box acquisition functions, the high-fidelity and multi-fidelity methods show comparable performance, while the low-fidelity method performs worse. We also see in (c) that the multi-fidelity method reduces the number of calls to pp.gen by a factor of 3, on average, relative to the high fidelity method. It is worth noting that the low-fidelity method performs well, given its low cost, and may be the most applicable method depending on the PP noise, cost of calling pp.gen, and desired accuracy.

5 Conclusion

In this paper we presented ProBO , a framework for performing Bayesian optimization automatically using an arbitrary probabilistic program model. We developed black box acquisition functions, which do not require model-specific derivations, and showed how to efficiently optimize these functions. We developed a new model, the Bayesian product of experts (BPoE), which integrates nicely with our framework, and allows for combining information from multiple PPs, and performing BO with PP ensembles. Finally, we demonstrated promising empirical results on a corrupt BO task, and a neural network architecture tuning task, where we were able to drop in and use various existing PP implementations.

While not the focus of this paper, universal PPs Le et al. [2016], Mansinghka et al. [2014], Wood et al. [2014] allow for models defined by arbitrary forward simulators, and aim to provide automatic inference for these models. They comprise a very broad class of models, and have the potential to incorporate sophisticated custom-built simulations of a broad array of systems. We think there may be the potential for running ProBO with models involving complex simulators of real world phenomena, and that this simulation-guided BO is an interesting avenue for future work.

References

Appendix A The Bayesian Product of Experts (BPoE) Posterior Predictive Distribution

Here we characterize the posterior predictive PDF for the Bayesian product of experts model defined in Sec. 3. For convenience, we will provide our derivation for ensemble of two models. However, we can extend these results to an ensemble of an arbitrary number of models.

Suppose we are modeling a system which, given an input , yields observations , written , where . Observing the system at inputs yields a dataset .

Consider two Bayesian models and , both for data .

Let have likelihood , where are latent variables with prior . We define the joint model PDF for to be . The posterior (conditional) PDF for can then be written . We can write the posterior predictive PDF for as

(6)

Similarly, let have likelihood , where are latent variables with prior PDF . We define the joint model PDF for to be , the posterior (conditional) PDF to be , and the posterior predictive PDF to be

(7)

Note that and need not be in the same space nor related.

Given models and , we define the following Bayesian Product of Experts (BPoE) ensemble model, , with latent variables :

(8)
(9)
(10)
(11)
(12)

We can write the posterior PDF for the BPoE ensemble model as

(13)
(14)
(15)
(16)

We can thus write the posterior predictive PDF for the BPoE ensemble model as

(17)
(18)
(19)
(20)
(21)
(22)
(23)

Therefore, we have the result, which we use in Sec. 3, that the posterior predictive PDF for the BPoE ensemble model is proportional to the product of the posterior predictive PDFs of the constituent models and , i.e.

(24)

Via a similar argument we can also say that, for a BPoE ensemble model consisting of models , the posterior predictive PDF for , , has the property

(25)

where is the posterior predictive PDF for constituent model in the ensemble.

Appendix B Combination Algorithms for the pp.ens Operation (Alg. 7)

In Sec. 3, we make use of combination algorithms from the embarrassingly parallel MCMC literature Neiswanger et al. [2014], Wang et al. [2015], to define the pp.ens operation (Alg. 7) for use in applying the ProBO framework to a BPoE model. We describe these combination algorithms here in more detail.

For convenience, we describe these methods for two Bayesian models, and , though these methods apply similarly to an abitrarily large set of models.

The goal of these combination methods is to combine a set of samples

(26)

from the posterior predictive distribution of a model , with a disjoint set of samples

(27)

from the posterior predictive distribution of a model , to produce samples

(28)
(29)

where denotes the posterior predictive distribution of a BPoE ensemble model , with constituent models and .

We use the notation to denote a combination algorithm. We give a combination algorithm in Alg. 8 for our setting inspired by a combination algorithm presented in Neiswanger et al. [2014].

1:
2:for  do
3:     
4:     
5:     if  then
6:         
7:               
8:     
9:Return .
Algorithm 8 Combine sample sets

Where we must define a couple of terms used in Alg. 8. The mean output , for indices , is defined to be

(30)

and weights (alternatively, ), for indices , are defined to be

(31)

Note that this algorithm (Alg. 8) holds for sample sets from two arbitrary posterior predictive distributions and , without any parametric assumptions such as Gaussianity.

Appendix B Combination Algorithms for the pp.ens Operation (Alg. 7)

In Sec. 3, we make use of combination algorithms from the embarrassingly parallel MCMC literature Neiswanger et al. [2014], Wang et al. [2015], to define the pp.ens operation (Alg. 7) for use in applying the ProBO framework to a BPoE model. We describe these combination algorithms here in more detail.

For convenience, we describe these methods for two Bayesian models, and , though these methods apply similarly to an abitrarily large set of models.

The goal of these combination methods is to combine a set of samples

(26)

from the posterior predictive distribution of a model , with a disjoint set of samples

(27)

from the posterior predictive distribution of a model , to produce samples

(28)
(29)

where denotes the posterior predictive distribution of a BPoE ensemble model , with constituent models and .

We use the notation to denote a combination algorithm. We give a combination algorithm in Alg. 8 for our setting inspired by a combination algorithm presented in Neiswanger et al. [2014].

1:
2:for  do
3:     
4:     
5:     if  then
6:         
7:               
8:     
9:Return .
Algorithm 8 Combine sample sets

Where we must define a couple of terms used in Alg. 8. The mean output , for indices , is defined to be