Bayesian Experimental Design (BED) is a probabilistic framework for selecting experiments to learn about one or more uncertain variables. Within BED, the most popular criterion for ranking experiments is by Expected Information Gain (EIG), which estimates from current knowledge, encoded in a prior distribution, how informative a particular experiment is likely to be. This framework is used in diverse applications across many disciplines [ryan2016fully], in natural sciences [huan2010accelerated], social sciences [embretson2013item]
, and in machine learning and data analysis[foster2020unified].
The sensitivity of experimental design to misspecification of the prior distribution has been described in [dasgupta1991robust, ryan2016review], even in settings where the information gain is computable in closed form. For more complex models, EIG can only be estimated numerically, which may further affect the reliability of the computed rankings of experiments (or, in the case of continuously parameterized experiments, the gradient of EIG). EIG is by definition an expectation of an expectation, so general purpose estimates, such as Nested Monte Carlo (NMC) estimation [ryan2003estimating], can be expensive, slow to converge, and sensitive to underconverged sample estimates.
To address the issues above that affect the reliability of EIG estimates in BED, we introduce a quantity we call robust expected information gain
(REIG) as a probability-theoretic way of ranking experiments by their expected information gain for some worst-case small perturbation of the prior. We also show through convex analysis that the estimation of REIG is a simple post-processing of the samples generated by an NMC-like estimator. As a result, our methodology is applicable with many existing methods, which we demonstrate insection 7 by applying REIG to samples generated by three recent popular EIG estimators [foster2020unified, NEURIPS2019_d55cbf21, kleinegesse2020bayesian].
2 Background and Notation
to indicate a choice of parameters for a model from a set of possible parameters, and we assume a reference prior probability distribution ofwhich has a measurable density function , so that we can write We let be a potential experiment from a class of experiments, which has an outcome variable . The experiment is modeled by the likelihood function , which for each choice of
defines a measurable probability density function of. Our interest is in models where the densities and can be efficiently computed, and where samples can be drawn from and from for each , so that the joint prior distribution also has a computable density function and can be sampled. We use the notation for the marginal distribution of the outcome .
2.1 Prior Uncertainty
While it seems recursive to consider uncertainty in the prior distribution used in Bayesian inference, the prior distribution is in many settings not determined from first principles or an existing population of data. In these cases the choice of prior is often dictated by what is required to make a computation tractable or simple, by invariance principles, or by an attempt to be noninformative[stark2015constraints]. But notions of noninformative priors do not scale to high dimensions [yang1996catalog], and even in low dimensions priors that are close under a weak topology like total variation can have diverging posteriors for the same observations [owhadi2015brittleness].
So in this work we will consider sets of prior distributions other than the reference , but we only consider that are absolutely continuous with respect to . Although methods similar to ours are used to handle model uncertainty [shapiro2021bayesian], we treat as certain: the only uncertainty we consider is in the prior distribution of . When we extend the notation of derived distributions from to another , then, it is always with the same likelihood: the joint prior , the marginal , etc.
2.2 Expected Information Gain
to denote the Kullback-Leibler divergence,. The expected information gain of experiment is defined to be the expectation over the marginal distribution of outcomes of the KL-divergence from the Bayesian posterior distribution to the prior . Because the prior is not fixed in this work, we consider EIG to be a function of both the prior and the experiment ,
Bayesian optimal experimental design seeks the experiment that maximizes this quantity,
Although the form of EIG in (1) is the most intuitive for Bayesian experimental design, other equivalent definitions map more directly on the robust variant we introduce in section 5 and the sampling-based estimators in section 6. The EIG of an experiment is also the mutual information between and ,
The last form in (4) is the preferred form for many methods that estimate EIG when the likelihood and prior can be evaluated directly. We discuss methods for estimating and other related quantities in section 6.
3 A Simple Example with Two Experiments
Suppose a doctor has two blood tests for Condition X: test A has a chance of a false negative but a chance of a false positive, and test B has an
chance of a false negative and the same chance of a false positive. If the doctor estimates the prior probability that a patient has condition X is, it turns out that both tests have the same expected information gain of nats. If that prior probability could be mistaken, however, the two tests have different EIGs for prior probabilities in the vicinity of , shown in fig. 1. If the prior probability the patient has Condition X is actually , then test A has a greater EIG than test B, and vice versa if it is .
We interpret these results as follows: For test A, if the prior probability of Condition X is , then a negative test result is surprising because there are essentially no false negatives, but if the prior probability is , then a positive result is less surprising because it has a high probability of being a false positive. So in comparison to test B, the EIG of test A is more sensitive to the choice of prior. In any neighborhood of , there are priors where test A is expected to be less informative than test B. So one could argue that a risk-averse doctor, who would maximize how informative the test would be in the worst case, should select test B.
4 Ambiguity Sets
In the simple example above, we used a range of prior probabilities for the model parameter to argue that some tests are less locally sensitive to perturbations of the prior distribution. To generalize this idea from a simple discrete example to other probability distributions, we rely on the notion of an ambiguity set [bayraksan2015data], which is a set of distributions that are not far from a reference prior in some statistical distance. We use KL-divergence as distance, so our ambiguity set with radius centered at reference distribution is
KL-divergence as a distance works well with Bayesian optimal experimental design, because the KL-divergence appears in the definition of EIG, and because the set is defined as a convex subset of positive measurable functions with just two conditions: and . Thus the minimization over of a well-behaved convex function , which appears difficult because of the infinite dimensionality of space of measurable functions, transforms into an equivalent dual convex program with only two variables.
We direct the interested reader to [shapiro2017distributionally] for additional details: here we summarize the results that are important for this work. If the objective function of interest is the expectation under of a measurable quantity of interest , then it is an affine function of and we may use duality to simplify the maximization or minimization of over into a dual problem with only one variable . The maximization problem
can be solved in dual form as
and the minimization problem
can be solved in dual form as
It is important to note that these are non-parametric results. Given a parameterized family of priors , the gradient is sufficient to compute the optimizer in of (5) or (7) within the parametric family because the objective is affine. But (6) and (8) allow us to compute the optimal objective value over the entire ambiguity set without explicitly computing an optimal distribution in the set.
5 Robust Bayesian Experimental Design
The insight of the example of section 3 was that a risk-averse approach to experimental design that allows for some uncertainty in the prior distribution would select the experiment that maximizes the worst-case EIG in the vicinity of the reference prior . Using the ambiguity set from section 4 to define the vicinity of , we first formalize the worst-case EIG as , the true robust expected information gain with radius ,
The experiment that maximizes this quantity is
This optimization problem in this definition has a clear meaning, but we note that is a quantity that is concave in , so it can have multiple local minima in the convex ambiguity set and the duality framework of section 4 cannot be applied directly.
5.1 Affine Expected Information Gain Approximation
To define a relaxation to a tractable problem, we split into two contributions, one with the marginal distribution of fixed by the reference prior, , and the other a correction that is the divergence between and ,
We denote the first term in this difference
because it is an approximation to that is affine and exact when . By the concavity of EIG with respect to its first argument q,
Due to the data processing inequality that the mutual information between two random variables cannot increase by a deterministic or random transformation of the arguments, the error inis bounded by
In fact is the affine approximation to that is tangent at .
The function from (11) is tangent to at for every design .
It is sufficient to show that the difference between the two functions, which by (10) is , is gradient free at .
We first calculate the gradient with respect to
: by the chain rule applied to (4), the derivative in the direction is
For general distributions and the KL divergence satisfies , so the first term vanishes when . In the second term, when the denominator cancels with the measure and we have
where we use the fact that for all .
Finally, we can see that the gradient with respect to in the direction is
which vanishes at for the same reasons as above. ∎
5.2 Robust Expected Information Gain (REIG)
Having shown that is a good approximation to near the reference prior , we now use it to define a robust quantity that approximates , which we refer to simply as ,
These facts suggest an experiment that maximizes ,
has similar robustness to over perturbations of the the prior , as long as the radius of the ambiguity set is not too large.
5.3 Computation of via duality
We will show in section 6 that this 1D convex optimization problem can be solved efficiently by a small adaptation of existing EIG estimators.
5.4 Related Design Criteria
Our definition of was motivated by a risk-aversion argument in favor of the design with the best worst-case EIG in a neighborhood. Because the approximation is affine, however, maximization over the ambiguity set can also be solved by duality. This means that the same methodology can be used to define a risk-loving strategy for experimental design, which selects the experiment that has the highest EIG for some prior in the ambiguity set. We call this criterion
This criterion is used in section 7 to counteract biased underestimation of EIG by some estimators.
Last, we note that our decision to limit the uncertainty in the models of the experiments to just the prior and not the likelihood is arbitrary, at least from the perspective of the methods we have developed. An ambiguity set can be centered around the joint prior of the model , and an affine approximation can be taken that would allow for optimization over that ambiguity set via duality. The result would be an even more conservative quantity,
We will not explore this criterion more in this work.
6 REIG Estimation via Sampling
The design of efficient estimators for EIG has been the subject of much research, in part because their use in experimental design is computationally demanding. The combination of nested iterations to estimate the densities of implicitly defined distributions, to evaluate the expectation of the EIG, and finally to optimize that quantity lead to many passes over the problem data as well as many model evaluations.
When introducing an implicitly defined quantity like , we should be leery of adding another nested loop to the calculation. This is why we immediately discounted the design criterion from (9), which would require optimization in the original variables parameterizing , which could be numerous.
6.1 Constructing a REIG Estimator
When both the prior and the likelihood can be sampled directly, sampling-based approaches to estimating EIG often have a two-level structure: an inner estimator is defined for a fixed and/or in the integrated quantity — either in (1), in (3), or in (4) — and an outer Monte Carlo estimator over either or calls the inner estimator for each generated or .
Draw i.i.d. samples from .
For each , use estimator to compute an estimate of .
Solve the 1D convex optimization problem
and return .
In this approach the inner estimator is called times in step 2, which is the same number of times it would have been called to compute the EIG estimator , but those estimates are saved and treated as an empirical distribution, so that the optimization problem in step 3 solves (18) by sample average approximation (SAA) instead of stochastic approximation (SA). The assumption is that this 1D convex problem will be solved quickly and the dominant cost in algorithm 1 is the cost of computing the KL-divergence estimators .
Although the inner optimization in step 3 is solved by SAA, we note that algorithm 1 can be used within either SAA or SA for the optimization over , depending on whether the samples in step 1 are reused or not. The derivative can be computed as , where
is the vector ofestimates from step 2. The partial derivatives are also present in computing the gradients of EIG estimators, so existing methods for this term can be reused: are the only additional derivatives needed for REIG. Letting be the optimal value and letting , if then . If there there is some and , so that either or if is not unique.
6.2 EIG Estimators
There are many possible choices for the estimator of . In fact, any EIG estimator that accepts general prior distributions can be used by running a separate instance of for each with the prior distribution . In practice, this approach would have poor performance because sequestering the samples
into separate estimator instances would not allow for vectorization across samples. Vectorization and batching are best exploited in a nested EIG estimator if all instances of the inner estimator have the same hyperparameters to maximize throughput.
In the experiments in section 7, we have primarily used the estimators developed in [NEURIPS2019_d55cbf21, foster2020unified, kleinegesse2020bayesian]. The first is the Variational Nested Monte Carlo (VNMC) estimator, which is based on the form of EIG in (4). It draws samples from the joint prior, evaluates directly, and then uses an inner estimator for . That estimator is based on the identity
where can be any distribution that is absolutely continuous with respect to
, but the variance of that expectation is lower the closeris to the posterior distribution . samples are drawn from for each , resulting in the EIG estimator
This estimator is consistent in the limit as , but for finite is in expectation an upper bound for EIG. For details see NEURIPS2019_d55cbf21.
The second estimator that we use is the Adaptive Contrastive Estimator (ACE) from [foster2020unified]. In description it is almost identical to VNMC, except that in estimating we add to the samples the original sample drawn from that generated . The result is
This is also a consistent estimator of EIG, but the addition of the prior-samples to the estimator for makes it in expectation a lower bound for EIG for finite : see [foster2020unified] for more details.
The last estimator that we use is the Mutual Information Neural Estimation (MINE), which trains the ratio, , with samples and estimate the EIG with SAA method [kleinegesse2020bayesian],
where represents the shuffled and
is a neural network with parameters.
The outer loop for ACE, VNMC and MINE methods draw from or , and in the implementations provided by the authors one sample from is drawn for each of samples drawn from . To adapt these methods to the needs of our algorithm 1, we split into , and draw samples from for each of samples drawn from . We then take the mean over the samples for our estimate of on line 2 of our algorithm.
6.3 REIG as Log-Sum-Exp Stabilization
Step 3 of algorithm 1 shows that the convex dual objective function for the design criterion manifests as a -biased and -weighted log-sum-exp combination of the estimates.
From the bounds for log-sum-exp operators we have
When the ambiguity set radius is smaller, the optimal is larger, and becomes more like the sample mean; when is larger, the optimal is smaller until eventually the constraint becomes binding. In that case the limit as is achieved and the value is squeezed to become .
We interpret these facts in the following way: tends to bias the samples in the EIG estimate more towards the smaller values of the larger is. A well-recognized problem in EIG estimation [foster2020unified] is the presence of samples where is under-estimated, leading to artificially inflated EIG estimates.
We have until this point consider a design criterion in its own right, but this biasing behavior suggests that the use of algorithm 1 with an appropriate choice of can also be useful as an estimator for the original EIG criterion whose bias protects the estimate from sampling error. This may be a viable approach for stabilizing the computed EIG value when there are insufficient samples to converge the estimate of well in the nested sampling approaches.
Since is an upper bound while and are lower bounds for , we define an estimator that uses VNMC as the estimator for (18) and the estimators and that use ACE and MINE, respectively. The minimization / maximization in each case acts counter to the bias of the sampling process, which we measure in the following section.
We perform two types of numerical experiments. In the first type, we test to see how well performs as a “worst-case” EIG estimate, as discussed in section 5. In the second type, we compare the previously developed EIG estimators from section 6.2 with a large number of samples to with a small number of samples to see if the bias against large summands described in section 6.3 is as effective as additional samples in stabilizing the computation.
7.1 REIG as a Worst-Case Estimate
This test uses the Preference model test case from [NEURIPS2019_d55cbf21]. The parameter is a location parameter with a reference prior
that is normally distributed, and the experiments are indexed by locations relative to.
To test how well serves at computing worst-case estimate, we select as second prior such that and look visibly distinct over the range of possible experiments in fig. 2, and then measure the KL-divergence between and and take this number to be (in this case, ). We then compute the criterion for all experiments as well, and plot them in fig. 2. (In all of these calculations we use a sampling estimator with a large number of samples so we can be reasonably sure that the values are converged.) What we see is that succeeds at being a lower bound for both and in this case. We also see that this is not the result of a uniform or log-uniform scale reduction: has more drastically reduced the gain of some experiments than others, meaning that determines a different optimal than .
7.2 EIG Stabilization via REIG
We now test the effectiveness of ’s log-sum-exp stabilization at producing a stabilized estimation of . We use the benchmarks designed by [NEURIPS2019_d55cbf21, kleinegesse2020bayesian], especially three experimental designs which have explicit models for the likelihood distribution: A/B test, Preference, and Pharmacokinetic model.
7.2.1 A/B test
An A/B test [kohavi2009controlled, box1978statistics] can be used to determine which experiment in a set of two or more results in the largest information gain. NEURIPS2019_d55cbf21 introduces the group size selection problem between groups A and B. We have experiment participants, and we can select participants for group A and participants for group B. Each participant is represented as two random variables: the first is measured for participants in group A, and the second for group B. This A/B test models each participant’s random variables using a Bayesian linear model,
; the prior and likelihood distributions are Gaussian distributions.
Each of the estimators under consideration uses a neural network to generate samples that musts be trained. The proposal distribution used by the VNMC and ACE estimators is trained with and as in [NEURIPS2019_d55cbf21]: more details about the proposal distribution can be found there. Likewise, the parameters of the neural network in are trained for each with samples of from the given distributions.
In our work, we start from a different reference prior distribution, with its mean taken to be instead of . We find that the VNMC estimator shows considerably more error for some designs with a small number of samples (fig. 3, top left), and requires more samples to converge (). MINE, on the other hand, doesn’t have big difference when we have enough samples (fig. 3, bottom left).
In contrast to that larger number of samples, we compute and estimators using only posterior samples and using samples, with an increasing series of ambiguity set radii (fig. 3, right). The estimate is almost unaffected by the ambiguity set if , but increasing to seems to improve the estimates for the designs that were highly overestimated without negatively affecting the other experiments that were already properly estimated.
Another benchmark that we use to test the effectiveness of ’s log-sum-exp stabilization is Preference test. The Preference experiment is designed to understand consumer behavior with a utility function [samuelson1948consumption, NEURIPS2019_d55cbf21]. The experiment provides the proposal to subjects and checks their preference as an output to use in the utility function.
We use the experiment from [NEURIPS2019_d55cbf21]. We start from different reference prior distribution using as the mean instead of . The KL divergence from the original prior distribution is . VNMC estimator in our test shows more error with a small number of samples M = 30 (fig. 4, top left), while ACE estimator shows accurate estimation for the EIG with small number of marginalization samples (fig. 4, middle left). MINE estimator, on the other hand, estimate the EIG lower than the true EIG (fig. 4, bottom left). But the optimal experimental set-up are same with the true EIG.
We further estimate and estimator using only M = 30 samples and MINE, with an increasing series of ambiguity set radii (fig. 4, right). By applying ambiguity set 0.01, the estimation boundary becomes tight. If we apply ambiguity set , the lower bound (ACE) becomes higher than the true while the upper bound (VNMC) becomes lower than the True (fig. 4, right). The and are no longer the corresponding upper bound and lower bound. with increase the EIG estimation and change the optimal experimental set up (fig. 4, bottom right).
7.2.3 Pharmacokinetic model
The last benchmark that we use to test the effectiveness of is Pharmacokinetic study. The Pharamcokinetic study is designed to understand how the medicine is absorbed, distributed and eliminated in the subject’s body, which we can understand with the compartmental model. We can estimate the compartment model’s parameter by blood sampling and we want to calculate the optimal blood sampling time. [ryan2014towards, kleinegesse2020bayesian]
We start from a different reference prior distribution with its mean taken to be whose KL divergence from the original prior distribution is 0.1. The accuracy of the estimators and in our tests have increases with M (fig. 5, top left, middle left) while , on the other hand, does converge to a substantial underestimate (fig. 5, bottom left), yet identifies the same optimal experiment.
We further compute and estimators using only M = 30 samples and , with an increasing series of ambiguity set radii (fig. 5, right). By applying ambiguity set 0.1, the lower bound (ACE) becomes higher than the true while the upper bound (VNMC) becomes lower than the True (fig. 5, right). The and are no longer the corresponding upper bound and lower bound. MINE estimator with ambiguity set , on the other hand, improve the EIG estimation so that the estimation is close to the true EIG (fig. 5, bottom right).
This work presents an introduction and initial numerical experiments testing the use of a robust modification of expected information gain as a criterion for Bayesian model selection. The estimator is designed to both have rigorously defined approximation properties (as the minimization of a tangent approximation to the EIG over a convex ambiguity set), and to yield a practical algorithm in practice (algorithm 1, which post-processes the samples of previously defined estimators by the solution of a 1D convex optimization problem).
While our initial results are promising, our understanding of how to apply this method is not complete at this time. Most of the remaining questions relate to the choice of the radius of the ambiguity set over which the approximated EIG is taken to be robust. A definite a priori estimate for seems unlikely in most cases.
In an approach analogous to Morozov’s discrepancy principle, a logical choice of would be one such that the implied radius of the ambiguity set is on the same order as the error that the sampling based estimator (such as VNMC or ACE) has introduced into the problem. The fact that these two estimators provide (in expectation) an upper and lower bound for the true EIG of a design suggests that there may be a way to combine the two estimators into an a posteriori estimate of the proper choice of . Further investigation into this topic is the subject of further research.
Acknowledgements.We would like to thank Alexander Shapiro for several stimulating discussions on ambiguity set.
Appendix A Experiments Details
a.1 A/B test
The reference prior and likelihood for the A/B test were as follows:
We use the utility function from [NEURIPS2019_d55cbf21].
a.3 Pharmacokinetic model
The prior distribution and noise distributions for the pharmacokinetic model are defined as follows,
where , and .