The motivation for this research comes from construction of optimal Bayesian experimental designs, for which the so-called expected information gain has been often employed as a quality criterion of experimental designs, see for instance [3, 16, 12, 13, 1]. Let
be a (possibly multi-dimensional) random variable which represents the uncertain quantity of interest. By collecting relevant data(which is again possibly multi-dimensional) through carrying out some experiments under an experimental setup , one can expect that the uncertainty of , measured by the information entropy, can be reduced. The aim of Bayesian experimental designs is to find an optimal experimental setup such that the expected amount of the information entropy reduction about is maximized.
In what follows, we give a formal definition of the expected information gain for a particular experimental setup . The information entropy of before collecting data is given by
denotes the prior probability density function of, and the expectation is taken with respect to the same . On the other hand, after collecting data , the conditional information entropy of is
denotes the posterior probability density function ofgiven and the expectation here is taken with respect to instead of . Thus the expected conditional information entropy of by collecting data is
Now the expected information gain measures the average amount of the reduction of the information entropy about by collecting data , and is defined by the difference
Here the inner expectation appearing in the right-most side is nothing but the Kullback-Leibler divergence betweenand . In the context of Bayesian experimental designs, it is considered that the larger the value of , the more informative the data is about and thus the better the experimental design is. This is why the expected information gain is used as a quality criterion of experimental designs.
Let us consider the following data model:
where the function represents the deterministic part of the model response which depends on and , and denotes the stochastic part of the model response, i.e, the measurement error. It is often the case where is assumed to be zero-mean Gaussian with covariance matrix . Throughout this paper we assume that the prior probability density function of , the function
, and the probability distribution ofare given. As considered in [12, 13, 1], this data model can be extended to allow the repetition of experiments as
where is the number of repetitive experiments and are independent and identically distributed (i.i.d.) measurement errors. This extended model, however, can be easily rewritten into the form of (2) by concatenating with a suitable choice of and . Thus we stick to the original model (2) in this paper. Moreover, since we shall consider a fixed experimental setup , we omit the subscript and simply write etc. when distinguishing different ’s is not important.
The aim of this paper is to develop an efficient Monte Carlo algorithm for estimating the expected information gain . In the next section, we introduce the standard, nested Monte Carlo method as a classical algorithm to estimate , and give a brief review of the relevant literature. Then in Section 3, after introducing the concept of a multilevel Monte Carlo (MLMC) method, we construct an MLMC estimator for as an alternative, more efficient algorithm. We prove under a sufficient condition on the data model that the MLMC estimator can estimate with a root-mean-square accuracy by the optimal computational complexity . (Here and in what follows, the difference between the noise and the accuracy should not be confused.) Moreover we discuss how to incorporate importance sampling techniques within the MLMC estimator, which might be quite useful in practical applications. Numerical experiments in Section 4 confirm the considerable computational savings compared to the nested Monte Carlo method not only for a simple test case but also for a more realistic pharmacokinetic model adapted from .
2 Nested Monte Carlo
The nested Monte Carlo (NMC) method is the most standard approach to estimate the expected information gain [16, 12, 1, 14]. Given the data model (2), it is straightforward to generate i.i.d. random samples of given a particular value of and also those of itself. Besides, since follows the probability distribution of , it is easy to compute for given and . On the other hand, it is usually hard to generate i.i.d. random samples of given a particular value of and to compute and for given and .
Based on this fact, we use Bayes’ theorem
to rewrite the expected information gain (1) into
With this form of , the NMC estimator for the expected information gain is given by
for , where denote i.i.d. random samples of , and denotes a random sample of generated conditionally on .
, Ryan showed that the bias and the variance of the NMC estimator are ofand of , respectively. Since the mean square error of the NMC estimator is given by the sum of the variance and the squared bias, can be estimated with a root-mean-square accuracy by using and samples. Assuming that each computation of , which is necessary for calculating , can be performed with unit cost, the total computational complexity is .
Much more recently, in , Beck et al. provided a thorough error analysis of the NMC estimator and derive the optimal allocation of and for a given . In fact, they consider the situation where cannot be computed exactly and only its discretized approximation with a mesh discretization parameter is available. Here it is assumed that, as gets smaller, approaches to , but at the same time, the computational cost of increases. Therefore, their optimization deals with not only the number of samples and but also the parameter . In this paper, we assume that can be computed exactly, so that dealing with such situations is left open for future works, see Section 5.
More importantly, Beck et al. incorporated importance sampling based on the Laplace approximation from  within the NMC estimator. This approach is quite useful in reducing the number of inner samples substantially and also in mitigating the risk of so-called arithmetic underflow: when (as a function of for a fixed ) is highly concentrated around a certain value of , the Monte Carlo estimate of the inner expectation
appearing in (4) can be numerically zero. Taking the logarithm of 0 of course returns error. This can happen in practice especially for small . Therefore, it is desirable to apply a change of measure such that most of the samples of are concentrated properly depending on , which is exactly what the Laplace-based importance sampling aims to do. We note, however, that using importance sampling does not improve the order of computational complexity, so that the necessary cost of remains unchanged.
3 Multilevel Monte Carlo
3.1 Basic theory of MLMC
In order to reduce the total computational complexity from to , we consider applying a multilevel Monte Carlo (MLMC) method [4, 5]. The MLMC method has already been applied to estimate nested expectations of the form
for independent random variables and , where an outer expectation is taken with respect to and an inner one is taken with respect to , see [2, 5, 6, 7]. In particular, the case where is smooth has been briefly discussed in [5, Section 9]. In this paper we make a rigorous argument when is a logarithmic function.
Before introducing an MLMC estimator for the expected information gain, we give an overview of the MLMC method. Let be a random output variable which cannot be sampled exactly, and let be a sequence of random variables which approximate with increasing accuracy but also with increasing cost. The problem here is to estimate efficiently.
For we have the following telescoping sum
The standard Monte Carlo method estimates the left-hand side directly by
The mean square error of is given by the sum of variance and squared bias:
The MLMC method, on the other hand, independently estimates each term on the right-hand side of (5). In general, if we have a sequence of random variables which satisfy and for , the MLMC estimator is given by
The mean square error of is
For the same underlying stochastic sample, and can be well correlated and thus is expected to get smaller as the level increases. This means that, in order to estimate with a fixed root-mean-square accuracy, the necessary number of samples decreases as increases, and, as a consequence, most of the number of samples are allocated on smaller levels for estimating . Since the cost for each computation of is assumed to be cheaper for smaller , the overall computational cost can be significantly reduced compared to the standard Monte Carlo method.
Let be a random variable and let denote the corresponding level approximation of . If there exist independent random variables with expected cost and variance , and positive constants such that and
then there exists a positive constant such that for any there are and for which the MLMC estimator (8) has a mean square error less than with a computational complexity with bound
As discussed for instance in [6, Section 2.1], a computational complexity for the standard Monte Carlo estimator to have a mean square error less than is of . Thus regardless of the values of and , the MLMC estimator has an asymptotically better complexity bound than the standard Monte Carlo estimator.
3.2 MLMC estimator for expected information gains
Here we introduce an MLMC estimator for the expected information gain. First let us define a random output variable
where is distributed conditionally on the random variable of the first term. It is obvious that cannot be computed exactly because of the expectation appearing in the second term. We can introduce a sequence of approximations of with increasing accuracy but also with increasing cost as follows:
for an increasing sequence such that as . That is, is the standard Monte Carlo estimator of using random samples of . Thus we have . Note that the standard, nested Monte Carlo estimator (4) is essentially the same as (6) with given as above for a fixed .
In what follows, let for some for all , i.e., we consider a geometric sequence for . Then a sequence of corrections is defined as follows: is the same as , given by
For , the simplest one is
where the first random samples of used in the second term is also used in the first term. However, according to [8, 2, 5, 6], we can consider a better “tight coupling” of and . Namely, the set of random samples of used to compute is divided into two disjoint sets of samples to compute two realizations of , denoted by and , respectively. In this way is given by
denotes an average of over random samples of (note that we omit the superscript since it is clear from the level of );
denotes an average of over the first random samples of used in ;
denotes an average of over the second random samples of used in ,
for a randomly generated . Because of the independence of and , we see that . Moreover, it is important that the following “antithetic” property of holds:
Due to the concavity of , this is always non-positive when .
3.3 MLMC variance analysis
In order to prove the main theorem below, we need the following result.
Let be a random variable with zero mean, and let be an average of i.i.d. samples of . If is finite for , there exists a constant depending only on such that
See [6, Lemma 1]. ∎
Now we prove:
If there exists such that
for all , we have
The proof follows a similar argument to those of [8, Theorem 5.2] and [6, Theorem 3] in conjunction with the Taylor series argument of [5, Section 9.1]. For a particular value of , we define and consider the event
In what follows, we denote the indicator function for the event by and the complement of by . Using Hölder’s inequality, it holds that
for any such that .
Since is nothing but the Monte Carlo estimate of , the difference is an average of i.i.d. samples of a zero-mean random variable. Using Lemma 1, for a particular value of , we have
for any . By taking an outer expectation with respect to , the tower property gives
Since similar bounds exist for the probabilities and , we have
which leads to
Let us consider the term . Applying Jensen’s inequality and then the inequality which holds for any , we have
As long as , it follows from Lemma 1 that
which leads to
This gives a bound on the first term of (12):
with and . For a particular value of , the exponent is maximized by
That is, we have
respectively, where lies between and , lies between and , and lies between and . Using these expansions and the antithetic property (11) of , we have
Applying Jensen’s inequality twice and considering the respective ranges of and , we obtain