Multilevel Monte Carlo estimation of expected information gains

11/19/2018 ∙ by Takashi Goda, et al. ∙ The University of Tokyo 0

In this paper we develop an efficient Monte Carlo algorithm for estimating the expected information gain that measures how much the information entropy about uncertain quantity of interest θ is reduced on average by collecting relevant data Y. The expected information gain is expressed as a nested expectation, with an outer expectation with respect to Y and an inner expectation with respect to θ. The standard, nested Monte Carlo method requires a total computational cost of O(ε^-3) to achieve a root-mean-square accuracy of ε. In this paper we reduce this to optimal O(ε^-2) by applying a multilevel Monte Carlo (MLMC) method. More precisely, we introduce an antithetic MLMC estimator for the expected information gain and provide a sufficient condition on the data model under which the antithetic property of the MLMC estimator is well exploited such that optimal complexity of O(ε^-2) is achieved. Furthermore, we discuss how to incorporate importance sampling techniques within the MLMC estimator to avoid so-called arithmetic underflow. Numerical experiments show the considerable computational savings compared to the nested Monte Carlo method for a simple test case and a more realistic pharmacokinetic model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The motivation for this research comes from construction of optimal Bayesian experimental designs, for which the so-called expected information gain has been often employed as a quality criterion of experimental designs, see for instance [3, 16, 12, 13, 1]. Let

be a (possibly multi-dimensional) random variable which represents the uncertain quantity of interest. By collecting relevant data

(which is again possibly multi-dimensional) through carrying out some experiments under an experimental setup , one can expect that the uncertainty of , measured by the information entropy, can be reduced. The aim of Bayesian experimental designs is to find an optimal experimental setup such that the expected amount of the information entropy reduction about is maximized.

In what follows, we give a formal definition of the expected information gain for a particular experimental setup . The information entropy of before collecting data is given by

where

denotes the prior probability density function of

, and the expectation is taken with respect to the same . On the other hand, after collecting data , the conditional information entropy of is

where

denotes the posterior probability density function of

given and the expectation here is taken with respect to instead of . Thus the expected conditional information entropy of by collecting data is

Now the expected information gain measures the average amount of the reduction of the information entropy about by collecting data , and is defined by the difference

(1)

Here the inner expectation appearing in the right-most side is nothing but the Kullback-Leibler divergence between

and . In the context of Bayesian experimental designs, it is considered that the larger the value of , the more informative the data is about and thus the better the experimental design is. This is why the expected information gain is used as a quality criterion of experimental designs.

Let us consider the following data model:

(2)

where the function represents the deterministic part of the model response which depends on and , and denotes the stochastic part of the model response, i.e, the measurement error. It is often the case where is assumed to be zero-mean Gaussian with covariance matrix . Throughout this paper we assume that the prior probability density function of , the function

, and the probability distribution of

are given. As considered in [12, 13, 1], this data model can be extended to allow the repetition of experiments as

where is the number of repetitive experiments and are independent and identically distributed (i.i.d.) measurement errors. This extended model, however, can be easily rewritten into the form of (2) by concatenating with a suitable choice of and . Thus we stick to the original model (2) in this paper. Moreover, since we shall consider a fixed experimental setup , we omit the subscript and simply write etc. when distinguishing different ’s is not important.

The aim of this paper is to develop an efficient Monte Carlo algorithm for estimating the expected information gain . In the next section, we introduce the standard, nested Monte Carlo method as a classical algorithm to estimate , and give a brief review of the relevant literature. Then in Section 3, after introducing the concept of a multilevel Monte Carlo (MLMC) method, we construct an MLMC estimator for as an alternative, more efficient algorithm. We prove under a sufficient condition on the data model that the MLMC estimator can estimate with a root-mean-square accuracy by the optimal computational complexity . (Here and in what follows, the difference between the noise and the accuracy should not be confused.) Moreover we discuss how to incorporate importance sampling techniques within the MLMC estimator, which might be quite useful in practical applications. Numerical experiments in Section 4 confirm the considerable computational savings compared to the nested Monte Carlo method not only for a simple test case but also for a more realistic pharmacokinetic model adapted from [15].

2 Nested Monte Carlo

The nested Monte Carlo (NMC) method is the most standard approach to estimate the expected information gain [16, 12, 1, 14]. Given the data model (2), it is straightforward to generate i.i.d. random samples of given a particular value of and also those of itself. Besides, since follows the probability distribution of , it is easy to compute for given and . On the other hand, it is usually hard to generate i.i.d. random samples of given a particular value of and to compute and for given and .

Based on this fact, we use Bayes’ theorem

to rewrite the expected information gain (1) into

(3)

With this form of , the NMC estimator for the expected information gain is given by

(4)

for , where denote i.i.d. random samples of , and denotes a random sample of generated conditionally on .

In [16]

, Ryan showed that the bias and the variance of the NMC estimator are of

and of , respectively. Since the mean square error of the NMC estimator is given by the sum of the variance and the squared bias, can be estimated with a root-mean-square accuracy by using and samples. Assuming that each computation of , which is necessary for calculating , can be performed with unit cost, the total computational complexity is .

Much more recently, in [1], Beck et al. provided a thorough error analysis of the NMC estimator and derive the optimal allocation of and for a given . In fact, they consider the situation where cannot be computed exactly and only its discretized approximation with a mesh discretization parameter is available. Here it is assumed that, as gets smaller, approaches to , but at the same time, the computational cost of increases. Therefore, their optimization deals with not only the number of samples and but also the parameter . In this paper, we assume that can be computed exactly, so that dealing with such situations is left open for future works, see Section 5.

More importantly, Beck et al. incorporated importance sampling based on the Laplace approximation from [13] within the NMC estimator. This approach is quite useful in reducing the number of inner samples substantially and also in mitigating the risk of so-called arithmetic underflow: when (as a function of for a fixed ) is highly concentrated around a certain value of , the Monte Carlo estimate of the inner expectation

appearing in (4) can be numerically zero. Taking the logarithm of 0 of course returns error. This can happen in practice especially for small . Therefore, it is desirable to apply a change of measure such that most of the samples of are concentrated properly depending on , which is exactly what the Laplace-based importance sampling aims to do. We note, however, that using importance sampling does not improve the order of computational complexity, so that the necessary cost of remains unchanged.

3 Multilevel Monte Carlo

3.1 Basic theory of MLMC

In order to reduce the total computational complexity from to , we consider applying a multilevel Monte Carlo (MLMC) method [4, 5]. The MLMC method has already been applied to estimate nested expectations of the form

for independent random variables and , where an outer expectation is taken with respect to and an inner one is taken with respect to , see [2, 5, 6, 7]. In particular, the case where is smooth has been briefly discussed in [5, Section 9]. In this paper we make a rigorous argument when is a logarithmic function.

Before introducing an MLMC estimator for the expected information gain, we give an overview of the MLMC method. Let be a random output variable which cannot be sampled exactly, and let be a sequence of random variables which approximate with increasing accuracy but also with increasing cost. The problem here is to estimate efficiently.

For we have the following telescoping sum

(5)

The standard Monte Carlo method estimates the left-hand side directly by

(6)

The mean square error of is given by the sum of variance and squared bias:

(7)

The MLMC method, on the other hand, independently estimates each term on the right-hand side of (5). In general, if we have a sequence of random variables which satisfy and for , the MLMC estimator is given by

(8)

The mean square error of is

(9)

For the same underlying stochastic sample, and can be well correlated and thus is expected to get smaller as the level increases. This means that, in order to estimate with a fixed root-mean-square accuracy, the necessary number of samples decreases as increases, and, as a consequence, most of the number of samples are allocated on smaller levels for estimating . Since the cost for each computation of is assumed to be cheaper for smaller , the overall computational cost can be significantly reduced compared to the standard Monte Carlo method.

In his seminal work [4], Giles made this observation explicit as follows, see also a recent review [5]:

Theorem 1.

Let be a random variable and let denote the corresponding level approximation of . If there exist independent random variables with expected cost and variance , and positive constants such that and

  1. ,

  2. ,

  3. ,

then there exists a positive constant such that for any there are and for which the MLMC estimator (8) has a mean square error less than with a computational complexity with bound

Remark 1.

As discussed for instance in [6, Section 2.1], a computational complexity for the standard Monte Carlo estimator to have a mean square error less than is of . Thus regardless of the values of and , the MLMC estimator has an asymptotically better complexity bound than the standard Monte Carlo estimator.

3.2 MLMC estimator for expected information gains

Here we introduce an MLMC estimator for the expected information gain. First let us define a random output variable

where is distributed conditionally on the random variable of the first term. It is obvious that cannot be computed exactly because of the expectation appearing in the second term. We can introduce a sequence of approximations of with increasing accuracy but also with increasing cost as follows:

for an increasing sequence such that as . That is, is the standard Monte Carlo estimator of using random samples of . Thus we have . Note that the standard, nested Monte Carlo estimator (4) is essentially the same as (6) with given as above for a fixed .

In what follows, let for some for all , i.e., we consider a geometric sequence for . Then a sequence of corrections is defined as follows: is the same as , given by

For , the simplest one is

where the first random samples of used in the second term is also used in the first term. However, according to [8, 2, 5, 6], we can consider a better “tight coupling” of and . Namely, the set of random samples of used to compute is divided into two disjoint sets of samples to compute two realizations of , denoted by and , respectively. In this way is given by

(10)

where

  • denotes an average of over random samples of (note that we omit the superscript since it is clear from the level of );

  • denotes an average of over the first random samples of used in ;

  • denotes an average of over the second random samples of used in ,

for a randomly generated . Because of the independence of and , we see that . Moreover, it is important that the following “antithetic” property of holds:

(11)

Due to the concavity of , this is always non-positive when .

In this paper, we always consider the latter definition of for . Our MLMC estimator for the expected information gain is given by (8) for and into which the above is substituted. It is already clear from the construction of that the parameter in Theorem 1 should be .

3.3 MLMC variance analysis

In this subsection we prove for defined in (10), meaning that our MLMC estimator is in the first regime of Theorem 1, so that the total computational complexity is .

In order to prove the main theorem below, we need the following result.

Lemma 1.

Let be a random variable with zero mean, and let be an average of i.i.d. samples of . If is finite for , there exists a constant depending only on such that

Proof.

See [6, Lemma 1]. ∎

Now we prove:

Theorem 2.

If there exists such that

for all , we have

Proof.

The proof follows a similar argument to those of [8, Theorem 5.2] and [6, Theorem 3] in conjunction with the Taylor series argument of [5, Section 9.1]. For a particular value of , we define and consider the event

In what follows, we denote the indicator function for the event by and the complement of by . Using Hölder’s inequality, it holds that

(12)

for any such that .

Since is nothing but the Monte Carlo estimate of , the difference is an average of i.i.d. samples of a zero-mean random variable. Using Lemma 1, for a particular value of , we have

for any . By taking an outer expectation with respect to , the tower property gives

Since similar bounds exist for the probabilities and , we have

which leads to

Let us consider the term . Applying Jensen’s inequality and then the inequality which holds for any , we have

As long as , it follows from Lemma 1 that

which leads to

This gives a bound on the first term of (12):

with and . For a particular value of , the exponent is maximized by

That is, we have

(13)

Let us move onto the second term of (12). For a particular value of , the second-order Taylor series expansion around of each term of (10) is

respectively, where lies between and , lies between and , and lies between and . Using these expansions and the antithetic property (11) of , we have

Applying Jensen’s inequality twice and considering the respective ranges of and , we obtain