    # Multilevel Monte Carlo estimation of the expected value of sample information

We study Monte Carlo estimation of the expected value of sample information (EVSI) which measures the expected benefit of gaining additional information for decision making under uncertainty. EVSI is defined as a nested expectation in which an outer expectation is taken with respect to one random variable Y and an inner conditional expectation with respect to the other random variable θ. Although the nested (Markov chain) Monte Carlo estimator has been often used in this context, a root-mean-square accuracy of ε is achieved notoriously at a cost of O(ε^-2-1/α), where α denotes the order of convergence of the bias and is typically between 1/2 and 1. In this article we propose a novel efficient Monte Carlo estimator of EVSI by applying a multilevel Monte Carlo (MLMC) method. Instead of fixing the number of inner samples for θ as done in the nested Monte Carlo estimator, we consider a geometric progression on the number of inner samples, which yields a hierarchy of estimators on the inner conditional expectation with increasing approximation levels. Based on an elementary telescoping sum, our MLMC estimator is given by a sum of the Monte Carlo estimates of the differences between successive approximation levels on the inner conditional expectation. We show, under a set of assumptions on decision and information models, that successive approximation levels are tightly coupled, which directly proves that our MLMC estimator improves the necessary computational cost to optimal O(ε^-2). Numerical experiments confirm the considerable computational savings as compared to the nested Monte Carlo estimator.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Motivated by applications to medical decision making under uncertainty , we study Monte Carlo estimation of the expected value of sample information (EVSI). Let

be a vector of random variables representing the uncertainty in the effectiveness of different medical treatments. Let

be a finite set of possible medical treatments, and for each treatment , denotes a function of representing some measure of the patient outcome with “larger the better”, where quality-adjusted life years (QALY) is typically employed in the context of medical decision making [1, 2, 19, 12]. Without any knowledge about , the best treatment is the one which maximizes the expectation of , giving the average outcome:

 maxd∈DEθ[fd(θ)], (1)

where

denotes the expectation taken with respect to the prior probability density function of

. On the other hand, if perfect information on is available, the best treatment after knowing the value of is simply the one which maximizes , so that, on average, the outcome will be

 Eθ[maxd∈Dfd(θ)].

The difference between these two values is called the expected value of perfect information (EVPI):

 EVPI:=Eθ[maxd∈Dfd(θ)]−maxd∈DEθ[fd(θ)].

However, it will be rare that we have access to perfect information on . In practice, what we obtain, for instance, through carrying out some new medical research is either partial perfect information or sample information on .

Partial perfect information is nothing but perfect information only on a subset of random variables for a partition , where and are assumed independent. After knowing the value of , the best treatment is the one which maximizes the partial expectation . Therefore, the average outcome with partial perfect information on will be

 Eθ1[maxd∈DEθ2[fd(θ1,θ2)]],

and the increment from (1) is called the expected value of partial perfect information (EVPPI):

 EVPPI:=Eθ1[maxd∈DEθ2[fd(θ1,θ2)]]−maxd∈DEθ[fd(θ)].

Sample information on

, which is of our interest in this article, is a single realization drawn from some probability distribution. To be more precise, we consider that information

is stochastically generated according to the forward information model:

 Y=h(θ)+ϵ, (2)

where is a known deterministic function of possibly with multiple outputs and is a zero-mean random variable with density . Note that and are called the observation operator and the observation noise, respectively [20, Section 2]

. It is widely known that Bayes’ theorem provides an update of the probability density of

after observing :

 πY(θ)=ρ(Y|θ)π0(θ)Eθ[ρ(Y|θ)], (3)

where denotes the prior probability density of , the conditional probability density of given . Here is also called the likelihood of the information , and it follows from the model (2) that .

Now, if such sample information is available, by choosing the best treatment which maximizes the conditional expectation depending on , where denotes the expectation taken with respect to the conditional probability density , the overall average outcome becomes

 EY[maxd∈DEθ|Y[fd(θ)]].

Then EVSI represents the expected benefit of gaining the information and is defined by the difference

In this article we are concerned with Monte Carlo estimation of EVSI. Given that EVPI can be estimated with root-mean-square accuracy by using i.i.d. samples of , denoted by , as

 1NN∑n=1maxd∈Dfd(θ(n))−maxd∈D1NN∑n=1fd(θ(n)),

it suffices to efficiently estimate the difference between EVPI and EVSI:

 EVPI−EVSI=Eθ[maxd∈Dfd(θ)]−EY[maxd∈DEθ|Y[fd(θ)]]. (4)

Because of the non-commutativity between the operators and , this estimation is inherently a nested expectation problem and it is far from trivial whether we can construct a good Monte Carlo estimator which achieves a root-mean-square accuracy at a cost of .

Classically the most standard approach is to apply nested (Markov chain) Monte Carlo methods. For , let be outer i.i.d. samples of , and for each , let be inner i.i.d. samples of conditional on . Then the nested Monte Carlo estimator of is given by

 1NN∑n=1[1MM∑m=1maxd∈Dfd(θ(n,m))−maxd∈D1MM∑m=1fd(θ(n,m))]. (5)

Here it is often hard to generate inner i.i.d. samples of conditional on some value of directly (although, conversely, it is quite easy to generate i.i.d. samples of conditional on some value of according to (2)). This is a major difference from estimating EVPPI.

To work around this difficulty, although the resulting samples are no longer i.i.d., one relies on Markov chain Monte Carlo (MCMC) sampling techniques such as Metropolis-Hasting sampling and Gibbs sampling, see [16, 18]. Under certain conditions, it follows from [14, 15] that one can establish a non-asymptotic error bound of for MCMC estimation of the inner conditional expectation. Still, as inferred from a recent work of Giles & Goda  on EVPPI estimation, we need and samples for outer and inner expectations, respectively, to estimate with root-mean-square accuracy . Here denotes the order of convergence of the bias and is typically between and . This way the necessary total computational cost is of .

In this article, building upon the earlier work by Giles & Goda , we develop a novel efficient Monte Carlo estimator of by using a multilevel Monte Carlo (MLMC) method [6, 7]. Although there has been extensive recent research on efficient approximations of EVSI in the medical decision making context [19, 12, 17, 13], our proposal avoids function approximations on the inner conditional expectation and any reliance on assumptions of multilinearity of or weak correlation between random variables in . Recently MLMC estimators have been studied intensively for nested expectations of different forms, for instance, by [4, 10, 11]. We also refer to  for a review of recent developments of MLMC applied to nested expectation problems. Importantly, our approach developed in this article does not require MCMC sampling for generating inner conditional samples of and can achieve a root-mean-square accuracy at a cost of optimal

. Moreover, it is straightforward to incorporate importance sampling techniques within our estimator, which may sometimes reduce the variance of the estimator significantly.

## 2 Multilevel Monte Carlo

### 2.1 Basic theory

Before introducing our estimator of , we give an overview of the MLMC method. Let be a real-valued random variable which cannot be sampled exactly, and let be a sequence of real-valued random variables which approximate with increasing accuracy but also with increasing cost. In order to estimate , we first approximate by for some and then the standard Monte Carlo method estimates by using i.i.d. samples of as

 E[P]≈E[PL]≈¯¯¯¯¯¯PLN:=1NN∑n=1P(n)L.

On the other hand, the MLMC method exploits the following telescoping sum representation:

 E[PL]=E[P0]+L∑ℓ=1E[Pℓ−Pℓ−1].

More generally, given a sequence of random variables which satisfy

 E[ΔP0]=E[P0]andE[ΔPℓ]=E[Pℓ−Pℓ−1]for ℓ≥1,

we have

 E[PL]=L∑ℓ=0E[ΔPℓ].

Then the MLMC estimator is given by a sum of independent Monte Carlo estimates of , i.e.,

 ZMLMC=L∑ℓ=0¯¯¯¯¯¯¯¯¯ΔPNℓℓ=L∑ℓ=01NℓNℓ∑n=1ΔP(n)ℓ. (6)

Since approximate with increasing accuracy, through a tight coupling of and , the variance of the correction variable is expected to get smaller as the level increases. This implies that the numbers of samples can also get smaller as the level increases so as to estimate each quantity accurately. If this is the case, the total computational cost can be reduced significantly as compared to the standard Monte Carlo method.

The following basic theorem from [6, 5, 7] makes the above observation explicit.

###### Theorem 1.

Let be a random variable, and for , let be the level approximation of . Assume that there exist independent correction random variables with expected cost and variance , and positive constants such that and

1. ,

2. ,

3. .

Then there exists a positive constant such that, for any root-mean-square accuracy , there are and for which the MLMC estimator (6) achieves a mean-square error less than , i.e.,

 E[(ZMLMC−E[P])2]≤ε2

with a computational cost bounded above by

 E[C]≤⎧⎪⎨⎪⎩c4ε−2,β>γ,c4ε−2(logε−1)2,β=γ,c4ε−2−(γ−β)/α,β<γ.
###### Remark 1.

As discussed in [5, Section 2.1] and [9, Section 2.1], under similar assumptions to those in Theorem 1, the standard Monte Carlo estimator achieves a root-mean-square accuracy at a cost of . Therefore, regardless of the values of and , the MLMC estimator always has an asymptotically lower complexity bound than the standard Monte Carlo estimator.

### 2.2 MLMC estimator

Here we construct an MLMC estimator of the difference . Our starting point is to plug (3) into (4), which results in

 EVPI−EVSI =EY[Eθ[maxd∈Dfd(θ)ρ(Y|θ)]Eθ[ρ(Y|θ)]−maxd∈DEθ[fd(θ)ρ(Y|θ)]Eθ[ρ(Y|θ)]].

Then, within the framework of the MLMC method, let us consider a real-valued random variable

 P=Eθ[maxd∈Dfd(θ)ρ(Y|θ)]Eθ[ρ(Y|θ)]−maxd∈DEθ[fd(θ)ρ(Y|θ)]Eθ[ρ(Y|θ)]

with being the underlying random variable. We see that but cannot be sampled exactly because of the inner expectations.

It is important, however, that all these inner expectations are taken with respect to the prior probability density of , so that they can be approximated by the standard Monte Carlo method without requiring any MCMC sampling. This way, as a sequence of random variables which approximate with increasing accuracy, we consider the standard Monte Carlo estimation of :

 Pℓ:=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)Mℓ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)Mℓ−maxd∈D¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)Mℓ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)Mℓ

for an increasing sequence with as . Here, in the definition of , we can use either the same samples of for all of the averages, or independent samples of for each average. In this article we focus on the former approach and consider a geometric progression for , i.e., let for some .

Regarding a sequence of the correction variables , following the ideas of [4, 9, 10], we consider an antithetic coupling of and . That is, the set of samples of used to compute is split into two disjoint sets of samples to compute two independent realizations of , denoted by and . Then is defined by and

 ΔPℓ :=Pℓ−P(a)ℓ−1+P(b)ℓ−12 =¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)−maxd∈D¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅) −12⎡⎢⎣¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)(a)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)(a)+¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)(b)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)(b)⎤⎥⎦ +12⎡⎢⎣maxd∈D¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)(a)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)(a)+maxd∈D¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)(b)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)(b)⎤⎥⎦,

for , where we have omitted the superscripts from the first and second terms, and the averages with the superscripts and are taken by using the first and second samples of used to compute the first two terms, respectively. Assuming that each computation of , and can be performed with unit cost, it is clear that in Theorem 1 and for because of the independence of the samples.

We mean by the word “antithetic” that the following properties hold:

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅) =¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)(a)+¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)(b)2, (7) ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅) =¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)(a)+¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)(b)2for all d∈D, (8) ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅) =¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)(a)+¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)(b)2. (9)

This is the key advantage of the antithetic correction as compared to the standard correction .

### 2.3 Combination with importance sampling

In practical applications, it is often the case where the likelihood (as a function of for a fixed ) is highly concentrated around some values of . If one uses the i.i.d. samples of following from the prior density for estimating and , most of the samples can be distributed outside such concentrated regions. As a result, these quantities are estimated as almost zero, yielding a numerical instability of the estimates.

As discussed in [11, Section 3.4] for MLMC applied to another nested expectation problem, one can combine importance sampling techniques with our MLMC estimator to address this issue. Let be an importance distribution of conditional on , which needs to satisfy for any with . Since we have

 Eθ[fd(θ)ρ(Y|θ)]=Eθ∼qY[fd(θ)ρ(Y|θ)π0(θ)qY(θ)]

and

 Eθ[ρ(Y|θ)]=Eθ∼qY[ρ(Y|θ)π0(θ)qY(θ)],

the random variables and can be replaced, respectively, by

 Pℓ =¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)π0(⋅)/qY(⋅)Mℓ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)π0(⋅)/qY(⋅)Mℓ−maxd∈D¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)π0(⋅)/qY(⋅)Mℓ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)π0(⋅)/qY(⋅)Mℓ, ΔPℓ =¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)π0(⋅)/qY(⋅)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)π0(⋅)/qY(⋅)−maxd∈D¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)π0(⋅)/qY(⋅)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)π0(⋅)/qY(⋅) −12⎡⎢⎣¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)π0(⋅)/qY(⋅)(a)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)π0(⋅)/qY(⋅)(a)+¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)π0(⋅)/qY(⋅)(b)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)π0(⋅)/qY(⋅)(b)⎤⎥⎦ +12⎡⎢⎣maxd∈D¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)π0(⋅)/qY(⋅)(a)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)π0(⋅)/qY(⋅)(a)+maxd∈D¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)π0(⋅)/qY(⋅)(b)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)π0(⋅)/qY(⋅)(b)⎤⎥⎦,

where all of the averages are taken with respect to i.i.d. samples of for a randomly chosen .

We suggest to find a good approximation of the posterior distribution for . If one can do so, it follows from Bayes’ theorem (3) that

 ρ(Y|θ)π0(θ)qY(θ)≈ρ(Y|θ)π0(θ)πY(θ)=Eθ[ρ(Y|θ)].

Since the right-most side does not depend on , the integrand appearing in the denominator of each term for and becomes close to a constant function, so that its variance is extremely small. This way we can avoid the numerical instability of our original MLMC estimator.

## 3 Theoretical results

In this section, we prove and under a set of assumptions on the decision and information models. This directly implies from Theorem 1 that our MLMC estimator of achieves a root-mean-square accuracy at a cost of optimal .

In what follows, for simplicity of notation, we write and . Note that we have

 E⎡⎣¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)ρ(Y)|Y⎤⎦=Eθ[ρ(Y|θ)ρ(Y)|Y]=1.

Moreover we write

 ¯¯¯¯¯gd(a)=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)(a)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)(a), ¯¯¯¯¯gd(b)=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)(b)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)(b), ¯¯¯¯¯gd=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯fd(⋅)ρ(Y|⋅)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅),

for , and

 ¯¯¯g(a)max=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)(a)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)(a), ¯¯¯g(b)max=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)(b)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅)(b), ¯¯¯gmax=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯maxd∈Dfd(⋅)ρ(Y|⋅)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ρ(Y|⋅).

Then is given by with

 ΔPℓ,1=¯¯¯gmax−12(¯¯¯g(a)max+¯¯¯g(b)max)andΔPℓ,2=12(maxd∈D¯¯¯¯¯gd(a)+maxd∈D¯¯¯¯¯gd(b))−maxd∈D¯¯¯¯¯gd.

For a given , we define

 Gd(Y):=Eθ|Y[fd(θ)]=Eθ[fd(θ)ρ(Y|θ)]Eθ[ρ(Y|θ)]=Fd(Y)ρ(Y)

for each , and

 dopt(Y):=argmaxd∈DFd(Y)=argmaxd∈DGd(Y).

The second equality is trivial since the denominator of does not affect the choice . The domain for is divided into a number of sub-domains in which the optimal decision is unique. We denote by the dividing decision manifold on which is not uniquely defined and we assume that is a lower-dimensional subspace of the domain for .

Let us give some assumptions on the decision and information models:

###### Assumption 1.

There exists a constant such that for any and .

###### Assumption 2.

There exist a constant such that for all

 PY[miny∈K∥Y−y∥≤ϵ]≤c0ϵ.
###### Assumption 3.

There exist constants such that if , the following holds:

 maxd∈DGd(Y)−maxd∈Dd≠dopt(Y)Gd(Y)>min(c1,c2miny∈K∥Y−y∥).
###### Assumption 4.

There exists a constant such that

 EY[Eθ[(ρ(Y|θ)ρ(Y))p]]<∞.
###### Remark 2.

Assumptions 13 are similar to those considered in . In particular, Assumption 2 is introduced to ensure a bound on the probability that is close to the decision manifold , while Assumption 3 is to ensure a linear separation of different decisions as moves away from . Assumption 1, which is stronger than [9, Assumption 1], together with Assumption 4 enables us to bound the difference between and .

###### Theorem 2.

If Assumptions 14 are satisfied, we have

where is as given in Assumption 4.

###### Remark 3.

When an importance sampling is used within the MLMC estimator, the same orders of the variance and the mean of can be shown by replacing Assumption 4 with the existence of a constant such that

 EY[Eθ[(ρ(Y|θ)π0(θ)ρ(Y)qY(θ))p]]<∞.

Since the result can be proven in the same manner with the original MLMC estimator, we shall give a proof of Theorem 2 only for the original estimator.

This theorem implies that the parameters and in Theorem 1 are equal to and , respectively. Since if , our MLMC estimator of is in the first regime. Therefore, the total computational complexity to achieve a root-mean-square accuracy is of order . If , on the other hand, the equality holds, which means that our MLMC estimator is in the second regime. In the next subsection, we give a proof of this theorem by using several lemmas which are shown later in Subsection 3.2.

### 3.1 Proof of the main result

We follow a similar argument to that used in [9, Theorem 3] in conjunction with novel results shown later. Recalling that , we have

 V[ΔPℓ]≤E[|ΔPℓ|2]≤2E[|ΔPℓ,1|2]+2E[|ΔPℓ,2|2].

We shall see later in Remark 4 that the first term on the right-hand side is of which decays no slower than the desired order for any . Thus it suffices to prove that the second term is of .

For as given in Assumption 4, let us define and consider the events

 A ≡{miny∈K∥Y−y∥≤ϵ}, B ≡⋃d∈D{max(|¯¯¯¯¯gd(a)−Gd|,|¯¯¯¯¯gd(b)−Gd|,|¯¯¯¯¯gd−Gd|)≥12c2ϵ},

where is as defined in Assumption 3.

For an event , let denote the indicator function which is 1 if , and zero otherwise, and let denote the complement of . By using Hölder’s inequality, we have

 E[|ΔPℓ,2|2] =E[|ΔPℓ,2|21A∪B]+E[|ΔPℓ,2|21Ac∩Bc] ≤(E[|ΔPℓ,2|p])2/p(E[1p/(p−2)A∪B])(p−2)/p+E[|ΔPℓ,2|21Ac∩Bc] ≤(E[|ΔPℓ,2|p)2/p(P[A]+P[B])(p−2)/p+E[|ΔPℓ,2|21Ac∩Bc].

In the following we show bounds on , and , respectively.

Bound on . Noting that the inequality