DeepAI

# An efficient estimation of nested expectations without conditional sampling

Estimating nested expectations is an important task in computational mathematics and statistics. In this paper we propose a new Monte Carlo method using post-stratification to estimate nested expectations efficiently without taking samples of the inner random variable from the conditional distribution given the outer random variable. This property provides the advantage over many existing methods that it enables us to estimate nested expectations only with a dataset on the pair of the inner and outer variables drawn from the joint distribution. We show an upper bound on the mean squared error of the proposed method under some assumptions. Numerical experiments are conducted to compare our proposed method with several existing methods (nested Monte Carlo method, multilevel Monte Carlo method, and regression-based method), and we see that our proposed method is superior to the compared methods in terms of efficiency and applicability.

• 4 publications
• 18 publications
09/02/2019

### Multilevel Monte Carlo estimation of the expected value of sample information

We study Monte Carlo estimation of the expected value of sample informat...
09/18/2017

### On the Opportunities and Pitfalls of Nesting Monte Carlo Estimators

We present a formalization of nested Monte Carlo (NMC) estimation, where...
04/30/2019

### A Detailed Analysis of Quicksort Algorithms with Experimental Mathematics

We study several variants of single-pivot and multi-pivot Quicksort algo...
10/17/2022

### Asymptotic control of the mean-squared error for Monte Carlo sensitivity index estimators in stochastic models

In global sensitivity analysis for stochastic models, the Sobol' sensiti...
12/03/2021

### Computation of conditional expectations with guarantees

Theoretically, the conditional expectation of a square-integrable random...
03/29/2022

### Sample Recycling for Nested Simulation with Application in Portfolio Risk Measurement

Nested simulation is a natural approach to tackle nested estimation prob...
08/30/2020

### Optimal Nested Simulation Experiment Design via Likelihood Ratio Method

Nested simulation arises frequently in financial or input uncertainty qu...

## 1 Introduction

Motivated by several different applications, we study estimating nested expectations which are defined as follows. Let and

be possibly dependent random variables following the joint probability density

. For a function , the nested expectation is defined by

 I=Eρ(Y)f(Eρ(X|Y)X). (1)

Here we emphasize that the outer expectation is taken with respect to the marginal distribution of , while the inner expectation is with respect to the conditional distribution of given . Throughout this paper, we simply write (resp. ) to denote the marginal probability density of (resp. ), and also write (resp. ) to denote the conditional probability density of given (resp.  given ).

The motivating examples are as follows:

###### Example 1 (expected information gain).

The concept of Bayesian experimental design aims to construct an optimal experimental design under which making the observation maximizes the expected information gain (EIG) on the input random variable Lindley (1956); Chaloner and Verdinelli (1995). Here the EIG denotes the expected amount of reduction in the Shannon information entropy and is given by

 Eρ(Y)Eρ(θ|Y)logρ(θ|Y)−Eρ(θ)logρ(θ) (2) =Eρ(θ)Eρ(Y|θ)logρ(Y|θ)−Eρ(Y)log(Eρ(θ)ρ(Y|θ)), (3)

where the equality follows from Bayes’ theorem. A nested expectation appears in the second term on the right-hand side.

###### Example 2 (expected value of sample information).

Let be a finite set of possible medical treatments. As the outcome and cost of each treatment is uncertain, we model its net benefit as a function of the input random variable , denoted by , where includes, for instance, the probability of side effect and the cost of treatment.

In the context of medical decision making, we want to know whether it is worth conducting a clinical trial or medical research to reduce the uncertainty of Welton et al. (2012). Denoting the observation from a clinical trial or medical research by , the expected value of sample information (EVSI) measures the average gain in the net benefit from making the observation and is given by

 Eρ(Y)maxd∈DEρ(θ|Y)NBd(θ)−maxd∈DEρ(θ)NBd(θ), (4)

where the first term represents the average net benefit when choosing the optimal treatment depending on the observation , and the second term does the net benefit without making the observation. Here the first term is exactly a nested expectation as given in (1).

The nested Monte Carlo (NMC) method is probably the most straightforward approach to estimate nested expectations. The idea is quite simple: approximating the inner and outer expectations by the standard Monte Carlo methods, respectively. To be more precise, for positive integers and , the NMC estimator is given by

 1NpNp∑p=1f⎛⎝1NqNq∑q=1X(p,q)⎞⎠, (5)

where denote the i.i.d. samples drawn from , denote the i.i.d. samples drawn from for each , and the inner sum over is taken element-wise. However, it has been known that a large computational cost is necessary for the NMC method to estimate nested expectations with high accuracy Rainforth et al. (2018). Moreover, the NMC method has a disadvantage in terms of applicability, since it requires generating the i.i.d. samples from , which is often quite hard in applications Strong et al. (2015); Goda et al. (2020); Hironaka et al. (2020).

A typical situation in estimating nested expectations is, instead, that we can generate i.i.d. samples from and , or, those from

. One way to tackle this issue is to use a Markov chain sampler directly for each

in (5). Although the resulting estimator might be consistent, it is quite hard to obtain a non-asymptotic upper bound on the mean squared error and to choose an optimal allocation for and with the total cost fixed. Another way is to rewrite (1) into

 Eρ(Y)f(Eρ(X)Xρ(Y|X)Eρ(X)ρ(Y|X)),

by using Bayes’ theorem. The problem is then that, as a function of , the likelihood is typically concentrated around a small region of the support of , so that we need some sophisticated techniques like importance sampling to reliably estimate the ratio of two expectations inside.

There are some recent works to overcome these problems of the NMC estimator. We refer to Goda et al. (2020); Beck et al. (2018) for an efficient estimation of the EIG, and to Strong et al. (2015); Hironaka et al. (2020); Menzies (2016); Jalal and Alarid-Escudero (2018); Heath et al. (2018, 2020) for an efficient estimation of the EVSI. Estimating nested expectations is also an important task in portfolio risk measurement Duffie (2010); Gordy and Juneja (2010), and the relevant literature can be found in Broadie et al. (2015); Hong et al. (2017); Giles and Haji-Ali (2019)

In this paper, we propose a new Monte Carlo estimator for nested expectations based on post-stratification, which does not require sampling from the conditional distribution . In fact, our proposed estimator only requires a dataset on the pair drawn from the joint distribution , and avoids any need to use a Markov chain sampler and importance sampling. Moreover, our proposed estimator is experimentally more efficient than several existing methods for numerical examples with small (the dimension of ). The rest of this paper is organized as follows. In Section 2 we introduce our proposed estimator, and then give a theoretical analysis to derive an upper bound on the mean squared error of our proposed estimator in Section 3. Numerical experiments in Section 4 compares our proposed method with several existing methods (NMC method, multilevel Monte Carlo method Goda et al. (2020); Hironaka et al. (2020), and regression-based method Strong et al. (2015)). We conclude this paper with discussion in Section 5.

## 2 Method

In this section, we first provide a new Monte Carlo estimator for nested expectations using post-stratification. Subsequently we introduce a variant of our proposed method using a linear regression.

###### Algorithm 1 (proposed method).

For with , let be a set of i.i.d. samples from the joint density . For , and , we write . First we construct as follows:

1. Sort each sample set

 {Y(1,1,k),Y(1,2,k),…,Y(1,m2K−k+1,k)} (6) {Y(2,1,k),Y(2,2,k),…,Y(2,m2K−k+1,k)} (7) ⋮ (8) {Y(mk−1,1,k),Y(mk−1,2,k),…,Y(mk−1,m2K−k+1,k)} (9)

in the ascending order of . This process is repeated for each .

2. Define .

For , let denote the sample of the inner variable corresponding to . Then our proposed estimator is given by

 ˆI\coloneqq1√N√N∑p=1f⎛⎝1√N√N∑q=1X(p,q)⎞⎠, (10)

where the inner sum over is taken element-wise.

The form of our proposed estimator apparently looks similar to the NMC estimator (5) with . However, as we have mentioned already, the crucial difference is that our proposed estimator is free of the inner conditional sampling from the distribution . We approximate the inner expectation by the average

 ~X(p)=1√N√N∑q=1X(p,q),

based only on the set of i.i.d. samples from the joint distribution , applying post-stratification to which is important in that the samples are used as a substitute of the samples from the exact conditional distribution . We note that, although each can be regarded as a sample from , using only one inner sample is not enough to approximate the exact inner expectation. This is why we divide the set of samples into the sets of samples using post-stratification and use each subset to compute .

Generating the i.i.d. samples from is often quite hard in some applications Strong et al. (2015); Goda et al. (2020); Hironaka et al. (2020). Therefore, by construction, our proposed estimator has the advantage over many existing methods including the NMC estimator in terms of applicability. In the literature, there exist some other methods to estimate a special class of nested expectations, which do not rely on sampling from conditional distribution, see for instance Strong et al. (2015); Hong et al. (2017). A regression-based method from Strong et al. (2015) is compared with our proposed estimator in Section 4.

We end this section with introducing a variant of Algorithm 1.

###### Algorithm 2 (proposed method with regression).

For with , let be a set of i.i.d. samples from the joint density . We construct by following Algorithm 1. Then our proposed estimator with a linear regression is given by

 ˆIrg\coloneqq1√N√N∑p=11√N√N∑q=1f(~X(p,q)), (11)

where is computed by

 ~X(p,q)j=ˆc0+ˆc1Y(p,q)1+⋯+ˆcKY(p,q)K (12)

for each with

 ˆc=(ˆc0,ˆc1,…,ˆcK)⊤ (13)

being a vector of coefficients which satisfies

 M⊤Mˆc=M⊤v (14)

with

 M=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1Y(p,1)1⋯Y(p,1)k⋯Y(p,1)K⋮⋮⋱⋮1Y(p,q)1Y(p,q)kY(p,q)K⋮⋮⋱⋮1Y(p,√N)1⋯Y(p,√N)k⋯Y(p,√N)K⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,v=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝X(p,1)jX(p,2)j⋮X(p,√N)j⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠. (15)

Our motivation behind introducing this variant is that, instead of using the samples directly, we perturb them to approximate the inner conditional expectation simply by one (perturbed) sample . Then the resulting estimator is given in a non-nested form.

## 3 Theoretical Analysis

In this section we show an upper bound on the mean squared error (MSE) of our proposed method. Here we show the result only for Algorithm 1. It is left open for future research whether a similar result on the MSE also holds for Algorithm 2.

###### Theorem 1 (A bound of the MSE).

Let , , , and be given as in Algorithm 1. Let

be the cumulative distribution function of

, and be the marginal cumulative distribution function for . Assume that every is continuous, that and are mutually independent for all , that there exist such that

 (16)

and

 ∥∥Eρ(X|Y1)X−Eρ(X|Y2)X∥∥2≤β|FY(Y1)−FY(Y2)| (17)

hold for any and , and that

 Eρ(Y)Vρ(X|Y)X<∞. (18)

Then we have

 E(I−ˆI)2≤2α2β2K2N1/K+2α2(β2K3+Eρ(Y)Vρ(X|Y)X)1N1/2. (19)

Before proving Theorem 1, we prove Lemma 1.

###### Lemma 1.

Let , , , and be given as in Algorithm 1, and let denote the cumulative distribution function of . Under the same assumptions considered in Theorem 1, we have

 1√N√N∑p=11N√N∑q,q′=1(EFY(Y(p,q))−EFY(Y(p,q′)))2≤K2N1/K, (20)

and

 1√N√N∑p=11√N√N∑q=1VFY(Y(p,q))≤K6N1/2. (21)
###### Proof.

Let us recall the following two well-known facts as preparation. One fact is on order statistics. For a positive integer , let

be i.i.d. random variables following the uniform distribution

, and let be the corresponding order statistics. Then, for any , the -th order statistics

follows the beta distribution

, which ensures

 VU(r,n)=r(n−r+1)(n+1)(n+2). (22)

Moreover, for any , the difference follows the beta distribution and so it holds that

 E(U(s,n)−U(r,n))2=(s−r)(s−r+1)(n+1)(n+2). (23)

The other fact is on an elementary inequality. For a positive integer and any , it holds that

 ∣∣ ∣∣n∏i=1ai−n∏i=1bi∣∣ ∣∣≤n∑i=1|ai−bi|. (24)

This inequality can be proven by an induction on . As the case trivially holds, let us assume that the inequality holds for and prove the case . In fact, it follows from the triangle inequality and the induction assumption that

 ∣∣ ∣∣m+1∏i=1ai−m+1∏i=1bi∣∣ ∣∣ =∣∣ ∣∣m+1∏i=1ai−bm+1m∏i=1ai+bm+1m∏i=1ai−m+1∏i=1bi∣∣ ∣∣ (25) =∣∣ ∣∣(am+1−bm+1)m∏i=1ai+bm+1(m∏i=1ai−m∏i=1bi)∣∣ ∣∣ (26) (27) ≤|am+1−bm+1|+m∑i=1|ai−bi|=m+1∑i=1|ai−bi|, (28)

which completes the proof of (24).

Now let us prove (20). To do so, let us consider a single coordinate for first. In what follows, we denote by the marginal cumulative distribution function of as stated in Theorem 1, and we mean by the -th sample of the outer variable after the sorting step in Algorithm 1 is applied, and we do so also for . For each , , and , we can see that the sample set

 {Y((p1−1)mK−k+1+(p2−1)mK−k+p3,1)k,…,Y((p1−1)mK−k+1+(p2−1)mK−k+p3,√N)k}

is a set of points drawn from the parent sample set

 {Y((p1−1)m+p2,1,k+1)k,…,Y((p1−1)m+p2,m2K−k,k+1)k}

consisting of points, which coincides exactly with the set

 {Y(p1,(p2−1)m2K−k+1,k)k,…,Y(p1,(p2−1)m2K−k+m2K−k,k)k}.

Here, since every one of the outer variables is assumed independent from all the other variables , we can see that, for any , the sorted random variables correspond to the order statistics , respectively.

Therefore, for , by using the equality (23), we have

 1√N√N∑p=11N√N∑q,q′=1E(FYk(Y(p,q)k)−FYk(Y(p,q′)k))2 (29) =1mk−1mk−1∑p1=11mm∑p2=11mK−kmK−k∑p3=11N√N∑q,q′=1 (30) E(FYk(Y((p1−1)mK−k+1+(p2−1)mK−k+p3,q)k)−FYk(Y((p1−1)mK−k+1+(p2−1)mK−k+p3,q′)k))2 (31) =1mk−1mk−1∑p1=11mm∑p2=11m2(2K−k)m2K−k∑q,q′=1E(FYk(Y(p1,(p2−1)m2K−k+q,k)k)−FYk(Y(p1,(p2−1)m2K−k+q′,k)k))2 (32) =1mk−1mk−1∑p1=11mm∑p2=11m2(2K−k)m2K−k∑q,q′=1E(U((p2−1)m2K−k+q,m2K−k+1)−U((p2−1)m2K−k+q′,m2K−k+1))2 (33) =1mk−1mk−1∑p1=11mm∑p2=11m2(2K−k)m2K−k∑q,q′=1|q−q′|(|q−q′|+1)(m2K−k+1+1)(m2K−k+1+2) (34) =1m2(2K−k)(m2K−k+1+1)(m2K−k+1+2)m2K−k∑q,q′=1|q−q′|(|q−q′|+1) (35) =(m2K−k−1)m2K−k(m2K−k+1)(m2K−k+2)6m2(2K−k)(m2K−k+1+1)(m2K−k+1+2) (36) ≤1m2=1N1/K. (37)

As we have assumed that every is mutually independent, applying the inequality (24) and Jensen’s inequality leads to

 1√N√N∑p=11N√N∑q,q′=1(EFY(Y(p,q))−EFY(Y(p,q′)))2 (38) =1√N√N∑p=11N√N∑q,q′=1(K∏k=1EFYk(Y(p,q)k)−K∏k=1EFYk(Y(p,q′)k))2 (39) ≤1√N√N∑p=11N√N∑q,q′=1(K∑k=1∣∣EFYk(Y(p,q)k)−EFYk(Y(p,q′)k)∣∣)2 (40) ≤KK∑k=11√N√N∑p=11N√N∑q,q′=1∣∣EFYk(Y(p,q)k)−EFYk(Y(p,q′)k)∣∣2 (41) ≤K2N1/K, (42)

which completes the proof of (20).

Let us move on to proving (21). By a reasoning similar to what is used to prove (20) and using the equality (22), we have

 1√N√N∑p=11√N√N∑q=1VFYk(Y(p,q)k) (43) =1mk−1mk−1∑p1=11mm∑p2=11m2K−km2K−k∑q=1VFYk(Y(p1,(p2−1)m2K−k+q,k)k) (44) =1mk−1mk−1∑p1=11mm∑p2=11m2K−km2K−k∑q=1VU((p2−1)m2K−k+q,m2K−k+1) (45) =1m2K−k+1m2K−k+1∑q=1VU(q,m2K−k+1) (46) =1m2K−k+1m2K−k+1∑q=1q(m2K−k+1−q+1)(m2K−k+1+1)2(m2K−k+1+2) (47) =16(m2K−k+1+1). (48)

Therefore, under the assumption that every is mutually independent, by using the inequality (24) and Jensen’s inequality, it holds that

 1√N√N∑p=11√N√N∑q=1VFY(Y(p,q)) (49) =1√N√N∑p=11√N√N∑q=1E(K∏k=1FYk(Y(p,q)k)−K∏k=1EFYk(Y(p,q)k))2 (50) ≤1√N√N∑p=11√N√N∑q=1E(K∑k=1∣∣FYk(Y(p,q)k)−EFYk(Y(p,q)k)∣∣)2 (51) ≤KK∑k=11√N√N∑p=11√N√N∑q=1E∣∣FYk(Y(p,q)k)−EFYk(Y(p,q)k)∣∣2 (52) =KK∑k=11√N√N∑p=11√N√N∑q=1VFYk(Y(p,q)k) (53) ≤KK∑k=116(m2K−k+1+1)≤K(mK−1)6m2K(m−1)≤K6mK=K6N1/2, (54)

which completes the proof. ∎

Now we are ready to show a proof of Theorem 1.

###### Proof of Theorem 1.

Throughout this proof, we simply denote by the expectation with respect to the underlying probability measure of the corresponding random variable . By using Jensen’s inequality, we obtain

 E(I−ˆI)2 +1√N√N∑p=1f⎛⎝1√N√N∑q=1EX|Y(p,q)X⎞⎠−1√N√N∑p=1f⎛⎝1√N√N∑q=1X(p,q)⎞⎠⎞⎠2 ≤2E{Y(p,q)}⎛⎝EYf(EX|YX)−1√N√N∑p=1f⎛⎝1√N√N∑q=1EX|Y(p,q)X⎞⎠⎞⎠2 (55) (56)

In what follows, we show an upper bound on each term on the right-most side of (56). Let us consider the first term. By denoting the i.i.d. copy of the outer random variables by , using Jensen’s inequality and the inequalities (16) and (17) and applying the results shown in Lemma 1, we have

 (57) =E{Y(p,q)}⎛⎝E{Y′(p,q)}1√N√N∑p=11√N√N∑q=1f(EX|Y′(p,q)X)−1√N√N∑p=1f⎛⎝1√N√N∑q=1EX|Y(p,q)X⎞⎠⎞⎠2 (58) ≤E{Y(p,q)},{Y′(p,q)}1√N√N∑p=11√N√N∑q=1⎛⎝f(EX|Y′(p,q)X)−f⎛⎝1√N√N∑q′=1EX|Y(p,q′)X⎞⎠⎞⎠2 (59) (60) ≤α2E{Y(p,q)},{Y′(p,q)}1√N√N∑p=11N√N∑q,q′=1∥∥EX|Y′(p,q)X−EX|Y(p,q′)X∥∥22 (61) ≤α2β2E{Y(p,q)},{Y′