1 Introduction
Motivated by several different applications, we study estimating nested expectations which are defined as follows. Let and
be possibly dependent random variables following the joint probability density
. For a function , the nested expectation is defined by(1) 
Here we emphasize that the outer expectation is taken with respect to the marginal distribution of , while the inner expectation is with respect to the conditional distribution of given . Throughout this paper, we simply write (resp. ) to denote the marginal probability density of (resp. ), and also write (resp. ) to denote the conditional probability density of given (resp. given ).
The motivating examples are as follows:
Example 1 (expected information gain).
The concept of Bayesian experimental design aims to construct an optimal experimental design under which making the observation maximizes the expected information gain (EIG) on the input random variable Lindley (1956); Chaloner and Verdinelli (1995). Here the EIG denotes the expected amount of reduction in the Shannon information entropy and is given by
(2)  
(3) 
where the equality follows from Bayes’ theorem. A nested expectation appears in the second term on the righthand side.
Example 2 (expected value of sample information).
Let be a finite set of possible medical treatments. As the outcome and cost of each treatment is uncertain, we model its net benefit as a function of the input random variable , denoted by , where includes, for instance, the probability of side effect and the cost of treatment.
In the context of medical decision making, we want to know whether it is worth conducting a clinical trial or medical research to reduce the uncertainty of Welton et al. (2012). Denoting the observation from a clinical trial or medical research by , the expected value of sample information (EVSI) measures the average gain in the net benefit from making the observation and is given by
(4) 
where the first term represents the average net benefit when choosing the optimal treatment depending on the observation , and the second term does the net benefit without making the observation. Here the first term is exactly a nested expectation as given in (1).
The nested Monte Carlo (NMC) method is probably the most straightforward approach to estimate nested expectations. The idea is quite simple: approximating the inner and outer expectations by the standard Monte Carlo methods, respectively. To be more precise, for positive integers and , the NMC estimator is given by
(5) 
where denote the i.i.d. samples drawn from , denote the i.i.d. samples drawn from for each , and the inner sum over is taken elementwise. However, it has been known that a large computational cost is necessary for the NMC method to estimate nested expectations with high accuracy Rainforth et al. (2018). Moreover, the NMC method has a disadvantage in terms of applicability, since it requires generating the i.i.d. samples from , which is often quite hard in applications Strong et al. (2015); Goda et al. (2020); Hironaka et al. (2020).
A typical situation in estimating nested expectations is, instead, that we can generate i.i.d. samples from and , or, those from
. One way to tackle this issue is to use a Markov chain sampler directly for each
in (5). Although the resulting estimator might be consistent, it is quite hard to obtain a nonasymptotic upper bound on the mean squared error and to choose an optimal allocation for and with the total cost fixed. Another way is to rewrite (1) intoby using Bayes’ theorem. The problem is then that, as a function of , the likelihood is typically concentrated around a small region of the support of , so that we need some sophisticated techniques like importance sampling to reliably estimate the ratio of two expectations inside.
There are some recent works to overcome these problems of the NMC estimator. We refer to Goda et al. (2020); Beck et al. (2018) for an efficient estimation of the EIG, and to Strong et al. (2015); Hironaka et al. (2020); Menzies (2016); Jalal and AlaridEscudero (2018); Heath et al. (2018, 2020) for an efficient estimation of the EVSI. Estimating nested expectations is also an important task in portfolio risk measurement Duffie (2010); Gordy and Juneja (2010), and the relevant literature can be found in Broadie et al. (2015); Hong et al. (2017); Giles and HajiAli (2019)
In this paper, we propose a new Monte Carlo estimator for nested expectations based on poststratification, which does not require sampling from the conditional distribution . In fact, our proposed estimator only requires a dataset on the pair drawn from the joint distribution , and avoids any need to use a Markov chain sampler and importance sampling. Moreover, our proposed estimator is experimentally more efficient than several existing methods for numerical examples with small (the dimension of ). The rest of this paper is organized as follows. In Section 2 we introduce our proposed estimator, and then give a theoretical analysis to derive an upper bound on the mean squared error of our proposed estimator in Section 3. Numerical experiments in Section 4 compares our proposed method with several existing methods (NMC method, multilevel Monte Carlo method Goda et al. (2020); Hironaka et al. (2020), and regressionbased method Strong et al. (2015)). We conclude this paper with discussion in Section 5.
2 Method
In this section, we first provide a new Monte Carlo estimator for nested expectations using poststratification. Subsequently we introduce a variant of our proposed method using a linear regression.
Algorithm 1 (proposed method).
For with , let be a set of i.i.d. samples from the joint density . For , and , we write . First we construct as follows:

Sort each sample set
(6) (7) (8) (9) in the ascending order of . This process is repeated for each .

Define .
For , let denote the sample of the inner variable corresponding to . Then our proposed estimator is given by
(10) 
where the inner sum over is taken elementwise.
The form of our proposed estimator apparently looks similar to the NMC estimator (5) with . However, as we have mentioned already, the crucial difference is that our proposed estimator is free of the inner conditional sampling from the distribution . We approximate the inner expectation by the average
based only on the set of i.i.d. samples from the joint distribution , applying poststratification to which is important in that the samples are used as a substitute of the samples from the exact conditional distribution . We note that, although each can be regarded as a sample from , using only one inner sample is not enough to approximate the exact inner expectation. This is why we divide the set of samples into the sets of samples using poststratification and use each subset to compute .
Generating the i.i.d. samples from is often quite hard in some applications Strong et al. (2015); Goda et al. (2020); Hironaka et al. (2020). Therefore, by construction, our proposed estimator has the advantage over many existing methods including the NMC estimator in terms of applicability. In the literature, there exist some other methods to estimate a special class of nested expectations, which do not rely on sampling from conditional distribution, see for instance Strong et al. (2015); Hong et al. (2017). A regressionbased method from Strong et al. (2015) is compared with our proposed estimator in Section 4.
We end this section with introducing a variant of Algorithm 1.
Algorithm 2 (proposed method with regression).
Our motivation behind introducing this variant is that, instead of using the samples directly, we perturb them to approximate the inner conditional expectation simply by one (perturbed) sample . Then the resulting estimator is given in a nonnested form.
3 Theoretical Analysis
In this section we show an upper bound on the mean squared error (MSE) of our proposed method. Here we show the result only for Algorithm 1. It is left open for future research whether a similar result on the MSE also holds for Algorithm 2.
Theorem 1 (A bound of the MSE).
Let , , , and be given as in Algorithm 1. Let
be the cumulative distribution function of
, and be the marginal cumulative distribution function for . Assume that every is continuous, that and are mutually independent for all , that there exist such that(16) 
and
(17) 
hold for any and , and that
(18) 
Then we have
(19) 
Lemma 1.
Proof.
Let us recall the following two wellknown facts as preparation. One fact is on order statistics. For a positive integer , let
be i.i.d. random variables following the uniform distribution
, and let be the corresponding order statistics. Then, for any , the th order statisticsfollows the beta distribution
, which ensures(22) 
Moreover, for any , the difference follows the beta distribution and so it holds that
(23) 
The other fact is on an elementary inequality. For a positive integer and any , it holds that
(24) 
This inequality can be proven by an induction on . As the case trivially holds, let us assume that the inequality holds for and prove the case . In fact, it follows from the triangle inequality and the induction assumption that
(25)  
(26)  
(27)  
(28) 
which completes the proof of (24).
Now let us prove (20). To do so, let us consider a single coordinate for first. In what follows, we denote by the marginal cumulative distribution function of as stated in Theorem 1, and we mean by the th sample of the outer variable after the sorting step in Algorithm 1 is applied, and we do so also for . For each , , and , we can see that the sample set
is a set of points drawn from the parent sample set
consisting of points, which coincides exactly with the set
Here, since every one of the outer variables is assumed independent from all the other variables , we can see that, for any , the sorted random variables correspond to the order statistics , respectively.
Therefore, for , by using the equality (23), we have
(29)  
(30)  
(31)  
(32)  
(33)  
(34)  
(35)  
(36)  
(37) 
As we have assumed that every is mutually independent, applying the inequality (24) and Jensen’s inequality leads to
(38)  
(39)  
(40)  
(41)  
(42) 
which completes the proof of (20).
Let us move on to proving (21). By a reasoning similar to what is used to prove (20) and using the equality (22), we have
(43)  
(44)  
(45)  
(46)  
(47)  
(48) 
Therefore, under the assumption that every is mutually independent, by using the inequality (24) and Jensen’s inequality, it holds that
(49)  
(50)  
(51)  
(52)  
(53)  
(54) 
which completes the proof. ∎
Now we are ready to show a proof of Theorem 1.
Proof of Theorem 1.
Throughout this proof, we simply denote by the expectation with respect to the underlying probability measure of the corresponding random variable . By using Jensen’s inequality, we obtain
(55)  
(56) 
In what follows, we show an upper bound on each term on the rightmost side of (56). Let us consider the first term. By denoting the i.i.d. copy of the outer random variables by , using Jensen’s inequality and the inequalities (16) and (17) and applying the results shown in Lemma 1, we have
(57)  
(58)  
(59)  
(60)  
(61)  