Over the past few years, unbiased Monte Carlo estimation methods have
received significant attention, due to both theoretical interest and
practical applications; see, for example, Rhee and Glynn (2015); Blanchet et al. (2015); Blanchet and Glynn (2015); Jacob and Thiery (2015); McLeish (2011). Efficient unbiased estimation of non-linear
functions of expectations of random variables is challenging and has
several applications; see, for example, Blanchet et al. (2015); Jacob and Thiery (2015). An
important “canonical” case is the unbiased estimation of
for a non-negative random variable . Applications include
regenerative simulation, estimating parameters involving densities
with unknown normalizing constants, and Bayesian inference.
. Applications include regenerative simulation, estimating parameters involving densities with unknown normalizing constants, and Bayesian inference.
Motivated by these applications, we study properties of an unbiased estimator of proposed by Blanchet et al. (2015) (which is in turn based on the ideas proposed by Rhee and Glynn (2015) in the context of stochastic differential equations). The estimator is obtained as follows. Write for ; here the condition guarantees the convergence of the geometric series . Further, let be a sequence of iid copies of , and let be a non-negative integer-valued random variable with , for all . Then
Clearly, and thus is an unbiased estimator of . Note that if almost surely for a constant , then with the choice , becomes non-negative. In this paper, the goal is to study optimal choices for and that make efficient. In particular, a brief description of our contributions is as follows:
When is the variance-minimizing distribution for a fixed , we show that as , the expected cost to construct increases to , while both the variance and the expected time variance product of decease.
As a consequence, we argue that for any , instead of approximating with a sample mean of iid copies of , it is optimal to approximate it by just one outcome of , where is such that and the expected cost of constructing is the same as that of the sample mean.
We study the asymptotic distribution of as (i.e., as the expected computational cost for the estimator goes to
). We establish a central limit theorem type convergence result that is useful for finding asymptotically valid confidence intervals.
We compare the asymptotic performance of the unbiased estimator with that of the maximum likelihood (biased) ratio estimator, where is estimated using the reciprocal of a sample mean of iid copies of .
The above results are studied under the assumption that has the variance-minimizing distribution. Generating samples from this distribution is impossible as it involves unknown parameters. Since is unbiased even for a different distribution of , we develop a method to implement the estimator by proposing a distribution for (using samples of ) that closely resembles to the variance-minimizing distribution .
Background: Several applications of Monte Carlo
simulation involve the estimation of for a
non-negative random variable . In some applications it is a
desirable property to have an unbiased estimator of when the
magnitudes of the available biased estimators are unknown a
priori. Examples include the estimation of a steady-state parameter
for a regenerative stochastic process,
where denotes the length of a regenerative cycle and
denotes the cumulative reward obtained over the regenerative
cycle; see, e.g, Glynn (2006); Asmussen and Glynn (2007); Moka and Juneja (2015). It is evident that we have an unbiased estimator of if we have an unbiased estimator of . A similar case is where parameters can be expressed as
for some real-valued
function and probability density , where is known up to
the normalizing constant . Such densities occur, for example,
in Gibbs point processes (Møller and Waagepetersen (2003) ); and a standard method to
estimate such parameters is by using Markov Chain Monte Carlo (MCMC)
); and a standard method to estimate such parameters is by using Markov Chain Monte Carlo (MCMC) methods, seeAsmussen and Glynn (2007); Rubinstein and Kroese (2017). However, in many situations it is difficult to bound the bias of the MCMC estimator, as the mixing time of the Markov chain is unknown. An alternative approach is to use a ratio estimator, where is approximated by ratio of the sample means of the numerator and the denominator. However, this still returns a biased estimator and the bias decreases at a rate that is inversely proportional to the sample size; see Remark 2 and also Asmussen and Glynn (2007). Therefore, it is desirable to have an unbiased estimator for (and equally for ) that has the same order of complexity as that of the ratio estimator.
Most importantly, in some applications, having an unbiased estimator
of is essential. For example, in the study of doubly intractable
models in Bayesian inference, it is assumed that the observations
follow a distribution with a
density of the form
where can be evaluated point-wise up to the normalizing
constant ; see, for example,
Lyne et al. (2015); Walker (2011); Jacob and Thiery (2015). Standard Metropolis–Hastings algorithms to obtain posterior estimates are not applicable due to the intractability of the normalizing constant. However, an exact inference method called pseudo-marginal Metropolis–Hastings proposed by Andrieu and Roberts (2009) can be implemented if a non-negative unbiased estimator of is available; also see Beaumont (2003); Jacob and Thiery (2015); Walker (2011). In particular, Jacob and Thiery (2015) highlight the importance of the estimators of the form (1).
A standard method called Russian roulette truncation can be used for unbiased estimation of . This method is first proposed in the physics literature Carter and Cashwell (1975); Lux and Koblinger (1991) and further studied by McLeish (2011); Glynn and Rhee (2014); Lyne et al. (2015); Wei and Murray (2016). The key drawback of these estimators is that they can take negative values with positive probability even when is bounded.
Organization of the paper: In Section 2, we study the properties of the estimator as a function of , under the assumption that has the variance minimizing estimator. An implementable method is proposed in Section 3. A conclusion of the paper is given in Section 4. All the results are proved in Appendix A.
2 Properties of the Estimator
Without loss of generality, assume that is non-degenerate. As the estimator in (1) is unbiased, a sample mean of independent copies of
is an unbiased estimator of as well. It is well
known that the sample mean has square-root convergence rate if
; see, e.g., Asmussen and Glynn (2007). Thus, a
simple strategy is to seek values of and
that minimize . Using the Cauchy–Schwarz
inequality, Blanchet et al. (2015) show that for any ,
is finite and is minimal if has a geometric
distribution on the non-negative integers with success probability
has a geometric distribution on the non-negative integers with success probability
that is, if
where the assumption guarantees that . Unfortunately, depends on and , which are
unknown. However, is unbiased even when has a
different distribution. Therefore, in the implementation of
, we can replace with an estimate of ; see Section 3. In this
section, we study the properties of under the assumption
that has the distribution (2),
because it offers an understanding of what is the best that can be expected from the estimator.
Note that the variance of is given by,
for all . Further, observe that . Now we can ask what is the value of that minimizes (3). This question is not addressed in the existing literature. In addition to the variance, it is often important to include the running time to construct the estimator to determine its efficiency; see Glynn and Whitt (1992). In that case, we need to select for which the expected time variance product, is minimal, where is the time required to construct . From (1) (since are iid), it is reasonable to assume that is proportional to the number of ’s used for constructing . Since samples of are used in the construction of , we assume that the expected time variance product is .
Suppose that has the geometric distribution given in (2). Then the following hold true.
The success probability is a strictly concave function of with a maximum value of attained at , and
The variance is a strictly increasing convex function of , with
The expected time variance product is a strictly increasing function of , with
To understand the implications of Theorem 1, suppose we select a , giving an expected computational cost of to obtain . Further, let be the sample mean of iid copies of . Then the expected time variance product for is
Now suppose that is selected so that the average cost to generate one outcome of is equal to the average cost to construct ; that is, . Then, from Theorem 1 (iii), for the same computational effort, has a smaller variance than for any feasible selected as above, and is therefore a better estimator.
To illustrate the results of Theorem 1, consider an example where for an event with probability . Since , the relative variance . By substituting the values of and , we can calculate , and for any . Figure 1 illustrate the effect of on the efficacy of the estimator . As expected, both and are decreasing as with the limits and , respectively. ∎
Remark 1 motivates us to study the asymptotic distributional properties of as , when has the geometric distribution given in (2). Theorem 2 is crucial for establishing confidence intervals that are asymptotically valid as .
where . The above expression has more resemblance to the standard central limit theorem, since is the computational cost of the estimator.
It is not difficult to show that is a random variable with density
, which is the density of a mean zero Laplace (or double exponential) distribution with scale
(or double exponential) distribution with scale. These observations are useful for constructing asymptotically valid confidence intervals as follows. For any given , by solving for , we get . Then using Theorem 2, we can say that the interval
is an asymptotic confidence interval for .
Remark 2 (Comparison with the ratio estimator).
A standard (biased) estimator of is , where is the sample mean of iid copies of . Using Taylor’s theorem for the function about , we can easily show that the bias of is approximately for large , while, on the other hand, has zero bias. Furthermore, using the same Taylor’s theorem, we can show that the asymptotic time variance product of is as . From Theorem 1 (iii), the unbiased estimator has the same asymptotic expected time variance product. However, unbiasedness of comes at cost. As an application of the delta method, we can show that the ratio estimator satisfies the central limit theorem: That is, the ratio estimator is asymptotically normal. On the other hand, the asymptotic distribution of the unbiased estimator is Laplace, which has more slowly decaying tails than a normal distribution. In conclusion, the ratio estimator can have narrower confidence intervals than the unbiased estimator.
Remark 3 (Importance sampling).
Just like in the case of the ratio estimator, from Theorems 1 and 2, the relative variance of is the key factor influencing the asymptotic properties of the unbiased estimator. The smaller the value is of the relative variance of , the better is the reliability of the unbiased estimator. One of the most effective technique of variance reduction is importance sampling. We can improve the performance of the estimator if we can implement an importance sampling technique for the random variable .
Remark 4 (The time variance product minimizing distribution for ).
We have assumed that for a given the random variable has the variance minimizing distribution given by (2). However, when the criteria for the optimality of is the minimization of the expected time variance product, we need to seek a distribution that minimizes . It is shown in Blanchet et al. (2015) that the distribution that minimizes the expected time variance is given by
where is the unique (positive) number satisfying When compared to the distribution (2), drawing samples from (4) has an extra difficulty of finding by solving an equation that contains an infinite sum. Even if we overcome this difficulty, the reduction in the expected time variance product is typically insignificant, because for small values of ,
3 An Implementation
Recall that the success probability of is a function of unknown quantities and . However, fortunately, in (1) is still an unbiased estimator of for any distribution of . In particular, instead of taking as in (2), we can take , where is defined below. Under the proposed implementation, when the given budget is sufficiently large, half of the budget is used for estimating and then is chosen such that the remaining half the budget is used for generating a sample of the unbiased estimator.
To simplify the discussion, assume that there is a known constant ; for example, if for a constant , we can take . Let be a sequence of iid copies of , independent of the sequence , which is used in the construction of the unbiased estimator in (1 ).
Define the first two sample moments:
). Define the first two sample moments:and . If , define,
Otherwise, take and . The condition guarantees that . Further, whether or not, and hence it guarantees that the estmator (defined by (1)) with is an unbiased estimator of . Note that given and , the expected cost to construct to is (including the cost to construct ), since . Theorem 3 states that for large values of , this total expected cost is approximately , and conditioned on and , the expected time variance product goes to a random variable with mean . See Section A.4 for a proof Theorem 3.
Under the above construction, , and as , , and
where is the square of a standard normal random variable.
We investigated the theoretical properties of a parametrized family of unbiased estimators of for a non-negative random variable . We studied the variance and the expected time variance product as functions of and established several asymptotic results. We showed that with an optimal choice of , the asymptotic performance of the unbiased estimator is comparable to the maximum likelihood (biased) ratio estimator. We further proposed an implementable unbiased estimation based on our results. Similar to Theorem 2, our ongoing research establishes a central limit theorem type convergence result for defined in Section 3, by taking the budget parameter .
Appendix A Appendix
To simplify the notation in this section, we use and . We also write for the derivative and for the second derivative. and denote respectively the mean exponential distribution and the geometric distribution on non-negative integers and the success probability .
a.1 Proof of Theorem 1
From the definition, It follows that where the strict inequality holds because , which follows from the assumption that is non-degenerative. Therefore, is strictly concave over and it achieves its maximum value at . From the definition of , it is evident that .
Recall from (3) that the variance of the estimator is equal to . Its derivative can be written as
and the second derivative as
Using Jensen’s inequality, where the strict inequality holds again because is non-degenerative. On the other hand, by Bernoulli’s inequality,
is maximized by Thus,
Using (8), we have and , and hence for all , and
which establishes the convexity of over .
We now prove that the expected time variance product is a strictly increasing over . Let , and Then,
and hence, we write
where the inequality holds because , and . Therefore, is strictly increasing over .
The claims that and hold trivially because . To complete the proof of the theorem, we can write, by Taylor’s theorem, for any : for some Consequently,
where for some . Since as and , we have . Further simplification yields that and thus, Since ,
We conclude that both and go to their respective minima as .
a.2 Proof of Theorem 2
Statement (i) follows directly from Theorem 1 and Chebyshev’s inequality:
for every . To prove (ii), consider a decreasing sequence such that and . We construct an almost surely increasing sequence such that and
for a random variable . To do so we invoke Theorem 3.1 of Moka and Juneja (2015). Let and . Then, Moka and Juneja (2015) says that for each , there exist a random variable
such that is independent of , and has the same distribution as . Therefore, without loss of generality we assume that there is sequence of independent random variables
such that for all .
Consider the natural filtration . Since ,
where the last inequality holds because . We have since is independent of .
Thus, is a supermartingale (with respect to ).
In fact the sequence is bounded in , because , making it uniformly integrable submartingale. Thus, exists (see Theorem 2 in Section 4 of Chapter VII of Shiryaev (1996)).
This implies that .