# Unbiased Estimation of the Reciprocal Mean for Non-negative Random Variables

Many simulation problems require the estimation of a ratio of two expectations. In recent years Monte Carlo estimators have been proposed that can estimate such ratios without bias. We investigate the theoretical properties of such estimators for the estimation of β = 1/E Z, where Z ≥ 0. The estimator, β(w), is of the form w/f_w(N) ∏_i=1^N (1 - w Z_i), where w < 2β and N is any random variable with probability mass function f_w on the positive integers. For a fixed w, the optimal choice for f_w is well understood, but less so the choice of w. We study the properties of β(w) as a function of w and show that its expected time variance product decreases as w decreases, even though the cost of constructing the estimator increases with w. We also show that the estimator is asymptotically equivalent to the maximum likelihood (biased) ratio estimator and establish practical confidence intervals.

## Authors

• 2 publications
• 2 publications
• 8 publications
• ### Uncertainty on the Reproduction Ratio in the SIR Model

The aim of this paper is to understand the extreme variability on the es...
12/21/2020 ∙ by Sean Elliott, et al. ∙ 0

• ### Simulation study of estimating between-study variance and overall effect in meta-analysis of odds-ratios

Random-effects meta-analysis requires an estimate of the between-study v...
02/19/2019 ∙ by Ilyas Bakbergenuly, et al. ∙ 0

• ### Best estimator for bivariate Poisson regression

INTRODUCTION: Wald's, the likelihood ratio (LR) and Rao's score tests an...
03/17/2021 ∙ by André Gillibert, et al. ∙ 0

• ### Quantifying and Reducing Bias in Maximum Likelihood Estimation of Structured Anomalies

Anomaly estimation, or the problem of finding a subset of a dataset that...
07/15/2020 ∙ by Uthsav Chitra, et al. ∙ 25

• ### Estimating a probability of failure with the convex order in computer experiments

This paper deals with the estimation of a failure probability of an indu...
07/03/2019 ∙ by Lucie Bernard, et al. ∙ 0

• ### Noise contrastive estimation: asymptotics, comparison with MC-MLE

A statistical model is said to be un-normalised when its likelihood func...
01/31/2018 ∙ by Lionel Riou-Durand, et al. ∙ 0

• ### Prior-free Data Acquisition for Accurate Statistical Estimation

We study a data analyst's problem of acquiring data from self-interested...
11/30/2018 ∙ by Yiling Chen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Over the past few years, unbiased Monte Carlo estimation methods have received significant attention, due to both theoretical interest and practical applications; see, for example, Rhee and Glynn (2015); Blanchet et al. (2015); Blanchet and Glynn (2015); Jacob and Thiery (2015); McLeish (2011). Efficient unbiased estimation of non-linear functions of expectations of random variables is challenging and has several applications; see, for example, Blanchet et al. (2015); Jacob and Thiery (2015). An important “canonical” case is the unbiased estimation of for a non-negative random variable

. Applications include regenerative simulation, estimating parameters involving densities with unknown normalizing constants, and Bayesian inference.

Motivated by these applications, we study properties of an unbiased estimator of proposed by Blanchet et al. (2015) (which is in turn based on the ideas proposed by Rhee and Glynn (2015) in the context of stochastic differential equations). The estimator is obtained as follows. Write for ; here the condition guarantees the convergence of the geometric series . Further, let be a sequence of iid copies of , and let be a non-negative integer-valued random variable with , for all . Then

 1EZ =w∞∑n=0qn(1−wEZ)nqn=w∞∑n=0qnE∏ni=1(1−wZi)qn=wE[1qNN∏i=1(1−wZi)].

Define,

 ˆβ(w):=wqNN∏i=1(1−wZi). (1)

Clearly, and thus is an unbiased estimator of . Note that if almost surely for a constant , then with the choice , becomes non-negative. In this paper, the goal is to study optimal choices for and that make efficient. In particular, a brief description of our contributions is as follows:

• When is the variance-minimizing distribution for a fixed , we show that as , the expected cost to construct increases to , while both the variance and the expected time variance product of decease.

• As a consequence, we argue that for any , instead of approximating with a sample mean of iid copies of , it is optimal to approximate it by just one outcome of , where is such that and the expected cost of constructing is the same as that of the sample mean.

• We study the asymptotic distribution of as (i.e., as the expected computational cost for the estimator goes to

). We establish a central limit theorem type convergence result that is useful for finding asymptotically valid confidence intervals.

• We compare the asymptotic performance of the unbiased estimator with that of the maximum likelihood (biased) ratio estimator, where is estimated using the reciprocal of a sample mean of iid copies of .

• The above results are studied under the assumption that has the variance-minimizing distribution. Generating samples from this distribution is impossible as it involves unknown parameters. Since is unbiased even for a different distribution of , we develop a method to implement the estimator by proposing a distribution for (using samples of ) that closely resembles to the variance-minimizing distribution .

Background: Several applications of Monte Carlo simulation involve the estimation of for a non-negative random variable . In some applications it is a desirable property to have an unbiased estimator of when the magnitudes of the available biased estimators are unknown a priori. Examples include the estimation of a steady-state parameter for a regenerative stochastic process, where denotes the length of a regenerative cycle and denotes the cumulative reward obtained over the regenerative cycle; see, e.g, Glynn (2006); Asmussen and Glynn (2007); Moka and Juneja (2015). It is evident that we have an unbiased estimator of if we have an unbiased estimator of . A similar case is where parameters can be expressed as for some real-valued function and probability density , where is known up to the normalizing constant . Such densities occur, for example, in Gibbs point processes (Møller and Waagepetersen (2003)

); and a standard method to estimate such parameters is by using Markov Chain Monte Carlo (MCMC) methods, see

Asmussen and Glynn (2007); Rubinstein and Kroese (2017). However, in many situations it is difficult to bound the bias of the MCMC estimator, as the mixing time of the Markov chain is unknown. An alternative approach is to use a ratio estimator, where is approximated by ratio of the sample means of the numerator and the denominator. However, this still returns a biased estimator and the bias decreases at a rate that is inversely proportional to the sample size; see Remark 2 and also Asmussen and Glynn (2007). Therefore, it is desirable to have an unbiased estimator for (and equally for ) that has the same order of complexity as that of the ratio estimator.

Most importantly, in some applications, having an unbiased estimator of is essential. For example, in the study of doubly intractable models in Bayesian inference, it is assumed that the observations follow a distribution with a density of the form where can be evaluated point-wise up to the normalizing constant ; see, for example, Lyne et al. (2015); Walker (2011); Jacob and Thiery (2015). Standard Metropolis–Hastings algorithms to obtain posterior estimates are not applicable due to the intractability of the normalizing constant. However, an exact inference method called pseudo-marginal Metropolis–Hastings proposed by Andrieu and Roberts (2009) can be implemented if a non-negative unbiased estimator of is available; also see Beaumont (2003); Jacob and Thiery (2015); Walker (2011). In particular, Jacob and Thiery (2015) highlight the importance of the estimators of the form (1).

A standard method called Russian roulette truncation can be used for unbiased estimation of . This method is first proposed in the physics literature Carter and Cashwell (1975); Lux and Koblinger (1991) and further studied by McLeish (2011); Glynn and Rhee (2014); Lyne et al. (2015); Wei and Murray (2016). The key drawback of these estimators is that they can take negative values with positive probability even when is bounded.

Organization of the paper: In Section 2, we study the properties of the estimator as a function of , under the assumption that has the variance minimizing estimator. An implementable method is proposed in Section 3. A conclusion of the paper is given in Section 4. All the results are proved in Appendix A.

## 2 Properties of the Estimator

Without loss of generality, assume that is non-degenerate. As the estimator in (1) is unbiased, a sample mean of independent copies of is an unbiased estimator of  as well. It is well known that the sample mean has square-root convergence rate if ; see, e.g., Asmussen and Glynn (2007). Thus, a simple strategy is to seek values of and that minimize . Using the Cauchy–Schwarz inequality, Blanchet et al. (2015) show that for any , is finite and is minimal if

has a geometric distribution on the non-negative integers with success probability

 pw=1−√E(1−wZ)2=1−√1−2wEZ+w2EZ2;

that is, if

 qn=(1−pw)npw,n≥0, (2)

where the assumption guarantees that . Unfortunately, depends on and , which are unknown. However, is unbiased even when has a different distribution. Therefore, in the implementation of , we can replace with an estimate of ; see Section 3. In this section, we study the properties of under the assumption that has the distribution (2), because it offers an understanding of what is the best that can be expected from the estimator.

Note that the variance of is given by,

 Varˆβ(w) (3)

for all . Further, observe that . Now we can ask what is the value of that minimizes (3). This question is not addressed in the existing literature. In addition to the variance, it is often important to include the running time to construct the estimator to determine its efficiency; see Glynn and Whitt (1992). In that case, we need to select for which the expected time variance product, is minimal, where is the time required to construct . From (1) (since are iid), it is reasonable to assume that is proportional to the number of ’s used for constructing . Since samples of are used in the construction of , we assume that the expected time variance product is .

###### Theorem 1.

Suppose that has the geometric distribution given in (2). Then the following hold true.

• The success probability is a strictly concave function of with a maximum value of attained at , and

 limw↘0pw=limw↗2EZ/EZ2pw=0.
• The variance is a strictly increasing convex function of , with

 limw↘0Varˆβ(w) = 0 and limw↗2EZ/EZ2Varˆβ(w) = ∞.
• The expected time variance product is a strictly increasing function of , with

 limw↘0EN(w)Varˆβ(w)=VarZ(EZ)4, % and limw↗2EZ/EZ2EN(w)Varˆβ(w)=∞.
###### Remark 1.

To understand the implications of Theorem 1, suppose we select a , giving an expected computational cost of to obtain . Further, let be the sample mean of iid copies of . Then the expected time variance product for is

 kEN(w)Var¯¯¯βk(w)=EN(w)Varˆβ(w).

Now suppose that is selected so that the average cost to generate one outcome of is equal to the average cost to construct ; that is, . Then, from Theorem 1 (iii), for the same computational effort, has a smaller variance than for any feasible selected as above, and is therefore a better estimator.

To illustrate the results of Theorem 1, consider an example where for an event with probability . Since , the relative variance . By substituting the values of and , we can calculate , and for any . Figure 1 illustrate the effect of on the efficacy of the estimator . As expected, both and are decreasing as with the limits  and , respectively. ∎

Remark 1 motivates us to study the asymptotic distributional properties of as , when has the geometric distribution given in (2). Theorem 2 is crucial for establishing confidence intervals that are asymptotically valid as .

###### Theorem 2.

Suppose that has the distribution given by (2) and is bounded. Then, as ,

 (i)ˆβ(w)d⟶β, and (ii)(ˆβ(w)−β)√wEZd⟶(√VarZ(EZ)4E(1))N(0,1),

where denotes convergence in distribution, and and

are independent random variables from respectively the standard (mean 1) exponential and standard normal distributions.

We show later in Section A.1 that as (see (8)). Since and , as . That means, an alternative expression for Theorem 2 (ii) is

 √EN(ˆβ(w)−β)d⟶σ√E(1)N(0,1), asw↘0,

where . The above expression has more resemblance to the standard central limit theorem, since is the computational cost of the estimator. It is not difficult to show that is a random variable with density , which is the density of a mean zero Laplace

(or double exponential) distribution with scale

. These observations are useful for constructing asymptotically valid confidence intervals as follows. For any given , by solving for , we get . Then using Theorem 2, we can say that the interval

 (ˆβ(w)+log(α)√2σ√wEZ,ˆβ(w)−log(α)√2σ√wEZ)

is an asymptotic confidence interval for .

###### Remark 2 (Comparison with the ratio estimator).

A standard (biased) estimator of is , where is the sample mean of iid copies of . Using Taylor’s theorem for the function about , we can easily show that the bias of is approximately for large , while, on the other hand, has zero bias. Furthermore, using the same Taylor’s theorem, we can show that the asymptotic time variance product of is as . From Theorem 1 (iii), the unbiased estimator has the same asymptotic expected time variance product. However, unbiasedness of comes at cost. As an application of the delta method, we can show that the ratio estimator satisfies the central limit theorem: That is, the ratio estimator is asymptotically normal. On the other hand, the asymptotic distribution of the unbiased estimator is Laplace, which has more slowly decaying tails than a normal distribution. In conclusion, the ratio estimator can have narrower confidence intervals than the unbiased estimator.

###### Remark 3 (Importance sampling).

Just like in the case of the ratio estimator, from Theorems 1 and 2, the relative variance of is the key factor influencing the asymptotic properties of the unbiased estimator. The smaller the value is of the relative variance of , the better is the reliability of the unbiased estimator. One of the most effective technique of variance reduction is importance sampling. We can improve the performance of the estimator if we can implement an importance sampling technique for the random variable .

###### Remark 4 (The time variance product minimizing distribution for N(w)).

We have assumed that for a given the random variable has the variance minimizing distribution given by (2). However, when the criteria for the optimality of is the minimization of the expected time variance product, we need to seek a distribution that minimizes . It is shown in Blanchet et al. (2015) that the distribution that minimizes the expected time variance is given by

 ˜qn=w(1−pw)n√β2+dwn,n≥0, (4)

where is the unique (positive) number satisfying When compared to the distribution (2), drawing samples from (4) has an extra difficulty of finding by solving an equation that contains an infinite sum. Even if we overcome this difficulty, the reduction in the expected time variance product is typically insignificant, because for small values of ,

 E˜N(w)Var˜β(w)=EN(w)Varˆβ(w)(1+O(w)), (5)

when has the variance minimizing distribution (2); see Section A.3 for a proof of (5). ∎

## 3 An Implementation

Recall that the success probability of is a function of unknown quantities and . However, fortunately, in (1) is still an unbiased estimator of for any distribution of . In particular, instead of taking as in (2), we can take , where is defined below. Under the proposed implementation, when the given budget is sufficiently large, half of the budget is used for estimating and then is chosen such that the remaining half the budget is used for generating a sample of the unbiased estimator.

To simplify the discussion, assume that there is a known constant ; for example, if for a constant , we can take . Let be a sequence of iid copies of , independent of the sequence , which is used in the construction of the unbiased estimator in (1

). Define the first two sample moments:

and . If , define,

 Pk=1− ⎷1kk∑i=1(1−wk˜Zi)2withwk=min(1kM1(k),M1(k)M2(k),ε).

Otherwise, take and . The condition guarantees that . Further, whether or not, and hence it guarantees that the estmator (defined by (1)) with is an unbiased estimator of . Note that given and , the expected cost to construct to is (including the cost to construct ), since . Theorem 3 states that for large values of , this total expected cost is approximately , and conditioned on and , the expected time variance product goes to a random variable with mean . See Section A.4 for a proof Theorem 3.

###### Theorem 3.

Under the above construction, , and as , , and

 (k+1Pk)Var(ˆβ(wk)∣∣M1(k),M2(k))⟶2VarZ(EZ)4[1+χ21],a.s.,

where is the square of a standard normal random variable.

To understand Theorem 3, consider the example given in Remark 1. We estimated the expected total cost and using samples of and , respectivley, with . Our simulation results show that the estimated expected time relative variance product is , which is approximately equal to , as expected.

## 4 Conclusion

We investigated the theoretical properties of a parametrized family of unbiased estimators of for a non-negative random variable . We studied the variance and the expected time variance product as functions of and established several asymptotic results. We showed that with an optimal choice of , the asymptotic performance of the unbiased estimator is comparable to the maximum likelihood (biased) ratio estimator. We further proposed an implementable unbiased estimation based on our results. Similar to Theorem 2, our ongoing research establishes a central limit theorem type convergence result for defined in Section 3, by taking the budget parameter .

## Appendix A Appendix

To simplify the notation in this section, we use and . We also write for the derivative and for the second derivative. and denote respectively the mean exponential distribution and the geometric distribution on non-negative integers and the success probability .

### a.1 Proof of Theorem 1

From the definition, It follows that where the strict inequality holds because , which follows from the assumption that is non-degenerative. Therefore, is strictly concave over and it achieves its maximum value at . From the definition of , it is evident that .

Recall from (3) that the variance of the estimator is equal to . Its derivative can be written as

 dVarˆβ(w)dw=2wwz1−pwp3w(1−pw) (6)

and the second derivative as

 d2Varˆβ(w)dw2=2(wz1−pw)p3w(1−pw)3(3(wz1−pw)+2pw(pw−w2z2)). (7)

Using Jensen’s inequality, where the strict inequality holds again because is non-degenerative. On the other hand, by Bernoulli’s inequality,

 √E(1−wZ)2=√1+(−2wz1+w2z2)

is maximized by Thus,

 z1−wz2/2≤pww

Using (8), we have and , and hence for all , and which establishes the convexity of over .

We now prove that the expected time variance product is a strictly increasing over . Let , and Then,

 dg1dw(w)=wz2−z1p2w(1−pw), and dg2dw(w)=2wz21[wz1−pwp3w(1−pw)],

and hence, we write

 dgdw(w) =2wz21(1−pw)(wz1−pw)+(w2z21−p2w)(wz2−z1)p4w(1−pw) =wz1−pwp4w(1−pw)[wpw(z2−z21)+z1(wz1−pw)+wz1(wz2−z1pw)]>0,

where the inequality holds because , and . Therefore, is strictly increasing over .

The claims that and hold trivially because . To complete the proof of the theorem, we can write, by Taylor’s theorem, for any : for some Consequently,

 √E(1−wZ)2=1+E(1−wZ)2−12−(E(1−wZ)2−1)28+R(w),

where for some . Since as and , we have . Further simplification yields that and thus, Since ,

 1−pwpw=1wz1(1+O(w)), and w2z21p2w−1=w(z2−z21)z1(1+O(w)). (9)

We conclude that both and go to their respective minima as .

### a.2 Proof of Theorem 2

Statement (i) follows directly from Theorem 1 and Chebyshev’s inequality:

 P(|ˆβ(w)−β|>ϵ)≤Varˆβ(w)/ϵ2→0, as w↘0,

for every . To prove (ii), consider a decreasing sequence such that and . We construct an almost surely increasing sequence such that and

 limk→∞[wkNk]=X∞/z1,a.s., (10)

for a random variable . To do so we invoke Theorem 3.1 of Moka and Juneja (2015). Let and . Then, Moka and Juneja (2015) says that for each , there exist a random variable

 Gk(x)=1−(1−λk+1λk)exp(−λk+1x),x≥0,

such that is independent of , and has the same distribution as . Therefore, without loss of generality we assume that there is sequence of independent random variables such that for all .

Consider the natural filtration . Since ,

 λk+1Ek+1−1 =λk+1[Ek+Yk−1/λk−EYk]=λk+1λkλk[Ek−1/λk]+λk+1[Yk−EYk] ≤λkEk−1+λk+1[Yk−EYk],

where the last inequality holds because . We have since is independent of . Thus, is a supermartingale (with respect to ). In fact the sequence is bounded in , because , making it uniformly integrable submartingale. Thus, exists (see Theorem 2 in Section 4 of Chapter VII of Shiryaev (1996)). Since This implies that .

Let . Then for all we have and . From (8), and hence . From the convergence of the sequence , we have . Therefore, (10) holds.

Now define

 ˆβk:=wk(1−pwk)NkpwkNk∏i=1(1−wkZi). (11)

From the definitions, is identical to in distribution. We now conclude the proof Theorem 2 by establishing lower and upper bounds on separately. Let be an upper bound on . From the construction of given by (11), for all such that , we have using (8) that

 ˆβ