 # The square root rule for adaptive importance sampling

In adaptive importance sampling, and other contexts, we have unbiased and uncorrelated estimates of a common quantity μ and the variance of the k'th estimate is thought to decay like k^-y for an unknown rate parameter y∈ [0,1]. If we combine the estimates as though y=1/2, then the resulting estimate attains the optimal variance rate with a constant that is too large by at most 9/8 for any 0< y< 1 and any number K of estimates.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A useful rule for weighting uncorrelated estimates appears in an unpublished technical report (Owen and Zhou, 1999). This paper states and proves that result and discusses some mild generalizations.

The motivating context for the problem is adaptive importance sampling. There is an unknown quantity and we have unbiased and uncorrelated measurements of it, denoted for . In adaptive importance sampling, could be the result of an importance sampler chosen on the basis of the data that produced . The estimates are uncorrelated by construction but are not independent. The variance of is affected by the prior sample values. We have in mind a setting where each is based on the same number of function evaluations but we do not need to use that in our analysis. Our setup could also be reasonable in settings where evaluations are used to construct .

In an adaptive method with one pilot estimate and one final estimate, and is usually large. In other settings, each could be based on just one evaluation of the integrand (e.g., Ryu and Boyd (2014)) and then would typically be very large. In intermediate settings we might have something like estimates based on perhaps thousands of evaluations each. For instance, the cross-entropy method (De Boer et al., 2005) might be used this way.

We estimate by where . Then and to minimize we should take . The problem is that we do not know

. We might have unbiased estimates

of , but taking will yield an estimate that is no longer unbiased. The bias is potentially serious in rare event problems. When the ’th sample fails to include the rare event much or at all, then and are both likely to be small, whereas obtaining more than the expected number of rare events makes it more likely that both are large. That is, we anticipate a negative correlation between and . Then the small values of will be upweighted while the large ones will be downweighted resulting in a downward bias for . In estimating the probabilty of a rare event, we might even obtain making sample variances completely unusable. For background on importance sampling for rare events, see L’Ecuyer et al. (2009). The AMIS algorithm of Cornuet et al. (2012) uses a weighted combination of estimates with weights that are correlated with those estimates and that complicates even the task of proving consistency.

Suppose that for some unknown with . The value is a model for importance sampling where adaptation brings no benefit. The value is a model for a setting where importance sampling is working very well, roughly as well as quasi-Monte Carlo sampling (Dick and Pillichshammer, 2010). The optimal choice is . Not knowing the true we might take for some . The square root rule from the appendix of Owen and Zhou (1999) takes . It never has variance more than times that of the unknown best weighting rule when for some .

This paper is organized as follows. Section 2 presents our notation and some of the context from adaptive importance sampling. Section 3 states and proves our main result using two lemmas. Lemma 2 there has an inequality that we had originally described as ‘straightforward algebra’. Revisiting the problem now we find the inequality to be quite delicate. Section 4 includes some remarks on generalizations to settings where even better convergence rates, including exponential convergence, apply.

## 2 Notation

Step of our adaptive importance sampler generates data . We let denote the data from all prior steps, with being empty. We assume that our estimates satisfy

 E(^μk∣Zk)=μ,andVar(^μk∣Zk)=σ2k<∞,k=1,…,K. (1)

In Owen and Zhou (1999),

was an importance sampled estimate of an integral over the unit cube, sampling from a mixture of products of beta distributions whose parameters were tuned to the prior data in

.

We combine the estimates via

 ^μ=K∑k=1ωk^μk

where . Then so . For ,

 Cov(^μk,^μℓ)=E(E((^μk−μ)(^μℓ−μ)∣Zℓ))=E((^μk−μ)E(^μℓ−μ∣Zℓ))=0.

Therefore

 Var(^μ)=K∑k=1ω2kVar(^μk)

The conditional variances are random. If stage of importance sampling has or more observations in it, then we can ordinarily construct conditionally unbiased variance estimates . That is,

 E(^σ2k∣Zk)=σ2k.

Then

 E(K∑k=1ω2k^σ2k)=K∑k=1ω2kE(^σ2k)=K∑k=1ω2kVar(^μk)=Var(^μ).

We can thus combine results from the stages of the AIS algorithm and get an unbiased estimate of the variance. The underlying idea here is that is a martingale in (Williams, 1991). We will not make formal use of martingale arguments.

Now we introduce a model

 Var(^μk)=τ2k−y (2)

for and a constant of proportionality . Our estimate is

 ^μ=^μ(x)=K∑i=1ix^μi/K∑i=1ix.

The best choice is but is unknown. Our variance using is

 Var(^μ(x))=τ2K∑i=1i2x−y/(K∑i=1ix)2

and we measure the inefficiency of our choice by

 ρK(x∣y)≡Var(^μ(x))Var(^μ(y))=(∑Ki=1i2x−y)(∑Ki=1iy)(∑Ki=1ix)2. (3)

If the variance of decays as and, knowing that, we use , then . For the pessimistic value , the variance decays at the usual Monte Carlo rate in the number of steps. For an optimistic value , the variance decays at the rate , slightly better than (any ) which holds for randomly shifted lattice rules applied to functions of bounded variation in the sense of Hardy and Krause. See L’Ecuyer and Lemieux (2000) for background on randomized lattice rules.

## 3 Square root rule

Owen and Zhou (1999) proposed to split the difference between the optimistic estimate and the pessimistic one by using . The result is an unbiased estimate that attains the same convergence rate as the unknown best estimate with only a modest penalty in the constant factor.

###### Theorem 1.

For given by equation (3),

 sup1⩽K<∞sup0⩽y⩽1ρK(12∣∣y)⩽98. (4)

If and then

 sup0⩽y⩽1ρK(x∣y)>sup0⩽y⩽1ρK(12∣∣y). (5)

Equation (4) shows that the square root rule has at most more variance than the unknown best rule. Equation (5) shows that the choice is the unique minimax optimal one, when . When then does not depend on and in that case, . Before proving the theorem, we establish two lemmas.

###### Lemma 1.

For given by equation (3), with and ,

 sup0⩽y⩽1ρK(x∣y)={ρK(x∣1),x⩽1/2ρK(x∣0),x⩾1/2.
###### Proof.

The result holds trivially for because . Now suppose that . Then

 ∂2∂y2ρK(x∣y)=C−2×K∑i=1K∑j=1i2x−yjy(log(i)−log(j))2

for . Thus is strictly convex in for any when , and so .

By symmetry, . So if , then . The last step follows because and is a convex function of with its minimum at . This establishes the result for and a similar argument holds for . For , . ∎

We will use the following integral bounds, for integers . If , then

 K∑i=1ix >∫K+1/21/2vxdv=(K+1/2)x+1−(1/2)x+1x+1,and (6) K∑i=1ix <∫K+11vxdv=(K+1)x+1−1x+1. (7)

Equation (6) uses concavity of and is much sharper than the bound one gets by integrating over .

###### Lemma 2.

For given by equation (3), holds for any integer .

###### Proof.

Let . Then

 ρK+1ρK =(K+1)(K+2)S2KK2(SK+√K+1)2=(K+1)(K+2)K2(1+√K+1SK)2 >(K+1)(K+2)K2(1+√K+1((K+1/2)3/2−2−3/2)/(3/2))2 =(K+1)(K+2)(K+32Kf(K)−2−3/2/√K+1)2,

for

 f(K)=(K+1/2)3/2/√K+1.

Now has and . Therefore, if ,

 ρK+1ρK>(K+1)(K+2)(K+32KK+1/4−2−3/2/√8)2=(K+1)(K+2)(K+32KK+1/8)2. (8)

The numerator in (8) is while the denominator is

 K2+3KKK+1/8+94K2(K+1/8)2.

The numerator is larger than the denominator, establishing the theorem for . For we compute that for . ∎

###### Proof of Theorem 1.

Applying Lemmas 1 and 2,

 sup1⩽K<∞sup0⩽y⩽1ρK(12∣∣y)=limK→∞ρK(12∣∣1)=limK→∞K2(K+1)/2(∑Ki=1i1/2)2.

Using the integral bounds (6) and (7) we can show that the above limit is , establishing equation (4). Next, if , then

 sup0⩽y⩽1ρK(x∣y)=ρK(x∣1)=(∑Ki=1i2x−1)(∑Ki=1i)(∑Ki=1ix)2=K(K−1)2(∑Ki=1i2x−1)(∑Ki=1ix)2.

Ignoring the factor and taking the derivative yields

 ∑Ki=1∑Kj=1ixj2x−1(log(j)−log(i))(K∑i=1i2x−1)(K∑i=1ix). (9)

The numerator in (9) is of the form where . Furthermore is a decreasing function of while is increasing and so the expression in (9) is negative. Therefore is a decreasing function of making for . This establishes (5) for and the case of is similar. ∎

## 4 Generalization

We can generalize some aspects of the square root rule to other power laws. Suppose that the true rate parameter is known to satisfy for . We can then work with for . First we recall the definition of from (3):

 ρK(x∣y)=(∑Ki=1i2x−y)(∑Ki=1iy)(∑Ki=1ix)2.

Then and

 (10)
###### Lemma 3.

For given by equation (3), with and ,

 supL⩽y⩽UρK(x∣y)={ρK(x∣U),x⩽MρK(x∣L),x⩾M,

for .

###### Proof.

The proof is the same as for Lemma 1 because is convex in for any and any and . ∎

The inequalities in Lemma 2 are rather delicate and we have not extended them to the more general setting. Equation (10) is easy to evaluate for integer values of , and . For more general values, some sharper tools than the integral bounds in this paper are given by Burrows and Talbot (1984) who make a detailed study of sums of powers of the first natural numbers. We do see numerically that is nondecreasing with in every instance we have inspected. We can easily find the asymptotic inefficiency

 limK→∞ρK(M∣U)=limK→∞KLL+1KUU+1(KMM+1)2=(M+1)2(L+1)(U+1).

In cases with and hence we get . For instance, an upper bound at the rate corresponding roughly to asymptotic accuracy of scrambled net integration (Owen, 1997) leads to and an asymptotic inefficiency of at most

 (1+1)2(0+1)(2+1)=43.

There are problems in which the adaptive importance sampling error converges exponentially to zero. See for instance Kollman et al. (1999) as well as Kong and Spanier (2011). These examples involve particle transport problems through complicated media. It is reasonable to expect that each estimate will require a large number of observations and that will then be not too large.

Suppose that for some . Then the desired combination is

 ^μ(y)=∑Ki=1exp(iy)^μi∑Ki=1exp(iy).

Not knowing we use

 ^μ(x)=∑Ki=1exp(ix)^μi∑Ki=1exp(ix),

for some . If , then our inefficiency is

 γK(x∣y) ≡∑Ki=1exp((2x−y)i)∑Ki=1exp(yi)(∑Ki=1exp(xi))2=(e2x−y(eK(2x−y)−1)e2x−y−1)(ey(eKy−1)ey−1)(ex(eKx−1)ex−1)2.

If then the first factor in the numerator is . It can be disastrously inefficient to use . Some safety is obtained in the limit where . That is, in that limit, one simply uses the final and presumably best estimate. Then

 limx→∞γK(x∣y)=ey(1−e−Ky)ey−1⩽eyey−1.

For instance, if the variance is halving at each iteration, then and then taking is inefficient by at most a factor of . Repeated ten-fold variance reductions correspond to and a limiting inefficiency of at most . The greatest inefficiency from using only the final iteration arises in the limit where the factor is . In this setting, the user is not getting a meaningful exponential convergence and even there the loss factor is at most and, as remarked above, that is not likely to be large.

## Acknowledgments

This work was supported by the US NSF under grants DMS-1521145 and IIS-1837931.

## References

• Burrows and Talbot (1984) Burrows, B. L. and Talbot, R. F. (1984). Sums of powers of integers. The American Mathematical Monthly, 91(7):394–403.
• Cornuet et al. (2012) Cornuet, J., Marin, J.-M., Mira, A., and Robert, C. P. (2012). Adaptive multiple importance sampling. Scandinavian Journal of Statistics, 39(4):798–812.
• De Boer et al. (2005) De Boer, P.-T., Kroese, D. P., Mannor, S., and Rubinstein, R. Y. (2005). A tutorial on the cross-entropy method. Annals of operations research, 134(1):19–67.
• Dick and Pillichshammer (2010) Dick, J. and Pillichshammer, F. (2010). Digital sequences, discrepancy and quasi-Monte Carlo integration. Cambridge University Press, Cambridge.
• Kollman et al. (1999) Kollman, C., Baggerly, K., Cox, D., and Picard, R. (1999).

Adaptive importance sampling on discrete Markov chains.

Annals of Applied Probability

, pages 391–412.
• Kong and Spanier (2011) Kong, R. and Spanier, J. (2011). Geometric convergence of adaptive Monte Carlo algorithms for radiative transport problems based on importance sampling methods. Nuclear Science and Engineering, 168(3):197–225.
• L’Ecuyer and Lemieux (2000) L’Ecuyer, P. and Lemieux, C. (2000). Variance reduction via lattice rules. Management Science, 46(9):1214–1235.
• L’Ecuyer et al. (2009) L’Ecuyer, P., Mandjes, M., and Tuffin, B. (2009). Importance sampling and rare event simulation. In Rubino, G. and Tuffin, B., editors, Rare event simulation using Monte Carlo methods, pages 17–38. John Wiley & Sons, Chichester, UK.
• Owen (1997) Owen, A. B. (1997). Scrambled net variance for integrals of smooth functions. Annals of Statistics, 25(4):1541–1562.
• Owen and Zhou (1999) Owen, A. B. and Zhou, Y. (1999). Adaptive importance sampling by mixtures of products of beta distributions. Technical report, Stanford University.
• Ryu and Boyd (2014) Ryu, E. K. and Boyd, S. P. (2014). Adaptive importance sampling via stochastic convex programming. Technical report, arXiv:1412.4845.
• Williams (1991) Williams, D. (1991). Probability with martingales. Cambridge University Press, Cambridge.