 # Finite-sample Guarantees for Winsorized Importance Sampling

Importance sampling is a widely used technique to estimate the properties of a distribution. The resulting estimator is always unbiased, but may sometimes incur huge variance. This paper investigates trading-off some bias for variance by winsorizing the importance sampling estimator. The threshold level at which to winsorize is determined by a concrete version of the Balancing Principle, also known as Lepski's Method, which may be of independent interest. The procedure adaptively chooses a threshold level among a pre-defined set by roughly balancing the bias and variance of the estimator when winsorized at different levels. As a consequence, it provides a principled way to perform winsorization, with finite-sample optimality guarantees. The empirical performance of the winsorized estimator is considered in various examples, both real and synthetic. The estimator outperforms the usual importance sampling estimator in high-variance settings, and remains competitive when the variance of the importance sampling weights is low.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let and

denote two probability measures on a set

with some

-algebra, and suppose they admit probability density functions

and . Let be a measurable function, integrable with respect to . The goal is to estimate

 θ=Ep[f(X)]=∫Xf(x)p(x)dx.

Let be a sequence of

-valued random variables with law

. The importance sampling estimator for is defined as

 ^θn=1nn∑i=1f(Xi)p(Xi)q(Xi), (1)

with being the target distribution, and the proposal or sampling distribution. As long as whenever , this estimator is unbiased:

While importance sampling has many applications, from rare-event simulations to Bayesian computation, it can fail spectacularly with a poor choice of sampling distribution . Indeed, because is a ratio of random variables, it can exhibit enormous or even infinite variance, leading to possibly terrible estimates.

Consider the following simple example: let with , and , where . The variance of the importance sampling estimator is given by

 V[^θn]=E[^θ2n]−E2[^θn]=1n∫f2(x)p2(x)q2(x)q(x)dx=1nσ∫x2e−x2(1−12σ2)√2πdx,

which is infinite if . Thus, while the estimator is unbiased, one would be hard-pressed to trust it. In fact, with and as many as observations over different simulations, the maximum estimate obtained was for the true (the mean estimate over all simulations for was

, with standard deviation

).

In more complex examples, this issue can become even more pernicious. For instance, in many cases finding a good proposal distribution might be too hard, or sampling from it might be too computationally intensive. Section 2 considers an important class of examples where different but reasonable choices for the sampling distribution lead the method astray.

A relevant question in this context is the extent to which the variance of the terms

 Yi=f(Xi)p(Xi)q(Xi)

can be controlled. A straightforward possibility is to winsorize them. That is, define the random variables censored at levels and by

 YMi=max(−M,min(Yi,M)).

With this notation, the usual importance sampling can be rewritten , while the winsorized importance sampling estimator at level is

 ^θMn=1nn∑i=1YMi.

Note indexes a bias-variance trade-off. As the threshold level is decreased for the winsorized estimator , the variance lessens as the bias grows. Indeed, the sample mean has zero bias with potentially enormous variance, while winsorizing with produces the constant estimator . This trade-off will be considered more carefully in Section 3, but clearly a good choice of threshold is crucial for winsorized importance sampling to work at all.

To pick an optimal threshold level, the Balancing Principle (also known as Lepski’s Method) is adopted. This method adaptively selects a threshold level, among a predefined set, that automatically winsorizes more when the variance of the estimator is high compared to bias, and less (or not at all) when the variance is comparatively lower. It also yields finite-sample optimality guarantees for the winsorized estimator.

The main result has the following form:

###### Theorem 1.

Let be random variables distributed iid with mean . Consider winsorizing at different threshold levels in a pre-chosen set to obtain winsorized samples , . Pick the threshold level according to the rule

 M∗=min{M∈Λ : ∀M′,M′′≥M,|¯¯¯¯¯¯¯¯¯¯YM′−¯¯¯¯¯¯¯¯¯¯¯YM′′|≤c⋅t√n−t⋅(^σM′+^σM′′2)}, (2)

where are chosen constants, , and . Let be such that for all . Then, with probability

 1−2|Λ|⎛⎝1+50K√n−Φ⎛⎝t√n(√n−t)2+t2⎞⎠⎞⎠,

it holds

 |¯¯¯¯¯¯¯¯¯¯YM∗−θ|≤CminM∈Λ{|E[YMi]−θ|+t√n−t^σM},

where can be made less than .

Intuitively, the theorem says that, given a set of pre-chosen threshold values for winsorization, picking one according to the decision rule (2) guarantees, with high probability, that the error of the procedure is bounded by a constant multiple of the optimal sum of bias and sample standard deviation, among all threshold levels considered. This ensures the error incurred by winsorizing is not too large in terms of either unobserved bias or observed variance. The user is free to choose parameters and , and while is often unknown, it is not expected to be large in a setting where the variance of is assumed to be unwieldy to begin with. The theorem is discussed in more detail in Section 5. Section 6 considers its performance over many examples.

Review of the literature. Proposals for winsorizing importance sampling weights have been considered before. [Ionides, 2008] analyzes an optimal way to pick the threshold value under asymptotic considerations, and also suggests an adaptive thresholding procedure relying on an unbiased risk estimate. Similarly, [Sen et al., 2017] proposes adaptively capping the importance sampling weights according to an estimate of the variance. [Northrop et al., 2017] suggests using cross-validation to pick a threshold value for winsorizing a heavy-tailed sample, an approach related to the used in Section 6 as a baseline.

More generally, [Liu and Lee, 2016] considers modifying the importance weights by minimizing a kernelized Stein discrepancy, while [Delyon et al., 2016] investigates the improvement on rates of convergence when changing the weights using a leave-one-out kernel estimator. Several semi-parametric approaches also exist, estimating the tail distribution using exponential tilting ([Fithian and Wager, 2014]), Pareto distributions ([Johansson, 2003], [Vehtari et al., 2015]

) and Bayesian parametric modeling (

The main theoretical tool in picking a threshold value is based on the Balancing Principle, also known as Lepski’s Method. For an overview, see [Mathé, 2006] and [Hämarik and Raus, 2009]. This method has also been applied to different statistical procedures, for instance, the Lasso ([Chichignoud et al., 2016]), and to other ill-posed problems in numerical analysis ([Lazarov et al., 2007], [Li and Werner, 2017]).

To empirically verify the performance of the method, several examples, proposed elsewhere in the literature, are studied. Notably, the complete self-avoiding walk problem was first suggested in [Knuth, 1976]. A theoretical analysis in the case of monotone paths is given in [Bassetti and Diaconis, 2006], and analyses of other particular cases are in [Bousquet-Mélou et al., 2005]. [Bousquet-Mélou, 2014] gives further theoretical and empirical results, in particular laying out efficient ways of sampling paths that never get trapped. We also test the procedure on the set of importance sampling examples proposed in [Vehtari et al., 2015], with slight modifications.

Paper organization. The paper is structured as follows. Section 2 presents a motivating but common example where obtaining a good sampling distribution is hard, and winsorizing becomes a promising alternative. Section 3 investigates how winsorizing benefits the importance sampling estimation problem. Section 4 states and proves the Balancing Theorem, which is the main theoretical result powering the finite-sample results. Because the assumptions of the theorem only hold probabilistically, Section 5 uses a

-statistic Central Limit Theorem to bound the required probabilities, thus providing a full proof of Theorem

1 above. Finally, Section 6 applies the outlined procedure to real and simulated datasets, and compares it with both the traditional importance sampling estimator, and a winsorized estimator thresholded at a level chosen via cross-validation. Further research directions appear in Section 7.

## 2 Motivating Example: Self-avoiding Walks

In 1976, Don Knuth [Knuth, 1976] considered the following question: how many complete self-avoiding walks are there on a grid? A complete self-avoiding walk on an grid is a path that does not intersect itself, starting at point and ending at point (see Figure 1 for the case ). While it is theoretically possible to answer this question by counting every path, this quickly becomes intractable for large . Instead of counting, Knuth proposed using sequential importance sampling to estimate this quantity.

Let denote the number of complete self-avoiding walks (CSAW) on an grid, and let be the probability density function of sampling a complete self-avoiding walk uniformly at random, with an indicator function for whether the path is a complete self-avoiding walk. Since sampling from is hard, Knuth considered another distribution that is easy to sample from, and such that whenever . Since

 Z=∫Z p(x)dx=∫{x:q(x)>0}Z p(x)q(x)q(x)dx=∫{x:q(x)>0}I[x is CSAW]q(x)q(x)dx,

can be estimated by generating sample paths and using the importance sampling estimator

 ^Z=1nn∑i=11q(xi)I[xi is% CSAW]. (3) Figure 1: Two complete self-avoiding walks on a 10×10 grid (blue), and a walk that is forced to intersect with itself before reaching (10,10) (red).

For a distribution that is easy to sample from and whose density is known, Knuth suggested a sequential procedure. Start at point ; since it has two neighbor points available, jump to each with probability (see Figure 2). The new point also has two available neighbors, so again pick one uniformly among them. Keep advancing to new points by picking among available neighbors until the path gets to the point . Here, “available neighbors” are points that have not been visited before, and which don’t lead to a path that is forced to intersect with itself to get to , as the red walk in Figure 1 (the only available neighbor to the yellow point in the path is the one above it). Let denote the number of available neighbors on step , each neighbor is picked with uniform probability , and define the probability of the entire path as , with the length of path . Figure 2 provides an example. It is not hard to see this constitutes a probability measure on the self-avoiding walks. Thus, to estimate , generate many independent paths with this sequential procedure, calculate the number of available neighbors at each step and path , and set

 ^Z=1nn∑i=11q(xi)I[xi is% CSAW]=1nn∑i=1(li∏j=1dj,i)I[xi is CSAW]. (4)

This estimate is just the average of the product of the available neighbors at each step. Figure 2: Counting the number of available neighbors dj at each point j: d1=2, d14=1, and the product is d=∏30j=1dj=14×210×316; assign probability 1/d=1−4×2−10×3−16 to the path.

Using such a procedure and a few thousand simulations, Knuth estimated the number of paths on a grid to be . It was remarkably close: the actual answer was later found to be (see [Knuth, 1995], page 57). With similar ideas, he also calculated the average path length to be , and the number of paths through to be .

An appropriate choice of distribution is integral to this successful story. Consider a very similar distribution which allows the walk to get trapped: when picking available neighbors it only avoids those that have already been visited, and stops once hitting or no more such neighbors are available (the red path in Figure 1, for instance, could be sampled from ). This is a computationally convenient choice, in that any neighbor that has not been visited is an available neighbor for , without worrying about moves that lead to trapped walks. Since implies , the importance sampling estimator is still valid and unbiased for . The only difference from is that when a walk does get trapped, it contributes with zero weight to the estimator.

Despite being easier to sample from and retaining unbiasedness, this slight modification to the proposal distribution is enough to drastically upset the performance. Now, with as many as simulated paths, the mean estimate is with standard deviation . Estimating zero paths is well within one standard deviation of the mean!

The reason for such extreme disparity is the distribution of the importance sampling weights . Figure 3

shows how much more uneven the distribution of weights become when traps are allowed. Because walks routinely get trapped, many weights are just zero, and as the procedure is still unbiased, this is compensated by the appearance of enormous, albeit rare, weights. While these outliers exist in both cases, the variance of the estimator with traps is much bigger. Figure 3: Distribution of the importance sampling weights when walks that might get trapped are allowed or not; a jitter was added to the points. The red line denotes the true underlying mean; there are bigger weights in both cases, not shown here.

Hence, the trapped estimator becomes close to useless due to the enormous increase in variance. Generally, this underscores the importance of good proposals . In many cases, however, it is not possible to find a better proposal distribution, or sampling from it could be too computationally inefficient. For instance, for complete self-avoiding walks, detecting which moves lead to a trapped walk may not be obvious at first (see [Bousquet-Mélou, 2014] for how to do it in the current two-dimensional case; in higher dimensions this is an open research problem). In such instances, the high variance of the traditional importance sampling estimator is too high a price to pay.

This suggests considering biased estimators, unlike (3), at the expense of drastically reduced variance. This paper investigates the simplest of such estimators: a winsorized, or censored, version of the usual importance sampling weights in (3) at levels and .

The Balancing Principle suggests picking among a pre-chosen set by using the decision rule (2), with some probabilistic guarantees. For instance, suppose . The maximum thresholding value allowed, , is big enough so no random variable is censored, while the minimum, , censors enough to seriously bias the result (recall in this setting ). The Balancing Principle considers both bias and variance when settling on one of these levels.

The improvements this method offers are significant. With a bad proposal distribution such as , the usual importance sampling estimator has a mean-squared error (MSE) of about , while the ones winsorized with the Balancing Principle recorded only , an improvement of an order of magnitude. With a good proposal distribution such as , both methods are comparable, with the importance sampling MSE of and the winsorized importance sampling MSE of . Section 6 investigates this example further, among many others.

## 3 Winsorized Importance Sampling

Recall from Section 1 we denote , so the traditional importance sampling estimator and the -winsorized version are respectively given by

 ^θn=1nn∑i=1Yi,^θMn=1nn∑i=1YMi.

Whenever the meaning is clear from the context, the in and might be omitted.

The effectiveness of an estimator under squared error loss can be understood via the usual bias-variance decomposition

 E[(~θn−θ)2]=(E[~θn]−θ)2+V[~θn]. (5)

Winsorizing is a straightforward way to trade-off an increase in bias for a reduction in variance. As the threshold level is decreased for the winsorized estimator

, the variance (and other central moments of higher order) lessens at the expense of a higher bias. This is proved as a lemma below.

###### Lemma 2.

Let be a collection of independent and identically distributed random variables with mean . Denote the -winsorized version of each by , for . If and , then, in terms of bias,

 |E[^θMn]−θ|≥|E[^θn]−θ|=0,

while in terms of variance

 V[^θMn]≤V[^θn].

That is, the winsorized estimator exhibits higher bias and lower variance than . In fact the winsorized estimator has all centered moments at least as big as the non-winsorized one.

###### Proof.

For the bias, it is straightforward to see that , while

 E[^θMn]=E[YM1]=θ+E[YM1−Y1]=θ+E[(M−Y1)I[Y1>M]+(−M−Y1)I[Y1<−M]],

and so if does not have a symmetric distribution centered around and is sufficiently small, it holds . This in turn means the estimator has a higher bias.

For the variance, note and , so it remains to be shown that . Let be -winsorization applied to . By the convexity of , , for all ,

 (y−E[Y1])p−(yM−E[YM1])p≥p(yM−E[YM1])p−1(y−E[Y1]−yM+E[YM1]). (6)

Note is a non-decreasing function of , ranging from to . Let be the smallest such that .

If , . Also, , so . Thus,

 (y−E[Y1])p−(yM−E[YM1])p≥p(yM0−E[YM1])p−1(y−E[Y1]−yM+E[YM1]). (7)

If , , but also , so again (7) holds.

Thus, as (7) holds for all , take expectations over on both sides to obtain

 E[(Y−E[Y1])p]−E[(YM−E[YM1])p]≥0.

With , this is just

 V[Y1]=E[(Y1−E[Y1])2]≥E[(YM1−E[YM1])2]=V[YM1],

and higher values of give the inequality for other centered moments. This argument can be generalized further to non-symmetric forms of winsorization and conditional expectations, see [Chow and Studden, 1969]. ∎

As a straightforward corollary, winsorizing a symmetric random variable around the actual mean is always be beneficial, as the variance is reduced without an increase in bias. Since the mean is assumed unknown, however, this cannot be turned to a practical recommendation.

How to select an appropriate level ? If the actual bias and variance incurred by the importance sampling estimator, and , were known, one could pick the thresholding level to exactly balance out the increasing effects winsorizing has on the bias and the decreasing effects it has on the variance. Similarly, one could consider using the sample versions of these quantities: to choose between two thresholding levels , if the increase in bias

is small enough relative to the standard error, then it might be worth winsorizing the sample at the lower level

.

While this is a seemingly reasonable approach, it is by no means guaranteed to work. In particular, such a method would require estimating the standard deviation in a setting where estimating the mean is hard enough already. To make this idea useful, an actual procedure to determine the optimal threshold level is needed, as well as theoretical guarantees that the error of such a procedure do not grow too large. This will be developed next.

## 4 Balancing Theorem

This section states and proves the Balancing Theorem, the main theoretical tool delivering finite-sample optimality guarantees when winsorizing importance sampling. The idea, related to oracle inequalities, is to pick one threshold level among a finite, pre-chosen set of them, while ensuring an error of at most a constant away from the error roughly incurred by the optimal choice . This general scheme is known by many names, including the Balancing Principle and Lepski’s Method.

Consider estimators of , indexed by a parameter in a finite set , and assume the bound for all , where is an unobserved but non-increasing function in , and is observed but non-decreasing. The goal is a selection rule to pick such that , with a small constant . In the case of interest, will denote the importance sampling estimator when winsorizing the sample at , while and will be taken as proportionals to its bias and sample standard deviation.

###### Theorem 3 (Balancing Theorem).

Suppose is an unknown parameter, is a sequence of estimators of indexed by , with a non-empty finite set. Additionally, suppose that for each , where is an unknown but non-increasing function in , while is observed and non-decreasing in . Choose , and take

 M∗=min{M∈Λ:∀M′,M′′ with M′,M′′≥M, |^EM′−^EM′′|≤c(^s(M′)+^s(M′′)2)}. (8)

Then,

 |^EM∗−θ|≤CminM∈Λ{b(M)+^s(M)}, (9)

where can be made less than 4.25.

###### Proof.

First, note is always a candidate for , so it is well defined. The goal is to show there exists such that, for all , . For this consider two cases.

(i) First, consider any fixed such that . Then, by definition of , and since is non-decreasing in ,

 |^EM∗−^EM|≤c⋅(^s(M)+^s(M∗)2)≤c^s(M).

Also, by assumption , so, from the triangle inequality,

 |^EM∗−θ|≤|^EM∗−^EM|+|^EM−θ|≤c^s(M)+b(M)+^s(M)≤b(M)+(1+c)^s(M).

This proves the case .

(ii) Now suppose . In that case, since is finite, consider the sequence of elements in in between, namely . Then, there exist such that , where either or (otherwise, this contradicts the minimality of ). Without loss of generality, assume . Note this implies , otherwise gives a contradiction: . Hence,

 |^EM′−^EMj−1|=|^EM′−^EM′′|>c(^s(M′)+^s(M′′)2). (10)

Also, by the triangle inequality,

 |^EMj−1−^EM′|≤|^EMj−1−θ|+|^EM′−θ|≤(b(Mj−1)+b(M′))+(^s(Mj−1)+^s(M′)), (11)

so, combining (10) and (11) yields

 c(^s(M′)+^s(M′′)2) <|^EMj−1−^EM′|≤(b(Mj−1)+b(M′))+(^s(Mj−1)+^s(M′)). (12)

Recall that , so

 b(M′)≤b(M∗) ≤b(Mj−1)≤b(M) (13) ^s(M′)≥^s(M∗) ≥^s(Mj−1)≥^s(M), (14)

so from (12)

 (c2−1)(^s(M′)+^s(Mj−1))≤b(Mj−1)+b(M′)(???)≤2 b(M),

and, because and ,

 ^s(M′)≤4c−2b(M). (15)

Finally,

 |^EM∗−θ|≤b(M∗)+^s(M∗)(???)≤b(M)+^s(M′)(???)≤(1+4c−2)b(M),

which implies

 |^EM∗−θ|≤(1+4c−2)(b(M)+^s(M)),

so the second case is proved. Note . If , then . ∎

Some remarks about the theorem are collected below.

First, with datapoints and possible truncation values, the number of operations is . To see this, note can be obtained by first ordering the threshold values, initializing , and decreasing while the condition in (8) is satisfied. See Algorithm 1. Sorting the values takes , calculating sample means and variances takes and there are comparisons to be made. One can adapt the procedure to be of order by only considering the condition in line 4 of Algorithm 1 for , but this increases by a multiplicative factor of .

Also, while the size of is not relevant to the guarantees of the Balancing Theorem, it is clear from the previous paragraph that finding becomes harder.

The condition in (8) for decreasing the threshold value from to can be written as , with . The proof also holds for other functionals . For example, take . In fact, this gives an even better constant, namely instead of . However, this comes at the expense of making the procedure winsorize more aggressively. Empirically, seems a more robust alternative.

Finally, consider the choices for the functions and in the importance sampling setting. Let denote the usual importance sampling weights, and take to be the usual estimator when winsorizing the sample at , that is,

 ^EM=¯¯¯¯¯¯¯¯¯YM=1nn∑i=1YMi.

The goal is to choose to balance both the bias and the variance of the estimator, and obtain an estimate . Thus, it makes sense to take

 b(M) =bias(M)=|E[YM1]−E[Y1]|, (16) ^s(M) =t√n−t^σM, (17)

with a constant, , and . The reason for the factor is discussed in the next section. For now, note the theorem gives a natural bound on , controlling it by both bias and a multiple of the sample standard deviation. Recall the sum of bias and standard deviation is connected to the squared error risk via (5).

With the choices and , picking a scale amounts to implicitly setting how much to weight the importance of bias versus variance. For example, if then , and the procedure always selects the threshold , so it censors the data as little as possible. This is a consequence of weighting bias infinitely more than variance. The scale is also important in a more subtle way. The Balancing Theorem assumes

 |^EM−θ|≤b(M)+^s(M),∀M∈Λ, (18)

and since and are random, this can only hold probabilistically. Higher values of make these hypotheses more likely, but the conclusion of the theorem less useful.

Showing requires some knowledge about the underlying distribution of . To obtain tight bounds on , a Central Limit Theorem-type argument can be used, as shown in the next section.

## 5 Probabilistic bounds

The main goal of this section is to bound the probability that holds for all , as a function of the sample size and the number of threshold values considered . This provides a full proof of Theorem 1 and establishes the conditions under which the guarantees of the Balancing Theorem hold in the importance sampling setting. The strengths and weaknesses of the proposed method are then discussed.

In the interest of generality, no assumptions on the distribution of the weights are made, other than the existence of a mean, . However, a large sample size may be needed for the Balancing Theorem to hold with high probability under such a lax moment hypothesis.

In terms of the truncation levels, assume is chosen in a data-independent manner. This is not crucial, as long as there is a probabilistic bound for the data-dependent choice of . For instance, the case of no truncation, , can be considered if there are bounds on the distribution of . In particular, the existence of a second moment gives a bound via Chebyshev’s inequality, and higher-order moments give even finer control. In terms of data-independent values, , as suggested in [Ionides, 2008] following asymptotic considerations, can always be included.

Also, let be such that , so the variance of the winsorized sample can be bounded in terms of the third absolute moment. In a setting where the variance of is so large, or even infinite, that winsorizing the terms are worth considering, is not expected to be too large. In particular, the bound is implied by the condition , for , since

 E[|YMi−E[YM1]|3]≤22(E[|YMi|3]+(E[|Y−1M|])3)≤8M3=8α3/2α3/2M3≤8α3/2(V[YMi])3/2.

To upper bound , the argument proceeds by first showing via a Central Limit Theorem-type argument that, for all ,

 P[|¯¯¯¯¯¯¯¯¯YM−E[Y1]|≤bias(M)+t^σM/(√n−t)]≥1−2(1−Φ(t))+error(t,n,M).

and then using a Berry-Esseen bound on the -statistic to give precise bounds on the error.

### 5.1 Using the Central Limit Theorem

Since are iid with finite mean and variance, by the usual Central Limit Theorem,

 P⎡⎢ ⎢⎣√n|1n∑ni=1YMi−E[YM1]|√E[(YM1−E[YM1])2]≥t⎤⎥ ⎥⎦n→∞⟶2(1−Φ(t)).

Using the Strong Law of Large Numbers and the Continuous Mapping Theorem,

 √E[(YM1−E[YM1])2]√1n∑ni=1(YMi−E[YM1])2as,p⟶1.

Hence, by Slutsky’s Theorem and the Central Limit Theorem,

 limn→∞ P⎡⎣∣∣ ∣∣1nn∑i=1YM