    # Notes on optimal approximations for importance sampling

In this manuscript, we derive optimal conditions for building function approximations that minimize variance when used as importance sampling estimators for Monte Carlo integration problems. Particularly, we study the problem of finding the optimal projection g of an integrand f onto certain classes of piecewise constant functions, in order to minimize the variance of the unbiased importance sampling estimator E_g[f/g], as well as the related problem of finding optimal mixture weights to approximate and importance sample a target mixture distribution f = ∑_i α_i f_i with components f_i in a family F, through a corresponding mixture of importance sampling densities g_i that are only approximately proportional to f_i. We further show that in both cases the optimal projection is different from the commonly used ℓ_1 projection, and provide an intuitive explanation for the difference.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Importance sampling is one of the most important variance reduction tools in Monte Carlo integration. Given a measurable space , and a function , suppose we want to estimate an integral of the form:

 I=∫Ωf(x)dμ(x) (1)

Basic Monte Carlo integration tells us that we can approximate

with the unbiased estimator:

 I≈1NN∑j=1f(Xj) (2)

where

are realizations of a uniform random variable. The most basic form of importance sampling states that if

is a random variable with density , the same integral can be equivalently calculated as:

 I=Eg(f(X)) (3)

and that one can hence construct the following Monte Carlo estimator:

 I≈FN=1NN∑j=1f(Xj)g(Xj). (4)

The reason why importance sampling is interesting is that the variance of the above formula depends on how well approximates :

 V[FN]=1N⋅V[f;g] (5)

with:

 V[f;g]=∫Ωf2(x)g(x)dμ(x)−I2. (6)

In particular, if it was possible to simulate random variables from a distribution exactly proportional to , i.e. with , one would get a zero variance estimator:

 V[f;g] = ∫ΩI⋅f2(x)f(x)dμ(x)−I2 (7) = I⋅∫Ωf(x)dμ(x)−I2=0

Unfortunately, building such a zero variance estimator is most of the times not practical, as it would require knowledge of the integral we want to estimate in order to normalize the distribution . Hence, many algorithms rely on building useful approximations to .

As a first application of this manuscript, we seek optimality conditions for such approximations, with a particular focus on the class of piecewise constant functions.

Since many such approximations are built as combinations of a finite set of basis functions , we formulate the problem as that of finding a projection that is optimal in the sense of minimizing variance from equation (6). As a result, we will show that approximations resulting from the rather customary projection of onto does not lead to optimal estimators.

As a second application, we look at the related problem of estimating the integral of a mixture when the individual terms can only be approximately sampled, that is to say when we can only draw samples from densities . As a matter of fact, this latter application includes the first as a special case: it’s sufficient to identify the basis functions with the densities , and consider as the restrictions of to the support of - yet we thought it is interesting to consider both points of view separately, particularly as they lead to alternative solutions and different generalizations.

## 2 Optimal Piecewise Constant Approximations

Given a function , and a finite set of basis functions , we seek to find a projection :

 g=∑iαibi(x) (8)

with the normalization constraint:

 ∑iαi=1 (9)

which minimizes the variance of the importance sampling estimator:

 μ=E[f/g]≈1NN∑j=1f(Xj)g(Xj). (10)

That is, we want to minimize the following quantity:

 V[f;g]=∫Ωf2(x)g(x)dx−μ2. (11)

Particularly, since we want to focus our analysis on piecewise constant functions , we will look at elementary basis functions with non-overlapping compact supports :

 bi(x)=|Ci|−1⋅XCi(x)

with for .
Regular projection would set the coefficients as:

 αi=1μ∫Ωf(x)bi(x)dx=1μ∫Cif(x)dx; (12)

while this is commonly used, we here prove it is not the optimal choice.

### 2.1 Solution

Since have non overlapping supports, we can expand equation (11) into:

 V[f;g]=n∑i=1∫Cif2(x)αibi(x)dx−μ2 (13)

and if we define the constants:

 m(2)i=∫Cif2(x)bi(x)dx

we can further rewrite equation (13) as:

 V[f;g]=n∑i=1m(2)iαi−μ2. (14)

In order to find the minimum of , we use the method of Lagrange multipliers:

 ∇α,λ⎡⎣n∑i=1m(2)iαi−μ2+λ(n∑i=1αi−1)⎤⎦=0 (15)

which in turn translates into the set of equations:

 ∂∂αi⎡⎣m(2)iαi−μ2+λαi⎤⎦=0 (16) ⇔ −m(2)iα2i+λ=0 ⇔ α2i=m(2)iλ ⇔ αi=√m(2)iλ

subject to the constraint:

 ∑i√m(2)iλ=1 (17) ⇔ λ=(∑i√m(2)i)2

 αi=√m(2)i∑j√m(2)j. (18)

Notice how is essentially a normalized -norm projection, and may differ significantly from the more commonly used normalized projection. While this result may be surprising at first, we believe it has a quite intuitive explanation: the optimal projection simply puts more weight on high values of the integrand, and does so proportionately to its square in order to minimize the quadratic variance functional.

### 2.2 Generalization to arbitrary basis functions

In the previous section we only considered basis functions with non-overlapping compact supports. If we left this restriction out the original optimization problem would look like this instead:

 ∂∂αi[∫Ωf2(x)g(x)dx−μ2+λαi]=0 (19) ⇔ ∫Ωbi(x)f2(x)g2(x)dx=λ.

This tells us that for the optimal , the projection of on the basis functions must be a constant, i.e:

 =∀(i,j) (20)

which can also be expressed as the equivalent condition:

 =0∀(i,j) (21)

however, it doesn’t yet provide a closed formula for finding proper coefficients to realize it. Notice that one could also equivalently express (20) and (21) as:

 Ebi[f2/g2]=Ebj[f2/g2]∀(i,j) (22) ⇔ Ebi−bj[f2/g2]=0∀(i,j)

To gather further insights we reformulate the problem in a slightly different way. We can treat our function as a multiple importance sampling combination [Veach and Guibas 1995], where represents the density of the -th technique and the coefficients represent the relative mixture weights, or sampling frequencies. Under this light, the problem is essentially the same as that analysed by He and Owen [He and Owen 2014]. Unfortunately, while He and Owen demonstrated that variance is jointly convex in the mixture weights, thus allowing to use a convex optimization algorithm to find an optimum, a closed form solution is yet to be found.

## 3 Optimal approximation of a mixture

Now, let’s go back to the second problem mentioned in the introduction. Suppose we have a finite family of functions , and an equally sized family of importance sampling densities , each of which can be used to approximately sample from the corresponding term in , i.e. such that . And let’s further consider the problem of estimating the integral of a finite mixture:

 f(x)=n∑i=1αifi(x) (23)

sampling from the densities with frequency , and using the importance sampling estimator:

 E[f]=n∑i=1~αiEgi[αi~αi⋅figi]. (24)

which can be realized sampling random variables , with , and applying the summation rule:

 E[f]≈1Nn∑i=1Ni∑j=1wi(Xi,j) (25)

with proper importance sampling weights:

 wi(x)=αi~αi⋅fi(x)gi(x). (26)

Normally, one could consider setting , but we can again ask ourselves whether this is indeed an optimal choice. If we consider the resulting sample weights as realizations of a random variable , we can now write the variance of the estimator as:

 V=E[W2]−E[W]2. (27)

with the first term being equal to:

 E[W2] = ∑i~αi∫Ωw2i(x)gi(x)dx (28) = ∑iα2i~αi∫Ωf2i(x)g2i(x)gi(x)dx = ∑iα2i~αi∫Ωf2i(x)gi(x)dx

and the second term equal to the squared integral of :

 E[W]2 = (∑i~αi∫Ωwi(x)gi(x)dx)2 (29) = (∑iαi∫Ωfi(x)dx)2 = μ2

Hence, we obtain:

 V=∑iα2i~αi∫Ωf2i(x)gi(x)dx−μ2 (30)

Now, posing:

 m(2)i=α2i∫Ωf2i(x)gi(x)dx

and applying again the method of Lagrange multipliers to solve for optimal values of with the normalization constraint , we obtain:

 ~αi=√m(2)i∑j√m(2)j. (31)

Notice the similarity with equation (18) from the previous section. As mentioned in the introduction, this similarity is not a coincidence, as the result of equation (18) can be seen as a particular case of this derivation, by setting , and considering each as the restriction of to the non-overlapping supports , i.e. . However, as we will shortly see, it is important to make some careful considerations.

### 3.1 Generalization to multiple importance sampling estimators

In this section, we have assumed a partitioning of into a weighted sum of functions that can be each individually sampled by corresponding densities . However, such a strict partitioning might not always be desirable, or lead to an optimal estimator. In fact, if the functions have overlapping supports, and the densities are only locally good approximations for some of the features of , a better estimator might be obtained considering a multiple importance sampling combination. For example, we could use the unbiased estimator:

 E[f]=n∑i=1~αiEgi[f∑i~αigi] (32)

which is obtained applying the balance heuristic

[Veach and Guibas 1995], and resulting in the alternative sample weights:

 w(x)=f(x)∑j~αjgj(x) (33)

Optimizing the variance of such an estimator would now become exactly the same problem as finding the optimal projection of onto a set of arbitrary basis functions, as seen in section 2.2 and as analysed by He and Owen He:2014.

## 4 Discussion

In this work we have provided simple closed form formulas to build optimal importance sampling approximations within certain classes of mixture distributions, and showed how the resulting projection coefficients differ from the commonly used projections. We believe our results to be potentially useful for the solution of many Monte Carlo integration problems. Particularly, there could be numerous applications for path sampling in the field of light transport simulation, ranging from the construction of better radiance or importance field approximations for path guiding [Vorba et al. 2014, Dahm and Keller 2017], to improved importance caching distributions [Georgiev et al. 2012], to new methods for sampling direct illumination.

#### Acknowledgements:

We thank Peter Shirley for suggesting the use of Lagrange multipliers and providing useful feedback on the early drafts of this paper.