 # A weighted Discrepancy Bound of quasi-Monte Carlo Importance Sampling

Importance sampling Monte-Carlo methods are widely used for the approximation of expectations with respect to partially known probability measures. In this paper we study a deterministic version of such an estimator based on quasi-Monte Carlo. We obtain an explicit error bound in terms of the star-discrepancy for this method.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In statistical physics and Bayesian statistics it is desirable to compute expected values

 Eπ(f)=∫Rdf(x)dπ(x) (1)

with and a partially known probability measure on . Here denotes the Borel -algebra and partially known means that there is an unnormalized density (with respect to the Lebesgue measure) and , such that

 π(A)=∫Au(x)dx∫Rdu(y)dy,A∈B(Rd). (2)

Probability measures of this type are met in numerous applications. For example, for the density of a Boltzmann distribution one has

 u(x)=exp(−βH(x)),x∈Rd,

with inverse temperature and Hamiltonian . The density of a posterior distribution is also of this form. Given observations , likelihood function

and prior probability density

, with respect to the Lebesgue measure on ,

 u(x)=ℓ(y∣x)p(x),x∈Rd.

In this setting is considered as parameter- and as observable-space. In both examples, the normalizing constant is in general unknown.

In the present work we only consider unnormalized densities which are zero outside of the unit cube . Hence we restrict ourself to , i.e., is a probability measure on , and . To stress the dependence on the unnormalized density in (1), define

for and belonging to some class of functions. It is desirable to have algorithms which approximately compute by only having access to function values of and without knowing the normalizing constant a priori. A straightforward strategy to do so provides an importance sampling Monte Carlo approach. It works as follows.

###### Algorithm 1.

Monte Carlo importance sampling:

1. Generate a sample of an i.i.d. sequence of random variables

with 111By

we denote the uniform distribution on

.
and call the result .

2. Compute

 Mn(f,u):=∑nj=1f(xj)u(xj)∑nj=1u(xj).

Under the minimal assumption that

is finite, a strong law of large numbers argument guarantees that the importance sampling estimator

is well-defined, cf. [16, Chapter 9, Theorem 9.2]. For uniformly bounded and finite an explicit error bound of the mean square error is proven in [14, Theorem 2].

Surprisingly, there is not much known about a deterministic version of this method. The idea is to substitute the uniformly in distributed i.i.d. sequence by a carefully chosen deterministic point set. Carefully chosen in the sense that the point set has “small” star-discrepancy, that is,

 Dλd(Pn):=supx∈[0,1]d∣∣ ∣∣1nn∑j=11[0,x)(xj)−λd([0,x))∣∣ ∣∣

is “small”. Here, the set denotes an anchored box in with and is the -dimensional Lebesgue measure of . This leads to a quasi-Monte Carlo importance sampling method.

###### Algorithm 2.

Quasi-Monte Carlo importance sampling:

1. Generate a point set with “small” star discrepancy .

2. Compute

 Qn(f,u)=∑nj=1f(xj)u(xj)n∑j=1u(xj). (3)

Our main result, stated in Theorem 3, is an explicit error bound for the estimator of the form

 |S(f,u)−Qn(f,u)|≤4∥f∥H1∥u∥D∫[0,1]du(x)dxDλd(Pn). (4)

Here must be differentiable, such that , defined in (7) below, is finite. As a regularity assumption on it is assumed that , defined in (9) below, is also finite.

The estimate of (4) is proven by two results which might be interesting on its own. The first is a Koksma-Hlawka inequality in terms of a weighted star-discrepancy, see Theorem 1. The second is a relation between this quantity and the classical star-discrepancy, see Theorem 2. To illustrate the quasi-Monte Carlo importance sampling procedure and the error bound we provide an example in Section 3 where (4) is applicable.

Related Literature. The Monte Carlo importance sampling procedure from Algorithm 1 is well studied. In , Novak and Mathé prove that it is optimal on a certain class of tuples . However, recently this Monte Carlo approach attracted considerable attention, let us mention here [1, 4]. In particular, in  upper error bounds not only for bounded functions are provided and the relevance of the method for inverse problems is presented.

Another standard approach the approximation of

are Markov chain Monte Carlo methods. For details concerning error bounds we refer to

[11, 12, 13, 17, 19, 20, 21] and the references therein. Combinations of importance sampling and Markov chain Monte Carlo are for example analyzed in [18, 24, 22].

The quasi-Monte Carlo importance sampling procedure of Algorithm 2 is, to our knowledge, less well studied. An asymptotic convergence result is stated in [9, Theorem 1] and promising numerical experiments are conducted in . A related method, a randomized deterministic sampling procedure according to the unnormalized distribution , is studied in . Recently, 

explore the efficiency of using QMC inputs in importance sampling for Archimedean copulas where significant variance reduction is obtained for a case study.

A quasi-Monte Carlo approach to Bayesian inversion was used in  and in 

The latter paper uses a combination of quasi-Monte Carlo and the multi-level method. The computation of the likelihood function involves solving a partial differential equation, but otherwise the problem is of the same form as described in the introduction.

## 2 Weighted Star-discrepancy and error bound

Recall that for are boxes anchored at . As a measure of “closeness” between the empirical distribution of a point set to we consider the star-discrepancy . A straightforward extension of this quantity taking the probability measure on into account is the following weighted discrepancy.

###### Definition 1 (Weighted Star-discrepancy).

For a given point set

and weight vector

, which might depend on and satisfies , define the weighted star-discrepancy by

 Dπ(w,Pn)=supx∈[0,1]d∣∣ ∣∣n∑i=1wi1[0,x)(xi)−π([0,x))∣∣ ∣∣.
###### Remark 1.

If is the Lebesgue measure on and the weight vector is , then for any point set . For general with unnormalized density , allowing the representation (2), we focus on the weight vector

 wui:=wi(u,Pn):=u(xi)∑nj=1u(xj),i=1,…,n. (5)

Here let us emphasize that depends on and .

### 2.1 Integration Error and weighted Star-discrepancy

With standard techniques one can prove a Koksma-Hlawka inequality according to . For details we refer to , [8, Section 2.3] and [15, Chapter 9]. A similar inequality of a quasi-Monte Carlo importance sampler can be found in [2, Corollary 1].

Let and be the space of square integrable functions with respect to the Lebesgue measure. Define the reproducing kernel by By we denote the corresponding reproducing kernel Hilbert space, which consists of differentiable functions with respect to all variables with first partial derivatives being in . For the inner product is given by

 ⟨f,g⟩=∑v⊆[d]∫[0,1]|v|∂|v|∂xvf(xv;1)∂|v|∂xvg(xv;1)dxv,

where for and we write and with if and if . Thus, consists of functions which are differentiable according to all variables with first partial derivatives being in . Note that, for holds

 ∂|v|∂xvK((xv;1),y)=(−1)|v|1[yv,1](xv),

where with . Thus, the reproducing property of the reproducing kernel Hilbert space can be rewritten as

 f(y)=∑v⊆[d]∫[yv,1](−1)|v|∂|v|∂xvf(xv;1)dxv. (6)

Further, we define the space of differentiable functions with finite norm

 ∥f∥H1:=∑v⊆[d]∫[0,1]|v|∣∣∣∂|v|∂xvf(xv;1)∣∣∣dxv, (7)

where for we have . We also define the semi-norm

 ∥f∥˜H1:=∑∅≠v⊆[d]∫[0,1]|v|∣∣∣∂|v|∂xvf(xv;1)∣∣∣dxv. (8)

It is obvious that .

We have the following relation between the integration error in and the weighted discrepancy.

###### Theorem 1 (Koskma-Hlawka inequality).

Let be a probability measure of the form (2) with unnormalized density . Then, for , arbitrary weight vector with , and for all we have

 ∣∣ ∣∣S(f,u)−n∑i=1wif(xi)∣∣ ∣∣≤∥f∥˜H1Dπ(w,Pn).
###### Proof.

Define the quadrature error of the approximation of by . Define the function . Then , and .

For

 h(x):=∫[0,1]dK(x,y)dπ(y)−n∑i=1wiK(x,xi),

and we have A straightforward calculation, see also for instance [7, formula (3)], shows by using (6) that

 e(˜f,Pn) =∑v⊆[d]∫[0,1]|v|∂|v|∂xv˜f(zv;1)(−1)|v|(π([0,(zv;1)))−n∑i=1wi1[0,zv](xi,v))dz =⟨˜f,h⟩.

Finally, by we have

 |e(f,Pn)|=|e(˜f,Pn)|≤∥˜f∥H1Dπ(w,Pn)=∥f∥˜H1Dπ(w,Pn),

which finishes the proof. ∎

An immediate consequence of the theorem with from (5) and from (3) is the error bound

 |S(f,u)−Qn(f,u)|≤∥f∥H1Dπ(wu,Pn).

Here the dependence on on the right-hand side is hidden in through and . The intuition is, that under suitable assumptions on the weighted star-discrepancy can be bounded by the classical star-discrepancy of .

### 2.2 Weighted and classical Star-discrepancy

In this section we provide a relation between the classical star-discrepancy and the weighted star-discrepancy .

###### Theorem 2.

Let be a probability measure of the form (2) with unnormalized density function . Then, for any point set in , we have

 Dπ(wu,Pn)≤4Dλd(Pn)∥u∥D∫[0,1]du(x)dx,

where

 (9)

with and for .

###### Proof.

For the given point set and unnormalized density recall that is defined in (5). To shorten the notation define . Then, for we have

 ∣∣ ∣∣n∑j=1wuj1[0,z)(xj)−π([0,z))∣∣ ∣∣=∣∣ ∣∣∑nj=1u(xj)1[0,z)(xj)∑ni=1u(xi)−∫[0,z)u(x)dx∥u∥1∣∣ ∣∣ ≤∑nj=1u(xj)1[0,z)(xj)∥u∥1∑ni=1u(xi)∣∣ ∣∣∥u∥1−1nn∑i=1u(xi)∣∣ ∣∣ +1∥u∥1∣∣ ∣∣1nn∑i=1u(xi)1[0,z)(xi)−∫[0,z)u(x)dx∣∣ ∣∣ ≤2∥u∥1supz∈[0,1]d∣∣ ∣∣1nn∑i=1u(xi)1[0,z)(xi)−∫[0,z)u(x)dx∣∣ ∣∣.

For denote and let be the cardinality of . Define

 I1(z) :=∫[0,z)u(x)dxλd([0,z))∣∣∣|Pz|n−λd([0,z))∣∣∣, I2(z) :=|Pz|n∣∣ ∣∣1|Pz|∑x∈Pzu(x)−∫[0,z)u(x)dxλd([0,z))∣∣ ∣∣,

and note that

 ∣∣ ∣∣1nn∑i=1u(xi)1[0,z)(xi)−∫[0,z)u(x)dx∣∣ ∣∣ =|Pz|n∣∣ ∣∣1|Pz|∑x∈Pzu(x)−n|Pz|∫[0,z)u(x)dx∣∣ ∣∣≤I1(z)+I2(z).

Estimation of : An immediate consequence of the definition of is

 I1(z)≤∫[0,z]u(x)dxλd([0,z])Dλd(Pn)≤Dλd(Pn)supx∈[0,z]u(x). (10)

Estimation of : With the transformation defined in the theorem one has Let

 Q:=T−1zPz={(z−11x1,…,z−1dxd)∣x∈Pz}⊂[0,1]d

and observe that . Then

where the last inequality follows from Theorem 1 with and constant unnormalized density. Further,

 |Pz|nDλd(Q) =|Pz|nsupy∈[0,1]d∣∣ ∣∣1|Q|∑x∈Q1[0,y)(x)−λd([0,y))∣∣ ∣∣ =supy∈[0,1]d∣∣ ∣∣1n∑x∈Q1[0,y)(x)−|Q|nλd([0,y))∣∣ ∣∣ =supy∈[0,1]d∣∣ ∣∣1n∑x∈Pn1Tz([0,y))(x)−|Q|nλd([0,y))∣∣ ∣∣ ≤supy∈[0,1]d∣∣ ∣∣1n∑x∈Pn1Tz([0,y))(x)−λd(Tz([0,y)))∣∣ ∣∣ +supy∈[0,1]d∣∣∣λd(Tz([0,y)))−|Q|nλd([0,y))∣∣∣.

By the fact that is again a box anchored at and

 supy∈[0,1]d∣∣∣λd(Tz([0,y)))−|Q|nλd([0,y))∣∣∣= supy∈[0,1]dλd([0,y))∣∣∣λd([0,z))−|Pz|n∣∣∣ ≤ ∣∣∣λd([0,z))−|Pz|n∣∣∣,

we have

Hence we have

 supz∈[0,1]d∣∣ ∣∣n∑j=1wuj1[0,z)(xj)−π([0,z))∣∣ ∣∣≤2supz∈[0,1]dI1(z)+I2(z) ≤ 2Dλd(Pn)supz∈[0,1]d(supx∈[0,z]u(x)+2∥u(Tz⋅)∥˜H1),

which implies the result. ∎

In particular, the theorem implies that whenever is finite and goes to zero as goes to infinity, also goes to zero for increasing with the same rate of convergence.

### 2.3 Explicit error bound

An immediate consequence of the results of the previous two sections is the following explicit error bound of the quasi-Monte Carlo importance sampling method of Algorithm 2.

###### Theorem 3.

Let be a probability measure of the form (2) with unnormalized density . Then, for any point set in , and from (3) we obtain

 |S(f,u)−Qn(f,u)|≤4∥f∥H1∥u∥D∫[0,1]du(x)dxDλd(Pn),

with from Theorem 2.

Under the regularity assumption that is finite, the error bound tells us that the classical star-discrepancy determines the rate of convergence on how fast goes to .

## 3 Illustrating Example

Define the -simplex by and consider the (slightly differently formulated) unnormalized density of the Dirichlet distribution with parameter vector given by

 u(x;α)={(1−∑di=1xi)αd+1−1∏di=1xαi−1i,x∈Δd,0,x∉Δd. (11)

The Dirichlet distribution is the conjugate prior of the multinomial distribution: Assume that we observed some data

, which we model as a realization of a multinomial distributed random variable with unknown parameter vector . With this leads to a likelihood function For a prior distribution with unnormalized density and we obtain a posterior measure with unnormalized density .

The normalizing constant of can be computed explicitly, it is known that

 ∫[0,1]du(x,α)dx=∏d+1i=1Γ(αi)Γ(∑d+1i=1αi). (12)

To have a feasible setting for the application of Theorem 1 and Theorem 2 we need to show that is finite. This is not immediately clear, since in we take the supremum over . The following lemma is useful.

###### Lemma 1.

Let and recall that we write . Define with if , if and if . Assume that for and . Then

 ∂|v|∂xvu(x,α)=∑kv∈{0,1}|v|kd+1=|v|−∑i∈vkicv,kv,kd+1u(x,α−(kv;0;kd+1)) (13)

with

###### Proof.

The statement follows by induction over the cardinality of . For , i.e., both sides of (13) are equal to .

Assume , i.e., for some we have . Then

 ∂∂xsu(x,α)=(αs−1)u(x,α−es)−(αd+1−1)u(x,α−ed+1),

with where the th entry is “1”. On the other hand

 ∑ks∈{0,1}kd+1=1−ks c{s},ks,kd+1u(x,α−(ks;0;kd+1)) =c{s},0,1u(x,α−ed+1)+c{s},1,0u(x,α−es).

By the fact that and the claim is proven for .

Now assume that (13) is true for any with . Let with be an arbitrary subset and let with . Then we prove that the result also holds for . We have

 ∂|˜v|∂x˜vu(x,α) =∂∂xr∂|v|∂xvu(x,α) =∑kv∈{0,1}|v|kd+1=|v|−∑i∈vkicv,kv,kd+1∂∂xru(x,α−(kv;0;kd+1)).

Observe that

 ∂∂xru(x,α−(kv;0;kd+1)) =(αr−1)u(x,α−(kv;0;kd+1)−er)−(αd+1−kd+1−1)u(x,α−(kv;0;kd+1+1)) =∑kr∈{0,1}(αr−1)kr(−1)1−kr(αd+1−kd+1−1)1−kru(x,α