    # Convergence rates for optimised adaptive importance samplers

Adaptive importance samplers are adaptive Monte Carlo algorithms to estimate expectations with respect to some target distribution which adapt themselves to obtain better estimators over iterations. Although it is straightforward to show that they have the same theoretical guarantees with importance sampling with respect to the sample size, their behaviour over the number of iterations has been relatively left unexplored despite these adaptive algorithms aim at improving the proposal quality over time. In this work, we explore an adaptation strategy based on convex optimisation which leads to a class of adaptive importance samplers, termed optimised adaptive importance samplers (OAIS). These samplers rely on an adaptation idea based on minimizing the χ^2-divergence between an exponential family proposal and the target. The analysed algorithms are closely related to the adaptive importance samplers which minimise the variance of the weight function. We first prove non-asymptotic error bounds for the mean squared errors (MSEs) of these algorithms, which explicitly depend on the number of iterations and the number of particles together. The non-asymptotic bounds derived in this paper imply that when the target is from the exponential family, the L_2 errors of the optimised samplers converge to the perfect Monte Carlo sampling error O(1/√(N)). We also show that when the target is not from the exponential family, the asymptotic error rate is O(√(ρ^/N)) where ρ^ is the minimum χ^2-divergence between the target and an exponential family proposal.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Adaptive importance sampling (AIS) methods are one of the key Monte Carlo algorithms to estimate integrals which are not possible to obtain in closed form [Robert and Casella, 2004]

. This problem arises in many settings, such as Bayesian signal processing and machine learning

[Bugallo et al., 2015, 2017] or optimal control, [Kappen and Ruiz, 2016] where the quantities of interest are usually defined as intractable expectations. The adaptive importance samplers are adaptive versions of the importance samplers (IS), which iteratively improve the proposals to generate samples better suited to the estimation problem at hand. Its variants include population Monte Carlo methods [Cappé et al., 2004], adaptive mixture importance sampling [Cappé et al., 2008]. There has been an explosion of papers about AIS methods. A comprehensive review is beyond the scope of this article, see e.g. Bugallo et al.  for a recent review.

The convergence of these methods is also explored in the literature heavily. First of all, since the AIS is an IS method, it has the classical convergence rate of the error as the IS, where is the number of Monte Carlo samples used in the approximations, see e.g. Robert and Casella  and Agapiou et al. . However, since an adaptation is performed over the iterations and the goal of this adaptation is to improve the proposal quality, an insightful convergence result would provide a bound which explicitly depends on the number of iterations, , (which sometimes we refer to as time) and the number of samples, . Although there are convergence results of adaptive methods (see Douc et al. 

for a convergence theory for population Monte Carlo based on minimizing Kullback-Leibler divergence), none of the available results are able to explicitly bound the error in terms of the number of iterations and the number of particles at the same time.

The partial difficulty of the proving such a result for the adaptive mixture samplers is that the adaptive mixtures form an interacting particle system and it is unclear what kind of adaptation they perform or whether the adapted proposals actually get closer to the target in some metric. An alternative to adaptation using mixtures is the idea of minimizing a cost function in order to adapt the proposal. This idea has been popular in the literature, in particular, minimizing the variance of the weight function has received significant attention, see, e.g., Arouna [2004a, b], Kawai , Lapeyre and Lelong , Ryu and Boyd , Kawai [2017, 2018]. Relevant to us, in particular, Ryu and Boyd  have proposed an algorithm called Convex Adaptive Monte Carlo (Convex AdaMC), which is based on minimizing the variance of the IS estimator, which is a quantity related to the divergence between the target and the proposal. Ryu and Boyd  have shown that the variance the IS estimator is a convex function of the parameters of the proposal when the proposal family is chosen as the exponential family. Based on this observation, Ryu and Boyd 

have formulated Convex AdaMC, which draws one sample at each iteration and construct the IS estimator and they have proved a central limit theorem (CLT) for the resulting sampler. The idea has been further extended for self-normalised importance samplers by

Ryu , who considered minimising the -divergence between the target and an exponential family. Similarly, Ryu  proved a CLT for the resulting sampler. Similar ideas were also considered by Kawai [2017, 2018], which also aimed at minimizing the variance expression. Similarly, Kawai  showed that the variance of the weight function is convex when the proposal family is suitably chosen and provided general conditions for such proposals. Kawai  has also developed an adaptation technique based on the stochastic approximation, which is a similar scheme we analyse in this paper.

In this work, we develop and analyse a family of adaptive importance samplers, generally termed as optimised adaptive importance samplers (OAIS), which is based on a particular adaptation strategy based on convex optimisation. We adapt the proposal with respect to a quantity (which is essentially -divergence between the target and the proposal) which also happens to be the constant in error bounds of the IS, see, e.g., [Agapiou et al., 2017]. Assuming an exponential family proposal which converts the adaptation of the proposal to a convex optimisation problem, we develop a procedure which essentially optimises the error bound of the algorithm. By using results from convex optimisation , we obtain error rates depending on the number of iterations, denoted as , and the number of Monte Carlo samples, denoted as , together explicitly, which displays the trade-off between these two essential quantities. To the best of our knowledge, none of the papers on the topic provides convergence rates depending explicitly on the number of iterations and the number of particles together, as we do in this paper.

The paper is organised as follows. In Sec. 2, we introduce the problem definition, the IS and the AIS algorithms. In Sec. 3, we introduce the OAIS algorithms. In Sec. 4, we provide the theoretical results regarding optimised AIS and show its convergence using results from convex optimisation. Finally, we conclude in Sec. 5.

### Notation

For , we denote . We denote the state space as and assume

. The space of bounded real valued functions and the set of probability measures on space

are denoted as and , respectively. Given and , the expectation of wrt is written as or . The variance of with respect to is defined as . If , then . The unnormalised density associated to is denoted with . We denote the proposal with with an explicit dependence to the parameter . The parameter space is assumed to be a subset of dimensional Euclidean space, i.e. . Whenever necessary we denote both the probability measures and their densities and with same notation, therefore we implicitly assume that both and are absolutely continuous with respect to the Lebesgue measure.

## 2 Background

In this section, we review importance and adaptive importance samplers.

### 2.1 Importance sampling

Consider a target density and a bounded function . Oftentimes, the main interest is to compute an integral of form,

 (φ,π)=∫Xφ(x)π(x)dx. (1)

While perfect Monte Carlo can be used to estimate this expectation when it is possible to sample exactly from , this is often not possible. In the following, we consider the cases when the target can be evaluated exactly and up to a normalising constant, respectively.

Importance sampling (IS) uses a proposal distribution which is easy to sample and evaluate. Then, the IS consists of weighting these samples in order to correct the discrepancy between the target and proposal and finally constructing an estimator of the integral. To be precise, let

be the proposal which is parameterized by the vector

. The unnormalized target density is denoted as . Therefore, we have

 π(x)=Π(x)Zπ,

where . Next, we define functions as,

 wθ(x)=π(x)qθ(x)andWθ(x)=Π(x)qθ(x).

For a chosen proposal , the IS proceeds as follows. First a set of samples are generated i.i.d from the proposal . When can be evaluated, one constructs the empirical approximation of the , denoted , as

 πNθ(dx)=1NN∑i=1wθ(x(i))δx(i)(dx).

For this case, the IS estimate of the integral in (1) can be given as

 (φ,πNθ)=1NN∑i=1wθ(x(i))δx(i)(dx). (2)

However, in most practical cases, the target density cannot be evaluated. In this case, we construct the empirical measure as

 πNθ(dx)=N∑i=1w(i)θδx(i)(dx),

where,

 w(i)θ=Wθ(x(i))∑Nj=1Wθ(x(j)).

Finally this construction leads to the so called self-normalizing importance sampling (SNIS) estimator:

 (φ,πNθ)=N∑i=1w(i)θφ(x(i)). (3)

Although the IS estimator (2) is unbiased, the SNIS estimator (3) in general biased due to the fact that the target is only possible to evaluate up to a normalization constant. However, the bias and the MSE vanish with a rate , therefore, providing guarantees of convergence as . Crucially for us, the MSE of both estimators can be bounded as made precise in the following theorem which is adapted from Agapiou et al. .

###### Theorem 1.

Assume that . Then for any , we have

 E[((φ,π)−(φ,πNθ))2]≤cφρ(θ)N,

where and the function is defined as

 ρ(θ)=Eqθ[π2(X)q2θ(X)],

where .

###### Proof.

See Appendix A.1 for a self-contained proof.

###### Remark 1.

For the IS estimator (2), this bound can be improved so that . However, this improvement does not effect our results in this paper, hence we present a single bound for the estimators (2) and (3) for conciseness.

###### Remark 2.

As pointed out by Agapiou et al. , the function is essentially the divergence between and , i.e.,

 ρ(θ):=χ2(π||qθ)+1.

Note that can also be expressed in terms of the variance of the weight function , which coincides with divergence, i.e.,

 ρ(θ)=varqθ(wθ(X))+1.

Therefore, minimizing is equivalent to minimizing -divergence and the variance of the weight function , i.e., .

###### Remark 3.

Remark 2 implies that if is also from exponential family, then

 ρ⋆:=infθ∈Θρ(θ)=1.

###### Remark 4.

For the IS estimator (2), the bound in Theorem 1 can be modified so that it holds for unbounded test functions as well, see, e.g. Ryu and Boyd . Therefore, a similar quantity to , which includes whilst still retaining convexity, can be optimised for this case. Unfortunately, obtaining such a bound is not straightforward for the SNIS estimator (3) as shown by Agapiou et al. . In order to significantly simplify the presentation, we restrict ourselves to the class of bounded test functions, i.e., we assume .

### 2.2 Parametric adaptive importance samplers

Standard importance sampling may be inefficient in practice when the proposal is poorly calibrated with respect to the target. In particular, as implied by the error bound provided in Theorem 1, the error made by the IS can be huge if the -divergence between the target and the proposal is large. Therefore, it is more common to employ an iterative version of importance sampling, also called as adaptive importance sampling (AIS). The AIS algorithms are importance sampling methods which aim at iteratively improving the proposal distributions. More specifically, the AIS methods specify a sequence of proposals and perform importance sampling at each iteration. The aim is to improve the proposal so that the samples are better matched with the target, which would result in less variance and more accuracy in estimators. There are several variants, e.g. the most popular one being population Monte Carlo methods [Cappé et al., 2004] which uses previous samples in the proposal.

In this section, we review one particular AIS, which we refer to as parametric AIS. In this variant, the proposal distribution is a parametric distribution, denoted . Over time, this parameter is updated (or optimised) with respect to a predefined criterion resulting in a sequence . This gives us a sequence of proposal distributions denoted as .

One iteration of the algorithm goes as follows. Assume at time , we are given a proposal distribution . At time , we first update the parameter of this proposal,

 θt=Tt(θt−1),

where , is a sequence of (deterministic or stochastic) maps, e.g. gradient mappings, constructed so that it minimises a certain cost criterion. Then, in the same way we have done in the IS, we sample

 x(i)t∼qθt(dx),% for i=1,…,N,

and compute weights

 w(i)θt=Wθt(x(i)t)∑Ni=1Wθt(x(i)t),

and finally construct the empirical measure

 πNθt(dx)=N∑i=1w(i)θtδx(i)t(dx).

The estimator of the integral (1) is then constructed as

 (φ,πNθt)=N∑i=1w(i)θtφ(x(i)t).

The full procedure of the parametric AIS is summarized in 1.

Since this is a valid IS scheme, this algorithm enjoys the same guarantee provided in Theorem 1. In particular, we have the following theorem.

###### Theorem 2.

Assume that, given a sequence proposals , we have . Then for any , we have

 E[∣∣(φ,π)−(φ,πNθt)∣∣2]≤cφρ(θt)N,

where and the function is defined as

 ρ(θt)=Eqθt[π2(X)q2θt(X)].
###### Proof.

The proof is identical to the proof of Theorem 1.

However, this theorem does not give an insight about what happens as the number of iterations increases, i.e. as , with the bound. Ideally, the adaptation of the AIS should improve this bound with time. In other words, in the ideal case, the error should decrease as grows. Fortunately, Theorem 2 suggests that the mappings can be chosen so that the function is minimised over time. More specifically, the sequence can be chosen so that it leads to a decreasing sequence (at least in expectation) . In the following sections, we will summarize the deterministic and stochastic strategies to achieve this aim.

###### Remark 5.

We define the unnormalized version of and denote it as and it is characterised as follows

 ρ(θ)=R(θ)Z2πwhereZπ=∫XΠ(x)dx<∞.

Note that can also be expressed as

 R(θ)=Eqθ[Π2(X)q2θ(X)]. (4)

### 2.3 AIS with exponential family proposals

In this section, following Ryu and Boyd , we note that when is chosen as an exponential family density, the function is convex. In particular, we define

 qθ(x)=exp(θ⊤T(x)−A(θ))h(x), (5)

where is given by

 A(θ)=log∫exp(θ⊤T(x))h(x)dx,

and and . Then we have the following lemma adapted from Ryu and Boyd .

###### Lemma 1.

Let be chosen as in (5). Then is convex, i.e., for any and , the following holds

 ρ(λθ1+(1−λ)θ2)≤λρ(θ1)+(1−λ)ρ(θ2),

and

 ∇ρ(θ)=Eqθ[(∇A(θ)−T(X))π2(X)q2θ(X)]. (6)
###### Proof.

See Appendix A.2 for a self-contained proof.

###### Remark 6.

Note that Eqs. (4) and (6) together imply

 ∇R(θ)=Eqθ[(∇A(θ)−T(X))Π2(X)q2θ(X)]. (7)

We also note

 ∇R(θ)=Z2π∇ρ(θ). (8)

In the following sections, we assume that is a convex function. Thus Lemma 1 constitutes an important motivation for our approach. We leave handling general proposals which lead to nonconvex to future work.

## 3 The algorithms

In this section, we describe adaptation strategies based on minimizing . In particular, we design maps , for , for scenarios where,

• the gradient of can be exactly computed,

• an unbiased estimate of the gradient of

can be obtained,

• an unbiased estimate of the gradient of can be obtained.

The scenario (i) is unrealistic in practice but gives us a guideline in order to further develop the idea. The scenario (ii) can be realized in cases where it is possible to evaluate , in which case the IS leads to unbiased estimators. The scenario (iii) is what a practitioner would encounter in practice: The target is only possible to evaluate up to the normalizing constant, i.e., can be evaluated but cannot be.

We first introduce the gradient OAIS (GOAIS) where we assume that the exact gradients of are available. Since is defined as an expectation (or as an integral), this assumption is unrealistic. However, the results we can prove for this procedure will shed light onto the results that will be proven for practical cases in the following sections.

In particular, in this scheme, given , we specify as,

 θt=Tt(θt−1)=θt−1−γ∇ρ(θt−1), (9)

where is a step-size parameter of the map. This is a classical gradient descent scheme on . In other words, updating the proposal corresponds to a simple gradient descent procedure. In the next section, we provide non-asymptotic results for this scheme. However, as we have noted, this idea does not lead to a practical scheme and cannot be used in most cases in practice as the gradients of in exact form are rarely available. The full procedure can be seen from Algorithm 2.

Although it has a nice and simple form, GOAIS is often intractable to implement since is an integral. For this reason, the gradient needs to be estimated. In this section, we will first look at the case where can be evaluated, which means that an unbiased estimate of can be obtained. Then we will consider the general case, where one can only evaluate and can obtain an unbiased estimate of . To ease the analysis, the variant of SG-OAIS we will introduce here will use the iterate averaged SGD [Schmidt et al., 2017]. However, this results can be extended to the vanilla SGD by using the results from Nemirovski et al. .

#### 3.2.1 Normalised case

We assume first that the density can be evaluated. We are given the sequence of parameter estimates . First, in order to perform the adaptive importance sampling, we set

 ¯θt=1tt−1∑k=0θk, (10)

and sample for . Following the standard parametric AIS procedure (i.e. Algorithm 1), we obtain the estimate of as,

 (φ,πN¯θt)=1NN∑i=1w¯θt(¯x(i)t)φ(¯x(i)t).

Next, we update our parameter using the projected stochastic gradient step

 θt=Tt(θt−1)=ProjΘ(θt−1−γtgt) (11)

where and denotes the projection onto the set . Note that in order to estimate this gradient using (6), we sample for and estimate the expectation in (6). It is worth noting that these samples are different than the samples used to estimate .

###### Remark 7.

We consider the projected setup, since there might be natural constraint sets for parameter space . This need arises, for example, when where pair. In this case, the covariance matrix is constrained to be positive definite but the gradient step alone would not ensure that. At every iteration, needs to be projected onto the cone of positive-definite symmetric matrices.

#### 3.2.2 Self-normalised case

In general, however, cannot be evaluated exactly, hence a stochastic unbiased estimate of cannot be obtained. When the target can only be evaluated up to a normalisation constant, i.e., can be evaluated, we can use the SNIS procedure as explained in Section 2. Therefore, we introduce here the general version of the stochastic method, termed as the stochastic gradient OAIS (SG-OAIS).

To run this method, given a parameter as obtained in (10), we first generate a set of samples from the proposal . Then the integral estimate given by the SNIS can be written as,

 (φ,πN¯θt)=N∑i=1w¯θt(¯x(i)t)φ(¯x(i)t),

where,

 w(i)¯θt=W¯θt(¯x(i))∑Nj=1W¯θt(¯x(j)).

Finally, for the adaptation step, we obtain the unbiased estimate of the gradient and adapt the parameter as

 θt=ProjΘ(θt−1−γt~gt) (12)

where . Note that, as in the normalised case, this gradient is estimated using samples for using the formula (7). These samples are different, again, from the ones used to estimate . The full procedure is outlined in Algorithm 3.

In the following section, we analyse the proposed schemes.

## 4 Analysis

Theorem 1 tells an intuitive result about the error of the IS in terms of the distance between target and proposal . We now consider ideas from convex optimisation literature to obtain finite-time, finite-sample convergence rates.

### 4.1 Convergence rate with exact gradients

To begin, we will assume that we can compute the gradient of exactly. In particular, we consider the adaptation strategy given in (9). As it is often the case, we first need some assumptions on the function in order to guarantee convergence. Crucially, we need the following assumption.

###### Assumption 1.

Assume is convex and differentiable and

 ∥∇ρ(θ)−∇ρ(θ′)∥2≤L∥θ−θ′∥2for anyθ,θ′∈Θ.

We have already noted that convexity can be attained by choosing the proposal family as the exponential family, see, e.g., Ryu and Boyd . The next lemma provides the convergence rate of the gradient descent.

###### Lemma 2.

Assume that Assumption 1 holds. With , it is possible to show,

 ρ(θt)−ρ⋆≤∥θ0−θ⋆∥22γt.
###### Proof.

See, e.g., Nesterov .

This rate is one of the most basic results of convex optimisation. Next, we have the following result for the MSE of the AIS estimator adapted using exact gradient descent as summarised in Algorithm 2.

###### Theorem 3.

Assume that the sequence is chosen as in (9) and let be the sequence of proposal distributions. Then we have the following result

 E[((φ,π)−(φ,πNθt))2] ≤cφ∥θ0−θ⋆∥222γtN+cφρ⋆N,

where and can be interpreted as a measure of the ultimate quality of the proposal.

###### Proof.

See Appendix A.3.

###### Remark 8.

Theorem 3 sheds lights onto several facts, which are also true for the following sections. First, we recall that when is also from the exponential family. For this special case, Theorem 3 implies that

 limt→∞∥∥(φ,π)−(φ,πNθt)∥∥2=O(1√N).

In other words, when the target and the proposal are both from the exponential family, this adaptation strategy is leading to an asymptotically perfect Monte Carlo sampler. On the other hand, when is not from an exponential family, we obtain

 limt→∞∥∥(φ,π)−(φ,πNθt)∥∥2=O(√ρ⋆N),

i.e., the error is again asymptotically perfect, with a worse constant multiplied by .

This bound shows that as , what we are left with is essentially optimal IS error for a given parametric family . Intuitively, when the proposal is from a different parametric family than , the gradient OAIS optimises the error bound in order to obtain the best possible proposal. In particular, the MSE has two components: First component which can be made vanished over time by improving the proposal and the second component which is related to . The quantity is related to minimum -divergence between the target and proposal. This means that the discrepancy between the target and optimal proposal (with respect to minimum divergence) can only be tackled by increasing . This kind of intuition will be same for the schemes we analyse in the next sections, although the rate with respect to the number of iterations will necessarily worsen because of the noise on gradients.

###### Remark 9.

Note that the one can use different maps for optimisation. For example, one can use Nesterov’s accelerated gradient descent (which has more parameters than just a step size), in which case, one could prove a following type of result,

 E[((φ,π)−(φ,πNθt))2] ≤C1t2N+C2ρ⋆N,

where and are finite constants with respect to and . Note that this is an improved convergence rate going from to in the first term of the bound.

### 4.2 Convergence rate with stochastic gradients

While it is convenient to assume the minimisation of can be done deterministically, this is rarely the case. Especially since is defined as an integral itself, the best case is that we have a stochastic unbiased gradient rather than a deterministic one.

#### 4.2.1 Normalised case

First, we assume that we can evaluate , which means that at iteration , we can obtain , which is an unbiased estimate of . We use the optimisation algorithms called stochastic gradient

methods, which use stochastic and unbiased estimates of the gradients to optimise a given cost function. Particularly, we will focus on optimised samplers using stochastic gradient descent (SGD) as an adaptation strategy.

In order to prove convergence for this case, we first recall classical result for the SGD.

###### Lemma 3.

Assume at time , we obtain where where

 E[∥gt∥22]≤σ2ρ

for . Choose the step-size sequence for where . Then,

 E[ρ(¯θt)−ρ⋆]≤E∥θ0−θ⋆∥222α√t+ασ2ρ√t, (13)

where .

###### Proof.

See Appendix A.4 for a self-contained proof.

###### Remark 10.

There are various results similar to (13) in stochastic optimisation literature. Depending on the proof technique, the constant in the bound can change and can be optimised. However, all results regarding SGD for non-strongly convex functions arrive at the convergence rate is . Moreover, it is also possible to prove the same rate for iterates rather than averages [Nemirovski et al., 2009]. We note that a result with the same rate can be proven if the MSE of the stochastic gradients are bounded as well.

We can now state our first main result, which is the convergence rate for AIS using a SGD adaptation.

###### Theorem 4.

Assume that the sequence is chosen as in (11) and . Let be the sequence of proposal distributions. Then we have the following result

 E[((φ,π)−(φ,πN¯θt))2] ≤cφE∥θ0−θ⋆∥222α√tN+cφασ2ρ√tN+cφρ⋆N,

where is a constant independent of and .

###### Proof.

See Appendix A.5.

This theorem can be interpreted similarly to Theorem 3. One can see that the overall rate of the MSE bound is . This means that, as , we are only left with a rate that is optimal for the AIS for a given parametric proposal family. In particular, again, is related to minimal -divergence between the target and the proposal . When the proposal and the target is from the same family, we are back to the case , thus the adaptation leads to the standard, perfect Monte Carlo rate for the error within this setting as well.

#### 4.2.2 Self-normalised case

We have noted that it is possible to obtain an unbiased estimate of when the normalized target can be evaluated. However, if we can only evaluate the unnormalized density instead of and use the self-normalized IS estimator, this estimate is no longer unbiased. We refer to Sec. 5 of Tadić and Doucet  for stochastic optimisation with biased gradients for adaptive Monte Carlo where the discussion is over minimizing Kullback-Leibler divergence rather than -divergence. The results presented in Tadić and Doucet , however, are asymptotic in nature. We are interested in finite-time bounds in this paper.

Due to the special structure of our case, it is possible to avoid having to deal with a biased gradient. We note that although it is not possible to obtain an unbiased estimate of , we can straightforwardly obtain an unbiased estimate of . Since optimising the unnormalised function will lead to same minima as the normalised function , we can only deal with for the adaptation in the self-normalised case. In order to start, we first state the version of Lemma 4 for .

###### Lemma 4.

Assume that and

 E[∥~gt∥2]≤σ2R.

Choose the step-size for where . Then we have

 E[R(¯θt)−R⋆]≤E∥θ0−θ⋆∥222β√t+βσ2R√t (14)

where . This in turn implies that

 E[ρ(¯θt)−ρ⋆]≤E∥θ0−θ⋆∥222βZ2π√t+βσ2RZ2π√t. (15)
###### Proof.

The proof of the rate in (14) is identical to the proof of Lemma 3. The rate in (15) follows by observing for every .

Finally, using Lemma 4, we can state our main result, that is the error rate for the MSE of Algorithm 3.

###### Theorem 5.

Assume that the sequence is chosen as in (12) and . Let be the sequence of proposal distributions. Then we have the following result

 E[((φ,π)−(φ,πN¯θt))2] ≤cφE∥θ0−θ⋆∥222βZ2π√tN+cφβσ2RZ2π√tN+cφρ⋆N,

where is a constant independent of and .

###### Proof.

The proof follows from Lemma 4 and mimicking the exact same steps as in the proof of Theorem 4.

###### Remark 11.

Theorem 4, as in Remark 8, reveals important facts about the behaviour of the SG-OAIS. In particular, for a general target , we obtain

 limt→∞∥∥(φ,π)−(φ,πN¯θt)∥∥2=O(√ρ⋆N).

This result shows that the error is asymptotically perfect. As in previous cases, if the target is in the exponential family, then the asymptotic convergence rate is as .

###### Remark 12.

We note that despite the rate (15) looks faster compared to the rate provided in Theorem 4, this is usually not the case since the variance of the noise induced in unnormalized gradients are typically higher. In general, it can be conjectured that . Therefore, one must choose in order to obtain the same convergence rate with the normalized case. This implies, in order to obtain the same rate, one must choose much smaller step-sizes compared to the normalized case. However, this direction requires further analysis.

## 5 Conclusions

We have presented and analysed optimised parametric adaptive importance samplers and provided non-asymptotic convergence bounds for the MSE of these samplers. Our results display the exact interplay between the number of iterations and the number of samples . In particular, we have shown that the optimised samplers converge to an optimal sampling regime as leading to an asymptotic rate of . This intuitively shows that, the number of samples should be set in proportion to minimum -divergence between the target and the exponential family proposal, as we have shown that the adaptation (in the sense of minimising -divergence or, equivalently, the variance of the weight function) cannot improve the error rate beyond . The error rates in this regime may be dominated by how close the target is to the exponential family.

Note that, the algorithms we have analysed require constant computational load at each iteration and the computational load does not increase with as we do not re-use the samples in past iterations. Such schemes, however, can also be considered and analysed in the same manner. More specifically, the computational cost of each iteration depends on the cost of evaluating .

Our work implies several future directions. One direction is to analyse the methods with more advanced optimisation algorithms. A challenging direction is to consider more general proposals than the natural exponential family, which may lead to non-convex optimisation problems of adaptation. Analysing and providing guarantees for this general case would provide foundational insights for general adaptive importance sampling procedures. Also, as shown by Ryu , above theorems can also be proved for -divergences.

Another related piece of work arises from variational inference [Wainwright and Jordan, 2008]. In particular, Dieng et al.  have recently considered performing variational inference by minimising the -divergence, which is close to the setting in this paper. In particular, the variational approximation of the target distribution in the variational setting coincides with the proposal distribution we consider within the importance sampling context in this paper. This also implies that our results may be used to obtain finite-time guarantees for the expectations estimated using the variational approximations of target distributions.

Finally, the adaptation procedure can be converted into a sampling problem as well. In particular, instead of optimisation methods, one can use Markov chain Monte Carlo (MCMC) algorithms in a similar vein to

Martino et al.  in order to adapt the proposals and the results analogous to the ones in this paper might be possibly proven within the MCMC-adaptation setting.

## Acknowledgements

Ö. D. A. thanks to Melanie F. Pradier for fruitful discussions on -divergence minimization.

Ö. D. A. is funded by the Lloyds Register Foundation programme on Data Centric Engineering through the London Air Quality project. This work was supported by The Alan Turing Institute for Data Science and AI under EPSRC grant EP/N510129/1.

J. M. acknowledges the support of the Spanish Agencia Estatal de Investigación (award TEC2015-69868-C2-1-R ADVENTURE) and the Office of Naval Research (award no. N00014-19-1-2226).

## Appendix A Appendix

### a.1 Proof of Theorem 1

We first note the following inequalities,

 |(φ,π)−(φ,πNθ)| =∣∣ ∣∣(φWθ,qθ)(Wθ,qθ)−(φWθ,qNθ)(Wθ,qNθ)∣∣ ∣∣ ≤∣∣(φWθ,qθ)−(φWθ,qNθ)∣∣|(Wθ,qθ)|+|(φWθ,qNθ)|∣∣ ∣∣1(Wθ,qθ)−1(Wθ,qNθ)∣∣ ∣∣ =∣∣(φWθ,qθ)−(φWθ,qNθ)∣∣|(Wθ,qθ)|+∥φ∥∞|(Wθ,qNθ)|∣∣ ∣∣(Wθ,qNθ)−(Wθ,qθ)(Wθ,qθ)(Wθ,qNθ)∣∣ ∣∣ =∣∣(φWθ,qθ)−(φWθ,qNθ)∣∣+∥φ∥∞|(Wθ,qNθ)−(Wθ,qθ)|(Wθ,qθ).

We take squares of both sides and apply the inequality to further bound the rhs,

 |(φ,π)−(φ,πNθ)|2 ≤2∣∣(φWθ,qθ)−(φWθ,qNθ)∣∣2+∥φ∥2∞|(Wθ,qNθ)−(Wθ,qθ)|2(Wθ,qθ)2.

We now take the expectation of both sides,

 E[((φ,π)−(φ,πNθ))2]≤ 2(Wθ,qθ)2(E[((φWθ,qθ)−(φWθ,qNθ))2]+ ∥φ∥2∞E[((Wθ,qNθ)−(Wθ,qθ))2]).

Note that, both terms in the right hand side are perfect Monte Carlo estimates of the integrals. Bounding the MSE of these integrals yields

 E[((φ,π)−(φ,πNθ))2]≤ 2N(φ2W2θ,qθ)−(φWθ,qθ)2(Wθ,qθ)2+2∥φ∥2∞N(W2θ,qθ)−(Wθ,qθ)2(Wθ,qθ)2, ≤