Optimality in Noisy Importance Sampling

01/07/2022
by   Fernando Llorente, et al.
0

In this work, we analyze the noisy importance sampling (IS), i.e., IS working with noisy evaluations of the target density. We present the general framework and derive optimal proposal densities for noisy IS estimators. The optimal proposals incorporate the information of the variance of the noisy realizations, proposing points in regions where the noise power is higher. We also compare the use of the optimal proposals with previous optimality approaches considered in a noisy IS framework.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/09/2022

Variance Analysis of Multiple Importance Sampling Schemes

Multiple importance sampling (MIS) is an increasingly used methodology w...
03/23/2020

Efficient sampling generation from explicit densities via Normalizing Flows

For many applications, such as computing the expected value of different...
02/18/2021

Nonasymptotic bounds for suboptimal importance sampling

Importance sampling is a popular variance reduction method for Monte Car...
10/17/2016

Black-box Importance Sampling

Importance sampling is widely used in machine learning and statistics, b...
10/14/2019

A unified view of likelihood ratio and reparameterization gradients and an optimal importance sampling scheme

Reparameterization (RP) and likelihood ratio (LR) gradient estimators ar...
08/18/2019

Revisiting the balance heuristic for estimating normalising constants

Multiple importance sampling estimators are widely used for computing in...
01/02/2022

Global convergence of optimized adaptive importance samplers

We analyze the optimized adaptive importance sampler (OAIS) for performi...

1 Introduction

A wide range of modern applications, especially in Bayesian inference framework [1]

, require the study of probability density functions (pdfs) which can be evaluated stochastically, i.e., only noisy evaluations can be obtained

[2, 3, 4, 5]. For instance, this is the case of the pseudo-marginal approaches and doubly intractable posteriors [6, 7], approximate Bayesian computation (ABC) and likelihood-free schemes [8, 9], where the target density cannot be computed in closed-form.
The noisy scenario also appears naturally when mini-batches of data are employed instead of considering the complete likelihood of huge amounts of data [10, 11]

. More recently, the analysis of noisy functions of densities is required in reinforcement learning (RL), especially in direct policy search which is an important branch of RL, with applications in robotics

[12, 13]. The topic of inference in noisy settings (or where a function is known with a certain degree of uncertainty) is also of interest in the inverse problem literature, such as in the calibration of expensive computer codes [14, 15]. This is also the case when the construction of an emulator is considered, as a surrogate model [4, 16, 17].

In this work, we study the importance sampling (IS) scheme under noisy evaluations of the target pdf. The noisy IS scenario has been already analyzed in the literature [2, 3, 18]. In the context of optimization, some theoretical results can be found [19]. In the sequential framework, IS schemes with random weights can be found and have been studied in different works [2, 20, 21, 22]

. We provide the optimal proposal densities for different noisy scenarios, including also the case of integrals involving vector-valued functions. Moreover, we discuss the convergence and variance of the estimators in a general setting. We consider a different approach with respect to other studies in the literature

[3, 23]

. In those works, the authors analyzed the trade-off between decreasing the noise power (by increasing the number of auxiliary samples) and increasing the total number of samples in the IS estimators. Here, this information is encompassed within the optimal proposal density, which plays a similar role to an acquisition function in active learning

[17, 16]. This is information is relevant, especially if the noisy evaluations are also costly to obtain.

2 Background

2.1 Bayesian inference

In many applications, we aim at inferring a variable of interest given a set of observations or measurements. Let us denote the variable of interest by , and let be the observed data. The posterior pdf is then

(1)

where is the likelihood function, is the prior pdf, and is the model evidence (a.k.a. marginal likelihood) which is a useful quantity in model selection problems [24]. For simplicity, in the following, we skip the dependence on in and . Generally, is unknown, so we are able to evaluate the unnormalized target function, i.e., the numerator on the right hand side of Eq. (1),

(2)

The analytical study of the posterior density is unfeasible, so that numerical approximations are required [25, 1].

2.2 Noisy framework

Generally, we desire to approximate the unnormalized density , and the corresponding normalizing constant , using Monte Carlo methods. The unnormalized density can represent a posterior density in a Bayesian inference problem, as described above. We assume that, for any , we cannot evaluate exactly, but we only have access to a related noisy realization. Moreover, in many applications, obtaining such a noisy realization may be expensive. Hence, analyzing in which we require a noisy realization of is an important problem, which is related to the concept of optimality that we consider below.
In the following, we introduce a concise mathematical formalization of the noisy scenario. This simple framework contains real application scenarios, such as latent variable models [3] (see example 4 in Sect. 4.1), likelihood-free inference setting [8], doubly intractable posteriors [7], mini batch-based inference [10]. More specifically, we assume to have access to a noisy realization related to , i.e.,

(3)

where

is a non-linear transformation involving

and , that is some noise perturbation. Thus, for a fixed value ,

is a random variable with

(4)

for some mean function, , and variance function, . The assumption that must be strictly positive is important in practice [2, 23].

Noise power. In some applications, it is also possible to control the noise power , for instance by adding/removing data to the mini-batches (e.g., in the context of Big Data) [10], increasing the number of auxiliary samples in latent variables models [6], or interacting with an environment over longer/shorter periods of time (e.g., in reinforcement learning) [12].

Unbiased scenario and related cases. The scenario where appears naturally in some applications (such as in the estimation of latent variable and stochastic volatility models in statistics [3, 26]; or in the context of optimal filtering of partially observed stochastic processes [18]), or it is often assumed as a pre-established condition by the authors [3, 2]

. In some other scenarios, the noisy realizations are known to be unbiased estimates of some transformation of

, e.g., of [27, 28]. This situation can be encompassed by the following special case. If we consider an additive perturbation,

(5)

we have , where . If is known and invertible, we have .
Generally, we can state that always contains statistical information related to . The subsequent use of depends on the specific application. Thus, we study the mean function . Hence, our goal is to approximate efficiently integrals involving , i.e.,

(6)

where and denotes the vector of integrals of interest. Note that, in the unbiased case , we have . An integral involving can be approximated employing a cloud of random samples using the noisy realizations via Monte Carlo methods.

3 Noisy Importance Sampling

In a non-noisy IS scheme, a set of samples is drawn from a proposal density . Then each sample is weighted according to the ratio . A noisy version of importance sampling can be obtained when we substitute the evaluations of with noisy realizations of . See Table 1 and note that the importance weights in Eq. (9) are computed using the noisy realizations. Below, we show that

(7)

is an unbiased estimator of , and

(8)

are consistent estimators of . The estimator requires the knowledge of , that is not needed in the so-called self-normalized estimator, .

  • Inputs: Proposal distribution .

  • For :

    1. Sample and obtain one realization .

    2. Compute

      (9)
  • Outputs: the weighted samples .

Table 1: Noisy importance sampling algorithm
Theorem.

The estimators above constructed from the output of noisy IS converge to expectations under . More specifically, we have and are unbiased estimators of and respectively, and is a consistent estimator of . Moreover, these estimators have higher variance than their non-noisy counterparts.

Proof.

Here, we provide a simple proof of convergence by applying iterated conditional expectations. Equivalently, the correctness of the approach can be proved by using an extended space view (see, e.g., [18, 3]).
Let denote the samples from . By the law of total expectation, we have that . In the inner expectation, we use the fact the ’s are i.i.d., hence

where is the non-noisy IS estimator of , which is also unbiased, i.e.,

Therefore, is an unbiased estimator of , i.e., . Moreover, we show below that decreases to zero as . Hence, is a consistent estimator of . Now, with the same arguments, we can prove that the estimator is also unbiased and converges to . Thus, both the estimator , and the ratio

which is the noisy self-normalized IS estimator in Eq. (8), are consistent estimators of

given in Eq. (6).

Variance of . By the law of total variance, we have that

In a non-noisy scenario, i.e., in a non-noisy IS setting, the first term is null. Using the fact that is unbiased, we have that the second term is

Regarding the first term, we have

Assuming that for all , we have that

Hence, we finally have that

(10)

Therefore, has a greater variance than , but the same convergence speed, i.e., its variance decreases at rate. Proving that has greater variance than its non-noisy version is straightforward.

4 Optimal Proposal Density in Noisy IS

In this section, we derive the optimal proposals for the noisy IS estimators , and .

4.1 Optimal proposal for

We can rewrite the variance of in Eq. (10) as

By Jensen’s inequality, the first term is bounded below by

The minimum variance is thus attained at

(11)

Note that, for finite , is always greater than 0, specifically,

(12)

Hence, differently from the non-noisy setting, in noisy IS the optimal proposal does not provide an estimator with null variance. If for all , then we come back to the non-noisy scenario and . Note that the variance of using is

(13)

In the following, we show several examples of noise models and their corresponding optimal proposal densities.
Example 1. Let us consider a Bernoulli-type noise where , where , and . Then, we have

Replacing in Eq. (11), the optimal proposal density in this case is

(14)

Example 2. Let us consider , with . In this scenario, the random variable corresponds to a folded Gaussian random variable. We have

where

is the cumulative function of the standard Gaussian distribution. Then,

(15)

Example 3. Let us consider a multiplicative noise with , hence

If we denote and , then and . In this case, the optimal proposal coincides with the optimal one in the non-noisy setting, since

(16)

Example 4. In latent variable models, the noisy realization corresponds to the product of independent IS estimators, each built from auxiliary samples. With large enough, the distribution of this realization is approximately lognormal, i.e.,

where and , for some function [3, 23]. Equivalently, they write , where . Hence,

and the optimal proposal is

(17)

This example is related with the cases studied in [3, 23].

4.2 Optimal proposal for

We have already seen that the optimal proposal that minimizes the variance of is . Let us consider now the estimator . Note that this estimator assumes we can evaluate . Since we are considering a vector-valued function, the estimator has components , and corresponds to a covariance matrix. We aim to find the proposal that minimizes the sum of diagonal variances. From the results of the previous section, it is straightforward to show that the variance of the -th component is

where and are respectively the -th components of and , and denotes the non-noisy estimator (i.e. using instead of ). Thus,

By Jensen’s inequality, we have

where denotes the euclidean norm. The equality holds if and only if is constant. Hence, the optimal proposal is

(18)

4.3 Optimal proposal for

Let us consider the case of the self-normalized estimator . Recall that , where denotes the noisy estimator of , so that we are considering ratios of estimators. Again, we aim to find the proposal that minimizes the variance of the vector-valued estimator . When is large enough, the variance of -th ratio is approximated as [29],

where is the -th component of , and

The first two results have been already obtained in the previous sections. The third result is given in Appendix A. The sum of the variances is thus

By Jensen’s inequality, we can derive that the optimal proposal is

(19)

Relationship with active learning. The optimal density can be interpreted as an acquisition density, suggesting the regions of the space which require more number of acquisitions of the realizations . Namely, plays a role similar to an acquisition function in active learning. This is information is relevant, especially if the noisy evaluations are also costly to obtain.

4.4 Connection with other types of optimality

Here, we discuss another approach for optimality in noisy IS and connect it with our work. Other related works, in Monte Carlo and noisy optimization literature, focus on the trade-off between accuracy/noisiness and computational cost [3, 23, 30]. In those settings, it is assumed that one can control the variance of the noisy realizations . Clearly, taking samples with higher accuracy, i.e. small variance , is beneficial since it decrease the magnitude of the terms and , which are responsible for the efficiency loss in the estimators, due to the presence of noise. However, taking accurate estimates implies increased computational cost, hence one must reduce the number of samples , which affect the overall Monte Carlo variance. This trade-off have been investigated in both MCMC and IS frameworks [3, 23].

Let denote the number of auxiliary samples employed to reduce the variance of the noisy realizations. Namely, greater implies greater accuracy but also greater cost. Moreover, this number could depend on , i.e., . Then, the goal is to obtain the optimal function by balancing the decrease in variance with the extra computational cost (see, e.g., Sections 3.3, 3.4 and 5 of [3]). Namely, in this different approach, they try to reduce at certain increasing the value of , instead of using an optimal proposal pdf for the noisy scenario. On the contrary, in this work we have considered the use of optimal proposal pdfs and that is not tuned by the user, which means that is arbitrary and set constant for all .

5 Numerical experiments

In this section, we consider two illustrative numerical example where we clearly show the performance of the optimal proposal pdf in the noisy IS setting (showing the variance gains in estimation, with respect to the use the optimal proposal density from the non-noisy setting). For simplicity, we consider one-dimensional scenarios, and test the optimal proposal pdf with different densities (uniform and Gaussian), and different types of variance behavior, .

First experiment. Let for , i.e., a uniform density in with and . We set with so that , and we have .
We consider the estimation of using the optimal proposal pdf in Eq. (11), and the optimal proposal pdf in the non-noisy setting, i.e., . More specifically, we consider

Hence,

Clearly, by changing , we change the form of both and . For instance, increasing also increases the magnitude of and hence the mismatch between and , as depicted in Figure 1. Indeed, for , is almost identical to since the magnitude of is small w.r.t. the values of . As increases, deviates from , being in the middle between and , and eventually would converge to for . It is also interesting to note that with has very little probability mass around , where the noise is zero, since it needs to concentrate probability mass in the extremes of the interval, where the noise power is huge.
Let also denote as the variance obtained using given in Eq. (12), and the variance obtained using given in Eq. (13). In Figure 3(a), we show the ratio of variances both theoretically and empirically, as a function of , where and are the variances of when using and as proposals, respectively. We can observe the clear advantage of using the optimal proposal density in Eq. (11).

Second experiment. Let us consider now a Gaussian pdf, , and the same error model as in the previous example but considering

Figure 2 depicts the and , as a function of , for several values of . Note that, in this example, increasing makes become bimodal. As in the previous example, as increases, the optimal proposal will converge to . The theoretical and empirical curves of the ratio of variances, , in estimating are shown in Figure 3(b).

(a) for different values of .
(b) for different values of in log-scale.
Figure 1: Uniform example. (a) Optimal proposals for different values of , and the in dashed line; (b)

The standard deviation

for different values of .
(a) for different values of .
(b) for different values of .
Figure 2: Gaussian example. (a) Optimal proposals for different values of , and the in dashed line; (b) The standard deviation for different values of .
(a) Uniform example.
(b) Gaussian example.
Figure 3: Theoretical and empirical ratio of variances in the estimation of for both experiments. The -axis denotes the noise level (larger means greater noise).

6 Conclusions

Working with noisy evaluations of the target density is usual in Monte Carlo, especially in the last years. In this work, we have analyzed the use of optimal proposal densities in a noisy IS framework. Previous works have focused on the trade-off between accuracy in the evaluation and computational cost in order to form optimal estimators. In this work, we have considered a general setting and derived the optimal proposals for the noisy IS estimators. These optimal proposals incorporate the variance function of the noisy evaluation in order to propose samples in regions that are more affected by noise. In this sense, we can informally state that the optimal proposal densities play the role of an acquisition function that also take into account the noise power.

References

  • [1] D. Luengo, L. Martino, M. Bugallo, V. Elvira, and S. Särkkä, “A survey of Monte Carlo methods for parameter estimation,” EURASIP Journal on Advances in Signal Processing, vol. 2020, pp. 1–62, 2020.
  • [2] P. Fearnhead, O. Papaspiliopoulos, G. O. Roberts, and A. Stuart, “Random-weight particle filtering of continuous time processes,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 4, pp. 497–512, 2010.
  • [3] M.-N. Tran, M. Scharth, M. K. Pitt, and R. Kohn, “Importance sampling squared for Bayesian inference in latent variable models,” arXiv preprint arXiv:1309.3339, 2013.
  • [4] L. Acerbi, “Variational Bayesian Monte Carlo with noisy likelihoods,” arXiv:2006.08655, 2020.
  • [5] F. Llorente, L. Martino, J. Read, and D. Delgado, “A survey of Monte Carlo methods for noisy and costly densities with application to reinforcement learning,” arXiv:2108.00490, 2021.
  • [6] C. Andrieu and G. O. Roberts, “The pseudo-marginal approach for efficient Monte Carlo computations,” The Annals of Statistics, vol. 37, no. 2, pp. 697–725, 2009.
  • [7] J. Park and M. Haran, “Bayesian inference in the presence of intractable normalizing functions,” Journal of the American Statistical Association, vol. 113, no. 523, pp. 1372–1390, 2018.
  • [8] L. F. Price, C. C. Drovandi, A. Lee, and D. J. Nott, “Bayesian synthetic likelihood,” Journal of Computational and Graphical Statistics, vol. 27, no. 1, pp. 1–11, 2018.
  • [9] M. U. Gutmann and J. Corander, “Bayesian optimization for likelihood–free inference of simulator–based statistical models,” Journal of Machine Learning Research, 2016.
  • [10] R. Bardenet, A. Doucet, and C. Holmes,

    “On Markov chain Monte Carlo methods for tall data,”

    The Journal of Machine Learning Research, vol. 18, no. 1, pp. 1515–1557, 2017.
  • [11] M. Quiroz, R. Kohn, M. Villani, and M.-N. Tran, “Speeding up MCMC by efficient data subsampling,” Journal of the American Statistical Association, 2018.
  • [12] M. P. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search for robotics,” Foundations and trends in Robotics, vol. 2, no. 1-2, pp. 388–403, 2013.
  • [13] K. Chatzilygeroudis, V. Vassiliades, F. Stulp, S. Calinon, and J.-B. Mouret, “A survey on policy search algorithms for learning robot controllers in a handful of trials,” IEEE Transactions on Robotics, vol. 36, no. 2, pp. 328–347, 2019.
  • [14] A. B. Duncan, A. M. Stuart, and M.-T. Wolfram, “Ensemble inference methods for models with noisy and expensive likelihoods,” arXiv:2104.03384, 2021.
  • [15] N. Bliznyuk, D. Ruppert, C. Shoemaker, R. Regis, S. Wild, and P. Mugunthan,

    “Bayesian calibration and uncertainty analysis for computationally expensive models using optimization and radial basis function approximation,”

    Journal of Computational and Graphical Statistics, vol. 17, no. 2, pp. 270–294, 2008.
  • [16] D. H. Svendsen, L. Martino, and G. Camps-Valls, “Active emulation of computer codes with Gaussian processes–Application to remote sensing,” Pattern Recognition, vol. 100, pp. 107103, 2020.
  • [17] F. Llorente, L. Martino, V. Elvira, D. Delgado, and J. Lopez-Santiago, “Adaptive quadrature schemes for Bayesian inference via active learning,” arXiv:2006.00535, 2020.
  • [18] P. Fearnhead, O. Papaspiliopoulos, and G. O. Roberts, “Particle filters for partially observed diffusions,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 4, pp. 755–777, 2008.
  • [19] O. D. Akyildiz, I. P. Marino, and J. Míguez, “Adaptive noisy importance sampling for stochastic optimization,” in 2017 IEEE 7th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). IEEE, 2017, pp. 1–5.
  • [20] D. Crisan and J. Miguez,

    “Nested particle filters for online parameter estimation in discrete-time state-space markov models,”

    Bernoulli, vol. 24, no. 4A, pp. 3039–3086, 2018.
  • [21] N. Chopin, P.E. Jacob, and O. Papaspiliopoulos, “SMC2: an efficient algorithm for sequential analysis of state space models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 75, no. 3, pp. 397–426, 2013.
  • [22] L. Martino, J. Read, V. Elvira, and F. Louzada, “Cooperative parallel particle filters for on-line model selection and applications to urban mobility,” Digital Signal Processing, vol. 60, pp. 172–185, 2017.
  • [23] A. Doucet, M. K Pitt, G. Deligiannidis, and R. Kohn, “Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator,” Biometrika, vol. 102, no. 2, pp. 295–313, 2015.
  • [24] F. Llorente, L. Martino, D. Delgado, and J. Lopez-Santiago, “Marginal likelihood computation for model selection and hypothesis testing: an extensive review,” arXiv:2005.08334, 2020.
  • [25] C. P. Robert and G. Casella, Monte Carlo Statistical Methods, Springer, 2004.
  • [26] C. Andrieu, A. Doucet, and R. Holenstein, “Particle Markov chain Monte Carlo methods,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 3, pp. 269–342, 2010.
  • [27] M. Järvenpää, M. U. Gutmann, A. Vehtari, and P. Marttinen, “Parallel Gaussian process surrogate Bayesian inference with noisy likelihood evaluations,” Bayesian Analysis, vol. 16, no. 1, pp. 147–178, 2021.
  • [28] C. C. Drovandi, M. T. Moores, and R. J. Boys, “Accelerating pseudo-marginal MCMC using Gaussian processes,” Computational Statistics & Data Analysis, vol. 118, pp. 1–17, 2018.
  • [29] A. Stuart and K. Ord, Kendall’s Advanced Theory of Statistics, Arnold, London, 1998.
  • [30] D. V. Arnold, Noisy optimization with evolution strategies, vol. 8, Springer Science & Business Media, 2012.

Appendix A Covariance between and

We show that

First, recall that . By the law of iterated expectations,

The inner expectation is

Hence, we obtain

Combining the results, we obtain the desired expression.