1 Introduction
A wide range of modern applications, especially in Bayesian inference framework [1]
, require the study of probability density functions (pdfs) which can be evaluated stochastically, i.e., only noisy evaluations can be obtained
[2, 3, 4, 5]. For instance, this is the case of the pseudomarginal approaches and doubly intractable posteriors [6, 7], approximate Bayesian computation (ABC) and likelihoodfree schemes [8, 9], where the target density cannot be computed in closedform.The noisy scenario also appears naturally when minibatches of data are employed instead of considering the complete likelihood of huge amounts of data [10, 11]
. More recently, the analysis of noisy functions of densities is required in reinforcement learning (RL), especially in direct policy search which is an important branch of RL, with applications in robotics
[12, 13]. The topic of inference in noisy settings (or where a function is known with a certain degree of uncertainty) is also of interest in the inverse problem literature, such as in the calibration of expensive computer codes [14, 15]. This is also the case when the construction of an emulator is considered, as a surrogate model [4, 16, 17].In this work, we study the importance sampling (IS) scheme under noisy evaluations of the target pdf. The noisy IS scenario has been already analyzed in the literature [2, 3, 18]. In the context of optimization, some theoretical results can be found [19]. In the sequential framework, IS schemes with random weights can be found and have been studied in different works [2, 20, 21, 22]
. We provide the optimal proposal densities for different noisy scenarios, including also the case of integrals involving vectorvalued functions. Moreover, we discuss the convergence and variance of the estimators in a general setting. We consider a different approach with respect to other studies in the literature
[3, 23]. In those works, the authors analyzed the tradeoff between decreasing the noise power (by increasing the number of auxiliary samples) and increasing the total number of samples in the IS estimators. Here, this information is encompassed within the optimal proposal density, which plays a similar role to an acquisition function in active learning
[17, 16]. This is information is relevant, especially if the noisy evaluations are also costly to obtain.2 Background
2.1 Bayesian inference
In many applications, we aim at inferring a variable of interest given a set of observations or measurements. Let us denote the variable of interest by , and let be the observed data. The posterior pdf is then
(1) 
where is the likelihood function, is the prior pdf, and is the model evidence (a.k.a. marginal likelihood) which is a useful quantity in model selection problems [24]. For simplicity, in the following, we skip the dependence on in and . Generally, is unknown, so we are able to evaluate the unnormalized target function, i.e., the numerator on the right hand side of Eq. (1),
(2) 
The analytical study of the posterior density is unfeasible, so that numerical approximations are required [25, 1].
2.2 Noisy framework
Generally, we desire to approximate the unnormalized density , and the corresponding normalizing constant , using Monte Carlo methods.
The unnormalized density can represent a posterior density in a Bayesian inference problem, as described above.
We assume that, for any , we cannot evaluate exactly, but we only have access to a related noisy realization. Moreover, in many applications, obtaining such a noisy realization may be expensive.
Hence, analyzing in which we require a noisy realization of is an important problem, which is related to the concept of optimality that we consider below.
In the following, we introduce a concise mathematical formalization of the noisy scenario. This simple framework contains real application scenarios, such as latent variable models [3] (see example 4 in Sect. 4.1), likelihoodfree inference setting [8], doubly intractable posteriors [7], mini batchbased inference [10].
More specifically, we assume to have access to a noisy realization related to , i.e.,
(3) 
where
is a nonlinear transformation involving
and , that is some noise perturbation. Thus, for a fixed value ,is a random variable with
(4) 
for some mean function, , and variance function, .
The assumption that must be strictly positive is important in practice [2, 23].
Noise power. In some applications, it is also possible to control the noise power , for instance by adding/removing data to the minibatches (e.g., in the context of Big Data) [10], increasing the number of auxiliary samples in latent variables models [6], or interacting with an environment over longer/shorter periods of time (e.g., in reinforcement learning) [12].
Unbiased scenario and related cases. The scenario where appears naturally in some applications (such as in the estimation of latent variable and stochastic volatility models in statistics [3, 26]; or in the context of optimal filtering of partially observed stochastic processes [18]), or it is often assumed as a preestablished condition by the authors [3, 2]
. In some other scenarios, the noisy realizations are known to be unbiased estimates of some transformation of
, e.g., of [27, 28]. This situation can be encompassed by the following special case. If we consider an additive perturbation,(5) 
we have , where . If is known and invertible, we have .
Generally, we can state that always contains statistical information related to . The subsequent use of depends on the specific application. Thus, we study the mean function .
Hence, our goal is to approximate efficiently integrals involving , i.e.,
(6) 
where and denotes the vector of integrals of interest. Note that, in the unbiased case , we have . An integral involving can be approximated employing a cloud of random samples using the noisy realizations via Monte Carlo methods.
3 Noisy Importance Sampling
In a nonnoisy IS scheme, a set of samples is drawn from a proposal density . Then each sample is weighted according to the ratio . A noisy version of importance sampling can be obtained when we substitute the evaluations of with noisy realizations of . See Table 1 and note that the importance weights in Eq. (9) are computed using the noisy realizations. Below, we show that
(7) 
is an unbiased estimator of , and
(8) 
are consistent estimators of . The estimator requires the knowledge of , that is not needed in the socalled selfnormalized estimator, .

Theorem.
The estimators above constructed from the output of noisy IS converge to expectations under . More specifically, we have and are unbiased estimators of and respectively, and is a consistent estimator of . Moreover, these estimators have higher variance than their nonnoisy counterparts.
Proof.
Here, we provide a simple proof of convergence by applying iterated conditional expectations.
Equivalently, the correctness of the approach can be proved by using an extended space view (see, e.g., [18, 3]).
Let denote the samples from .
By the law of total expectation, we have that . In the inner expectation, we use the fact the ’s are i.i.d., hence
where is the nonnoisy IS estimator of , which is also unbiased, i.e.,
Therefore, is an unbiased estimator of , i.e., . Moreover, we show below that decreases to zero as . Hence, is a consistent estimator of . Now, with the same arguments, we can prove that the estimator is also unbiased and converges to . Thus, both the estimator , and the ratio
which is the noisy selfnormalized IS estimator in Eq. (8), are consistent estimators of
given in Eq. (6).
∎
Variance of . By the law of total variance, we have that
In a nonnoisy scenario, i.e., in a nonnoisy IS setting, the first term is null. Using the fact that is unbiased, we have that the second term is
Regarding the first term, we have
Assuming that for all , we have that
Hence, we finally have that
(10) 
Therefore, has a greater variance than , but the same convergence speed, i.e., its variance decreases at rate. Proving that has greater variance than its nonnoisy version is straightforward.
4 Optimal Proposal Density in Noisy IS
In this section, we derive the optimal proposals for the noisy IS estimators , and .
4.1 Optimal proposal for
We can rewrite the variance of in Eq. (10) as
By Jensen’s inequality, the first term is bounded below by
The minimum variance is thus attained at
(11) 
Note that, for finite , is always greater than 0, specifically,
(12) 
Hence, differently from the nonnoisy setting, in noisy IS the optimal proposal does not provide an estimator with null variance. If for all , then we come back to the nonnoisy scenario and . Note that the variance of using is
(13) 
In the following, we show several examples of noise models and their corresponding optimal proposal densities.
Example 1. Let us consider a Bernoullitype noise where , where , and . Then, we have
Replacing in Eq. (11), the optimal proposal density in this case is
(14) 
Example 2. Let us consider , with . In this scenario, the random variable corresponds to a folded Gaussian random variable. We have
where
is the cumulative function of the standard Gaussian distribution. Then,
(15) 
Example 3. Let us consider a multiplicative noise with , hence
If we denote and , then and . In this case, the optimal proposal coincides with the optimal one in the nonnoisy setting, since
(16) 
Example 4. In latent variable models, the noisy realization corresponds to the product of independent IS estimators, each built from auxiliary samples. With large enough, the distribution of this realization is approximately lognormal, i.e.,
where and , for some function [3, 23]. Equivalently, they write , where . Hence,
and the optimal proposal is
(17) 
4.2 Optimal proposal for
We have already seen that the optimal proposal that minimizes the variance of is . Let us consider now the estimator . Note that this estimator assumes we can evaluate . Since we are considering a vectorvalued function, the estimator has components , and corresponds to a covariance matrix. We aim to find the proposal that minimizes the sum of diagonal variances. From the results of the previous section, it is straightforward to show that the variance of the th component is
where and are respectively the th components of and , and denotes the nonnoisy estimator (i.e. using instead of ). Thus,
By Jensen’s inequality, we have
where denotes the euclidean norm. The equality holds if and only if is constant. Hence, the optimal proposal is
(18) 
4.3 Optimal proposal for
Let us consider the case of the selfnormalized estimator . Recall that , where denotes the noisy estimator of , so that we are considering ratios of estimators. Again, we aim to find the proposal that minimizes the variance of the vectorvalued estimator . When is large enough, the variance of th ratio is approximated as [29],
where is the th component of , and
The first two results have been already obtained in the previous sections. The third result is given in Appendix A. The sum of the variances is thus
By Jensen’s inequality, we can derive that the optimal proposal is
(19) 
Relationship with active learning. The optimal density can be interpreted as an acquisition density, suggesting the regions of the space which require more number of acquisitions of the realizations . Namely, plays a role similar to an acquisition function in active learning. This is information is relevant, especially if the noisy evaluations are also costly to obtain.
4.4 Connection with other types of optimality
Here, we discuss another approach for optimality in noisy IS and connect it with our work. Other related works, in Monte Carlo and noisy optimization literature, focus on the tradeoff between accuracy/noisiness and computational cost [3, 23, 30]. In those settings, it is assumed that one can control the variance of the noisy realizations . Clearly, taking samples with higher accuracy, i.e. small variance , is beneficial since it decrease the magnitude of the terms and , which are responsible for the efficiency loss in the estimators, due to the presence of noise. However, taking accurate estimates implies increased computational cost, hence one must reduce the number of samples , which affect the overall Monte Carlo variance. This tradeoff have been investigated in both MCMC and IS frameworks [3, 23].
Let denote the number of auxiliary samples employed to reduce the variance of the noisy realizations. Namely, greater implies greater accuracy but also greater cost. Moreover, this number could depend on , i.e., . Then, the goal is to obtain the optimal function by balancing the decrease in variance with the extra computational cost (see, e.g., Sections 3.3, 3.4 and 5 of [3]). Namely, in this different approach, they try to reduce at certain increasing the value of , instead of using an optimal proposal pdf for the noisy scenario. On the contrary, in this work we have considered the use of optimal proposal pdfs and that is not tuned by the user, which means that is arbitrary and set constant for all .
5 Numerical experiments
In this section, we consider two illustrative numerical example where we clearly show the performance of the optimal proposal pdf in the noisy IS setting (showing the variance gains in estimation, with respect to the use the optimal proposal density from the nonnoisy setting). For simplicity, we consider onedimensional scenarios, and test the optimal proposal pdf with different densities (uniform and Gaussian), and different types of variance behavior, .
First experiment. Let for , i.e., a uniform density in with and . We set with so that , and we have .
We consider the estimation of using the optimal proposal pdf in Eq. (11), and the optimal proposal pdf in the nonnoisy setting, i.e., .
More specifically, we consider
Hence,
Clearly, by changing , we change the form of both and .
For instance, increasing also increases the magnitude of and hence the mismatch between and , as depicted in Figure 1.
Indeed, for , is almost identical to since the magnitude of is small w.r.t. the values of . As increases, deviates from , being in the middle between and , and eventually would converge to for .
It is also interesting to note that with has very little probability mass around , where the noise is zero, since it needs to concentrate probability mass in the extremes of the interval, where the noise power is huge.
Let also denote as the variance obtained using given in Eq. (12), and the variance obtained using given in Eq. (13).
In Figure 3(a), we show the ratio of variances both theoretically and empirically, as a function of , where and are the variances of when using and as proposals, respectively. We can observe the clear advantage of using the optimal proposal density in Eq. (11).
Second experiment.
Let us consider now a Gaussian pdf, , and the same error model as in the previous example but considering
Figure 2 depicts the and , as a function of , for several values of . Note that, in this example, increasing makes become bimodal. As in the previous example, as increases, the optimal proposal will converge to . The theoretical and empirical curves of the ratio of variances, , in estimating are shown in Figure 3(b).
6 Conclusions
Working with noisy evaluations of the target density is usual in Monte Carlo, especially in the last years. In this work, we have analyzed the use of optimal proposal densities in a noisy IS framework. Previous works have focused on the tradeoff between accuracy in the evaluation and computational cost in order to form optimal estimators. In this work, we have considered a general setting and derived the optimal proposals for the noisy IS estimators. These optimal proposals incorporate the variance function of the noisy evaluation in order to propose samples in regions that are more affected by noise. In this sense, we can informally state that the optimal proposal densities play the role of an acquisition function that also take into account the noise power.
References
 [1] D. Luengo, L. Martino, M. Bugallo, V. Elvira, and S. Särkkä, “A survey of Monte Carlo methods for parameter estimation,” EURASIP Journal on Advances in Signal Processing, vol. 2020, pp. 1–62, 2020.
 [2] P. Fearnhead, O. Papaspiliopoulos, G. O. Roberts, and A. Stuart, “Randomweight particle filtering of continuous time processes,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 4, pp. 497–512, 2010.
 [3] M.N. Tran, M. Scharth, M. K. Pitt, and R. Kohn, “Importance sampling squared for Bayesian inference in latent variable models,” arXiv preprint arXiv:1309.3339, 2013.
 [4] L. Acerbi, “Variational Bayesian Monte Carlo with noisy likelihoods,” arXiv:2006.08655, 2020.
 [5] F. Llorente, L. Martino, J. Read, and D. Delgado, “A survey of Monte Carlo methods for noisy and costly densities with application to reinforcement learning,” arXiv:2108.00490, 2021.
 [6] C. Andrieu and G. O. Roberts, “The pseudomarginal approach for efficient Monte Carlo computations,” The Annals of Statistics, vol. 37, no. 2, pp. 697–725, 2009.
 [7] J. Park and M. Haran, “Bayesian inference in the presence of intractable normalizing functions,” Journal of the American Statistical Association, vol. 113, no. 523, pp. 1372–1390, 2018.
 [8] L. F. Price, C. C. Drovandi, A. Lee, and D. J. Nott, “Bayesian synthetic likelihood,” Journal of Computational and Graphical Statistics, vol. 27, no. 1, pp. 1–11, 2018.
 [9] M. U. Gutmann and J. Corander, “Bayesian optimization for likelihood–free inference of simulator–based statistical models,” Journal of Machine Learning Research, 2016.

[10]
R. Bardenet, A. Doucet, and C. Holmes,
“On Markov chain Monte Carlo methods for tall data,”
The Journal of Machine Learning Research, vol. 18, no. 1, pp. 1515–1557, 2017.  [11] M. Quiroz, R. Kohn, M. Villani, and M.N. Tran, “Speeding up MCMC by efficient data subsampling,” Journal of the American Statistical Association, 2018.
 [12] M. P. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search for robotics,” Foundations and trends in Robotics, vol. 2, no. 12, pp. 388–403, 2013.
 [13] K. Chatzilygeroudis, V. Vassiliades, F. Stulp, S. Calinon, and J.B. Mouret, “A survey on policy search algorithms for learning robot controllers in a handful of trials,” IEEE Transactions on Robotics, vol. 36, no. 2, pp. 328–347, 2019.
 [14] A. B. Duncan, A. M. Stuart, and M.T. Wolfram, “Ensemble inference methods for models with noisy and expensive likelihoods,” arXiv:2104.03384, 2021.

[15]
N. Bliznyuk, D. Ruppert, C. Shoemaker, R. Regis, S. Wild, and P. Mugunthan,
“Bayesian calibration and uncertainty analysis for computationally expensive models using optimization and radial basis function approximation,”
Journal of Computational and Graphical Statistics, vol. 17, no. 2, pp. 270–294, 2008.  [16] D. H. Svendsen, L. Martino, and G. CampsValls, “Active emulation of computer codes with Gaussian processes–Application to remote sensing,” Pattern Recognition, vol. 100, pp. 107103, 2020.
 [17] F. Llorente, L. Martino, V. Elvira, D. Delgado, and J. LopezSantiago, “Adaptive quadrature schemes for Bayesian inference via active learning,” arXiv:2006.00535, 2020.
 [18] P. Fearnhead, O. Papaspiliopoulos, and G. O. Roberts, “Particle filters for partially observed diffusions,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 4, pp. 755–777, 2008.
 [19] O. D. Akyildiz, I. P. Marino, and J. Míguez, “Adaptive noisy importance sampling for stochastic optimization,” in 2017 IEEE 7th International Workshop on Computational Advances in MultiSensor Adaptive Processing (CAMSAP). IEEE, 2017, pp. 1–5.

[20]
D. Crisan and J. Miguez,
“Nested particle filters for online parameter estimation in discretetime statespace markov models,”
Bernoulli, vol. 24, no. 4A, pp. 3039–3086, 2018.  [21] N. Chopin, P.E. Jacob, and O. Papaspiliopoulos, “SMC2: an efficient algorithm for sequential analysis of state space models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 75, no. 3, pp. 397–426, 2013.
 [22] L. Martino, J. Read, V. Elvira, and F. Louzada, “Cooperative parallel particle filters for online model selection and applications to urban mobility,” Digital Signal Processing, vol. 60, pp. 172–185, 2017.
 [23] A. Doucet, M. K Pitt, G. Deligiannidis, and R. Kohn, “Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator,” Biometrika, vol. 102, no. 2, pp. 295–313, 2015.
 [24] F. Llorente, L. Martino, D. Delgado, and J. LopezSantiago, “Marginal likelihood computation for model selection and hypothesis testing: an extensive review,” arXiv:2005.08334, 2020.
 [25] C. P. Robert and G. Casella, Monte Carlo Statistical Methods, Springer, 2004.
 [26] C. Andrieu, A. Doucet, and R. Holenstein, “Particle Markov chain Monte Carlo methods,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 3, pp. 269–342, 2010.
 [27] M. Järvenpää, M. U. Gutmann, A. Vehtari, and P. Marttinen, “Parallel Gaussian process surrogate Bayesian inference with noisy likelihood evaluations,” Bayesian Analysis, vol. 16, no. 1, pp. 147–178, 2021.
 [28] C. C. Drovandi, M. T. Moores, and R. J. Boys, “Accelerating pseudomarginal MCMC using Gaussian processes,” Computational Statistics & Data Analysis, vol. 118, pp. 1–17, 2018.
 [29] A. Stuart and K. Ord, Kendall’s Advanced Theory of Statistics, Arnold, London, 1998.
 [30] D. V. Arnold, Noisy optimization with evolution strategies, vol. 8, Springer Science & Business Media, 2012.
Appendix A Covariance between and
We show that
First, recall that . By the law of iterated expectations,
The inner expectation is
Hence, we obtain
Combining the results, we obtain the desired expression.