1 Introduction
The accurate estimation of the probability of a rare event with standard Monte Carlo typically requires the evaluation of a score function for a very large set of points: the number of points is of the order of the inverse of the sought rare event probability. The evaluation of the score function for this very large set becomes particularly infeasible if for each point a computationally expensive model is involved.
IS is a variance reduction strategy for Monte Carlo estimation. The idea is to sample from a biasing distribution, such that fewer samples are necessary to obtain the same accuracy of the rare event probability than with standard Monte Carlo. The bias introduced by the sampling from the biasing distribution is corrected by reweighing the samples in the IS estimator.
The key of the performance of IS lies in the choice of the problemdependent socalled biasing or proposal distribution. Although the optimal biasing distribution that leads to a zero variance IS estimator is closedform, computing this zerovariance density is not straightforward since it requires the sought probability of the rare event. The CE method [1, 2, 3, 4]
provides an efficient way to approximate this zerovariance density within a parametric family of probability distributions. The CE method searches the
optimalbiasing distribution, in the sense that it will minimise the KullbackLeibler divergence from the zerovariance density among the feasible distributions. Even though, computing the biasing distribution with the CE method may still be prohibitive if the model involved in the score function is computationally demanding.
This paper is concerned by the following question: can we save computational power by using score function approximations in the CE method? Several works recommend score function approximations in order to speedup rare event probability estimation. In particular, score function approximations have already been put forward in IS [5, 6, 7, 8, 9].^{1}^{1}1We note that such approximations have also been studied in the context of multilevel splitting [10, 11]. However, an exhaustive discussion of all possible combinations of alternative estimation techniques with score function approximations is out of the scope of this work. Nevertheless, to the best of our knowledge, the methods proposed in [12] and [13] are the unique works directly concerned by the raised question and providing partial answers. The authors in [12] define a sequence of auxiliary “lowfidelity CE methods” using a set of score function approximations. The auxiliary lowfidelity CE methods are then run sequentially in order to precondition the standard CE method defined with the original score function. Under certain conditions on score function approximations, the preconditioning guarantees that the number of iterations is lowered at each of the auxiliary CE method levels, or at worst remains identical. Nevertheless, using this preconditioner does not necessarily guarantee a reduction of the global computational cost for computing the biasing distribution. The preconditioner is in fact an initialization strategy rather than a way of integrating and adapting the score function approximations in the CE method. Computational power may still be spoilt in the case an accurate estimate is not needed at some levels of the preconditioning sequence. In [13], the authors propose an IS estimator based on an hybrid “lowfidelity / highfidelity CE method”. The idea is to approximate the score function, by dividing the probability space into two subdomains. Region which are close in some sense to the rare event are evaluated with the highfidelity model. The remaining part of the probability space is evaluated using a lowcost surrogate model. However, parameters of the proposed algorithms are exposed to arbitrariness, leading to noncertified subdomain discrimination, which in turns implies the computation of a suboptimal biasing distribution.
In this work, we propose an adaptive CE algorithm which converges almost surely to the optimal biasing distribution. The score function approximation is adapted at each iteration of the CE method, depending on the needed accuracy. As for the certified reducedbasis evaluation of failure probability in [14], the method adapts score function approximations exploiting upper bounds on the error norm of the reduced model. We focus on reduced basis (RB) approximations, for which there exists an extensive literature on the computation of a priori or a posteriori error estimates for approximation of the solution of a PDE using RB, see e.g., [15]. Besides, we also provide an asymptotic analysis showing that under mild conditions i) for each of the algorithm iterations, the squared coefficient of variation (SCV) of the IS estimator is the minimal achievable with the current score function approximation, ii) the convergence is guaranteed towards the optimal biasing distribution in at most the same number of iterations as the standard CE method. This proves that the adaptive CE method lowers the bound on the global computational cost, and moreover that the adaptive strategy is in some sense optimal.
The paper is organized as follows. Section 2 recalls the basics of IS for rare event estimation with the CE method. It then introduces score function approximations and reduced models. In Section 3, we present stateoftheart approaches, and in particular the method preconditioning the CE method. We then propose in Section 4 our adaptive algorithm. A theoretical result attesting of the asymptotic performance of our adaptive method is given in Section 5. We provide details on the proof of the proposed theorem in the appendices. Section 6 presents the numerical evaluation in the case of rare event probability estimation related to a pollution alert problem. We finally provide concluding remarks in a last section.
2 Efficient Rare Event Estimation by Importance Sampling
We assume that is a random element taking its values in and denote by its probability distribution.We denote by the set of rare events of interest, and we assume that
for some real number and for a score function . The probability of the rare event is defined as where denotes an integration with respect to the probability measure . We look for an estimator of where is large so that . We assume that we know how to draw independent and identically distributed (i.i.d.) samples from .
2.1 First Ingredient: the CrossEntropy Method
The naive MonteCarlo (MC) estimator of the rare event probability is
(1) 
This estimator is unbiased. Its relative error is measured by SCV of the estimator, which is
so that we need for a relative error smaller than one.
Let be the support of the distribution . For a biasing distribution with , the IS estimator of is
(2) 
with i.i.d. samples from . It is easy to see that
is an unbiased estimator of
. Its SCV is(3) 
The optimal biasing distribution yielding a zerovariance estimator (i.e., a zero SCV) is
(4) 
Unfortunately, it depends on the rare event probability we want to estimate. Now, consider a set of parametrized densities The CE method [1] optimizes for a parameter such that the corresponding distribution minimizes the KullbackLeibler divergence from the zerovariance density :
(5) 
In the desirable situation where , a minimizer of (5) is such that . Using (4), the optimization problem (5) can be rewritten as
(6) 
Solving the stochastic counterpart of (6)
(7) 
with i.i.d. samples from , typically fails because (7) is affected by the rareness of the event , just as for the naive MC estimator (1). To circumvent this effect, starting from the initial distribution and for some parameter , the CE method estimates sequentially a sequence of nested events
(8) 
such that with the
and jointly updates the biasing distribution according to
(9) 
In words, the occurrence of the event where
is a random variable of distribution
tends on the one hand to decrease since we have . On the other hand, it tends to increase since, according to (8)  (9), is nearer (in terms of crossentropy) from than . Typically, a proper setting for this tradeoff will yield a stable occurrence of the event within the algorithm iterations. Then, if the set and the initial distribution are chosen so that the event is not rare, we can typically expect that the solution of the stochastic counterpart of problem (9) will be a “good” approximation of , and that the CE method will yield a “good” approximation of .2.2 Second Ingredient: Score Function Approximations
We consider a highfidelity model parametrized by . Let be the parameter space. Furthermore, we assume that we have at our disposal a set of model evaluations , socalled snapshots, corresponding to representative samples of the parameter space . We focus in this work on reduced modeling strategies, which uses the set of snapshots to construct an approximation subspace for the set Typical examples of such strategies are the common greedy RB methods or principal orthogonal decomposition (POD). Those standard techniques usually generate the approximation subspace together with a sequence of nested RB spaces of increasing dimension . A set of lowdimensional approximations of the snapshot , also called reduced models or surrogates, is then computable using the sequence of nested subspaces. For instance, can be a finite element discretization of a PDE and some PetrovGalerkin or leastsquare RB approximation [15]. We will adopt the following general notations: we assume available a hierarchy of models defined over and taking their values in , the highfidelity model being the last element of the set. The hierarchy of model is such that for , we have , with and denoting a prescribed set of ordered positive integers lower or equal to .
From the computational standpoint, in the case is a finite element discretization of an elleptic PDE with affine dependences in the bilinear form parametrized by , computing a PetrovGalerkin reducedbasis approximation at point requires offline operations and, more importantly, it requires online operations. Computing the highfidelity solution requires instead operations, which suggests that an important computation gain for RB approximation if . We remark that the complexity is slightly higher for a leastsquare reducedbasis approximation. The possibility to devise an offline/online decomposition relies on the assumption of affine parametric dependence. Nevertheless, similar low online complexities are achievable in the case of nonaffine parametric dependence using approximations by means of the
empirical interpolation method
. We refer to [15] for details and further extensions to the case of nonlinear PDEs with nonaffine parametric dependences.Finally, we assume a score function of the form , where . Using the reduced models, we define score function approximations as . Therefore, the hierarchy is related to a sequence of approximations where the th element is the original score function .
2.3 Third Ingredient: Error Bounds
For any , we further assume that we can compute online for an error bound of the form
(10) 
with a complexity independent of the highfidelity model dimension. Although the evaluation of error bounds can be in some cases expensive, error estimators for RB approximation involves generally lowcomplexity computations and are largely employed to steer adaptive schemes, see e.g., [15] and references therein. Indeed, taking advantage of the previously introduced offline/online decomposition, it is typically possible to compute efficiently for any a local error bound for each subspace approximation. A posteriori error estimates [15] are computable online with a quadratic complexity in and . More precisely, the error norm of the RB approximation is bounded by the ratio of the dual norm of the residual and the stability factor. The dual norm of the residual can typically be evaluated online with operations, exploiting affine [15, Chapter 3.7] or approximate affine [15, Chapter 10.4] parametric dependance. On the other hand, provided an offline characterization by
interpolatory radial basis functions
, the stability factor term is evaluated online with a complexity linear in the number of interpolation points and independent of .Now, in the case is a bounded linear function, by definition of an operator norm, we have In the case is the sup norm, using the triangular inequality and the fact that with finite, we obtain Using the error estimate (10), we can then bound the error on any subset of :
(11) 
where or respectively for linear operators or for the sup norm. We mention that the goaloriented adaptive strategy proposed in [14] refines the error bound estimate (11). The idea of such a strategy is to directly rely on a posteriori estimates of the score function approximation error in order to infer a sequence of subspaces yielding fine approximations of the score function close to the set of rare events of interest and coarse approximations far away from it.
3 StateoftheArt Recipes
Stirring one after the other the two first ingredients in IS, we obtain the standard and the preconditioned CE algorithms. Note that we review hereafter algorithms which are certified to converge to the optimal biasing distribution. In consequence, we do not detail the hybrid method suggested in [13], although it relies on the same ingredients and is related, up to some extent.
3.1 Standard CE Algorithm
The standard CE method is exposed in Algorithm 1. At the th iteration, given , a sample size and a proposal , the algorithm builds the empirical quantile
(12) 
where are i.i.d. samples from . The initial proposal is simply . The next proposals are obtaining by solving (14), which is a MC approximation of the crossentropy minimization problem (9). The sequence of ’s is then used to build a sequence of stochastic events satisfying, as the set (8), a nesting property where ’s are defined as
(13) 
On the other hand, the algorithm imposes at each iteration the condition (15). This condition specifies that quantiles will increase at each of the algorithm iterations of at least , i.e., that the algorithm will produce a finite number of nested sets ’s. This condition is used to prove the convergence of the algorithm as shown in Section 5. To guarantee the validity of (15) at each iteration, we use an upgraded version of the CE algorithm first proposed in [2, Algorithm 5.2]. The difference with the basic CE method [4, Algorithm 2.1] is that the algorithm adapts at the th iteration the parameter and the sample size defining the sequence of nested events (13). This adaptation is presented in Algorithm 2. It consists in checking if can be diminished to meet the condition (15). Otherwise the sample size is increased (and thus new samples are drawn) by factor say until the condition is satisfied. The success of this adaptation step is guaranteed by the asymptotic analysis exposed in Section 5.
At each of the iteration, the evaluation of the ’s in Algorithm 1 or in Algorithm 2 requires operations. It represents in general the most computational demanding step of the CE algorithm. In the favorable situation where the problem (14) boils down to the resolution of a linear system (e.g., for being the family of dimensional Gaussians, see Section 6.3), an update of the proposal density requires operations, yielding for the overall complexity .
3.2 Preconditioned CE Algorithm
In the multifidelity method proposed in [12] and detailed in Algorithm 3, the biasing densities are refined solving a sequence of “lowfidelity CE methods”. The th optimization problem of this sequence is the CE method defined with score function approximation instead of , with gathering in principle any type of surrogates. The strategy proposed by the authors is to use the distribution obtained as the solution of th optimization problem of this sequence as an initialization for the problem at level . After running sequentially the “lowfidelity CE methods”, the solution is obtained by running the standard CE method initialized with the th optimization problem’s solution.
(14) 
We remark that at each iteration, the preconditioned method invokes Algorithm 1, which resorts itself to the adaptation step performed by Algorithm 2. This occurs in particular in the first iterations of Algorithm 3. We notice that if one of the score function approximations is upper bounded by a value lower than , a straightforward implementation of Algorithm 3 will yield a non convergent estimator. Indeed, in such a situation, there will exist an index such that for any the condition will never hold even for an infinite sample size. To avoid this pathological case, the adaptation of the sample size is avoided in the first iterations of Algorithm 3 and replaced by an increment of , i.e., a refinement from to .
The main advantage of the preconditioned CE algorithm is that, using this sound initialization, Algorithm 1 (where substitutes ) typically converges at level in only a few iterations. As shown in [12] and mentioned in the following, some guarantees can be obtained under mild conditions on the number of iterations saved at each of the levels of Algorithm 3. Since this saving occurs in particular at the last level (i.e., at ), it partially alleviates the computational bottleneck induced by calling too many times the highfidelity model. Nevertheless, the algorithm complexity remains of , as for the standard CE method. However, to obtain the initialization of Algorithm 3 at the last level, optimization problems need first to be solved, each one of them targeting a solution of the form (6) where substitutes . In fact, the proposed preconditioning does not necessarily guarantee a decrease of the global computational cost of the CE method.
4 Recipe for an Adaptive CE Method
To overcome this limitation, we propose in the following to incorporate the third ingredient and adapt the score function approximation at each level of the CE optimization process.
4.1 The Proposed Algorithm
We present the proposed CE adaptive method in Algorithm 4. It relies on a relaxed set of events defined for as
(16) 
The sequences of ’s, ’s and ’s are determined by the algorithm adaptation step and we assume that we can compute easily the error bound
(17) 
where is the constant in (11) and are i.i.d. samples from . The idea behind Algorithm 4 is to compute, as for the standard CE method, a sequence of ’s satisfying (15) or equivalently a sequence of nested events including the target set of rare events . This computation should rely as much as possible on the score function approximations and be controlled by their error estimates. In other words, the philosophy is to exploit the approximations and error bounds to adapt sequences of ’s, ’s and ’s so that the set is the smallest set satisfying the inclusion .
The general structure of Algorithm 4 is pretty much similar to the standard CE scheme. Indeed, on the one hand, the sequence of proposals are optimized solving (18) in place of (14), i.e., solving a MC approximation of the crossentropy minimization problem (9) using the in place of the original score function . On the other hand, the adaptation step is done by Algorithm 5 in place of Algorithm 2. In a very analogous manner to the latter algorithm, Algorithm 5 checks if can be diminished to satisfy a nesting property and agree with an error tolerance, taking the form of conditions (19) and (20).
(18) 
As proved in Section 5, the existence of a feasible at the approximation level is guaranteed for some specific by the positivity of
(21) 
where and with
is any continuous and locally Lipschitz^{2}^{2}2The usefulness of the continuity and Lipschitz properties will be clarified in the asymptotic analysis of Section 5. function approximating , identical to on , e.g., a spline approximation. Otherwise, if , the score function approximation is refined until the condition is satisfied. If at the finest level , i.e., for , the condition still does not hold, the sample size is increased (and thus new samples are drawn) by a factor say . The success of this adaptation step is guaranteed by the asymptotic analysis exposed in Section 5. Algorithm 4 exits the loop alternating optimization of the proposal and adaptation when the condition is fulfilled. We will see in Section 5 that this condition guarantees that the standard CE condition holds.
Finally, let us point out that problem (18) may be illconditioned, i.e., that the solution might not be unique. The nonuniqueness might be in certain circumstances only the consequence of an insufficient sample size, the circumstances being determined by the parametric distribution family . This issue is discussed in the case of the dimensional Gaussian family in Section 6.4.
4.2 Computational Saving
In comparison to stateoftheart, the main innovation of Algorithm 4 is that it adapts at each of its iteration the score function approximation to the current accuracy need. The proposed algorithm

substitutes score function approximations directly within the core of the CE method;

uses parsimoniously score function approximations guided by error estimates; in particular all score function approximations are not systematically used.
As for the standard CE algorithm, the computation of the score function or of its approximations represents the most computational intensive steps. Nevertheless, adapting surrogates intends to lower significantly the computational burden. In the final step, the evaluation of the highfidelity score function for a set of samples is irrevocable for the computation of the rare probability estimate.
The algorithm complexity remains of , as for stateoftheart CE methods. In theory, we note that if the bound on the maximum number of iterations of Algorithm 4 and Algorithm 1 are identical (and we will see that this is the case in the next section), then the worst computational load involved by the former is guaranteed to be lower or equal to the one involved by the latter. In practice, we expect a large number of iterations to be performed with the reduced models, i.e., requiring only operations, in the usual setting where with the number of affine parameters.
Let us point out that the error bounds are computed in operations. Therefore, this does not increase significantly the computational cost as long as . This regime is in general the only one suitable for importance sampling, which becomes very challenging in high dimensions as pointed out in [16]. Besides, we remark that the evaluation of at the core of the the adaptation step involves a simple dimensional minimization. Once the score function approximations have been computed, it can be solved with a negligible computational load using standard optimization techniques.
Finally, we remark that making an online adaptation of the directions (in addition of the dimension) of the RB spaces is prohibitive if we want to maintain an online cubic complexity in , independent of the ambient dimension . Indeed, updating a direction in the reduced model would require operations, see Section 2.2. Nevertheless, reconsidering the offline/online decomposition paradigm, an adaptive enrichment of the RB spaces as proposed in [14] could enhance the algorithm’s efficiency as long as the global cost for the adaptive construction is maintained comparable to the one required for the standard offline RB construction. This possible algorithm enhancement is however out of the scope of the present work.
5 Asymptotic Analysis
We provide theoretical guarantees on the computational saving obtained with the proposed algorithm for a sufficiently large number of samples. We begin by introducing assumptions and reviewing stateoftheart results.
5.1 Assumptions
We define the following assumptions.
Assumption A:
For any , .
In words, Assumption A ensures that the domination relation , satisfied by the zerovariance density given by (6) is also satisfied for any feasible . This assumption is trivially satisfied if densities in have an infinite support.
We remark that Assumption A implies that for any as long as , i.e., as long as the sought rareevent probability is nonzero.
We mention that this assumption could be relaxed as done in [2].
Assumption B: We assume that

the set is compact;

for almost every , the function is continuous on ;

there exists a function such that and for all and ;

for any , the function is continuous in .
The properties i), ii) and iii) are satisfied by numerous family of distributions. To illustrate that, consider the Gaussian family where is a compact which does not include zero. In this case, we verify that Assumption B holds, and in particular there exists a quadratic function satisfying iii).
Concerning iv), we know by definition that for any set the function is continuous in . In consequence, a sufficient (but not necessary) condition for the latter assumption is that the score function should be continuous in .
5.2 Previous Results
We hereafter expose the convergence result for the standard CE method [2].
[Homem de Mello and Rubinstein, 2002] Suppose that Assumptions A and B hold. Then, as ,
The almost sure convergence in at most iterations is proven in [2]. The result on the estimator SCV is straightforward, as detailed in Appendix C. We see that, the smaller the set , the smaller the variance. Therefore, it is tempting to build a sequence tending quickly to which would yield a dropoff on the estimator variance. However, the drawback of such a construction is that this approach can be expensive since it generally requires large values for the ’s. Indeed, the value of may need in this situation to be increased significantly until it fulfils condition (15). In fact, there is a tradeoff between the value of ’s and parameter which imposes a minimal speed of convergence of the sequence of ’s towards .
We mention that the authors in [12, Proposition 1] slightly enhance the first statement of Theorem 5.2. They prove that, under mild conditions, the bound on the maximum number of iterations becomes lower or equal to when their multifidelity approach is used to initialize the CE method. The assumption guaranteeing that the bound on the maximum number of iterations is lowered at level (and in particular at the last level using the highfidelity model) is that In practice, this assumption holds if we can verify conditions on the score function approximation regularity and on its cumulative distribution with respect to the biasing densities in , see [12, Assumption 1  2]. Nevertheless, there exists no theoretical guarantees that the preconditioned method is more efficient than the standard method presented in Algorithm 1.
5.3 Our Result
We now present the theoretical guarantees obtained with the adaptive method in the asymptotic regime.
Suppose that Assumptions A and B hold. Then, as ,
In the theorem, we call the tuple , or equivalently the set given in (16) feasible if and only if satisfies (15). The first statement of this theorem proves the almost sure convergence of our algorithm towards the optimal biasing distribution in at most iterations, as the standard CE algorithm. In this worst case scenario of iterations, we deduce that Algorithm 4 will use less computation power than the standard CE method since it adapts reduced models at each iteration.
The second statement of this theorem shows that the squared coefficient of variations of the estimator (2) obtained with the biasing distribution at the th iteration of Algorithm 1 and Algorithm 4 are identical up to an additional term proportional to the probability of the set . Furthermore, the statement claims that this probability is the minimal achievable among all feasible . In other words, given the score functions approximation and the worstcase error , the choice of the set is optimal at the th iteration in the sense that no other feasible set can yield a lower variance of the IS estimator.
5.4 Proof of Theorem 5.3
5.4.1 First Statement: Convergence
This statement is a straightforward consequence of the two following propositions.
Under assumption A and B, Algorithm 5 selects , and such that the sequence produced by Algorithm 4 satisfies almost surely as the condition
(23) 
The proof of Proposition 5.4.1 is detailed in Appendix A.2. It relies on the two following arguments.
On the one hand, based on some consistency results of MC approximations, we show that, given some , and , the approximation level and variable chosen by Algorithm 5 satisfy for large enough the deterministic counterparts of conditions (19) and (20). On the other hand, we show that the latter conditions imply that satisfies (23).
If the sequence satisfies (23) then under Assumption B, the biasing distribution given by Algorithm 4 converges to the optimal one (6) almost surely as in iterations.
To show this result we introduce the deterministic counterpart of problem (18), that is
(24) 
where
(25) 
For ease of notations, we used in (25) (and will use from now) the simplified notation , instead of as defined in (11).
We show in Appendix A.1.1 that under Assumption B, the distribution in Algorithm 4 (the solution (18)) converges almost surely as to the solution (24). This result corresponds to statement iv) of Proposition A.1.1. Moreover, because the sequence satisfies (23) according to Proposition 5.4.1, we have for . And at the th iteration we have . Thus, at the penultimate step of Algorithm 4, we reach the optimal distribution