Over-Conservativeness of Variance-Based Efficiency Criteria and Probabilistic Efficiency in Rare-Event Simulation

In rare-event simulation, an importance sampling (IS) estimator is regarded as efficient if its relative error, namely the ratio between its standard deviation and mean, is sufficiently controlled. It is widely known that when a rare-event set contains multiple "important regions" encoded by the so-called dominating points, IS needs to account for all of them via mixing to achieve efficiency. We argue that in typical experiments, missing less significant dominating points may not necessarily cause inefficiency, and the traditional analysis recipe could suffer from intrinsic looseness by using relative error, or in turn estimation variance, as an efficiency criterion. We propose a new efficiency notion, which we call probabilistic efficiency, to tighten this gap. The new notion is especially relevant in high-dimensional settings where the computational effort to locate all dominating points is enormous.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/28/2020

Deep Probabilistic Accelerated Evaluation: A Certifiable Rare-Event Simulation Methodology for Black-Box Autonomy

Evaluating the reliability of intelligent physical systems against rare ...
07/05/2016

Efficient Estimation in the Tails of Gaussian Copulas

We consider the question of efficient estimation in the tails of Gaussia...
10/10/2020

Rare-Event Simulation for Neural Network and Random Forest Predictors

We study rare-event simulation for a class of problems where the target ...
09/08/2019

Sampling Conditionally on a Rare Event via Generalized Splitting

We propose and analyze a generalized splitting method to sample approxim...
12/21/2020

Improvement of the cross-entropy method in high dimension through a one-dimensional projection without gradient estimation

Rare event probability estimation is an important topic in reliability a...
11/21/2021

Stochastic viscosity approximations of Hamilton-Jacobi equations and variance reduction

We consider the computation of free energy-like quantities for diffusion...
10/29/2018

Dominating Points of Gaussian Extremes

We quantify the large deviations of Gaussian extreme value statistics on...

1 Introduction

We study the problem of estimating the probabilities of rare events with Monte Carlo simulation, which falls in the domain of rare-event simulation

(Bucklew 2004, Juneja and Shahabuddin 2006, Rubino and Tuffin 2009). Traditionally, rare-event simulation is of wide interest to a variety of areas such as queueing systems (Dupuis et al. 2007, Dupuis and Wang 2009, Blanchet and Mandjes 2007, Blanchet et al. 2009, Blanchet and Lam 2014, Kroese and Nicola 1999, Ridder 2009, Sadowsky 1991, Szechtman and Glynn 2002), communication networks (Kesidis et al. 1993), finance (Glasserman 2003, Glasserman and Li 2005, Glasserman et al. 2008) and insurance (Asmussen 1985, Asmussen and Albrecher 2010). More recently, with the rapid development of intelligent physical systems such as autonomous vehicles and personal assistive robots (Ding et al. 2021, Arief et al. 2021), rare-event simulation is also applied to assess their risks before deployments in public, where the risks are often quantified by the probabilities of violations of certain safety metrics such as crash or injury rate (Huang et al. 2018a, OKelly et al. 2018, Zhao et al. 2016, 2018)

. The latter problems typically involve complex AI-driven underlying algorithms that deem the rare-event structures rough or difficult. The current work is motivated from the importance of handling such type of rare-event problems (e.g., the U.S. National Artificial Intelligence Research and Development Strategic Plan

(Kratsios 2019) lists “developing effective evaluation methods for AI” as a top priority) and provides a step towards rigorously grounded procedures in this direction.

The starting challenge in rare-event simulation is that, by its own nature, the target rare events seldom occur in the simulation experiment when using crude Monte Carlo. In other words, to achieve an acceptable estimation accuracy relative to the target probability, the required simulation size could be huge in order to obtain sufficient hits on the target events. Statistically, this issue is manifested as a large ratio between the standard deviation (per run) to the mean, known as the relative error, that determines the order of a required sample size. In the large deviations regime where the target probability can depend exponentially on the rarity parameter, this in particular means the required sample size is exponentially large.

To address the inefficiency of crude Monte Carlo, a range of variance reduction techniques have been developed. Among them, importance sampling (IS) (Siegmund 1976) has been broadly applied to improve the efficiency. IS uses an alternative probability measure to generate the simulation samples, and then reweighs the outputs via likelihood ratio to guarantee unbiasedness. The goal is that by using this alternate estimator than simply counting the frequency of hits in crude Monte Carlo, one can achieve a small relative error with a much smaller sample size.

To this end, it is also widely known that IS is a “delicate” technique, in the sense that the IS probability measure needs to be carefully chosen in order to achieve a small relative error. In the typical large deviations setting, the suggestion is to tilt the probability measure to the “important region”. The delicacy appears when there are more than one important regions, in which case all of them need to be accounted for. More specifically, in the light-tailed regime, these important regions are guided precisely by the so-called dominating points, which capture the most likely scenario in a local region of the rare event. Despite the tempting approach to simply shift the distribution center to the globally most likely scenario, it is well established that if not all the dominating points are included in the IS mixture distribution, then the resulting estimator may no longer be efficient in terms of the relative error (Glasserman and Wang 1997, Juneja and Shahabuddin 2006).

Our main goal in this paper is to argue that, in potentially many problems, the inclusion of all the dominating points in an IS could be unnecessary. Our study is motivated from high-dimensional settings where finding all dominating points could be computationally expensive or even prohibitive, yet these problems may arise in recent safety-critical applications (e.g., Arief et al. (2021), and the example in Section 3.3).

To intuitively explain the unnecessity, let us first drill into why all the dominating points are arguably needed in the literature in the first place. Imagine that a rare event set comprises two disjoint “important” regions, say and , and the dominating points are correspondingly and , which are sufficiently “far away” from each other, and has say a higher density than . Roughly speaking, if an IS scheme only focuses on and tilts the distribution center towards , then there is a small chance that the sample from this IS distribution hits , so that the contribution from this sample in the ultimate estimator is non-zero and, moreover, may constitute a “blow-up” in the likelihood ratio and consequently a large variance. This unfortunate event of falling into a secondary important region is the source of inefficiency according to the relative error criterion.

Now let’s take a step back and think about the following: How likely does the “unfortunate” event above occur? In a typical Monte Carlo experiment, we argue (which we will see experiments momentarily) that this could be very unlikely, to the extent that we shouldn’t be worried at all with a reasonable simulation size. Yet, according to the relative error criterion, it seems necessary to worry about this, because it contributes to the variance of the estimator. This points to that using variance to measure efficiency in rare-event simulation could be too loose to begin with. This variance measure, in turn, comes from the Markov inequality that converts relative error into a sufficient relative closeness between the estimate and target probability with high confidence. In other words, this Markov inequality itself could be the source of looseness.

This motivates us to propose what we call probabilistic efficiency. Different from all the efficiency criteria in the literature, including asymptotic efficiency (also known as asymptotic optimality or logarithmic efficiency) and bounded relative error (Juneja and Shahabuddin 2006, L’Ecuyer et al. 2010), probabilistic efficiency does not use relative error. Instead, it is a criterion on the achievement of the high confidence bound directly. The key element in probabilistic efficiency is the control on the simulation size itself, that we only allow it to grow moderately with the underlying rarity parameter. This moderate simulation size, which is often the only feasible option in experiments, suppresses the occurrence of the unfortunate event of falling into a secondary important region. This way, while the variance could blow up, the high-confidence closeness between the estimate and target probability could still be retained.

We close this introduction by cautioning that probabilistic efficiency is not meant to replace existing variance-based efficiency criteria, but rather to complement them especially in situations where identifying all dominating points is infeasible due to problem complexity. In problems where the latter is not an issue, it remains “safer” to use existing criteria, as probabilistic efficiency relies on more subtle conditions which in turn impose less transparent impacts on finite-sample performances. To this end, we provide both general and more specialized sufficient conditions for guaranteeing probabilistic efficiency. Though there is still much work to be done (as we discuss at the end of this paper), we view our study as a first step towards designing simpler IS schemes for complex problems that are not amenable to classical efficiency criteria.

In the following, we first introduce in more detail the background of rare-event simulation and the established efficiency criteria in the literature, all of which involve estimation variance or relative error (Section 2). Then we show several motivating numerical examples to illustrate how excluding some dominating points in IS appear to give similar and sometimes even better performances than including all these points, the latter suggested predominantly in the literature (Section 3). This motivates our new notion of probabilistic efficiency to explain the observed numerical phenomena (Section 4) and the analysis of efficiency guarantees using this new notion (Section 5). Finally, we give some cautionary notes about probabilistic efficiency which involve the risk of under-estimation, and suggest some future directions (Section 6).

2 Problem Setting and Existing Efficiency Criteria

As is customary in rare-event simulation, we introduce an indexed family of rare events , where denotes a “rarity parameter” such that as , the event becomes rarer so that . Our goal is to estimate . We would like to have an estimator such that, for a given tolerance level (e.g., ), we have

(1)

for a certain when using a simulation size . Note that the error between and in (1) is measured relative to the magnitude of itself, since is tiny and the estimation is only meaningful if the error is small enough relative to this tiny quantity.

Suppose we use crude Monte Carlo, which utilizes the unbiased estimator

for , where denotes the indicator variable on the event . Let be the sample mean of independent replications of . Then by Chebyshev’s inequality, we get that for any ,

Hence, implies that for . This means that is a sufficient size for to achieve (1). This quantity depends on the ratio between the standard deviation and the mean , which is known as the relative error. Here, the relative error is . If depends exponentially in as in the typical large deviations setting, then this quantity blows up exponentially, which in turn means needs to grow exponentially in to achieve (1).

With IS, we generate samples from an alternate distribution where satisfies (i.e., is absolutely continuous with respect to ), and use as an unbiased output for , where is the Radon-Nikodym derivative, or the so-called likelihood ratio, between and . Though this output is always unbiased thanks to the likelihood ratio adjustment, the performance of the IS estimator in terms of variability heavily depends on the choice of the IS probability measure . In the literature, many efficiency criteria for IS estimators have been developed. As a common example, we consider the criterion of asymptotic efficiency (Asmussen and Glynn 2007, Heidelberger 1995, Juneja and Shahabuddin 2006): [Asymptotic efficiency] The IS estimator under is said to achieve asymptotic efficiency if grows at most subexponentially in as , i.e. where denotes the variance under the IS distribution . An equivalent definition is where denotes the expectation under .

In Definition 2, the second equivalent definition is more commonly used, but the first one is more convenient for our development. From our earlier discussion, asymptotic efficiency implies that the required simulation size to attain a prefixed relative error grows only subexponentially in . As a stronger requirement, is said to have a bounded relative error if , which implies that the required simulation size remains bounded no matter how small is. The criterion of bounded relative error is sometimes too strict to achieve, so we place more attention on asymptotic efficiency in this paper. More efficiency criteria could be found in Juneja and Shahabuddin (2006), L’Ecuyer et al. (2010), Blanchet and Lam (2012).

In the large deviations setting, the classical notion of dominating points is used to guarantee asymptotic efficiency (Sadowsky and Bucklew 1990). Intuitively, we compute a so-called rate function which measures the likelihood of hitting each point on a logarithmic scale, and then for a rare-event set , represents the most likely point to hit in the set . Dominating points can be viewed as local minima of in the set . Definition 2 is an intuitive definition of dominating points to facilitate understanding, and we will give a rigorous definition in Section 5.

[Dominating set (informal)] Consider a closed rare-event set . Suppose that is the rate function associated with the input distribution. We call a dominating set if

  1. For any , there exists at least one such that where ;

  2. For any , does not satisfy the above condition.

We call any point in a dominating point. For two dominating points and , we say is more significant than if .

Figure 1: Illustration of rare-event set and dominating points.

Figure 1 is an illustration of the rare-event set and the dominating set . Here, is the global minimum rate point in , but is not covered by , so we need to include as well in the dominating set.

After finding all the dominating points, the classical suggestion of a good IS is to a mixture of exponentially tilted distributions, where each exponential tilting is with respect to each dominating point. This IS is guaranteed to be asymptotically efficient under certain conditions. We will also define this mixture distribution and its properties more rigorously later in this paper.

For now, we take a more specific setting as an example. Suppose that where with under and . In this example, the rate function is , so minimizing the rate function is equivalent to maximizing the density function. Following the definition, we call the dominating set if and this condition no longer holds after the removal of any . Indeed, we could write where are disjoint and . Then we have that where denotes the density function of . That is, the dominating points correspond to the highest-density scenarios in a local region should the rare event happen. Under proper conditions , we have that the IS distribution with density where achieves asymptotic efficiency.

On the other hand, if we miss some dominating points in the construction of the mixture IS, then asymptotic efficiency may fail to be attained. Below we give a simple example to demonstrate this.

[Missed dominating point leads to violation of asymptotic efficiency] Suppose that we want to estimate where under . If the IS distribution is chosen as , then grows exponentially in , and hence is not asymptotically efficient by definition.

Proof.

Proof of Lemma 1. There are two dominating points, and , and is the more significant one, i.e., where denotes the density function of . With IS distribution , the likelihood ratio function is and hence under implies that , which is large enough to cause a blow-up. More rigorously, we have that

where

denotes the tail distribution function of standard normal distribution. It is known that

as , and hence . Besides, . As a result, grows exponentially in . ∎

In the example in Lemma 1, with the missed dominating point in the constructed IS, samples that fall into the rare-event region associated with this missed point could contribute a large likelihood ratio and hence a blow-up of the overall variance, leading to the violation of asymptotic efficiency.

3 Numerical Experiments

Before going to the theories, we first present some numerical results to shed light on how much difference it makes to use different numbers of dominating points in the IS mixture. In particular, we demonstrate that, contrary to the guidance from asymptotic efficiency, missing some dominating points may not hurt empirical performances, and in fact can sometimes achieve even better performances in terms of stability (measured by confidence interval (CI) width). Besides, the example in Section

3.3 also justifies the motivation why we seek to reduce the number of used dominating points in the IS mixture.

3.1 Large Deviations of an I.I.D. Sum

We consider the problem of estimating the tail probability involving a sum of random variables, where

are i.i.d and we are interested in

where . We consider as the rarity parameter presented in Section 2. Denote . By large deviations theory, when , if and satisfy and , then we have

(2)

and

(3)

For this problem, Glasserman and Wang (1997) Section 3 provides two estimators, and . Specifically, we have and . That is, only uses one dominating point whereas uses all the dominating points. In our experiment, we follow Glasserman and Wang (1997) to set with , and independent, and (in this case, , , and , ).

We run numerical experiments with , , and . The results are shown in Table 1. By comparing the numbers in the second and third rows, we observe that and have very similar empirical performances. However, note that is asymptotically efficient (Glasserman and Wang (1997) Proposition 1) while is arguably a very poor estimator in terms of variance – in fact, as where denotes the expectation under the exponential change of measure at (Glasserman and Wang (1997) Theorem 1).

10 30 50 100
8.22(0.26) 1.60(0.07) 3.77(0.18) 1.34(0.08)
8.29(0.26) 1.60(0.07) 3.77(0.18) 1.34(0.08)
Table 1: Point estimates (and 95% CI) using IS estimators for the tail probability with different .

3.2 Overshoot Probability of a Random Walk

We consider the problem of estimating the overshoot probability of the finite-horizon maximum of a random walk. We define the probability of interest as

where and

’s follows Gaussian distribution with mean 0 and standard deviation

. Suppose that the rarity parameter is . We note that the rare-event set of this problem can be expressed by a union of half-spaces, i.e. where and denotes the th element in . This decomposition allows us to construct IS estimator using the dominating points corresponding to each half-space . More specifically, in this example, , and the dominating points, ranking from the most to the least significant (i.e., increasing rate function value), are .

In our experiments, we fix and vary for different rarity levels. In addition, we take a relatively small , i.e. , for illustration convenience. We generate samples from IS distributions using a varying, partial list of dominating points. That is, we choose the IS distribution as for . The performance of the IS estimators are shown in Table 2. We observe that the estimators with different numbers of dominating points perform similarly. In particular, we present the cases with and respectively in Figure 2. We observe that in both cases the performance of the IS estimators is almost independent of the number of used dominating points: The probability estimates are all comparable while using more dominating points would slightly increase the CI width.

[. ]  [ .]

Figure 2: Point estimates and CI widths for IS estimators using different numbers of dominating points.
0.2 0.22 0.24 0.26 0.28 0.3
# prob (with CI) prob (with CI) prob (with CI) prob (with CI) prob (with CI) prob (with CI)
1 1.15(0.02) 9.11(0.16) 4.48(0.08) 1.56(0.03) 4.23(0.07) 9.57(0.15)
2 1.16(0.02) 9.12(0.14) 4.46(0.07) 1.56(0.02) 4.25(0.06) 9.64(0.13)
3 1.16(0.02) 9.17(0.16) 4.48(0.07) 1.57(0.02) 4.26(0.06) 9.64(0.13)
4 1.15(0.03) 9.11(0.18) 4.46(0.08) 1.56(0.03) 4.24(0.06) 9.63(0.14)
5 1.15(0.03) 9.11(0.20) 4.46(0.09) 1.56(0.03) 4.24(0.07) 9.63(0.15)
6 1.15(0.03) 9.11(0.22) 4.46(0.10) 1.56(0.03) 4.24(0.08) 9.62(0.17)
7 1.14(0.03) 9.00(0.24) 4.41(0.11) 1.54(0.03) 4.21(0.08) 9.56(0.18)
8 1.14(0.04) 9.05(0.25) 4.43(0.11) 1.55(0.04) 4.22(0.09) 9.58(0.19)
9 1.13(0.04) 8.96(0.27) 4.39(0.12) 1.54(0.04) 4.19(0.10) 9.53(0.20)
10 1.14(0.04) 8.99(0.28) 4.41(0.13) 1.54(0.04) 4.20(0.10) 9.55(0.22)
Table 2: Point estimates (and 95% CIs) for IS estimators using different numbers of dominating points for the overshoot probability. # denotes the number of dominating points used in the IS estimator.

3.3 Robustness Assessment for an MNIST Classification Model

We consider a rare-event probability estimation problem from an image classification task. Our goal is to estimate the probability of misclassification when the input of a prediction model is perturbed by tiny noise. This probability estimate is of interest as a robustness measure of the prediction model (Webb et al. 2018). More specifically, suppose that the prediction model is able to predict the label of input , i.e. where is the true label of . Then where is a random perturbation can be used to measure the robustness of .

In particular, we consider the classification problem on MNIST dataset which contains 70,000 images of handwritten digits and each image consists of

pixels. We train a 2-ReLU-layer neural network with 20 neurons in each layer using 60,000 training data, which achieves approximately 95% of testing data accuracy in predicting the digits. We perturb a fixed input (that is correctly predicted) with a Gaussian noise with mean 0 and standard deviation

on each of the 784 dimensions to assess the robustness of the prediction. Note that the rarity of this problem is determined by the value of , and we let the rarity parameter .

In order to design an efficient estimator for this problem, we apply the dominating point scheme introduced in Huang et al. (2018b) and Bai et al. (2020) to construct an importance sampler. The scheme sequentially searches for dominating point with highest density using optimization with a cutting-plane algorithm.

Due to the high-dimensionality of the input space and the complexity of the neural network predictor, the number of dominating points in this problem is huge. We implement the sequential searching algorithm in Bai et al. (2020) and it took a week to find the first 100 dominating points. Since we stopped the algorithm prematurely, the actual number of dominating points can be much larger. We construct IS distributions with different numbers of dominating points (ranging from 1 to 42) and magnitudes of (ranging from to ) and report the estimated probabilities and CIs. We use samples for IS estimators and samples for crude Monte Carlo estimators. Figures 3 and 3 show the results.

[ ]  [ ]

Figure 3: Simulation results for the MNIST experiment. (a) Point estimates and CI widths from IS estimators using different numbers of dominating points. (b) Point estimates from IS estimators using different numbers of dominating points (IS with 1, 20, and 40) and crude Monte Carlo (CMC), with vertical error bars representing their 95% CIs (the CIs for the IS estimates are extremely narrow).

By analyzing the performance of the considered estimators, we conclude that missing less significant dominating points does not make a big difference in this problem. As shown in Figure 3, when we fix the rarity of the problem, the estimate is not sensitive to the number of dominating points. The CI width has an increasing trend as the number of dominating points gets larger, indicating that additional dominating points can in fact potentially cause inefficiency.

In Figure 3, we vary the rarity of the problem and compare the performance of different IS estimators and crude Monte Carlo. Note that the estimates using crude Monte Carlo is unavailable for higher-rarity problems due to its inefficiency. We observe that the estimates from different IS estimators overlap visually in all considered cases, which indicates that the differences among the estimates are negligible. We also note that these estimates are consistent with the crude Monte Carlo estimates (when available), which shows the correctness of the estimates.

4 Probabilistic Efficiency

Section 3 shows that IS estimators that miss some dominating points could perform competitively compared to estimators that consider all of them and thus achieve asymptotic efficiency. In light of this, we propose the concept of probabilistic efficiency as a relaxation of asymptotic efficiency. The key of probabilistic efficiency is to consider the high-probability relative discrepancy of the estimator from the ground truth directly, instead of using the relative error or equivalently the estimation variance. The latter, as can be seen in the arguments in Section 2, provides a sufficient, but not necessary, condition on the required sample size. In other words, there is an intrinsic looseness brought by the Markov or Chebyshev inequality that converts relative error into the required sample size.

To proceed, we first define the following: [Minimal relative discrepancy] For any estimator of and any , the minimal relative discrepancy of is given by

(4)

The minimal relative discrepancy measures the relative accuracy of the estimator , in that it gives the smallest relative discrepancy of from that can be achieved with probability . Thus the smaller is , the more accurate is .

We say that is probabilistically efficient if is small in some sense. More precisely, we propose the following notions: [Probabilistic Efficiency] Suppose that is an indexed family of rare events and as . We consider the IS estimator under the IS distribution and is the sample mean of independent replications of . For any , we define as in (4). Then

  1. We call weakly probabilistically efficient if we can choose subexponential in (i.e. ) such that for any , ;

  2. We call strongly probabilistically efficient if we can choose subexponential in such that for any , .

Note that strong probabilistic efficiency matches the usual notion in statistical estimation. That is, the estimator approaches the target parameter as . In contrast, weak probabilistic efficiency only cares about a correct magnitude, which is more flexible for rare-event simulation. We also contrast our notion of probabilistic efficiency with a notion named probabilistic bounded relative error proposed in Tuffin and Ridder (2012), where the IS measure is randomly chosen and efficiency is achieved if the resulting random relative error of the IS estimator is bounded by some constant with high probability, which is conceptually different from our notion.

The following shows that probabilistic efficiency is indeed a relaxation of asymptotic efficiency: If is asymptotically efficient, then is strongly probabilistically efficient.

Proof.

Proof of Proposition 4. For any unbiased estimator , we have that for any ,

and hence by definition,

If is asymptotically efficient, grows at most subexponentially in , so we could choose subexponentially growing in such that for any . By definition, is strongly probabilistically efficient. ∎

Now we explain how probabilistic efficiency helps us understand the influence of missing some dominating points. Recall the example where , under , and comprises two disjoint and faraway pieces and . The dominating points are respectively and (recall Figure 1). Denote , and we assume that is exponentially smaller than . If we focus on and simply use as the IS distribution, then we face the risk of getting a non-asymptotically-efficient estimator. However, experimentally if we run the simulation with a moderate sample size, then most likely none of the samples fall into . Conditional on not hitting , we actually get an estimate close to , which is in turn close to . In other words, even if the resulting IS estimator is not asymptotically efficient, it could still be a good estimate in terms of its distance to , as long as the sample size is not overly big. The latter is precisely the paradigm of probabilistic efficiency.

More concretely, we have the following theorem: [Achieving strong probabilistic efficiency] Suppose that is an indexed family of rare events and as . We write where and are two disjoint events. Denote . Assume that

  1. as ;

  2. We have an asymptotically efficient IS estimator for given by under , i.e., as for some growing subexponentially in ;

  3. satisfies that as .

Let be the sample mean of independent replications of under . For any , define as in (4). Then we have

Hence is strongly probabilistically efficient.

Proof.

Proof of Theorem 4. Suppose that we sample under . Let

and

Clearly . For simplicity, denote

(5)
(6)

Then we have that

where the second inequality follows from a union bound, the third inequality follows from and that implies , the fourth inequality follows from Chebyshev’s inequality in the first term and a union bound in the second term, and the last equality follows from the definitions in (5) and (6). Thus . Finally, by a direct use of the assumptions. ∎

Similarly, if we relax the assumption that as , we get sufficient conditions for weak probabilistic efficiency:

[Achieving weak probabilistic efficiency] Suppose that is an indexed family of rare events and as . We write where and are two disjoint events. Denote . Assume that

  1. where ;

  2. We have an asymptotically efficient IS estimator for given by under , i.e., as for some growing subexponentially in ;

  3. satisfies that as .

Let be the sample mean of independent replications of under . For any , define as in (4). Then we have

Hence is weakly probabilistically efficient.

Proof.

Proof of Theorem 4. Following the proof of Theorem 4, we still get

Under the conditions of Theorem 4, now . By the definition, is weakly probabilistically efficient. ∎

Here, we note that in Theorems 4 and 4, and are not necessarily governed by only one dominating point. They can be general events with one or more dominating points as long as the assumptions hold. According to the theorems, supposing that we have found some dominating points while the remaining ones are known to be less significant and “far from” the current ones, we could simply use the current mixture IS distribution instead of keep searching. The remaining question is how we know or believe that the remaining dominating points are negligible, i.e. the assumptions of the theorem are satisfied. Besides, probabilistic efficiency only implies that the point estimate is reliable in some sense. This raises questions on inference such as the construction of asymptotically valid CI using the sample variance. We investigate all these in the next section.

5 Theoretical Guarantees

In this section, we focus on the setting that and where has density function under , is a function and is a closed set. We focus on this setting as it arises as a generic representation of recent problems in intelligent system safety testing. There, could be highly complicated and leads to a gigantic number of dominating points, which in turn motivates the consideration of dropping most of them and our notion of probabilistic efficiency. We note that technically this setting is slightly different from the classical Gartner-Ellis regime in terms of the position of the scaling parameter (see, e.g. Sadowsky and Bucklew 1990), but conceptually very similar.

In Section 5.1, we recap some background notions in large deviations and the precise definition of dominating points. Then we state the assumptions under which we could build on Theorem 4 to obtain reliable point estimates and CIs. In Section 5.2, we consider the special (but important) case where follows a Gaussian distribution, in particular propose a simple stopping strategy to determine whether it is safe to stop searching for the remaining dominating points. Throughout this section, we write if grows or decays at most subexponentially in .

5.1 Guarantees for General Input Distribution

We define as the cumulant generating function of and as its Legendre transform. First, we make some basic assumptions on such that satisfies some useful properties. We note that these assumptions are commonly made in the large deviations literature, and they are satisfied by many widely used light-tailed parametric distributions. satisfies the following conditions:

  1. is a closed proper convex function;

  2. has non-empty interior including 0;

  3. is strictly convex and differentiable on ;

  4. for any sequence in converging to a boundary point of .

Under Assumption 5.1, has the following properties:

  1. has non-empty interior;

  2. is strictly convex and differentiable on ;

  3. with if and only if ;

  4. For any , there exists a unique such that and .

Then we present the concept of dominating set and dominating points: [Dominating set] Consider rare-event set . Suppose that Assumption 5.1 holds and . We call a dominating set for associated with the distribution if

  1. For any , there exists at least one such that where is as defined in Lemma 5.1;

  2. For any , does not satisfy the above condition.

We call any point in a dominating point. For two dominating points and , we say is more significant than if .

Suppose that is a dominating set. Then the corresponding mixture IS distribution is given by with . We recall the standard argument on why this IS is asymptotically efficient. Like the discussion in Section 2, we could split the rare-event set as where ’s are disjoint and . Note that implies that . Now, the likelihood ratio is given by

and it satisfies that for any ,

Hence we have that . Then

(7)

Thus, supposing we have a large deviations asymptotic given by , then combining with (7) will give us that the IS estimator is asymptotically efficient.

Our interest is on IS schemes that use a partial list of dominating points instead of the full list. In particular, suppose we sequentially fill in the dominating set where (Algorithm 1 in Appendix 7 shows how to do so), and we have a stopping strategy with before locating all the dominating points. We choose the mixture IS distribution given by

(8)

Our goal in this section is to discuss under what conditions this IS estimator is probabilistically efficient.

First of all, we summarize the basic assumptions on the problem setting to ensure that there exists a dominating set with moderate size: Consider the problem of estimating with where has density function