## 1 Introduction

The unprecedented deployment of intelligent physical systems on many real-world applications comes with the need for safety validation and certification [koopman2017autonomous, uesato2018rigorous]. For systems that interact with humans and are potentially safety-critical - which can range from medical systems to self-driving cars and personal assistive robots - it is imperative to rigorously assess their risks before their full-scale deployments. The challenge, however, is that these risks are often associated precisely to how AI reacts in rare and catastrophic scenarios which, by their own nature, are not sufficiently observed.

The challenge of validating the safety of intelligent systems described above is, unfortunately, insusceptible to traditional test methods. In the self-driving context, for instance, the goal of validation is to ensure the AI-enabled system reduces human-level accident rate (in the order of 1.5 per miles of driving), thus delivering enhanced safety promise to the public [Evan2016FatalPerfect, kalra2016driving, PreliminaryHWY16FH018]. Formal verification, which mathematically analyzes and verifies autonomous design, faces challenges when applied to black-box or complex models due to the lack of analytic tractability to formulate failure cases or consider all execution trajectories [clarke2018handbook]. Automated scenario selection approaches generate test cases based on domain knowledge [wegener2004evaluation] or adaptive stress testing [koren2018adaptive], which is more implementable but falls short of rigor. Test matrix approaches, such as Euro NCAP [national2007new] use prepopulated test cases extracted from crash databases, but they only contain historical human-driver information. The closest analog to the latter for self-driving vehicles is “naturalistic tests”, which means placing them in real-world environments and gathering observations. This method, however, is economically prohibitive because of the rarity of the target conflict events [zhao2017accelerated, arief2018accelerated, claybrook2018autonomous, o2018scalable]. Because of all these limitations, simulation-based tests surface as a powerful, if not unique, approach to validate black-box autonomy designs [corso2020survey]. This approach operates by integrating the target intelligent algorithm into an interacting virtual simulation platform that models the surrounding environment. By running enough Monte Carlo sampling of this (stochastic) environment, one hopes to observe catastrophic conflict events and subsequently conduct statistical analyses. This approach is flexible and scalable, as it hinges on building a virtual environment instead of physical systems, and provides a probabilistic assessment on the occurrences and behaviors of safety-critical events [koopman2018toward].

Nonetheless, similar to the challenge encountered by naturalistic tests, because of their rarity, safety-critical events are seldom observed in the simulation experiments. In other words, it could take an enormous amount of Monte Carlo simulation runs to observe one “hit”, and this in turn manifests statistically as a large estimation variance per simulation run relative to the target probability of interest (i.e., the so-called

*relative error*; [asmussen2007stochastic]). This problem, which is called rare-event simulation [bucklew2013introduction], is addressed conventionally under the umbrella of variance reduction, which includes a range of techniques from importance sampling (IS) [juneja2006rare, blanchet2012state] to multi-level splitting [glasserman1999multilevel, villen1994restart]. The main idea across all these techniques is to ensure the relative error is dramatically reduced (hence the name variance reduction), by analyzing the underlying model structures to gain understanding of the rare-event behaviors, and leveraging this knowledge to design good Monte Carlo schemes [blanchet2011rare, dean2009splitting]. For convenience, we call such relative error reduction guarantee an

*efficiency certificate*.

Our main focus of this paper is on rare-event problems where the underlying model is unknown or too complicated to support analytical tractability. In this case, traditional variance reduction approaches fail to provide an efficiency certificate. Yet, this scenario is precisely the prominent case in intelligent system testing. In fact, we argue that the existing “black-box” variance reduction techniques (including for instance the cross-entropy method [de2005tutorial, rubinstein2013cross] and particle approaches [au2001estimation, cerou2007adaptive]), not only do not provide correctness guarantees, but also could lead to dangerous *under-estimation* of the rare-event probability without noticed diagnostically. Our goal in this paper is to introduce an alternate framework, both theoretically justified and implementable, that allows the design of certifiable IS in the black-box setting. More precisely, it consists of three ingredients:

Relaxed efficiency certificate: We shift the estimation of target rare-event probability to a tight upper (and lower) bound, in a way that supports the integration of model errors into variance reduction without giving up estimation correctness.

Set-learning with one-sided error: We design learning algorithms to create outer (or inner) approximations of rare-event sets as special classification tasks that exhibit zero false negative rates, under a general geometric property called *orthogonal monotonicity*.

Deep-learning-based IS: We use a two-stage procedure, first to run deep neural network classifiers for set approximation, and second to assimilate the search of so-called *dominant points* in rare-event analysis, to create IS to that attains superior sampling efficiency with the relaxed certificate.

We call our framework consisting of the three ingredients above *Deep Probabilistic Accelerated Evaluation (D-PrAE)*, where “Accelerated Evaluation” follows terminologies in recent approaches for the safety-testing of autonomous vehicles [zhao2016accelerated, 8116682]. To our best knowledge, this is the first approach for guaranteed rare-event simulation efficiency in settings beyond the analytical tractability heavily relied on in the traditional variance reduction literature. We envision our approach to open the door to further generalizations in the emerging applications of probabilistic AI-system risk evaluation.

## 2 Statistical Challenges in Black-Box Rare-Event Simulation

Our evaluation goal is the probabilistic assessment of a complex physical system invoking rare but catastrophic events in a stochastic environment. For concreteness, we write this rare-event probability . Here

is a random vector in

that denotes the environment, and is distributed according to . denotes a safety-critical set on the interaction between the physical system and the environment. The “rarity” parameter is considered a large number, with the property that as , (Think of, e.g., for some risk function and exceedance threshold ). We will work with Gaussian for the ease of analysis, but our framework is more general (i.e., apply to Gaussian mixtures and other light-tailed distributions). In the following, we explain intuitively the main concepts and challenges in black-box rare-event simulation, leaving the details to Appendix A where we also explain how we have generalized existing results.Monte Carlo Efficiency. Suppose we use a Monte Carlo estimator to estimate , by running simulation runs in total. Since is tiny, the error of a meaningful estimation must be measured in relative term, i.e., we would like

(1) |

where is some confidence level (e.g., ) and . Suppose that is unbiased and is an average of i.i.d. simulation runs, i.e., for some random unbiased output . We define the *relative error* as the ratio of variance (per-run) and squared mean. Importantly, to attain (1), a sufficient condition is . So, when is large, the required Monte Carlo size is also large.

Challenges in Naive Monte Carlo. Let where denotes the indicator function, and is an i.i.d. copy of . Since

follows a Bernoulli distribution,

. Thus, the required scales linearly in (when is tiny). This demanding condition is a manifestation of the difficulty in hitting . In the standard large deviations regime [dembo2010large, dupuis2011weak] where is exponentially small in , the required Monte Carlo size would grow*exponentially*in .

Variance Reduction. The severe burden when using naive Monte Carlo motivates techniques to drive down . We use the term *efficiency certificate* to denote an estimator that achieves (1) with , which can be attained with (here means polynomial growth). In the large deviations regime, this means is reduced from exponential in the naive Monte Carlo to *polynomial* in .

To this end, *importance sampling (IS)* is the most prominent technique to achieve efficiency certificate [glynn1989importance]. IS generates from another distribution (called IS distribution), and output where is the likelihood ratio, or the Radon-Nikodym derivative, between and . Via a change of measure, it is easy to see that is unbiased for . The key is to control its by selecting a good . This requires analyzing the behavior of the likelihood ratio under the rare event, and in turn understanding the rare-event sample path dynamics [juneja2006rare].

Perils of Black-Box Variance Reduction Algorithms.
Unfortunately, in black-box settings where complete model knowledge and analytical tractability are unavailable, the classical IS methodology faces severe challenges. To explain this, we first need to understand how efficiency certificate can be obtained based on the concept of *dominant points*. From now on, we consider input

from a Gaussian distribution

where is positive definite.###### Definition 1.

A set is a *dominant set* for the set associated with the distribution if for any , there exists at least one such that . Moreover, this set is minimal in the sense that if any point in is removed, then the remaining set no longer satisfies the above condition. We call any point in a dominant point.

The dominant set comprises the “corner” cases where the rare event occurs [sadowsky1990large]. In other words, each dominant point

encodes, in a local region, the most likely scenario should the rare event happen, and this typically corresponds to the highest-density point in this region. Locality here refers to the portion of the rare-event set that is on one side of the hyperplane cutting through

(see Figure 1(a)).Intuitively, to increase the frequency of hitting the rare-event set (and subsequently to reduce variance), an IS would translate the distributional mean from to the global highest-density point in the rare-event set. The delicacy, however, is that this is *insufficient* to control the variance, due to the “overshoots” arising from sampling randomness. In order to properly control the overall variance, one needs to divide the rare-event set into local regions governed by dominant points, and using a mixture IS distribution that accounts for *all* of them:

###### Theorem 1 (Certifiable IS).

Suppose , where each is a “local” region corresponding to a dominant point associated with the distribution , with conditions stated precisely in Theorem 5 in the Appendix. Then the IS distribution achieves an efficiency certificate in estimating .

On the contrary, if the Gaussian (mixture) IS distribution misses any of the dominant points, then the resulting estimate may be utterly unreliable for two reasons. First, not only that efficiency certificate may fail to hold, but its RE can be arbitrarily large. Second, even more dangerously, this poor performance can be empirically hidden and leads to a systematic *under-estimation* of the rare-event probability without detection. In other words, in a given experiment, we may observe a reasonable empirical relative error (i.e., sample variance over squared sample mean), yet the estimate is much lower than the correct value. These are revealed in the following example:

###### Theorem 2 (Perils of under-estimation).

Suppose we estimate where and . We choose as the IS distribution to obtain . Then 1) The relative error of grows exponentially in . 2) If is polynomial in , we have for any where , and the empirical relative error with probability higher than .

We now explain why using black-box variance reduction algorithms can be dangerous - both in the sense of not having an efficiency certificate and, relatedly, the risk of an unnoticed systematic under-estimation. In the literature, there are two lines of techniques that apply to black-box problems. The first line is to use optimization to search for a good parametrization over a parametric class of IS. The objective criteria include the cross-entropy (with an oracle-best zero-variance IS distribution; [rubinstein2013cross, de2005tutorial]) and estimation variance [arouna2004adaptative]. Without closed-form expressions, and also to combat the rare-event issue, one typically solves a sequence of empirical optimization problems, starting from a “less rare” problem (i.e., smaller ) and gradually increasing the rarity with updated empirical objectives using better IS samples. Obtaining an efficiency certificate requires both a sufficiently expressive parametric IS class (enough mixtures) and parameter convergence (to account for all dominant points). In light of Theorem 2, one can construct examples where the converging IS distribution suffers from systematic under-estimation (see Appendix E). The second line of methods is the multi-level splitting or subsimulation [au2001estimation, cerou2007adaptive], a particle method in lieu of IS, which relies on enough mixing of descendant particles that can encounter similar issues as the cross-entropy method (see Appendix E). Thus, while both cross-entropy and multi-level splitting are powerful techniques, their use in rare-event problems with undefined structures should be properly cautioned.

## 3 The Deep Probabilistic Accelerated Evaluation Framework

We propose the D-PrAE framework to overcome the challenges faced by existing black-box variance reduction algorithms. This framework comprises two stages: First is to learn the rare-event set from a first-stage sample batch, by viewing set learning as a classification task. Second is to apply an efficiency-certified IS on the rare-event probability over the learned set. Algorithm 1 shows our main procedure. The key to achieving an ultimate efficiency certificate lies in how we learn the rare-event set in Stage 1, which requires two properties:

Small one-sided generalization error: “One-sided” generalization error here means the learned set is either an outer or an inner approximation of the unknown true rare-event set, with probability 1. Converting this into a classification, this means the false negative (or positive) rate is exactly 0. “Small” here then refers to the other type of error being controlled.

Decomposability: The learned set is decomposable according to dominant points in the form of Theorem 1, so that an efficient mixture IS can apply.

The first property ensures that, even though the learned set can contain errors, the learned rare-event probability is either an upper or lower bound of the truth. This requirement is important as it is extremely difficult to translate the impact of generalization errors into rare-event estimation errors. By Theorem 2, we know that any non-zero error implies the risk of missing out important regions of the rare-event set, undetectably. The one-sided generalization error allows a shift of our target to valid upper and lower bounds that can be correctly estimated, which is the core novelty of D-PrAE. To this end, we introduce a new efficiency notion:

###### Definition 2.

We say an estimator satisfies an upper-bound *relaxed efficiency certificate* to estimate if
with , for given .

Compared with the efficiency certificate in (1), Definition 2 is relaxed to only requiring to be an upper bound of , up to an error of . An analogous lower-bound relaxed efficiency certificate can be seen in Appendix D. From a risk quantification viewpoint, the upper bound for is more crucial, and the lower bound serves to assess an estimation gap. The following provides a handy certification:

###### Proposition 1 (Achieving relaxed efficiency certificate).

Suppose is upward biased, i.e., . Moreover, suppose takes the form of an average of i.i.d. simulation runs , with . Then possesses the upper-bound relaxed efficiency certificate.

Proposition 1 stipulates that a relaxed efficiency certificate can be attained by an upward biased estimator that has a logarithmic relative error with respect to the biased mean (and hence also the original target probability). Appendix C shows an extension of Proposition 1 to two-stage procedures, where the first stage determines the upward biased mean. This upward biased mean, in turn, can be obtained by learning an outer approximation for the rare-event set, giving:

###### Corollary 1 (Set-learning + IS).

Consider estimating . Suppose we can learn a set with any number of i.i.d. samples (drawn from some distribution) such that with probability 1. Also suppose that there is an efficiency certificate for an IS estimator for . Then a two-stage estimator where a constant number of samples are first used to construct , and samples are used for the IS in the second stage, achieves the upper-bound relaxed efficiency certificate.

To execute the procedure in Corollary 1, we need to learn an outer approximation of the rare-event set. To this end, consider set learning as a classification problem: We draw from some sampling distribution that has sufficient presence throughout the space (e.g., uniform over a large box centered at the mean of ). For each , we evaluate , i.e., if is in the rare-event set , and 0 otherwise. We consider the pairs where is regarded as the feature and as the binary label, and construct a classifier, say , from some hypothesis class that (nominally) signifies . The learned rare-event set is taken to be for some threshold .

The outer approximation requirement means that all true positive (i.e., 1) labels must be correctly classified, or in other words, the false negative (i.e., 0) rate is zero, i.e.,

(2) |

Typically, achieving such a zero “Type I” misclassification rate is impossible for any finite sample except in degenerate cases. However, this is achievable under a mild geometric premise on the rare-event set that we call *orthogonal monotonicity*. To facilitate discussion, suppose from now on that the rare-event set is known to lie entirely in the positive quadrant , so in learning the set, we only consider sampling points in (analogous development can be extended to the entire space).

###### Definition 3.

We call a set *orthogonally monotone* if for any two points , we have (where the inequality is defined coordinate-wise) and implies too.

Definition 3 means that any point that is more “extreme” than a point in the rare-event set must also lie inside the same set. This is an intuitive assumption that commonly holds in rare-event settings. Note that, even with such a monotonicity property, the boundary of the rare-event set can be very complex. The key is that, with orthogonal monotonicity, we can now produce a classification procedure that satisfies (2). In fact, the simplest approach is to use what we call an orthogonally monotone hull:

###### Definition 4.

For a set of points , we define the *orthogonally monotone hull* of (with respect to the origin) as , where is the rectangle that contains both and the origin as two of its corners.

In other words, the orthogonally monotone hull consists of the union of all the rectangles each wrapping each point and the origin . Now, denote as the non-rare-event sampled points. Evidently, if is orthogonally monotone, then (where complement is with respect to ), or equivalently, , i.e., is an outer approximation of the rare-event set . Figure 1(b) shows this outer approximation (and also the inner counterpart). Moreover, is the smallest region (in terms of set volume) such that (2) holds, because any smaller region could exclude a point that has label 1 with positive probability.

Lazy-Learner IS. We now consider an estimator for where in Stage 1, we sample a constant i.i.d. random points from some density, say . Then, we use the mixture IS depicted in Theorem 1 to estimate in Stage 2. Since takes the form , it has a finite number of dominant points, which can be found by a sequential algorithm (similar to the one that we will discuss momentarily). We call this the “lazy-learner” approach. Its problem, however, is that tends to have a very rough boundary. This generates a large number of dominant points, many of which are unnecessary in that they do not correspond to any “true” dominant points in the original rare-event set (see the middle of Figure 1(d)). This in turn leads to a large number of mixture components that degrades the IS efficiency, as the RE bound in Theorem 1 scales linearly with the the number of mixture components.

Deep-Learning-Based IS. Our main approach is a deep-learning alternative that resolves the statistical degradation of the lazy learner. We train a neural network classifier, say , using all the Stage 1 samples , and obtain an approximate non-rare-event region , where is say . Then we adjust minimally away from , say to , so that , i.e., . Then , so that is an outer approximation for (see Figure 1(c), where ). Stage 1 in Algorithm 1 shows this procedure. With this, we can run mixture IS to estimate in Stage 2.

The execution of this algorithm requires the set to be in a form susceptible to Theorem 1 and the search of all its dominant points. When

is a ReLU-activated neural net, the boundary of

is piecewise linear and is a union of polytopes, and Theorem 1 applies. Finding all dominant points is done by a sequential “cutting-plane” method that iteratively locates the next dominant point by minimizing over the remaining portion of not covered by the local region of any previously found points . These optimization sequences can be solved via mixed integer program (MIP) formulations for ReLU networks ([tjeng2017evaluating, huang2018designing]; see Appendix B). Note that a user can control the size of these MIPs via the neural net architecture. Regardless of the expressiveness of these networks, Algorithm 1 enjoys the following:###### Theorem 3 (Relaxed efficiency certificate for deep-learning-based mixture IS).

Figure 1(d) shows how our deep-learning-based IS achieves superior efficiency compared to other alternatives. The cross-entropy method can miss dominant points (1) and result in systematic under-estimation. The lazy-learner IS, on the other hand, generates too many dominant points (64) and degrades efficiency. Algorithm 1 finds the right number (2) and approximate locations of the dominant points.

Moreover, whereas the upper-bound certificate is guaranteed in our design, in practice, the deep-learning-based IS also works extremely well in controlling the conservativeness of the bound, as dictated by the false positive rate (see our experiments next). We close this section with a finite-sample bound on the false positive rate. Here we use empirical risk minimization (ERM) to train , i.e., where

is a loss function and

is the considered hypothesis class. Correspondingly, let be the true risk function and its minimizer. Also let be the true threshold associated with in obtaining the smallest outer rare-event set approximation.###### Theorem 4 (Conservativeness).

Consider obtained in Algorithm 1 where is trained from an ERM. Suppose the density has bounded support and for any . Also suppose there exists a function such that for any , implies . (e.g., if is the squared loss, then could be chosen as ). Then, with probability at least ,

Here, is the Lipschitz parameter of , and .

Theorem 4 reveals a tradeoff between overfitting (measured by and ) and underfitting (measured by ). Appendix C discusses related results on the sharp estimates of these quantities for deep neural networks, a more sophisticated version of Theorem 4 that applies to the cross-entropy loss, a corresponding bound for the lazy learner, as well as results to interpret Theorem 4 under the original distribution .

## 4 Numerical Experiments

We implement and compare the estimated probabilities and the RE’s of deep-learning-based IS for the upper bound (D-PrAE UB) and lazy-learner IS (LL UB). We also show the corresponding lower-bound estimator (D-PrAE LB and LL LB), the cross entropy method (CE), and naive Monte Carlo (NMC), in a 2-dimensional example and the safety-testing of a self-driving algorithm in a car-following scenario. These two experiments are representative as the former is low-dimensional (visualizable) yet with extremely rare events while the latter is high-dimensional, challenging for most of the existing methods.

#### 2D Example.

We estimate where , and ranging from 1.8 to 2.6. We use (10,000 for Stage 1 and 20,000 for Stage 2). Figure 1 illustrates the shape of , which has two dominant points. This probability is microscopically small (e.g., gives ) and serves to investigate our performance in ultra-extreme situations.

Figure 2 compares all approaches to the benchmark, which we compute via a proper mixture IS with 50,000 samples assuming full knowledge of . It shows several observations. First, D-PrAE and LL (both UB and LB) always provide valid bounds that contain the truth. Second, the UB for LL is consistently 2 order of magnitude more conservative than D-PrAE, attributed to the overly many dominant points (e.g., 34 vs 2 when ). Correspondingly, the RE of LL UB blows up to , compared to 3% for D-PrAE UB. Third, CE, which ends up using only one dominant point, under-estimates the truth by , yet it gives an over-confident RE, e.g., when , showing a systematic undetected under-estimation. Lastly, NMC fails to give a single hit in all cases (thus not shown on the graphs). Among all approaches, D-PrAE stands out as giving reliable and tight bounds with low RE.

#### Self-Driving Example.

We consider simulating the crash probability in a car-following scenario involving a human-driven lead vehicle (LV) followed by an autonomous vehicle (AV). The AV is controlled by the Intelligent Driver Model (IDM) to maintain safety distance while ensuring smooth ride and maximum efficiency, which is widely used for autonomy evaluation and microscopic transportation simulations [Treiber_2000, dlforcf, rssidm]. The state at time is given by 6 states consisting of the position, velocity, and acceleration of both LV and AV. The dynamic system has a stochastic input related to the acceleration of the LV and subject to uncertain human behavior. We consider an evaluation horizon

seconds and draw a sequence of 15 Gaussian random actions at a 4-second epoch, leading to a 15-dimensional LV action space. A (rare-event) crash occurs at time

if the longitudinal distance between the two vehicles is negative, with parameterizing the AV maximum throttle and brake pedals. This rare-event set is analytically challenging (see [zhao2017accelerated] for a similar setting). More details are in Appendix F.Figure 3 shows the performances of competing approaches, using . D-PrAE and LL (UB and LB) appear consistent in giving upper and lower bounds for the target probability. Once again, D-PrAE produces tighter bounds than LL ( vs in general). LL UB has dominant points when vs 42 in D-PrAE, and needs 4 times more computational time to search for them than D-PrAE. Moreover, the RE of D-PrAE is around 3 times lower than LL across the range (in both UB and LB). Thus, D-PrAE outperforms LL in both tightness, computation speed, and RE. CE seems to give a stable estimation. However, we have evidence to believe it is under-estimated. We run a modification of D-PrAE that replaces by in the last step of Algorithm 1 (D-PrAE IS) and find a consistent positive gap over CE. D-PrAE IS lacks an efficiency certificate and thus could be under-estimated, and the fact that its estimate is higher than CE suggests CE is under-estimated.

#### Summary of Practical Benefits.

Our investigation shows strong evidence of the practical benefits of using D-PrAE for rare-event estimation. It generates tight bounds for the target, with low RE, and requires reasonable computation effort in training deep-learning classifiers and IS. For example, if one were to assess whether the AV crash rate is below for , only 1000 simulation runs would be needed to get around RE, taking about seconds in total. This is in contrast to 3.7 months for naive Monte Carlo, and 1.1 hours for other methods that have no efficiency certificate. Results suggest that D-PrAE could be a game-changer once embedded in simulation software for safety evaluation of black-box autonomy.

## Acknowledgements

We gratefully acknowledge support from the National Science Foundation under grants CAREER CMMI-1834710, IIS-1849280, and IIS-1849304.

## References

## Appendix A Further Details for Section 2

This section expands the discussions in Section 2, by explaining in more detail the notion of relative error, challenges in naive Monte Carlo, the concept of dominant points, and the perils of black-box variance reduction algorithms.

### a.1 Explanation of the Role of Relative Error

As described in Section 2, to estimate a tiny using , we want to ensure a high accuracy in relative term, namely, (1). Suppose that is unbiased and is an average of i.i.d. simulation runs, i.e., for some random unbiased output . The Markov inequality gives that

so that

ensures (1). Equivalently,

is a sufficient condition to achieve (1), where is the relative error defined as the ratio of variance (per-run) and squared mean.

### a.2 Further Explanation on the Challenges in Naive Monte Carlo

We have seen in Section 2 that for the naive Monte Carlo estimator, where , the relative error is . Thus, when is tiny, the sufficient condition for to attain (1) scales at least linearly in . In fact, this result can be seen to be tight by analyzing as a binomial variable. To be more specific, we know that and that takes values in . Therefore, if , then , and hence (1) does not hold.

Moreover, the following provides a concrete general statement that an that grows only polynomially in would fail to estimate that decays exponentially in with enough relative accuracy, of which (1) fails to hold is an implication.

###### Proposition 2.

Suppose that is exponentially decaying in and is polynomially growing in . Define . Then for any ,

We have used the term efficiency certificate to denote an estimator that achieves (1) with . In the rare-event literature, such an estimator is known as “logarithmically efficient" or “weakly efficient" [juneja2006rare, blanchet2012state].

### a.3 Further Explanations of Dominant Points

We have mentioned that a certifiable IS should account for all dominant points, defined in Definition 1. We provide more detailed explanations here. Roughly speaking, for and a rare-event set , the Laplace approximation gives (see the proof of Theorem 5).

Thus, to obtain an efficiency certificate, IS estimator given by , where and , needs to have (where and denote the variance and expectation under , and is up to some factor polynomial in ; note that the last equality relation cannot be improved, as otherwise it would imply that ).

Now consider an IS that translates the mean of the distribution from to , an intuitive choice since contributes the highest density among all points in (this mean translation also bears the natural interpretation as an exponential change of measure; [bucklew2013introduction]). The likelihood ratio is , giving

(3) |

If the “overshoot” , i.e., the remaining term in the exponent of after moving out , satisfies for all , then the expectation in the right hand side of (3) is bounded by 1, and an efficiency certificate is achieved. This, however, is not true for all set , which motivates the following definition of the dominant set and points in Definition 1.

For instance, if is convex, then, noting that is precisely the gradient of the function , we get that gives a singleton dominant set since for all is precisely the first order optimality condition of the involved quadratic optimization. In general, if we can decompose where for a dominant point , then each can be viewed as a “local” region where the dominant point is the highest-density, or the most likely point such that the rare event occurs.

###### Theorem 5 (Certifiable IS).

Suppose that is the dominant set for associated with the distribution . Then we can decompose where ’s are disjoint, and for . Denote . Assume that each component of is of polynomial growth in

. Moreover, assume that there exist invertible matrix

and positive constant such that . Then the IS distribution achieves an efficiency certificate in estimating , i.e., if we let where is the corresponding likelihood ratio, then is at most polynomially growing in . This applies in particular to where is a piecewise linear function.We contrast Theorem 5 with existing works on dominant points. The latter machinery has been studied in [sadowsky1990large, dieker2006fast]. These papers, however, consider regimes where the Gärtner-Ellis Theorem [Gartner1977, Ellis1984] can be applied, which requires the considered rare-event set to scale proportionately with the rarity parameter. This is in contrast to the general conditions on the dominant points used in Theorem 5.

### a.4 Further Explanation of the Example in Theorem 2

In the theorem, there are two dominant points and but the IS design only considers the first one. As a result, there could exist “unlucky” scenario where the sample falls into the rare-event set, so that , while the likelihood ratio explodes, which leads to a tremendous estimation variance. Part 2 of the theorem further shows how this issue is undetected empirically, as the empirical RE appears small (polynomially in and hence by our choice of ) while the estimation concentrates at a value that can be severely under the correct one (especially when ). This is because the samples all land on the neighborhood of the solely considered dominant point. If the missed dominant point is a significant contributor to the rare-event probability, then the empirical performance would look as if the rare-event set is *smaller*, leading to a systematic under-estimation. Note that this phenomenon occurs even if the estimator is unbiased, which is guaranteed by IS by default.

## Appendix B Further Details on Implementing Algorithm 1

We provide further details on implementing Algorithm 1. In particular, we present how to solve the optimization problem

(4) |

to obtain the next dominant point in the sequential cutting-plane approach in Stage 2. Moreover, we also present how to tune

(5) |

in Stage 1.

#### MIP formulations for ReLU-activated neural net classifier.

The problem (4) can be reformulated into a mixed integer program (MIP), in the case where is trained via a ReLU-activated neural net classifier, which is used in our deep-learning-based IS. Since the objective is convex quadratic and second set of constraints is linear in (4), we focus on the first constraint . The neural net structure in our approach (say with layers) can be represented as , where each

denotes a ReLU-activated layer with linear transformation, i.e.

, where denotes a certain linear transformation in the input. In order to convert into an MIP constraint, we introduce as a practical upper bound for such that . The key step is to reformulate the ReLU function intoFor simple ReLU networks, the size of the resulting MIP formulation depends linearly on the number of neurons in the neural network. In particular, the number of binary decision variables is linearly dependent on the number of ReLU neurons, and the number of constraints is linearly dependent the total number of all neurons (here we consider the linear transformations as independent neurons).

The MIP reformulation we discussed can be generalized to many other popular piecewise linear structures in deep learning. For instance, linear operation layers, such as normalization and convolutional layers, can be directly used as constraints; some non-linear layers, such as ReLU and max-pooling layers, introduce non-linearity by the “max” functions. A general reformulation for the max functions can be used to convert these non-linear layers to mixed integer constraints.

Consider the following equality defined by a max operation . Then the equality is equivalent to

#### Tuning .

We illustrate how to tune to achieve (5). This requires checking, for a given , whether . Then, by discretizing the range of or using a bisection algorithm, we can leverage this check to obtain (5). We use an MIP to check . Recall that . We want to check if for a given lies completely inside the hull, where is trained with a ReLU-activated neural net. This can be done by solving an optimization problem as follows. First, we rewrite as , where and refer to the -th components of and respectively. Then we solve

(6) |

If the optimal value is greater than 0, this means is not completely inside , and vice versa. Now, we rewrite (6) as

(7) |

We then rewrite (7) as an MIP by introducing a large real number as a practical upper bound for all coordinates of :

(8) |

Note that the set of points to be considered in constructing can be reduced to its “extreme points”. More concretely, we call a point an extreme point if there does not exist any other point such that . We can eliminate all points such that for another , and the resulting orthogonal monotone hull would remain the same. If we carry out this elimination, then in (7) we need only consider that are extreme points in , which can reduce the number of integer variables needed to add. In practice, we can also randomly remove points in to further reduce the number of integer variables. This would not affect the correctness of our approach, but would increase the conservativeness of the final estimate.

## Appendix C Further Results for Section 3

Here we present and discuss several additional results for Section 3 regarding estimation efficiency and conservativeness. The latter includes further theorems on the lazy-learner classifier and classifiers constructed using the difference of two functions, translation of the false positive rate under the Stage 1 sampling distribution to under the original distribution, and interpretations and refinements of the conservativeness results.

### c.1 Extending Upper-Bound Relaxed Efficiency Certificate to Two-Stage Procedures

We present an extension of Proposition 1 to two-stage procedures, which is needed to set up Corollary 1.

###### Proposition 3 (Extended relaxed efficiency certificate).

Suppose constructing consists of two stages, with : First we sample , where are i.i.d. (following some sampling distribution), and given , we construct where are i.i.d. conditional on (following some distribution). Suppose is conditionally upward biased almost surely, i.e., , and the conditional relative error given in the second stage satisfies

Comments

There are no comments yet.