 # Efficient computation of extreme excursion probabilities for dynamical systems

We develop a novel computational method for evaluating the extreme excursion probabilities arising for random initialization of nonlinear dynamical systems. The method uses a Markov chain Monte Carlo or a Laplace approximation approach to construct a biasing distribution that in turn is used in an importance sampling procedure to estimate the extreme excursion probabilities. The prior and likelihood of the biasing distribution are obtained by using Rice's formula from excursion probability theory. We use Gaussian mixture biasing distributions and approximate the non-Gaussian initial excitation by the method of moments to circumvent the linearity and Gaussianity assumptions needed by excursion probability theory. We demonstrate the effectiveness of this computational framework for nonlinear dynamical systems of up to 100 dimensions.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Motivation

Computing the probability of extreme events is of central importance for dynamical systems that arise in natural phenomena such as climate, weather, oceanography [Easterling_2000, Easterling_2000A], and engineering systems such as structures [Cornell_1968, Vrouwenvelder_2000] or power grids [Lesieutre_2008]. Examples of consequential extreme events are rogue waves in the ocean [Dysthe_2008], hurricanes, tornadoes [Ross_2003], and power outages [Atputharajah_2009]. In this work we are motivated by the increased concern of transient security in the presence of uncertain inertia, as identified by the North American Electric Reliability Corporation in its most recent long-term reliability assessment [nerc2017ltra]. The mathematical formulation is of a dynamical system with parametric uncertainty, which is equivalent to initial condition uncertainty by adding the equations

to the ordinary differential equation, where

are the parameters. The aim of the calculation in that case is to compute an extreme excursion probability: the odds that the transient due to a sudden malfunction exceeds prescribed safety limits. Since the reliability goal is that an average customer should experience only minutes of electricity interruption per year

[fischer2016german], the target safe limit exceedance probabilities may be in the range of .

Quantifying extreme excursion probabilities is of great importance because of their socioeconomic impact. Outcomes of interest reside in the tails of the probability distribution of the associated event space because of their low likelihood. To resolve the tails of these events, one has to evaluate multivariable integrals over complex domains. Because of the tiny mass and complex shape of the relevant likelihood level sets, standard quadrature, cubature, or sparse grid methods cannot be applied directly to evaluate these integrals. The most commonly used method is Monte Carlo simulation (MCS), which requires repeated samples of the underlying outcome. For such small probabilities, however, MCS exhibits a large variance relative to the probability to be computed, and thus it needs a large number of samples to produce results of acceptable accuracy. For example, estimating the odds of an extreme event, whose probability ends up being

for an underlying process that requires 10 minutes per numerical simulation, requires two years of serial computation for producing an estimate with a standard deviation of less than 10% of the target value via MCS. Hence, alternative methods must be developed that are computationally efficient.

In the rest of this section, we review the literature, provide an overview of our approach, and discuss its limitations and possible extensions. In §2 we describe the rare-event problem in detail and revisit MCS and importance sampling (IS) methods. In §3 we formulate the problem of estimating rare-event probability as a sequence of Bayesian inverse problems, in §4 we discuss two well-known approaches to solve the Bayesian inverse problems, and in §5 we use the solutions of these Bayesian inverse problems to construct an importance biasing distribution (IBD). In §6 we demonstrate the algorithm on two nonlinear dynamical systems of different sizes. In §7 we give concluding remarks .

### 1.1 Literature review

Most methods to compute the probabilities of rare events are a combination of MCS and IS methods. The key difference between such approaches lies in the proposal distribution for importance sampling. In what follows, we briefly discuss existing methods and the key ideas underpinning them.

#### 1.1.1 Monte Carlo simulation

The MCS approach is one of the most robust methods for simulating rare events and estimating their probabilities. It was originally developed for solving problems in mathematical physics [Metropolis_1949]. Since then, the method has been used in a wide variety of applications, and it currently lies at the heart of all random sampling-based techniques [Liu_2008B, Robert_2005B]. The main strength of MCS is that its rate of convergence does not depend on the likelihood level set or its dimension. When evaluating excursion probabilities, the method primarily counts how many of the random samples exceed the given excursion level. Thus, in order to estimate a probability , MCS needs a number of samples exceeding , which for small probabilities makes its direct application impractical.

#### 1.1.2 Importance sampling

IS methods belong to the class of variance reduction techniques that aim to estimate the quantity of interest by constructing estimators that have smaller variance than does MCS. This technique was proposed in the 1950s [Kahn_1953]. The major cause for the inefficiency in computing rare-event probabilities using MCS is that most of the random samples generated do not belong to the extreme excursion region (or the region of interest). The basic idea of IS is to use the information available about the rare event to generate samples that more frequently visit the region of interest. This is achieved by constructing an IBD, which can be used to generate samples. If successful, unlike in the case of MCS, an appreciable fraction of these samples contribute to the probability estimate. When designing an IBD, the aim is for its probability mass to be concentrated in the region of interest. Based on this consideration, several techniques for constructing IBDs have been developed, such as variance scaling and mean shifting [Bucklew_2013B]. A more detailed treatment of importance sampling and the relevant literature can be found in standard stochastic simulation textbooks [Dunn_2011B, Asmussen_2007B]. One of the major challenges involved with importance sampling is the construction of an IBD that results in a low-variance estimator. We note that the approach may sometimes be inefficient in high-dimensional problems [Katafygiotis_2008].

#### 1.1.3 Nested subset methods

The underlying idea of this class of methods is to consider a sequence of nested subsets of the probability space of interest (for example, starting with the entire space and shrinking to the target rare event) and use the notion of conditional probability to factorize the target event as a product of conditional events. Two main methods that fall into this class are subset simulation (SS) [Au_2001] and splitting methods [Kahn_1951]. In SS, the nested subsets are generated by choosing appropriate intermediate thresholds. Splitting methods are based on the idea of restarting the associated Markov process from certain system states in order to generate more occurrences of the rare event of interest. Several modifications have been proposed to both SS [Ching_2005, Ching_2005_2, Katafygiotis_2005, Zuev_2012, Bect_2017] and splitting methods [Botev_2012, Beck_2016]. Evaluating the conditional probabilities forms a major portion of the computational load. Computing the conditional probabilities for different nested subsets concurrently is nontrivial. Additionally, it is not clear how many model evaluations are required at the intermediate level sets in order to achieve a good probability estimate.

#### 1.1.4 Methods based on large deviation theory

Recent work by Dematteis et al. used large deviations theory (LDT) to estimate the probabilities of rogue waves of a certain height [Dematteis_2018]. The same authors used LDT to estimate probabilities of extreme events in dynamical systems with random components [Dematteis_2019]. LDT is an efficient approach for estimating rare events when the event space of interest is dominated by a few elements. The aforementioned papers solve an optimization problem to estimate the rare-event probability. In contrast, our approach uses a Bayesian inverse problem framework to determine an IBD, which will then be used to estimate the rare-event probability. In §5 we contrast the approach based on LDT with our approach.

#### 1.1.5 Multifidelity and surrogate-based methods

Multifidelity methods are used for estimating rare-event probabilities is situations when multiple evaluations of the forward model is prohibitively expensive. This approach leverages a hierarchy of low-cost reduced-order models, such as projection-based reduced-order models, data fit interpolation models, and support vector machines, to reduce the cost of constructing the IBD

[Peherstorfer_2017]. The main idea behind the surrogate-based method is to start with a deterministic sample of the system and then construct a surrogate that approximates the system based on these samples [Bucher_1990, Faravelli_1989, Wong_1985]. We remark that multifidelity and surrogate methods can be readily augmented with the framework developed in this paper to obtain additional computational savings.

### 1.2 Overview of our methodology

Our methodology uses ideas from excursion probability theory to characterize the tails of the probability distribution associated with the event [adler2010geometry]. Specifically, we use Rice’s formula [Rice_1944], which was developed to compute the expected number of upcrossings for stochastic processes:

 E{N+u(0,T)}=∫T0∫∞0yφt(u,y)dydt. (1)

The left-hand side denotes the number of upcrossings of level , is the derivative of the stochastic process (in a mean square sense), and represents the joint probability distribution of the process and its derivative . Clearly the expression in the integral is analytically tractable only for special types of stochastic processes. Specifically, for Gaussian processes, this term can be resolved analytically. Moreover—the critical feature we will use here— for smooth Gaussian processes , Rice’s formula is a faster-than-exponentially-accurate approximation of the excursion probability. That is [adler2009random, Equation (14.0.2)]:

 ∣∣ ∣∣P{supt∈[0,T]g(t)≥u}−E{N+u(0,T)}∣∣ ∣∣≤O(e−βu2), (2)

where is a parameter depending on the process and interval , but not on the target level , and the asymptotics in the notation refers to . If we use the number of upcrossings in (1) as our estimate of the excursion probability, we can interpret large values of as defining the times and values of the process velocity for which the crossing is most likely to occur. This, in turn, is the key in efficiently determining the points in the input space that represent the highest contribution to the excursion probability.

The setup in this article involves a nonlinear dynamical system that is excited by a Gaussian or a non-Gaussian initial state that results in a non-Gaussian stochastic process. To address this problem, we linearize the nonlinear dynamical system variation around the trajectories starting at the mean of the initial state. We thus obtain a Gaussian approximation to the system trajectory distribution. Furthermore, we use Rice’s formula and solve a sequence of Bayesian inverse problems (1) to determine the uncertainty sets in the input space that are most likely to cause the extreme events; these sets, in turn, are used to construct the biasing distribution. The main advantages of our approach are the following:

• Constructing the biasing distribution is the most expensive component of the computational method. Since our method does not use nested subsets to evaluate the target probability, it is amenable to parallelization (see the discussion in §3).

• As we will demonstrate in §6, a moderate number of evaluations of the model () are required in order to achieve acceptable levels of relative accuracy (). The method can capture probabilities on the order of accurately.

• Although we use nonlinear dynamical systems with random initial states as a basis to demonstrate our method, the algorithm can be seamlessly extended to stochastic dynamical systems with random parameters.

• In applications that require repeated evaluations of the rare-event probability and where the distribution of the random parameter does not change significantly between these evaluations, the IBD can be reused to obtain accurate estimates of the rare-event probability. We demonstrate this for a small problem in §6.

### 1.3 Limitations and possible extensions

One of the major limitations of our approach is that as the dimensionality of the random variable grows, the construction of the biasing distribution becomes expensive. Currently, this method is practical for problems where the size of the random variable is

. However, we aim to solve problems up to and beyond

. The current approach requires that the random variable be normally distributed. For many practical problems, however, this might not be the case. In such scenarios we use the method of moments to approximate the non normal random variable by a Gaussian distribution. Another possible approach to handling non-Gaussian random parameters is to use a Gaussian mixture model (GMM) to approximate it. In such a scenario, challenges may arise regarding controlling the variance of the GMM components such that the errors due to linearization do not grow too much. An obvious extension to the current work is to develop a strong theoretical foundation that justifies the algorithm in this paper. Another potential research direction is related to constructing the likelihood function that is necessary for the MCMC step of the algorithm. Currently, we use ad hoc methods to choose the likelihood for the MCMC step (more details are in §

3), and this approach.can be significantly improved by using a design of experiments approach (similar to [Mohamad_2018]).

## 2 The rare-event problem

Consider an input-output system with inputs and outputs, which is modeled by a continuous function

. We represent the uncertainties in the input by a probability distribution with probability density function (PDF)

. Let be a random -dimensional vector. The inputs to the system are the components of a realization of . The outputs are the realization of the random variable . We are interested in the failure probability of the system described by the model . Let be the limit-state function, which we assume to be continuous. We say that the system fails for an input if . This leads us to the failure domain

 F={z∈Rd:g(s(z))≥U}

and the indicator function of the failure domain

 IF={1,z∈F0z∉F. (3)

We define the failure probability of the system as

 PF=Ep[IF(Z)]=∫RdIF(z)p(z)dz. (4)

The variance of with respect to the PDF is

 varp[IF(Z)]=∫Rd(IF−Ep[IF(Z)])2p(z)dz=PF−P2F. (5)

The MCS method is often used to estimate expectation such as (4). It draws independent and identically distributed (i.i.d.) samples from the distribution of , that is, realizations of the random variable , and computes the Monte Carlo estimate

 PMCF(z1,…,zM)=1MM∑i=1IF(zi). (6)

Note that we distinguish between the estimate , which is a scalar value, and the Monte Carlo estimator , which is a random variable. The relative root mean square error (RMSE) of is

 e(PMCF)≈1Ep[IF(Z)]√varp[IF(Z)]M= ⎷PF−P2FP2FM≈√1PFM, (7)

when . Hence, for a threshold parameter , a relative error is achieved with

 M=⌈1PFϵ2⌉ (8)

samples, where denotes the ceil function. If , then is large. Hence, estimating small probability events with MC is difficult.

### 2.1 Importance sampling

Variance reduction methods aim to reduce the RMSE (7) by changing (4) to a new integral with the same value but with an integrand and/or a distribution that, combined, result in a lower variance than the original function has with respect to the distribution . Importance sampling is one such variance reduction method that has been used successfully to estimate failure probabilities[Srinivasan_2013B, Robert_2005B]. Importance sampling introduces a random vector with PDF , which is used as the biasing distribution. A realization of is denoted by . In the following, the distribution of is the nominal distribution, and the corresponding PDF is the nominal PDF. The distribution of is the biasing distribution, and is the biasing PDF. The biasing PDF is constructed such that the , where

 supp(p)={z∈Rd:p(z)>0},

denotes the support of the PDF . Let be the weight function . The weight is the importance weight of a realization . Because holds, the failure probability (4) equals the expectation of the random variable weighted with the random variable . That is, we have

 PF=Eq[IF(Z′)w(Z′)]. (9)

The expectation (9) is approximated with the Monte Carlo method, with samples drawn from the biasing distribution. Thus the importance sampling estimate of with samples is

 PISF(z′1,…,z′M)=1MM∑i=1IF(z′i)w(z′i). (10)

Therefore, the Monte Carlo method with importance sampling consists of two steps. In step one, the biasing distribution is generated. In step two, the importance sampling estimator

is an unbiased estimator of

because [Robert_2005B].

If , then the relative RMSE of the importance sampling estimator is

 e(PISF)=1PF√varq[IF(Z′)w(Z′)]M. (11)

If the variance is smaller than then the relative RMSE of the importance sampling estimator is smaller than the relative RMSE of the Monte Carlo estimator for the same number of samples .

## 3 Construction of IBD via Bayesian inference

Consider the following dynamical system,

 x′ =f(t,x),t=[0,T] (12) x(0) =x0,x0∼p,x∈Ω,

where the initial state of the system is uncertain and has a probability distribution , with being the corresponding probability measure. The problem of interest to us is to estimate the probability that exceeds the level for . That is, we seek to estimate the excursion probability

 PT(u):=P(sup0≤t≤Tc⊤x(t,x0)≥u,  t∈[0,T]), (13)

where represents the solution of the dynamical system (12) for a given initial condition . Let represent the set of all initial conditions for which the solution of the dynamical system exceeds the excursion level . That is,

 Ω(u):={x0:sup0≤t≤Tc⊤x(t,x0)≥u}. (14)

Notice that depends on implicitly through the solution of the dynamical system. Since we can write

 PT(u)=μ(Ω(u)), (15)

estimating is related to determining . In general, one cannot determine analytically. We use Rice’s formula, (1), to gain insight about , which in turn will be used to construct an approximation to . Let us revisit Rice’s formula:

 E{N+u(0,T)}=∫T0∫∞0yφt(u,y)dydt. (16)

Recall that represents the joint probability density of and its derivative for an excursion level . The right-hand side of equation (16) integrates the joint density over all values of derivatives and times at which there is an excursion. The key insight for our method is that values of the time and slope at which is large contribute the most to this integral. We use this idea to construct an approximation to .

Using (16), we can interpret as an unnormalized PDF and thus sample from it to compute using Monte Carlo approximation. By sampling from the unnormalized distribution , we obtain a slope-time pair, at which the sample paths of the stochastic process exceed the excursion level . Consider the forward map , which evaluates the vector based on the dynamics (12), given an initial state and a time . We call a preimage of a sample if

 G(xi,ti)=[uyi]. (17)

Note that the problem of finding a preimage of a sample is ill-posed. Multiple ’s map to at time via the operator . Therefore, we define the set

 Xi:={xi∈Ω:G(xi,ti)=[uyi]}, (18)

and we construct our approximation as

 ˆΩ(u):=N⋃i=1Xi. (19)

Our intuition is that approximates better as we increase . The underlying computational framework to approximate consists of the following stages:

• Draw samples from unnormalized

• Find the preimages of these samples to approximate .

We use MCMC to draw samples from unnormalized . We note that irrespective of the size of the dynamical system, represents an unnormalized density in two dimensions; hence, using MCMC is an effective means to draw samples from it. Drawing samples from requires evaluating it repeatedly, and in the following section we discuss the means to do so.

### 3.1 Evaluating yφt(u,y)

In this section, we describe the process of evaluating given , , and . We note that , can be evaluated analytically only for special cases. Specifically, when is a Gaussian process, the joint density function is analytically computable. Consider the dynamical system described by (12). When is Gaussian and is linear, we have

 x′=Ax(t)+b,x(t0)=x0,x0∼N(¯¯¯x0,Σ). (20)

Assuming is invertible, we can write as

 x(t)=exp(A(t−t0))x0−(I−exp(A(t−t0)))A−1b, (21)

where

represents an identity matrix of the appropriate size. Given that

is normally distributed, it follows that is a Gaussian process:

 x(t)∼GP(¯¯¯x,covx), where (22) ¯¯¯x=exp(A(t−t0))¯¯¯x0−(I−exp(A(t−t0)))A−1b and covx=exp(A(t−t0))Σ(exp(A(t−t0)))⊤.

The joint PDF of a stochastic process and its derivative, , the joint PDF of and , is given by [3569, equation 9.1]

 (23)

where

 ¯¯¯xφ:=[c⊤¯¯¯xc⊤(A¯¯¯x+b)]

and

 Φ:=exp(A(t−t0))Σ(exp(A(t−t0)))⊤.

We can now evaluate for arbitrary values of , , and as

 yiφti(ui,yi)=yi2π∣Υ∣exp(−12∥∥∥[uiyi]−¯¯¯xφ∥∥∥2Υ−1), (24)

where and denotes the determinant of . Note that the right-hand side in (24) is dependent on via .

#### 3.1.1 Notes for nonlinear f

When is nonlinear, one cannot compute analytically—a key ingredient for our computational procedure. We approximate the nonlinear dynamics by linearizing around the mean of the initial distribution. Assuming that the initial state of the system is normally distributed as described by equation (20), linearizing around the mean of the initial state gives

 x′≈F⋅(x−¯¯¯x0)+f(¯¯¯x0,0), (25)

where represents the Jacobian of at , . This reduces the nonlinear dynamical system to a form that is similar to equation (20). Thus, we can now use equations (22), (23), and (24) to approximate for nonlinear .

We now describe a systematic computational framework to determine for a given sample . This allows us to determine the elements of set .

### 3.2 Determining the preimages for a given sample

A sample from the unnormalized joint distribution

gives a slope, , and time, , at which the stochastic process exceeds the level . Hence . Constructing requires finding all the preimages . This amounts to finding all the solutions of the following equation,

 G(x,ti)=yi, (26)

where . Another formulation of the problem (26) is

 xi:= arg minx 12∥yi−G(x,ti)∥22. (27)

Since is a mapping from to , problem (27) is an ill-posed and underdetermined inverse problem. To address the ill-posedness, we use the Bayesian formulation of the inverse problem by placing a prior on and identifying the term as a negative log-likelihood. Suppose, in the process of finding preimages , we encounter elements that map to a value higher than . These should not be discarded because these elements still cause an excursion and hence are elements of the set . The Bayesian treatment allows for such flexibility because of the covariance associated with the log-likelihood term. We note, however, that the nonlinear equation (26) does not allow this flexibility.

In equation (12), we stated that has a probability distribution and that we use as a prior PDF for .

 πpr(xi)∝p (28)

Treating as a random variable with covariance , we can write the following likelihood:

 πlike(yi∣xi)∝exp(−12∥yi−G(xi,ti)∥2Γ−1i). (29)

Using Bayes’ rule, we can write the posterior PDF of as

 πipost:=πpost(xi∣yi)∝pπlike(yi∣xi), (30)

which is

 πipost(xi∣yi)∝pexp(−12∥yi−G(xi,ti)∥2Γ−1i). (31)

When is Gaussian, the kernel of the likelihood distribution can be represented in closed form. Hence the Bayesian inverse problem in (31) can be solved either by finding a maximum a posteriori point (MAP) and using the Laplace approximation around the MAP to describe the uncertainty around the solution or by drawing samples from the approximate posterior distribution using MCMC. In scenarios when is non-Gaussian, however, the challenges are twofold:

• The kernel of the posterior cannot be represented in a closed analytical form.

• We cannot evaluate analytically—which is central to our method.

We tackle non-Gaussianity by using the method of moments to approximate by a Gaussian distribution. Using Gaussian mixtures might lead to a better approximation to

than using the method of moments, and we can reuse the technique here about the center of each component of the mixture. However, this approach suffers from the curse of dimensionality when

represents a PDF in large dimensions, and hence we do not pursue using Gaussian mixtures to approximate in this paper, with the expectation that our approach will create an acceptable IBD (which need not be exact). Assuming is Gaussian or can be approximated by the method of moments, we can write as .

 πipost(xi∣yi)∝exp(−12∥∥xi−¯¯¯x∥∥2Σ−1)exp(−12∥yi−G(xi,ti)∥2Γ−1i). (32)

The covariance information is necessary in order to evaluate the posterior PDF given . We discuss the choice of the covariance in the next subsection.

### 3.3 Choice of covariance

For our specific problem, defining is an important step in solving the Bayesian inverse problem (32). We use the value of as a guide to choose the covariance of the likelihood term in (32). Recall that a sample from unnormalized distribution gives us a pair. To choose the covariance of , we look at the unnormalized distribution of at time . That is, we model the joint distribution of based on the values of evaluated at . Specifically we evaluate for . These values give us a range of slopes at time for which the state is close to the excursion level. We then use the values of to construct a Laplace approximation to obtain an approximation for the joint distribution of . This gives us an approximate covariance () for the likelihood PDF. This is illustrated for the Lotka-Volterra system in Figure 1. We evaluate at for a range of values of and and fit a two-dimensional Gaussian distribution to approximate the covariance for the likelihood. Figure 1: Contours of yφt(u,y) evaluated for different values of y and u at a fixed ti for the Lotka-Volterra system.

## 4 Solution to the Bayesian inverse problem

In §3 we formulated the process of approximating the as solving a sequence of Bayesian inverse problems. We also defined the necessary ingredients to define a Bayesian inverse problem—the prior and the likelihood function. Ideally, the solution to the Bayesian inverse problem in (32) should yield the posterior distribution . Except under special circumstances, however, one cannot obtain a closed-form expression for the posterior distribution (32). Let denote the maximum a posteriori point (MAP point), that is, the point that maximizes the posterior PDF (equation (32)). A standard approach to solving the Bayesian inverse problem (31) is to first find and then approximate the forward map by its linearization around . This results in a Gaussian approximation to the posterior distribution , which is known as the Laplace approximation [Tanbui_2013, Petra_2014].

Alternatively, one can use MCMC methods to sample from the posterior PDF. In the following paragraphs, we describe both these approaches for solving the Bayesian inverse problem.

### 4.1 Laplace approximation at the MAP point

The problem of finding the point at which the posterior PDF (32) is maximized can be formulated as a deterministic inverse problem. The negative log-likelihood is treated as the data misfit term, and the negative log prior is used as a regularizer to avoid overfitting. The resulting inverse problem can be written as

 xMAPi:= arg minx 12∥yi−G(x,ti)∥2Γ−1idata misfit+τ2∥∥x−¯¯¯x∥∥2Σ−1regularization, (33)

where is the regularization parameter. The solution for the optimization problem in (33) is the MAP point for the Bayesian inverse problem in (32). To solve the minimization problem in (33), we use gradient-based optimization methods (for example, L-BFGS); the necessary gradient information can be evaluated by using adjoints. In Appendix A we describe the computational procedure to evaluate the gradient information.

Assuming the forward map to be Fréchet differentiable, we can approximately express an observation as

 (34)

where and is the Fréchet derivative of evaluated at . Hence the Laplace approximation of the posterior can be written as

 πposti(xi∣yi)∼N(xMAPi,Γposti), (35)

where .

### 4.2 Markov chain Monte Carlo

The Metropolis-Hastings (M-H) algorithm [Metropolis_1953, Hastings_1970] is an MCMC method that employs a proposal density function () at each sample point in to generate a proposed sample point. This sample point is then rejected or accepted based on the M-H criterion ( in Algorithm 1). The M-H algorithm is described in Algorithm 1 [Kaipio_2006B, Section 3.6.2]. The performance of MCMC algorithms depends heavily on how close the proposal distribution is to the target distribution. A number of different MCMC algorithms exist, the distinguishing feature being the manner in which the sample points are proposed and accepted (or rejected). See, for example, [Liu_2008B, Gilks_1995B, Gelfand_1990, Geman_1987]. In this paper, we use the delayed rejection adaptive Metropolis (DRAM) MCMC algorithm [Haario_2006].

#### 4.2.1 Dram Mcmc

DRAM combines two ideas: delayed rejection (DR) and adaptive Metropolis (AM) algorithms. Here, we describe DR, AM, and their combination.

#### 4.2.2 Delayed rejection

DR is a strategy that is employed to improve the performance of the M-H algorithm. Unlike the M-H algorithm, which employs a single proposal density, DR uses a hierarchy of proposal densities. Suppose the current position of the Markov chain is and that a candidate move is generated from the proposal distribution . This proposal is accepted with probability

 α1(ξ,λ1)=min(1,π(λ1)q1(λ1,ξ)π(λ1)q1(ξ,λ1)).

In the case of a rejection, the M-H algorithm retains the same position . On the other hand, the DR algorithm instead proposes a second move . The second-stage proposal, , depends on the current position and on the recently proposed and rejected move. The second-stage proposal is accepted with probability

 α2(ξ,λ1,λ2)=min(1,π(λ2)q1(λ2,λ1)q2(λ2,λ1,ξ)(1−α1(λ2,λ1))π(ξ)q1(ξ,λ1)q2(ξ,λ1,λ2)(1−α1(ξ,λ1))). (36)

This process of delaying rejection can be iterated over a fixed number of stages. Alternatively, one can use a biased coin to guide whether to move to a higher-stage proposal or not. We refer interested readers to [Haario_2006, Tierney_1999] for more details about the DR algorithm.

The AM MCMC algorithm constructs a proposal distribution adaptively by using the existing elements of the Markov chain. The basic idea is to use the sample path of the Markov chain to “adapt” the covariance matrix for a Gaussian proposal distribution. For example, after an initial period of nonadaptation, one can set the Gaussian proposal to be centered at the current position of the Markov chain, . That is, the covariance is set to , where is a parameter that depends only on the dimension of the state space on which the target probability distribution is defined. The quantity is typically chosen to be a small constant, and is an identity matrix of appropriate dimensions. Before the start of the adaptation period, a strictly positive definite covariance is chosen according to a priori knowledge. Let index define the length of the nonadaption period. Then

 Cn={C0,n≤n0,sdCov(Ξ0,⋯,Ξn−1)+sdϵId,n>n0. (37)

A recursive procedure allows us to update the covariance of the proposal distribution efficiently. For more details see [Haario_2001, Haario_2006].

#### 4.2.4 Combining DR and AM

The success of the DR algorithm depends on a proposal in at least one of the stages being calibrated close to the target distribution. The AM algorithm attempts to calibrate the proposal distribution as the sample path of the Markov chain grows. The DRAM algorithm [Haario_2006] combines these two strategies. The DRAM version deployed in this paper combines stages of DR with adaptation. The process can be summarized as follows:

• The proposal ( in Algorithm 1) at the first of the stages is adapted as described in equation (37). The covariance of the proposal distribution is computed by using the sample path of the Markov chain.

• The covariance of the proposal for stage , is computed as a scaled version of the first-stage proposal  .

Both and can be freely chosen. For our purposes, we use a MATLAB implementation of DRAM that is available online [Haario_2006].

## 5 Constructing the importance biasing distribution

We explained in §3 that solving Bayesian inverse problems is an effective method for constructing ’s (preimages to observations ). The Bayesian inverse problem that we wish to solve is described in 32. We also explained how to choose the covariance for the likelihood in question. In §4, we described two approaches, Laplace approximation at MAP and DRAM MCMC, to solve the Bayesian inverse problem. One can use either of these approaches to draw samples from the unnormalized distribution (DRAM) or an approximation of it (MAP). These samples are used to approximate the preimages . In §3 (equations (15) and (19)) we mentioned that can be computed by approximating the set and using the corresponding probability measure . A more practical means to estimate is to use the preimages to construct an IBD and use the IBD to estimate the probability using importance sampling.

Using the Laplace approximation of the posterior, one can draw samples from the approximate posterior by sampling from the distribution in (35), and these samples can be used to estimate using IS. DRAM MCMC, on the other hand, yields a Markov chain, and we denote the elements of the Markov chain drawn from the unnormalized distribution by

 ˆXi:={x1i,x2i,⋯}. (38)

Assuming there are samples in the chain, the elements of set can be thought of as samples from the Gaussian distribution with empirical mean

 ¯¯¯xi=1ℓℓ∑k=1xki (39)

and empirical covariance

 ¯¯¯¯¯Xi=1ℓ−1ℓ∑k=1(xki−¯¯¯xi)(xki−¯¯¯xi)⊤. (40)

If we use observations (see the discussion around (19)), then we can approximate the IBD as the following Gaussian mixture:

 pIBD:= N∑i=1wiN(¯¯¯xi,¯¯¯¯¯Xi), (41) N∑i=1wi= 1.

One of the obvious ways to choose is to assign equal weights to each component of the mixture. This is effective if is small, because the observations mostly correspond to high-density regions. If is large, however, the samples could potentially be from low-density regions, too. In such a scenario, it would be prudent to set

 wi∝yiφt(u,yi)∣∣t=ti. (42)

### 5.1 Estimating PT(u)

We now have all the pieces necessary to estimate . Following the discussion from §2.1, the importance sampling estimate of can be written as

 PIST(u)(ˆx10,…,ˆxM0)=1MM∑i=1I(ˆxi0)ψ(ˆxi0), (43)

where are sampled from the biasing distribution and represents the indicator function given by

 I(ˆxi0)=⎧⎪ ⎪⎨⎪ ⎪⎩1,sup0≤t≤Tc⊤x(t,ˆxi0)≥u,  t∈[0,T],0,sup0≤t≤Tc⊤x(t,ˆxi0)

Also, represents the importance weights. The importance weight for an arbitrary is given by

 ψ(ˆxi0)=p(ˆxi0)pIBD(ˆxi0). (45)

The overall procedure to compute an estimate of is summarized in Algorithm 2.

We call the approach that uses the MAP point and Laplace approximation around the MAP point to construct the IBD as MAP-based IS and the approach that uses the MCMC chains to construct the IBD as MCMC-based IS.

### 5.2 Connection to approach based on LDT

We mentioned earlier that LDT has been used in [Dematteis_2018, Dematteis_2019] to estimate rare-event probabilities using large deviations as a tool. For a detailed treatment of large deviation theory, we refer the interested readers to [Touchette_2011]. Loosely speaking, one can use large deviations to estimate when as . According to LDT,

 PT(u)≍exp(−I(x)), (46)

where indicates that the ratio of the logarithm ’s right-hand side and the logarithm’s left-hand side tends to one asymptotically, where

 I(x):=12minx∈Ω(u)∥x−¯¯¯x∥2Σ−1. (47)

Intuitively this approach, introduced by Dematteis et al. in [Dematteis_2018, Dematteis_2019], approximates the rare-event probability by determining the dominating point in , and the relative precision of this estimate improves as increases. On the other hand, for given , which is the case we discuss here, even determining an error estimate for the large deviation approach is problematic in practice. While inspired by large deviation ideas [adler2009random], our approach goes further by approximating the distribution around the dominating point the distribution in an importance sampling approach that produces an unbiased estimate of the sought-after probability. The empirical variance of the importance sampling approach gives an estimate of the error we make in our approach, something that is not accessible in a classical large deviation approach.

## 6 Numerical results

We demonstrate the application of procedure described in §3 and §4 for nonlinear dynamical systems excited by a Gaussian distribution. We use the Lotka-Volterra equations and the Lorenz-96 system as test problems.

### 6.1 Lotka-Volterra system

The Lotka-Volterra equations, which are also known as the predator-prey equations, are a pair of first-order nonlinear differential equations and are used to describe the dynamics of biological systems in which two species interact, one as predator and the other as prey. The populations change through time according to the following pair of equations,

 dx1dt=αx1−βx1x2, (48) dx2dt=δx1x2−γx2,

where is the number of prey, is the number of predators, and and represent the instantaneous growth rates of the two populations. We assume that the initial state of the system at time is a random variable that is normally distributed:

 x(0)∼N(,0.8×I2)

. We are interested in estimating the probability of the event , where , , and . The first step of our solution procedure involves sampling from to generate observations . We linearize the dynamical system about the mean of the distribution of equation (25) and express as a function of and as described by equation (23). We can compute as shown in equation (24). We use the DRAM MCMC method to generate samples from ; to minimize the effect of the initial guess on the posterior inference, we use a burn-in of 1,000 samples. Fig. 2 shows the contours of and samples drawn from it by using DRAM MCMC. Fig. 3 shows the autocorrelation between the samples drawn by using DRAM MCMC from , and we see that the autocorrelation dies down to zero for a lag of ; choosing every eleventh sample gives us independent samples that are in turn used to form . The next step in our solution procedure is to construct that approximate the preimages of . We use the procedure described in §3 to form an unnormalized posterior distribution that uses . Subsequently, either MAP-based IS or MCMC-based IS can be used to estimate . For the MAP-based IS, we first solve the optimization problem in (33) using a gradient-based optimization algorithm (for example, LBFGS). We use the inverse of the Hessian at the MAP point to approximate the covariance of the posterior as in equation (35). For the MCMC-based IS, we use DRAM MCMC as described in §4.2.1 to sample from the posterior distribution. To minimize the effect of initial guess on the posterior samples, we use a burn-in of 500 samples. We use these samples to construct the IBD as described in §5. We test our algorithm by constructing using different numbers of observations. Fig. 4 shows samples drawn from , , and the corresponding marginal densities. We see that the samples generated from the IBD are predominantly from the tails of . Fig. 5 compares the relative accuracies of conventional MCS and MCMC-based IS algorithms. We use the Monte Carlo estimate obtained using 10 million samples as a proxy for the truth. We test the accuracy of the IBD constructed with and observations (see the discussion around (19) for the definition of the number of observations). Constructing an IBD with observations involves more work because the MCMC DRAM has to be run with different unnormalized posterior distributions, involving about 5,000 model evaluations just to construct and a further 800 model runs to estimate . We note that executing the MCMC DRAM with 5 different observations completely independent of one another can be run in parallel. On the other hand, constructing with a single observation requires 1,000 model runs and a further 800 model runs to estimate . As Fig. 5 indicates, we get a more accurate estimate (one order of magnitude) for extra work performed with