1 Motivation
Computing the probability of extreme events is of central importance for dynamical systems that arise in natural phenomena such as climate, weather, oceanography [Easterling_2000, Easterling_2000A], and engineering systems such as structures [Cornell_1968, Vrouwenvelder_2000] or power grids [Lesieutre_2008]. Examples of consequential extreme events are rogue waves in the ocean [Dysthe_2008], hurricanes, tornadoes [Ross_2003], and power outages [Atputharajah_2009]. In this work we are motivated by the increased concern of transient security in the presence of uncertain inertia, as identified by the North American Electric Reliability Corporation in its most recent longterm reliability assessment [nerc2017ltra]. The mathematical formulation is of a dynamical system with parametric uncertainty, which is equivalent to initial condition uncertainty by adding the equations
to the ordinary differential equation, where
are the parameters. The aim of the calculation in that case is to compute an extreme excursion probability: the odds that the transient due to a sudden malfunction exceeds prescribed safety limits. Since the reliability goal is that an average customer should experience only minutes of electricity interruption per year
[fischer2016german], the target safe limit exceedance probabilities may be in the range of –.Quantifying extreme excursion probabilities is of great importance because of their socioeconomic impact. Outcomes of interest reside in the tails of the probability distribution of the associated event space because of their low likelihood. To resolve the tails of these events, one has to evaluate multivariable integrals over complex domains. Because of the tiny mass and complex shape of the relevant likelihood level sets, standard quadrature, cubature, or sparse grid methods cannot be applied directly to evaluate these integrals. The most commonly used method is Monte Carlo simulation (MCS), which requires repeated samples of the underlying outcome. For such small probabilities, however, MCS exhibits a large variance relative to the probability to be computed, and thus it needs a large number of samples to produce results of acceptable accuracy. For example, estimating the odds of an extreme event, whose probability ends up being
for an underlying process that requires 10 minutes per numerical simulation, requires two years of serial computation for producing an estimate with a standard deviation of less than 10% of the target value via MCS. Hence, alternative methods must be developed that are computationally efficient.
In the rest of this section, we review the literature, provide an overview of our approach, and discuss its limitations and possible extensions. In §2 we describe the rareevent problem in detail and revisit MCS and importance sampling (IS) methods. In §3 we formulate the problem of estimating rareevent probability as a sequence of Bayesian inverse problems, in §4 we discuss two wellknown approaches to solve the Bayesian inverse problems, and in §5 we use the solutions of these Bayesian inverse problems to construct an importance biasing distribution (IBD). In §6 we demonstrate the algorithm on two nonlinear dynamical systems of different sizes. In §7 we give concluding remarks .
1.1 Literature review
Most methods to compute the probabilities of rare events are a combination of MCS and IS methods. The key difference between such approaches lies in the proposal distribution for importance sampling. In what follows, we briefly discuss existing methods and the key ideas underpinning them.
1.1.1 Monte Carlo simulation
The MCS approach is one of the most robust methods for simulating rare events and estimating their probabilities. It was originally developed for solving problems in mathematical physics [Metropolis_1949]. Since then, the method has been used in a wide variety of applications, and it currently lies at the heart of all random samplingbased techniques [Liu_2008B, Robert_2005B]. The main strength of MCS is that its rate of convergence does not depend on the likelihood level set or its dimension. When evaluating excursion probabilities, the method primarily counts how many of the random samples exceed the given excursion level. Thus, in order to estimate a probability , MCS needs a number of samples exceeding , which for small probabilities makes its direct application impractical.
1.1.2 Importance sampling
IS methods belong to the class of variance reduction techniques that aim to estimate the quantity of interest by constructing estimators that have smaller variance than does MCS. This technique was proposed in the 1950s [Kahn_1953]. The major cause for the inefficiency in computing rareevent probabilities using MCS is that most of the random samples generated do not belong to the extreme excursion region (or the region of interest). The basic idea of IS is to use the information available about the rare event to generate samples that more frequently visit the region of interest. This is achieved by constructing an IBD, which can be used to generate samples. If successful, unlike in the case of MCS, an appreciable fraction of these samples contribute to the probability estimate. When designing an IBD, the aim is for its probability mass to be concentrated in the region of interest. Based on this consideration, several techniques for constructing IBDs have been developed, such as variance scaling and mean shifting [Bucklew_2013B]. A more detailed treatment of importance sampling and the relevant literature can be found in standard stochastic simulation textbooks [Dunn_2011B, Asmussen_2007B]. One of the major challenges involved with importance sampling is the construction of an IBD that results in a lowvariance estimator. We note that the approach may sometimes be inefficient in highdimensional problems [Katafygiotis_2008].
1.1.3 Nested subset methods
The underlying idea of this class of methods is to consider a sequence of nested subsets of the probability space of interest (for example, starting with the entire space and shrinking to the target rare event) and use the notion of conditional probability to factorize the target event as a product of conditional events. Two main methods that fall into this class are subset simulation (SS) [Au_2001] and splitting methods [Kahn_1951]. In SS, the nested subsets are generated by choosing appropriate intermediate thresholds. Splitting methods are based on the idea of restarting the associated Markov process from certain system states in order to generate more occurrences of the rare event of interest. Several modifications have been proposed to both SS [Ching_2005, Ching_2005_2, Katafygiotis_2005, Zuev_2012, Bect_2017] and splitting methods [Botev_2012, Beck_2016]. Evaluating the conditional probabilities forms a major portion of the computational load. Computing the conditional probabilities for different nested subsets concurrently is nontrivial. Additionally, it is not clear how many model evaluations are required at the intermediate level sets in order to achieve a good probability estimate.
1.1.4 Methods based on large deviation theory
Recent work by Dematteis et al. used large deviations theory (LDT) to estimate the probabilities of rogue waves of a certain height [Dematteis_2018]. The same authors used LDT to estimate probabilities of extreme events in dynamical systems with random components [Dematteis_2019]. LDT is an efficient approach for estimating rare events when the event space of interest is dominated by a few elements. The aforementioned papers solve an optimization problem to estimate the rareevent probability. In contrast, our approach uses a Bayesian inverse problem framework to determine an IBD, which will then be used to estimate the rareevent probability. In §5 we contrast the approach based on LDT with our approach.
1.1.5 Multifidelity and surrogatebased methods
Multifidelity methods are used for estimating rareevent probabilities is situations when multiple evaluations of the forward model is prohibitively expensive. This approach leverages a hierarchy of lowcost reducedorder models, such as projectionbased reducedorder models, data fit interpolation models, and support vector machines, to reduce the cost of constructing the IBD
[Peherstorfer_2017]. The main idea behind the surrogatebased method is to start with a deterministic sample of the system and then construct a surrogate that approximates the system based on these samples [Bucher_1990, Faravelli_1989, Wong_1985]. We remark that multifidelity and surrogate methods can be readily augmented with the framework developed in this paper to obtain additional computational savings.1.2 Overview of our methodology
Our methodology uses ideas from excursion probability theory to characterize the tails of the probability distribution associated with the event [adler2010geometry]. Specifically, we use Rice’s formula [Rice_1944], which was developed to compute the expected number of upcrossings for stochastic processes:
(1) 
The lefthand side denotes the number of upcrossings of level , is the derivative of the stochastic process (in a mean square sense), and represents the joint probability distribution of the process and its derivative . Clearly the expression in the integral is analytically tractable only for special types of stochastic processes. Specifically, for Gaussian processes, this term can be resolved analytically. Moreover—the critical feature we will use here— for smooth Gaussian processes , Rice’s formula is a fasterthanexponentiallyaccurate approximation of the excursion probability. That is [adler2009random, Equation (14.0.2)]:
(2) 
where is a parameter depending on the process and interval , but not on the target level , and the asymptotics in the notation refers to . If we use the number of upcrossings in (1) as our estimate of the excursion probability, we can interpret large values of as defining the times and values of the process velocity for which the crossing is most likely to occur. This, in turn, is the key in efficiently determining the points in the input space that represent the highest contribution to the excursion probability.
The setup in this article involves a nonlinear dynamical system that is excited by a Gaussian or a nonGaussian initial state that results in a nonGaussian stochastic process. To address this problem, we linearize the nonlinear dynamical system variation around the trajectories starting at the mean of the initial state. We thus obtain a Gaussian approximation to the system trajectory distribution. Furthermore, we use Rice’s formula and solve a sequence of Bayesian inverse problems (1) to determine the uncertainty sets in the input space that are most likely to cause the extreme events; these sets, in turn, are used to construct the biasing distribution. The main advantages of our approach are the following:

Constructing the biasing distribution is the most expensive component of the computational method. Since our method does not use nested subsets to evaluate the target probability, it is amenable to parallelization (see the discussion in §3).

As we will demonstrate in §6, a moderate number of evaluations of the model () are required in order to achieve acceptable levels of relative accuracy (). The method can capture probabilities on the order of accurately.

Although we use nonlinear dynamical systems with random initial states as a basis to demonstrate our method, the algorithm can be seamlessly extended to stochastic dynamical systems with random parameters.

In applications that require repeated evaluations of the rareevent probability and where the distribution of the random parameter does not change significantly between these evaluations, the IBD can be reused to obtain accurate estimates of the rareevent probability. We demonstrate this for a small problem in §6.
1.3 Limitations and possible extensions
One of the major limitations of our approach is that as the dimensionality of the random variable grows, the construction of the biasing distribution becomes expensive. Currently, this method is practical for problems where the size of the random variable is
. However, we aim to solve problems up to and beyond. The current approach requires that the random variable be normally distributed. For many practical problems, however, this might not be the case. In such scenarios we use the method of moments to approximate the non normal random variable by a Gaussian distribution. Another possible approach to handling nonGaussian random parameters is to use a Gaussian mixture model (GMM) to approximate it. In such a scenario, challenges may arise regarding controlling the variance of the GMM components such that the errors due to linearization do not grow too much. An obvious extension to the current work is to develop a strong theoretical foundation that justifies the algorithm in this paper. Another potential research direction is related to constructing the likelihood function that is necessary for the MCMC step of the algorithm. Currently, we use ad hoc methods to choose the likelihood for the MCMC step (more details are in §
3), and this approach.can be significantly improved by using a design of experiments approach (similar to [Mohamad_2018]).2 The rareevent problem
Consider an inputoutput system with inputs and outputs, which is modeled by a continuous function
. We represent the uncertainties in the input by a probability distribution with probability density function (PDF)
. Let be a random dimensional vector. The inputs to the system are the components of a realization of . The outputs are the realization of the random variable . We are interested in the failure probability of the system described by the model . Let be the limitstate function, which we assume to be continuous. We say that the system fails for an input if . This leads us to the failure domainand the indicator function of the failure domain
(3) 
We define the failure probability of the system as
(4) 
The variance of with respect to the PDF is
(5) 
The MCS method is often used to estimate expectation such as (4). It draws independent and identically distributed (i.i.d.) samples from the distribution of , that is, realizations of the random variable , and computes the Monte Carlo estimate
(6) 
Note that we distinguish between the estimate , which is a scalar value, and the Monte Carlo estimator , which is a random variable. The relative root mean square error (RMSE) of is
(7) 
when . Hence, for a threshold parameter , a relative error is achieved with
(8) 
samples, where denotes the ceil function. If , then is large. Hence, estimating small probability events with MC is difficult.
2.1 Importance sampling
Variance reduction methods aim to reduce the RMSE (7) by changing (4) to a new integral with the same value but with an integrand and/or a distribution that, combined, result in a lower variance than the original function has with respect to the distribution . Importance sampling is one such variance reduction method that has been used successfully to estimate failure probabilities[Srinivasan_2013B, Robert_2005B]. Importance sampling introduces a random vector with PDF , which is used as the biasing distribution. A realization of is denoted by . In the following, the distribution of is the nominal distribution, and the corresponding PDF is the nominal PDF. The distribution of is the biasing distribution, and is the biasing PDF. The biasing PDF is constructed such that the , where
denotes the support of the PDF . Let be the weight function . The weight is the importance weight of a realization . Because holds, the failure probability (4) equals the expectation of the random variable weighted with the random variable . That is, we have
(9) 
The expectation (9) is approximated with the Monte Carlo method, with samples drawn from the biasing distribution. Thus the importance sampling estimate of with samples is
(10) 
Therefore, the Monte Carlo method with importance sampling consists of two steps. In step one, the biasing distribution is generated. In step two, the importance sampling estimator
is an unbiased estimator of
because [Robert_2005B].If , then the relative RMSE of the importance sampling estimator is
(11) 
If the variance is smaller than then the relative RMSE of the importance sampling estimator is smaller than the relative RMSE of the Monte Carlo estimator for the same number of samples .
3 Construction of IBD via Bayesian inference
Consider the following dynamical system,
(12)  
where the initial state of the system is uncertain and has a probability distribution , with being the corresponding probability measure. The problem of interest to us is to estimate the probability that exceeds the level for . That is, we seek to estimate the excursion probability
(13) 
where represents the solution of the dynamical system (12) for a given initial condition . Let represent the set of all initial conditions for which the solution of the dynamical system exceeds the excursion level . That is,
(14) 
Notice that depends on implicitly through the solution of the dynamical system. Since we can write
(15) 
estimating is related to determining . In general, one cannot determine analytically. We use Rice’s formula, (1), to gain insight about , which in turn will be used to construct an approximation to . Let us revisit Rice’s formula:
(16) 
Recall that represents the joint probability density of and its derivative for an excursion level . The righthand side of equation (16) integrates the joint density over all values of derivatives and times at which there is an excursion. The key insight for our method is that values of the time and slope at which is large contribute the most to this integral. We use this idea to construct an approximation to .
Using (16), we can interpret as an unnormalized PDF and thus sample from it to compute using Monte Carlo approximation. By sampling from the unnormalized distribution , we obtain a slopetime pair, at which the sample paths of the stochastic process exceed the excursion level . Consider the forward map , which evaluates the vector based on the dynamics (12), given an initial state and a time . We call a preimage of a sample if
(17) 
Note that the problem of finding a preimage of a sample is illposed. Multiple ’s map to at time via the operator . Therefore, we define the set
(18) 
and we construct our approximation as
(19) 
Our intuition is that approximates better as we increase . The underlying computational framework to approximate consists of the following stages:

Draw samples from unnormalized

Find the preimages of these samples to approximate .
We use MCMC to draw samples from unnormalized . We note that irrespective of the size of the dynamical system, represents an unnormalized density in two dimensions; hence, using MCMC is an effective means to draw samples from it. Drawing samples from requires evaluating it repeatedly, and in the following section we discuss the means to do so.
3.1 Evaluating
In this section, we describe the process of evaluating given , , and . We note that , can be evaluated analytically only for special cases. Specifically, when is a Gaussian process, the joint density function is analytically computable. Consider the dynamical system described by (12). When is Gaussian and is linear, we have
(20) 
Assuming is invertible, we can write as
(21) 
where
represents an identity matrix of the appropriate size. Given that
is normally distributed, it follows that is a Gaussian process:(22)  
The joint PDF of a stochastic process and its derivative, , the joint PDF of and , is given by [3569, equation 9.1]
(23) 
where
and
We can now evaluate for arbitrary values of , , and as
(24) 
where and denotes the determinant of . Note that the righthand side in (24) is dependent on via .
3.1.1 Notes for nonlinear
When is nonlinear, one cannot compute analytically—a key ingredient for our computational procedure. We approximate the nonlinear dynamics by linearizing around the mean of the initial distribution. Assuming that the initial state of the system is normally distributed as described by equation (20), linearizing around the mean of the initial state gives
(25) 
where represents the Jacobian of at , . This reduces the nonlinear dynamical system to a form that is similar to equation (20). Thus, we can now use equations (22), (23), and (24) to approximate for nonlinear .
We now describe a systematic computational framework to determine for a given sample . This allows us to determine the elements of set .
3.2 Determining the preimages for a given sample
A sample from the unnormalized joint distribution
gives a slope, , and time, , at which the stochastic process exceeds the level . Hence . Constructing requires finding all the preimages . This amounts to finding all the solutions of the following equation,(26) 
where . Another formulation of the problem (26) is
(27) 
Since is a mapping from to , problem (27) is an illposed and underdetermined inverse problem. To address the illposedness, we use the Bayesian formulation of the inverse problem by placing a prior on and identifying the term as a negative loglikelihood. Suppose, in the process of finding preimages , we encounter elements that map to a value higher than . These should not be discarded because these elements still cause an excursion and hence are elements of the set . The Bayesian treatment allows for such flexibility because of the covariance associated with the loglikelihood term. We note, however, that the nonlinear equation (26) does not allow this flexibility.
In equation (12), we stated that has a probability distribution and that we use as a prior PDF for .
(28) 
Treating as a random variable with covariance , we can write the following likelihood:
(29) 
Using Bayes’ rule, we can write the posterior PDF of as
(30) 
which is
(31) 
When is Gaussian, the kernel of the likelihood distribution can be represented in closed form. Hence the Bayesian inverse problem in (31) can be solved either by finding a maximum a posteriori point (MAP) and using the Laplace approximation around the MAP to describe the uncertainty around the solution or by drawing samples from the approximate posterior distribution using MCMC. In scenarios when is nonGaussian, however, the challenges are twofold:

The kernel of the posterior cannot be represented in a closed analytical form.

We cannot evaluate analytically—which is central to our method.
We tackle nonGaussianity by using the method of moments to approximate by a Gaussian distribution. Using Gaussian mixtures might lead to a better approximation to
than using the method of moments, and we can reuse the technique here about the center of each component of the mixture. However, this approach suffers from the curse of dimensionality when
represents a PDF in large dimensions, and hence we do not pursue using Gaussian mixtures to approximate in this paper, with the expectation that our approach will create an acceptable IBD (which need not be exact). Assuming is Gaussian or can be approximated by the method of moments, we can write as .(32) 
The covariance information is necessary in order to evaluate the posterior PDF given . We discuss the choice of the covariance in the next subsection.
3.3 Choice of covariance
For our specific problem, defining is an important step in solving the Bayesian inverse problem (32). We use the value of as a guide to choose the covariance of the likelihood term in (32). Recall that a sample from unnormalized distribution gives us a pair. To choose the covariance of , we look at the unnormalized distribution of at time . That is, we model the joint distribution of based on the values of evaluated at . Specifically we evaluate for . These values give us a range of slopes at time for which the state is close to the excursion level. We then use the values of to construct a Laplace approximation to obtain an approximation for the joint distribution of . This gives us an approximate covariance () for the likelihood PDF. This is illustrated for the LotkaVolterra system in Figure 1. We evaluate at for a range of values of and and fit a twodimensional Gaussian distribution to approximate the covariance for the likelihood.
4 Solution to the Bayesian inverse problem
In §3 we formulated the process of approximating the as solving a sequence of Bayesian inverse problems. We also defined the necessary ingredients to define a Bayesian inverse problem—the prior and the likelihood function. Ideally, the solution to the Bayesian inverse problem in (32) should yield the posterior distribution . Except under special circumstances, however, one cannot obtain a closedform expression for the posterior distribution (32). Let denote the maximum a posteriori point (MAP point), that is, the point that maximizes the posterior PDF (equation (32)). A standard approach to solving the Bayesian inverse problem (31) is to first find and then approximate the forward map by its linearization around . This results in a Gaussian approximation to the posterior distribution , which is known as the Laplace approximation [Tanbui_2013, Petra_2014].
Alternatively, one can use MCMC methods to sample from the posterior PDF. In the following paragraphs, we describe both these approaches for solving the Bayesian inverse problem.
4.1 Laplace approximation at the MAP point
The problem of finding the point at which the posterior PDF (32) is maximized can be formulated as a deterministic inverse problem. The negative loglikelihood is treated as the data misfit term, and the negative log prior is used as a regularizer to avoid overfitting. The resulting inverse problem can be written as
(33) 
where is the regularization parameter. The solution for the optimization problem in (33) is the MAP point for the Bayesian inverse problem in (32). To solve the minimization problem in (33), we use gradientbased optimization methods (for example, LBFGS); the necessary gradient information can be evaluated by using adjoints. In Appendix A we describe the computational procedure to evaluate the gradient information.
Assuming the forward map to be Fréchet differentiable, we can approximately express an observation as
(34) 
where and is the Fréchet derivative of evaluated at . Hence the Laplace approximation of the posterior can be written as
(35) 
where .
4.2 Markov chain Monte Carlo
The MetropolisHastings (MH) algorithm [Metropolis_1953, Hastings_1970] is an MCMC method that employs a proposal density function () at each sample point in to generate a proposed sample point. This sample point is then rejected or accepted based on the MH criterion ( in Algorithm 1). The MH algorithm is described in Algorithm 1 [Kaipio_2006B, Section 3.6.2]. The performance of MCMC algorithms depends heavily on how close the proposal distribution is to the target distribution. A number of different MCMC algorithms exist, the distinguishing feature being the manner in which the sample points are proposed and accepted (or rejected). See, for example, [Liu_2008B, Gilks_1995B, Gelfand_1990, Geman_1987]. In this paper, we use the delayed rejection adaptive Metropolis (DRAM) MCMC algorithm [Haario_2006].
4.2.1 Dram Mcmc
DRAM combines two ideas: delayed rejection (DR) and adaptive Metropolis (AM) algorithms. Here, we describe DR, AM, and their combination.
4.2.2 Delayed rejection
DR is a strategy that is employed to improve the performance of the MH algorithm. Unlike the MH algorithm, which employs a single proposal density, DR uses a hierarchy of proposal densities. Suppose the current position of the Markov chain is and that a candidate move is generated from the proposal distribution . This proposal is accepted with probability
In the case of a rejection, the MH algorithm retains the same position . On the other hand, the DR algorithm instead proposes a second move . The secondstage proposal, , depends on the current position and on the recently proposed and rejected move. The secondstage proposal is accepted with probability
(36) 
This process of delaying rejection can be iterated over a fixed number of stages. Alternatively, one can use a biased coin to guide whether to move to a higherstage proposal or not. We refer interested readers to [Haario_2006, Tierney_1999] for more details about the DR algorithm.
4.2.3 Adaptive Metropolis
The AM MCMC algorithm constructs a proposal distribution adaptively by using the existing elements of the Markov chain. The basic idea is to use the sample path of the Markov chain to “adapt” the covariance matrix for a Gaussian proposal distribution. For example, after an initial period of nonadaptation, one can set the Gaussian proposal to be centered at the current position of the Markov chain, . That is, the covariance is set to , where is a parameter that depends only on the dimension of the state space on which the target probability distribution is defined. The quantity is typically chosen to be a small constant, and is an identity matrix of appropriate dimensions. Before the start of the adaptation period, a strictly positive definite covariance is chosen according to a priori knowledge. Let index define the length of the nonadaption period. Then
(37) 
A recursive procedure allows us to update the covariance of the proposal distribution efficiently. For more details see [Haario_2001, Haario_2006].
4.2.4 Combining DR and AM
The success of the DR algorithm depends on a proposal in at least one of the stages being calibrated close to the target distribution. The AM algorithm attempts to calibrate the proposal distribution as the sample path of the Markov chain grows. The DRAM algorithm [Haario_2006] combines these two strategies. The DRAM version deployed in this paper combines stages of DR with adaptation. The process can be summarized as follows:

The covariance of the proposal for stage , is computed as a scaled version of the firststage proposal .
Both and can be freely chosen. For our purposes, we use a MATLAB implementation of DRAM that is available online [Haario_2006].
5 Constructing the importance biasing distribution
We explained in §3 that solving Bayesian inverse problems is an effective method for constructing ’s (preimages to observations ). The Bayesian inverse problem that we wish to solve is described in 32. We also explained how to choose the covariance for the likelihood in question. In §4, we described two approaches, Laplace approximation at MAP and DRAM MCMC, to solve the Bayesian inverse problem. One can use either of these approaches to draw samples from the unnormalized distribution (DRAM) or an approximation of it (MAP). These samples are used to approximate the preimages . In §3 (equations (15) and (19)) we mentioned that can be computed by approximating the set and using the corresponding probability measure . A more practical means to estimate is to use the preimages to construct an IBD and use the IBD to estimate the probability using importance sampling.
Using the Laplace approximation of the posterior, one can draw samples from the approximate posterior by sampling from the distribution in (35), and these samples can be used to estimate using IS. DRAM MCMC, on the other hand, yields a Markov chain, and we denote the elements of the Markov chain drawn from the unnormalized distribution by
(38) 
Assuming there are samples in the chain, the elements of set can be thought of as samples from the Gaussian distribution with empirical mean
(39) 
and empirical covariance
(40) 
If we use observations (see the discussion around (19)), then we can approximate the IBD as the following Gaussian mixture:
(41)  
One of the obvious ways to choose is to assign equal weights to each component of the mixture. This is effective if is small, because the observations mostly correspond to highdensity regions. If is large, however, the samples could potentially be from lowdensity regions, too. In such a scenario, it would be prudent to set
(42) 
5.1 Estimating
We now have all the pieces necessary to estimate . Following the discussion from §2.1, the importance sampling estimate of can be written as
(43) 
where are sampled from the biasing distribution and represents the indicator function given by
(44) 
Also, represents the importance weights. The importance weight for an arbitrary is given by
(45) 
The overall procedure to compute an estimate of is summarized in Algorithm 2.
We call the approach that uses the MAP point and Laplace approximation around the MAP point to construct the IBD as MAPbased IS and the approach that uses the MCMC chains to construct the IBD as MCMCbased IS.
5.2 Connection to approach based on LDT
We mentioned earlier that LDT has been used in [Dematteis_2018, Dematteis_2019] to estimate rareevent probabilities using large deviations as a tool. For a detailed treatment of large deviation theory, we refer the interested readers to [Touchette_2011]. Loosely speaking, one can use large deviations to estimate when as . According to LDT,
(46) 
where indicates that the ratio of the logarithm ’s righthand side and the logarithm’s lefthand side tends to one asymptotically, where
(47) 
Intuitively this approach, introduced by Dematteis et al. in [Dematteis_2018, Dematteis_2019], approximates the rareevent probability by determining the dominating point in , and the relative precision of this estimate improves as increases. On the other hand, for given , which is the case we discuss here, even determining an error estimate for the large deviation approach is problematic in practice. While inspired by large deviation ideas [adler2009random], our approach goes further by approximating the distribution around the dominating point the distribution in an importance sampling approach that produces an unbiased estimate of the soughtafter probability. The empirical variance of the importance sampling approach gives an estimate of the error we make in our approach, something that is not accessible in a classical large deviation approach.
6 Numerical results
We demonstrate the application of procedure described in §3 and §4 for nonlinear dynamical systems excited by a Gaussian distribution. We use the LotkaVolterra equations and the Lorenz96 system as test problems.
6.1 LotkaVolterra system
The LotkaVolterra equations, which are also known as the predatorprey equations, are a pair of firstorder nonlinear differential equations and are used to describe the dynamics of biological systems in which two species interact, one as predator and the other as prey. The populations change through time according to the following pair of equations,
(48)  
where is the number of prey, is the number of predators, and and represent the instantaneous growth rates of the two populations. We assume that the initial state of the system at time is a random variable that is normally distributed:
. We are interested in estimating the probability of the event , where , , and . The first step of our solution procedure involves sampling from to generate observations . We linearize the dynamical system about the mean of the distribution of equation (25) and express as a function of and as described by equation (23). We can compute as shown in equation (24). We use the DRAM MCMC method to generate samples from ; to minimize the effect of the initial guess on the posterior inference, we use a burnin of 1,000 samples. Fig. 2 shows the contours of and samples drawn from it by using DRAM MCMC. Fig. 3 shows the autocorrelation between the samples drawn by using DRAM MCMC from , and we see that the autocorrelation dies down to zero for a lag of ; choosing every eleventh sample gives us independent samples that are in turn used to form . The next step in our solution procedure is to construct that approximate the preimages of . We use the procedure described in §3 to form an unnormalized posterior distribution that uses . Subsequently, either MAPbased IS or MCMCbased IS can be used to estimate . For the MAPbased IS, we first solve the optimization problem in (33) using a gradientbased optimization algorithm (for example, LBFGS). We use the inverse of the Hessian at the MAP point to approximate the covariance of the posterior as in equation (35). For the MCMCbased IS, we use DRAM MCMC as described in §4.2.1 to sample from the posterior distribution. To minimize the effect of initial guess on the posterior samples, we use a burnin of 500 samples. We use these samples to construct the IBD as described in §5. We test our algorithm by constructing using different numbers of observations. Fig. 4 shows samples drawn from , , and the corresponding marginal densities. We see that the samples generated from the IBD are predominantly from the tails of . Fig. 5 compares the relative accuracies of conventional MCS and MCMCbased IS algorithms. We use the Monte Carlo estimate obtained using 10 million samples as a proxy for the truth. We test the accuracy of the IBD constructed with and observations (see the discussion around (19) for the definition of the number of observations). Constructing an IBD with observations involves more work because the MCMC DRAM has to be run with different unnormalized posterior distributions, involving about 5,000 model evaluations just to construct and a further 800 model runs to estimate . We note that executing the MCMC DRAM with 5 different observations completely independent of one another can be run in parallel. On the other hand, constructing with a single observation requires 1,000 model runs and a further 800 model runs to estimate . As Fig. 5 indicates, we get a more accurate estimate (one order of magnitude) for extra work performed with
Comments
There are no comments yet.