# A sequential sampling strategy for extreme event statistics in nonlinear dynamical systems

We develop a method for the evaluation of extreme event statistics associated with nonlinear dynamical systems, using a small number of samples. From an initial dataset of design points, we formulate a sequential strategy that provides the 'next-best' data point (set of parameters) that when evaluated results in improved estimates of the probability density function (pdf) for a scalar quantity of interest. The approach utilizes Gaussian process regression to perform Bayesian inference on the parameter-to-observation map describing the quantity of interest. We then approximate the desired pdf along with uncertainty bounds utilizing the posterior distribution of the inferred map. The 'next-best' design point is sequentially determined through an optimization procedure that selects the point in parameter space that maximally reduces uncertainty between the estimated bounds of the pdf prediction. Since the optimization process utilizes only information from the inferred map it has minimal computational cost. Moreover, the special form of the metric emphasizes the tails of the pdf. The method is practical for systems where the dimensionality of the parameter space is of moderate size, i.e. order O(10). We apply the method to estimate the extreme event statistics for a very high-dimensional system with millions of degrees of freedom: an offshore platform subjected to three-dimensional irregular waves. It is demonstrated that the developed approach can accurately determine the extreme event statistics using limited number of samples.

## Authors

• 2 publications
• 8 publications
• ### Sequential Bayesian experimental design for estimation of extreme-event probability in stochastic dynamical systems

We consider a dynamical system with two sources of uncertainties: (1) pa...
02/22/2021 ∙ by Xianliang Gong, et al. ∙ 0

• ### Efficient computation of extreme excursion probabilities for dynamical systems

We develop a novel computational method for evaluating the extreme excur...
01/31/2020 ∙ by Vishwas Rao, et al. ∙ 0

• ### Fast Gaussian Process Based Gradient Matching for Parameter Identification in Systems of Nonlinear ODEs

Parameter identification and comparison of dynamical systems is a challe...
04/12/2018 ∙ by Philippe Wenk, et al. ∙ 0

• ### Output-weighted optimal sampling for Bayesian regression and rare event statistics using few samples

For many important problems the quantity of interest (or output) is an u...
07/17/2019 ∙ by Themistoklis P. Sapsis, et al. ∙ 2

• ### Probably approximate Bayesian computation: nonasymptotic convergence of ABC under misspecification

Approximate Bayesian computation (ABC) is a widely used inference method...
07/19/2017 ∙ by James Ridgway, et al. ∙ 0

• ### Numerical Method for Parameter Inference of Nonlinear ODEs with Partial Observations

Parameter inference of dynamical systems is a challenging task faced by ...
12/30/2019 ∙ by Yu Chen, et al. ∙ 0

• ### Extreme events and their optimal mitigation in nonlinear structural systems excited by stochastic loads: Application to ocean engineering systems

We develop an efficient numerical method for the probabilistic quantific...
06/01/2017 ∙ by Han Kyul Joo, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For many natural and engineering systems, extreme events, corresponding to large excursions, have significant consequences and are important to predict. Examples include extreme economic events, such as credit shocks [8], rogue waves in the ocean [12], and extreme climate events [11]

. Extreme events ‘live’ in the tails of a probability distribution function (pdf), thus it is critical to quantify the pdf many standard deviations away from the mean. For most real-world problems, the underlying processes are far too complex to enable estimation of the tails through direct simulations or repeated experiments. This is a result of the low probabilities of extreme events, which necessitates a large number of experiments or ensembles to resolve their statistics. For random dynamical systems with inherently nonlinear dynamics (expressed through intermittent events, nonlinear energy transfers, broad energy spectrum, and large intrinsic dimensionality) we are usually limited to a few ensemble realizations.

The setup in this article involves a stochastic dynamical system that depends on a set of random parameters with known probability distribution. We assume that the dimensionality of the random parameters is or can be reduced to a moderate size . Because of the inherent stochastic and transient character of extreme responses, it is not sufficient to consider the dynamical properties of the system independently from the statistical characteristics of solutions. A statistical approach to this problem has important limitations, such as requiring various extrapolation schemes due to insufficient sample numbers (see extreme value theorems [18]). Another strategy is large deviations theory [24, 7], a method for the probabilistic quantification of large fluctuations in systems, which involves identifying a large deviations principle that explains the least unlikely rare event. While applied to many problems, for complex systems estimating the rate function can be very costly and the principle does not characterize the full probability distribution. The resulting distributions via such approaches cannot always capture the non-trivial shape of the tail, dictated by physical laws in addition to statistical characteristics. On the other hand, in a dynamical systems approach there are no sufficiently generic efficient methods to infer statistical information from dynamics. For example, the Fokker-Planck equation [22] is challenging to solve even in moderate to low dimensions [14]. To this end, it is essential to consider blended strategies. The utilization of combined dynamic-stochastic models for the prediction of extreme events have also been advocated and employed in climate science and meteorology by others [9, 13, 2]. In [15, 16] a probabilistic decomposition of extreme events was utilized to efficiently characterize the probability distribution of complex systems, which considered both the statistical characteristics of trajectories and the mechanism triggering the instabilities (extreme events). While effective, the proposed decomposition of intermittent regimes requires explicit knowledge of the dynamics triggering the extremes, which may not be available or easily determined for arbitrary dynamical systems.

We formulate a sequential method for capturing the statistics of an observable that is, for example, a functional of the state of a dynamical system or a physical experiment. The response of the observable is modeled using a machine learning method that infers the functional form of the quantity of interest by utilizing only a few strategically sampled numerical simulations or experiments. Combining the predictions from the machine learning model, using the Gaussian process regression framework, with available statistical information on the random input parameters, we formulate an optimization problem that provides the next-best or most informative experiment that should be performed to maximally reduce uncertainty in the pdf prediction of extreme events (tails of the distribution) according to a proposed ‘distance‘ metric. To account for tail features the metric utilize a logarithmic transformation of the pdfs, which is similar to the style of working on the rate function in the large deviation principle. The optimization process relies exclusively on the inferred properties on the parameter-to-observation map and no additional simulations are required in the optimization. For the optimization problem to be practically solvable we require the parameter space to be of moderate size, on the order

. The proposed method allows us to sequentially sample the parameter space in order to rapidly capture the pdf and, in particular, the tails of the distribution of the observable of interest.

## 2 Problem setup and method overview

Consider a dynamical system with state variable ,

 dudt=g(t,u;θ(ω)),ω∈Ω, (1)

where

is the sample space in an appropriate probability space (we denote the density of the random variable

by and its cumulative density by ). The random variable parameterizes sources of uncertainty, such as stochastic forcing terms or system parameters with a priori known distribution . For fixed , the response is a deterministic function in time. We are interested in estimating the pdf of a scalar quantity of interest or observable given by

 q=^T(θ)≜F(u)+ε (2)

where is a continuous parameter-to-observation map, is an arbitrary functional of , and is some observational or numerical noise term, which we take as zero, without loss of generality. In our setup the unknown parameter-to-observation-map is expensive to evaluate, representing, for example, a large scale numerical simulation or a costly physical experiment, and so we desire to minimize the number of evaluations of this mapping. Note that the true statistics of the random variable induced by is,

 ^f(s)=dds^F(s)=ddsP(^T(θ)≤s)=dds∫A(s)fθ(θ)dθ, (3)

where and is the cumulative density function of .

Our objective is to estimate the statistics, especially non-Gaussian features, of the observable , i.e. the pdf of the random variable induced by the mapping which we denote by :

Consider the quantity of interest with pdf induced by the unknown mapping , where is a random valued parameter with known pdf . Given a dataset of size , so that the estimated distribution of using a learning algorithm from this dataset is , determine the next input parameter such that when the map is evaluated at this new sample the error between and is minimized, placing special emphasis on the tails of the distribution where is large.

If we consider the augmented -parameterized dataset for all , and denote the learned density by pdf , the next parameter is obtained by minimizing a distance metric between two probability distributions,

Determining the next-best experiment should not involve the expensive-to-evaluate map nor . The main computational savings of our proposed method involves (1) using a inexpensive-to-compute surrogate model to replace appearing in the -parametrized dataset and using a version of the distance metric without explicit dependence on . Here we utilize Gaussian process regression (GPR), as the learning algorithm to construct the surrogate for

. Using the posterior distribution of the inferred map through the GPR scheme, we estimate the pdf for the quantity of interest as well as the pdfs that correspond to the confidence intervals on the map. The distance metric

, is then based on minimization of the logarithmic transform of the pdfs that correspond to the map upper and lower confidence intervals from the posterior variance, which does not involve

. This optimization problem provides the ‘next-best’ point in a sequential fashion.

The overall aim is to accurately capture the tail statistics of through a minimum number of observations of . Note that the resulting method should not need to densely sample all regions in , since not all regions have significant probability ( may be negligible) or importance ( may be small). The formulated sampling strategy should accurately predict the tail region of the pdf taking into account both the magnitude of the map and the value of the probability of the sample , see Fig. 1.

## 3 Method description

An important component of our proposed method is the construction of an inexpensive surrogate for the map . We utilize Gaussian process regression (GPR) as our learning method. GPR considers the function as a Gaussian process, in terms of a distribution in function space (see appendix A and numerous references, such as [21]). An important property of GPR is that the posterior distribution is a Gaussian process with an explicit mean and kernel function. The variance of the posterior can be used as a proxy for the error or uncertainty of the prediction, which we utilize along with to guide selection of the next sample point.

We learn the parameter-to-observation map using GPR from the dataset . The GPR method then provides analytic expressions for the mean and kernel for the posterior distribution of the Gaussian random function (see appendix A for the expressions). We can then construct the following estimate for the pdf using the posterior mean of the learned surrogate model :

 fn−1(s)=dFn−1ds=dds∫An−1(s)fθ(θ)dθ, (4)

where and

We now formulate the optimization problem for the next sample point . Consider the augmented -parameterized dataset , which approximates by using the GPR mean instead of . Then let denote the random function trained on the augmented dataset . The pdf of the random variable , where , is now replaced by a -parameterized random probability measure induced by the Gaussian random function , where and is a sample point. Note, that the mean of in the -parameterized dataset is identical to for all , since the prediction of the value of the map at the sample point is given by the posterior mean at iteration .

The proposed criterion is then based on minimization of a distance metric between the pdfs of the confidence bounds of . Specifically, let denote the pdfs of the two random variables , where , which are the upper and lower bounds of the confidence interval based on the -scaled standard deviation of the posterior distribution. The pdfs corresponding to the confidence bounds are explicitly given by

 ~f±n(s;θ∗)=dds˜F±n(s;θ∗)=dds∫A±n(s;θ∗)fθ(θ)dθ, (5)

where . We employ the 95% interval bounds, so that the standard deviation is scaled by a factor . The cdf corresponding to the upper confidence bound , is a lower bound for , and we have the relation , for all . Note, although the map mean of the Gaussian random function based on the -parameterized dataset is identical to the posterior mean at iteration , the value of the variance now vanishes at the sample point: .

The distance metric we propose for the selection of the next sample point is given by

 Q(~f+n(⋅;θ),~f−n(⋅;θ))≜12∫∣∣log(~f−n(s;θ))−log(~f+n(s;θ))∣∣ds, (6)

where the integral is computed over the intersection of the two domains that the pdfs are defined over. The next sample point is then determined by solving the optimization problem:

 θn=argminθQ(~f+(⋅;θ),~f−(⋅;θ)). (7)

This is a based metric of the logarithmic transform of the pdfs. The logarithmic transform of the densities in the criterion effectively emphasizes extreme and rare events, as we explicitly show in Theorem 1. The computational savings comes from the construction of a criterion that avoids and instead uses , which involves evaluating the GPR emulator (inexpensive) and an additional GPR prediction.

We make a few comments on the optimization problem and the sequential algorithm. The sequential strategy is summarized in pseudocode in appendix C

For the optimization problem, we utilize a derivative free method, specifically a particle swarm optimizer. The integrations for the pdfs are computed explicitly from the definition of the Lebesgue integral, by partitioning the co-domain of map

. This is much more efficient compared with a Monte-Carlo approach that would result in a very expensive computational task, as long as the dimensionality of the parameter space is low. For high-dimensional parameter spaces, the computational cost of the integration can become prohibitive and in such cases an order reduction in the parameter space should be first attempted, or alternatively an importance sampling algorithm could be used to compute the integral. In the numerical problems, the optimization problem is practically solvable for low dimensional parameter spaces . The starting design plan size , if not already provided, should be small; a Latin hypercube based sampling plan should be utilized for this purpose. We also recommend as a pre-training period to process a small number of iterations using a metric without the logarithmic transform of the densities, such as the following based metric

 Q′(~f+n(⋅;θ),~f−n(⋅;θ))≜12∫∣∣~f−n(s;θ)−~f+n(s;θ)∣∣2ds, (8)

in order to capture the main probability mass (low order moments) before utilizing the proposed metric that emphasizes extreme and rare events. In addition, it is not necessary to retrain the GPR hyperparameters after every iteration, which can remain fixed after being calibrated from a few iterations. Updating the Gaussian process emulator after the addition of new data points can be done in

if the hyperparameters are fixed, otherwise the GPR emulator must be performed anew in (see appendix B)111For low-dimensional , since we presume the dataset size is small, the cost difference may be negligible..

### 3.1 Asymptotic behavior

The first theoretical result relates to the convergence of the proposed method to the true pdf as the number of samples goes to infinity (see appendix D for the proof). The second result shows that the asymptotic form of the criterion is given by (see appendix D for the proof):

###### Theorem 1.

Let and from the GPR scheme be sufficiently smooth functions of . The asymptotic behavior of for large (ensuring small ) and small is given by

 ˜Qn ≜12∫∣∣logf+n(s)−logf−n(s)∣∣ds (9) ≈∫∣∣∣ddsE(σn(θ)⋅1Tn(θ)=s)fn(s)∣∣∣ds, (10)

where denotes the expectation over the probability measure .

Note that the pdf in the denominator under the integral in 30 is a direct implication of our choice to consider the difference between the logarithms of the pdfs in the optimization criterion. The pdf of the parameter-to-observation map in the denominator of the integrand guarantees that even values of the parameter , where the probability is low (rare events) are sampled. This point is clearly demonstrated by the following corollary (proof in appendix D).

###### Corollary 1.

Let be a one-dimensional random variable and in addition to the assumptions of Theorem 1 we assume that is monotonically increasing function. Then, the asymptotic value of for large has the following property:

 ˜Qn≳∣∣∣∫Uσn(θ)d(logfθ(θ))+∫Uσn(θ)d(logT′(θ))+σn(u2)−σn(u1)∣∣∣. (11)

Therefore for large value of , bounds (within higher order error) a quantity that consists of boundary terms and two integral terms involving sampling the function over the contours of and . Consider the first term on the right, which can be discretized as:

 ∫Uσ(θ)d(logfθ(θ))=limN→∞N∑i=1Δz∑{θ:logfθ(θ)=zi}σ(θ), (12)

where is an equipartition of the range of . This summation is summarized in Fig. 2 (left). The way the integral is computed guarantees that the integrand will be sampled even in locations where the pdf has small value (rare events) or else the criterion would not converge.

On the other hand, if we had instead chosen a criterion that focused directly on the convergence of low order statistics for , such as the same criterion, but without the logarithmic transformation of the densities, we would have

 ˜Q′n

where is the cumulative distribution function of . The corresponding integration is shown in Fig. 2 (right). In such case, sampling the function in regions of high probability would be sufficient for the criterion to converge. However, such a strategy would most likely lead to large errors in regions associated with rare events since the sampling will be sparse.

## 4 Applications

We illustrate the proposed algorithm to two problems. The first example consists of a nonlinear oscillator stochastically forced by a colored noise process; this application, serves to illustrate the main ideas of the proposed method. The second application, involving three-dimensional hydrodynamic wave impacts on an offshore platform (a system with millions of degrees of freedom) showcases the applicability of the proposed method to real world setups where computation of extreme event statistics using traditional approaches are prohibitively expensive, since the simulation times of experimental runs are on the order of several hours.

### 4.1 Nonlinear oscillator driven by correlated stochastic noise

Consider the nonlinear oscillator,

 ¨x+δ˙x+F(x)=ζ(t), (13)

forced by a stationary, colored noise with correlation function and the nonlinear restoring term given by

 F(x)=⎧⎨⎩αx,0≤|x|≤x1αx1,x1<|x|≤x2αx1+β(x−x2)3,x2≤|x|. (14)

Since the system is stochastically forced it is necessary to use an expansion to obtain a parameterization in terms of a finite number of random variables. We use a Karhunen-Loève expansion (see appendix E) to obtain

 ¨x(t)+δ˙x(t)+F(x(t))=m∑i=1θi(ω)ei(t),t∈[0,T], (15)

which is truncated to a suitable number . For illustration, we take our quantity of interest as the average value of the response, so that the parameter-to-observation map is defined by

We consider a three term truncation . The system parameters are given by , , , , and the forcing parameters are and , with . For comparisons the exact pdf is obtained by sampling the true map from points on a grid. In Fig. 3 we illustrate the sampling as determined by the proposed algorithm in addition to the log error between the exact pdf and the GPR mean prediction. In these simulations we start from a dataset of points selected according to a Latin-Hypercube (LH) design. In order to capture the main mass of the pdf, before focusing on the tails of the distribution, we perform iterations using the error metric before moving on to the criterion using the error of the logarithms of the pdfs. Observe the unique shape that the sampling algorithm has identified in space, which spans regions in associated with finite probability and large values of . Fig. 4 demonstrates the progression of the estimated pdf as a function of the iteration count. Even after only samples we have already captured the qualitative features of the exact pdf and have very good quantitative agreement.

We have explored the convergence properties of the algorithm and in Fig. 5 we compare the proposed sampling method to space-filling Latin Hypercube sampling. The LH strategy is not iterative and thus must be started anew, which puts the LH sampling at a large disadvantage. Nonetheless, this serves as a benchmark to a widely used reference method for the design of experiments due to its simplicity. In the figure, the purple curve represents the mean LH design error and the shaded region represents the standard deviation about the mean, which are computed by evaluating 250 number of random LH designs per fixed dataset size. Even considering the variance of the LH curve, the proposed algorithm under various parameters (initial dataset size or number of ‘core’ iterations where the criterion uses the metric) is observed to outperform the LH strategy by nearly an order of magnitude in the error of the logarithm of the pdfs. This demonstrates the favorable properties of the proposed sampling strategy for accurately estimating the tail of target distribution.

### 4.2 Hydrodynamic forces and moments on an offshore platform

Here we apply the sampling algorithm to compute the probability distributions describing the loads on an offshore platform in irregular seas. The response of the platform is quantified through direct, three-dimensional numerical simulations of Navier-Stokes utilizing the smoothed particle hydrodynamics (SPH) method [5] (Fig. 6). Our numerical setup parallels that of a physical wave tank experiment and consists of a wave maker on one end and a sloping ‘beach’ on the other end of the tank to quickly dissipate the energy of incident waves and avoid wave reflections. Further details regarding the simulations are provided in appendix F.

Wind generated ocean waves are empirically described by their energy spectrum. Here, we consider irregular seas with JONSWAP spectral density (see appendix F for details and parameters). While realizations of the random waves have the form of time series, an alternative description can be obtained by considering a sequence of primary wave groups, each characterized by a random group length scale and height (see e.g.  [3]). This formulation allows us to describe the input space through just two random variables (much fewer than what we would need with a Karhunen-Loeve expansion). Following [3] we describe these primary wavegroups by the representation which is an explicit parameterization in terms of and . Thus, and correspond to and in the notation of Eq. section 2. The statistical characteristics of the wave groups associated with a random wave field (such as the one given by the JONSWAP spectrum) can be obtained by applying the scale-selection algorithm described in [4]. Specifically, by generating many realizations consistent with the employed spectrum we use a group detection algorithm to identify coherent group structures in the field along with their lengthscale and amplitude ( and ). This procedure provides us with the empirical probability distribution of the wave field and thus a nonlinear parametrization of the randomness in the input process.

The quantities of interest in this problem are the forces and moments acting on the platform. The incident wave propagates in the direction and as such we consider the pdf of the force in the direction and the moment about the bottom-center of the platform:

 qf=maxt∈[0,T]|Fx(t)|andqm=maxt∈[0,T]|My(t)|. (16)

In Fig. 6 we show the results of the progression of density prediction for the force variable. In these experiments we begin by arbitrary selecting initial sample points from a Latin Hypercube sampling strategy. Next, we perform iterations using the distance metric to quickly capture the main mass of the distribution before focusing on the distribution away from the mean that utilizes the metric of the logarithmic of the pdf. The lightly shaded red region in the pdf plots is a visualization of the uncertainty in the pdf, obtained by sampling the GPR prediction and computing the pdf for realizations and then computing the upper (lower) locus of the maximum (minimum) value of the pdf at each value. The figures demonstrate that with (i.e total sample points) iterations (together with the samples in the initial configuration) we are able to approximate the pdf to good agreement with the ‘exact’ pdf, which was computed from a densely sampled grid. In appendix F we also present the sampled map where it can be seen that the algorithm selects points associated with large forces and non-negligible probability of occurrence. In the same figures results for the momentum and an additional spectrum are included. Note, for this problem the GP regression operates on the logarithm of the observable because the underlying function is always positive.

## 5 Conclusions

We developed and analyzed a computational algorithm for the evaluation of extreme event statistics associated with nonlinear dynamical systems that depend on a set of random parameters. The algorithm is practical even for very high dimensional systems but with parameter spaces of moderate dimensionality and it provides a sequence of points that lead to improved estimates of the probability distribution for a scalar quantity of interest. The criterion for the selection of the next design point emphasizes the tail statistics. We have proven asymptotic convergence of the algorithm and provided analysis for its asymptotic behavior. We have also demonstrated its applicability through two problems, one of them involving a demanding system with millions degrees of freedom.

## Acknowledgments

This work has been supported through the ONR grant N00014-15-1-2381, the AFOSR grant FA9550-16- 1-0231, and the ARO grant W911NF-17-1-0306. We are grateful to the referees for providing numerous suggestions that led to important improvements and corrections. We also thank A. Crespo for helpful comments regarding the SPH simulations.

## References

• [1] Corrado Altomare, Alejandro J. C. Crespo, Jose M. Domnguez, Moncho Gómez-Gesteira, Tomohiro Suzuki, and Toon Verwaest. Applicability of smoothed particle hydrodynamics for estimation of sea wave impact on coastal structures. Coastal Engineering, 96:1–12, 02 2015.
• [2] Nan Chen and Andrew J. Majda. Simple stochastic dynamical models capturing the statistical diversity of el niño southern oscillation. Proceedings of the National Academy of Sciences, 114(7):1468–1473, 2017.
• [3] Will Cousins and Themistoklis P. Sapsis. Unsteady evolution of localized unidirectional deep-water wave groups. Physical Review E, 91:063204, 2015 06.
• [4] Will Cousins and Themistoklis P. Sapsis. Reduced-order precursors of rare events in unidirectional nonlinear water waves. Journal of Fluid Mechanics, 790:368–388, 02 2016.
• [5] A. J. C. Crespo, J. M. Domnguez, B. D. Rogers, M. Gómez-Gesteira, S. Longshaw, R. Canelas, R. Vacondio, A. Barreiro, and O. Garca-Feal. DualSPHysics: Open-source parallel CFD solver based on smoothed particle hydrodynamics (sph). Computer Physics Communications, 187:204–216, 02 2015.
• [6] Timothy A. Davis and William W. Hager. Row modifications of a sparse Cholesky factorization. SIAM Journal on Matrix Analysis and Applications, 26(3):621–639, 2005.
• [7] Giovanni Dematteis, Tobias Grafke, and Eric Vanden-Eijnden. Rogue waves and large deviations in deep sea. Proceedings of the National Academy of Sciences, 2018.
• [8] Jean-Pierre Fouque, George Papanicolaou, Ronnie Sircar, and Knut Sølna. Multiscale Stochastic Volatility for Equity, Interest-Rate and Credit Derivatives. Cambridge University Press, 09 2011.
• [9] Christian L. E. Franzke. Extremes in dynamic-stochastic systems. Chaos, 27(1):012101, 2017.
• [10] Peter Frigaard, Michael Høgedal, and Morten Christensen. Wave generation theory, 06 1993.
• [11] Andreas Hense and Petra Friederichs. Wind and precipitation extremes in the Earth’s atmosphere, pages 169–187. Springer Berlin Heidelberg, 2006.
• [12] Paul C. Liu. A chronology of freaque wave encounters. Geofizika, 24(1):57–70, 2007.
• [13] Andrew J. Majda. Introduction to Turbulent Dynamical Systems in Complex Systems. Springer International Publishing, 2016.
• [14] A. Masud and L. A. Bergman. Solution of the four dimensional Fokker-Planck equation: still a challenge. In ICOSSAR, volume 2005, pages 1911–1916, 2005.
• [15] Mustafa A. Mohamad, Will Cousins, and Themistoklis P. Sapsis. A probabilistic decomposition-synthesis method for the quantification of rare events due to internal instabilities. Journal of Computational Physics, 322:288–308, 10 2016.
• [16] Mustafa A. Mohamad and Themistoklis P. Sapsis. Probabilistic description of extreme events in intermittently unstable dynamical systems excited by correlated stochastic processes. SIAM/ASA Journal on Uncertainty Quantification, 3(1):709–736, 01 2015.
• [17] Arvid Naess and Torgeir Moan. Stochastic Dynamics of Marine Structures. Cambridge University Press, 2013.
• [18] Mario Nicodemi. Extreme value statistics, pages 1066–1072. Springer New York, 2012.
• [19] Athanasios Papoulis and S. Unnikrishna Pillai. Probability, Random Variables and Stochastic Processes. McGraw-Hill Education, 4 edition, 2002.
• [20] Grigorios A. Pavliotis. Stochastic Processes and Applications, volume 60 of Texts in Applied Mathematics. Springer-Verlag New York, 2014.
• [21] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2005.
• [22] Kazimierz Sobczyk. Stochastic Differential Equations. Mathematics and Its Applications (East European Series). Springer Netherlands, 1991.
• [23] Andrew M. Stuart and Aretha L. Teckentrup. Posterior consistency for gaussian process approximations of bayesian posterior distributions. ArXiv e-prints, 2016.
• [24] S. R. Srinivasa Varadhan. Large Deviations and Applications. CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics, 1984.
• [25] Holger Wendland. Scattered Data Approximation, volume 17 of Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2004.

## Appendix A Overview of Gaussian process regression

An important component of our algorithm is the construction of a surrogate for the map . We utilize the Gaussian process regression (GPR) method for this purpose. A feature of critical importance is that GPR specifies the posterior distribution as a Gaussian random function, with explicit formulas for the posterior mean and kernel function. The variance of the posterior can be used as an error or uncertainty estimate of the current prediction, which can in turn be used to guide optimization and explore parameter space. We briefly provide an overview of GPR since it’s a crucial component of our proposed method, but refer to the book [21] and numerous other references in the literature for further details.

We estimate the parameter-to-observation map, with , via a GPR scheme from an observed dataset , using design points. These are the points that we have already sampled. Specifically, to estimate we place a Gaussian process prior over and consider the function values as a realization of the GP. In particular, with , we have the following posterior mean and covariance :

 Tn(θ) =¯¯¯T(θ)+k0(θ,Θ)Tk0(Θ,Θ)−1(^T(Θ)−¯¯¯T(Θ)) (17) kn(θ,θ′) =k0(θ,θ′)−k0(θ,Θ)Tk0(Θ,Θ)−1k0(Θ,θ′) (18)

where,

• is an arbitrary regression mean function, often chosen to be a constant or zero,

• is the regression covariance function, here the exponential squared kernel, with and being positive hyperparameters,

• is the covariance matrix, with the entry given by , and , are

-dimensional vectors with the

entries given by , , respectively, and

• is the variance at .

There are several important properties to emphasize in the GPR scheme related to the posterior [23]

. Firstly, for any choice of the regression function the GPR mean estimate is an interpolant of the exact map at the design points, that is

. Another property to note in the sequential framework, is that since is positive definite:

 σn(θ)≤σn−1(θ)≤…≤σ0(θ),θ∈U, (19)

in other words, additional data points lead to non-increasing local variance. Moreover, at each of the design points the estimated variance vanishes, i.e. , for .

We recall an important result concerning the convergence of the posterior mean estimate of  [23, 25]. This result is used in appendix D to prove asymptotic convergence of the proposed sampling method.

###### Proposition 1.

Let be a bounded Lipschitz domain that satisfies an interior cone condition. Also, let be the mean given by the GPR method. Then there exists constants and , independent of , , or such that

 ∥Tn−^T∥L2≤ChpΘ∥^T∥L2, (20)

where is the fill distance defined as .

As shown in [23], convergence of the mean estimate implies convergence of the estimated variance to zero, i.e.

 σn(θ)→0asn→∞andhΘ→0. (21)

For uniform tensor grids

the fill distance is of the order and thus the convergence of the map to the truth, in this case, is at least of order . The precise rate of convergence may differ, depending on the exact criterion used.

## Appendix B Iterative updates to a Gaussian process emulator with fixed hyperparameters

At every iteration a new data point is added to the design matrix . This necessitates refactorization of the covariance matrix . Since the Cholesky decomposition in general costs , we would like to update the factorization without recomputing the Cholesky decomposition from scratch at each iteration. We can update the Cholesky factorization for a new data point at cost if the covariance matrix has fixed parameters (efficient formula’s for the removal of a data point and similarly for block updates can be developed following the same approach, see e.g. [6]).

Denote the current covariance matrix with data points by with Cholesky factorization and the covariance matrix after addition of a new design point by . We can write,

 (22)

so that

 C =[A1AT100T0]=[C1100T0], (23) ¯¯¯¯C =[A1AT1A1¯a2¯aT2A1¯aT2¯a2]=[C11c21cT21c22], (24)

where is the covariance of the new data point with the dataset and is variance of the data point. The matrix can be decomposed as

 (25)

and thus

 L11LT11 =C11, (26) L11ℓ21 =c21⟹ℓ21=L11∖c21, (27) ℓT21ℓ21+ℓ222 =c22⟹ℓ22=√c22−ℓT21ℓ21, (28)

which we solve for and at costs due to the linear system solve in the second equation above ( is already known from the previous iteration).

## Appendix C Algorithm pseudocode

Below we summarize the main loop of the sequential algorithm in pseudocode:

input initial dataset
repeat
predict Gaussian process mean and variance
integration using , , and

Append to dataset
until desired error level
return

This is an iterative procedure that leads to a series of pdfs for the quantity of interest which, as we show in appendix D, converges to the true pdf under appropriate conditions. The function that computes the distance criterion at a test point is summarized below:

function (, , )
Append to dataset
predict Gaussian process variance
integration using , , and
return

Above is a user defined distance metric in the stopping condition and .

## Appendix D Convergence and asymptotic behavior of the optimization criterion

Here we provide details on the proposed sampling algorithm’s convergence and the asymptotic behavior of the proposed criterion, which elucidate its favorable extreme event sampling properties. We first prove the convergence of the estimated probability distribution to the true distribution in the large sample size limit. This is under the assumption that the sampling criterion does not lead to resampling, i.e. we have as . Under this condition the result in appendix A guarantees convergence of the map. We show that convergence of the map implies convergence of the sequence of the transformed pdfs .

###### Theorem 2.

Let the sequence of random variables be denoted by with cumulative distribution function . Moreover, assume that the sampling criterion does not lead to resampling of existing points. Then the convergence of the sequence of maps to by proposition 1 implies converges in probability to , and this then implies converges in distribution to .

###### Proof.

The assumption that the criterion does not lead to resampling of existing samples guarantees as and thus the sequence converges in to by proposition 1.

We show that convergence implies convergence in measure. For every , by Chebyshev inequality, we have

 ∥Tn−^T∥2L2≥E(|Tn−^T|2)≥ϵ2P(|Tn−^T|≥ϵ). (29)

Therefore since the left hand side tends to zero as , for every ,

 limn→∞P(|Tn−^T|≥ϵ)=0,

thus converges in probability to . This also implies converges in distribution to : , for all that are points of continuity of (see [19]). ∎

The next result is on the asymptotic behavior of the criterion. Specifically, we show that for large , so that the distance between the upper and lower map bounds decrease and become small, the proposed criterion measures and attempts to minimize the variance of the surrogate map. We show that the metric appropriately weights regions with large and small probability with more importance.

###### Theorem 3.

Let and from the GPR scheme be sufficiently smooth functions of . The asymptotic behavior of for large (ensuring small ) and small is given by

 ˜Qn ≜12∫∣∣logf+n(s)−logf−n(s)∣∣ds (30) ≈∫∣∣∣ddsE(σn(θ)⋅1Tn(θ)=s)fn(s)∣∣∣ds, (31)

where denotes the expectation over the probability measure .

###### Proof.

To simplify notation we drop the index in what follows. We want to determine the asymptotic behavior of the integral for . In this case, we can assume that is uniformly bounded. We focus on the integrand and determine the behavior of the difference appearing in the integrand. First focus on the cumulative distribution function difference

 F+(s)−F−(s)=−∫A−(s)∖A+(s)fθ(θ)dθ, (32)

where denotes the difference between the sets . Now since is equivalent to ,

 F+(s)−F−(s) =−∫T(θ)∈(s−σ(θ),s+σ(θ))fθ(θ)dθ, (33) (34)

where denotes surface integration and is on the surface . The functions satisfy

 (35)

where is the unit normal to the surface . Note by Taylor theorem we have

 (36) σ(¯θ+nt) (37)

for , and therefore from (35) we find,

 (38)

We then have

 h−−h+ =2σ∥∇T∥(1−(∇σ⋅∇T)2∥∇T∥3)+O(σ2) (39) =2σ∥∇T∥+O(∥∇σ∥∥∇T∥,σ2), (40)

where for the second equality we use the Cauchy–Schwarz inequality to bound the error in the last line. From this result and (34) we obtain,

 f+−f− (41) (42) (43) ≈−2ddsE(σ(θ)⋅1T(θ)=s) (44)

Next, note the following approximation

 logf+−logf−=log(1+Δf/f−)≈Δf/f−≈Δf/f, (45)

where , under the small assumption. We therefore have the following result on the behavior of the difference between and

 logf+(s)−logf−(s)≈−2ddsE(σ(θ)⋅1T(θ)=s)f(s), (46)

which implies (30). ∎

Note that the pdf in the denominator in the integrand (10) is a direct implication of our choice to consider the difference between the logarithmic transform of the pdfs. This guarantees that values of the parameter where the probability is small (rare events) are sampled sufficiently to capture tail behavior of the map’s pdf. This is seen clearly in the following corollary, where we consider the special case of a one-dimensional parameter space and a family of maps which are monotonically increasing.

###### Corollary 2.

Let be a one-dimensional random variable and in addition to the assumptions of Theorem 1 we assume that is monotonically increasing function. Then, the asymptotic value of for large has the following property:

 ˜Qn≳|∫Uσn(θ)d(logfθ(θ))+∫Uσn(θ)d(logT′(θ))+σn(u2)−σn(u1)|. (47)
###### Proof.

For simplicity, we drop the index in the following. From the proof of Theorem 1 and using standard inequalities we obtain

 ˜Q≈∫∣∣ ∣∣dds∫T(θ)=sσ(θ)fθ(θ)dθf(s)∣∣ ∣∣ds≥∣∣ ∣∣∫dds∫T(θ)=sσ(θ)fθ(θ)dθf(s)ds∣∣ ∣∣=∣∣ ∣ ∣∣∫d2ds2∫T(θ)≤sσ(θ)fθ(θ)dθdds∫T(θ)≤sfθ(θ)dθds∣∣ ∣ ∣∣ (48)

Since we assume by applying the transformation to the numerator of the right hand side we obtain

 d2ds2∫T(θ)≤sfθ(θ)σ(θ)dθ =d2ds2∫sfθ(T−1(s))σ(T−1(s))T′(T−1(s))ds (49) =dds(fθ(T−1(s))σ(T−1(s))T′(T−1(s))), (50)

which is equivalent to

 f′θ(T−1(s))σ(T−1(s))(T′(T−1(s)))2+fθ(T−1(s))T′(T−1(s))(σ(T−1(s))T′(T−1(s)))′. (51)

Similarly, we have the following for the denominator term:

 dds∫T(θ)≤sfθ(θ)dθ=dds∫sfθ(T−1(s))T′(T−1(s))ds=fθ(T−1(s))T′(T−1(s)). (52)

Therefore we obtain,

 ˜Q ≳∣∣∣∫T(u2)T(u1)[f′θ(θ)fθ(θ)σ(θ)T′(θ)+(σ(θ)T′(θ))′]θ=T−1(s)ds∣∣∣=∣∣∣∫u2u1[f′θ(θ)fθ(θ)σ(θ)−T′′(θ)T′