Using coupling methods to estimate sample quality for stochastic differential equations

12/21/2019 ∙ by Matthew Dobson, et al. ∙ University of Massachusetts Amherst 0

A probabilistic approach for estimating sample qualities for stochastic differential equations is introduced in this paper. The aim is to provide a quantitative upper bound of the distance between the invariant probability measure of a stochastic differential equation and that of its numerical approximation. In order to extend estimates of finite time truncation error to infinite time, it is crucial to know the rate of contraction of the transition kernel of the SDE. We find that suitable numerical coupling methods can effectively estimate such rate of contraction, which gives the distance between two invariant probability measures. Our algorithms are tested with several low and high dimensional numerical examples.



There are no comments yet.


page 18

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Stochastic differential equations (SDEs) are widely used in many scientific fields. Under mild assumptions, an SDE would admit a unique invariant probability measure, denoted by

. In many applications including but not limited to Markov chain Monte Carlo and molecular dynamics, it is important to sample from

[1, 28]. This is usually done by either numerically integrating an SDE over a very long trajectories or integrating many trajectories of the SDE over a finite time [37]. However, a numerical integrator of the SDE typically has a different invariant probability measure, denoted by , that depends on the time discretization [43, 44, 41]. A natural question is that, how is different from ? In other words, what is the quality of data sampled from a numerical trajectory of the SDE? This is very different from the classical truncation error analysis, which is only applicable for finite time intervals except some special cases [27, 10].

Theoretically, it is well known that the distance between and can be controlled if we have good estimate of (i) the finite time truncation error and (ii) the rate of geometric ergodicity of the SDE. Estimates of this type can be made by various different approaches [34, 35, 8, 5, 6]. Roughly speaking, if the truncation error over a finite time interval is , and the rate of geometric ergodicity is (i.e. speed of convergence to is for ), then the difference between and is . (See our discussion in Section 3.1 for details.) However, these approaches can not give a quantitative estimate in general, as the rate of geometric ergodicity estimated by rigorous approaches are usually very far from being sharp. Many approaches such as the Lyapunov function method can only rigorously show that the speed of convergence is for some [36, 16, 17]. Looking into the proof more carefully, one can easily find that this has to be extremely close to to make the proof work. This gives a very large and makes rigorous estimates difficult to use in practice. To the best of our knowledge, quantitative estimates of convergence rate can only be proved for a few special cases like stochastic gradient flow and Langevin dynamics [14, 7, 2].

The aim of this paper is to provide some algorithms to numerically estimate the distance between and . The finite time truncation error over a time interval

is estimated by using extrapolations, which is a common practice in numerical analysis. The main novel part is the estimation of the rate of contraction of the transition kernel. Traditional approaches for computing the rate of geometric ergodicity are either computing principal eigenvalue of the discretized generator or estimating the decay rate of correlation. The eigenvalue method works well in low dimension but faces significant challenge if the SDE is in dimension

. The correlation decay is difficult to estimate as well, because a correlation has exponentially small expectation and large variance. One needs a huge amount of samples to estimate it effectively. In addition, exponential decay of correlation with respect to an ad-hoc observable is usually not very convincing. In this paper, we propose to estimate the rate of contraction of the transition kernel by using a coupling technique.

Coupling methods have been used in rigorous proofs for decades [31, 32, 13, 39]. The idea is to run two trajectories of a random process

, such that one is from a given initial distribution and the other is stationary. A suitable joint distribution, called a coupling, is constructed in the product space, such that two marginals of this joint process are the original two trajectories. If after some time, the two processes stay together with high probability, then the law of

must be very close to its invariant probability measure. It is well known that the coupling lemma gives bounds of both total variation norm and some 1-Wasserstein-type distances. In this paper, we use the coupling method numerically. If two numerical trajectories meet each other, they are coupled and evolve together after coupling. By the coupling lemma, the contraction rate of the transition kernel can be estimated numerically by computing the probability of successful coupling, which follows from running a Monte Carlo simulation. Together with the finite time error, we can estimate the distance between and . The main advantage of coupling method is that it is relatively dimension-free, and we demonstrate our technique on an SDE system in in Section 4.5.

We provide two sets of algorithms, one for a quantitative upper bound and the other for a rough but quick estimate. To get the quantitative upper bound, one needs an upper bound of the contraction rate of 1-Wasserstein distance for all pairs of initial values starting from a certain compact set . This is done by applying extreme value theory. More precisely, we uniformly sample initial values from and compute the contraction rate by using the coupling method. Then the upper bound of such contraction rate can be obtained by numerically fitting a generalized Pareto distribution (GPD) [9, 3]. In practice, one may want a low cost estimate for the quality of samples. Hence we provide a “rough estimate” that only uses the exponential tail of the coupling probability as the rate of contraction of the generator after a given time . This rough estimate differs from the true upper bound by an unknown constant, but it is more efficient and works well empirically.

Our coupling method can be applied to SDEs with degenerate random terms after suitable modifications. This is done by comparing the overlap of the probability density functions after two or more steps of the numerical scheme. Our approach is demonstrated on a Langevin dynamics example in Section 

4.3. It is known from [14, 7] that a suitable mixture of reflection coupling and synchronous coupling can be used for Langevin equation. We find that this approach can be successfully combined with the “maximal coupling” for the numerical scheme. However, for SDEs with very degenerate noise, using the coupling method remains to be a great challenge.

We test our algorithm with a few different examples, from simple to complicated. The sharpness of our algorithm is checked by using a “ring density example” whose invariant probability density function can be explicitly given. Then we demonstrate the use of coupling method under degenerate noise by working with a 4D Langevin equation. Next we show two examples whose numerical invariant probability differs significantly from true invariant probability measure . One is an asymmetric double well potential whose transition kernel has a slow rate of convergence. The other example is the Lorenz 96 model whose finite time truncation error is very difficult to control due to intensive chaos. Finally, we study a coupled FizHugh-Nagumo oscillator model proposed in [11, 24] to demonstrate that our algorithm works reasonably well in high dimensional problems.

The organization of this paper is as follows. Section 2 serves as the probability preliminary, in which we review some necessary background about the coupling method, stochastic differential equations, and numerical SDE schemes. The main algorithm is developed in Section 3. All numerical examples are demonstrated in Section 4. Section 5 is the conclusion.

2. Probability preliminary

In this section, we provide some necessary probability preliminaries for this paper, which are about the coupling method, stochastic differential equations, numerical stochastic differential equations, and convergence analysis.

2.1. Coupling

This subsection provides the definition of coupling of random variables and Markov processes.

Definition 2.1 (Coupling of probability measures).

Let and be two probability measures on a probability space . A probability measure on is called a coupling of and , if two marginals of coincide with and respectively.

The definition of coupling can be extended to any two random variables that take value in the same state space.

Definition 2.2 (Markov Coupling).

A Markov coupling of two Markov processes and with transition kernel is a Markov process on the product state space such that

  • The marginal processes and are Markov processes with transition kernel , and

  • If , we have for all .

Markov coupling can be defined in many different ways. For example, let be the transition kernel of a Markov chain on a countable state space , the following transition kernel for on such that

is called the independent coupling. Paths of the two marginal processes are independent until they first meet. In the rest of this paper, unless otherwise specified, we only consider Markov couplings.

2.2. Wasserstein distance and total variation distance

In order to give an estimate for the coupling of Markov processes, we need the following to metrics.

Definition 2.3 (Wasserstein distance).

Let be a metric on the state space . For probability measures and on , the Wasserstein distance between and for is given by

For the discussion in our paper, unless otherwise specified, we will use the Wasserstein distance with the distance

Definition 2.4 (Total variation distance).

Let and be probability measures on . The total variation distance of and is given by

2.3. Coupling Lemma

In this subsection, we provide the coupling inequalities for the approximation of a coupling of Markov processes.

Definition 2.5 (Coupling time).

The coupling time of two stochastic processes and is a random variable given by

Definition 2.6 (Successful coupling).

A coupling of and is said to be successful if

or equivalently,

For all Markov couplings, we have the following two coupling inequalities:

Lemma 2.7 (Coupling inequality w.r.t. the total variation distance).

For the coupling given above, we have


For any ,

where the second equality follows from cancelling the probability

By the arbitrariness of , the lemma is proved. ∎

Lemma 2.8 (Coupling inequality w.r.t. the Wasserstein distance).

For the coupling given above and the Wasserstein distance induced by the distance given in (2.1), we have


By the definition of the Wasserstein distance,

where is the specific distance given in (2.1). ∎

2.4. Stochastic differential equations (SDE)

We consider the following stochastic differential equation (SDE) with initial condition that is measurable with respect to



is a continuous vector field in

, is an matrix-valued function, and

is the white noise in

. The following theorem is well known for the existence and uniqueness of the solution of equation (2.3) [33].

Theorem 2.9.

Assume that there are two positive constants and such that the two functions and in (2.3) satisfy

  1. (Lipschitz condition) for all and

  2. (Linear growth condition) for all and


Then there exists a unique solution to equation (2.3) in .

In addition, we assume that admits a unique invariant probability measure . The existence and uniqueness of usually follows from some drift condition plus some suitable irreducibility conditions [25, 19].

2.5. Numerical SDE

In this subsection, we talk about the numerical scheme we use for sampling stochastic differential equations (2.3). The Euler–Maruyama approximation of the solution of (2.3) is given by


where , , , and and are mutually independent Gaussian random vectors.

We have the following convergence rate for the Euler–Maruyama approximation (see [33])

Theorem 2.10.

Assume that the Lipschitz condition (2.4) and the linear growth condition (2.5) hold. Let be the unique solution of equation (2.3), and be the Euler–Maruyama approximation for . Then


where is a constant depending only on and .

Namely, the Euler–Maruyama approximation provides a convergence rate of order .

A commonly used improvement of the Euler–Maruyama scheme is called the Milstein scheme, which reads


where , and is an matrix with its -th component being the double Itô integral

and is a vector of operators with th component

Under suitable assumptions of Lipschitz continuity and linear growth conditions for some functions of the coefficients and , the Milstein scheme [26] is an order strong approximation.

Theorem 2.11 ([26]).

Under suitable assumptions, we have the following estimate for the Milstein approximation ,

where is a constant independent of .

It is easy to see that when is a constant matrix, the Euler–Maruyama scheme and the Milstein scheme coincide. In other words the Euler–Maruyama scheme for constant also has a convergence rate of order . There are also strong approximations of order or that are much more complicated to implement. We refer interested readers to [26].

2.6. Extreme value theory

This subsection introduces some extreme value theory that is relevant to materials in this paper.

Definition 2.12 (Generalized Pareto distribution).

A random variable is said to follow a generalized Pareto distribution

, if its cumulative distribution function is given by

The generalized Pareto distribution is used to model the so-called peaks over threshold distribution, that is, the part of a random variable over a chosen threshold , or the tail of a distribution. Specifically, for a random variable , consider the random variable conditioning on that the threshold is exceeded. Its conditional distribution function is called the conditional excess distribution function and is denoted by

The extreme value theory proves the following theorem.

Theorem 2.13 ([4, 40]).

For a large class of distributions (e.g., uniform, normal, log-normal, ,

, gamma, beta distributions), there is a function

such that

where is the rightmost point of the distribution.

This theorem shows that the conditional distribution of peaks over threshold can be approximated by the generalized Pareto distribution. In particular, if the parameter fitting of gives , we can statistically conclude that the random variable is has a bounded distribution with an upper bound . In this paper, we will employ this method to estimate the upper bound of quantities of interest.

3. Description of algorithm

3.1. Decomposition of error terms

Let and be the stochastic process given by equation (2.3) and its numerical scheme, respectively. Let and be two corresponding transition kernels. The time step size of is unless otherwise specified. Hence in , only takes values . Let be a fixed constant. Let be the 1-Wasserstein distance of probability measures induced by distance

unless otherwise specified.

Denote the invariant probability measures of and by and respectively. The following decomposition follows easily by the triangle inequality and the invariance. (This is motivated by [22]. Similar approaches are also reported in [42, 38].)


If the transition kernel has enough contraction such that

for some , we have

In other words the distance can be estimated by computing the finite time error and the speed of contraction of . Theoretically, the finite time error can be given by the strong approximation of the truncation error of the numerical scheme of equation (2.3). The second term comes from the geometric ergodicity of the Markov process . As discussed in the introduction, except some special cases, the rate of geometric ergodicity of can not be estimated sharply. As a result one can only have an that is extremely close to . This makes decomposition (3.1) less interesting in practice. Therefore, we need to look for suitable numerical estimators of the two terms in equation (3.1).

The other difficulty comes from the fact that both and are defined on an unbounded domain. However, large deviations theory guarantees that the mass of both and should concentrate near the global attractor [25, 15]. Similar concentration estimates can be made by many different approaches [30, 20]. Therefore, we assume that there exists a compact set and a constant , such that


In practice, can be chosen to be the set that contains all samples of a very long trajectory of , and is the reciprocal of the length of this trajectory. This is usually significantly smaller than all other error terms.

This allows us to estimate the contraction rate for initial values in a compact set. Let be the optimal coupling plan such that

Let be the transition kernel on that gives a Markov coupling of two trajectories of . By the assumption of , we have


where is the minimum contracting rate of on such that

where and are two margins of .

Combine equations (3.1) and (3.3), we have


3.2. Estimator of error terms

From equation (3.4), we need to numerically estimate the finite time error and the contraction rate . We propose the following approach to estimate these two quantities.

Extrapolation for finite time error. By the definition of 1-Wasserstein distance, we have

for any coupling measure . A suitable choice of that can be sampled easily is , where is the coupling measure of on the “diagonal” of

that is supported by the hyperplane

Theoretically, we can sample an initial value , run and up to time , and calculate . However, we do not have exact expressions for and . Hence in the estimator, we use to replace and use extrapolation to estimate . The idea is that

can be approximated by

as this only induces a higher order error . Then the integral above can be estimated by sampling such that . The distance can be obtained by extrapolating , where is the random process of the same numerical scheme with the same noise term but time step size. This gives Algorithm 1.

Input: Initial value
Output: An estimator of
for i = 1 to N do
     Using the same noise, simulate and with initial value up to
     Let , where is a constant depending on the order of accuracy
end for
Algorithm 1 Estimate finite time error

When is sufficiently large, in Algorithm 1 are from a long trajectory of the time- skeleton of . Hence are approximately sampled from . The error term for estimates . Therefore,

estimates the integral

which is an upper bound of . The constant in Algorithm 1 depends on the order of accuracy. For example, if the numerical scheme is a strong approximation with order error , then because when .

One advantage of Algorithm 1 is that it can run together with the Monte Carlo sampler. The trajectory of can not be recycled. But the trajectory of can be used to estimate either the invariant density or the expectation of an observable.

Coupling for contraction rate. The idea of estimating is to use coupling. We can construct a Markov process such that is a Markov coupling of and . Then as introduced in Section 2, the first passage time to the “diagonal” hyperplane is the coupling time, which is denoted by . It then follows from Lemma 2.8 that

Then we can use extreme value theory to estimate . The idea is to uniformly sample initial values from , and define . Then is actually a random variable whose sample can be easily computed. We use extreme value theory to estimate an upper bound for , and denote it by . See Algorithm 2 for details. The threshold in Algorithm 2 is usually chosen such that approximately samples are greater than this threshold.

Input: A compact set
Output: An estimator of the minimal contraction rate .
for i = 1 to N do
     Sample pairs uniformly from
     Set ,
     for j = 1 to M do
         Run with initial value until
         if  then
         end if
     end for
end for
if  then
     The estimator fails. Choose better coupling algorithm or larger
     Let for all
     Choose a threshold
     Use generalized Pareto distribution to fit to get two parameters .
     if  then
         The estimator fails. Choose better coupling algorithm or larger
         Let .
     end if
end if
Return as the estimator of .
Algorithm 2 Estimate contraction rate

If running successfully, Algorithms 1 and 2 give us an upper bound of according to equation (3.4), which can be used to check the quality of samples.

Construction of . It remains to construct a coupling scheme that is suitable for the numerical trajectory . In this paper we use the follow two types of couplings.

Denote two margins of by and respectively. The Reflection coupling means two noise terms of and in an update are always symmetric. For the case of Euler-Maruyama scheme and constant coefficient matrix as an example. A reflection coupling gives the following update from to :


where is a normal random variable with mean and variance , and

is a unit vector. It is known that reflection coupling is the optimal coupling for Brownian motion in [32, 18]. Empirically it gives fast coupling rates for many stochastic differential equations with non-degenerate noise.

The maximal coupling looks for the maximal coupling probability for the next step (or next several steps) of the numerical scheme. Assume and are both known. Then it is easy to explicitly calculate the probability density function of and , denoted by and respectively. The update of and is described in Algorithm 3. This update maximizes the probability of coupling at the next step. This is similar to the maximal coupling method implemented in [21, 23].

Input: and
Output: , , and if coupled.
Sample and according to the numerical scheme using independent noise
Compute probability density functions and .
Draw from uniform 0-1 distribution.
if u ¡ r then
     Use and sampled before
      is undetermined.
end if
Algorithm 3 Maximal coupling

In practice, we use reflection coupling when and are far away from each other, and maximal coupling when they are sufficiently close. This method fits discrete-time numerical schemes, as using reflection coupling only will easily let two processes miss each other. In our simulation code, the threshold of changing coupling method is . When the distance between and is smaller than this, we use maximal coupling.

3.3. A fast estimator

In practice, Algorithm 1 can be done together with the Monte Carlo sampler to compute either an observable or the invariant probability density function. The extra cost comes from simulating trajectories of , which takes time of running trajectories of . The main overhead of the above mentioned methods is algorithm 2. Because we want a quantitative upper bound of , the contraction rate of in 1-Wasserstein space with respect to all pairs of initial points in must be estimated. In practice, this takes a long time because one needs to run many () independent trajectories from each initial point to estimate the coupling probability.

In practice, if one only needs a rough estimate about the sample quality instead of a definite upper bound of the 1-Wasserstein distance, Algorithm 2 can be done in a much easier way by estimating the exponential rate of convergence of . It is usually safe to assume that the rate of exponential contraction is same for all “reasonable” initial distributions. In addition, the contraction rate is bounded from below by the exponential tail of the coupling time distribution

. Therefore, we only need to sample some initial points uniformly distributed in

and estimate the exponential tail of the coupling time distribution. This gives an estimate

where denotes the uniform distribution on , and can be obtained by a linear fit of versus in a log-linear plot. Then we have a rough estimate of given by


where is estimated by Algorithm 1.

Equation (3.6) usually differs from the output of Algorithm 1, Algorithm 2, and equation (3.4) by an unknown multiplicative constant. However, in practice, this is usually sufficient for us to predict the quality of Monte Carlo sampler with relatively low computational cost. In numerical examples we will show that, it is usually sufficient to estimate the exponential tail of by running examples.

4. Numerical examples

4.1. Ring density

The first example is the “ring density” example that has a known invariant probability measure. Consider the following stochastic differential equation:


where and are independent Wiener processes, and is the strength of the noise. The drift part of equation (4.1) is a gradient flow of potential function plus an orthogonal rotation term that does not change the invariant probability density function. Hence the invariant probability measure of (4.1) has a probability density function

where is a normalizer. We will compare the invariant probability measure of the Euler-Maruyama scheme and that of equation (4.1).

In our simulation, we choose and . The first simulation runs independent long trajectories up to time . Hence Algorithm 1 compares the distance between and for samples. Constant equals here because the Euler-Maruyama scheme is a strong approximation with accuracy when is a constant. Algorithm 1 gives an upper bound

In addition, all eight trajectories are contained in the box . Hence we choose and .

Then we run Algorithm 2 to get coupling probabilities up to . The number of initial values is . Then we run pairs of trajectories from each initial point to estimate the coupling probability. The probability that coupling has not happened before is then divided by , which estimates an upper bound of the contraction rate

Then we use generalized Pareto distribution (GPD) to fit . The threshold is chosen to be . The fitting algorithm gives parameters and . This gives . A comparison of cumulative distribution functions of empirical data and that of the GPD fitting is demonstrated in Figure 1.

Figure 1. A comparison of cumulative distribution functions of empirical data and that of the generalized Pareto distribution.

Combining all estimates above, we obtain a bound


Since the invariant probability measure of equation 4.1 is known, we can check the sharpness of the bound given in equation (4.2). The approach we take is Monte Carlo simulation with extrapolation to infinite sample size. On a grid, we use long trajectories to estimate the invariant probability density function of 4.1. The sample sizes of these trajectories are . Then we compute the total variation distance between

and the empirical probability density function at those grid points. The error is linearly dependent on the power of the sample size. Linear extrapolation shows that the total variation distance at the infinite sample limit is . The linear extrapolation is demonstrated in Figure 2. Since is smaller than the total variation distance, the 1-Wasserstein distance should be no greater than . Therefore, our estimation given in equation (4.2) is larger than the true distance between and , but is reasonably sharp.

Figure 2. Linear extrapolation for the total variance distance at the infinite sample limit.

It remains to comment on the fast estimator mentioned in Section 3.3. In Figure 3 we draw the exponential tail of and its linear fit. The slope of the exponential tail is . When , we have . Equation (