 # Symmetrized importance samplers for stochastic differential equations

We study a class of importance sampling methods for stochastic differential equations (SDEs). A small-noise analysis is performed, and the results suggest that a simple symmetrization procedure can significantly improve the performance of our importance sampling schemes when the noise is not too large. We demonstrate that this is indeed the case for a number of linear and nonlinear examples. Potential applications, e.g., data assimilation, are discussed.

## Code Repositories

### SDE_Importance_Sampling

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider a stochastic differential equation (SDE)

 dXt=f(Xt) dt+σ dBt ,  Xt∈RD, (1.1)

where and is -dimensional Brownian motion. Suppose we make noisy observations of the system at times (, fixed), obtaining a sequence of measurements where () is the quantity being measured (the “observable”),

are independent identically-distributed (IID) random variables modeling measurement errors and

. What is the conditional distribution of for given

? This problem of “nonlinear filtering” or “data assimilation” arises in many applications; see, e.g.,

[7, 8, 5, 27]. A variety of algorithms have been developed to address it, but efficient data assimilation, especially in high-dimensional non-gaussian problems, remains a challenge .

This paper concerns an approach to data assimilation known as “particle filtering” (see, e.g.,  for more details) based on sampling

the conditional distributions. We present an asymptotic analysis of certain sampling algorithms designed to improve the efficiency of particle filtering, and, based on this analysis, we propose a general way to improve their performance. The analysis relies on taking a small-noise limit, but the algorithms do not require a small noise to operate (but may not be as efficient when the noise is not small). We focus on

one step of the filtering problem, i.e., we set in the above, as this is sufficient to capture the computational difficulty we wish to address. For simplicity, we assume , where is a scalar and is the identity matrix; we also assume is a scalar. These assumptions can be relaxed if needed.

To take one step of particle filtering, one begins by discretizing Eq. (1.1) using, e.g., the Euler scheme, to obtain

 Xn+1=Xn+Δt f(Xn)+√Δt σ⋅ξn ,   X0=x0∈RD, n=0,⋯,N−1, (1.2)

where , the are IID standard normal random variables. A straightforward application of Bayes’s Theorem tells us that the conditional distribution of interest satisfies

 p(x1,⋯,xN|y)∝exp(12σ2ΔtN−1∑n=0∥∥xn+1−xn−f(xn)Δt∥∥2+∥m(xN)−y∥22r) . (1.3)

One then tries to design a Monte Carlo algorithm to generate discrete-time sample paths from Eq. (1.3), conditioned on the observation . We refer to the distribution in Eq. (1.3) as the target distribution. They are the discrete-time analogs of the conditional distributions introduced above, with observation.

Without the last term in the exponent in Eq. (1.3), the target distribution is just the distribution of the discretized SDE, and one can generate sample paths by carrying out the recursion in Eq. (1.2). When the last term is included, however, it is generally not feasible to sample directly from the target distribution. A solution to this problem is importance sampling: instead of drawing samples from the target distribution, we draw sample paths from an approximation , usually called the “proposal distribution”. Any statistics we compute based on sample paths from will be biased. We compensate for this bias by associating a weight to the th sample path , with , so that the weighted sample paths again have the correct statistics (in a sense we make precise later).

Weare and Vanden-Eijnden [28, 29] proposed an algorithm for sampling distributions like Eq. (1.3

). They showed that their algorithm is efficient in the sense that in the limit of small dynamical and observation noise, the relative variance of the weights vanishes (see

 for precise definitions and statements). The basic idea of the sampler is to look for the most likely sample path of the target distribution (1.3) and use this information to modify the dynamics so that samples from the proposal remain close to the target distribution. In this paper, by a combination of formal asymptotic analysis and numerical examples, we show that a symmetrization procedure proposed in  can be applied to SDEs to improve the efficiency of importance samplers. The symmetrization and “small noise analysis” has also been discussed in the context of implicit sampling [6, 23], see .

While our primary motivation here is data assimilation for SDEs, our symmetrization procedure may be effective for sequential Monte Carlo sampling of more general types of systems. As well, the class of importance sampling algorithms studied here are closely related to algorithms proposed in [12, 13, 10, 9, 11] and in  for sampling “rare events” in SDEs, though there are some significant differences between the two applications. We plan to explore some of these connections in future work.

#### Paper organization.

The remainder of this paper is organized as follows. We state our main results in Section 2. Section 3 briefly reviews the linear map method and its symmetrization, as well as the small noise theory (see ). We explain a new sampling method, the dynamic linear map, in Section 4. We study its efficiency in the small noise regime and show how to use symmetrization to improve its efficiency in small noise problems. Several numerical examples are provided in Section 5 that illustrate our asymptotic results as well as the efficiency of our dynamic approach in multimodal problems. The continuous time limit of the dynamic linear map is discussed in Section 6 and we present conclusions in Section 7.

## 2 Problem statement and summary of results

We now formulate the problem more precisely and summarize our key findings. We consider a discretized SDE in the small noise regime

 Xn+1=Xn+Δt ˜f(Xn,Δt)+√Δt √ε σ⋅ξn ,   X0=x0∈RD, (2.1)

where corresponds to a numerical discretization of (for most of this paper, we assume the Euler discretization ), and is the “small noise parameter”. Throughout this paper we assume that the

-dimensional vector field

is smooth, and that the process starts at a given initial position and proceeds for time steps of size each. The transitions are made with independent gaussian samples . We denote the path as , a sequence of positions , and its likelihood in the process with the path distribution .

The observation of the state at time gives rise to the likelihood

 θ(xN):=exp(−1εg(xN)), (2.2)

where is assumed to be a smooth, nonnegative function. For example, for observations , , we have . Hereafter we will sometimes refer to as the “log-likelihood,” in a slight abuse of standard terminology. By Bayes’s Theorem, the target distribution then has the form

 p(x1:N|x0)∝ρ(x1:N|x0)⋅θ(xN) . (2.3)

Importance sampling methods generate samples using a proposal distribution , and attach weights

 W(k)=w(X(k)1:N|x0)=p(X(k)1:N|x0)/q(X(k)1:N|x0) (2.4)

to each sample, so that the weighted samples can be used to compute unbiased statistical estimates with respect to the target distribution. To measure the efficiency of the sampling methods, we evaluate the relative variance of the weights

 Q:=Var[W]E[W]2. (2.5)

Here the expected values are computed with respect to the proposal distribution . This relative variance

is connected to a standard heuristic called the “effective sample size,” defined by

 Neff:=Ne1+Q, (2.6)

where is the number of weighted samples (see, e.g., [4, 21, 8]). The effective sample size is meant to measure the size of an unweighted ensemble that is equivalent to the weighted ensemble of size . All else being equal, the smaller the , the more efficient the importance sampling algorithm, and if all the samples were independent, we would have and . The quantity is convenient because it is not tied to any specific observable; recent work (see ) has also given it a more precise meaning. Other quantities that can assess effective sample sizes are discussed in . We note that in practice, and are only known up to a constant. The algorithms we describe do not require knowing the normalization constants. Likewise, is invariant under rescaling of or by a constant.

We study two types of importance sampling methods in this paper. The first method, called the “linear map” (LM), uses a gaussian proposal distribution centered at the most likely path. The second method, called dynamic linear map (DLM), re-applies the linear map after each time step between and given the previous moves. Note that the linear map can be viewed as a version of implicit sampling [6, 23] applied to the path distribution of an SDE. The dynamic linear map applies this implicit sampling step repeatedly to transition densities and is also closely linked to the continuous time control method of Weare and Vanden-Eijnden [28, 29] (see also Section 6). For each method, we perform a symmetrization and exploit symmetries of the proposal distributions to increase sampling efficiency. Symmetrization was previously studied for the LM in a more general context in . Here we adapt this procedure to problems involving SDE and to the dynamic linear map. Following the approach taken in , we show that under suitable assumptions (see Section 4), the relative variances of the various methods are as follows:

Method scaling
Linear Map (LM)
Symmetrized LM
Dynamic LM (DLM)
Symmetrized DLM

We also present examples showing that the leading coefficient of the DLM can be smaller than that of LM, suggesting that DLM may be more effective in some situations (see Section 5). We discuss the continuous time limit of LM and DLM for scalar SDE, and calculate the leading coefficient of in an asymptotic expansion in . In doing so, we show that, under additional assumptions, the sampling method discussed in  is recovered in the limit of the DLM (see Section 6).

#### Notes.

1. The -expansions we will consider are formally justified as the relevant quantities, e.g., relative weight variance, are gaussian integrals.

2. The insertion of the small noise parameter into the problem is mainly to enable asymptotic analysis. In specific problems, there is not always an identifiable small parameter, and in any case our methods do not require a small parameter to operate.

## 3 Background

We simplify notation and write , and , and consider the small noise target distribution defined in (2.3) which can be written as , where

 F(x)=Δt2σ2N−1∑n=0∥∥xn+1−xnΔt−˜f(xn,Δt)∥∥2+g(xN) , (3.1)

for , a scalar function as in (2.2). If we assume that has a unique, nondegenerate minimum, and let

 φ=argminx∈RD⋅NF(x), (3.2)

i.e., is the optimal path with prescribed initial condition , we can employ Laplace asymptotics to expand the target distribution around . (See, e.g.,  for a general formulation of Laplace asymptotics.) After a change of variables

 z=ε−1/2⋅(x−φ) (3.3)

the expansion is

 F(z)=F(φ)+zTHz/2+ε1/2C3(z)+εC4(z)+O(ε3/2), (3.4)

where is the Hessian evaluated at , are the higher order terms in the Taylor series. Here and below, we use the shorthand , and similarly write for etc. Note that while we will continue to refer to as a “path” after the change of coordinates, is the actual solution of Eq. (2.1).

The small noise analysis of LM, and other methods to follow will make frequent use of this expansion, as well as the “variance lemma” (see ).

###### Lemma 1.

(Variance Lemma) For a function that can be expanded in at least to the terms

 u(z)=1+εru1(z)+ε2ru2(z)+O(ε3r) (3.5)

the relative variance of

with respect to a probability density

is

 Q=ε2rVarq[u1(z)]+O(ε3r) (3.6)

### 3.1 Linear map

The proposal distribution of the linear map (LM) sampling method, summarized in Algorithm 1, is gaussian and proportional to

 q(z)∝exp(−zTHz/2). (3.7)

The weights are the ratio of target and proposal distribution, and can be expanded as

 w(z)=1−ε1/2C3(z)+O(ε). (3.8)

Using the variance lemma we thus find that

 Q=εVarq[C3(z)]+O(ε3/2), (3.9)

i.e., the relative variance of the weights is linear in (see  for more details).

### 3.2 Symmetrized linear map

It is shown in  that the linear map can be “symmetrized" to improve the scaling of from linear to quadratic in

. This stems from the observation that the leading order term in the weight is an odd function with respect to the random variable

, whose probability distribution function is even. The symmetrized sampler uses a proposal distribution which reweights equally likely samples from the gaussian distribution of the linear map such that the resulting weights have even symmetry. The odd leading order terms in the weight expansions then cancel, which results in a quadratic scaling of

in .

Specifically, the symmetrized linear map draws a sample from the proposal distribution . It returns with probability , and with probability , where

 w+=p(−z)q(z)  and  w−=p(z)q(z). (3.10)

Samples generated in this way have a non-symmetric distribution, but even weights:

 qs(z)=q(z)2w+w−+w+,ws(z)=w−+w+2. (3.11)

The Taylor expansion of the symmetrized weight is

 ws(z)=1+ε(12C3(z)2−C4(z))+O(ε2), (3.12)

which, together with the variance lemma shows that

 Qs=ε2Varq[12C3(z)2−C4(z)]+O(ε4). (3.13)

The symmetrization therefore improves the linear scaling of in of LM, to a quadratic scaling of for SLM (see  for more details).

## 4 Dynamic linear map and its symmetrization

### 4.1 A multimodal example

The linear map can be efficient when the hypotheses underlying its derivation are satisfied, i.e., when the pathspace distribution is unimodal and a gaussian approximation is appropriate. However, when there are multiple modes, LM can become inefficient. To see how this might happen, consider the simple random walk

 Xn+1=Xn+√Δt √ε ξn (4.1)

i.e., where is standard Wiener process. Suppose we have a bimodal likelihood function whose graph is as shown in Figure 1; this type of situation can arise when multiple states can give the same measurement, so that observations may have ambiguous interpretation. In this case, the high probability paths will be those that reach the vicinity of at ; effectively, the high probability paths are sample paths of Brownian motion, conditioned to be near at . The probability of this occurring by chance is exponentially small as , and direct sampling is unlikely to ever produce such a path. Figure 1: Brownian motion with bimodal likelihood. Here, the initial condition is X0=0.01, and we use ε=0.1 . Shown are a sample path X and the optimal path φ starting from X0.

A straightforward calculation shows that the optimal path approaches a straight line in the -plane as , going to the right bump if , to the left if (and undefined if ). With a bimodal likelihood function, the target distribution is bimodal as well. If the initial condition is sufficiently to the right of , one of the two modes will dominate, and LM can be expected to be effective. As moves closer to , however, the other mode will begin to make a greater contribution; at , the two modes carry exactly the same weight. But LM will always pick the mode on the right when , no matter how close is to . So LM will produce essentially no sample paths going to the left, leading to a large weight variance. See Section 5 for detailed numerical results.

This is a well-known problem with importance sampling algorithms. Similar issues arise in rare event simulation, and a standard solution is to dynamically recompute the optimal path. See, e.g., the discussion of Siegmund’s algorithm in . In our context, this leads to an algorithm we call the dynamic linear map, which is similar to the algorithms proposed in [28, 13]. We will also discuss symmetrization in this context.

### 4.2 Dynamic linear map

Roughly speaking, the dynamic linear map (DLM) consists of computing the optimal path starting from the current state , taking one step (so that ), then repeating. See Algorithm 2 for details. The DLM thus requires redoing LM at every step, and is therefore more expensive.111Suppose each cost function evaluation requires CPU time , the number of steps, and each optimization requires function evaluations. Then all else being equal, LM has running time and DLM . However, it can avoid some of the issues arising from multi-modal target distributions. One can see this heuristically in the above example (Section 4.1): suppose we start with slightly to the right of , so that the optimal path goes to the right bump. After a few steps, we may end up in a state closer to the left bump. At this point, the DLM would start steering the sample path towards the left bump. Unlike LM, repeated sampling using DLM would yield sample paths that end at both the left and the right bumps (see Section 5.1).

To make use of DLM, we need an expression for the associated weights. This, in turn, requires an expression for the proposal distribution associated with DLM, which one can derive by first noting that in general, transition densities are marginals of the pathspace distribution:

 ρ(xn+1|xn) =∫ρ(xn+1:N|xn) dxn+2:N.

(Here we abuse notation slightly and use and to denote both pathspace distributions as well as their marginals.) The DLM transition density arises from making a gaussian approximation of the target distribution at each step, then taking its marginal. This leads to

 q(xn+1|xn) =∫q(xn+1:N|xn) dxn+2:N (4.2) ∝exp(−(x−φ)T\tinyn+1Σ−1n+1(x−φ)\tinyn+1/(2Δt)).

Here is the optimal path from to and we omit its dependence on for readability of the equations; we also remind the reader that is a path. We denote the Hessian of evaluated at the optimal path by . We view a path from to as a point in , arranged in blocks of entries. Accordingly, the matrix can be viewed as an element of and can be subdivided into blocks of dimension each. The matrix in Eq. (4.2) is , the first block of the inverse of the Hessian (after rescaling).

In Algorithm 2, going from step to requires optimizing over the remaining variables in the path. This is done independently at every step and for every sample path. The weights for the proposal distribution of DLM can be calculated as described in Algorithm 2, or as the product of the incremental weights

 w=N−1∏n=0wn,wn∝p(xn+1|xn)q(xn+1|xn). (4.3)

Relation to Hamilton-Jacobi equation and regularity of “value functions”. In the definitions above, it is assumed that is well-defined for all . This is actually not always the case. To see this, consider again the example from Section 4.1. If at some , there are two optimal paths pointing in opposite directions. At this point, because there is not a single optimal path,

is undefined. This behavior is actually rather common, and not at all confined to the Brownian motion example. It is closely connected with regularity of solutions of a partial differential equation of Hamilton-Jacobi (HJ) type. As we do not make use of the theory of HJ equations in this paper, we do not go into details here. Instead, we provide a brief summary below, and refer interested readers to, e.g.,

In the DLM method, the optimal path minimizes a version of the function in Eq. (3.1), but starting with state at time rather than always at time 0. In the limit as , the value function achieved with initial condition at step solves a HJ equation of the form , with Hamiltonian ; this is the Legendre transformation of the Freidlin-Wentzell Lagrangian  . For the HJ equation to be well-posed, one prescribes the final condition that where is the likelihood in Eq. (2.2) and . The HJ equation is then solved backwards in time. The time derivative of the optimal path starting at position and time is given by the gradient of where it is differentiable. At locations where there are multiple optimal paths, the value function is generally continuous but not differentiable. At such singular points , has jump disconinuities (as varies) and is therefore undefined.

Though very much relevant to the efficacy of the type of methods discussed in this paper, the analysis of singularities of HJ equations can be highly nontrivial. As our main goal is to assess whether some version of the symmetrization procedure proposed in  can be extended to SDEs, we have opted to focus on the simplest possible setting, leaving more general analysis to future work. For the remainder of the paper, we make the following standing assumption:

 q(xn+1|xn) is defined everywhere, and is as smooth as needed.

The analytical results described below should therefore be interpreted as a best-case scenario. We also note that while the numerical algorithm is unlikely to produce an exactly in the set of singular points in actual practice, the presence of singularities does mean that the performance of the algorithm may be worse than predicted by our analysis. We have therefore designed our numerical examples to test the extent to which the algorithms behave as predicted even when is not differentiable everywhere.

### 4.3 Small-noise analysis

To find the scaling of the relative variance of the weights of DLM with the small noise parameter , we apply the same change of variables as in Eq. (3.3) to each transition density and expand the incremental weights as

 wn  =  w(zn+1|zn)  =  1+ε1/2⋅w1,n(zn+1|zn)+ε⋅w2,n(zn+1|zn)+O(ε3/2), (4.4)

where

 w1,n(zn+1|zn) =∫C3(z)exp(−zTHz/2) dzn+2:N∫exp(−zTHz/2) dzn+2:N (4.5) w2,n(zn+1|zn) =∫(C3(z)2/2−C4(z))exp(−zTHz/2) dzn+2:N∫exp(−zTHz/2) dzn+2:N (4.6)

noting that Eq. (4.4) relies strongly on our standing assumption that is differentiable. Since the weight of a sample is the product of the incremental weights, we have

 w(z) =1+ε1/2⋅w1+ε⋅w2+O(ε3/2),

where

 w1=N−1∑n=0w1,n,w2=N−1∑n=0w2,n+N−1∑n=0N−1∑m=0w1,n⋅w1,m. (4.7)

The scaling of in now follows from the variance lemma:

 Qε=ε⋅Varq[w1]+O(ε2). (4.8)

Thus, the relative variance of DLM scales linearly in , the same asymptotic scaling as LM. However, we will show in numerical examples below that the dynamic approach can be more effective in practice than LM, especially when the target distribution has multiple modes.

### 4.4 Symmetrization

The leading order term in the weight for DLM has an odd symmetry, just like the LM, and a symmetrization procedure can be applied to DLM to improve the scaling of in . The reason is that, at each time step, is generated by a composition of the previous state and a new gaussian sample . While this procedure leads to a proposal distribution that is not necessarily even, the paths are constructed incrementally from gaussian samples which are even.

More specifically, the recursive composition forms a map from the dimensional gaussian to the path , and for every sampled path , there is a path which is equally likely. Following the algorithm described in Algorithm 3, we sample with probability , and with probability , the resulting proposal is a “symmetrized” distribution with even weights (see Eq. (3.11)).

The symmetrized weights can be written in terms of the map as

 ws(h(ε1/2ξ))=w(h(ε1/2ξ))+w(h(−ε1/2ξ))2. (4.9)

Recall the expansion of the weights in (4.4), and note that

 z =ε−1/2(h(ε1/2ξ)−h(0)),

since the most likely path can be written in terms of the map as .

If is unique (at each time step), can be expanded around the most likely path as

 h(ε1/2ξ) =φ+ε1/2(Dh)(0)⋅ξ+O(ε), (4.10) h(−ε1/2ξ) =φ−ε1/2(Dh)(0)⋅ξ+O(ε). (4.11)

We thus have that

 w(h(ε1/2ξ)) =1+ε1/2w1(ε1/2(Dh)(0)⋅ξ,φ)+εw2(ε1/2(Dh)(0)⋅ξ,φ)+O(ε3/2) (4.12) w(h(−ε1/2ξ)) =1−ε1/2w1(ε1/2(Dh)(0)⋅ξ,φ)+εw2(ε1/2(Dh)(0)⋅ξ,φ)+O(ε3/2) (4.13)

which results in the cancellation of the leading order term in of the symmetrized weight

 ws(h(ε1/2ξ))=1+εw2(ε1/2(Dh)(0)⋅ξ,φ)+O(ε3/2) (4.14)

Applying the variance lemma completes the proof for the quadratic scaling of in

 Qs=ε2⋅Varqs[w2]+O(ε4). (4.15)

## 5 Numerical examples

We now examine a number of concrete examples, both to illustrate the scaling of the proposed algorithms and to test their limitations. The source code for all examples in this section can be found at https://github.com/AndrewLeach/SDE_Importance_Sampling .

### 5.1 Examples with linear SDE

We begin with the Brownian motion example from Section 4.1:

 Xn+1=Xn+√Δt √ε ξn , (5.1)

with initial condition and with likelihood for two different choices for . We first consider the case of a unimodal target distribution for which the assumptions made during the small noise analysis are satisfied. We then violate the assumption of a unique optimal path to indicate limitations of DLM and our small noise analysis. For the examples below, the time step is . The observation is collected at step (i.e., ). Computing the optimal paths is straightforward to do analytically and we use the analytic formulas in our implementation of the various samplers.

#### Brownian motion with unimodal likelihood.

We first consider a likelihood defined by

 g(x)=x424+x36+x22.

The likelihood is asymmetric in and leads to a non-gaussian and unimodal target distribution. In this example, the assumptions made in our small noise analysis are satisfied.

We apply LM, SLM, DLM, and SDLM to sample the target distribution over a wide range of , and compute the relative variance for each of these methods. For each and method (LM, SLM, DLM and SDLM), we draw samples . The results are shown in Figure 2. Figure 2: Brownian motion with asymmetric unimodal likelihood. The scaling of Q in ε for LM, SLM, DLM and SDLM are plotted.

As can be seen, the results show the predicted scalings for for a wide range of for all four methods: both LM and DLM are , while SLM and SDLM are both . Perhaps this is no surprise, as all assumptions that lead to the small noise theory are valid in this example. We also see that the dynamic methods (DLM and SDLM) have smaller relative variance at each value of , though they also cost more per sample.

#### Brownian motion with bimodal likelihood.

Next, we examine

 g(x)=100⋅(x44−x22).

As explained in Section 4.1, this leads to a bimodal target distribution. We fix , and leave all other parameters as above. We apply LM and DLM to compute the final-time distribution , using (weighted) samples. The results are shown in Figure 3, along with the target distribution . Figure 3: Final-time distribution for Brownian motion with bimodal likelihood. In (a), we plot the marginal distribution p(xN|x0) estimated by weighted histograms of 12000 samples generated using LM. Also shown is the target distribution. In (b), we plot the same information for DLM.

As expected, LM essentially ignores one of the two modes, while DLM captures both modes. As explained before, even though both samplers should reproduce the target distribution in the large-sample-size limit, in practice LM produces almost no sample paths that go to the left bump. In contrast, DLM readily generates sample paths ending at both bumps, leading to a more effective sampling of the target distribution. We have experimented with increasing the sample size for LM, but even the largest sample sizes we consider did not lead to weighted samples that represent both modes.

Finally, note that empirical estimates of are insufficient to detect this problem: even though the true value of for LM should be quite large in this case, empirical estimates of for LM are actually quite small because none of the sample paths go to the left bump. Indeed, for Figure 3, the empirical for LM is , while that of DLM is . The example thus shows that for non-gaussian and possibly multimodal distributions, DLM can be more reliable despite the same scaling of .

#### Overdamped Langevin equation with bimodal likelihood.

The scaling arguments for DLM and its symmetrized version rely on the assumption that the most likely path is unique at every time step. We now consider an example for the DLM in which we deliberately violate this assumption. The model is

 Xn+1=Xn−Δt α⋅Xn+√Δt √ε ξn . (5.2)

This is the Euler discretization of the overdamped Langevin equation We use the log-likelihood

 g(x)=10⋅(x44−x22).

As in the previous example, the optimal path goes to the right bump when and to the left when . At there is no unique optimal path.

The linear drift makes it likely that DLM sample paths encounter the line and the small noise results may not hold in this case. To illustrate the behavior and efficiency of the methods in this situation we perform experiments with varying values of and . Specifically, for a fixed , we take time steps with DLM, starting from initial conditions ranging from to . We compute the averaging number of crossings for each experiment. Figure 4 shows the results as well as the computed values of . Figure 4: DLM applied to the overdamped Langevin equation with bimodal likelihood. Panel (a) shows the scaling of Q vs. ε for x0 approaching x=0. In (b), we plot the average number of x=0 crossings against ε.

As can be seen in Figure 4(a), the predicted asymptotic scaling of only emerges for small ; the critical value of at which the curve crosses over into the asymptotic regime decreases as approaches , making crossings more likely. Comparing Figures 4(a) and 4(b), we see that the asymptotic regime corresponds to values of small enough that the average number of crossings per sample is near zero. Closer examination of the data suggests that this critical scales roughly linearly with distance of the initial condition to . The example thus suggests that the efficiency of DLM may suffer if one encounters non-unique optimal paths while constructing the proposal distribution sequentially, but the predicted scaling again holds if is small enough.

Finally, we note that even in the pre-asymptotic regime, the value of are , meaning the effective number of samples is , which is still a significant improvement over direct sampling.

### 5.2 Example with a nonlinear SDE

Our second example is a stochastic version of an idealized geomagnetic pole reversal model due to Gissinger :

 ˙x1=0.119x1−x2x3+√ε˙B1˙x2=−0.1x2+x1x3+√ε˙B2˙x3=0.9−x3+x1x2+√ε˙B3 . (5.3)

(In this section, refers to the th component of a vector .) The

system of ordinary differential equations has 3 unstable fixed points:

and . It has a chaotic attractor on which trajectories circulate around either or many times before making a quick transition to the other fixed point. See Figure 5. Following , we refer to these transitions as “pole reversals,” since the second component can be thought of as a proxy for the geomagnetic dipole field, and it changes signs at these transitions.

Here, we consider Eq. (5.3) with . We start with an initial condition near , and after steps make an observation with log-likelihood , where . We view as the outcome of a “measurement” made at step .

We consider two cases:

• The measured value is near , i.e., on the opposite “lobe” from the initial condition;

• is near , i.e., on the same “lobe” as the initial condition.

Figure 5 illustrates the initial conditions, data, and optimal paths for the two cases. Shown are trajectories of the deterministic model (light gray), representing the chaotic attractor. The dashed line is the most likely path with initial condition marked by “” and with measured state at time marked by “”; this trajectory undergoes a “pole reversal” (Case (a)). The solid blue line represents the most likely path with initial condition “” and observation “,” and does not exhibit a pole reversal (Case (b)). Figure 5: The Gissinger model and its phase space geometry. Shown are trajectories of the deterministic model (light gray) projected to the x2-x3 plane. The dashed line is the most likely path with initial condition marked by “∙” and measured state at time t=10 marked by “+”; this trajectory undergoes a “pole reversal” (Case (a)). The solid blue line represents the most likely path with initial condition “∘” and observation “×” at t=10, and does not exhibit a pole reversal (Case (b)). The symbols □ and ⋄ are the times at which we computed the histograms in Figure 6.

To see how the two cases differ, we fix and apply the LM and DLM to generate sample paths in each case and plot marginals of the proposal distributions at two different times. In Case (a), we plot histograms of the marginal distributions at time as marked by in Figure 5; in Case (b), we plot histograms of the marginal distributions at time as marked by . For each method, the resulting “triangle plot” consists of histograms of the one-dimensional marginals, for , and the two-dimensional marginals, , of the proposal distributions. The triangle plots are shown in Figure 6. In each panel, the diagonal plots are the one-dimensional marginal distributions. The lower-triangular parts of each panel are the two-dimensional marginal distributions generated by LM, while the upper-triangular parts show marginals generated by DLM. Figure 6: Final-time marginal distributions for the Gissinger model. In each panel, the diagonal plots are histograms for the final-time marginal proposal distributions for of x1, x2, and x3 (solid = LM, dashed = DLM). The times at which the marginals are computed are marked by ⋄ in Figure 5 for Case (a), and □ for Case (b). Plots on the lower-triangular submatrix are two-dimensional marginal proposal distributions computed by LM, while two-dimensional marginal proposal distributions computed by DLM form the upper-triangle (see text for details).

In Case (a), the marginal distributions of the DLM proposal are multimodal, possibly related to the underlying geometry of the strange attractor. In contrast, the LM proposal distribution misses this complexity altogether (as one might expect). Moving now to Case (b), which involves starting and end points on the same lobe connected by a shorter optimal path, the marginals are unimodal, and LM and DLM give more similar answers (though there is still significant deviation from gaussianity in the DLM proposal distribution).

Finally, we vary in Cases (a) and (b) and apply LM, SLM, DLM and SDLM. For each value of , we estimate for each of the 4 methods. The results are shown in Figure 7. Not surprisingly, LM breaks down for Case (a), in which the target distribution is likely multimodal. In contrast, both DLM and SDLM exhibit the predicted scaling. For Case (b), because the target distribution is unimodal, all four methods behave as predicted by the small noise theory. Figure 7: Relative variance Q as a function of ε for the Gissinger model. Case (a) involves a pole reversal, whereas Case (b) does not.

#### Numerical details.

The Gissinger model requires attention to numerical implementation when we compute its statistics.We describe our numerical implementation in detail.

1. Time-stepping. The Euler scheme for the Gissinger model requires small time steps because of numerical instabilities. To improve stability, we discretize the drift part of Eq. (5.3) using a standard 4th-order Runge-Kutta (RK4) method, then adding IID normal random vectors at each step. This yields a model of the form (2.1), where now represents one step of the RK4 scheme. In all the examples shown above, the time step is .

2. Estimation of . In Figure 7, because of their different variances, we use sample paths to estimate for DLM and for SDLM, and paths for LM and for SLM.

3. Computing optimal paths. Our methods requires computing optimal paths. For the Gissinger model, we use Newton’s method. Since explicit analytical expressions for the gradient and the Hessian are available, this is relatively straightforward to program. To reduce the (fairly significant) computational cost of computing at each time step, we “guess” a good initialization for the optimization procedure using the solution from the previous time step using the linearized dynamics. See  for details.

## 6 Continuous-time limit of dynamic linear map

So far, we have focused on time discretizations of SDEs. A natural question is what happens to the proposed algorithms in the limit . In this section, we sketch some analytical arguments aimed at addressing these questions for scalar SDE. Though restrictive, we believe these results yield useful insights. A more complete and rigorous analysis is left for future work, as it is expected to be more involved.

### 6.1 Dynamic linear map

For scalar SDE, the DLM can be defined through the recursion

 X\tinyΔtn+1=φ\tinyΔtn+1(X\tinyΔtn,n)+√Δt √ε √Σ\tinyΔtn+1(X\tinyΔtn,n) ξn, (6.1)

where , , is the optimal path (3.2) with prescribed initial condition , is the th entry of the Hessian of (see Eqs. (3.1) and (4.2)), and the are independent standard normal random variables. Keeping in mind that for all , the above can be written as

 X\tinyΔtn+1=X\tinyΔtn+Δt φ\tinyΔtn+1(X\tinyΔtn,n)−φ\tinyΔtn(X\tinyΔtn,n)Δt+√Δt √ε √Σ\tinyΔtn+1(X\tinyΔtn,n) ξn . (6.2)

Our goal in this subsection is to sketch an argument suggesting that as , solutions of (6.2) converge weakly  to those of

 dXt=˙φt(Xt,t) dt+√ε σ⋅dBt (6.3)

with . Since we consider “continuous time” and “discrete time” cases, we mark the discrete time case by a superscript (i.e., in this section, the function in Eq. (3.1) is called ). In Eq. (6.3), “” denotes and the path () minimizes the action functional 

 F(xt:T|xt=x0)=12σ2∫Tt(˙xs−f(xs))2 ds+g(xT) ,  φt(x0,t)=x0 . (6.4)

This is the continuous-time analog of Eq. (3.1).

Eq. (6.3) was derived in  as the proposal for an importance sampling algorithm. This was later used in  for data assimilation in the small-noise regime. We assume minimizers of the action functional are twice-differentiable in the time parameter and satisfy the Euler-Lagrange equations; this can be justified via standard results from the calculus of variations (see, e.g., Section 3.1 of ). In what follows, we also assume that the action functional has a single global minimum for all initial positions and initial time . This unique optimal paths assumption (the continuous-time analog of the unimodality of ) implies that is defined everywhere. Without unique optimal paths, any analysis will require more care; see, e.g.,  and references therein for a discussion of these and related issues. The assumption is natural for linear systems with unimodal likelihood functions , and may hold (approximately) in nonlinear systems when is small.

We now sketch our argument. We begin by recalling that a numerical approximation of an SDE converges weakly with weak order if for all test functions with at most polynomial growth,

 (6.5)

as . By standard results in the numerical analysis of SDEs, weak convergence is implied by “weak consistency” plus some mild polynomial growth conditions; see, e.g., Section 14.5 in  for details.

In the present context, consistency means that the factors and in Eq. (6.2) approximate the corresponding factors in Eq. (6.3) ( and , respectively). These we now prove.

###### Proposition.

Under the unique optimal path assumption, we have

•  φ\tinyΔtn+1(x,n)−φ\tinyΔtn(x,n)Δt=˙φn\tinyΔt(x,nΔt)+O(Δt) (6.6)

for all and , and

•  Σ\tinyΔtn+1(x,n)=σ2+O(Δt). (6.7)
###### Proof of (a).

We begin by proving that and satisfy the first variational equations for and , respectively (see Eqs. (6.4) and (3.1)). Without loss of generality, set and , and write for a given . Then the first variational equation of is the boundary value problem

 −¨φ(s)+f′(φ(s))f(φ(s)) =0 (6.8) φ(0)−x(0) =0 (6.9) ˙φ(T)−f(φ(T))+σ g′(φ(T)) =0 (6.10)

and the first variational equation for is

 −φ\tinyΔtk−1−2φ% \tinyΔtk+φ\tinyΔtk+1Δt% 2+