Stochastic Normalizing Flows

We introduce stochastic normalizing flows, an extension of continuous normalizing flows for maximum likelihood estimation and variational inference (VI) using stochastic differential equations (SDEs). Using the theory of rough paths, the underlying Brownian motion is treated as a latent variable and approximated, enabling efficient training of neural SDEs as random neural ordinary differential equations. These SDEs can be used for constructing efficient Markov chains to sample from the underlying distribution of a given dataset. Furthermore, by considering families of targeted SDEs with prescribed stationary distribution, we can apply VI to the optimization of hyperparameters in stochastic MCMC.

• 8 publications
• 1 publication
• 11 publications
• 112 publications
04/30/2021

Maximum likelihood estimation for stochastic differential equations driven by a mixed fractional Brownian motion with random effects

We discuss maximum likelihood estimation of parameters for models govern...
01/06/2020

Maximum Likelihood Estimation of Stochastic Differential Equations with Random Effects Driven by Fractional Brownian Motion

Stochastic differential equations and stochastic dynamics are good model...
07/10/2020

Variational Inference with Continuously-Indexed Normalizing Flows

Continuously-indexed flows (CIFs) have recently achieved improvements ov...
03/19/2022

TO-FLOW: Efficient Continuous Normalizing Flows with Temporal Optimization adjoint with Moving Speed

Continuous normalizing flows (CNFs) construct invertible mappings betwee...
05/23/2019

Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit

In deep latent Gaussian models, the latent variable is generated by a ti...
03/02/2020

Batch Stationary Distribution Estimation

We consider the problem of approximating the stationary distribution of ...
06/15/2021

Multi-Resolution Continuous Normalizing Flows

Recent work has shown that Neural Ordinary Differential Equations (ODEs)...

1 Introduction

Normalizing flows (Rezende & Mohamed, 2015)

are probabilistic models constructed as a sequence of successive transformations applied to some initial distribution. A key strength of normalizing flows is their expressive power as generative models, while enjoying an explicitly computable form of the likelihood function evaluated on the transformed space. This makes them especially well-equipped for variational inference (VI). Neural networks are often used as inspiration for finding effective transformations

(Dinh et al., 2015; van den Berg et al., 2018).

Continuous normalizing flows were later developed in Chen et al. (2018)

as a means to perform maximum likelihood estimation and VI for large-scale probabilistic models derived from ordinary differential equations (ODEs). The framework stems from the computation of the evolving density of an ODE with random initial value, as the solution to another ODE. The jump to continuous-time dynamics affords a few computational benefits over its discrete-time counterpart, namely the presence of a trace in place of a determinant in the evolution formulae for the density, as well as the adjoint method for memory-efficient backpropagation. Motivated by deep learning, a family of ODEs, called

neural ordinary differential equations were constructed, whose Euler discretizations resembled layer-wise transformations of residual neural networks. Further algorithmic improvements to the framework were presented by Grathwohl et al. (2018), enabling virtually arbitrary choices of parameterized classes of ODEs. Doing all this involves some technical subtlety, and effective neural ODE architectures remain the subject of ongoing research — see for example (Dupont et al., 2019; Gholami et al., 2019; Zhang et al., 2019).

There has also been recent interest in extending these frameworks to a stochastic scenario, that is, training probabilistic models derived from stochastic differential equations (SDEs). For physical models, where the evolution of a dynamical system is no longer deterministic, or microscopic fluctuations are dependent on components changing too rapidly to quantify, an SDE can be more appropriate. Stochastic extensions of neural ODEs have been considered in (Tzen & Raginsky, 2019; Liu et al., 2019; Jia & Benson, 2019; Peluchetti & Favaro, 2019) as limits of deep latent Gaussian models, where they have been suggested to show increased robustness to noisy / adversarial data. Furthermore, unlike deterministic flows, there is a foolproof recipe for constructing a family of SDEs that are ergodic with respect to some target distribution (Ma et al., 2015). This particular property guarantees the convergence of the solution of an SDE to a prescribed target distribution. Such SDEs are prime candidates for the construction of stochastic MCMC algorithms, by generating sample paths via approximate stochastic integration methods.

However, developing an analogue of the continuous normalizing flows framework for flows constructed from SDEs—in particular, one that comes with simple and rigorous mathematical theory and that does not rely on ad hoc or problem-specific assumptions—is far from trivial. A common approach for conducting VI with SDEs is to rely on Girsanov’s theorem. This allows one to estimate the Kullback-Leibler divergence between densities of solutions to two SDEs (for the prior and posterior distributions) with differing drift coefficients

(Beskos et al., 2006; Tzen & Raginsky, 2019). Following this approach, Li et al. (2020)

developed a stochastic adjoint method which scales well to high dimensions, and enables SDEs as latent models in variational autoencoders. Theoretical justification of the method proved challenging, as stochastic calculus is ill-suited for analyzing backward (approximate) solutions to SDEs. Notable deficiencies with these previous approaches include difficulties with non-diagonal diffusion, incompatibility with higher-order adaptive SDE solvers, and a complex means of reconstructing Brownian motion paths from random number generator seeds. Furthermore, the method cannot be justifiably combined with existing approaches of density estimation for SDEs (see

Hurn et al. (2007)).

On the other hand, recent efforts have made significant strides in applying variational and MCMC methods for idealized Bayesian computation. One of the most significant contributions in this direction is

Salimans et al. (2015), who performed VI with respect to distributions formed from steps of a reversible Markov chain. For example, the setting of Hamiltonian Monte Carlo was examined in Wolf et al. (2016). More recently, Liu & Feng (2016) considered optimizing step size in stochastic gradient Langevin dynamics using methods derived from kernelized Stein discrepancy. Langevin flows (Rezende & Mohamed, 2015) have been discussed as a potential VI framework that takes inspiration from the SDEs underlying stochastic MCMC (Ma et al., 2015). Once again, implementation of Langevin flows relies on the approximation of the log-likelihood for a general class of SDEs.

Contributions

We provide a general theoretical framework (which we refer to as stochastic normalizing flows) for approximating generative models constructed from SDEs using continuous normalizing flows. These approximations can then be trained using existing techniques. By this process, we find that theoretical and practical developments concerning continuous normalizing flows and neural ODEs extend readily to the stochastic setting, without the need of an independent framework. The key theoretical enabler underlying our strong results and simple analysis is the theory of rough paths (Friz & Hairer, 2014), an alternative stochastic calculus that enables approximation and pathwise treatment of SDEs. Our approach

1. enables (i) density estimation, (ii) maximum likelihood estimation, and (iii) variational approximations beyond autoencoders, for arbitrary SDE models; and

2. is easily implemented using any general continuous normalizing flows implementation, such as that of Grathwohl et al. (2018).

Our framework recovers the stochastic adjoint method of Li et al. (2020), but our approach is sufficiently flexible to overcome its deficiencies. Moreover, using our approach, any existing neural ODE framework (such as Zhang et al. (2019)) can be extended to SDEs, simply by the addition of a few extra terms.

Following a review of background material in §2, the stochastic normalizing flows framework is introduced and discussed in §3, with our main approximation result presented in Theorem 2. Some numerical investigations are conducted in §4, including an application to hyperparameter optimization in stochastic MCMC.

2 Background Review

2.1 Continuous Normalizing Flows

We shall begin by reviewing the continuous normalizing flow framework for training ODE models, as our development of random and stochastic normalizing flows will build upon it. Consider a parameterized class of models of the following form: for , let satisfy the ODE with random initial condition (often called a random ordinary differential equation)

 ddtZ(t)=f(Z(t),t,θ),Z(0)∼p0(θ). (1)

In a general machine learning context, one might choose

such that the Euler discretization of (1) resembles layer-wise updates of a residual neural network (Lu et al., 2017; Chen et al., 2018), or one may parameterize as a neural network itself (Grathwohl et al., 2018). The resulting ODEs constitute the class of so-called neural ordinary differential equations. The following theorem is a consequence of the Liouville equation (equivalently, Fokker-Planck equation) applied to the solution of the random ODE (1), and it yields an ODE for the log density of evaluated at .

Theorem 1 (Chen et al. (2018)).

Suppose that satisfies (1). The distribution of

is absolutely continuous with respect to Lebesgue measure, with probability density

satisfying

 ddtlogpt(Z(t))=−∇z⋅f(Z(t),t,θ) (2)

Naively computing the divergence in (2) with automatic differentiation is of quadratic complexity in the dimension . As pointed out by Grathwohl et al. (2018), this can be improved to linear complexity using a trace estimator (Roosta & Ascher, 2015):

 ∇z⋅f(z)=tr(∂f∂z)≈1nn∑k=1ϵ⊤k∂f∂zϵk, (3)

where each

is an independent and identically distributed copy of a random vector

with zero mean and . Common choices for include standard normal and Rademacher random vectors.

Training continuous normalizing flows often involves minimizing a scalar loss function involving

and/or the log-density computed via Theorem 1 with respect to the parameters . For this, we require gradients of with respect to for . The most obvious approach is to directly backpropagate through a numerical integration scheme such as in Ryder et al. (2018), but this does not scale well in . The superior alternative is the adjoint method, which computes derivatives of a scalar loss function by solving another differential equation in reverse time. Letting denote a scalar loss depending on , the adjoint given by , as well as the gradient of in , satisfy (Pontryagin, 2018, §12)

 ddta(t) =−∇zf(Z(t),t,θ)a(t), (4a) ∇θL =∫T0∇θf(Z(t),t,θ)a(t)dt. (4b)

Together with (1), the equations (4) are solved in reverse time, starting from the terminal values and . By augmenting together with (2), this method also allows for loss functions depending on .

Solving (1), (2), and (4) can be achieved using off-the-shelf numerical integrators. Adaptive solvers prove particularly effective, although, as pointed out in Gholami et al. (2019), the backward solve (4) can often run into stability issues, suggesting a Rosenbrock or other implicit approach (Hairer & Wanner, 1996). We point out that the same is also true in the stochastic setting; see Hodgkinson et al. (2019), for example. For further implementation details concerning continuous normalizing flows, we refer to Grathwohl et al. (2018).

2.3 Rough Path Theory

The theory of rough paths was first introduced in (Lyons, 1998) to provide a supporting pathwise theory for SDEs. It has since flourished into a coherent pathwise alternative to stochastic calculus, facilitating direct stochastic generalizations of results from the theory of ODEs — we refer to Friz & Hairer (2014) for a gentle introduction, and Friz & Victoir (2010) for a thorough treatment of the topic. Suppose that we would like to prescribe meaning to the infinitesimal limit of the sequence of iterates

 Zt+h=Zt+f(Zt)(Xt+h−Xt),as h→0+. (5)

In the case of SDEs, is a sample path of Brownian motion, so that each is a realization of a normal random vector with zero mean and covariance . Unfortunately, a strong limit of (5) fails to exist if is too “rough”. In particular, suppose that is -Hölder continuous for , that is, there exists some such that for any . Since the limit (5) is only well-defined if (Young, 1936), a function on is rough if it is Hölder-continuous only for . Sample paths of Brownian motion constitute rough paths under this definition. The problem is that the discretization (5) invokes the zeroth-order approximation for , which proves too poor. By instead taking a first-order approximation

 f(Zt+s) ≈f(Zt)+∇zf(Zt)(Zt+s−Zt) ≈f(Zt)+∇zf(Zt)f(Zt)(Xt+s−Xt),

we arrive at the Davie scheme (Davie, 2008)

 Zt+h=Zt+f(Zt)(Xt+h−Xt)+∇zf(Zt)f(Zt)Xt,t+h, (6)

where represents the “integral” . Once again, we cannot uniquely define from the path itself, so instead we prescribe it. In fact, each choice of satisfying Chen’s relations

 Xs,t−Xs,u−Xu,t=(Xs−Xu)(Xt−Xu)⊤,

for any , will reveal a different limit for (6) as , provided (for smaller , higher-order approximations are necessary). The pair is referred to as a rough path, and the limit of (6) as is the solution to the rough differential equation (RDE)

 dZt=f(Zt)dXt. (7)

Hölder continuity is critical to rough path theory — in the sequel, we equip the space of -Hölder functions with the -Hölder norm, defined by

 ∥X∥α\coloneqqsupt∈[0,T]∥Xt∥+sups,t∈[0,T]s≠t∥Xt−Xs∥|t−s|α.

This definition extends to the iterated integral by replacing and with and , respectively.

It is useful to identify a calculus which satisfies the usual chain and product rules. This occurs when the rough path is geometric, that is,

 Xs,t−Xt,s=12(Xt−Xs)(Xt−Xs)⊤,∀s,t≥0. (8)

Every continuous and piecewise differentiable function is canonically lifted to a geometric rough path by taking , where the derivative is interpreted in the weak sense. In these cases, (7) equates to the ODE .

Geometric rough paths have two key properties of interest:

1. [label=.]

2. The canonical lifts of any sequence of smooth approximations which converge to as in the -Hölder norm, also converge in the -Hölder rough path metric

 ϱα((X,X),(Y,Y))=∥X−Y∥α+∥X−Y∥2α,

to a geometric rough path . Conversely, any geometric rough path can be approximated by some sequence of smooth paths (Friz & Hairer, 2014, Proposition 2.5).

3. The reverse-time process of a solution to any rough differential equation (7) with Lipschitz , itself satisfies the reversed rough differential equation if and only if is geometric.

By property I, any solution to RDEs driven by a geometric rough path can be approximated by solutions to ODEs. Property II, which follows readily from the definition (8) in the limit (6), enables the adjoint method for rough differential equations driven by a geometric rough path.

3 Stochastic Normalizing Flows

Let satisfy the Itô SDE

 dZt=μt(Zt,θ)dt+σt(Zt,θ)dBt,Z0∼p0(θ), (9)

where is an -dimensional Brownian motion, and , are the drift, and diffusion coefficients, respectively. Analogous to neural ODEs, neural SDEs choose to resemble a single layer of a neural network (Tzen & Raginsky, 2019). The dropout-inspired construction of Liu et al. (2019) suggests taking . Alternatively, one can parameterize both and by multi-layer neural networks.

The reliance of stochastic calculus on non-anticipating processes as well as the lack of continuity for solution maps of Itô SDEs necessitates complicated and delicate arguments for extending each piece of the continuous normalizing flow framework from §2 to SDEs. We bypass the intricacies of existing theoretical treatments of neural SDEs by an approximation argument: for a smooth approximation of Brownian motion , we estimate solutions of an SDE by a random ODE involving . One must take great care with such approximations. For example, geometric Brownian motion, that is, the solution to , has the explicit expression , which is not well-approximated by the solution to . Theoretical verification of this approach is challenging using traditional stochastic calculus due to the irregularity of solution maps. Instead, we rely on rough path theory — particularly properties I and II of geometric rough paths.

In the rough path framework, one can reconstruct the Itô stochastic calculus via the rough path , where . Indeed, by Friz & Hairer (2014, Theorem 9.1), letting denote a realization of the Itô Brownian motion rough path, the solution to the rough differential equation

 dZt=μt(Zt,θ)dt+σt(Zt,θ)dBIt\^{o}t(ω) (10)

is a realization of the strong solution to (9). Likewise, the Davie scheme (6) corresponds to the Milstein integrator for SDEs (Kloeden & Platen, 2013, §10.3).

Unfortunately, is not a geometric rough path, and so Theorem 2 cannot be directly applied. Instead, we shall proceed according to the following steps:

1. [label=()]

2. Convert the Itô SDE to a Stratonovich SDE (§3.1).

3. Interpret the Stratonovich SDE pathwise as an RDE driven by a geometric rough path (12).

4. Approximate the pathwise Stratonovich RDE by a random ODE (§3.2).

5. Train the random ODE as a continuous normalizing flow with added latent variables (§3.4).

Consequently, the RDE (10) is estimated by the ODE where

 Fω(z,t,θ)=~μt(z,θ)Stratonovich drift+σt(z,θ)dBt(ω)dtapproximation.

3.1 Stratonovich calculus

The unique geometric rough path formed from Brownian motion yields the Stratonovich calculus: . A Stratonovich differential equation is commonly written in the form , where denotes Stratonovich integration: for a process adapted to the filtration generated by ,

 ∫tsYt∘dBt=lim|P|→0N∑k=112(Ytk+Ytk−1)(Btk−Btk−1),

where is a partition with mesh size , and the limit is in . This is to be compared with Itô integration which is defined instead by

 ∫tsYtdBt=lim|P|→0N∑k=1Ytk−1(Btk−Btk−1).

Stratonovich differential equations were recognized in Li et al. (2020) to be the correct setting for extending the adjoint method to SDEs. However, the adherence to classical stochastic calculus, which relies on adaptedness, somewhat complicates the argument. In our setting, the advantages of Stratonovich differential equations are clear. Because Stratonovich differential equations can be arbitrarily well-approximated by random ODEs, all methods of training continuous normalizing flows extend to them, including the adjoint method. Any Itô SDE can be converted into a Stratonovich SDE by adjusting the drift (Evans, 2012, p. 123), a fact readily seen by comparing limits of (6) with and . The following formula is particularly amenable to implementation with automatic differentiation: the Itô SDE is equivalent to the Stratonovich SDE provided that for each ,

 ~μit(x)=μit(x)−12∇x⋅(σt(x)σ⊤t(x∗))i, (11)

where is an independent copy of , and the subscript denotes the -th row. Once again, we can make use of the trace estimator (3) to increase performance in higher dimensions. In the rough path theory, Stratonovich SDEs are interpreted pathwise according to the RDE

 dZt=~μt(Zt,θ)dt+σt(Zt,θ)dBStratt(ω), (12)

which is equivalent to (10).

3.2 Wong–Zakai approximations

A random ODE estimating a Stratonovich SDE is commonly referred to as a Wong–Zakai approximation (Twardowska, 1996), after the authors of the seminal paper (Wong & Zakai, 1965), who first illustrated this concept for one-dimensional Brownian motion. We shall consider two types of Wong–Zakai approximation: a Karhunen-Loève expansion, and a piecewise linear function. These approximations are compared in Figure 1. In practice, we have found that the Karhunen-Loève expansion with terms works well for training, while the piecewise linear approximation is preferable for testing.

3.2.1 Piecewise linear

Easily the most common approximation of Brownian motion involves exact simulation on a discrete set of times

, followed by linear interpolation. More precisely, letting

, for each , we let

 B(n)tk+1=B(n)tk+√Δtkωk,ωk∼N(0,1),

and consider the approximation

 B(n)t=B(n)tk+t−tktk+1−tk(B(n)tk+1−B(n)tk),t∈[tk,tk+1].

Integrating the resulting Wong–Zakai approximation using Euler’s method on the same set of time points is equivalent to performing the Euler–Maruyama method for solving the Stratonovich SDE. By Friz & Victoir (2010, Theorem 15.45), as the mesh size , the piecewise linear approximation converges almost surely to Brownian motion in the -Hölder norm for any .

3.2.2 Karhunen-Loève expansion

For any zero-mean Gaussian process on with , the covariance function is a positive-definite kernel. If is also continuous, Mercer’s theorem guarantees the existence of an orthonormal basis on

with corresponding positive eigenvalues

such that . The process can be expanded in terms of these eigenfunctions as

 Xt=∞∑k=1√λkωkek(t),ωk∼N(0,1),

where each is independent. This is called the Karhunen-Loève expansion of . Truncating the series after terms yields the -th order Karhunen-Loève approximation, and has the smallest mean squared error over all expansions with orthogonal basis elements. Recalling that we are primarily interested in the endpoints of the solution, instead of expanding Brownian motion itself, we consider an approximation derived from the Karhunen-Loève expansion of the Brownian bridge :

 B(n)t=ω0t√T+n−1∑k=1ωk√2Tsin(kπt/T)kπ,n=1,2,….

Using this approximation ensures that the terminal density for SDEs with constant drift and diffusion coefficients is computed exactly. By Friz & Victoir (2010, Theorem 15.51), converges almost surely as to Brownian motion in the -Hölder norm for any . Furthermore, since is smooth, Wong–Zakai approximations involving may be readily solved using adaptive ODE solvers.

3.3 Main result

Using Wong–Zakai approximations, a Stratonovich SDE can be uniformly approximated in Hölder norm by random ODEs. In Theorem 2, we show that the log-densities and loss function gradients for these random ODEs also converge appropriately. More generally, geometric rough paths (including the Stratonovich paths (12)) with random initial conditions can be approximately trained as random ODEs.

Theorem 2.

Let be an -Hölder geometric rough path, and a sequence of piecewise differentiable functions on that approximate under the -Hölder norm for , that is, as . Let be solutions to the differential equations

 dZt =f(Zt,t,θ)dXt, Z0 ∼p0, (13a) dZ(n)tdt =f(Z(n)t,t,θ)dX(n)tdt Z(n)0 =Z0, (13b)

where and is a density on such that is continuous. Let denote the probability density of at time , given by (2). The distribution of is absolutely continuous with respect to Lebesgue measure with corresponding continuous density satisfying:

1. For any , as .

2. The path is the unique solution to the rough differential equation

 dlogpt(Zt)=−∇z⋅(f(Zt,t,θ)dXt). (14)
3. For any smooth loss function and , as ,

 ∇θL(Z(n)t,logp(n)t(Z(n)t))→∇θL(Zt,logpt(Zt)). (15)
Proof of Theorem 2.

Recall that each can be lifted canonically to a rough path such that as . For an arbitrary rough path , we let and denote the solution maps for the rough differential equations , and , , respectively. By Friz & Hairer (2014, Theorem 8.10), is a -diffeomorphism, and hence, for and any ,

is an absolutely continuous random variable, whose corresponding density we denote by

. In fact, denoting by the inverse of ,

 p(n)t(x) =p0(Φ−t(X(n),x))∣∣ ∣∣det∂Φ−t(X(n),x)∂x∣∣ ∣∣, (16) pt(x) =p0(Φ−t(X,x))∣∣∣det∂Φ−t(X,x)∂x∣∣∣, (17)

and so both and are continuous. Furthermore, by Friz & Hairer (2014, Theorem 8.5), for any , there exist constants and such that for any -Hölder continuous rough paths , and , ,

 ∥Φ(X,ξ)−Φ(Y,~ξ)∥γ ≤CΦγ(∥ξ−~ξ∥+ϱβ(X,Y)) (18) ∥Ψ(X,ℓ)−Ψ(Y,~ℓ)∥γ ≤CΨγ(|ℓ−~ℓ|+ϱβ(X,Y)). (19)

We deduce the following for any and : (i) by (18); (ii) using (i) and continuity of , ; (iii) as a consequence of (16), (17), (18), and Friz & Hairer (2014, Theorem 8.10), ; (iv) combining (ii) and (iii), . Since by Theorem 1, (iv) and (19) imply and hence (14). Let be arbitrary. To show that converges uniformly in , observe that

 logpt(x)=Ψ(X,logp0(Φ−t(X,x))),

and similarly for . Together with property II of geometric rough paths, inequality (18) with reveals that and are uniformly bounded in . Since is continuous, converges to uniformly in . Applying (19),

 supt∈[0,T]|logp(n)t(x)−logpt(x)|≤CΨγ(ϱβ(X(n),X)+|logp0(Φ−t(X(n),x))−logp0(Φ−t(X,x))|→0.

Finally, to prove (15), by Friz & Hairer (2014, Proposition 5.6), we can write as the solution to the rough differential equation

 dθ=0 (20a) dL(Zt,logpt(Zt))=∇zL(Zt,logpt(Zt))⋅dZt+∇ℓL(Zt,logpt(Zt))dlogpt(Zt) (20b)

and similarly for and , where and denote the gradients with respect to and , respectively. The derivative of with respect to is a derivative of (20) with respect to its initial condition, and hence (15) follows from Friz & Hairer (2014, Theorem 8.10). ∎

3.4 Random continuous normalizing flows

By a conditioning argument, any random ODE, such as a Wong-Zakai approximation, may be treated as a continuous normalizing flow. Let be the solution to a random ODE of the form

 ddtZt=f(Zt,ω,t,θ), (21)

where is a random vector independent of , , and . The reduction of a random ODE to this form is in keeping with the reparameterization trick (Xu et al., 2019). In particular, for the piecewise linear and Karhunen-Loève approximations, each . After conditioning on , Theorem 1 applied to (21) provides a means of computing , after sampling . The density can be computed using a naive Monte Carlo estimator

 pθt(Zt)=∫pθt(Zt|ω)q(ω)dω≈1Nn∑i=1pθt(Zt|ωi), (22)

where the dependence on has been made explicit, and can be optimized over using the adjoint method. Analogously to Chen et al. (2018); Grathwohl et al. (2018), the density of data may be estimated along a single sample path (denoted ) in the following way: letting , we see that also satisfies (2). By solving (21) and the corresponding (2) in reverse time from the initial conditions and , we obtain and , and compute . This is shown in Algorithm 1, which depends on an ODE solver odesolve, and yields a density estimation procedure for stochastic normalizing flows when paired with (22). Note that by comparison to Grathwohl et al. (2018, Algorithm 1) which encompasses steps 6–12 of our Algorithm 1, we see much of the density estimation procedure can be accomplished using an existing continuous normalizing flow implementation. In variational settings where is required, the same procedure applies, where becomes and is generated by the SDE as well.

A number of techniques exist for debiasing the logarithm of (22) — see Rhee & Glynn (2015) and Rischard et al. (2018), for example. Alternatively, we lie in the setting of semi-implicit variational inference seen in Yin & Zhou (2018) and Titsias & Ruiz (2019), and those techniques directly extend to our case as well. Naturally, it would be easiest to instead optimize the lower bound

 −logpθt(Zt)≥−Eωlogpθt(Zt|ω),

and in many cases we have found this to be effective. Observing that

 logpθt(Zt)−DKL(q∥pθω|Zt)=Eωlogpθt(Zt|ω), (23)

minimizing maximizes the true log-likelihood regularized by the KL-divergence between the prior and posterior distributions for , which reduces the effect of noise on the model. At the same time, parameterizations of the diffusion coefficient that allow to shrink to zero will often do so, and should be avoided to remain distinct from a continuous normalizing flow.

4 Numerical experiments

4.1 Samplers and density estimation from data

For our first experiments, we train a stochastic normalizing flow (9) — using Algorithm 1 with the upper bound (23) — to data generated from a specified target distribution. For our drift function, we adopt the same architecture used in the toy examples of Grathwohl et al. (2018); a four-layer fully-connected neural network with 64 hidden units in each layer. Dependence on time is removed to ensure a time-homogeneous, and hence, potentially ergodic SDE after training. All networks were trained using Adagrad (Duchi et al., 2011), with and a batch size of 1000 samples.

4.1.1 A two-dimensional toy example

In our first example, our data is generated from the banana-shaped distribution

 p(x,y)∝exp(−12(x2+12(x2+y)2)).

Two choices of diffusion coefficient are considered: the first, where , yields a neural SDE that can be trained using the techniques of Li et al. (2020). For the second, we choose

 σ(x)=λ(1σ1(x)σ2(x)1), (24)

with , and parameterize by a two-layered neural network with 64 hidden units. This SDE can only be trained using our method. After training, to emulate the application of these SDEs as approximate samplers, a single sample path with 10,000 steps was simulated for each model using the Euler–Maruyama method. The resulting paths are compared in Figure 2. From data alone, both models constructed recurrent processes. The addition of a trainable diffusion coefficient led to improved adaptation of the sampler to the underlying curvature.

4.1.2 Visualizing regularization

As discussed in Liu et al. (2019), the stochastic noise injection in SDEs is a natural form of regularization, that can potentially improve robustness to noisy or adversarial data. We visualize this effect by considering the same stochastic normalizing flows treated in §4.1.1 with diffusion coefficient (24), and adjusting the parameter . Our data is generated in polar coordinates from a ten-pointed star-shaped distribution by

 θ∼Unif(−π,π),r|θ∼N(2√1+12sin(10θ),9400).

In Figure 3, we plot the densities for computed using Algorithm 1, noting that the case corresponds to a continuous normalizing flow. Increasing

reveals generative models with expectedly higher variance, but with improved capacity to smooth out minor (potentially, unwanted) details.

4.2 Optimizing stochastic MCMC

An interesting class of SDE models for approximating a target distribution are targeted diffusions, solutions to SDEs that are -ergodic. A convenient representation of such diffusions are known (Ma et al., 2015). Because these diffusions are frequently used in MCMC algorithms, in a sense, conducting VI with respect to targeted diffusions is analogous to optimizing the convergence rate of stochastic MCMC algorithms.

To illustrate the potential applications of stochastic normalizing flows for finding and examining optimal stochastic MCMC algorithms for a particular target distribution, we consider a basic setup, where

is the one-dimensional Cauchy distribution

. All -ergodic SDEs are of the form

 dZt=(−2σ(Zt)2Zt/(1+Z2t)+12σ′(Zt))dt+σ(Zt)dBt, (25)

and we may choose arbitrarily. A priori, an optimal choice of (up to constants) to ensure rapid mixing of (25) does not appear obvious. The present rule of thumb from second-order methods takes (Girolami & Calderhead, 2011). We train a stochastic normalizing flow for (25) with parameterized by a four-layer neural network with 32 hidden units in each layer. The corresponding loss function is taken to be the Kullback-Leibler divergence , estimated using Algorithm 1, with an penalty term over the weights of the neural network, to prevent taking . The results are presented in Figure 4.

Curiously, after training, we found the apparent “optimal” choice is approximately , where

is the density of the standard normal distribution.

5 Conclusion

We have extended the continuous normalizing flows framework to generative models involving SDEs. Justified by rough path theory, our framework enables practitioners of neural ODEs to apply their existing implementation for training neural SDEs. This is advantageous, as neural SDEs have been suggested to be more robust than neural ODEs in high-dimensional real-world examples (Liu et al., 2019; Li et al., 2020). Stochastic normalizing flows can be implemented as a device for investigating “optimal” hyperparameters in stochastic MCMC, which could prove useful for informing future research, but they may require implementational improvements for high-dimensional cases, e.g., variance reduction techniques and improved loss estimators.

Acknowledgements.

We would like to acknowledge DARPA, NSF, and ONR for providing partial support of this work.