1 Introduction and informal summary of results
Recently there has been much interest in using continuous-time processes to analyze discrete-time algorithms and probabilistic models (Wibisono et al., 2016; Li et al., 2017; Mandt et al., 2017; Chen et al., 2018; Yang et al., 2018). In particular, diffusion processes have been examined as a way towards a better understanding of first- and second-order optimization methods, as they afford an analysis of behavior over non-convex landscapes using a rich array of techniques from the mathematical physics literature (Li et al., 2017; Raginsky et al., 2017; Zhang et al., 2017). Gradient flows and diffusions have also found a role in the analysis of deep neural nets, where they are interpreted as describing the limiting case of infinitely many layers, with each layer being ‘infinitesimally thin’ (e.g., Chen et al. (2018); Li et al. (2018)). As in the case of optimization, continuous-time frameworks enable the use of a different set of tools for studying standard questions of relevance, such as sampling and inference, i.e., forward and backward passes through the network.
In this work, we consider a class of generative models where the latent object is a -dimensional diffusion and the observable object is a random element of some space : equationparentequation
where (1.1a) is a -dimensional Itô diffusion process whose drift is a member of some parametric function class, such as multilayer feedforward neural nets, and (1.1b) prescribes an observation model for generating conditionally on . To the best of our knowledge, generative models of this form were first considered by Movellan et al. (2002) as a noisy continuous-time counterpart of recurrent neural nets. More recently, Hashimoto et al. (2016) and Ryder et al. (2018) investigated the use of discrete-time recurrent neural nets to approximate the population dynamics of biological systems that are classically modeled by diffusions. It is natural to view (1.1) as a continuum limit of deep generative models introduced by Rezende et al. (2014) — in fact, as we explain in Section 4, one can simulate a model of the above form using a deep generative model with a random number of layers. Alternatively, one can think of (1.1) as a neural stochastic differential equation, in analogy to the neural ODE framework of Chen et al. (2018).
There are three main questions that are natural to ask concerning the usefulness of such models: How expressive can they be? How might one sample from such a diffusion process? How might one perform inference on it? As our first contribution, we provide a unified view of sampling and inference through the lens of stochastic control. In particular, by adding a control to the drift of some reference diffusion, one can obtain a desired distribution at , and the minimal-cost control that yields exact sampling is given by the so-called Föllmer drift (Föllmer, 1985; Dai Pra, 1991; Lehec, 2013; Eldan and Lee, 2018). Complementarily, we show that any control added to the drift in (1.1a) leads to a variational upper bound on the log-likelihood of a given tuple of observations . Variational inference then reduces to minimizing the expected control cost over a tractable class of controls. While we provide a unifying viewpoint that captures both sampling and inference, we emphasize that this is a synthesis of a number of existing results, and serves as a conceptual underpinning and motivation for our subsequent analysis. Specifically, after establishing that diffusion-based generative models can be effectively worked with, we explore their expressive power vis-à-vis neural nets: We show that, if the target density of can be efficiently approximated using a neural net, then the corresponding Föllmer drift can also be efficiently approximated by a neural net, such that the terminal law of the diffusion with this approximate drift is -close to the target density in Kullback–Leibler divergence. Finally, we investigate unbiased simulation methods for generative models with underlying diffusion processes and provide bounds on the variance of the resulting estimators.
1.1 Method of analysis: an overview
To arrive at the unified perspective of sampling and inference, we begin by formulating a stochastic control problem that captures all of our desiderata: sampling from a target probability lawat terminal time ; a set of tractable controls that might be used to take it there; and an appropriate notion of cost with that captures both the ‘control effort’ and the terminal cost that quantifies the discrepancy between the final probability law and the target measure .
Our first result, stated in Theorem 2.1, is an explicit characterization of the value function of this control problem, which has a free-energy interpretation and can be understood from an information-theoretic viewpoint: the Kullback–Leibler divergence between the law of the path of the uncontrolled diffusion and that of the path of the controlled diffusion is the expected total work done by the control. The negative free energy with respect to the uncontrolled process is a lower bound on that of the controlled process after accounting for the work done, and equality is achieved by the optimal control. As pointed out above, this result is a synthesis of a number of existing results, and its main purpose is to motivate the use of controlled diffusions in probabilistic generative modeling.
We next examine the expressiveness of these generative models, which refers to their ability to generate samples from a given target distribution for when the observation model in (1.1b) is fixed. In Theorem 3.1, we provide quantitative guarantees for obtaining approximate samples from a given target distribution for when the drift in (1.1a) is restricted to be a multilayer feedforward neural net. Specifically, we show that, if the density of with respect to the standard Gaussian measure on can be efficiently approximated by a feedforward neural net, then the corresponding Föllmer drift can also be approximated efficiently by a neural net. Moreover, this approximate Föllmer drift yields a diffusion , such that satisfies for a given accuracy . Under some assumptions on the smoothness of and and on their uniform approximability by neural nets, the proof proceeds as follows: First, we show that the Föllmer drift can be approximated by a neural net uniformly over a given compact subset of and for all . Then, to show that the terminal distribution resulting from this approximation is -close to in KL-divergence, we use Girsanov’s theorem to relate to the expected squared error between the Föllmer drift and its neural-net approximation.
Finally, we discuss the issue of unbiased simulation with the goal of estimating expected values of functions of . The standard Euler–Maruyama scheme (Graham and Talay, 2013, Chap. 7) is straightforward, but produces a biased estimator. One way to obtain an unbiased estimator is to employ a random discretization of the time interval , where the sampling times are generated by a point process on the real line. Unbiased simulation schemes of this type have been proposed and analyzed by Bally and Kohatsu-Higa (2015), Andersson and Kohatsu-Higa (2017), and Henry-Labordère et al. (2017). Our final result, Theorem 4.1, builds on the latter work and presents an unbiased, finite-variance simulation scheme. Conceptually, the simulation scheme can be thought of as a deep latent Gaussian model in the sense of Rezende et al. (2014)
, but with a random number of layers. Unfortunately, the variance of the resulting estimator can exhibit exponential dependence on dimension. We show why this is the case via an analysis of the moment-generating function of the point process used to generate the random mesh and propose alternatives to reduce the variance.
The Euclidean norm of a vectorwill be denoted by , the transpose of a vector or a matrix will be indicated by . The -dimensional Euclidean ball of radius centered at the origin will be denoted by . The standard Gaussian measure on will be denoted by . The Euclidean heat semigroup , , acts on measurable functions as follows:
A function is of class if it is twice continuously differentiable in the space variable and once continuously differentiable in the time variable .
2 Exact sampling and variational inference: a unified stochastic control viewpoint
Before addressing the specific questions posed in the Introduction, we aim to demonstrate that both sampling and variational inference in generative models of the form (1.1) can be viewed through the lens of stochastic control. We give a brief description of the relevant ideas in Appendix A; the book by Fleming and Rishel (1975) is an excellent and readable reference.
2.1 A stochastic control problem
Let be a probability space with a complete and right-continuous filtration , and let be a standard -dimensional Brownian motion adapted to . Consider the Itô diffusion process
where the drift is sufficiently well-behaved (say, bounded and Lipschitz). Then the process admits a transition density, i.e., a family of functions for all , such that, for all points and all Borel sets ,
(see, e.g., Protter (2005, Chap. V)).
Consider the following stochastic control problem: Let be the set of controls, i.e., measurable functions . Any defines a diffusion process by
We say that is a diffusion controlled by . Let a function be given. For each , we define the family of cost-to-go functions
where is shorthand for . The value functions are defined by
Consider the control problem (2.4). The value function is given by
where the conditional expectation is with respect to the uncontrolled diffusion process (2.1). Moreover, the optimal control is given by , where the gradient is taken with respect to the space variable , and the corresponding controlled diffusion has the transition density
where is the transition density (2.2) of the uncontrolled process.
This result, proved in Appendix A, also admits an information-theoretic interpretation. Let denote the probability law of the path of the uncontrolled diffusion process (2.1) and let denote the corresponding object for the controlled diffusion (2.3). Since and differ from each other by a change of drift, the probability measures and are mutually absolutely continuous, and the Radon–Nikodym derivative is given by the Girsanov formula (Protter, 2005)
where , with and denoting the th coordinates of and respectively. From (2.8), we can calculate the Kullback–Leibler divergence between and :
Therefore, by Theorem 2.1, for any control , we can write
with equality if and only if . An inequality of this form holds more generally for real-valued measurable functions of the entire path (Boué and Dupuis, 1998).
We will now demonstrate how both the problem of sampling and the problem of variational inference can be addressed via the above theorem.
2.2 Exact sampling: the Föllmer drift
Recall that, in the context of exact sampling, the objective is to construct a diffusion process , such that has a given target distribution . We will consider the case when is absolutely continuous with respect to the standard Gaussian measure and let denote the Radon–Nikodym derivative . This problem goes back to a paper of Schrödinger (1931); for rigorous treatments, see, e.g., Jamison (1975), Föllmer (1985), Dai Pra (1991), Lehec (2013), Eldan and Lee (2018). The derivation we give below is not new (see, e.g., Dai Pra (1991, Thm. 3.1)), but the route we take is somewhat different in that we make the stochastic control aspect more explicit.
We take and in (2.1). Then the diffusion process is simply the standard -dimensional Brownian motion , which has the Gaussian transition density
where denotes the Euclidean heat semigroup (1.2). Hence, , and the optimal diffusion process has the drift . Following Lehec (2013) and Eldan and Lee (2018), we will refer to as the Föllmer drift in the sequel.
It remains to show that . Using the formula (2.7) for the transition density of together with the fact that and , we see that . Then, for any Borel set ,
Moreover, using the entropy inequality (2.10), we can show that the Föllmer drift is optimal in the following strong sense: Consider any control with and with the property that . For any such control,
while clearly . Therefore, it follows from (2.10) that, for any such control ,
with equality if and only if . Thus, the Föllmer drift has the minimal ‘energy’ among all admissible controls that induce the distribution at , and this energy is precisely the Kullback–Leibler divergence between and the standard Gaussian measure (Dai Pra, 1991; Lehec, 2013; Eldan and Lee, 2018).
2.3 Variational inference
We now turn to the problem of variational inference. We are given an -tuple of observations , and wish to upper-bound the negative log-likelihood
where and is the diffusion process (1.1).
where the quantity on the right-hand side can be thought of as the variational free energy that depends on the choice of the control , and equality is achieved when . While the structure of the optimal control is described in Theorem 2.1, it may not be possible to derive it in closed form. However, we can fix a class of tractable suboptimal controls and upper-bound by . For example, we can take to consist of all controls of the form for some . In that case, is the sum of the Brownian motion and the affine drift , and consequently
where the expectation is taken with respect to the standard Brownian motion . Another possiblity is to consider controls of the form , for some . The corresponding controlled diffusion is the Ornstein–Uhlenbeck process , and the variational free energy can be minimized over .
Now that we have shown that generative models of the form (1.1) allow for both sampling and variational inference, we turn to the analysis of their expressiveness. Specifically, our objective is to show that, by working with a suitable structured class of drifts , we can achieve approximate sampling from a rich class of distributions at the terminal time .
Let be the target probability measure for . We assume that is absolutely continuous with respect to and let denote the Radon–Nikodym derivative . From Section 2.2 we know that the diffusion process governed by the Itô SDE
with the Föllmer drift has the property that , and, moreover, it is optimal in the sense that it minimizes the ‘energy’ among all adapted drifts that result in distribution at time . The main result of this section is as follows: If the Radon–Nikodym derivative can be approximated efficiently by multilayer feedforward neural nets, then, for any , there exists a drift that can be implemented exactly by a neural net whose parameters do not depend on time or space, and the terminal law of the diffusion process
is an -approximation to in the KL-divergence: . Moreover, the size of the neural net that implements the approximate Föllmer drift can be estimated explicitly in terms of the size of a suitable approximating neural net for .
We begin by imposing some assumptions on . The first assumption is needed to guarantee enough regularity for the Föllmer drift:
The function is differentiable, both and are -Lipschitz, and there exists a constant , such that everywhere.
This assumption is satisfied, for example, by Gibbs measures of the form with a differentiable potential , such that both and are Lipschitz, and is bounded from above; see Appendix B for details.
Next, we introduce the assumptions pertaining to the approximability of by neural nets. Let be a fixed nonlinearity. Given a vector and scalars , define the function
For , we define the class of
-layer feedforward neural nets with activation functionrecursively as follows: consists of all functions of the form for all , , , and, for each ,
Thus, each element of is a function that represents computation by a directed acyclic graph, where each node receives inputs , performs a computation of the form , and communicates the outcome of the computation to all the nodes in the next layer. We refer to as the depth of the neural net, and define the size of the neural net as the total number of nodes in its computation graph. We will denote by the collection of all neural nets with depth and size . All these definitions extend straightforwardly to the case of neural nets with vector-valued output and to the case where each node may have a different activation function.
We assume that the activation function is differentiable and universal, in the sense that any univariate Lipschitz function which is nonconstant on a bounded interval can be approximated arbitrarily well by an element of :
The activation function is differentiable. Moreover, there exists a constant depending only on , such that the following holds: For any -Lipschitz function which is constant outside the interval and for any , there exist real numbers , where , such that the function
Apart from differentiability, this is the same assumption made by Eldan and Shamir (2016). For example, it holds for differentiable sigmoidal activation functions, i.e., monotonic functions that satisfy and for some is universal in the above sense but not differentiable. However, we can replace it by the differentiable softplus function , where increasing the value of results in finer approximations to the ReLU. Also, note that the function differs from the elements of by the presence of the constant term . However, the constant function can be implemented by , for any such that . Thus, we will refer to functions of the form (3.3) as -layer neural networks of size
-layer neural networks of size.
We also make the following assumption regarding approximability of by neural nets:
For any and , there exists a neural net with , such that
Typical results on neural net approximation are concerned with approximating a given function uniformly on a given compact set. By contrast, Assumption 3.3 requires uniform approximability of both and its gradient on a compact set by some neural net and its gradient . Such simultaneous approximation guarantees can also be found in the literature, see, e.g., Hornik et al. (1990); Yukich et al. (1995); Li (1996). See Safran and Shamir (2017) for a discussion of various trade-offs between depth and width (maximum number of neurons per layer) in neural net approximation.
for a discussion of various trade-offs between depth and width (maximum number of neurons per layer) in neural net approximation.
We are now in a position to state the main result of this section:
Suppose Assumptions 3.1–3.3 are in force. Let denote the maximum of the Lipschitz constants of and . Then, for any , there exists a neural net with size polynomial in , such that the activation function of each neuron is an element of the set , and the following holds: If is the diffusion process governed by the Itô SDE
with the drift , then satisfies .
3.1 The proof of Theorem 3.1
The proof relies on three key steps: First, we show that the heat semigroup can be approximated by a finite sum of the form uniformly for all and all , where lie in a ball of radius . This result is stated in Appendix C and proved using empirical process methods. Next, replacing with a suitable neural net approximation , we build on this result to show that the Föllmer drift can be approximated by a neural net using , , and ReLU as activation functions. This is the content of Theorem 3.2 below (the proof is given in Appendix D). The third step uses Girsanov theory to upper-bound the approximation error that results from replacing the Föllmer drift by this neural net.
Let and be given. Then there exists a neural net of size polynomial in , such that the activation function of each neuron is an element of the set , and the following holds:
Let and . The Girsanov formula gives
(Bubeck et al., 2018, Lemma 3.8). Therefore,
Choosing large enough to guarantee and putting everything together, we obtain . Therefore, by the data processing inequality.
4 Unbiased simulation
Now that we have shown that generative models with latent diffusions are capable of expressing a rich class of probability distributions, we turn to the problem of unbiased simulation. Specifically, given a function, we wish to estimate the expectation , where with is a diffusion process of the form (1.1). The simplest approach is to use the Euler–Maruyama scheme: Fix a partition of and define the Itô process by and
In particular, for each ,
We can then estimate the expectation by , but this estimate is biased: if is, say, bounded, then
where is some constant that depends on and on the starting point (Graham and Talay, 2013). Recently, several authors (Bally and Kohatsu-Higa, 2015; Andersson and Kohatsu-Higa, 2017; Henry-Labordère et al., 2017) have studied unbiased simulation of SDEs using Euler–Maruyama schemes with random partitions, where the partition breakpoints are generated by a Poisson point process on the real line. In this section, we build on this line of work and present a scheme for unbiased simulation in the context of generative models of the form (1.1) that uses random partitions generated by arbitrary renewal processes (Kallenberg, 2002, Chap. 9) with sufficiently well-behaved densities of interrenewal times. Our analysis closely follows that of Henry-Labordère et al. (2017), but we provide a more refined analysis of the variance of the resulting estimators.
We first describe the simulation procedure. In what follows, we will drop the index from the drift to keep the notation clean. Let
be i.i.d. nonnegative random variables with an absolutely continuous distribution whose support contains the intervalfor some . Let and denote the cdf and the pdf of . Let and
Define a process with as the Euler–Maruyama scheme (4.1) on the random partition of , and let
This process can be interpreted as a deep generative model in the sense of Rezende et al. (2014), but with a random number of layers. Specifically, let be independent of , and define recursively by taking and
where denotes equality of probability distributions. We are now ready to state our main result on unbiased simulation (see Appendix E for the proof):
Suppose that the drift is uniformly bounded, Lipschitz in , and -Hölder in , i.e., for some constants and ,
for all and all . Suppose also that
for some constants and . Then, for any Lipschitz-continuous with Lipschitz constant , is an unbiased estimator of
is an unbiased estimator ofwith
where , , and is the moment-generating function of .
For example, the type of drift used in the construction of Section 3 has the property (4.3). The key implication of Theorem 4.1 is that the variance of the estimator is controlled by the moment-generating function of , and is therefore related to the tail behavior of the sums . In some cases, one can calculate in closed form. For instance, if we take for some , then the estimator (4.2) reduces to the one introduced by Henry-Labordère et al. (2017). Since and for