DeepAI

# Score-Based Generative Models Detect Manifolds

Score-based generative models (SGMs) need to approximate the scores ∇log p_t of the intermediate distributions as well as the final distribution p_T of the forward process. The theoretical underpinnings of the effects of these approximations are still lacking. We find precise conditions under which SGMs are able to produce samples from an underlying (low-dimensional) data manifold ℳ. This assures us that SGMs are able to generate the "right kind of samples". For example, taking ℳ to be the subset of images of faces, we find conditions under which the SGM robustly produces an image of a face, even though the relative frequencies of these images might not accurately represent the true data generating distribution. Moreover, this analysis is a first step towards understanding the generalization properties of SGMs: Taking ℳ to be the set of all training samples, our results provide a precise description of when the SGM memorizes its training data.

09/07/2019

### On Need for Topology Awareness of Generative Models

Manifold assumption in learning states that: the data lie approximately ...
05/28/2019

### Anomaly scores for generative models

Reconstruction error is a prevalent score used to identify anomalous sam...
05/27/2019

### Universality Theorems for Generative Models

Despite the fact that generative models are extremely successful in prac...
03/07/2022

02/06/2022

### Riemannian Score-Based Generative Modeling

Score-based generative models (SGMs) are a novel class of generative mod...
04/30/2018

### Clustering Meets Implicit Generative Models

Clustering is a cornerstone of unsupervised learning which can be though...
04/01/2021

### Towards creativity characterization of generative models via group-based subset scanning

Deep generative models, such as Variational Autoencoders (VAEs), have be...

## 1 Introduction

Score-based generative models, also called diffusion models ([Sohl-Dickstein et al., 2015, Song and Ermon, 2019, Song et al., 2021, Vahdat et al., 2021]) and the related models ([Bordes et al., 2017, Ho et al., 2020, Kingma et al., 2021]) have shown great empirical success in many areas, such as image generation ([Jolicoeur-Martineau et al., 2021, Nichol and Dhariwal, 2021, Dhariwal and Nichol, 2021, Ho et al., 2022]), audio generation ([Chen et al., 2021, Kong et al., 2021, Jeong et al., 2021, Popov et al., 2021]) as well as in other applications ([Batzolis et al., 2021, De Bortoli et al., 2021, Zhou et al., 2021, Cai et al., 2020, Luo and Hu, 2021, Meng et al., 2021, Saharia et al., 2021, Li et al., 2022, Sasaki et al., 2021]). Recently some progress has been made to bridge the gap between the different approaches ([Song et al., 2021, Huang et al., 2021]) through the framework of SDEs and reverse SDEs.

Score-based generative models (SGMs) are set up to sample from a measure of interest on . Henceforth we will refer to this target measure as , denoting its support by . In practical situations is often a low-dimensional substructure of the full space according to the manifold hypothesis ([Bengio et al., 2013, Pope et al., 2021]). When implementing SGMs one needs to make numerical approximations, and therefore the SGM will generate samples from only approximately. Instead, the samples will stem from a proxy measure . The performance of SGMs can then be measured by comparing to . We will see under which conditions and are equivalent (see Appendix A.1 for a definition), and in particular are supported on the same set . Therefore, even if might not perfectly reproduce , it will still produce the “right kind of samples”. We also find necessary conditions for SGMs to be able to generalize and produce novel samples instead of memorizing the training data. Strikingly, numerical error in the approximation of the score function turns out to be indispensable for any generalization beyond the training set to occur.

SGMs first diffuse each data sample by gradually adding noise, leading to a path , . This process corresponds to transforming the initial distribution , from which the training data stems, into a noisy (final) distribution through a sequence . The final distribution has had so much noise injected that not much of the original signal is left. We study the case in which the forward process can be expressed in terms of a Stochastic Differential Equation (SDE), referred to as the forward SDE:

 dXt=β(Xt)dt+σdWt,X0∼μdata. (1)

We denote by the density of . Corresponding to (1) we define the reverse SDE:

 dYt=−β(Yt)dt+σσT∇logpT−t(Yt)dt+σdBt.Y0∼q0, (2)

We refer to the density of as . This reverse SDE has the property that if is chosen to be (the final distribution induced by (1)), then it transforms the distributions backwards, , in particular implying . This provides us with a mean to sample from : We first sample from and then run (2) on those samples, transforming the samples to samples from . Note that if is a low-dimensional substructure of the space, the density does not exist, and explodes as . Therefore the above SDE only makes sense on and it is not clear in which sense exists a priori. For the reverse SDE to be well defined on we have to make some assumptions, see Assumption 1. We will prove in Section 2.2 that these Assumptions hold for all popular SDEs used in SGMs.

There are two necessary approximations when implementing a SGM. First, we have no access to and therefore need to replace it by a proxy distribution . Second, the scores are not available and need to be replaced by a neural net trained via score-matching techniques ([Vincent, 2011, Hyvärinen and Dayan, 2005]). Therefore the sample distribution will in general differ from . We study the distance of to . In particular, we show under which circumstances will be supported on .

We first study how the error made while approximating propagates to .

###### Theorem 1

Assume that Assumption 1 (see below) holds and that is absolutely continuous with respect to . Then the following hold.

• Let be a solution to (2) on . The limit exists almost surely. We refer to its distribution by . The distribution is absolutely continuous with respect to . If and are equivalent, then so are and .

• Furthermore, for any -Divergence ,

 Df(μsample|μdata)≤Df(μprior|pT)andDf(μdata|μsample)≤Df(pT|μprior).

Some choices for are for example the KL-divergence or the total variation distance. In Section 3 we will explicitly bound the KL-divergence for some common SGM implementations. Next we investigate how the approximation of impacts .

###### Theorem 2

Denote by an approximation to and define the approximation error . Assume that Assumption 2 holds (see below).

Let be a solution to

 d~Yt=−β(~Yt)dt+σσTs(~Yt,t)dt+σdBt,~Y0∼q0 (3)

on . Then is well defined. Moreover, its distribution is equivalent to the distribution of .

Putting the theorems together, we see that even if we start in the wrong initial distribution and then make an error while approximating the drift , the sample measure will still be equivalent to . Loosely speaking, if is the set of all high resolution pictures of faces, then will also generate samples from that set, just maybe not in the same relative frequency as .

In practice however we can only access through samples , . Taking (see Section 4.2), we see that for an SGM to produce novel samples, one of the assumptions in Theorem 1 and 2 has to be violated. The Assumptions of Theorem 1 will usually be fulfilled, see Section 2.2.

Therefore, we can conclude that Assumption 2 has to be violated for generalization, and indeed this is expected to be the case in realistic scenarios. If the data is supported on a low-dimensional substructure, will explode almost everywhere as . For that reason the approximation error will to grow to infinity as (see also Section 4.1). Interestingly, this approximation error is necessary for the SGMs to generalize and produce novel samples . The theorems are illustrated in Figure 1.

We now proceed as follows. In Section 2 we will introduce some of the most commonly used SDEs in SGMs. We prove that they fulfil the requirements for the Theorems in Section 2.2. We saw in Theorem 1 that in the case of no error in the drift approximation, the divergence from to is an upper bound to the divergence of to . We therefore study the -divergence of to in Section 3. In Section 4 we discuss in which way the approximation of the score has to carry an error and how that relates to the generalization capabilities of SGMs. Section 5 contains the proofs of the theorems.

## 2 Popular SDEs satisfy the assumptions

In this section we will study a few SDEs used in practice and show that they satisfy the following Assumption.

###### Assumption 1

Let and be constants that can change from appearance to appearance.

1. is globally Lipschitz, i.e..

2. grows at most linearly, i.e.

3. has a density for every and for any and .

Furthermore, for each ,

1. is locally Lipschitz, for all and .

Conditions - are technical conditions on the forward SDE. They ensure that if we run backwards in time, will be a solution to the reverse SDE (2) on . The last condition then ensures that the solutions to the reverse SDE are unique, therefore we will be able to transmit the properties of to any other solution of (2).

### 2.1 Popular Methods and their Respective SDEs

The first works on SGMs studied discrete forward and backward processes. Nevertheless, the transition kernels and algorithms proposed in those works can be seen as discretisations of some well-known SDEs. More recent works have studied this connection and state the algorithms in terms of SDEs ([Song et al., 2021, Huang et al., 2021]).

#### Brownian Motion:

The works [Song and Ermon, 2019, 2020] can be seen as a discretization of the SDE

 dXt=σ(t)dWt.

Denoting , the solution to the above process can be explicitly stated as a time-changed Brownian motion, . The time-change can help in the implementation but does not alter the qualitative behaviour of the reverse SDE. In our following analysis we therefore set . Nevertheless, our results still hold for any positive .

#### Ornstein-Uhlenbeck Process:

The works [Sohl-Dickstein et al., 2015, Ho et al., 2020] can be seen as a discretization of

 dXt=−12α(t)Xtdt+√α(t)dWt,

which is an Ornstein-Uhlenbeck process. Again, the parameters are a time-change and do not influence the properties that we are investigating in this paper. Therefore, to simplify notation, we again set .

#### Critically Damped Langevin Dynamics (CLD):

The work [Dockhorn et al., 2021] studies a second order SDE. Here artificial velocity coordinates are introduced and the system under consideration is

 dXt = Vt, dVt = −Xt−2Vt+2dWt,

where , . For generation one runs the reverse SDE in and but discards the coordinate at the end. The work [Dockhorn et al., 2021] also includes the parameters and . We set both to as they do in their numerical experiments.

### 2.2 Proving that the assumptions hold

We will now prove that for all SDEs considered in Section 2.1, Assumption 1 holds. To simplify the calculations we assume that the data manifold is contained in a ball of diameter . This is a natural assumption for many data sets. Nevertheless, we note that this assumption could be weakened by additional technical effort.

First, we note that all SDEs in Section 2.1 are linear SDEs. Therefore, their transition kernels are Gaussian ([Pavliotis, 2014, Section 3.7]):

 πt(Zt=z|Z0=z0)=N(z;mt(z0),Σt).

The explicit form of and differ for each of the SDEs and can be found in Appendix C.1. We remark that does not depend on the initial condition . The transition kernel above gives us the distribution of the SDE started in a single point . Since we start the SDE in we need to average over to get the marginal at time :

 pt(z)=∫RdN(z;mt(z0),Σt)μ%data(z0)dz0. (4)

We can also compute the additional drift in the reverse SDE (see Appendix C.2),

 ∇logpt=∇pt(z)pz(z)=Σ−1t(z−E[mt(Z0)|Zt=z]). (5)

The drift is intuitive, especially for the Brownian Motion, where and . In that case, is the expected position from which the SDE was started on the manifold, if it is at position at time . For a point which has distance to , , since is in almost surely. Therefore, we can infer that explodes like as . A similar analysis can be conducted for the other two SDEs using the formulas for the transition kernels in Appendix C.1. We are now ready to prove the following

###### Lemma 1

Assume that the data manifold is contained in a ball of radius . Then all the methods introduced in Section 2.1 fulfil Assumption 1.

Proof  The forward drifts are , and for the Brownian Motion, the OU-Process and the Critically Damped Langevin Dynamics respectively. In particular, these are all linear maps and therefore fulfil conditions and of Assumption 1.

We show in Appendix C.1 that is in and for . Therefore we can integrate and its derivative over compact sets, implying that condition holds. Furthermore, the Hessian w.r.t. is continuous and obtains its maximum and minimum on the compact set , where is the ball of diameter around the origin. Therefore the gradient is Lipschitz on , which proves .

## 3 Distance from pT to μprior

We have seen in Theorem 1 that the distance between and is directly related to the distance between and

. The OU-process and the CLD are well-known SDEs in the Markov Chain Monte Carlo community. They form building blocks of algorithms like Langevin MCMC (

[Roberts and Tweedie, 1996, Welling and Teh, 2011]) or Hamiltonian Monte Carlo ([Neal et al., 2011]) and the speed at which they converge to their stationary distribution is well studied ([Roberts and Tweedie, 1996, Markowich and Villani, 2000, Bou-Rabee et al., 2010, Eberle et al., 2019, Cao et al., 2019]). It is known that their intermediate distributions will converge to at an exponential rate.

The Brownian motion however does not converge to a stationary distribution and therefore one has to choose a different for each to approximate . We now study the distance of to . If we transfer the suggestions of [Song et al., 2021, Appendix C] to our setting, they propose to set to . However, has the same mean as

and has variance

. Loosely speaking, in contrast to the other two SDEs, the Brownian Motion does not forget its initial mean and covariance. As we see in the following lemma, we can decrease the distance between and by choosing the mean and covariance of equal to the mean and covariance of .

###### Lemma 2

Let be the -time marginal of the Brownian motion process of Section 2.1. The following minimization problem

 minm,CKL(pT | N(m,C))

is minimized by and

. If we restrict the covariance to be a multiple of the identity matrix, i.e.

 minm,cKL(pT | N(m,cId)),

the problem is solved by choosing as above and as .

This result is a slight variation on the well known fact that the

-projection in the second argument matches its moments. We prove it in Appendix

E. Equivalently to choosing the mean of equal to the mean of , , one can normalize the training data to have mean and then also centre . However, the covariance matrix will not be equal to after common normalization methods.

The next lemma bounds the distance of and in the Brownian motion case.

###### Lemma 3

Let be the time -marginal of a Brownian motion with initial condition . Denote by

the eigenvalues of

. Let

be any normal distribution. Then

 KL(pT|μTprior)≤12log(∏di=1(ci+T)Td).

This lemma is proven by explicitly bounding the distance of to the optimal normal distribution from Lemma 2, see Appendix E.

## 4 Approximation of the score ∇logpt

### 4.1 Violation of the assumption

First, we state the assumption that the approximation has to fulfil in order for Theorem 2 to be valid.

###### Assumption 2

Let be defined as in Theorem 2. We assume that the reverse SDE has a solution on . For , we define the
weights

 Zt=exp(∫t0σTe(s,Ys)⋅dBs−12∫t0∥σTe(s,Ys)∥2ds)

and assume that the are a uniformly integrable martingale (see Appendix A.2).

Furthermore, we assume that the approximate drift also fulfils condition of 1.

We again need to hold to have uniqueness in law of the reverse SDE, see the discussion after Assumption 1. However, the local Lipschitzness is implied if is , which we can expect in practice.

We already stated in Theorem 2 that a sufficient condition for Assumption 2 to hold is for to be bounded. From Section 2.2 we know that for each fixed outside , will explode. Since is a numerical approximation to , it will be finite. Therefore will also grow to infinity. Therefore, if is a lower-dimensional substructure, we cannot expect to be bounded.

Note however, that a global bound on is stronger than what we need for Assumption 2. In particular, the approximation error is only evaluated along the path . If we follow the intuition of Section 2.2 in the Brownian motion case, would approximately have magnitude . If goes to fast enough, then may stay bounded along the typical paths of .

Let us briefly investigate this. Continuing with the Brownian motion case, if we neglect the approximation errors, we can think of as . Here is distributed according to and lies on and is standard normal. Especially, will be of magnitude . Therefore, we can expect the distance of to to be of order . Because of that we can expect to have magnitude along a typical path of . This is unbounded and the approximation error will tend to infinity as .

### 4.2 Score approximation with a finite number of samples

The score is approximated by minimizing a variant of

 L(θ,t)=Ept(x)[∥∇logpt(x)−sθ(x,t)∥]. (6)

for . This is done via score matching techniques (see [Hyvärinen and Dayan, 2005, Vincent, 2011, Song et al., 2020]). The above expectation is taken over , which itself depends on ,

 pt(x)=Eμdata(x0)[πt|0(x|x0)].

In practice we do not have access to the full data distribution . Therefore we replace by a its empirical measure , which is supported on the training set,

 ^μdata(x)=Unif{x1,…,xn}. (7)

This leads to a Monte-Carlo estimate of

of ,

 ^pt(x)=E^pdata(x0)[pt|0(x|x0)]=1NN∑i=1pt|0(x|xi)≈pt(x).

The surrogate loss

 ^L(θ,t)=E^pt(x)[∥∇log^pt(x)−sθ(x,t)∥], (8)

can be evaluated and used for training. Instead of interpreting this loss as an approximation to (6), we can also interpret it as the exact loss for the distribution . However in this case, . If Theorem 2 would hold and would be equivalent to , then could only draw samples from . In other words, the SGM would have memorized the training set. This is also visualized in Figures 2 and 3.

However, empirically it has been shown that SGMs are able to create novel samples (see, for example [Dhariwal and Nichol, 2021]

). Furthermore, in the case of image generation, the images are sharp. In particular, they probably come from the same low-dimensional substructure that the training samples of

are from. Therefore the approximation errors made by are non-Gaussian and probably tangent to . The form and properties of the errors are closely related to the ability of the SGM to generalize.

## 5 Proofs of the theorems

We now give proofs of our main results and briefly summarize the key steps in an intuitive way. In our study we would like to include the case when is degenerate and supported on a low-dimensional substructure . As we have seen in Section 2.2, this can lead to an exploding drift in the reverse SDE as . Nevertheless, in order to understand the properties of , it is crucial to study the properties of solutions to the reverse SDE at time . This is where the main mathematical difficulties come from. The proofs are mostly independent of the specific form of the forward SDE and hold for more general forward/backward SDEs than those stated in Section 1, see Appendix D.

### 5.1 Theorem 1

We now proceed with proving Theorem 1.

Proof  Let be the measure on induced by the forward SDE (1). has marginals . Denote by the canonical projections for . We define through

 dQdP(ω)=dμpriordpT(ω(T)).

By the data processing inequality we obtain (see [Liese and Vajda, 2006, Theorem 14]),

 KL(qT|μdata)≤KL(Q|P)=KL(μprior|pT). (9)

It remains to prove that by running backwards we obtain a solution to (2) started in . We denote the generator of the reverse SDE (2) by . Denote by and the time reversals of and . Our assumption are such that is a Markov process solving the martingale problem for (see [Haussmann and Pardoux, 1986, Theorem 2.1]). A short calculation shows that is still Markov (see, for example [Léonard, 2011, Proposition 4.2]). Furthermore for ,

 EQR[f(Xt)−f(Xs)−∫tsLf(Xr)dr|Xs] = EPR[(f(Xt)−f(Xs)−∫tsLf(Xr)dr)dμpriordpT(X0)|Xs]EPR[dμpriordpT(X0)|Xs] = EPR[(f(Xt)−f(Xs)−∫tsLf(Xr)dr)|Xs]EPR[dμpriordpT(X0)|Xs]EPR[dμpriordpT(X0)|Xs] = EPR[(f(Xt)−f(Xs)−∫tsLf(Xr)dr)|Xs]=0.

In the second equality we used the Markov property of . In the last one we used that solves the martingale problem for . Therefore also solves the martingale problem for . Denote by a solution to (2) on . Since solutions to (2) are unique in law on for (see [Karatzas and Shreve, 2012, Section 5.2]) and the solutions are continuous, the law of is equal to on . But the paths of are continuous on . Therefore, can be extended to , i.e. the limit exists almost surely and its distribution is equal to the -time marginal distribution of , which is the -time marginal of . We denote the marginals of and by and . Since is absolutely continuous with respect to , is absolutely continuous with respect to . Analogously, if and are equivalent, then so are and and therefore and . This proves .

is a consequence of the data processing inequality for -divergences ([Liese and Vajda, 2006, Theorem 14]), analogous to (9).

The main idea of this proof is that we look at the forward SDE for first. It induces a distribution over all continuous paths in . If we reverse the time direction of this distribution on , we get a solution to the reverse SDE, started in . This reverse solution is well behaved as , since . The solution for a different initial condition is obtained by reweighting . This does not change the qualitative behaviour of , which still exists and is well defined. We then use a uniqueness result to see that any solution of (16) inherits these properties.

### 5.2 Theorem 2

Proof  Denote the space by . We also define the natural filtration . We denote the distribution of on by . We define by reweighting with on . By Girsanov theorem (see [Karatzas and Shreve, 2012, Section 3.5]) we know that the canonical process under is a solution to (3) on . Since is uniformly integrable, its limit exists in . Furthermore is absolutely continuous with respect to on with density . We define . Then the event

 A={x(T):=limt→Tx(t) exists and x(T)∈M}

has probability under (see Theorem 1) and therefore also under . Furthermore, is measurable with respect to . Therefore the distributions of under and are equivalent. The canonical process under is therefore a solution of (3), with the property that its time -marginal is well defined and equivalent to the time -marginal of (2). We can use uniqueness in law on for any and extend it to as in the proof of Theorem 1. This shows that every solution to (3) has the desired properties.

Finally we show that if is bounded, it fulfils Assumptions 2. We define by . Then there is a Brownian motion such that we can write as

 Zt=exp(WHt−12Ht).

Since is bounded by , is bounded by . In particular, one can view . Therefore is uniformly integrable since it can be viewed as a family of conditional expectations.

Here we essentially applied the Girsanov Theorem on . Using the uniform integrability of the Girsanov weights , we are able to extend it to . Therefore, we can infer that the distribution of and are actually equivalent on the whole path space . In particular, their time -marginals will be equivalent too, which is the claim of the theorem.

The results deepen the understanding of score-based generative models. As such, they can be seen as a step towards improving the quality of generative models. Therefore the possible negative societal impacts are the same ones that apply to generative modelling in general. First, generative models can be used to create synthetic data that is hard to distinguish from real data (for example images or videos), see [Mirsky and Lee, 2021]. Second, generative models can learn and reproduce biases that are prevalent in the training data ([Esser et al., 2020]). Last, depending on the application, generative models might be used to do creative work that was previously done by humans.

## 7 Previous studies

The work [De Bortoli et al., 2022] treats a variant of SGMs that is defined on a manifold. In this variant, the full algorithm is defined on a Riemannian manifold . We study the common SGM implementation under the assumption that the initial distribution is supported on a substructure of . As a matter of fact, we do not require to carry any manifold structure in a mathematical sense, since we define it as . Theorem 1 in [De Bortoli et al., 2021] studies the distance of the sample measure to the data generating distribution. An explicit rate for the convergence in the total variation distance is shown. However, it is assumed that that the initial distribution has an everywhere positive density, which excludes degenerate . Furthermore, it seems to be challenging to infer equivalence of the measures at any finite time.

## 8 Conclusion

We conducted a theoretical study of some properties of SGMs. We found explicit conditions under which the sample measure is equivalent to the true data generating distribution . Under these conditions we can guarantee, that the SGM generates samples that could also be samples from . Furthermore, each sample that can be generated by also has positive probability under , meaning that the full support is covered.

Since one can not actually access the full support of , but only a finite number of training examples , our results can be applied to find conditions under which the SGM memorizes its training data. We believe that this observation provides a first step towards understanding the generalization capabilities of SGMs.

## References

• Batzolis et al. [2021] G. Batzolis, J. Stanczuk, C. Schönlieb, and C. Etmann. Conditional image generation with score-based diffusion models. CoRR, abs/2111.13606, 2021. URL https://arxiv.org/abs/2111.13606.
• Bengio et al. [2013] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
• Bordes et al. [2017] F. Bordes, S. Honari, and P. Vincent. Learning to generate samples from noise through infusion training. arXiv preprint arXiv:1703.06975, 2017.
• Bou-Rabee et al. [2010] N. Bou-Rabee, M. Hairer, and E. Vanden-Eijnden. Non-asymptotic mixing of the MALA algorithm. IMA Journal of Numerical Analysis, 33, 08 2010.
• Cai et al. [2020] R. Cai, G. Yang, H. Averbuch-Elor, Z. Hao, S. J. Belongie, N. Snavely, and B. Hariharan. Learning gradient fields for shape generation. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III, volume 12348 of Lecture Notes in Computer Science, pages 364–381. Springer, 2020.
• Cao et al. [2019] Y. Cao, J. Lu, and L. Wang. On explicit -convergence rate estimate for underdamped Langevin dynamics. arXiv preprint arXiv:1908.04746, 2019.
• Chen et al. [2021] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan. Wavegrad: Estimating gradients for waveform generation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
• Costa and Cover [1984] M. Costa and T. Cover. On the similarity of the entropy power inequality and the Brunn-Minkowski inequality (corresp.). IEEE Transactions on Information Theory, 30(6):837–839, 1984.
• De Bortoli et al. [2021] V. De Bortoli, J. Thornton, J. Heng, and A. Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34, 2021.
• De Bortoli et al. [2022] V. De Bortoli, E. Mathieu, M. Hutchinson, J. Thornton, Y. W. Teh, and A. Doucet. Riemannian score-based generative modeling. arXiv preprint arXiv:2202.02763, 2022.
• Dhariwal and Nichol [2021] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 2021.
• Dockhorn et al. [2021] T. Dockhorn, A. Vahdat, and K. Kreis. Score-based generative modeling with critically-damped Langevin diffusion. CoRR, abs/2112.07068, 2021. URL https://arxiv.org/abs/2112.07068.
• Eberle et al. [2019] A. Eberle, A. Guillin, and R. Zimmer. Couplings and quantitative contraction rates for langevin dynamics. The Annals of Probability, 47(4):1982–2010, 2019.
• Esser et al. [2020] P. Esser, R. Rombach, and B. Ommer. A note on data biases in generative models. In NeurIPS 2020 Workshop on Machine Learning for Creativity and Design, 2020. URL https://arxiv.org/abs/2012.02516.
• Haussmann and Pardoux [1986] U. G. Haussmann and E. Pardoux. Time reversal of diffusions. The Annals of Probability, pages 1188–1205, 1986.
• Ho et al. [2020] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
• Ho et al. [2022] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion models for high fidelity image generation.

Journal of Machine Learning Research

, 23(47):1–33, 2022.
• Huang et al. [2021] C.-W. Huang, J. H. Lim, and A. C. Courville. A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 34, 2021.
• Hyvärinen and Dayan [2005] A. Hyvärinen and P. Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
• Jeong et al. [2021] M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim. Diff-tts: A denoising diffusion model for text-to-speech. In H. Hermansky, H. Cernocký, L. Burget, L. Lamel, O. Scharenborg, and P. Motlícek, editors, Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 3605–3609. ISCA, 2021.
• Jolicoeur-Martineau et al. [2021] A. Jolicoeur-Martineau, R. Piché-Taillefer, I. Mitliagkas, and R. T. des Combes. Adversarial score matching and improved sampling for image generation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
• Karatzas and Shreve [2012] I. Karatzas and S. Shreve. Brownian motion and stochastic calculus, volume 113. Springer Science & Business Media, 2012.
• Kingma et al. [2021] D. P. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630, 2021.
• Klenke [2013] A. Klenke. Probability theory: a comprehensive course. Springer Science & Business Media, 2013.
• Kong et al. [2021] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
• Léonard [2011] C. Léonard. Stochastic derivatives and generalized h-transforms of Markov processes. arXiv preprint arXiv:1102.3172, 2011.
• Li et al. [2022] H. Li, Y. Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen.

Srdiff: Single image super-resolution with diffusion probabilistic models.

Neurocomputing, 479:47–59, 2022.
• Liese and Vajda [2006] F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
• Luo and Hu [2021] S. Luo and W. Hu. Diffusion probabilistic models for 3d point cloud generation. In

IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021

, pages 2837–2845. Computer Vision Foundation / IEEE, 2021.
• Markowich and Villani [2000] P. A. Markowich and C. Villani. On the trend to equilibrium for the Fokker-Planck equation: an interplay between physics and functional analysis. Mat. Contemp, 19:1–29, 2000.
• Meng et al. [2021] C. Meng, Y. Song, J. Song, J. Wu, J. Zhu, and S. Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. CoRR, abs/2108.01073, 2021. URL https://arxiv.org/abs/2108.01073.
• Mirsky and Lee [2021] Y. Mirsky and W. Lee. The creation and detection of deepfakes: A survey. ACM Computing Surveys (CSUR), 54(1):1–41, 2021.
• Neal et al. [2011] R. M. Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
• Nichol and Dhariwal [2021] A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
• Pavliotis [2014] G. A. Pavliotis. Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations, volume 60. Springer, 2014.
• Pope et al. [2021] P. Pope, C. Zhu, A. Abdelkader, M. Goldblum, and T. Goldstein. The intrinsic dimension of images and its impact on learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
• Popov et al. [2021] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. A. Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8599–8608. PMLR, 2021.
• Rioul [2010] O. Rioul. Information theoretic proofs of entropy power inequalities. IEEE Transactions on Information Theory, 57(1):33–55, 2010.
• Roberts and Tweedie [1996] G. O. Roberts and R. L. Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996.
• Saharia et al. [2021] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi. Image super-resolution via iterative refinement. CoRR, abs/2104.07636, 2021. URL https://arxiv.org/abs/2104.07636.
• Sasaki et al. [2021] H. Sasaki, C. G. Willcocks, and T. P. Breckon. UNIT-DDPM: UNpaired image translation with denoising diffusion probabilistic models. CoRR, abs/2104.05358, 2021. URL https://arxiv.org/abs/2104.05358.
• Sohl-Dickstein et al. [2015] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli.

Deep unsupervised learning using nonequilibrium thermodynamics.

In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
• Song and Ermon [2019] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 11895–11907, 2019.
• Song and Ermon [2020] Y. Song and S. Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
• Song et al. [2020] Y. Song, S. Garg, J. Shi, and S. Ermon. Sliced score matching: A scalable approach to density and score estimation. In

Uncertainty in Artificial Intelligence

, pages 574–584. PMLR, 2020.
• Song et al. [2021] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
• Vahdat et al. [2021] A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34, 2021.
• Vincent [2011] P. Vincent.

A connection between score matching and denoising autoencoders.

Neural computation, 23(7):1661–1674, 2011.
• Welling and Teh [2011] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11)

, pages 681–688. Citeseer, 2011.

• Zhou et al. [2021] L. Zhou, Y. Du, and J. Wu. 3d shape generation and completion through point-voxel diffusion. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 5806–5815. IEEE, 2021.

## Appendix A Stochastic prerequisites

In this section we give a formal introduction to some of the concepts used in this work. For a more rigorous treatment, see for example [Klenke, 2013] or [Karatzas and Shreve, 2012].

### a.1 Equivalence of measures / Girsanov Theorem

First we define absolute continuity of measures. Let and be two measures on , where is a -algebra.

###### Definition 1

We say that is absolutely continuous with respect to if for any such that . We also denote this by .

Two measures and are equivalent if and . Loosely speaking, we can say that if the support of is contained in the support of and they are equivalent if they share the same support.

The Radon-Nikodym theorem tells us that if , then under mild conditions there exists a density such that . Therefore, we can obtain through a reweighting of . One specific instance of this is the Girsanov Theorem. Assume we are given the solutions to two SDEs in ,

 dYt=b(t,Yt)dt+σ(t,Yt)dWt (10)

and

 d~Yt=b(t,~Yt)dt+σ(t,~Yt)e(t,~Yt)dt+σ(t,~Yt)dBt. (11)

Both of these induce a measure on the space of continuous functions . We denote them by and respectively. Then the Girsanov Theorem equips us with conditions under which the measures and are equivalent. Furthermore, in case of equivalence we get a formula for the density of with respect to . The relative density is given as

 ZT=exp(∫T0e(s,Ys)dWs−12∫T0∥e(s,Ys)∥2ds).

For a full statement of the Girsanov Theorem and under which conditions it holds, see [Karatzas and Shreve, 2012, Section 3.5].

### a.2 Uniform integrability

Since we are treating the case where the drift explodes as we end up with densities

 Zt=exp(∫t0e(s,Ys)dWs−12∫t0∥e(s,Ys)∥2ds). (12)

on , but not with a density on . Uniform integrability is exactly the condition one needs to extend these local densities.

###### Definition 2

A family

of random variables is called uniformly integrable if

 supα E[|Xα| 1{|Xα|>s}]→0

as .

In the proof of Theorem 2 we implicitly use the following two results which we here state as a lemma. The filtration is defined as in the proof of Theorem 2.

###### Lemma 4

Assume the in (12) form a uniformly integrable martingale on . Then,

• the limit exists in . We denote this limit by .

• Furthermore, is absolutely continuous with respect to on with density .

Proof  Both of these results are standard. The first one can for example be found in [Karatzas and Shreve, 2012, Section 1.3.B]. For the second one we compute that for any ,

 EP[1AZ]=EP[1Alimt→TZt]=limt→TEP[1AZt]=EP[1AZs]=E~P[1A],

where we used convergence in the second equality and the martingale property of in the third equality. Therefore is a density of with respect to on each for . Therefore is also a density of with respect to on which concludes the proof.

## Appendix B Numerics

All numerical experiments can be run on a consumer grade computer within a few minutes.

### b.1 Figure 1

We first discuss the top left figure of Figure 1. We set to a mixture of two Gaussian and with weights and respectively. Then, we draw samples from , denoted by , . An Euler-Maruyama discretization of the Brownian motion propagates these samples from time to by

 Xni+1=Xni+√dtZni,

where are i.i.d. random variables, independent of for and . The time index runs from to and is set to . The initial samples are used to create the left line plot of and the final samples are used to create the right line plot of

using kernel density estimation. The

are approximate samples from . Therefore, we create histograms using to approximate . The height of the histogram bars corresponds to the square root of the colour intensity in the heat map. The horizontal axis in the heat map stands for the time , whereas the vertical axis stands for the position . At location we plot an estimate of . We apply the square root since it improves the contrast in areas where is close to and makes it more visible where to the observer.

For the bottom left figure we show the same plots, just for the reverse SDE (3) instead of the forward SDE. Since the initial distribution is a Gaussian mixture we can exactly calculate using

 pt(x)=w1N(x;m1,s21+t)+w2N(x;m2,s22+t), (13)

where we use

for the probability density function of a normal distribution with mean

and variance , evaluated at . With the above expression of one could compute an analytical representation of . We use automatic differentiation instead. The reverse SDE (3) is simulated with a disturbance and initial condition . The Euler-Maruyama method is run with the same step size . More precisely, the one step transition kernel of the discretized reverse SDE is

 Yni+1=Yni+dt (∇logp1−iI(Yni)+1)+√dt ~Zni, (14)

where are i.i.d. random variables, independent of for and . The plots are created in the same way as for the upper left plot, except that we reverse the time axis to plot and directly underneath each other.

On the right side we plot the same kernel density estimates already plotted on the left side as and into the same plot for comparison.

### b.2 Figure 2 and 3

Figure 3 is created by setting to be the uniform distribution on equally spaced samples on the unit sphere . This can also be viewed as a Gaussian mixture with components where each component having mean and variance . Therefore, we can again explicitly calculate for as in (13),

 pt(y)=1MM∑i=1N(y;xi,t).

The score is evaluated using automatic differentiation. The reverse SDE 3 is simulated with and