Denoising Diffusion Gamma Models

by   Eliya Nachmani, et al.
Tel Aviv University

Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underlying noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom could improve the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion process. Specifically, we introduce the Denoising Diffusion Gamma Model (DDGM) and show that noise from Gamma distribution provides improved results for image and speech generation. Our approach preserves the ability to efficiently sample state in the training diffusion process while using Gamma noise.


Non Gaussian Denoising Diffusion Models

Generative diffusion processes are an emerging and effective tool for im...

Noise Estimation for Generative Diffusion Models

Generative diffusion models have emerged as leading models in speech and...

Matrix moments of the diffusion tensor distribution

Purpose: To facilitate the implementation/validation of signal represent...

On Analyzing Generative and Denoising Capabilities of Diffusion-based Deep Generative Models

Diffusion-based Deep Generative Models (DDGMs) offer state-of-the-art pe...

Accelerating Diffusion Models via Early Stop of the Diffusion Process

Denoising Diffusion Probabilistic Models (DDPMs) have achieved impressiv...

Automatic, fast and robust characterization of noise distributions for diffusion MRI

Knowledge of the noise distribution in magnitude diffusion MRI images is...

Rethinking Deep Image Prior for Denoising

Deep image prior (DIP) serves as a good inductive bias for diverse inver...

1 Introduction

Deep generative neural networks have shown significant progress over the last years. The main architectures for generation are: (i) VAE

(Kingma and Welling, 2013) based, for example, NVAE (Vahdat and Kautz, 2020) and VQ-VAE (Razavi et al., 2019), (ii) GAN (Goodfellow et al., 2014) based, for example, StyleGAN (Karras et al., 2020) for vision application and WaveGAN (Donahue et al., 2018) for speech (iii) Flow-based, for example Glow (Kingma and Dhariwal, 2018) (iv) Autoregessive, for example, Wavenet for speech (Oord et al., 2016) and (v) Diffusion Probabilistic Models (Sohl-Dickstein et al., 2015), for example, Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) and its implicit version DDIM (Song et al., 2020a).

Models from this last family have shown significant progress in generation capabilities in the last years, e.g., (Chen et al., 2020; Kong et al., 2020a), and have achieved results comparable to state-of-the-art generation architecture for both images and speech.

A DDPM is a Markov chain of latent variables. Two processes are modeled: (i) a diffusion process and (ii) a denoising process. During training, the diffusion process learns to transform data samples into Gaussian noise. Denoising is the reverse process and it is used during inference for generating data samples, starting from Gaussian noise. The second process can be conditioned on attributes to control the generation sample. To obtain high-quality synthesis, a large number of denoising steps is used (i.e.

steps). A notable property of the diffusion process is a closed-form formulation of the noise that arises from accumulating diffusion stems. This allows sampling arbitrary states in the Markov chain of the diffusion process without calculating the previous steps.

In the Gaussian case, this property stems from the fact that adding Gaussian distributions leads to another Gaussian distribution. Other distributions have similar properties. For example, for the Gamma distribution, the sum of two distributions that share the scale parameter is a Gamma distribution of the same scale. The Poisson distribution has a similar property. However, its discrete nature makes it less suitable for DDPM.

In DDPM, the mean of the distribution is set at zero. The Gamma distribution, with its two parameters (shape and scale), is better suited to fit the data than a Gaussian distribution with one degree of freedom (scale). Furthermore, the Gamma distribution generalizes other distributions, and many other distributions can be derived from it (Leemis and McQueston, 2008).

The added modeling capacity of the Gamma distribution can help speed up the convergence of the DDPM model. Consider, for example, a conventional DDPM model that was trained with Gaussian noise on the CelebA dataset (Liu et al., 2015).

The noise distribution throughout the diffusion process can be visualized by computing the histogram of the estimated residual noise in the generation process. The estimated residual noise

is given by , where is the noise schedule, is the data point and is the estimate state at timestep , as can be derived from Eq.4 from (Song et al., 2020a). Both a Gaussian distribution and Gamma distribution can then be fitted to this histogram, as shown in Fig. 1(a,b). As can be seen, the Gamma distribution provides a better fit to the estimated residual noise . Moreover, Fig. 1

(c) presents the mean fitting error between the histogram and the fitted probability distribution function. Evidently, the Gamma distribution is a better fit than the Gaussian distribution.

In this paper, we investigate the non-Gaussian Gamma noise distribution. The proposed models maintain the property of the diffusion process of sampling arbitrary states without calculating the previous steps. Our results are demonstrated in two major domains: vision and audio. In the first domain, the proposed method is shown to provide a better FID score for generated images. For speech data, we show that the proposed method improves various measures, such as Perceptual Evaluation of Speech Quality (PESQ) and short-time objective intelligibility (STOI).

(a) (b) (c)
Figure 1: Fitting a distribution to the histogram of the generation error, which given by the scaled difference between and the image after DDPM steps . The model is a pretrained DDPM (Gaussian) celebA (64x64) model. (a) The fitting of a Gaussian to the histogram of a typical image after

steps. (b) Fitting a Gamma distribution. (c) The fitting error to Gaussian and Gamma distribution, measured as the MSE between the histogram and the fitted probability distribution function. Each point is the average value for the generation of

images. The vertical error bars denote the standard deviation.

2 Related Work

In their seminal work, Sohl-Dickstein et al. (2015) introduce the Diffusion Probabilistic Model. This model is applied to various domains, such as time series and images. The main drawback in the proposed model is that it needs up to thousands of iterative steps to generate a valid data sample. Song and Ermon (2019) proposed a diffusion generative model based on Langevin dynamics and the score matching method (Hyvärinen and Dayan, 2005). The model estimates the Stein score function (Liu et al., 2016) which is the logarithm of data density. Given the Stein score function, the model can generate data points.

Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) combine generative models based on score matching and neural Diffusion Probabilistic Models into a single model. Similarly, in Chen et al. (2020); Kong et al. (2020b) a generative neural diffusion process based on score matching was applied to speech generation. These models achieve state-of-the-art results for speech generation, and show superior results over well-established methods, such as Wavernn (Kalchbrenner et al., 2018), Wavenet (Oord et al., 2016), and GAN-TTS (Bińkowski et al., 2019).

Diffusion Implicit Models (DDIM) offer a way to accelerate the denoising process (Song et al., 2020a). The model employs a non-Markovian diffusion process to generate a higher quality sample. The model helps reduce the number of diffusion steps, e.g., from a thousand steps to a few hundred.

Dhariwal and Nichol (2021) find a better diffusion architecture through a series of exploratory experiments, leading to the Ablated Diffusion Model (ADM). This model outperforms the state-of-the-art in image synthesis, which was previously provided by GAN based-models, such as BigGAN-deep (Brock et al., 2018) and StyleGAN2 (Karras et al., 2020). ADM is further improved using a novel Cascaded Diffusion Model (CDM). Our contribution is fundamental and can be incorporated into the proposed ADM and CDM architectures.

Watson et al. (2021) proposed an efficient method for sampling from diffusion probabilistic models by a dynamic programming algorithm that finds the optimal discrete time schedules. Choi et al. (2021) introduces the Iterative Latent Variable Refinement (ILVR) method for guiding the generative process in DDPM. Moreover, Kong and Ping (2021) systematically investigates fast sampling methods for diffusion denoising models. Lam et al. (2021) propose bilateral denoising diffusion models (BDDM), which take significantly fewer steps to generate high-quality samples.

Huang et al. (2021) derive a variational framework for likelihood estimating of the marginal likelihood of continuous-time diffusion models. Moreover, Kingma et al. (2021) shows equivalence between various diffusion processes by using a simplification of the variational lower bound (VLB).

Song et al. (2020b) show that score-based generative models can be considered a solution to a stochastic differential equation. Gao et al. (2020) provide an alternative approach for training an energy-based generative model using a diffusion process.

Another line of work in audio is that of neural vocoders based on a denoising diffusion process. WaveGrad (Chen et al., 2020) and DiffWave (Kong et al., 2020b) are conditioned on the mel-spectrogram and produce high-fidelity audio samples, using as few as six steps of the diffusion process. These models outperform adversarial non-autoregressive baselines. Popov et al. (2021) propose a text-to-speech diffusion base model, which allows generating speech with the flexibility of controlling the trade-off between sound quality and inference speed.

Diffusion models were also applied to natural language processing tasks.

Hoogeboom et al. (2021) proposed a multinomial diffusion process for categorical data and applied it to language modeling. Austin et al. (2021)

generalize the multinomial diffusion process with Discrete Denoising Diffusion Probabilistic Models (D3PMs) and improve the generated results for the text8 and One Billion Word (LM1B) datasets.

1:  Input: dataset , diffusion process length , noise schedule
2:  repeat
7:     Take gradient descent step on:
8:  until converged
Algorithm 1 DDPM training procedure.
2:  for t= T, …, 1 do
6:     if  then
8:     end if
9:  end for
10:  return
Algorithm 2 DDPM sampling algorithm

3 Diffusion models for Gamma Distribution

We start by recapitulating the Gaussian case, after which we derive diffusion models for the Gamma distribution.

3.1 Background - Gaussian DDPM

Diffusion networks learn the gradients of the data log density:


By using Langevin Dynamics and the gradients of the data log density , a sample procedure from the probability can be done by:


where and is the step size.

The diffusion process in DDPM (Ho et al., 2020) is defined by a Markov chain that gradually adds Gaussian noise to the data according to a noise schedule. The diffusion process is defined by:


where T is the length of the diffusion process, and is a sequence of latent variables with the same size as the clean sample .

The Diffusion process is parameterized with a set of parameters called noise schedule (

), which defines the variance of the noise added at each step:


Since we are using a Gaussian noise random variable at each step, the diffusion process can be simulated for any number of steps with the closed formula:


where , and .

Diffusion models are a class of generative neural network of the form that learn to reverse the diffusion process. One can write that:


As described in (Ho et al., 2020), one can learn to predict the noise present in the data with a network and sample from using the following formula :



is white noise and

is the standard deviation of added noise. (Song et al., 2020a) use .

The training procedure of is defined in Alg.1. Given the input dataset , the algorithm samples , and . The noisy latent state is calculated and fed to the DDPM neural network . A gradient descent step is taken in order to estimate the noise with the DDPM network .

The complete inference algorithm present at Alg. 2. Starting from Gaussian noise and then reversing the diffusion process step-by-step, by iteratively employing the update rule of Eq. 7.

3.2 Denoising Diffusion Gamma Models (DDGM)

We expand the framework of diffusion generative processes by incorporating a new noise distribution, namely the Gamma Distribution. We call this new type of models Denoising Diffusion Gamma Models. First, we define the Gamma diffusion process, then we present a way to sample from this process, and finally we show how to train those models by computing the variational lower bound and deriving a novel loss function from it.

3.2.1 The Gamma Model

In the Gaussian case the diffusion equation (Eq. 4) can be written as:


where is the Gaussian noise of step . One can denote as the Gamma distribution, where and are the shape and the scale respectively. We modify Eq. 8 by adding, during the diffusion process, noise that follows a Gamma distribution:


where , and . Note that and

are hyperparameters.

Since the sum of Gamma distribution (with the same scale parameter) is distributed as Gamma distribution, one can derive a closed form for , i.e. an equation to calculate from :


where and .

Lemma 1.

Let , Assuming , , , and . Then the following hold:


where and

The complete proof for Lemma 1 is given in Appendix A.1.

Similarly to Eq.7 by using Langevin dynamics, the inference is given by:


In Algorithm 3 we describe the training procedure. As input we have the: (i) initial scale , (ii) the dataset , (iii) the maximum number of steps in the diffusion process and (iv) the noise schedule . The training algorithm sample: (i) an example , (ii) number of step and (iii) noise . Then it calculates from by using Eq.10. The neural network has an input and is conditional on the time step . Next, it takes a gradient descent step to approximate the normalized noise with the neural network . The main changes between Algorithm 3 and the single Gaussian case (i.e. Alg. 1) are the following: (i) calculating the Gamma parameters, (ii) update equation and (iii) the gradient update equation.

The inference procedure is given in Algorithm 4. It starts from a zero mean noise sampled from . Next, for steps the algorithm estimates from by using Eq.13. Note that as in (Song et al., 2020a) . Algorithm 4 replaces the Gaussian version (i.e. Alg. 2) with the following: (i) the starting sampling point , (ii) the sampling noise and (iii) the update equation.

1:  Input: initial scale , dataset , diffusion process length , noise schedule
2:  repeat
7:     Take a gradient descent step on:
8:  until converged
Algorithm 3 Gamma Training Algorithm
3:  for t = T, …, 1 do
5:     if t > 1 then
9:     end if
10:  end for
Algorithm 4 Gamma Inference Algorithm

3.2.2 The Reverse Process for DDGM

The reverse process defines the underlying generation process. Therefore, in this section, we will obtain the reverse process for the Gamma denoising diffusion model. Furthermore, we will use the reverse process to obtain the variational lower bound and the appropriate loss function for the Gamma distribution denoising diffusion model.

The reverse process is given by:


Next, one can calculate each one of the three main components of the reverse process, i.e. (i) , (ii) and (iii) .

Since is memoryless, . Therefore, the first component (i) of Eq. 14 is the forward process. The forward process is given by:


The second component of Eq.14 is given by:


Similarly, the third component of Eq.14 is given by:


Overall, the reverse process is given by:


One can denote:

Thus, the reverse process is proportional to:


3.2.3 Variational Lower Bound for DDGM

Denoising diffusion models (Ho et al., 2020) trained by optimizing the usual variational bound on negative log likelihood:


To get the variational lower bound for the proposed Gamma denoising diffusion model, one can use Eq.5 from Ho et al. (2020):


where and define by:

is constant and ignored during training since it doesn’t have learnable parameters. Moreover, in (Ho et al., 2020) modeled with discrete decoder, however, in our proposed model we empirically found that the impact is negligible and can be removed.

Therefore, to calculate the variatonal lower bound one needs to obtain:




We can calculate the KL divergence with the exact form:


Using Eq.20 the RHS of Eq.25 become:


One can show that the four terms present in the previous equation can be upper bounded with the L1 distance between the predicted and the ground truth :

The complete form of the upper bound can be expressed as follows:


As can be seen, the variational lower bound is bounded by some constant forms multiplied by the L1 norm between the data point and its estimation . The constant terms and as well as and are known values during the training.

3.2.4 Loss Function for DDGM

Denoising diffusion probabilistic models use the variational lower bound to minimize the negative log likelihood. As described in Sec.3.2.3, one can minimize the variational lower bound by for . To do so, one can minimize the L1 norm from Eq.27. Our model optimizes the L1 norm between the sampled noise and the estimated noise . This is verified in the following lemmas.

Lemma 2.

Minimizing the variational lower bound for DDGM (i.e. for ) is equivalent to minimizing the L1 norm between the sampled noise and the estimated noise:


The complete proof for Lemma 2 is given in Sec.A.2 at the appendix. Thus, the loss that is used in the Alg.3 is given by .

4 Experiments

4.1 Speech Generation

For our speech experiments we used a version of Wavegrad (Chen et al., 2020) based on this implementation Vovk (2020) (under BSD-3-Clause License). We evaluate our model with high-level perceptual quality of speech measurements, PESQ (Rix et al., 2001) and STOI (Taal et al., 2011). We used the standard Wavegrad method with the Gaussian diffusion process as a baseline. We use two Nvidia Volta V100 GPUs to train our models.

For all the experiments, the inference noise schedules () were defined as described in the Wavegrad paper (Chen et al., 2020). For and iterations the noise schedule is linear, for iterations it comes from the Fibonacci and for iterations we performed a model-dependent grid search to find the best noise schedule parameters. For other hyper-parameters (e.g. learning rate, batch size, etc) we use the same as in Wavegrad (Chen et al., 2020). Training was performed using the following form of Eq. 9, e.g. and . Our best results were obtained using .

Results  Tab. 3 presents the PESQ and STOI measurement for the LJ dataset (Ito and Johnson, 2017). As can be seen, for the proposed Gamma denoising diffusion model our results are better than the Wavegrad baseline for all number of iterations in both PESQ and STOI.

Model Iteration 6 25 100 1000 6 25 100 1000
WaveGrad (Chen et al., 2020) 2.78 3.194 3.211 3.290 0.924 0.957 0.958 0.959
DDGM (ours) 3.07 3.208 3.214 3.308 0.948 0.972 0.969 0.969
Table 2: FID () score comparison for CelebA(64x64) dataset. Lower is better.
Model Iteration 10 20 50 100 1000
DDPM (Ho et al., 2020) 299.71 183.83 71.71 45.2 3.26
DDGM - Gamma Distribution DDPM (ours) 35.59 28.24 20.24 14.22 4.09
DDIM (Song et al., 2020a) 17.33 13.73 9.17 6.53 3.51
DDGM - Gamma Distribution DDIM (ours) 11.64 6.83 4.28 3.17 2.92
Table 3: FID () score comparison for LSUN Church (256x256) dataset. Lower is better.
Model Iteration 10 20 50 100
DDPM (Ho et al., 2020) 51.56 23.37 11.16 8.27
DDGM - Gamma Distribution DDPM (ours) 28.56 19.68 10.53 7.87
DDIM (Song et al., 2020a) 19.45 12.47 10.84 10.58
DDGM - Gamma Distribution DDIM (ours) 18.11 11.32 10.31 8.75
Table 1: PESQ and STOI metrics for the LJ dataset for various Wavegrad-like models.

4.2 Image Generation

Our model is based on the DDIM implementation available in (Jiaming Song and Ermon, 2020) (under the MIT license). We trained our model on two image datasets (i) CelebA 64x64 (Liu et al., 2015) and (ii) LSUN Church 256x256 (Yu et al., 2015). The Fréchet Inception Distance (FID) (Heusel et al., 2017) is used as the benchmark metric. For all experiments, similarly to previous work (Song et al., 2020a), we compute the FID score with

generated images, using the torch-fidelity implementation 

(Obukhov et al., 2020). Similar to (Song et al., 2020a), the training noise schedule is linear with values raging from to . For other hyperparameters (e.g. learning rate, batch size etc) we use the same parameters that appear in DDPM (Ho et al., 2020). We use eight Nvidia Volta V100 GPUs to train our models. The parameter for Gamma distribution set to .

Results  We test our models with the inference procedure from DDPM (Ho et al., 2020) and DDIM (Song et al., 2020a). In Tab. 3 we provide the FID score for CelebA (64x64) dataset (Liu et al., 2015) (under non-commercial research purposes license). As can be seen for DDPM inference procedure for steps, the best results were obtained from the Gamma model, which improves results by a gap of FID scores for ten iterations. For iterations, the Gamma model improves results by FID scores. For iterations, the best results were obtained from the DDPM model. Nevertheless, our Gamma model obtains results that are closer to the DDPM by a gap of . For the DDIM procedure, the best results were obtained with the Gamma model for all number of iterations. Fig. 2 presents samples generated by the three models. Our models provide better quality images when compared to DDPM and DDIM methods.

In Tab. 3 we provide the FID score for the LSUN church dataset (Yu et al., 2015). As can be seen, the Gamma model improves results over the baseline for iterations.

Figure 2: Typical examples of images generated with iterations and . For models trained with different noise distributions - (i) First row - Gaussian noise and (ii) Second row - Gamma noise. All models start from the same noise instance.

5 Conclusions

We present a novel Gamma diffusion model. The model employs a Gamma noise distribution. A key enabler for using these distributions is a closed-form formulation (Eq. 10) of the multi-step noising process, which allows for efficient training. We also present the reverse process and the variational lower bound for the Gamma diffusion model. The proposed model improves the quality of generated image and audio, as well as the speed of generation in comparison to conventional, Gaussian-based diffusion processes.


This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974). The contribution of Eliya Nachmani is part of a Ph.D. thesis research conducted at Tel Aviv University.


  • J. Austin, D. Johnson, J. Ho, D. Tarlow, and R. v. d. Berg (2021) Structured denoising diffusion models in discrete state-spaces. arXiv preprint arXiv:2107.03006. Cited by: §2.
  • M. Bińkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo, and K. Simonyan (2019) High fidelity speech synthesis with adversarial networks. arXiv preprint arXiv:1909.11646. Cited by: §2.
  • A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §2.
  • N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan (2020) WaveGrad: estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713. Cited by: §1, §2, §2, §4.1, §4.1, Table 3.
  • J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon (2021) ILVR: conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938. Cited by: §2.
  • P. Dhariwal and A. Nichol (2021) Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233. Cited by: §2.
  • C. Donahue, J. McAuley, and M. Puckette (2018) Adversarial audio synthesis. arXiv preprint arXiv:1802.04208. Cited by: §1.
  • R. Gao, Y. Song, B. Poole, Y. N. Wu, and D. P. Kingma (2020)

    Learning energy-based models by diffusion recovery likelihood

    arXiv preprint arXiv:2012.08125. Cited by: §2.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §1.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500. Cited by: §4.2.
  • J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239. Cited by: §1, §2, §3.1, §3.1, §3.2.3, §3.2.3, §3.2.3, §4.2, §4.2, Table 3.
  • E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and M. Welling (2021)

    Argmax flows and multinomial diffusion: towards non-autoregressive language models

    arXiv preprint arXiv:2102.05379. Cited by: §2.
  • C. Huang, J. H. Lim, and A. Courville (2021) A variational perspective on diffusion-based generative models and score matching. arXiv preprint arXiv:2106.02808. Cited by: §2.
  • A. Hyvärinen and P. Dayan (2005) Estimation of non-normalized statistical models by score matching..

    Journal of Machine Learning Research

    6 (4).
    Cited by: §2.
  • K. Ito and L. Johnson (2017) The lj speech dataset. Note: Cited by: §4.1.
  • C. M. Jiaming Song and S. Ermon (2020) Denoising diffusion implicit models. GitHub. Note: Cited by: §4.2.
  • N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu (2018) Efficient neural audio synthesis. In International Conference on Machine Learning, pp. 2410–2419. Cited by: §2.
  • T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 8110–8119. Cited by: §1, §2.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039. Cited by: §1.
  • D. P. Kingma, T. Salimans, B. Poole, and J. Ho (2021) Variational diffusion models. arXiv preprint arXiv:2107.00630. Cited by: §2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
  • Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro (2020a) Diffwave: a versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761. Cited by: §1.
  • Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro (2020b) Diffwave: a versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761. Cited by: §2, §2.
  • Z. Kong and W. Ping (2021) On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2106.00132. Cited by: §2.
  • M. W. Lam, J. Wang, R. Huang, D. Su, and D. Yu (2021) Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514. Cited by: §2.
  • L. M. Leemis and J. T. McQueston (2008) Univariate distribution relationships. The American Statistician 62 (1), pp. 45–53. Cited by: §1.
  • Q. Liu, J. Lee, and M. Jordan (2016) A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pp. 276–284. Cited by: §2.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §1, §4.2, §4.2.
  • A. Obukhov, M. Seitzer, P. Wu, S. Zhydenko, J. Kyl, and E. Y. Lin (2020)

    High-fidelity performance metrics for generative models in pytorch

    Zenodo. Note: Version: 0.2.0, DOI: 10.5281/zenodo.3786540 External Links: Link, Document Cited by: §4.2.
  • A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1, §2.
  • V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov (2021) Grad-tts: a diffusion probabilistic model for text-to-speech. arXiv preprint arXiv:2105.06337. Cited by: §2.
  • A. Razavi, A. v. d. Oord, and O. Vinyals (2019) Generating diverse high-fidelity images with vq-vae-2. arXiv preprint arXiv:1906.00446. Cited by: §1.
  • A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Vol. 2, pp. 749–752 vol.2. External Links: Document Cited by: §4.1.
  • J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli (2015)

    Deep unsupervised learning using nonequilibrium thermodynamics

    In International Conference on Machine Learning, pp. 2256–2265. Cited by: §1, §2.
  • J. Song, C. Meng, and S. Ermon (2020a) Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. Cited by: §1, §1, §2, §3.1, §3.2.1, §4.2, §4.2, Table 3.
  • Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600. Cited by: §2.
  • Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2020b) Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. Cited by: §2.
  • C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing 19 (7), pp. 2125–2136. External Links: Document Cited by: §4.1.
  • A. Vahdat and J. Kautz (2020)

    Nvae: a deep hierarchical variational autoencoder

    arXiv preprint arXiv:2007.03898. Cited by: §1.
  • I. Vovk (2020) WaveGrad. GitHub. Note: Cited by: §4.1.
  • D. Watson, J. Ho, M. Norouzi, and W. Chan (2021) Learning to efficiently sample from diffusion probabilistic models. arXiv preprint arXiv:2106.03802. Cited by: §2.
  • F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao (2015) LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §4.2, §4.2.

Appendix A Proofs

a.1 Proof of lemma 1

See 1


The first part of Eq. 11 is immediate. The variance part is also straightforward:

Eq. 12 is proved by induction on . For :

since , . We also have that . Thus we have:

Assume Eq. 12 holds for some . The next iteration is obtained as


It remains to be proven that (i) and (ii) . Since hold, then:

Therefore, we prove (i):

which implies that and have the same probability distribution.

Furthermore, by the linearity of the expectation, one can obtain (ii):

Thus, we have:

which ends the proof by induction. ∎

a.2 Proof of lemma 2

See 2


From Eq.27, the variational lower bound of DDGM is given by . Substitute Eq.24 and Eq.10 to the variational lower bound we have: