1 Introduction
Deep generative neural networks have shown significant progress over the last years. The main architectures for generation are: (i) VAE
(Kingma and Welling, 2013) based, for example, NVAE (Vahdat and Kautz, 2020) and VQVAE (Razavi et al., 2019), (ii) GAN (Goodfellow et al., 2014) based, for example, StyleGAN (Karras et al., 2020) for vision application and WaveGAN (Donahue et al., 2018) for speech (iii) Flowbased, for example Glow (Kingma and Dhariwal, 2018) (iv) Autoregessive, for example, Wavenet for speech (Oord et al., 2016) and (v) Diffusion Probabilistic Models (SohlDickstein et al., 2015), for example, Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) and its implicit version DDIM (Song et al., 2020a).Models from this last family have shown significant progress in generation capabilities in the last years, e.g., (Chen et al., 2020; Kong et al., 2020a), and have achieved results comparable to stateoftheart generation architecture for both images and speech.
A DDPM is a Markov chain of latent variables. Two processes are modeled: (i) a diffusion process and (ii) a denoising process. During training, the diffusion process learns to transform data samples into Gaussian noise. Denoising is the reverse process and it is used during inference for generating data samples, starting from Gaussian noise. The second process can be conditioned on attributes to control the generation sample. To obtain highquality synthesis, a large number of denoising steps is used (i.e.
steps). A notable property of the diffusion process is a closedform formulation of the noise that arises from accumulating diffusion stems. This allows sampling arbitrary states in the Markov chain of the diffusion process without calculating the previous steps.In the Gaussian case, this property stems from the fact that adding Gaussian distributions leads to another Gaussian distribution. Other distributions have similar properties. For example, for the Gamma distribution, the sum of two distributions that share the scale parameter is a Gamma distribution of the same scale. The Poisson distribution has a similar property. However, its discrete nature makes it less suitable for DDPM.
In DDPM, the mean of the distribution is set at zero. The Gamma distribution, with its two parameters (shape and scale), is better suited to fit the data than a Gaussian distribution with one degree of freedom (scale). Furthermore, the Gamma distribution generalizes other distributions, and many other distributions can be derived from it (Leemis and McQueston, 2008).
The added modeling capacity of the Gamma distribution can help speed up the convergence of the DDPM model. Consider, for example, a conventional DDPM model that was trained with Gaussian noise on the CelebA dataset (Liu et al., 2015).
The noise distribution throughout the diffusion process can be visualized by computing the histogram of the estimated residual noise in the generation process. The estimated residual noise
is given by , where is the noise schedule, is the data point and is the estimate state at timestep , as can be derived from Eq.4 from (Song et al., 2020a). Both a Gaussian distribution and Gamma distribution can then be fitted to this histogram, as shown in Fig. 1(a,b). As can be seen, the Gamma distribution provides a better fit to the estimated residual noise . Moreover, Fig. 1(c) presents the mean fitting error between the histogram and the fitted probability distribution function. Evidently, the Gamma distribution is a better fit than the Gaussian distribution.
In this paper, we investigate the nonGaussian Gamma noise distribution. The proposed models maintain the property of the diffusion process of sampling arbitrary states without calculating the previous steps. Our results are demonstrated in two major domains: vision and audio. In the first domain, the proposed method is shown to provide a better FID score for generated images. For speech data, we show that the proposed method improves various measures, such as Perceptual Evaluation of Speech Quality (PESQ) and shorttime objective intelligibility (STOI).
(a)  (b)  (c) 
steps. (b) Fitting a Gamma distribution. (c) The fitting error to Gaussian and Gamma distribution, measured as the MSE between the histogram and the fitted probability distribution function. Each point is the average value for the generation of
images. The vertical error bars denote the standard deviation.
2 Related Work
In their seminal work, SohlDickstein et al. (2015) introduce the Diffusion Probabilistic Model. This model is applied to various domains, such as time series and images. The main drawback in the proposed model is that it needs up to thousands of iterative steps to generate a valid data sample. Song and Ermon (2019) proposed a diffusion generative model based on Langevin dynamics and the score matching method (Hyvärinen and Dayan, 2005). The model estimates the Stein score function (Liu et al., 2016) which is the logarithm of data density. Given the Stein score function, the model can generate data points.
Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) combine generative models based on score matching and neural Diffusion Probabilistic Models into a single model. Similarly, in Chen et al. (2020); Kong et al. (2020b) a generative neural diffusion process based on score matching was applied to speech generation. These models achieve stateoftheart results for speech generation, and show superior results over wellestablished methods, such as Wavernn (Kalchbrenner et al., 2018), Wavenet (Oord et al., 2016), and GANTTS (Bińkowski et al., 2019).
Diffusion Implicit Models (DDIM) offer a way to accelerate the denoising process (Song et al., 2020a). The model employs a nonMarkovian diffusion process to generate a higher quality sample. The model helps reduce the number of diffusion steps, e.g., from a thousand steps to a few hundred.
Dhariwal and Nichol (2021) find a better diffusion architecture through a series of exploratory experiments, leading to the Ablated Diffusion Model (ADM). This model outperforms the stateoftheart in image synthesis, which was previously provided by GAN basedmodels, such as BigGANdeep (Brock et al., 2018) and StyleGAN2 (Karras et al., 2020). ADM is further improved using a novel Cascaded Diffusion Model (CDM). Our contribution is fundamental and can be incorporated into the proposed ADM and CDM architectures.
Watson et al. (2021) proposed an efficient method for sampling from diffusion probabilistic models by a dynamic programming algorithm that finds the optimal discrete time schedules. Choi et al. (2021) introduces the Iterative Latent Variable Refinement (ILVR) method for guiding the generative process in DDPM. Moreover, Kong and Ping (2021) systematically investigates fast sampling methods for diffusion denoising models. Lam et al. (2021) propose bilateral denoising diffusion models (BDDM), which take significantly fewer steps to generate highquality samples.
Huang et al. (2021) derive a variational framework for likelihood estimating of the marginal likelihood of continuoustime diffusion models. Moreover, Kingma et al. (2021) shows equivalence between various diffusion processes by using a simplification of the variational lower bound (VLB).
Song et al. (2020b) show that scorebased generative models can be considered a solution to a stochastic differential equation. Gao et al. (2020) provide an alternative approach for training an energybased generative model using a diffusion process.
Another line of work in audio is that of neural vocoders based on a denoising diffusion process. WaveGrad (Chen et al., 2020) and DiffWave (Kong et al., 2020b) are conditioned on the melspectrogram and produce highfidelity audio samples, using as few as six steps of the diffusion process. These models outperform adversarial nonautoregressive baselines. Popov et al. (2021) propose a texttospeech diffusion base model, which allows generating speech with the flexibility of controlling the tradeoff between sound quality and inference speed.
Diffusion models were also applied to natural language processing tasks.
Hoogeboom et al. (2021) proposed a multinomial diffusion process for categorical data and applied it to language modeling. Austin et al. (2021)generalize the multinomial diffusion process with Discrete Denoising Diffusion Probabilistic Models (D3PMs) and improve the generated results for the text8 and One Billion Word (LM1B) datasets.
3 Diffusion models for Gamma Distribution
We start by recapitulating the Gaussian case, after which we derive diffusion models for the Gamma distribution.
3.1 Background  Gaussian DDPM
Diffusion networks learn the gradients of the data log density:
(1) 
By using Langevin Dynamics and the gradients of the data log density , a sample procedure from the probability can be done by:
(2) 
where and is the step size.
The diffusion process in DDPM (Ho et al., 2020) is defined by a Markov chain that gradually adds Gaussian noise to the data according to a noise schedule. The diffusion process is defined by:
(3) 
where T is the length of the diffusion process, and is a sequence of latent variables with the same size as the clean sample .
The Diffusion process is parameterized with a set of parameters called noise schedule (
), which defines the variance of the noise added at each step:
(4) 
Since we are using a Gaussian noise random variable at each step, the diffusion process can be simulated for any number of steps with the closed formula:
(5) 
where , and .
Diffusion models are a class of generative neural network of the form that learn to reverse the diffusion process. One can write that:
(6) 
As described in (Ho et al., 2020), one can learn to predict the noise present in the data with a network and sample from using the following formula :
(7) 
The training procedure of is defined in Alg.1. Given the input dataset , the algorithm samples , and . The noisy latent state is calculated and fed to the DDPM neural network . A gradient descent step is taken in order to estimate the noise with the DDPM network .
3.2 Denoising Diffusion Gamma Models (DDGM)
We expand the framework of diffusion generative processes by incorporating a new noise distribution, namely the Gamma Distribution. We call this new type of models Denoising Diffusion Gamma Models. First, we define the Gamma diffusion process, then we present a way to sample from this process, and finally we show how to train those models by computing the variational lower bound and deriving a novel loss function from it.
3.2.1 The Gamma Model
In the Gaussian case the diffusion equation (Eq. 4) can be written as:
(8) 
where is the Gaussian noise of step . One can denote as the Gamma distribution, where and are the shape and the scale respectively. We modify Eq. 8 by adding, during the diffusion process, noise that follows a Gamma distribution:
(9) 
where , and . Note that and
are hyperparameters.
Since the sum of Gamma distribution (with the same scale parameter) is distributed as Gamma distribution, one can derive a closed form for , i.e. an equation to calculate from :
(10) 
where and .
Lemma 1.
Let , Assuming , , , and . Then the following hold:
(11) 
(12) 
where and
Similarly to Eq.7 by using Langevin dynamics, the inference is given by:
(13) 
In Algorithm 3 we describe the training procedure. As input we have the: (i) initial scale , (ii) the dataset , (iii) the maximum number of steps in the diffusion process and (iv) the noise schedule . The training algorithm sample: (i) an example , (ii) number of step and (iii) noise . Then it calculates from by using Eq.10. The neural network has an input and is conditional on the time step . Next, it takes a gradient descent step to approximate the normalized noise with the neural network . The main changes between Algorithm 3 and the single Gaussian case (i.e. Alg. 1) are the following: (i) calculating the Gamma parameters, (ii) update equation and (iii) the gradient update equation.
The inference procedure is given in Algorithm 4. It starts from a zero mean noise sampled from . Next, for steps the algorithm estimates from by using Eq.13. Note that as in (Song et al., 2020a) . Algorithm 4 replaces the Gaussian version (i.e. Alg. 2) with the following: (i) the starting sampling point , (ii) the sampling noise and (iii) the update equation.
3.2.2 The Reverse Process for DDGM
The reverse process defines the underlying generation process. Therefore, in this section, we will obtain the reverse process for the Gamma denoising diffusion model. Furthermore, we will use the reverse process to obtain the variational lower bound and the appropriate loss function for the Gamma distribution denoising diffusion model.
The reverse process is given by:
(14) 
Next, one can calculate each one of the three main components of the reverse process, i.e. (i) , (ii) and (iii) .
Since is memoryless, . Therefore, the first component (i) of Eq. 14 is the forward process. The forward process is given by:
(15)  
(16) 
The second component of Eq.14 is given by:
(17) 
Similarly, the third component of Eq.14 is given by:
(18) 
Overall, the reverse process is given by:
(19) 
One can denote:
Thus, the reverse process is proportional to:
(20) 
3.2.3 Variational Lower Bound for DDGM
Denoising diffusion models (Ho et al., 2020) trained by optimizing the usual variational bound on negative log likelihood:
(21) 
To get the variational lower bound for the proposed Gamma denoising diffusion model, one can use Eq.5 from Ho et al. (2020):
(22) 
where and define by:
is constant and ignored during training since it doesn’t have learnable parameters. Moreover, in (Ho et al., 2020) modeled with discrete decoder, however, in our proposed model we empirically found that the impact is negligible and can be removed.
Therefore, to calculate the variatonal lower bound one needs to obtain:
(23) 
where:
(24) 
We can calculate the KL divergence with the exact form:
(25) 
Using Eq.20 the RHS of Eq.25 become:
(26) 
One can show that the four terms present in the previous equation can be upper bounded with the L1 distance between the predicted and the ground truth :
The complete form of the upper bound can be expressed as follows:
(27) 
As can be seen, the variational lower bound is bounded by some constant forms multiplied by the L1 norm between the data point and its estimation . The constant terms and as well as and are known values during the training.
3.2.4 Loss Function for DDGM
Denoising diffusion probabilistic models use the variational lower bound to minimize the negative log likelihood. As described in Sec.3.2.3, one can minimize the variational lower bound by for . To do so, one can minimize the L1 norm from Eq.27. Our model optimizes the L1 norm between the sampled noise and the estimated noise . This is verified in the following lemmas.
Lemma 2.
Minimizing the variational lower bound for DDGM (i.e. for ) is equivalent to minimizing the L1 norm between the sampled noise and the estimated noise:
(28) 
4 Experiments
4.1 Speech Generation
For our speech experiments we used a version of Wavegrad (Chen et al., 2020) based on this implementation Vovk (2020) (under BSD3Clause License). We evaluate our model with highlevel perceptual quality of speech measurements, PESQ (Rix et al., 2001) and STOI (Taal et al., 2011). We used the standard Wavegrad method with the Gaussian diffusion process as a baseline. We use two Nvidia Volta V100 GPUs to train our models.
For all the experiments, the inference noise schedules () were defined as described in the Wavegrad paper (Chen et al., 2020). For and iterations the noise schedule is linear, for iterations it comes from the Fibonacci and for iterations we performed a modeldependent grid search to find the best noise schedule parameters. For other hyperparameters (e.g. learning rate, batch size, etc) we use the same as in Wavegrad (Chen et al., 2020). Training was performed using the following form of Eq. 9, e.g. and . Our best results were obtained using .
Results Tab. 3 presents the PESQ and STOI measurement for the LJ dataset (Ito and Johnson, 2017). As can be seen, for the proposed Gamma denoising diffusion model our results are better than the Wavegrad baseline for all number of iterations in both PESQ and STOI.
PESQ ()  STOI ()  

Model Iteration  6  25  100  1000  6  25  100  1000 
WaveGrad (Chen et al., 2020)  2.78  3.194  3.211  3.290  0.924  0.957  0.958  0.959 
DDGM (ours)  3.07  3.208  3.214  3.308  0.948  0.972  0.969  0.969 
Model Iteration  10  20  50  100  1000 

DDPM (Ho et al., 2020)  299.71  183.83  71.71  45.2  3.26 
DDGM  Gamma Distribution DDPM (ours)  35.59  28.24  20.24  14.22  4.09 
DDIM (Song et al., 2020a)  17.33  13.73  9.17  6.53  3.51 
DDGM  Gamma Distribution DDIM (ours)  11.64  6.83  4.28  3.17  2.92 
Model Iteration  10  20  50  100 

DDPM (Ho et al., 2020)  51.56  23.37  11.16  8.27 
DDGM  Gamma Distribution DDPM (ours)  28.56  19.68  10.53  7.87 
DDIM (Song et al., 2020a)  19.45  12.47  10.84  10.58 
DDGM  Gamma Distribution DDIM (ours)  18.11  11.32  10.31  8.75 
4.2 Image Generation
Our model is based on the DDIM implementation available in (Jiaming Song and Ermon, 2020) (under the MIT license). We trained our model on two image datasets (i) CelebA 64x64 (Liu et al., 2015) and (ii) LSUN Church 256x256 (Yu et al., 2015). The Fréchet Inception Distance (FID) (Heusel et al., 2017) is used as the benchmark metric. For all experiments, similarly to previous work (Song et al., 2020a), we compute the FID score with
generated images, using the torchfidelity implementation
(Obukhov et al., 2020). Similar to (Song et al., 2020a), the training noise schedule is linear with values raging from to . For other hyperparameters (e.g. learning rate, batch size etc) we use the same parameters that appear in DDPM (Ho et al., 2020). We use eight Nvidia Volta V100 GPUs to train our models. The parameter for Gamma distribution set to .Results We test our models with the inference procedure from DDPM (Ho et al., 2020) and DDIM (Song et al., 2020a). In Tab. 3 we provide the FID score for CelebA (64x64) dataset (Liu et al., 2015) (under noncommercial research purposes license). As can be seen for DDPM inference procedure for steps, the best results were obtained from the Gamma model, which improves results by a gap of FID scores for ten iterations. For iterations, the Gamma model improves results by FID scores. For iterations, the best results were obtained from the DDPM model. Nevertheless, our Gamma model obtains results that are closer to the DDPM by a gap of . For the DDIM procedure, the best results were obtained with the Gamma model for all number of iterations. Fig. 2 presents samples generated by the three models. Our models provide better quality images when compared to DDPM and DDIM methods.
In Tab. 3 we provide the FID score for the LSUN church dataset (Yu et al., 2015). As can be seen, the Gamma model improves results over the baseline for iterations.
5 Conclusions
We present a novel Gamma diffusion model. The model employs a Gamma noise distribution. A key enabler for using these distributions is a closedform formulation (Eq. 10) of the multistep noising process, which allows for efficient training. We also present the reverse process and the variational lower bound for the Gamma diffusion model. The proposed model improves the quality of generated image and audio, as well as the speed of generation in comparison to conventional, Gaussianbased diffusion processes.
Acknowledgments
This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974). The contribution of Eliya Nachmani is part of a Ph.D. thesis research conducted at Tel Aviv University.
References
 Structured denoising diffusion models in discrete statespaces. arXiv preprint arXiv:2107.03006. Cited by: §2.
 High fidelity speech synthesis with adversarial networks. arXiv preprint arXiv:1909.11646. Cited by: §2.
 Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §2.
 WaveGrad: estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713. Cited by: §1, §2, §2, §4.1, §4.1, Table 3.
 ILVR: conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938. Cited by: §2.
 Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233. Cited by: §2.
 Adversarial audio synthesis. arXiv preprint arXiv:1802.04208. Cited by: §1.

Learning energybased models by diffusion recovery likelihood
. arXiv preprint arXiv:2012.08125. Cited by: §2.  Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §1.
 Gans trained by a two timescale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500. Cited by: §4.2.
 Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239. Cited by: §1, §2, §3.1, §3.1, §3.2.3, §3.2.3, §3.2.3, §4.2, §4.2, Table 3.

Argmax flows and multinomial diffusion: towards nonautoregressive language models
. arXiv preprint arXiv:2102.05379. Cited by: §2.  A variational perspective on diffusionbased generative models and score matching. arXiv preprint arXiv:2106.02808. Cited by: §2.

Estimation of nonnormalized statistical models by score matching..
Journal of Machine Learning Research
6 (4). Cited by: §2.  The lj speech dataset. Note: https://keithito.com/LJSpeechDataset/ Cited by: §4.1.
 Denoising diffusion implicit models. GitHub. Note: https://github.com/ermongroup/ddim Cited by: §4.2.
 Efficient neural audio synthesis. In International Conference on Machine Learning, pp. 2410–2419. Cited by: §2.

Analyzing and improving the image quality of stylegan.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 8110–8119. Cited by: §1, §2.  Glow: generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039. Cited by: §1.
 Variational diffusion models. arXiv preprint arXiv:2107.00630. Cited by: §2.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
 Diffwave: a versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761. Cited by: §1.
 Diffwave: a versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761. Cited by: §2, §2.
 On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2106.00132. Cited by: §2.
 Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514. Cited by: §2.
 Univariate distribution relationships. The American Statistician 62 (1), pp. 45–53. Cited by: §1.
 A kernelized stein discrepancy for goodnessoffit tests. In International conference on machine learning, pp. 276–284. Cited by: §2.
 Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §1, §4.2, §4.2.

Highfidelity performance metrics for generative models in pytorch
. Zenodo. Note: Version: 0.2.0, DOI: 10.5281/zenodo.3786540 External Links: Link, Document Cited by: §4.2.  Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1, §2.
 Gradtts: a diffusion probabilistic model for texttospeech. arXiv preprint arXiv:2105.06337. Cited by: §2.
 Generating diverse highfidelity images with vqvae2. arXiv preprint arXiv:1906.00446. Cited by: §1.
 Perceptual evaluation of speech quality (pesq)a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Vol. 2, pp. 749–752 vol.2. External Links: Document Cited by: §4.1.

Deep unsupervised learning using nonequilibrium thermodynamics
. In International Conference on Machine Learning, pp. 2256–2265. Cited by: §1, §2.  Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. Cited by: §1, §1, §2, §3.1, §3.2.1, §4.2, §4.2, Table 3.
 Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600. Cited by: §2.
 Scorebased generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. Cited by: §2.
 An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing 19 (7), pp. 2125–2136. External Links: Document Cited by: §4.1.

Nvae: a deep hierarchical variational autoencoder
. arXiv preprint arXiv:2007.03898. Cited by: §1.  WaveGrad. GitHub. Note: https://github.com/ivanvovk/WaveGrad Cited by: §4.1.
 Learning to efficiently sample from diffusion probabilistic models. arXiv preprint arXiv:2106.03802. Cited by: §2.
 LSUN: construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §4.2, §4.2.
Appendix A Proofs
a.1 Proof of lemma 1
See 1
Proof.
The first part of Eq. 11 is immediate. The variance part is also straightforward:
Eq. 12 is proved by induction on . For :
since , . We also have that . Thus we have:
Assume Eq. 12 holds for some . The next iteration is obtained as
(29)  
(30)  
(31) 
It remains to be proven that (i) and (ii) . Since hold, then:
Therefore, we prove (i):
which implies that and have the same probability distribution.
Furthermore, by the linearity of the expectation, one can obtain (ii):
Thus, we have:
which ends the proof by induction. ∎
a.2 Proof of lemma 2
See 2