Dynamic Dual-Output Diffusion Models

by   Yaniv Benny, et al.
Tel Aviv University

Iterative denoising-based generation, also known as denoising diffusion models, has recently been shown to be comparable in quality to other classes of generative models, and even surpass them. Including, in particular, Generative Adversarial Networks, which are currently the state of the art in many sub-tasks of image generation. However, a major drawback of this method is that it requires hundreds of iterations to produce a competitive result. Recent works have proposed solutions that allow for faster generation with fewer iterations, but the image quality gradually deteriorates with increasingly fewer iterations being applied during generation. In this paper, we reveal some of the causes that affect the generation quality of diffusion models, especially when sampling with few iterations, and come up with a simple, yet effective, solution to mitigate them. We consider two opposite equations for the iterative denoising, the first predicts the applied noise, and the second predicts the image directly. Our solution takes the two options and learns to dynamically alternate between them through the denoising process. Our proposed solution is general and can be applied to any existing diffusion model. As we show, when applied to various SOTA architectures, our solution immediately improves their generation quality, with negligible added complexity and parameters. We experiment on multiple datasets and configurations and run an extensive ablation study to support these findings.


page 6

page 7

page 8

page 12

page 13

page 14


Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

Iterative generative models, such as noise conditional score networks an...

InferGrad: Improving Diffusion Models for Vocoder by Considering Inference in Training

Denoising diffusion probabilistic models (diffusion models for short) re...

Accelerating Score-based Generative Models with Preconditioned Diffusion Sampling

Score-based generative models (SGMs) have recently emerged as a promisin...

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

A wide variety of deep generative models has been developed in the past ...

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Denoising diffusion probabilistic models (DDPMs) have recently achieved ...

Generative Modelling With Inverse Heat Dissipation

While diffusion models have shown great success in image generation, the...

AT-DDPM: Restoring Faces degraded by Atmospheric Turbulence using Denoising Diffusion Probabilistic Models

Although many long-range imaging systems are designed to support extende...

1 Introduction

Over the past few years, deep generative models have reached the ability to generate high-quality samples in various domains, including images [brock2018large], speech [oord2016wavenet], and natural language [brown2020language]. For image generation, generative models can be divided into two main branches: approaches based on generative adversarial networks (GAN) [goodfellow2020generative]

and log-likelihood-based methods, such as variational autoencoders (VAE) 


, autoregressive models 

[oord2016conditional], and normalizing flows [rezende2015variational, kingma2018glow]. Log-likelihood models have the advantage of possessing a straightforward objective, which makes them easier to optimize, while GANs are known to be unstable during training [heusel2017gans, salimans2016improved]. However, until recently, well optimized GAN models outperformed their log-likelihood counterparts in generation quality [brock2018large, karras2020analyzing, karras2019style, karras2017progressive].

This changed when Ho et al. [ho2020denoising] introduced a new type of log-likelihood model called the Denoising Diffusion Probabilistic Model (DDPM). With this model, image quality surpasses GANs [dhariwal2021diffusion], while it is also very stable and easy to train. DDPMs follow the concept of iterative denoising: given a noisy image , it is gradually denoised by predicting a less noisy image

. This process, when done over hundreds (or thousands) of iterations, is able to generate images with very high quality and diversity, even when starting from random noise. DDPMs have many computer vision applications, such as super-resolution 

[saharia2021image, li2021srdiff] and image translation [sasaki2021unit], and are also extremely effective in non-visual domains [chen2020wavegrad, luo2021diffusion, rasul2021autoregressive].

DDPM incorporates a probabilistic denoising process that is dependant on the estimation of the mean component

. This is done by a neural network parameterized over

and denoted as . However, it was found that through the forward and backward equations this process is better formalized by predicting either the noise or the original image  [ho2020denoising]. Their experiments found the former to be empirically superior, and, as far as we can ascertain, no further comparisons between the two options (noise or original image) have been performed as yet.

In this work, we revisit the original implementation of DDPM, and find that the preference of over

is circumstantial and depends on the hyperparameters and datasets. In addition, in certain timesteps, the denoising process has less error when predicting the noise component

, while in others it predicts the original image better. This realization motivated us to design a model capable of predicting both values and adaptively selecting the more reliable output at each sampling iteration. The modified model has a negligible number of added parameters and complexity. We apply this method to various DDPM models and show a marked improvement in terms of image quality (measured by FID) for many benchmarks. This addition to the framework is orthogonal to existing advancements (that we know of), and is able to improve sampling quality, especially with the restriction of few iterations.

2 Related work

Diffusion probabilistic models were introduced by Sold-Dickstein et al. [sohl2015deep]

, who proposed a model that can learn to reverse a gradual noising schedule. This framework is part of long research on generative models that are based on Markov chains 

[bengio2014deep, salimans2015markov], that has led to the development of Noise Conditional Score Networks (NCSN) [song2019generative, song2020improved] and Denoising Diffusion Probabilistic Models (DDPM) [ho2020denoising]. Although very similar, DDPMs try to minimize the log-likelihood, while NCSNs optimize the matching objective [hyvarinen2005estimation].

The success of DDPMs has sparked a lot of interest in improving upon the original design. Song et al. proposed an implicit sampling (DDIM) [song2020denoising] that reduces the number of iterations while maintaining high image quality. Nichol and Dhariwal [nichol2021improved]

proposed a cosine noising schedule and a learned denoising variance factor, and in a second work 


proposed architectural improvements and classifier guidance. Watson et al. 

[watson2021learning] proposed a dynamic programming algorithm to find an efficient denoising schedule. Nachmani et al. [nachmani2021denoising]

applied a Gamma distribution instead of Gaussian. Luhman and Luhman 

[luhman2021knowledge] applied knowledge distillation with DDPMs.

The solution proposed in this work is orthogonal to the contribution of these methods. It is, therefore, possible to apply our method to the above advancements and increase the performance of all these networks. This is demonstrated in our experiments for some of the existing approaches.

3 Setup

(a) (b) (c)
Figure 1: Loss comparison between and . (a) Loss on predicting , (b) loss on predicting , (c) loss on predicting .
Figure 2: Pixel mean and variance of predicted over timesteps, for both the subtractive () and the additive () paths. For a model trained on CIFAR10. Real data statistics are in black.

Diffusion models operate as the reversal of a gradual noising process. Given a sample , we consider the samples for obtained by gradually adding noise, starting from . Noise is applied in such a way that each instance is noisier than the previous, and the final instance is completely destroyed and can be seen as a sample from a predefined noise distribution. Ho et el. [ho2020denoising] proposed a Gaussian noise that is applied iteratively as:


Here, for are a group of scalars selected so that

(a multivariate i.i.d normal distribution). Due to the choice of applying Gaussian noise, a simpler transition can be applied directly from

to any , which makes training much more efficient. Using and , we get:


This formulation also reveals a simpler constraint, which is that , and multiple such schedules have been proposed [ho2020denoising, nichol2021improved] and tested.

Through this equation, any intermediate step can be sampled, given a noise sample :


Notice that can easily be backtraced through


which is an important property for the denoising process.

The “reversed” denoising process is then a Markovian process, parameterized with a neural network over as:


Training this model is done by sampling a random and minimizing the loss

with stochastic gradient descent.

is the KL-divergence:


Some critical modifications are responsible for the stabilization of this objective. The high variance distribution is replaced with the more stable , which is practically the combination of the posterior and the “forward” process .


As for , Ho et al. [ho2020denoising] found that fixing to a constant

makes it easier to optimize an objective that is reduced to predicting the mean vector



As a corollary, the loss function becomes:


The constant was selected to be , even though there is also a theoretical explanation for choosing instead.

In addition, the prediction of directly was replaced by either \⃝raisebox{-0.6pt}{1} the prediction of , denoted as , and computing (denoted as ) through Eq. 8, or \⃝raisebox{-0.6pt}{2} predicting , denoted as , and using Eq. 4,8 ().

\⃝raisebox{-0.6pt}{1} (12)
\⃝raisebox{-0.6pt}{2} (13)

The latter was chosen based on empirical evidence, and by further developing the equations this choice resulted in a new formalization of as:


where is a weight that should be equal to for consistency with Eq. 11, but was set to for simplicity.

3.1 Pros and cons for predicting

There are multiple justifications for why the backward process should be driven by predicting the noise . The first is that the noise has always zero mean and unit variance, and the model can learn these statistics quite easily. A second is that it gives a residual-like equation, where image is predicted by subtracting the output of the model from the input (Eq. 4). This provides the model with the option to preserve the information in the input, by predicting zero noise or multiply it by a small . This approach becomes increasingly beneficial towards the end of the denoising process, where the amount of noise becomes small, and only minor modifications are needed.

The main disadvantage of this approach is that after the subtraction of noise from , the result is scaled with (Eq. 4), which can be a very small value for some steps (large ). This can lead to a very large error even for a small error in . This error propagates, since the model is limited to modifying the intermediate states with something that resembles noise, and if previous iterations produced a state that is not viable, it becomes difficult for the model to correct this path. In such cases, multiple iterations may be required just to revert a previous bad prediction.

This problem is demonstrated in Fig. 1,2. Fig. 1 shows that loss when using is significantly larger at high than using a direct approach. Note that Fig. 1 is in log-scale, and the error of is larger by orders of magnitude for hundreds of steps. Fig. 2 shows the mean and variance of the predicted through the denoising process. As can be seen, the prediction of using starts with very high bias and variance, and it takes multiple steps to correct this. In contrast, the direct prediction using immediately starts with very low bias, and its variance increases monotonically with the real data variance.

Figure 3: The dual output diffusion model. A noisy image is gradually denoised into . In each iteration, an intermediate state is inserted into a model that predicts . These outputs are combined into a mean vector , that is subsequently used to sample the next state .

3.2 Pros and cons for predicting

An advantage and also disadvantage of predicting directly is that the model needs to produce the entire image, not just subtract some noise from its input. This can be an advantage during the first stages when the input image is very noisy. As can be seen in Fig. 2

, it is easier to predict an unbiased estimate of the image directly than to do so by subtracting noise, and the loss during these steps is substantially lower than

(Fig. 1). It becomes a disadvantage during later steps, by when a substantial structure has been formed in , and the direct prediction of needs to rebuild it in each step, instead of simply subtracting small noise artifacts. Fig 1 shows that all predictions are less accurate for small when using .

At first glance, it appears that the backward process using loses the residual-like property of , since the image is estimated directly and not subtracted from . However, the objective of step is not to predict , but to obtain . According to Eq. 8, the residual property is still present in this alternative backward process, which relies on .

Note that while is subtracted from (Eq. 13), is added to it (Eq. 12). Therefore, we distinguish between the processes by calling the process the “subtractive” backward process, and calling the process the “additive” backward process.

4 Method

To leverage the advantages of both flows, we can consider two models. The first, , predicts (and through Eq. 4), and the second, , predicts . Each of them can estimate its own (Eq. 12,13

), but in order to control how much we want to rely on each model’s output, we can interpolate between their estimates with an additional parameter

, as . By selecting a different value for for each step , we can control how much influence we want each path to have on each step.

To simplify and generalize this solution, we fuse into one model , and make the interpolation parameter learned as well (). The generalized model computes:


An illustration can be seen in Fig. 3. The modifications required to go from a model that only predicts to our new model might seem complex, but they are in fact very simple. The only change to the model is in the number of output channels in the last layer. For example, for and , the output of changes from to . This means that the number of added parameters should be negligible. Complexity and runtime are also unaffected since the computation of is negligible.

This new model requires a new loss function , which optimizes , and . We separate this into three components:


where denotes “stop-grad”, which means that inner values are detached and no gradient propagated back from them. The ’s are weights that can be applied to each loss, which we kept as throughout our experiments. We found that optimizing with these stop-grads results in a much more stable training regime than the alternative of allowing gradients to propagate through , because the gradients of are subject to intense rescaling (see Eq. 12,13).

4.1 Implicit sampling

Song et al. [song2020denoising] proposed an implicit sampling method (DDIM) that is deterministic after generating the first seed . Since our method only changes how is estimated, it does not affect the ability to perform implicit sampling. Their generalized formula is as follows:


In [song2020denoising], were estimated with (Eq. 4), . Our method uses the interpolated estimation of:


When , deterministic behavior is obtained, and the model is not sensitive to the value of of the probabilistic approach that would have been selected in Eq. 10.

An additional advantage of DDIM is that it was shown to be able to generate exceptionally well with far fewer iterations than the probabilistic sampling of [ho2020denoising]. For these reasons, our experiments will follow mostly this approach.

5 Experiments

We start by showcasing our method’s generation results from each independent output , and show how the interpolation parameter affects the generation process. We then evaluate our model against existing state-of-the-art methods on multiple datasets, including CIFAR10 [krizhevsky2009cifar], CelebA [liu2015faceattributes]

, and ImageNet 

[deng2009imagenet], and perform ablation evaluations with each experiment.

5.1 Dual-Output Denoising

Figure 4: Progressive generation in 5 steps. From top to bottom: \⃝raisebox{-0.6pt}{1} Prediction using (subtractive), \⃝raisebox{-0.6pt}{2} prediction using (additive), \⃝raisebox{-0.6pt}{3} our dual-output. The images generated with the dual-output method are overall cleaner and sharper.

Figure 5: Progressive generation in 10 steps. Each row is a different generation sequence from the same initial noise, with the intermediate results visualized. Steps marked by red are produced with the output and blue are produced with . Thus, the sequences in the middle rows start with and switch to at some point. The sequence at the bottom represents our dynamic dual-output technique.
Figure 6: Mean value for the interpolation parameter over the generation steps. For our model trained on CIFAR10 (top) and CelebA (bottom).

As presented in Sec. 4, our proposed solution consists of a dual output model. One head predicts the noise while the second predicts the image , and a third head, , efficiently balances the two options. To understand how each method affects the iterative process, we visualize their intermediate results during each denoising process. For , the intermediate result is simply the predicted image. For , it is following Eq. 4.

These results can be seen in Fig. 4. Evidently, the two outputs produce two very different iterative processes, which to some extent act as opposites. The denoising process that uses , produces a very noisy start and gradually removes noise from its previous estimates. In contrast, the process that relies on starts with a very blurry image, which resembles an average of many images, and iteratively adds content to it. These two different sequences inevitably result in two different final images, but it appears that the initial seed is a strong enough condition to guide them both in a similar direction in the image space.

In the bottom row of each grid in Fig. 4, we visualize the dual-output denoising process, driven by the interpolation parameter .

Evidently, the interpolation magnitude in each step is dataset dependent. For example, the dual-output process in CIFAR10 starts very similarly to , while that of CelebA is mixed. It can also be observed that the dual-output result is different from either of the two options. In CIFAR10, it can be observed that the dual-output produces less noisy images than , and sharper than . For CelebA, we noticed that the dual-output image quality is higher. For example, pieces of hair are more refined and glasses are noticeable.

To better understand the denoising process with each output, we perform an additional experiment, where the method is switched between subtractive and additive at some point in the middle of the process. Fig. 5 depicts multiple sequences, where the model starts with and at some point continues the task with (this order is more natural than starting with and switching to , see Sec. 3.1). In this figure, the steps surrounded by red boxes are the result of progressing using , while steps marked in blue are intermediate results of using . The top row uses only and the bottom row only . We again add the sequence produced by the interpolated results. In this example, generation using failed to produce a pleasing image, and the other option was superior. However, it seems that some mixture of the two yields the best result. This is the motivation behind our adaptive interpolation, which allows the model to choose dynamically how to proceed.

A valid question would be “how much better is a dynamic interpolation parameter than learning a constant for each step ?”. To answer this, we measure at each step for multiple denoising processes. In Fig. 6, we show the average value of at each step . The two plots show the average value in black, with the grey region marking the dynamic range of the parameter. We also show the interpolation value for 16 different trajectories in blue.

This visualization shows that there is a large variability in the trajectory that the interpolation values take. Moreover, when it comes to a particular generation process, our method usually prefers a different value than the overall average. We also evaluate image quality using a fixed in Sec. 5.2, and compare it to the dynamic .

Interestingly, seems to behave differently for each dataset. In CIFAR10, interpolation is more clear-cut. It starts with a very high preference towards , and somewhere around the middle of the process starts to drop fast towards . On CelebA, begins at around 0.5 and maintains a relatively narrow dynamic range. Nevertheless, it also drops fast towards in the second half of the process. In both datasets, the model finished with a very high preference for the subtractive process.

5.2 Image Quality

(a) (b) (c)
Figure 7: Generation on CIFAR10. Comparison between similar images. (a) 5, (b) 10, and (c) 20 steps. Top: DDIM, Bottom: Ours.

We conduct image quality evaluations on multiple datasets and baselines. In all evaluations, we use the implicit sampling formula proposed by Song et al. [song2020denoising], in order to maintain high quality with low iteration count. In each evaluation, we specify the number of denoising iterations. Timesteps were respaced uniformly, following [song2020denoising].

We evaluate by generating 50K images for each dataset, which are then compared with the full training set for CIFAR10 and CelebA and the validation set for ImageNet. We evaluate the models on the basis of image quality measured with FID 

[heusel2017gans]. FID is known to be sensitive for even slightly different preprocessing and methodology [parmar2021cleanfid]

. For reproducibility and reliability, we use the torch-fidelity 


library. For ImageNet, we also measure “improved precision and recall” 

[kynkaanniemi2019improved] over VGG feature manifolds between 10K real images and 50K generated images with k=3.

We compare our model to official pretrained models of DDPM [ho2020denoising], DDIM [song2020denoising], IDDPM [nichol2021improved], and ADM [dhariwal2021diffusion]. We also included IDDPM with implicit sampling (IDDIM), which uses the same official pretrained model, but applies the implicit backward process. Since DDPM and IDDPM enforce a different noising schedule (linear and cosine, respectively), we separate them and compare the models that were trained under equal conditions. DDIM used the pretrained model of DDPM for CIFAR10, but trained a new model for CelebA.

Each comparison to a baseline involves modifying the architecture and loss function to suit our method, and then train the model using the same hyperparameters. Training of each model was performed on 4 NVIDIA RTX 2080 TI GPUS in a distributed fashion. CIFAR10 and CelebA were trained from scratch. On ImageNet, this was not feasible, since the baseline (ADM) took 4.36 million iterations on extremely high-end devices. Instead, we loaded the encoded with the pretrained weights of their model and trained the rest of the model (decoder and residual block) for 80 thousand iterations.

CIFAR10 and CelebA

In Tab. 1, we show the results of evaluation on CIFAR10 and CelebA. We perform the evaluations with 5, 10, 20, 50, and 100 iterations. As can be seen, our method outperforms the baselines on all metrics and under all respacing conditions, except for IDDPM with 100 iterations. The image quality inevitably declines with the reduction of denoising iterations, but our method maintains a significantly lower FID than the equivalent baselines.

When comparing to the ablation experiments, it can be observed that using a fixed , which was taken to be the mean value as in Fig. 6, worse performance is obtained. The results for the additive and the subtractive paths reveal that no single path is always better than the other. CIFAR10 with linear schedule measured a lower FID with , but the cosine schedule and CelebA did better with .

For a visual comparison, we show generated images of our method alongside DDIM, for 5, 10, and 20 steps, see Fig. 7. The images were not cherry-picked, but we did manually select samples in DDIM and our method, that looked relatively similar. For each image in our method, we show the most similar image from 100 generated images in DDIM. It can be seen that our method generated better-looking, more detailed and sharper images. It is also evident that more steps produce higher quality results.

# iterations
Method 5 10 20 50 100

CIFAR10 (3232)


DDPM [ho2020denoising] 196.54 160.18 145.45  65.43  32.65
DDIM [song2020denoising] 49.70 18.57 10.87 7.03 5.57
ours 35.12 11.68 8.62 6.68 5.54
- fixed 38.50 12.08 8.71 6.89 5.57
- only 41.99 12.30 8.74 7.11 6.01
- only 45.53 24.27 16.93 12.47 7.39


IDDPM [nichol2021improved] 29.10 13.33 5.73 4.58
IDDIM [nichol2021improved] 38.14 19.68 8.98 6.29
ours 18.25 12.54 5.59 5.10
- fixed 19.60 13.93 7.24 6.17
- only 36.78 17.08 8.85 6.72
- only 45.75 19.75 13.21 7.13 5.93

CelebA (6464)


DDPM [ho2020denoising] 304.89 278.31 160.67 88.74 43.90
DDIM [song2020denoising] 56.16 16.90 13.38 8.80 6.15
ours 26.22 14.96 8.74 5.54 4.07
- fixed 32.64 16.19 8.85 6.20 4.44
- only 64.82 27.53 12.64 9.03 8.68
- only 29.79 16.03 9.18 6.57 4.23
Official pretrained models by [ho2020denoising], [nichol2021improved], and [song2020denoising].
Table 1: FID on CIFAR10 and CelebA. Results are separated by the applied noising schedule “linear/cosine”. ✗ marks unstable conditions that produced NaNs; due to dividing by a very low .
Figure 8: Image generation on ImageNet. Comparison of generated results for different paths on the same initial noise .


Fig. 8 shows generated results with our model on ImageNet, for different images generated with the same initial noise , but different denoising paths (, “dual”, and respectively). To qualitatively compare the images, we performed a user study, where the subjects were asked to select the most visually convincing image from the three options. Among 25 participants, our images were selected 78% of the time, followed by 17% , and 5% .

Tab. 2 shows our evaluation on conditional ImageNet with 128128 resolution. Since we did not train our model for nearly as long as the baseline, we do not compare our results to theirs, but only add them as reference. Our evaluation is focused on comparing the subtractive ( and the additive () paths to the dual-output solution. In here, can act as a representative for the baseline, as that is the baseline’s method of choice.

The evaluation was performed on 25 and 50 denoising iterations, with and without the classifier guidance proposed by the baseline [dhariwal2021diffusion]. All evaluations were performed by generating 50 images per class (50K images in total), and comparing them to 50K validation images for FID and 10K images for precision and recall. The results show that the dual-output outperforms each of the alternative paths on all three metrics. Also, again we observed that the results of were superior to , which shows that the advantage of the subtractive path is circumstantial. Finally, while there is a considerable gap between our results and the ADM baseline, this evaluation solidifies our speculation about the dual-output process, and suggests that with enough training resources, could surpass the baseline.

# iterations
25 50
Method (+train steps) FID PR RC FID PR RC
— no classifier guidance —
ADM [dhariwal2021diffusion] (4.36M) 11.7 0.92 0.14 7.6 0.92 0.21
dual (80K) 27.7 0.90 0.11 25.3 0.89 0.15
- only (80K) 51.3 0.89 0.08 49.1 0.86 0.09
- only (80K) 29.5 0.90 0.08 27.4 0.88 0.12
— classifier scale 1.0 —
ADM [dhariwal2021diffusion] (4.36M) 10.2 0.95 0.09 7.1 0.96 0.16
dual (80K) 24.5 0.94 0.08 22.1 0.92 0.12
- only (80K) 44.1 0.93 0.07 36.8 0.89 0.07
- only (80K) 26.0 0.93 0.07 24.7 0.90 0.10
Table 2: Generation evaluation on ImageNet 128128. With and without classifier guidance. Measuring FID, precision, and recall. Using pretrained encoder from ADM.

6 Discussion and limitations

While we are able to select an effective value for by considering the next-step measure derived from the loss in Eq. 19, this does not necessarily lead to optimal image quality at the end of the generation process. While one can intuitively expect such a greedy approach to be close to optimal, this requires validation. If the greedy approach turns out to be significantly suboptimal, a beam search approach may be able to improve image quality further.

From the societal perspective, the study of diffusion models has two immediate negative outcomes: environmental and harmful use. The environmental footprint of training high-resource neural networks is becoming a major concern. Our work enables the reduction of the number of iterations required to achieve a certain level of visual quality, thus lowering their computation cost. In addition, our experiments are done at a relatively modest energy cost, especially since we opted to train the ImageNet models only in part. The second concern is the ability to generate realistic fake media with generative methods. Our hope is that open academic study of generative models will raise public awareness of the associated risks and enable the development of methods for identifying fake images and audio.

7 Conclusions

When applying diffusion models, one can choose to transition to the next step by estimating either the slightly improved image after applying the current step or by estimating the target image. As we show, the accuracy of each of the two, depends on the exact stage of the inference process. Moreover, the ideal trajectory varies depending on the specific sample, and for most of the process, mixing the two estimates for the next step provides a better results.


Appendix A Model specifications and hyperparameters

The hyperparameters used across our experiments are the same as the compared baselines, this is in order to perform a concise evaluation. Still, for clarity and completeness of this work, we indicate the hyperparemeters used for each model version.

a.1 Cifar10

In CIFAR10, we used the baselines DDPM [ho2020denoising]. The model is a UNet architecture, with the following hyperparameters. The UNet had a depth of 4 downsampling (and upsampling) blocks, with a base number channel size of 128 and channel multiplier of [1,2,2,2]. Each block contained a residual block with 2 residual layers, and an attention block at the 16x16 resolution. The model was subject to dropout of 0.1. Following the baseline, the linear noise schedule was from 1e-4 to 2e-2 in 1000 steps. Training was done on 4 GPUs, with a batch size of 128x4, for 1M iterations. The Adam optimizer was used with learning rate of 2e-4, and EMA decay of 0.9999.

For the cosine noise schedule, we used IDDPM [nichol2021improved]. The improved UNet model included a scale-shift GroupNorm instead of the standard GroupNorm, three residual layers in each block, attention on both the 16x16 and 8x8 resolutions, 4 attention heads instead of 1, and a cosine noise schedule.

a.2 CelebA

DDPM was also selected for CelebA evaluation of 64x64 image resolution. The difference from its CIFAR10 counterpart, is the addition of a fifth downsampling layer, with the same base channel size of 64x4, and a channel multiplier of 4. Model was trained for 500K iterations, using batch size of 32, and Adam optimizer with learning rate 1e-5.

a.3 ImageNet

The evaluated model on ImageNet is based on ADM [dhariwal2021diffusion]. This architecture has some major differences from the previous ones. The model had classifier condition, which were being added to the time condition. Instead of pooled downsampling and interpolated upsampling, a learned up/downsampling was applied through the residual block. In addition, each block has a base channel size of 256, with channel multipliers of [1,1,2,3,4]. Attention of 4 heads was applied on the 32x32, 16x16, and 8x8 resolutions. A dropout of 0.1 and scle-shift GroupNorm. Noise schedule was the default linear schedule. Finally, we trained only the decoder weights, while using the pretrained weights of the baseline for the rest of the model. Model was trained for 80K iterations, with batch size of 32x4 (with 8 mini-batches of 4). Adam optimizer with learning rate of 1e-5.

Appendix B Generated images

In addition to the images in the paper, we provide additional generated images for the various datasets, for further inspection.

Appendix C Progressive generation

Fig. 9 shows progressive generation results for CIFAR10 and CelebA. All grids show the intermediate results of the three paths , dual, and , from top to bottom. It can be seen how in all cases, starts very noisy, while is blurry. All paths end with a similar image, but the dual method provides a sharper and less noisy result. In CelebA we noted more difference between the images. The additive path often produced darker images, and the noise in the final result of is very noticeable.

Appendix D Effect of iteration count

Fig. 10 shows generation for both CIFAR10 and CelebA with a different number of denoising iterations. Iteration are monotonically increasing from left to right (5, 10, 20, 50, 100). The effect of the number of iterations is very clear as the image becomes more detailed and sharp when more denoising iterations are applied. Sometimes there is also a change in appearance, but an improvement in quality is always present. However, it can be seen that the change in quality is relatively low, and an already good image is achieved with few iterations.

Appendix E High quality ImageNet result (128128)

Fig. 11 shows generated images from ImageNet, using 50 denoising iterations. There is a high variety in the images, and the class condition successfully represents the chosen category. Considering that the model was only finetuned for 80K steps, and the denoising is done with only 50 iterations, the image quality is quite good.

(a) (b)
Figure 9: Progressive generation (a) CIFAR10 and (b) CelebA. From top to bottom: , dual, and .
(a) (b)
Figure 10: Generation on (a) CIFAR10 and (b) CelebA. From left to right: 5, 10, 20, 50, 100 iterations.
Figure 11: Generation on ImageNet with 50 iterations.