1 Introduction
Diffusion models [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] form a category of deep generative models which has recently become one of the hottest topic in computer vision (see Figure 1), showcasing impressive generative capabilities, ranging from the high level of details to the diversity of the generated examples. We can even go as far as stating that these generative models raised the bar to a new level in the area of generative modeling, particularly referring to models such as Imagen [12] and Latent Diffusion Models (LDM) [10]. To date, diffusion models have been applied to a wide variety of generative modeling tasks, such as image generation [1, 3, 2, 13, 4, 5, 14, 7, 15, 16, 17, 12, 6, 18, 10, 19, 20, 11, 21, 22]
, image superresolution
[12, 23, 24, 10, 25, 26][1, 3, 4, 27, 23, 28, 10, 25, 29], image editing [30, 31, 32][33, 31, 34, 35, 36, 37], among others. Moreover, the latent representation learned by diffusion models was also found to be useful in discriminative tasks, e.g. image segmentation [38, 39, 40, 41], classification [42][43, 44, 45]. This confirms the broad applicability of denoising diffusion models, indicating that further applications are yet to be discovered. Additionally, the ability to learn strong latent representations creates a connection to representation learning [46, 47], a comprehensive domain that studies ways to learn powerful data representations, covering multiple approaches ranging from the design of novel neural architectures [48, 49, 50, 51] to the development of learning strategies [52, 53, 54, 55, 56, 57].According to the graph shown in Figure 1, the number of papers on diffusion models is growing at a very fast pace. To outline the past and current achievements of this rapidly developing topic, we present a comprehensive review of articles on denoising diffusion models in computer vision. More precisely, we survey articles that fall in the category of generative models defined below. Diffusion models represent a category of deep generative models that are based on a forward diffusion stage, in which the input data is gradually perturbed over several steps by adding Gaussian noise, and a reverse (backward) diffusion stage, in which a generative model is tasked at recovering the original input data from the diffused (noisy) data by learning to gradually reverse the diffusion process, step by step.
We underline that there are at least three subcategories of diffusion models that comply with the above definition. The first subcategory comprises denoising diffusion probabilistic models (DDPMs) [2, 1]
, which are inspired by the nonequilibrium thermodynamics theory. DDPMs are latent variable models that employ latent variables to estimate the probability distribution. From this point of view, DDPMs can be viewed as a special kind of variational autoencoders (VAEs)
[49], where the forward diffusion stage corresponds to the encoding process inside VAE, while the reverse diffusion stage corresponds to the decoding process. The second subcategory is represented by noise conditioned score networks (NCSNs) [3], which are based on training a shared neural network via score matching to estimate the score function (defined as the gradient of the log density) of the perturbed data distribution at different noise levels. Stochastic differential equations (SDEs)
[4] represent an alternative way to model diffusion, forming the third subcategory of diffusion models. Modeling diffusion via forward and reverse SDEs leads to efficient generation strategies as well as strong theoretical results [58]. This latter formulation (based on SDEs) can be viewed as a generalization over DDPMs and NCSNs.We identify several defining design choices and synthesize them into three generic diffusion modeling frameworks corresponding to the three subcategories introduced above. To put the generic diffusion modeling framework into context, we further discuss the relations between diffusion models and other deep generative models. More specifically, we describe the relations to variational autoencoders (VAEs) [49], generative adversarial networks (GANs) [51], energybased models (EBMs) [59, 60], autoregressive models [61] and normalizing flows [62, 63]
. Then, we introduce a multiperspective categorization of diffusion models applied in computer vision, classifying the existing models based on several criteria, such as the underlying framework, the target task, or the denoising condition. Finally, we illustrate the current limitations of diffusion models and envision some interesting directions for future research. For example, perhaps one of the most problematic limitations is the poor time efficiency during inference, which is caused by a very high number of evaluation steps,
e.g. thousands, to generate a sample [2]. Naturally, overcoming this limitation without compromising the quality of the generated samples represents an important direction for future research.In summary, our contribution is twofold:

Since many contributions based on diffusion models have recently emerged in vision, we provide a comprehensive and timely literature review of denoising diffusion models applied in computer vision, aiming to provide a fast understanding of the generic diffusion modeling framework to our readers.

We devise a multiperspective categorization of diffusion models, aiming to help other researchers working on diffusion models applied to a specific domain in quickly finding relevant related works in the respective domain.
2 Generic Framework
Diffusion models are a class of probabilistic generative models that learn to reverse a process that gradually degrades the training data structure by adding noise at different scales. In the following three subsections, we present three formulations of diffusion models, namely denoising diffusion probabilistic models, noise conditioned score networks, and the approach based on stochastic differential equations that generalizes over the first two methods. For each formulation, we describe the process of adding noise to the data, the method which learns to reverse this process, and how new samples are generated at inference time. In Figure 2, all three formulations are illustrated as a generic framework. We dedicate the last subsection to discussing connections to other deep generative models.
2.1 Denoising Diffusion Probabilistic Models (DDPMs)
DDPMs [1, 2] slowly corrupt the training data using Gaussian noise. Let be the data density, where the index denotes the fact that the data is uncorrupted (original). Given an uncorrupted training sample , the noised versions are obtained according to the following Markovian process:
(1) 
where is the number of diffusion steps,
are hyperparameters representing the variance schedule across diffusion steps,
is the identity matrix having the same dimensions as the input image
, andrepresents the normal distribution of mean
and covariance that produces . An important property of this recursive formulation is that it also allows the direct sampling of , whenis drawn from a uniform distribution,
i.e. :(2) 
where , and .
After applying the recursive process denoted in Eq. (1) for iterations, the distribution
should be well approximated by the standard Gaussian distribution
if the total noise added to the data is sufficiently large. Hence, we can generate new samples from if we start from a sample and follow the reverse steps . We can train a neural network to approximate these steps. Moreover, SohlDickstein et al. [1] observe that, if is small enough, then can be considered as being a Gaussian distribution, meaning that the neural network can only estimate the mean and the covariance .The objective for training the neural network is the variational lowerbound of the density assigned to the data by the model, :
(3) 
where KL
denotes the KullbackLeibler divergence between two probability distributions.
We present below an alternative objective employed in [2], which seems to increase the quality of the generated samples. Essentially, this objective trains a neural network to estimate the noise from arbitrary examples drawn according to Eq. (2), as follows:
(4) 
where is the expected value, and is a network predicting the noise in . In the latter case, Ho et al. [2] propose to fix the covariance to a constant value and rewrite the mean as a function of noise, as follows:
(5) 
2.2 Noise Conditioned Score Networks (NCSNs)
The score function of some data density is defined as the gradient of the log density with respect to the input, . This score function is required by the Langevin dynamics method to generate samples from , as follows:
(6) 
where , , controls the magnitude of the update in the direction of the score, is sampled from a prior distribution, and the method is applied recursively for steps. Therefore, a generative model can employ the above method to sample from after estimating the score with a neural network . This network can be trained via score matching, a method that requires the optimization of the following objective:
(7) 
In practice, it is impossible to minimize this objective directly because is unknown. However, there are other methods such as denoising score matching [64] and sliced score matching [65] that overcome this problem.
Although the described approach can be used for data generation, Song et al. [3]
emphasize several issues when applying this method on real data. Most of the problems are linked with the manifold hypothesis. For example, the score estimation
is inconsistent when the data resides on a lowdimensional manifold and, among other implications, this could cause the Langevin dynamics to never converge to the highdensity regions. In the same work [3], the authors demonstrate that these problems can be addressed by perturbing the data with Gaussian noise at different scales. Furthermore, they propose to learn score estimations for the resulting noisy distributions via a single noise conditioned score network (NCSN).Regarding the sampling, they adapt the strategy in Eq. (6) and use the score estimations associated with each noise scale. Formally, given a sequence of Gaussian noise scales such that and , we can train an NCSN with denoising score matching so that , . We can derive as follows:
(8) 
given that:
(9) 
where is a noised version of , and exp is the exponential function. Consequently, generalizing Eq. (7) for all and replacing the gradient with the form in Eq. (8) leads to training by minimizing the following objective, :
(10) 
where is a weighting function.
At inference time, the sampling is performed via the annealed Langevin dynamics algorithm. Essentially, Eq. (6) is adapted such that it uses a different value of for each scale, while transferring the output from one scale to the next.
2.3 Stochastic Differential Equations (SDEs)
Similar to the previous two methods, the approach presented in [4] gradually transforms the data distribution into noise. However, it generalizes over the previous two methods because, in its case, the diffusion process is considered to be continuous, thus becoming the solution of a stochastic differential equation (SDE). As shown in [66], the reverse process of this diffusion can be modeled with a reversetime SDE which requires the score function of the density at each time step. Therefore, the generative model of Song et al. [4] employs a neural network to estimate the score functions and generates samples from by employing numerical SDE solvers.
The SDE of the forward diffusion process , has the following form:
(11) 
where is Gaussian noise, is the drift coefficient, and is the diffusion coefficient. The associated reversetime SDE is defined as follows:
(12) 
where represents the Brownian motion when the time is reversed, from to .
We can perform the training of the neural network by optimizing the same objective as in Eq. (10), but adapted for the continuous case, as follows:
(13) 
where is a weighting function, and . We underline that, when the drift coefficient is affine, is a Gaussian distribution. When does not conform to this property, we can fallback to score matching [65].
The sampling for this approach can be performed with any numerical method applied on the SDE from Eq. (12). Notably, in [4], the authors introduce the PredictorCorrector sampler which generates better examples. This algorithm first employs a numerical method to sample from the reversetime SDE, and then uses a scorebased method as a corrector, for example the annealed Langevin dynamics described in the previous subsection. Furthermore, Song et al. [4]
show that ordinary differential equations (ODEs) can also be used to model the reverse process. Hence, another sampling strategy unlocked by the SDE interpretation is based on numerical methods applied to ODEs. The main advantage of this latter strategy is its efficiency.
2.4 Relation to Other Generative Models
We discuss below the connections between diffusion models and other types of generative models. We start with likelihoodbased methods and finish with generative adversarial networks.
Diffusion models have more aspects in common with VAEs [49]. For instance, in both cases, the data is mapped to a latent space and the generative process learns to transform the latent representations into data. Moreover, in both situations, the objective function can be derived as a lowerbound of the data likelihood, as shown in Eq. (3). Nevertheless, there are essential differences between the two approaches and, further, we will mention some of them. The latent representation of a VAE contains compressed information about the original image, while diffusion models destroy the data entirely after the last step of the forward process. The latent representations of diffusion models have the same dimensions as the original data, while VAEs work better when the dimensions are reduced. Ultimately, the mapping to the latent space of a VAE is trainable, which is not true for the forward process of diffusion models. The aforementioned similarities and differences can be the key for future developments of the two methods. For example, there already exists some work that builds more efficient diffusion models by applying them on the latent space of a VAE [18, 17].
Autoregressive models [61, 67] represent images as sequences of pixels. Their generative process produces new samples by generating an image pixel by pixel, conditioned on the previously generated pixels. This approach implies a unidirectional bias that clearly represents a limitation of this class of generative models. Esser et al. [27]
see diffusion and autoregressive models as complementary and solve the above issue. Their method learns to reverse a multinomial diffusion process via a Markov chain where each transition is implemented as an autoregressive model. The global information provided to the autoregressive model is given by the previous step of the Markov chain.
Similar to diffusion models, normalizing flows [62, 63] map the data distribution to Gaussian noise. However, the similarities between the two methods end here, because normalizing flows perform the mapping in a deterministic fashion by learning an invertible and differentiable function. These properties imply, in contrast to diffusion models, additional constraints on the network architecture, and a learnable forward process. A method which connects these two generative algorithms is DiffFlow. Introduced in [68], DiffFlow extends both diffusion models and normalizing flows such that the reverse and forward processes are both trainable and stochastic.
Energybased models (EBMs) [59, 60, 69, 70] focus on providing estimations of unnormalized versions of density functions, called energy functions. One popular strategy used for training this type of models is score matching [69, 70]. Regarding the sampling, among other strategies, there is the Markov Chain Monte Carlo (MCMC) method, which is based on the score function. Therefore, the formulation from Subsection 2.2 of diffusion models can be considered to be a particular case of the energybased framework, precisely the case when the training and sampling only require the score function.
GANs [51] were considered by many as stateoftheart generative models in terms of the quality of the generated samples, before the recent rise of diffusion models [5]. GANs are also known as being difficult to train due to their adversarial objective [71], and often suffer from mode collapse. In contrast, diffusion models have a stable training process and provide more diversity because they are likelihoodbased. Despite these advantages, diffusion models are still inefficient when compared to GANs, requiring multiple network evaluations during inference.
3 A Categorization of Diffusion Models
We categorize diffusion models into a multiperspective taxonomy considering different criteria of separation. Perhaps the most important criteria to separate the models are defined by the task they are applied to, and the input signals they require. Furthermore, as there are multiple approaches in formulating a diffusion model, the underlying architecture is a another key factor for classifying diffusion models. Finally, the data sets used during training and evaluation are also of high importance, because this helps comparing different baselines on the same task. Our categorization of diffusion models according to the criteria enumerated before is presented in Table I.
In the reminder of this section, we present several contributions on diffusion models, choosing the target task as the primary criterion to separate the methods. We opted for this classification criterion as it is fairly wellbalanced and representative for research on diffusion models, facilitating a quick grasping of related works by readers working on specific tasks. Although the main task is usually related to image generation, a considerable amount of work has been conducted to match and even surpass the performance of GANs on other topics, such as superresolution, inpainting, image editing, imagetoimage translation or segmentation.
Paper  Task  Denoising Condition  Architecture  Data Sets 
Austin et al. [72]  image generation  unconditional  D3PM  CIFAR10 
Bao et al. [19]  image generation  unconditional  DDIM, Improved DDPM  CelebA, ImageNet, LSUN Bedroom, CIFAR10 
Benny et al. [73]  image generation  unconditional  DDPM, DDIM  CIFAR10, ImageNet, CelebA 
Bond et al. [74]  image generation  unconditional  DDPM  LSUN Bedroom, LSUN Church, FFHQ 
Choi et al. [75]  image generation  unconditional  DDPM  FFHQ, AFHQDog, CUB, MetFaces 
De et al. [76]  image generation  unconditional  DSB  MNIST, CelebA 
Deasy et al. [77]  image generation  unconditional  NCSN  MNIST, FashionMNIST, CIFAR10, CelebA 
Deja et al. [78]  image generation  unconditional  Improved DDPM  FashionMNIST, CIFAR10, CelebA 
Dockhorn et al. [20]  image generation  unconditional  NCSN++, DDPM++  CIFAR10 
Ho et al. [2]  image generation  unconditional  DDPM  CIFAR10, CelebAHQ, LSUN 
Huang et al. [58]  image generation  unconditional  DDPM  CIFAR10, MNIST 
Jing et al. [29]  image generation  unconditional  NCSN++, DDPM++  CIFAR10, CelebA256HQ, LSUN Church 
Jolicoeur et al. [79]  image generation  unconditional  NCSN  CIFAR10, LSUN Church, StackedMNIST 
Jolicoeur et al. [80]  image generation  unconditional  DDPM++, NCSN++  CIFAR10, LSUN Church, FFHQ 
Kim et al. [81]  image generation  unconditional  NCSN++, DDPM++  CIFAR10, CelebA, MNIST 
Kingma et al. [82]  image generation  unconditional  DDPM  CIFAR10, ImageNet 
Kong et al. [83]  image generation  unconditional  DDIM, DDPM  LSUN Bedroom, CelebA, CIFAR10 
Lam et al. [84]  image generation  unconditional  BDDM  CIFAR10, CelebA 
Liu et al. [85]  image generation  unconditional  DNPM  CIFAR10, CelebA 
Ma et al. [86]  image generation  unconditional  NCSN, NCSN++  CIFAR10, CelebA, LSUN Bedroom, LSUN Church, FFHQ 
Nachmani et al. [87]  image generation  unconditional  DDIM, DDPM  CelebA, LSUN Church 
Nichol et al. [6]  image generation  unconditional  DDPM  CIFAR10, ImageNet 
Pandey et al. [18]  image generation  unconditional  DDPM  CelebAHQ, CIFAR10 
San et al. [88]  image generation  unconditional  DDPM  CelebA, LSUN Bedroom, LSUN Church 
Sehwag et al. [89]  image generation  unconditional  ADM  CIFAR10, ImageNet 
Sohl et al. [1]  image generation  unconditional  DDPM  MNIST, CIFAR10, Dead Leaf Images 
Song et al. [13]  image generation  unconditional  NCSN  FFHQ, CelebA, LSUN Bedroom, LSUN Tower, LSUN Church Outdoor 
Song et al. [15]  image generation  unconditional  DDPM++  CIFAR10, ImageNet 3232 
Song et al. [7]  image generation  unconditional  DDIM  CIFAR10, CelebA, LSUN 
Vahdat et al. [17]  image generation  unconditional  NCSN++  CIFAR10, CelebAHQ, MNIST 
Wang et al. [90]  image generation  unconditional  DDIM  CIFAR10, CelebA 
Wang et al. [91]  image generation  unconditional  StyleGAN2, ProjectedGAN  CIFAR10, STL10, LSUN Bedroom, LSUN Church, AFHQ, FFHQ 
Watson et al. [92]  image generation  unconditional  DDPM  CIFAR10, ImageNet 
Watson et al. [8]  image generation  unconditional  Improved DDPM  CIFAR10, ImageNet 6464 
Xiao et al. [93]  image generation  unconditional  NCSN++  CIFAR10 
Zhang et al. [68]  image generation  unconditional  DDPM  CIFAR10, MNIST 
Zheng et al. [94]  image generation  unconditional  DDPM  CIFAR10, CelebA, CelebAHQ, LSUN Bedroom, LSUN Church 
Bordes et al. [95]  conditional image generation  conditioned on latent representations  Improved DDPM  ImageNet 
Campbell et al. [96]  conditional image generation  unconditional, conditioned on sound  DDPM  CIFAR10, Lakh Pianoroll 
Chao et al. [97]  conditional image generation  conditioned on class  Score SDE, Improved DDPM  CIFAR10, CIFAR100 
Dhariwal et al. [5]  conditional image generation  unconditional, classifier guidance  ADM  LSUN Bedroom, LSUN Horse, LSUN Cat 
Ho et al. [98]  conditional image generation  conditioned on label  DDPM  LSUN, ImageNet 
Ho et al. [99]  conditional image generation  unconditional, classifierfree guidance  ADM  ImageNet 6464, ImageNet 128128 
Karras et al. [100]  conditional image generation  unconditional, conditioned on class  DDPM++, NCSN++, DDPM, DDIM  CIFAR10, ImageNet 6464 
Liu et al. [101]  conditional image generation  conditioned on text, image, style guidance  DDPM  FFHQ, LSUN Cat, LSUN Horse, LSUN Bedroom 
Liu et al. [21]  conditional image generation  conditioned on text, 2D positions, relational descriptions between items, human facial attributes  Improved DDPM  CLEVR, Relational CLEVR, FFHQ 
Lu et al. [102]  conditional image generation  unconditional, conditioned on class  DDIM  CIFAR10, CelebA, ImageNet, LSUN Bedroom 
Salimans et al. [103]  conditional image generation  unconditional, conditioned on class  DDIM  CIFAR10, ImageNet, LSUN 
Singh et al. [104]  conditional image generation  conditioned on noise  DDIM  ImageNet 
Sinha et al. [16]  conditional image generation  unconditional, conditioned on label  D2C  CIFAR10, CIFAR100, fMoW, CelebA64, CelebAHQ256, FFHQ256 
Gu et al. [105]  texttoimage generation  conditioned on text  VQDiffusion  CUB200, Oxford 102 Flowers, MSCOCO 
Jiang et al. [22]  texttoimage generation  conditioned on text  Transformerbased encoderdecoder  DeepFashionMultiModal 
Ramesh et al. [106]  texttoimage generation  conditioned on text  ADM  MSCOCO, AVA 
Rombach et al. [11]  texttoimage generation  conditioned on text  LDM  OpenImages, WikiArt, LAION2Ben, ArtBench 
Saharia et al. [107]  texttoimage generation  conditioned on text  Imagen  MSCOCO, DrawBench 
Shi et al. [9]  texttoimage generation  unconditional, conditioned on text  Improved DDPM  Conceptual Captions, MSCOCO 
Zhang et al. [108]  texttoimage generation  unconditional, conditioned on text  DDIM  CIFAR10, CelebA, ImageNet 
Daniels et al. [24]  superresolution  conditioned on image  NCSN  CIFAR10, CelebA 
Saharia et al. [12]  superresolution  conditioned on image  DDPM++  FFHQ, CelebAHQ, ImageNet1K 
Avrahami et al. [109]  image editing  conditioned on image and mask  DDPM, ADM  ImageNet, CUB, LSUN Bedroom, MSCOCO 
Avrahami et al. [30]  region image editing  text guidance  DDPM  PaintByWord 
Meng et al. [32]  image editing  conditioned on image  Score SDE, DDPM, Improved DDPM  LSUN, CelebAHQ 
Lugmayr et al. [28]  inpainting  unconditional  DDPM  CelebAHQ, ImageNet 
Nichol et al. [14]  inpainting  conditioned on image, text guidance  ADM  MSCOCO 
Ho et al. [33]  imagetoimage translation  conditioned on image  Improved DDPM  ctest10k, places10k 
Li et al. [36]  imagetoimage translation  conditioned on image  DDPM  Face2Comic, Edges2Shoes, Edges2Handbags 
Sasaki et al. [110]  imagetoimage translation  conditioned on image  DDPM  CMP Facades, KAIST Multispectral Pedestrian 
Wang et al. [35]  imagetoimage translation  conditioned on image  DDIM  ADE20K, COCOStuff, DIODE 
Wolleb et al. [37]  imagetoimage translation  conditioned on image  Improved DDPM  BRATS 
Zhao et al. [34]  imagetoimage translation  conditioned on image  DDPM  CelebaAHQ, AFHQ 
Amit et al. [41]  image segmentation  conditioned on image  Improved DDPM  Cityscapes, Vaihingen, MoNuSeg 
Baranchuk et al. [38]  image segmentation  conditioned on image  Improved DDPM  LSUN, FFHQ256, ADEBedroom30, CelebA19 
Batzolis et al. [23]  multitask (inpainting, superresolution, edgetoimage)  conditioned on image  DDPM  CelebA, Edges2Shoes 
Batzolis et al. [111]  multitask (image generation, superresolution, inpainting, imagetoimage translation)  unconditional  DDIM  ImageNet, CelebAHQ, CelebA, Edges2Shoes 
Blattmann et al. [112]  multitask (image generation)  unconditional, conditioned on text, class  LDM  ImageNet 
Choi et al. [31]  multitask (image generation, imagetoimage translation, image editing)  conditioned on image  DDPM  FFHQ, MetFaces 
Chung et al. [25]  multitask (inpainting, superresolution, MRI reconstruction)  conditioned on image  CCDF  FFHQ, AFHQ, fastMRI knee 
Esser et al. [27]  multitask (image generation, inpainting)  unconditional, conditioned on class, image and text  ImageBART  ImageNet, Conceptual Captions, FFHQ, LSUN 
Gao et al. [113]  multitask (image generation, inpainting)  unconditional, conditioned on image  DDPM  CIFAR10, LSUN, CelebA 
Graikos et al. [39]  multitask (image generation, image segmentation)  conditioned on class  DDIM  FFHQ256, CelebA 
Hu et al. [114]  multitask (image generation, inpainting)  unconditional, conditioned on image  VQDDM  CelebAHQ, LSUN Church 
Khrulkov et al. [115]  multitask (image generation, imagetoimage translation)  conditioned on class  Improved DDPM  AFHQ, FFHQ, MetFaces, ImageNet 
Kim et al. [116]  multitask (image translation, multiattribute transfer)  conditioned on image, portrait, stroke  DDIM  ImageNet, CelebAHQ, AFHQDog, LSUN Bedroom, Church 
Luo et al. [117]  multitask (point cloud generation, autoencoding, unsupervised representation learning)  conditioned on shape latent  DDPM  ShapeNet 
Lyu et al. [118]  multitask (image generation, image editing)  unconditional, conditioned on class  DDPM  CIFAR10, CelebA, ImageNet, LSUN Bedroom, LSUN Cat 
Preechakul et al. [119]  multitask (latent interpolation, attribute manipulation) 
conditioned on latent representation  CelebAHQ  
Rombach et al. [10]  multitask (superresolution, image generation, inpainting)  unconditional, conditioned on image  VQDDM  ImageNet, CelebAHQ, FFHQ, LSUN 
Shi et al. [120]  multitask (superresolution, inpainting)  conditioned on image  Improved DDPM  MNIST, CelebA 
Song et al. [3]  multitask (image generation, inpainting)  unconditional, conditioned on image  NCSN  MNIST, CIFAR10, CelebA 
Song et al. [4]  multitask (image generation, inpainting, colorization) 
unconditional, conditioned on image, class  NCSN++, DDPM++  CelebAHQ, CIFAR10, LSUN 
Hu et al. [121]  medical imagetoimage translation  conditioned on image  DDPM  ONH 
Chung et al. [122]  medical image generation  conditioned on measurements  NCSN++  fastMRI knee 
Özbey et al. [123]  medical image generation  conditioned on image  Improved DDPM  IXI, Gold Atlas  Male Pelvis 
Song et al. [124]  medical image generation  conditioned on measurements  NCSN++  LIDC, LDCT Image and Projection, BRATS 
Wolleb et al. [40]  medical image segmentation  conditioned on image  Improved DDPM  BRATS 
Pinaya et al. [43]  medical image segmentation and anomaly detection  conditioned on image  DDPM  MedNIST, UK Biobank Images, WMH, BRATS, MSLUB 
Wolleb et al. [44]  medical image anomaly detection  conditioned on image  DDIM  CheXpert, BRATS 
Wyatt et al. [45]  medical image anomaly detection  conditioned on image  ADM  NFBS, 22 MRI scans 
Harvey et al. [125]  video generation  conditioned on frames  FDM  GQNMazes, MineRL Navigate, CARLA Town01 
Ho et al. [126]  video generation  unconditional, conditioned on text  DDPM  101 Human Actions 
Yang et al. [127]  video generation  conditioned on video representation  RVD  BAIR, KTH Actions, Simulation, Cityscapes 
Höppe et al. [128]  video generation and infilling  conditioned on frames  RaMViD  BAIR, Kinetics600, UCF101 
Giannone et al. [129]  fewshot image generation  conditioned on image  Improved DDPM  CIFARFS, miniImageNet, CelebA 
Jeanneret et al. [130]  counterfactual explanations  unconditional  DDPM  CelebA 
Kawar et al. [26]  image restoration  conditioned on image  DDIM  FFHQ, ImageNet 
Özdenizci et al. [131]  image restoration  conditioned on image  DDPM  Snow100K, OutdoorRain, RainDrop 
Kim et al. [132]  image registration  conditioned on image  DDPM  Radboud Faces, OASIS3 
Nie et al. [133]  adversarial purification  conditioned on image  Score SDE, Improved DDPM, DDIM  CIFAR10, ImageNet, CelebAHQ 
Wang et al. [134]  semantic image generation  conditioned on semantic map  DDPM  Cityscapes, ADE20K, CelebAMaskHQ 
Zhou et al. [135]  shape generation and completion  unconditional, conditional shape completion  DDPM  ShapeNet, PartNet 
Zimmermann et al. [42]  classification  conditioned on label  DDPM++  CIFAR10 
3.1 Unconditional Image Generation
The diffusion models presented below are used to generate samples in an unconditional setting. Such models do not require supervision signals, being completely unsupervised. We consider this as the most basic and generic setting for image generation.
The work of Sohl et al. [1] formalizes diffusion models as generative models that learn to reverse a Markov chain which transforms the data into white Gaussian noise. Their algorithm trains a neural network to predict the mean and covariance for a sequence of Gaussian distributions required for reversing the Markov chain. The neural network is based on a convolutional architecture containing multiscale convolution.
Ho et al. [2] extend the work presented in [1], proposing to learn the reverse process by estimating the noise in the image at each step. This change leads to an objective that resembles the denoising score matching applied in [3]. To predict the noise in an image, the authors use the PixelCNN++ architecture, which was introduced in [67].
Starting from a related work [3], Song et al. [13] present several improvements which are based on theoretical and empirical analysis. They address both training and sampling phases. For training, the authors show new strategies to choose the noise scales and how to incorporate the noise conditioning into NCSNs [3]. For sampling, they propose to apply exponential moving average to the parameters and select the hyperparameters for the Langevin dynamics such that the step size verifies a certain equation. The proposed changes unlock the application of NCSNs on highresolution images.
JolicoeurMartineau et al. [79] introduce an adversarial objective along with denoising score matching to train scorebased models. Furthermore, they propose a new sampling procedure called Consistent Annealed Sampling and prove that it is more stable than the annealed Langevin method. Their image generation experiments show that the new objective returns higher quality examples without an impact on diversity. The suggested changes are tested on the architectures proposed in [3, 13, 2].
Song et al. [15] improve the likelihood of scorebased diffusion models. They achieve this through a new weighting function for the combination of the score matching losses. For their image generation experiments, they use the DDPM++ architecture introduced in [4].
The work of Sinha et al. [16] presents the diffusiondecoding model with contrastive representations (D2C), a generative method which trains a diffusion model on latent representations produced by an encoder. The framework, which is based on the DDPM architecture presented in [2], produces images by mapping the latent representations to images.
DiffFlow is introduced in [68] as a new generative modeling approach that combines normalizing flows and diffusion probabilistic models. From the perspective of diffusion models, the method has a sampling procedure that is up to 20 times more efficient, thanks to a learnable forward process which skips unneeded noise regions. The authors perform experiments using the same architecture as in [2].
In [76], the authors present a scorebased generative model as an implementation of Iterative Proportional Fitting (IPF), a technique used to solve the Schrödinger bridge problem. This novel approach is tested on image generation, as well as data set interpolation, which is possible because the prior can be any distribution.
Austin et al. [72] extend the previous approaches [1] on discrete diffusion models, studying different choices for the transition matrices used in the forward process. Their results are competitive with previous continuous diffusion models for the image generation task.
Vahdat et al. [17] train diffusion models on latent representations. They use a VAE to encode to and decode from the latent space. This work achieves up to 56 times faster sampling. For the image generation experiments, the authors employ the NCSN++ architecture introduced in [4].
On top of the work proposed in [2], Nichol et al.[6] introduce several improvements, observing that the linear noise schedule is suboptimal for low resolution. They propose a new option that avoids a fast information destruction towards the end of the forward process. Further, they show that it is required to learn the variance in order to improve the performance of diffusion models in terms of loglikelihood. This last change allows faster sampling, somewhere around 50 steps being required.
Song et al. [7] replace the Markov forward process used in [2] with a nonMarkovian one. The generative process changes such that the model first predicts the normal sample, and then, it is used to estimate the next step in the chain. The change leads to a faster sampling procedure with a small impact on the quality of the generated samples. The resulting framework is known as the denoising diffusion implicit model (DDIM).
The work of Nachmani et al. [87]
replaces the Gaussian noise distributions of the diffusion process with two other distributions, a mixture of two Gaussians and the Gamma distribution. The results show better FID values and faster convergence thanks to the Gamma distribution that has higher modeling capacity.
Lam et al. [84] learn the noise scheduling for sampling. The noise schedule for training remains linear as before. After training the score network, they assume to be close to the optimal value in order to use it for noise schedule training. The inference is composed of two steps. First, the schedule is determined by fixing two initial hyperparameters. The second step is the usual reverse process with the determined schedule.
Bond et al. [74]
present a twostage process, where they apply vector quantization to images to obtain discrete representations, and use a transformer
[136] to reverse a discrete diffusion process, where the elements are randomly masked at each step. The sampling process is faster because the diffusion is applied to a highly compressed representation, which allows fewer denoising steps (50256).In [88], the authors present a method to estimate the noise parameters given the current input at inference time. Their change improves the FID measure, while requiring less steps. The authors employ VGG11 to estimate the noise parameters, and DDPM [2] to generate images.
JolicoeurMartineau et al. [80] introduce a new SDE solver that is between and faster than EulerMaruyama and does not affect the quality of the generated images. The solver is evaluated in a set of image generation experiments with pretrained models from [3].
Watson et al. [92] propose a dynamic programming algorithm that finds the optimal inference schedule, having a time complexity of , where is number of steps. They conduct their image generation experiments on CIFAR10 and ImageNet, using the DDPM architecture.
Wang et al. [90] present a new deep generative model based on Schrödinger bridge. This is a twostage method, where the first stage learns a smoothed version of the target distribution, and the second stage derives the actual target.
Kingma et al. [82] introduce a class of diffusion models that obtain stateoftheart likelihoods on image density estimation. They add Fourier features to the input of the network to predict the noise, and investigate if the observed improvement is specific to this class of models. Their results confirm the hypothesis, i.e
. previous stateoftheart models did not benefit from this change. As a theoretical contribution, they show that the diffusion loss is impacted by the signaltonoise ratio function only through its extremes.
Focusing on scoredbased models, Dockhorn et al. [20] utilize a criticallydamped Langevin diffusion process by adding another variable (velocity) to the data, which is the only source of noise in the process. Given the new diffusion space, the resulting score function is demonstrated to be easier to learn. The authors extend their work by developing a more suitable score objective called hybrid score matching, as well as a sampling method, by solving the SDE through integration. The authors adapt the NCSN++ and DDPM++ architectures to accept both data and velocity, being evaluated on unconditional image generation and outperforming similar scorebased diffusion models.
Xiao et al. [93] try to improve the sampling speed, while also maintaining the quality, coverage and diversity. Their approach is to integrate a GAN in the denoising process to discriminate between real samples (forward process) and fake ones (denoised samples from the generator), with the objective of minimizing the softened reverse KL divergence [137]. However, this is modified by directly generating the clean (fully denoised) sample and conditioning the fake example on it. Using the NCSN++ architecture with adaptive group normalization layers for the GAN generator, they achieve similar Fréchet Inception Distance (FID) values in both image synthesis and strokebased image generation, at sampling rates of about 20 to 2000 times faster than other diffusion models.
Motivated by the limitations of highdimensional scorebased diffusion models due to the Gaussian noise distribution, Deasy et al. [77] extend denoising score matching to generalize to the normal noising distribution. By adding a heavier tailed distribution, their experiments on several data sets show promising results, as the generative performance improves in certain cases (depending on the shape of the distribution). An important scenario in which the method excels is on data sets with unbalanced classes.
Following the work in [113], Bao et al. [19] propose an inference framework that does not require training using nonMarkovian diffusion processes. By first deriving an analytical estimate of the optimal mean and variance with respect to a score function, and using a pretrained scoredbased model to obtain score values, they show better results while being 20 to 40 times more timeefficient. The score is approximated by Monte Carlo sampling. However, the score is clipped within some precomputed bounds in order to diminish any bias of the pretrained DPM model.
Watson et al. [8]
begin by presenting how a reparameterization trick can be integrated within the backward process of diffusion models in order to optimize a family of fast samplers. Using the Kernel Inception Distance as loss function, they show how optimization can be done using stochastic gradient descent. Next, they propose a special parameterized family of samplers, which, using the same process as before, it can achieve competitive results with fewer sampling steps. Using FID and Inception Score (IS) as metrics, the method seems to outperform some diffusion model baselines.
Zheng et al. [94] suggest truncating the process at an arbitrary step, and propose a method to inverse the diffusion from this distribution by relaxing the constraint of having Gaussian random noise as the final output of the forward diffusion. To solve the issue of starting the reverse process from a nontractable distribution, an implicit generative distribution is used to match the distribution of the diffused data. The proxy distribution is fit either through a GAN or a conditional transport. We note that the generator utilizes the same UNet model as the sampler of the diffusion model, thus not adding extra parameters to be trained.
Jing et al. [29] try to shorten the duration of the sampling process of diffusion models by reducing the space onto which diffusion is realized, i.e. the larger the timestamp in the diffusion process, the smaller the subspace. The data is projected onto a finite set of subspaces, at specific times, each being associated with a score model. This results in reduced computational costs, while the performance is increased. The work is limited to natural images synthesis. Evaluating the method in unconditional image generation, the authors achieve similar or better performance compared with stateoftheart models, while having a lower inference time. The method is demonstrated to work for the inpainting task as well.
Kim et al. [81] propose to change the diffusion process into a nonlinear one. This is achieved by using a trainable normalizing flow model which encodes the image in the latent space, where it can now be linearly diffused to the noise distribution. A similar logic is then applied to the denoising process. This method is applied on the NCSN++ and DDPM++ frameworks, while the normalizing flow model is based on ResNet.
Deja et al. [78] begin by analyzing the backward process of a diffusion model and postulate that it has two models, a generator and a denoiser. Thus, they propose to explicitly split the process in two components: the denoiser via an autoencoder, and the generator via a diffusion model. Both models use the same UNet architecture.
Wang et al. [91] start from the idea presented by Arjovsky et al. [138] and Sønderby et al. [139] to augment the input data of the discriminator by adding noise. This is achieved in [91] by injecting noise from a Gaussian mixture distribution composed of weighted diffused samples from the clean image at various time steps. The noise injection mechanism is applied to both real and fake images. The experiments are conducted on a wide range of data sets to cover multiple resolutions and higher diversity.
Ma et al. [86]
aim to make the backward diffusion process more timeefficient, while maintaining the synthesis performance. Within the family of scorebased diffusion models, they begin to analyze the reverse diffusion in the frequency domain, subsequently applying a spacefrequency filter to the sampling process which aims to integrate information about the target distribution in the initial noise sampling. The authors conduct experiments with NCSN
[3] and NCSN++ [4], where the proposed method clearly shows speed improvements in image synthesis (by up to 20 times less sampling steps), while keeping the same satisfactory generation quality for both low and highresolution images.3.2 Conditional Image Generation
We next showcase diffusion models that are applied in conditional image synthesis. The condition is commonly based on various source signals, in most cases some class labels being used. The methods performing both unconditional and conditional generation are also discussed here.
Dhariwal et al. [5] introduce few architectural changes to improve the FID of diffusion models. They also propose classifier guidance, a strategy which uses the gradients of a classifier to guide the diffusion during sampling. They conduct both unconditional and conditional image generation experiments.
Ho et al. [99] introduce a guidance method that does not require a classifier. It just needs one conditional diffusion model and one unconditional version, but they use the same model to learn both cases. The unconditional model is trained with the class identifier being equal to 0. The idea is based on the implicit classifier derived from the Bayes rule.
Kong et al. [83] define a bijection between the continuous diffusion steps and the noise levels. With the defined bijection, they are able to construct an approximated diffusion process which requires less steps. The method is tested using the previous DDIM [7] and DDPM [2] architectures on image generation.
Pandey et al. [18] build a generatorrefiner framework, where the generator is a VAE and the refiner is a DDPM conditioned by the output of the VAE. The latent space of the VAE can be used to control the content of the generated image because the DDPM only adds the details. After training the framework, the resulting DDPM is able to generalize to different noise types. More specifically, if the reverse process is not conditioned on the VAE’s output, but on different noise types, the DDPM is able to reconstruct the initial image.
Liu et al. [85] investigate the usage of conventional numerical methods to solve the ODE formulation of the reverse process. They find that these methods return lower quality samples compared with the previous approaches. Therefore, they introduce pseudonumerical methods for diffusion models. Their idea splits the numerical methods into two parts, the gradient part and the transfer part. The transfer part (standard methods have a linear transfer part) is replaced such that the result is as close as possible to the target manifold. As a last step, they show how this change solves the problems discovered when using conventional approaches.
Ho et al. [98] introduce Cascaded Diffusion Models (CDM), an approach for generating highresolution images conditioned on ImageNet classes. Their framework contains multiple diffusion models, where the first model from the pipeline generates lowresolution images conditioned on the image class. The subsequent models are responsible for generating images of increasingly higher resolutions. These models are conditioned on both the class and the lowresolution image.
Bordes et al. [95] examine representations resulted from selfsupervised tasks by visualizing and comparing them to the original image. They also compare representations generated from different sources. Thus, a diffusion model is used to generate samples that are conditioned on these representations. The authors implement several modifications on the UNet architecture presented by Dhariwal et al. [5]
, such as adding conditional batch normalization layers, and mapping the vector representation through a fully connected layer.
Tachibana et al. [140] address the slow sampling problem of DDPMs. They propose to decrease the number of sampling steps by increasing the order (from one to two) of the stochastic differential equation solver (denoising part). While preserving the network architecture and score matching function, they adopt the ItôTaylor expansion scheme for the sampler, as well as substitute some derivative terms in order to simplify the calculation. They reduce the number of backward steps while retaining the performance. Added to these, another contribution is the noise schedule.
Benny et al. [73] study the advantages and disadvantages of predicting the image instead of the noise during the reverse process. They conclude that some of the discovered problems could be addressed by interpolating the two types of output. They modify previous architectures to return both the noise and the image, as well as a value that controls the importance of the noise when performing the interpolation. The strategy is evaluated on top of the DDPM and DDIM architectures.
The method presented in [89] allows diffusion models to produce images from lowdensity regions of the data manifold. They use two new losses to guide the reverse process. The first loss guides the diffusion towards lowdensity regions, while the second enforces the diffusion to stay on the manifold. Moreover, they demonstrate that the diffusion model does not memorize the examples from the lowdensity neighborhoods, generating novel images. The authors employ an architecture similar to that of Dhariwal et al. [5].
Choi et al. [75] investigate the impact of the noise levels on the visual concepts learned by diffusion models. They modify the convetional weighting scheme of the objective function to a new one that enforces the diffusion models to learn rich visual concepts. The method groups the noise levels into three categories (coarse, content and cleanup) according to the signaltonoise ratio, i.e. small SNR is coarse, medium SNR is content, large SNR is cleanup. The weighting function assigns lower weights to the last group.
Karras et al. [100] try to separate diffusion scoredbased models into individual components that are independent of each other. This separation allows modifying a single part without affecting the other units, thus facilitating the improvement of diffusion models. Using this framework, the authors first present a sampling process that uses Heun’s method as the ODE solver, which reduces the neural function evaluations while maintaining the FID score. They further show that a stochastic sampling process brings great performance benefits. The second contribution is related to training the scorebased model by preconditioning the neural network on its input and the corresponding targets, as well as using image augmentation.
Within the context of both unconditional and classconditional image generation, Salimans et al. [103] propose a technique for reducing the number of sampling steps. They distill the knowledge of a trained teacher model, represented by a deterministic DDIM, into a student model that has the same architecture, but halving the number of sampling steps. In other words, the target of the student is to take two consecutive steps of the teacher. Furthermore, this process can be repeated until the desired number of sampling steps is reached, while maintaining the same image synthesis quality. Finally, three versions of the model and two loss functions are explored in order to facilitate the distillation process and reduce the number of sampling steps (from 8192 to 4).
The works of Song et al. [4] and Dhariwal et al. [5] on scoredbased conditional diffusion models based on classifier guidance inspired Chao et al. [97] to develop a new training objective which reduces the potential discrepancy between the score model and the true score. The loss of the classifier is modified into a scaled cross entropy added to a modified score matching loss.
Singh et al. [104] propose a novel method for conditional image generation. Instead of conditioning the signal throughout the sampling process, they present a method to condition the noise signal (from where the sampling starts). Using Inverting Gradients [141], the noise is injected with information about localization and orientation of the conditioned class, while maintaining the same random Gaussian distribution.
Campbell et al. [96] demonstrate a continuoustime formulation of denoising diffusion models that is capable of working with discrete data. The work models the forward continuoustime Markov chain diffusion process via a transition rate matrix, and the backward denoising process via a parametric approximation of the inverse transition rate matrix. Further contributions are related to the training objective, the matrix construction, and an optimized sampler.
The interpretation of diffusion models as ODEs proposed by Song et al. [4] is reformulated by Lu et al. [102] in a form that can be solved using an exponential integrator. Other contributions of this work are an ODE solver that approximates the integral term of the new formulation using Taylor expansion (first order to third order), and an algorithm that adapts the time step schedule, being 4 to 16 times faster.
Describing the resembling functionality of diffusion models and energybased models, and leveraging the compositional structure of the latter models, Liu et al. [21] propose to combine multiple diffusion models for conditional image synthesis. In the reverse process, the composition of multiple diffusion models, each associated with a different condition, can be achieved either through conjunction or negation.
3.3 TexttoImage Synthesis
A considerable number of diffusion models conditioned on text have been developed so far, demonstrating how interesting the task of texttoimage synthesis is. Perhaps the most impressive results of diffusion models are attained on texttoimage synthesis, where the capability of combining unrelated concepts, such as objects, shapes and textures, to generate unusual examples comes to light. To confirm this statement, we used Imagen [107] to generate an image of “a stone rabbit statue sitting on the moon” and the result in shown in Figure 3.
Gu et al. [105] introduce the VQDiffusion model, a method for texttoimage synthesis that does not have the unidirectional bias of previous approaches. With its masking mechanism, the proposed method avoids the accumulation of errors during inference. The model has two stages, where the first stage is based on an VQVAE that learns to represent an image via discrete tokens, and the second stage is a discrete diffusion model that operates on the discrete latent space of the VQVAE. The training of the diffusion model is conditioned on the embedding of a caption. Inspired from masked language modeling, some tokens are replaced with a [mask] token.
Avrahami et al. [30] present a textconditional diffusion model conditioned on CLIP [142] image embeddings generated from CLIP text embeddings. This is a twostage approach, where the first stage generates the image embedding, and the second stage (decoder) produces the final image conditioned on the image embedding and the text caption. To generate image embeddings, the authors use a diffusion model in the latent space. They perform a subjective human assessment to evaluate their generative results.
Imagen is introduced in [107] as an approach for texttoimage synthesis. It consists of one encoder for the text sequence and a cascade of diffusion models for generating highresolution images. These models are also conditioned on the text embeddings returned by the encoder. Moreover, the authors introduce a new set of captions (DrawBench) for texttoimage evaluations. Regarding the architecture, the authors develop Efficient UNet to improve efficiency, and apply this architecture in their texttoimage generation experiments.
Addressing the slow sampling inconvenience of diffusion models, Zhang et al. [108] focus their work on a new discretization scheme that reduces the error and allows a greater step size, i.e. a lower number of sampling steps. By using highorder polynomial extrapolations in the score function and an Exponential Integrator for solving the reverse SDE, the number of network evaluations is drastically reduced, while maintaining the generation capabilities.
Shi et al. [9] combine a VQVAE [143] and a diffusion model to generate images. Starting from the VQVAE, the encoding functionality is preserved, while the decoder is replaced by a diffusion model. The authors use the UNet architecture from [6], injecting the image tokens into the middle block.
Building on top of the work presented in [112], Rombach et al. [11] introduce a modification to create artistic images using the same procedure: extract the knearest neighbors in the CLIP [142] latent space of the image from a database, then generate a new image by guiding the reverse denoising process with these embeddings. As the CLIP latent space is shared by text and images, the diffusion can be guided by text prompts as well. However, at inference time, the database is replaced with another one that contains artistic images. Thus, the model generates images within the style of the new database.
Jiang et al. [22] present a framework to generate images of fullbody humans with rich clothing representation given three inputs: a human pose, a text description of the clothes’ shape, and another text of the clothing texture. The first stage of the method encodes the former text prompt into an embedding vector and infuses it into the module (encoderdecoder based) that generates a map of forms. In the second stage, a diffusionbased transformer samples an embedded representation of the latter text prompt from multiple multilevel codebooks (each specific to a texture), a mechanism suggested in VQVAE [143]. Initially, the codebook indices at coarser levels are sampled, and then, using a feedforward network, the finer level indices are predicted. The text is encoded using SentenceBERT [144].
3.4 Image SuperResolution
Saharia et al. [12] apply diffusion models on superresolution. Their reverse process learns to generate high quality images conditioned on lowresolution versions. This work employs the architectures presented in [2, 6] and the following data sets: CelebAHQ, FFHQ and ImageNet.
Daniels et al. [24] use scorebased models to sample from the Sinkhorn coupling of two distributions. Their method models the dual variables with neural networks, then solves the problem of optimal transport. After neural networks are trained, the sampling can be performed via Langevin dynamics and a scorebased model. They run experiments on image superresolution using a UNet architecture.
3.5 Image Editing
Meng et al. [32] utilize diffusion models in various guided image generation tasks, i.e. stroke painting or strokebased editing and image composition. Starting from an image that contains some form of guidance, its properties (such as shapes and colors) are preserved, while the deformations are smoothed out by progressively adding noise (forward process of the diffusion model). Then, the result is denoised (reverse process) to create a realistic image according to the guidance. Images are synthesized with a generic diffusion model by solving the reverse SDE, without requiring any custom data set or modifications for training.
One of the first approaches for editing specific regions of images based on natural language descriptions is introduced in [30]. The regions to be modified are specified by the user via a mask. The method relies on CLIP guidance to generate an image according to the text input, but the authors observe that combining the output with the original image at the end does not produce globally coherent images. Hence, they modify the denoising process to fix the issue. More precisely, after each step, the authors apply the mask on the latent image, while also adding the noisy version of the original image.
Extending the work presented in [10], Avrahami et al. [109] apply latent diffusion models for editing images locally, using text. A VAE encodes the image and the adaptivetotime mask (region to edit) into the latent space where the diffusion process occurs. Each sample is iteratively denoised, while being guided by the text within the region of interest. However, inspired by Blended Diffusion [30], the image is combined with the masked region in the latent space that is noised at the current time step. Finally, the sample is decoded with the VAE to generate the new image. The method demonstrates superior performance while being comparably faster.
3.6 Image Inpainting
Nichol et al. [14] train a diffusion model conditioned on text descriptions and also study the effectiveness of classifierfree and CLIPbased guidance. They obtain better results with the first option. Moreover, they finetune the model for image inpainting, unlocking image modifications based on text input.
Lugmay et al. [28] present an inpainting method agnostic to the mask form. They use an unconditional diffusion model for this, but modify its reverse process. They produce the image at step by sampling the known region from the masked image, and the unknown region by applying denoising to the image obtained at step . With this procedure, the authors observe that the unknown region has the right structure, while also being semantically incorrect. Further, they solve the issue by repeating the proposed step for a number of times and, at each iteration, they replace the previous image from step with a new sample obtained from the denoised version generated at step .
3.7 ImagetoImage Translation
Saharia et al. [33] propose a framework for imagetoimage translation using diffusion models, focusing on four tasks: colorization, inpainting, uncropping and JPEG restoration. The proposed framework is the same across all four tasks, meaning that it does not suffer custom changes for each task. The authors begin by comparing and losses, suggesting that is preferred, as it leads to a higher sample diversity. Finally, they reconfirm the importance of selfattention layers in conditional image synthesis.
To translate an unpaired set of images, Sasaki et al. [110] propose a method involving two jointly trained diffusion models. During the reverse denoising process, at every step, each model is also conditioned on the other’s intermediate sample. Furthermore, the loss function of the diffusion models is regularized using the cycleconsistency loss [145].
The aim of Zhao et al. [34] is to improve current imagetoimage translation scorebased diffusion models by utilizing data from a source domain with an equal significance. An energybased function trained on both source and target domains is employed in order to guide the SDE solver. This leads to generating images that preserve the domainagnostic features, while translating characteristics specific to the source domain to the target domain. The energy function is based on two feature extractors, each specific to a domain.
Leveraging the power of pretraining, Wang et al. [35] employ the GLIDE model [14] and train it to obtain a rich semantic latent space. Starting from the pretrained version and replacing the head to adapt to any conditional input, the model is finedtuned on some specific image generation downstream task. This is done in two steps, where the first step is to freeze the decoder and train only the new encoder, and the second step is to train them simultaneously. Finally, the authors employ adversarial training and normalize the classifierfree guidance to enhance generation quality.
Li et al. [36] introduce a diffusion model for imagetoimage translation that is based on Brownian bridges, as well as GANs. The proposed process begins by encoding the image with a VQGAN [146]. Within this quantized latent space, the diffusion process, formulated as a Brownian bridge, maps between the latent representations of the source domain and target domain. Finally, another VQGAN decodes the quantized vectors in order to synthesize the image in the new domain. The two GAN models are independently trained on their specific domains.
Continuing their previous work proposed in [44], Wolleb et al. [37] extend the diffusion model by replacing the classifier with another model specific to the task. Thus, at every step of the sampling process, the gradient of the taskspecific network is infused. The method is demonstrated with a regressor (based on an encoder) or with a segmentation model (using the UNet architecture), whereas the diffusion model is based on existing frameworks [2, 6]. This setting has the advantage of eliminating the need to retrain the whole diffusion model, except for the taskspecific model.
3.8 Image Segmentation
Baranchuk et al. [38]
demonstrate how diffusion models can be used in semantic segmentation. Taking the feature maps (middle blocks) at different scales from the decoder of the UNet (used in the denoising process) and concatenating them (upsampling the feature maps in order to have the same dimensions), they can be used to classify each pixel by further attaching an ensemble of multilayer perceptrons. The authors show that these feature maps, extracted at later steps in the denoising process, contain rich representations. The experiments show that segmentation based on diffusion models outperforms most baselines.
Amit et al. [41] propose the use of diffusion probabilistic models for image segmentation through extending the architecture of the UNet encoder. The input image and the current estimated image are passed through two different encoders and combined together through summation. The result is then supplied to the encoderdecoder of the UNet. Due to the stochastic noise infused at every time step, multiple samples for a single input image are generated and used to compute the mean segmentation map. The UNet architecture is based on a previous work [6], while the input image generator is built with Residual Dense Blocks [147]. The denoised sample generator is a simple 2D convolutional layer.
3.9 MultiTask Approaches
A series of diffusion models have been applied to multiple tasks, demonstrating a good generalization capacity across tasks. We discuss such contributions below.
Song et al. [3] present the noise conditional score network (NCSN), an approach which estimates the score function at different noise scales. For sampling, they introduce an annealed version of Langevin dynamics and use it to report results in image generation and inpainting. The NCSN architecture is mainly based on the work presented in [148], with small changes such as replacing batch normalization with instance normalization.
The SDE formulation of diffusion models introduced in [4] generalizes over several previous methods [2, 1, 3]. Song et al. [4] present the forward and reverse diffusion processes as solutions of SDEs. This technique unlocks new sampling methods, such as the PredictorCorrector sampler, or the deterministic sampler based on ODEs. The authors carried out experiments on image generation, inpainting and colorization.
Esser et al. [27] propose ImageBART, a generative model which learns to revert a multinomial diffusion process on compact image representations. A transformer is used to model the reverse steps autoregressively, where the encoder’s representation is obtained using the output at the previous step. ImageBART is evaluated on unconditional, classconditional and textconditional image generation, as well as local editing.
Gao et al. [113] introduce diffusion recovery likelihood, a new training procedure for energybased models. They learn a sequence of energybased models for the marginal distributions of the diffusion process. Thus, instead of approximating the reverse process with normal distributions, they derive the conditional distributions from the marginal energybased models. The authors run experiments on both image generation and inpainting.
Batzolis et al. [23] analyze the previous scorebased diffusion models on conditional image generation. Moreover, they present a new method for conditional image generation called conditional multispeed diffusive estimator (CMDE). This method is based on the observation that diffusing the target image and the condition image at the same rate, might be suboptimal. Therefore, they propose to diffuse the two images, which have the same drift and different diffusion rates, with an SDE. The approach is evaluated on inpainting, superresolution and edgetoimage synthesis.
Liu et al. [101] introduce a framework which allows text, content and style guidance from a reference image. The core idea is to use the direction that maximizes the similarity between the representations learned for image and text. The image and text embeddings are produced by the CLIP model [142]. To address the need of training CLIP on noisy images, the authors present a selfsupervised procedure that does not require text captions. The procedure uses pairs of normal and noised images to maximize the similarity between positive pairs and minimize it for negative ones (contrastive objective).
Choi et al. [31] propose a novel method, which does not require further training, for conditional image synthesis using unconditional diffusion models. Given a reference image, i.e. the condition, each sample is drawn closer to it by eliminating the low frequency content and replacing it with content from the reference image. The low pass filter is represented by a downsampling operation, which is followed by an upsampling filter of the same factor. The authors show how this method can be applied on various imagetoimage translation tasks, e.g. painttoimage, and editing with scribbles.
Hu et al. [114] propose to apply diffusion models on discrete representations given by a discrete VAE. They evaluate the idea in image generation and inpainting experiments, considering the CelebAHQ and LSUN Church data sets.
Rombach et al. [10] introduce latent diffusion models, where the forward and reverse processes happen on the latent space learned by an autoencoder. They also include crossattention in the architecture, which brings further improvements on conditional image synthesis. The method is tested on superresolution, image generation and inpainting.
The method introduced by Preechakul et al. [119] contains a semantic encoder that learns a descriptive latent space. The output of this encoder is used to condition an instance of DDIM. The proposed method allows DDPMs to perform well on tasks such as interpolation or attribute manipulation.
Chung et al. [25] introduce an algorithm for sampling, which reduces the number of steps required for the conditional case. Compared to the standard case, where the reverse process starts from Gaussian noise, their approach first executes one forward step to obtain an intermediary noised image and resumes the sampling from this point on. The approach is tested on inpainting, superresolution, and MRI reconstruction.
In [116], the authors finetune a pretrained DDIM to generate images according to a text description. They propose a local directional CLIP loss that basically enforces the direction between the generated image and the original image to be as close as possible to the direction between the reference (original domain) and target text (target domain). The tasks considered in the evaluation are image translation between unseen domains, and multiattribute transfer.
Starting from the formulation of diffusion models as SDEs of Meng et al. [32], Khrulkov et al. [115] investigate the latent space and the resulting encoder maps. As per Monge formulation, it it shown that these encoder maps are the optimal transport maps, but this is demonstrated only for multivariate normal distributions. The authors further support this with numerical experiments, as well as practical experiments, using the model implementation of Dhariwal et al. [5].
Shi et al. [120] start by observing how an unconditional scorebased diffusion model can be formulated as a Schrödinger bridge, which can be solved using a modified version of Iterative Proportional Fitting. The previous method is reformulated to accept a condition, thus making conditional synthesis possible. Further adjustments are made to the iterative algorithm in order to optimize the time required to converge. The method is first validated with synthetic data from Kovachki et al. [149], showing improved capabilities in estimating the ground truth. The authors also conduct experiments on superresolution, inpainting, and biochemical oxygen demand, the latter task being inspired by Marzouk et al. [150].
Inspired by the Retrieval Transformer [151], Blattmann et al. [112] propose a new method for training diffusion models. First, a set of similar images is fetched from a database using a nearest neighbor algorithm. The images are further encoded by an encoder with fixed parameters and projected into CLIP [142] feature space. Finally, the reverse process of the diffusion model is conditioned on this latent space. The method can be further extended to use other conditional signals, e.g. text, by simply enhancing the latent space with the encoded representation of the signal.
Lyu et al. [118] introduce a new technique to reduce the number of sampling steps of diffusion models, boosting the performance at the same time. The idea is stop the diffusion process at an earlier step. As the sampling cannot start from a random Gaussian noise, a GAN or VAE model is used to encode the last diffused image into a Gaussian latent space. The result is then decoded into an image which can be diffused into the starting point of the backward process.
The aim of Graikos et al. [39] is to separate diffusion models into two independent parts, a prior (the base part) and a constraint (the condition). This enables models to be applied on various tasks without further training. Changing the equation of DDPMs from [2] leads to independently training the model and using it in a conditional setting, given that the constraint becomes differentiable. The authors conduct experiments on conditional image synthesis and image segmentation.
Batzolis et al. [111] introduce a new forward process in diffusion models, called nonuniform diffusion. This is determined by each pixel being diffused with a different SDE. Multiple networks are employed in this process, each corresponding to a different diffusion scale. The paper further demonstrates a novel conditional sampler that interpolates between two denoising scorebased sampling methods. The model, whose architecture is based on [2] and [4], is evaluated on unconditional synthesis, superresolution, inpainting and edgetoimage translation.
3.10 Medical Image Generation and Translation
Wolleb et al. [40] introduce a method based on diffusion models for image segmentation within the context of brain tumor segmentation. The training consists of diffusing the segmentation map, then denoise it to obtain the original image. During the backward process, the brain MR image is concatenated into the intermediate denoising steps in order to be passed through the UNet model, thus conditioning the denoising process on it. Furthermore, for each input, the authors propose to generate multiple samples, which will be different due to stochasticity. Thus, the ensemble can generate a mean segmentation map and its variance (associated with the uncertainty of the map).
Song et al. [124] introduce a method for scorebased models that is able to solve inverse problems in medical imaging, i.e. reconstructing images from measurements. First, an unconditional score model is trained. Then, a stochastic process of the measurement is derived, which can be used to infuse conditional information into the model via a proximal optimization step. Finally, the matrix that maps the signal to the measurement is decomposed to allow sampling in closedform. The authors carry out multiple experiments on different medical image types, including CT, lowdose CT and MRI.
Within the area of medical imaging, but focusing on reconstructing the images from accelerated MRI scans, Chung et al. [122] propose to solve the inverse problem using a scorebased diffusion model. A score model is pretrained only on magnitude images in an unconditional setting. Then, a variance exploding SDE solver [4] is employed in the sampling process. By adopting a PredictorCorrector algorithm [4] interleaved with a data consistency mapping, the split image (real and imaginary parts) is fed through, enabling conditioning the model on the measurement. Furthermore, the authors present an extension of the method which enables conditioning on multiple coilvarying measurements.
Özbey et al. [123] propose a diffusion model with adversarial inference. In order to increase each diffusion step, and thus make fewer steps, inspired by [93], the authors employ a GAN model in the reverse process to estimate the denoised image at every step. Using a similar method as [145], they introduce a cycleconsistent architecture to allow training on unpaired data sets.
The aim of Hu et al. [121] is to remove the speckle noise in optical coherence tomography (OCT) bscans. The first stage is represented by a method called selffusion, as described in [152], where additional bscans close to the given 2D slice of the input OCT volume are selected. The second stage consists of a diffusion model whose starting point is the weighted average of the original bscan and its neighbors. Thus, the noise can be removed by sampling a clean scan.
3.11 Anomaly Detection in Medical Images
Autoencoders are widely used for anomaly detection [153]. Since diffusion models can been seen as particular type of VAEs, it seems natural to also employ diffusion models for the same task. So far, diffusion models have shown promising results in detecting anomalies in medical images, as further discussed below.
Wyatt et al. [45] train a DDPM on healthy medical images. The anomalies are detected at inference time by subtracting the generated image from the original image. The work also proves that using simplex noise instead of Gaussian noise yields better results for this type of task.
Wolleb et al. [44] propose a weaklysupervised method based on diffusion models for anomaly detection in medical images. Given two unpaired images, one healthy and one with lesions, the former is diffused by the model. Then, the denoising process is guided with the gradient of a binary classifier in order to generate the healthy image. Finally, the sampled healthy image and the one containing lesions are subtracted to obtain the anomaly map.
Pinaya et al. [43] propose a diffusionbased method for detecting anomalies in brain scans, as well as segmenting those regions. The images are encoded by a VQVAE [143], and the quantized latent representation is obtained from a codebook. The diffusion model operates in this latent space. Averaging the intermediate samples from median steps of the backward process and then applying a precomputed threshold map, a binary mask implying the anomaly location is created. Starting the backward process from the middle, the binary mask is used to denoise the anomalous regions, while maintaining the rest. Finally, the sample at the final step is decoded, resulting in a healthy image. The segmentation map of the anomaly is obtained by subtracting the input image and the synthesized image.
3.12 Video Generation
The recent progress towards making diffusion models more efficient has enabled their application in the video domain. We next present works applying diffusion models to video generation.
Ho et al. [126] introduce diffusion models to the task of video generation. When compared to the 2D case, the changes are applied only to the architecture. The authors adopt the 3D UNet from [154], presenting results in unconditional and text conditional video generation. Longer videos are generated in an autoregressive manner, where the latter video chunks are conditioned on the previous ones.
Yang et al. [127]
generate videos frame by frame, using diffusion models. The reverse process is entirely conditioned on a context vector provided by a convolutional recurrent neural network. The authors perform an ablation study to decide if predicting the residual of the next frame returns better results than the case of predicting the actual frame. The conclusion is that the first option works better.
Höppe et al. [128] present random mask video diffusion (RaMViD), a method which can be used for video generation and infilling. The main contribution of their work is a novel strategy for training, which randomly splits the frames into masked and unmasked frames. The unmasked frames are used to condition the diffusion, while the masked ones are diffused by the forward process.
The work of Harvey et al. [125] introduces flexible diffusion models, a type of diffusion models that can be used with multiple sampling schemes for long video generation. As in [128], the authors train a diffusion model by randomly choosing the frames used in the diffusion and those used for conditioning the process. After training the model, they investigate the effectiveness of multiple sampling schemes, concluding that the sampling choice depends on the data set.
3.13 Other Tasks
There are some pioneering works applying diffusion models to new tasks, which have been scarcely explored via diffusion modeling. We gather and discuss such contributions below.
Luo et al. [117] apply diffusion models on 3D point cloud generation, autoencoding, and unsupervised representation learning. They derive an objective function from the variational lower bound of the likelihood of point clouds conditioned on a shape latent. The experiments are conducted using PointNet [155] as the underlying architecture.
Zhou et al. [135] introduce PointVoxel Diffusion (PVD), a novel method for shape generation which applies diffusion on pointvoxel representations. The approach addresses the tasks of shape generation and completion on the ShapeNet and PartNet data sets.
Zimmermann et al. [42] show a strategy to apply scorebased models for classification. They add the image label as a conditioning variable to the score function and, thanks to the ODE formulation, the conditional likelihood can be computed at inference time. Thus, the prediction is the label with the maximum likelihood. Further, they study the impact of this type of classifiers on outofdistribution scenarios considering common image corruptions and adversarial perturbations.
Kim et al. [132] propose to solve the image registration task using diffusion models. This is achieved via two networks, a diffusion network, as per [2], and a deformation network that is based on UNet, as described in [156]. Given two images (one static, one moving), the former network’s role is to assess the deformation between the two images, and feed the result to the latter network, which predicts the deformation fields, enabling sample generation. This method also has the ability to synthesize the deformations through the whole transition. The authors carried out experiments for different tasks, one on 2D facial expressions and one on 3D brain images. The results confirm that the model is capable of producing qualitative and accurate registration fields.
Jeanneret et al. [130] apply diffusion models for counterfactual explanations. The method starts from a noised query image, and generates a sample with an unconditional DDPM. With the generated sample, the gradients required for guidance are computed. Then, one step of the reversed guided process is applied. The output is further used in the next reverse steps.
Nie et al. [133] demonstrate how a diffusion model can be used as a defensive mechanism for adversarial attacks. Given an adversarial image, it gets diffused up until an optimally computed time step. The result is then reversed by the model, producing a purified sample at the end. To optimize the computations of solving the reversetime SDE, the adjoint sensitivity method of Li et al. [157] is used for the gradient score calculations.
In the context of fewshot learning, an image generator based on diffusion models is proposed by Giannone et al. [129]. Given a small set of images that condition the synthesis, a visual transformer encodes these, and the resulting context representation is integrated (via two different techniques) into the UNet model employed in the denoising process.
Wang et al. [134] present a framework based on diffusion models for semantic image synthesis. Leveraging the UNet architecture of diffusion models, the input noise is supplied to the encoder, while the semantic label map is passed to the decoder using multilayer spatiallyadaptive normalization operators [158]. To further improve the sampling quality and the condition on the semantic label map, an empty map is also supplied to the sampling method to generate the unconditional noise. Then, the final noise uses both estimations.
Concerning the task of restoring images negatively affected by various weather conditions (e.g. snow, rain), Özdenizci et al. [131] demonstrate how diffusion models can be used. They condition the denoising process on the degraded image by concatenating it channelwise to the denoised sample, at every time step. In order to deal with varying image sizes, at every step, the sample is divided into overlapping patches, passed in parallel through the model, and combined back by averaging the overlapping pixels. The employed diffusion model is based on the UNet architecture, as presented in [2, 4], but modified to accept two concatenated images as input.
Formulating the task of image restoration as a linear inverse problem, Kawar et al. [26] propose the use of diffusion models. Inspired by Kawar et al. [159]
, the linear degradation matrix is decomposed using singular value decomposition, such that both the input and the output can be mapped onto the spectral space of the matrix where the diffusion process is carried out. Leveraging the pretrained diffusion models from
[2] and [5], the evaluation is conducted on various tasks: superresolution, deblurring, colorization and inpainting.3.14 Theoretical Contributions
4 Closing Remarks and Future Directions
In this paper, we reviewed the advancements made by the research community in developing and applying diffusion models to various computer vision tasks. We identified three primary formulations of diffusion modeling based on: DDPMs, NCSNs, and SDEs. Each formulation obtains remarkable results in image generation, surpassing GANs while increasing the diversity of the generated samples. The outstanding results of diffusion models are achieved while the research is still in its early phase. Although we observed that the main focus is on conditional and unconditional image generation, there are still many tasks to be explored and further improvements to be realized.
Limitations. The most significant disadvantage of diffusion models remains the need to perform multiple steps at inference time to generate only one sample. Despite the important amount of research conducted in this direction, GANs are still faster at producing images. Other issues of diffusion models can be linked to the commonly used strategy to employ CLIP embeddings for texttoimage generation. For example, Ramesh et al. [106] highlight that their model struggles to generate readable text in an image and motivate the behavior by stating that CLIP embeddings do not contain information about spelling. Therefore, when such embeddings are used for conditioning the denoising process, the model can inherit this kind of issues.
Future directions. Aside from the current tendency of researching more efficient diffusion models, future work can study diffusion models applied in other computer vision tasks, such as image dehazing, video anomaly detection, or visual question answering. Even if we found some works studying anomaly detection in medical images [45, 44, 43], this task could also be explored in other domains, such as video surveillance or industrial inspection.
An interesting research direction is to assess the quality and utility of the representation space learned by diffusion models in discriminative tasks. This could be carried out in at least two distinct ways. In a direct way, by learning some discriminative model on top of the latent representations provided by a denoising model, to address some classification or regression task. In an indirect way, by augmenting training sets with realistic samples generated by diffusion models. The latter direction might be more suitable for tasks such as object detection, where inpainting diffusion models could do a good job at blending in new objects in images.
Another future work direction is to employ conditional diffusion models to simulate possible futures in video. The generated videos could further be given as input to reinforcement learning models.
Acknowledgments
This work was supported by a grant of the Romanian Ministry of Education and Research, CNCS  UEFISCDI, project no. PNIIIP22.1PED20210195, contract no. 690/2022, within PNCDI III.
References

[1]
J. SohlDickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in
Proceedings of ICML, pp. 2256–2265, 2015.  [2] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proceedings of NeurIPS, vol. 33, pp. 6840–6851, 2020.
 [3] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in Proceedings of NeurIPS, vol. 32, pp. 11918–11930, 2019.
 [4] Y. Song, J. SohlDickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “ScoreBased Generative Modeling through Stochastic Differential Equations,” in Proceedings of ICLR, 2021.
 [5] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in Proceedings of NeurIPS, vol. 34, pp. 8780–8794, 2021.
 [6] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in Proceedings of ICML, pp. 8162–8171, 2021.
 [7] J. Song, C. Meng, and S. Ermon, “Denoising Diffusion Implicit Models,” in Proceedings of ICLR, 2021.
 [8] D. Watson, W. Chan, J. Ho, and M. Norouzi, “Learning fast samplers for diffusion models by differentiating through sample quality,” in Proceedings of ICLR, 2021.
 [9] J. Shi, C. Wu, J. Liang, X. Liu, and N. Duan, “DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder,” arXiv preprint arXiv:2206.00386, 2022.
 [10] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “HighResolution Image Synthesis with Latent Diffusion Models,” in Proceedings of CVPR, pp. 10684–10695, 2022.
 [11] R. Rombach, A. Blattmann, and B. Ommer, “TextGuided Synthesis of Artistic Images with RetrievalAugmented Diffusion Models,” arXiv preprint arXiv:2207.13038, 2022.
 [12] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image superresolution via iterative refinement,” arXiv preprint arXiv:2104.07636, 2021.
 [13] Y. Song and S. Ermon, “Improved techniques for training scorebased generative models,” in Proceedings of NeurIPS, vol. 33, pp. 12438–12448, 2020.
 [14] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “GLIDE: Towards Photorealistic Image Generation and Editing with TextGuided Diffusion Models,” in Proceedings of ICML, pp. 16784–16804, 2021.
 [15] Y. Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood training of scorebased diffusion models,” in Proceedings of NeurIPS, vol. 34, pp. 1415–1428, 2021.
 [16] A. Sinha, J. Song, C. Meng, and S. Ermon, “D2C: Diffusiondecoding models for fewshot conditional generation,” in Proceedings of NeurIPS, vol. 34, pp. 12533–12548, 2021.
 [17] A. Vahdat, K. Kreis, and J. Kautz, “Scorebased generative modeling in latent space,” in Proceedings of NeurIPS, vol. 34, pp. 11287–11302, 2021.
 [18] K. Pandey, A. Mukherjee, P. Rai, and A. Kumar, “VAEs meet diffusion models: Efficient and highfidelity generation,” in Proceedings of NeurIPS Workshop on DGMs and Applications, 2021.
 [19] F. Bao, C. Li, J. Zhu, and B. Zhang, “AnalyticDPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models,” in Proceedings of ICLR, 2022.
 [20] T. Dockhorn, A. Vahdat, and K. Kreis, “Scorebased generative modeling with criticallydamped Langevin diffusion,” in Proceedings of ICLR, 2022.
 [21] N. Liu, S. Li, Y. Du, A. Torralba, and J. B. Tenenbaum, “Compositional Visual Generation with Composable Diffusion Models,” in Proceedings of ECCV, 2022.
 [22] Y. Jiang, S. Yang, H. Qiu, W. Wu, C. C. Loy, and Z. Liu, “Text2Human: TextDriven Controllable Human Image Generation,” ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–11, 2022.
 [23] G. Batzolis, J. Stanczuk, C.B. Schönlieb, and C. Etmann, “Conditional image generation with scorebased diffusion models,” arXiv preprint arXiv:2111.13606, 2021.
 [24] M. Daniels, T. Maunu, and P. Hand, “Scorebased generative neural networks for largescale optimal transport,” in Proceedings of NeurIPS, pp. 12955–12965, 2021.
 [25] H. Chung, B. Sim, and J. C. Ye, “ComeCloserDiffuseFaster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction,” in Proceedings of CVPR, pp. 12413–12422, 2022.
 [26] B. Kawar, M. Elad, S. Ermon, and J. Song, “Denoising diffusion restoration models,” in Proceedings of DGM4HSD, 2022.
 [27] P. Esser, R. Rombach, A. Blattmann, and B. Ommer, “ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis,” in Proceedings of NeurIPS, vol. 34, pp. 3518–3532, 2021.
 [28] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “RePaint: Inpainting using Denoising Diffusion Probabilistic Models,” in Proceedings of CVPR, pp. 11461–11471, 2022.
 [29] B. Jing, G. Corso, R. Berlinghieri, and T. Jaakkola, “Subspace diffusion generative models,” arXiv preprint arXiv:2205.01490, 2022.
 [30] O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for textdriven editing of natural images,” in Proceedings of CVPR, pp. 18208–18218, 2022.
 [31] J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models,” in Proceedings of ICCV, pp. 14347–14356, 2021.
 [32] C. Meng, Y. Song, J. Song, J. Wu, J.Y. Zhu, and S. Ermon, “SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations,” in Proceedings of ICLR, 2021.
 [33] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Imagetoimage diffusion models,” in Proceedings of SIGGRAPH, pp. 1–10, 2022.
 [34] M. Zhao, F. Bao, C. Li, and J. Zhu, “EGSDE: Unpaired ImagetoImage Translation via EnergyGuided Stochastic Differential Equations,” arXiv preprint arXiv:2207.06635, 2022.
 [35] T. Wang, T. Zhang, B. Zhang, H. Ouyang, D. Chen, Q. Chen, and F. Wen, “Pretraining is All You Need for ImagetoImage Translation,” arXiv preprint arXiv:2205.12952, 2022.
 [36] B. Li, K. Xue, B. Liu, and Y.K. Lai, “VQBB: Imagetoimage Translation with Vector Quantized Brownian Bridge,” arXiv preprint arXiv:2205.07680, 2022.
 [37] J. Wolleb, R. Sandkühler, F. Bieder, and P. C. Cattin, “The Swiss Army Knife for ImagetoImage Translation: MultiTask Diffusion Models,” arXiv preprint arXiv:2204.02641, 2022.
 [38] D. Baranchuk, I. Rubachev, A. Voynov, V. Khrulkov, and A. Babenko, “LabelEfficient Semantic Segmentation with Diffusion Models,” in Proceedings of ICLR, 2022.
 [39] A. Graikos, N. Malkin, N. Jojic, and D. Samaras, “Diffusion models as plugandplay priors,” arXiv preprint arXiv:2206.09012, 2022.
 [40] J. Wolleb, R. Sandkühler, F. Bieder, P. Valmaggia, and P. C. Cattin, “Diffusion Models for Implicit Image Segmentation Ensembles,” in Proceedings of MIDL, 2022.
 [41] T. Amit, E. Nachmani, T. Shaharbany, and L. Wolf, “SegDiff: Image Segmentation with Diffusion Probabilistic Models,” arXiv preprint arXiv:2112.00390, 2021.
 [42] R. S. Zimmermann, L. Schott, Y. Song, B. A. Dunn, and D. A. Klindt, “Scorebased generative classifiers,” in Proceedings of NeurIPS Workshop on DGMs and Applications, 2021.
 [43] W. H. Pinaya, M. S. Graham, R. Gray, P. F. Da Costa, P.D. Tudosiu, P. Wright, Y. H. Mah, A. D. MacKinnon, J. T. Teo, R. Jager, et al., “Fast Unsupervised Brain Anomaly Detection and Segmentation with Diffusion Models,” arXiv preprint arXiv:2206.03461, 2022.
 [44] J. Wolleb, F. Bieder, R. Sandkühler, and P. C. Cattin, “Diffusion Models for Medical Anomaly Detection,” arXiv preprint arXiv:2203.04306, 2022.
 [45] J. Wyatt, A. Leach, S. M. Schmon, and C. G. Willcocks, “AnoDDPM: Anomaly Detection With Denoising Diffusion Probabilistic Models Using Simplex Noise,” in Proceedings of CVPRW, pp. 650–656, 2022.
 [46] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
 [47] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
 [48] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
 [49] D. P. Kingma and M. Welling, “AutoEncoding Variational Bayes,” in Proceedings of ICLR, 2014.
 [50] I. Higgins, L. Matthey, A. Pal, C. P. Burgess, X. Glorot, M. M. Botvinick, S. Mohamed, and A. Lerchner, “betaVAE: Learning Basic Visual Concepts with a Constrained Variational Framework,” in Proceedings of ICLR, 2017.
 [51] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of NIPS, pp. 2672–2680, 2014.
 [52] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” in Proceedings of NeurIPS, vol. 33, pp. 9912–9924, 2020.
 [53] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of ICML, vol. 119, pp. 1597–1607, 2020.
 [54] F.A. Croitoru, D.N. Grigore, and R. T. Ionescu, “Discriminabilityenforcing loss to improve representation learning,” in Proceedings of CVPRW, pp. 2598–2602, 2022.
 [55] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.

[56]
S. Laine and T. Aila, “Temporal ensembling for semisupervised learning,” in
Proceedings of ICLR, 2017.  [57] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results,” in Proceedings of NIPS, vol. 30, pp. 1195–1204, 2017.
 [58] C.W. Huang, J. H. Lim, and A. C. Courville, “A variational perspective on diffusionbased generative models and score matching,” in Proceedings of NeurIPS, vol. 34, pp. 22863–22876, 2021.
 [59] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. J. Huang, “A tutorial on energybased learning,” in Predicting Structured Data, MIT Press, 2006.
 [60] J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng, “Learning Deep Energy Models,” in Proceedings of ICML, pp. 1105–1112, 2011.
 [61] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves, “Conditional Image Generation with PixelCNN Decoders,” in Proceedings of NeurIPS, vol. 29, pp. 4797–4805, 2016.
 [62] L. Dinh, D. Krueger, and Y. Bengio, “NICE: Nonlinear Independent Components Estimation,” in Proceedings of ICLR, 2015.
 [63] L. Dinh, J. SohlDickstein, and S. Bengio, “Density estimation using Real NVP,” in Proceedings of ICLR, 2017.

[64]
P. Vincent, “A Connection Between Score Matching and Denoising Autoencoders,”
Neural Computation, vol. 23, pp. 1661–1674, 2011.  [65] Y. Song, S. Garg, J. Shi, and S. Ermon, “Sliced Score Matching: A Scalable Approach to Density and Score Estimation,” in Proceedings of UAI, p. 204, 2019.
 [66] B. D. Anderson, “Reversetime diffusion equation models,” Stochastic Processes and their Applications, vol. 12, no. 3, pp. 313–326, 1982.
 [67] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications,” in Proceedings of ICLR, 2017.
 [68] Q. Zhang and Y. Chen, “Diffusion normalizing flow,” in Proceedings of NeurIPS, vol. 34, pp. 16280–16291, 2021.

[69]
K. Swersky, M. Ranzato, D. Buchman, B. M. Marlin, and N. Freitas, “On Autoencoders and Score Matching for Energy Based Models,” in
Proceedings of ICML, pp. 1201–1208, 2011.  [70] F. Bao, K. Xu, C. Li, L. Hong, J. Zhu, and B. Zhang, “Variational (gradient) estimate of the score function in energybased latent variable models,” in Proceedings of ICML, pp. 651–661, 2021.
 [71] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Proceedings of NeurIPS, pp. 2234–2242, 2016.
 [72] J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg, “Structured denoising diffusion models in discrete statespaces,” in Proceedings of NeurIPS, vol. 34, pp. 17981–17993, 2021.
 [73] Y. Benny and L. Wolf, “Dynamic DualOutput Diffusion Models,” in Proceedings of CVPR, pp. 11482–11491, 2022.
 [74] S. BondTaylor, P. Hessey, H. Sasaki, T. P. Breckon, and C. G. Willcocks, “Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast HighResolution Image Generation from VectorQuantized Codes,” in Proceedings of ECCV, 2022.
 [75] J. Choi, J. Lee, C. Shin, S. Kim, H. Kim, and S. Yoon, “Perception Prioritized Training of Diffusion Models,” in Proceedings of CVPR, pp. 11472–11481, 2022.
 [76] V. De Bortoli, J. Thornton, J. Heng, and A. Doucet, “Diffusion Schrödinger bridge with applications to scorebased generative modeling,” in Proceedings of NeurIPS, vol. 34, pp. 17695–17709, 2021.
 [77] J. Deasy, N. Simidjievski, and P. Liò, “Heavytailed denoising score matching,” arXiv preprint arXiv:2112.09788, 2021.
 [78] K. Deja, A. Kuzina, T. Trzciński, and J. M. Tomczak, “On Analyzing Generative and Denoising Capabilities of DiffusionBased Deep Generative Models,” arXiv preprint arXiv:2206.00070, 2022.
 [79] A. JolicoeurMartineau, R. PichéTaillefer, I. Mitliagkas, and R. T. des Combes, “Adversarial score matching and improved sampling for image generation,” in Proceedings of ICLR, 2021.
 [80] A. JolicoeurMartineau, K. Li, R. PichéTaillefer, T. Kachman, and I. Mitliagkas, “Gotta go fast when generating data with scorebased models,” arXiv preprint arXiv:2105.14080, 2021.
 [81] D. Kim, B. Na, S. J. Kwon, D. Lee, W. Kang, and I.C. Moon, “Maximum Likelihood Training of Implicit Nonlinear Diffusion Models,” arXiv preprint arXiv:2205.13699, 2022.
 [82] D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational diffusion models,” in Proceedings of NeurIPS, vol. 34, pp. 21696–21707, 2021.
 [83] Z. Kong and W. Ping, “On Fast Sampling of Diffusion Probabilistic Models,” in Proceedings of INNF+, 2021.
 [84] M. W. Lam, J. Wang, R. Huang, D. Su, and D. Yu, “Bilateral denoising diffusion models,” arXiv preprint arXiv:2108.11514, 2021.
 [85] L. Liu, Y. Ren, Z. Lin, and Z. Zhao, “Pseudo Numerical Methods for Diffusion Models on Manifolds,” in Proceedings of ICLR, 2022.
 [86] H. Ma, L. Zhang, X. Zhu, J. Zhang, and J. Feng, “Accelerating ScoreBased Generative Models for HighResolution Image Synthesis,” arXiv preprint arXiv:2206.04029, 2022.
 [87] E. Nachmani, R. S. Roman, and L. Wolf, “NonGaussian denoising diffusion models,” arXiv preprint arXiv:2106.07582, 2021.
 [88] R. SanRoman, E. Nachmani, and L. Wolf, “Noise estimation for generative diffusion models,” arXiv preprint arXiv:2104.02600, 2021.
 [89] V. Sehwag, C. Hazirbas, A. Gordo, F. Ozgenel, and C. Canton, “Generating High Fidelity Data from Lowdensity Regions using Diffusion Models,” in Proceedings of CVPR, pp. 11492–11501, 2022.
 [90] G. Wang, Y. Jiao, Q. Xu, Y. Wang, and C. Yang, “Deep generative learning via Schrödinger bridge,” in Proceedings of ICML, pp. 10794–10804, 2021.
 [91] Z. Wang, H. Zheng, P. He, W. Chen, and M. Zhou, “DiffusionGAN: Training GANs with Diffusion,” arXiv preprint arXiv:2206.02262, 2022.
 [92] D. Watson, J. Ho, M. Norouzi, and W. Chan, “Learning to efficiently sample from diffusion probabilistic models,” arXiv preprint arXiv:2106.03802, 2021.
 [93] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative learning trilemma with denoising diffusion GANs,” in Proceedings of ICLR, 2022.
 [94] H. Zheng, P. He, W. Chen, and M. Zhou, “Truncated diffusion probabilistic models,” arXiv preprint arXiv:2202.09671, 2022.

[95]
F. Bordes, R. Balestriero, and P. Vincent, “High fidelity visualization of
what your selfsupervised representation knows about,”
Transactions on Machine Learning Research
, 2022.  [96] A. Campbell, J. Benton, V. De Bortoli, T. Rainforth, G. Deligiannidis, and A. Doucet, “A Continuous Time Framework for Discrete Denoising Models,” arXiv preprint arXiv:2205.14987, 2022.
 [97] C.H. Chao, W.F. Sun, B.W. Cheng, Y.C. Lo, C.C. Chang, Y.L. Liu, Y.L. Chang, C.P. Chen, and C.Y. Lee, “Denoising Likelihood Score Matching for Conditional ScoreBased Data Generation,” in Proceedings of ICLR, 2022.
 [98] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded Diffusion Models for High Fidelity Image Generation,” Journal of Machine Learning Research, vol. 23, no. 47, pp. 1–33, 2022.
 [99] J. Ho and T. Salimans, “ClassifierFree Diffusion Guidance,” in Proceedings of NeurIPS Workshop on DGMs and Applications, 2021.
 [100] T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the Design Space of DiffusionBased Generative Models,” arXiv preprint arXiv:2206.00364, 2022.
 [101] X. Liu, D. H. Park, S. Azadi, G. Zhang, A. Chopikyan, Y. Hu, H. Shi, A. Rohrbach, and T. Darrell, “More control for free! Image synthesis with semantic diffusion guidance,” arXiv preprint arXiv:2112.05744, 2021.
 [102] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “DPMSolver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps,” arXiv preprint arXiv:2206.00927, 2022.
 [103] T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” in Proceedings of ICLR, 2022.
 [104] V. Singh, S. Jandial, A. Chopra, S. Ramesh, B. Krishnamurthy, and V. N. Balasubramanian, “On Conditioning the Input Noise for Controlled Image Generation with Diffusion Models,” arXiv preprint arXiv:2205.03859, 2022.
 [105] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for texttoimage synthesis,” in Proceedings of CVPR, pp. 10696–10706, 2022.
 [106] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical textconditional image generation with CLIP latents,” arXiv preprint arXiv:2204.06125, 2022.
 [107] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al., “Photorealistic TexttoImage Diffusion Models with Deep Language Understanding,” arXiv preprint arXiv:2205.11487, 2022.
 [108] Q. Zhang and Y. Chen, “Fast Sampling of Diffusion Models with Exponential Integrator,” arXiv preprint arXiv:2204.13902, 2022.
 [109] O. Avrahami, O. Fried, and D. Lischinski, “Blended latent diffusion,” arXiv preprint arXiv:2206.02779, 2022.
 [110] H. Sasaki, C. G. Willcocks, and T. P. Breckon, “UNITDDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models,” arXiv preprint arXiv:2104.05358, 2021.
 [111] G. Batzolis, J. Stanczuk, C.B. Schönlieb, and C. Etmann, “NonUniform Diffusion Models,” arXiv preprint arXiv:2207.09786, 2022.
 [112] A. Blattmann, R. Rombach, K. Oktay, and B. Ommer, “RetrievalAugmented Diffusion Models,” arXiv preprint arXiv:2204.11824, 2022.
 [113] R. Gao, Y. Song, B. Poole, Y. N. Wu, and D. P. Kingma, “Learning EnergyBased Models by Diffusion Recovery Likelihood,” in Proceedings of ICLR, 2021.
 [114] M. Hu, Y. Wang, T.J. Cham, J. Yang, and P. N. Suganthan, “Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation,” in Proceedings of CVPR, pp. 11502–11511, 2022.
 [115] V. Khrulkov and I. Oseledets, “Understanding DDPM Latent Codes Through Optimal Transport,” arXiv preprint arXiv:2202.07477, 2022.
 [116] G. Kim, T. Kwon, and J. C. Ye, “DiffusionCLIP: TextGuided Diffusion Models for Robust Image Manipulation,” in Proceedings of CVPR, pp. 2426–2435, 2022.
 [117] S. Luo and W. Hu, “Diffusion probabilistic models for 3D point cloud generation,” in Proceedings of CVPR, pp. 2837–2845, 2021.
 [118] Z. Lyu, X. Xu, C. Yang, D. Lin, and B. Dai, “Accelerating Diffusion Models via Early Stop of the Diffusion Process,” arXiv preprint arXiv:2205.12524, 2022.
 [119] K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn, “Diffusion Autoencoders: Toward a Meaningful and Decodable Representation,” in Proceedings of CVPR, pp. 10619–10629, 2022.
 [120] Y. Shi, V. De Bortoli, G. Deligiannidis, and A. Doucet, “Conditional Simulation Using Diffusion Schrödinger Bridges,” in Proceedings of UAI, 2022.
 [121] D. Hu, Y. K. Tao, and I. Oguz, “Unsupervised denoising of retinal OCT with diffusion probabilistic model,” in Proceedings of SPIE Medical Imaging, vol. 12032, pp. 25–34, 2022.
 [122] H. Chung and J. C. Ye, “Scorebased diffusion models for accelerated MRI,” Medical Image Analysis, vol. 80, p. 102479, 2022.
 [123] M. Özbey, S. U. Dar, H. A. Bedel, O. Dalmaz, Ş. Özturk, A. Güngör, and T. Çukur, “Unsupervised Medical Image Translation with Adversarial Diffusion Models,” arXiv preprint arXiv:2207.08208, 2022.
 [124] Y. Song, L. Shen, L. Xing, and S. Ermon, “Solving inverse problems in medical imaging with scorebased generative models,” in Proceedings of ICLR, 2022.
 [125] W. Harvey, S. Naderiparizi, V. Masrani, C. Weilbach, and F. Wood, “Flexible Diffusion Modeling of Long Videos,” arXiv preprint arXiv:2205.11495, 2022.
 [126] J. Ho, T. Salimans, A. A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” in Proceedings of DGM4HSD, 2022.
 [127] R. Yang, P. Srivastava, and S. Mandt, “Diffusion Probabilistic Modeling for Video Generation,” arXiv preprint arXiv:2203.09481, 2022.
 [128] T. Höppe, A. Mehrjou, S. Bauer, D. Nielsen, and A. Dittadi, “Diffusion Models for Video Prediction and Infilling,” arXiv preprint arXiv:2206.07696, 2022.
 [129] G. Giannone, D. Nielsen, and O. Winther, “FewShot Diffusion Models,” arXiv preprint arXiv:2205.15463, 2022.
 [130] G. Jeanneret, L. Simon, and F. Jurie, “Diffusion Models for Counterfactual Explanations,” arXiv preprint arXiv:2203.15636, 2022.
 [131] O. Özdenizci and R. Legenstein, “Restoring Vision in Adverse Weather Conditions with PatchBased Denoising Diffusion Models,” arXiv preprint arXiv:2207.14626, 2022.
 [132] B. Kim, I. Han, and J. C. Ye, “DiffuseMorph: Unsupervised Deformable Image Registration Along Continuous Trajectory Using Diffusion Models,” arXiv preprint arXiv:2112.05149, 2021.
 [133] W. Nie, B. Guo, Y. Huang, C. Xiao, A. Vahdat, and A. Anandkumar, “Diffusion models for adversarial purification,” in Proceedings of ICML, 2022.
 [134] W. Wang, J. Bao, W. Zhou, D. Chen, D. Chen, L. Yuan, and H. Li, “Semantic image synthesis via diffusion models,” arXiv preprint arXiv:2207.00050, 2022.
 [135] L. Zhou, Y. Du, and J. Wu, “3D shape generation and completion through pointvoxel diffusion,” in Proceedings of ICCV, pp. 5826–5835, 2021.
 [136] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of NIPS, vol. 30, pp. 6000–6010, 2017.
 [137] M. Shannon, B. Poole, S. Mariooryad, T. Bagby, E. Battenberg, D. Kao, D. Stanton, and R. SkerryRyan, “Nonsaturating GAN training as divergence minimization,” arXiv preprint arXiv:2010.08029, 2020.
 [138] M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” in Proceedings of ICLR, 2017.
 [139] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár, “Amortised map inference for image superresolution,” in Proceedings of ICLR, 2017.
 [140] H. Tachibana, M. Go, M. Inahara, Y. Katayama, and Y. Watanabe, “ItôTaylor Sampling Scheme for Denoising Diffusion Probabilistic Models using Ideal Derivatives,” arXiv preprint arXiv:2112.13339, 2021.
 [141] J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Inverting gradients – How easy is it to break privacy in federated learning?,” in Proceedings of NeurIPS, vol. 33, pp. 16937–16947, 2020.
 [142] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of ICML, vol. 139, pp. 8748–8763, 2021.
 [143] A. Van Den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” Proceedings of NIPS, vol. 30, pp. 6309–6318, 2017.
 [144] N. Reimers and I. Gurevych, “SentenceBERT: Sentence Embeddings using Siamese BERTNetworks,” in Proceedings of EMNLP, pp. 3982–3992, 2019.
 [145] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired imagetoimage translation using cycleconsistent adversarial networks,” in Proceedings of ICCV, pp. 2223–2232, 2017.
 [146] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for highresolution image synthesis,” in Proceedings of CVPR, pp. 12873–12883, 2021.
 [147] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, “ESRGAN: Enhanced SuperResolution Generative Adversarial Networks,” in Proceedings of ECCVW, pp. 63–79, 2018.
 [148] G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: MultiPath Refinement Networks for HighResolution Semantic Segmentation,” in Proceeding of CVPR, pp. 5168–5177, 2017.
 [149] N. Kovachki, R. Baptista, B. Hosseini, and Y. Marzouk, “Conditional sampling with monotone GANs,” arXiv preprint arXiv:2006.06755, 2020.
 [150] Y. Marzouk, T. Moselhy, M. Parno, and A. Spantini, “Sampling via Measure Transport: An Introduction,” in Handbook of Uncertainty Quantification, (Cham), pp. 1–41, Springer, 2016.

[151]
S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. Van Den Driessche, J.B. Lespiau, B. Damoc, A. Clark, D. De Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Paganini, G. Irving, O. Vinyals, S. Osindero, K. Simonyan, J. Rae, E. Elsen, and L. Sifre, “Improving Language Models by Retrieving from Trillions of Tokens,” in
Proceedings of ICML, vol. 162, pp. 2206–2240, 2022.  [152] I. Oguz, J. D. Malone, Y. Atay, and Y. K. Tao, “Selffusion for OCT noise reduction,” in Proceedings of SPIE Medical Imaging, vol. 11313, p. 113130C, SPIE, 2020.
 [153] R. T. Ionescu, F. S. Khan, M.I. Georgescu, and L. Shao, “ObjectCentric AutoEncoders and Dummy Anomalies for Abnormal Event Detection in Video,” in Proceedings of CVPR, pp. 7842–7851, 2019.
 [154] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D UNet: Learning Dense Volumetric Segmentation from Sparse Annotation,” in Proceedings of MICCAI, pp. 424–432, 2016.
 [155] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,” in Proceedings of CVPR, pp. 77–85, 2017.
 [156] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, “An unsupervised learning model for deformable medical image registration,” in Proceedings of CVPR, pp. 9252–9260, 2018.
 [157] X. Li, T.K. L. Wong, R. T. Chen, and D. Duvenaud, “Scalable gradients for stochastic differential equations,” in Proceedings of AISTATS, pp. 3870–3882, 2020.
 [158] T. Park, M.Y. Liu, T.C. Wang, and J.Y. Zhu, “Semantic image synthesis with spatiallyadaptive normalization,” in Proceedings of CVPR, pp. 2337–2346, 2019.
 [159] B. Kawar, G. Vaksman, and M. Elad, “Snips: Solving noisy inverse problems stochastically,” in Proceedings of NeurIPS, vol. 34, pp. 21757–21769, 2021.