Diffusion models as plug-and-play priors

We consider the problem of inferring high-dimensional data 𝐱 in a model that consists of a prior p(𝐱) and an auxiliary constraint c(𝐱,𝐲). In this paper, the prior is an independently trained denoising diffusion generative model. The auxiliary constraint is expected to have a differentiable form, but can come from diverse sources. The possibility of such inference turns diffusion models into plug-and-play modules, thereby allowing a range of potential applications in adapting models to new domains and tasks, such as conditional generation or image segmentation. The structure of diffusion models allows us to perform approximate inference by iterating differentiation through the fixed denoising network enriched with different amounts of noise at each step. Considering many noised versions of 𝐱 in evaluation of its fitness is a novel search mechanism that may lead to new algorithms for solving combinatorial optimization problems.


page 5

page 6

page 7

page 14

page 15

page 16

page 17

page 19


Diffusion Priors In Variational Autoencoders

Among likelihood-based approaches for deep generative modelling, variati...

PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Driven Adaptive Prior

Denoising diffusion probabilistic models have been recently proposed to ...

Argmax Flows and Multinomial Diffusion: Towards Non-Autoregressive Language Models

The field of language modelling has been largely dominated by autoregres...

Denoising Diffusion Restoration Models

Many interesting tasks in image restoration can be cast as linear invers...

Structured Denoising Diffusion Models in Discrete State-Spaces

Denoising diffusion probabilistic models (DDPMs) (Ho et al. 2020) have s...

Dynamic Dual-Output Diffusion Models

Iterative denoising-based generation, also known as denoising diffusion ...

Resolving label uncertainty with implicit posterior models

We propose a method for jointly inferring labels across a collection of ...

Code Repositories


Using pre-trained Diffusion models as priors for inference tasks

view repo

1 Introduction

Deep generative models, such as denoising diffusion probabilistic models (DDPMs; Sohl-Dickstein et al., 2015; Ho et al., 2020) can capture the details of very complex distributions over high-dimensional continuous data Nichol and Dhariwal (2021); Dhariwal and Nichol (2021); Amit et al. (2021); Sinha et al. (2021); Vahdat et al. (2021); Hoogeboom et al. (2022). The immense effective depth of DDPMs, sometimes with thousands of deep network evaluations in the generation process, is an apparent limitation on their use as off-the-shelf modules in hierarchical generative models, where models can be mixed and one model may serve as a prior for another conditional model. In this paper, we show that DDPMs trained on image data can be directly used as priors in systems that involve other differentiable constraints.

In our main problem setting, we assume that we have a prior over high-dimensional data and wish to do an inference in a model that involves this prior and a constraint on given some additional information . That is, we want to find an approximation of the posterior distribution . In this paper, is provided in the form of an independently trained DDPM over 2.2), making the DDPM a ‘plug-and-play’ prior.

Although the recent community interest in DDPMs has spurred progress in training algorithms and fast generation schedules Nichol and Dhariwal (2021); Salimans and Ho (2022); Xiao et al. (2022), the possibility of their use as plug-and-play modules has not been explored. Furthermore, as opposed to existing work on plug-and-play models (starting from Nguyen et al. (2017)), the algorithms we propose do not require additional training or finetuning of model components or inference networks.

One obvious application of plug-and-play priors is conditional image generation (§3.1, §3.2

). For example, a denoising diffusion model trained on MNIST digit images might define

, while the constraint

may be be the probability of digit class

under an off-the-shelf classifier. However, changing the semantics of

, we can also use such models for inference tasks where neural networks struggle with domain adaptation, such as image segmentation:

constrains the segmentation to match an appearance or a weak labeling 3.3). Finally, we describe a path towards using DDPM priors to solve continuous relaxations of combinatorial search problems by treating as a latent variable with combinatorial structure that is deterministically encoded in 3.4).

1.1 Related work

Conditioning DDPMs.

DDPMs have previously been used for conditional generation and image segmentation Saharia et al. (2021); Tashiro et al. (2021); Amit et al. (2021). With few exceptions – such as Baranchuk et al. (2022), which uses a pretrained DDPM as a feature extractor – these algorithms assume access to paired data and conditioning information during training of the DDPM model. In Dhariwal and Nichol (2021), a classifier that guides the denoising model towards the desired subset of images with the attribute is trained in parallel with the denoiser. In Choi et al. (2021), generation is conditioned on an auxiliary image by guiding the denoising process through correction steps that match the low-frequency components of the generated and conditioning images. In contrast, we aim to build models that combine an independently trained DDPM with an auxiliary constraint.

Our approach is also related to work on adversarial examples. Adversarial samples are produced by optimizing an image to satisfy a desired constraint – a classifier

– without reference to the prior over data. As supervised learning algorithms can ignore the structure in data

, focusing only on the conditional distribution, it is possible to optimize for input that provides the desired classification in various surprising ways Szegedy et al. (2014). In Nie et al. (2022), a diffusion model is used to defend from adversarial samples by making images more likely under a DDPM . We are instead interested in inference, where we seek samples that satisfy both the classifier and the prior. (Our work may, however, have consequences for adversarial generation.)

Conditional generation from unconditional models.

Works that preceded the recent popularity of DDPMs Nguyen et al. (2017); Engel et al. (2018)

show how an unconditional generative model, such as a generative adversarial network

(GAN; Goodfellow et al., 2014)

or variational autoencoder

(VAE; Kingma and Welling, 2014), can be combined with a constraint model to generate conditional samples.

Latent vectors in DDPMs.

Modeling the latent prior distribution in VAE-like models using a DDPM has been studied in Sinha et al. (2021); Vahdat et al. (2021). On the other hand, in §3.4, we perform inference in the low-dimensional latent space under a pretrained DDPM on a high-dimensional data space. Our work is also related to Rolf et al. (2022), where a prior over latents is used to tune a posterior network . There, the priors are of relatively simple structure and are sample-specific, rather than global diffusion priors like in this paper. Still, some practical problems can be tackled with both approaches (§3.3).

2 Methodology

2.1 Problem setting

Recall that we want to find an approximation to the posterior distribution , where is a fixed prior distribution. Fixing and introducing an approximate variational posterior , the free energy


is minimized when is closest to the true posterior, i.e., when is minimized. When is expressive enough to capture the true posterior, this minimization yields the exact posterior . Otherwise, the posterior will capture a ‘mode-seeking’ approximation to the true posterior Minka (2005); in particular, if is Dirac, it is optimal to concentrate at the mode of . When the prior involves latent variables (i.e., ), the free energy is


We are, in particular, interested in a general procedure for minimizing with respect to an approximate posterior for any differentiable when is a DDPM (§2.2).

A free energy of the same structure was also studied in Vahdat et al. (2021), where a DDPM over a latent space is hybridized as a parent to a decoder , with an additional inference model trained jointly with both of these models. On the other hand, we aim to work with independently trained components that operate directly in the pixel space, e.g., an off-the-shelf diffusion model trained on images of faces and an off-the-shelf face classifier , without training or finetuning them jointly (§3.2).

2.2 Denoising diffusion probabilistic models as priors

Denoising diffusion probabilistic models (DDPMs) Sohl-Dickstein et al. (2015); Ho et al. (2020) generate samples by reversing a (Gaussian) noising process. DDPMs are deep directed stochastic networks:


where and are neural networks with learned parameters (often, as in this paper, is fixed to a constant depending on ). The model starts with a sample from a unit Gaussian and successively transforms it with a nonlinear network adding a small Gaussian innovation signal at each step according to a noise schedule. After steps, the sample is obtained.

In general, using such a model as a prior over would require an intractable integration over latent variables :


However, DDPMs are trained under the assumption that the posterior is a simple diffusion process that successively adds Gaussian noise according to a pre-defined schedule :


Therefore, if is the likelihood (5) of under a DDPM, then in the first expectation of (2) we should use . A computationally and notationally convenient form for the approximate posterior over is a Gaussian:


Thus we can sample at any arbitrary time step as


where and , with . Analogously to Ho et al. (2020), we can also extract a conditional Gaussian and express the first expectation in (2) as


which after reparametrization Ho et al. (2020) leads to


where the link between the stage noise reconstruction and the model’s expectation is


The weighting is generally a function of the noise schedule, but in most pretrained diffusion models it is set to 1. Thus, the free energy in (2) reduces to


To deal with the second term, one could use the reparametrization trick . However, in our experiments, we were mostly interested in the mode , so we set , leading to


The first term is the cost usually used to learn the parameters of the diffusion model. To perform inference under an already trained model , we instead minimize wrt to (and if desired) through sampling in the summands over .

We summarize the algorithm for a point estimate

as Algorithm 1. Variations on this algorithm are possible. Depending on how close to a good mode we can initialize , this optimization may involve summing only over ; different time step schedules can be considered depending on the desired diversity in the estimated . Note that optimization is stochastic and each time it is run it can produce different point estimates of which are are both likely under the diffusion prior and satisfy the constraint as much as possible.

We observed that optimizing simultaneously for all makes it difficult to guide the sample towards a mode; therefore, we anneal from high to low values. Intuitively, the first few iterations of gradient descent should coarsely explore the search space, while later iterations gradually reduce the temperature to steadily reach a nearby local maximum of .

Another interesting case is when is parametrized through a latent variable (this can be seen as a case of a hard, non-differentiable constraint: if is a deterministic function of , , then is Dirac on the corresponding manifold). Then the procedure in Algorithm 1 can be performed with gradient descent steps with respect to on instead of steps 4,5.

0:  pretrained DDPM , time schedule , learning rate , auxiliary data
1:  Initialize .
2:  for  do
3:     Sample
6:  end for
Algorithm 1 Inferring a point estimate of , under a DDPM prior and constraint.

3 Experiments

3.1 Conditional generation: Simple illustration on MNIST

Thin Thick Asymmetry
(a) (b) (c)
Horiz. Class 3 +
Asymmetry Class 3 Symmetry
(d) (e) (f)
Figure 1: Inferred MNIST samples under different conditions .

We first explore the idea of generating conditional samples from an unconditional diffusion model on MNIST. We train the DDPM model of Dhariwal and Nichol (2021) on MNIST digits and experiment with different sets of constraints to generate samples with specific attributes. The examples in Fig. 1 showcase such generated samples. For the digit in (a) we set the constraint to be the unnormalized score of ‘thin’ digits, computed as negative of the average image intensity, whereas in (b) we invert that and generate a ‘thick’ digit with high mean intensity. Similarly, in (c) and (d) we hand-craft a score that penalizes the vertical and horizontal symmetry respectively, by computing the distance between the two folds (vertical/horizontal) of the inferred digit

, which leads in the generation of skewed, non-symmetric samples.

We also showcase how the auxiliary constraint can be modeled by a different, independently trained network. The digit in Fig. 1 (e) is generated by constraining the DDPM with a classifier network that is separately trained to distinguish between the digit class and all other digits. The auxiliary constraint in this case is the likelihood of the inferred digit, as it is estimated by the classifier. Finally, for (f) we multiply horizontal symmetry and digit classifier constraints, prompting the inference procedure to generate a perfectly centered and symmetric digit.

Details of model training and inference can be found in §A.

3.2 Using off-the-shelf components for conditional generation on FFHQ

We examine the generation of natural images with a pretrained DDPM prior and a learned constraint. We utilize the pretrained DDPM network on FFHQ-256 Karras et al. (2019) from Baranchuk et al. (2022) and a pretrained ResNet18 face attribute classifier on CelebA Liu et al. (2015). The attribute classifier computes the likelihood of presence of various facial features in a given image , as they are defined by the CelebA dataset. Examples of such features are no beard, smiling, blonde and male. To generate a conditional sample from the unconditional DDPM network we select a subset of these and enforce their presence or absence using the classifier predicted likelihoods as our constraint . If is a set of attributes we wish to be present, the constraint can be expressed as


We only strictly enforce a small subset of facial attributes and therefore is allowed to converge towards different modes that correspond to samples that exhibit, in varying levels, the desired features.

In Fig. 2 we demonstrate our ability to infer conditional samples with desired attributes , using only the unconditional diffusion model and the classifier . In the first row, we show the results of the optimization procedure of Algorithm 1 for various attributes. The classifier objective manipulates the image with the goal of making the classifier network produce the desired attribute predictions, whereas the diffusion objective attempts to pull the sample towards the learned distribution . If we ignored the denoising loss, the result would be some adversarial noise that fools the classifier network. The DDPM prior, however, is strong enough to guide the process towards realistic-looking images that simultaneously satisfy the classifier constraint set.

We notice that the generated samples , although having converged towards a correct mode of , still exhibit a noticeable amount of noise related to the optimization of classifier objective. To address that, inspired by Nie et al. (2022), we simply denoise the image using directly the DDPM model, starting from the low noise level so as to retain the overall structure. The results of this denoising are shown in the second row of Fig. 2. (Model and inference details are in §B.)

blonde 5:00 shadow oval high cheekbones eyeglasses goatee+big nose
Figure 2: First row: Conditional FFHQ samples for constraints with various attribute sets . Second row: denoising as in Nie et al. (2022) to remove artifacts that appear when training on a classifier as a constraint.

In Fig. 3 we showcase the intermediate steps of the optimization process for inference with the conditions blond+smiling+not male, thus solving a problem like that studied in Du et al. (2020) using only independently trained attribute classifiers and an unconditional generative model of faces. The sample is initialized with Gaussian noise , and as we perform gradient steps with decreasing t values, we observe facial features being added in a coarse-to-fine manner. This aligns with the observations made in Baranchuk et al. (2022), where the intermediate representations of the denoising U-Net for different t values were used to predict attributes with varying granularity.

Figure 3: FFHQ conditional generation for blonde,smiling,female. The last step performs denoising as in Nie et al. (2022) to remove artifacts that appear when optimizing with the classifier constraint.

3.3 Semantic image segmentation: Land cover from aerial imagery

We test the applicability of diffusion priors in discrete tasks, such as inferring semantic segmentations from images. For this purpose, we use the EnviroAtlas dataset Pickard et al. (2015) which is composed of 5-class, 1m-resolution land cover labels from four geographically diverse cities across the US; Pittsburgh, PA, Durham, NC, Austin, TX and Phoenix, AZ. We only have access to the high resolution labels from Pittsburgh, and the task is to infer the land cover labels in the other three cities, given only probabilistic weak labels derived from coarse auxiliary data Rolf et al. (2022). We use Algorithm 1 to perform an inference procedure that does not directly take imagery as input, but uses constraints derived from unsupervised color clustering. We use only cluster indices in inference, making the algorithm dependent on image structure, but not color. Local cluster indices as a representation have a promise of extreme domain transferability, but they require a form of a combinatorial search which matches local cluster indices to semantic labels so that the created shapes resemble previously observed land cover, as captured by a denoising diffusion model of semantic segmentations.

DDPM on semantic pixel labels.

We train a DDPM model on the -resolution one-hot representations of the land cover labels, using the U-Net diffusion model architecture from Dhariwal and Nichol (2021). To convert the one-hot diffusion samples to probabilities we follow Hoogeboom et al. (2022) and assume that for any pixel in the inferred sample , the distribution over the label is, , where is user-defined a parameter. We chose this approach for its simplicity and ease to apply in our inference setting of Algorithm 1. Alternatively, we could use diffusion models for categorical data Hoogeboom et al. (2021) with the appropriate modifications to our inference procedure. Samples drawn from the learned distribution are presented in Fig 4.

  Water   Impervious Surface   Soil and Barren   Trees and Forest   Grass and Herbaceous

Figure 4: Unconditional samples from the DDPM trained on land cover segmentations (cf. Fig. 5).

Inferring semantic segmentations.

In order to infer the segmentation of a single image, under the diffusion prior, we directly apply Algorithm 1 with a hand-crafted constraint which provides structural and label guidance. To construct , we first compute a local color clustering of input the image (see §C). In addition, we utilize the available weak labels Rolf et al. (2022) and force the predicted segments’ distribution to match the weak label distribution when averaged in non-overlapping blocks. We combine the two objectives in a single constraint by (i) computing the mutual information between the color clustering and the predicted labels

, transformed into a valid probability distribution from the inferred one-hot vectors, in overlapping image patches and (ii) computing the negative KL divergence between the average predicted distribution and the distribution given by the weak labels in non-overlapping blocks


Empirically, we find that we can reduce the number of optimization steps needed to perform inference by initializing the sample with the weak labels instead of random noise, allowing us to start from a smaller . Examples of images and their inferred segmentations are shown in Fig. 5.

Image Clustering Labels Inferred Ground Truth
Figure 5: Segmentation inference results. The inferred segmentation

is initialized with the weak labels to reduce the number of steps needed. The samples are chosen from (top to bottom) Durham, NC, Austin, TX and Phoenix, AZ. Although AZ has a vastly different joint distribution of colors and labels, the inferred segmentation still captures the overall structure. Note that the inference algorithm does not use the pixel intensities in the input image, only an unsupervised color clustering.

Domain transfer with inferred samples.

By design, the above inference procedure has a greater ability to perform in new areas than the approach in Rolf et al. (2022), which still finetunes networks that take raw images as input. We also investigate domain transfer approach where inferred semantic segmentation patches can be used to train neural networks for fast inference. We pretrain a standard U-Net inference network solely on 20k batches of 16 randomly sampled image patches in PA. We then randomly sampled 640 images in each of the other geographies and generate semantic segmentations using our inference procedure. We then finetune the inference network on these segmentations. This network is then evaluated on the entire target geography.

The results in Table 1 demonstrate that this approach to domain transfer is comparable with the state-of-the-art work of Rolf et al. (2022) for weakly-supervised training.. The naive approach of training a U-Net only on the available high-resolution PA data (PA supervised) fails to generalize to the geographically different location of Phoenix, AZ. Similarly, the model of Robinson et al. (2019), which is a US-wide high-resolution land cover model trained on imagery and labels, and multi-resolution auxiliary data over the entire contiguous US also suffers. When the weak labels are provided as input (PA supervised + weak) the results can improve significantly.

Durham, NC Austin, TX Phoenix, AZ
Algorithm Acc % IoU % Acc % IoU % Acc % IoU %
PA supervised 74.2 35.9 71.9 36.8 6.7 13.4
PA supervised + weak 78.9 47.9 77.2 50.5 62.8 24.2
Implicit posterior Rolf et al. (2022) 79.0 48.4 76.6 49.5 76.2 46.0
Ours (from scratch) 76.0 39.9 74.8 39.4 69.5 31.6
Ours (finetuned) 79.8 46.4 79.5 45.4 69.6 32.4
Full US supervised Robinson et al. (2019) 77.0 49.6 76.5 51.8 24.7 23.6
Table 1: Accuracies and class mean intersection-over-union scores on the EnviroAtlas dataset in various geographic domains. The model in the second-to-last row was pretrained in a supervised way on labels in the Pittsburgh, PA, region.

3.4 Traveling salesman problems: Towards combinatorial reasoning with DDPMs

So far, we have considered inference under a DDPM prior and a differentiable constraint . We consider the case of a ‘hard’ constraint, where is a latent vector is deterministically encoded in an image (), where we have a DDPM prior over images . We will use the variation of Algorithm 1 described in §2.2 to obtain a point estimate of the distribution over , .

We illustrate this in the setting of a well-known combinatorial problem, the traveling salesman problem (TSP). Recall that a Euclidean traveling salesman problem on the plane is described by points , which form the vertex set of a complete weighted graph , where the weight of the edge from to is the Euclidean distance . A tour of is a connected subgraph in which every vertex has degree 2. The TSP is the optimization problem of finding the tour with minimal total weight of the edges, or, equivalently, a permutation of that minimizes

Although the general form of the TSP is NP-hard, a polynomial-time approximation scheme is known to exist in the Euclidean case Arora (1998); Mitchell (1999) and can yield proofs of tour optimality for small problems.

Figure 6: Two unconditional samples from the diffusion model trained on images of solved TSPs.

Humans have been shown to have a natural propensity for solving the Euclidean TSP (see MacGregor and Chu (2011) for a survey). Humans construct a tour by processing an image representation of the points through their visual system. However, the optimization algorithms in common use for solving the TSP do not use a vision inductive bias, instead falling into two broad categories:

  • [leftmargin=*,itemsep=0pt,topsep=0pt]

  • Discrete combinatorial optimization algorithms and efficient integer programming solvers, studied for decades in the optimization literature Lin and Kernighan (1973); Helsgaun (2000); et al. (1997);

  • More recently, there has been work on neural nets, trained by reinforcement learning or imitation learning, that build tours sequentially or learn heuristics for their (discrete) iterative refinement. Successful recent approaches

    Deudon et al. (2018); Kool et al. (2019); Joshi et al. (2019, 2021); Bresson and Laurent (2021) have used Transformer Vaswani et al. (2017) and graph neural network Kipf and Welling (2017) architectures.

The algorithm we propose using DDPMs is a hybrid of these categories: it reasons over a continuous relaxation of the problem, but exploits the learning of generalizable structure in example solutions by a neural model. In addition, ours is the first TSP algorithm to mimic the convolutional inductive bias of the visual system.

Encoding function.

Fix a set of points . Let be a symmetric matrix with 0 diagonal. We encode as a greyscale image by superimposing: - raster images of line segments from to with intensity value for every pair , and - raster images of small black dots placed at for each . For example, if is the adjacency matrix of a tour, then is a visualization of this tour as a image.

Diffusion model training.

We use a dataset of Euclidean TSPs, with ground truth tours obtained by a state-of-the-art TSP solver et al. (1997), from Kool et al. (2019) (we consider two variants of the dataset, each with 1.5m training graphs: with 50 vertices in each graph and with a varying number from 20 to 50 vertices in each graph). Each training tour is represented via its adjacency matrix and encoded as an image . We then train a DDPM with the U-Net architecture from Dhariwal and Nichol (2021) on all of such encoded image. Model and training details can be found in §D. Some unconditional samples from the trained DDPM are shown in Fig. 6; most samples indeed resemble image representations of tours.

Solving new TSPs.

Suppose we are given a new set of points . Solving the TSP requires finding the adjacency matrix of a tour of minimal length. As a differentiable relaxation, we set , where is a stochastic matrix with zero diagonal (parametrized via softmax of a matrix of parameters over rows). We run the inference procedure using the trained DDPM as a prior to estimate

. The hyperparameters and noise schedule are described in Appendix. Examples of the optimization are shown in Fig. 


Although the inferred is usually sharp (i.e., all entries close to 0 or 1), rounding to 0 or 1 does not always give the adjacency matrix of a tour (see, for example, the top row of Fig. 7; other common incorrect outputs include two disjoint tours). To extract a tour from the inferred , we greedily insert edges to form an initial proposal, then refine it using a standard and lightweight combinatorial procedure, the 2-opt heuristic Lin and Kernighan (1973) (amounting to iteratively uncrossing pairs of edges that intersect). The entire procedure is shown in Fig. 7, and full details can be found in §D.

Optimize latent adj. matrix w.r.t. denoising model Recover tour
Input 256 192 128 64 0 Extracted + 2-opt Oracle
Figure 7: The procedure for solving the Euclidean TSP with a DDPM: Gradient descent is performed on a latent adjacency matrix to minimize a stochastic denoising loss on an image representation with steadily decreasing amounts of noise (here, 256 steps). In the process, pieces of the tour are ‘burned in’ and later recombined in creative ways. Finally, a tour is extracted from the inferred adjacency matrix and refined by uncrossing moves. For both problems shown, the length of the inferred tour is within 1% of the optimum.


We evaluate the trained models on test sets of 1280 graphs each with and vertices. (We also show results with 200 vertices in §D.) We report the average length of the inferred tour and the gap (discrepancy from the length of the ground truth tour) in Table 2 (left), from which we make several observations.

  • [leftmargin=*,itemsep=0pt,topsep=0pt]

  • The right side of Table 2 shows the number of 2-opt (edge uncrossing) steps performed in the refinement step of the algorithm when the inference algorithm is run for varying numbers of steps. Running the inference with more steps results in extracted tours that are closer to local minima with respect to the 2-opt neighbourhood, indicating that the DDPM encodes meaningful information about the shape of tours.

  • The DDPM inference is competitive with recent baseline algorithms that do not use beam search in generation of the tour (those shown in the table). These baseline algorithms improve when beam search decoding with very large beam size is used, but encounter diminishing returns as the computation cost grows. Our performance on the 100-vertex problems is similar to Kool et al. (2019) with the largest beam size they report (5000), which has 2.18% gap, while having similar computation time.

  • The model trained on problems with 50 nodes performs almost identically to the model trained on problems with 50 or fewer nodes, and both models generalize better than baseline methods from 50-node problems to the out-of-distribution 100-node problems.

We reemphasize a unique feature of our algorithm: all ‘reasoning’ in our inference procedure happens via the image space. This property also leads to sublinear computation cost scaling with increasing size of the graph – as long as it can reasonably be represented in a image – since most of the computation cost of inference is borne by running the denoiser on images of a fixed size.

Algorithm Obj Gap % Obj Gap %
Oracle (Concorde et al. (1997)) 5.69 0.00 7.759 0.00
2-opt Lin and Kernighan (1973) 5.86 2.95 8.03 3.54
Transformer Kool et al. (2019) 5.80 1.76 8.12 4.53
GNN Joshi et al. (2019) 5.87 3.10 8.41 8.38
Transformer Bresson and Laurent (2021) 5.71 0.31 7.88 1.42
Diffusion 20–50 5.76 1.23 7.92 2.11
Diffusion 50 5.76 1.28 7.93 2.19
Diff. steps Obj Gap % Steps Obj Gap % Steps
256 5.763 1.28 11.6 7.930 2.19 50.6
64 5.780 2.60 14.3 7.942 2.35 45.7
16 5.858 2.98 25.9 8.052 3.78 58.6
4 5.851 2.86 23.9 8.031 3.50 52.8
2-opt 5.856 2.95 24.4 8.034 3.54 53.0
Table 2: Left: Mean tour length and optimality cap on Euclidean TSP test sets. The baseline results from Kool et al. (2019); Joshi et al. (2019); Bresson and Laurent (2021) are taken from the respective papers. The two DDPMs were trained on 1.5m images of solved TSP instances (with different numbers of vertices) and used to infer latent adjacency matrices in the test set. Right: Performance of the DDPM trained on images of 50-vertex TSP instances with different numbers of inference steps (see §D for time schedule details). We also show the mean number of 2-opt (uncrossing) steps per instance, suggesting that the DDPM prior assigns high likelihood to adjacency matrices that are in less need of refinement.

4 Conclusion

We have shown how inference in denoising diffusion models can be performed under constraints in a variety of settings. Imposing constraints that arise from pretrained classifiers enables conditional generation, while common-sense conditions, such as mutual information with a clustering or divergence from weak labels, can lead to models that are less sensitive to domain shift in the distribution of conditioning data.

A notable limitation of DDPMs, which is inherited by our algorithms, is the high cost of inference, requiring a large number of passes through the denoising network to generate a sample. We expect that with further research on DDPMs for which inference procedures converge in fewer steps Salimans and Ho (2022); Xiao et al. (2022), plug-and-play use of DDPMs will become more appealing in various applications.

Finally, our results on the traveling salesman problem illustrate the ability of DDPMs to reason over uncertain hypotheses in a manner that can mimic human ‘puzzle-solving’ behavior. These results open the door to future research on using DDPMs to efficiently generate candidates in combinatorial search problems.


  • T. Amit, E. Nachmani, T. Shaharabany, and L. Wolf (2021) SegDiff: image segmentation with diffusion probabilistic models. arXiv preprint 2112.00390. Cited by: §1.1, §1.
  • S. Arora (1998) Polynomial time approximation schemes for euclidean traveling salesman and other geometric problems. Journal of the Association for Computing Machinery 45 (5), pp. 753–782. Cited by: §3.4.
  • D. Baranchuk, A. Voynov, I. Rubachev, V. Khrulkov, and A. Babenko (2022) Label-efficient semantic segmentation with diffusion models. International Conference on Learning Representations (ICLR). Cited by: Appendix B, Appendix B, §1.1, §3.2, §3.2.
  • X. Bresson and T. Laurent (2021)

    The transformer network for the traveling salesman problem

    arXiv preprint 2103.03012. Cited by: 2nd item, Table 2.
  • J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon (2021) ILVR: conditioning method for denoising diffusion probabilistic models.

    International Conference on Computer Vision (ICCV)

    Cited by: §1.1.
  • M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L. Rousseau (2018) Learning heuristics for the TSP by policy gradient. In

    Integration of Constraint Programming, Artificial Intelligence, and Operations Research

    pp. 170–181. Cited by: 2nd item.
  • P. Dhariwal and A. Q. Nichol (2021) Diffusion models beat GANs on image synthesis. Neural Information Processing Systems (NeurIPS). Cited by: Appendix A, Appendix C, Appendix D, §1.1, §1, §3.1, §3.3, §3.4.
  • Y. Du, S. Li, and I. Mordatch (2020)

    Compositional visual generation with energy based models

    Neural Information Processing Systems (NeurIPS). Cited by: §3.2.
  • J. H. Engel, M. D. Hoffman, and A. Roberts (2018) Latent constraints: learning to generate conditionally from unconditional generative models. International Conference on Learning Representations (ICLR). Cited by: §1.1.
  • D. A. et al. (1997) Concorde TSP solver. Note: http://www.math.uwaterloo.ca/tsp/concorde Cited by: 1st item, §3.4, Table 2.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Neural Information Processing Systems (NeurIPS). Cited by: §1.1.
  • K. Helsgaun (2000) An effective implementation of the Lin–Kernighan traveling salesman heuristic. European Journal of Operational Research 126 (1), pp. 106–130. Cited by: 1st item.
  • J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems (NeurIPS). Cited by: §1, §2.2, §2.2.
  • E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and M. Welling (2021) Argmax flows and multinomial diffusion: learning categorical distributions. Neural Information Processing Systems (NeurIPS). Cited by: §3.3.
  • E. Hoogeboom, V. G. Satorras, C. Vignac, and M. Welling (2022) Equivariant diffusion for molecule generation in 3D. arXiv preprint 2203.17003. Cited by: §1, §3.3.
  • C. K. Joshi, Q. Cappart, L. Rousseau, and T. Laurent (2021) Learning tsp requires rethinking generalization. International Conference on Principles and Practice of Constraint Programming. Cited by: 2nd item.
  • C. K. Joshi, T. Laurent, and X. Bresson (2019) An efficient graph convolutional network technique for the travelling salesman problem. arXiv preprint 1906.01227. Cited by: 2nd item, Table 2.
  • T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks.

    Computer Vision and Pattern Recognition (CVPR)

    Cited by: §3.2.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. International Conference on Learning Representations (ICLR). Cited by: §1.1.
  • T. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR). Cited by: 2nd item.
  • W. Kool, H. van Hoof, and M. Welling (2019) Attention, learn to solve routing problems!. International Conference on Learning Representations (ICLR). Cited by: 2nd item, 2nd item, §3.4, Table 2.
  • S. Lin and B. Kernighan (1973) An effective heuristic algorithm for the traveling-salesman problem. Operations Research 21 (2), pp. 498–516. Cited by: 1st item, §3.4, Table 2.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. International Conference on Computer Vision (ICCV). Cited by: Appendix B, §3.2.
  • J. MacGregor and Y. Chu (2011) Human performance on the traveling salesman and related problems: a review. The Journal of Problem Solving 3. Cited by: §3.4.
  • T. Minka (2005) Divergence measures and message passing. Microsoft Research Technical Report. Cited by: §2.1.
  • J. S. B. Mitchell (1999) Guillotine subdivisions approximate polygonal subdivisions: a simple polynomial-time approximation scheme for geometric tsp, k-mst, and related problems. SIAM Journal on Computing 28 (4), pp. 1298–1309. Cited by: §3.4.
  • A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski (2017) Plug & play generative networks: conditional iterative generation of images in latent space. Computer Vision and Pattern Recognition (CVPR). Cited by: §1.1, §1.
  • A. Nichol and P. Dhariwal (2021) Improved denoising diffusion probabilistic models.

    International Conference on Machine Learning (ICML)

    Cited by: §1, §1.
  • W. Nie, B. Guo, Y. Huang, C. Xiao, A. Vahdat, and A. Anandkumar (2022) Diffusion models for adversarial purification. International Conference on Machine Learning (ICML). Note: To appear; arXiv preprint 2205.07460 Cited by: Figure B.1, Figure B.2, §1.1, Figure 2, Figure 3, §3.2.
  • B. R. Pickard, J. Daniel, M. Mehaffey, L. E. Jackson, and A. Neale (2015) EnviroAtlas: a new geospatial tool to foster ecosystem services science and resource management. Ecosystem Services 14 (C), pp. 45–55. External Links: Link Cited by: Appendix C, §3.3.
  • C. Robinson, L. Hou, N. Malkin, R. Soobitsky, J. Czawlytko, B. Dilkina, and N. Jojic (2019) Large scale high-resolution land cover mapping with multi-resolution data. Computer Vision and Pattern Recognition (CVPR). Cited by: §3.3, Table 1.
  • E. Rolf, N. Malkin, A. Graikos, A. Jojic, C. Robinson, and N. Jojic (2022) Resolving label uncertainty with implicit posterior models. Uncertainty in Artificial Intelligence (UAI). Note: To appear; arXiv preprint 2202.14000 Cited by: Appendix C, Appendix C, §1.1, §3.3, §3.3, §3.3, §3.3, Table 1.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI). Cited by: Appendix C.
  • C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi (2021)

    Image super-resolution via iterative refinement

    arXiv preprint 2104.07636. Cited by: §1.1.
  • T. Salimans and J. Ho (2022) Progressive distillation for fast sampling of diffusion models. International Conference on Learning Representations (ICLR). Cited by: §1, §4.
  • A. Sinha, J. Song, C. Meng, and S. Ermon (2021) D2C: diffusion-decoding models for few-shot conditional generation. Neural Information Processing Systems (NeurIPS). Cited by: §1.1, §1.
  • J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli (2015)

    Deep unsupervised learning using nonequilibrium thermodynamics

    International Conference on Machine Learning (ICML). Cited by: §1, §2.2.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. International Conference on Learning Representations (ICLR). Cited by: §1.1.
  • Y. Tashiro, J. Song, Y. Song, and S. Ermon (2021)

    CSDI: conditional score-based diffusion models for probabilistic time series imputation

    Neural Information Processing Systems (NeurIPS). Cited by: §1.1.
  • A. Vahdat, K. Kreis, and J. Kautz (2021) Score-based generative modeling in latent space. Neural Information Processing Systems (NeurIPS). Cited by: §1.1, §1, §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Neural Information Processing Systems (NIPS). Cited by: 2nd item.
  • Z. Xiao, K. Kreis, and A. Vahdat (2022) Tackling the generative learning trilemma with denoising diffusion GANs. International Conference on Learning Representations (ICLR). Cited by: §1, §4.

Appendix A Mnist

Training the DDPM.

To train the diffusion model we used the U-Net architecture of Dhariwal and Nichol [2021] with a linear schedule and

diffusion steps. We trained the network for 10 epochs, with a batch size of 128 samples, using the Adam optimizer and a learning rate of


Performing inference.

For all inference examples, we performed 1000 optimization steps with the Adam optimizer and a learning rate . We employed a cosine-modulated, linearly decreasing annealing schedule as shown in Fig. A.1 (a). We empirically designed this annealing process following the observation that the linearly decreasing values guide the inference procedure in a coarse-to-fine manner that starts by deciding the overall structure of the inferred sample and then adding in details. We also added the oscillating component to allow for revisions of the coarser structures that are to be made after having inferred specific details.

When performing the optimization step of Algorithm 1, we observed that it was important to gradually reduce the effect of the condition in order to obtain good sample quality. In practice, we linearly decreased the weight of the conditional component of the loss, from to as we performed the optimization steps from . This can be attributed to the fact that the conditions we used provide guidance for the steps made at larger values of , where the shape and orientation of a digit are decided. When combining two or more conditions the weighting is applied to all of them.

(a) (b)
Figure A.1: Inference annealing schedules for the (a) MNIST (b) Land Cover experiments. The TSP and FFHQ experiments use similarly defined schedules.

Appendix B Ffhq

Performing Inference.

For the conditional generation experiments on the FFHQ dataset we utilized the pre-trained DDPM model provided by Baranchuk et al. [2022]. The face attribute classifier network was a ResNet-18 network trained on the face attributes given in the CelebA dataset Liu et al. [2015]. To run our inference algorithm we performed 200 optimization steps with the Adamax optimizer, choosing and a linearly decreasing learning rate from to . The annealing schedule was similar to the one used for the land cover segmentation experiments (Fig. A.1 (b)) but for values now ranging from 1000 to 200. Additionally, in this experiment we found that balancing between the diffusion and auxiliary losses with a carefully chosen weighting term was difficult. Thus, we opted for a different approach where we clipped the gradient norm of the auxiliary objective to of the gradient norm of the diffusion denoising loss.

Further Discussion on Samples.

In Fig. B.1,B.2 we demonstrate additional conditionally-generated samples from the unconditional DDPM and the attribute classifier. In the first set of examples we show that although we may find modes of that satisfy to a level the condition set by the classifier, the sample quality is not always on-par with unconditionally generated samples, like those presented in Baranchuk et al. [2022]. We can attribute that to the fact that for natural images, in contrast to segmentation labels, the mode may not always be a good-looking sample from the distribution. Our method to mitigate that, along with the classifier noise artifacts left from the optimization process, is to run the diffusion denoising procedure starting from a low temperature . Although this may improve the visual quality of the result, in some cases our choice of is not large enough to move the sample far enough from the inferred . If we choose a larger however we risk erasing the attributes we aimed to generate in the first place.

In the second set of samples, we first show how conflicting attributes are resolved. When the constraint is set to satisfy two attributes that contradict each other we observed that the inferred sample tends to gravitate towards a single randomly-chosen direction. This is evident in the first two examples where we set the not male attribute along with a male-correlated attribute. In each of them only a single condition, either the not male or the male-related, is satisfied. In the blonde+black hair example we could argue that a mix of the two attributes is present in the inferred sample. However, the classifier predictions for that specific image tell us that the person shown is exclusively blonde.

We also show a set of failure cases where the classifier ’painted’ the features related to the desired attribute but the diffusion prior did not complete the sample in a correct way. For instance, in the eyeglasses example we see that the classifier has drawn an outline of the eyeglass edges on the generated face but the diffusion model has failed to pick up the cue. Similarly, when asking for wavy hair we see curves that can fool the classifier into thinking that the person has curly hair, or when the attributes set are smiling+mustache we observe a comically drawn mustache on the generated face. Since the conditioning depends both on the diffusion prior and the robustness of the classifier we believe that with better classifier training we could improve the result in such cases.

male+ male+smiling+
not male male not young young rosy cheeks mustache
Figure B.1: First row: Additional conditional samples for constraints with various attribute sets . Second row: denoising as in Nie et al. [2022] to remove artifacts that appear due to optimizing the classifier constraint.
blonde+ not male+ smiling+
not male+bald not male+beard black hair eyeglasses wavy hair mustache
Figure B.2: First row: Failure cases of conditional generation for constraints with various attribute sets . Second row: denoising as in Nie et al. [2022] to remove artifacts that appear due to optimizing the classifier constraint.

Appendix C Land Cover

Training the DDPM.

The land cover DDPM was trained on -resolution, patches of land cover labels, randomly sampled from the Pittsburgh, PA tiles of the EnviroAtlas dataset Pickard et al. [2015]. For the diffusion network, we used the U-Net architecture of Dhariwal and Nichol [2021], a linear schedule and diffusion steps. We trained with batches of size 32, using the Adam optimizer and a learning rate of . Additional samples from the unconditional diffusion model are shown in Fig. C.1. We observe that the model has learned both structures that are independent of the geography, such as the continuity of roads and the suburban building planning, and PA-specific ones, such as buildings nested in forested areas, which may not be as common in AZ for instance.

  Water   Impervious Surface   Soil and Barren   Trees and Forest   Grass and Herbaceous

Figure C.1: Unconditional samples from the DDPM trained on land cover segmentations.

Performing inference.

Since we initialize the inference procedure with the weak labels we require fewer optimization steps and do not have to start the search from . Thus, to infer the land cover segmentations we only perform 200 optimization steps using the Adam optimizer, with a linearly decreasing learning rate from to and . The annealing schedule we designed for this task reflects the needs for fewer overall steps and is shown in A.1 (b). We also decrease the weights of both conditional components of the loss, from to as we perform the optimization steps to reduce their influence on the final inferred sample. In addition, we linearly decrease the parameter that is used to convert the one-hot representations learned from the DDPM model to probabilities, from to to mimic the uncertainty of this conversion process. Further examples of land cover segmentation inference are shown in Fig. C.2. Despite the fact that the DDPM was trained only on PA land cover labels we show how the weak label guidance allows us to perform inference in completely new geographies, such as that of AZ (last two rows), where the most prominent label is now Soil and Barren. We can still observe a few artifacts of the PA-related biases the model has learned, like the tendency to add uninterrupted forested areas but the transferability of the semantic model is still far superior than an of an image-based one.

Image Clustering Labels Inferred Ground Truth

  Water   Impervious Surface   Soil and Barren   Trees and Forest   Grass and Herbaceous

Figure C.2: Segmentation inference results.

Our hand-crafted constraint for land cover segmentation inference is split between two objectives; (i) matching the structure of the target image using a local color clustering and (ii) forcing the predicted segments’ distribution to match the weak label distribution when averaged in non-overlapping blocks of the image.

The local color clustering is computed as a local Gaussian mixture with a fixed number of components. To match the structure between the predicted labels and the pre-computed clustering we compute the mutual information between the two distributions in overlapping patches of pixels. This choice of constraint pushes the inferred land covers segments in a way that they should match locally the color clustering segments. Although this allows us to infer the labels of large structures like roads and buildings it also tends to add noisy labels at areas where the clustering has a high entropy. By gradually reducing the weight of the auxiliary objective however, we allow the inference procedure to ’fill-in’ these details as it is dictated by the diffusion prior.

The label guidance during inference is provided from probabilistic weak labels which are derived from coarse auxiliary data. These data are composed of the 30m-resolution National Land Cover Database (NLCD) labels, augmented with building footprints, road networks and waterways/waterbodies Rolf et al. [2022]. The corresponding weak label constraint is computed as the KL-divergence between the average predicted and weak label distributions in non-overlapping blocks of pixels. In the absence of such guidance the inference procedure can easily confuse semantic classes while still producing segmentations that are likely under . We showcase this in Fig. C.3 where we infer the land cover labels of an image, starting from a random initialization, with and without the weak label guidance.

Image Clustering Weak Labels
Inferred (no guidance) Inferred (with guidance) Ground Truth

Figure C.3: Inference with and without label guidance.

Domain transfer.

Regarding the domain transfer experiments, we initially pre-trained the standard inference U-Net Ronneberger et al. [2015] on batches of 16 randomly sampled image patches in Pittsburgh, PA, using the Adam optimizer with a learning rate of . We then inferred the land cover segmentations of 640 randomly-sampled patches in each of the other geographic regions, (NC, TX, AZ) using the inference procedure described above. With these generated labels, we first fine-tuned the original network on a validation set of 5 tiles to determine the optimal fine-tuning parameters. For Durham, NC and Austin, TX we only fine-tune the last layer of the network for a single epoch, using a batch size of 16 patches and a learning rate of . For Phoenix, AZ we require 5 epochs of fine-tuning the entire network with a learning rate of since the domain shift is larger. Additionally, for all regions, following the experiments of Rolf et al. [2022], we multiply the predicted probabilities with the weak labels and renormalize.

Finally, in Table 1, we also present the results when the inference network is trained from scratch, to show that the resulting performance is not only an artifact of the pre-training. The U-Net was trained for 20 epochs on all 640 generated samples, with a batch size of 16 and a learning rate of .

Appendix D Tsp

DDPM training.

The DDPM was trained on images of ground truth TSP solutions encoded as images. The architecture was the same U-Net as used in the other experiments, with the architecture from Dhariwal and Nichol [2021] and diffusion steps in training. We trained each model for 8 epochs with batch size 16, which took about two days on one Tesla K80 GPU.

Performing inference.

At inference time, we performed varying numbers of inference steps (see Table 2 in the main text), using the Adam optimizer with and a learning rate linearly decaying from 1 to 0.1. The noise schedule was the same as that used in the MNIST experiment (Fig. A.1), with the time interval from 0 to 1000 linearly resampled to the number of inference steps used.

To extract a tour from the inferred adjacency matrix , we used the following greedy edge insertion procedure.

  • [topsep=0pt,itemsep=0pt,leftmargin=*]

  • Initialize extracted tour with an empty graph with vertices.

  • Sort all the possible edges in decreasing order of (i.e., the inverse edge weight, multiplied by inferred likelihood). Call the resulting edge list .

  • For each edge in the list:

    • [topsep=0pt,itemsep=0pt]

    • If inserting into the graph results in a complete tour, insert and terminate.

    • If inserting results in a graph with cycles (of length ), continue.

    • Otherwise, insert into the tour.

It is easy to see that this algorithm terminates before the entire edge list has been traversed. The tour is refined by a naïve implementation of 2-opt, in which, on each step, all pairs of edges in the tour are enumerated and a 2-opt move is performed if the edges cross. For the ‘2-opt’ baseline, the same procedure is performed using a uniform adjacency matrix.

Results on larger problems.

Extending the results in Table 2 of the main text, we evaluate the model trained on TSP instances with 20 to 50 nodes on problems with 200 nodes. We find an optimality gap of 3.77% (average number of uncrossing moves 219), compared to 3.81% for 2-opt (average number of uncrossing moves 115), suggesting that the generalization potential is near-saturated at this problem size. As shown in Fig. D.3, the vertices fill the image with such high density that it is difficult to see the (light grey) tour; many edges are invisible (compare to Fig. 7 in the main text).

optimize latent adj. matrix w.r.t. denoising model recover tour
Input 256 192 128 64 0 Extracted + 2-opt Oracle
Figure D.1: Latent adjacency matrix inference in a 200-vertex TSP, using a model trained on images but images at inference time. The discovered tour is 2.12% longer than the optimal one.
Figure D.2: Unconditional samples from the DDPM trained on image representations of 50-vertex TSPs.

We suggest three directions to solving to this problem that should be explored in later work:

  1. Encoding: The size of the encoding image can be increased (for example, to ) when the number of vertices increases, without changing the model (trained on images), which can make denoising predictions on images of any size. We may expect to see better out-of-domain generalization of the denoising model in this setting, as the density of nodes (mean number of black pixels) would match that in the training set. Figs. D.2 and D.1 show the potential of DDPMs to generalize to image sizes larger than those in which they were trained. Inference using images gives an optimality gap of 2.59% (average number of uncrossings 81), much lower than that obtained with in-domain image size.

    In addition, encoding graphs with smaller dots and thinner lines can be explored, although the generalization difficulties due to image ‘crowding’ would still appear at a larger value of .

  2. Fractal behaviour and coarse-to-fine: Taking advantage of the fractal structure of Euclidean TSP solutions, a denoising objective could be used to locally refine the tour by minimizing the objective on a crop of the image representation (a form of DDPM-guided local search). This could be done in a coarse-to-fine manner by application of the same model at different scales, with a representation of a problem with 200 vertices being first optimized with respect to the denoising objective globally, then on crops.

  3. Improved extraction: The 2-opt search can be improved by inexpensive heuristics, such as choosing the 2-opt move that most improves the cost on every step, rather than iterating through the edges of the candidate tour in order.

Input Target
Figure D.3: A TSP instance and the ground truth solution with vertices encoded in a image.