Using pre-trained Diffusion models as priors for inference tasks
We consider the problem of inferring high-dimensional data 𝐱 in a model that consists of a prior p(𝐱) and an auxiliary constraint c(𝐱,𝐲). In this paper, the prior is an independently trained denoising diffusion generative model. The auxiliary constraint is expected to have a differentiable form, but can come from diverse sources. The possibility of such inference turns diffusion models into plug-and-play modules, thereby allowing a range of potential applications in adapting models to new domains and tasks, such as conditional generation or image segmentation. The structure of diffusion models allows us to perform approximate inference by iterating differentiation through the fixed denoising network enriched with different amounts of noise at each step. Considering many noised versions of 𝐱 in evaluation of its fitness is a novel search mechanism that may lead to new algorithms for solving combinatorial optimization problems.READ FULL TEXT VIEW PDF
Using pre-trained Diffusion models as priors for inference tasks
Deep generative models, such as denoising diffusion probabilistic models (DDPMs; Sohl-Dickstein et al., 2015; Ho et al., 2020) can capture the details of very complex distributions over high-dimensional continuous data Nichol and Dhariwal (2021); Dhariwal and Nichol (2021); Amit et al. (2021); Sinha et al. (2021); Vahdat et al. (2021); Hoogeboom et al. (2022). The immense effective depth of DDPMs, sometimes with thousands of deep network evaluations in the generation process, is an apparent limitation on their use as off-the-shelf modules in hierarchical generative models, where models can be mixed and one model may serve as a prior for another conditional model. In this paper, we show that DDPMs trained on image data can be directly used as priors in systems that involve other differentiable constraints.
In our main problem setting, we assume that we have a prior over high-dimensional data and wish to do an inference in a model that involves this prior and a constraint on given some additional information . That is, we want to find an approximation of the posterior distribution . In this paper, is provided in the form of an independently trained DDPM over (§2.2), making the DDPM a ‘plug-and-play’ prior.
Although the recent community interest in DDPMs has spurred progress in training algorithms and fast generation schedules Nichol and Dhariwal (2021); Salimans and Ho (2022); Xiao et al. (2022), the possibility of their use as plug-and-play modules has not been explored. Furthermore, as opposed to existing work on plug-and-play models (starting from Nguyen et al. (2017)), the algorithms we propose do not require additional training or finetuning of model components or inference networks.
). For example, a denoising diffusion model trained on MNIST digit images might define, while the constraint
may be be the probability of digit class
under an off-the-shelf classifier. However, changing the semantics of
, we can also use such models for inference tasks where neural networks struggle with domain adaptation, such as image segmentation:constrains the segmentation to match an appearance or a weak labeling (§3.3). Finally, we describe a path towards using DDPM priors to solve continuous relaxations of combinatorial search problems by treating as a latent variable with combinatorial structure that is deterministically encoded in (§3.4).
DDPMs have previously been used for conditional generation and image segmentation Saharia et al. (2021); Tashiro et al. (2021); Amit et al. (2021). With few exceptions – such as Baranchuk et al. (2022), which uses a pretrained DDPM as a feature extractor – these algorithms assume access to paired data and conditioning information during training of the DDPM model. In Dhariwal and Nichol (2021), a classifier that guides the denoising model towards the desired subset of images with the attribute is trained in parallel with the denoiser. In Choi et al. (2021), generation is conditioned on an auxiliary image by guiding the denoising process through correction steps that match the low-frequency components of the generated and conditioning images. In contrast, we aim to build models that combine an independently trained DDPM with an auxiliary constraint.
Our approach is also related to work on adversarial examples. Adversarial samples are produced by optimizing an image to satisfy a desired constraint – a classifier
– without reference to the prior over data. As supervised learning algorithms can ignore the structure in data, focusing only on the conditional distribution, it is possible to optimize for input that provides the desired classification in various surprising ways Szegedy et al. (2014). In Nie et al. (2022), a diffusion model is used to defend from adversarial samples by making images more likely under a DDPM . We are instead interested in inference, where we seek samples that satisfy both the classifier and the prior. (Our work may, however, have consequences for adversarial generation.)
show how an unconditional generative model, such as a generative adversarial network(GAN; Goodfellow et al., 2014)
or variational autoencoder(VAE; Kingma and Welling, 2014), can be combined with a constraint model to generate conditional samples.
Modeling the latent prior distribution in VAE-like models using a DDPM has been studied in Sinha et al. (2021); Vahdat et al. (2021). On the other hand, in §3.4, we perform inference in the low-dimensional latent space under a pretrained DDPM on a high-dimensional data space. Our work is also related to Rolf et al. (2022), where a prior over latents is used to tune a posterior network . There, the priors are of relatively simple structure and are sample-specific, rather than global diffusion priors like in this paper. Still, some practical problems can be tackled with both approaches (§3.3).
Recall that we want to find an approximation to the posterior distribution , where is a fixed prior distribution. Fixing and introducing an approximate variational posterior , the free energy
is minimized when is closest to the true posterior, i.e., when is minimized. When is expressive enough to capture the true posterior, this minimization yields the exact posterior . Otherwise, the posterior will capture a ‘mode-seeking’ approximation to the true posterior Minka (2005); in particular, if is Dirac, it is optimal to concentrate at the mode of . When the prior involves latent variables (i.e., ), the free energy is
We are, in particular, interested in a general procedure for minimizing with respect to an approximate posterior for any differentiable when is a DDPM (§2.2).
A free energy of the same structure was also studied in Vahdat et al. (2021), where a DDPM over a latent space is hybridized as a parent to a decoder , with an additional inference model trained jointly with both of these models. On the other hand, we aim to work with independently trained components that operate directly in the pixel space, e.g., an off-the-shelf diffusion model trained on images of faces and an off-the-shelf face classifier , without training or finetuning them jointly (§3.2).
Denoising diffusion probabilistic models (DDPMs) Sohl-Dickstein et al. (2015); Ho et al. (2020) generate samples by reversing a (Gaussian) noising process. DDPMs are deep directed stochastic networks:
where and are neural networks with learned parameters (often, as in this paper, is fixed to a constant depending on ). The model starts with a sample from a unit Gaussian and successively transforms it with a nonlinear network adding a small Gaussian innovation signal at each step according to a noise schedule. After steps, the sample is obtained.
In general, using such a model as a prior over would require an intractable integration over latent variables :
However, DDPMs are trained under the assumption that the posterior is a simple diffusion process that successively adds Gaussian noise according to a pre-defined schedule :
Therefore, if is the likelihood (5) of under a DDPM, then in the first expectation of (2) we should use . A computationally and notationally convenient form for the approximate posterior over is a Gaussian:
Thus we can sample at any arbitrary time step as
which after reparametrization Ho et al. (2020) leads to
where the link between the stage noise reconstruction and the model’s expectation is
The weighting is generally a function of the noise schedule, but in most pretrained diffusion models it is set to 1. Thus, the free energy in (2) reduces to
To deal with the second term, one could use the reparametrization trick . However, in our experiments, we were mostly interested in the mode , so we set , leading to
The first term is the cost usually used to learn the parameters of the diffusion model. To perform inference under an already trained model , we instead minimize wrt to (and if desired) through sampling in the summands over .
We summarize the algorithm for a point estimateas Algorithm 1. Variations on this algorithm are possible. Depending on how close to a good mode we can initialize , this optimization may involve summing only over ; different time step schedules can be considered depending on the desired diversity in the estimated . Note that optimization is stochastic and each time it is run it can produce different point estimates of which are are both likely under the diffusion prior and satisfy the constraint as much as possible.
We observed that optimizing simultaneously for all makes it difficult to guide the sample towards a mode; therefore, we anneal from high to low values. Intuitively, the first few iterations of gradient descent should coarsely explore the search space, while later iterations gradually reduce the temperature to steadily reach a nearby local maximum of .
Another interesting case is when is parametrized through a latent variable (this can be seen as a case of a hard, non-differentiable constraint: if is a deterministic function of , , then is Dirac on the corresponding manifold). Then the procedure in Algorithm 1 can be performed with gradient descent steps with respect to on instead of steps 4,5.
|Horiz.||Class 3 +|
We first explore the idea of generating conditional samples from an unconditional diffusion model on MNIST. We train the DDPM model of Dhariwal and Nichol (2021) on MNIST digits and experiment with different sets of constraints to generate samples with specific attributes. The examples in Fig. 1 showcase such generated samples. For the digit in (a) we set the constraint to be the unnormalized score of ‘thin’ digits, computed as negative of the average image intensity, whereas in (b) we invert that and generate a ‘thick’ digit with high mean intensity. Similarly, in (c) and (d) we hand-craft a score that penalizes the vertical and horizontal symmetry respectively, by computing the distance between the two folds (vertical/horizontal) of the inferred digit
, which leads in the generation of skewed, non-symmetric samples.
We also showcase how the auxiliary constraint can be modeled by a different, independently trained network. The digit in Fig. 1 (e) is generated by constraining the DDPM with a classifier network that is separately trained to distinguish between the digit class and all other digits. The auxiliary constraint in this case is the likelihood of the inferred digit, as it is estimated by the classifier. Finally, for (f) we multiply horizontal symmetry and digit classifier constraints, prompting the inference procedure to generate a perfectly centered and symmetric digit.
Details of model training and inference can be found in §A.
We examine the generation of natural images with a pretrained DDPM prior and a learned constraint. We utilize the pretrained DDPM network on FFHQ-256 Karras et al. (2019) from Baranchuk et al. (2022) and a pretrained ResNet18 face attribute classifier on CelebA Liu et al. (2015). The attribute classifier computes the likelihood of presence of various facial features in a given image , as they are defined by the CelebA dataset. Examples of such features are no beard, smiling, blonde and male. To generate a conditional sample from the unconditional DDPM network we select a subset of these and enforce their presence or absence using the classifier predicted likelihoods as our constraint . If is a set of attributes we wish to be present, the constraint can be expressed as
We only strictly enforce a small subset of facial attributes and therefore is allowed to converge towards different modes that correspond to samples that exhibit, in varying levels, the desired features.
In Fig. 2 we demonstrate our ability to infer conditional samples with desired attributes , using only the unconditional diffusion model and the classifier . In the first row, we show the results of the optimization procedure of Algorithm 1 for various attributes. The classifier objective manipulates the image with the goal of making the classifier network produce the desired attribute predictions, whereas the diffusion objective attempts to pull the sample towards the learned distribution . If we ignored the denoising loss, the result would be some adversarial noise that fools the classifier network. The DDPM prior, however, is strong enough to guide the process towards realistic-looking images that simultaneously satisfy the classifier constraint set.
We notice that the generated samples , although having converged towards a correct mode of , still exhibit a noticeable amount of noise related to the optimization of classifier objective. To address that, inspired by Nie et al. (2022), we simply denoise the image using directly the DDPM model, starting from the low noise level so as to retain the overall structure. The results of this denoising are shown in the second row of Fig. 2. (Model and inference details are in §B.)
|blonde||5:00 shadow||oval||high cheekbones||eyeglasses||goatee+big nose|
In Fig. 3 we showcase the intermediate steps of the optimization process for inference with the conditions blond+smiling+not male, thus solving a problem like that studied in Du et al. (2020) using only independently trained attribute classifiers and an unconditional generative model of faces. The sample is initialized with Gaussian noise , and as we perform gradient steps with decreasing t values, we observe facial features being added in a coarse-to-fine manner. This aligns with the observations made in Baranchuk et al. (2022), where the intermediate representations of the denoising U-Net for different t values were used to predict attributes with varying granularity.
We test the applicability of diffusion priors in discrete tasks, such as inferring semantic segmentations from images. For this purpose, we use the EnviroAtlas dataset Pickard et al. (2015) which is composed of 5-class, 1m-resolution land cover labels from four geographically diverse cities across the US; Pittsburgh, PA, Durham, NC, Austin, TX and Phoenix, AZ. We only have access to the high resolution labels from Pittsburgh, and the task is to infer the land cover labels in the other three cities, given only probabilistic weak labels derived from coarse auxiliary data Rolf et al. (2022). We use Algorithm 1 to perform an inference procedure that does not directly take imagery as input, but uses constraints derived from unsupervised color clustering. We use only cluster indices in inference, making the algorithm dependent on image structure, but not color. Local cluster indices as a representation have a promise of extreme domain transferability, but they require a form of a combinatorial search which matches local cluster indices to semantic labels so that the created shapes resemble previously observed land cover, as captured by a denoising diffusion model of semantic segmentations.
We train a DDPM model on the -resolution one-hot representations of the land cover labels, using the U-Net diffusion model architecture from Dhariwal and Nichol (2021). To convert the one-hot diffusion samples to probabilities we follow Hoogeboom et al. (2022) and assume that for any pixel in the inferred sample , the distribution over the label is, , where is user-defined a parameter. We chose this approach for its simplicity and ease to apply in our inference setting of Algorithm 1. Alternatively, we could use diffusion models for categorical data Hoogeboom et al. (2021) with the appropriate modifications to our inference procedure. Samples drawn from the learned distribution are presented in Fig 4.
Water Impervious Surface Soil and Barren Trees and Forest Grass and Herbaceous
In order to infer the segmentation of a single image, under the diffusion prior, we directly apply Algorithm 1 with a hand-crafted constraint which provides structural and label guidance. To construct , we first compute a local color clustering of input the image (see §C). In addition, we utilize the available weak labels Rolf et al. (2022) and force the predicted segments’ distribution to match the weak label distribution when averaged in non-overlapping blocks. We combine the two objectives in a single constraint by (i) computing the mutual information between the color clustering and the predicted labels
, transformed into a valid probability distribution from the inferred one-hot vectors, in overlapping image patches and (ii) computing the negative KL divergence between the average predicted distribution and the distribution given by the weak labels in non-overlapping blocks
Empirically, we find that we can reduce the number of optimization steps needed to perform inference by initializing the sample with the weak labels instead of random noise, allowing us to start from a smaller . Examples of images and their inferred segmentations are shown in Fig. 5.
is initialized with the weak labels to reduce the number of steps needed. The samples are chosen from (top to bottom) Durham, NC, Austin, TX and Phoenix, AZ. Although AZ has a vastly different joint distribution of colors and labels, the inferred segmentation still captures the overall structure. Note that the inference algorithm does not use the pixel intensities in the input image, only an unsupervised color clustering.
By design, the above inference procedure has a greater ability to perform in new areas than the approach in Rolf et al. (2022), which still finetunes networks that take raw images as input. We also investigate domain transfer approach where inferred semantic segmentation patches can be used to train neural networks for fast inference. We pretrain a standard U-Net inference network solely on 20k batches of 16 randomly sampled image patches in PA. We then randomly sampled 640 images in each of the other geographies and generate semantic segmentations using our inference procedure. We then finetune the inference network on these segmentations. This network is then evaluated on the entire target geography.
The results in Table 1 demonstrate that this approach to domain transfer is comparable with the state-of-the-art work of Rolf et al. (2022) for weakly-supervised training.. The naive approach of training a U-Net only on the available high-resolution PA data (PA supervised) fails to generalize to the geographically different location of Phoenix, AZ. Similarly, the model of Robinson et al. (2019), which is a US-wide high-resolution land cover model trained on imagery and labels, and multi-resolution auxiliary data over the entire contiguous US also suffers. When the weak labels are provided as input (PA supervised + weak) the results can improve significantly.
|Durham, NC||Austin, TX||Phoenix, AZ|
|Algorithm||Acc %||IoU %||Acc %||IoU %||Acc %||IoU %|
|PA supervised + weak||78.9||47.9||77.2||50.5||62.8||24.2|
|Implicit posterior Rolf et al. (2022)||79.0||48.4||76.6||49.5||76.2||46.0|
|Ours (from scratch)||76.0||39.9||74.8||39.4||69.5||31.6|
|Full US supervised Robinson et al. (2019)||77.0||49.6||76.5||51.8||24.7||23.6|
So far, we have considered inference under a DDPM prior and a differentiable constraint . We consider the case of a ‘hard’ constraint, where is a latent vector is deterministically encoded in an image (), where we have a DDPM prior over images . We will use the variation of Algorithm 1 described in §2.2 to obtain a point estimate of the distribution over , .
We illustrate this in the setting of a well-known combinatorial problem, the traveling salesman problem (TSP). Recall that a Euclidean traveling salesman problem on the plane is described by points , which form the vertex set of a complete weighted graph , where the weight of the edge from to is the Euclidean distance . A tour of is a connected subgraph in which every vertex has degree 2. The TSP is the optimization problem of finding the tour with minimal total weight of the edges, or, equivalently, a permutation of that minimizes
Although the general form of the TSP is NP-hard, a polynomial-time approximation scheme is known to exist in the Euclidean case Arora (1998); Mitchell (1999) and can yield proofs of tour optimality for small problems.
Humans have been shown to have a natural propensity for solving the Euclidean TSP (see MacGregor and Chu (2011) for a survey). Humans construct a tour by processing an image representation of the points through their visual system. However, the optimization algorithms in common use for solving the TSP do not use a vision inductive bias, instead falling into two broad categories:
More recently, there has been work on neural nets, trained by reinforcement learning or imitation learning, that build tours sequentially or learn heuristics for their (discrete) iterative refinement. Successful recent approachesDeudon et al. (2018); Kool et al. (2019); Joshi et al. (2019, 2021); Bresson and Laurent (2021) have used Transformer Vaswani et al. (2017) and graph neural network Kipf and Welling (2017) architectures.
The algorithm we propose using DDPMs is a hybrid of these categories: it reasons over a continuous relaxation of the problem, but exploits the learning of generalizable structure in example solutions by a neural model. In addition, ours is the first TSP algorithm to mimic the convolutional inductive bias of the visual system.
Fix a set of points . Let be a symmetric matrix with 0 diagonal. We encode as a greyscale image by superimposing: - raster images of line segments from to with intensity value for every pair , and - raster images of small black dots placed at for each . For example, if is the adjacency matrix of a tour, then is a visualization of this tour as a image.
We use a dataset of Euclidean TSPs, with ground truth tours obtained by a state-of-the-art TSP solver et al. (1997), from Kool et al. (2019) (we consider two variants of the dataset, each with 1.5m training graphs: with 50 vertices in each graph and with a varying number from 20 to 50 vertices in each graph). Each training tour is represented via its adjacency matrix and encoded as an image . We then train a DDPM with the U-Net architecture from Dhariwal and Nichol (2021) on all of such encoded image. Model and training details can be found in §D. Some unconditional samples from the trained DDPM are shown in Fig. 6; most samples indeed resemble image representations of tours.
Suppose we are given a new set of points . Solving the TSP requires finding the adjacency matrix of a tour of minimal length. As a differentiable relaxation, we set , where is a stochastic matrix with zero diagonal (parametrized via softmax of a matrix of parameters over rows). We run the inference procedure using the trained DDPM as a prior to estimate
. The hyperparameters and noise schedule are described in Appendix. Examples of the optimization are shown in Fig.7.
Although the inferred is usually sharp (i.e., all entries close to 0 or 1), rounding to 0 or 1 does not always give the adjacency matrix of a tour (see, for example, the top row of Fig. 7; other common incorrect outputs include two disjoint tours). To extract a tour from the inferred , we greedily insert edges to form an initial proposal, then refine it using a standard and lightweight combinatorial procedure, the 2-opt heuristic Lin and Kernighan (1973) (amounting to iteratively uncrossing pairs of edges that intersect). The entire procedure is shown in Fig. 7, and full details can be found in §D.
|Optimize latent adj. matrix w.r.t. denoising model||Recover tour|
We evaluate the trained models on test sets of 1280 graphs each with and vertices. (We also show results with 200 vertices in §D.) We report the average length of the inferred tour and the gap (discrepancy from the length of the ground truth tour) in Table 2 (left), from which we make several observations.
The right side of Table 2 shows the number of 2-opt (edge uncrossing) steps performed in the refinement step of the algorithm when the inference algorithm is run for varying numbers of steps. Running the inference with more steps results in extracted tours that are closer to local minima with respect to the 2-opt neighbourhood, indicating that the DDPM encodes meaningful information about the shape of tours.
The DDPM inference is competitive with recent baseline algorithms that do not use beam search in generation of the tour (those shown in the table). These baseline algorithms improve when beam search decoding with very large beam size is used, but encounter diminishing returns as the computation cost grows. Our performance on the 100-vertex problems is similar to Kool et al. (2019) with the largest beam size they report (5000), which has 2.18% gap, while having similar computation time.
The model trained on problems with 50 nodes performs almost identically to the model trained on problems with 50 or fewer nodes, and both models generalize better than baseline methods from 50-node problems to the out-of-distribution 100-node problems.
We reemphasize a unique feature of our algorithm: all ‘reasoning’ in our inference procedure happens via the image space. This property also leads to sublinear computation cost scaling with increasing size of the graph – as long as it can reasonably be represented in a image – since most of the computation cost of inference is borne by running the denoiser on images of a fixed size.
|Algorithm||Obj||Gap %||Obj||Gap %|
|Oracle (Concorde et al. (1997))||5.69||0.00||7.759||0.00|
|2-opt Lin and Kernighan (1973)||5.86||2.95||8.03||3.54|
|Transformer Kool et al. (2019)||5.80||1.76||8.12||4.53|
|GNN Joshi et al. (2019)||5.87||3.10||8.41||8.38|
|Transformer Bresson and Laurent (2021)||5.71||0.31||7.88||1.42|
|Diff. steps||Obj||Gap %||Steps||Obj||Gap %||Steps|
We have shown how inference in denoising diffusion models can be performed under constraints in a variety of settings. Imposing constraints that arise from pretrained classifiers enables conditional generation, while common-sense conditions, such as mutual information with a clustering or divergence from weak labels, can lead to models that are less sensitive to domain shift in the distribution of conditioning data.
A notable limitation of DDPMs, which is inherited by our algorithms, is the high cost of inference, requiring a large number of passes through the denoising network to generate a sample. We expect that with further research on DDPMs for which inference procedures converge in fewer steps Salimans and Ho (2022); Xiao et al. (2022), plug-and-play use of DDPMs will become more appealing in various applications.
Finally, our results on the traveling salesman problem illustrate the ability of DDPMs to reason over uncertain hypotheses in a manner that can mimic human ‘puzzle-solving’ behavior. These results open the door to future research on using DDPMs to efficiently generate candidates in combinatorial search problems.
The transformer network for the traveling salesman problem. arXiv preprint 2103.03012. Cited by: 2nd item, Table 2.
International Conference on Computer Vision (ICCV). Cited by: §1.1.
Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pp. 170–181. Cited by: 2nd item.
Compositional visual generation with energy based models. Neural Information Processing Systems (NeurIPS). Cited by: §3.2.
Computer Vision and Pattern Recognition (CVPR). Cited by: §3.2.
International Conference on Machine Learning (ICML). Cited by: §1, §1.
Image super-resolution via iterative refinement. arXiv preprint 2104.07636. Cited by: §1.1.
Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning (ICML). Cited by: §1, §2.2.
CSDI: conditional score-based diffusion models for probabilistic time series imputation. Neural Information Processing Systems (NeurIPS). Cited by: §1.1.
For all inference examples, we performed 1000 optimization steps with the Adam optimizer and a learning rate . We employed a cosine-modulated, linearly decreasing annealing schedule as shown in Fig. A.1 (a). We empirically designed this annealing process following the observation that the linearly decreasing values guide the inference procedure in a coarse-to-fine manner that starts by deciding the overall structure of the inferred sample and then adding in details. We also added the oscillating component to allow for revisions of the coarser structures that are to be made after having inferred specific details.
When performing the optimization step of Algorithm 1, we observed that it was important to gradually reduce the effect of the condition in order to obtain good sample quality. In practice, we linearly decreased the weight of the conditional component of the loss, from to as we performed the optimization steps from . This can be attributed to the fact that the conditions we used provide guidance for the steps made at larger values of , where the shape and orientation of a digit are decided. When combining two or more conditions the weighting is applied to all of them.
For the conditional generation experiments on the FFHQ dataset we utilized the pre-trained DDPM model provided by Baranchuk et al. . The face attribute classifier network was a ResNet-18 network trained on the face attributes given in the CelebA dataset Liu et al. . To run our inference algorithm we performed 200 optimization steps with the Adamax optimizer, choosing and a linearly decreasing learning rate from to . The annealing schedule was similar to the one used for the land cover segmentation experiments (Fig. A.1 (b)) but for values now ranging from 1000 to 200. Additionally, in this experiment we found that balancing between the diffusion and auxiliary losses with a carefully chosen weighting term was difficult. Thus, we opted for a different approach where we clipped the gradient norm of the auxiliary objective to of the gradient norm of the diffusion denoising loss.
In Fig. B.1,B.2 we demonstrate additional conditionally-generated samples from the unconditional DDPM and the attribute classifier. In the first set of examples we show that although we may find modes of that satisfy to a level the condition set by the classifier, the sample quality is not always on-par with unconditionally generated samples, like those presented in Baranchuk et al. . We can attribute that to the fact that for natural images, in contrast to segmentation labels, the mode may not always be a good-looking sample from the distribution. Our method to mitigate that, along with the classifier noise artifacts left from the optimization process, is to run the diffusion denoising procedure starting from a low temperature . Although this may improve the visual quality of the result, in some cases our choice of is not large enough to move the sample far enough from the inferred . If we choose a larger however we risk erasing the attributes we aimed to generate in the first place.
In the second set of samples, we first show how conflicting attributes are resolved. When the constraint is set to satisfy two attributes that contradict each other we observed that the inferred sample tends to gravitate towards a single randomly-chosen direction. This is evident in the first two examples where we set the not male attribute along with a male-correlated attribute. In each of them only a single condition, either the not male or the male-related, is satisfied. In the blonde+black hair example we could argue that a mix of the two attributes is present in the inferred sample. However, the classifier predictions for that specific image tell us that the person shown is exclusively blonde.
We also show a set of failure cases where the classifier ’painted’ the features related to the desired attribute but the diffusion prior did not complete the sample in a correct way. For instance, in the eyeglasses example we see that the classifier has drawn an outline of the eyeglass edges on the generated face but the diffusion model has failed to pick up the cue. Similarly, when asking for wavy hair we see curves that can fool the classifier into thinking that the person has curly hair, or when the attributes set are smiling+mustache we observe a comically drawn mustache on the generated face. Since the conditioning depends both on the diffusion prior and the robustness of the classifier we believe that with better classifier training we could improve the result in such cases.
|not male||male||not young||young||rosy cheeks||mustache|
|not male+bald||not male+beard||black hair||eyeglasses||wavy hair||mustache|
The land cover DDPM was trained on -resolution, patches of land cover labels, randomly sampled from the Pittsburgh, PA tiles of the EnviroAtlas dataset Pickard et al. . For the diffusion network, we used the U-Net architecture of Dhariwal and Nichol , a linear schedule and diffusion steps. We trained with batches of size 32, using the Adam optimizer and a learning rate of . Additional samples from the unconditional diffusion model are shown in Fig. C.1. We observe that the model has learned both structures that are independent of the geography, such as the continuity of roads and the suburban building planning, and PA-specific ones, such as buildings nested in forested areas, which may not be as common in AZ for instance.
Water Impervious Surface Soil and Barren Trees and Forest Grass and Herbaceous
Since we initialize the inference procedure with the weak labels we require fewer optimization steps and do not have to start the search from . Thus, to infer the land cover segmentations we only perform 200 optimization steps using the Adam optimizer, with a linearly decreasing learning rate from to and . The annealing schedule we designed for this task reflects the needs for fewer overall steps and is shown in A.1 (b). We also decrease the weights of both conditional components of the loss, from to as we perform the optimization steps to reduce their influence on the final inferred sample. In addition, we linearly decrease the parameter that is used to convert the one-hot representations learned from the DDPM model to probabilities, from to to mimic the uncertainty of this conversion process. Further examples of land cover segmentation inference are shown in Fig. C.2. Despite the fact that the DDPM was trained only on PA land cover labels we show how the weak label guidance allows us to perform inference in completely new geographies, such as that of AZ (last two rows), where the most prominent label is now Soil and Barren. We can still observe a few artifacts of the PA-related biases the model has learned, like the tendency to add uninterrupted forested areas but the transferability of the semantic model is still far superior than an of an image-based one.
Water Impervious Surface Soil and Barren Trees and Forest Grass and Herbaceous
Our hand-crafted constraint for land cover segmentation inference is split between two objectives; (i) matching the structure of the target image using a local color clustering and (ii) forcing the predicted segments’ distribution to match the weak label distribution when averaged in non-overlapping blocks of the image.
The local color clustering is computed as a local Gaussian mixture with a fixed number of components. To match the structure between the predicted labels and the pre-computed clustering we compute the mutual information between the two distributions in overlapping patches of pixels. This choice of constraint pushes the inferred land covers segments in a way that they should match locally the color clustering segments. Although this allows us to infer the labels of large structures like roads and buildings it also tends to add noisy labels at areas where the clustering has a high entropy. By gradually reducing the weight of the auxiliary objective however, we allow the inference procedure to ’fill-in’ these details as it is dictated by the diffusion prior.
The label guidance during inference is provided from probabilistic weak labels which are derived from coarse auxiliary data. These data are composed of the 30m-resolution National Land Cover Database (NLCD) labels, augmented with building footprints, road networks and waterways/waterbodies Rolf et al. . The corresponding weak label constraint is computed as the KL-divergence between the average predicted and weak label distributions in non-overlapping blocks of pixels. In the absence of such guidance the inference procedure can easily confuse semantic classes while still producing segmentations that are likely under . We showcase this in Fig. C.3 where we infer the land cover labels of an image, starting from a random initialization, with and without the weak label guidance.
|Inferred (no guidance)||Inferred (with guidance)||Ground Truth|
Regarding the domain transfer experiments, we initially pre-trained the standard inference U-Net Ronneberger et al.  on batches of 16 randomly sampled image patches in Pittsburgh, PA, using the Adam optimizer with a learning rate of . We then inferred the land cover segmentations of 640 randomly-sampled patches in each of the other geographic regions, (NC, TX, AZ) using the inference procedure described above. With these generated labels, we first fine-tuned the original network on a validation set of 5 tiles to determine the optimal fine-tuning parameters. For Durham, NC and Austin, TX we only fine-tune the last layer of the network for a single epoch, using a batch size of 16 patches and a learning rate of . For Phoenix, AZ we require 5 epochs of fine-tuning the entire network with a learning rate of since the domain shift is larger. Additionally, for all regions, following the experiments of Rolf et al. , we multiply the predicted probabilities with the weak labels and renormalize.
Finally, in Table 1, we also present the results when the inference network is trained from scratch, to show that the resulting performance is not only an artifact of the pre-training. The U-Net was trained for 20 epochs on all 640 generated samples, with a batch size of 16 and a learning rate of .
The DDPM was trained on images of ground truth TSP solutions encoded as images. The architecture was the same U-Net as used in the other experiments, with the architecture from Dhariwal and Nichol  and diffusion steps in training. We trained each model for 8 epochs with batch size 16, which took about two days on one Tesla K80 GPU.
At inference time, we performed varying numbers of inference steps (see Table 2 in the main text), using the Adam optimizer with and a learning rate linearly decaying from 1 to 0.1. The noise schedule was the same as that used in the MNIST experiment (Fig. A.1), with the time interval from 0 to 1000 linearly resampled to the number of inference steps used.
To extract a tour from the inferred adjacency matrix , we used the following greedy edge insertion procedure.
Initialize extracted tour with an empty graph with vertices.
Sort all the possible edges in decreasing order of (i.e., the inverse edge weight, multiplied by inferred likelihood). Call the resulting edge list .
For each edge in the list:
If inserting into the graph results in a complete tour, insert and terminate.
If inserting results in a graph with cycles (of length ), continue.
Otherwise, insert into the tour.
It is easy to see that this algorithm terminates before the entire edge list has been traversed. The tour is refined by a naïve implementation of 2-opt, in which, on each step, all pairs of edges in the tour are enumerated and a 2-opt move is performed if the edges cross. For the ‘2-opt’ baseline, the same procedure is performed using a uniform adjacency matrix.
Extending the results in Table 2 of the main text, we evaluate the model trained on TSP instances with 20 to 50 nodes on problems with 200 nodes. We find an optimality gap of 3.77% (average number of uncrossing moves 219), compared to 3.81% for 2-opt (average number of uncrossing moves 115), suggesting that the generalization potential is near-saturated at this problem size. As shown in Fig. D.3, the vertices fill the image with such high density that it is difficult to see the (light grey) tour; many edges are invisible (compare to Fig. 7 in the main text).
|optimize latent adj. matrix w.r.t. denoising model||recover tour|
We suggest three directions to solving to this problem that should be explored in later work:
Encoding: The size of the encoding image can be increased (for example, to ) when the number of vertices increases, without changing the model (trained on images), which can make denoising predictions on images of any size. We may expect to see better out-of-domain generalization of the denoising model in this setting, as the density of nodes (mean number of black pixels) would match that in the training set. Figs. D.2 and D.1 show the potential of DDPMs to generalize to image sizes larger than those in which they were trained. Inference using images gives an optimality gap of 2.59% (average number of uncrossings 81), much lower than that obtained with in-domain image size.
In addition, encoding graphs with smaller dots and thinner lines can be explored, although the generalization difficulties due to image ‘crowding’ would still appear at a larger value of .
Fractal behaviour and coarse-to-fine: Taking advantage of the fractal structure of Euclidean TSP solutions, a denoising objective could be used to locally refine the tour by minimizing the objective on a crop of the image representation (a form of DDPM-guided local search). This could be done in a coarse-to-fine manner by application of the same model at different scales, with a representation of a problem with 200 vertices being first optimized with respect to the denoising objective globally, then on crops.
Improved extraction: The 2-opt search can be improved by inexpensive heuristics, such as choosing the 2-opt move that most improves the cost on every step, rather than iterating through the edges of the candidate tour in order.