AdversarialConsistentScoreMatching
Code for paper "Adversarial score matching and improved sampling for image generation"
view repo
Denoising score matching with Annealed Langevin Sampling (DSM-ALS) is a recent approach to generative modeling. Despite the convincing visual quality of samples, this method appears to perform worse than Generative Adversarial Networks (GANs) under the Fréchet Inception Distance, a popular metric for generative models. We show that this apparent gap vanishes when denoising the final Langevin samples using the score network. In addition, we propose two improvements to DSM-ALS: 1) Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling, and 2) a hybrid training formulation,composed of both denoising score matching and adversarial objectives. By combining both of these techniques and exploring different network architectures, we elevate score matching methods and obtain results competitive with state-of-the-art image generation on CIFAR-10.
READ FULL TEXT VIEW PDFCode for paper "Adversarial score matching and improved sampling for image generation"
Song and Ermon (2019) recently proposed a novel method of generating samples from a target distribution through a combination of Denoising Score Matching (DSM) (Hyvärinen, 2005; Vincent, 2011; Raphan and Simoncelli, 2011) and Annealed Langevin Sampling (ALS) (Welling and Teh, 2011; Roberts et al., 1996). The main benefits of their approach (DSM-ALS) are that it produces relatively high-quality samples and guarantees high diversity due to ALS. The main drawback of DSM-ALS is that generating samples requires many iterations rather than being done in one-shot.
Song and Ermon (2020)
further improved their approach by increasing the stability of score matching training and proposing theoretically sound choices of hyperparameters. They also scaled their approach to higher-resolution images and showed that DSM-ALS is competitive with other generative models.
Song and Ermon (2020) observed that the images produced by their improved model were more visually appealing than the ones from their original work; however, the reported Fréchet Inception Distance (FID) (Heusel et al., 2017) did not correlate with this improvement. In Section 5, we show that taking the expected denoised sample (EDS) (Robbins, 1955; Miyasawa, 1961; Raphan and Simoncelli, 2011) corrects this mismatch and gives some insight as to its origin.In Section 3, we show that the scaling of the additive noise in the ALS proposed by Song and Ermon (2019, 2020) does not ensure that the levels of noise in the samples will decrease geometrically at inference time, and we hypothesize this incongruity is harmful to the quality of samples. We show how to scale the noise for consistent sampling such that the noise level follows the desired schedule as best as possible .
Although DSM-ALS is gaining traction, Generative adversarial networks (GANs) (Goodfellow et al., 2014) remain the leading approach to generative modeling. GANs are a very popular class of generative models; they have been successfully applied to image generation (Brock et al., 2018; Karras et al., 2017, 2019, 2020) and have subsequently spawned a wealth of variants (Radford et al., 2015b; Miyato et al., 2018; Jolicoeur-Martineau, 2018; Zhang et al., 2019). The idea behind this method is to train a Discriminator () to correctly distinguish real samples from fake samples generated by a second agent, known as the Generator (). GANs excel at generating high-quality samples as the discriminator captures which features make an image realistic, while the generator learns to emulate them.
Still, GANs often have trouble generating data from all possible modes, which impedes the diversity of the generated samples. A wide variety of tricks has been developed to address this issue in GANs (Kodali et al., 2017; Gulrajani et al., 2017; Arjovsky et al., 2017; Miyato et al., 2018; Jolicoeur-Martineau and Mitliagkas, 2019). On the other hand, DSM-ALS does not suffer from such issue, since ALS allows for sampling from the full distribution captured by the score network. Nevertheless, the perceptual quality of higher-resolution images is still inferior to those generated by GANs. In Section 4, we propose to take the best of both worlds through a formulation combining DSM with a Discriminator during training. In Section 6, we show that this hybrid method improves the quality of samples without sacrificing diversity.
Generative modeling has also seen some incredible work from Ho et al. (2020), who achieved exceptionally low FID on image generation tasks. Their approach showcased a new diffusion-based method that shares close ties with denoising score matching. We show that the network architecture they used, a variant of Salimans et al. (2017), significantly improves quality over the RefineNet (Lin et al., 2017a) architecture used by Song and Ermon (2020). In doing so, we close the gap between DSM-ALS and the diffusion-based approach.
In Section 6.1, we study how these two training improvements (hybrid objective, and Ho et al. (2020) score network architecture) and two sampling improvements (consistent sampling, expected denoised sample) interact by performing an ablation study on the CIFAR-10 and LSUN-Churches generation tasks. The code to replicate our experiments is publicly available on https://github.com/AlexiaJM/AdversarialConsistentScoreMatching.
Denoising Score Matching (DSM) (Hyvärinen, 2005) consists of training a score network to approximate the gradient of the log density of a certain distribution (), referred to as the score function. This is achieved by training the network to approximate a noisy surrogate at multiple levels of Gaussian noise corruption (Vincent, 2011). The score network , parametrized by and conditioned on the noise level , is then trained by minimizing the following loss:
(1) |
where . We define further the corrupted data distribution, the training data distribution, and
the uniform distribution over a set
corresponding to different levels of noise. In practice, this set is defined as a geometric progression between and (with chosen according to some computational budget):(2) |
Rather than having to learn a different score function for every , one can train an unconditional score network by defining , and then minimizing Eq. 1. While unconditional networks are less heavy computationally, it remains an open question whether conditioning helps performance. Li et al. (2019) and Song and Ermon (2020) found that the unconditional network produced better samples, while Ho et al. (2020) obtained better results than both of them using a conditional network. Additionally, the conditional score function described in Lim et al. (2020) gives evidence supporting the benefits of conditioning when the noise becomes small. While our experiments are conducted with unconditional networks, we believe our techniques can be straightforwardly applied to conditional networks; we leave that extension for future work.
Given a score function, one can use Langevin dynamics (i.e., Langevin sampling) (Welling and Teh, 2011)
to sample from the corresponding probability distribution. In practice, the score function is generally unknown and estimated through a score network trained to minimize Eq.
1. Song and Ermon (2019) showed that Langevin sampling has trouble exploring the full support of the distribution when the modes are too far apart and proposed Annealed Langevin Sampling (ALS) as a solution. ALS starts sampling with a large noise level and progressively anneals it down to a value close to , ensuring both proper mode coverage and convergence to the data distribution. It is shown in Algorithm 1.A little known fact from Bayesian literature is that one can recover a denoised sample from the score function using the Empirical Bayes mean (Robbins, 1955; Miyasawa, 1961; Raphan and Simoncelli, 2011):
(3) |
where is the expected denoised sample given a noisy sample (or Empirical Bayes mean), conditioned on the noise level. A different way of reaching the same result is through the closed-form of the optimal score function, as presented in Appendix D. Note that the corresponding result for unconditional score function is presented in Appendix E. The EDS corresponds to the expected real image given a corrupted image; it can be thought of as what the score network believes to be the true image concealed within the noisy input. In Section 4, we show how to take advantage of this fact to build an hybrid training objective.
It has also been suggested that denoising the samples (i.e., taking the EDS) at the end of the Langevin sampling improves their quality (Saremi and Hyvarinen, 2019; Li et al., 2019; Kadkhodaie and Simoncelli, 2020). In Section 5, we provide further evidence that denoising the final Langevin sample brings it closer to the assumed data manifold. In particular, we show that the Fréchet Inception Distance (FID) consistently decreases after denoising.
In this section, we present Consistent Annealed Sampling (CAS) as an alternative to ALS, demonstrate how it can enforce any sampling noise schedule and explain how this improves on ALS.
One can think of the ALS algorithm as a sequential series of Langevin Dynamics (inner loop) for decreasing levels of noise (outer loop). If allowed an infinite number of steps , the sampling process will properly produce samples from the data distribution.
In ALS, the score network is conditioned on geometrically decreasing noise (
). In the unconditional case, this corresponds to dividing the score network by the standard deviation (
i.e., ). Thus, in both conditional and unconditional cases, we make the assumption that the noise of the sample at stepwill be of variance
. While choosing to use a geometric progression of noise levels seems like a reasonable (though arbitrary) schedule to follow, we show that ALS did not ensure such a schedule.Assume we have the true score function and begin sampling using a real image with some added zero-centered Gaussian noise of standard deviation . In Figure 0(a), we illustrate how the intensity of the noise in the sample evolves through ALS and CAS, our proposed sampling, for a given sampling step size and a geometric schedule in this idealized scenario. We note that, although a large approaches the real geometric curve, it will only reach it at the limit ( and ). Most importantly, Figure 0(b) highlights how even when the annealing process does converge, the progression of the noise is never truly geometric; we prove this formally in Proposition 1.
The proof is in Appendix F. In particular, for , sampling has not fully converged and the remaining noise is carried over to the next iteration of Langevin Sampling. It also follows that for any different from the optimal , the actual noise at every iteration is expected to be even higher than for the best possible score function .
We propose Consistent Annealed Sampling (CAS) as a sampling method that ensures the noise level will follow a prescribed schedule for any sampling step size and number of steps . Algorithm 2 illustrates the process for a geometric schedule. Note that for a different schedule, will instead depend on the step , as in the general case, is defined as .
Let us first define to be equal to with and defined as Eq. 2. At the start of Algorithm 2, assume that is comprised of an image component and a noise component denoted , where and . We will proceed by induction to show that the noise component at step will be a Gaussian of variance for every .
The first induction step is trivial. Assume the noise component of to be , where . Following Algorithm 2, the update step will be: | ||||
with . From Eq. 3, we get | ||||
The noise component from and can then be summed as: | ||||
Making use of the fact that both sources of noise follow independent normal distributions, the sum will be normally distributed, centered and of variance: | ||||
By induction, the noise component of the sample
will follow a Gaussian distribution of variance
. In the particular case where corresponds to a geometrically decreasing series, it means that given an optimal score function, the standard deviation of the noise component will follow its prescribed schedule. ∎Importantly, Proposition 2 holds no matter how many steps we take to decrease the noise geometrically. For ALS, corresponds to the number of steps per level of noise, it plays a rather similar role in CAS: we simply dilate the geometric series of noise levels used during training by a factor of , such that . Note that the proposition only holds when the initial sample is a corrupted image (). However, by defining as the maximum Euclidean distance between all pairs of training data points (Song and Ermon, 2020), the noise becomes in practice much greater than the true image; sampling with pure noise initialization () becomes indistinguishable from sampling with data initialization.
The score network is trained to recover an uncorrupted image from a noisy input minimizing the distance between the two. However, it is well known from the image restoration literature that does not correlate well with human perception of image quality (Zhang et al., 2012; Zhao et al., 2016). One way to take advantage of the EDS would be to encourage the score network to produce an EDS that is more realistic from the perspective of a discriminator. Intuitively, this would incentivize the score network to produce more discernible features at inference time.
We propose to do so by training the score network to simultaneously minimize the score-matching loss function and maximize the probability of denoised samples being perceived as real by a discriminator. We use alternating gradient descent to sequentially train a discriminator for a determined number of steps at every score function update.
In our experiments, we selected the Least Squares GAN (LSGAN) (Mao et al., 2017) formulation as it performed best (see Appendix B for details). For an unconditional score network, the objective functions are as follows:
(4) |
(5) |
where is the EDS derived from the score network and . Eq. 4 is the objective function of the LSGAN discriminator, while Eq. 5 is the adversarial objective function of the score network derived from Eq. 1 and from the LSGAN objective function.
We note the similarities between these objective functions and those of an LSGAN adversarial autoencoder
(Makhzani et al., 2015; Tolstikhin et al., 2017; Tran et al., 2018), with the distinction of using a denoising autoencoder
as opposed to a standard autoencoder.As GANs favor quality over diversity, there is a concern that this hybrid objective function might decrease the diversity of samples produced by the ALS. In Section 6.1, we first study image generation improvements brought by this method and then address the diversity concerns with experiments on the 3-StackedMNIST (Metz et al., 2016) dataset in Section 6.2.
As mentioned in Section 2.3, it has been suggested that one can obtain better samples (closer to the assumed data manifold) by taking the EDS of the last Langevin sample. We provide further evidence of this with synthetic data and images.
It can first be observed that the sampling steps correspond to an interpolation between the previous point and the EDS, followed by the addition of noise.
The proof is in Appendix G. This result is equally true for an unconditional score network, with the distinction that would no longer be independent of but rather linearly proportional to it.
Intuitively, this implies that the sampling steps slowly move the current sample toward a moving target (the EDS). If the sampling behaves appropriately, we expect the final sample to be very close to the EDS, i.e., . However, if the sampling step size is too high or too low, or if the EDS does not stabilize to a fixed point near the end of the sampling, these two quantities may be arbitrarily far from one another. As we will show, the FIDs from Song and Ermon (2020) suffer from such distance.
The equivalence showed in Proposition 3 suggests instead to take the expected denoised sample at the end of the Langevin sampling as the final sample; this would be equivalent to the update rule at the last step.
Synthetic 2D examples are shown in Figure 2. It can be observed that taking the EDS brings the samples much closer to the real data manifolds.
We trained a score network on CIFAR-10 (50k 32x32 images) (Krizhevsky et al., 2009) for both ALS and CAS and report the Fréchet Inception Distance (FID) (Heusel et al., 2017) as a function of the sampling step size when denoised and when not denoised in Figure 3.
The first observation to be made is just how critical denoising is to the FID score for ALS, even as its effect cannot be perceived by the human eye. For CAS, we note that the score remains small for a much wider range of sampling step sizes when denoising. Alternatively, the sampling step size must be very carefully tuned to obtain results close to the optimal.
Figure 3 also shows that, with CAS, the FID of the final sample is approximately equal to the FID of the denoised samples for small sampling step sizes. Furthermore, we see a smaller gap in FID between denoised and non-denoised for larger sampling step sizes than ALS. This suggests that consistent sampling is resulting in the final sample being closer to the assumed data manifold (i.e., ).
Interestingly, when Song and Ermon (2020) improved their score matching method, they could not explain why the FID of their new model did not improve even though the generated images looked better visually. To resolve that matter, they proposed the use of a new metric (Zhou et al., 2019) that did not have this issue. As shown in Figure 3, denoising resolves this mismatch.
We ran experiments on CIFAR-10 (50k 32x32 images) (Krizhevsky et al., 2009) and LSUN-churches (126k 64x64 images) (Yu et al., 2015) with the score network architecture used by Song and Ermon (2020). We also ran similar experiments with an unconditional version of the network architecture by Ho et al. (2020), given that their approach is similar to Song and Ermon (2019) and they obtain very small FIDs. For the hybrid adversarial score matching approach, we used an unconditional BigGAN discriminator (Brock et al., 2018). We compared three factors in an ablation study: adversarial training, Consistent Annealed Sampling and denoising.
Details on how the experiments were conducted are found in Appendix B. Unsuccessful experiments with large images are also discussed in Appendix C. See also Appendix H for a discussion pertaining to the use of the Inception Score (Heusel et al., 2017), a popular metric for generative models.
Results for CIFAR-10 and LSUN-churches with Song and Ermon (2019) score network architecture are respectively shown in Table 1 and 2. Results for CIFAR-10 with Ho et al. (2020) score network architecture are shown in Table 3.
Sampling | Non-adversarial | Adversarial |
---|---|---|
non-consistent | 36.3 / 13.3 | 33.9 / 12.5 |
non-consistent | 33.7 / 10.9 | 31.5 / 10.2 |
consistent | 14.7 / 12.3 | 13.4 / 11.6 |
consistent | 12.7 / 11.2 | 11.4 / 10.2 |
Sampling | Non-adversarial | Adversarial |
---|---|---|
non-consistent | 43.2 / 40.3 | 42.4 / 38.6 |
non-consistent | 42.0 / 39.2 | 41.6 / 37.9 |
consistent | 41.5 / 40.7 | 40.2 / 39.1 |
consistent | 39.5 / 39.1 | 38.2 / 37.6 |
Sampling | Non-adversarial | Adversarial |
---|---|---|
non-consistent | 25.29 / 7.47 | 26.75 / 7.98 |
non-consistent | 20.02 / 5.63 | 20.98 / 6.02 |
consistent | 7.78 / 7.11 | 8.08 / 7.49 |
consistent | 6.19 / 6.11 | 6.43 / 6.40 |
We always observe an improvement in FID from denoising and by increasing from 1 to 5. We observe an improvement from using the adversarial approach with Song and Ermon (2019) network architecture, but not with the Ho et al. (2020) network architecture. We hypothesize that this is a limitation of the architecture of the discriminator since, as far as we know, no variant of BigGAN achieves an FID smaller than 6. Nevertheless, it remains advantageous for more simple architectures, as shown in Table 1 and 2. We observe that consistent sampling outperforms non-consistent sampling on the CIFAR-10 task at , the quickest way to sample.
We calculated the FID of the non-consistent non-adversarial denoised models from 50k samples in order to compare our method with the recent work from Ho et al. (2020). We obtained a score of 3.65 on the CIFAR-10 task when sharing their architecture; a score similar to their reported 3.17. Although not explicitly mentioned in their paper, Ho et al. (2020) denoised their final sample. This suggests that taking the EDS and using an architecture akin to theirs were the two main reasons for outperforming Song and Ermon (2020). Of note, our method only trains the score network for 300k iterations, while Ho et al. (2020) trained their networks for more than 1 million iterations to achieve similar results.
To assess the diversity of generated samples, we evaluate our models on the 3-Stacked MNIST generation task
(Metz et al., 2016) (128k images of 28x28), consisting of numbers from the MNIST dataset (LeCun et al., 1998) superimposed on 3 different channels. We trained non-adversarial and adversarial score networks in the same way as the other models. The results are shown in Table 4.We see that each of the 1000 modes is covered, though the KL divergence is still inferior to PACGAN (Lin et al., 2018), meaning that the mode proportions are not perfectly uniform. Nevertheless, these results confirm a full mode coverage on a task where most GANs fail and, most importantly, that using a hybrid objective does not hurt the diversity of the generated samples.
3-Stacked MNIST | ||
---|---|---|
Modes (Max 1000) | KL | |
DCGAN (Radford et al., 2015a) | 99.0 | 3.40 |
ALI (Dumoulin et al., 2016) | 16.0 | 5.40 |
Unrolled GAN (Metz et al., 2016) | 48.7 | 4.32 |
VEEGAN (Srivastava et al., 2017) | 150.0 | 2.95 |
PacDCGAN2 (Lin et al., 2017b) | 1000.0 | 0.06 |
WGAN-GP (Kumar et al., 2019; Gulrajani et al., 2017) | 959.0 | 0.73 |
PresGAN (Dieng et al., 2019) | 999.6 | 0.115 |
MEG (Kumar et al., 2019) | 1000.0 | 0.03 |
Non-adversarial DSM (ours) | 1000.0 | 1.36 |
Adversarial DSM (ours) | 1000.0 | 1.36 |
, we generated 26k samples and evaluated the mode coverage and KL divergence based on the predicted modes from a pre-trained MNIST classifier.
We proposed Consistent Annealed Sampling as an alternative to Annealed Langevin Sampling, which ensures the expected geometric progression of the noise and brings the final samples closer to the data manifold. We showed how to extract the expected denoised sample and how to use it to bring the final Langevin samples even closer to the data manifold. We proposed a hybrid approach between GAN and score matching. With experiments on synthetic and standard image datasets; we showed that these approaches generally improved the quality/diversity of the generated samples.
We found equal diversity (coverage of all 1000 modes) for the adversarial and non-adversarial variant of the difficult StackedMNIST problem. Since we also observed better performance (from lower FIDs) in our other adversarial models trained on images, we conclude that making score matching adversarial increases the quality of the samples without decreasing diversity.
These findings imply that score matching performs better than most GANs and on-par with state-of-the-art GANs. Furthermore, our results suggest that hybrid methods, combining multiple generative techniques together, are a very promising direction to pursue.
As future work, these models should be scaled to larger batch sizes on high-resolution images, since GANs have been shown to produce outstanding high-resolution images at very large batch sizes (2048 or more). Contrastive learning (Hadsell et al., 2006; Chen et al., 2020) is a promising technique that has been shown to improve the quality of images in GANs (Zhao et al., 2020); this technique could be used with adversarial score matching to capture more realistic features with the score network.
We would like to thank Yang Song, Linda Petrini, Florian Bordes, Vikram Voleti, Amartya Mitra, Isabela Albuquerque, and Reyhane Askari for their useful discussions and feedback. We thank Yang Song for suggesting to use the diffusion network architecture. Alexia would like to thank her wife Emy Gervais for her support.
We would also like to thank Compute Canada and Calcul Québec for the GPUs which were used in this work. This work was also partially supported by the NSERC ES-D grant (ESD3-546493-2020), the FRQNT new researcher program (2019-NC-257943), the NSERC Discovery grant (RGPIN-2019-06512), a startup grant by IVADO, a grant by Microsoft Research and a Canada CIFAR AI chair.
International Conference on Machine Learning
, pp. 214–223. Cited by: §1.The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 1802–1811. Cited by: Appendix B.2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
, Vol. 2, pp. 1735–1742. Cited by: §7.Connections between support vector machines, wasserstein distance and gradient-penalty gans
. arXiv preprint arXiv:1910.06922. Cited by: §1.Maximum entropy generators for energy-based models
. arXiv preprint arXiv:1901.08508. Cited by: Table 4.Lsun: construction of a large-scale image dataset using deep learning with humans in the loop
. arXiv preprint arXiv:1506.03365. Cited by: §6.1.Loss functions for image restoration with neural networks
. IEEE Transactions on computational imaging 3 (1), pp. 47–57. Cited by: §4.Unfortunately, these improvements in image generation come at a very high computational cost, meaning that the ability to generate high-resolution images is constrained by the availability of large computing resources (TPUs or clusters of 8+ GPUs). This is mainly due to the architectures used in this paper, while adding a discriminator further adds to the training computational load.
Network architecture | Dataset | consistent | ||
Song and Ermon [2020] | CIFAR-10 | No | 1 | 1.8e-5 |
Song and Ermon [2020] | CIFAR-10 | No | 5 | 3.6e-6 |
Song and Ermon [2020] | CIFAR-10 | Yes | 1 | 5.6e-6 |
Song and Ermon [2020] | CIFAR-10 | Yes | 5 | 1.1e-6 |
Song and Ermon [2020] | LSUN-Churches | No | 1 | 4.85e-6 |
Song and Ermon [2020] | LSUN-Churches | No | 5 | 9.7e-7 |
Song and Ermon [2020] | LSUN-Churches | Yes | 1 | 2.8e-6 |
Song and Ermon [2020] | LSUN-Churches | Yes | 5 | 4.5e-7 |
Ho et al. [2020] | CIFAR-10 | No | 1 | 1.6e-5 |
Ho et al. [2020] | CIFAR-10 | No | 5 | 4.0e-6 |
Ho et al. [2020] | CIFAR-10 | Yes | 1 | 5.45e-6 |
Ho et al. [2020] | CIFAR-10 | Yes | 5 | 1.05e-6 |
Song and Ermon [2020] | 3-StackedMNIST | Yes | 1 | 5.0e-6 |
Following the recommendations from Song and Ermon [2020], we chose and on CIFAR-10, and , on LSUN-Churches, both with . We used a batch size of 128 in all models. We first swept summarily the training checkpoint (saved at every 2.5k iterations), the Exponential Moving Average (EMA) coefficient from {.999, .9999}, and then swept over the sampling step size with approximately 2 significant number precision. The values reported in Table 1 correspond to the sampling step size that minimized the denoised FID for every (See Table 5). We used the same sampling step sizes for adversarial and non-adversarial. Empirically, the optimal sampling step size is found for a certain and is extrapolated to other precision levels by solving for the consistent algorithm. In the non-consistent algorithm, we found the best sampling step size at and divided by 5 to obtain a starting point to find the optimal value at . The best EMA values were found to be .9999 in CIFAR-10 and .999 in LSUN-churches. The number of score network training iterations was 300k on CIFAR-10 and 200k on LSUN-Churches.
Of note, Song and Ermon [2020] used non-denoised non-consistent sampling with and for CIFAR-10 and LSUN-Churches respectively. However, they did not use the same learning rates (we tuned ours more precisely) and they used bigger images for LSUN-Churches.
Regarding the adversarial approach, we swept for the GAN loss function, the number of discriminator steps per score network steps (), the Adam optimizer [Kingma and Ba, 2014] parameters, and the hyperparameter (see Eq. 5) based on quick experiments on CIFAR-10. LSGAN [Mao et al., 2017] yielded the best FID scores among all other loss functions considered, namely the original GAN loss function [Goodfellow et al., 2014], HingeGAN [Lim and Ye, 2017], as well as their relativistic counterparts [Jolicoeur-Martineau, 2018, 2019]. Note that the saturating variant (see Goodfellow et al. [2014]) on LSGAN worked as well as its non-saturating version; we did not use it for simplicity. Following the trend towards zero or negative momentum [Gidel et al., 2019], we used the Adam optimizer with hyperparameters for the discriminator and for the score network. These values were found by sweeping over and . We found the simple setting to perform comparatively better than more complex weighting schemes. We used on the experiments with Song and Ermon [2020] network architecture and on the experiments with Ho et al. [2020] network architecture.
The 3-Stacked MNIST experiment was conducted with an arbitrary EMA of .999. The sampling step size was broadly swept upon. Following [Song and Ermon, 2020] hyperparameter recommendations, we obtained , .
For the synthetic 2d experiments, we used Langevin sampling with , , and .
Due to limited computing resources (4 V100 GPUs), the training of models on FFHQ (70k images) in 256x256 [Karras et al., 2019], with the same setting as previously done by Song and Ermon [2020], was impossible. Using a reduced model yielded very poor results. The adversarial version performed worse than the other; we suspect this was the case due to the mini-batch of size 32, our computational limit, while the BigGAN architecture normally assumes very large batch sizes of 2048 when working with images of that size (256x256 or higher).
making use of the fact that , and that
As the explicit optimal score function is obtained in Appendix D for the conditional case, a similar result can be obtained for the unconditional case. Recall the loss from Eq. 1
(6) | ||||
(7) |
where is chosen to be a discrete uniform distribution over a specific set of values. We use this expectation formulation over to obtain a more general result; the choice of is not important for this derivation.
Solving with calculus of variations, we get:
Specifically, we have that the critical point is achieved when
Let us now consider the scenario where the data distribution is simply a Dirac in : . In that case, the EDS is trivial: . This gives an interesting insight into the unconditional score network. Letting , we get:
This should be compared to the true value of the conditional network . We see in particular that for far from its mean, the approximation will get inaccurate:
For large noise values, the unconditional score network will overestimate the true score function, leading to a larger effective step size during sampling.
For small noise values, the unconditional score network will underestimate the true score function, leading to samples not diffusing as much as they should.
Assume at the start of an iteration of Langevin Sampling that the point is comprised of an image component and of a noise component denoted for . We assume from the proposition statement that the Langevin Sampling is performed at the level of variance , meaning the update rule is as follows:
for . From Eq. 3, we get: | ||||
The noise component of and can then be summed as: | ||||
Making use of the fact that both sources of noise follow independent normal distributions, the sum will be normally distributed, centered and of variance , with | ||||
Applying the same steps allows us to examine the variance after multiple Langevin Sampling iterations | ||||
From there, the two following statements can be obtained: | ||||
From these observations, we understand that is monotonically decreasing (under conditions generally respected in practice) but converges to a point superior to after an infinite number of Langevin Sampling steps. We then conclude that for all , the variance of the noise component in the sample will always exceed . We also note that this will be true across the full sampling, at every step .
In the particular case where corresponds to a geometrically decreasing series, it means that even given an optimal score function, the standard deviation of the noise component cannot follow its prescribed schedule.
∎
While the FID is improved by applying the EDS to image samples, the Inception Score is not. Convolutional neural networks suffer from texture bias
[Geirhos et al., 2018]. Since the IS is built upon convolution layers, this flaw is also strongly present in the metric. Designed to answer the question of how easy it is to recover the class of an image, it tends to bias towards within-class texture similarity [Barratt and Sharma, 2018].Since we denoise the final image, we are evaluating the expected lower level of details across all classes. Therefore, the denoiser will confound the textures used by the IS to distinguish between classes, invariably worsening the score. Since the FID has already been shown to be more consistent with the level of noise than the IS [Heusel et al., 2017], and since ALS methods are particularly prone to inject class-specific imperceptible noise, we would recommend against its use to compare within and between score matching models.