Autoregressive (AR) models are a class of likelihood models that attempt to model the data distribution by estimating the data density. They approach the maximum likelihood objective by factorizing over the dimensions of, and this decomposition helps them achieve negative log-likehood (NLL) scores superior to other methods such as VAEs Kingma and Welling (2013) or flow models Dinh et al. (2016); Bach and Tewfik (2007).
PixelCNN Courville et al. (2016) was the first to introduce a convolutional AR architecture, achieving a NLL of 3.00 bits/dim on CIFAR-10. Over the past few years, a flurry of further modifications Oord et al. (2016); Parmar et al. (2018); Salimans et al. (2017); Gray et al. (Technical report); Chen et al. (2017) has pushed the score down to 2.85 bits/dim, the best known reported NLL on CIFAR-10 to date. However, despite these advancements, their uses outside of compression have not been well explored. Samples from these models are considerably worse than state-of-the-art GANs Brock et al. (2018), and they do not provide a compact latent space representation, an important piece for use in downstream tasks.
Prior work Theis et al. (2015)
has shown that AR metrics such as log-likelihood and Parzen window estimates are poor indicators of the AR model’s performance on specific tasks. We investigate this empirically in two scenarios: using the log-density as a learning signal for image translation, and for outlier detection. Our results show that density estimates neither correlate with perceptual quality, nor are useful for downstream tasks.
2 Log-Likelihood Estimates as Learning Signal
We first explore using NLL scores for image-to-image translation. CycleGANZhu et al. (2017), a popular unpaired image translation method, learns mappings between domains and using GANs Goodfellow et al. (2014)
. However, GANs are known to be unstable during training due to their adversarial framework, and they also lack an evaluation metric for the perceptual quality of generated images, so there is no way to evaluate the learned mapping other than visually inspecting the samples.
Replacing the CycleGAN discriminator with NLL estimates from an AR model seems like a natural solution to these problems. Powerful AR models provide log-likelihood estimates of our samples throughout training that we can compare across methods, and optimization could be easier since there is no longer an adversarial minimax problem, which could help learn more general cross-domain mappings.
2.1 ARCycle Formulation
For the mapping functions and , we borrow the same cycle consistency loss from Zhu et al. (2017):
Instead of an adversarial loss, we use a generative loss with the negative log-likelihood of the generated image under our autoregressive model. This autoregressive model is trained purely on real images from its domain. For the mapping function and the density model on , we express the objective as:
Thus, our overall ARCycle objective that we minimize is
2.2 Experiments and Discussion
We trained ARCycle with the loss in Equation 3, as well as several ablations. We used a pre-trained PixelCNN++ to compute and set so that is on the same order as the reconstruction loss. After training to convergence, we observed the results in Figure 1.
The reconstructions in (a) are perfect, but the translated images in the MNIST domain have collapsed to a degenerate solution. The network encodes enough information into the translation to perfectly reconstruct the original image, but hits a local optimum for and does not learn to produce realistic MNIST digits. The negative log prob of the translations starts at about bits/dim and steadily goes down to bits/dim, far from the bits/dim that PixelCNN++ achieves on the MNIST test set. Even if we remove competing losses and only optimize as in (b), or blur the transformed images to remove high-frequency patterns as in (d), our mappings still fail to produce realistic MNIST digits. Most interestingly, even if we initialize training with mappings and pretrained using CycleGAN, the ARCycle training procedure manages to corrupt the mappings by producing faint lines in the background.
To analyze how the network learns to arrive at the degenerate solution, we plot reconstructions over iterations in Figure 2. Here, we see an interesting phenomenon: with more and more updates, the translated images look like a set of lines, while at the same time the reconstructions get more accurate. A hypothesis is that the network is learning to embed information in high-frequency signals that aren’t apparent to the human eye, a phenomenon also mentioned in Zhu et al. (2017).
proved to be difficult to optimize in a variety of settings, and these results suggest that using AR density estimates as learning signal for optimization may be flawed, which we investigate in the next section.
3 Optimization With AR Models
In this section, we investigate the optimization process and explain why an AR log-likelihood loss can be difficult to optimize in general.
3.1 Directly Maximizing Image Log-Likelihood
One key observation is that the gradient of the ARCycle loss with respect to our generator ’s parameters is entirely dependent – by the gradient chain rule – on the gradient of the log-likelihood with respect to the generated image. If that pixel-level gradient vanishes, gradient descent cannot help our generator produce better images. Thus, we want to see if it’s possible to directly optimize the image pixels themselves using gradient descent on the negative log-likelihood output of PixelCNN++.
Figure 3 shows the result of applying gradient descent on 3 MNIST digits from the test set, 3 images of random noise, and black, gray, and white images. Every image begins to accumulate noise, even the true MNIST digits, which should have already been in a local minimum. Both the noisy and gray images form lines, which indicates that PixelCNN++ gradients lead us to images that locally resemble digits, rather than images that gradually form digits as training progresses. Not only does the model encourage local texture over global structure, but the optimization problem seems ill-conditioned. The true MNIST digits start with 0.6 bits/dim, but jump up to 7 bits/dim after a few gradient steps. Since the digits remain visually identical, log-likelihood has no bearing on the quality of a sample. Overall, minimizing log-likelihood is difficult to do and is not guaranteed to produce good results.
3.2 Harder Optimization Problem
Training PixelCNN++ on CIFAR-10 or ImageNet achieves good performance for a wide range of hyperparameter choices. Why, then, is it so hard to maximize the log-likelihood of samples under a trained PixelCNN++ model?
The PixelCNN++ objective is . As long as is initialized with support almost everywhere and the parameters are closely tied to the density estimate, the gradient of this is nonzero and should yield improvements to the AR model’s ability to model the distribution.
In contrast, by trying to directly optimize our samples, our objective is Instead of trying to increase our probability of data points by adapting our model, we have to find samples that have high likelihood under a fixed AR model. In certain scenarios, this proves to be an impossible task. Assuming that the data lies on a lower-dimensional manifold within the pixel-space, a powerful enough AR model will put all of its probability mass on the manifold, leaving none elsewhere. Thus, as the model has probability zero almost everywhere, the gradients are also zero almost everywhere, so gradient descent cannot improve the log-likelihood of points.
We visualize this with a 2-dimensional toy problem in Figure 4. The true data distribution, shown in (a), always has , while . We use a MADE AR model Germain et al. (2015) to learn a discretized Gaussian log-likelihood, an analogue of the discretized mixture of logistics used by PixelCNN++. As seen in (b)
, the trained model is able to fit the true distribution fairly closely. Internally, the Gaussian distribution in the-direction is becoming narrower, as it tries to place as much mass as possible in the bin for . This leads to the gradient heatmap in (c), which is taken with respect to and . The gradient is nonzero in only 2 thin strips, which will converge to two infinitely thin lines as the AR model fits the true distribution better and better. Optimizing a set of samples on this landscape would be impossible, as only the points on those thin, bright gradient strips would have a nonzero gradient. Thus, the ability of powerful autoregressive models to represent any distribution hurts downstream optimization by creating vanishing gradients almost everywhere.
4 Correlation between Log-Likelihood and Perceptual Features
4.1 Do Realistic Images Have High Log-Likelihood?
Do images that look like the training dataset have high log probability under an AR model for that dataset? We propose the following experiment: Train a WGAN-GP Gulrajani et al. (2017) model to produce highly realistic CIFAR-10 images, compute the bits/dim of those samples under a PixelCNN++, and compare that quantity to the bits/dim of the CIFAR-10 test set under the same PixelCNN++.
Our WGAN-GP samples, which are visually indistinguishable from real images, obtained 6.52 bits/dim, significantly higher than the CIFAR-10 test set, which had 2.92 bits/dim. This suggests that perceptual similarity may not be correlated well with the AR log probability. To further investigate, we compare the AR log probability to the inception score, a metric purported to correlate well with human perception Salimans et al. (2016). Figure 5 shows that the log probability remains relatively constant, even though the inception score and sample perceptual quality progressively increase. This further implies that log probability is not a good metric to evaluate whether a given image is similar to a certain distribution of images.
4.2 Do Images with High Log-Likelihood Look Realistic?
We seek to better understand whether images that have high log probability under an AR model trained on a dataset look like images from that dataset. To test this, we evaluated how well AR models fare against other methods of outlier detection for high dimensional data, where using simple strategies such as z-scores is not particularly useful. All the methods we test only assume knowledge of the CIFAR-10 training set and attempted to classify new images as CIFAR-10.
|WGAN-GP CIFAR-10 Samples||0%||44.9%||0%||69.7%|
We tried three strategies for outlier detection with AR models. We computed the meanbits/dim over the CIFAR-10 training set using the PixelCNN++’s NLL. We then constructed 3 intervals of the following form: 2 Standard Deviation Interval (AR-2SD): , 1 Standard Deviation Interval (AR-1SD): , and One-Sided Interval (AR-One-sided): . We classify an image as an outlier if its NLL in bits/dim lies outside the interval.
We compare against a Class-Conditional Gaussians (CCG), a method for high-dimensional outlier detection using deep classifiers Lee et al. (2018). CCG is as follows: Train a classifier on the dataset (Inception-v1 Szegedy et al. (2015)
on CIFAR-10), strip off the output layer, and fit class conditional Gaussians to the feature vectors given by evaluating the classifier on the dataset. For a new test image, compute its feature vector and evaluate the feature vector’s probability under each class conditional Gaussian; if the probability is less than some threshold, classify as an outlier.
We test the effectiveness of the outlier detection methods on six different datasets: CIFAR-10 test set, samples from WGAN-GP trained on CIFAR-10, SVHN test set, random noise images from , completely black images, and completely white images. Intuitively, we would expect a good outlier detector to classify CIFAR-10 test set images correctly as CIFAR, WGAN-GP samples as CIFAR most of the time, and SVHN, noise, all black and all white images as outliers.
Table 1 shows that CCG largely does what a good outlier detector should do, but the AR model does not. In fact, the AR model often classifies SVHN to be CIFAR just as often as it does actual CIFAR-10 images. As found earlier by Nalisnick et al. (2018), PixelCNN++ actually assigns higher log probability to images from SVHN than actual CIFAR-10 images – SVHN achieves 2.1 bits/dim, while CIFAR-10 only gets 2.92 bits/dim. Additionally, the model assigns even high log probability to all black and all white images: and bits/dim respectively. These failures on outlier detection indicate that log probability under an AR model is not usefully correlated with our notion of perceptual quality.
5 Conclusions and Future Work
We investigate the usefulness of the density estimates of autoregressive models. We apply them to 2 tasks: as a learning signal for unsupervised image translation with ARCycle, and outlier detection, and find that the density estimates are not informative in both settings. We also perform an analysis on the underlying optimization issues, finding that optimizing using AR models leads to degenerate solutions due to vanishing gradients. Their log-likelihood estimates also don’t correlate with perceptual quality, which explains their poor performance at outlier detection. Given that these models achieve superior log-likelihood, our findings call into question the overall utility of likelihood-based learning. In this work, we examined PixelCNN++ in detail; one interesting avenue of future work is investigating if our results hold for other autoregressive models as well.
- Glow: Generative Flow with Invertible 1x1x1 Convolutions Diederik. In NIPS, Vol. 86, pp. 301–3. External Links: Cited by: §1.
- Large Scale GAN Training for High Fidelity Natural Image Synthesis. External Links: Cited by: §1.
PixelSNAIL: An Improved Autoregressive Generative Model.
Proceedings of Machine Learning Research. External Links: Cited by: §1.
- . Proceedings of The 33rd International Conference on Machine Learning 48, pp. 1747–1756. External Links: Cited by: §1.
- Density estimation using Real NVP. External Links: Cited by: §1.
MADE: Masked Autoencoder for Distribution Estimation. External Links: Cited by: §3.2.
- Generative Adversarial Networks. External Links: Cited by: §2.
- GPU Kernels for Block-Sparse Weights. Technical report External Links: Cited by: §1.
- Improved Training of Wasserstein GANs. External Links: Cited by: §4.1.
- Image-to-Image Translation with Conditional Adversarial Networks. External Links:
- Auto-Encoding Variational Bayes. External Links: Cited by: §1.
- A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. External Links: Cited by: §4.2.
- Do Deep Generative Models Know What They Don’t Know?. ICLR 2019. External Links: Cited by: §4.2.
- Conditional Image Generation with PixelCNN Decoders. External Links: Cited by: §1.
- Image Tranformer. In ICML Submission, External Links: Cited by: §1.
- Improved DCGAN. pp. 1–10. External Links: Cited by: §4.1.
- PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications. External Links: Cited by: §1.
- Going deeper with convolutions. In , Vol. 07-12-June-2015, pp. 1–9. External Links: Cited by: §4.2.
- A note on the evaluation of generative models. External Links: Cited by: §1.
- Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. External Links: Cited by: §2.1, §2.2, §2.