Log In Sign Up

Autoregressive Models: What Are They Good For?

by   Murtaza Dalal, et al.

Autoregressive (AR) models have become a popular tool for unsupervised learning, achieving state-of-the-art log likelihood estimates. We investigate the use of AR models as density estimators in two settings – as a learning signal for image translation, and as an outlier detector – and find that these density estimates are much less reliable than previously thought. We examine the underlying optimization issues from both an empirical and theoretical perspective, and provide a toy example that illustrates the problem. Overwhelmingly, we find that density estimates do not correlate with perceptual quality and are unhelpful for downstream tasks.


page 2

page 3

page 4


On the Discrepancy between Density Estimation and Sequence Generation

Many sequence-to-sequence generation tasks, including machine translatio...

Can Multilinguality benefit Non-autoregressive Machine Translation?

Non-autoregressive (NAR) machine translation has recently achieved signi...

Autoregressive Energy Machines

Neural density estimators are flexible families of parametric models whi...

PixelPyramids: Exact Inference Models from Lossless Image Pyramids

Autoregressive models are a class of exact inference approaches with hig...

Autoregressive Models for Sequences of Graphs

This paper proposes an autoregressive (AR) model for sequences of graphs...

Non-Autoregressive Machine Translation: It's Not as Fast as it Seems

Efficient machine translation models are commercially important as they ...

1 Introduction

Autoregressive (AR) models are a class of likelihood models that attempt to model the data distribution by estimating the data density. They approach the maximum likelihood objective by factorizing over the dimensions of

via the chain rule. They learn each conditional probability forming

, and this decomposition helps them achieve negative log-likehood (NLL) scores superior to other methods such as VAEs Kingma and Welling (2013) or flow models Dinh et al. (2016); Bach and Tewfik (2007).

PixelCNN Courville et al. (2016) was the first to introduce a convolutional AR architecture, achieving a NLL of 3.00 bits/dim on CIFAR-10. Over the past few years, a flurry of further modifications Oord et al. (2016); Parmar et al. (2018); Salimans et al. (2017); Gray et al. (Technical report); Chen et al. (2017) has pushed the score down to 2.85 bits/dim, the best known reported NLL on CIFAR-10 to date. However, despite these advancements, their uses outside of compression have not been well explored. Samples from these models are considerably worse than state-of-the-art GANs Brock et al. (2018), and they do not provide a compact latent space representation, an important piece for use in downstream tasks.

Prior work Theis et al. (2015)

has shown that AR metrics such as log-likelihood and Parzen window estimates are poor indicators of the AR model’s performance on specific tasks. We investigate this empirically in two scenarios: using the log-density as a learning signal for image translation, and for outlier detection. Our results show that density estimates neither correlate with perceptual quality, nor are useful for downstream tasks.

2 Log-Likelihood Estimates as Learning Signal

We first explore using NLL scores for image-to-image translation. CycleGAN

Zhu et al. (2017), a popular unpaired image translation method, learns mappings between domains and using GANs Goodfellow et al. (2014)

. However, GANs are known to be unstable during training due to their adversarial framework, and they also lack an evaluation metric for the perceptual quality of generated images, so there is no way to evaluate the learned mapping other than visually inspecting the samples.

Replacing the CycleGAN discriminator with NLL estimates from an AR model seems like a natural solution to these problems. Powerful AR models provide log-likelihood estimates of our samples throughout training that we can compare across methods, and optimization could be easier since there is no longer an adversarial minimax problem, which could help learn more general cross-domain mappings.

2.1 ARCycle Formulation

For the mapping functions and , we borrow the same cycle consistency loss from Zhu et al. (2017):


Instead of an adversarial loss, we use a generative loss with the negative log-likelihood of the generated image under our autoregressive model. This autoregressive model is trained purely on real images from its domain. For the mapping function and the density model on , we express the objective as:


Thus, our overall ARCycle objective that we minimize is


2.2 Experiments and Discussion

(b) Just
(c) Just
(d) , Gaussian blur on generator output.
(e) ARCycle starting with pretrained generators.
Figure 1: ARCycle trained in different settings. The left columns are real images from colored-MNIST dataset, the middle columns contain mappings of the images to the MNIST domain, and the right columns show reconstruction with the reverse mapping back to the colored-MNIST domain.
(a) Iteration 0
(b) Iteration 25
(c) Iteration 50
(d) Iteration 75
(e) Iteration 100
Figure 2: AR Cycle quickly learns to output lines in the MNIST space when optimizing .

We trained ARCycle with the loss in Equation 3, as well as several ablations. We used a pre-trained PixelCNN++ to compute and set so that is on the same order as the reconstruction loss. After training to convergence, we observed the results in Figure 1.

The reconstructions in (a) are perfect, but the translated images in the MNIST domain have collapsed to a degenerate solution. The network encodes enough information into the translation to perfectly reconstruct the original image, but hits a local optimum for and does not learn to produce realistic MNIST digits. The negative log prob of the translations starts at about bits/dim and steadily goes down to bits/dim, far from the bits/dim that PixelCNN++ achieves on the MNIST test set. Even if we remove competing losses and only optimize as in (b), or blur the transformed images to remove high-frequency patterns as in (d), our mappings still fail to produce realistic MNIST digits. Most interestingly, even if we initialize training with mappings and pretrained using CycleGAN, the ARCycle training procedure manages to corrupt the mappings by producing faint lines in the background.

To analyze how the network learns to arrive at the degenerate solution, we plot reconstructions over iterations in Figure 2. Here, we see an interesting phenomenon: with more and more updates, the translated images look like a set of lines, while at the same time the reconstructions get more accurate. A hypothesis is that the network is learning to embed information in high-frequency signals that aren’t apparent to the human eye, a phenomenon also mentioned in Zhu et al. (2017).

proved to be difficult to optimize in a variety of settings, and these results suggest that using AR density estimates as learning signal for optimization may be flawed, which we investigate in the next section.

3 Optimization With AR Models

(a) Optimized samples at iteration 0, 5000, and 10000.
(b) Loss curve over 10k iterations.
Figure 3: Result of directly maximizing images’ PixelCNN++ log-likelihood estimate by applying gradients to the image pixels themselves.

In this section, we investigate the optimization process and explain why an AR log-likelihood loss can be difficult to optimize in general.

3.1 Directly Maximizing Image Log-Likelihood

One key observation is that the gradient of the ARCycle loss with respect to our generator ’s parameters is entirely dependent – by the gradient chain rule – on the gradient of the log-likelihood with respect to the generated image. If that pixel-level gradient vanishes, gradient descent cannot help our generator produce better images. Thus, we want to see if it’s possible to directly optimize the image pixels themselves using gradient descent on the negative log-likelihood output of PixelCNN++.

Figure 3 shows the result of applying gradient descent on 3 MNIST digits from the test set, 3 images of random noise, and black, gray, and white images. Every image begins to accumulate noise, even the true MNIST digits, which should have already been in a local minimum. Both the noisy and gray images form lines, which indicates that PixelCNN++ gradients lead us to images that locally resemble digits, rather than images that gradually form digits as training progresses. Not only does the model encourage local texture over global structure, but the optimization problem seems ill-conditioned. The true MNIST digits start with 0.6 bits/dim, but jump up to 7 bits/dim after a few gradient steps. Since the digits remain visually identical, log-likelihood has no bearing on the quality of a sample. Overall, minimizing log-likelihood is difficult to do and is not guaranteed to produce good results.

3.2 Harder Optimization Problem

(a) True data distribution.
(b) Learned probability with discretized Gaussian.
(c) Gradient norm heatmap.
Figure 4: 2D toy example illustrating the optimization problem that arises when the data lies on a lower-dimensional manifold.

Training PixelCNN++ on CIFAR-10 or ImageNet achieves good performance for a wide range of hyperparameter choices. Why, then, is it so hard to maximize the log-likelihood of samples under a trained PixelCNN++ model?

The PixelCNN++ objective is . As long as is initialized with support almost everywhere and the parameters are closely tied to the density estimate, the gradient of this is nonzero and should yield improvements to the AR model’s ability to model the distribution.

In contrast, by trying to directly optimize our samples, our objective is Instead of trying to increase our probability of data points by adapting our model, we have to find samples that have high likelihood under a fixed AR model. In certain scenarios, this proves to be an impossible task. Assuming that the data lies on a lower-dimensional manifold within the pixel-space, a powerful enough AR model will put all of its probability mass on the manifold, leaving none elsewhere. Thus, as the model has probability zero almost everywhere, the gradients are also zero almost everywhere, so gradient descent cannot improve the log-likelihood of points.

We visualize this with a 2-dimensional toy problem in Figure 4. The true data distribution, shown in (a), always has , while . We use a MADE AR model Germain et al. (2015) to learn a discretized Gaussian log-likelihood, an analogue of the discretized mixture of logistics used by PixelCNN++. As seen in (b)

, the trained model is able to fit the true distribution fairly closely. Internally, the Gaussian distribution in the

-direction is becoming narrower, as it tries to place as much mass as possible in the bin for . This leads to the gradient heatmap in (c), which is taken with respect to and . The gradient is nonzero in only 2 thin strips, which will converge to two infinitely thin lines as the AR model fits the true distribution better and better. Optimizing a set of samples on this landscape would be impossible, as only the points on those thin, bright gradient strips would have a nonzero gradient. Thus, the ability of powerful autoregressive models to represent any distribution hurts downstream optimization by creating vanishing gradients almost everywhere.

4 Correlation between Log-Likelihood and Perceptual Features

4.1 Do Realistic Images Have High Log-Likelihood?

Figure 5: Left: samples from WGAN-GP trained on CIFAR-10. Right: plot of the zero-centered inception score and PixelCNN++ negative log probability of WGAN-GP samples over the WGAN training process.

Do images that look like the training dataset have high log probability under an AR model for that dataset? We propose the following experiment: Train a WGAN-GP Gulrajani et al. (2017) model to produce highly realistic CIFAR-10 images, compute the bits/dim of those samples under a PixelCNN++, and compare that quantity to the bits/dim of the CIFAR-10 test set under the same PixelCNN++.

Our WGAN-GP samples, which are visually indistinguishable from real images, obtained 6.52 bits/dim, significantly higher than the CIFAR-10 test set, which had 2.92 bits/dim. This suggests that perceptual similarity may not be correlated well with the AR log probability. To further investigate, we compare the AR log probability to the inception score, a metric purported to correlate well with human perception Salimans et al. (2016). Figure 5 shows that the log probability remains relatively constant, even though the inception score and sample perceptual quality progressively increase. This further implies that log probability is not a good metric to evaluate whether a given image is similar to a certain distribution of images.

4.2 Do Images with High Log-Likelihood Look Realistic?

We seek to better understand whether images that have high log probability under an AR model trained on a dataset look like images from that dataset. To test this, we evaluated how well AR models fare against other methods of outlier detection for high dimensional data, where using simple strategies such as z-scores is not particularly useful. All the methods we test only assume knowledge of the CIFAR-10 training set and attempted to classify new images as CIFAR-10.

Dataset AR-2SD AR-1SD AR-One-sided CCG
CIFAR-10 Test 92.5% 68% 94.7% 92.3%
WGAN-GP CIFAR-10 Samples 0% 44.9% 0% 69.7%
SVHN Test 90.9% 44.9% 100% 37.3%
Noise 0% 0% 0% 1.1%
All Black 0% 0% 100% 0%
All White 0% 0% 100% 0%
Table 1: The percent of the dataset classified to be in CIFAR-10, i.e. classified as not an outlier.

We tried three strategies for outlier detection with AR models. We computed the mean

and standard deviation

bits/dim over the CIFAR-10 training set using the PixelCNN++’s NLL. We then constructed 3 intervals of the following form: 2 Standard Deviation Interval (AR-2SD): , 1 Standard Deviation Interval (AR-1SD): , and One-Sided Interval (AR-One-sided): . We classify an image as an outlier if its NLL in bits/dim lies outside the interval.

We compare against a Class-Conditional Gaussians (CCG), a method for high-dimensional outlier detection using deep classifiers Lee et al. (2018). CCG is as follows: Train a classifier on the dataset (Inception-v1 Szegedy et al. (2015)

on CIFAR-10), strip off the output layer, and fit class conditional Gaussians to the feature vectors given by evaluating the classifier on the dataset. For a new test image, compute its feature vector and evaluate the feature vector’s probability under each class conditional Gaussian; if the probability is less than some threshold, classify as an outlier.

We test the effectiveness of the outlier detection methods on six different datasets: CIFAR-10 test set, samples from WGAN-GP trained on CIFAR-10, SVHN test set, random noise images from , completely black images, and completely white images. Intuitively, we would expect a good outlier detector to classify CIFAR-10 test set images correctly as CIFAR, WGAN-GP samples as CIFAR most of the time, and SVHN, noise, all black and all white images as outliers.

Table 1 shows that CCG largely does what a good outlier detector should do, but the AR model does not. In fact, the AR model often classifies SVHN to be CIFAR just as often as it does actual CIFAR-10 images. As found earlier by Nalisnick et al. (2018), PixelCNN++ actually assigns higher log probability to images from SVHN than actual CIFAR-10 images – SVHN achieves 2.1 bits/dim, while CIFAR-10 only gets 2.92 bits/dim. Additionally, the model assigns even high log probability to all black and all white images: and bits/dim respectively. These failures on outlier detection indicate that log probability under an AR model is not usefully correlated with our notion of perceptual quality.

5 Conclusions and Future Work

We investigate the usefulness of the density estimates of autoregressive models. We apply them to 2 tasks: as a learning signal for unsupervised image translation with ARCycle, and outlier detection, and find that the density estimates are not informative in both settings. We also perform an analysis on the underlying optimization issues, finding that optimizing using AR models leads to degenerate solutions due to vanishing gradients. Their log-likelihood estimates also don’t correlate with perceptual quality, which explains their poor performance at outlier detection. Given that these models achieve superior log-likelihood, our findings call into question the overall utility of likelihood-based learning. In this work, we examined PixelCNN++ in detail; one interesting avenue of future work is investigating if our results hold for other autoregressive models as well.


  • J. R. Bach and G. Tewfik (2007) Glow: Generative Flow with Invertible 1x1x1 Convolutions Diederik. In NIPS, Vol. 86, pp. 301–3. External Links: Link, ISBN 0894-9115 (Print)\r0894-9115 (Linking), Document, ISSN 0894-9115 Cited by: §1.
  • A. Brock, J. Donahue, and K. Simonyan (2018) Large Scale GAN Training for High Fidelity Natural Image Synthesis. External Links: Link Cited by: §1.
  • X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel (2017) PixelSNAIL: An Improved Autoregressive Generative Model.

    Proceedings of Machine Learning Research

    External Links: Link Cited by: §1.
  • A. Courville, N. Kalchbrenner, and K. Kavukcuoglu (2016)

    Pixel Recurrent Neural Networks

    Proceedings of The 33rd International Conference on Machine Learning 48, pp. 1747–1756. External Links: Link, ISBN 9781510829008 Cited by: §1.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using Real NVP. External Links: Link Cited by: §1.
  • M. Germain, K. Gregor, I. Murray, and H. Larochelle (2015)

    MADE: Masked Autoencoder for Distribution Estimation

    External Links: Link Cited by: §3.2.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative Adversarial Networks. External Links: Link Cited by: §2.
  • S. Gray, A. Radford, and D. P. Kingma (Technical report) GPU Kernels for Block-Sparse Weights. Technical report External Links: Link Cited by: §1.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville (2017) Improved Training of Wasserstein GANs. External Links: Link Cited by: §4.1.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016) Image-to-Image Translation with Conditional Adversarial Networks. External Links: Link
  • D. P. Kingma and M. Welling (2013) Auto-Encoding Variational Bayes. External Links: Link Cited by: §1.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018) A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. External Links: Link Cited by: §4.2.
  • E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan (2018) Do Deep Generative Models Know What They Don’t Know?. ICLR 2019. External Links: Link Cited by: §4.2.
  • A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu (2016) Conditional Image Generation with PixelCNN Decoders. External Links: Link Cited by: §1.
  • N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, and A. Ku (2018) Image Tranformer. In ICML Submission, External Links: Link Cited by: §1.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved DCGAN. pp. 1–10. External Links: Link, ISBN 0924-6495, Document, ISSN 09246495 Cited by: §4.1.
  • T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma (2017) PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications. External Links: Link Cited by: §1.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    Vol. 07-12-June-2015, pp. 1–9. External Links: ISBN 9781467369640, Document, ISSN 10636919 Cited by: §4.2.
  • L. Theis, A. v. d. Oord, and M. Bethge (2015) A note on the evaluation of generative models. External Links: Link Cited by: §1.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. External Links: Link Cited by: §2.1, §2.2, §2.