1 Introduction
Autoregressive (AR) models are a class of likelihood models that attempt to model the data distribution by estimating the data density. They approach the maximum likelihood objective by factorizing over the dimensions of
via the chain rule. They learn each conditional probability forming
, and this decomposition helps them achieve negative loglikehood (NLL) scores superior to other methods such as VAEs Kingma and Welling (2013) or flow models Dinh et al. (2016); Bach and Tewfik (2007).PixelCNN Courville et al. (2016) was the first to introduce a convolutional AR architecture, achieving a NLL of 3.00 bits/dim on CIFAR10. Over the past few years, a flurry of further modifications Oord et al. (2016); Parmar et al. (2018); Salimans et al. (2017); Gray et al. (Technical report); Chen et al. (2017) has pushed the score down to 2.85 bits/dim, the best known reported NLL on CIFAR10 to date. However, despite these advancements, their uses outside of compression have not been well explored. Samples from these models are considerably worse than stateoftheart GANs Brock et al. (2018), and they do not provide a compact latent space representation, an important piece for use in downstream tasks.
Prior work Theis et al. (2015)
has shown that AR metrics such as loglikelihood and Parzen window estimates are poor indicators of the AR model’s performance on specific tasks. We investigate this empirically in two scenarios: using the logdensity as a learning signal for image translation, and for outlier detection. Our results show that density estimates neither correlate with perceptual quality, nor are useful for downstream tasks.
2 LogLikelihood Estimates as Learning Signal
We first explore using NLL scores for imagetoimage translation. CycleGAN
Zhu et al. (2017), a popular unpaired image translation method, learns mappings between domains and using GANs Goodfellow et al. (2014). However, GANs are known to be unstable during training due to their adversarial framework, and they also lack an evaluation metric for the perceptual quality of generated images, so there is no way to evaluate the learned mapping other than visually inspecting the samples.
Replacing the CycleGAN discriminator with NLL estimates from an AR model seems like a natural solution to these problems. Powerful AR models provide loglikelihood estimates of our samples throughout training that we can compare across methods, and optimization could be easier since there is no longer an adversarial minimax problem, which could help learn more general crossdomain mappings.
2.1 ARCycle Formulation
For the mapping functions and , we borrow the same cycle consistency loss from Zhu et al. (2017):
(1) 
Instead of an adversarial loss, we use a generative loss with the negative loglikelihood of the generated image under our autoregressive model. This autoregressive model is trained purely on real images from its domain. For the mapping function and the density model on , we express the objective as:
(2) 
Thus, our overall ARCycle objective that we minimize is
(3) 
2.2 Experiments and Discussion
We trained ARCycle with the loss in Equation 3, as well as several ablations. We used a pretrained PixelCNN++ to compute and set so that is on the same order as the reconstruction loss. After training to convergence, we observed the results in Figure 1.
The reconstructions in (a) are perfect, but the translated images in the MNIST domain have collapsed to a degenerate solution. The network encodes enough information into the translation to perfectly reconstruct the original image, but hits a local optimum for and does not learn to produce realistic MNIST digits. The negative log prob of the translations starts at about bits/dim and steadily goes down to bits/dim, far from the bits/dim that PixelCNN++ achieves on the MNIST test set. Even if we remove competing losses and only optimize as in (b), or blur the transformed images to remove highfrequency patterns as in (d), our mappings still fail to produce realistic MNIST digits. Most interestingly, even if we initialize training with mappings and pretrained using CycleGAN, the ARCycle training procedure manages to corrupt the mappings by producing faint lines in the background.
To analyze how the network learns to arrive at the degenerate solution, we plot reconstructions over iterations in Figure 2. Here, we see an interesting phenomenon: with more and more updates, the translated images look like a set of lines, while at the same time the reconstructions get more accurate. A hypothesis is that the network is learning to embed information in highfrequency signals that aren’t apparent to the human eye, a phenomenon also mentioned in Zhu et al. (2017).
proved to be difficult to optimize in a variety of settings, and these results suggest that using AR density estimates as learning signal for optimization may be flawed, which we investigate in the next section.
3 Optimization With AR Models

In this section, we investigate the optimization process and explain why an AR loglikelihood loss can be difficult to optimize in general.
3.1 Directly Maximizing Image LogLikelihood
One key observation is that the gradient of the ARCycle loss with respect to our generator ’s parameters is entirely dependent – by the gradient chain rule – on the gradient of the loglikelihood with respect to the generated image. If that pixellevel gradient vanishes, gradient descent cannot help our generator produce better images. Thus, we want to see if it’s possible to directly optimize the image pixels themselves using gradient descent on the negative loglikelihood output of PixelCNN++.
Figure 3 shows the result of applying gradient descent on 3 MNIST digits from the test set, 3 images of random noise, and black, gray, and white images. Every image begins to accumulate noise, even the true MNIST digits, which should have already been in a local minimum. Both the noisy and gray images form lines, which indicates that PixelCNN++ gradients lead us to images that locally resemble digits, rather than images that gradually form digits as training progresses. Not only does the model encourage local texture over global structure, but the optimization problem seems illconditioned. The true MNIST digits start with 0.6 bits/dim, but jump up to 7 bits/dim after a few gradient steps. Since the digits remain visually identical, loglikelihood has no bearing on the quality of a sample. Overall, minimizing loglikelihood is difficult to do and is not guaranteed to produce good results.
3.2 Harder Optimization Problem
Training PixelCNN++ on CIFAR10 or ImageNet achieves good performance for a wide range of hyperparameter choices. Why, then, is it so hard to maximize the loglikelihood of samples under a trained PixelCNN++ model?
The PixelCNN++ objective is . As long as is initialized with support almost everywhere and the parameters are closely tied to the density estimate, the gradient of this is nonzero and should yield improvements to the AR model’s ability to model the distribution.
In contrast, by trying to directly optimize our samples, our objective is Instead of trying to increase our probability of data points by adapting our model, we have to find samples that have high likelihood under a fixed AR model. In certain scenarios, this proves to be an impossible task. Assuming that the data lies on a lowerdimensional manifold within the pixelspace, a powerful enough AR model will put all of its probability mass on the manifold, leaving none elsewhere. Thus, as the model has probability zero almost everywhere, the gradients are also zero almost everywhere, so gradient descent cannot improve the loglikelihood of points.
We visualize this with a 2dimensional toy problem in Figure 4. The true data distribution, shown in (a), always has , while . We use a MADE AR model Germain et al. (2015) to learn a discretized Gaussian loglikelihood, an analogue of the discretized mixture of logistics used by PixelCNN++. As seen in (b)
, the trained model is able to fit the true distribution fairly closely. Internally, the Gaussian distribution in the
direction is becoming narrower, as it tries to place as much mass as possible in the bin for . This leads to the gradient heatmap in (c), which is taken with respect to and . The gradient is nonzero in only 2 thin strips, which will converge to two infinitely thin lines as the AR model fits the true distribution better and better. Optimizing a set of samples on this landscape would be impossible, as only the points on those thin, bright gradient strips would have a nonzero gradient. Thus, the ability of powerful autoregressive models to represent any distribution hurts downstream optimization by creating vanishing gradients almost everywhere.4 Correlation between LogLikelihood and Perceptual Features
4.1 Do Realistic Images Have High LogLikelihood?
Do images that look like the training dataset have high log probability under an AR model for that dataset? We propose the following experiment: Train a WGANGP Gulrajani et al. (2017) model to produce highly realistic CIFAR10 images, compute the bits/dim of those samples under a PixelCNN++, and compare that quantity to the bits/dim of the CIFAR10 test set under the same PixelCNN++.
Our WGANGP samples, which are visually indistinguishable from real images, obtained 6.52 bits/dim, significantly higher than the CIFAR10 test set, which had 2.92 bits/dim. This suggests that perceptual similarity may not be correlated well with the AR log probability. To further investigate, we compare the AR log probability to the inception score, a metric purported to correlate well with human perception Salimans et al. (2016). Figure 5 shows that the log probability remains relatively constant, even though the inception score and sample perceptual quality progressively increase. This further implies that log probability is not a good metric to evaluate whether a given image is similar to a certain distribution of images.
4.2 Do Images with High LogLikelihood Look Realistic?
We seek to better understand whether images that have high log probability under an AR model trained on a dataset look like images from that dataset. To test this, we evaluated how well AR models fare against other methods of outlier detection for high dimensional data, where using simple strategies such as zscores is not particularly useful. All the methods we test only assume knowledge of the CIFAR10 training set and attempted to classify new images as CIFAR10.
Dataset  AR2SD  AR1SD  AROnesided  CCG 

CIFAR10 Test  92.5%  68%  94.7%  92.3% 
WGANGP CIFAR10 Samples  0%  44.9%  0%  69.7% 
SVHN Test  90.9%  44.9%  100%  37.3% 
Noise  0%  0%  0%  1.1% 
All Black  0%  0%  100%  0% 
All White  0%  0%  100%  0% 
We tried three strategies for outlier detection with AR models. We computed the mean
bits/dim over the CIFAR10 training set using the PixelCNN++’s NLL. We then constructed 3 intervals of the following form: 2 Standard Deviation Interval (AR2SD): , 1 Standard Deviation Interval (AR1SD): , and OneSided Interval (AROnesided): . We classify an image as an outlier if its NLL in bits/dim lies outside the interval.We compare against a ClassConditional Gaussians (CCG), a method for highdimensional outlier detection using deep classifiers Lee et al. (2018). CCG is as follows: Train a classifier on the dataset (Inceptionv1 Szegedy et al. (2015)
on CIFAR10), strip off the output layer, and fit class conditional Gaussians to the feature vectors given by evaluating the classifier on the dataset. For a new test image, compute its feature vector and evaluate the feature vector’s probability under each class conditional Gaussian; if the probability is less than some threshold, classify as an outlier.
We test the effectiveness of the outlier detection methods on six different datasets: CIFAR10 test set, samples from WGANGP trained on CIFAR10, SVHN test set, random noise images from , completely black images, and completely white images. Intuitively, we would expect a good outlier detector to classify CIFAR10 test set images correctly as CIFAR, WGANGP samples as CIFAR most of the time, and SVHN, noise, all black and all white images as outliers.
Table 1 shows that CCG largely does what a good outlier detector should do, but the AR model does not. In fact, the AR model often classifies SVHN to be CIFAR just as often as it does actual CIFAR10 images. As found earlier by Nalisnick et al. (2018), PixelCNN++ actually assigns higher log probability to images from SVHN than actual CIFAR10 images – SVHN achieves 2.1 bits/dim, while CIFAR10 only gets 2.92 bits/dim. Additionally, the model assigns even high log probability to all black and all white images: and bits/dim respectively. These failures on outlier detection indicate that log probability under an AR model is not usefully correlated with our notion of perceptual quality.
5 Conclusions and Future Work
We investigate the usefulness of the density estimates of autoregressive models. We apply them to 2 tasks: as a learning signal for unsupervised image translation with ARCycle, and outlier detection, and find that the density estimates are not informative in both settings. We also perform an analysis on the underlying optimization issues, finding that optimizing using AR models leads to degenerate solutions due to vanishing gradients. Their loglikelihood estimates also don’t correlate with perceptual quality, which explains their poor performance at outlier detection. Given that these models achieve superior loglikelihood, our findings call into question the overall utility of likelihoodbased learning. In this work, we examined PixelCNN++ in detail; one interesting avenue of future work is investigating if our results hold for other autoregressive models as well.
References
 Glow: Generative Flow with Invertible 1x1x1 Convolutions Diederik. In NIPS, Vol. 86, pp. 301–3. External Links: Link, ISBN 08949115 (Print)\r08949115 (Linking), Document, ISSN 08949115 Cited by: §1.
 Large Scale GAN Training for High Fidelity Natural Image Synthesis. External Links: Link Cited by: §1.

PixelSNAIL: An Improved Autoregressive Generative Model.
Proceedings of Machine Learning Research
. External Links: Link Cited by: §1.  . Proceedings of The 33rd International Conference on Machine Learning 48, pp. 1747–1756. External Links: Link, ISBN 9781510829008 Cited by: §1.
 Density estimation using Real NVP. External Links: Link Cited by: §1.

MADE: Masked Autoencoder for Distribution Estimation
. External Links: Link Cited by: §3.2.  Generative Adversarial Networks. External Links: Link Cited by: §2.
 GPU Kernels for BlockSparse Weights. Technical report External Links: Link Cited by: §1.
 Improved Training of Wasserstein GANs. External Links: Link Cited by: §4.1.
 ImagetoImage Translation with Conditional Adversarial Networks. External Links: Link
 AutoEncoding Variational Bayes. External Links: Link Cited by: §1.
 A Simple Unified Framework for Detecting OutofDistribution Samples and Adversarial Attacks. External Links: Link Cited by: §4.2.
 Do Deep Generative Models Know What They Don’t Know?. ICLR 2019. External Links: Link Cited by: §4.2.
 Conditional Image Generation with PixelCNN Decoders. External Links: Link Cited by: §1.
 Image Tranformer. In ICML Submission, External Links: Link Cited by: §1.
 Improved DCGAN. pp. 1–10. External Links: Link, ISBN 09246495, Document, ISSN 09246495 Cited by: §4.1.
 PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications. External Links: Link Cited by: §1.

Going deeper with convolutions.
In
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, Vol. 0712June2015, pp. 1–9. External Links: ISBN 9781467369640, Document, ISSN 10636919 Cited by: §4.2.  A note on the evaluation of generative models. External Links: Link Cited by: §1.
 Unpaired ImagetoImage Translation using CycleConsistent Adversarial Networks. External Links: Link Cited by: §2.1, §2.2, §2.