Parallel Multiscale Autoregressive Density Estimation

by   Scott Reed, et al.

PixelCNN achieves state-of-the-art results in density estimation for natural images. Although training is fast, inference is costly, requiring one network evaluation per pixel; O(N) for N pixels. This can be sped up by caching activations, but still involves generating each pixel sequentially. In this work, we propose a parallelized PixelCNN that allows more efficient inference by modeling certain pixel groups as conditionally independent. Our new PixelCNN model achieves competitive density estimation and orders of magnitude speedup - O(log N) sampling instead of O(N) - enabling the practical generation of 512x512 images. We evaluate the model on class-conditional image generation, text-to-image synthesis, and action-conditional video generation, showing that our model achieves the best results among non-pixel-autoregressive density models that allow efficient sampling.


page 5

page 7

page 11

page 12

page 13

page 14

page 15

page 16


Normalizing Flows with Multi-Scale Autoregressive Priors

Flow-based generative models are an important class of exact inference m...

Few-shot Autoregressive Density Estimation: Towards Learning to Learn Distributions

Deep autoregressive models have shown state-of-the-art performance in de...

PixelPyramids: Exact Inference Models from Lossless Image Pyramids

Autoregressive models are a class of exact inference approaches with hig...

Symmetry-Aware Marginal Density Estimation

The Rao-Blackwell theorem is utilized to analyze and improve the scalabi...

Locally Masked Convolution for Autoregressive Models

High-dimensional generative models have many applications including imag...

On the Discrepancy between Density Estimation and Sequence Generation

Many sequence-to-sequence generation tasks, including machine translatio...

AND: Autoregressive Novelty Detectors

We propose an unsupervised model for novelty detection. The subject is t...

1 Introduction

Many autoregressive image models factorize the joint distribution of images into per-pixel factors:


For example PixelCNN (van den Oord et al., 2016b) uses a deep convolutional network with carefully designed filter masking to preserve causal structure, so that all factors in equation 1 can be learned in parallel for a given image. However, a remaining difficulty is that due to the learned causal structure, inference proceeds sequentially pixel-by-pixel in raster order.

In the naive case, this requires a full network evaluation per pixel. Caching hidden unit activations can be used to reduce the amount of computation per pixel, as in the 1D case for WaveNet (Oord et al., 2016; Ramachandran et al., 2017). However, even with this optimization, generation is still in serial order by pixel.

Ideally we would generate multiple pixels in parallel, which could greatly accelerate sampling. In the autoregressive framework this only works if the pixels are modeled as independent. Thus we need a way to judiciously break weak dependencies among pixels; for example immediately neighboring pixels should not be modeled as independent since they tend to be highly correlated.

Figure 1: Samples from our model at resolutions from to , conditioned on text and bird part locations in the CUB data set. See Fig. 5 and the supplement for more examples.

Multiscale image generation provides one such way to break weak dependencies. In particular, we can model certain groups of pixels as conditionally independent given a lower resolution image and various types of context information, such as preceding frames in a video. The basic idea is obvious, but nontrivial design problems stand between the idea and a workable implementation.

First, what is the right way to transmit global information from a low-resolution image to each generated pixel of the high-resolution image? Second, which pixels can we generate in parallel? And given that choice, how can we avoid border artifacts when merging sets of pixels that were generated in parallel, blind to one another?

In this work we show how a very substantial portion of the spatial dependencies in PixelCNN can be cut, with only modest degradation in performance. Our formulation allows sampling in time for pixels, instead of as in the original PixelCNN, resulting in orders of magnitude speedup in practice. In the case of video, in which we have access to high-resolution previous frames, we can even sample in time, with much better performance than comparably-fast baselines.

At a high level, the proposed approach can be viewed as a way to merge per-pixel factors in equation 1. If we merge the factors for, e.g. and , then that dependency is “cut”, so the model becomes slightly less expressive. However, we get the benefit of now being able to sample and in parallel. If we divide the pixels into groups of pixels each, the joint distribution can be written as a product of the corresponding factors:


Above we assumed that each of the groups contains exactly pixels, but in practice the number can vary. In this work, we form pixel groups from successively higher-resolution views of an image, arranged into a sub-sampling pyramid, such that .

In section 3 we describe this group structure implemented as a deep convolutional network. In section 4 we show that the model excels in density estimation and can produce quality high-resolution samples at high speed.

2 Related work

Deep neural autoregressive models have been applied to image generation for many years, showing promise as a tractable yet expressive density model 

(Larochelle & Murray, 2011; Uria et al., 2013)

. Autoregressive LSTMs have been shown to produce state-of-the-art performance in density estimation on large-scale datasets such as ImageNet 

(Theis & Bethge, 2015; van den Oord et al., 2016a).

Causally-structured convolutional networks such as PixelCNN (van den Oord et al., 2016b) and WaveNet (Oord et al., 2016) improved the speed and scalability of training. These led to improved autoregressive models for video generation (Kalchbrenner et al., 2016b) and machine translation (Kalchbrenner et al., 2016a).

Non-autoregressive convolutional generator networks have been successful and widely adopted for image generation as well. Instead of maximizing likelihood, Generative Adversarial Networks (GANs) train a generator network to fool a discriminator network adversary (Goodfellow et al., 2014). These networks have been used in a wide variety of conditional image generation schemes such as text and spatial structure to image (Mansimov et al., 2015; Reed et al., 2016b, a; Wang & Gupta, 2016).

The addition of multiscale structure has also been shown to be useful in adversarial networks.  Denton et al. (2015) used a Laplacian pyramid to generate images in a coarse-to-fine manner.  Zhang et al. (2016) composed a low-resolution and high-resolution text-conditional GAN, yielding higher quality bird and flower images.

Generator networks can be combined with a trained model, such as an image classifier or captioning network, to generate high-resolution images via optimization and sampling procedures 

(Nguyen et al., 2016). Wu et al. (2017) state that it is difficult to quantify GAN performance, and propose Monte Carlo methods to approximate the log-likelihood of GANs on MNIST images.

Both auto-regressive and non auto-regressive deep networks have recently been applied successfully to image super-resolution.  

Shi et al. (2016) developed a sub-pixel convolutional network well-suited to this problem. Dahl et al. (2017)

use a PixelCNN as a prior for image super-resolution with a convolutional neural network.  

Johnson et al. (2016)

developed a perceptual loss function useful for both style transfer and super-resolution. GAN variants have also been successful in this domain 

(Ledig et al., 2016; Sønderby et al., 2017).

Several other deep, tractable density models have recently been developed. Real NVP (Dinh et al., 2016) learns a mapping from images to a simple noise distribution, which is by construction trivially invertible. It is built from smaller invertible blocks called coupling layers whose Jacobian is lower-triangular, and also has a multiscale structure. Inverse Autoregressive Flows (Kingma & Salimans, 2016) use autoregressive structures in the latent space to learn more flexible posteriors for variational auto-encoders. Autoregressive models have also been combined with VAEs as decoder models (Gulrajani et al., 2016).

The original PixelRNN paper (van den Oord et al., 2016a) actually included a multiscale autoregressive version, in which PixelRNNs or PixelCNNs were trained at multiple resolutions. The network producing a given resolution image was conditioned on the image at the next lower resolution. This work is similarly motivated by the usefulness of multiscale image structure (and the very long history of coarse-to-fine modeling).

Our novel contributions in this work are (1) asymptotically and empirically faster inference by modeling conditional independence structure, (2) scaling to much higher resolution, (3) evaluating the model on a diverse set of challenging benchmarks including class-, text- and structure-conditional image generation and video generation.

Figure 2: Example pixel grouping and ordering for a image. The upper-left corners form group , the upper-right group , and so on. For clarity we only use arrows to indicate immediately-neighboring dependencies, but note that all pixels in preceding groups can be used to predict all pixels in a given group. For example all pixels in group can be used to predict pixels in group . In our image experiments pixels in group originate from a lower-resolution image. For video, they are generated given the previous frames.
Figure 3: A simple form of causal upscaling network, mapping from a image to . The same procedure can be applied in the vertical direction to produce a image. In reference to figure 3, the leftmost images could be considered “group 1” pixels; i.e. the upper-left corners. The network shown here produces “group 2” pixels; i.e. the upper-right corners, completing the top-corners half of the image. (A) In the simplest version, a deep convolutional network (in our case ResNet) directly produces the right image from the left image, and merges column-wise. (B) A more sophisticated version extracts features from a convolutional net, splits the feature map into spatially contiguous blocks, and feeds these in parallel through a shallow PixelCNN. The result is then merged as in (A).
Figure 2: Example pixel grouping and ordering for a image. The upper-left corners form group , the upper-right group , and so on. For clarity we only use arrows to indicate immediately-neighboring dependencies, but note that all pixels in preceding groups can be used to predict all pixels in a given group. For example all pixels in group can be used to predict pixels in group . In our image experiments pixels in group originate from a lower-resolution image. For video, they are generated given the previous frames.

3 Model

The main design principle that we follow in building the model is a coarse-to-fine ordering of pixels. Successively higher-resolution frames are generated conditioned on the previous resolution (See for example Figure 1). Pixels are grouped so as to exploit spatial locality at each resolution, which we describe in detail below.

The training objective is to maximize . Since the joint distribution factorizes over pixel groups and scales, the training can be trivially parallelized.

3.1 Network architecture

Figure 3 shows how we divide an image into disjoint groups of pixels, with autoregressive structure among the groups. The key property to notice is that no two adjacent pixels of the high-resolution image are in the same group. Also, pixels can depend on other pixels below and to the right, which would have been inaccessible in the standard PixelCNN. Each group of pixels corresponds to a factor in the joint distribution of equation 2.

Concretely, to create groups we tile the image with blocks. The corners of these blocks form the four pixel groups at a given scale; i.e. upper-left, upper-right, lower-left, lower-right. Note that some pairs of pixels both within each block and also across blocks can still be dependent. These additional dependencies are important for capturing local textures and avoiding border artifacts.

Figure 3 shows an instantiation of one of these factors as a neural network. Similar to the case of PixelCNN, at training time losses and gradients for all of the pixels within a group can be computed in parallel. At test time, inference proceeds sequentially over pixel groups, in parallel within each group. Also as in PixelCNN, we model the color channel dependencies - i.e. green sees red, blue sees red and green - using channel masking.

In the case of type-A upscaling networks (See Figure 3A), sampling each pixel group thus requires network evaluations 111However, one could also use a discretized mixture of logistics as output instead of a softmax as in Salimans et al. (2017), in which case only one network evaluation is needed.. In the case of type-B upscaling, the spatial feature map for predicting a group of pixels is divided into contiguous patches for input to a shallow PixelCNN (See figure 3B). This entails very small network evaluations, for each color channel. We used , and the shallow PixelCNN weights are shared across patches.

The division into non-overlapping patches may appear to risk border artifacts when merging. However, this does not occur for several reasons. First, each predicted pixel is directly adjacent to several context pixels fed into the upscaling network. Second, the generated patches are not directly adjacent in the output image; there is always a row or column of pixels on the border of any pair.

Note that the only learnable portions of the upscaling module are (1) the ResNet encoder of context pixels, and (2) the shallow PixelCNN weights in the case of type-B upscaling. The “merge” and “split” operations shown in figure 3 only marshal data and are not associated with parameters.

Given the first group of pixels, the rest of the groups at a given scale can be generated autoregressively. The first group of pixels can be modeled using the same approach as detailed above, recursively, down to a base resolution at which we use a standard PixelCNN. At each scale, the number of evaluations is , and the resolution doubles after each upscaling, so the overall complexity is to produce images with pixels.

Figure 4: Text-to-image bird synthesis. The leftmost column shows the entire sampling process starting by generating images, followed by six upscaling steps, to produce a image. The right column shows the final sampled images for several other queries. For each query the associated part keypoints and caption are shown to the left of the samples.
Figure 5: Text-to-image human synthesis.The leftmost column again shows the sampling process, and the right column shows the final frame for several more examples. We find that the samples are diverse and usually match the color and position constraints.
Figure 4: Text-to-image bird synthesis. The leftmost column shows the entire sampling process starting by generating images, followed by six upscaling steps, to produce a image. The right column shows the final sampled images for several other queries. For each query the associated part keypoints and caption are shown to the left of the samples.

3.2 Conditional image modeling

Given some context information , such as a text description, a segmentation, or previous video frames, we maximize the conditional likelihood . Each factor in equation 2 simply adds as an additional conditioning variable. The upscaling neural network corresponding to each factor takes as an additional input.

For encoding text we used a character-CNN-GRU as in  (Reed et al., 2016a). For spatially structured data such as segmentation masks we used a standard convolutional network. For encoding previous frames in a video we used a ConvLSTM as in (Kalchbrenner et al., 2016b).

4 Experiments

4.1 Datasets

We evaluate our model on ImageNet, Caltech-UCSD Birds (CUB), the MPII Human Pose dataset (MPII), the Microsoft Common Objects in Context dataset (MS-COCO), and the Google Robot Pushing dataset.

  • For ImageNet (Deng et al., 2009), we trained a class-conditional model using the 1000 leaf node classes.

  • CUB (Wah et al., 2011) contains images across bird species, with captions per image. As conditioning information we used a spatial encoding of the 15 annotated bird part locations.

  • MPII (Andriluka et al., 2014) has around images of human activities, with captions per image. We kept only the images depicting a single person, and cropped the image centered around the person, leaving us about images. We used a encoding of the 17 annotated human part locations.

  • MS-COCO (Lin et al., 2014) has training images with captions per image. As conditioning we used the -class segmentation scaled to .

  • Robot Pushing (Finn et al., 2016) contains sequences of frames of size showing a robotic arm pushing objects in a basket. There are training sequences and a validation set with the same objects but different arm trajectories. One test set involves a subset of the objects seen during training and another involving novel objects, both captured on an arm and camera viewpoint not seen during training.

All models for ImageNet, CUB, MPII and MS-COCO were trained using RMSprop with hyperparameter

, with batch size for steps. The learning rate was set initially to and decayed to .

For all of the samples we show, the queries are drawn from the validation split of the corresponding data set. That is, the captions, key points, segmentation masks, and low-resolution images for super-resolution have not been seen by the model during training.

When we evaluate negative log-likelihood, we only quantize pixel values to at the target resolution, not separately at each scale. The lower resolution images are then created by sub-sampling this quantized image.

4.2 Text and location-conditional generation

Figure 6: Text and segmentation-to-image synthesis. The left column shows the full sampling trajectory from to . The caption queries are shown beneath the samples. Beneath each image we show the image masked with the largest object in each scene; i.e. only the foreground pixels in the sample are shown. More samples with all categories masked are included in the supplement.

In this section we show results for CUB, MPII and MS-COCO. For each dataset we trained type-B upscaling networks with 12 ResNet layers and 4 PixelCNN layers, with 128 hidden units per layer. The base resolution at which we train a standard PixelCNN was set to .

To encode the captions we padded to

characters, then fed into a character-level CNN with three convolutional layers, followed by a GRU and average pooling over time. Upscaling networks to , and shared a single text encoder. For higher-resolution upscaling networks we trained separate text encoders. In principle all upscalers could share an encoder, but we trained separably to save memory and time.

For CUB and MPII, we have body part keypoints for birds and humans, respectively. We encode these into a binary feature map, where is the number of parts; for MPII and for CUB. A indicates the part is visible, and indicates the part is not visible. For MS-COCO, we resize the class segmentation mask to .

For all datasets, we then encode these spatial features using a

-layer ResNet. These features are then depth-concatenated with the text encoding and resized with bilinear interpolation to the spatial size of the image. If the target resolution for an upscaler network is higher than

, these conditioning features are randomly cropped along with the target image to a patch. Because the network is fully convolutional, the network can still generate the full resolution at test time, but we can massively save on memory and computation during training.

Figure 5 shows examples of text- and keypoint-to-bird image synthesis. Figure 5 shows examples of text- and keypoint-to-human image synthesis. Figure 6 shows examples of text- and segmentation-to-image synthesis.

CUB Train Val Test
PixelCNN 2.91 2.93 2.92
Multiscale PixelCNN 2.98 2.99 2.98
MPII Train Val Test
PixelCNN 2.90 2.92 2.92
Multiscale PixelCNN 2.91 3.03 3.03
MS-COCO Train Val Test
PixelCNN 3.07 3.08 -
Multiscale PixelCNN 3.14 3.16 -
Table 1: Text and structure-to image negative conditional log-likelihood in nats per sub-pixel.

Quantitatively, the Multiscale PixelCNN results are not far from those obtained using the original PixelCNN (Reed et al., 2016c), as shown in Table 1. In addition, we increased the sample resolution by . Qualitatively, the sample quality appears to be on par, but with much greater realism due to the higher resolution.

Figure 7: Upscaling low-resolution images to and . In each group of images, the left column is made of real images, and the right columns of samples from the model.
Figure 8: Class-conditional samples from a model trained on ImageNet.
Figure 7: Upscaling low-resolution images to and . In each group of images, the left column is made of real images, and the right columns of samples from the model.

4.3 Action-conditional video generation

In this section we present results on Robot Pushing videos. All models were trained to perform future frame prediction conditioned on starting frames and also on the robot arm actions and state, which are each

-dimensional vectors.

We trained two versions of the model, both versions using type-A upscaling networks (See Fig. 3). The first is designed to sample in time, for video frames. That is, the number of network evaluations per frame is constant with respect to the number of pixels.

The motivation for training the model is that previous frames in a video provide very detailed cues for predicting the next frame, so that our pixel groups could be conditionally independent even without access to a low-resolution image. Without the need to upscale from a low-resolution image, we can produce “group 1” pixels - i.e. the upper-left corner group - directly by conditioning on previous frames. Then a constant number of network evaluations are needed to sample the next three pixel groups at the final scale.

The second version is our multi-step upscaler used in previous experiments, conditioned on both previous frames and robot arm state and actions. The complexity of sampling from this model is , because at every time step the upscaling procedure must be run, taking time.

The models were trained for steps with batch size , using the RMSprop optimizer with centering and . The learning rate was initialized to and decayed by factor after steps and after steps. For the model we used a mixture of discretized logistic outputs (Salimans et al., 2017) and for the model we used a softmax ouptut.

Table 2 compares two variants of our model with the original VPN. Compared to the baseline - a convolutional LSTM model without spatial dependencies - our model performs dramatically better. On the validation set, in which the model needs to generalize to novel combinations of objects and arm trajectories, the model does much better than our model, although not as well as the original model.

On the testing sets, we observed that the model performed as well as on the validation set, but the model showed a drop in performance. However, this drop does not occur due to the presence of novel objects (in fact this setting actually yields better results), but due to the novel arm and camera configuration used during testing 222From communication with the Robot Pushing dataset author.. It appears that the model may have overfit to the background details and camera position of the training arms, but not necessarily to the actual arm and object motions. It should be possible to overcome this effect with better regularization and perhaps data augmentation such as mirroring and jittering frames, or simply training on data with more diverse camera positions.

The supplement contains example videos generated on the validation set arm trajectories from our model. We also trained and upscalers conditioned on low-resolution and a previous high-resolution frame, so that we can produce videos.

4.4 Class-conditional generation

To compare against other image density models, we trained our Multiscale PixelCNN on ImageNet. We used type-B upscaling networks (Seee figure 3) with 12 ResNet (He et al., 2016) layers and 4 PixelCNN layers, with 256 hidden units per layer. For all PixelCNNs in the model, we used the same architecture as in (van den Oord et al., 2016b). We generated images with a base resolution of and trained four upscaling networks to produce up to samples.At scales and above, during training we randomly cropped the image to . This accelerates training but does not pose a problem at test time because all of the networks are fully convolutional.

Model Tr Val Ts-seen Ts-novel
O(T) baseline - 2.06 2.08 2.07
O(TN) VPN - 0.62 0.64 0.64
O(T) VPN 1.03 1.04 1.04 1.04
O(T log N) VPN 0.74 0.74 1.06 0.97
Table 2: Robot videos neg. log-likelihood in nats per sub-pixel. “Tr” is the training set, “Ts-seen” is the test set with novel arm and camera configuration and previously seen objects, and “Ts-novel” is the same as “Ts-seen” but with novel objects.

Table 3 shows the results. On both and ImageNet it achieves significantly better likelihood scores than have been reported for any non-pixel-autoregressive density models, such as ConvDRAW and Real NVP, that also allow efficient sampling.

Of course, performance of these approaches varies considerably depending on the implementation details, especially in the design and capacity of deep neural networks used. But it is notable that the very simple and direct approach developed here can surpass the state-of-the-art among fast-sampling density models.

Model 32 64 128
PixelRNN 3.86 (3.83) 3.64(3.57) -
PixelCNN 3.83 (3.77) 3.57(3.48) -
Real NVP 4.28(4.26) 3.98(3.75) -
Conv. DRAW 4.40(4.35) 4.10(4.04) -
Ours 3.95(3.92) 3.70(3.67) 3.55(3.42)
Table 3: ImageNet negative log-likelihood in bits per sub-pixel at , and resolution.

In Figure 8 we show examples of diverse class conditional image generation.

Interestingly, the model often produced quite realistic bird images from scratch when trained on CUB, and these samples looked more realistic than any animal image generated by our ImageNet models. One plausible explanation for this difference is a lack of model capacity; a single network modeling the very diverse ImageNet categories can devote only very limited capacity to each one, compared to a network that only needs to model birds. This suggests that finding ways to increase capacity without slowing down training or sampling could be a promising direction.

Figure 8 shows upscaling starting from ground-truth images of size , and . We observe the largest diversity of samples in terms of global structure when starting from , but less realistic results due to the more challenging nature of the problem. Upscaling starting from results in much more realistic images. Here the diversity is apparent in the samples (as in the data, conditioned on low-resolution) in the local details such as the dog’s fur patterns or the frog’s eye contours.

4.5 Sampling time comparison

Model scale time speedup
PixelCNN, in-graph
VPN, in-graph
VPN, in-graph
Table 4: Sampling speed of several models in seconds per frame on an Nvidia Quadro M4000 GPU. The top three rows were measured on ImageNet, with batch size of 30. The bottom five rows were measured on generating videos of frames each, averaged over videos.

As expected, we observe a very large speedup of our model compared to sampling from a standard PixelCNN at the same resolution (see Table 4). Even at we observe two orders of magnitude speedup, and the speedup is greater for higher resolution.

Since our model only requires

network evaluations to sample, we can fit the entire computation graph for sampling into memory, for reasonable batch sizes. In-graph computation in TensorFlow can further improve the speed of both image and video generation, due to reduced overhead by avoiding repeated calls to

Since our model has a PixelCNN at the lowest resolution, it can also be accelerated by caching PixelCNN hidden unit activations, recently implemented b by Ramachandran et al. (2017). This could allow one to use higher-resolution base PixelCNNs without sacrificing speed.

5 Conclusions

In this paper, we developed a parallelized, multiscale version of PixelCNN. It achieves competitive density estimation results on CUB, MPII, MS-COCO, ImageNet, and Robot Pushing videos, surpassing all other density models that admit fast sampling. Qualitatively, it can achieve compelling results in text-to-image synthesis and video generation, as well as diverse super-resolution from very small images all the way to .

Many more samples from all of our models can be found in the appendix and supplementary material.


6 Appendix

Below we show additional samples.

Figure 9: Additional CUB samples randomly chosen from the validation set.
Figure 10: Additional MPII samples randomly chosen from the validation set.
Figure 11: Additional MS-COCO samples randomly chosen from the validation set.
Figure 12: Robot pushing videos at , and .
Figure 13: Label-conditional ImageNet samples.
Figure 14: Additional upscaling samples.