Many autoregressive image models factorize the joint distribution of images into per-pixel factors:
For example PixelCNN (van den Oord et al., 2016b) uses a deep convolutional network with carefully designed filter masking to preserve causal structure, so that all factors in equation 1 can be learned in parallel for a given image. However, a remaining difficulty is that due to the learned causal structure, inference proceeds sequentially pixel-by-pixel in raster order.
In the naive case, this requires a full network evaluation per pixel. Caching hidden unit activations can be used to reduce the amount of computation per pixel, as in the 1D case for WaveNet (Oord et al., 2016; Ramachandran et al., 2017). However, even with this optimization, generation is still in serial order by pixel.
Ideally we would generate multiple pixels in parallel, which could greatly accelerate sampling. In the autoregressive framework this only works if the pixels are modeled as independent. Thus we need a way to judiciously break weak dependencies among pixels; for example immediately neighboring pixels should not be modeled as independent since they tend to be highly correlated.
Multiscale image generation provides one such way to break weak dependencies. In particular, we can model certain groups of pixels as conditionally independent given a lower resolution image and various types of context information, such as preceding frames in a video. The basic idea is obvious, but nontrivial design problems stand between the idea and a workable implementation.
First, what is the right way to transmit global information from a low-resolution image to each generated pixel of the high-resolution image? Second, which pixels can we generate in parallel? And given that choice, how can we avoid border artifacts when merging sets of pixels that were generated in parallel, blind to one another?
In this work we show how a very substantial portion of the spatial dependencies in PixelCNN can be cut, with only modest degradation in performance. Our formulation allows sampling in time for pixels, instead of as in the original PixelCNN, resulting in orders of magnitude speedup in practice. In the case of video, in which we have access to high-resolution previous frames, we can even sample in time, with much better performance than comparably-fast baselines.
At a high level, the proposed approach can be viewed as a way to merge per-pixel factors in equation 1. If we merge the factors for, e.g. and , then that dependency is “cut”, so the model becomes slightly less expressive. However, we get the benefit of now being able to sample and in parallel. If we divide the pixels into groups of pixels each, the joint distribution can be written as a product of the corresponding factors:
Above we assumed that each of the groups contains exactly pixels, but in practice the number can vary. In this work, we form pixel groups from successively higher-resolution views of an image, arranged into a sub-sampling pyramid, such that .
2 Related work
Deep neural autoregressive models have been applied to image generation for many years, showing promise as a tractable yet expressive density model(Larochelle & Murray, 2011; Uria et al., 2013)
. Autoregressive LSTMs have been shown to produce state-of-the-art performance in density estimation on large-scale datasets such as ImageNet(Theis & Bethge, 2015; van den Oord et al., 2016a).
Causally-structured convolutional networks such as PixelCNN (van den Oord et al., 2016b) and WaveNet (Oord et al., 2016) improved the speed and scalability of training. These led to improved autoregressive models for video generation (Kalchbrenner et al., 2016b) and machine translation (Kalchbrenner et al., 2016a).
Non-autoregressive convolutional generator networks have been successful and widely adopted for image generation as well. Instead of maximizing likelihood, Generative Adversarial Networks (GANs) train a generator network to fool a discriminator network adversary (Goodfellow et al., 2014). These networks have been used in a wide variety of conditional image generation schemes such as text and spatial structure to image (Mansimov et al., 2015; Reed et al., 2016b, a; Wang & Gupta, 2016).
The addition of multiscale structure has also been shown to be useful in adversarial networks. Denton et al. (2015) used a Laplacian pyramid to generate images in a coarse-to-fine manner. Zhang et al. (2016) composed a low-resolution and high-resolution text-conditional GAN, yielding higher quality bird and flower images.
Generator networks can be combined with a trained model, such as an image classifier or captioning network, to generate high-resolution images via optimization and sampling procedures(Nguyen et al., 2016). Wu et al. (2017) state that it is difficult to quantify GAN performance, and propose Monte Carlo methods to approximate the log-likelihood of GANs on MNIST images.
Both auto-regressive and non auto-regressive deep networks have recently been applied successfully to image super-resolution.Shi et al. (2016) developed a sub-pixel convolutional network well-suited to this problem. Dahl et al. (2017)
use a PixelCNN as a prior for image super-resolution with a convolutional neural network.Johnson et al. (2016)
developed a perceptual loss function useful for both style transfer and super-resolution. GAN variants have also been successful in this domain(Ledig et al., 2016; Sønderby et al., 2017).
Several other deep, tractable density models have recently been developed. Real NVP (Dinh et al., 2016) learns a mapping from images to a simple noise distribution, which is by construction trivially invertible. It is built from smaller invertible blocks called coupling layers whose Jacobian is lower-triangular, and also has a multiscale structure. Inverse Autoregressive Flows (Kingma & Salimans, 2016) use autoregressive structures in the latent space to learn more flexible posteriors for variational auto-encoders. Autoregressive models have also been combined with VAEs as decoder models (Gulrajani et al., 2016).
The original PixelRNN paper (van den Oord et al., 2016a) actually included a multiscale autoregressive version, in which PixelRNNs or PixelCNNs were trained at multiple resolutions. The network producing a given resolution image was conditioned on the image at the next lower resolution. This work is similarly motivated by the usefulness of multiscale image structure (and the very long history of coarse-to-fine modeling).
Our novel contributions in this work are (1) asymptotically and empirically faster inference by modeling conditional independence structure, (2) scaling to much higher resolution, (3) evaluating the model on a diverse set of challenging benchmarks including class-, text- and structure-conditional image generation and video generation.
The main design principle that we follow in building the model is a coarse-to-fine ordering of pixels. Successively higher-resolution frames are generated conditioned on the previous resolution (See for example Figure 1). Pixels are grouped so as to exploit spatial locality at each resolution, which we describe in detail below.
The training objective is to maximize . Since the joint distribution factorizes over pixel groups and scales, the training can be trivially parallelized.
3.1 Network architecture
Figure 3 shows how we divide an image into disjoint groups of pixels, with autoregressive structure among the groups. The key property to notice is that no two adjacent pixels of the high-resolution image are in the same group. Also, pixels can depend on other pixels below and to the right, which would have been inaccessible in the standard PixelCNN. Each group of pixels corresponds to a factor in the joint distribution of equation 2.
Concretely, to create groups we tile the image with blocks. The corners of these blocks form the four pixel groups at a given scale; i.e. upper-left, upper-right, lower-left, lower-right. Note that some pairs of pixels both within each block and also across blocks can still be dependent. These additional dependencies are important for capturing local textures and avoiding border artifacts.
Figure 3 shows an instantiation of one of these factors as a neural network. Similar to the case of PixelCNN, at training time losses and gradients for all of the pixels within a group can be computed in parallel. At test time, inference proceeds sequentially over pixel groups, in parallel within each group. Also as in PixelCNN, we model the color channel dependencies - i.e. green sees red, blue sees red and green - using channel masking.
In the case of type-A upscaling networks (See Figure 3A), sampling each pixel group thus requires network evaluations 111However, one could also use a discretized mixture of logistics as output instead of a softmax as in Salimans et al. (2017), in which case only one network evaluation is needed.. In the case of type-B upscaling, the spatial feature map for predicting a group of pixels is divided into contiguous patches for input to a shallow PixelCNN (See figure 3B). This entails very small network evaluations, for each color channel. We used , and the shallow PixelCNN weights are shared across patches.
The division into non-overlapping patches may appear to risk border artifacts when merging. However, this does not occur for several reasons. First, each predicted pixel is directly adjacent to several context pixels fed into the upscaling network. Second, the generated patches are not directly adjacent in the output image; there is always a row or column of pixels on the border of any pair.
Note that the only learnable portions of the upscaling module are (1) the ResNet encoder of context pixels, and (2) the shallow PixelCNN weights in the case of type-B upscaling. The “merge” and “split” operations shown in figure 3 only marshal data and are not associated with parameters.
Given the first group of pixels, the rest of the groups at a given scale can be generated autoregressively. The first group of pixels can be modeled using the same approach as detailed above, recursively, down to a base resolution at which we use a standard PixelCNN. At each scale, the number of evaluations is , and the resolution doubles after each upscaling, so the overall complexity is to produce images with pixels.
3.2 Conditional image modeling
Given some context information , such as a text description, a segmentation, or previous video frames, we maximize the conditional likelihood . Each factor in equation 2 simply adds as an additional conditioning variable. The upscaling neural network corresponding to each factor takes as an additional input.
We evaluate our model on ImageNet, Caltech-UCSD Birds (CUB), the MPII Human Pose dataset (MPII), the Microsoft Common Objects in Context dataset (MS-COCO), and the Google Robot Pushing dataset.
For ImageNet (Deng et al., 2009), we trained a class-conditional model using the 1000 leaf node classes.
CUB (Wah et al., 2011) contains images across bird species, with captions per image. As conditioning information we used a spatial encoding of the 15 annotated bird part locations.
MPII (Andriluka et al., 2014) has around images of human activities, with captions per image. We kept only the images depicting a single person, and cropped the image centered around the person, leaving us about images. We used a encoding of the 17 annotated human part locations.
MS-COCO (Lin et al., 2014) has training images with captions per image. As conditioning we used the -class segmentation scaled to .
Robot Pushing (Finn et al., 2016) contains sequences of frames of size showing a robotic arm pushing objects in a basket. There are training sequences and a validation set with the same objects but different arm trajectories. One test set involves a subset of the objects seen during training and another involving novel objects, both captured on an arm and camera viewpoint not seen during training.
For all of the samples we show, the queries are drawn from the validation split of the corresponding data set. That is, the captions, key points, segmentation masks, and low-resolution images for super-resolution have not been seen by the model during training.
When we evaluate negative log-likelihood, we only quantize pixel values to at the target resolution, not separately at each scale. The lower resolution images are then created by sub-sampling this quantized image.
4.2 Text and location-conditional generation
In this section we show results for CUB, MPII and MS-COCO. For each dataset we trained type-B upscaling networks with 12 ResNet layers and 4 PixelCNN layers, with 128 hidden units per layer. The base resolution at which we train a standard PixelCNN was set to .
To encode the captions we padded tocharacters, then fed into a character-level CNN with three convolutional layers, followed by a GRU and average pooling over time. Upscaling networks to , and shared a single text encoder. For higher-resolution upscaling networks we trained separate text encoders. In principle all upscalers could share an encoder, but we trained separably to save memory and time.
For CUB and MPII, we have body part keypoints for birds and humans, respectively. We encode these into a binary feature map, where is the number of parts; for MPII and for CUB. A indicates the part is visible, and indicates the part is not visible. For MS-COCO, we resize the class segmentation mask to .
For all datasets, we then encode these spatial features using a
-layer ResNet. These features are then depth-concatenated with the text encoding and resized with bilinear interpolation to the spatial size of the image. If the target resolution for an upscaler network is higher than, these conditioning features are randomly cropped along with the target image to a patch. Because the network is fully convolutional, the network can still generate the full resolution at test time, but we can massively save on memory and computation during training.
Figure 5 shows examples of text- and keypoint-to-bird image synthesis. Figure 5 shows examples of text- and keypoint-to-human image synthesis. Figure 6 shows examples of text- and segmentation-to-image synthesis.
Quantitatively, the Multiscale PixelCNN results are not far from those obtained using the original PixelCNN (Reed et al., 2016c), as shown in Table 1. In addition, we increased the sample resolution by . Qualitatively, the sample quality appears to be on par, but with much greater realism due to the higher resolution.
4.3 Action-conditional video generation
In this section we present results on Robot Pushing videos. All models were trained to perform future frame prediction conditioned on starting frames and also on the robot arm actions and state, which are each
We trained two versions of the model, both versions using type-A upscaling networks (See Fig. 3). The first is designed to sample in time, for video frames. That is, the number of network evaluations per frame is constant with respect to the number of pixels.
The motivation for training the model is that previous frames in a video provide very detailed cues for predicting the next frame, so that our pixel groups could be conditionally independent even without access to a low-resolution image. Without the need to upscale from a low-resolution image, we can produce “group 1” pixels - i.e. the upper-left corner group - directly by conditioning on previous frames. Then a constant number of network evaluations are needed to sample the next three pixel groups at the final scale.
The second version is our multi-step upscaler used in previous experiments, conditioned on both previous frames and robot arm state and actions. The complexity of sampling from this model is , because at every time step the upscaling procedure must be run, taking time.
The models were trained for steps with batch size , using the RMSprop optimizer with centering and . The learning rate was initialized to and decayed by factor after steps and after steps. For the model we used a mixture of discretized logistic outputs (Salimans et al., 2017) and for the model we used a softmax ouptut.
Table 2 compares two variants of our model with the original VPN. Compared to the baseline - a convolutional LSTM model without spatial dependencies - our model performs dramatically better. On the validation set, in which the model needs to generalize to novel combinations of objects and arm trajectories, the model does much better than our model, although not as well as the original model.
On the testing sets, we observed that the model performed as well as on the validation set, but the model showed a drop in performance. However, this drop does not occur due to the presence of novel objects (in fact this setting actually yields better results), but due to the novel arm and camera configuration used during testing 222From communication with the Robot Pushing dataset author.. It appears that the model may have overfit to the background details and camera position of the training arms, but not necessarily to the actual arm and object motions. It should be possible to overcome this effect with better regularization and perhaps data augmentation such as mirroring and jittering frames, or simply training on data with more diverse camera positions.
The supplement contains example videos generated on the validation set arm trajectories from our model. We also trained and upscalers conditioned on low-resolution and a previous high-resolution frame, so that we can produce videos.
4.4 Class-conditional generation
To compare against other image density models, we trained our Multiscale PixelCNN on ImageNet. We used type-B upscaling networks (Seee figure 3) with 12 ResNet (He et al., 2016) layers and 4 PixelCNN layers, with 256 hidden units per layer. For all PixelCNNs in the model, we used the same architecture as in (van den Oord et al., 2016b). We generated images with a base resolution of and trained four upscaling networks to produce up to samples.At scales and above, during training we randomly cropped the image to . This accelerates training but does not pose a problem at test time because all of the networks are fully convolutional.
|O(T log N) VPN||0.74||0.74||1.06||0.97|
Table 3 shows the results. On both and ImageNet it achieves significantly better likelihood scores than have been reported for any non-pixel-autoregressive density models, such as ConvDRAW and Real NVP, that also allow efficient sampling.
Of course, performance of these approaches varies considerably depending on the implementation details, especially in the design and capacity of deep neural networks used. But it is notable that the very simple and direct approach developed here can surpass the state-of-the-art among fast-sampling density models.
In Figure 8 we show examples of diverse class conditional image generation.
Interestingly, the model often produced quite realistic bird images from scratch when trained on CUB, and these samples looked more realistic than any animal image generated by our ImageNet models. One plausible explanation for this difference is a lack of model capacity; a single network modeling the very diverse ImageNet categories can devote only very limited capacity to each one, compared to a network that only needs to model birds. This suggests that finding ways to increase capacity without slowing down training or sampling could be a promising direction.
Figure 8 shows upscaling starting from ground-truth images of size , and . We observe the largest diversity of samples in terms of global structure when starting from , but less realistic results due to the more challenging nature of the problem. Upscaling starting from results in much more realistic images. Here the diversity is apparent in the samples (as in the data, conditioned on low-resolution) in the local details such as the dog’s fur patterns or the frog’s eye contours.
4.5 Sampling time comparison
As expected, we observe a very large speedup of our model compared to sampling from a standard PixelCNN at the same resolution (see Table 4). Even at we observe two orders of magnitude speedup, and the speedup is greater for higher resolution.
Since our model only requires
network evaluations to sample, we can fit the entire computation graph for sampling into memory, for reasonable batch sizes. In-graph computation in TensorFlow can further improve the speed of both image and video generation, due to reduced overhead by avoiding repeated calls tosess.run.
Since our model has a PixelCNN at the lowest resolution, it can also be accelerated by caching PixelCNN hidden unit activations, recently implemented b by Ramachandran et al. (2017). This could allow one to use higher-resolution base PixelCNNs without sacrificing speed.
In this paper, we developed a parallelized, multiscale version of PixelCNN. It achieves competitive density estimation results on CUB, MPII, MS-COCO, ImageNet, and Robot Pushing videos, surpassing all other density models that admit fast sampling. Qualitatively, it can achieve compelling results in text-to-image synthesis and video generation, as well as diverse super-resolution from very small images all the way to .
Many more samples from all of our models can be found in the appendix and supplementary material.
- Andriluka et al. (2014) Andriluka, Mykhaylo, Pishchulin, Leonid, Gehler, Peter, and Schiele, Bernt. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, pp. 3686–3693, 2014.
- Dahl et al. (2017) Dahl, Ryan, Norouzi, Mohammad, and Shlens, Jonathon. Pixel recursive super resolution. arXiv preprint arXiv:1702.00783, 2017.
- Deng et al. (2009) Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- Denton et al. (2015) Denton, Emily L, Chintala, Soumith, Szlam, Arthur, and Fergus, Rob. Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS, pp. 1486–1494, 2015.
- Dinh et al. (2016) Dinh, Laurent, Sohl-Dickstein, Jascha, and Bengio, Samy. Density estimation using Real NVP. In NIPS, 2016.
- Finn et al. (2016) Finn, Chelsea, Goodfellow, Ian, and Levine, Sergey. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
- Goodfellow et al. (2014) Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron C., and Bengio, Yoshua. Generative adversarial nets. In NIPS, 2014.
- Gulrajani et al. (2016) Gulrajani, Ishaan, Kumar, Kundan, Ahmed, Faruk, Taiga, Adrien Ali, Visin, Francesco, Vazquez, David, and Courville, Aaron. PixelVAE: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
- He et al. (2016) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Identity mappings in deep residual networks. In ECCV, pp. 630–645, 2016.
- Johnson et al. (2016) Johnson, Justin, Alahi, Alexandre, and Fei-Fei, Li. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
- Kalchbrenner et al. (2016a) Kalchbrenner, Nal, Espeholt, Lasse, Simonyan, Karen, Oord, Aaron van den, Graves, Alex, and Kavukcuoglu, Koray. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016a.
- Kalchbrenner et al. (2016b) Kalchbrenner, Nal, Oord, Aaron van den, Simonyan, Karen, Danihelka, Ivo, Vinyals, Oriol, Graves, Alex, and Kavukcuoglu, Koray. Video pixel networks. Preprint arXiv:1610.00527, 2016b.
- Kingma & Salimans (2016) Kingma, Diederik P and Salimans, Tim. Improving variational inference with inverse autoregressive flow. In NIPS, 2016.
- Larochelle & Murray (2011) Larochelle, Hugo and Murray, Iain. The neural autoregressive distribution estimator. In AISTATS, 2011.
- Ledig et al. (2016) Ledig, Christian, Theis, Lucas, Huszar, Ferenc, Caballero, Jose, Cunningham, Andrew, Acosta, Alejandro, Aitken, Andrew, Tejani, Alykhan, Totz, Johannes, Wang, Zehan, and Shi, Wenzhe. Photo-realistic single image super-resolution using a generative adversarial network. 2016.
- Lin et al. (2014) Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollár, Piotr, and Zitnick, C Lawrence. Microsoft COCO: Common objects in context. In ECCV, pp. 740–755, 2014.
- Mansimov et al. (2015) Mansimov, Elman, Parisotto, Emilio, Ba, Jimmy Lei, and Salakhutdinov, Ruslan. Generating images from captions with attention. In ICLR, 2015.
- Nguyen et al. (2016) Nguyen, Anh, Yosinski, Jason, Bengio, Yoshua, Dosovitskiy, Alexey, and Clune, Jeff. Plug & play generative networks: Conditional iterative generation of images in latent space. arXiv preprint arXiv:1612.00005, 2016.
- Oord et al. (2016) Oord, Aaron van den, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, and Kavukcuoglu, Koray. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Ramachandran et al. (2017) Ramachandran, Prajit, Paine, Tom Le, Khorrami, Pooya, Babaeizadeh, Mohammad, Chang, Shiyu, Zhang, Yang, Hasegawa-Johnson, Mark, Campbell, Roy, and Huang, Thomas. Fast generation for convolutional autoregressive models. 2017.
- Reed et al. (2016a) Reed, Scott, Akata, Zeynep, Mohan, Santosh, Tenka, Samuel, Schiele, Bernt, and Lee, Honglak. Learning what and where to draw. In NIPS, 2016a.
- Reed et al. (2016b) Reed, Scott, Akata, Zeynep, Yan, Xinchen, Logeswaran, Lajanugen, Schiele, Bernt, and Lee, Honglak. Generative adversarial text-to-image synthesis. In ICML, 2016b.
- Reed et al. (2016c) Reed, Scott, van den Oord, Aäron, Kalchbrenner, Nal, Bapst, Victor, Botvinick, Matt, and de Freitas, Nando. Generating interpretable images with controllable structure. Technical report, 2016c.
- Salimans et al. (2017) Salimans, Tim, Karpathy, Andrej, Chen, Xi, and Kingma, Diederik P. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
- Shi et al. (2016) Shi, Wenzhe, Caballero, Jose, Huszár, Ferenc, Totz, Johannes, Aitken, Andrew P, Bishop, Rob, Rueckert, Daniel, and Wang, Zehan. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, 2016.
- Sønderby et al. (2017) Sønderby, Casper Kaae, Caballero, Jose, Theis, Lucas, Shi, Wenzhe, and Huszár, Ferenc. Amortised MAP inference for image super-resolution. 2017.
- Theis & Bethge (2015) Theis, L. and Bethge, M. Generative image modeling using spatial LSTMs. In NIPS, 2015.
- Uria et al. (2013) Uria, Benigno, Murray, Iain, and Larochelle, Hugo. RNADE: The real-valued neural autoregressive density-estimator. In NIPS, 2013.
van den Oord et al. (2016a)
van den Oord, Aäron, Kalchbrenner, Nal, and Kavukcuoglu, Koray.
Pixel recurrent neural networks.In ICML, pp. 1747–1756, 2016a.
- van den Oord et al. (2016b) van den Oord, Aäron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders. In NIPS, 2016b.
- Wah et al. (2011) Wah, Catherine, Branson, Steve, Welinder, Peter, Perona, Pietro, and Belongie, Serge. The Caltech-UCSD birds-200-2011 dataset. 2011.
- Wang & Gupta (2016) Wang, Xiaolong and Gupta, Abhinav. Generative image modeling using style and structure adversarial networks. In ECCV, pp. 318–335, 2016.
- Wu et al. (2017) Wu, Yuhuai, Burda, Yuri, Salakhutdinov, Ruslan, and Grosse, Roger. On the quantitative analysis of decoder-based generative models. 2017.
- Zhang et al. (2016) Zhang, Han, Xu, Tao, Li, Hongsheng, Zhang, Shaoting, Huang, Xiaolei, Wang, Xiaogang, and Metaxas, Dimitris. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.
Below we show additional samples.