Scaling Autoregressive Video Models

by   Dirk Weissenborn, et al.

Due to the statistical complexity of video, the high degree of inherent stochasticity, and the sheer amount of data, generating natural video remains a challenging task. State-of-the-art video generation models attempt to address these issues by combining sometimes complex, often video-specific neural network architectures, latent variable models, adversarial training and a range of other methods. Despite their often high complexity, these approaches still fall short of generating high quality video continuations outside of narrow domains and often struggle with fidelity. In contrast, we show that conceptually simple, autoregressive video generation models based on a three-dimensional self-attention mechanism achieve highly competitive results across multiple metrics on popular benchmark datasets for which they produce continuations of high fidelity and realism. Furthermore, we find that our models are capable of producing diverse and surprisingly realistic continuations on a subset of videos from Kinetics, a large scale action recognition dataset comprised of YouTube videos exhibiting phenomena such as camera movement, complex object interactions and diverse human movement. To our knowledge, this is the first promising application of video-generation models to videos of this complexity.


page 15

page 16

page 17

page 18

page 19

page 20

page 21

page 22


HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

Video prediction is an important yet challenging problem; burdened with ...

VideoGPT: Video Generation using VQ-VAE and Transformers

We present VideoGPT: a conceptually simple architecture for scaling like...

StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3

Realistic generative face video synthesis has long been a pursuit in bot...

Efficient Video Generation on Complex Datasets

Generative models of natural images have progressed towards high fidelit...

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

Generating videos from text is a challenging task due to its high comput...

Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling

The unconditional generation of high fidelity images is a longstanding b...

Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

A video prediction model that generalizes to diverse scenes would enable...

1 Introduction

Generative modeling of video holds promise for applications such as content creation, forecasting, transfer learning and model-based reinforcement learning

Srivastava et al. (2015); Carl Vondrick (2016); Oh et al. (2015); Kaiser et al. (2019)

. While there has been much recent work on generative models for text, audio and images, video generation remains challenging to a large extent due to the large amount of data, ultimately pixels, that needs to be produced. This is especially true for autoregressive models that generate videos pixel by pixel, Yet, autoregressive models have a number of desirable attributes, such as their conceptual simplicity and tractable likelihood, enabling straightforward evaluation of their ability to model the entire data distribution.

Recent results on image generation by Menick and Kalchbrenner (2019) show that pixel-level autoregressive models are capable of generating images with high fidelity. These findings motivate the question of how far we can push autoregressive models for the more general problem of video generation when scaling recent neural architectural advances to modern hardware accelerators.

In this work, we introduce a generalization of the Transformer architecture of Vaswani et al. (2017) using three-dimensional, block-local self-attention. It allows us to efficiently model video as 3D volume as opposed to sequences of still image frames, with direct interactions between representations of pixels across the spatial and temporal dimensions. To maintain a manageable memory footprint, we combine this with a three-dimensional generalization of methods from Menick and Kalchbrenner (2019)

, who generate images as sequences of smaller, sub-scaled image slices. The overall architecture is well suited for Tensor Processing Units, or TPUs

Jouppi et al. (2017), enabling us to scale our models substantially.

We obtain strong results on popular benchmarks (Section 4.2, Appendix A) and produce high fidelity video continuations on the BAIR robot pushing dataset exhibiting plausible object interactions.

Finally, we apply our models to down-sampled videos from the Kinetics-600 dataset Carreira et al. (2018) (Section 4.3). While the full range of Kinetics-600 videos still poses a major challenge, we see highly encouraging video continuations for a more limited subset, namely cooking videos. These feature camera movement, complex object interactions and still cover diverse subjects. This dataset marks a departure from the often very narrow domains discussed in the video generation literature to date, such as artificially generated videos of moving digits or shapes, or videos depicting natural, yet highly constrained environments, such as robot arms interacting with a small number of different objects with a fixed camera angle and background.

We hope that these initial results will encourage future video generation work to evaluate models on more challenging datasets such as Kinetics.

2 Related Work

Our setup is closely related to that of Kalchbrenner et al. (2016), who extend work on pixel-level autoregressive image generation van den Oord et al. (2016b, a) to videos. However, whereas they model the temporal and spatial dimensions separately with dilated convolutions and convolutional LSTMs, respectively, our model is conceptually simpler in that we do not make any distinction between temporal and spatial dimensions and instead rely almost entirely on multi-head self-attention Vaswani et al. (2017) within the 3D video volume. For comparability, we provide results on Moving MNIST and an older Robotic Pushing dataset for which we achieve an almost 50% reduction in perplexity (see Appendix A).

To keep memory requirements feasible, we use block-local self-attention, generalizing the image generation approaches of Parmar et al. (2018) and Chen et al. (2018) to 3D volumes. The concurrent work of Child et al. (2019) instead use sparse attention, by linearizing images to a sequence of pixels. However, this approach would fail to capture local 3D neighborhoods directly and they only apply their model to images. To further reduce memory requirements, we exploit ideas of sub-scaling which were recently introduced in Menick and Kalchbrenner (2019). Another approach is (hierarchical) multi-scale generation, which has recently been explored for both image- Reed et al. (2017); De Fauw et al. (2019) and video Mathieu et al. (2016) generation.

Earlier work on video generation mostly focused on deterministic approaches Srivastava et al. (2015); Carl Vondrick (2016); Xingjian et al. (2015); Liu et al. (2017); Jia et al. (2016)

, which fail to capture the high degree of stochasticity inherent in video. In response, a popular research direction has been that of generative latent-variable video models. In contrast to pixel-level autoregressive models, these posit an underlying latent process in tandem with the observed pixel values. Work in this category include variants of variational autoencoders

Babaeizadeh et al. (2018); Denton and Fergus (2018). To tackle the issues inherent in these models, most notably the tendency to generate blurry outputs due to restricted modeling power, inadequate prior distributions, or optimization of a lower bound in place of the true likelihood, these have been combined with adversarial objectives Mathieu et al. (2016); Vondrick et al. (2016); Lee et al. (2018), hierarchical latent-variables Castrejón et al. (2019), or flow-based modeling Kumar et al. (2019). These approaches have the benefit that they admit fast generation, but are restricted in that they tend to only focus on a subset of the modes in the empirical distribution in the adversarial case, or that they struggle with limited modeling power even when using a large number of layers in the case of flow-based models.

The conceptual simplicity of our model is in line with recent approaches to video classification that models videos by means of 3D convolutions Carreira and Zisserman (2017); Xie et al. (2018) or spatiotemporal self-attention Girdhar et al. (2018). In contrast, much earlier work on video generation has encoded specific intuitions about videos, such as explicit modeling of motion Finn et al. (2016b); Denton and Fergus (2018) or generation of optical flow Pătrăucean et al. (2016).

3 Video Transformer


Subscale Slices

Figure 1: Top: Illustration of the subscale video transformer architecture and process flow. We incrementally generate

video slices. The video slices and their respective generation order are derived from subscaling. In each iteration, we first process the partially padded video (illustrated for slice index

, black means padding and gray means already generated or visible) by an encoder, the output of which is used as conditioning for decoding the current video slice. After generating a slice we replace the respective padding in the video with the generated output and repeat the process for the next slice.
Bottom: Subscaling in 3D (best viewed in color). The 3D volume is evenly divided by a given subscale factor, here , and the respective slices are extracted. The whole volume is generated by incrementally predicting the individual, much smaller slices, starting at slice (yellow), followed by (green), (red), etc., in raster-scan order.

We generalize the one-dimensional Transformer Vaswani et al. (2017) to explicitly model videos represented as three-dimensional spatiotemporal volumes, without resorting to sequential linearization of the positions in the volume Child et al. (2019). This allows for maintaining spatial neighborhoods around positions, which is important as the large number of individual positions to be predicted in a video requires limiting the receptive field of the self-attention mechanism to a neighborhood around every position to avoid the quadratic blow-up in memory consumption of naive fully-connected attention.

We model the distribution over videos — with time, height, width and channel dimensions, respectively — by means of a pixel-channel level autoregressive factorization.111In the following, we denote general tensors by boldface lowercase letters and matrices by capital letters.

That is, the joint distribution over pixels is factorized into a product of channel intensities for all

channels, for each of the pixels, with respect to an ordering over pixels:


The ordering is given by a combination of a subscale- and raster-scan ordering, as detailed in 3.2.

3.1 Block-local Self-Attention

The attention mechanism of the original Transformer lets each element in a set of elements connect to every other element, via the fully-connected weighted adjacency (attention) matrix , with representing attention weights from element to element . Because grows quadratically with the number of elements it becomes prohibitively large for objects such as videos, which contain thousands or millions of pixels. Therefore, similar in spirit to Parmar et al. (2018), we propose to use local self-attention by dividing a video into much smaller non-overlapping sub-volumes, or 3D blocks. We then apply self-attention separately within each block. This approach is conceptually simple and amenable to highly efficient implementation on TPUs, which enables us to scale our models substantially while retaining fast training time with only a modest sacrifice in modeling power.

The Video Transformer consists of multiple stacked self-attention layers. Each layer divides the overall video volume of shape into smaller blocks of shape of length , and performs attention within each block independently. Given a (flattened) block representation of hidden size as input, this amounts to:


The input is first projected to query, key and value representations (Eq. 2). The attention matrix is then formed as the scaled dot-product between all query-key pairs adding a relative position bias Parikh et al. (2016) (Eq. 3). The bias is defined as the sum of per-dimension relative distance biases between element and , along each of the time- and spatial dimensions. Finally, the values are aggregated with respect to the attention weights (Eq. 4).

Following Vaswani et al. (2017), we concatenate the output of

parallel attention heads in each layer and project the result by a linear transformation (Eq. 


) before applying a residual connection. Finally, the output of the multi-head self-attention layer is passed through another dense layer with ReLU activation, followed by a final linear transformation and a residual connection (Eq. 



where overloading notation, denotes the blockwise application of self-attention to . Similar to Baevski and Auli (2019), we found that applying layer normalization before each block, rather than after each block as proposed by Vaswani et al. (2017), improves training.

Connectivity. Operating on 3D sub-volumes (blocks) of videos means that there is no direct information exchange between blocks. However, this can be addressed by varying the block sizes between each layer. To achieve this, we define blocks that stretch over the entire extent of at least a single dimension in each layer. Following this procedure, we can effectively connect all pixel positions in the encoder, but due to masking some dependencies are missed in the decoder. However, in our experiments these did not produce any visible, systematic artifacts. We discuss missing dependencies and potential remedies in Appendix C.

Efficiency. Running block-local self-attention is very efficient in practice as the cost of splitting videos into blocks is negligible. The approach of Parmar et al. (2018) uses overlapping 2D image blocks in each layer. We found this prohibitive as the required data copying is comparatively expensive. To avoid the need for overlaps to connect pixels across blocks, we simply vary block sizes between layers, which is highly efficient and, as our results show, works well in practice.

3.2 Spatiotemporal Subscaling

Menick and Kalchbrenner (2019) recently proposed generating images as a sequence of subscaled image slices. We similarly define a subscale factor which divides a video into sub-sampled videos (slices), each of resolution , as depicted in the bottom part of Figure 1. The slices are generated in order according to their respective offsets, such that we first generate slice , then , up until slice . Generating all slices one at a time in this way drastically reduces the number of pixels in memory to , which enables scaling our architectures by a factor of . Each slice is internally generated according to the raster-scan order. In the following we explain how slices are generated and how they are conditioned on already decoded slices. An overview is illustrated in the upper part of Figure 1.

Slice Encoder. The current slice is generated conditioned on the encoded pixels from preceding slices as follows. First, we create a partially masked video, where only the pixels of preceding slices

are visible. The partially masked video is then embedded by concatenating the one-hot encoding of the discretized pixel intensities of each channel. Subsequently, a 3D convolution with kernel size

and stride

(the sub-scaling factor) results in an encoded video of resolution . We apply convolution padding depending on the current slice index . In particular, we pad with , which “centers” the convolution kernel on the pixels of the current slice. Finally, we add positional embeddings for each axis, as well as embeddings for the current slice index , to the output of this strided convolution. The result is an initial encoder representation , where is the embedding size. We can optionally condition on auxiliary information, such as per-frame action values of a robot arm, by concatenating this information to the initial encoder representation.

This representation is further transformed by a linear projection to hidden size , before being fed as input to a stack of block-local self-attention layers as described in §3.1. Each layer is parameterized by a different block size and number of attention heads. The resulting output is used as conditional input to the subscale slice decoder, which generates the pixels of the current slice .

Slice Decoder. The pixel values of the current slice are predicted conditioned on the encoder representation . The decoder is almost identical to the encoder in structure, except for the use of masking in the decoder as defined by the generation order. First, we embed by summing channel embeddings of size at every pixel, before applying a 3x3x3 masked convolution van den Oord et al. (2016a) on the embedded pixels, effectively representing each pixel by its already generated, immediate neighbors. Similar to the encoder, we add positional embeddings for the space- and time dimensions to the output of this masked convolution. As in the encoder, this results in an initial decoder representation .

To condition on the encoder state, a linear projection of is added to and the resulting representation is fed through a stack of block-local self-attention layers, with masking, to produce a state on which the final channel predictions are conditioned.

3.3 Channel Prediction & Loss Function.

The per-pixel channel intensities (we omit the slice index in the following) for each channel are predicted by MLPs with a single hidden layer (Eq. 8), conditioned on the flattened final decoder state — which is itself conditioned on and hence on prior slices — as well as the preceding channels

for each pixel, encoded as one-hot vectors. Finally, the per video slice loss is defined as the negative log-likelihood as in Eq. 



We found that splitting the color channel values of the videos into coarse and fine bits helps slightly in terms of performance. Specifically, we split the -bit RGB channels into -bit channels (, ), such that the coarse bits of all three channels are predicted before the fine bits. Furthermore, splitting channels this way at the input level considerably lowers memory footprint when encoding videos as vectors on TPUs.

4 Experiments

Below, we provide details on the model variants considered, training setup and evaluation metrics used. We focus our evaluation on the BAIR Robot Pushing and Kinetics datasets. Additional results on Moving MNIST and an earlier version of Robotic Pushing are provided in Appendix 

A for reference. Sample videos strips of each model and dataset can be found in Appendix F and sample videos at

4.1 Models & Setup

Unless specified otherwise, we model video slices of 4 frames with a spatial resolution of 32x32. Both the encoder and decoder consist of 8 layers and have a nearly identical structure, except for the use of masking in the decoder, as described in Section 3.2. We apply block-local self-attention with the following block sizes . Layers 1-4: (4, 8, 4); (4, 4, 8); (1, 32, 4); and (1, 4, 32). Intuitively, layers 1 and 2 are responsible for gathering temporal information whereas layers 3 and 4 gather spatial information of the entire frame. Layer 3 has access to the entire height and layer 4 to the entire width of a frame. The remaining 4 layers have the same block sizes, but in reverse order. However, as discussed in Appendix B, this particular choice of block size ordering is not crucial. There are attention heads, each with hidden size . Our base models are trained with embedding size and hidden size of (46M parameters). Based on ablations in Appendix B, we observed that increasing the hidden dimension is preferable to using deeper networks. Hence, we increase the hidden size to and use instead of heads for the last encoder/decoder layers in our large models (373M parameters).

Models. To assess the effect of subscaling, we explore the following variants. These differ mainly in the subscaling factor as well as the context kernel size , defaulting to :

Spatiotemporal Subscaling. The subscale video transformer with full spatiotemporal subscaling applies subscaling in every dimension. For instance, a 16x64x64 video is subscaled by factors to 16 slices of 4x32x32.

Spatial Subscaling. This baseline models shorter videos with frames subscaled to a resolution of 32x32. For instance, a 4x64x64 video is subscaled by factors to 4 slices of 4x32x32.

Single Frame. This baseline models a single frame at a time conditioned on the previous three, without applying any subscaling. For instance, a 16x64x64 video is subscaled by factors to 16 slices of 1x64x64 frames. The context kernel size is which means that we merely condition on a context of 3 past frames as the current and future frames are always masked when the temporal subscaling factor equals the full video length. Self-attention blocks are adapted as follows: Layers 1-4: (1, 8, 16); (1, 16, 8); (1, 2, 64); (1, 64, 2). For the remaining 4 layers we use the same blocks, again in reverse order.


All models are trained with RMSProp

Tieleman and Hinton (2012) with a fixed learning rate of , decay of and momentum of . We use a batch size of 64 video slices, if not stated otherwise, and shuffle the slices to avoid having all slices in a batch correspond to the same video. The smaller models are trained for 300K steps and the larger ones for 1M steps. No explicit regularization is applied as we could not observe any form of over-fitting. Videos longer than the training resolution are cropped randomly in time to the defined training length. If not stated otherwise, models are conditioned on the first frame during training, which is achieved by masking the loss corresponding to this frame. In preliminary experiments, this gave a minor improvement over computing the loss across all frames.

Intrinsic Evaluation. Most results are reported as bits per dimension (bits/dim), the average negative

-probability assigned by the model per (RGB) channel, averaged across all pixels in the video. This corresponds directly to the loss optimized by the model. In all experiments, we condition (prime) on a specified number of initial frames, which we simply exclude from the average.

Extrinsic Evaluation. Prior work mainly reported results on the peak signal-to-noise ratio (PSNR) and mean-structural similarity (SSIM) metrics Wang et al. (2004b). However, these metrics were developed for images and have serious flaws when applied to videos Wang et al. (2004a); Wang and Li (2007); Zhang et al. (2018); Lee et al. (2018). In particular, PSNR has a strong preference for blurry videos as it is based on pixel-level mean squared error, so we report SSIM for comparability instead. Primarily, we use the Fréchet Video Distance (FVD), which was recently proposed by Unterthiner et al. (2018) as a qualitative metric sensitive to visual quality, temporal coherence and diversity of samples. This is the spatiotemporal counterpart to the Fréchet Inception Distance Heusel et al. (2017)

, replacing the ImageNet-trained Inception network of the latter with an I3D Network trained on Kinetics. Despite sharing the known drawbacks of FID

Bińkowski et al. (2018), FVD has shown to correlate much stronger with human raters compared to both PSNR and SSIM Unterthiner et al. (2018)

. We report the FVD of the first 16 frames, as well as the “unrolled” average FVD across all contiguous subsequences of 16 frames. In each case, we report the mean and standard deviation of 20 trials.

Models Bits/Dim FVD FVD (Avg)
Single Frame 1.49 1044 1992
Spatial Sub. 1.57 1114 1081
Spatiotemp. Sub. 1.53 1063 1062
Spatiotemp. Sub. (L) 1.35 1942 1962
SV2P Babaeizadeh et al. (2018) 263
SAVP Lee et al. (2018) 116
VideoFlow Kumar et al. (2019) 1.87
BAIR Robot Pushing. Bits/dim averaged across 15 subsequent frames when priming with 1 initial frame, FVD and unrolled average FVD scores with temperature 0.9. Best results in bold.  Results from Unterthiner et al. (2018).  Results are not strictly comparable (see text for details).
Models Bits/dim FVD FVD (Avg)
Single Frame 1.40 2436 41311
Spatial Sub. 1.47 2636 45015
Spatiotemp. Sub. 1.49 1957 37511
Single frame (L) 1.14 2078 35313
Spatiotemp. Sub. (L) 1.19 1705 31612
Kinetics. Bits/dim averaged across 15 subsequent frames when priming with 1 initial frame, FVD and unrolled average FVD scores with temperature 0.9 and priming with 5 frames. Best results in bold.
Table 1: Quantitative results on BAIR Robot Pushing (left) and Kinetics (right).

4.2 BAIR Robot Pushing

BAIR Robot Pushing Ebert et al. (2017) shows a robotic arm pushing and grasping objects in a box. It consists of roughly 40K training- and 256 test videos. We prime on the first frame for training and evaluation.

Empirical Results. All variants of the Video Transformer achieve strong results compared to prior work in terms of both intrinsic and extrinsic metrics. From Table 1, we see that the small models already reduce the perplexity in terms of bits/dim by almost 20% compared to the recently proposed VideoFlow model Kumar et al. (2019) with our large model (L) reducing perplexity even further to a 25% improvement. Similar to Menick and Kalchbrenner (2019), we find that subscaling can have a slightly negative effect on bits/dim. In terms of perceptual quality, every incarnation of our model obtains a lower (better) FVD score compared to all models evaluated by Unterthiner et al. (2018), which notably includes adversarial networks with no guarantees of covering the full empirical distribution. The same holds in terms of SSIM (Figure 2, numbers for other models from Kumar et al. (2019)). These results are not strictly comparable, since prior work has used longer priming sequences of two Babaeizadeh et al. (2018); Lee et al. (2018) or three Kumar et al. (2019) frames, whereas our models see a single prime frame (to our disadvantage). Note that we sample with temperature 0.9 for the extrinsic metrics as we observed improved qualitative results at this temperature on the validation set. This corresponds to a mild form of mode dropping. Further results on an earlier version of robot pushing Finn et al. (2016a) and Moving MNIST Srivastava et al. (2015) can be found in Appendix A for brevity. In summary, like Kalchbrenner et al. (2016), we match the lower bound on Moving MNIST while obtaining an almost 50% reduction in bits/dim on robotic pushing.

Figure 2: Unrolled FVD and SSIM metrics on BAIR Robot Pushing (left) and Kinetics (right).

Qualitative Observations. All variants of our model reach similar quantitative results on these benchmarks and we observe no immediate differences in fidelity. However, there are some notable differences. First, whereas the spatiotemporal subscaling model is able to capture temporal dependencies across up to 16 frames (given subscaling in time by a factor four), the remaining models can only capture dependencies across four frames. This results, for example, in deformation of occluded objects as shown in Figure 3. However, due to the simplicity of the benchmark datasets, this is not appropriately reflected in the metrics including better unrolled FVD curves for the single frame base model. Second, we observe that lowering the sampling temperature from 1.0 to 0.9 consistently improves results. Notably, spatiotemporal subscaling seems more robust to sampling errors as its performance decays less when sampling with temperature 1.0 (1224 Avg. FVD) compared to the spatial subscaling (1344) and single frame models (1537). We attribute this finding to the difference in generation order when spatiotemporal subscaling is employed as it predicts pixels over the entire extend of the 3D video volume early and thereby effectively anchors future predictions around these pixels. Finally, considering that our results on BAIR Robot Pushing in terms of FVD are on par with those between two ground-truth subsamples (Figure 4 of Unterthiner et al. (2018)), we may be approaching the limit of this benchmark. On the other hand, it could be that FVD suffers out-of-domain and is not sufficiently sensitive to long-range temporal dynamics, since it is trained to perform human action recognition, which is known to predominantly rely on local features Carreira and Zisserman (2017); Xie et al. (2018).

Figure 3: Samples (showing every 5th frame horizontally) illustrating occlusion effects on BAIR Robot Pushing. Models without temporal subscaling (rows 3-4) fail on occlusions, whereas the model with temporal subscaling (row 2) correctly maintains objects from the ground truth video (row 1). Notice the green ball deformation on rows 2 and 3 and the hallucinated green ball on the right edge of row 3, which are caused by missing temporal dependencies across the duration of occlusion.

4.3 Kinetics

Moving from a constrained to a more realistic setting, we next apply our models to the Kinetics dataset Kay et al. (2017), a large scale action-recognition dataset consisting of YouTube videos. Specifically, we use Kinetics-600, which contains roughly 400K training videos ranging over 600 action classes Carreira et al. (2018). We center-crop and down-sample each frame to 64x64 with a width-3 Lanczos filter and anti-aliasing.

We introduce a slight change to our setup by using a separate decoder for the first slice . This decoder can be twice as deep (16 instead of 8 layers) as the original subscale decoder, because it does not rely on any encoder. For all other slices we train a regular subscale model (8 layers in both encoder and decoder) as before. Using a separate first-slice decoder means that there is no wasted encoder computation on the first slice and that there are additional parameters. Furthermore, for our large models we scale the batch size to 256 by training in parallel on 128 TPU v3 instances.

Empirical Results. Results for our base models are shown in the upper part of Table 1. In line with results on BAIR pushing, we find that the single frame model obtains better performance in terms of bits/dim. In contrast, we observe that the spatiotemporal subscaling model generates better and more robust video continuations which is reflected by its superior FVD scores. Our large models (L) show much stronger performance across the board (see lower half of Table 1 and Figures 2, 2), lowering the perplexity to 1.14 bits/dim for the single frame model. While the spatiotemporal subscaling model obtains slightly worse perplexity of 1.19 bits/dim, it improves FVD to 170. Despite its good performance on bits/dim, even with a temperature of 0.9, samples from the large single frame model are prone to instability and in many cases we observe color “explosions” (Figure 12 in the Appendix shows an example) which is reflected in its significantly higher FVD score. Although much less pronounced we observed such instability already when sampling with temperature 1.0 on BAIR pushing which clearly indicates that benefits of temporal subscaling for video generation.

Qualitative Observations. Figure 4 shows samples from a cooking subset of Kinetics that we describe in Appendix E. These are selected to showcase different aspects of real-world videos learned by the large spatiotemporal subscaling model. Figures 3(a) and 3(c) demonstrate the model’s ability to handle camera movement. We find that camera movement is learned early in training since it is a major source of uncertainty. This requires transforming pixels correctly while hallucinating new pixels at the edges. Similarly, object movement resulting, for instance, in a change of perspective is predicted quite well (Figure 3(b)). Highly stochastic motion such as fire (Figure 3(d)) or steam is modeled surprisingly well. Videos in Kinetics sometimes contain scene changes and therefore our model occasionally generates completely new scenes (Figure 3(e)). Lastly, we find that human motion of fingers and faces is difficult to model. Nevertheless, for a fair amount of samples the model is able to generate surprisingly good continuations as can be seen in Figures 3(g), 3(h) or 3(i).

These selected examples show only a small subset of interesting phenomena handled by the model and illustrate the sheer complexity involved in modeling this dataset. In Appendix F, we provide multiple samples primed on the same initial frames to show the diversity of the generated samples.

(a) Zoom
(b) Perspective
(c) Camera movement
(d) Fire
(e) Scene change
(f) Object interaction
(g) Fingers
(h) Mouth closing
(i) Yawning
Figure 4: Selected Kinetics continuations from a set of 128 videos and 16 samples which showcase a variety of natural, video-specific phenomena our model learns to generate. We used our large spatiotemporal subscaling model and prime generation with 5 frames (0-4) to include the first two frames in subscale order (0, 4). Samples are generated with temperature of 0.9. The examples depict frames 0, 5, 10 and 15.

Limitations. While we see encouraging qualitative results for a subset of videos, we would like to point out that the full range of Kinetics still poses a major challenge. Failure modes range from freezing movement or object distortions to continuations that completely break after a few frames. Hence, while we provide a strong baseline, there is much room for improvement in future work.

5 Conclusion

We presented an autoregressive model of videos based almost entirely on efficient block-local self-attention. Combined with spatiotemporal subscaling, our model can be scaled up substantially while retaining longer range spatiotemporal dependencies. Empirically, we obtain state-of-the-art results across a range of popular video generation benchmarks, while the scalability of our approach enables us to make a first attempt at modeling real-world videos of an unprecedented complexity.


This work benefited from numerous conversations with Nal Kalchbrenner, as well as discussions with Jacob Menick, Mohammad Taghi Saffar and Niki Parmar. Manoj Kumar graciously provided the data for the baseline SSIM evaluations for Figure 2. We would also like to thank Chelsea Finn and Tom Kwiatkowski for thoughtful comments on an earlier draft.


  • Babaeizadeh et al. [2018] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. ICLR, 2018.
  • Baevski and Auli [2019] A. Baevski and M. Auli. Adaptive input representations for neural language modeling. In ICLR, 2019.
  • Bińkowski et al. [2018] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. In ICLR, 2018.
  • Carl Vondrick [2016] A. T. Carl Vondrick, Hamed Pirsiavash. Anticipating visual representations from unlabeled video. In CVPR, 2016.
  • Carreira and Zisserman [2017] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • Carreira et al. [2018] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600. arXiv, abs/1808.01340, 2018.
  • Castrejón et al. [2019] L. Castrejón, N. Ballas, and A. Courville. Improved conditional vrnns for video prediction. arXiv, abs/1904.12165, 2019.
  • Chen et al. [2018] X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel. PixelSNAIL: An improved autoregressive generative model. In ICML, 2018.
  • Child et al. [2019] R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. OpenAI Preprint, 2019.
  • De Fauw et al. [2019] J. De Fauw, S. Dieleman, and K. Simonyan. Hierarchical autoregressive image models with auxiliary decoders. In CVPR, 2019.
  • Denton and Fergus [2018] E. Denton and R. Fergus. Stochastic video generation with a learned prior. In ICML, 2018.
  • Ebert et al. [2017] F. Ebert, C. Finn, A. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections. In Conference on Robot Learning (CoRL), 2017.
  • Finn et al. [2016a] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016a.
  • Finn et al. [2016b] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016b.
  • Girdhar et al. [2018] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman. Video action transformer network. arXiv, abs/1812.02707, 2018.
  • Heusel et al. [2017] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS, 2017.
  • Jia et al. [2016] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems, 2016.
  • Jouppi et al. [2017] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. In-datacenter performance analysis of a tensor processing unit. In ISCA, 2017.
  • Kaiser et al. [2019] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, R. Sepassi, G. Tucker, and H. Michalewski. Model-based reinforcement learning for atari. arXiv, abs/1903.00374, 2019.
  • Kalchbrenner et al. [2016] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. In ICML, 2016.
  • Kay et al. [2017] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, A. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. arXiv, abs/1705.06950, 2017.
  • Kumar et al. [2019] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma. Videoflow: A flow-based generative model for video. arXiv, abs/1903.01434, 2019.
  • Lee et al. [2018] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv, abs/1804.01523, 2018.
  • Liu et al. [2017] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In ICCV, 2017.
  • Lu et al. [2017] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. CVPR, 2017.
  • Mathieu et al. [2016] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.
  • Menick and Kalchbrenner [2019] J. Menick and N. Kalchbrenner. Generating high fidelity images with subscale pixel networks and multidimensional upscaling. In ICLR, 2019.
  • Oh et al. [2015] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. In NIPS, 2015.
  • Parikh et al. [2016] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit.

    A decomposable attention model for natural language inference.

    In EMNLP, 2016.
  • Parmar et al. [2018] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, and A. Ku. Image transformer. In ICML, 2018.
  • Pătrăucean et al. [2016] V. Pătrăucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory. In ICLR (Workshop track), 2016.
  • Reed et al. [2017] S. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. de Freitas.

    Parallel multiscale autoregressive density estimation.

    In ICML, 2017.
  • Srivastava et al. [2015] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
  • Tieleman and Hinton [2012] T. Tieleman and G. Hinton.

    Lecture 6.5-rmsprop, coursera: Neural networks for machine learning.

    University of Toronto, Technical Report, 2012.
  • Unterthiner et al. [2018] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv, abs/1812.01717, 2018.
  • van den Oord et al. [2016a] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves. Conditional image generation with pixelcnn decoders. In NIPS, 2016a.
  • van den Oord et al. [2016b] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu.

    Pixel recurrent neural networks.

    In ICML, 2016b.
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • Vondrick et al. [2016] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.
  • Wang and Li [2007] J. Wang and Q. Li. Video quality assessment using a statistical model of human visual speed perception. Journal of the Optical Society of America. A, Optics, image science, and vision, 24 12:B61–9, 2007.
  • Wang et al. [2004a] J. Wang, L. Lu, and A. C. Bovik. Video quality assessment based on structural distortion measurement. Sig. Proc.: Image Comm., 19, 2004a.
  • Wang et al. [2004b] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004b.
  • Xie et al. [2018] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, 2018.
  • Xingjian et al. [2015] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015.
  • Zhang et al. [2018] R. Zhang, P. Isola, A. Efros, E. Shechtman, and O. Wang.

    The unreasonable effectiveness of deep features as a perceptual metric.

    In CVPR, 2018.

Appendix A Further benchmarks

Models Nats/Frame ()
Single Frame 86.2
Spatial Subscaling 91.8
Spatiotemporal Subscaling 90.0
VPN Kalchbrenner et al. [2016] 87.6
Lower bound 85.1 (86.3)
Table 2: Moving MNIST. Nats per frame averaged across 10 subsequent frames when priming with 10 initial frames. Best results in bold.  The lower bound reported in Kalchbrenner et al. [2016] is slightly higher than ours.

a.1 Moving MNIST

Moving MNIST Srivastava et al. [2015] consists of 100K training- and 10K validation/test videos of two handwritten digits from the MNIST benchmark that move deterministically across the frame, crossing each other and bouncing off the borders. The partial occlusion of crossing digits makes this dataset challenging. To be comparable with Kalchbrenner et al. [2016], we use the first ten frames as priming and predict the subsequent ten frames.

To allow direct comparison with Kalchbrenner et al. [2016], we change our loss to a “deterministic” loss (and derived nats-per-frame metric) which is defined as: , where are the gray-scale targets between 0.0 and 1.0, and are the predicted scalar intensities.

From Table 2, we find that like Kalchbrenner et al. [2016] our single frame prediction model (i.e., no subscaling) virtually solves the task in the sense that it almost matches the lower bound of the loss. However, this is not true for our subscaling models. Employing spatial subscaling on this task gives aliasing artifacts that make it harder to predict future frames. Although this finding is limited to Moving MNIST, it suggests that spatial subscaling can potentially hurt generation.

a.2 Robotic Pushing.

Robotic Pushing Finn et al. [2016a] was used in prior work on autoregressive video generation Kalchbrenner et al. [2016]. The videos show a robotic arm pushing and grasping objects in a box and there are roughly 50K training videos and 1500 test videos with seen and novel objects, respectively. Following prior work, we use the initial two frames for priming and condition on the robot arm action for each frame as described in Section 3.2. We use the same setup as Kalchbrenner et al. [2016] with videos of twenty frames down-sampled to 64x64 with a Lanczos filter and anti-aliasing.

We report results to compare with prior work on autoregressive video generation by Kalchbrenner et al. [2016], who achieve 0.92 bits/dim (0.64 nats/dim) with 2 frames of priming on each of the test splits (one with objects seen during training and one with novel objects). We trained a large (2048 dimensional) spatiotemporal subscaling model which achieves 0.51 bits/dim on the subset with seen objects and 0.47 bits/dim on the subset with new objects, which corresponds to an almost 50 percent reduction in perplexity.

Appendix B Hyper-Parameter Sweeps

Layers Heads Hidden size
4 1.63 4 1.59 256 1.65
8 1.55 8 1.55 512 1.55
16 1.48 16 1.51 1024 1.47
24 1.45 24 1.47 2048 1.40
Table 3: Ablation of hyper-parameter settings in terms of bits per dimension for models on 256 BAIR Robot Pushing validation videos. All models were primed on 1 frame and trained for 300K steps with a batch size of 64.

Table 3

shows the impact of different architectural settings. We see that the hidden size has the biggest impact followed by the number of layers and heads. This is an interesting as well as important finding because increasing the hidden size (wider networks) requires more parallel compute which modern Deep Learning hardware excels at. Computation time grows sub-linear, memory linear and parameters partially quadratically. In contrast all of these aspects grow linearly with deep networks. For scaling up architectures depth is therefore not the preferred option as we suffer much more in terms of computation time while having less parameters.

In another experiment, we shuffle the arrangement of block sizes between layers and found that it did not really matter, that is, all results were within 0.01 bits/dim. However, our setup had the best overall performance.

Finally, we tried sampling temperature 0.9 and 1.0 only on the BAIR Robot Pushing validation set and found that temperature 0.9 consistently gave more robust predictions and better results on all extrinsic metrics.

Appendix C Connectivity in Block-local Self-Attention

Blind Spots.

Varying block sizes between layers in block-local self-attention can efficiently connect every pixel with every other pixel when no masking is employed. If masking is employed to respect the generation order (as in our slice decoder) block-local self attention produces “blind spots” which leads to independence assumptions. To exemplify these special cases, consider position , the top-left pixel of the second frame, and its direct predecessor in generation order , the bottom-right pixel of the first frame. The only way to establish a connection between these two positions is through a direct connection, because masking prevents any indirect connection. Thus, there has to be one layer in which both of these pixels are in the same block. This block must at least stretch over the entire extent of both width and height (i.e., the full frame) as well as at least 2 time steps. Running full self-attention in such blocks can easily become prohibitive for large and .


There seems to be no simple solution that solves the problem of blind spots completely. However, we can make sure that local dependencies up to a certain distance are all covered by increasing the kernel size of the initial, masked convolution in the decoder. It is also possible to combine block-local self-attention with its dual form, dilated self-attention in dimensions which connects all pixels at the same relative position within their respective block with each other. Finally, we find that it is important to avoid blocks of small sizes in any dimension (e.g., 1). That means, even if we stretch a block to the full extent of one dimension it is important to define sizes at least larger than 1 on all other dimensions to limit the number of unconnected pixels.

On the other hand, the independence assumptions due to masking do not seem to produce any systematic, visible artifacts in our samples. We believe this to be an interesting finding by itself as it shows that there is potential for parallelizing autoregressive video generation by systematically exploring further independence assumptions.

Appendix D Additional Findings

Below, we summarize some additional findings that may be of interest to some readers:

  • We found that using blocks stretching across a single time-/row-/column- dimension, is substantially worse than using blocks that stretch at least to some extent in all directions. This is likely due to the fact that future masking in the decoder imposes strong independence assumptions in this case, as discussed in Appendix C.

  • We found that RMSProp with momentum converges significantly faster than ADAM, which we tried with different learning rates and settings for and .

  • We tried using continuous, rather than discretized one-hot, input channel representations, but this had an overall negative impact on both performance and sample quality.

  • We experimented with a gating mechanism in Eq. 3, such that the attention matrix is masked elementwise with to allow for not attending to any element, similar to sentinel attention Lu et al. [2017]. However, this had no effect on generation quality.

Appendix E Kinetics Cooking

We found that for many video-prefixes in Kinetics it is very hard for our model to predict continuations. For instance, main objects in the videos are too small or movement is too fast which results in very blurry frames or there is little to no movement at all. Figure 13 shows some examples. Therefore, we created a subset of cooking videos that we found to exhibit these problems to a lesser degree.

In particular we filtered videos whose label matched the following regular expression:


Note that we still train on the full Kinetics training set and only use the cooking set to showcase samples in some cases.

Appendix F Samples

Figures 5-8 show samples from our spatiotemporal subscaling and large spatiotemporal subscaling models on BAIR Robot Pushing. Figures 5 and 6 illustrate the fidelity and realism of the generated samples, whereas Figures 7 and 8 illustrate the diversity of samples.

Figures 9-11 show samples from our spatiotemporal subscaling model on cooking videos for Kinetics-600, while Figure  12 depicts samples from the single frame model. In each case, we prime on 5 frames and sample the next 11 frames. Each figure shows 16 different samples from the same model. As can be seen, the model is able to generate diverse continuations while retaining fidelity. For the single frame model we observe strange color artifacts (exploding colors) which we attribute to the standard, raster-scan generation order of this model.

Figure 5: Samples of 30 future frames (showing every 4th frame) for 12 test videos with the spatiotemporal subscaling model, using 1 prime frame and temperature 0.9 on BAIR Robot Pushing.
Figure 6: Samples of 30 future frames (showing every 4th frame) for 12 test videos with the large spatiotemporal subscaling model, using 1 prime frame and temperature 0.9 on BAIR Robot Pushing.
Figure 7: 11 samples of 30 future frames (showing every 4th frame) for 1 test video (top row) with the spatiotemporal subscaling model, using 1 prime frame and temperature 0.9 on BAIR Robot Pushing.
Figure 8: 11 samples of 30 future frames (showing every 4th frame) for 1 test video (top row) with the large spatiotemporal subscaling model, using 1 prime frame and temperature 0.9 on BAIR Robot Pushing.
Figure 9: Samples of 11 future frames from the spatiotemporal subscaling model with 5 prime frames on 64x64 Kinetics.
Figure 10: Samples of 11 future frames from the spatiotemporal subscaling model with 5 prime frames on 64x64 Kinetics.
Figure 11: Samples of 11 future frames from the spatiotemporal subscaling model with 5 prime frames on 64x64 Kinetics.
Figure 12: Samples of 11 future frames from the single frame model with 5 prime frames on 64x64 Kinetics exhibiting strange color artifacts.
(a) Blurry and fast camera movement.
(b) Blurry and fast camera movement.
(c) Very little movement.
(d) Very little movement and small objects.
Figure 13: Ground-truth (top) and 2 samples of 30 future frames (showing every 4th frame) demonstrating that random Kinetics videos do not always lend themselves as good prefixes for generating continuations.