DeepAI
Log In Sign Up

Temporally Consistent Video Transformer for Long-Term Video Prediction

10/05/2022
by   Wilson Yan, et al.
berkeley college
0

Generating long, temporally consistent video remains an open challenge in video generation. Primarily due to computational limitations, most prior methods limit themselves to training on a small subset of frames that are then extended to generate longer videos through a sliding window fashion. Although these techniques may produce sharp videos, they have difficulty retaining long-term temporal consistency due to their limited context length. In this work, we present Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics video prediction model that learns compressed representations to efficiently condition on long videos of hundreds of frames during both training and generation. We use a MaskGit prior for dynamics prediction which enables both sharper and faster generations compared to prior work. Our experiments show that TECO outperforms SOTA baselines in a variety of video prediction benchmarks ranging from simple mazes in DMLab, large 3D worlds in Minecraft, and complex real-world videos from Kinetics-600. In addition, to better understand the capabilities of video prediction models in modeling temporal consistency, we introduce several challenging video prediction tasks consisting of agents randomly traversing 3D scenes of varying difficulty. This presents a challenging benchmark for video prediction in partially observable environments where a model must understand what parts of the scenes to re-create versus invent depending on its past observations or generations. Generated videos are available at https://wilson1yan.github.io/teco

READ FULL TEXT VIEW PDF

page 1

page 8

page 14

page 15

page 16

page 17

page 18

06/07/2022

Generating Long Videos of Dynamic Scenes

We present a video generation model that accurately reproduces object mo...
02/18/2021

Clockwork Variational Autoencoders

Deep learning has enabled algorithms to generate realistic images. Howev...
05/23/2022

Flexible Diffusion Modeling of Long Videos

We present a framework for video modeling based on denoising diffusion p...
12/29/2022

Long-horizon video prediction using a dynamic latent hierarchy

The task of video prediction and generation is known to be notoriously d...
03/31/2021

Long-Term Temporally Consistent Unpaired Video Translation from Simulated Surgical 3D Data

Research in unpaired video translation has mainly focused on short-term ...
07/02/2020

Understanding Road Layout from Videos as a Whole

In this paper, we address the problem of inferring the layout of complex...
05/11/2022

Video-ReTime: Learning Temporally Varying Speediness for Time Remapping

We propose a method for generating a temporally remapped video that matc...

Code Repositories

1 Introduction

Recent work in video prediction has seen tremendous progress (ho2022video; clark2019adversarial; yan2021videogpt; le2021ccvs; ge2022long; tian2021good; luc2020transformation) in producing high-fidelity and diverse samples on complex video data. This can largely be attributed to a combination of increased computational resources and more compute efficient high-capacity neural architectures. However, much of this progress has focused on generating short videos, where models can perform well by basing their predictions on only a handful of previous frames.

Video prediction models with short context windows can generate long videos in a sliding window fashion. While the resulting videos can look impressive at first sight, they lack temporal consistency. We would like models to predict temporally consistent videos — where the same content is generated if a camera pans back to a previously observed location. On the other hand, the model should imagine a new part of the scene for locations that have not yet been observed, and future predictions should remain consistent to this newly imagined part of the scene.

Prior work has investigated techniques for modeling long-term dependencies, such as temporal hierarchies (saxena2021clockwork)

and strided sampling with frame-wise interpolation 

(ge2022long; hong2022cogvideo). Other methods train on sparse sets of frames selected out of long videos (harvey2022flexible; skorokhodov2021stylegan; clark2019adversarial; saito2018tganv2; yu2022generating), or model videos via compressed representations (yan2021videogpt; rakhimov2020latent; le2021ccvs; seo2022harp; gupta2022maskvit; walker2021predicting). Refer to Appendix H for more detailed discussion on related work.

Despite this progress, many methods still have difficulty scaling to datasets with many long-range dependencies. While Clockwork-VAE (saxena2021clockwork) trains on long sequences, it is limited by training time (due to a recurrent architecture) and difficult to scale to more complex data. On the other hand, transformer-based methods over latent spaces (yan2021videogpt)

scale poorly to long videos due to quadratic complexity in attention, with long videos containing tens of thousands of tokens. Methods that train on subsets of tokens are limited by truncated backpropagation through time 

(hutchins2022block; rae2019compressive; dai2019transformer) or naive temporal operations (hawthorne2022general).

In this paper, we introduce Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics model that effectively models long-term dependencies in a compact representation space using efficient transformers. The key contributions are summarized as follows:

  • We introduce TECO, an efficient and scalable video prediction model that learns a set of compressed VQ-latents to allow for efficient training and generation.

  • We propose several long-length video prediction datasets centered around 3D scenes in DMLab (beattie2016deepmind), Minecraft (guss2019minerl), and Habitat (szot2021habitat; habitat19iccv) to help better evaluate temporal consistency in video predictions.

  • We show that TECO has strong performance on a variety of difficult video prediction tasks, and is able to leverage long-term temporal context to generate high quality videos with consistency.

  • We provide several ablations providing intuition for why TECO is able to generate more temporally consistency predictions, and how these insights can extend to future work in long-term video prediction.

2 Preliminaries

2.1 Vq-Gan

VQ-GAN (esser2021taming; van2017neural)

is an autoencoder that learns to compress data into a set of discrete latents, consisting of an encoder

, decoder , codebook , and discriminator . Given an image , the encoder maps to its latent representation , which is quantized by nearest neighbors lookup in a codebook of embeddings to produce . The discretized latent is fed through decoder to reconstruct

. A straight-through estimator 

(bengio2013estimating) is used to maintain gradient flow through the quantization step. The codebook optimizes the following loss:

(1)

where

is a hyperparameter, and

is the corresponding nearest-neighbors embedding from codebook . For reconstruction, VQ-GAN replaces the original loss with a perceptual loss (zhang2018unreasonable), . Finally, in order to encourage higher-fidelity samples, patch-level discriminator

is trained to classify between real and reconstructed images, with.

(2)

Overall, VQ-GAN optimizes the following combination of losses:

(3)

where is an adaptive weight, is the last decoder layer, and .

2.2 MaskGit

MaskGit (chang2022maskgit)

is a generative model that models distributions over tokens, such as produced by a VQ-GAN. Instead of autoregressively modelling the sequence of tokens, MaskGit generates images with competitive sample quality at a fraction of the sampling cost by using a masked token prediction objective during training. Formally, we denote

as the discrete latent tokens representing an image. For each training step, we uniformly sample and randomly generate a mask with masked values, where . Then, MaskGit learns to predict the masked tokens with the following objective

(4)

During inference, because MaskGit has been trained to model any set of unconditional and conditional probabilities, we can sample any subset of tokens per sampling iteration, from the extreme case of sampling all tokens (independent) to sampling one token at a time (autoregressive).

chang2022maskgit introduces a confidence-based sampling mechanism whereas other work (lee2022draft) proposes iterative sample-and-revise approaches.

(a) VQ-GAN
(b) TECO
Figure 2: The architectural design of TECO. Our proposed method models sequences of videos encoded with a pretrained VQ-GAN. We achieve efficient and scalable training and generation on long sequences through several key design choices to maximally compress our representations. We leverage temporal redundancies by encoding frames conditioned on the previous one, and model temporal dependencies in a downsampled latent space. For fast sampling, we learn a MaskGit dynamics for the prior.

3 Teco

Generating temporally consistent videos requires training on long videos to correctly learn long-term temporal dependencies between frames. However, computational and memory requirements remain the primary bottleneck in preventing from doing so. We present Temporally Consistent Video Transformer (TECO), a video generation model that more efficiently scales to training on longer horizon videos.

First, we train a VQ-GAN to spatially compress our video data. Shown in prior work (seo2022harp), this is an important step for video prediction in a more efficient and scalable manner. However, even in latent space, existing methods are still limited to modeling short sequences of 16–24 frames, which can be attributed to the quadratics costs of transformer layers as sequence length grows. With 256 tokens per frame, 16 frame videos already consist of 4096 tokens, and scaling to longer videos of 100s frames is prohibitively expensive, where resulting videos have tens of thousands of tokens. Therefore, in the following sections, we propose several key design choices to building a more efficient video prediction model.

3.1 Vector-Quantized Latent Dynamics

Our proposed framework shown in Figure 2 follows similarly to prior work in latent dynamics models (hafner2019learning; hafner2020mastering; saxena2021clockwork), with several key differences in architectural and latent variable design. Let consist of a sequence of video frames encoded using a pretrained VQ-GAN. In the following sections, we motivate each component for our model, with several specific design choices to ensure efficiency and scalability. TECO consists of four components:

(5)

Encoder Although VQ-GAN exploits spatial redundancies, we can achieve more compressed representations by leveraging temporal redundancy in video data. To do this, we learn a CNN encoder which encodes the current frame conditioned on the previous frame by channel-wise concatenating , and then quantizes the output using codebook to produce . We apply the VQ loss defined in Equation 1 per timestep. In addition, we -normalize the codebook and embeddings to encourage higher codebook usage (yu2021vector). Conditionally encoding nearby frames lets the model learn smaller latents, and provides a general way to take advantage of temporal redundancy. The first frame is concatenated with zeros and does not quantize to prevent information loss. As we focus on video prediction, there is always at least 1 frame to condition on, so we do not need to predict the un-quantized representation of the first frame when computing decoding and dynamics losses. Intuitively, this also does not burden the dynamics model to learn an unconditional prior.

Temporal Transformer Compressed, discrete latents are more lossy and tend to require higher spatial resolutions compared to continuous latents. Therefore, before modeling temporal information, we apply a single strided convolution to downsample each discrete latent , where visually simpler datasets allow for more downsampling and visually complex datasets require less downsampling. Afterwards, we learn a large transformer to model temporal dependencies, and then apply a transposed convolution to upsample our representation back to the original resolution of . In summary, we use the following architecture:

(6)

Decoder The decoder is a upsampling CNN that reconstructs , where can be interpreted as the posterior of timestep , and is the output of the temporal transformer which summarizes information from previous timesteps. and are concatenated channel-wise before being fed into the decoder. Together with the encoder, the decoder optimizes the following cross entropy reconstruction loss

(7)

which encourages features to encode relative information between frames since the temporal transformer can aggregate information over time.

Dynamics Prior Lastly, we use a MaskGit (chang2022maskgit) to model the dynamics prior, . In our experiments, we show that using a MaskGit prior allows for not just faster but also higher quality sampling compared to an autoregressive prior. During every training iteration, we use the same process as prior work to sample a random mask and optimize

(8)

where is concatenated channel-wise with masked to predict the masked tokens. During generation, we follow lee2022draft, where we initially generate each frame in chunks of 8 at a time and then go through 2 revise rounds of re-generating half the tokens each time.

Training Objective The final objective is the sum of these losses:

(9)

3.2 DropLoss

To train the model efficiently on long videos, we propose DropLoss, a simple trick to allow for more scalable and efficient training. Due to its architecture design, TECO can be separated into two components: (1) learning temporal representations, consisting of the encoder and the temporal transformer, and (2) predicting future frames, consisting of the dynamics prior and decoder. We can increase training efficiency by dropping out random timesteps that are not decoded and thus omitted from the reconstruction loss. For example, given a video of T frames, we compute for all , and then compute the losses and for only 10% of the indices. Because random indices are selected each iteration, the model still needs to learn to accurately predict all timesteps. This reduces training costs significantly because the decoder and dynamics prior require non-trivial computations. DropLoss is applicable to both a wide class of architectures and to tasks beyond video prediction.

Figure 3: Quantitative comparisons between TECO and baseline methods in long-horizon temporal consistency (left) and sampling speed (right). Our method is able to remain temporally consistent while still generating sharp samples with fast sampling speed.

4 Experiments

4.1 Datasets

We introduce three challenging video datasets to better measure long-range consistency in video prediction. We design these benchmarks around 3D environments in DMLab (beattie2016deepmind), Minecraft (guss2019minerl), and Habitat (habitat19iccv), with videos of agents randomly traversing different scenes of varying difficulty. These datasets require video prediction models to re-produce observed parts of scenes, and newly generate unobserved parts of the scene. In contrast, many existing video benchmarks do not have strong long-range dependencies, where a model with limited context is sufficient. Refer to Appendix I for further details on each dataset.

DMLab DeepMind Lab is a simulator that procedurally generates random 3D mazes with random floor and wall textures. We generate 40k action-conditioned videos of frames of an agent randomly traversing

mazes by choosing random points in the maze and navigating to them via the shortest path. We train all models for both action-conditioned and unconditional prediction (by periodically masking out actions) to enable both types of generations. We use both modes to evaluate since a video model may generate new parts of a scene that do not correlate with the action (e.g. run into a wall) which results in out-of-distribution errors. However, action-conditioning is useful with enough conditioned past context, and substantially lowers variance on PSNR, SSIM, and LPIPS evaluations.

Minecraft This popular game features procedurally generated 3D worlds that contain complex terrain such as hills, forests, rivers, and lakes. We collect 200k action-conditioned videos of length and resolution in Minecraft’s marsh biome. The player iterates between walking forward for a random number of steps and randomly rotating left or right, resulting in parts of the scene going out of view and coming back into view later. We train action-conditioned for all models for ease of interpreting and evaluating, though it is generally easy for video models to unconditionally learn these discrete actions.

Habitat Habitat is a simulator for rendering trajectories through scans of real 3D scenes. We compile 1400 indoor scans from HM3D (ramakrishnan2021habitat), MatterPort3D (chang2017matterport3d), and Gibson (xia2018gibson) to generate 200k action-conditioned videos of frames at a resolution of pixels. We use the in-built path traversal algorithm provided in Habitat to construct action trajectories that allow our agent to move between randomly sampled locations in the 3D scene. Similar to DMLab, we train all video models to perform both unconditional and action-conditioned prediction.

Kinetics-600 Kinetics-600 (carreira2017quo) is a highly complex real-world video dataset, originally proposed for action recognition. The dataset contains 400k videos of varying length of up to 300 frames. We evaluate our method in the video prediction without actions (as they do not exist), generating 80 future frames conditioned on 20. In addition, we filter out videos shorter than 100 frames, leaving 392k videos that are split for training and evaluation. We use a resolution of pixels. Although Kinetics-600 does not have many long-range dependencies, we evaluate our method on this dataset to show that it can scale to complex, natural video.

DMLab Minecraft
Method FVD PSNR SSIM LPIPS FVD PSNR SSIM LPIPS
FitVid
CW-VAE
Perceiver AR
Latent FDM
TECO (ours)
Habitat Kinetics-600
Method FVD PSNR SSIM LPIPS FVD PSNR SSIM LPIPS
Perceiver AR
Latent FDM
TECO (ours)
Table 1: Quantitative evaluation on all four datasets. Detailed results in Appendix F.

4.2 Baselines

We compare against SOTA baselines selected from several different families of models: latent-variable-based variational models, autoregressive likelihood models, and diffusion models. In addition, for more fair comparisons, we train all models on VQ codes using the same VQ-GAN as our method. For our diffusion baseline, we follow rombach2022high and use a pretrained VAE instead of a VQ-GAN. Note that we do not have any GANs for our baselines, since to the best of our knowledge, there does not exist a GAN that trains on latent space instead of raw pixels, an important aspect for properly scaling to long video sequences.

FitVid FitVid (babaeizadeh2021fitvid) is a state-of-the-art variational video prediction model based on CNNs and LSTMs that scales to complex video by leveraging efficient architectural design choices in its encoder and decoder.

Clockwork VAE CW-VAE (saxena2021clockwork) is also a variational video prediction model that is designed to better learn long-range dependencies through a hierarchies of latent variables with exponentially slower tick speeds for each new level.

Perceiver AR We use Perceiver AR (hawthorne2022general) as our AR baseline over VQ-GAN discrete latents, which has been show to be an effective generative model that can efficiently incorporate long-range sequential dependencies. Conceptually, this baseline is similar to HARP (seo2022harp) with a Perceiver AR as the prior instead of a sparse transformer (child2019generating). We choose Perceiver AR over other autoregressive baselines such as VideoGPT (yan2021videogpt) or TATS (ge2022long) primarily due to the prohibitive costs of transformers when applied to tens of thousands of tokens.

Latent FDM For our diffusion baseline, we train a Latent FDM model with frame-wise autoregressive sampling. Although FDM (harvey2022flexible) is originally trained on pixel observations, we also train in latent space for a more fair comparison with our method and other baselines, as training on long sequences in pixel space is too expensive. We follow LDM (rombach2022high) to separately train an autoencoder to encode each frame into a set of continuous latents.

4.3 Experimental Setup

Training All of our models are trained for 1 million iterations under fixed compute budget (measured in TPU v3 days) allocated for each dataset. Models are trained on TPU-v3 instances, ranging from v3-8 to v3-128 TPU pods (similar to 4 V100s to 64 V100s) with training times of roughly 3-5 days. For DMLab, Minecraft, and Habitat we train all models on full 300 frames videos, and 100 frames for Kinetics-600. Our VQ-GANs are trained on 8 A5000 GPUs, taking about 2-4 days for each dataset, and downsample all videos to grids of discrete latents per frame regardless of original video resolution. More details on exact hyperparameters and compute budgets for each dataset can be found in Appendix J.

Evaluation We evaluate our models using a combination of standard video prediction metrics such as PSNR (huynh2008scope), SSIM (wang2004image), LPIPS (zhang2018unreasonable), and FVD (unterthiner2019fvd). For DMLab, Minecraft, and Habitat, we measure FVD on 300 frame videos, conditioned on 36 frames (264 predicted frames). For Kinetics-600, we evaluate FVD on 100 frame videos, conditioned on 20 frames (80 predicted frames). To evaluate temporal consistency, we measure PSNR, SSIM, and LPIPS on video predictions conditioned on 144 frames (156 predicted frames), and action condition for all models. Conditioning on a large portion of the video ensures that the model can observe a large part of the scene, and combined with action-conditioning, the model with temporally-consistent predictions should generate future frames close to the ground truth. Due to this reduced stochasticity, we only sample one prediction for computing PSNR, SSIM, and LPIPS. We compute all metrics over batches of examples, averaged over 4 runs to make total samples.

4.4 Benchmark Results

DMLab & MinecraftTable 1 shows quantitative results on the DMLab and Minecraft datasets. TECO performs the best across all metrics for both datasets when training on the full 300 frame videos. Figure 4 shows sample trajectories and 3D visualizations of the generated DMLab mazes, where TECO is able to generate more stable and consistent 3D mazes. For both datasets, CW-VAE, FitVid, and Perceiver AR can produce sharp predictions, but do not model long-horizon context well, with per-frame metrics sharply dropping as the future prediction horizon increases as seen in Figure B.1. Latent FDM has consistent predictions, but high FVD most likely due to FVD being sensitive to high frequency errors.

In order to better investigate scaling properties of our models, Figure C.1 and Figure C.2 compare TECO and Latent FDM on different training sequence lengths. Intuitively, under a fixed computation budget and batch size, models that train on shorter sequence lengths can scale larger, with more FLOPs allocated per frame. In general, this is reflected in model architectures through computations at higher spatial resolutions (e.g. less downsampling). For DMLab, we see that in terms of per-frame metrics, models generally benefit from training on longer videos, where more computation per image has less of an effect due to saturation in image quality because of relatively simple visual complexity. For Minecraft, we observe that models generally perform best when training with 100 frames of context, which have better per-image sample quality compared to training on 300 frames due to higher downsampling required for longer sequences. Models trained on 300 frames generally have more distortion in predictions compared to 100 frames. Theoretically, as the compute budget is increased, training on 300 frames would eventually outperform models trained on 100 frames.

Figure 4: 3D visualization of predicted trajectories in DMLab for each model, generating 264 frames conditioned on 36. Note that predictions do not necessarily match ground truth since new parts of a maze can be generated given limited context. Video predictions use only the first-person RGB frames. Refer to Section I.1 for more details on 3D evaluation. A video corresponding to this figure is available at: https://wilson1yan.github.io/teco.

HabitatTable 1 shows results for our Habitat dataset. We only evaluate our strongest baselines, Perceiver AR and Latent FDM due to the need to implement model parallel variants for all of our models. Because of high complexity of Habitat videos, all models generally perform equally as bad in per-frame metrics. However, TECO has significantly better FVD. Qualitatively, Latent FDM quickly collapses to blurred predictions with poor sample quality, and although Perceiver AR can generate high quality frames, they are less temporally consistent than TECO: agents in Habitat videos navigate to far points in the scene and back whereas Perceiver AR tends to generate samples where the agent is constantly turning. TECO generates traversals of a scene that match the data distribution more closely.

Kinetics-600Table 1 shows FVD for predicting frames conditioned on for Kinetics-600. Although Kinetics-600 does not have many long-range dependencies, we found that TECO is able to produce more stable generations that degrade slower by incorporating longer contexts. In contrast, Perceiver AR tends to degrade quickly, with Latent FDM performing in between. Figure F.1 and Table F.4 include further investigations using top-k sampling for Perceiver AR and TECO. Table 1 does not use top-k sampling for a fair comparison against Latent FDM. With top-k sampling, Perceiver AR outperforms our method at . However, resulting videos tend to be uninteresting with little to no dynamics movement.

Sampling SpeedFigure 3 compares sampling speed for all models. We report sampling speed on Minecraft and observed similar results for the different model sizes used on other datasets. FitVid and CW-VAE are both significantly faster that other methods, but have poor sample quality. On the other end, Perceiver AR and Latent FDM can produce high quality samples, but are 20-60x slower than TECO, which has comparably fast sampling speed while retaining high sample quality.

4.5 Ablations

In this section, we perform ablations on various architectural decisions of our model. For simplicity, we evaluate our methods on short sequences of 16 frames from Something-Something-v2 (SSv2). We choose SSv2 as it provides insight into scaling our method on complex real-world data more similar to Kinetics-600 while being computationally cheaper to run.

Table E.1 shows several ablations comparing posterior, prior, and various architectural design choices. We demonstrate that using VQ-latent dynamics with a MaskGit prior proves better compared to alternative formulations for latent dynamics models, such as popular variational methods. In addition, we show that conditional encodings learn better representations for video predictions. We also ablate the codebook size, showing that although there exists an optimal codebook size, it does not matter too much as along as there are not too many codes, which may make it more difficult for the prior to learn. Lastly, we show the benefits of DropLoss, with up to 60% faster training and a minimal increase in FVD. The benefits are greater for longer sequences, and allow video models to better account for long horizon context with little cost in performance.

Table E.2 shows ablations on scaling different parts of our model, such as the encoder, decoder, temporal transformer, and prior. In general, it is more beneficial to have an imbalanced encoder decoder architecture, with more parameters in the decoder. For the temporal transformer, it is more beneficial to have larger resolution features (), especially for more complex data like SSv2, and less useful for visually simpler datasets such as DMLab or Minecraft. Similarly, having a larger width is more beneficial than more layers due to increased capacity to represent each frame. Lastly, for scaling the MaskGit prior, more layers is better than larger width networks.

5 Discussion

We introduced TECO, an efficient video prediction model that leverages hundreds of frames of temporal context. Our evaluation demonstrated that TECO accurately incorporates long-range context, outperforming SOTA baselines across a wide range of datasets. In addition, we introduce several difficult video datasets, which we hope make it easier to evaluate temporal consistency in future video prediction models. We identify several limitations as directions for future work:

  • Although we show that PSNR, SSIM, and LPIPS can be reliable metrics to measure consistency when video models are properly conditioned, there remains room for better evaluation metrics that provide a reliable signal as the prediction horizon grows, since new parts of a scene that are generated are unlikely to correlate with ground truth.

  • Our focus was on learning a compressed tokens and an expressive prior, which we combined with a simple full attention transformer as the sequence model. Leveraging prior work on efficient sequence models (choromanski2020rethinking; wang2020linformer; zhai2021attention; gu2021efficiently; hawthorne2022general) would likely allow for further scaling.

  • We trained all models on top of pretrained VQ-GAN codes to reduce the data dimensionality. This compression step lets us train on longer sequences at a cost of reconstruction error, which causes noticeable artifacts in Kinetics-600, such as corrupted text and incoherent faces. Although TECO can train directly on pixels, a loss results in slightly blurry predictions. Training directly on pixels with diffusion or GAN losses would be promising.

6 Acknowledgements

This work was in part supported by Panasonic through BAIR Commons, Darpa RACER, the Hong Kong Centre for Logistics Robotics, and BMW. In addition, we thank the TRC program (https://sites.research.google/trc/about/) and Cirrascale Cloud Services (https://cirrascale.com/) for providing compute resources.

References

Appendix A Samples

a.1 DMLab

Figure A.1: 156 frames generated conditioned on 144 (action-conditioned)
Figure A.2: 264 frames generated conditioned on 36 (no action-conditioning)
Figure A.3: 3D visualizations of the resulting generated DMLab mazes

a.2 Minecraft

Figure A.4: 156 frames generated conditioned on 144 (action-conditioned)
Figure A.5: 264 frames generated conditioned on 36 (action-conditioned)

a.3 Habitat

Figure A.6: 156 frames generated conditioned on 144 (action-conditioned)
Figure A.7: 264 frames generated conditioned on 36 (no action-conditioning)

a.4 Kinetics-600

Figure A.8: 80 frames generated conditioned on 20 (no top-k sampling)
Figure A.9: 80 frames generated conditioned on 20 (with top-k sampling)

Appendix B Performance versus Horizon

(a) DMLab

(b) Minecraft

(c) Habitat
Figure B.1: All plots shows PSNR, SSIM, and LPIPS on 150 predicted frames conditioned on 144 frames. The 144 conditioned frames are not shown on the graphs and timestep 0 corresponds to the first predicted frame

Figure B.1 shows PSNR, SSIM, and LPIPS as a function of prediction horizon for each dataset. Generally, each plot reflected the corresponding aggregated metrics in Table 1. For DMLab, TECO shows much better temporal consistency for the full trajectory, with Latent FDM coming in second. CW-VAE is able retain some consistency but drops fairly quickly. Lastly, FitVid and Perceiver AR lose consistency very quickly. We see a similar trend in Minecraft, with Latent FDM coming closer in matching TECO. For Habitat, all methods generally have trouble producing consistent predictions, primarily due to the difficulty of the environment.

Appendix C Performance versus Training Sequence Length

Figure C.1: DMLab

Figure C.2: Minecraft

Figure C.1 and Figure C.2 show plots comparing performance with training models on different sequence lengths. Under a fixed compute budget and batch size, training on shorter videos enables us to scale to larger models. This can also be interpreted as model capacity or FLOPs allocated per image. In general, training on shorter videos enables higher quality frames (per-image) but at a cost of worse temporal consistency due to reduced context length. We can see a very clear trend in DMLab, in that TECO is able to better scale on longer sequences, and correspondingly benefits from it. Latent FDM has trouble when training on full sequences. We hypothesize that this may be due to diffusion models being less amenable towards downsamples, it it needs to model and predict noise. In Minecraft, we see the best performance at around 50-100 training frames, where a model has higher fidelity image predictions, and also has sufficient context.

Appendix D Sampling

Sampling Time per Frame (ms)
TECO (ours)
Latent FDM
Perceiver-AR
CW-VAE
FitVid

Appendix E Ablations

DropLoss Rate FVD Train Step (ms)
0.8
0.6
0.4
0.2
0.0
(a) DropLoss Rates
Posteriors FVD
VQ (+ MaskGit prior) (ours)
OneHot (+ MaskGit prior)
OneHot (+ Block AR prior)
OneHot (+ Independent prior)
Argmax (+ MaskGit prior)
(b) Posteriors
Dynamics Prior FVD
MaskGit (ours)
Independent
Autoregressive
(c) Prior Networks
Conditional Encoding FVD
Yes (ours) 189
No 208
(d) Conditional Encoding
Number of Codes FVD
64 191
256 195
1024 186
4096 200
(e) VQ Codebook Size
Table E.1: Ablations comparing alternative prior, posterior, and codebook designs
FVD
Size
Base 204 189
Small Enc 214 191
Small Dec 232 198
(a) Encoder and Decoder
FVD
Layers Width
8 768 204 189
8 384 260 196
2 768 216 202
(b) Temporal Transformer
FVD
Layers Width
8 768 204 189
8 384 228 193
2 768 228 201
(c) MaskGit Prior
Table E.2: Ablations on scaling different parts of TECO.
FVD () PSNR () SSIM () LPIPS () Train Step Time (ms)
TECO (ours) 27.5
MaskGit
Autoregressive
Table E.3: DMLab dataset comparisons against similar model as TECO without latent dynamics, and Maskgit or AR model on VQ-GAN tokens directly.

Table E.3 shows comparisons between TECO and alternative architectures that do not use latent dynamics. Architecturally, MaskGit and Autoregressive are very similar to TECO, with a few small changes: (1) there is no CNN decoder and (2) MaskGit and AR directly predict the VQ-GAN latents (as opposed to the learned VQ latents in TECO). In terms of training time, MaskGit and AR are a little slower since they operate on latents instead of latents for TECO. In addition, conditioning for the AR model is done using cross attention, as channel-wise concatenation does not work well due to unidirectioal masking. Both models without latent dynamics have worse temporal consistency, as well as overall sample quality. We hypothesize that TECO has better temporal consistency due to weak bottlenecking of latent representation, as a lot of time can be spent modeling likelihood of imperceptible image / video statistics.

Appendix F Full Experimental Results

TPU-v3 Days Params FVD PSNR SSIM LPIPS
TECO (ours) 169M
Latent FDM 31M
Perceiver-AR 30M
CW-VAE 111M
FitVid 165M
Table F.1: DMLab
TPU-v3 Days Params FVD PSNR SSIM LPIPS
TECO (ours) 274M
Latent FDM 33M
Perceiver-AR 166M
CW-VAE 140M
FitVid 176M
Table F.2: Minecraft
TPU-v3 Days Params FVD PSNR SSIM LPIPS
TECO (ours) 386M
Latent FDM 87M
Perceiver-AR 200M
Table F.3: Habitat
TPU-v3 Days Params FVD
TECO (ours) 1.09B
Latent FDM 831M
Perceiver-AR 1.06B
(a) Using top-k sampling for Perceiver AR and TECO
TPU-v3 Days Params FVD
TECO (ours) 1.09B
Latent FDM 831M
Perceiver-AR 1.06B
(b) No top-k sampling
Table F.4: Kinetics
Figure F.1: FVD on Kinetics-600 with different top-k values for Perceiver-AR and TECO

Appendix G Scaling Results

TPU-v3
Days
Train
Seq Len
Params FVD PSNR SSIM LPIPS
TECO (ours) 300 169M
200 169M
100 86M
50 195M
Latent FDM 300 31M
200 62M
100 80M
50 110M
Table G.1: DM Lab scaling
TPU-v3
Days
Train
Seq Len
Params FVD PSNR SSIM LPIPS
TECO (ours) 300 274M
200 261M
100 257M
50 140M
Latent FDM 300 33M
200 80M
100 69M
50 186M
Table G.2: Minecraft scaling

Appendix H Related Work

Video Generation Prior video generation methods can be divided into a few classes of models: variational models, exact likelihood models, and GANs. SV2P (babaeizadeh2017stochastic), SVP (denton2018stochastic), SVG (villegas2019high), and FitVid babaeizadeh2021fitvid are variational video generation methods models videos through stochastic latent dynamics, optimized using the ELBO (kingma2013auto) objective extended in time. SAVP (lee2018stochastic) adds an adversarial (goodfellow2014generative) loss to encourage more realistic and high-fidelity generation quality. Diffusion models (ho2020denoising; sohldickstein2014thermodynamics) have recently emerged as a powerful class of variational generative models which learn to iteratively denoise an initial noise sample to generate high-quality images. There have been several recent works that extend diffusion models to video, through temporal attention (ho2022video; harvey2022flexible), 3D convolutions (hoppe2022diffusion), or channel stacking (voleti2022masked). Unlike variational models, autoregressive models (AR) and flows (kumar2019videoflow) model videos by optimizing exact likelihood. Video Pixel Networks (kalchbrenner2017video) and Subscale Video Transformers (weissenborn2019scaling) autoregressively model each pixel. For more compute efficient training, some prior methods (yan2021videogpt; le2021ccvs; seo2022harp; rakhimov2020latent; walker2021predicting) propose to learn an AR model in a spatio-temporally compressed latent space of a discrete autoencoder, which has shown to be orders of magnitudes more efficient compared to pixel-based methods. Instead of a VQ-GAN, le2021ccvs, learns a frame conditional autoencoder through a flow mechanism. Lastly, GANs (goodfellow2014generative) offer an alternative method to training video models. MoCoGAN (tulyakov2018mocogan) generates videos by disentangling style and motion. MoCoGAN-HD (tian2021good) can efficiently extend to larger resolutions by learning to navigate the latent space of a pretrained image generator. TGANv2 (saito2018tganv2), DVD-GAN (clark2019adversarial), StyleGAN-V (skorokhodov2021stylegan), and TrIVD-GAN (luc2020transformation) introduce various methods to scale to complex video, such as proposing sparse training, or more efficient discriminator design.

The main focus of this work lies with video prediction, a specific interpretation of conditional video generation. Most prior methods are trained autoregressive in time, so they can be easily extended to video prediction. Video Diffusion, although trained unconditionally proposes reconstruction guidance for prediction. GANs generally require training a separate model for video prediction. However, some methods such as MoCoGAN-HD and DI-GAN can approximate frame conditioning by inverting the generator to compute a corresponding latent for a frame.

Long-Horizon Video Generation CW-VAE (saxena2021clockwork) learns a hierarchy of stochastic latents to better model long term temporal dynamics, and is able to generate videos with long-term consistency for hundreds of frames. TATS (ge2022long) extends VideoGPT which allows for sampling of arbitrarily long videos using a sliding window. In addition, TATs and CogVideo (hong2022cogvideo) propose strided sampling as a simple method to incorporate longer horizon contexts. StyleGAN-V (skorokhodov2021stylegan) and DI-GAN (yu2022generating) learn continuous-time representations for videos which allow for sampling of arbitrary long videos as well. brooks2022generating proposes an efficient video GAN architecture that is able to generate high resolution videos of 128 frames on complex video data for dynamic scenes and horseback riding. FDM (harvey2022flexible) proposes a diffusion model that is trained to be able to flexibly condition on a wide range of sampled frames to better incorporate context of arbitrarily long videos. lee2021revisiting is able to leverage a hierarchical prediction framework using semantic segmentations to generate long videos.

Appendix I Dataset Details

i.1 DMLab

We generate random mazes split into four quadrants, with each quadrant containing a random combination of wall and floor textures. We generate 40k trajectories of 300 frames, each images. Actions in this environment consist of left turn, right turn, and walk forward. In order to maximally traverse the maze, we code an agent that traverses to the furthest unvisited point in the maze, with some added noise for stochasticity. Since the maze is a grid, we can easily hard-code a navigation policy to move to any specified point in the maze.

For 3D visualizations, we also collect depth, camera intrinsics and camera extrinsics (pose) for each timestep. Given this information, we can project RGB points into a 3D coordinate space and reconstruct the maze as a 3D pointcloud. Note that since videos are generated only using RGB as input, they do not have groundtruth depth and pose. Therefore, we train depth and pose estimators that are used during evaluation. Specifically, we train a depth estimator to map from RGB frame to depth, and a pose estimator that takes in two adjacent RGB frames and predicts the relative change in orientation. During evaluation, we are given an initial ground truth orientation that we apply sequentially to predicted frames.

Although the GQN Mazes (eslami2018neural) already exists as a video prediction dataset, it is difficult to properly measure temporal consistency. The 3D scenes are relatively simple, and it does not have actions to help reduce stochasticity in using metrics such as PSNR, SSIM, and LPIPS. As a result, FVD is the reliable metric used in GQN Mazes, but tends to be sensitive to noise in video predictions. In addition, we perform 3D visualizations using our dataset that are not possible with GQN Mazes.

i.2 Minecraft

We generate 200k trajectories (each of a different Minecraft world) of 300 frames in the Minecraft marsh biome. We hardcode an agent to randomly traverse the surroundings by taking left, right, and forward actions with different probabilities. In addition, we let the agent constantly jump, which we found to help traverse simple hills, and prevent itself from drowning. We specifically chose the marsh biome, as it contains hilly turns with sparse collections of trees that act as clear landmarks for consistent generation. Forest and jungle biomes tend to be too dense for any meanginfully clear consistency, as all surroundings look nearly identical. On the other hand, plains biomes had the opposite issue where the surroundings were completely flat. Mountain biomes were too hilly and difficult to traverse.

We opt to introduce an alternative to the MineRL Navigate (guss2019minerl) since this dataset primarily consists of human demonstrations of people navigating to specific points. This means that trajectories usually follow a relatively straight line, so there are not many long-term dependencies in this dataset, as only a few past frames of context are necessary for prediction.

i.3 Habitat

Habitat is a 3D simulator that can render realistic trajectories in scans of 3D scenes. We compile roughly 1400 3D scans from HM3D (ramakrishnan2021habitat), MatterPort3D (chang2017matterport3d) and Gibson (xia2018gibson), and generate a total of 200k trajectories of frames. We use the in-built path traversal algorithm provided in Habitat to construct action trajectories that allow our agent to move between randomly sampled locations in the 3D scene. Similar to Minecraft and DMLab, the agent action space consists of left turn, right turn, and move forward.

Appendix J Hyperparameters

j.1 Vq-Gan & Vae

DMLab / Minecraft Habitat / Kinetics-600
GPU Days 16 32
Resolution 64 / 128 128
Batch Size 64 64
LR
Num Res Blocks 2 2
Attention Resolutions 16 16
Channel Mult 1,2,2,2 1,2,3,4
Base Channels 128 128
Latent Size (VQ-GAN)
Embedding Dim (VQ-GAN) 256 256
Codebook Size (VQ-GAN) 1024 8192
Latent Size (VAE)

j.2 Teco

Hyperparameters DMLab Minecraft Habitat Kinetics-600
TPU-v3 Days 32 80 275 640
Params 169M 274M 386M 1.09B
Resolution 64 128 128 128
Batch Size 32 32 32 32
Sequence Length 300 300 300 100
LR
LR Schedule cosine cosine cosine cosine
Warmup Steps 10k 10k 10k 10k
Total Training Steps 1M 1M 1M 1M
DropLoss Rate 0.9 0.9 0.9 0.9
Encoder Depths 256, 512 256, 512 256, 512 256, 512
Blocks 2 4 4 8
Codebook Size 1024 1024 1024 1024
Embedding Dim 32 32 32 32
Decoder Depths 256, 512 256, 512 256, 512 256, 512
Blocks 4 8 8 10
Temporal
Transformer
Downsample Factor 8 8 4 2
Hidden Dim 1024 1024 1024 1536
Feedforward Dim 4096 4096 4096 6144
Heads 16 16 16 24
Layers 8 12 8 24
Dropout 0 0 0 0
MaskGit Mask Schedule cosine cosine cosine cosine
Hidden Dim 512 768 1024 1024
Feedforward Dim 2048 3072 4096 4096
Heads 8 12 16 16
Layers 8 6 16 24
Dropout 0 0 0 0
Train Sequence Length
(Fewer FLOPs per Frame)
Hyperparameters 300 200 100 50
TPU-v3 Days 32 32 32 32
Params 169M 169M 86M 195M
Resolution 64 64 64 64
Batch Size 32 32 32 32
LR
LR Schedule cosine cosine cosine cosine
Warmup Steps 10k 10k 10k 10k
Total Training Steps 1M 1M 1M 1M
DropLoss Rate 0.9 0.85 0.85 0.85
Encoder Depths 256, 512 256, 512 256, 512 256, 512
Blocks 2 2 2 2
Codebook Size 1024 1024 1024 1024
Embedding Dim 32 32 32 32
Decoder Depths 256, 512 256, 512 256, 512 256, 512
Blocks 4 4 4 4
Temporal
Transformer
Downsample Factor 8 8 2 2
Hidden Dim 1024 1024 512 1024
Feedforward Dim 4096 4096 2048 4096
Heads 16 16 8 16
Layers 8 8 8 8
Dropout 0 0 0 0
MaskGit Mask Schedule cosine cosine cosine cosine
Hidden Dim 512 512 512 768
Feedforward Dim 2048 2048 2048 3072
Heads 8 8 8 12
Layers 8 8 8 8
Dropout 0 0 0 0
Table J.1: Hyperparameters for scaling TECO on DMLab
Train Sequence Length
(Fewer FLOPs per Frame)
Hyperparameters 300 200 100 50
TPU-v3 Days 80 80 80 80
Params 274M 261M 257M 140M
Resolution 128 128 128 128
Batch Size 32 32 32 32
LR
LR Schedule cosine cosine cosine cosine
Warmup Steps 10k 10k 10k 10k
Total Training Steps 1M 1M 1M 1M
DropLoss Rate 0.9 0.85 0.25 0.25
Encoder Depths 256, 512 256, 512 256, 512 256, 512
Blocks 4 4 4 4
Codebook Size 1024 1024 1024 1024
Embedding Dim 32 32 32 32
Decoder Depths 256, 512 256, 512 256, 512 256, 512
Blocks 8 8 8 8
Temporal
Transformer
Downsample Factor 8 4 2 2
Hidden Dim 1024 1024 1024 512
Feedforward Dim 4096 4096 4096 2048
Heads 16 16 16 8
Layers 12 12 12 12
Dropout 0 0 0 0
MaskGit Mask Schedule cosine cosine cosine cosine
Hidden Dim 768 768 768 768
Feedforward Dim 3072 3072 3072 3072
Heads 12 12 12 12
Layers 6 6 6 8
Dropout 0 0 0 0
Table J.2: Hyperparameters for scaling TECO on Minecraft

j.3 Latent FDM

Hyperparameters DMLab Minecraft Habitat Kinetics-600
TPU-v3 Days 32 80 275 640
Params 31M 33M 87M 831M
Resolution 64 128 128 128
Batch Size 32 32 32 32
LR
LR Schedule cosine cosine cosine cosine
Optimizer Adam Adam Adam Adam
Warmup Steps 10k 10k 10k 10k
Total Training Steps 1M 1M 1M 1M
Base Channels 128 128 128 256
Num Res Blocks 1,1,1,2 1,1,2,2 1,2,2,4 2,2,2,2
Head Dim 64 64 64 64
Attention Resolutions 4,2 4,2 4,2 8,4,2
Dropout 0 0 0 0
Channel Mult 1,1,1,2 1,2,2,2 1,2,2,4 1,2,3,8
Table J.3: Hyperparameters for Latent FDM
Train Sequence Length
(Fewer FLOPs per Frame)
Hyperparameters 300 200 100 50
TPU-v3 Days 32 32 32 32
Params 31M 62M 80M 110M
Resolution 64 64 64 64
Batch Size 32 32 32 32
LR
LR Schedule cosine cosine cosine cosine
Optimizer Adam Adam Adam Adam
Warmup Steps 10k 10k 10k 10k
Total Training Steps 1M 1M 1M 1M
Base Channels 128 128 128 192
Num Res Blocks 1,1,1,2 1,1,2,2,4 2,2,2,2 3,3,3,3
Head Dim 64 64 64 64
Attention Resolutions 4,2 4,1 4,2 8,4,2
Dropout 0 0 0 0
Channel Mult 1,1,1,2 1,1,2,2,4 1,2,3,4 1,2,3,4
Table J.4: Hyperparameters for scaling Latent FDM on DMLab
Train Sequence Length
(Fewer FLOPs per Frame)
Hyperparameters 300 200 100 50
TPU-v3 Days 80 80 80 80
Params 33M 80M 69M 186M
Resolution 128 128 128 128
Batch Size 32 32 32 32
LR
LR Schedule cosine cosine cosine cosine
Optimizer Adam Adam Adam Adam
Warmup Steps 10k 10k 10k 10k
Total Training Steps 1M 1M 1M 1M
Base Channels 128 128 128 192
Num Res Blocks 1,1,2,2 2,2,2,2 3,3,3,3 2,2,2,2
Head Dim 64 64 64 64
Attention Resolutions 4,2 4,2 8,4,2 8,4,2
Dropout 0 0 0 0
Channel Mult 1,2,2,2 1,2,3,4 1,2,2,3 1,2,3,4
Table J.5: Hyperparameters for scaling Latent FDM on Minecraft

j.4 Cw-Vae

Hyperparameters DMLab Minecraft
TPU-v3 Days 32 80
Params 111M 140M
Resolution 64 128
Batch Size 32 32
LR
LR Schedule cosine cosine
Optimizer Adam Adam
Warmup Steps 10k 10k
Total Training Steps 1M 1M
Encoder Kernels 4,4,4 4,4,4
Filters 256,512,1024 256,512,1024
Decoder Depths 256,512 256,512
Blocks 4 8
Dynamics Levels 3 3
Abs Factor 6 6
Enc Dense Layers 3 3
Enc Dense Embed 1024 1024
Cell Stoch Size 128 256
Cell Deter Size 1024 1024
Cell Embed Size 1024 1024
Cell Min Stddev 0.001 0.001
Table J.6: Hyperparameters for CW-VAE

j.5 FitVid

Hyperparameters DMLab Minecraft
TPU-v3 Days 32 80
Params 165M 176M
Resolution 64 128
Batch Size 32 32
LR
LR Schedule cosine cosine
Optimizer Adam Adam
Warmup Steps 10k 10k
Total Training Steps 1M 1M
g Dim 256 256
RNN Size 512 768
z Dim 64 128
Filters 128,128,256,512 128,128,256,512
Table J.7: Hyperparameters for FitVid