Temporally Consistent Video Transformer for Long-Term Video Prediction

by   Wilson Yan, et al.

Generating long, temporally consistent video remains an open challenge in video generation. Primarily due to computational limitations, most prior methods limit themselves to training on a small subset of frames that are then extended to generate longer videos through a sliding window fashion. Although these techniques may produce sharp videos, they have difficulty retaining long-term temporal consistency due to their limited context length. In this work, we present Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics video prediction model that learns compressed representations to efficiently condition on long videos of hundreds of frames during both training and generation. We use a MaskGit prior for dynamics prediction which enables both sharper and faster generations compared to prior work. Our experiments show that TECO outperforms SOTA baselines in a variety of video prediction benchmarks ranging from simple mazes in DMLab, large 3D worlds in Minecraft, and complex real-world videos from Kinetics-600. In addition, to better understand the capabilities of video prediction models in modeling temporal consistency, we introduce several challenging video prediction tasks consisting of agents randomly traversing 3D scenes of varying difficulty. This presents a challenging benchmark for video prediction in partially observable environments where a model must understand what parts of the scenes to re-create versus invent depending on its past observations or generations. Generated videos are available at https://wilson1yan.github.io/teco


page 1

page 8

page 14

page 15

page 16

page 17

page 18


Generating Long Videos of Dynamic Scenes

We present a video generation model that accurately reproduces object mo...

Clockwork Variational Autoencoders

Deep learning has enabled algorithms to generate realistic images. Howev...

Flexible Diffusion Modeling of Long Videos

We present a framework for video modeling based on denoising diffusion p...

Long-horizon video prediction using a dynamic latent hierarchy

The task of video prediction and generation is known to be notoriously d...

Long-Term Temporally Consistent Unpaired Video Translation from Simulated Surgical 3D Data

Research in unpaired video translation has mainly focused on short-term ...

Understanding Road Layout from Videos as a Whole

In this paper, we address the problem of inferring the layout of complex...

Video-ReTime: Learning Temporally Varying Speediness for Time Remapping

We propose a method for generating a temporally remapped video that matc...

Please sign up or login with your details

Forgot password? Click here to reset