Temporally Consistent Video Transformer for Long-Term Video Prediction

10/05/2022
by   Wilson Yan, et al.
0

Generating long, temporally consistent video remains an open challenge in video generation. Primarily due to computational limitations, most prior methods limit themselves to training on a small subset of frames that are then extended to generate longer videos through a sliding window fashion. Although these techniques may produce sharp videos, they have difficulty retaining long-term temporal consistency due to their limited context length. In this work, we present Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics video prediction model that learns compressed representations to efficiently condition on long videos of hundreds of frames during both training and generation. We use a MaskGit prior for dynamics prediction which enables both sharper and faster generations compared to prior work. Our experiments show that TECO outperforms SOTA baselines in a variety of video prediction benchmarks ranging from simple mazes in DMLab, large 3D worlds in Minecraft, and complex real-world videos from Kinetics-600. In addition, to better understand the capabilities of video prediction models in modeling temporal consistency, we introduce several challenging video prediction tasks consisting of agents randomly traversing 3D scenes of varying difficulty. This presents a challenging benchmark for video prediction in partially observable environments where a model must understand what parts of the scenes to re-create versus invent depending on its past observations or generations. Generated videos are available at https://wilson1yan.github.io/teco

READ FULL TEXT

page 1

page 8

page 14

page 15

page 16

page 17

page 18

research
06/07/2022

Generating Long Videos of Dynamic Scenes

We present a video generation model that accurately reproduces object mo...
research
02/18/2021

Clockwork Variational Autoencoders

Deep learning has enabled algorithms to generate realistic images. Howev...
research
05/23/2022

Flexible Diffusion Modeling of Long Videos

We present a framework for video modeling based on denoising diffusion p...
research
12/29/2022

Long-horizon video prediction using a dynamic latent hierarchy

The task of video prediction and generation is known to be notoriously d...
research
03/31/2021

Long-Term Temporally Consistent Unpaired Video Translation from Simulated Surgical 3D Data

Research in unpaired video translation has mainly focused on short-term ...
research
07/02/2020

Understanding Road Layout from Videos as a Whole

In this paper, we address the problem of inferring the layout of complex...
research
05/11/2022

Video-ReTime: Learning Temporally Varying Speediness for Time Remapping

We propose a method for generating a temporally remapped video that matc...

Please sign up or login with your details

Forgot password? Click here to reset