Scaling Autoregressive Video Models

06/06/2019
by   Dirk Weissenborn, et al.
6

Due to the statistical complexity of video, the high degree of inherent stochasticity, and the sheer amount of data, generating natural video remains a challenging task. State-of-the-art video generation models attempt to address these issues by combining sometimes complex, often video-specific neural network architectures, latent variable models, adversarial training and a range of other methods. Despite their often high complexity, these approaches still fall short of generating high quality video continuations outside of narrow domains and often struggle with fidelity. In contrast, we show that conceptually simple, autoregressive video generation models based on a three-dimensional self-attention mechanism achieve highly competitive results across multiple metrics on popular benchmark datasets for which they produce continuations of high fidelity and realism. Furthermore, we find that our models are capable of producing diverse and surprisingly realistic continuations on a subset of videos from Kinetics, a large scale action recognition dataset comprised of YouTube videos exhibiting phenomena such as camera movement, complex object interactions and diverse human movement. To our knowledge, this is the first promising application of video-generation models to videos of this complexity.

READ FULL TEXT

page 15

page 16

page 17

page 18

page 19

page 20

page 21

page 22

research
09/15/2022

HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

Video prediction is an important yet challenging problem; burdened with ...
research
04/20/2021

VideoGPT: Video Generation using VQ-VAE and Transformers

We present VideoGPT: a conceptually simple architecture for scaling like...
research
10/05/2022

Imagen Video: High Definition Video Generation with Diffusion Models

We present Imagen Video, a text-conditional video generation system base...
research
08/16/2022

StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3

Realistic generative face video synthesis has long been a pursuit in bot...
research
06/02/2023

Probabilistic Adaptation of Text-to-Video Models

Large text-to-video models trained on internet-scale data have demonstra...
research
08/24/2023

APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

Diffusion models have exhibited promising progress in video generation. ...
research
03/06/2021

Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

A video prediction model that generalizes to diverse scenes would enable...

Please sign up or login with your details

Forgot password? Click here to reset