Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

05/19/2022
by   Vikram Voleti, et al.
23

Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction – when only future/past frames are masked; unconditional generation – when both past and future frames are masked; and interpolation – when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using ≤ 4 GPUs. https://mask-cond-video-diffusion.github.io

READ FULL TEXT

page 2

page 4

page 6

research
06/18/2020

Latent Video Transformer

The video generation task can be formulated as a prediction of future vi...
research
11/26/2022

Randomized Conditional Flow Matching for Video Prediction

We introduce a novel generative model for video prediction based on late...
research
03/17/2022

Transframer: Arbitrary Frame Prediction with Generative Models

We present a general-purpose framework for image modelling and vision ta...
research
06/24/2021

FitVid: Overfitting in Pixel-Level Video Prediction

An agent that is capable of predicting what happens next can perform a v...
research
06/15/2022

Diffusion Models for Video Prediction and Infilling

To predict and anticipate future outcomes or reason about missing inform...
research
09/05/2023

Hierarchical Masked 3D Diffusion Model for Video Outpainting

Video outpainting aims to adequately complete missing areas at the edges...
research
07/05/2018

Consistent Generative Query Networks

Stochastic video prediction is usually framed as an extrapolation proble...

Please sign up or login with your details

Forgot password? Click here to reset