VideoGPT: Video Generation using VQ-VAE and Transformers

04/20/2021
by   Wilson Yan, et al.
12

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html

READ FULL TEXT

page 2

page 5

page 6

page 7

page 13

research
04/07/2022

Video Diffusion Models

Generating temporally coherent high fidelity video is an important miles...
research
06/17/2021

A Simple Generative Network

Generative neural networks are able to mimic intricate probability distr...
research
06/06/2019

Scaling Autoregressive Video Models

Due to the statistical complexity of video, the high degree of inherent ...
research
07/15/2019

Efficient Video Generation on Complex Datasets

Generative models of natural images have progressed towards high fidelit...
research
01/15/2023

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

In this work, we investigate a simple and must-known conditional generat...
research
06/22/2020

generating annotated high-fidelity images containing multiple coherent objects

Recent developments related to generative models have made it possible t...
research
10/05/2021

Top-N: Equivariant set and graph generation without exchangeability

We consider one-shot probabilistic decoders that map a vector-shaped pri...

Please sign up or login with your details

Forgot password? Click here to reset