DeepAI AI Chat
Log In Sign Up

VideoGPT: Video Generation using VQ-VAE and Transformers

by   Wilson Yan, et al.

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at


page 2

page 5

page 6

page 7

page 13


Video Diffusion Models

Generating temporally coherent high fidelity video is an important miles...

A Simple Generative Network

Generative neural networks are able to mimic intricate probability distr...

Scaling Autoregressive Video Models

Due to the statistical complexity of video, the high degree of inherent ...

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

In this work, we investigate a simple and must-known conditional generat...

Efficient Video Generation on Complex Datasets

Generative models of natural images have progressed towards high fidelit...

generating annotated high-fidelity images containing multiple coherent objects

Recent developments related to generative models have made it possible t...

Diagnosing and Enhancing VAE Models

Although variational autoencoders (VAEs) represent a widely influential ...

Code Repositories