HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

by   Younggyo Seo, et al.

Video prediction is an important yet challenging problem; burdened with the tasks of generating future frames and learning environment dynamics. Recently, autoregressive latent video models have proved to be a powerful video prediction tool, by separating the video prediction into two sub-problems: pre-training an image generator model, followed by learning an autoregressive prediction model in the latent space of the image generator. However, successfully generating high-fidelity and high-resolution videos has yet to be seen. In this work, we investigate how to train an autoregressive latent video prediction model capable of predicting high-fidelity future frames with minimal modification to existing models, and produce high-resolution (256x256) videos. Specifically, we scale up prior models by employing a high-fidelity image generator (VQ-GAN) with a causal transformer model, and introduce additional techniques of top-k sampling and data augmentation to further improve video prediction quality. Despite the simplicity, the proposed method achieves competitive performance to state-of-the-art approaches on standard video prediction benchmarks with fewer parameters, and enables high-resolution video prediction on complex and large-scale datasets. Videos are available at https://sites.google.com/view/harp-videos/home.


page 1

page 5

page 8


Scaling Autoregressive Video Models

Due to the statistical complexity of video, the high degree of inherent ...

Predicting Video with VQVAE

In recent years, the task of video prediction-forecasting future video g...

Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

A video prediction model that generalizes to diverse scenes would enable...

High-fidelity 3D Human Digitization from Single 2K Resolution Images

High-quality 3D human body reconstruction requires high-fidelity and lar...

StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

Videos show continuous events, yet most - if not all - video synthesis f...

High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

Predicting future video frames is extremely challenging, as there are ma...

Please sign up or login with your details

Forgot password? Click here to reset