Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

by   Wonkwang Lee, et al.

Learning to predict the long-term future of video frames is notoriously challenging due to inherent ambiguities in the distant future and dramatic amplifications of prediction error through time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to destruction in structure and content. In this work, we revisit hierarchical models in video prediction. Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation. Despite the simplicity, we show that modeling structures and their dynamics in the discrete semantic structure space with a stochastic recurrent estimator leads to surprisingly successful long-term prediction. We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (i.e., thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches. Full videos and codes are available at


page 3

page 4

page 10

page 11

page 13

page 14

page 22

page 25


Hierarchical Model for Long-term Video Prediction

Video prediction has been an active topic of research in the past few ye...

Memory Warps for Learning Long-Term Online Video Representations

This paper proposes a novel memory-based online video representation tha...

Taylor saves for later: disentanglement for video prediction using Taylor representation

Video prediction is a challenging task with wide application prospects i...

Hierarchical Long-term Video Prediction without Supervision

Much of recent research has been devoted to video prediction and generat...

Long-Term Video Interpolation with Bidirectional Predictive Network

This paper considers the challenging task of long-term video interpolati...

From Single to Multiple: Leveraging Multi-level Prediction Spaces for Video Forecasting

Despite video forecasting has been a widely explored topic in recent yea...

Video Prediction at Multiple Scales with Hierarchical Recurrent Networks

Autonomous systems not only need to understand their current environment...