Video Prediction at Multiple Scales with Hierarchical Recurrent Networks

03/17/2022
by   Ani Karapetyan, et al.
37

Autonomous systems not only need to understand their current environment, but should also be able to predict future actions conditioned on past states, for instance based on captured camera frames. For certain tasks, detailed predictions such as future video frames are required in the near future, whereas for others it is beneficial to also predict more abstract representations for longer time horizons. However, existing video prediction models mainly focus on forecasting detailed possible outcomes for short time-horizons, hence being of limited use for robot perception and spatial reasoning. We propose Multi-Scale Hierarchical Prediction (MSPred), a novel video prediction model able to forecast future possible outcomes of different levels of granularity at different time-scales simultaneously. By combining spatial and temporal downsampling, MSPred is able to efficiently predict abstract representations such as human poses or object locations over long time horizons, while still maintaining a competitive performance for video frame prediction. In our experiments, we demonstrate that our proposed model accurately predicts future video frames as well as other representations (e.g. keypoints or positions) on various scenarios, including bin-picking scenes or action recognition datasets, consistently outperforming popular approaches for video frame prediction. Furthermore, we conduct an ablation study to investigate the importance of the different modules and design choices in MSPred. In the spirit of reproducible research, we open-source VP-Suite, a general framework for deep-learning-based video prediction, as well as pretrained models to reproduce our results.

READ FULL TEXT

page 1

page 6

page 7

research
10/07/2019

Action-conditioned Benchmarking of Robotic Video Prediction Models: a Comparative Study

A defining characteristic of intelligent systems is the ability to make ...
research
02/23/2023

Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

We propose a novel framework for the task of object-centric video predic...
research
08/20/2020

Learning to Abstract and Predict Human Actions

Human activities are naturally structured as hierarchies unrolled over t...
research
10/12/2021

Fourier-based Video Prediction through Relational Object Motion

The ability to predict future outcomes conditioned on observed video fra...
research
05/10/2021

Local Frequency Domain Transformer Networks for Video Prediction

Video prediction is commonly referred to as forecasting future frames of...
research
02/16/2020

AOL: Adaptive Online Learning for Human Trajectory Prediction in Dynamic Video Scenes

We present a novel adaptive online learning (AOL) framework to predict h...
research
12/09/2022

MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction

The mainstream of the existing approaches for video prediction builds up...

Please sign up or login with your details

Forgot password? Click here to reset