DeepAI AI Chat
Log In Sign Up

KeyIn: Discovering Subgoal Structure with Keyframe-based Video Prediction

by   Karl Pertsch, et al.

Real-world image sequences can often be naturally decomposed into a small number of frames depicting interesting, highly stochastic moments (its keyframes) and the low-variance frames in between them. In image sequences depicting trajectories to a goal, keyframes can be seen as capturing the subgoals of the sequence as they depict the high-variance moments of interest that ultimately led to the goal. In this paper, we introduce a video prediction model that discovers the keyframe structure of image sequences in an unsupervised fashion. We do so using a hierarchical Keyframe-Intermediate model (KeyIn) that stochastically predicts keyframes and their offsets in time and then uses these predictions to deterministically predict the intermediate frames. We propose a differentiable formulation of this problem that allows us to train the full hierarchical model using a sequence reconstruction loss. We show that our model is able to find meaningful keyframe structure in a simulated dataset of robotic demonstrations and that these keyframes can serve as subgoals for planning. Our model outperforms other hierarchical prediction approaches for planning on a simulated pushing task.


page 5

page 6

page 7

page 8

page 14


Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors

The ability to predict and plan into the future is fundamental for agent...

Consistent Generative Query Networks

Stochastic video prediction is usually framed as an extrapolation proble...

Stochastic Video Long-term Interpolation

Video interpolation is aiming to generate intermediate sequence between ...

Planning Robot Motion using Deep Visual Prediction

In this paper, we introduce a novel framework that can learn to make vis...

Long-Term Video Interpolation with Bidirectional Predictive Network

This paper considers the challenging task of long-term video interpolati...

Time-Agnostic Prediction: Predicting Predictable Video Frames

Prediction is arguably one of the most basic functions of an intelligent...

Video Extrapolation with an Invertible Linear Embedding

We predict future video frames from complex dynamic scenes, using an inv...