Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

08/05/2020
by   Haozhi Qi, et al.
2

Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object representations that can capture inter-object and object-environment interactions over a long range. To this end, we propose Region Proposal Interaction Networks (RPIN), which reason about each object's trajectory in a latent region-proposal feature space. Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin both in terms of prediction quality and their ability to plan for downstream tasks, and also generalize well to novel environments. Our code is available at https://github.com/HaozhiQi/RPIN.

READ FULL TEXT

page 2

page 6

page 15

page 16

page 18

research
07/14/2021

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Active speaker detection (ASD) seeks to detect who is speaking in a visu...
research
03/31/2020

Long Short-Term Relation Networks for Video Action Detection

It has been well recognized that modeling human-object or object-object ...
research
02/23/2023

FTM: A Frame-level Timeline Modeling Method for Temporal Graph Representation Learning

Learning representations for graph-structured data is essential for grap...
research
12/18/2021

Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs

Structured video representation in the form of dynamic scene graphs is a...
research
01/05/2023

HierVL: Learning Hierarchical Video-Language Embeddings

Video-language embeddings are a promising avenue for injecting semantics...
research
03/15/2023

Relax, it doesn't matter how you get there: A new self-supervised approach for multi-timescale behavior analysis

Natural behavior consists of dynamics that are complex and unpredictable...
research
01/11/2022

Optimization Planning for 3D ConvNets

It is not trivial to optimally learn a 3D Convolutional Neural Networks ...

Please sign up or login with your details

Forgot password? Click here to reset