Learning Spatial Common Sense with Geometry-Aware Recurrent Networks

12/31/2018
by   Hsiao-Yu Fish Tung, et al.
14

We integrate two powerful ideas, geometry and deep visual representation learning, into recurrent network architectures for mobile visual scene understanding. The proposed networks learn to "lift" 2D visual features and integrate them over time into latent 3D feature maps of the scene. They are equipped with differentiable geometric operations, such as projection, unprojection, egomotion estimation and stabilization, in order to compute a geometrically-consistent mapping between the world scene and their 3D latent feature space. We train the proposed architectures to predict novel image views given short frame sequences as input. Their predictions strongly generalize to scenes with a novel number of objects, appearances and configurations, and greatly outperform predictions of previous works that do not consider egomotion stabilization or a space-aware latent feature space. We train the proposed architectures to detect and segment objects in 3D, using the latent 3D feature map as input--as opposed to any input 2D video frame. The resulting detections are permanent: they continue to exist even when an object gets occluded or leaves the field of view. Our experiments suggest the proposed space-aware latent feature arrangement and egomotion-stabilized convolutions are essential architectural choices for spatial common sense to emerge in artificial embodied visual agents.

READ FULL TEXT

page 1

page 4

page 6

page 11

page 12

page 13

page 14

page 15

research
11/03/2018

Geometry-Aware Recurrent Neural Networks for Active Visual Recognition

We present recurrent geometry-aware neural networks that integrate visua...
research
06/10/2019

Embodied View-Contrastive 3D Feature Learning

Humans can effortlessly imagine the occluded side of objects in a photog...
research
11/06/2020

Disentangling 3D Prototypical Networks For Few-Shot Concept Learning

We present neural architectures that disentangle RGB-D images into objec...
research
10/02/2019

Embodied Language Grounding with Implicit 3D Visual Feature Representations

Consider the utterance "the tomato is to the left of the pot." Humans ca...
research
11/12/2020

3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators

We propose an action-conditioned dynamics model that predicts scene chan...
research
10/30/2020

3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations

We propose a system that learns to detect objects and infer their 3D pos...
research
10/27/2020

Structured Visual Search via Composition-aware Learning

This paper studies visual search using structured queries. The structure...

Please sign up or login with your details

Forgot password? Click here to reset