Embodied Language Grounding with Implicit 3D Visual Feature Representations

10/02/2019
by   Mihir Prabhudesai, et al.
6

Consider the utterance "the tomato is to the left of the pot." Humans can answer numerous questions about the situation described, as well as reason through counterfactuals and alternatives, such as, "is the pot larger than the tomato ?", "can we move to a viewpoint from which the tomato is completely hidden behind the pot ?", "can we have an object that is both to the left of the tomato and to the right of the pot ?", "would the tomato fit inside the pot ?", and so on. Such reasoning capability remains elusive from current computational models of language understanding. To link language processing with spatial reasoning, we propose associating natural language utterances to a mental workspace of their meaning, encoded as 3-dimensional visual feature representations of the world scenes they describe. We learn such 3-dimensional visual representations—we call them visual imaginations— by predicting images a mobile agent sees while moving around in the 3D world. The input image streams the agent collects are unprojected into egomotion-stable 3D scene feature maps of the scene, and projected from novel viewpoints to match the observed RGB image views in an end-to-end differentiable manner. We then train modular neural models to generate such 3D feature representations given language utterances, to localize the objects an utterance mentions in the 3D feature representation inferred from an image, and to predict the desired 3D object locations given a manipulation instruction. We empirically show the proposed models outperform by a large margin existing 2D models in spatial reasoning, referential object detection and instruction following, and generalize better across camera viewpoints and object arrangements.

READ FULL TEXT

page 2

page 6

page 8

page 14

page 15

page 16

page 17

page 18

research
10/30/2020

3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations

We propose a system that learns to detect objects and infer their 3D pos...
research
11/06/2020

Disentangling 3D Prototypical Networks For Few-Shot Concept Learning

We present neural architectures that disentangle RGB-D images into objec...
research
06/10/2019

Embodied View-Contrastive 3D Feature Learning

Humans can effortlessly imagine the occluded side of objects in a photog...
research
04/08/2021

CoCoNets: Continuous Contrastive 3D Scene Representations

This paper explores self-supervised learning of amodal 3D feature repres...
research
12/31/2018

Learning Spatial Common Sense with Geometry-Aware Recurrent Networks

We integrate two powerful ideas, geometry and deep visual representation...
research
07/04/2018

Encoding Spatial Relations from Natural Language

Natural language processing has made significant inroads into learning t...
research
11/21/2018

Early Fusion for Goal Directed Robotic Vision

Increasingly, perceptual systems are being codified as strict pipelines ...

Please sign up or login with your details

Forgot password? Click here to reset