Language Grounding with 3D Objects

07/26/2021
by   Jesse Thomason, et al.
4

Seemingly simple natural language requests to a robot are generally underspecified, for example "Can you bring me the wireless mouse?" When viewing mice on the shelf, the number of buttons or presence of a wire may not be visible from certain angles or positions. Flat images of candidate mice may not provide the discriminative information needed for "wireless". The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or texture, robots should perform the necessary exploration to accomplish the task. In particular, while substantial effort and progress has been made on understanding explicitly visual attributes like color and category, comparatively little progress has been made on understanding language about shapes and contours. In this work, we introduce a novel reasoning task that targets both visual and non-visual language about 3D objects. Our new benchmark, ShapeNet Annotated with Referring Expressions (SNARE), requires a model to choose which of two objects is being referenced by a natural language description. We introduce several CLIP-based models for distinguishing objects and demonstrate that while recent advances in jointly modeling vision and language are useful for robotic language understanding, it is still the case that these models are weaker at understanding the 3D nature of objects – properties which play a key role in manipulation. In particular, we find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.

READ FULL TEXT

page 5

page 8

research
07/18/2017

Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction

The human language is one of the most natural interfaces for humans to i...
research
03/01/2019

Improving Grounded Natural Language Understanding through Human-Robot Dialog

Natural language understanding for robotics can require substantial doma...
research
11/16/2018

Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context

A robot's ability to understand or ground natural language instructions ...
research
02/16/2021

Composing Pick-and-Place Tasks By Grounding Language

Controlling robots to perform tasks via natural language is one of the m...
research
07/28/2022

DoRO: Disambiguation of referred object for embodied agents

Robotic task instructions often involve a referred object that the robot...
research
05/08/2019

ShapeGlot: Learning Language for Shape Differentiation

In this work we explore how fine-grained differences between the shapes ...
research
06/28/2016

"Show me the cup": Reference with Continuous Representations

One of the most basic functions of language is to refer to objects in a ...

Please sign up or login with your details

Forgot password? Click here to reset