Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction

06/11/2018
by   Mohit Shridhar, et al.
0

This paper presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The core issue here is the grounding of referring expressions: infer objects and their relationships from input images and language expressions. INGRESS allows for unconstrained object categories and unconstrained language expressions. Further, it asks questions to disambiguate referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neural network model for grounding. The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expression, and identifies a set of candidate objects. The second stage uses another neural network to examine all pairwise relations between the candidates and infers the most likely referred object. The same neural networks are used for both grounding and question generation for disambiguation. Experiments show that INGRESS outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans.

READ FULL TEXT

page 1

page 6

page 8

research
07/18/2017

Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction

The human language is one of the most natural interfaces for humans to i...
research
08/25/2021

INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

This paper presents INVIGORATE, a robot system that interacts with human...
research
01/06/2022

Incremental Object Grounding Using Scene Graphs

Object grounding tasks aim to locate the target object in an image throu...
research
11/30/2016

Modeling Relationships in Referential Expressions with Compositional Modular Networks

People often refer to entities in an image in terms of their relationshi...
research
02/16/2021

Composing Pick-and-Place Tasks By Grounding Language

Controlling robots to perform tasks via natural language is one of the m...
research
03/17/2021

Few-Shot Visual Grounding for Natural Human-Robot Interaction

Natural Human-Robot Interaction (HRI) is one of the key components for s...
research
05/26/2018

Using Syntax to Ground Referring Expressions in Natural Images

We introduce GroundNet, a neural network for referring expression recogn...

Please sign up or login with your details

Forgot password? Click here to reset