Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks

06/01/2021
by   Van-Quang Nguyen, et al.
0

There is a growing interest in the community in making an embodied AI agent perform a complicated task while interacting with an environment following natural language directives. Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task, but achieved only very low accuracy. This paper proposes a new method, which outperforms the previous methods by a large margin. It is based on a combination of several new ideas. One is a two-stage interpretation of the provided instructions. The method first selects and interprets an instruction without using visual information, yielding a tentative action sequence prediction. It then integrates the prediction with the visual information etc., yielding the final prediction of an action and an object. As the object's class to interact is identified in the first stage, it can accurately select the correct object from the input image. Moreover, our method considers multiple egocentric views of the environment and extracts essential information by applying hierarchical attention conditioned on the current instruction. This contributes to the accurate prediction of actions for navigation. A preliminary version of the method won the ALFRED Challenge 2020. The current version achieves the unseen environment's success rate of 4.45 multiple views.

READ FULL TEXT

page 3

page 7

page 12

page 13

page 14

research
12/06/2020

MOCA: A Modular Object-Centric Approach for Interactive Instruction Following

Performing simple household tasks based on language directives is very n...
research
04/19/2021

Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information

Vision language navigation is the task that requires an agent to navigat...
research
09/04/2018

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

We propose to decompose instruction execution to goal prediction and act...
research
08/14/2023

Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents

Accomplishing household tasks requires to plan step-by-step actions cons...
research
07/29/2020

Object-and-Action Aware Model for Visual Language Navigation

Vision-and-Language Navigation (VLN) is unique in that it requires turni...
research
04/30/2020

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Following a navigation instruction such as 'Walk down the stairs and sto...
research
09/16/2021

Hierarchical Control of Situated Agents through Natural Language

When humans conceive how to perform a particular task, they do so hierar...

Please sign up or login with your details

Forgot password? Click here to reset