Episodic Transformer for Vision-and-Language Navigation

05/13/2021
by   Alexander Pashevich, et al.
16

Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4 splits.

READ FULL TEXT

page 1

page 13

page 14

page 16

page 17

page 18

page 19

research
10/25/2021

History Aware Multimodal Transformer for Vision-and-Language Navigation

Vision-and-language navigation (VLN) aims to build autonomous visual age...
research
03/01/2021

CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

Navigation guided by natural language instructions is particularly suita...
research
09/11/2022

Instruction-driven history-aware policies for robotic manipulations

In human environments, robots are expected to accomplish a variety of ma...
research
10/27/2021

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Natural language instructions for visual navigation often use scene desc...
research
06/03/2019

Hierarchical Decision Making by Generating and Following Natural Language Instructions

We explore using latent natural language instructions as an expressive a...
research
08/10/2021

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

Language-guided robots performing home and office tasks must navigate in...
research
10/24/2022

Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

Humans are excellent at understanding language and vision to accomplish ...

Please sign up or login with your details

Forgot password? Click here to reset