Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation

11/27/2019
by   Federico Landi, et al.
28

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities - natural language, images, and discrete actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent's history of actions and the perception modalities. We experimentally validate our model on two datasets and two different action settings. PTA surpasses previous state-of-the-art architectures for low-level VLN on R2R and achieves the first place for both setups in the recently proposed R4R benchmark. Our code is publicly available at https://github.com/aimagelab/perceive-transform-and-act.

READ FULL TEXT

page 5

page 8

research
08/31/2022

NestedFormer: Nested Modality-Aware Transformer for Brain Tumor Segmentation

Multi-modal MR imaging is routinely used in clinical practice to diagnos...
research
07/19/2022

Target-Driven Structured Transformer Planner for Vision-Language Navigation

Vision-language navigation is the task of directing an embodied agent to...
research
07/05/2019

Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters

In Vision-and-Language Navigation (VLN), an embodied agent needs to reac...
research
04/23/2018

Attention Based Natural Language Grounding by Navigating Virtual Environment

In this work, we focus on the problem of grounding language by training ...
research
05/06/2020

Diagnosing the Environment Bias in Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires an agent to follow natural...
research
08/22/2023

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

This report details the methods of the winning entry of the AVDN Challen...
research
07/16/2020

Memory Based Attentive Fusion

The use of multi-modal data for deep machine learning has shown promise ...

Please sign up or login with your details

Forgot password? Click here to reset