Cross-modal Map Learning for Vision and Language Navigation

03/10/2022
by   Georgios Georgakis, et al.
0

We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric observations of the agent. In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations. In this work, we propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of waypoints. In both cases, the prediction is informed by the language through cross-modal attention mechanisms. We experimentally test the basic hypothesis that language-driven navigation can be solved given a map, and then show competitive results on the full VLN-CE benchmark.

READ FULL TEXT

page 4

page 6

page 8

page 12

page 14

page 15

page 16

research
03/15/2020

Vision-Dialog Navigation by Exploring Cross-modal Memory

Vision-dialog navigation posed as a new holy-grail task in vision-langua...
research
04/22/2021

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation

Deep Learning has revolutionized our ability to solve complex problems s...
research
03/06/2021

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Transformer architectures have brought about fundamental changes to comp...
research
07/24/2022

A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

In a busy city street, a pedestrian surrounded by distractions can pick ...
research
12/09/2020

Topological Planning with Transformers for Vision-and-Language Navigation

Conventional approaches to vision-and-language navigation (VLN) are trai...
research
11/28/2022

SLAN: Self-Locator Aided Network for Cross-Modal Understanding

Learning fine-grained interplay between vision and language allows to a ...
research
04/07/2021

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

We study joint learning of Convolutional Neural Network (CNN) and Transf...

Please sign up or login with your details

Forgot password? Click here to reset