Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

10/14/2022
by   Peihao Chen, et al.
0

We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. The instructions often contain descriptions of objects in the environment. To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects. However, enabling a robot to build a map that well represents the environment is extremely challenging as the environment often involves diverse objects with various attributes. In this paper, we propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively. Moreover, we propose a weakly-supervised auxiliary task, which requires the agent to localize instruction-relevant objects on the map. Through this task, the agent not only learns to localize the instruction-relevant objects for navigation but also is encouraged to learn a better map representation that reveals object information. We then feed the learned map and instruction to a waypoint predictor to determine the next navigation goal. Experimental results show our method outperforms the state-of-the-art by 4.0 environments, respectively on VLN-CE dataset. Code is available at https://github.com/PeihaoChen/WS-MGMap.

READ FULL TEXT

page 9

page 16

page 17

page 18

research
08/15/2023

A^2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

We study the task of zero-shot vision-and-language navigation (ZS-VLN), ...
research
07/24/2023

GridMM: Grid Memory Map for Vision-and-Language Navigation

Vision-and-language navigation (VLN) enables the agent to navigate to a ...
research
09/16/2019

Where are the Keys? -- Learning Object-Centric Navigation Policies on Semantic Maps with Graph Convolutional Networks

Emerging object-based SLAM algorithms can build a graph representation o...
research
05/21/2023

Instance-Level Semantic Maps for Vision Language Navigation

Humans have a natural ability to perform semantic associations with the ...
research
11/15/2020

ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

For embodied agents, navigation is an important ability but not an isola...
research
07/03/2019

Chasing Ghosts: Instruction Following as Bayesian State Tracking

A visually-grounded navigation instruction can be interpreted as a seque...
research
07/23/2021

Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation

Language instruction plays an essential role in the natural language gro...

Please sign up or login with your details

Forgot password? Click here to reset