HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation

03/22/2022
by   Yanyuan Qiao, et al.
0

Pre-training has been adopted in a few of recent works for Vision-and-Language Navigation (VLN). However, previous pre-training methods for VLN either lack the ability to predict future actions or ignore the trajectory contexts, which are essential for a greedy navigation process. In this work, to promote the learning of spatio-temporal visual-textual correspondence as well as the agent's capability of decision making, we propose a novel history-and-order aware pre-training paradigm (HOP) with VLN-specific objectives that exploit the past observations and support future action prediction. Specifically, in addition to the commonly used Masked Language Modeling (MLM) and Trajectory-Instruction Matching (TIM), we design two proxy tasks to model temporal order information: Trajectory Order Modeling (TOM) and Group Order Modeling (GOM). Moreover, our navigation action prediction is also enhanced by introducing the task of Action Prediction with History (APH), which takes into account the history visual perceptions. Extensive experimental results on four downstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the effectiveness of our proposed method compared against several state-of-the-art agents.

READ FULL TEXT
research
04/11/2023

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

Vision-and-Language Navigation (VLN) is the task that requires an agent ...
research
04/05/2023

ENTL: Embodied Navigation Trajectory Learner

We propose Embodied Navigation Trajectory Learner (ENTL), a method for e...
research
07/15/2021

Neighbor-view Enhanced Model for Vision and Language Navigation

Vision and Language Navigation (VLN) requires an agent to navigate to a ...
research
10/25/2021

History Aware Multimodal Transformer for Vision-and-Language Navigation

Vision-and-language navigation (VLN) aims to build autonomous visual age...
research
03/28/2023

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

Vision-and-language navigation (VLN) is the task to enable an embodied a...
research
11/26/2020

A Recurrent Vision-and-Language BERT for Navigation

Accuracy of many visiolinguistic tasks has benefited significantly from ...
research
06/20/2023

GenPlot: Increasing the Scale and Diversity of Chart Derendering Data

Vertical bars, horizontal bars, dot, scatter, and line plots provide a d...

Please sign up or login with your details

Forgot password? Click here to reset