A Recurrent Vision-and-Language BERT for Navigation

11/26/2020
by   Yicong Hong, et al.
0

Accuracy of many visiolinguistic tasks has benefited significantly from the application of vision-and-language (V L) BERT. However, its application for the task of vision-and-language navigation (VLN) remains limited. One reason for this is the difficulty adapting the BERT architecture to the partially observable Markov decision process present in VLN, requiring history-dependent attention and decision making. In this paper we propose a recurrent BERT model that is time-aware for use in VLN. Specifically, we equip the BERT model with a recurrent function that maintains cross-modal state information for the agent. Through extensive experiments on R2R and REVERIE we demonstrate that our model can replace more complex encoder-decoder models to achieve state-of-the-art results. Moreover, our approach can be generalised to other transformer-based architectures, supports pre-training, and is capable of multi-task learning suggesting the potential to merge a wide range of BERT-like models for other vision and language tasks.

READ FULL TEXT

page 7

page 16

page 17

page 18

research
12/14/2021

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

BERT-type structure has led to the revolution of vision-language pre-tra...
research
11/28/2021

Explore the Potential Performance of Vision-and-Language Navigation Model: a Snapshot Ensemble Method

Vision-and-Language Navigation (VLN) is a challenging task in the field ...
research
03/30/2021

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

We present a new vision-language (VL) pre-training model dubbed Kaleido-...
research
07/16/2023

Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

Vision language decision making (VLDM) is a challenging multimodal task....
research
10/31/2019

DiaNet: BERT and Hierarchical Attention Multi-Task Learning of Fine-Grained Dialect

Prediction of language varieties and dialects is an important language p...
research
03/22/2022

HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation

Pre-training has been adopted in a few of recent works for Vision-and-La...
research
11/03/2022

Circling Back to Recurrent Models of Language

Just because some purely recurrent models suffer from being hard to opti...

Please sign up or login with your details

Forgot password? Click here to reset