A Recurrent Vision-and-Language BERT for Navigation

Accuracy of many visiolinguistic tasks has benefited significantly from the application of vision-and-language (V L) BERT. However, its application for the task of vision-and-language navigation (VLN) remains limited. One reason for this is the difficulty adapting the BERT architecture to the partially observable Markov decision process present in VLN, requiring history-dependent attention and decision making. In this paper we propose a recurrent BERT model that is time-aware for use in VLN. Specifically, we equip the BERT model with a recurrent function that maintains cross-modal state information for the agent. Through extensive experiments on R2R and REVERIE we demonstrate that our model can replace more complex encoder-decoder models to achieve state-of-the-art results. Moreover, our approach can be generalised to other transformer-based architectures, supports pre-training, and is capable of multi-task learning suggesting the potential to merge a wide range of BERT-like models for other vision and language tasks.



There are no comments yet.


page 7

page 16

page 17

page 18


CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

BERT-type structure has led to the revolution of vision-language pre-tra...

Explore the Potential Performance of Vision-and-Language Navigation Model: a Snapshot Ensemble Method

Vision-and-Language Navigation (VLN) is a challenging task in the field ...

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

We present a new vision-language (VL) pre-training model dubbed Kaleido-...

Vision-Dialog Navigation by Exploring Cross-modal Memory

Vision-dialog navigation posed as a new holy-grail task in vision-langua...

HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation

Pre-training has been adopted in a few of recent works for Vision-and-La...

DiaNet: BERT and Hierarchical Attention Multi-Task Learning of Fine-Grained Dialect

Prediction of language varieties and dialects is an important language p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Asking a robot to navigate in complex environments following human instructions has been a long-term goal in AI research. Recently, a great variety of vision-and-language navigation (VLN) setups [anderson2018vision, qi2020reverie, thomason2020vision] have been introduced for relevant studies and a large number of works explore different methods to leverage visual and language clues to assist navigation. For example, in the popular R2R navigation task [anderson2018vision], enhancing the learning of visual-textual correspondence is essential for the agent to correctly interpret the instruction and perceive the environment.

On the other hand, recent study on vision-and-language pre-training has achieved significant improvement over a wide range of visiolinguistic problems. Instead of designing complex and monolithic models for different tasks, those methods pre-train a multi-layer Transformer [vaswani2017attention] on a large number of image-text pairs to learn generic cross-modal representations [chen2020uniter, li2020unicoder, li2019visualbert, li2020oscar, lu2019vilbert, su2019vl, tan2019lxmert], known as V&L BERT (Bidirectional Encoder Representations from Transformers [devlin2019bert]). Those advances have inspired us to employ V&L BERT for VLN, replacing the complicated modules for modelling cross-modal relationships and allowing the learning of navigation to adequately benefit from the pre-trained visual-textual knowledge. Unlike recent works on VLN which apply a pre-trained V&L BERT only for encoding language [hao2020towards, li2019robust] or for measuring the instruction-path compatibility [majumdar2020improving], we propose to use existing V&L BERT models themselves for learning to navigate.

Figure 1: Recurrent multi-layer Transformer for addressing partially observable inputs. A state token is defined along with the input sequence. At each time step, a new state representation will be generated based on the new observation. Meanwhile, the past information will help inferring a new decision .

However, an essential difference between VLN and other vision-and-language tasks is that VLN can be considered as a partially observable Markov decision process, in which the future observation is dependent on the agent’s current state and action. Meanwhile, at each navigational step, the visual observation only corresponds to partial instruction, this requires the agent to keep track of the navigation progress and correctly localise the relevant sub-instruction to gain useful information for decision making. Another difficulty of applying V&L BERT for VLN is the high demand on computational power; since the navigational episode could be very long, performing self-attention on a long visual and textual sequence at each time step will cost an excessive amount of memory during training.

To address the aforementioned problems, we propose a Recurrent Vision-and-Language BERT for Navigation, or simply VLN BERT. Instead of employing large-scale datasets for pre-training which usually require thousands of GPU hours, the aim of this work is to allow the learning of VLN to adequately benefit from pre-trained V&L BERT. Based on the previously proposed V&L BERT models, we implement a recurrent function in their original architecture (Fig. 1) to model and leverage the history-dependent state representations, without explicitly defining a memory buffer [zhu2020babywalk] or applying any external recurrent modules such as a LSTM [hochreiter1997long] . To reduce the memory consumption, we control the self-attention to only consider the language tokens as keys and values but not queries during navigation, which is similar to the cross-modality encoder in LXMERT [tan2019lxmert]. Such design greatly reduces the memory usage so that the entire model can be trained on a single GPU without performance degeneration. Furthermore, as in the original V&L BERT, our proposed model has the potential of multi-task learning, it is able to address other vision and language problems along with the navigation task.

We employ two datasets to evaluate the performance of our VLN BERT, R2R [anderson2018vision] and REVERIE [qi2020reverie]. The chosen datasets are different in terms of the provided visual clues, the instructions and the goal. Our agent, initialised from a pre-trained V&L BERT and fine-tuned on the two datasets, achieves state-of-the-art results. We also initialise our model with the PREVALENT [hao2020towards], a LXMERT-like model pre-trained for VLN. On the test split of R2R [anderson2018vision], it improves the Success Rate absolutely by 8% and achieves 57% Success weighted by Path Length (SPL). For the remote referring expression (REF) task in REVERIE [qi2020reverie], our agent obtains 18.25% navigation SPL and 9.55% Remote Grounding SPL. These results indicate the strong generalisation ability of our proposed VLN BERT  as well as the potential of using it for merging the learning of VLN with other vision and language tasks, such as in the work of 12-in-1 [lu202012].

Figure 2: Schematics of the Recurrent Vision-and-Language BERT. At the initialisation stage, the entire instruction is encoded by a multi-layer Transformer, where the output feature of the [CLS]

token serves as the initial state representation of the agent. During navigation, the concatenated sequence of state, encoded language and new visual observation is fed to the same Transformer to obtain the updated state and decision probabilities. The updated state and the language encoding from initialisation will be fused and applied as input at the next time step. The green star (

) indicates the visual-textual matching (Eq. 12) and the past decision encoding (Eq. 13).

2 Related Work

Vision-and-Language Navigation

Learning navigation with visual-linguistic clues has drawn significant research interests. The recent R2R [anderson2018vision] and Touchdown [chen2019touchdown] datasets introduce human natural language as guidance and apply photo-realistic environments for navigation. Following these work, dialog-based navigation such as CVDN [thomason2020vision], VNLA [nguyen2019vision] and HANNA [nguyen2019help], navigation for localising a remote object such as REVERIE [qi2020reverie], VLN in continuous environment [krantz2020navgraph], and multilingual navigation with spatial-temporal grounding such as RxR [anderson2020rxr] have been proposed for further research.

One crucial challenge in VLN is to understand the visual-textual correspondence. To achieve this, Self-Monitoring [ma2019self] and RCM [wang2019reinforced] adopt cross-modal attention to highlight the relevant observations and instruction at each step. Speaker-Follower [fried2018speaker] and EnvDrop [tan2019learning] learn on the augmented training data via a self-supervised manner. FAST [ke2019tactical] resorts self-correction navigation, while APS [fu2020counter] samples adversarial paths for training to enhance the model’s ability to generalise. AuxRN [zhu2020vision] applies several auxiliary losses to learn generic representations, Qi [qi2020object] and Wang [wang2020soft]

also design loss functions to encourage the agent to follow the instructions to take the shortest paths. More recently, Hong

[hong2020graph] propose a graph network to model the intra- and inter-modal relationships among the contextual and visual clues. The great improvements achieved by these methods encourage researchers to explore simpler and more powerful visiolinguistic learning network for VLN.

Visual BERT Pre-Training

After the success of pre-trained BERT on a wide range of natural language processing tasks

[devlin2019bert], the model has been extended to process visual tokens and to pre-train on large-scale image/video-text pairs for learning generic visual-linguistic representations. Previous research introduce two-stream BERT models which encode texts and images separately, and fuse the two modalities in a later stage [lu2019vilbert, tan2019lxmert], as well as one-stream BERT models which directly perform inter-modal grounding [chen2020uniter, li2020unicoder, li2019visualbert, li2020oscar, su2019vl]. Although video BERT approaches have been proposed to learn the correspondence between texts and video frames [li2020hero, luo2020univilm, sun2019videobert, yang2020bert], we are the first to integrate recurrence into BERT to learn partially-observable and temporal-dependent inputs. In terms of VLN pre-training, PRESS adopts an off-the-shelf pre-trained language BERT for encoding instructions [li2019robust], PREVALENT trains a V&L BERT on a large amount of image-text-action triplets from scratch to learn navigation-oriented textual representations [hao2020towards], and VLN-BERT [majumdar2020improving] fine-tunes a ViLBERT [lu2019vilbert] on instruction-trajectory pairs for measuring their compatibility in beam search setting. Unlike all previous work, our VLN BERT  can be adapted to various V&L BERT models with an additional recurrent function, it is a navigator network by itself which can be directly trained for navigation.

V&L Multi-Task Learning

Instead of building a monolithic model for different V&L tasks, numbers of previous work explore multi-task learning with a unified model for utilising the common and the complementary knowledge to reduce the domain gap [li2018visual, nguyen2019multi, pramanik2019omninet, shuster2019dialogue]. Very recently, 12-in-1 [lu202012] trains a single ViLBERT [lu2019vilbert]

on 12 different datasets across four categories of V&L tasks, including visual question answering, referring expressions, multi-modal verification and caption-based image retrieval. In VLN, Wang  

[wang2020environment] propose a multitask navigation model to address the R2R [anderson2018vision] and the Navigation from Dialog History (NDH) [thomason2020vision] tasks seamlessly. Comparing to previous methods, our VLN BERT is the first to close the gap between VLN and other V&L problems, we apply a single model to train on both the R2R [anderson2018vision] and the REF [qi2020reverie] tasks.

3 Proposed Model

In this section, we first define the vision-and-language navigation task, then we revisit the BERT model [devlin2019bert] and present the architecture of our proposed VLN BERT .

3.1 VLN Background

The problem of VLN can be formulated as follows: Given a natural language instruction which contains a sequence of words, at each time step , the agent observes the environment and infers an action which transfers the agent from state to a new state . The state consists of the navigational history and the current spatial position defined by a triplet , where is a viewpoint on the pre-defined connectivity graph of the environment [anderson2018vision], and and are the angles of heading and elevation respectively. The agent needs to execute a sequence of actions to navigate on the connectivity graph and eventually decides to stop at the target position to complete the task.

3.2 Revisit BERT

Bidirectional Encoder Representations from Transformers (BERT) [devlin2019bert] is a multi-layer Transformer architecture [vaswani2017attention] designed to pre-train deep bidirectional language representations. Each layer of the Transformer encodes the language features from the previous layer with multi-head self-attention to capture the dependencies among the words in the sentence, and applies a residual feed-forward network to process the output features.

Formally, the -th attention head at the -th layer performs self-attention over as


where , and are learnable linear projections111All in this section denotes learnable linear projections specifically for queries, keys and values; is the hidden dimension of the network. The outputs from all the attention heads will be concatenated and projected onto the same dimension as the input as


where is the total number of heads, denotes concatenation and is a learned linear projection. Finally, the output of layer is formulated by


where ReLU is the Rectified Linear Unit activation function and LayerNorm is layer normalisation


Based on this architecture, BERT has been extended to V&L BERT [chen2020uniter, li2020unicoder, li2019visualbert, li2020oscar, lu2019vilbert, su2019vl, tan2019lxmert], which takes the concatenation of language tokens and visual tokens as input, and pre-trains on image-text corpus to learn generic visiolinguistic representations.

3.3 Recurrent VLN BERT

The idea of our VLN BERT  can be adapted to a wide range of Transformer-based networks, in this section, we apply the recently proposed one-stream V&L BERT model OSCAR [li2020oscar] for demonstration. We modify the model to enable the learning of navigation and the associated referring expression task [qi2020reverie]. As shown in Fig. 2, at each time step, the network takes three sets of tokens as input; the previous state token , the language tokens and the visual tokens

. Then, it performs self-attention over these cross-modal tokens to capture the textual-visual correspondence for inferring the action logits



Language Processing

At initialisation (), a sequence of words consisting of the classification token [CLS], the language tokens of the instruction and the separation token [SEP] will be fed into VLN BERT. In previous V&L BERT, the [CLS] token is trained for gathering relevant visiolinguistic clues from the input sequence, and it is usually applied for downstream classification tasks. We argue that the function of [CLS] can be adopted to represent the agent’s state in VLN. We use the embedded [CLS] token as the initial state representation , and use the [SEP] token to separate the textual and visual inputs.


During navigation steps (), unlike the state token or the visual tokens which performs self-attention with respect to the entire input sequence, the language tokens only serve as the keys and values in the Transformer. We consider the language tokens produced by the model at the initialisation step as a deep representation of the instruction which does not need to be further encoded in later steps. Not updating the language features also save a huge amount of computational resources since the instruction and the trajectory can be long in VLN problems.

Vision Processing

At each navigation step, the agent makes new visual observation in the environment and uses the visual clues to assist navigation. To process the visual clues, the network first projects the image features of views at the navigable directions to the same space as the BERT token as . Then, the visual tokens will be concatenated with the state token and the language tokens, and fed into the model.

In terms of the remote REF task [qi2020reverie], we simply consider the object features as additional visual tokens in the input sequence. Similarly, the features will be projected onto the token space as and fed into the model. The object clues can provide valuable information about the important landmarks on the path, which could be very helpful to the navigation with high-level instructions [qi2020reverie].

State Representation

We formulate the agent’s state at each time step as the summary of all textual and visual clues that the agent collects, as well as all decisions that the agent makes until the current viewpoint. Instead of explicitly defining a memory buffer [zhu2020babywalk] or implementing an additional recurrent network [hochreiter1997long] to store the past experiences, our model relies on BERT’s original architecture to recognise time-dependent inputs and use its pre-defined classification token [CLS] to represent the state. At each navigation step, the state representation is used as the leading input token of the entire textual-visual sequence. It then performs inter-modal self-attention in VLN BERT with other tokens to update its content and becomes the leading token of the input at the next step, in an autoregressive way.

However, unlike most of the V&L BERT models which apply the output feature of the [CLS] token for classification, our state is not directly used for inferring a decision (see following Decision Making subsection), which means, the vanilla state representation is not explicitly enforced to capture the most important language and visual features. To address this issue, our model matches the raw textual and visual tokens, and feeds the output to the state representation. Formally, let and be the state and textual tokens at head of the final (-th) layer of VLN BERT, the attention scores over the textual tokens can be expressed as:


Then, we average the scores over all the attention heads () and apply a Softmax function to get the overall state-language attention weights as:


Similarly, the visual attention scores and weights can be obtained. Now, we perform a weighted sum over the input textual tokens and visual tokens respectively to obtain the weighted raw features as:


We then enforce a cross-modal Matching between the raw textual and visual features via element-wise product and send such information to the agent’s state as:


where is the output state features at the final layer.

Finally, past decisions are important for the agent to keep track of the navigation progress, our network records the new decision by feeding the directional features of the selected action into the state token as:


where is the new representation of the agent’s state at time step .

Decision Making

Many previous VLN agents apply an inner product between the state representation and the visual features at candidate directions to evaluate the state-vision correspondence, and choose a direction with the highest matching score to navigate [ma2019self, tan2019learning]. We find that the BERT network can nicely perform such matching because it is fully built upon the inner product based soft-attention. As a result, we directly apply the mean attention weights of the visual tokens over all the attention heads in the last layer, with respect to the state, as the action probabilities, simply (as defined in Eq. 10).

As for the remote referring expression task [qi2020reverie], our agent uses the same method to select an object. The selection probabilities can be expressed as , where is the mean attention weights for all candidate objects.

3.4 Training

We train our network with a mixture of reinforcement learning (RL) and imitation learning (IL) objectives. We apply A2C

[mnih2016asynchronous] for RL, in which the agent samples an action according to and measures the advantage at each step. In IL, our agent navigates on the ground-truth trajectory by following teacher actions and calculates a cross-entropy loss for each decision. Formally, we minimise the navigation loss function, expressed for each given sample, as


where is the sampled action and is the teacher action. Here is a coefficient for weighting the IL loss. In REVERIE [qi2020reverie], we applied an additional cross-entropy term to learn object grounding.

Reward Shaping

In addition to the progress rewards defined in EnvDrop [tan2019learning], we apply the normalised dynamic time warping [ilharco2019general] as a part of the reward to encourage the agent to follow the instruction to navigate. Moreover, we introduce a negative reward to penalise the agent if it misses the target. We refer the Appendix for more details.

3.5 Adaptation

We initialise the parameters of VLN BERT  from OSCAR [li2020oscar] pre-trained without object tags. Although OSCAR is trained on regional features, we find that it is also compatible with the grid features of the entire scene. When adapting to the LXMERT-like [tan2019lxmert] model in PREVALENT [hao2020towards], we remove the language branch in the cross-modality encoder and concatenate the state token with the visual tokens for self-attention. We also remove the entire downstream network EnvDrop [tan2019learning], including the Speaker and the environmental dropout, and then directly fine-tune the model pre-trained by PREVALENT for navigation.

4 Experiments

Methods R2R Validation Seen R2R Validation Unseen R2R Test Unseen
Random 9.58 9.45 16 - 9.77 9.23 16 - 9.89 9.79 13 12
Human - - - - - - - - 11.85 1.61 86 76
Seq2Seq-SF [anderson2018vision] 11.33 6.01 39 - 8.39 7.81 22 - 8.13 7.85 20 18
Speaker-Follower [fried2018speaker] - 3.36 66 - - 6.62 35 - 14.82 6.62 35 28
SMNA [ma2019self] - 3.22 67 58 - 5.52 45 32 18.04 5.67 48 35
RCM+SIL (train) [wang2019reinforced] 10.65 3.53 67 - 11.46 6.09 43 - 11.97 6.12 43 38
PRESS [li2019robust] 10.57 4.39 58 55 10.36 5.28 49 45 10.77 5.49 49 45
FAST-Short [ke2019tactical] - - - - 21.17 4.97 56 43 22.08 5.14 54 41
EnvDrop [tan2019learning] 11.00 3.99 62 59 10.70 5.22 52 48 11.66 5.23 51 47
AuxRN [zhu2020vision] - 3.33 70 67 - 5.28 55 50 - 5.15 55 51
PREVALENT [hao2020towards] 10.32 3.67 69 65 10.19 4.71 58 53 10.51 5.30 54 51
RelGraph [hong2020graph] 10.13 3.47 67 65 9.99 4.73 57 53 10.29 4.75 55 52
Ours (no init. OSCAR) 9.78 3.92 62 59 10.31 5.10 50 46 11.15 5.45 51 47
Ours (init. OSCAR) 10.79 3.11 71 67 11.86 4.29 59 53 12.34 4.59 57 53
Ours (init. PREVALENT) 11.13 2.90 72 68 12.01 3.93 63 57 12.35 4.09 63 57
Table 1: Comparison of agent performance on R2R in single-run setting. work that applies pre-trained BERT for language encoding.
Methods REVERIE Validation Seen REVERIE Validation Unseen REVERIE Test Unseen
Navigation RGS RGSPL Navigation RGS RGSPL Navigation RGS RGSPL
Random 2.74 8.92 1.91 11.99 1.97 1.31 1.76 11.93 1.01 10.76 0.96 0.56 2.30 8.88 1.44 10.34 1.18 0.78
Human 81.51 86.83 53.66 21.18 77.84 51.44
Seq2Seq-SF [anderson2018vision] 29.59 35.70 24.01 12.88 18.97 14.96 4.20 8.07 2.84 11.07 2.16 1.63 3.99 6.88 3.09 10.89 2.00 1.58
RCM [wang2019reinforced] 23.33 29.44 21.82 10.70 16.23 15.36 9.29 14.23 6.97 11.98 4.89 3.89 7.84 11.68 6.67 10.60 3.67 3.14
SMNA [ma2019self] 41.25 43.29 39.61 7.54 30.07 28.98 8.15 11.28 6.44 9.07 4.54 3.61 5.80 8.39 4.53 9.23 3.10 2.39
FAST-Short [ke2019tactical] 45.12 49.68 40.18 13.22 31.41 28.11 10.08 20.48 6.17 29.70 6.24 3.97 14.18 23.36 8.74 30.69 7.07 4.52
FAST-MATTN [qi2020reverie] 50.53 55.17 45.50 16.35 31.97 29.66 14.40 28.20 7.19 45.28 7.84 4.67 19.88 30.63 11.61 39.05 11.28 6.08
Ours (init. OSCAR) 38.09 39.99 32.95 14.18 24.88 21.53 23.74 27.66 19.35 16.26 13.58 11.19 22.14 24.54 18.25 16.35 11.51 9.55
Table 2: Comparison of agent performance of navigation and remote referring expression on REVERIE.

Navigation Tasks

We evaluate our proposed model on two distinct datasets for VLN:

  • Room-to-Room (R2R) [anderson2018vision]: The agent is required to navigate in photo-realistic environments (Matterport3D [chang2017matterport3d]) to reach a target following low-level natural language instructions. Most of the previous works apply a panoramic action space for navigation [fried2018speaker], where the agent jumps among viewpoints pre-defined on the connectivity graph of the environment. The dataset contains 61 scenes for training; 11 and 18 scenes for validation and testing, respectively, in unseen environments.

  • REVERIE [qi2020reverie]: The agent needs to first navigate to a point where the target object is visible, then, it needs to identify the target object from a list of given candidates. In REVERIE, the navigational instructions are high-level while the instructions for object grounding are very specific. The dataset has in total 4,140 target objects in 489 categories, and each target viewpoint has 7 objects with 50 bounding boxes in average.

Implementation Details

All experiments are conducted on a single NVIDIA 2080Ti GPU, the learning rate is fixed to throughout the training and AdamW optimiser [loshchilov2018decoupled] is applied. For R2R, we train the agent directly on the mixture of the original training data and the augmented data from PREVALENT [hao2020towards], the batch size222Half for RL and half for IL in each iteration, corresponding to the first and the second term in Eq. 14, respectively. is set to 16 and the network is trained for 300,000 iterations. For REVERIE, we use batch size 8 and train the agent for 200,000 iterations. Images in the environments are encoded by a ResNet-152 [he2016deep] pre-trained on Places365 [zhou2017places], and objects are encoded by a Faster-RCNN [ren2015faster] pre-trained on the Visual Genome [krishna2017visual]. Early stopping is applied when the training saturates, the model which achieves the highest SPL in validation unseen split is adopted for testing.

Evaluation Metrics

We apply the standard metrics employed by previous works to evaluate the performance:

R2R [anderson2018vision] considers Trajectory Length (TL): the average path length in meters, Navigation Error (NE): the average distance between agent’s final position and the target in meters, Success Rate (SR): the ratio of stopping within 3 meters to the target, and Success weighted by the normalised inverse of the Path Length (SPL) [anderson2018evaluation].

REVERIE [qi2020reverie] defines Success Rate (SR) as the ratio of stopping at a viewpoint where the target object is visible (in panorama), and considers the corresponding SPL. It also employ Oracle Success Rate (OSR): the ratio of having a viewpoint along the trajectory where the target object is visible, Remote Grounding Success Rate (RGS): the ratio of grounding to the correct objects when stopped, and RGSPL, which weights RGS by the trajectory length.

4.1 Main Results

Comparison with SoTA

Results in Table 1 compare the single-run (greedy search, no pre-exploration [wang2019reinforced]) performance of different agents on the R2R benchmark. Our proposed VLN BERT  initialised from OSCAR [li2020oscar] (init. OSCAR) performs better than previous methods across all the dataset splits. Comparing to a randomly initialised network (no init. OSCAR), the large performance degeneration suggests that the pre-trained general vision-linguistic knowledge significantly benefits the learning of navigation. The model initialised from PREVALENT [hao2020towards], pre-trained especially for VLN, further improves the agent’s performance, achieving 63% SR (+8%) and 57% SPL (+5%) on the test unseen split333R2R Leaderboard: https://evalai.cloudcv.org/web/challenges/challenge-page/97/overview. Comparing to PRESS [li2019robust] and PREVALENT [hao2020towards] which only fine-tune a pre-trained BERT for extracting language features, adding recurrence into V&L BERT and using the model directly as the navigator network allows the VLN learning to adequately benefit from the pre-trained knowledge. Such performance gain cannot be achieved by using pre-trained V&L BERT only as a feature extractor, as will be shown in §4.2 Ablation Study. Moreover, the large gain in SR with a slight increase in TL suggests that the agent is able to navigate both accurately and efficiently. Comparing to previous methods, we can see that the performance gap between the validation unseen and the test unseen splits is greatly reduced, which means our agent has a stronger generalisation ability to novel instructions and environments.

In REVERIE [qi2020reverie] (Table 2), comparing to the previous best, our method generalises much better to unseen data. On the validation unseen split, the SR of navigation and object grounding has been absolutely improved by 9.34% and 5.74% respectively. On the test unseen split444REVERIE Leaderboard: https://eval.ai/web/challenges/challenge-page/606/overview, our method obtains 22.14% SR and 18.25% SPL for navigation, as well as 11.51% RGS and 9.55% RGSPL for REF, achieving a better performance than the previous best method [qi2020reverie] which applies SoTA navigator FAST [ke2019tactical] for navigation and pointer MATTN [yu2018mattnet] for object grounding. Although the previous method has higher OSR, it is likely due to longer searching (long TL), the lower SR suggests that the agent does not know where to stop correctly. Our results also indicate that it is possible to apply a BERT-based model for VLN and REF multi-task learning. It is clear that the performance has a huge room for improvement, and we argue that learning about common sense (structure of the environment) is crucial for navigation in REVERIE due to the high-level instructions. Although not investigated in this paper, we suggest that future work can pre-train VLN BERT  for common sense knowledge to improve the performance.

Visualisation of Language Attention

To demonstrate that our VLN BERT  (init. OSCAR) is able to trace the navigation progress, we visualise the changes of language attention weights at the final Transformer layer over all instructions during navigation (Fig. 3). As the agent moves forward, the attention weights with respect to state shifts from the beginning of the instructions to the end. Since the sub-instructions and sub-paths for each sample in R2R is monotonically aligned [hong-etal-2020-sub], our results indicate that the state nicely records the partial instruction that has been completed. In terms of the attention weights with respect to the visual token at the select direction, it follows a similar pattern meaning that the most relevant part of the instruction is used for guiding the action selection.

Figure 3: Averaged attention weights over all instructions in validation unseen split during navigation. State: Attention weights with respect to the state representation. Selected Action: Attention weights with respect to the visual token at the selected direction.

4.2 Ablation Study

Models V&L BERT (init. OSCAR) R2R Validation Seen R2R Validation Unseen
Language Vision State Decision Matching Train TL NE SR SPL TL NE SR SPL
Baseline [tan2019learning] 11.84 4.44 57.79 52.85 12.50 5.26 49.81 43.45
1 10.81 4.98 49.95 46.19 11.34 5.68 44.15 39.64
2 11.73 4.18 59.26 54.12 12.59 5.00 52.11 45.75
3 9.26 6.85 34.77 33.33 8.92 7.43 30.74 29.05
4 11.37 3.50 67.97 63.94 12.98 4.73 54.75 48.31
5 11.10 3.81 65.52 61.24 12.20 4.62 55.21 49.72
6 10.70 3.21 70.32 66.45 11.46 4.48 57.22 52.57
Full model 10.79 3.11 71.11 67.23 11.86 4.29 58.71 53.41
Table 3: Ablation experiments on the effect of applying V&L BERT for learning navigation. Checkmarks indicate using V&L BERT to replace or to add the corresponding function in the baseline model. Matching indicates the visual-textual matching (Eq. 12), and Train with checkmark means the V&L BERT is fine-tuned for navigation.
Models R2R Validation Seen R2R Validation Unseen Batch Memory
Emb-Attn 10.84 3.40 67.19 63.19 11.31 4.64 55.60 51.00 6 11.0GB
Init-Attn 10.21 3.93 61.80 58.06 10.39 4.59 53.47 49.05 6 11.0GB
Re-Attn 8.99 5.81 40.84 39.44 8.85 6.22 37.76 35.78 6 11.0GB
Ours 10.79 3.11 71.11 67.23 11.86 4.29 58.17 53.41 16 9.2GB
Table 4: Comparison of performing language self-attention: On the raw word embeddings at each step (Emb-Attn), on the initialised language features at each step (Init-Attn), on the output language features from previous step (Re-Attn), or only at initialisation (Ours). Memory is the training time GPU memory cost.

Network Components

Table 3 shows comprehensive ablation experiments on the influence of using V&L BERT (init. OSCAR) to replace or add the corresponding functions in the baseline model (EnvDrop [tan2019learning] trained with the same data and training strategy as in other experiments). As the results suggested, our proposed method is a multi-functional framework, the more functions it serves, the larger performance gain it achieves. Comparing the baseline with Model #1 and #2, we can see that employing a pre-trained BERT as language encoder improves the performance only if the BERT is fine-tuned for navigation. This finding is also supported by using the V&L BERT to encode both the textual and visual signals (Model #3 and #4). However, simply using pre-trained V&L BERT as text and image encoders does not fully utilise its power; model #5 indicates that relying on the original architecture of BERT to learn recurrence is feasible and it is able to achieve better results. Moreover, using the averaged visual attention weights of the final layer of the Transformer as the action probabilities (Model #6) and enhancing the state representation with visual-textual matching as defined in Eq. 12 (full model) further improves the agent’s performance.

Self-Attended Language Features

Due to the long instructions and episodes, high memory cost during training is one of the key issues that prevents previous research to apply BERT for self-attention at every time step. To demonstrate the influence of self-attending textual features during navigation, we compare the agent’s performance and the training time GPU memory consumption (constrained to a single 11GB memory GPU) of re-attending the language at each step. As shown in Table 4, training for Emb-Attn, Init-Attn and Re-Attn consume much more memory for each sample than performing language self-attention only at initialisation (Ours), and their performances are worse than Ours. The results of Re-Attn degenerates significantly because the output language features aggregate the most relevant visual-textual clues at a certain viewpoint, which suppresses the valuable information in other part of the instruction for the future steps. These results suggest that running self-attention on language at each step is unnecessary.

Learning Curves

As shown in Fig. 4, we compare the learning curves of VLN BERT  initialised from different models. The training losses of our method initialised from pre-trained OSCAR [li2020oscar] converges faster than a randomly initialised model, and it reaches much higher SPL in both validation seen and unseen environments. Moreover, our model initialised from PREVALENT [hao2020towards] learns significantly faster than the other two methods and it is able to achieve a much better performance within much fewer iterations. These results suggest that the pre-trained generic visiolinguistic knowledge is beneficial to the learning of VLN, and pre-training especially for navigation skills allows the agent to learn better in fine-tuning.

Figure 4: Comparison of the learning curves. no init. means randomly initialised network parameters.

5 Conclusion

In this paper, we introduce recurrence into Vision-and-Language BERT and rely on its original architecture to recognise time-dependent inputs. Such innovation allows V&L BERT to address problems with a partially observable Markov decision process, and allows the learning of downstream tasks to adequately benefit from the pre-trained generic V&L knowledge. For VLN, our proposed VLN BERT  applies BERT itself as the navigator network, which achieves SoTA performance in R2R [anderson2018vision] and REVERIE [qi2020reverie]. Moreover, results suggest that V&L BERT with recurrence is capable of VLN and REF multi-task learning.

Future Work

As suggested by the great improvement on R2R [anderson2018vision] achieved by our VLN BERT, we expect that the model can improve the performance of other navigation settings such as street navigation [chen2019touchdown] and navigation in continuous environments [krantz2020navgraph]. In this paper, we only apply our recurrent BERT for VLN. However, we believe that it has a huge potential in addressing other tasks which require sequential interactions/decisions, such as language-only dialog [byrne2019taskmaster, lee2019multi, Rastogi2020TowardsSM], visual dialog [das2017visual, kottur2019clevr], dialog navigation [de2018talk, nguyen2019help, nguyen2019vision] and action anticipation for reactive robot response [aliakbarian2018viena, furnari2019would, koppula2015anticipating].



Appendix A Implementation Details

We provide the implementation details of preparing visual features (§3.3555Link to Section 3.3 in Main Paper. ), decision making (§3.3), adaptation to PREVALENT [hao2020towards] (§3.3 & 3.5), critic function and reward shaping in reinforcement learning (§3.4).

a.1 Visual Features (3.3 Vision Processing)

Navigation in R2R [anderson2018vision] and REVERIE [qi2020reverie] are conducted in the Matterport3D Simulator [chang2017matterport3d]. At each navigable viewpoint in the environment, the agent observes a panorama, consisting of 36 single-view images at 12 headings ( separation) and 3 elevation angles ().

Scene Features

In our experiments, we only consider the scene features (grid features of the single-view images provided by ResNet-152 [he2016deep] pre-trained on Places365 [zhou2017places]) at the navigable directions as visual features to VLN BERT. Each visual feature is direction-aware, which is formulated by the concatenation of the convolutional image feature and the directional encoding as:


The directional encoding is formed by replicating vector

by 32 times, where and represents the heading and elevation angles of the image with respect to the agent’s orientation [fried2018speaker, tan2019learning]. Moreover, the action features which is fed to the state representation is exactly the directional encoding at the selected direction , where .

Object Features

In REVERIE [qi2020reverie], the target objects can appear at any single-view of the panorama, we extract the object features (regional features encoded by Faster-RCNN [ren2015faster] pre-trained on Visual Genome [krishna2017visual] by Anderson  [anderson2018bottom]) according to the positions of objects provided in REVERIE [qi2020reverie]. The object features are position- and direction-aware, formulated as


where is the convolutional object features, is the directional encoding of the single-view which contains the object. represents the spatial position of the object within the image, as in MATTN [yu2018mattnet], we apply , where and are the top-left and bottom-right coordinates of the object, and are the width and height of the object and the image, respectively. The matrix is a learnable linear projection.

Figure 5: Adaptation to recurrent PREVALENT. At the initialisation stage, the entire instruction is encoded by a language transformer (TRM-Lang1), where the output feature of the [CLS] token servers as the initial state representation of the agent. During navigation, the concatenated sequence of state, encoded language and new visual observation are fed to the cross-modality and the single-modality encoders to obtain the updated state and decision logits. The updated state and the language encoding from initialisation will be fused and applied as input at the next time step. The green star () indicates the visual-textual matching and the past decision encoding (3.3).

a.2 Decision Making (3.3 Decision Making)

In R2R [anderson2018vision], there are two types of decisions that an agent infers during navigation; it either selects a navigable direction to move, or it decides to stop at the current viewpoint. As in most of the previous work, stopping in VLN BERT is implemented by adding a zero vector to the list of visual features at navigable directions [fried2018speaker, ma2019self, tan2019learning, hong2020graph], as


Our VLN BERT determines to stop by predicting the largest attention score to the stop representation at the final transformer layer. Otherwise, the agent will move to a navigable direction with the largest score.

However, in REVERIE [qi2020reverie], we directly apply the attention scores over the candidate objects for stopping. To be specific, the visual tokens in REVERIE consists of sequences of scene features and object features:


When the model predicts larger attention scores for at least one of the object token than all of the scene tokens, the agent will stop and will select the object with the largest score as the grounded object for REF. Such formulation has two advantages; First, it relates the object searching process to navigation, , the agent should not stop navigation if it has low confidence of localising the target object. Second, it allows the reinforcement learning to benefit the object grounding, since the action logits for stop is the greatest attention scores over objects.

a.3 Adaptation to PREVALENT (3.3 & 3.5)

As shown in Fig. 5, we adapt our VLN BERT to the LXMERT-like [tan2019lxmert] architecture in PREVALENT [hao2020towards]. At the initialisation step, the transformer TRM-Lang1 encodes the instruction and uses the output features of the [CLS] token to represent agent’s initial state. During navigation, the concatenated sequence of the previous state , the encoded language from initialisation and the new visual observation will be fed to the cross-modality encoder to obtain the language-aware state feature and the language-aware visual features . Finally, TRM-Vis2 will process and to produce a new state and a decision .

Pre-training of PREVALENT [hao2020towards] applies the outputs from TRM-Lang3 for attended masked language modelling and action prediction, but fine-tuning on R2R [anderson2018vision] only use the output language features from TRM-Lang1 and rely on a downstream network EnvDrop [tan2019learning] for navigation. In contrast, our method does not require any downstream network, we leverage the visual transformers TRM-Vis1 and TRM-Vis2 to learn state-language-vision relationship for better decision marking.

Visual-Textual Matching

The visual-textual matching for state (see Eq. 12 in §3.3) is applied to enhance the state representation of PREVALENT-based VLN BERT. Since the final transformer TRM-Vis2 only process the state and visual features, we apply the averaged attention scores over the visual features with respect to to weight the input visual tokens , but apply the averaged attention scores over the textual features with respect to to weight the input language tokens . Results in Table 5 show that visual-textual matching to state improves the agent’s performance in unseen environments.

Models R2R Validation Seen R2R Validation Unseen
w/o Matching 10.87 2.44 76.79 73.13 11.72 4.08 62.07 56.15
with Matching 11.13 2.90 72.18 67.72 12.01 3.93 62.75 56.84
Table 5: Performance of PREVALENT-based VLN
BERT with and without visual-textual matching for agent’s state.

a.4 Critic Function (3.4 Training)

We apply the A2C [mnih2016asynchronous]

for reinforcement learning. At each time step, the critic, a multi-layer perceptron, predicts an expected value from the updated state representation as:


where ReLU is the Rectified Linear Unit function, and are learnable linear projections.

a.5 Reward Shaping (3.4 Reward Shaping)

We apply three different rewards for learning to navigate.

Progress Reward

As formulated in the EnvDrop [tan2019learning], we apply the progress rewards as strong supervision signals for directing the agent to approach the target. To be specific, let to be the distance from agent to target at time step , and to be the change of distance by action , the reward at each step () is defined as:


When the agent decides to stop (), a final reward is assigned depending on if the agent successfully complete the task:


Overall, the agent will receive a positive reward if it approaches the target and completes the task (by stopping within 3 meters to the target viewpoint), while it will be penalised with a negative reward if it moves away from the target or stops at a wrong viewpoint.

Path Fidelity Rewards

The Progress Reward encourages the agent to approach the target, but it does not constrain the agent to take the shortest path. As there could be multiple routes to the target, the agent could learn to take a longer path or even a cyclic path to maximise the total reward. To address the problem, we apply the normalised dynamic time warping reward [ilharco2019general], a measurement of the similarity between the ground-truth path and the predicted path, to urge the agent to follow the instruction accurately. Let be the normalised dynamic time warping [ilharco2019general] at time , and to be the change of caused by action , the reward for is defined as:


and the reward for :


Moreover, as suggested by many previous works, there exists a large discrepancy between the agent’s oracle success rate and success rate, indicating that it does not learn well to stop accurately. To address this issue, we introduce a negative stopping reward which will be triggered whenever the agent first approaches the target but then departs from it. To be precise, if and :


The normalised dynamic time warping reward and the stopping reward together form the path fidelity rewards which can encourage the agent to navigate efficiently. In summary, the overall reward at each step during navigation can be expressed as:


Ablation Study

We perform ablation experiments on training the OSCAR-based and the PREVALENT-based VLN BERT with and without the path fidelity rewards. As shown in Table 6, the models trained with the path fidelity rewards achieve higher Success Rate (SR) and lower Trajectory Length (TL), leading to higher Success weighted by Path Length (SPL). Despite the improvements in SR, the gap between the Oracle Success Rate (OSR) and SR is kept roughly the same for the PREVALENT-based model and is reduced by about 1.45% the OSCAR-based models. These results suggest that the path-fidelity rewards benefit the agent to navigate more accurate and efficient. Note that, comparing to the results that are shown in Table 1 and Table 3 of our Main Paper, such reward shaping only contributes to a slight gain of the improvement, whereas the structure of the recurrent BERT is much more influential.

Models R2R Validation Seen R2R Validation Unseen
only 11.15 77.86 70.71 66.64 12.62 66.92 58.32 52.48
10.79 76.79 71.11 67.23 11.86 65.86 58.71 53.41
only 12.24 77.28 69.93 64.46 12.89 69.52 62.15 55.65
11.13 78.16 72.18 67.72 12.01 70.24 62.75 56.84
Table 6: Performance of OSCAR-based and PREVALENT-based VLN
BERT trained with and without the path fidelity rewards.

Appendix B Visualisation (4.1 Main Results)

As shown in Figure 6, 7 and 8, we visualise the language-to-language and the state/vision-to-state/vision/language attention weights of a sample in the validation unseen split. As shown by the panoramas in Fig. 7 and the given instruction “Exit the bedroom. Walk the opposite way of the picture hanging on the wall through the kitchen. Turn right at the long white countertop. Stop when you get past the two chairs.”, the agent needs to understand the complex contextual clues, including different scene clues (bedroom, kitchen), different object clues (picture, wall, countertop, chair) and various directional clues (forward, opposite, left/right, stop) to complete the task.

Language Self-Attention

Fig. 6 shows the language self-attention attention weights at some selected heads at initialisation (); different heads demonstrate different functions and the attentions at different layers behave very differently. Plot (1) and (4) show the general pattern of attentions at shallow (Layer 0) and deep layers (Layer 8); words in shallow layers tend to collect information from the entire sentence, while words at deep layers have higher correspondence to the adjacent words since they are semantically more relevant. It is very interesting to see that the attention head in Plot (2) learns to attend the adjectives and the action-related terms, which describe the objects and scenes, for the initialised state representation ([CLS]). This head also learns about the co-occurrence between different entities, for example, picture has higher correspondence to wall, and countertop has higher correspondence to kitchen. In contrast, the attention head in Plot (3) learns to extract the important landmarks such as bedroom, picture, kitchen and chairs. Attention head in Plot (5) learns about the [SEP] token, which indicates the ending of the instruction. Attention weights in Plot (6) show the most frequent pattern of the attention heads at the final (-th) layer, those heads seem to aggregate information from the punctuations in the instruction. This implies the heads could have learnt about breaking the sentence into multiple sub-sentences. Refer to the idea of sub-instruction proposed by Hong  [hong-etal-2020-sub], such attention pattern could be beneficial for matching the current observation to a particular and the most relevant part of the instruction.

State/Vision Step-wise Attention

Fig. 7 shows the trajectory of the agent starting from a bedroom, taking a series of actions and eventually stopping at the target location. It also displays the averaged attention weights at the final (-th) layer for the state and visual tokens. Note that the averaged attention for state/vision tokens is representative since we apply the averaged attention weights for visual-textual matching to state and use it as the action probabilities (see Eq. 12 and Decision Making in §3.3, respectively). As shown in Plot (1a-6a), the attention shifts from the beginning of the instruction to the end, which agrees with the agent’s navigation progress. It is also interesting to see that at the final layer, the state token is more influential to the predicted action at the first two steps, while the language tokens are more influential at the later steps. The state/vision self-attention in Plot (1b-6b) reflects the action prediction at each time step. The first row in each plot (attention of state with respect to candidate actions) shows the prediction result, we can see that the agent is very confident in each decisions. Moreover, starting from Step 3, the information from state and from different views are aggregated to support choosing the correct direction.

State/Vision Layer-wise Attention

To better understand how the visual and language features are aggregated to support action prediction, we visualise the layer-wise attention at Step 4 of the trajectory (). As shown by Fig. 8 Plot (1-12), at the first two layers, candidate views collect information from the entire instruction. But as the signals propagate to deeper layers (Layer 3-6), the visual tokens attend more the middle part of the instruction, which should be more relevant to the current observations. Interestingly, starting from Layer 6, the visual features tend to dominate the attention, and information aggregates toward the visual token at the predicted direction (which is the correct direction). We can see that, at Layer 7-9, the network still has some doubts about the candidate directions that are spatially closer to the correct direction. But after implicitly reasoning in deeper layers, the network becomes very confident about choosing the correct direction.

Figure 6: Language self-attention weights of some selected heads at initialisation. The attention weights are normalised for each row.
Figure 7: Visualisation of a trajectory and the averaged attention weights at the final (-th) layer. The centre of each panorama is roughly the agent’s heading direction at the corresponding time step. Distance is the agent’s distance to target in meters. The attention weights are normalised for each row. Texts in red indicates the predicted action (corresponds to a candidate view) at each time step.
Figure 8: Averaged state/vision-to-state/vision/language attention weights at each layer at Step 4 of the trajectory. The attention weights are normalised for each row. Texts in red indicates the predicted action (corresponds to a candidate view) at the current time step ().