Log In Sign Up

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

by   Chuang Lin, et al.

Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving. Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction via the multimodal cross-attention mechanism. However, these methods usually represent temporal context as a fixed-length vector by using an LSTM decoder or using manually designed hidden states to build a recurrent Transformer. Considering a single fixed-length vector is often insufficient to capture long-term temporal context, in this paper, we introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation by modelling the temporal context explicitly. Specifically, MTVM enables the agent to keep track of the navigation trajectory by directly storing previous activations in a memory bank. To further boost the performance, we propose a memory-aware consistency loss to help learn a better joint representation of temporal context with random masked instructions. We evaluate MTVM on popular R2R and CVDN datasets, and our model improves Success Rate on R2R unseen validation and test set by 2 by 1.6m on CVDN test set.


page 4

page 7

page 8


Reinforced Structured State-Evolution for Vision-Language Navigation

Vision-and-language Navigation (VLN) task requires an embodied agent to ...

History Aware Multimodal Transformer for Vision-and-Language Navigation

Vision-and-language navigation (VLN) aims to build autonomous visual age...

Contrastive Instruction-Trajectory Learning for Vision-Language Navigation

The vision-language navigation (VLN) task requires an agent to reach a t...

Target-Driven Structured Transformer Planner for Vision-Language Navigation

Vision-language navigation is the task of directing an embodied agent to...

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

The Vision-and-Language Navigation (VLN) task entails an agent following...

Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

In the vision-and-language navigation (VLN) task, an agent follows natur...

Vision-Language Navigation with Random Environmental Mixup

Vision-language Navigation (VLN) tasks require an agent to navigate step...

1 Introduction

Figure 1: In contrast to most existing methods that utilize a fixed-length vector to represent temporal context, we equip the agent with the capability to model long-term dependency. At each step t, MTVM takes all the tokens stored in the memory bank as the temporal context input. After making a decision, it adds a memory token by simply reusing the output activation corresponding to the action at step .

Enabling robots to assist humans in real world has always been what deep learning researches want to achieve 

[chen2011learning, tellex2011understanding, guadarrama2013grounding]. One particular important capability of robots is to follow human instructions to navigate. Vision-and-Language Navigation (VLN) is such a task that an embodied agent is required to follow a language instruction to navigate to a goal position. Specifically, the agent is first given a detailed instruction, like “Head a bit ahead and towards the double doors on the left towards the kitchen. Stop upon reaching the counter.” At each step, the agent is able to observe the panorama of its surroundings and make a decision for the next step until reaching the end point.

In the past few years, many methods [chattopadhyay2021robustnav, ke2019tactical, lin2021adversarial, nguyen2019vision, ma2019self, fried2018speaker, liu2021vision, hong2020language, tan2019learning, zhu2020vision, wang2019reinforced, zhao2021evaluation] have been proposed for the VLN task. Most of them adopt the encoder-decoder framework which first encodes the instruction and visual observations and then decodes the action sequence. Recent VLN studies [qi2021know, li2019robust, majumdar2020improving, hao2020towards, guhur2021airbert]

have shown great performance by directly modeling cross-modal relationships with Transformer. Different from other vision and language tasks such as VQA and image captioning that learn relationships between each single image and its corresponding language, VLN aims to learn the joint representation between each instruction and a series of observations by interacting with environment. Thus, taking the temporal context into account is the key to ground the instruction with the observations, figuring out what has been completed, what is next, and where to go. Most of the existing VLN methods utilize a fixed-length latent vector to represent the temporal context. For example,

[hong2021vln] employs the recurrent hidden state to inject temporal information into Transformer and [qi2021know, hao2020towards] inherit the encoder-decoder structure with an additional LSTM to encode the temporal context. However, a single hidden state vector is not expressive enough to encode the whole history of interactions with environment in Transformer. It is still a huge challenge to align such a hidden state at time with the corresponding sub-instruction for the decision making.

To address the problem, we propose a Multimodal Transformer with Variable-length Memory (MTVM) framework for VLN. Instead of using hidden states or an LSTM to encode temporal context, we find that it is simple and effective to direct reuse the cross-modal Transformer activations obtained in the previous steps. Keeping past activations in an explicit memory bank allows explicitly modeling history information without the need to consider their distances in the path and the Transformer architecture naturally accommodates variable-length memory token inputs. In this way, the agent is able to easily update the temporal context by adding the current output activation corresponding to the action at step into the memory bank, as shown in Figure 1.

Thanks to the explicit memory bank, we further design a memory-aware consistency loss to boost the navigation performance. The consistency loss aims to help cross-modal alignment by learning the relations between the previous activations and the language instruction. Specifically, we randomly mask out some instruction words and force the model output distribution to be consistent with the unmasked result. In this way, the model avoid overfitting to the language modality with the help of the explicit memory bank.

Our contributions can be summarized as follows:

  1. We propose MTVM that allows the agent to capture temporal context without distance dependency by simply reusing the previous cross-model activations corresponding to the actions.

  2. We design a memory-aware consistency loss to learn strong relations between instruction and temporal context which can further boost the navigation performance.

  3. We conduct extensive experiments on R2R and CVDN datasets, which improves the success rate by 2% on the R2R test set and Goal Progress by 1.6m on CVDN test set compared with the best baseline method.

2 Related Work

Vision-and-Language Navigation. VLN [anderson2018vision] is a task that requires an agent to follow a nature-language instruction to navigate in a photo-realistic environment to a goal location. In this process, the given instruction describes the trajectory in detail and the embodied agent needs to move through the scene with first person views as observations. Following [anderson2018vision], several navigation tasks [chen2019touchdown, thomason2020vision, nguyen2019help, nguyen2019vision, qi2020reverie] have been further proposed for interactions with surrounding environments. In particular, different from [anderson2018vision] collecting data from an indoor environment, [chen2019touchdown] extends the navigation environment to real-life visual urban streets. [thomason2020vision] introduces navigating according to several question-answering pairs in a dialog history. [nguyen2019vision] and [nguyen2019help] consider object-finding tasks by requesting and interpreting simulated human assistants. [qi2020reverie] requires the agent to navigate to an appropriate location and identify the target object.

As a practical task in real-world applications, VLN has made incredible progress in recent years. [lin2021adversarial] uses adversarial attacking to capture key information from long instructions for a robust navigation. [ma2019self] ensures grounding the instruction correctly with a progress monitor so that the agent is constantly aware of what has been completed, what is next, and where to go. RCM [wang2019reinforced]

enforces cross-modal grounding both locally and globally via a matching critic providing rewards for reinforcement learning.

In vision-and-language navigation setting, it is difficult to collect enough annotated data due to the large navigation space. [fried2018speaker] synthesizes new instructions where the speaker model helps the agent by additional route-instruction pairs to expand the limited training data. To make further advances, [zhao2021evaluation] proposes an instruction-trajectory compatibility model to improve the instruction evaluation. [tan2019learning] proposes an environmental dropout method based on the view consistency to mimic novel and diverse environments. From a different perspective, REM [liu2021vision] reconnect the seen scenes to generate augmented data via mixing up environments. To further understand the relations between the instructions and scenes, [hong2020language] and [qi2021know] take the objects in scenes and the corresponding words in instructions as the minimal units of encoding. AuxRN [zhu2020vision] introduces additional training signals including explaining actions, predicting next orientation, etc., to help acquire semantic knowledge. In contrast, our method focuses on modeling the temporal context to help the alignment between language and observations.

Multi-Modal Transformers. The Transformer [vaswani2017attention] architecture has shown great effectiveness in vision and language tasks [tan2019lxmert, lu2019vilbert, chen2019uniter, li2020unicoder, li2020unicoder, huang2020pixel, zhou2020unified, gan2020large]

. Most of the vision-and-language tasks focus on the joint embedding learning with individual pairs of an image and its corresponding language, such as VQA, image captioning, and text-to-image retrieval. Different from these tasks, VLN is a Markov Decision Process, which learns the joint representation between the instruction and a series of observations along the corresponding trajectory. Inspired by the success of BERT 

[devlin2018bert], PRESS [li2019robust] first introduces a large-scale pre-trained language model to VLN for text representations. As cross-modal joint learning is the key for VLN task, VLN-BERT [majumdar2020improving] and PREVALENT [hao2020towards] develop Transformer-based model in a self-supervised manner on image-text pairs from the web and image-text-action triplets from R2R dataset [anderson2018vision], respectively.

Figure 2: The general framework of our proposed MTVM framework. The left blue part is an illustration of MTVM at each step. We concatenate temporal context in the memory bank, together with visual features and language features as input. After making decision, we update the memory bank by storing the output activation that corresponding to the action. The right yellow part is the designed memory-aware consistency loss, where we randomly mask out some words to help the alignment between language and temporal context, avoiding model overfitting to the language modality.

[hong2021vln] and [qi2021know] adapt pre-trained V&L BERT to VLN task by leveraging the hidden state representations with the learned linear projection or LSTM.

Recently, HAMT [chen2021history] and Episodic Transformer [pashevich2021episodic] also propose to model the history information explicitly by directly encoding all past observations and the actions, which however is fairly complex. For example, HAMT requiring a hierarchical vision Transformer to encode intra-panorama and inter-panorama visual information for temporal context, whose end-to-end training is extremely costly and time-consuming. In contrast, our MTVM simply copies the past activations into a memory bank as the history information and we further design a memory-aware consistency loss to help the alignment between history information and language instruction.

3 Methods

3.1 Overview

Formally, at the beginning of each episode, the agent is given a nature language instruction , where is the length of the instruction and denotes a word. VLN task requires the agent to follow the instruction to navigate from a start position to the goal location. At each step , the agent is able to observe the surrounding environment in a panoramic view comprised by 36 single view images. Figure 2 gives an overview of our proposed MTVM. At each step, our MTVM directly interacts with visual information, language information, and history information to make the action decision. After that, we update memory bank by reusing the activation of the Transformer output according to the action decision. Moreover, a consistency loss is introduced to measure the distance between the output distributions of the full instruction and a randomly masked instruction to help the cross-modal alignment. Note that the instruction masking is only used in training but not in inference.

3.2 Memory-based Multimodal Transformer

As VLN is a Markov decision process [anderson2018vision], an embodied agent needs to pay attention to the temporal context information during its navigation. The general Transformer is not enough to model the instruction and the observations due to the lack of the temporal context. At each navigation step, an agent needs to ground an instruction to which part has finished and which part is the next action. MTVM learns the cross-modal alignment to make the completed instruction part matched with the past trajectory. Our memory bank enables the agent to be aware of the navigating process by directly interacting with the previous actions so that it can ground the sub-instructions as guidance. In this way, the agent is easier to locate the sub-instruction to gain useful information to select the candidate direction from the current-step observation. We construct our model following the vision and language pretrained work [tan2019lxmert, hao2020towards], which consists of a language encoder, a vision encoder and a cross-modality encoder.

Language Encoder. The language encoder is a standard multi-layer transformer with self-attention. At the beginning of an episode, we feed the instruction to the language encoder to get the language representation .

Vision Encoder. The vision encoder is a convolution network to encode each single view image to a 2048-dimensional visual feature . A 128-dimensional directional feature by repeating the trigonometric function representation [fried2018speaker] is concatenated with the visual feature to represent the orientation of each single view . For each step, we have as the vision representation, where is the number of candidate directions.

Cross-modality Encoder. In order to learn cross-modality representations, the cross-modality encoder is composed of self-attention layers and cross-attention layers, where cross-attention layers treat one modality as query and the other as key and value to exchange the information and align the entities between the two modalities. In particular, we feed language representation , vision representation , and previous activations to the cross-modality encoder as


where denotes concatenation. Then, the action prediction head takes the output to make the action decision for this step: .

At the end of each step, we update the memory bank by appending the current agent action decision into the existing memory bank as


where is the index of the selected vision output and is the corresponding directional feature of step action.

3.3 Memory-aware Consistency Loss

As aforementioned, the key challenge in VLN is an embodied agent needs to be aware of the progress of the navigating trajectory by learning the cross-modal representation. However, the existing studies [hu2019you, anderson2018vision] show that the agent tends to overfit the instructions, which could be due to large variations in the visual modality. In order to avoid the model from overfitting a single modality, we design a memory-aware consistency loss. By randomly dropping some words in the instruction, we force the model to learn strong representations among language, vision, and temporal context from the cross-modality encoder.

Specifically, given an instruction

, we random drop some words with a fixed probability and obtain


Both and are then encoded by language encoder to produce the instruction representations and , respectively. Same as the instruction feature , is also fed through the cross-modality encoder with the same history and vision representations as Eq. (1):


Although some words are discarded, we expect the similarities between the instruction features and

and their corresponding outputs are preserved. Concretely, we minimize the bidirectional Kullback-Leibler (KL) divergence between the outputs of the full instruction and the randomly dropped instruction from the language encoder and the cross-modality encoder by generating the probability vectors with a Softmax layer. The consistency loss is defined as


where and are the weights to balance the distance losses. The first term in Eq. (5) aims to prevent the agent from overfitting the special words (such as route words), while the second term aims to avoid overfitting the language input.

Methods Validation Seen Validation Unseen Test
NE(m) SR(%) SPL(%) NE(m) SR(%) SPL(%) NE(m) SR(%) SPL(%)
Random 9.45 16 - 9.23 16 - 9.79 13 12
Human - - - - - - 1.61 86 76
Speaker-Follower [fried2018speaker] 3.36 66 - 6.62 35 - 6.62 35 28
Self-monitoring [ma2019self] 3.22 67 58 5.52 45 32 5.67 48 35
RCM [wang2019reinforced] 3.53 67 - 6.09 43 - 6.12 43 38
FAST-Short [ke2019tactical] - - - 4.97 56 43 5.14 54 41
EnvDrop[tan2019learning] 3.99 62 59 5.22 52 48 5.23 51 47
DR-Attacker [lin2021adversarial] 3.52 70 67 4.99 53 48 5.53 52 49
AuxRN [zhu2020vision] 3.33 70 67 5.28 55 50 5.15 55 51
RelGraph [hong2020language] 3.47 67 65 4.73 57 53 4.75 55 52
PRESS [li2019robust] 4.39 58 55 5.28 49 45 5.49 49 45
PREVALENT [hao2020towards] 3.67 69 65 4.71 58 53 5.30 54 51
ORIST [qi2021know] - - - 4.72 57 51 5.10 57 52
VLNBERT [hong2021vln] 2.90 72 68 3.93 63 57 4.09 63 57
HAMT [chen2021history] - 69 65 - 64 58 - - -
Ours 2.67 74 69 3.73 66 59 3.85 65 59
Table 1: Comparisons of the VLN performance on R2R dataset in a single-run setting. The best results are in bold font. The set of methods at the bottom are Transformer based solutions, whose model parameters are initialized by the pre-trained vision-and-language BERT. The set of methods in the middle are non-Transformer based solutions. Note that for the very recent concurrent method HAMT [chen2021history], we report its results with Resnet-152 as the vision encoder for fair comparison.

3.4 Training

Following the existing VLN works, we apply the mixture of Imitation Learning (IL) and Reinforcement Learning (RL) training strategies 

[wang2019reinforced, tan2019learning]. In IL, the agent learns to follow the teacher action of the ground-truth path at each step

by minimizing the negative log probability loss function. In RL, the agent learns from rewards by using A2C algorithm 

[mnih2016asynchronous], where sampling the action from the agent action prediction , the agent will get rewards if successfully arriving at the target within 3m () or reducing the distance to the target after taking the action (). Besides, we consider the similarity of the agent path and the ground-truth path as a reward to encourage the agent follow the instruction to move closer to the target. The overall loss function can be written as:


where is a trade-off weight for IL loss, is the length of the navigation path, and is the advantage calculated by A2C algorithm [mnih2016asynchronous]. We alternately train the agent with IL and RL strategies while applying the consistency loss in both.

4 Experiments

4.1 Setup

Datasets: We evaluate our VLN work on the Room-to-Room dataset (R2R) [anderson2018vision] and Cooperative Vision-and-Dialog Navigation dataset (CVDN) [thomason2020vision] in 3D environments based on Matterport3D Simulator [chang2017matterport3d]. The simulated environments include 90 different housing scenes. R2R dataset provides fully specified instructions describing the steps necessary to reach the goal, while CVDN dataset provides an ambiguous and underspecified goal location and human-human dialogs to guide the agent. R2R splits the dataset into the training set consisting of 61 environments with 14,025 instructions, the seen validation set consisting of the same 61 environments with 1,020 instructions, and the unseen validation set consisting of another 11 environments with 2,349 instructions, while the test set consists of the remaining 18 environments with 4,173 instructions. CVDN contains 4742 training, 382 seen validation, 907 unseen validation, and 1384 unseen test instances.

Evaluation Metrics: For R2R, we use its three standard metrics: Navigation Error (NE) defined as the distance (in meters) from the stop viewpoint to the goal position, Success Rate (SR), and Success rate weighted by Path Length (SPL), where SPL is regarded as the primary metric.

For CVDN, following [thomason2020vision], we evaluate the performance on the navigation from dialog history (NDH) task by Goal Progress, which measures how much reduction in meters the agent makes towards the goal. There are three settings depending on the supervised strategy. Oracle indicates the agent regarding the shortest path as ground truth and Navigator indicates learning from the navigator path (maybe not be the optimal navigation). Mixed supervision means to learn from the navigator path if it reaches the goal point; otherwise learn from the shortest path.

Figure 3: Impacts of the memory bank size on seen and unseen validation sets of R2R dataset in terms of NE, SR and SPL. Solid lines are our results with different memory bank sizes, and dashed lines are the results of PREVALENT [hao2020towards] from which our model is initialized.
Methods Validation Unseen Test
Ora Nav Mix Ora Nav Mix
Random 1.09 1.09 1.09 0.83 0.83 0.83
Shortest Path 8.36 7.99 9.58 8.06 8.48 9.76
Seq-to-seq [thomason2020vision] 1.23 1.98 2.10 1.25 2.11 2.35
PREVALENT [hao2020towards] 2.58 2.99 3.15 1.67 2.39 2.44
CMN [zhu2020vision1] 2.68 2.28 2.97 2.69 2.26 2.95
ORIST [qi2021know] 3.30 3.29 3.55 2.78 3.17 3.15
DR-Attacker [lin2021adversarial] 3.27 4.00 4.18 2.77 2.95 3.26
Ours 4.57 4.80 5.15 4.23 4.46 4.82
Table 2: Comparisons with state-of-the-art methods in terms of Goal Progress (m) on the navigation from dialog history (NDH) task on CVDN dataset [thomason2020vision]. ‘Ora’, ‘Nav’ and ‘mix’ denote the three settings, ‘Oracle’, ‘Navigator’ and ‘Mixed’, respectively.

Implementation Details: To leverage vision and language pre-trained models, we initialize the language encoder and the cross-modality encoder by a pre-train VLN model PREVALENT [hao2020towards]. Following PREVALENT [hao2020towards] and VLNBERT [hong2021vln], we train the agent on the original training data and the augmented data provided by [hao2020towards]. The vision encoder is a fixed ResNet-152 [he2016deep] pre-trained on Place365 [zhou2017places] provided by R2R dataset. The experiments are conducted on 3 NVIDIA V100 GPUs. We train the model 10,000 iterations and adopt the early stopping strategy when the model achieves the best performance on the evaluation metric. The learning rate is fixed to with an AdamW optimiser [loshchilov2017decoupled]. The parameters and are respectively set to 0.6 and 0.2 and is set to 0.2. We find different levels of dropping words are all helpful, and we fix the word dropping probability to 0.5.

4.2 Comparisons with SoTA

Table 1 shows the performance comparisons of different VLN methods on R2R dataset in a single-run setting. It can be seen that our model performs the best on all the metrics under both unseen validation and test sets, suggesting the good generalizing ability. Compared with other transformer-based methods including PRESS [li2019robust], ORIST [qi2021know] and VLNBERT [hong2021vln] which also initialize their models using the pre-trained ones [hao2020towards, chen2019uniter], our method is at least 2% higher in terms of SPL or SR under both test and validation unseen scenarios. In addition, the lowest navigation error achieved by our model indicates that we can make the agent move closer to the target. Note that for the very recent concurrent work HAMT [chen2021history] 111The paper has just been released in late Oct. 2021., we report its results with Resnet-152 as the vision encoder for fair comparison. When changing Resnet-152 to ViT [dosovitskiy2020image], HAMT reports better performance. However, its end-to-end training requires 20 NVIDIA V100 GPUs for 20 hours, which is much higher than ours (3 V100 for 1 day). Another recent work Mixup [liu2021vision] extends 61 scenes in the training set to 116 cross-connected scenes with data augmentation and thus uses more data than other methods. We leave more comparisons in Appendix.

Table 2 shows the performance comparisons in terms of Goal Progress on CVDN dataset under the three different settings. Again, our method achieves the best performance with significant gains on both unseen validation and test sets, demonstrating the effectiveness of handling a variety of language instructions. Note that the Shortest Path Agent takes the shortest path to the supervision goal at inference, which represents the upper bound navigation performance for an agent.

4.3 Ablation studies

Memory bank size. Recall that our method stores the activations at each step as history information in a memory bank. Here, we evaluate the model performance with different memory bank sizes. When the memory bank size is , we only record the last step activations; when the size is variable, it means we record every step. Note that the paths in R2R dataset are all around four to six steps. The results are shown in Figure 3. In general, a larger memory size helps, and the variable-length memory gives the best performance, suggesting the importance of explicitly storing the history information. In addition, we also show the performance of PREVALENT as our baseline (dashed lines) since our model is initialized from it. It can be seen that our model under most of the fixed-length memory banks outperforms the baseline.

Methods Validation Unseen
SR(%) SPL(%)
VLNBERT [hong2021vln] 63.3 57.5
VLNBERT [hong2021vln] + consistency 62.8 57.2
MTVM w/o consistency 64.0 58.6
MTVM 65.7 59.4
MTVM w/o consistency + drop words 64.5 57.8
Table 3: Impacts of our proposed memory-aware consistency loss and random word dropping. “MTVM w/o consistency + drop words” refers to our model without using the consistency loss but with random word dropping in language instructions for data augmentation.

Impacts of consistency loss and random word dropping. Table 3 compares the results with and without our proposed consistency loss. For our MTVM model, we can see that the consistency loss significantly improves the performance. We also evaluate its impact to another method, VLNBERT, in Table 3. Specifically, we use the same word dropping strategy to VLNBERT with the consistency loss. However, for VLNBERT, adding the consistency loss performs slightly worse, suggesting hard to learn cross-modal representations without explicit memory bank.

Note that our word-drop strategy for the consistency loss is similar to conventional random word dropping used for data augmentation. Thus, we make a comparison with direct word dropping for data augmentation (denoted as ‘MTVM w/o consistency + drop words”) in Table 3, where we fix the word dropping rate to 0.5 in all methods. It can be seen that direct word dropping as data augmentation is not as effective as ours.

We further investigate the effect of different word dropping rates on SR and SPL in both seen and unseen validation sets of R2R dataset. Here we conduct experiments by varying the word dropping rate in . As shown in Fig 4, we can see that a small dropping rate (e.g., 0.1) does not perform as good as a large one (e.g., 0.5), while a too large dropping rate (e.g. 0.7) also hurts the performance. Thus, the best choice is 0.5.

Figure 4: Impact of different random word dropping rates on SR and SPL on both seen and unseen validation sets of R2R dataset.

Hyper-parameter sensitivity. We analyze the sensitivity of the hyper-parameters to SPL metric on R2R unseen validation set by using and in Eq. (5) as examples. The results are reported in Figure 5. From these results, we can see that SPL is not very sensitive to the variations of and in a range around 2 8 and we find that it is a good choice to set .

Figure 5: Sensitivity examples of the hyper-parameters in Eq. (5) to SPL metric on R2R unseen validation set. The darker the color, the better the performance.
Figure 6: Visualization examples of panoramic views and language attention weights. From sub-figures (a) and (b), it can be seen that without the consistency loss, MTVM took a longer path to reach “stairs”. (c) and (d) are the language attention weights at the final layer of the cross-modality encoder corresponding to (a) and (b) at each step.
Methods Params Memory Validation Unseen
SR(%) SPL(%)
VLNBERT 41.9M 8.6GB 63.3 57.5
MTVM 41.6M  8.4GB 63.6 58.2
MTVM 68.4M 17.9GB 64.0 58.6
Table 4: Comparisons of training memory and computation cost on R2R dataset. We produce MTVM with the same cross-attention strategy as VLNBERT, where language is used as keys and values but not as queries. indicates MTVM without the consistency loss. The best results are in bold and the second best results are underlined.

Memory and computation Cost. Following most of the cross-modal Transformer methods [tan2019lxmert, hao2020towards], our MTVM facilitates vision-and-language interactions by bi-directional cross-attention sub-layers, where language is used as query attending to vision and vice versa. To compare with single-direction cross-modal Transformer method VLNBERT [hong2021vln], which only considers language tokens as keys and values but not as queries, we also develop a similar version, MTVM. The comparison results of VLNBERT, MTVM and MTVM in terms of Parameters and GPU Memory Cost are shown in Table 4. For a fair comparison with VLNBERT, all the experiments are conducted on a single V100 GPU with batch size 16. With the same cross-attention strategy, compared with VLNBERT, our MTVM archives better performance but with lower memory and computation cost. This is because VLNBERT needs an additional small network to encode update its hidden states for temporal context while our MTVM directly reuses the previous activations. This demonstrates the efficiency and effectiveness of our proposed memory bank based Transformer design.

4.4 Visualization

To demonstrate the proposed consistency loss, we give a few visualization examples of panoramic views and language attention weights in Fig. 6. In R2R dataset, the agent needs to navigate following the instruction from the beginning to the end. Sub-figures (a) and (b) in Fig. 6 show that our MTVM model with the consistency loss achieves better navigation performance with a much shorter trajectory. In sub-figures (c) and (d), we observe that our model with the consistency loss is able to better ground the sub-instructions while MTVM without the consistency loss fails to focus on the action word at each step.

5 Conclusion

We have proposed the framework of Multimodal Transformer with Variable-length Memory (MTVM), which enables the agent explicitly model the history information in a simple and effective way. We have also designed the memory-aware consistency loss to improve the generalization ability of our model. Our MTVM has demonstrated strong performance, outperforming almost all the existing works on both R2R and CVDN dataset. We see the benefit of allowing long-range dependency for VLN task and we hope this idea can benefit other vision and language interaction tasks.