Enabling robots to assist humans in real world has always been what deep learning researches want to achieve[chen2011learning, tellex2011understanding, guadarrama2013grounding]. One particular important capability of robots is to follow human instructions to navigate. Vision-and-Language Navigation (VLN) is such a task that an embodied agent is required to follow a language instruction to navigate to a goal position. Specifically, the agent is first given a detailed instruction, like “Head a bit ahead and towards the double doors on the left towards the kitchen. Stop upon reaching the counter.” At each step, the agent is able to observe the panorama of its surroundings and make a decision for the next step until reaching the end point.
In the past few years, many methods [chattopadhyay2021robustnav, ke2019tactical, lin2021adversarial, nguyen2019vision, ma2019self, fried2018speaker, liu2021vision, hong2020language, tan2019learning, zhu2020vision, wang2019reinforced, zhao2021evaluation] have been proposed for the VLN task. Most of them adopt the encoder-decoder framework which first encodes the instruction and visual observations and then decodes the action sequence. Recent VLN studies [qi2021know, li2019robust, majumdar2020improving, hao2020towards, guhur2021airbert]
have shown great performance by directly modeling cross-modal relationships with Transformer. Different from other vision and language tasks such as VQA and image captioning that learn relationships between each single image and its corresponding language, VLN aims to learn the joint representation between each instruction and a series of observations by interacting with environment. Thus, taking the temporal context into account is the key to ground the instruction with the observations, figuring out what has been completed, what is next, and where to go. Most of the existing VLN methods utilize a fixed-length latent vector to represent the temporal context. For example,[hong2021vln] employs the recurrent hidden state to inject temporal information into Transformer and [qi2021know, hao2020towards] inherit the encoder-decoder structure with an additional LSTM to encode the temporal context. However, a single hidden state vector is not expressive enough to encode the whole history of interactions with environment in Transformer. It is still a huge challenge to align such a hidden state at time with the corresponding sub-instruction for the decision making.
To address the problem, we propose a Multimodal Transformer with Variable-length Memory (MTVM) framework for VLN. Instead of using hidden states or an LSTM to encode temporal context, we find that it is simple and effective to direct reuse the cross-modal Transformer activations obtained in the previous steps. Keeping past activations in an explicit memory bank allows explicitly modeling history information without the need to consider their distances in the path and the Transformer architecture naturally accommodates variable-length memory token inputs. In this way, the agent is able to easily update the temporal context by adding the current output activation corresponding to the action at step into the memory bank, as shown in Figure 1.
Thanks to the explicit memory bank, we further design a memory-aware consistency loss to boost the navigation performance. The consistency loss aims to help cross-modal alignment by learning the relations between the previous activations and the language instruction. Specifically, we randomly mask out some instruction words and force the model output distribution to be consistent with the unmasked result. In this way, the model avoid overfitting to the language modality with the help of the explicit memory bank.
Our contributions can be summarized as follows:
We propose MTVM that allows the agent to capture temporal context without distance dependency by simply reusing the previous cross-model activations corresponding to the actions.
We design a memory-aware consistency loss to learn strong relations between instruction and temporal context which can further boost the navigation performance.
We conduct extensive experiments on R2R and CVDN datasets, which improves the success rate by 2% on the R2R test set and Goal Progress by 1.6m on CVDN test set compared with the best baseline method.
2 Related Work
Vision-and-Language Navigation. VLN [anderson2018vision] is a task that requires an agent to follow a nature-language instruction to navigate in a photo-realistic environment to a goal location. In this process, the given instruction describes the trajectory in detail and the embodied agent needs to move through the scene with first person views as observations. Following [anderson2018vision], several navigation tasks [chen2019touchdown, thomason2020vision, nguyen2019help, nguyen2019vision, qi2020reverie] have been further proposed for interactions with surrounding environments. In particular, different from [anderson2018vision] collecting data from an indoor environment, [chen2019touchdown] extends the navigation environment to real-life visual urban streets. [thomason2020vision] introduces navigating according to several question-answering pairs in a dialog history. [nguyen2019vision] and [nguyen2019help] consider object-finding tasks by requesting and interpreting simulated human assistants. [qi2020reverie] requires the agent to navigate to an appropriate location and identify the target object.
As a practical task in real-world applications, VLN has made incredible progress in recent years. [lin2021adversarial] uses adversarial attacking to capture key information from long instructions for a robust navigation. [ma2019self] ensures grounding the instruction correctly with a progress monitor so that the agent is constantly aware of what has been completed, what is next, and where to go. RCM [wang2019reinforced]
enforces cross-modal grounding both locally and globally via a matching critic providing rewards for reinforcement learning.
In vision-and-language navigation setting, it is difficult to collect enough annotated data due to the large navigation space. [fried2018speaker] synthesizes new instructions where the speaker model helps the agent by additional route-instruction pairs to expand the limited training data. To make further advances, [zhao2021evaluation] proposes an instruction-trajectory compatibility model to improve the instruction evaluation. [tan2019learning] proposes an environmental dropout method based on the view consistency to mimic novel and diverse environments. From a different perspective, REM [liu2021vision] reconnect the seen scenes to generate augmented data via mixing up environments. To further understand the relations between the instructions and scenes, [hong2020language] and [qi2021know] take the objects in scenes and the corresponding words in instructions as the minimal units of encoding. AuxRN [zhu2020vision] introduces additional training signals including explaining actions, predicting next orientation, etc., to help acquire semantic knowledge. In contrast, our method focuses on modeling the temporal context to help the alignment between language and observations.
Multi-Modal Transformers. The Transformer [vaswani2017attention] architecture has shown great effectiveness in vision and language tasks [tan2019lxmert, lu2019vilbert, chen2019uniter, li2020unicoder, li2020unicoder, huang2020pixel, zhou2020unified, gan2020large]
. Most of the vision-and-language tasks focus on the joint embedding learning with individual pairs of an image and its corresponding language, such as VQA, image captioning, and text-to-image retrieval. Different from these tasks, VLN is a Markov Decision Process, which learns the joint representation between the instruction and a series of observations along the corresponding trajectory. Inspired by the success of BERT[devlin2018bert], PRESS [li2019robust] first introduces a large-scale pre-trained language model to VLN for text representations. As cross-modal joint learning is the key for VLN task, VLN-BERT [majumdar2020improving] and PREVALENT [hao2020towards] develop Transformer-based model in a self-supervised manner on image-text pairs from the web and image-text-action triplets from R2R dataset [anderson2018vision], respectively.
[hong2021vln] and [qi2021know] adapt pre-trained V&L BERT to VLN task by leveraging the hidden state representations with the learned linear projection or LSTM.
Recently, HAMT [chen2021history] and Episodic Transformer [pashevich2021episodic] also propose to model the history information explicitly by directly encoding all past observations and the actions, which however is fairly complex. For example, HAMT requiring a hierarchical vision Transformer to encode intra-panorama and inter-panorama visual information for temporal context, whose end-to-end training is extremely costly and time-consuming. In contrast, our MTVM simply copies the past activations into a memory bank as the history information and we further design a memory-aware consistency loss to help the alignment between history information and language instruction.
Formally, at the beginning of each episode, the agent is given a nature language instruction , where is the length of the instruction and denotes a word. VLN task requires the agent to follow the instruction to navigate from a start position to the goal location. At each step , the agent is able to observe the surrounding environment in a panoramic view comprised by 36 single view images. Figure 2 gives an overview of our proposed MTVM. At each step, our MTVM directly interacts with visual information, language information, and history information to make the action decision. After that, we update memory bank by reusing the activation of the Transformer output according to the action decision. Moreover, a consistency loss is introduced to measure the distance between the output distributions of the full instruction and a randomly masked instruction to help the cross-modal alignment. Note that the instruction masking is only used in training but not in inference.
3.2 Memory-based Multimodal Transformer
As VLN is a Markov decision process [anderson2018vision], an embodied agent needs to pay attention to the temporal context information during its navigation. The general Transformer is not enough to model the instruction and the observations due to the lack of the temporal context. At each navigation step, an agent needs to ground an instruction to which part has finished and which part is the next action. MTVM learns the cross-modal alignment to make the completed instruction part matched with the past trajectory. Our memory bank enables the agent to be aware of the navigating process by directly interacting with the previous actions so that it can ground the sub-instructions as guidance. In this way, the agent is easier to locate the sub-instruction to gain useful information to select the candidate direction from the current-step observation. We construct our model following the vision and language pretrained work [tan2019lxmert, hao2020towards], which consists of a language encoder, a vision encoder and a cross-modality encoder.
Language Encoder. The language encoder is a standard multi-layer transformer with self-attention. At the beginning of an episode, we feed the instruction to the language encoder to get the language representation .
Vision Encoder. The vision encoder is a convolution network to encode each single view image to a 2048-dimensional visual feature . A 128-dimensional directional feature by repeating the trigonometric function representation [fried2018speaker] is concatenated with the visual feature to represent the orientation of each single view . For each step, we have as the vision representation, where is the number of candidate directions.
Cross-modality Encoder. In order to learn cross-modality representations, the cross-modality encoder is composed of self-attention layers and cross-attention layers, where cross-attention layers treat one modality as query and the other as key and value to exchange the information and align the entities between the two modalities. In particular, we feed language representation , vision representation , and previous activations to the cross-modality encoder as
where denotes concatenation. Then, the action prediction head takes the output to make the action decision for this step: .
At the end of each step, we update the memory bank by appending the current agent action decision into the existing memory bank as
where is the index of the selected vision output and is the corresponding directional feature of step action.
3.3 Memory-aware Consistency Loss
As aforementioned, the key challenge in VLN is an embodied agent needs to be aware of the progress of the navigating trajectory by learning the cross-modal representation. However, the existing studies [hu2019you, anderson2018vision] show that the agent tends to overfit the instructions, which could be due to large variations in the visual modality. In order to avoid the model from overfitting a single modality, we design a memory-aware consistency loss. By randomly dropping some words in the instruction, we force the model to learn strong representations among language, vision, and temporal context from the cross-modality encoder.
Specifically, given an instruction
, we random drop some words with a fixed probability and obtain
Both and are then encoded by language encoder to produce the instruction representations and , respectively. Same as the instruction feature , is also fed through the cross-modality encoder with the same history and vision representations as Eq. (1):
Although some words are discarded, we expect the similarities between the instruction features and
and their corresponding outputs are preserved. Concretely, we minimize the bidirectional Kullback-Leibler (KL) divergence between the outputs of the full instruction and the randomly dropped instruction from the language encoder and the cross-modality encoder by generating the probability vectors with a Softmax layer. The consistency loss is defined as
where and are the weights to balance the distance losses. The first term in Eq. (5) aims to prevent the agent from overfitting the special words (such as route words), while the second term aims to avoid overfitting the language input.
|Methods||Validation Seen||Validation Unseen||Test|
Following the existing VLN works, we apply the mixture of Imitation Learning (IL) and Reinforcement Learning (RL) training strategies[wang2019reinforced, tan2019learning]. In IL, the agent learns to follow the teacher action of the ground-truth path at each step
by minimizing the negative log probability loss function. In RL, the agent learns from rewards by using A2C algorithm[mnih2016asynchronous], where sampling the action from the agent action prediction , the agent will get rewards if successfully arriving at the target within 3m () or reducing the distance to the target after taking the action (). Besides, we consider the similarity of the agent path and the ground-truth path as a reward to encourage the agent follow the instruction to move closer to the target. The overall loss function can be written as:
where is a trade-off weight for IL loss, is the length of the navigation path, and is the advantage calculated by A2C algorithm [mnih2016asynchronous]. We alternately train the agent with IL and RL strategies while applying the consistency loss in both.
Datasets: We evaluate our VLN work on the Room-to-Room dataset (R2R) [anderson2018vision] and Cooperative Vision-and-Dialog Navigation dataset (CVDN) [thomason2020vision] in 3D environments based on Matterport3D Simulator [chang2017matterport3d]. The simulated environments include 90 different housing scenes. R2R dataset provides fully specified instructions describing the steps necessary to reach the goal, while CVDN dataset provides an ambiguous and underspecified goal location and human-human dialogs to guide the agent. R2R splits the dataset into the training set consisting of 61 environments with 14,025 instructions, the seen validation set consisting of the same 61 environments with 1,020 instructions, and the unseen validation set consisting of another 11 environments with 2,349 instructions, while the test set consists of the remaining 18 environments with 4,173 instructions. CVDN contains 4742 training, 382 seen validation, 907 unseen validation, and 1384 unseen test instances.
Evaluation Metrics: For R2R, we use its three standard metrics: Navigation Error (NE) defined as the distance (in meters) from the stop viewpoint to the goal position, Success Rate (SR), and Success rate weighted by Path Length (SPL), where SPL is regarded as the primary metric.
For CVDN, following [thomason2020vision], we evaluate the performance on the navigation from dialog history (NDH) task by Goal Progress, which measures how much reduction in meters the agent makes towards the goal. There are three settings depending on the supervised strategy. Oracle indicates the agent regarding the shortest path as ground truth and Navigator indicates learning from the navigator path (maybe not be the optimal navigation). Mixed supervision means to learn from the navigator path if it reaches the goal point; otherwise learn from the shortest path.
Implementation Details: To leverage vision and language pre-trained models, we initialize the language encoder and the cross-modality encoder by a pre-train VLN model PREVALENT [hao2020towards]. Following PREVALENT [hao2020towards] and VLNBERT [hong2021vln], we train the agent on the original training data and the augmented data provided by [hao2020towards]. The vision encoder is a fixed ResNet-152 [he2016deep] pre-trained on Place365 [zhou2017places] provided by R2R dataset. The experiments are conducted on 3 NVIDIA V100 GPUs. We train the model 10,000 iterations and adopt the early stopping strategy when the model achieves the best performance on the evaluation metric. The learning rate is fixed to with an AdamW optimiser [loshchilov2017decoupled]. The parameters and are respectively set to 0.6 and 0.2 and is set to 0.2. We find different levels of dropping words are all helpful, and we fix the word dropping probability to 0.5.
4.2 Comparisons with SoTA
Table 1 shows the performance comparisons of different VLN methods on R2R dataset in a single-run setting. It can be seen that our model performs the best on all the metrics under both unseen validation and test sets, suggesting the good generalizing ability. Compared with other transformer-based methods including PRESS [li2019robust], ORIST [qi2021know] and VLNBERT [hong2021vln] which also initialize their models using the pre-trained ones [hao2020towards, chen2019uniter], our method is at least 2% higher in terms of SPL or SR under both test and validation unseen scenarios. In addition, the lowest navigation error achieved by our model indicates that we can make the agent move closer to the target. Note that for the very recent concurrent work HAMT [chen2021history] 111The paper has just been released in late Oct. 2021., we report its results with Resnet-152 as the vision encoder for fair comparison. When changing Resnet-152 to ViT [dosovitskiy2020image], HAMT reports better performance. However, its end-to-end training requires 20 NVIDIA V100 GPUs for 20 hours, which is much higher than ours (3 V100 for 1 day). Another recent work Mixup [liu2021vision] extends 61 scenes in the training set to 116 cross-connected scenes with data augmentation and thus uses more data than other methods. We leave more comparisons in Appendix.
Table 2 shows the performance comparisons in terms of Goal Progress on CVDN dataset under the three different settings. Again, our method achieves the best performance with significant gains on both unseen validation and test sets, demonstrating the effectiveness of handling a variety of language instructions. Note that the Shortest Path Agent takes the shortest path to the supervision goal at inference, which represents the upper bound navigation performance for an agent.
4.3 Ablation studies
Memory bank size. Recall that our method stores the activations at each step as history information in a memory bank. Here, we evaluate the model performance with different memory bank sizes. When the memory bank size is , we only record the last step activations; when the size is variable, it means we record every step. Note that the paths in R2R dataset are all around four to six steps. The results are shown in Figure 3. In general, a larger memory size helps, and the variable-length memory gives the best performance, suggesting the importance of explicitly storing the history information. In addition, we also show the performance of PREVALENT as our baseline (dashed lines) since our model is initialized from it. It can be seen that our model under most of the fixed-length memory banks outperforms the baseline.
|VLNBERT [hong2021vln] + consistency||62.8||57.2|
|MTVM w/o consistency||64.0||58.6|
|MTVM w/o consistency + drop words||64.5||57.8|
Impacts of consistency loss and random word dropping. Table 3 compares the results with and without our proposed consistency loss. For our MTVM model, we can see that the consistency loss significantly improves the performance. We also evaluate its impact to another method, VLNBERT, in Table 3. Specifically, we use the same word dropping strategy to VLNBERT with the consistency loss. However, for VLNBERT, adding the consistency loss performs slightly worse, suggesting hard to learn cross-modal representations without explicit memory bank.
Note that our word-drop strategy for the consistency loss is similar to conventional random word dropping used for data augmentation. Thus, we make a comparison with direct word dropping for data augmentation (denoted as ‘MTVM w/o consistency + drop words”) in Table 3, where we fix the word dropping rate to 0.5 in all methods. It can be seen that direct word dropping as data augmentation is not as effective as ours.
We further investigate the effect of different word dropping rates on SR and SPL in both seen and unseen validation sets of R2R dataset. Here we conduct experiments by varying the word dropping rate in . As shown in Fig 4, we can see that a small dropping rate (e.g., 0.1) does not perform as good as a large one (e.g., 0.5), while a too large dropping rate (e.g. 0.7) also hurts the performance. Thus, the best choice is 0.5.
Hyper-parameter sensitivity. We analyze the sensitivity of the hyper-parameters to SPL metric on R2R unseen validation set by using and in Eq. (5) as examples. The results are reported in Figure 5. From these results, we can see that SPL is not very sensitive to the variations of and in a range around 2 8 and we find that it is a good choice to set .
Memory and computation Cost. Following most of the cross-modal Transformer methods [tan2019lxmert, hao2020towards], our MTVM facilitates vision-and-language interactions by bi-directional cross-attention sub-layers, where language is used as query attending to vision and vice versa. To compare with single-direction cross-modal Transformer method VLNBERT [hong2021vln], which only considers language tokens as keys and values but not as queries, we also develop a similar version, MTVM. The comparison results of VLNBERT, MTVM and MTVM in terms of Parameters and GPU Memory Cost are shown in Table 4. For a fair comparison with VLNBERT, all the experiments are conducted on a single V100 GPU with batch size 16. With the same cross-attention strategy, compared with VLNBERT, our MTVM archives better performance but with lower memory and computation cost. This is because VLNBERT needs an additional small network to encode update its hidden states for temporal context while our MTVM directly reuses the previous activations. This demonstrates the efficiency and effectiveness of our proposed memory bank based Transformer design.
To demonstrate the proposed consistency loss, we give a few visualization examples of panoramic views and language attention weights in Fig. 6. In R2R dataset, the agent needs to navigate following the instruction from the beginning to the end. Sub-figures (a) and (b) in Fig. 6 show that our MTVM model with the consistency loss achieves better navigation performance with a much shorter trajectory. In sub-figures (c) and (d), we observe that our model with the consistency loss is able to better ground the sub-instructions while MTVM without the consistency loss fails to focus on the action word at each step.
We have proposed the framework of Multimodal Transformer with Variable-length Memory (MTVM), which enables the agent explicitly model the history information in a simple and effective way. We have also designed the memory-aware consistency loss to improve the generalization ability of our model. Our MTVM has demonstrated strong performance, outperforming almost all the existing works on both R2R and CVDN dataset. We see the benefit of allowing long-range dependency for VLN task and we hope this idea can benefit other vision and language interaction tasks.