Vision-and-language navigation (VLN) is a task where an agent navigates in a real environment by following natural language instructions, as illustrated in Fig. 0(a). Different from indoor navigation [2, 34, 9, 32, 22, 29, 23, 18], the outdoor navigation task [5, 25, 24] takes place in urban environments that contain diverse street views. The vast urban area leads to a much larger space for an agent to explore, which provides a wide variety of objects for visual grounding and requires more informative instructions to address the complicated navigation environment. It is also more difficult to recover from a mistaken action when routing in a real-life urban environment. These problems made outdoor navigation a much more challenging task than indoor navigation, which has been broadly studied. Although tools such as Google Maps API enable researchers to gather large-scale street scenes for visual perception, it is expensive to collect human-annotated instructions and generate adequate trajectory-instruction pairs to train an agent. The issue of data scarcity limits navigation performance under sophisticated urban environments.
To deal with the data scarcity issue, Fried et al.  proposes the Speaker model to generate additional training pairs. By sampling plenty of trajectories in the navigation environment and adopting the Speaker model to back-translate their instructions, one can obtain a broad set of augmented training data.
Although pre-training with augmented data produced by the Speaker model improves the agent’s performance to some extent , this method has a few inherent drawbacks that limit the benefits of introducing augmented trajectories. First, the trained Speaker model hardly back-translates specific objects correctly in the trajectory — it can only provide general guidance regarding directions (Fig. 0(b)). As a result, it is challenging to learn the alignment between language and object groundings with the augmented data generated by the Speaker model. Moreover, since the back-translated instructions are reconstructed by a Speaker model, we can not assure its correctness — any error within the augmented instructions will propagate into the navigation agent during the pre-training process and hinder the final navigation performance. In short, the instruction of inferior quality generated the Speaker model may massively decrease its effectiveness in data augmentation.
To overcome data scarcity while avoiding error propagation, we leverage external resources to help outdoor navigation. Google Street View111https://developers.google.com/maps/documentation/streetview/intro has world-wide scale coverage of street scenes, and it also supplies to navigate between two locations. With its assistance, we can collect various additional navigation trajectories in the urban environment. One major challenge of utilizing external resources lies in the style distinction between instructions. For example, human-annotated instructions in the outdoor navigation task emphasize the visual environment’s attributes as navigation targets. As for the external trajectory provided by Google Street View, the instructions are generated by Google Maps API, with guidance on street names and directions only. Even though such tools can provide external trajectory-instruction pairs, the instruction style difference will undermine the power of data augmentation.
In this paper, we present a novel Multimodal Text Style Transfer (MTST) learning approach that introduces external resources to overcome the data scarcity issue in the outdoor VLN task. We use the multimodal text style transfer model to narrow the gap between the machine-generated instructions in the external resources, and the human-annotated instructions for the outdoor navigation task. The multimodal style transfer model is used to infer style-modified instructions for trajectories in the external resources, which will be later applied to pre-train the navigation agent. While providing direction guidance, such an approach can inject more visual objects in the navigation environment to the instructions (Fig. 0(b)). The enriched object-related information in the instruction can further assist the navigation agent to learn the grounding between the visual environment and the instruction. Meanwhile, the external trajectories and the style-modified instructions mitigate the data scarcity issue and serve as a more robust source for pre-training in the outdoor VLN task. Moreover, pre-training the navigation agent on the external resources will expose the agent to additional visual environments, which improves the agent’s generalizability.
Experimental results show that utilizing external resources during the pre-training process improves the navigation agent’s performance. In addition, pre-training with the style-modified instructions generated by our multimodal text style transfer model can further improve navigation performance and make the pre-training process more robust. In summary, the contribution of our work is three-fold:
We introduce external multimodal resources into the outdoor VLN task to overcome the data scarcity issue.
We propose a novel VLN Transformer model as the navigation agent for the outdoor VLN task.
We present a novel Multimodal Text Style Transfer learning approach to generate style-modified instructions and make more robust augmented data, which benefits the navigation agent in the pre-training process and leads to better navigation performance.
2 Related Work
Vision-and-Language Navigation Vision-and-language navigation (VLN) [5, 24, 20, 2] is a task that requires an agent to achieve the final goal based on the given instructions in a 3D environment. Besides the generalizability problem studied by many previous works on the VLN task [34, 9, 32, 22, 29, 23, 18], the data scarcity problem is still a critical issue for the VLN task. For the outdoor navigation task [5, 24], since the instructions are annotated by humans, it is difficult to collect large-scale human-written data for training agents under vast urban environments. This kind of data scarcity makes learning the optimal match between vision-and-language challenging. In this paper, we introduce the multimodal text-style transfer model that can utilize additional street view scenes and leverage different domains’ instructions to further improve the outdoor navigation agent.
Pre-training for Vision-and-Language Navigation To deal with the data scarcity problem, Fried et al.  proposes the Speaker model. The Speaker model is trained from the original trajectory-instruction pairs and back-translates the instruction of a navigation trajectory. By sampling numerous trajectories with the reconstructed instructions in the environment, the navigation agent can pre-train for the augmented data [10, 12]. Though these pre-training methods show some improvement, they mainly rely on the quality of the instructions reconstructed from the Speaker model. However, the Speaker model can only produce instructions with guidance on direction and hardly back-translate specific target objects. This restriction heavily limits the benefit of introducing augmented trajectories by the Speaker model. For our proposed multimodal text style transfer model, instead of being wholly based on the Speaker model, we take advantage of numerous yet high-quality instructions from a different domain and apply style-transfer for a better pre-training process.
Leveraging External Resources With the numerous but unlabeled data from external resources, effectively utilizing them can improve performance and generalizability. Chen et al.  searches related articles on Wikipedia to answer open-domain questions. Wang et al.  leverages the external corpus on WikiHow to realize zero-shot video captioning. Zheng et al. 
helps neural machine translation by incorporating the information from human interactive suggestions. In this paper, we utilize the urban environment resources from Google Street View to enhance the outdoor navigation task. Instead of leveraging them directly, we propose a multimodal style-transfer model to make the external instructions more robust to our primary task.
Style Transfer for Data Augmentation Data augmentation is a training technique that avoids the trained model from overfitting and enhances its generalizability by generating more diverse data as training input . Unlike traditional data augmentation methods like rotating, flipping, and cropping the image, style transfer  can simultaneously maintain the original semantic and supply more distinct image features for data augmentation. Jackson et al.  adopts style transfer between two domains to increase the model’s robustness for cross-domain image classification. Xu and Goel  also proposes domain adaptive text style transfer to leverage massively available text data from other domains. Inspired by the above data augmentation methods via style transfer, our multimodal style transfer utilizes cross-domain adaption for augmented data and realizes instruction recovery with the reference trajectory to enhance the navigation agent.
3.1.1 Vision-and-Language Navigation (VLN)
In the vision-and-language navigation task, the reasoning navigator is asked to find the correct path to reach the target location following the guidance of the instructions (a set of sentences) . The navigation procedure can be viewed as a series of decision making processes. At each time step , the navigation environment presents an image view . With reference to the instruction and the visual view , the navigator is expected to choose an action . The action set for urban environment navigation usually contains four actions, namely turn left, turn right, go forward, and stop.
3.1.2 Instruction Style
The navigation instructions vary across different VLN datasets in real-world urban environments. The human-annotated instructions for the outdoor VLN task emphasize attributes of the visual environment as navigation targets, and it frequently refers to objects in the panorama, such as traffic lights, cars, awnings, etc. In contrast, the external trajectory provided by Google Street View222https://developers.google.com/maps/documentation/streetview/intro, the instructions are generated by Google Maps API, which is templated-based and mainly consists of street names and directions.
Our Multimodal Text Style Transfer learning approach can effectively utilize the external resources for outdoor VLN task, and leverage the instructions with different styles.
Following natural language instructions and navigating through a busy urban environment, remains a great challenge for navigation agents for its lack of annotated data. In this paper, we propose the Multimodal Text Style Transfer (MTST) learning approach for the vision-and-language navigation in real-life urban environments to deal with the issue of data scarcity. The MTST learning framework mainly consists of two modules, namely the multimodal text style transfer model and the VLN Transformer. Fig. 2 provides an overview of our MTST approach.
To mitigate the data scarcity problem for the outdoor VLN task, we leverage external resources in our training process. We use the multimodal text style transfer model to narrow the gap between the human-annotated instructions for the outdoor navigation task, and the machine-generated instructions in the external resources. The multimodal text style transfer model is trained on the dataset for outdoor navigation, and it learns to infer style-modified instructions for trajectories in the external resources. Furthermore, we apply the two-stage training pipeline to train the VLN Transformer. We first pre-train the VLN Transformer on the external resources with the style-modified instructions and then fine-tune it on the outdoor navigation dataset.
3.3 Multimodal Text Style Transfer Model
In this section, we introduce the detailed implementation of the multimodal text style transfer model. The main difference between human-annotated instructions and machine-generated instructions is that the instructions written by human annotators often focus on objects in the surrounding environment, while the machine-generated instructions emphasize on street names nearby. The goal of conducting multimodal text style transfer is to inject more object-related information in the surrounding navigation environment to the machine-generated instruction while keeping the correct guiding signals.
3.3.1 Masking-and-Recovering Scheme
To inject objects that appeared in the panorama into the instructions, the multimodal text style transfer model is trained with a “masking-and-recovering” scheme. We train the model on the outdoor VLN dataset with human-annotated instructions, then infer instructions for trajectories in the external resources. We mask out certain portions in the instructions and try to recover the missing portions with the help of the remaining instruction skeleton and the paired trajectory. To be specific, we use NLTK  to mask out the object-related tokens in the human-annotated instructions, and the street names in the machine-generated instructions333We masked out the tokens with the following part-of-speech tags: [JJ, JJR, JJS, NN, NNS, NNP, NNPS, PDT, POS, RB, RBR, RBS, PRP$, PRP, MD, CD]. Multiple tokens that are masked out in a row will be replaced by a single [MASK] token. We aim to maintain the correct guiding signals for navigation after the style transfer process. Tokens that provide guiding signals, such as “turn left” or “take a right”, will not be masked out. Instead, they will be part of the remaining instruction skeleton that the style transfer decoder will attend to when generating instructions with the new style.
Fig. 3 provides an example of the “masking-and-recovering” process during training and inferring. The MTST model is trained on the outdoor navigation dataset with human-annotated instructions. We mask out the objects in the human-annotated instructions to get the instruction template. The model takes both the trajectory and the instruction skeleton as input, and tries to fill back the missing objects in the instruction skeleton. The training objective is to recover the instructions with objects. With the “masking-and-recovering” training scheme, the MTST model learns to generate object-grounded instructions. When inferring instructions for the external resources, we mask out the street names. The MTST model also takes the visual trajectory and the masked instruction template as input, and it is prompt to fill the missing portions of the instructions with objects. As a result, the generated instructions will have a similar style to the human-annotated instructions. The style-modified instructions will later be used to pre-train the VLN Transformer.
3.3.2 Model Structure
We build our multimodal text style transfer model based on the Speaker model, proposed by Fried et al. . On top of the visual-attention-based LSTM  structure in the Speaker model, we inject the textual attention of the masked instruction skeleton to the encoder, which allows the model to attend to original guiding signals.
The encoder takes both the visual and textual inputs, which encode the trajectory and the masked instruction skeletons. To be specific, each visual view in the trajectory is represented as a feature vector, which is the concatenation of the visual encoding and the orientation encoding . The visual encoding is the output of the last but one layer of the RESNET18  of the current view. The orientation encoding encodes current heading by repeating vector for 32 times, which follows Fried et al. . As described in section 3.4.2, the feature matrix of a panorama is the concatenation of eight projected visual views.
In the multimodal style transfer encoder, we use a soft-attention module  to calculate the grounded visual feature for current view at step :
where is the hidden context of previous step, refers to the learnable parameters, and is the attention weight over the slice of view in current panorama.
We use the full-stop punctuations to split the input text into multiple sentences. For each sentence in the input text, the textual encoding is the average of all the tokens’ word embedding in the current sentence. We also use a soft-attention modules to calculate the grounded textual feature at current step :
where refers to the learnable parameters, is the attention weight over the sentence encoding at step , and denotes the maximum sentence number in the input text. The input text for the multimodal style transfer encoder is the instruction template .
Based on the grounded visual feature , the grounded textual feature and the visual view feature at current timestamp , the hidden context can be given as:
3.3.3 Training Objectives
We train the multimodal text style transfer model in the teacher-forcing manner . The decoder generates tokens auto-regressively, conditioning on the masked instruction template , and the trajectory.
The training objective is to minimize the following cross-entropy loss:
where denotes the tokens in the original instruction , is the total token number in , and denotes the maximum view number in the trajectory.
3.4 VLN Transformer
In this section, we introduce the implementation details of the VLN Transformer. As illustrated in Fig. 4, our VLN Transformer is composed of an instruction encoder, a trajectory encoder, a cross-modal encoder that fuses the modality of the instruction encodings and trajectory encodings, and an action predictor.
3.4.1 Instruction Encoder
We use the instruction encoder to generate embeddings for each sentence in the instruction .
The instruction encoder is a pre-trained uncased BERT-base model. For the sentence that contains tokens, its sentence embedding is calculated as:
where is the word embedding for generated by BERT, and is a fully-connected layer.
3.4.2 View Encoder
We use the view encoder to retrieve embeddings for the visual views at each time step.
Following Chen et al. , we embed each panorama by slicing it into eight images and projecting each image from an equirectangular projection to a perspective projection. Each of the projected image of size will be passed through the RESNET18 
pre-trained on ImageNet. We use the output of size
from the fourth to last layer before classification as the feature for each slice. The feature map for each panorama is the concatenation of the eight image slices, which is a single tensor of size.
We center the feature map according to the agent’s heading at time stamp . We crop a sized feature map from the center and calculate the mean value along the channel dimension. The resulting features is regard as the current panorama feature for each state. Following Mirowski et al. 
, we then apply a three-layer convolutional neural network onto extract the view features at time stamp .
3.4.3 Cross-Modal Encoder
In order to navigate through complicated real-world environments, the agent needs to grasp a proper understanding of the natural language instructions and the visual views jointly to choose proper actions for each state. Since the instructions and the trajectory lies in different modalities and are encoded separately, we introduce the cross-modal encoder to fuse the features from different modalities and jointly encode the instructions and the trajectory. The cross-modal encoder is an 8-layer Transformer encoder  with mask. We use eight self-attention heads and a hidden size of 256.
In the teacher-forcing training process, we add a mask when calculating the multi-head self-attention across different modalities. By masking out all the future views in the ground-truth trajectory, the current view is only allowed to refer to the full instructions and all the previous views that the agent has passed by, which is , where denotes the maximum sentence number.
Since the Transformer architecture is based solely on attention mechanism and thus contains no recurrence or convolution, we need to inject additional information about the relative or absolute position of the features in the input sequence. We add a learned segment embedding to every input feature vector specifying whether it belongs to the sentence encodings or the view encodings. We also add a learned position embedding to indicate the relative position of the sentences in the instruction sequence or the trajectory sequence’s views.
3.4.4 Training Objective
To predict the action for view , we concatenate the cross-modal encoder’s output of all views in the trajectory up to the current timestamp , and apply a fully-connected layer on top of it.
where is a fully-connected layer, and refers to the Transformer operation.
During training, we use the cross-entropy loss for optimization.
4.1 Experimental Setup
For the outdoor VLN task, we conduct experiments on the Touchdown dataset [5, 24], which is designed for navigation in realistic urban environments. Based on Google Street View444https://developers.google.com/maps/documentation/streetview/intro, Touchdown’s navigation environment encompasses 29,641 Street View panoramas of the Manhattan area in New York City, which are connected by 61,319 undirected edges. The dataset contains 9,326 trajectories for the navigation task, and each trajectory is paired with a human-written instruction. The training set consists of 6,526 samples, while the development set and the test set are made up of 1,391 and 1,409 samples, respectively.
4.1.2 External Resource
We use the StreetLearn  dataset as the external resource for the outdoor VLN task. The StreetLearn dataset is another dataset for navigation in real-life urban environments based on Google Street View. StreetLearn contains 114k panoramas from the New York City and Pittsburgh. In the StreetLearn navigation environment, the graph for New York City contains 56k nodes and 115k edges, while the graph for Pittsburgh contains 57k nodes and 118k edges. The StreetLearn dataset contains 580k samples in the Manhattan area and 8k samples in the Pittsburgh area for navigation.
While the StreetLearn dataset’s trajectory contains more panorama along the way on average, the paired instructions are shorter in length compared to the Touchdown dataset. We extract a sub-dataset Manh-50 from the original large scale StreetLearn dataset for the convenience of conducting experiments. Manh-50 consists of navigation samples in the Manhattan area that contains no more than 50 panoramas in the whole trajectory, containing 31k training samples. More statistical details of the dataset can be found in the appendix.
4.1.3 Dataset Comparison
Even though the Touchdown dataset and the StreetLearn dataset are both built upon Google Street View555https://developers.google.com/maps/documentation/streetview/intro, and both of them contains urban environments in the New York City, pre-training the model with VLN task on the StreetLearn dataset does not raise a threat of test data leaking. This is due to several causes:
The instructions in the two datasets are distinct in styles. The instructions in the StreetLearn dataset is generated by Google Maps API, which is template-based and focuses on street names. However, the instructions in the Touchdown dataset are created by human annotators and emphasize the visual environment’s attributes as navigational cues.
As reported by , the panoramas in the two datasets have little overlaps. In addition, Touchdown instructions constantly refer to transient objects such as cars and bikes, which might not appear in a panorama from a different time. The different granularity of the panorama spacing also leads to distinct panorama distributions of the two datasets.
4.1.4 Instruction Style Transfer
Among the 9,326 trajectories in the Touchdown dataset, 9,000 are used to train the multimodal text style transfer model, while the rest formed the validation set. We generate style-transferred instructions for the Manh-50 dataset, which will be used to pre-train the VLN Transformer.
4.1.5 Evaluation Metrics
We use the following metrics to evaluate VLN performance:
Task Completion (TC): the accuracy of completing the navigation task correctly. Following Chen et al. , the navigation result is considered correct if the agent reaches the specific goal or one of the adjacent nodes in the environment graph.
Shortest-Path Distance (SPD): the mean distance between the agent’s final position and the goal position in the environment graph.
Success weighted by Edit Distance (SED): the normalized Levenshtein edit distance between the path predicted by the agent and the reference path, which is constrained only to the successful navigation.
Coverage weighted by Length Score (CLS): a measurement of the fidelity of the agent’s path with respect to reference path.
Normalized Dynamic Time Warping (nDTW): the minimized cumulative distance between the predicted path and the reference path, normalized by the length of the reference path.
Success weighted Dynamic Time Warping (SDTW): the nDTW value where the summation is only over the successful navigation.
4.2 Results and Analysis
4.2.1 Baseline Model
4.2.2 Quantitative Results
Table 1 presents the navigation results on the Touchdown validation set and test set. We have the following observations from the evaluation results:
Firstly, we compare the navigation performance of our VLN Transformer to the baseline RCONCAT model. When the navigation model is trained solely on the Touchdown dataset, our VLN Transformer surpassed the RCONCAT models in all metrics on the test set.
We also conduct experiments to compare both models’ outdoor navigation results with and without pre-training on external resources. Experimental results show that pre-training on external resources helps improve the task completion rate for both models, and it also improves the navigation performance on the metrics that are calculated on the success cases, such as SED and SDTW. However, pre-training with machine-generated instructions that have different styles with the human-annotated instructions will significantly harm the fidelity of paths generated by our VLN Transformer, resulting in a performance drop on SPD, CLS, and nDTW. These results suggest that the difference between the instruction style might misguide the agent in the pre-training stage, and cause the agent to take longer paths in failure cases.
In addition, we evaluate the effect of our Multimodal Text Style Transfer learning approach by pre-training the VLN Transformer on external resources with style-modified instructions. Evaluation results indicate that pre-training with style-modified instructions can stably improve navigation performance on all the metrics for the RCONCAT model. It also improves navigation performance on most of the metrics for the VLN Transformer, which means our Multimodal Text Style Transfer learning approach is model-agnostic. The VLN Transformer’s inferior performance on SPD indicates that the instructions generated by the MTST model still have certain flaws compared to human-annotated instructions, which might be enhanced in the future study.
4.3 Ablation Study
In the ablation studies, we use the following annotations when displaying the evaluation scores: +M-50 stands for pre-training with vanilla Manh-50; in the +speaker setting, the instructions are generated by the original Speaker , which only attends to the visual input; +text_attn denotes that we add a textual attention module to the Speaker so that it can attend to both the visual input and the machine-generated instruction which is automatically obtained using Google Maps API; in the +style setting, the instructions are generated by our MTST model with the “masking-and-recovering” learning objective.
4.3.1 Quality of the Generated Instruction
In the first ablation study, we evaluate the quality of instructions generated by the original Speaker and by the MTST model. We utilize five automatic metrics for natural language generation to evaluate the quality of the generated instructions, including BLEU, ROUGE , METEOR , CIDEr  and SPICE . Among the 9,326 trajectories in the Touchdown dataset, 9,000 are used to train the MTST model, while the rest form the validation set.
We report the quantitative results on the validation set in Table 2. After adding textual attention to the original Speaker, the evaluation performance on all five metrics improved. Our MTST model scores the highest on all five metrics, which indicates that the “masking-and-recovering” scheme is beneficial for the multimodal text style transfer process and that the MTST model can generate higher quality instructions.
4.3.2 Multimodal Instruction Style Transfer
We conduct another group of ablation study to reveal the effect of each components in the multimodal text style transfer model. The VLN Transformer is pre-trained with external trajectories and instructions generated by different models, then fine-tuned on the outdoor VLN task. Navigation results are shown in Table 3.
According to the evaluation results, the instructions generated by the original Speaker model misguide the navigation agent, which indicates that relying solely on the Speaker model is not able to reduce the gap between different instruction styles. Adding textual attention to the Speaker model slightly improves the navigation results, but still hinders the agent from navigating correctly. The style-modified instructions improve the agent’s performance on all the navigation metrics, which suggests that our Multimodal Text Style Transfer learning approach can assist the outdoor VLN task.
4.3.3 Case Study
We demonstrate case study results to illustrate the performance of our Multimodal Text Style Transfer learning approach. Fig. 5 provides two showcases of the instruction generation results. As listed in the charts, the instructions generated by the original Speaker model not only have a poor performance in keeping the guiding signals in the ground truth instructions but also suffer from hallucinations, which is referring to objects that have not appeared in the trajectory.
The Speaker with textual attention can provide guidance direction. However, the instructions generated in this manner does not utilize the rich visual information in the trajectory. On the other hand, the instructions generated by our multimodal text style transfer model inject more object-related information (“the light”, “scaffolding”) in the surrounding navigation environment to the StreetLearn instruction, while keeping the correct guiding signals.
In this paper, we proposed the Multimodal Text Style Transfer learning approach for vision-and-language navigation in real-life urban environments. This learning framework allows us to utilize out-of-domain navigation samples in outdoor environments and enrich the original navigation reasoning training process. Experimental results show that our MTST approach is model-agnostic, and our MTST learning approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 22% relatively on the test set and achieving new state-of-the-art performance. We believe our study provides a possible solution to mitigate the data scarcity issue in outdoor VLN task, and we will further improve the quality of the style-modified instructions in future study.
-  (2016) SPICE: semantic propositional image caption evaluation. In ECCV, Cited by: §4.3.1.
-  (2018) Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. In CVPR, Cited by: §1, §2.
-  (2009) Natural Language Processing with Python. O’Reilly Media. Cited by: §3.3.1.
-  (2017) Reading Wikipedia to Answer Open-Domain Questions. In ACL, Cited by: §2.
-  (2019) Touchdown: natural language navigation and spatial reasoning in visual street environments. In , pp. 12538–12547. Cited by: §0.A.2, §0.A.3, §1, §2, §3.4.2, 1st item, §4.1.1, §4.1.5, §4.2.1.
-  (2019) AutoAugment: Learning Augmentation Policies from Data. In CVPR, Cited by: §2.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. External Links: Cited by: §3.4.1.
-  (2013) Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292–1302. Cited by: §4.3.1.
-  (2018) Speaker-follower models for vision-and-language navigation. External Links: Cited by: §1, §1, §1, §2, §2, §3.3.2, §3.3.2, §4.3.
-  (2019) Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling. In arXiv:1911.07308, Cited by: §2.
-  (2015) A Neural Algorithm of Artistic Style. In arxiv:1508.06576, Cited by: §2.
-  (2020) Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training. In CVPR, Cited by: §2.
-  (2015) Deep residual learning for image recognition. External Links: Cited by: §0.A.3, §3.3.2, §3.4.2.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §3.3.2.
-  (2019) General evaluation for instruction conditioned navigation using dynamic time warping. External Links: Cited by: §4.1.5.
-  (2018) Style Augmentation: Data Augmentation via Style Randomization. In arxiv:1809.05375, Cited by: §2.
-  (2019) Stay on the path: instruction fidelity in vision-and-language navigation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. External Links: Cited by: §4.1.5.
-  (2019) Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation. In CVPR, Cited by: §1, §2.
-  (2014) Adam: a method for stochastic optimization. External Links: Cited by: §0.A.4.
-  (2019) VALAN: vision and language agent navigation. External Links: Cited by: §2.
-  (2004-07) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Cited by: §4.3.1.
Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035. Cited by: §1, §2.
The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation. In CVPR, Cited by: §1, §2.
-  (2020) Retouchdown: adding touchdown to streetlearn as a shareable resource for language grounding tasks in street view. External Links: Cited by: §0.A.2, §1, §2, 2nd item, §4.1.1.
-  (2018) Learning to navigate in cities without a map. External Links: Cited by: §0.A.2, §0.A.3, §1, §3.4.2, §4.1.2, §4.2.1.
HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent. External Links: Cited by: §4.2.1.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.3.1.
-  (2014) ImageNet large scale visual recognition challenge. External Links: Cited by: §0.A.3, §3.4.2.
-  (2019) Learning to navigate unseen environments: back translation with environmental dropout. arXiv preprint arXiv:1904.04195. Cited by: §1, §2.
-  (2017) Attention is all you need. External Links: Cited by: §3.3.2, §3.4.3.
-  (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §4.3.1.
Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6629–6638. Cited by: §1, §2.
-  (2019) Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning. In AAAI, Cited by: §2.
Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. Lecture Notes in Computer Science, pp. 38–55. External Links: Cited by: §1, §2.
A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: §3.3.3.
-  (2019) Cross-Domain Image Classification through Neural-Style Transfer Data Augmentation. In arxiv:1910.05611, Cited by: §2.
-  (2018) Learning to Discriminate Noises for Incorporating External Information in Neural Machine Translation. In arxiv:1810.10317, Cited by: §2.
Appendix 0.A Appendix
0.a.1 Dataset Comparison
Table 4 lists out the statistical information of the datasets used in pre-training and fine-tuning. We can see that the StreetLearn dataset has longer trajectories than the Touchdown dataset, which its instructions are significantly shorter in length.
0.a.2 Instruction Style
The instructions in the StreetLearn dataset , a large-scale interactive navigation environment built upon Google Street View666https://developers.google.com/maps/documentation/streetview/intro, is generated by Google Maps API. Street names are always mentioned in the templated-based instructions in StreetLearn. As a result, these template-based instructions always mention street names when providing suggestions for navigation actions. However, static information such as street names is not directly revealed in the navigation environment, mainly when the agent does not acquire a top-down view of the overall environment.
The instructions in the Touchdown dataset [5, 24], another urban navigation environment with Street View panoramas, is written by human annotators. These natural language instructions frequently refer to objects in the panorama, such as traffic lights, cars, awnings, etc.
Table 5 lists out two instruction samples in the StreetLearn dataset and the Touchdown dataset.
|StreetLearn||Head northwest on E 23rd St toward 2nd Ave. Turn left at the 2nd cross street onto 3rd Ave. Turn right at the 2nd cross street onto E 21st St.|
|Touchdown||Orient yourself so you are facing the same as the traffic on the 4 lane road. Travel down this road until the first intersection. Turn left and go down this street with the flow of traffic. You’ll see a black and white stripped awning on your right as you travel down the street. Keep going pass the parking building on your right, until you are right next to a large open red dumpster.|
0.a.3 View Encoder Implementation
Following Chen et al. , we embed each panorama by slicing it into eight images and projecting each image from an equirectangular projection to a perspective projection. Each of the projected image of size will be passed through the RESNET18  pre-trained on ImageNet . We use the output of size from the fourth to last layer before classification as the feature for each slice. The feature map for each panorama is the concatenation of the eight image slices, which is a single tensor of size .
We center the feature map according to the agent’s heading at time stamp . We crop a sized feature map from the center and calculate the mean value along the channel dimension. The resulting features is regard as the current panorama feature for each state. Following Mirowski et al. , we then apply a three-layer convolutional neural network on to extract the view features at time stamp .
The first layer has one input channel and 32 output channels, using
kernels with stride 4. The second layer has 32 input channels and 64 output channels, using
kernels with stride 4. ReLu is applied as the activation function after each convolutional operation. The convolutional layer’s output is projected by a single fully-connected layer to receive the view feature representation.
0.a.4 Parameter Setting
We pre-train the VLN Transformer with the outdoor VLN task on Manh-50, the sub-dataset extracted from the StreetLearn dataset. Then, we fine-tune the pre-trained VLN Transformer on the Touchdown dataset for the VLN task.
In the pre-training phase, we use a learning rate of for the VLN Transformer. We fine-tune BERT with a learning rate of . When pre-training on Manh-50
, the batch size is 30, and the total pre-training epochs are 25.
When training or fine-tuning the VLN Transformer on the Touchdown dataset, the batch size is 36. The learning rate to fine-tune BERT initially set to , while the learning rate for other parameters in the model is initialized to be . Adam optimizer  is used to optimize all the parameters.
0.a.5 Leverage Multimodal Features
In this section, we discussed our approaches to leveraging the information from different modalities and assisting the VLN task in real-life urban environments. Our MTST learning approach mainly makes use of the multimodal features in the outdoor navigation datasets in the following three ways:
We use the trajectory and the masked instruction skeleton in the Touchdown dataset to train our multimodal text style transfer model (MTST). Regarding both the visual features in the trajectory and the textual features in the incomplete instruction template, the MTST model learns to recover the incomplete instruction by injecting object-related information to the generated instruction.
With the trajectory and masked instruction pairs in the StreetLearn dataset as inference input, we use the MTST model trained on the Touchdown dataset to inference style-modified instructions for StreetLearn trajectories. Such an approach narrows the gap between the instruction styles of the two outdoor navigation datasets.
We maneuver the StreetLearn trajectories and the style-modified instructions to pre-train our VLN Transformer. The VLN Transformer learns to fuse and reason through the navigation environment’s visual features and the textual features in the instruction. We then fine-tune the VLN Transformer with the multimodal features in the Touchdown dataset.