Explore the Potential Performance of Vision-and-Language Navigation Model: a Snapshot Ensemble Method

by   Wenda Qin, et al.
Boston University

Vision-and-Language Navigation (VLN) is a challenging task in the field of artificial intelligence. Although massive progress has been made in this task over the past few years attributed to breakthroughs in deep vision and language models, it remains tough to build VLN models that can generalize as well as humans. In this paper, we provide a new perspective to improve VLN models. Based on our discovery that snapshots of the same VLN model behave significantly differently even when their success rates are relatively the same, we propose a snapshot-based ensemble solution that leverages predictions among multiple snapshots. Constructed on the snapshots of the existing state-of-the-art (SOTA) model ↻BERT and our past-action-aware modification, our proposed ensemble achieves the new SOTA performance in the R2R dataset challenge in Navigation Error (NE) and Success weighted by Path Length (SPL).



There are no comments yet.


page 12

page 13


A Recurrent Vision-and-Language BERT for Navigation

Accuracy of many visiolinguistic tasks has benefited significantly from ...

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

Learning to navigate in a visual environment following natural-language ...

The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation

As deep learning continues to make progress for challenging perception t...

Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation

Vision-and-Language Navigation (VLN) is a natural language grounding tas...

A Survey of Current Datasets for Vision and Language Research

Integrating vision and language has long been a dream in work on artific...

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

In the Vision-and-Language Navigation task, the embodied agent follows l...

Rethinking the Spatial Route Prior in Vision-and-Language Navigation

Vision-and-language navigation (VLN) is a trending topic which aims to n...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given instruction in texts and visual inputs describing the surrounding environment, a Vision-and-Language Navigation (VLN) agent controls an agent to complete a set of goals listed in the instruction. Building a good VLN system is difficult due to the fact that it needs to understand vision and language information and coordinate them well.

Recent advancements in computer vision and natural language processing and the advent of better vision-language models

sundermeyer2012lstm; vaswani2017attention; lu2019vilbert; tan2019lxmert along with the effort to prepare large scale realistic datasets Matterport3D has enabled rapid development of VLN systems. Among all current VLN datasets, the R2R dataset (Anderson et al. 2018) is a dataset based on real photos taken in indoor environments. It attracts massive attention for its simple-form task, which at the same time requires complex understanding in both images and texts.

To obtain better performance in R2R, various studies in the past have discussed how to adjust the best vision/language models at the time for the R2R VLN task anderson2018vision; majumdar2020improving; hong2021vln. Previous studies have also made efforts to prevent overfitting due to the limited size of the R2R dataset fried2018speaker; liu2021vision; li2019robust; hao2020towards.

In this paper, we offer a new perspective for analyzing the R2R VLN model that focuses on the by-products of the model training process: snapshots. Snapshots are the saved parameters of a model at various intervals during training. Although all snapshots have the same goal as the model, their parameters are different due to the ongoing optimization. We discover that some of the best snapshots at various intervals saved during training shared similar navigation success rates while making significantly diverse errors. Based on such observation, we construct our VLN system with an ensemble of snapshots instead of just one. Through experiments, we found out that such an ensemble can take advantage of its members and thus significantly improve the navigation performance.

In addition, to allow more model variants in the ensemble, we also propose a novel modification of an existing state-of-the-art (SOTA) model for VLN i.e., the VLNBERT hong2021vln. Our ensemble, which consists of snapshots of both models: the VLNBERT model and our proposed modification: the past-action-aware VLNBERT model–achieves a new SOTA performance in the single-run setting of the R2R dataset.

To conclude, our contributions are as follows:

  • We discover that the best snapshots of the same model behave differently while having similar navigation success rates. Based on this observation, we propose a snapshot ensemble method to take advantage of the different snapshots.

  • We also propose a past-action-aware modification on the current best VLN model: the VLNBERT. It creates additional variant snapshots to the original model with equivalent navigation performance.

  • By combining the snapshots from both the original and the modified model, our ensemble achieves a new SOTA performance on the R2R challenge leaderboard in the single-run setting.111Our method is noted as “SE-Mixed (Single-Run)” in the leaderboard webpage: https://eval.ai/web/challenges/challenge-page/97/leaderboard

  • We evaluate the snapshot ensemble method on two different datasets and apply it with two other VLN models. The evaluation results show that the snapshot ensemble also improves performance on more complicated VLN tasks and with different model architectures.

2 Related Works

2.1 Vision-and-language Navigation datasets

Teaching a robot to complete instructions is a long-existing goal in the AI community winograd1971procedures. Compared to GPS-based navigation, VLN accepts surrounding environments as visual inputs and correlates them with instruction in human language. Most VLN datasets in the past are based on synthesized 3-D scenes kolve2017ai2; brodeur2017home; wu2018building; yan2018chalet; Song_2017_CVPR. Recently, the emergence of data based on real-life scenarios allows VLN systems to be developed and tested in realistic environments. Specifically, 3-D views from Google Street View222https://www.google.com/streetview/ and Matterport3D datasets (Chang et al. 2017) allow people to build simulators that generate navigation data from photos taken in real life. Different from the previous datasets, the R2R dataset that we use consists of navigations in real indoor environments. Concretely, the R2R dataset provides 15,000 instructions and 5,000 navigation paths in 90 indoor scenes. Since its construction, people have proposed variants of the R2R dataset to address certain shortcomings of the original one ku2020room; jain2019stay; hong2020sub; krantz2020beyond. However, the community still considers the R2R dataset a necessary test for evaluating all kinds of VLN systems for indoor navigation.

2.2 VLN systems for navigation in R2R dataset

To improve navigation performance in the R2R dataset, various models and techniques have been proposed. fried2018speaker and tan2019learning further developed the LSTM sundermeyer2012lstm + soft-attention luong2015effective baseline system anderson2018vision. majumdar2020improving proposed a VLN system based on VilBERT lu2019vilbert to replace the LSTM + soft-attention architecture for better image and text understanding. Recently, chen2021topological; Wang_2021_CVPR; hong2020language proposed VLN systems based on graph models. In terms of techniques, fried2018speaker built a speaker model for data augmentation. ma2019regretful; ma2019self introduced regularization loss and back-tracking; tan2019learning improved the dropout mechanic in its VLN model; li2019robust; hao2020towards improved the models’ initial states by pre-training it on large-scale datasets; and hong2021vln developed a recurrent VLN model (VLNBERT) based on BERT structure for single-run setting in VLN. liu2021vision provides further data augmentation by splitting and mixing scenes.

Previous work that shared the closest idea to us is hu2019you, which proposed a mixture of VLN models. However, each of their models is trained with different inputs. In this paper, we build an ensemble based on snapshots of the same model.

2.3 Ensemble

The concept of applying ensemble in neural network models appeared very early in the machine learning community

hansen1990neural. There are well-known ensemble techniques such as Bagging breiman1996bagging

, Random forests

ho1995random, and boosting (AdaBoost) freund1997decision

. However, applying such ensembles to deep learning models directly is very time-consuming. There are previous works that provide ensemble-like solutions for deep learning models

xie2013horizontal; moghimi2016boosted; laine2016temporal; french2017self. Our work is inspired by the idea of “snapshot ensemble” from huang2017snapshot, which constructs the ensemble from a set of snapshots collected in local minima. Different from the previous work, we collect snapshots based on training intervals and success rates. Also, we apply beam search to optimize the combination of snapshots to be in the ensemble.

3 Preliminaries

3.1 Vision-and-language Navigation in R2R dataset

Navigation in R2R consists of three parts: instruction , scene , and path . The instruction is a sequence of words in the vocabulary : . The scene is a connected graph that contains viewpoints and the edges that connect viewpoints: . For viewpoint where the agent stands, there’s is a panoramic view to describe the visual surroundings of the agent. To be more precise, is a set of 36 views that a camera captured in the viewpoint from different horizontal and vertical directions. A viewpoint is connected (“navigable”) to another viewpoint when you could directly walk from to in the real environment that represents. The path is a sequence of viewpoints, which starts from the starting viewpoint, and ends in the destination viewpoint: . is the initial position of the agent.

A VLN model for the R2R navigation task works as a policy function with the instruction and the panoramic view of a certain viewpoint as inputs: . At each time step , the policy function predicts an action , and tries to get as close as possible to the ground truth destination in at the end.

After the agent chooses to stop or the number of its actions exceeds a limit, its last viewpoint will be evaluated. If is within 3 meters of from ground-truth path , the navigation is considered to be successful or failed otherwise.

There are three different settings for the VLN task in R2R: single-run, pre-explore, and beam search. In this paper, we focus on the single-run setting. The “single-run” setting requires the agent to finish the navigation with minimum actions taken and without prior knowledge of the environment.

3.2 VlnBERT model

We apply our modification and snapshot ensemble on the VLNBERT model proposed by hong2021vln. 333hong2021vln proposed two VLNBERT models in their work. The VLNBERT model we used here is the LXMERT-based VLNBERT model pre-trained by PREVALENT hao2020towards. For the other one, which is BERT-based and pre-trained by OSCAR li2020oscar, we call it OSCAR-init VLNBERT to distinguish it from the PREVALENT-initialized one. The model currently holds the best performance for the single-run setting in the R2R dataset liu2021vision. In this section, we will have a brief recap of this model. A simplified visualization of the model structure is in Figure 1.

Before computing the prediction of actions, the model selects a set of candidate views from . After that, the VLN

BERT model projects the candidate views and the instruction into the same feature space. We discuss this process in detail in Appendix A. Eventually, we have a vector of instruction features

and a vector of candidate action features as inputs of the action prediction.

At the first time step, is sent to a 9-layer self-attended module. The word features are thus attended to the feature. The model then appends to from . After that, a cross-attention sub-module attends the remaining elements in to both and . Lastly, another sub-module computes the self-attention of the instruction-attended . Such cross and self sub-modules build up the cross + self-attention module in figure 1. The process repeats for four layers and the attention scores between and each elements in of the last layer are the prediction scores of each action . Additionally, the in the output is sent to a cross-modal-matching module. The output of the module is used as in the next time step while other features in remains unchanged. The cross and self attention computation will be repeated to compute action predictions for the rest of time steps.


BERT model minimizes two losses: imitation learning loss and reinforcement learning loss:


is the teacher action (one-hot encoded action that gets closest to the destination),

is the probability of the taken action,

is the action taken and is the advantage value at time step , computed by the A2C algorithm mnih2016asynchronous. is a hyper-parameter that balances the weights of imitation learning loss and reinforcement learning loss.

Figure 1: A visualization of the VLNBERT model. The instruction feature first passes through a self-attention module and then attends to a candidate feature vector through a cross-self-attention module. The candidate feature then self-attended itself in the same module. After four layers of computation, the last layer outputs the probabilities of each action and sends the feature to a cross-modal matching module. The output will replace the features in the instruction vector for the next time step.

4 Proposed Method

4.1 Differences of Snapshots in the Same VLN model

Like other machine learning models, VLNBERT chooses the best snapshot by validation to represent the trained model. We train the VLNBERT model and observe its validation success rates, as measured on the val_unseen split of R2R, of the snapshots saved in the training process. As shown in Appendix B Figure 1, we saw that the success rate fluctuates drastically over time. This fluctuation is however not seen in the training loss. As shown in Appendix B, figures 2 and 3, both imitation and reinforcement learning losses drop consistently with time. This interesting discovery leads us to further investigate whether the snapshots that perform similarly (in terms of success rates) might behave differently with respect to the errors that they make.

We set up an experiment designed as follows: we train the VLNBERT model for 300,000 iterations and save the best snapshot in the validation split for every 30,000 iterations. The top-5 snapshots among them are shown in Table 1. We chose the best two snapshots, namely the snapshots with 62.32% and 61.60% success rates. We then count the navigations that only one of the snapshots failed, both of the snapshots failed or none of the snapshots failed. Our result shows that 563 navigations ended with different results between the best and the second-best snapshots, approximately 24% of the validation data. In comparison, the difference in their success rate is only 0.72%. The massive difference between 24% and 0.72% suggests that the agents of the two snapshots have different navigation behaviors even though they are almost equal in success rates. Naturally, we wonder if we could leverage both of their behaviors and thus create a better agent. One of the techniques we find to be effective for this problem is the snapshot ensemble that we discuss in the next section.

Snapshot Period Success Rate in val_unseen Split
90K - 120K 62.32%
240K - 270K 61.60%
210K - 240K 61.56%
60K - 90K 61.52%
180K - 210K 61.30%
Table 1: The navigation success rates for the top-5 snapshots of VLNBERT in 10 periods of a 300,000-iteration training cycle.

4.2 Snapshot Ensemble for VLNBERT models

A snapshot is a set of saved parameters of a model during a particular time in training. Naturally, the first thing to do to set up the ensemble is to decide what snapshots to save during training. According to huang2017snapshot, the ensemble mechanic does the best when “the individual models have low test error and do not overlap in the set of examples they misclassify”. Therefore, we want to save snapshots that are “different enough” while doing well individually. Our approach is as follows (where snapshots and ensembles are evaluated on the validation set):

  • For a training cycle of iterations, we evenly divide it into periods (assuming is divisible by ).

  • For each period , we save the snapshot with the highest success rate in the validation split.

  • The saved snapshots will be the candidates to build the ensemble: .

Among the candidates snapshot saved this way, we conduct a beam search with size to construct the ensemble of maximum size . The process is as follows:

  • Evaluate all possible ensembles of size 1, that is:

  • Keep the top- ensembles in the previous step. For each kept ensemble, evaluate all possible ensembles of size 2 that contain the kept snapshot(s). E.g., say is one of the top- ensembles of size 1, the ensembles of size 2 to be evaluated related to are: .

  • Keep the top- ensembles of size 2 from the previous step. Then we repeat the process for size-3 ensembles, so on and so forth. The evaluation stops when we finish evaluating the ensembles of size .

  • In the end, we choose the ensemble with the highest success rate among all the ensembles evaluated during the whole process.

The approximate number of evaluations needed for our beam search strategy is when , which is much smaller than the cost of an exhaustive search .

During the evaluation, the ensemble completes a navigation task as follows: at each time step, the instruction inputs and the visual inputs of the current viewpoint are sent to each snapshot in the ensemble. Each snapshot then gives its predictions on the available actions. After that, the agent sums those predictions up and takes the corresponding action. At the end of the time step, each snapshot uses the action taken to update its own states. We visualize the ensemble navigation workflow in Figure 2. We do not apply normalization on the prediction scores of snapshots to allow model confidence as the score weights.

Figure 2: The workflow of the snapshot ensemble in the recurrent navigation process. The inputs broadcast to all snapshots, and the ensemble sums their predictions to make action at every time step.

4.3 A past-action-aware VLNBERT model

We saw a significant improvement in the snapshot ensemble of the original VLNBERT model, as shown in Table 2. Still, we could improve the performance of the ensemble by adding more variant snapshots. To do that, we modify the VLNBERT model and combine the snapshots of the original and the modified model.

Our modification is based on the two ideas to improve the model: adding in the past time steps and regularizing the attention scores between and words in based on the observation of OSCAR-init VLNBERT model in hong2021vln. The modification is visualized by the blue parts in figure 3.

At the beginning of each time step , we add a copy of the cross-modal matching output from the last time step to a vector. At time step , , re-indexed from to . We then concatenate and [] as a large vector, and pass them through the cross + self-attention module. Note that we do not update the features in during the attention computation. As a result, we will not only have the attention scores from the current to each word features in , but also that from the to in the last layer outputs.

For each set of attention scores from to each word in the instruction where , we compute an “attention regularization” loss defined as follows:

“MSE” stands for Mean-Squared-Error and is the “ground truth” values for the normalized attention scores . is computed based on the sub-instruction annotation from the Fine-Grained R2R dataset (FGR2R). Concretely, the FGR2R dataset divides the instructions in the R2R dataset into a set of ordered sub-instructions: where n is the number of sub-instructions the original instruction consists of. Each sub-instruction corresponds to one or a sequence of viewpoints in the ground truth path . To compute , we first build a map from each viewpoint in to a specific sub-instruction in . The map function is very straightforward: we choose the first sub-instruction in that corresponds to as the mapped sub-instruction. By doing so, each viewpoint in now has their own related sub-instruction in . We then compute , by the following step:

  • find the viewpoint where the agent stands at time step . If , we choose the viewpoint in that is closest to as the new .

  • Since every has its mapped , we compute each by:

We compute each and the total loss becomes:

is a hyper-parameter and is the total time steps.

The added history vector provides additional information to the model during action prediction. In addition, the attention regularization forces the VLNBERT model to align attention scores to words that correspond to the agent’s actions without performance (success rate) drop. The visualization of how the attention score changes as the agent moves in its path are given in appendix C.

Figure 3: The structure of past-action-aware VLNBERT model. The blue parts are the modification added to the original structure. A history vector is added to keep track of the past . An attention regularization loss is added to regularize the attention scores between at each time step and relevant words in the instruction (visualization is given in appendix C).
Model R2R val_unseen R2R test
Random 9.77 9.23 16 - 9.89 9.79 13 12
Human - - - - 11.85 1.61 86 76
Seq2Seq-SF anderson2018vision 8.39 7.81 22 - 8.13 7.85 20 18
Speaker-Follower fried2018speaker - 6.62 35 - 14.82 6.62 35 28
PRESS li2019robust 10.36 5.28 49 45 10.77 5.49 49 45
EnvDrop tan2019learning 10.7 5.22 52 48 11.66 5.23 51 47
AuxRN Zhu_2020_CVPR - 5.28 55 50 - 5.15 55 51
PREVALENT hao2020towards 10.19 4.71 58 53 10.51 5.3 54 51
RelGraph hong2020language 9.99 4.73 57 53 10.29 4.75 55 52
VLNBERT hong2021vln 12.01 3.93 63 57 12.35 4.09 63 57
VLNBERT + REM liu2021vision 12.44 3.89 63.6 57.9 13.11 3.87 65.2 59.1
Past-action-aware VLNBERT (ours) 13.2 3.88 63.47 56.27 13.86 4.11 62.49 56.11
VLNBERT Original Snapshot Ensemble (ours) 11.79 3.75 65.55 59.2 12.41 4 64.22 58.96
VLNBERT Past-action-aware Snapshot Ensemble (ours) 12.35 3.72 65.26 58.65 13.19 3.93 64.65 58.78
VLNBERT Mixed Snapshot Ensemble (ours) 12.05 3.63 66.67 60.16 12.71 3.82 65.11 59.61
Table 2: The evaluation results for our snapshot ensemble and past-action-aware VLNBERT models (bold is best). Our mixed snapshot ensemble achieved the new SOTA performance in NE and SPL, and only 0.09% worse than liu2021vision, which uses further data augmentation, in SR.
Model R4R val_unseen_half R4R val_unseen_full
Speaker-Follower - - - - 19.9 8.47 23.8 12.2
EnvDrop - - - - - 9.18 34.7 21
VLNBERT + REM liu2021vision - - - - - 6.21 46 38.1
VLNBERT 13.76 7.05 37.29 27.38 13.92 6.55 43.11 32.13
VLNBERT Original Snapshot Ensemble (ours) 15.09 7.03 39 28.66 14.71 6.44 44.55 33.45
Table 3: Evaluation results on R4R dataset (bold is best). We also present the evaluation result of the full split with our constructed ensemble. The model gains improvement from the original VLNBERT model after applying snapshot ensemble. Note that liu2021vision uses further data augmentation, which is orthogonal to our approach. At the time of writing they have yet to release their code or dataset which might improve our performance further.
Model R2R val_unseen R2R Test
EnvDrop 10.7 5.22 52 48 11.66 5.23 51 47
EnvDrop Snapshot Ensemble 11.74 4.9 53.34 49.49 11.9 4.98 53.58 50.01
OSCAR-init VLNBERT 11.86 4.29 59 53 12.34 4.59 57 53
OSCAR-init VLNBERT Snapshot Ensemble 11.22 4.21 59.73 54.76 11.74 4.36 59.72 55.35
Table 4: The evaluation result of snapshot ensembles of EnvDrop and OSCAR-init VLNBERT for R2R val_unseen split. Both models are consistently improved by the snapshot ensemble methods.

5 Experiment

We run the following experiments to evaluate the performances of snapshot ensembles in different models and datasets:

  • We evaluate the performance of snapshot ensemble on the R2R dataset, including the ensemble built from the original model snapshots and the ensemble built from both the original and the modified model snapshots.

  • We also apply snapshot ensemble on the OSCAR-init VLNBERT model from hong2021vln and the Env-Drop model from tan2019learning. We evaluate and compare their ensemble performances on the R2R dataset against their best single snapshot.

  • We evaluate the performance of snapshot ensemble on the R4R dataset, which is a larger VLN dataset than R2R and with more complicated navigation paths.

5.1 Dataset Setting and Evaluation Metrics

We use the R2R train split as training data, val_unseen split as validation data, and test split to evaluate the ensemble. For the R4R dataset, we also use the train split as the training data. As there’s no test split in the R4R dataset, we divide the val_unseen split into two halves. The two halves do not share scenes in common. We construct the snapshot ensemble on one half and evaluate it on the other half.

We adopt four metrics for evaluation: Success Rate (SR), Trajectory Length (TL), NavigationError (NE), and Success weighted by Path Length (SPL). SR is the ratio of successful navigation numbers to the number of all navigations (higher is better). TL is the average length of the model’s navigation path (lower is better). NE is the average distance between the last viewpoint in the predicted path and the ground truth destination viewpoint (lower is better); SPL is the path-length weighted success rate compared to SR (higher is better).

5.2 Training Setting and Hard/Software Setup

We train the VLNBERT and the OSCAR-init VLNBERT model with the default 300,000 iterations. We run an ablation study to decide for constructing the ensemble (the candidate number for beam search is when mixing the original and modified models. In R4R, we set to shorten the evaluation time). Appendix D describes the ablation study result. For other parameters, we use the default given by the authors.444We do not adopt the cyclic learning rate schedule loshchilov2016sgdr suggested in huang2017snapshot that forces the model to generate local minima. Our trial experiment result shows there’s no significant improvement by doing so in this task.

We set the pseudo-random seed to 0 for the training process. We run the training code under Ubuntu 20.04.1 LTS operating system, GeForce RTX 3090 Graphics Card with 24GB memory. It takes around 10,000 MB of graphics card memory to evaluate an ensemble of 4 snapshots with batch size 8 inputs. The code is developed in Pytorch 1.7.1, and CUDA 11.2. The training takes approximately 30 - 40 hours to finish. The beam search evaluation is done in 3 - 5 hours.

6 Results

6.1 Evaluation Results of our proposed methods on R2R dataset

We evaluate our modification of the VLNBERT model, the snapshot ensembles of original-only and mixed (original and modified) snapshots on the R2R test split. Table 2 shows the evaluation results of our model and ensembles. The past-action-aware modified model has a similar performance to the original model. Both snapshot ensembles significantly improve the performance of the model in NE, SR, and SPL. Specifically, the mixed snapshot ensemble achieved the new SOTA in NE and SPL while only 0.09% worse than VLNBERT + REM liu2021vision, which uses further data augmentation, in SR. At the time of writing, they have yet to release their code or dataset which may improve our results further.

Additionally, we evaluate whether the snapshot ensemble also improves different VLN models. We train an OSCAR-init VLNBERT hong2021vln, which is a BERT-structure model, and an EnvDropout model tan2019learning which has an LSTM plus soft-attention structure with their default training settings. We apply the snapshot ensemble on both and compare the ensemble’s performance with their best single snapshot on the R2R test split. Table 4 shows the evaluation result. Both ensembles consistently gained a more than 2% increase in SR and SPL compared to the best snapshot of the model. That suggests the snapshot ensemble is also able to improve the performances of other VLN models as well.

6.2 Evaluation Results of Snapshot Ensemble on R4R dataset

In addition to the R2R dataset, we evaluate the snapshot ensemble method on a more challenging dataset R4R jain2019stay. R4R dataset contains more navigation data and more complicated paths in more variant lengths. Table 3 shows the evaluation result between the best snapshot and the snapshot ensemble. We saw a more than 1% of increase in SR and SPL after applying snapshot ensemble. We also evaluate the same snapshot ensemble with all val_unseen split data in R4R, as shown in Table 3.

7 Discussion

In this section, we discuss potential reasons why the snapshot ensemble works well in the VLN task. Additionally, we provide a case study in appendix E that qualitatively analyzes our proposed snapshot ensemble.

7.1 Ensemble is More Similar to Its Snapshots

Figure 4: The Venn diagram on val_unseen that counts the number of navigation that are failed by one or more snapshots. The numbers not in any circle are navigations that are succeeded by all 3 snapshots. The numbers in parenthesis are the counts when snapshot 2 is replaced by the ensemble.

To find out if the snapshot ensemble leverages the predictions of its snapshots, we analyze the errors that it makes in comparison to those of its snapshots (similar to our analysis in section 4.1), counting the distinctive failures of the snapshots and the ensemble on the validation set. We choose an ensemble of 3 snapshots with SR 65.43% obtained via beam search to visualize the shared failure cases among the three snapshots and the failure cases shared between snapshots 1, 3, and the ensemble (replacing snapshot 2). We draw the corresponding Venn diagram with and without the replacement, as shown in Figure 4. In comparison to snapshot 2, the ensemble shares more navigations that are succeeded/failed between itself and the two snapshots (529+1086 477+ 988). Meanwhile, the number of navigations that are only failed by the ensemble is less than that of snapshot 2 (). These changes suggest that the ensemble behaves more similarly to its snapshot members than the replaced snapshot. We repeat this process by replacing snapshots 1 and 3 and obtain a similar result.

To understand the benefits of the ensemble acting more similarly to its snapshots, we count the successful navigations of the ensemble/snapshots in each scene. Table 5 shows the result. We find that each snapshot is good at different scenes. The ensemble either outperforms its snapshots or is comparable to the best snapshot in most scenes suggesting that the ensemble leverages the advantages of snapshots in different scenes to achieve better performance.

7.2 Ensemble Avoids Long Navigations

In our setting, the system forces the agent to stop when its taken actions are over 15. We call navigations that take the agent 15 or more actions to complete as Long Navigations (LNs). As ground truth navigation only needs 5 - 7 actions, we wonder if LN is harmful to the model performance. In table 6, we count the LNs for snapshots and the ensemble discussed in section 7.1 and compute the success rates when their navigation is an LN. We discovered that 5.5% to 10.5% of agent’s navigations are LNs. Meanwhile, LN has a high likelihood ( 90%) of failing. As the ensemble has a much fewer number of LNs than its snapshots (131), we consider avoiding more LNs as one of the reasons why the ensemble outperforms single snapshots.

Scene Ensemble Snapshot 1 Snapshot 2 Snapshot 3
1 178 165 169 159
2 32 33 32 29
3 140 131 131 144
4 208 189 199 185
5 10 11 8 9
6 169 161 170 152
7 203 198 200 196
8 217 205 204 212
9 93 80 89 84
10 102 95 89 89
11 185 177 173 181
Table 5: The count of successful navigations for the ensemble and its snapshots in each scene on val_unseen split.
SR LN Count LN that fails (%)
Snapshot 1 61.52 172 159 (92.44%)
Snapshot 2 62.32 155 141 (90.97%)
Snapshot 3 61.3 246 223 (90.65%)
Ensemble 65.43 131 123 (93.89%)
Table 6: Long navigation (LN) for the ensemble and snapshots. The ensemble has fewer long navigation (131) when compared to its snapshot members.

8 Conclusion

In this paper, we discover differences in snapshots of the same VLN model. We apply the snapshot ensemble method that leverages the behaviors of multiple snapshots. By combining snapshots of the VLNBERT model and its past-action-aware modification that we propose, we achieve a new SOTA performance on the R2R dataset. We also show that our snapshot ensemble method works with different models and on more complicated VLN tasks. In the future, we will train the model with augmented data from liu2021vision and see if it improves performance. We will also apply our approach to pre-explore and beam search settings to see if ensemble methods can improve performance on these settings.