1 Introduction
Given instruction in texts and visual inputs describing the surrounding environment, a Vision-and-Language Navigation (VLN) agent controls an agent to complete a set of goals listed in the instruction. Building a good VLN system is difficult due to the fact that it needs to understand vision and language information and coordinate them well.
Recent advancements in computer vision and natural language processing and the advent of better vision-language models
sundermeyer2012lstm; vaswani2017attention; lu2019vilbert; tan2019lxmert along with the effort to prepare large scale realistic datasets Matterport3D has enabled rapid development of VLN systems. Among all current VLN datasets, the R2R dataset (Anderson et al. 2018) is a dataset based on real photos taken in indoor environments. It attracts massive attention for its simple-form task, which at the same time requires complex understanding in both images and texts.To obtain better performance in R2R, various studies in the past have discussed how to adjust the best vision/language models at the time for the R2R VLN task anderson2018vision; majumdar2020improving; hong2021vln. Previous studies have also made efforts to prevent overfitting due to the limited size of the R2R dataset fried2018speaker; liu2021vision; li2019robust; hao2020towards.
In this paper, we offer a new perspective for analyzing the R2R VLN model that focuses on the by-products of the model training process: snapshots. Snapshots are the saved parameters of a model at various intervals during training. Although all snapshots have the same goal as the model, their parameters are different due to the ongoing optimization. We discover that some of the best snapshots at various intervals saved during training shared similar navigation success rates while making significantly diverse errors. Based on such observation, we construct our VLN system with an ensemble of snapshots instead of just one. Through experiments, we found out that such an ensemble can take advantage of its members and thus significantly improve the navigation performance.
In addition, to allow more model variants in the ensemble, we also propose a novel modification of an existing state-of-the-art (SOTA) model for VLN i.e., the VLNBERT hong2021vln. Our ensemble, which consists of snapshots of both models: the VLNBERT model and our proposed modification: the past-action-aware VLNBERT model–achieves a new SOTA performance in the single-run setting of the R2R dataset.
To conclude, our contributions are as follows:
-
We discover that the best snapshots of the same model behave differently while having similar navigation success rates. Based on this observation, we propose a snapshot ensemble method to take advantage of the different snapshots.
-
We also propose a past-action-aware modification on the current best VLN model: the VLNBERT. It creates additional variant snapshots to the original model with equivalent navigation performance.
-
By combining the snapshots from both the original and the modified model, our ensemble achieves a new SOTA performance on the R2R challenge leaderboard in the single-run setting.111Our method is noted as “SE-Mixed (Single-Run)” in the leaderboard webpage: https://eval.ai/web/challenges/challenge-page/97/leaderboard
-
We evaluate the snapshot ensemble method on two different datasets and apply it with two other VLN models. The evaluation results show that the snapshot ensemble also improves performance on more complicated VLN tasks and with different model architectures.
2 Related Works
2.1 Vision-and-language Navigation datasets
Teaching a robot to complete instructions is a long-existing goal in the AI community winograd1971procedures. Compared to GPS-based navigation, VLN accepts surrounding environments as visual inputs and correlates them with instruction in human language. Most VLN datasets in the past are based on synthesized 3-D scenes kolve2017ai2; brodeur2017home; wu2018building; yan2018chalet; Song_2017_CVPR. Recently, the emergence of data based on real-life scenarios allows VLN systems to be developed and tested in realistic environments. Specifically, 3-D views from Google Street View222https://www.google.com/streetview/ and Matterport3D datasets (Chang et al. 2017) allow people to build simulators that generate navigation data from photos taken in real life. Different from the previous datasets, the R2R dataset that we use consists of navigations in real indoor environments. Concretely, the R2R dataset provides 15,000 instructions and 5,000 navigation paths in 90 indoor scenes. Since its construction, people have proposed variants of the R2R dataset to address certain shortcomings of the original one ku2020room; jain2019stay; hong2020sub; krantz2020beyond. However, the community still considers the R2R dataset a necessary test for evaluating all kinds of VLN systems for indoor navigation.
2.2 VLN systems for navigation in R2R dataset
To improve navigation performance in the R2R dataset, various models and techniques have been proposed. fried2018speaker and tan2019learning further developed the LSTM sundermeyer2012lstm + soft-attention luong2015effective baseline system anderson2018vision. majumdar2020improving proposed a VLN system based on VilBERT lu2019vilbert to replace the LSTM + soft-attention architecture for better image and text understanding. Recently, chen2021topological; Wang_2021_CVPR; hong2020language proposed VLN systems based on graph models. In terms of techniques, fried2018speaker built a speaker model for data augmentation. ma2019regretful; ma2019self introduced regularization loss and back-tracking; tan2019learning improved the dropout mechanic in its VLN model; li2019robust; hao2020towards improved the models’ initial states by pre-training it on large-scale datasets; and hong2021vln developed a recurrent VLN model (VLNBERT) based on BERT structure for single-run setting in VLN. liu2021vision provides further data augmentation by splitting and mixing scenes.
Previous work that shared the closest idea to us is hu2019you, which proposed a mixture of VLN models. However, each of their models is trained with different inputs. In this paper, we build an ensemble based on snapshots of the same model.
2.3 Ensemble
The concept of applying ensemble in neural network models appeared very early in the machine learning community
hansen1990neural. There are well-known ensemble techniques such as Bagging breiman1996baggingho1995random, and boosting (AdaBoost) freund1997decision. However, applying such ensembles to deep learning models directly is very time-consuming. There are previous works that provide ensemble-like solutions for deep learning models
xie2013horizontal; moghimi2016boosted; laine2016temporal; french2017self. Our work is inspired by the idea of “snapshot ensemble” from huang2017snapshot, which constructs the ensemble from a set of snapshots collected in local minima. Different from the previous work, we collect snapshots based on training intervals and success rates. Also, we apply beam search to optimize the combination of snapshots to be in the ensemble.3 Preliminaries
3.1 Vision-and-language Navigation in R2R dataset
Navigation in R2R consists of three parts: instruction , scene , and path . The instruction is a sequence of words in the vocabulary : . The scene is a connected graph that contains viewpoints and the edges that connect viewpoints: . For viewpoint where the agent stands, there’s is a panoramic view to describe the visual surroundings of the agent. To be more precise, is a set of 36 views that a camera captured in the viewpoint from different horizontal and vertical directions. A viewpoint is connected (“navigable”) to another viewpoint when you could directly walk from to in the real environment that represents. The path is a sequence of viewpoints, which starts from the starting viewpoint, and ends in the destination viewpoint: . is the initial position of the agent.
A VLN model for the R2R navigation task works as a policy function with the instruction and the panoramic view of a certain viewpoint as inputs: . At each time step , the policy function predicts an action , and tries to get as close as possible to the ground truth destination in at the end.
After the agent chooses to stop or the number of its actions exceeds a limit, its last viewpoint will be evaluated. If is within 3 meters of from ground-truth path , the navigation is considered to be successful or failed otherwise.
There are three different settings for the VLN task in R2R: single-run, pre-explore, and beam search. In this paper, we focus on the single-run setting. The “single-run” setting requires the agent to finish the navigation with minimum actions taken and without prior knowledge of the environment.
3.2 VlnBERT model
We apply our modification and snapshot ensemble on the VLNBERT model proposed by hong2021vln. 333hong2021vln proposed two VLNBERT models in their work. The VLNBERT model we used here is the LXMERT-based VLNBERT model pre-trained by PREVALENT hao2020towards. For the other one, which is BERT-based and pre-trained by OSCAR li2020oscar, we call it OSCAR-init VLNBERT to distinguish it from the PREVALENT-initialized one. The model currently holds the best performance for the single-run setting in the R2R dataset liu2021vision. In this section, we will have a brief recap of this model. A simplified visualization of the model structure is in Figure 1.
Before computing the prediction of actions, the model selects a set of candidate views from . After that, the VLN
BERT model projects the candidate views and the instruction into the same feature space. We discuss this process in detail in Appendix A. Eventually, we have a vector of instruction features
and a vector of candidate action features as inputs of the action prediction.At the first time step, is sent to a 9-layer self-attended module. The word features are thus attended to the feature. The model then appends to from . After that, a cross-attention sub-module attends the remaining elements in to both and . Lastly, another sub-module computes the self-attention of the instruction-attended . Such cross and self sub-modules build up the cross + self-attention module in figure 1. The process repeats for four layers and the attention scores between and each elements in of the last layer are the prediction scores of each action . Additionally, the in the output is sent to a cross-modal-matching module. The output of the module is used as in the next time step while other features in remains unchanged. The cross and self attention computation will be repeated to compute action predictions for the rest of time steps.
The VLN
BERT model minimizes two losses: imitation learning loss and reinforcement learning loss:
where
is the teacher action (one-hot encoded action that gets closest to the destination),
is the probability of the taken action,
is the action taken and is the advantage value at time step , computed by the A2C algorithm mnih2016asynchronous. is a hyper-parameter that balances the weights of imitation learning loss and reinforcement learning loss.
4 Proposed Method
4.1 Differences of Snapshots in the Same VLN model
Like other machine learning models, VLNBERT chooses the best snapshot by validation to represent the trained model. We train the VLNBERT model and observe its validation success rates, as measured on the val_unseen split of R2R, of the snapshots saved in the training process. As shown in Appendix B Figure 1, we saw that the success rate fluctuates drastically over time. This fluctuation is however not seen in the training loss. As shown in Appendix B, figures 2 and 3, both imitation and reinforcement learning losses drop consistently with time. This interesting discovery leads us to further investigate whether the snapshots that perform similarly (in terms of success rates) might behave differently with respect to the errors that they make.
We set up an experiment designed as follows: we train the VLNBERT model for 300,000 iterations and save the best snapshot in the validation split for every 30,000 iterations. The top-5 snapshots among them are shown in Table 1. We chose the best two snapshots, namely the snapshots with 62.32% and 61.60% success rates. We then count the navigations that only one of the snapshots failed, both of the snapshots failed or none of the snapshots failed. Our result shows that 563 navigations ended with different results between the best and the second-best snapshots, approximately 24% of the validation data. In comparison, the difference in their success rate is only 0.72%. The massive difference between 24% and 0.72% suggests that the agents of the two snapshots have different navigation behaviors even though they are almost equal in success rates. Naturally, we wonder if we could leverage both of their behaviors and thus create a better agent. One of the techniques we find to be effective for this problem is the snapshot ensemble that we discuss in the next section.
Snapshot Period | Success Rate in val_unseen Split |
---|---|
90K - 120K | 62.32% |
240K - 270K | 61.60% |
210K - 240K | 61.56% |
60K - 90K | 61.52% |
180K - 210K | 61.30% |
4.2 Snapshot Ensemble for VLNBERT models
A snapshot is a set of saved parameters of a model during a particular time in training. Naturally, the first thing to do to set up the ensemble is to decide what snapshots to save during training. According to huang2017snapshot, the ensemble mechanic does the best when “the individual models have low test error and do not overlap in the set of examples they misclassify”. Therefore, we want to save snapshots that are “different enough” while doing well individually. Our approach is as follows (where snapshots and ensembles are evaluated on the validation set):
-
For a training cycle of iterations, we evenly divide it into periods (assuming is divisible by ).
-
For each period , we save the snapshot with the highest success rate in the validation split.
-
The saved snapshots will be the candidates to build the ensemble: .
Among the candidates snapshot saved this way, we conduct a beam search with size to construct the ensemble of maximum size . The process is as follows:
-
Evaluate all possible ensembles of size 1, that is:
-
Keep the top- ensembles in the previous step. For each kept ensemble, evaluate all possible ensembles of size 2 that contain the kept snapshot(s). E.g., say is one of the top- ensembles of size 1, the ensembles of size 2 to be evaluated related to are: .
-
Keep the top- ensembles of size 2 from the previous step. Then we repeat the process for size-3 ensembles, so on and so forth. The evaluation stops when we finish evaluating the ensembles of size .
-
In the end, we choose the ensemble with the highest success rate among all the ensembles evaluated during the whole process.
The approximate number of evaluations needed for our beam search strategy is when , which is much smaller than the cost of an exhaustive search .
During the evaluation, the ensemble completes a navigation task as follows: at each time step, the instruction inputs and the visual inputs of the current viewpoint are sent to each snapshot in the ensemble. Each snapshot then gives its predictions on the available actions. After that, the agent sums those predictions up and takes the corresponding action. At the end of the time step, each snapshot uses the action taken to update its own states. We visualize the ensemble navigation workflow in Figure 2. We do not apply normalization on the prediction scores of snapshots to allow model confidence as the score weights.

4.3 A past-action-aware VLNBERT model
We saw a significant improvement in the snapshot ensemble of the original VLNBERT model, as shown in Table 2. Still, we could improve the performance of the ensemble by adding more variant snapshots. To do that, we modify the VLNBERT model and combine the snapshots of the original and the modified model.
Our modification is based on the two ideas to improve the model: adding in the past time steps and regularizing the attention scores between and words in based on the observation of OSCAR-init VLNBERT model in hong2021vln. The modification is visualized by the blue parts in figure 3.
At the beginning of each time step , we add a copy of the cross-modal matching output from the last time step to a vector. At time step , , re-indexed from to . We then concatenate and [] as a large vector, and pass them through the cross + self-attention module. Note that we do not update the features in during the attention computation. As a result, we will not only have the attention scores from the current to each word features in , but also that from the to in the last layer outputs.
For each set of attention scores from to each word in the instruction where , we compute an “attention regularization” loss defined as follows:
“MSE” stands for Mean-Squared-Error and is the “ground truth” values for the normalized attention scores . is computed based on the sub-instruction annotation from the Fine-Grained R2R dataset (FGR2R). Concretely, the FGR2R dataset divides the instructions in the R2R dataset into a set of ordered sub-instructions: where n is the number of sub-instructions the original instruction consists of. Each sub-instruction corresponds to one or a sequence of viewpoints in the ground truth path . To compute , we first build a map from each viewpoint in to a specific sub-instruction in . The map function is very straightforward: we choose the first sub-instruction in that corresponds to as the mapped sub-instruction. By doing so, each viewpoint in now has their own related sub-instruction in . We then compute , by the following step:
-
find the viewpoint where the agent stands at time step . If , we choose the viewpoint in that is closest to as the new .
-
Since every has its mapped , we compute each by:
We compute each and the total loss becomes:
is a hyper-parameter and is the total time steps.
The added history vector provides additional information to the model during action prediction. In addition, the attention regularization forces the VLNBERT model to align attention scores to words that correspond to the agent’s actions without performance (success rate) drop. The visualization of how the attention score changes as the agent moves in its path are given in appendix C.

Model | R2R val_unseen | R2R test | ||||||
TL | NE | SR | SPL | TL | NE | SR | SPL | |
Random | 9.77 | 9.23 | 16 | - | 9.89 | 9.79 | 13 | 12 |
Human | - | - | - | - | 11.85 | 1.61 | 86 | 76 |
Seq2Seq-SF anderson2018vision | 8.39 | 7.81 | 22 | - | 8.13 | 7.85 | 20 | 18 |
Speaker-Follower fried2018speaker | - | 6.62 | 35 | - | 14.82 | 6.62 | 35 | 28 |
PRESS li2019robust | 10.36 | 5.28 | 49 | 45 | 10.77 | 5.49 | 49 | 45 |
EnvDrop tan2019learning | 10.7 | 5.22 | 52 | 48 | 11.66 | 5.23 | 51 | 47 |
AuxRN Zhu_2020_CVPR | - | 5.28 | 55 | 50 | - | 5.15 | 55 | 51 |
PREVALENT hao2020towards | 10.19 | 4.71 | 58 | 53 | 10.51 | 5.3 | 54 | 51 |
RelGraph hong2020language | 9.99 | 4.73 | 57 | 53 | 10.29 | 4.75 | 55 | 52 |
VLNBERT hong2021vln | 12.01 | 3.93 | 63 | 57 | 12.35 | 4.09 | 63 | 57 |
VLNBERT + REM liu2021vision | 12.44 | 3.89 | 63.6 | 57.9 | 13.11 | 3.87 | 65.2 | 59.1 |
Past-action-aware VLNBERT (ours) | 13.2 | 3.88 | 63.47 | 56.27 | 13.86 | 4.11 | 62.49 | 56.11 |
VLNBERT Original Snapshot Ensemble (ours) | 11.79 | 3.75 | 65.55 | 59.2 | 12.41 | 4 | 64.22 | 58.96 |
VLNBERT Past-action-aware Snapshot Ensemble (ours) | 12.35 | 3.72 | 65.26 | 58.65 | 13.19 | 3.93 | 64.65 | 58.78 |
VLNBERT Mixed Snapshot Ensemble (ours) | 12.05 | 3.63 | 66.67 | 60.16 | 12.71 | 3.82 | 65.11 | 59.61 |
Model | R4R val_unseen_half | R4R val_unseen_full | ||||||
---|---|---|---|---|---|---|---|---|
TL | NE | SR | SPL | TL | NE | SR | SPL | |
Speaker-Follower | - | - | - | - | 19.9 | 8.47 | 23.8 | 12.2 |
EnvDrop | - | - | - | - | - | 9.18 | 34.7 | 21 |
VLNBERT + REM liu2021vision | - | - | - | - | - | 6.21 | 46 | 38.1 |
VLNBERT | 13.76 | 7.05 | 37.29 | 27.38 | 13.92 | 6.55 | 43.11 | 32.13 |
VLNBERT Original Snapshot Ensemble (ours) | 15.09 | 7.03 | 39 | 28.66 | 14.71 | 6.44 | 44.55 | 33.45 |
Model | R2R val_unseen | R2R Test | ||||||
---|---|---|---|---|---|---|---|---|
TL | NE | SR | SPL | TL | NE | SR | SPL | |
EnvDrop | 10.7 | 5.22 | 52 | 48 | 11.66 | 5.23 | 51 | 47 |
EnvDrop Snapshot Ensemble | 11.74 | 4.9 | 53.34 | 49.49 | 11.9 | 4.98 | 53.58 | 50.01 |
OSCAR-init VLNBERT | 11.86 | 4.29 | 59 | 53 | 12.34 | 4.59 | 57 | 53 |
OSCAR-init VLNBERT Snapshot Ensemble | 11.22 | 4.21 | 59.73 | 54.76 | 11.74 | 4.36 | 59.72 | 55.35 |
5 Experiment
We run the following experiments to evaluate the performances of snapshot ensembles in different models and datasets:
-
We evaluate the performance of snapshot ensemble on the R2R dataset, including the ensemble built from the original model snapshots and the ensemble built from both the original and the modified model snapshots.
-
We also apply snapshot ensemble on the OSCAR-init VLNBERT model from hong2021vln and the Env-Drop model from tan2019learning. We evaluate and compare their ensemble performances on the R2R dataset against their best single snapshot.
-
We evaluate the performance of snapshot ensemble on the R4R dataset, which is a larger VLN dataset than R2R and with more complicated navigation paths.
5.1 Dataset Setting and Evaluation Metrics
We use the R2R train split as training data, val_unseen split as validation data, and test split to evaluate the ensemble. For the R4R dataset, we also use the train split as the training data. As there’s no test split in the R4R dataset, we divide the val_unseen split into two halves. The two halves do not share scenes in common. We construct the snapshot ensemble on one half and evaluate it on the other half.
We adopt four metrics for evaluation: Success Rate (SR), Trajectory Length (TL), NavigationError (NE), and Success weighted by Path Length (SPL). SR is the ratio of successful navigation numbers to the number of all navigations (higher is better). TL is the average length of the model’s navigation path (lower is better). NE is the average distance between the last viewpoint in the predicted path and the ground truth destination viewpoint (lower is better); SPL is the path-length weighted success rate compared to SR (higher is better).
5.2 Training Setting and Hard/Software Setup
We train the VLNBERT and the OSCAR-init VLNBERT model with the default 300,000 iterations. We run an ablation study to decide for constructing the ensemble (the candidate number for beam search is when mixing the original and modified models. In R4R, we set to shorten the evaluation time). Appendix D describes the ablation study result. For other parameters, we use the default given by the authors.444We do not adopt the cyclic learning rate schedule loshchilov2016sgdr suggested in huang2017snapshot that forces the model to generate local minima. Our trial experiment result shows there’s no significant improvement by doing so in this task.
We set the pseudo-random seed to 0 for the training process. We run the training code under Ubuntu 20.04.1 LTS operating system, GeForce RTX 3090 Graphics Card with 24GB memory. It takes around 10,000 MB of graphics card memory to evaluate an ensemble of 4 snapshots with batch size 8 inputs. The code is developed in Pytorch 1.7.1, and CUDA 11.2. The training takes approximately 30 - 40 hours to finish. The beam search evaluation is done in 3 - 5 hours.
6 Results
6.1 Evaluation Results of our proposed methods on R2R dataset
We evaluate our modification of the VLNBERT model, the snapshot ensembles of original-only and mixed (original and modified) snapshots on the R2R test split. Table 2 shows the evaluation results of our model and ensembles. The past-action-aware modified model has a similar performance to the original model. Both snapshot ensembles significantly improve the performance of the model in NE, SR, and SPL. Specifically, the mixed snapshot ensemble achieved the new SOTA in NE and SPL while only 0.09% worse than VLNBERT + REM liu2021vision, which uses further data augmentation, in SR. At the time of writing, they have yet to release their code or dataset which may improve our results further.
Additionally, we evaluate whether the snapshot ensemble also improves different VLN models. We train an OSCAR-init VLNBERT hong2021vln, which is a BERT-structure model, and an EnvDropout model tan2019learning which has an LSTM plus soft-attention structure with their default training settings. We apply the snapshot ensemble on both and compare the ensemble’s performance with their best single snapshot on the R2R test split. Table 4 shows the evaluation result. Both ensembles consistently gained a more than 2% increase in SR and SPL compared to the best snapshot of the model. That suggests the snapshot ensemble is also able to improve the performances of other VLN models as well.
6.2 Evaluation Results of Snapshot Ensemble on R4R dataset
In addition to the R2R dataset, we evaluate the snapshot ensemble method on a more challenging dataset R4R jain2019stay. R4R dataset contains more navigation data and more complicated paths in more variant lengths. Table 3 shows the evaluation result between the best snapshot and the snapshot ensemble. We saw a more than 1% of increase in SR and SPL after applying snapshot ensemble. We also evaluate the same snapshot ensemble with all val_unseen split data in R4R, as shown in Table 3.
7 Discussion
In this section, we discuss potential reasons why the snapshot ensemble works well in the VLN task. Additionally, we provide a case study in appendix E that qualitatively analyzes our proposed snapshot ensemble.
7.1 Ensemble is More Similar to Its Snapshots

To find out if the snapshot ensemble leverages the predictions of its snapshots, we analyze the errors that it makes in comparison to those of its snapshots (similar to our analysis in section 4.1), counting the distinctive failures of the snapshots and the ensemble on the validation set. We choose an ensemble of 3 snapshots with SR 65.43% obtained via beam search to visualize the shared failure cases among the three snapshots and the failure cases shared between snapshots 1, 3, and the ensemble (replacing snapshot 2). We draw the corresponding Venn diagram with and without the replacement, as shown in Figure 4. In comparison to snapshot 2, the ensemble shares more navigations that are succeeded/failed between itself and the two snapshots (529+1086 477+ 988). Meanwhile, the number of navigations that are only failed by the ensemble is less than that of snapshot 2 (). These changes suggest that the ensemble behaves more similarly to its snapshot members than the replaced snapshot. We repeat this process by replacing snapshots 1 and 3 and obtain a similar result.
To understand the benefits of the ensemble acting more similarly to its snapshots, we count the successful navigations of the ensemble/snapshots in each scene. Table 5 shows the result. We find that each snapshot is good at different scenes. The ensemble either outperforms its snapshots or is comparable to the best snapshot in most scenes suggesting that the ensemble leverages the advantages of snapshots in different scenes to achieve better performance.
7.2 Ensemble Avoids Long Navigations
In our setting, the system forces the agent to stop when its taken actions are over 15. We call navigations that take the agent 15 or more actions to complete as Long Navigations (LNs). As ground truth navigation only needs 5 - 7 actions, we wonder if LN is harmful to the model performance. In table 6, we count the LNs for snapshots and the ensemble discussed in section 7.1 and compute the success rates when their navigation is an LN. We discovered that 5.5% to 10.5% of agent’s navigations are LNs. Meanwhile, LN has a high likelihood ( 90%) of failing. As the ensemble has a much fewer number of LNs than its snapshots (131), we consider avoiding more LNs as one of the reasons why the ensemble outperforms single snapshots.
Scene | Ensemble | Snapshot 1 | Snapshot 2 | Snapshot 3 |
---|---|---|---|---|
1 | 178 | 165 | 169 | 159 |
2 | 32 | 33 | 32 | 29 |
3 | 140 | 131 | 131 | 144 |
4 | 208 | 189 | 199 | 185 |
5 | 10 | 11 | 8 | 9 |
6 | 169 | 161 | 170 | 152 |
7 | 203 | 198 | 200 | 196 |
8 | 217 | 205 | 204 | 212 |
9 | 93 | 80 | 89 | 84 |
10 | 102 | 95 | 89 | 89 |
11 | 185 | 177 | 173 | 181 |
SR | LN Count | LN that fails (%) | |
---|---|---|---|
Snapshot 1 | 61.52 | 172 | 159 (92.44%) |
Snapshot 2 | 62.32 | 155 | 141 (90.97%) |
Snapshot 3 | 61.3 | 246 | 223 (90.65%) |
Ensemble | 65.43 | 131 | 123 (93.89%) |
8 Conclusion
In this paper, we discover differences in snapshots of the same VLN model. We apply the snapshot ensemble method that leverages the behaviors of multiple snapshots. By combining snapshots of the VLNBERT model and its past-action-aware modification that we propose, we achieve a new SOTA performance on the R2R dataset. We also show that our snapshot ensemble method works with different models and on more complicated VLN tasks. In the future, we will train the model with augmented data from liu2021vision and see if it improves performance. We will also apply our approach to pre-explore and beam search settings to see if ensemble methods can improve performance on these settings.
Comments
There are no comments yet.