When reading an instruction (e.g. “Exit the bathroom, take the second door on your right, pass the sofa and stop at the top of the stairs .”), a person builds a mental map of how to arrive at a specific location. This map can include landmarks, such as the second door, and markers such as reaching the top of the stairs. Training an embodied agent to accomplish such a task with access to only ego-centric vision and individually supervised actions requires building rich multi-modal representations from limited data .
Most current approaches to Vision-and-Language Navigation (VLN) formulate the task to use the seq2seq (or encoder-decoder) framework , where language and vision are encoded as input and an optimal action sequence is decoded as output. Several subsequent architectures also use this framing; however, they augment it with important advances in attention mechanisms, global scoring, and beam search [2, 13, 10].
Inherent to the seq2seq formulation is the problem of exposure bias : a model that has been trained to predict one-step into the future given the ground-truth sequence cannot perform accurately given its self-generated sequence. Previous work with seq2seq models attempted to address this using student forcing and beam search.
Student forcing exposes a model to its own generated sequence during training, teaching the agent how to recover. However, once the agent has deviated from the correct path, the original instruction no longer applies. The Supplementary Materials (§A.1) show that student forcing cannot solve the exposure bias problem, causing the confused agent to fall into loops.
Beam search, at the other extreme, collects multiple global trajectories to score and incurs a cost proportional to the number of trajectories, which can be prohibitively high. This approach runs counter to the goal of building an agent that can efficiently navigate an environment: No one would likely deploy a household robot that re-navigates an entire house 100 times222This is calculated based on the length of Speaker-Follower agent paths and human paths on the R2R dataset. before executing each command, even if it ultimately arrives at the correct location. The top performing systems on the VLN leaderboard333https://evalai.cloudcv.org/web/challenges/challenge-page/97/leaderboard/270 all require broad exploration that yields long trajectories, causing poor SPL performance (Success weighted by Path Length ).
To alleviate the issues of exposure bias and expensive, inefficient beam-search decoding, we propose the Frontier Aware Search with backTracking(Fast Navigator). This framework lets agents compare partial paths of different lengths based on local and global information and then backtrack if it discerns a mistake. Figure 1 shows trajectory graphs created by the current published state-of-the-art (SoTA) agent using beam search versus our own.
Our method is a form of asynchronous search, which combines global and local knowledge to score and compare partial trajectories of different lengths. We evaluate our progress to the goal by modeling how closely our previous actions align with the given text instructions. To achieve this, we use a fusion
function, which converts local action knowledge and history into an estimated score of progress. This score determines which local action to take and whether the agent should backtrack. This insight yields significant gains on evaluation metrics relative to existing models. The primary contributions of our work are:
A method to alleviate the exposure bias of action decoding and expensiveness of beam search.
An algorithm that makes use of asynchronous search with neural decoding.
An extensible framework that can be applied to existing models to achieve significant gains on SPL.
The VLN challenge requires an agent to carry out a natural language instruction in photo-realistic environments. The agent takes an input instruction , which contains several sentences describing a desired trajectory. At each step , the agent observes its surroundings . Because the agent can look around for 360 degrees, is in fact a set of different views. We denote each view as . Using this multimodal input, the agent is trained to execute a sequence of actions to reach a desired location. Consistent with recent work [13, 10], we use a panoramic action space, where each action corresponds to moving towards one of the views, instead of R2R’s original primitive action space (i.e, left, right, etc.) [2, 23]. In addition, this formulation includes a action to indicate that the agent has reached its goal.
2.1 Learning Signals
Key to progress in visual navigation is that all VLN approaches performs a search (Figure 2). Current work often goes toward two extremes: using only local information, e.g. greedy decoding, or fully sweeping multiple paths simultaneously, e.g. beam search. To build an agent that can navigate an environment successfully and efficiently, we leverage both local and global information, letting the agent make a local decision while remaining aware of its global progress and efficiently backtracking when the agent discerns a mistake. Inspired by previous work [10, 13], our work uses three learning signals:
: local distribution over action. The logit of the action chosen at timeis denoted . Specifically, the original language instruction is encoded via LSTM. Another LSTM acts as a decoder, using attention mechanism to generate logits over actions. At each time step of decoding, logits are calculated by taking the dot product of the decoder’s hidden state and each candidate action .
PM : global progress monitor. It tracks how much of an instruction has been completed . Formally, the model takes as input the (decoder) LSTM’s current cell state, , previous hidden state, , visual inputs, , and attention over language embeddings, to compute a score . The score ranges between [-1,1], indicating the agent’s normalized progress. Training this indicator regularizes attention alignments, helping the model learn language-to-vision correspondences that it can use to compare multiple trajectories.
Speaker : global scoring. Given a sequence of visual observations and actions, we train a seq2seq captioning model as a “speaker”  to produce a textual description. Doing so provides two benefits: (1) the new speaker can automatically annotate new trajectories in the environment with the synthetic instructions, and (2) the speaker can score the likelihood that a given trajectory will correspond to the original instruction.
We now introduce an extendible framework444Figure 3(a) shows an example of integrating the three signals in a seq2seq framework. that integrates the preceding three signals (, , )555Figure 3(b) shows how to compute the three signals. and to train new indicators, equipping an agent to answer:
Should we backtrack?
Where should we backtrack to?
Which visited node is most likely to be the goal?
When does it terminate this search?
These questions pertain to all existing approaches in navigation task. In particular, greedy approaches never backtrack and do not compare partial trajectories. Global beam search techniques always backtrack but can waste efforts. By taking a more principled approach to modeling navigation as graph traversal, our framework permits nuanced and adaptive answers to each of these questions.
For navigation, the graph is defined by a series of locations in the environment, called nodes. For each task, the agent is placed at a starting node, and the agent’s movement in the house creates a trajectory comprised of a sequence of node , action pairs. We denote a partial trajectory up to time as , or the set of physical locations visited and the action taken at each point:
For any partial trajectory, the last action is proposed and evaluated, but not executed. Instead, the model chooses whether to expand a partial trajectory or execute a stop action to complete the trajectory. Importantly, this means that every node the agent visited can serve as a possible final destination. The agent moves in the environment by choosing to extend a partial trajectory: it does this by moving to the last node of the partial trajectory and executing its last action to arrive at a new node. The agent then realizes the actions available at the new node and collects them to build a set of new partial trajectories.
At each time step, the agent must (1) access the set of partial trajectories it has not expanded, (2) access the completed trajectories that might constitute the candidate path, (3) calculate the accumulated cost of partial trajectories and the expected gain of its proposed action, and (4) compares all partial trajectories.
To do so, we maintain two priority queues: a frontier queue, , for partial trajectories, and a global candidate queue, , for completed trajectories. These queues are sorted by local and global scores, respectively. scores the quality of all partial trajectories with their proposed actions and maintains their order in ; scores the quality of completed trajectories and maintains the order in .
, that is implemented as a neural network.
To allow the agent to efficiently navigate and follow the instruction, we use an approximation of the D* search. Fast expands its optimal partial trajectory until it decides to backtrack (Q1). It decides on where to backtrack (Q2) by ranking all partial trajectories. To propose the final goal location (Q3 & Q4), the agent ranks the completed global trajectories in candidate queue . We explore these questions in more detail below.
Q1: Should we backtrack?
When an agent makes a mistake or gets lost, backtracking lets it move to a more promising partial trajectory; however, retracing steps increases the length of the final path. To determine when it is worth incurring this cost, we proposed two simple strategies: explore and exploit.
Explore always backtracks to the most promising partial trajectory. This approach resembles beam search, but, rather than simply moving to the next partial trajectory in the beam, the agent computes the most promising node to backtrack to (Q2).
Exploit, in contrast, commits to the current partial trajectory, always executing the best action available at the agent’s current location. This approach resembles greedy decoding, except that the agent backtracks when it is confused (i.e, when the best local action causes the agent to revisit a node, creating a loop; see the SMNA examples in Supplementary Materials §A.1).
Q2: Where should we backtrack to?
Making this decision involves using to score all partial trajectories. Intuitively, the better a partial trajectory aligns with a given description, the higher the value of . Thus, if we can assume the veracity of , the agent simply returns to the highest scoring node when backtracking. Throughout this paper, we explore several functions for computing , but we present two simple techniques here, each acting over the sequence of actions that comprise a trajectory:
sums the log-probabilities of every previous action, thereby computing the probability of a partial trajectory.
Sum-of-logits sums the unnormalized logits of previous actions, which outperforms summing probabilities. These values are computed using an attention mechanism over the hidden state, observations, and language. In this way, their magnitude captures how well the action was aligned with the target description (this information is lost during normalization).666This is particularly problematic when an agent is lost. Normalizing many low-value logits can yield a comparatively high probability (e.g. uniform or random). We also experiment with variations of this approach (e.g. means instead of sums) in §4.
Finally, during exploration, the agent implicitly constructs a “mental map” of the visited space. This lets it search more efficient by refusing to revisit nodes, unless they lead to a high-value unexplored path.
Q3: Which visited node is most likely to be the goal?
Unlike existing approaches, Fast considers every point that the agent has visited as a candidate for the final destination,777There can be more than one trajectory connecting the starting node to each visited node. meaning we must rerank all candidates. We achieve this using , a trainable neural network function that incorporates all global information for each candidate and ranks them accordingly. Figure 4(a) shows a simple visualization.
We experimented with several approaches to compute , e.g., by integrating , the progress monitor, speaker score, and a trainable ensemble in (§4.3).
Q4: When do we terminate the search?
The flexibility of Fast allows it to recover both the greedy decoding and beam search framework. In addition, we define two alternative stopping criteria:
When a partial trajectory decides to terminate.
When we have expanded nodes. In §3 we ablate the effect of choosing a different .
We present the algorithm flow of our Fast framework. When an agent is initialized and placed on the starting node, both the candidate and frontier queues are empty. The agent then adds all possible next actions to the frontier queue and adds its current location to the candidate queue:
Now that the is not empty and the stop criterion is not met, Fast can choose the best partial trajectory from the frontier queue under the local scoring function:
Following , we perform the final action proposal, , to move to a new node (location in the house). Fast can now update the candidate queue with this location and the frontier queue with all possible new actions. We then either continue, by exploiting the available actions at the new location, or backtrack, depending on the choice of backtrack criteria. We repeat this process until the model chooses to stop and returns the best candidate trajectory.
We evaluate our approach using the Room-to-Room (R2R) dataset . At the beginning of the task, the agent receives a natural language instruction and a specific start location in the environment; the agent must navigate to the target location specified in the instruction as quickly as possible. R2R is built upon the Matterport3D dataset , which consists of 194K images, yielding 10,800 panoramic views (“nodes”) and 7,189 paths. Each path is matched with three natural language instructions.
3.1 Evaluation Criteria
We evaluate our approach on the following metrics in the R2R dataset:
Trajectory Length measures the average length of the navigation trajectory.
Navigation Error is the mean of the shortest path distance in meters between the agent’s final location and the goal location.
Success Rate is the percentage of the agent’s final location that is less than 3 meters away from the goal location.
Success weighted by Path Length  trades-off SR against TL. Higher score represents more efficiency in navigation.
We compare our results to four published baselines for this task.888Some baselines on the leader-board are not yet public when submitted; therefore, we cannot compare with them directly on the training and validation sets.
Random: an agent that randomly selects a direction and moves five step in that direction .
Seq2seq: the best performing model in the R2R dataset paper .
Speaker-Follower : an agent trained with data augmentation from a speaker model on the panoramic action space.
SMNA : an agent trained with a visual-textual co-grounding module and a progress monitor on the panoramic action space.999Our SMNA implementation matches published validation numbers. All our experiments are based on full re-implementations.
3.3 Our Model
As our framework provides a flexible design space, we report performance for two versions:
Fast(short) uses the exploit strategy. We use the sum of logits fusion method to compute and terminate when the best local action is stop.
Fast(long) uses the explore strategy. We again use the sum of logits for fusion, terminating the search after fixed number of nodes and using a trained neural network reranker to select the goal state .
|Validation Seen||Validation Unseen||Test Unseen|
|Our baseline SMNA||11.69||3.31||0.69||0.63||12.61||5.48||0.47||0.41||-||-||-||-|
|+ Fast (short)||21.17||4.97||0.56||0.43||22.08||5.14||0.54||0.41|
|+ Fast (long)||188.06||3.13||0.70||0.04||224.42||4.03||0.63||0.02||196.53||4.29||0.61||0.03|
Table 1 compares the performance of our model against published numbers of existing models. Our approach significantly outperforms the existing model in terms of efficiency, matching the best overall success rate despite taking 150 - 1,000 fewer steps. This efficiency gain can be seen in the SPL metric, where our models outperform previous approaches in every setting. Note that our short trajectory model appreciably outperforms current approaches in both SR and SPL. If our agent could continue exploring, it matches existing peak success rates in half of the steps (196 vs 373).
|Validation Unseen||SR (%)||SPL (%)||TL|
Another key advantage of our technique is how simple it is to integrate with current approaches to achieve dramatic performance gains. Table 2 shows how the sum-of-logits fusion method enhances the two previously best performing models. Simply changing their greedy decoders to Fast with no added global information and therefore no reranking yields immediate gains of 6 and 9 points in success rate for Speaker-Follower and SMNA, respectively. Due to those models’ new ability to backtrack, the trajectory lengths increase slightly. However, the success rate increases so much that SPL increases, as well.
Here, we isolate the effects of local and global knowledge, the importance of backtracking, and various stopping criteria. In addition, we include three qualitative intuitive examples to illustrate the model’s behavior in the Supplementary Materials (§A.1). We can perform this analysis because our approach has access to the same information as previous architectures, but it is more efficient. Our claims and results are general, and our Fast approach should benefit future VLN architectures.
4.1 Fixing Your Mistakes
To investigate the degree to which models benefit from backtracking, Figure 5 plots a model’s likelihood of successfully completing the task after making its first mistake at each step. We use SMNA as our greedy baseline. Our analysis finds that the previous SoTA model makes a mistake at the very first action 40% of the time. Figure 5 shows the effect of this error: the greedy approach, if made a mistake at its first step, has a 30% chance of successfully completing the task. In contrast, because Fast detects its mistake, it returns to the starting position and tries again. This simple one-step backtracking increases its likelihood of success by over 10%. In fact, the greedy approach is equally successful only if it progresses over halfway through the instruction without making a mistake.
4.2 Knowing When To Stop Exploring
The stopping criterion balances exploration and exploitation. Unlike previous approaches, our framework lets us compare different criteria and offers the flexibility to determine which is optimal for a given domain. The best available stopping criterion for VLN is not necessarily the best in general. We investigated the number of nodes to expand before terminating the algorithm, and we plot the resulting success rate and SPL in Figure 6. One important finding is that the model’s success rate, though increasing with more nodes expanded, does not match the oracle’s rate, i.e., as the agent expands 40 nodes, it has visited the true target node over 90% of the time but cannot recognize it as the final destination. This motivates an analysis of the utility of our global information and whether it is truly predictive (Table 4), which we investigate further in §4.3.
4.3 Local and Global Scoring
As noted in §2.3, core to our approach are two queues, frontier queue for expansion and the candidate queue for proposing the final candidate. Each queue can use arbitrary information for scoring (partial) trajectories. We now compare the effects of combining different set of signals for scoring each queue.
Fusion methods for scoring partial trajectories
An ideal model would include as much global information as possible when scoring partial trajectories in the frontier expansion queue. Thus, we investigated several sources of pseudo-global information and ten different ways to combine them. The first four use only local information, while the others attempts to fuse local and global information.
The top half of Table 3 shows the performance when considering only local information providers. For example, the third row of the table shows that summing the logit scores of nodes along the partial trajectory as the score for that trajectory achieves an SR score of 56.66. Note although all information originates with the same hidden vectors, the values computed and how they are aggregated substantially affect performance. Overall, we find that summing unnormalized logits (the 3rd row) performs the best considering its outstanding SR. This suggests that important activation information in the network outputs is being thrown away by normalization and therefore discarded by other techniques.
The bottom part of Table 3 explores ways of combining local and global information providers. These are motivated by beam-rescoring techniques in previous work (e.g., multiplying by the normalized progress monitor score). Correctly integrating signals is challenging, in part due to differences in scale. For example, the logit is unbounded (+/-), log probabilities are unbounded in the negative, and the progress monitor is normalized to a score between 0 and 1. Unfortunately, direct integration of the progress monitor did not yield promising results, but future signals may prove more powerful.
|logit||mean / pm||53.00||44.51||13.67|
|log prob||mean / pm||53.72||44.64||13.85|
|logit||mean * pm||54.78||44.70||15.91|
|log prob||mean * pm||55.04||43.70||17.45|
|logit||sum * pm||50.95||41.28||20.25|
|log prob||sum * pm||56.15||43.19||21.55|
Fusion methods for ranking complete trajectories
. Previous work  used state-factored beam search to generate candidates and rank the complete trajectories using probability of speaker and follower scores . In addition to the speaker and progress monitor scores used by previous models, we also experiment with using to compute . To inspect the performance of using different fusion methods, we ran Fast Navigator to expand 40 nodes on the frontier and collect candidate trajectories. Table 4
shows the performance of different fusion scores that rank complete trajectories. We see that most techniques have a limited understanding of the global task’s goal and formulation. We do, however, find a significant improvement on unseen trajectories when all signals are combined. For this we train a multi-layer perceptron to aggregate and weight our predictors. Note that any improvements to the underlying models or new features introduced by future work will directly correlate to gains in this component of the pipeline.
The top line of Table 4, shows oracle’s performance. This indicates how far current global information providers have yet to achieve. Closing this gap is an important direction for future work.
|Train||Val Seen||Val Unseen|
4.4 Intuitive Behavior
The Supplementary Materials (§A.1) provide three real examples to show how our model performs when compared to greedy decoding (SMNA model). It highlights how the same observations can lead to drastically different behaviors during an agent’s rollout. Specifically, in Figures A1 and A2, the greedy decoder is forced into a behavioral loop because only local improvements are considered. Using Fast clearly shows that even a single backtracking step can free the agent of poor behavioral choices.
5 Related Work
Our work focuses on and complements recent advances in Vision-and-Language Navigation (VLN) as introduced by , but many aspects of the task and core technologies date back much further. The natural language community has explored instruction following using 2D maps [17, 14] and computer-rendered 3D environments 
. Due to the enormous visual complexity of real-world scenes, the VLN literature usually builds on computer vision work from referring expressions[15, 24], visual question answering , and ego-centric QA that requires navigation to answer questions [11, 8, 9]. Finally, core to the our work is the field of search algorithm, dating back to the earliest days of AI [18, 20], but largely absent from recent VLN literature that tends to focuses more on neural architecture design.
During publishing the Room-to-Room dataset (VLN), 
introduced the “student forcing” method for seq2seq model. Later work integrated a planning module to combined model-based and model-free reinforcement learning to better generalize to unseen environments, and a Cross-Modal Matching method that enforces cross-modal grounding both locally and globally via reinforcement learning . Two substantial improvements came from panoramic action spaces and a “speaker” model trained to enable data augmentation and trajectory reranking for beam search . Most recently,  leverages a visual-textual co-grounding attention mechanism to better align the instruction and visual scenes and incorporates a progress monitor to estimate the agent’s current progress towards a goal. These approaches require beam search for peak SR. Beam search techniques can unfortunately lead to long trajectories when exploring unknown environments. This limitation motivates the work we present here. Existing approaches trade off a high success rate and long trajectories: greedy decoding provides short, often incorrect paths, the beam search yields high success rates but long trajectories.
We present Fast Navigator, a framework for using asynchronous search to boost any VLN navigator by enabling explicit backtrack when an agent detects if it is lost. This framework can be easily plugged into the most advanced agents to immediately improve their efficiency. Further, empirical results on the Room-to-Room dataset show that our agent achieves state-of-the-art Success Rates and SPLs. Our search-based method is easily extendible to more challenging settings, e.g., when an agent is given a goal without any route instruction [6, 12], or a complicated real visual environment .
Partial funding provided by DARPA’s CwC program through ARO (W911NF-15-1-0543), NSF (IIS-1524371, 1703166), National Institute of Health (R01EB019335), National Science Foundation CPS (1544797), National Science Foundation NRI (1637748), the Office of Naval Research, the RCTA, Amazon, and Honda.
-  P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. Zamir. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid,
S. Gould, and A. van den Hengel.
Vision-and-language navigation: Interpreting visually-grounded
navigation instructions in real environments.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2018.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), pages 2425–2433, 2015.
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and
Learning to rank using gradient descent.
Proceedings of the 22nd International Conference on Machine Learning, pages 89–96. ACM, 2005.
-  A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
-  D. S. Chaplot, K. M. Sathyendra, R. K. Pasumarthi, D. Rajagopal, and R. Salakhutdinov. Gated-attention architectures for task-oriented language grounding. In 32nd AAAI Conference on Artificial Intelligence (AAAI), 2018.
-  H. Chen, A. Shur, D. Misra, N. Snavely, and Y. Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
-  A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 5, page 6, 2018.
-  H. de Vries, K. Shuster, D. Batra, D. Parikh, J. Weston, and D. Kiela. Talk the walk: Navigating new york city through grounded dialogue. arXiv preprint arXiv:1807.03367, 2018.
-  D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell. Speaker-follower models for vision-and-language navigation. In Neural Information Processing Systems (NeurIPS), 2018.
-  D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. IQA: Visual question answering in interactive environments. In Computer Vision and Pattern Recognition (CVPR), volume 1, 2018.
-  K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, et al. Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551, 2017.
-  C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong. Self-monitoring navigation agent via auxiliary progress estimation. In International Conference on Learning Representations (ICLR), 2019.
-  H. Mei, M. Bansal, and M. R. Walter. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In AAAI, volume 1, page 2, 2016.
-  P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and R. Hadsell. Learning to navigate in complex environments. In International Conference on Learning Representations (ICLR), 2017.
D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi.
Mapping instructions to actions in 3d environments with visual goal
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
-  D. Misra, J. Langford, and Y. Artzi. Mapping instructions and visual observations to actions with reinforcement learning. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017.
-  J. Pearl. Heuristics: intelligent search strategies for computer problem solving. 1984.
M. Ranzato, S. Chopra, M. Auli, and W. Zaremba.
Sequence level training with recurrent neural networks.In International Conference on Learning Representations (ICLR), 2016.
-  S. J. Russell and P. Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 3104–3112, 2014.
X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Y. Wang, and
Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation.In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
-  X. Wang, W. Xiong, H. Wang, and W. Y. Wang. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In European Conference on Computer Vision (ECCV), 2018.
-  Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In IEEE International Conference on Robotics and Automation (ICRA), pages 3357–3364. IEEE, 2017.
A Supplementary Material
Our appendix is structured to provide both corresponding qualitative examples for the quantitative results in the paper and additional implementation details for replication.
a.1 Qualitative comparison
Figures A1 through A3 show three examples comparing our approach to the previous state-of-the-art. In addition, the following URL includes a 90 second video (https://youtu.be/AD9TNohXoPA) showing a first-person view of several agents navigating the environment with corresponding birds-eye-view maps.
a.2 Candidate Reranker
Given a collection of candidate trajectories, our reranker module assigns a score to each of the trajectories. The highest scoring trajectory is selected for the Fast agent’s next step. In our implementation, we use a 2-layer MLP as the reranker. We train the neural network using pairwise cross-entropy loss .
As input to the reranker, we concatenate the following features to obtain a 6-dimensional vector:
Sum of score logits for actions on the trajectory.
Mean of score logits for actions on the trajectory.
Sum of log probabilities for actions on the trajectory.
Mean of log probability for actions on the trajectory.
Progress monitor score for the completed trajector.
Speaker score for the completed trajectory.
We feed the 6-dimensional vector through an MLP: BN FC BN Tanh FC, where BN is a layer of Batch Normalization, FC is a Fully Connected layer, and Tanh is the nonlinearity used. The first FC layer transforms the 6-dimensional input vector to a 6-dimensional hidden vector. The second FC layer project the 6-dimensional vector to a single floating-point value, which is used as the score for the given partial trajectory.
To train the MLP, we cache the candidate queue after running Fast for 40 steps. Each candidate trajectory in the queue has a corresponding score . To calculate the loss, we minimize the pairwise cross-entropy loss:
where is the score for a qualified candidate and is the score for an unqualified candidate. We define qualified candidate trajectories as those that end within 3 meters of ground truth destination. In our cached training set, we have pairs of training data. We train using a batch size of , SGD optimizer with a learning rate of , and momentum ; We train for epochs.