PyTorch code for ICLR 2019 paper: Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware of which instruction was completed, which instruction is needed next, which way to go, and its navigation progress towards the goal. In this paper, we introduce a self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress. We test our self-monitoring agent on a standard benchmark and analyze our proposed approach through a series of ablation studies that elucidate the contributions of the primary components. Using our proposed method, we set the new state of the art by a significant margin (8 available at https://github.com/chihyaoma/selfmonitoring-agent .READ FULL TEXT VIEW PDF
As deep learning continues to make progress for challenging perception t...
We propose to decompose instruction execution to goal prediction and act...
In Vision-and-Language Navigation (VLN), an embodied agent needs to reac...
Vision-Language Navigation (VLN) is a task where agents learn to navigat...
A visually-grounded navigation instruction can be interpreted as a seque...
VALAN is a lightweight and scalable software framework for deep reinforc...
We present FAST NAVIGATOR, a general framework for action decoding, whic...
PyTorch code for ICLR 2019 paper: Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
Recently, the Vision-and-Language (VLN) navigation task (Anderson et al., 2018b), which requires the agent to follow natural language instructions to navigate through a photo-realistic unknown environment, has received significant attention (Wang et al., 2018b; Fried et al., 2018). In the VLN task, an agent is placed in an unknown realistic environment and is required to follow natural language instructions to navigate from its starting location to a target location. In contrast to some existing navigation tasks (Kempka et al., 2016; Zhu et al., 2017; Mirowski et al., 2017, 2018), we address the class of tasks where the agent does not have an explicit representation of the target (e.g., location in a map or image representation of the goal) to know if the goal has been reached or not (Matuszek et al., 2013; Hemachandra et al., 2015; Duvallet et al., 2016; Arkin et al., 2017). Instead, the agent needs to be aware of its navigation status through the association between the sequence of observed visual inputs to instructions.
Consider an example as shown in Fig. 1, given the instruction ”Exit the bedroom and go towards the table. Go to the stairs on the left of the couch. Wait on the third step.”, the agent first needs to locate which instruction is needed for the next movement, which in turn requires the agent to be aware of (i.e., to explicitly represent or have an attentional focus on) which instructions were completed or ongoing in the previous steps. For instance, the action ”Go to the stairs” should be carried out once the agent has exited the room and moved towards the table. However, there exists inherent ambiguity for ”go towards the table”. Intuitively, the agent is expected to ”Go to the stairs” after completing ”go towards the table”. But, it is not clear what defines the completeness of ”Go towards the table”. The completeness of an ongoing action often depends on the availability of the next action. Since the transition between past and next part of the instructions is a soft boundary, in order to determine when to transit and to follow the instruction correctly the agent is required to keep track of both grounded instructions. On the other hand, assessing the progress made towards the goal has indeed been shown to be important for goal-directed tasks in humans decision-making (Benn et al., 2014; Chatham et al., 2012; Berkman & Lieberman, 2009). While a number of approaches have been proposed for VLN (Anderson et al., 2018b; Wang et al., 2018b; Fried et al., 2018), previous approaches generally are not aware of which instruction is next nor progress towards the goal; indeed, we qualitatively show that even the attentional mechanism of the baseline does not successfully track this information through time.
In this paper, we propose an agent endowed with the following abilities: (1) identify which direction to go by finding the part of the instruction that corresponds to the observed images—visual grounding, (2) identify which part of the instruction has been completed or ongoing and which part is potentially needed for the next action selection—textual grounding, and (3) ensure that the grounded instruction can correctly be used to estimate the progress made towards the goal, and apply regularization to ensure this —progress monitoring. Therefore, we introduce the self-monitoring agent consisting of two complementary modules: visual-textual co-grounding and progress monitor.
More specifically, we achieve both visual and textual grounding simultaneously by incorporating the full history of grounded instruction, observed images, and selected actions into the agent. We leverage the structural bias between the words in instructions used for action selection and progress made towards the goal and propose a new objective function for the agent to measure how well it can estimate the completeness of instruction-following. We then demonstrate that by conditioning on the positions and weights of grounded instruction as input, the agent can be self-monitoring of its progress and further ensure that the textual grounding accurately reflects the progress made.
Overall, we propose a novel self-monitoring agent for VLN and make the following contributions: (1) We introduce the visual-textual co-grounding module, which performs grounding interdependently across both visual and textual modalities. We show that it can outperform the baseline method by a large margin. (2) We propose to equip the self-monitoring agent with a progress monitor, and for navigation tasks involving instructions instantiate this by introducing a new objective function for training. We demonstrate that, unlike the baseline method, the position of grounded instruction can follow both past and future instructions, thereby tracking progress to the goal. (3) With the proposed self-monitoring agent, we set the new state-of-the-art performance on both seen and unseen environments on the standard benchmark. With 8% absolute improvement in success rate on the unseen test set, we are ranked #1 on the challenge leaderboard.
Given a natural language instruction with words, its representation is denoted by , where
is the feature vector for the-th word encoded by an LSTM language encoder. Following Fried et al. (2018), we enable the agent with panoramic view. At each time step, the agent perceives a set of images at each viewpoint where is the maximum number of navigable directions111Empirically, we found that using only the images on navigable directions to be slightly better than using all 36 surrounding images (12 headings 3 elevations with 30 degree intervals)., and represents the image feature of direction . The co-grounding feature of instruction and image are denoted as and respectively. The selected action is denoted as . The learnable weights are denoted with , with appropriate sub/super-scripts as necessary. We omit the bias term to avoid notational clutter in the exposition.
First, we propose a visual and textual co-grounding model for the vision and language navigation task, as illustrated in Fig. 2
. We model the agent with a sequence-to-sequence architecture with attention by using a recurrent neural network. More specifically, we use Long Short Term Memory (LSTM) to carry the flow of information effectively. At each step, the decoder observes representations of the current attended panoramic image feature , previous selected action and current grounded instruction feature as input, and outputs an encoder context :
where denotes concatenation. The previous encoder context is used to obtain the textual grounding feature and visual grounding feature , whereas we use current encoder context to obtain next action , all of which will be illustrated in the rest of the section.
Textual grounding. When the agent moves from one viewpoint to another, it is required to identify which direction to go by relying on a grounded instruction, i.e. which parts of the instruction should be used. This can either be the instruction matched with the past (ongoing action) or predicted for the future (next action). To capture the relative position between words within an instruction, we incorporate the positional encoding (Vaswani et al., 2017) into the instruction features. We then perform soft-attention on the instruction features , as shown on the left side of Fig. 2. The attention distribution over words of the instructions is computed as:
where are parameters to be learnt. is a scalar value computed as the correlation between word of the instruction and previous hidden state , and is the attention weight over features in at time . Based on the textual attention distribution, the grounded textual feature can be obtained by the weighted sum over the textual features .
Visual grounding. In order to locate the completed or ongoing instruction, the agent needs to keep track of the sequence of images observed along the navigation trajectory. We thus perform visual attention over the surrounding views based on its previous hidden vector . The visual attention weight can be obtained as:
is a one-layer Multi-Layer Perceptron (MLP),are parameters to be learnt. Similar to Eq. 2, the grounded visual feature can be obtained by the weighted sum over the visual features .
Action selection. To make a decision on which direction to go, the agent finds the image features on navigable directions with the highest correlation with the grounded navigation instruction and the current hidden state
. We use the inner-product to compute the correlation, and the probability of each navigable direction is then computed as:
where are the learnt parameters, is the same MLP as in Eq. 3, and is the probability of each navigable direction at time . We use categorical sampling during training to select the next action . Unlike the previous method with the panoramic view (Fried et al., 2018), which attends to instructions only based on the history of observed images, we achieve both textual and visual grounding using the shared hidden state output containing grounded information from both textual and visual modalities. During action selection, we rely on both hidden state output and grounded instruction, instead of only relying on grounded instruction.
It is imperative that the textual-grounding correctly reflects the progress towards the goal, since the agent can then implicitly know where it is now and what the next instruction to be completed will be. In the visual-textual co-grounding module, we ensure that the grounded instruction reasonably informs decision making when selecting a navigable direction. This is necessary but not sufficient for ensuring that the notion of progress to the goal is encoded. Thus, we propose to equip the agent with a progress monitor that serves as regularizer during training and prunes unfinished trajectories during inference.
Since the positions of localized instruction can be a strong indication of the navigation progress due to the structural alignment bias between navigation steps and instruction, the progress monitor can estimate how close the current viewpoint is to the final goal by conditioning on the positions and weights of grounded instruction. This can further enforce the result of textual-grounding to align with the progress made towards the goal and to ensure the correctness of the textual-grounding.
The progress monitor aims to estimate the navigation progress by conditioning on three inputs: the history of grounded images and instructions, the current observation of the surrounding images, and the positions of grounded instructions. We therefore represent these inputs by using (1) the previous hidden state and the current cell state of the LSTM, (2) the grounded surrounding images , and (3) the distribution of attention weights of textual-grounding , as shown at the bottom of Fig. 2 represented by dotted lines.
Our proposed progress monitor first computes an additional hidden state output by using grounded image representations as input, similar to how a regular LSTM computes hidden states except we use concatenation over element-wise addition for empirical reasons222We found that using concatenation provides slightly better performance and stable training..
The hidden state output is then concatenated with the attention weights on textual-grounding to
estimate how close the agent is to the goal333 We use zero-padding to handle instructions with various lengths.
We use zero-padding to handle instructions with various lengths.. The output of the progress monitor , which represents the completeness of instruction-following, is computed as:
where and are the learnt parameters, is the cell state of the LSTM, denotes the element-wise product, and
is the sigmoid function.
Training. We introduce a new objective function to train the proposed progress monitor. The training target is defined as the normalized distance in units of length from the current viewpoint to the goal, i.e., the target will be at the beginning and closer to as the agent approaches the goal444We set the target to 1 if the agent’s distance to the goal is less than 3.. Note that the target can also be lower than 0, if the agent’s current distance from the goal is farther than the starting point. Finally, our self-monitoring agent is optimized with a cross-entropy loss and a mean squared error loss, computed with respect to the outputs from both action selection and progress monitor.
where is the action probability of each navigable direction, is the weight balancing the two losses, and is the ground-truth navigable direction at step .
Inference. During inference, we follow Fried et al. (2018) by using beam search. we propose that, while the agent decides which trajectories in the beams to keep, it is equally important to evaluate the state of the beams on actions as well as on the agent’s confidence in completing the given instruction at each traversed viewpoint. We accomplish this idea by integrating the output of our progress monitor into the accumulated probability of beam search. At each step, when candidate trajectories compete based on accumulated probability, we integrate the estimated completeness of instruction-following (normalized between 0 to 1) with action probability to directly evaluate the partial and unfinished candidate routes: .
Without beam search, we use greedy decoding for action selection with one condition. If the progress monitor output decreases (), the agent is required to move back to the previous viewpoint and select the action with next highest probability. We repeat this process until the selected action leads to increasing progress monitor output. We denote this procedure as progress inference.
R2R Dataset. We use the Room-to-Room (R2R) dataset (Anderson et al., 2018b) for evaluating our proposed approach. The R2R dataset is built upon the Matterport3D dataset (Chang et al., 2017) and has 7,189 paths sampled from its navigation graphs. Each path has three ground-truth navigation instructions written by humans. The whole dataset is divided into 4 sets: training, validation seen, validation unseen, and test sets unseen.
|Ours (beam search) (leaderboard)||3.23||0.70||0.78||0.66||5.04||0.57||0.70||0.51||4.99||0.57||0.68||0.51|
|Ours* (beam search) (leaderboard)||3.04||0.71||0.78||0.67||4.62||0.58||0.68||0.52||4.48||0.61||0.70||0.56|
We follow the same evaluation metrics used by previous work on the R2R task: (1) Navigation Error (NE), mean of the shortest path distance in meters between the agent’s final position and the goal location. (2) Success Rate (SR), the percentage of final positions less than 3m away from the goal location. (3) Oracle Success Rate (OSR), the success rate if the agent can stop at the closest point to the goal along its trajectory. In addition, we also include the recently introduced Success rate weighted by (normalized inverse) Path Length (SPL)(Anderson et al., 2018a), which trades-off Success Rate against trajectory length.
We first compare the proposed self-monitoring agent with existing approaches. As shown in Table 1, our method achieves significant performance improvement compared to the state of the arts without data augmentation. We achieve 70% SR on the seen environment and 57% on the unseen environment while the existing best performing method achieved 63% and 50% SR respectively. When trained with synthetic data555We use the exact same synthetic data generated from the Speaker as in Fried et al. (2018) for comparison., our approach achieves slightly better performance on the seen environments and significantly better performance on both the validation unseen environments and the test unseen environments when submitted to the test server. We achieve 3% and 8% improvement on SR on both validation and test unseen environments. Both results with or without data augmentation indicate that our proposed approach is more generalizable to unseen environments. At the time of writing, our self-monitoring agent is ranked #1 on the challenge leader-board among the state of the arts.
Note that both Speaker-Follower and our approach in Table 1 use beam search. For comparison without using beam search, please refer to the Appendix.
Textually grounded agent. Intuitively, an instruction-following agent is required to strongly demonstrate the ability to correctly focus and follow the corresponding part of the instruction as it navigates through an environment. We thus record the distribution of attention weights on instruction at each step as indications of which parts of the instruction being used for action selection. We average all runs across both validation seen and unseen dataset splits. Ideally, we expect to see the distribution of attention weights lies close to a diagonal, where at the beginning, the agent focuses on the beginning of the instruction and shifts its attention towards the end of instruction as it moves closer to the goal.
To demonstrate, we use the method with panoramic action space proposed in Fried et al. (2018) as a baseline for comparison. As shown in Figure 3, our self-monitoring agent with progress monitor demonstrates that the positions of grounded instruction over time form a line similar to a diagonal. This result may further indicate that the agent successfully utilizes the attention on instruction to complete the task sequentially. We can also see that both agents were able to focus on the first part of the instruction at the beginning of navigation consistently. However, as the agent moves further in unknown environments, our self-monitoring agent can still successfully identify the parts of instruction that are potentially useful for action selection, whereas the baseline approach becomes uncertain about which part of the instruction should be used for selecting an action.
We now discuss the importance of each component proposed in this work. We begin with the same baseline as before (agent with panoramic action space in Fried et al. (2018))666Note that our results for this baseline are slightly higher on val-seen and slightly lower on val-unseen than those reported, due to differences in hyper-parameter choices..
Co-grounding. When comparing the baseline with row #1 in our proposed method, we can see that our co-grounding agent outperformed the baseline with a large margin. This is due to the fact that we use the LSTM to carry both the textually and visually grounded content, and the decision on each navigable direction is predicted with both textually grounded instruction and the hidden state output of the LSTM. On the other hand, the baseline agent relies on the LSTM to carry visually grounded content, and uses the hidden state output for predicting the textually grounded instruction. As a result, we observed that instead of predicting the instruction needed for selecting a navigable direction, the textually grounded instruction may match with the past sequence of observed images implicitly saved within the LSTM.
Progress monitor. Given the effective co-grounding, the proposed progress monitor further ensure that the grounded instruction correctly reflects the progress made toward the goal. This further improves the performance especially on the unseen environments as we can see from row #1 and #2.
When using the progress inference, the progress monitor serve as a progress indicator for the agent to decide when to move back to the last viewpoint. We can see from row #2 and #4 that the SR performance can be further improved around 2% on both seen and unseen environments.
Finally, we integrate the output of the progress monitor with the state-factored beam search (Fried et al., 2018), so that the candidate paths compete not only based on the probability of selecting a certain navigable direction but also on the estimated correspondence between the past trajectory and the instruction. As we can see by comparing row #2, #6, and #7, the progress monitor significantly improved the success rate on both seen and unseen environments and is the key for surpassing the state of the arts even without data augmentation. We can also see that when using beam search without progress monitor, the SR on unseen improved 7% (row #1 vs #6), while using beam search integrated with progress estimation improved 13% (row #2 vs #7).
Data augmentation. In the above, we have shown each row in our approach contributes to the performance. Each of them increases the success rate and reduces the navigation error incrementally. By further combining them with the data augmentation pre-trained from the speaker (Fried et al., 2018), the SR and OSR are further increased, and the NE is also drastically reduced. Interestingly, the performance improvement introduced by data augmentation is smaller than from Speaker-Follower on the validation sets (see Table 1 for comparison). This demonstrates that our proposed method is more data-efficient.
To further validate the proposed method, we qualitatively show how the agent navigates through unseen environments by following instructions as shown in Fig. 4. In each figure, the agent follows the grounded instruction (at the top of the figure) and decides to move towards a certain direction (green arrow). For the full figures and more examples of successful and failed agents in both unseen and seen environments, please see the supplementary material.
Consider the trajectory on the left side in Fig. 4, at step 3, the grounded instruction illustrated that the agent just completed ”turn right” and focuses mainly on ”walk straight to bedroom”. As the agent entered the bedroom, it then shifts the textual grounding to the next action ”Turn left and walk to bed lamp”. Finally, at step 6, the agent completed another ”turn left” and successfully stop at the rug (see the supplementary material for the importance of dealing with duplicate actions). Consider the example on the right side, the agent has already entered the hallway and now turns right to walk across to another room. However, it is ambiguous that which room the instructor is referring to. At step 5, our agent checked out the room on the left first and realized that it does not match with ”Stop in doorway in front of rug”. It then moves to the next room and successfully stops at the goal.
In both cases, we can see that the completeness estimated by progress monitor gradually increases as the agent steadily navigates toward the goal. We have also observed that the estimated completeness ends up much lower for failure cases (see the supplementary material for further details).
Vision, Language, and Navigation.. There is a plethora work investigating the combination of vision and language for a multitude of applications (Zhou et al., 2018a, b; Antol et al., 2015; Tapaswi et al., 2016; Das et al., 2017), etc. While success has been achieved in these tasks to handle massive corpora of static visual input and text data, a resurgence of interest focuses on equipping an agent with the ability to interact with its surrounding environment for a particular goal such as object manipulation with instructions (Misra et al., 2016; Arkin et al., 2017), grounded language acquisition (Al-Omari et al., 2017; Kollar et al., 2013; Spranger & Steels, 2015; Dubba et al., 2014), embodied question answering (Das et al., 2018; Gordon et al., 2018), and navigation (Matuszek et al., 2013; Hemachandra et al., 2015; Duvallet et al., 2016; Zhu et al., 2017; de Vries et al., 2018; Yuke Zhu, 2017; Mousavian et al., 2018; Wayne et al., 2018; Wang et al., 2018a; Mirowski et al., 2017, 2018; Zamir et al., 2018). In this work, we concentrate on the recently proposed the Vision-and-Language Navigation task (Anderson et al., 2018b)—asking an agent to carry out sophisticated natural-language instructions in a 3D environment. This task has application to fields such as robotics; in contrast to traditional map-based navigation systems, navigation with instructions provides a flexible way to generalize across different environments.
A few approaches have been proposed for the VLN task. For example, Anderson et al. (2018b) address the task in the form of a sequence-to-sequence translation model. Yu et al. (2018) introduce a guided feature transformation for textual grounding. Wang et al. (2018b)
present a planned-head module by combing model-free and model-based reinforcement learning approaches. Recently,Fried et al. (2018) propose to train a speaker to synthesize new instructions for data augmentation and further use it for pragmatic inference to rank the candidate routes. These approaches leverage attentional mechanisms to select related words from a given instruction when choosing an action, but those agents are deployed to explore the environment without knowing about what progress has been made and how far away the goal is. In this paper, we propose a self-monitoring agent that performs co-grounding on both visual and textual inputs and constantly monitors its own progress toward the goal as a way of regularizing the textual grounding.
Visual and textual grounding. Visual grounding learns to localize the most relevant object or region in an image given linguistic descriptions, and has been demonstrated as an essential component for a variety of vision tasks like image captioning (Hu et al., 2016; Rohrbach et al., 2016; Lu et al., 2018), visual question answering (Lu et al., 2016b; Agrawal et al., 2018), relationship detection (Lu et al., 2016a; Ma et al., 2018) and referral expression (Nagaraja et al., 2016; Gavrilyuk et al., 2018). In contrast to identifying regions or objects, we perform visual grounding to locate relevant images (views) in a panoramic photo constructed by stitching multiple images with the aim of choosing which direction to go. Extensive efforts have been made to ground language instructions into a sequence of actions (MacMahon et al., 2006; Branavan et al., 2009; Vogel & Jurafsky, 2010; Tellex et al., 2011; Artzi & Zettlemoyer, 2013; Andreas & Klein, 2015; Mei et al., 2016; Cohn et al., 2016; Misra et al., 2017). These early approaches mainly emphasize the incorporation of structural alignment biases between the linguistic structure and sequence of actions (Mei et al., 2016; Andreas & Klein, 2015), and assume the agents are in relatively easy environment where limited visual perception is required to fulfill the instructions.
We introduce a self-monitoring agent which consists of two complementary modules: visual-textual co-grounding module and progress monitor. The visual-textual co-grounding module locates the instruction completed in the past, the instruction needed in the next action, and the moving direction from surrounding images. The progress monitor regularizes and ensures the grounded instruction correctly reflects the progress towards the goal by explicitly estimating the completeness of instruction-following. This estimation is conditioned on the positions and weights of grounded instruction. Our approach sets a new state-of-the-art performance on the standard Room-to-Room dataset on both seen and unseen environments. While we present one instantiation of self-monitoring for a decision-making agent, we believe that this concept can be applied to other domains as well.
This research was partially supported by DARPA’s Lifelong Learning Machines (L2M) program, under Cooperative Agreement HR0011-18-2-001. We thank the authors from Fried et al. (2018), Ronghang Hu and Daniel Fried, for communicating with us and providing details of the implementation and synthetic instructions for fair comparison.
Association for the Advancement of Artificial Intelligence (AAAI), pp. 4349–4356, 2017.
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
Weakly supervised learning of semantic parsers for mapping instructions to actions.Transactions of the Association of Computational Linguistics (ACL), 1:49–62, 2013.
End-to-end dense video captioning with masked transformer.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8739–8748, 2018b.
|Ours* (Greedy Decoding)||3.22||0.67||0.78||0.58||5.52||0.45||0.56||0.32||5.99||0.43||0.55||0.32|
|Ours* (Progress Inference)||3.18||0.68||0.77||0.58||5.41||0.47||0.59||0.34||5.67||0.48||0.59||0.35|
We provide the comparison with state of the arts without using beam search. The results are shown in Table 3. We can see that our proposed method outperformed existing approaches with a large margin on both validation unseen and test sets. Our method with greedy decoding for action selection improved the SR by 9% and 8% on validation unseen and test set. When using progress inference for action selection, the performance on the test set significantly improved by 5% compared to using greedy decoding, yielding 13% improvement over the best existing approach.
Similar to previous work, we use the pre-trained ResNet-152 on ImageNet to extract image features. Each image feature is thus a 2048-d vector. The embedded feature vector for each navigable direction is obtained by concatenating an appearance feature with a 4-d orientation feature, where and are the heading and elevation angles. Following the work in Fried et al. (2018), the 4-dim orientation features are tiled 32 times, resulting a embedding feature vector with 2176 dimension.
Network architecture. The embedding dimension for encoding the navigation instruction is 256. We use a dropout layer with ratio 0.5 after the embedding layer. We then encode the instruction using a regular LSTM, and the hidden state is 512 dimensional. The MLP used for projecting the raw image feature is . The FC layer projects the 2176-d input vector to a 1024-d vector, and the dropout ratio is set to be 0.5. The hidden state of the LSTM used for carrying the textual and visual information through time in Eq. 1 is 512. We set the maximum length of instruction to be 80, thus the dimension of the attention weights of textual grounding is also 80. The dimension of the learnable matrices from Eq. 2 to 5 are: , , , , and .
Training. We use ADAM as the optimizer. The learning rate is with batch size of 64 consistently through out all experiments. When using beam search, we set the beam size to be 15. We perform categorical sampling during training for action selection.
For evaluating our proposed approach on the unseen test set, we participate in the Vision and Language Navigation challenge and submitted our result with the full proposed approach to the test server. We achieved 61% success rate and ranked #1 on the test server at the time of writing.
We follow the submission guidelines, where picking the highest confidence trajectory from multiple trials for each instruction is not permissible. This means that using the beam search for competing and selecting a final trajectory is not allow directly. Similar to the submission from Speaker-Follower (Fried et al., 2018), we record all the viewpoints traversed during the beam search process. The final agent traverses through all recorded trajectories by first reaching the end of one trajectory and backtracking to the shared viewpoint with the next trajectory. This means that the agent could backtrack to the start point during this process. The trajectories are however logged according to the closest previous trajectory, so that when a single agent traverses through all recorded trajectories, the overhead for switching from one trajectory to another can be reduced significantly. The final selected trajectory from beam search is then lastly logged to the trajectory. This therefore yields exactly the same success rate and navigation error, as the metrics are computed according to the last viewpoint from a trajectory.
We provide and discuss additional qualitative results on the self-monitoring agent navigating on seen and unseen environments. We first discuss four successful examples in Fig. 5 and 6, and followed by two failure examples in Fig. 7.
In Fig. 5 (a), at the beginning, the agent mostly focuses on ”walk up” for making the first movement. While the agent keeps its attention on ”walk up” as completed instruction or ongoing action, it shifts the attention on instruction to ”turn right” as it walks up the stairs. Once it reached the top of the stairs, it decides to turn right according to the grounded instruction. Once turned right, we can again see that the agent pays attention on both the past action ”turn right” and next action ”walk straight to bedroom”. The agent continues to do so until it decides to stop by grounding on the word ”stop”.
In Fig. 5 (b), the agent starts by focusing on both ”enter bedroom from balcony” and ”turn left” to navigate. It correctly shifts the attention on textual grounding on the following instruction. Interestingly, the given instruction ”walk straight across rug to room” at step 3 is ambiguous since there are two rooms across the rug. Our agent decided to sneak out of the first room on the left and noticed that it does not match with the description from instruction. It then moved to another room across the rug and decided to stop because there is a rug inside the room as described.
In Fig. 6 (a), the given instruction is ambiguous as it only asks the agent to take actions around the stairs. Since there are multiple duplicated actions described in the instruction, e.g. ”walk up” and ”turn left”, only an agent that is able to precisely follow the instruction step-by-step can successfully complete the task. Otherwise, the agent is likely to stop early before it reaches the goal. The agent also needs to demonstrate its ability to assess the completeness of instruction-following task in order to correctly stop at the right amount of repeated actions as described in the instruction.
In Fig. 6 (b), at the beginning (step 0), the agent only focuses on ’left’ for making the first movement (the agent is originally facing the painting). We can see that at each step, the agent correctly focuses on parts of the instruction for making every movements, and it finally believes that the instruction is completed (attention on the last sentence period) and stopped.
In Fig. 7 (a) step 1, although the attention on instruction correctly focused on ”take a left” and ”go down”, the agent failed to follow the instruction and was not able to complete the task. We can however see that the progress monitor correctly reflected that the agent did not follow the given instruction successfully. The agent ended up stopping with progress monitor reporting that only 16% of the instruction was completed.
In Fig. 7 (b) step 2, the attention on instruction only focuses on ”go down” and thus failed to associate the ”go down steps” with the stairs previously mentioned in ”turn right to stairs”. The agent was however able to follow the rest of the instruction correctly by turning right and stopping near a mirror. Note that, different from Fig. 7 (a), the final estimated completeness of instruction-following from progress monitor is much higher (16%), which indicates that the agent failed to be aware that it was not correctly following the instruction.