Chasing Ghosts: Instruction Following as Bayesian State Tracking

07/03/2019 ∙ by Peter Anderson, et al. ∙ Georgia Institute of Technology 3

A visually-grounded navigation instruction can be interpreted as a sequence of expected observations and actions an agent following the correct trajectory would encounter and perform. Based on this intuition, we formulate the problem of finding the goal location in Vision-And-Language Navigation (VLN) within the framework of Bayesian state tracking - learning observation and motion models conditioned on these expectable events. Together with a mapper that constructs a semantic spatial map on-the-fly during navigation, we formulate an end-to-end differentiable Bayes filter and train it to identify the goal by predicting the most likely trajectory through the map according to the instructions. The resulting navigation policy constitutes a new approach to instruction following that explicitly models a probability distribution over states, encoding strong geometric and algorithmic priors while enabling greater explainability. Our experiments show that our approach outperforms strong baselines when predicting the goal location in VLN.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One long-term challenge in AI is to build agents that can navigate complex 3D environments from natural language instructions. In the Vision-and-Language Navigation (VLN) instantiation of this task Anderson et al. (2018a), an agent is placed in a photo-realistic reconstruction of an indoor environment and given a natural language navigation instruction, similar to the example in Figure 1. The agent must interpret this instruction and execute a sequence of actions to navigate efficiently from its starting point to the corresponding goal. This task is challenging for existing models Wang et al. (2019, 2018); Ma et al. (2019a, b); Fried et al. (2018); Tan et al. (2019); Ke et al. (2019), particularly as the test environments are unseen during training and no prior exploration is permitted in the hardest setting.

To be successful, agents must learn to ground language instructions to both visual observations and actions. Since the environment is only partially-observable, this in turn requires the agent to relate instructions, visual observations and actions through memory. Current approaches to the VLN task use unstructured general purpose memory representations implemented with recurrent neural network (RNN) hidden state vectors 

Anderson et al. (2018a); Wang et al. (2019, 2018); Ma et al. (2019a, b); Fried et al. (2018); Tan et al. (2019); Ke et al. (2019). However, these approaches lack geometric priors and contain no mechanism for reasoning about the likelihood of alternative trajectories – a crucial skill for the task, e.g., ‘Would this look more like the goal if I was on the other side of the room?’. Due to this limitation, many previous works have resorted to performing inefficient first-person search through the environment using search algorithms such as beam search Fried et al. (2018); Ma et al. (2019a). While this greatly improves performance, it is clearly inconsistent with practical applications like robotics since the resulting agent trajectories are enormously long – in the range of hundreds or thousands of meters.

To address these limitations, it is essential to move towards reasoning about alternative trajectories in a representation of the environment – where there are no search costs associated with moving a physical robot – rather than in the environment itself. Towards this, we extend the Matterport3D simulator Anderson et al. (2018a) to provide depth outputs, enabling us to investigate the use of a semantic spatial map Gupta et al. (2017); Blukis et al. (2018); Henriques and Vedaldi (2018); Gordon et al. (2018) in the context of the VLN task for the first time. We propose an instruction-following agent incorporating three components: (1) a mapper that builds a semantic spatial map of its environment from first-person views; (2) a filter that determines the most probable trajectory(ies) and goal location(s) in the map, and (3) a policy that executes a sequence of actions to reach the predicted goal.

From a modeling perspective, our key contribution is the filter that formulates instruction following as a problem of Bayesian state tracking Thrun et al. (2005). We notice that a visually-grounded navigation instruction typically contains a description of expected future observations and actions on the path to the goal. For example, consider the instruction ‘walk out of the bathroom, turn left, and go on to the bottom of the stairs and wait near the coat rack’ shown in Figure 1. When following this instruction, we would expect to immediately observe a bathroom, and at the end a coat rack near a stairwell. Further, in reaching the goal we can anticipate performing certain actions, such as turning left and continuing that way. Based on this intuition, we use a sequence-to-sequence model with attention to extract sequences of latent vectors representing observations and actions from a natural language instruction.

Figure 1: Navigation instructions can be interpreted as encoding a set of latent expectable observations and actions an agent would encounter and undertake while successfully following the directions.

Faced with a known starting state, a (partially-observed) semantic spatial map generated by the mapper, and a sequence of (latent) observations and actions, we now quite naturally interpret our instruction following task within the framework of Bayesian state tracking. Specifically, we formulate an end-to-end differentiable histogram filter Jonschkowski and Brock (2016) with learnable observation and motion models, and we train it to predict the most likely trajectory taken by a human demonstrator. We emphasize that we are not tracking the state of the actual agent

. In the VLN setting, the pose of the agent is known with certainty at all times. The key challenge lies in determining the location of the natural-language-specified goal state. Leveraging the machinery of Bayesian state estimation allows us to reason in a principled fashion about what a (hallucinated) human demonstrator would do when following this instruction – by explicitly modeling the demonstrator’s trajectory over multiple time steps in terms of a probability distribution over map cells. The resulting model encodes both strong geometric priors (e.g., pinhole camera projection) and strong algorithmic priors (e.g., explicit handling of uncertainty, which can be multi-modal), while enabling explainability of the learned model. For example, we can separately examine the motion model, the observation model, and their interaction during filtering.

Empirically, we show that our filter-based approach significantly outperforms a strong neural net baseline when tasked with predicting the goal location in VLN given a partially-observed semantic spatial map. On the full VLN task (incorporating the learned policy as well), our approach achieves a success rate on the test server  Anderson et al. (2018a) of 32.7% (29.9% SPL Anderson et al. (2018b)

), a credible result for a new class of model trained exclusively with imitation learning and without data augmentation.

Contributions. In summary, we:

  • Extend the existing Matterport3D simulator Anderson et al. (2018a) used for VLN to support depth image outputs.

  • Implement and investigate a semantic spatial memory in the context of VLN for the first time.

  • Propose a novel formulation of instruction following / goal prediction as Bayesian state tracking of a hypothetical human demonstrator.

  • Show that our approach outperforms a strong baseline for goal location prediction.

  • Demonstrate credible results on the full VLN task with the addition of a simple reactive policy, trained exclusively with imitation learning and without data augmentation.

2 Related work

Vision-and-Language Navigation Task. The VLN task Anderson et al. (2018a), based on the Matterport3D dataset Chang et al. (2017), builds on a rich history of prior work on situated instruction-following tasks beginning with SHRDLU Winograd (1971). Despite the task’s difficulty, a recent flurry of work has seen significant improvements in success rates and related metrics Wang et al. (2019, 2018); Ma et al. (2019a, b); Fried et al. (2018); Tan et al. (2019); Ke et al. (2019). Key developments include the use of instruction-generation (‘speaker’) models for trajectory re-ranking and data augmentation Fried et al. (2018); Tan et al. (2019), which have been widely adopted. Other work has focused on developing modules for estimating progress towards the goal Ma et al. (2019a) and learning when to backtrack Ma et al. (2019b); Ke et al. (2019). However, comparatively little attention has been paid to the memory architecture of the agent. LSTM Hochreiter and Schmidhuber (1997) memory has been used in all previous work.

Memory architectures for navigation agents. Beyond the VLN task, various categories of memory structures for deep neural navigation agents can be identified in the literature, including unstructured, addressable, metric and topological. General purpose unstructured memory representations, such as LSTM memory Hochreiter and Schmidhuber (1997), have been used extensively in both 2D and 3D environments Wierstra et al. (2007); Jaderberg et al. (2017); Mirowski et al. (2017); Savva et al. (2017); Das et al. (2018). However, LSTM memory does not offer context-dependent storage or retrieval, and so does not naturally facilitate local reasoning when navigating large or complex environments Oh et al. (2016). To overcome these limitations, both addressable Oh et al. (2016); Parisotto and Salakhutdinov (2018) and topological Savinov et al. (2018) memory representations have been proposed for navigating in mazes and for predicting free space. However, in this work we elect to use a metric semantic spatial map Gupta et al. (2017); Blukis et al. (2018); Henriques and Vedaldi (2018); Gordon et al. (2018)

– which preserves the geometry of the environment – as our agent’s memory representation since reasoning about observed phenomena from alternative viewpoints is an important aspect of the VLN task. Semantic spatial maps are grid-based representations containing convolutional neural network (CNN) features which have been recently proposed in the context of visual navigation 

Gupta et al. (2017), interactive question answering Gordon et al. (2018), and localization Henriques and Vedaldi (2018). However, there has been little work on incorporating these memory representations into tasks involving natural language. The closest work to ours is Blukis et al. (2018), however our map construction is more sophisticated as we use depth images and do not assume that all pixels lie on the ground plane. Furthermore, our major contribution is formulating instruction-following as Bayesian state tracking.

3 Preliminaries: Bayes filters

A Bayes filter Thrun et al. (2005) is a framework for estimating a probability distribution over a latent state (e.g., the pose of a robot) given a history of observations and actions (e.g., camera observations, odometry, etc.). At each time step

the algorithm computes a posterior probability distribution

conditioned on the available data. This is also called the belief.

Taking as a key assumption the Markov property of states, and conditional independence between observations and actions given the state, the belief can be recursively updated from using two alternating steps to efficiently combine the available evidence. These steps may be referred to as the prediction based on action and the observation update using observation .

Prediction. In the prediction step, the filter processes the action using a motion model that defines the probability of a state given the previous state and an action . In particular, the updated belief is obtained by integrating (summing) over all prior states from which action could have lead to , as follows:


Observation update. During the observation update, the filter incorporates information from the observation using an observation model which defines the likelihood of an observation given a state . The observation update is given by:


where is a normalization constant and Equation 2 is derived from Bayes rule.

Differentiable implementations. To apply Bayes filters in practice, a major challenge is to construct accurate probabilistic motion and observation models for a given choice of belief representation

. However, recent work has demonstrated that Bayes filter implementations – including Kalman filters 

Haarnoja et al. (2016), histogram filters Jonschkowski and Brock (2016) and particle filters Jonschkowski et al. (2018); Karkus et al. (2018) – can be embedded into deep neural networks. The resulting models may be seen as new recurrent architectures that encode algorithmic priors from Bayes filters (e.g., explicit representations of uncertainty, conditionally independent observation and motion models) yet are fully differentiable and end-to-end learnable.

4 Agent model

In this section, we describe our VLN agent that simultaneously: (1) builds a semantic spatial map from first-person views; (2) determines the most probable goal location in the current map by filtering likely trajectories taken by a human demonstrator from the start location (i.e., the ‘ghost’); and (3) executes actions to reach the predicted goal. Each of these functions is the responsibility of a separate module which we refer to as the mapper, filter, and policy, respectively. We begin with the mapper.

4.1 Mapper

At each time step , the mapper updates a learned semantic spatial map in the world coordinate frame from first-person views. This map is a grid-based metric representation in which each grid cell contains a -sized latent vector representing the visual appearance of a small corresponding region in the environment. The map maintains a representation for every world coordinate that has been observed by the agent, and each map cell is computed from all past observations of the region. We define the world coordinate frame by placing the agent at the center of the map at the start of each episode, and defining the xy plane to coincide with the ground plane.

Inputs. As with previous work on VLN task Fried et al. (2018); Ma et al. (2019a, b), we provide the agent with a panoramic view of its environment at each time step111The panoramic setting is chosen for comparison with prior work – not as a requirement of our architecture. comprised of a set of RGB images , where represents the image captured in direction . The agent also receives the associated depth images and camera poses . We additionally assume that the camera intrinsics and the ground plane are known. In the VLN task, these inputs are provided by the simulator, in other settings they could be provided by SLAM systems etc.

Image processing. Each image is processed with a pretrained convolutional neural network (CNN) to extract a downsized visual feature representation . We apply 2D adaptive average pooling to the matching depth image , excluding missing (zero) depth values, to extract a corresponding depth image .

Feature projection. Similarly to MapNet Henriques and Vedaldi (2018), we project CNN features onto the ground plane in the world coordinate frame using the corresponding depth image , the camera pose , and a pinhole camera model using known camera intrinsics. We then discretize the projected features into a 2D spatial grid

, using elementwise max pooling to handle feature collisions in a cell.

Map update. To integrate map observations into our semantic spatial map , we use a convolutional implementation Xingjian et al. (2015)

of a Gated Recurrent Unit (GRU) 

Cho et al. (2014)

. In preliminary experiments we found that using convolutions in both the input-to-state and state-to-state transitions reduced the variance in the performance of the complete agent by sharing information across neighboring map cells. However, since both the map

and the map update are sparse, we use a sparsity-aware convolution operation that evaluates only observed pixels and normalizes the output Uhrig et al. (2017). We also mask the GRU map update to prevent bias terms from accumulating in the unobserved regions.

4.2 Filter

Figure 2: Proposed filter architecture. To identify likely goal locations in the partially-observed semantic spatial map generated by the mapper, we first initialize the belief with the known starting state . We then recursively: (1) generate a latent observation and action from the instruction, (2) compute the prediction step using the motion model (Equation 3), and (3) compute the observation update using the observation model (Equation 5), stopping after time steps. The resulting belief represents the posterior probability distribution over likely goal locations.

At the beginning of each episode the agent is placed at a start location , where represents the agent’s heading and and are coordinates in the world frame as previously described. The agent is given an instruction describing the trajectory to an unknown goal coordinate . As an intermediate step towards actually reaching the goal, we wish to identify likely goal locations in the partially-observed semantic spatial map generated by the mapper.

Our approach to this problem is based on the observation that a natural language navigation instruction typically conveys a sequence of expected future observations and actions, as previously discussed. Based on this observation, we frame the problem of determining the goal location as a tracking problem. As illustrated in Figure 2 and described further below, we implement a Bayes filter to track the pose of a hypothetical human demonstrator (i.e., the ‘ghost’) from the start location to the goal. As inputs to the filter, we provided a series of latent observations and actions extracted from the navigation instruction . The output of the filter is the belief over likely goal locations .

Note that in this section we use the subscript to denote time steps in the filter, overloading the notation from Section 4.1 in which referred to agent time steps. We wish to make clear that in our model the filter runs in an inner loop, re-estimating belief over trajectories for an ideal agent starting from each time the map is updated by the agent in the outer loop.

Belief. We define the state using the agent’s position and heading . We represent the belief over the ideal agent’s state at each time step

with a histogram, implemented as a tensor

, where , and are the number of bins for each component of the state, respectively. Using a histogram-based approach allows the filter to track multiple hypotheses, meshes easily with our implementation of a grid-based semantic map, and leads naturally to an efficient motion model implementation based on convolutions, as discussed further below. However, our proposed approach could also be implemented as a particle filter Jonschkowski et al. (2018); Karkus et al. (2018), for example if discretization error was a significant concern.

Observations and actions. To transform the instruction into a latent representation of observations and actions , we use a sequence-to-sequence model with attention Bahdanau et al. (2015). We first tokenize the instruction into a sequence of words which are encoded using learned word embeddings and a bi-directional LSTM Hochreiter and Schmidhuber (1997) to output a series of encoder hidden states and a final hidden state representing the output of a complete pass in each direction. We then use an LSTM decoder to generate a series of latent observation and action vectors and respectively. Here, is given , where is the hidden state of the decoder LSTM, and is the attended instruction representation computed using a standard dot-product attention mechanism Luong et al. (2014). The action vectors are computed analogously, using the same decoder LSTM but with a separate learned attention mechanism. The only input to the decoder LSTM is a positional encoding Vaswani et al. (2017) of the decoding time step . While the correct number of decoding time steps is unknown, in practice we always run the filter for a fixed number of time steps equal to the maximum trajectory length in the dataset (which is 6 steps in the navigation graph).

Motion model. We implement the motion model as a convolution over the belief . This ensures that agent motion is consistent across the state space while explicitly enforcing locality, i.e., the agent cannot move further than half the kernel size in a single time step. Similarly to Jonschkowski and Brock (2016), the prediction step from Equation 1 is thus reformulated as:


where we define an action- and map-dependent motion kernel given by:


where conv

is a small 3-layer CNN with ReLU activations operating on the semantic spatial map

and the spatially-tiled action vector , is the motion kernel size and the softmax function enforces the prior that represents a probability mass function. Note that we include in the input so that the motion model can learn that the agent is unlikely to move through obstacles.

Observation model. We require an observation model to define the likelihood of a latent observation conditioned on the agent’s state and the map . A generative observation model like this would be hard to learn, since it is not clear how to generate high-dimensional latent observations and normalization needs to be done across observations, not states. Therefore, we follow prior work Karkus et al. (2018) and learn a discriminative observation model that takes and as inputs and directly outputs the likelihood of this observation for each state. As detailed further in Section 4.4, this observation model is trained end-to-end without direct supervision of the likelihood.

To implement our observation model we use LingUNet Misra et al. (2018), a language-conditioned image-to-image network based on U-Net Ronneberger et al. (2015). Specifically, we use the LingUNet implementation from Blukis et al. Blukis et al. (2018) with 3 cascaded convolution and deconvolution operations. The spatial dimensionality of the LingUNet output matches the input image (in this case, ), and number of output channels is selected to match the number of heading bins . Outputs are restricted to the range

using a sigmoid function. The observation update from Equation 

2 is re-defined as:


where is a normalization constant and represents element-wise multiplication.

Goal prediction. In summary, to identify goal locations in the partially-observed spatial map , we initialize the belief with the known starting state . We then iteratively: (1) Generate a latent observation and action , (2) Compute the prediction step using Equation 3, and (3) Compute the observation update using Equation 5. We stop after filter update time steps. The resulting belief represents the posterior probability distribution over goal locations.

4.3 Policy

The final component of our agent is a simple reactive policy network. It operates over a global action space defined by the complete set of panoramic viewpoints observed in the current episode (including both visited viewpoints, and their immediate neighbors). Our agent thus memorizes the local structure of the observed navigation graph to enable it to return to any previously observed location in a single action. The probability distribution over actions is defined by a softmax function, where the logit associated with each viewpoint

is given by , where MLP is a two-layer neural network, is a vector containing the belief at each time step in a gaussian neighborhood around viewpoint , and is a vector containing the distance from the agent’s current location to viewpoint , and an indicator variable for whether has been previously visited. If the policy chooses to revisit a previously visited viewpoint, we interpret this as a stop action. Note that our policy does not have access to any representation of the instruction, or the semantic map . Although our policy network is specific to the Matterport3D simulator environment, the rest of our pipeline is general and operates without knowledge of the simulator’s navigation graph.

4.4 Learning

Our entire agent model is fully differentiable, from policy actions back to image pixels via the semantic spatial map, geometric feature projection function, etc. We train the filter using supervised learning by minimizing the KL-divergence between the predicted belief

and the true state

, backpropagating gradients through the previous belief

at each step. We concurrently train the policy with cross-entropy loss to maximize the likelihood of the ground-truth target action, defined as the next action in the shortest path from the current location to the goal. During training, we sample an action from the policy with 50% probability, or we select the ground-truth action otherwise.

Implementation details.

We provide further implementation details in the supplementary. PyTorch code will be released to replicate all experiments.

5 Experiments

5.1 Environment and dataset

Simulator. We use the Matterport3D Simulator Anderson et al. (2018a) based on the Matterport3D dataset Chang et al. (2017) containing RGB-D images, textured 3D meshes and other annotations captured from 11K panoramic viewpoints densely sampled throughout 90 buildings. Using this dataset, the simulator implements a visually-realistic first-person environment that allows the agent to look in any direction while moving between panoramic viewpoints along edges in a navigation graph. Viewpoints are 2.25m apart on average.

Depth outputs. As the Matterport3D Simulator supports RGB output only, we extend it to support depth outputs which are necessary to accurately project CNN features into the semantic spatial map. Our simulator extension projects the undistorted depth images from the Matterport3D dataset onto cubes aligned with the provided ‘skybox’ images, such that each cube-mapped pixel represents the euclidean distance from the camera center. We then adapt the existing rendering pipeline to render depth images from these cube-maps, converting depth values from euclidean distance back to distance from the camera plane in the process. To fill missing depth values corresponding to shiny, bright, transparent, and distant surfaces, we apply a simple cross-bilateral filter based on the NYUv2 implementation Nathan Silberman and Fergus (2012). We additionally implement various other performance improvements, such as caching, which boosts the frame-rate of the simulator up to 1000 FPS, subject to GPU performance and CPU-GPU memory bandwith. We have made our simulator extension available to the community.

R2R instruction dataset. We evaluate using the Room-to-Room (R2R) dataset for Vision-and-Language Navigation (VLN) Anderson et al. (2018a). The dataset consists of 22K open-vocabulary, crowd-sourced navigation instructions with an average length of 29 words. Each instruction corresponds to a 5–24m trajectory in the Matterport3D dataset, traversing 5–7 viewpoint transitions. Instructions are divided into splits for training, validation and testing. The validation set is further split into two components: val-seen, where instructions and trajectories are situated in environments seen during training, and val-unseen containing instructions situated in environments that are not seen during training. All the test set instructions and trajectories are from environments that are unseen in training and validation.

5.2 Goal prediction results

Val-Seen Val-Unseen

Time step
0 1 2 3 4 5 6 7 Avg 0 1 2 3 4 5 6 7 Avg
Map Seen (m) 47.2 62.5 73.3 82.1 90.7 98.3 105 112 83.9 45.6 60.3 69.8 78.0 84.9 91.1 96.7 102 78.6
Goal Seen (%) 8.82 17.2 25.9 33.7 41.2 48.8 54.5 60.2 36.3 16.0 25.2 34.6 43.2 50.5 57.0 62.8 67.6 44.6
Prediction Error (m)
Hand-coded baseline 7.42 7.33 7.19 7.18 7.15 7.13 7.09 7.11 7.20 6.75 6.53 6.40 6.37 6.29 6.20 6.15 6.12 6.35
LingUNet baseline 7.17 6.66 6.17 5.75 5.42 5.15 4.89 4.69 5.74 6.18 5.80 5.40 5.17 4.90 4.65 4.44 4.27 5.10
Filter, (ours) 6.45 5.94 5.66 5.25 5.00 4.86 4.67 4.62 5.31 5.92 5.50 5.14 4.88 4.67 4.45 4.41 4.30 4.91
Filter, (ours) 6.10 5.75 5.30 5.06 4.81 4.71 4.59 4.46 5.09 5.69 5.28 4.90 4.60 4.40 4.26 4.14 4.05 4.67
Success Rate (3m error)
Hand-coded baseline 17.3 17.8 18.5 18.2 18.0 19.1 18.8 18.6 18.3 18.9 20.1 21.1 21.3 21.8 22.2 22.6 22.9 21.4
LingUNet baseline 10.7 16.7 21.2 25.8 29.7 33.6 36.9 39.1 26.7 16.9 22.3 27.7 31.6 35.2 38.4 41.1 44.5 32.2
Filter, (ours) 24.6 29.3 31.9 35.9 39.7 41.0 42.1 41.2 35.7 29.1 32.5 36.1 39.2 41.9 44.5 45.7 46.2 39.4
Filter, (ours) 30.9 34.3 38.4 41.6 43.7 44.9 44.3 46.2 40.6 34.2 38.7 42.7 46.1 48.2 48.4 49.9 51.2 44.9
Table 1: Goal prediction results given a natural language navigation instruction and a fixed trajectory that either moves towards the goal, or randomly, with 50:50 probability. We evaluate predictions at each time step, although on average the goal is not seen until later time steps. Our filtering approach that explicitly models trajectories outperforms LingUNet Blukis et al. (2018); Misra et al. (2018) across all time steps (i.e., regardless of map sparsity). We confirm that add heading to the filter state provides a robust boost.

We first evaluate the goal prediction performance of our proposed mapper and filter architecture in a setting with fixed trajectories. Trajectories are generated by an agent that moves towards the goal with 50% probability, or randomly otherwise. As an ablation, we also report results for our model excluding heading from the agent’s filter state, i.e., , to quantify the value of encoding the agent’s orientation in the motion and observation models. We compare to two baselines as follows:

LingUNet baseline. As a strong neural net baseline, we compare to LingUNet Misra et al. (2018) – a language-conditioned variant of the U-Net image-to-image architecture Ronneberger et al. (2015) – that has recently been applied to goal location prediction in the context of a simulated quadrocopter instruction-following task Blukis et al. (2018). Following Blukis et al. (2018) we train a 5-layer LingUNet module conditioned on the sentence encoding and the semantic map to directly predict the goal location distribution (as well as a path visitation distribution, as an auxilliary loss) in a single forward pass. As we implement our observation model using a (smaller, 3-layer) LingUNet, the LingUNet baseline resembles an ablated single-step version of our model that dispenses with the decoder generating latent observations and actions as well as the motion model. Note that we use the same mapper architecture for our filter and for LingUNet.

Hand-coded baseline. We additionally compare to hand-coded goal prediction baseline designed to exploit biases in the R2R dataset Anderson et al. (2018a) and the provided trajectories. We first calculate the mean straight-line distance from the start position to the goal across the entire training set, which is 7.6m. We then select as the predicted goal the position in the map at a radius of 7.6m from the start position that has the greatest observed map area in an Gaussian-weighted neighborhood of .

As illustrated in Table 1, our proposed filter architecture that explicitly models belief over trajectories that could be taken by a human demonstrator outperforms a strong LingUNet baseline at predicting the goal location, regardless of the sparsity of the map. We confirm that removing heading from the agent’s state degrades our model’s performance significantly, demonstrating that the model is using the agent’s heading to learn about relative orientation. Finally, the poor performance of the handcoded baseline confirms that the goal location cannot be trivially predicted from the trajectory.

5.3 Vision-and-Language Navigation results

Having established the efficacy of our approach for goal prediction from a partial map, we turn to the full VLN task that requires our agent to take actions to actually reach the goal.

Evaluation. In VLN, an episode is successful if the final navigation error is less than 3m. We report our agent’s average success rate at reaching the goal (SR), and SPL Anderson et al. (2018b), a recently proposed summary measure of an agent’s navigation performance that balances navigation success against trajectory efficiency (higher is better). We also report trajectory length (TL) and navigation error (NE) in meters, as well as oracle success (OS), defined as the agent’s success rate under an oracle stopping rule.

Val-Seen Val-Unseen Test

RPA Wang et al. (2018)
8.46 5.56 0.53 0.43 - 7.22 7.65 0.32 0.25 - 9.15 7.53 0.32 0.25 0.23

Speaker-Follower Fried et al. (2018)
- 3.36 0.74 0.66 - - 6.62 0.45 0.36 - 14.82 6.62 0.44 0.35 0.28

RCM Wang et al. (2019)
10.65 3.53 0.75 0.67 - 11.46 6.09 0.50 0.43 - 11.97 6.12 0.50 0.43 0.38

Self-Monitoring Ma et al. (2019a)
- 3.18 0.77 0.68 0.58 - 5.41 0.59 0.47 0.34 18.04 5.67 0.59 0.48 0.35

Regretful Agent Ma et al. (2019b)
- 3.23 0.77 0.69 0.63 - 5.32 0.59 0.50 0.41 13.69 5.69 0.56 0.48 0.40

FAST Ke et al. (2019)
- - - - - 21.1 4.97 - 0.56 0.43 22.08 5.14 0.64 0.54 0.41

Back Translation Tan et al. (2019)
11.0 3.99 - 0.62 0.59 10.7 5.22 - 0.52 0.48 11.66 5.23 0.59 0.51 0.47

Seq2Seq Anderson et al. (2018a)
11.33 6.01 0.52 0.39 - 8.39 7.81 0.28 0.22 - 8.13 7.85 0.26 0.20 0.18

9.86 7.52 0.36 0.31 0.27 10.33 7.70 0.40 0.31 0.27 9.14 7.71 0.38 0.33 0.30
Table 2: Results for the full VLN task on the R2R dataset. Our model achieves credible results for a new model class trained exclusively with imitation learning (no RL) and without any data augmentation (Aug) and without an internal ‘speaker’ model for re-ranking trajectories (Spk).
Figure 3: Left: Textual attention during latent observation and action generation is appropriately more focused towards action words (‘left’, ‘right’) for the motion model, and visual words (‘bedroom’, ‘corridor’, ‘table’) for the observation model. Right: Top-down view illustrating the agent’s expanding semantic spatial map (lighter-colored region), navigation graph (blue dots) and corresponding belief (red heatmap and circles with white heading markers) when following this instruction. At the map is largely unexplored, and the belief is approximately correct but dispersed. By , the agent has become confident about the correct goal location, despite many now-visible alternative paths.

Results. In Table 2, we present our results in the context of state-of-the-art methods; however, as noted by the RL, Aug, Spk

columns in the table, these approaches include reinforcement learning strategies, complex data augmentation, or embedded ‘speaker’ models. These are non-trivial extensions that are the result of a community effort 

Wang et al. (2019, 2018); Ma et al. (2019a, b); Fried et al. (2018); Tan et al. (2019); Ke et al. (2019) and are orthogonal to our own contribution. We also use a less powerful CNN (ResNet-34 vs. ResNet-152 in prior work). For direct comparison, we consider the Seq-2-Seq model of Anderson et al. (2018a) which we outperform significantly on unseen environments – increasing success rate by 13% on test. We find these results promising given this is the first work to explore such a drastically different model class (i.e., maintaining a metric map and a probability distribution over alternative trajectories in the map). Our model also exhibits less overfitting than other approaches – performing equally well on both seen (val-seen) and unseen (val-unseen) environments.

Further, our filtering approach allows us greater insight into the model. We examine a qualitative example in Figure 3. On the left, we can see the agent attends to appropriate visual and direction words when generating latent observations and actions, supporting the intuition in Figure 1. On the right, we can see the growing confidence our goal predictor places on the correct location as more of the map is explored – despite the increasing number of visible alternatives. We provide further examples (including insight into the motion and observation models) in the supplementary video.

6 Conclusion

We show that instruction following can be formulated as Bayesian state tracking in a model that maintains a semantic spatial map of the environment, and an explicit probability distribution over alternative possible trajectories in that map. To evaluate our approach we choose the complex problem of Vision-and-Language Navigation (VLN). This represents a significant departure from existing work in the area, and required augmenting the Matterport3D simulator with depth. Empirically, we show that our approach outperforms recent alternative approaches to goal location prediction, and achieves credible results on the full VLN task without using RL or data augmentation – while offering reduced overfitting to seen environments and unprecedented intepretability.


  • Anderson et al. [2018a] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018a.
  • Wang et al. [2019] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR, 2019.
  • Wang et al. [2018] Xin Wang, Wenhan Xiong, Hongmin Wang, and William Yang Wang. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In ECCV, September 2018.
  • Ma et al. [2019a] Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. In ICLR, 2019a.
  • Ma et al. [2019b] Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira.

    The regretful agent: Heuristic-aided navigation through progress estimation.

    In CVPR, 2019b.
  • Fried et al. [2018] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and-language navigation. In NeurIPS, 2018.
  • Tan et al. [2019] Hao Tan, Licheng Yu, and Mohit Bansal. Learning to navigate unseen environments: Back translation with environmental dropout. In NAACL, 2019.
  • Ke et al. [2019] Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and Siddhartha Srinivasa. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In CVPR, 2019.
  • Gupta et al. [2017] Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In CVPR, 2017.
  • Blukis et al. [2018] Valts Blukis, Dipendra Misra, Ross A Knepper, and Yoav Artzi. Mapping navigation instructions to continuous control actions with position-visitation prediction. In CoRL, 2018.
  • Henriques and Vedaldi [2018] J. F. Henriques and A. Vedaldi. Mapnet: An allocentric spatial memory for mapping environments. In

    proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2018.
  • Gordon et al. [2018] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. IQA: Visual question answering in interactive environments. In CVPR, 2018.
  • Thrun et al. [2005] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT Press, 2005.
  • Jonschkowski and Brock [2016] Rico Jonschkowski and Oliver Brock. End-to-end learnable histogram filters. In

    In Workshop on Deep Learning for Action and Interaction at the Conference on Neural Information Processing Systems (NIPS)

    , 2016.
  • Anderson et al. [2018b] Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents. arXiv:1807.06757, 2018b.
  • Chang et al. [2017] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
  • Winograd [1971] Terry Winograd. Procedures as a representation for data in a computer program for understanding natural language. Technical report, Massachusetts Institute of Technology, 1971.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 1997.
  • Wierstra et al. [2007] Daan Wierstra, Alexander Foerster, Jan Peters, and Juergen Schmidhuber. Solving deep memory pomdps with recurrent policy gradients. In International Conference on Artificial Neural Networks, 2007.
  • Jaderberg et al. [2017] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017.
  • Mirowski et al. [2017] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. In ICLR, 2017.
  • Savva et al. [2017] Manolis Savva, Angel X. Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. MINOS: Multimodal indoor simulator for navigation in complex environments. arXiv:1712.03931, 2017.
  • Das et al. [2018] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied Question Answering. In CVPR, 2018.
  • Oh et al. [2016] Junhyuk Oh, Valliappa Chockalingam, Satinder Singh, and Honglak Lee. Control of memory, active perception, and action in minecraft. In ICML, 2016.
  • Parisotto and Salakhutdinov [2018] Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. In ICLR, 2018.
  • Savinov et al. [2018] Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. In ICLR, 2018.
  • Haarnoja et al. [2016] Tuomas Haarnoja, Anurag Ajay, Sergey Levine, and Pieter Abbeel. Backprop KF: Learning discriminative deterministic state estimators. In NIPS, 2016.
  • Jonschkowski et al. [2018] Rico Jonschkowski, Divyam Rastogi, and Oliver Brock. Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors. In Proceedings of Robotics: Science and Systems (RSS), 2018.
  • Karkus et al. [2018] Peter Karkus, David Hsu, and Wee Sun Lee. Particle filter networks with application to visual localization. In Proceedings of the Annual Conference on Robot Learning (CoRL), 2018.
  • Xingjian et al. [2015] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo.

    Convolutional lstm network: A machine learning approach for precipitation nowcasting.

    In NIPS, 2015.
  • Cho et al. [2014] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014.
  • Uhrig et al. [2017] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant CNNs. In 2017 International Conference on 3D Vision (3DV), pages 11–20. IEEE, 2017.
  • Bahdanau et al. [2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  • Luong et al. [2014] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. In EMNLP, 2014.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • Misra et al. [2018] Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, and Yoav Artzi. Mapping instructions to actions in 3d environments with visual goal prediction. In EMNLP, 2018.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • Nathan Silberman and Fergus [2012] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • Tompson et al. [2015] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In CVPR, 2015.
  • Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Supplementary Materials

Implementation Details

Simulator. In experiments, we set the Matterport3D simulator Anderson et al. (2018a) to generate pixel images with a degree vertical field of view. To capture more of the floor and nearby obstacles (and less of the roof) we set the camera elevation to degrees down from horizontal. At each panoramic viewpoint location in the simulator we capture a horizontal sweep containing 12 images at 30 degree increments, which are projected into the map in a single time step as described in Section 4.1 of the main paper.

Mapper. For our CNN implementation we use a ResNet-34 He et al. (2016) architecture that is pretrained on ImageNet Russakovsky et al. (2015). We found that fine-tuning the CNN while training our model mainly improved performance on the Val-Seen set, and so we left the CNN parameters fixed in the reported experiments. To extract the visual feature representation we concatenate the output from the CNN’s last 2 layers to provide a representation. The dimensionality of our map representation is fixed at and each cell represents a square region with side length m (the entire map is thus ). In the mapper’s convolutional Xingjian et al. (2015) GRU Cho et al. (2014) we use convolutional filters and we train with spatial dropout Tompson et al. (2015) of in both the input-to-state and state-to-state transitions with fixed dropout masks for the duration of each episode.

Filter. In the instruction encoder we use a hidden state size of for both the forward and backward encoders, and a word embedding size of . We use a motion kernel size of 7, but we upscale the motion kernel by a scale factor of before applying it such that the agent can move a maximum of m in a single time step.

Training. In training, we use the Adam optimizer Kingma and Ba (2014) with an initial learning rate of 1e-3, weight decay of 1e-7, and a batch size of 5. In the goal prediction experiment, all models are trained for 8K iterations, after which all models have converged. In the full VLN experiment, our models are trained for 17.5K iterations. Training the model takes around 1 day for goal prediction, and 2.5 days for the full VLN task, using a single Titan X GPU.

Visualizations. In the main paper and the supplementary video (to be released), we depict top-down floorplan visualizations of Matterport environments to provide greater insight into the model’s behavior. These visualizations are rendered from textured meshes in the Matterport3D dataset Chang et al. (2017), using the provided GAPS software which was modified to render using an orthographic projection.