Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation

by   Zhiwei Deng, et al.
Princeton University

The ability to perform effective planning is crucial for building an instruction-following agent. When navigating through a new environment, an agent is challenged with (1) connecting the natural language instructions with its progressively growing knowledge of the world; and (2) performing long-range planning and decision making in the form of effective exploration and error correction. Current methods are still limited on both fronts despite extensive efforts. In this paper, we introduce the Evolving Graphical Planner (EGP), a model that performs global planning for navigation based on raw sensory input. The model dynamically constructs a graphical representation, generalizes the action space to allow for more flexible decision making, and performs efficient planning on a proxy graph representation. We evaluate our model on a challenging Vision-and-Language Navigation (VLN) task with photorealistic images and achieve superior performance compared to previous navigation architectures. For instance, we achieve a 53 the Room-to-Room navigation task through pure imitation learning, outperforming previous navigation architectures by up to 5


Structured Scene Memory for Vision-Language Navigation

Recently, numerous algorithms have been developed to tackle the problem ...

Target-Driven Structured Transformer Planner for Vision-Language Navigation

Vision-language navigation is the task of directing an embodied agent to...

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation

Deep Learning has revolutionized our ability to solve complex problems s...

Reinforced Structured State-Evolution for Vision-Language Navigation

Vision-and-language Navigation (VLN) task requires an embodied agent to ...

Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation

We present FAST NAVIGATOR, a general framework for action decoding, whic...

FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation

The speaker-follower models have proven to be effective in vision-and-la...

Deliberate Exploration Supports Navigation in Unfamiliar Worlds

To perform tasks well in a new domain, one must first know something abo...

1 Introduction

Recent work has made remarkable progress towards building autonomous agents that navigate by following instructions fried2018speaker ; fu2019language ; hao2020towards ; chen2019touchdown ; mirowski2018learning ; oh2017zero ; thomason2019vision ; roman2020rmm ; kajic2020learning and constructing memory structures for maps parisotto2017neural ; savinov2018semi ; laskin2020sparse . An important problem setting within this space is the paradigm of online navigation, where an agent needs to perform navigation based on goal descriptions in an unseen environment using a limited number of steps ma2019self ; anderson2018vision .

In order to successfully navigate through an unseen environment, an agent needs to overcome two key challenges. First, the instructions given to the agent are natural language descriptions of the goal and the landmarks along the way; these descriptions need to be grounded onto the evolving visual world that the agent is observing. Second, the agent needs to perform non-trivial planning over a large action space, including: 1) deciding which step to take next to resolve ambiguities in the instructions through novel observations and 2) gaining a better understanding of the environment layout in order to progress towards the goal or recover from its prior mistakes. Notably this planning requires not only selecting from an increasingly large set of possible actions but also performing complex long-term reasoning.

Existing work tackles only one or two components of the above and may require additional pre-processing steps. Some are constrained to use local control policies ma2019self ; fried2018speaker or use rule-based algorithms such as beam or search fried2018speaker ; ke2019tactical ; ma2019self to perform localized path corrections. Others focus on processing long-range observations instead of actions fang2019scene or employ offline pre-training schemes to learn topological structures savinov2018semi ; laskin2020sparse ; liu2020hallucinative . This is challenging since accurate construction of graphs is non-trivial and requires special adaptation to work during real-time navigation laskin2020sparse ; liu2020hallucinative .

Figure 1: Under the guidance of natural language instruction, the autonomous agent needs to navigate through the environment from the start state to the target location (red flag). Our proposed Evolving Graphical Planner (EGP) constructs a dynamic representation and makes decisions in a global action space (right). With the EGP, the agent, currently in the orange node, maintains and reasons over the evolving graph to select the next node to visit (green) from possible choices (blue).

In this paper, we propose the Evolving Graphical Planner (EGP) (Figure 1), which 1) dynamically constructs a graphical map of the environment in an online fashion during exploration and 2) incorporates a global planning module for selecting actions. EGP can operate directly on raw sensory inputs in partially observable settings by building a structured representation of the geometric layout of the environment using discrete symbols to represent visited states and unexplored actions. This expressive representation allows our agent to choose from a greater number of global actions conditioned on the text instruction and perform course corrections if needed.

Incorporating a global navigation module is challenging since we do not always have access to ground truth supervisions from the environment. Further, the ever expanding size of the global graphs requires a scalable action selection module. To solve the first challenge, we introduce a novel method for training our planning modules using imitation learning – this allows the agent to efficiently learn how to select global actions and backtrack when necessary. For the second, we introduce proxy graphs, which are local approximations of the entire map and allow for more scalable planning. Our entire model is end-to-end differentiable under pure imitation learning framework.

We test EGP on two benchmarks for 3D navigation with instructions – Room-to-Room anderson2018vision and Room-for-Room jain2019stay . Our model outperforms several state-of-the-art backbone architectures on both datasets – e.g., on Room-to-Room, we achieve a 5% improvement over the Regretful agent ma2019regretful on success rate. We also perform a series of ablation studies on our model to justify the design choices.

2 Related work

Embodied Navigation Agent Many recent papers develop neural architectures for navigation tasks ma2019regretful ; fried2018speaker ; ma2019self ; zhu2017target ; gupta2017cognitive ; andreas2017modular ; fang2019scene ; xia2020interactive . Vision-and-Language Navigation (VLN) anderson2018vision ; chen2019touchdown is one representative task that focuses on language-driven navigation across photo-realistic 3D environments. Anderson et al. anderson2018vision propose the Room-to-Room benchmark and an attention-based sequence-to-sequence method. Fried et al. fried2018speaker extend the model by a pragmatic agent with the ability to synthesize data through a speaker model. With an emphasize on grounding, the self-monitoring agent ma2019self

adopts a co-grounding module and a progress estimation auxiliary task for more progress-sensitive alignment. Similarly, an intrinsic reward

wang2018nervenet is introduced to improve cross-modal alignment for navigation agent. Ma et al. extends the existing works towards a graph search like algorithm by adding one-step regretful action. Anderson et al. anderson2019chasing proposes an agent formulated under Bayesian filtering with a global mapper. Hand-crafted decoding algorithms are also used as a post-processing technique but may lead to long trajectories and difficulty on joint optimization fried2018speaker ; ke2019tactical ; karaman2010incremental .

Navigation memory structures A recent emerging trend of navigation focuses on extending the agents with different types of memory structures. Simple structures: Parisotto et al. parisotto2017neural

propose a tensor-based memory structure with operations that agent can use to access and perform navigation. Fang et al.

fang2019scene adopts transformer to extract historical memory information stored in a sequence. Topological structures: The landmark-based topological representation has shown to be effective in pre-exploration based navigation tasks without the need of externally provided camera poses or ego-motion informationsavinov2018semi . Laskin et al. proposes methods to sparsify the graphical memory through consistency checkings and graph cleanups laskin2020sparse . Liu et al. uses a contrastive energy model for building more accurate edges in the memory graph liu2020hallucinative .

Graphical representation Graph-based methods have been shown to be an effective intermediate representation for information exchange defferrard2016convolutional ; li2015gated ; henaff2015deep ; duvenaud2015convolutional ; DengVHM16 . In image and video understanding, graphical representation is used for visual question answering teney2017graph ; santoro2017simple , video captioning pan2020spatio or action recognitionguo2018neural ; huang2017unsupervised . In robotics, Huang et al. huang2019neural

propose Neural Task Graph (NTG) as an intermediate modularized representation leading to better generalization. Graph Neural Networks are demonstrated to be effective in learning structured policies

wang2018nervenet and automatic robot design wang2019neural .

Supervision strategies for imitation learning The proper training of sequential models is challenging due to the drifting issues ross2011reduction . DAgger ross2011reduction proposes to aggregate datasets with expert supervision provided for the sequences samples from student models. Scheduled sampling bengio2015scheduled tackles this problem through mixing the samples from both ground truth and student models. Professor forcing lamb2016professor shows a more effective approach through adversarial domain adaptation. OCD sabour2018optimal adopts the online computed characters as ground truth for speech recognition. In this paper, we instead propose a graph-augmented strategy to provide expert supervisions, which alleviates the mismatch issue between instructions and new-computed expert trajectories ma2019self .

3 Model

Problem definition and setup We follow the standard instruction-following navigation problem setting anderson2018vision . Given the demonstration dataset , where is the language instruction with length , is the expert navigation trajectories and is the environment paired with the data, the agent is trained to imitate the expert behaviours to correctly follow the instruction and navigate to the goal location. At each navigation step , the agent is located at the state with observations from the environment and performs an action , where is the decision space at state . In our task, decision space is the set of navigable locations fried2018speaker .

To set up an agent, we build upon a basic navigation architecture in ma2019self and utilize the language encoder and the attention-based decoder. The agent encodes the language instruction into a hidden encoding through an LSTM gers1999learning . Conditioned on , the agent uses an attention-based decoder to model the distribution over the trajectory . At every step, the decoder takes in the encoding , the observation and a maintained hidden memory

to produce the per-step action probability distribution

. Note that such an navigation agent suffers from the constrained local action set and lacks the ability to perform long-range planning over the navigation space and to effectively correct errors along the navigation.

Our approach In this section, we introduce our Evolving Graphical Planner (EGP), an end-to-end global planner that navigates a agent through a new environment via re-defining the decision space and performing long-term plannings over the graphs. With the graphical representation, the agent accumulates the knowledge about the unseen environment and has access to the entire action space. Each long-distance navigation is then reduced to one-step decision making, leading to easier and more efficient exploration and error correction. We also show that the graphical representation elicits a new supervision strategy for effective training of imitation agent, which alleviates the mismatch issue in standard navigation agent training ma2019self . The global planning is performed on a efficient proxy representation, making the model scalable to navigation with longer horizon.

The proposed model consists of two core components: (1) an evolving graphical representation that generalizes the local action space; (2) a planning core that performs multi-channel information propagation on a proxy representation. Starting from the initial information, the EGP agent gradually expands an internal representation of the explored space (Section 3.1) and perform planning efficiently over the graphs (Section 3.2).

3.1 Evolving Graphical Planner: Graphical representation

We start with introducing the notations of the representation. Let denote a graph at time , where is the set of node embeddings, is the set of edge embeddings between node and node , and is the graph connectivity tensor with function types. In our graph, nodes are separated into two types: the leaf nodes represent the possible actions (e.g. navigable locations) and internal nodes represent the visited locations, as shown in figure 2.

Graph construction The agent builds up the graphical representation progressively during navigation. Initialized as empty set, the graph expands the node set, edge set and connectivity function tensor through the observations and local connection information given by the environment. When receiving an observation and the set of information over possible actions at the new state , the agent maps the current location information and action information through two separate neural networks, resulting to the node embedding and . The incoming and outgoing edges of new nodes are determined through the local map information and the moving direction of the agent. There are three function types considered and stored in the tensor : the forward and backward directions between two visited nodes, and the connection from the visited node to the leaf node. With the new graph , the agent continues navigation and the loop repeats. To reduce the memory cost, the model has the option to selectively add nodes to the graph. We use top-K leaf nodes, ranked by confidence scores in policy distribution, to expand the graph, as shown in fig. 2.

Figure 2: Overall scheme of the Evolving Graphical Planner model. (1): Our graphical representation progressively expands during navigation (light-yellow = visited nodes; orange = current node; blue = potential action nodes; green = selected node). Based on the graphical representation, the agent performs planning over the actions, and selects the next action through student sampling. The top-K nodes on the current state will be kept in the graph. (2): The model performs multi-channel planning on a proxy graph pooled from the entire graph representation. The refined node states are unpooled back to the entire graph and used to compute the probability distribution over actions.

Navigation with graph jump With graph representing the entire action space, the agent can easily choose to navigate to execute actions that have not been explored in previous visited locations. For the proposed action from the planner, the agent computes the shortest-path route based on the graph and plans the navigation route . This allows the agent to “jump” around the full graph-defined action space and execute the unexplored actions through a single-step decision. The long-range decision space also makes error-correction easier for the agent: a single step decision is all the agent needs to backtrack to the correct location. With the navigation steps on the internal route generated through the graph, the agent keeps updating the hidden states based on the observation from the environments.

Supervising imitation learning The proper training of the imitation learning agent has been a challenging problem sabour2018optimal ; ross2011reduction ; lamb2016professor . Student forcing with new computed route as supervision is a widely used solution in navigation ma2019self ; ma2019regretful ; fried2018speaker . However, the mismatch between the new route and language instructions could potentially lead to noisy and incorrect signals for learning language groundings and navigation. We provide a new graph-augmented solution for computing the supervision for each student sampled trajectory. Assume a metric between two navigation trajectories (e.g. ilharco2019general ). With graph memorizing the entire possible action space, the subset of nodes in ground truth route is guaranteed to exist in . We choose the node on the that maximizes the metric as the ground truth action for step . This basically provides a “correction” signal that indicates the best action to take for correcting the mistake made by the agent.

3.2 Evolving Graphical Planner: Planning Core

With the graphical representation , a straightforward method is to directly perform planning on the full graph using the embeddings of nodes and edges. However, a progressively growing graph can lead to high costs and limit the scalability of the model. Pre-exploration based methods often tackle this issue through performing offline sparsification with multiple rounds of cleanups on the pre-collected graphs laskin2020sparse containing full knowledge over the map, which are unsuitable under the online navigation settings. In this section, we show that, interestingly, the effective planning can be achieved not through the full graph, and present the second component of our model, where it performs a goal-driven information extraction dynamically on the entire graph to build a condensed proxy representation used for planning. Our model utilizes Graph Neural Networks (GNN) as a basic operator, which we explain at the end of the section.

Proxy graph Denote the proxy graph as , where , and are the pooled node embedding set, edge embedding set and connectivity matrix with function types respectively. The proxy graph contains a fixed number of nodes invariant to the growing graph size in . We hypothesize that given the instruction information and the current states of the agent, there are only a subset of nodes providing useful information for planning. To construct the proxy representation, the model uses a normalized pooling similar to ying2018hierarchical . Differently, our graphical representation consists of a rich set of information including edge states, function types in connectivity matrices besides the node states. We describe the process of generating the proxy representation and corresponding planning as follows.

Given the entire graphical representation , the planner contains two functions, and . The function takes in the agent state information and constructs the proxy representation through a lightweight neural network. Assume the pooling function uses as the graph dimension in the propagation model, and performs step message passing. We derive the pooling matrix as follows, using a small :

The normalized pooling matrix is the attention weights on the entire graph and extracts relevant information conditioned on the agent states and instructions

for further planning. To help the description, we denote the concatenated matrix of all node vectors as

, the concatenation of all edge vectors as tensor . The concatenation order is aligned with the connectivity function matrix. The following operations are used for deriving the proxy representation.

where is a non-negative matrix indicating the weights of connectivity among nodes for function type , is the matrix at dimension for the edge state tensor. The pooled tensors and are corresponded to the node set and edge set of the proxy graph

Planning The planning of the navigation agent is achieved through propagating information among nodes in the proxy graph, conditioned on the agent state information and instruction encodings . Denote the graph dimension used in the propagation model as , the number of steps for message passing operations as , the refined node embedding of the proxy graph is derived through , with controlling the capacity of the function. The refined node embedding contains the information involving the neighboring nodes (visited locations and unexplored actions), the state of agent, the instruction, and the connectivity types between nodes. With the globally refinement step, node embedding vector is unpooled back to the original graph through . With the update node representation containing both the current state of the agent and the full action-observation space in the history, the distribution over actions is generated based on the node vectors:


where the function is a dot-product with linear mapping parameterized by .

Multi-channel planning The model is further strengthened with the ability to perform multi-channel planning over the graph . Instead of only using one proxy graph representation to extract information and perform propagation, we find it useful to learn a set of proxy graphs and perform planning independently on each of them. The final information are aggregated through summation over the embedding across the channels. The final policy over actions is generated through the same operation described in eqn. 1

Training objective We train our full agent through the standard maximum likelihood objective using cross-entropy loss. Given the demonstration

, the loss function to optimize is formulated as:


where is the ground truth trajectory, is the ground truth label on action at step , generated through the graph-augmented supervision using the information of , as described in Sec. 3.1.

3.2.1 Message Passing Operations

In this subsection we explain the Graph Neural Network (GNN) operator used in the EGP. As the operator is used in both pooling and planning, we describe it as a general function which takes in a graph and a general context information (e.g. the agent hidden state and language encodings), with hyper-parameters and . Formally, given the input graph and the context vector , the function generates the refined node vectors after steps of message passings, where is the vector dimension in the propagation model. The subscription on nodes and edges denotes the index of message passing iterations, with a slight abuse of notation. The function contains two components: an input network and a propagation network.

Input network Along with the initial vectors for nodes and edges, the input network considers the context vector as a shared additional information across nodes and edges. The input model maps the context vector with node and edge vectors respectively into two fixed-size embedding as follows:


where is a neural network, and the generated embedding vectors are used for message communications among graph nodes in the propagation model.

Propagation network The propagation model, taking in the mapped embedding vectors , consecutively generates a sequence of node embedding vectors for each node. At step , the propagation operation updates every node through computing and aggregating information from the neighborhood nodes. The process is executed in the order:


where is the message function. Function is the aggregator function that collects messages back to node vectors. represents the neighbours of node in the graph. The refined node vector set , containing global information from the whole graph is mapped through a matrix to recover the input node dimension and is used as the output of the function for either pooling or planning component, as described in Sec. 3.2.

4 Experiments

4.1 Experimental setup

Dataset Train Val:seen Val:unseen Test
R2R 14,039 1,021 2,349 4,173
R4R 233,532 1,035 45,234 -
Table 1: Dataset statistics.

Datasets We evaluate our methods on the standard benchmark datasets for Vision-and-Language Navigation (VLN). The VLN task is built upon photo-realistic simulated environments chang2017matterport3d with human-generated instructions describing the landmarks and directions for navigation routes. Starting at a random sampled location in the environment, the agent needs to follow the instruction to navigate through the environment. There are two datasets commonly used for VLN: (1) Room-to-Room (R2R) benchmark anderson2018vision with 7,189 paths, each associated with 3 sentences, resulting in 21,567 total human instructions. The paths are produced through computing shortest paths from start to end points; (2) Room-for-Room (R4R) jain2019stay , which extends the R2R dataset by re-emphasizing on the necessity of following instructions compared to the goal-driven definition in R2R. The R4R dataset contains 278,766 instructions associated with twisted routes connecting two shortest-path trajectories in R2R. The dataset details are summarized in table 1.

Implementation details We follow ma2019self and adopt the co-grounding agent (w/o auxiliary loss) as our base agent. As the standard protocol for R2R, visual features for each location are pre-computed ResNet features from the panoramic images. In the Evolving Graphical Planner, we use 256 dimensions as the graph embedding size for both the full graph and the proxy graph. The propagation model uses three iterations of message passing operations. For every expansion step, the default setting adds all the possible navigable locations into the graph (top-K is set to 16, the maximum number of navigable location in both datasets). For student-forced training, we use graph-augmented ground truth supervision throughout the experiments except for the ablation study on supervision methods. The model is trained jointly, using Adam kingma2014adam with 1e-4 as the default learning rate.

4.2 Room-to-Room benchmark

Evaluation metrics We follow prior works on the R2R dataset and report: navigation error (NE) in meters, lower is better; Success Rate (SR), i.e., the percentage of navigation end-locations which are within 3m of the true global location; Success Rate divided by path Length in meters (SPL); and Oracle Success Rate (OSR), i.e., the path at any point passes within 3m of the goal state.

4.2.1 Comparison with prior art

Architectures for comparison We compare our model with the following state-of-the-art navigation architectures: (1) the Seq2Seq agent anderson2018vision that translates instructions to actions; (2) Speaker-Follower (SF) fried2018speaker agent that augments the dataset with a speaker model; (3) the Reinforced Cross-Modal (RCM) agent wang2019reinforced

using modal-alignment score as intrinsic reward for reinforcement learning; (4) the Self-Monitoring (Monitor) agent

ma2019self that uses a co-grounding module and a progress estimation component to increase the progress alignment between texts and trajectories; (5) the Regretful agent ma2019regretful that uses the Regretful module and Progress Marker to perform one-step rollback; (6) the Ghost anderson2019chasing with Bayesian filters.

Results We report results in table 2. We train our models both by using only the standard demonstration and by augmenting the dataset with the synthetic data containing 178,330 instruction-route pairs generated by the Speaker model fried2018speaker . On the Val Unseen split, we observe that just through using the EGP module (without synthetic data augmentation), the performance of agent can be increased over the baseline agent by meters on NE (from to ), by 9 on SR (from to ) on SR, by 5 on SPL (from to ), and by 13% on OSR (from to ). Our path length remains short, at 13.7 meters compared to 12.8m for baseline, 14.8m for RCM wang2019reinforced and 15.2m SF fried2018speaker (not shown in the table). Most notably, our EGP agent with synthetic data augmentation outperforms prior art on all metrics, across both the validation-unseen and the test set. Concretely, on the test set we achieve a meters reduction in NE (from to ), a improvement on SR (from to ), a improvement on SPL (from to ), and a improvement on OSR (from to ) over the best performing prior model on each metric respectively.

Val Unseen Test
Seq2Seq anderson2018vision IL 6.01 39 - 53 7.81 22 - 28
Ghost anderson2019chasing IL 7.20 35 31 44 7.83 33 30 42
SF fried2018speaker IL 6.62 36 - 45 6.62 35 28 44
RCM wang2019reinforced IL+RL 5.88 43 - 52 6.12 43 38 50
Monitor ma2019self IL 5.98 44 30 58 - - - -
Monitor ma2019self IL 5.52 45 32 56 5.67 48 35 59
Regretful ma2019regretful IL 5.36 48 37 61 - - - -
Regretful ma2019regretful IL 5.32 50 41 59 5.69 48 40 56
Baseline agent IL 6.20 43 36 52 - - - -
EGP (ours) IL 5.34 52 41 65 - - - -
EGP (ours) IL 4.83 56 44 64 5.34 53 42 61
Table 2: We compare our architecture with previous state-of-the-art architectures on the Val Unseen and Test splits of R2R anderson2018vision . (: models using additional synthetic data. : numbers in percentage).

Discussion of other works Note that there are other works contributing to this benchmark through non-architecture approaches: using more data (6,582K) for BERT-type pre-training hao2020towards ; exploiting web data majumdar2020improving ; adding extra tasks zhu2019vision ; adding dropout regularization tan2019learning ; different settings of evaluation (fusing information from three instructions) xia2020multi ; li2019robust ; post-processing decoding method ke2019tactical . We contribute to the backbone navigation architectures and these works can potentially be useful as complementary approaches.

4.2.2 Ablation studies

We now justify the design choices of our model by analyzing the individual components. In addition to the metrics above we also include Path Length (PL) for completeness. Results are summarized in table 3, with the last row depicting our model with the default settings.

Does global planning matter We verify the importance of global planning and navigation through controlling the top-K expansion rate for the graphical representation. In R2R dataset, there are maximally 16 navigable locations for each state. With smaller expansion rate, the EGP planner will have less expressive power on exploiting global information from the environments. As seen in the top group of rows of table 3, with smaller top-K, the path becomes shorter (fewer options to explore) but the accuracy of the model consistently drops, indicating the importance of the global planning.

Planner implementation Next we analyze the effects of message passing steps and the number of channels used on our planner module. The results are summarized in the second group of rows in table 3. Through the information propagation operations, our model achieves a 8 increase on SR (from with mp=0,channel=1 to with mp=3,channel=1).With more independent planning channels, we can obtain a further 2 improvement on SR (from with mp=3,channel=1 to for our default setting with mp=3,channel=3, last row). To verify whether the increase is due to more parameters in the model, we also add a comparison through using a single channel planner with three times larger graph dimensions (768), which shows no similar effect to the multi-channel models.

Compare across supervision methods Finally, we compare our methods with the standard supervision method navigation agent trained by student forcing. The standard supervision used for student forcing requires recomputing the new route to goal for each location, leading to potential noisy data and larger generalization error, shown in the second-from-last row of table 3.

Ablation type Model PL NE SR SPL OSR
EGP - topK = 3 13.07 5.95 47 38 56
Global vs local planning EGP - topK = 5 13.17 5.75 49 40 59
EGP - topK = 10 13.50 5.71 50 40 61
EGP - mp=0,channel=1 18.83 6.06 42 32 62
Planner implementation EGP - mp=3,channel=1 14.65 5.73 50 40 62
EGP - graph dim 3 14.16 5.68 49 38 60
Supervision EGP - with shortest path 14.68 5.65 46 36 57
EGP 13.68 5.34 52 41 65
Table 3: We show ablation studies of our method on the val-unseen set of Room-to-Room. The default setting for our model (bottom row) is top-K= 16, mp=3, channel=3, graph dimension=256; we use graph-augmented supervision rather than shortest path for training.

4.3 Room-for-Room benchmark

Evaluation metrics Room-for-Room dataset emphasizes on the ability of correctly following instructions instead of solely on reaching the goal locations. We follow the metrics in jain2019stay ; ilharco2019general and mainly compare our results on Coverage weighted by Length Score (CLS) that measures the fidelity of the agent’s path to the reference, weighted by the length score, and the Success rate weighted normalized Dynamic Time Warping (SDTW) and normalized Dynamic Time Warping (nDTW) that measures the spatiotemporal similarity of the paths by the agent and the expert refence.

Architectures for comparison We compare our model with the Speaker-Follower model fried2018speaker , Reinforced Cross-Modal agent wang2019reinforced trained under goal-directed and fidelity-oriended rewards, reported in jain2019stay , and the Perceive Transform Act (PTA) agent landi2019perceive using more complex multi-modal attentions.

Results analysis We summarize the results in table 4. Note that all previous state-of-the-art methods require the mixed training between imitation learning and reinforcement learning objectives. The student forcing method leads to goal-oriented supervisions and harms the ability of following instructions for agents jain2019stay . Our model is the first that successfully train the navigation agent through pure imitation learning on the R4R benchmark, due to the benefit of the graphical representation, the powerful planning module and the new supervision method. We obtain a consistent margin across all metrics. Specifically, our model outperforms other architectures by 7.0, 5.0 and 4.9 on the fidelity-oriented measurements CLS, nDTW and SDTW respectively. Also, although using a global search mechanism, our model maintains a relatively short path length, which is difficult in other rule-based global search algorithms fried2018speaker ; ke2019tactical .

Random - 23.6 10.4 13.8 22.3 18.5 4.1
Speaker-Followerjain2019stay IL+RL 19.9 8.47 23.8 29.6 - -
RCM + goal-orientedjain2019stay IL+RL 32.5 8.45 28.6 20.4 26.9 11.4
RCM + fidelity-orientedjain2019stay IL+RL 28.5 8.08 26.1 34.6 30.4 12.6
PTA low-levellandi2019perceive IL+RL 10.2 8.19 27.0 35.0 20.0 8.0
PTA high-levellandi2019perceive IL+RL 17.7 8.25 24.0 37.0 32.0 10.0
EGP (ours) IL 18.3 8.0 30.2 44.4 37.4 17.5
Table 4: Comparison across methods on R4R Val Unseen split. Path Length (PL) is reported as a reference. (Note that we refer to the numbers from ilharco2019general as jain2019stay did not report DTW-based results)

5 Conclusion

In this work, we proposed a solution to the long-standing problem of contextual global planning for vision-and-language navigation. Our system based on the new Evolving Graphical Planner (EGP) module outperforms prior backbone navigation architectures on multiple metrics across two benchmarks. Specifically, we show that building a policy over the global action space is critical to decision making, the graphical representation can further elicit a new supervising strategy for student forcing in navigation, and, interestingly, the actual planning can be achieved through a proxy graph rather than the actual topological representation which leads to high costs in both time and memory.

6 Broader Impact

This work has several downstream applications in areas like autonomous navigation and robotic control, especially through the use of natural language instruction. Potential downstream uses of such agents range from healthcare delivery to elderly home assistance to disaster relief efforts. We believe that imbuing these agents with a global awareness of the environment and long-term planning will enable them to handle more challenging tasks and recover gracefully from mistakes. While the graph-based approach we propose is scalable and easy to manipulate in real time, future research can address computation challenges and increase the planning time-scale to enable better decision making.

7 Acknowledgement

This work is partially supported by King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSRCRG2017-3405 and by Princeton University’s Center for Statistics and Machine Learning (CSML) DataX fund. We would also like to thank Felix Yu and Zeyu Wang for offering insightful discussions and comments on the paper.