Recent work has made remarkable progress towards building autonomous agents that navigate by following instructions fried2018speaker ; fu2019language ; hao2020towards ; chen2019touchdown ; mirowski2018learning ; oh2017zero ; thomason2019vision ; roman2020rmm ; kajic2020learning and constructing memory structures for maps parisotto2017neural ; savinov2018semi ; laskin2020sparse . An important problem setting within this space is the paradigm of online navigation, where an agent needs to perform navigation based on goal descriptions in an unseen environment using a limited number of steps ma2019self ; anderson2018vision .
In order to successfully navigate through an unseen environment, an agent needs to overcome two key challenges. First, the instructions given to the agent are natural language descriptions of the goal and the landmarks along the way; these descriptions need to be grounded onto the evolving visual world that the agent is observing. Second, the agent needs to perform non-trivial planning over a large action space, including: 1) deciding which step to take next to resolve ambiguities in the instructions through novel observations and 2) gaining a better understanding of the environment layout in order to progress towards the goal or recover from its prior mistakes. Notably this planning requires not only selecting from an increasingly large set of possible actions but also performing complex long-term reasoning.
Existing work tackles only one or two components of the above and may require additional pre-processing steps. Some are constrained to use local control policies ma2019self ; fried2018speaker or use rule-based algorithms such as beam or search fried2018speaker ; ke2019tactical ; ma2019self to perform localized path corrections. Others focus on processing long-range observations instead of actions fang2019scene or employ offline pre-training schemes to learn topological structures savinov2018semi ; laskin2020sparse ; liu2020hallucinative . This is challenging since accurate construction of graphs is non-trivial and requires special adaptation to work during real-time navigation laskin2020sparse ; liu2020hallucinative .
In this paper, we propose the Evolving Graphical Planner (EGP) (Figure 1), which 1) dynamically constructs a graphical map of the environment in an online fashion during exploration and 2) incorporates a global planning module for selecting actions. EGP can operate directly on raw sensory inputs in partially observable settings by building a structured representation of the geometric layout of the environment using discrete symbols to represent visited states and unexplored actions. This expressive representation allows our agent to choose from a greater number of global actions conditioned on the text instruction and perform course corrections if needed.
Incorporating a global navigation module is challenging since we do not always have access to ground truth supervisions from the environment. Further, the ever expanding size of the global graphs requires a scalable action selection module. To solve the first challenge, we introduce a novel method for training our planning modules using imitation learning – this allows the agent to efficiently learn how to select global actions and backtrack when necessary. For the second, we introduce proxy graphs, which are local approximations of the entire map and allow for more scalable planning. Our entire model is end-to-end differentiable under pure imitation learning framework.
We test EGP on two benchmarks for 3D navigation with instructions – Room-to-Room anderson2018vision and Room-for-Room jain2019stay . Our model outperforms several state-of-the-art backbone architectures on both datasets – e.g., on Room-to-Room, we achieve a 5% improvement over the Regretful agent ma2019regretful on success rate. We also perform a series of ablation studies on our model to justify the design choices.
2 Related work
Embodied Navigation Agent Many recent papers develop neural architectures for navigation tasks ma2019regretful ; fried2018speaker ; ma2019self ; zhu2017target ; gupta2017cognitive ; andreas2017modular ; fang2019scene ; xia2020interactive . Vision-and-Language Navigation (VLN) anderson2018vision ; chen2019touchdown is one representative task that focuses on language-driven navigation across photo-realistic 3D environments. Anderson et al. anderson2018vision propose the Room-to-Room benchmark and an attention-based sequence-to-sequence method. Fried et al. fried2018speaker extend the model by a pragmatic agent with the ability to synthesize data through a speaker model. With an emphasize on grounding, the self-monitoring agent ma2019self
adopts a co-grounding module and a progress estimation auxiliary task for more progress-sensitive alignment. Similarly, an intrinsic rewardwang2018nervenet is introduced to improve cross-modal alignment for navigation agent. Ma et al. extends the existing works towards a graph search like algorithm by adding one-step regretful action. Anderson et al. anderson2019chasing proposes an agent formulated under Bayesian filtering with a global mapper. Hand-crafted decoding algorithms are also used as a post-processing technique but may lead to long trajectories and difficulty on joint optimization fried2018speaker ; ke2019tactical ; karaman2010incremental .
Navigation memory structures A recent emerging trend of navigation focuses on extending the agents with different types of memory structures. Simple structures: Parisotto et al. parisotto2017neural
propose a tensor-based memory structure with operations that agent can use to access and perform navigation. Fang et al.fang2019scene adopts transformer to extract historical memory information stored in a sequence. Topological structures: The landmark-based topological representation has shown to be effective in pre-exploration based navigation tasks without the need of externally provided camera poses or ego-motion informationsavinov2018semi . Laskin et al. proposes methods to sparsify the graphical memory through consistency checkings and graph cleanups laskin2020sparse . Liu et al. uses a contrastive energy model for building more accurate edges in the memory graph liu2020hallucinative .
Graphical representation Graph-based methods have been shown to be an effective intermediate representation for information exchange defferrard2016convolutional ; li2015gated ; henaff2015deep ; duvenaud2015convolutional ; DengVHM16 . In image and video understanding, graphical representation is used for visual question answering teney2017graph ; santoro2017simple , video captioning pan2020spatio or action recognitionguo2018neural ; huang2017unsupervised . In robotics, Huang et al. huang2019neural
propose Neural Task Graph (NTG) as an intermediate modularized representation leading to better generalization. Graph Neural Networks are demonstrated to be effective in learning structured policieswang2018nervenet and automatic robot design wang2019neural .
Supervision strategies for imitation learning The proper training of sequential models is challenging due to the drifting issues ross2011reduction . DAgger ross2011reduction proposes to aggregate datasets with expert supervision provided for the sequences samples from student models. Scheduled sampling bengio2015scheduled tackles this problem through mixing the samples from both ground truth and student models. Professor forcing lamb2016professor shows a more effective approach through adversarial domain adaptation. OCD sabour2018optimal adopts the online computed characters as ground truth for speech recognition. In this paper, we instead propose a graph-augmented strategy to provide expert supervisions, which alleviates the mismatch issue between instructions and new-computed expert trajectories ma2019self .
Problem definition and setup We follow the standard instruction-following navigation problem setting anderson2018vision . Given the demonstration dataset , where is the language instruction with length , is the expert navigation trajectories and is the environment paired with the data, the agent is trained to imitate the expert behaviours to correctly follow the instruction and navigate to the goal location. At each navigation step , the agent is located at the state with observations from the environment and performs an action , where is the decision space at state . In our task, decision space is the set of navigable locations fried2018speaker .
To set up an agent, we build upon a basic navigation architecture in ma2019self and utilize the language encoder and the attention-based decoder. The agent encodes the language instruction into a hidden encoding through an LSTM gers1999learning . Conditioned on , the agent uses an attention-based decoder to model the distribution over the trajectory . At every step, the decoder takes in the encoding , the observation and a maintained hidden memory
to produce the per-step action probability distribution. Note that such an navigation agent suffers from the constrained local action set and lacks the ability to perform long-range planning over the navigation space and to effectively correct errors along the navigation.
Our approach In this section, we introduce our Evolving Graphical Planner (EGP), an end-to-end global planner that navigates a agent through a new environment via re-defining the decision space and performing long-term plannings over the graphs. With the graphical representation, the agent accumulates the knowledge about the unseen environment and has access to the entire action space. Each long-distance navigation is then reduced to one-step decision making, leading to easier and more efficient exploration and error correction. We also show that the graphical representation elicits a new supervision strategy for effective training of imitation agent, which alleviates the mismatch issue in standard navigation agent training ma2019self . The global planning is performed on a efficient proxy representation, making the model scalable to navigation with longer horizon.
The proposed model consists of two core components: (1) an evolving graphical representation that generalizes the local action space; (2) a planning core that performs multi-channel information propagation on a proxy representation. Starting from the initial information, the EGP agent gradually expands an internal representation of the explored space (Section 3.1) and perform planning efficiently over the graphs (Section 3.2).
3.1 Evolving Graphical Planner: Graphical representation
We start with introducing the notations of the representation. Let denote a graph at time , where is the set of node embeddings, is the set of edge embeddings between node and node , and is the graph connectivity tensor with function types. In our graph, nodes are separated into two types: the leaf nodes represent the possible actions (e.g. navigable locations) and internal nodes represent the visited locations, as shown in figure 2.
Graph construction The agent builds up the graphical representation progressively during navigation. Initialized as empty set, the graph expands the node set, edge set and connectivity function tensor through the observations and local connection information given by the environment. When receiving an observation and the set of information over possible actions at the new state , the agent maps the current location information and action information through two separate neural networks, resulting to the node embedding and . The incoming and outgoing edges of new nodes are determined through the local map information and the moving direction of the agent. There are three function types considered and stored in the tensor : the forward and backward directions between two visited nodes, and the connection from the visited node to the leaf node. With the new graph , the agent continues navigation and the loop repeats. To reduce the memory cost, the model has the option to selectively add nodes to the graph. We use top-K leaf nodes, ranked by confidence scores in policy distribution, to expand the graph, as shown in fig. 2.
Navigation with graph jump With graph representing the entire action space, the agent can easily choose to navigate to execute actions that have not been explored in previous visited locations. For the proposed action from the planner, the agent computes the shortest-path route based on the graph and plans the navigation route . This allows the agent to “jump” around the full graph-defined action space and execute the unexplored actions through a single-step decision. The long-range decision space also makes error-correction easier for the agent: a single step decision is all the agent needs to backtrack to the correct location. With the navigation steps on the internal route generated through the graph, the agent keeps updating the hidden states based on the observation from the environments.
Supervising imitation learning The proper training of the imitation learning agent has been a challenging problem sabour2018optimal ; ross2011reduction ; lamb2016professor . Student forcing with new computed route as supervision is a widely used solution in navigation ma2019self ; ma2019regretful ; fried2018speaker . However, the mismatch between the new route and language instructions could potentially lead to noisy and incorrect signals for learning language groundings and navigation. We provide a new graph-augmented solution for computing the supervision for each student sampled trajectory. Assume a metric between two navigation trajectories (e.g. ilharco2019general ). With graph memorizing the entire possible action space, the subset of nodes in ground truth route is guaranteed to exist in . We choose the node on the that maximizes the metric as the ground truth action for step . This basically provides a “correction” signal that indicates the best action to take for correcting the mistake made by the agent.
3.2 Evolving Graphical Planner: Planning Core
With the graphical representation , a straightforward method is to directly perform planning on the full graph using the embeddings of nodes and edges. However, a progressively growing graph can lead to high costs and limit the scalability of the model. Pre-exploration based methods often tackle this issue through performing offline sparsification with multiple rounds of cleanups on the pre-collected graphs laskin2020sparse containing full knowledge over the map, which are unsuitable under the online navigation settings. In this section, we show that, interestingly, the effective planning can be achieved not through the full graph, and present the second component of our model, where it performs a goal-driven information extraction dynamically on the entire graph to build a condensed proxy representation used for planning. Our model utilizes Graph Neural Networks (GNN) as a basic operator, which we explain at the end of the section.
Proxy graph Denote the proxy graph as , where , and are the pooled node embedding set, edge embedding set and connectivity matrix with function types respectively. The proxy graph contains a fixed number of nodes invariant to the growing graph size in . We hypothesize that given the instruction information and the current states of the agent, there are only a subset of nodes providing useful information for planning. To construct the proxy representation, the model uses a normalized pooling similar to ying2018hierarchical . Differently, our graphical representation consists of a rich set of information including edge states, function types in connectivity matrices besides the node states. We describe the process of generating the proxy representation and corresponding planning as follows.
Given the entire graphical representation , the planner contains two functions, and . The function takes in the agent state information and constructs the proxy representation through a lightweight neural network. Assume the pooling function uses as the graph dimension in the propagation model, and performs step message passing. We derive the pooling matrix as follows, using a small :
The normalized pooling matrix is the attention weights on the entire graph and extracts relevant information conditioned on the agent states and instructions
for further planning. To help the description, we denote the concatenated matrix of all node vectors as, the concatenation of all edge vectors as tensor . The concatenation order is aligned with the connectivity function matrix. The following operations are used for deriving the proxy representation.
where is a non-negative matrix indicating the weights of connectivity among nodes for function type , is the matrix at dimension for the edge state tensor. The pooled tensors and are corresponded to the node set and edge set of the proxy graph
Planning The planning of the navigation agent is achieved through propagating information among nodes in the proxy graph, conditioned on the agent state information and instruction encodings . Denote the graph dimension used in the propagation model as , the number of steps for message passing operations as , the refined node embedding of the proxy graph is derived through , with controlling the capacity of the function. The refined node embedding contains the information involving the neighboring nodes (visited locations and unexplored actions), the state of agent, the instruction, and the connectivity types between nodes. With the globally refinement step, node embedding vector is unpooled back to the original graph through . With the update node representation containing both the current state of the agent and the full action-observation space in the history, the distribution over actions is generated based on the node vectors:
where the function is a dot-product with linear mapping parameterized by .
Multi-channel planning The model is further strengthened with the ability to perform multi-channel planning over the graph . Instead of only using one proxy graph representation to extract information and perform propagation, we find it useful to learn a set of proxy graphs and perform planning independently on each of them. The final information are aggregated through summation over the embedding across the channels. The final policy over actions is generated through the same operation described in eqn. 1
Training objective We train our full agent through the standard maximum likelihood objective using cross-entropy loss. Given the demonstration
, the loss function to optimize is formulated as:
where is the ground truth trajectory, is the ground truth label on action at step , generated through the graph-augmented supervision using the information of , as described in Sec. 3.1.
3.2.1 Message Passing Operations
In this subsection we explain the Graph Neural Network (GNN) operator used in the EGP. As the operator is used in both pooling and planning, we describe it as a general function which takes in a graph and a general context information (e.g. the agent hidden state and language encodings), with hyper-parameters and . Formally, given the input graph and the context vector , the function generates the refined node vectors after steps of message passings, where is the vector dimension in the propagation model. The subscription on nodes and edges denotes the index of message passing iterations, with a slight abuse of notation. The function contains two components: an input network and a propagation network.
Input network Along with the initial vectors for nodes and edges, the input network considers the context vector as a shared additional information across nodes and edges. The input model maps the context vector with node and edge vectors respectively into two fixed-size embedding as follows:
where is a neural network, and the generated embedding vectors are used for message communications among graph nodes in the propagation model.
Propagation network The propagation model, taking in the mapped embedding vectors , consecutively generates a sequence of node embedding vectors for each node. At step , the propagation operation updates every node through computing and aggregating information from the neighborhood nodes. The process is executed in the order:
where is the message function. Function is the aggregator function that collects messages back to node vectors. represents the neighbours of node in the graph. The refined node vector set , containing global information from the whole graph is mapped through a matrix to recover the input node dimension and is used as the output of the function for either pooling or planning component, as described in Sec. 3.2.
4.1 Experimental setup
Datasets We evaluate our methods on the standard benchmark datasets for Vision-and-Language Navigation (VLN). The VLN task is built upon photo-realistic simulated environments chang2017matterport3d with human-generated instructions describing the landmarks and directions for navigation routes. Starting at a random sampled location in the environment, the agent needs to follow the instruction to navigate through the environment. There are two datasets commonly used for VLN: (1) Room-to-Room (R2R) benchmark anderson2018vision with 7,189 paths, each associated with 3 sentences, resulting in 21,567 total human instructions. The paths are produced through computing shortest paths from start to end points; (2) Room-for-Room (R4R) jain2019stay , which extends the R2R dataset by re-emphasizing on the necessity of following instructions compared to the goal-driven definition in R2R. The R4R dataset contains 278,766 instructions associated with twisted routes connecting two shortest-path trajectories in R2R. The dataset details are summarized in table 1.
Implementation details We follow ma2019self and adopt the co-grounding agent (w/o auxiliary loss) as our base agent. As the standard protocol for R2R, visual features for each location are pre-computed ResNet features from the panoramic images. In the Evolving Graphical Planner, we use 256 dimensions as the graph embedding size for both the full graph and the proxy graph. The propagation model uses three iterations of message passing operations. For every expansion step, the default setting adds all the possible navigable locations into the graph (top-K is set to 16, the maximum number of navigable location in both datasets). For student-forced training, we use graph-augmented ground truth supervision throughout the experiments except for the ablation study on supervision methods. The model is trained jointly, using Adam kingma2014adam with 1e-4 as the default learning rate.
4.2 Room-to-Room benchmark
Evaluation metrics We follow prior works on the R2R dataset and report: navigation error (NE) in meters, lower is better; Success Rate (SR), i.e., the percentage of navigation end-locations which are within 3m of the true global location; Success Rate divided by path Length in meters (SPL); and Oracle Success Rate (OSR), i.e., the path at any point passes within 3m of the goal state.
4.2.1 Comparison with prior art
Architectures for comparison We compare our model with the following state-of-the-art navigation architectures: (1) the Seq2Seq agent anderson2018vision that translates instructions to actions; (2) Speaker-Follower (SF) fried2018speaker agent that augments the dataset with a speaker model; (3) the Reinforced Cross-Modal (RCM) agent wang2019reinforced
using modal-alignment score as intrinsic reward for reinforcement learning; (4) the Self-Monitoring (Monitor) agentma2019self that uses a co-grounding module and a progress estimation component to increase the progress alignment between texts and trajectories; (5) the Regretful agent ma2019regretful that uses the Regretful module and Progress Marker to perform one-step rollback; (6) the Ghost anderson2019chasing with Bayesian filters.
Results We report results in table 2. We train our models both by using only the standard demonstration and by augmenting the dataset with the synthetic data containing 178,330 instruction-route pairs generated by the Speaker model fried2018speaker . On the Val Unseen split, we observe that just through using the EGP module (without synthetic data augmentation), the performance of agent can be increased over the baseline agent by meters on NE (from to ), by 9 on SR (from to ) on SR, by 5 on SPL (from to ), and by 13% on OSR (from to ). Our path length remains short, at 13.7 meters compared to 12.8m for baseline, 14.8m for RCM wang2019reinforced and 15.2m SF fried2018speaker (not shown in the table). Most notably, our EGP agent with synthetic data augmentation outperforms prior art on all metrics, across both the validation-unseen and the test set. Concretely, on the test set we achieve a meters reduction in NE (from to ), a improvement on SR (from to ), a improvement on SPL (from to ), and a improvement on OSR (from to ) over the best performing prior model on each metric respectively.
Discussion of other works Note that there are other works contributing to this benchmark through non-architecture approaches: using more data (6,582K) for BERT-type pre-training hao2020towards ; exploiting web data majumdar2020improving ; adding extra tasks zhu2019vision ; adding dropout regularization tan2019learning ; different settings of evaluation (fusing information from three instructions) xia2020multi ; li2019robust ; post-processing decoding method ke2019tactical . We contribute to the backbone navigation architectures and these works can potentially be useful as complementary approaches.
4.2.2 Ablation studies
We now justify the design choices of our model by analyzing the individual components. In addition to the metrics above we also include Path Length (PL) for completeness. Results are summarized in table 3, with the last row depicting our model with the default settings.
Does global planning matter We verify the importance of global planning and navigation through controlling the top-K expansion rate for the graphical representation. In R2R dataset, there are maximally 16 navigable locations for each state. With smaller expansion rate, the EGP planner will have less expressive power on exploiting global information from the environments. As seen in the top group of rows of table 3, with smaller top-K, the path becomes shorter (fewer options to explore) but the accuracy of the model consistently drops, indicating the importance of the global planning.
Planner implementation Next we analyze the effects of message passing steps and the number of channels used on our planner module. The results are summarized in the second group of rows in table 3. Through the information propagation operations, our model achieves a 8 increase on SR (from with mp=0,channel=1 to with mp=3,channel=1).With more independent planning channels, we can obtain a further 2 improvement on SR (from with mp=3,channel=1 to for our default setting with mp=3,channel=3, last row). To verify whether the increase is due to more parameters in the model, we also add a comparison through using a single channel planner with three times larger graph dimensions (768), which shows no similar effect to the multi-channel models.
Compare across supervision methods Finally, we compare our methods with the standard supervision method navigation agent trained by student forcing. The standard supervision used for student forcing requires recomputing the new route to goal for each location, leading to potential noisy data and larger generalization error, shown in the second-from-last row of table 3.
|EGP - topK = 3||13.07||5.95||47||38||56|
|Global vs local planning||EGP - topK = 5||13.17||5.75||49||40||59|
|EGP - topK = 10||13.50||5.71||50||40||61|
|EGP - mp=0,channel=1||18.83||6.06||42||32||62|
|Planner implementation||EGP - mp=3,channel=1||14.65||5.73||50||40||62|
|EGP - graph dim 3||14.16||5.68||49||38||60|
|Supervision||EGP - with shortest path||14.68||5.65||46||36||57|
4.3 Room-for-Room benchmark
Evaluation metrics Room-for-Room dataset emphasizes on the ability of correctly following instructions instead of solely on reaching the goal locations. We follow the metrics in jain2019stay ; ilharco2019general and mainly compare our results on Coverage weighted by Length Score (CLS) that measures the fidelity of the agent’s path to the reference, weighted by the length score, and the Success rate weighted normalized Dynamic Time Warping (SDTW) and normalized Dynamic Time Warping (nDTW) that measures the spatiotemporal similarity of the paths by the agent and the expert refence.
Architectures for comparison We compare our model with the Speaker-Follower model fried2018speaker , Reinforced Cross-Modal agent wang2019reinforced trained under goal-directed and fidelity-oriended rewards, reported in jain2019stay , and the Perceive Transform Act (PTA) agent landi2019perceive using more complex multi-modal attentions.
Results analysis We summarize the results in table 4. Note that all previous state-of-the-art methods require the mixed training between imitation learning and reinforcement learning objectives. The student forcing method leads to goal-oriented supervisions and harms the ability of following instructions for agents jain2019stay . Our model is the first that successfully train the navigation agent through pure imitation learning on the R4R benchmark, due to the benefit of the graphical representation, the powerful planning module and the new supervision method. We obtain a consistent margin across all metrics. Specifically, our model outperforms other architectures by 7.0, 5.0 and 4.9 on the fidelity-oriented measurements CLS, nDTW and SDTW respectively. Also, although using a global search mechanism, our model maintains a relatively short path length, which is difficult in other rule-based global search algorithms fried2018speaker ; ke2019tactical .
|RCM + goal-orientedjain2019stay||IL+RL||32.5||8.45||28.6||20.4||26.9||11.4|
|RCM + fidelity-orientedjain2019stay||IL+RL||28.5||8.08||26.1||34.6||30.4||12.6|
In this work, we proposed a solution to the long-standing problem of contextual global planning for vision-and-language navigation. Our system based on the new Evolving Graphical Planner (EGP) module outperforms prior backbone navigation architectures on multiple metrics across two benchmarks. Specifically, we show that building a policy over the global action space is critical to decision making, the graphical representation can further elicit a new supervising strategy for student forcing in navigation, and, interestingly, the actual planning can be achieved through a proxy graph rather than the actual topological representation which leads to high costs in both time and memory.
6 Broader Impact
This work has several downstream applications in areas like autonomous navigation and robotic control, especially through the use of natural language instruction. Potential downstream uses of such agents range from healthcare delivery to elderly home assistance to disaster relief efforts. We believe that imbuing these agents with a global awareness of the environment and long-term planning will enable them to handle more challenging tasks and recover gracefully from mistakes. While the graph-based approach we propose is scalable and easy to manipulate in real time, future research can address computation challenges and increase the planning time-scale to enable better decision making.
This work is partially supported by King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSRCRG2017-3405 and by Princeton University’s Center for Statistics and Machine Learning (CSML) DataX fund. We would also like to thank Felix Yu and Zeyu Wang for offering insightful discussions and comments on the paper.
- (1) Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In , pages 3674–3683, 2018.
- (2) Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, pages 3314–3325, 2018.
- (3) Justin Fu, Anoop Korattikara, Sergey Levine, and Sergio Guadarrama. From language to goals: Inverse reinforcement learning for vision-based instruction following. arXiv preprint arXiv:1902.07742, 2019.
- (4) Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. arXiv preprint arXiv:2002.10638, 2020.
- (5) Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12538–12547, 2019.
- (6) Piotr Mirowski, Matt Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith Anderson, Denis Teplyashin, Karen Simonyan, Andrew Zisserman, Raia Hadsell, et al. Learning to navigate in cities without a map. In Advances in Neural Information Processing Systems, pages 2419–2430, 2018.
- (7) Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2661–2670. JMLR. org, 2017.
- (8) Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog navigation. arXiv preprint arXiv:1907.04957, 2019.
- (9) Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, and Jianfeng Gao. Rmm: A recursive mental model for dialog navigation. arXiv preprint arXiv:2005.00728, 2020.
- (10) Ivana Kajić, Eser Aygün, and Doina Precup. Learning to cooperate: Emergent communication in multi-agent navigation. arXiv preprint arXiv:2004.01097, 2020.
- (11) Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. arXiv preprint arXiv:1702.08360, 2017.
- (12) Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653, 2018.
- (13) Michael Laskin, Scott Emmons, Ajay Jain, Thanard Kurutach, Pieter Abbeel, and Deepak Pathak. Sparse graphical memory for robust planning. arXiv preprint arXiv:2003.06417, 2020.
- (14) Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035, 2019.
- (15) Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and Siddhartha Srinivasa. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6741–6749, 2019.
- (16) Kuan Fang, Alexander Toshev, Li Fei-Fei, and Silvio Savarese. Scene memory transformer for embodied agents in long-horizon tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 538–547, 2019.
- (17) Kara Liu, Thanard Kurutach, Christine Tung, Pieter Abbeel, and Aviv Tamar. Hallucinative topological memory for zero-shot visual planning. arXiv preprint arXiv:2002.12336, 2020.
- (18) Vihan Jain, Gabriel Magalhaes, Alex Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. Stay on the path: Instruction fidelity in vision-and-language navigation. arXiv preprint arXiv:1905.12255, 2019.
Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira.
The regretful agent: Heuristic-aided navigation through progress estimation.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6732–6740, 2019.
- (20) Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3357–3364. IEEE, 2017.
- (21) Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2017.
- (22) Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 166–175. JMLR. org, 2017.
- (23) Fei Xia, William B Shen, Chengshu Li, Priya Kasimbeg, Micael Edmond Tchapmi, Alexander Toshev, Roberto Martín-Martín, and Silvio Savarese. Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 5(2):713–720, 2020.
- (24) Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with graph neural networks. 2018.
- (25) Peter Anderson, Ayush Shrivastava, Devi Parikh, Dhruv Batra, and Stefan Lee. Chasing ghosts: Instruction following as bayesian state tracking. In Advances in Neural Information Processing Systems, pages 369–379, 2019.
- (26) Sertac Karaman and Emilio Frazzoli. Incremental sampling-based algorithms for optimal motion planning. Robotics Science and Systems VI, 104(2), 2010.
- (27) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
- (28) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
- (29) Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015.
- (30) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
Zhiwei Deng, Arash Vahdat, Hexiang Hu, and Greg Mori.
Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition.In Computer Vision and Pattern Recognition (CVPR), 2016.
- (32) Damien Teney, Lingqiao Liu, and Anton van Den Hengel. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2017.
- (33) Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
- (34) Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. Spatio-temporal graph for video captioning with knowledge distillation. arXiv preprint arXiv:2003.13942, 2020.
- (35) Michelle Guo, Edward Chou, De-An Huang, Shuran Song, Serena Yeung, and Li Fei-Fei. Neural graph matching networks for fewshot 3d action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 653–669, 2018.
- (36) De-An Huang, Joseph J Lim, Li Fei-Fei, and Juan Carlos Niebles. Unsupervised visual-linguistic reference resolution in instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2183–2192, 2017.
- (37) De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, and Juan Carlos Niebles. Neural task graphs: Generalizing to unseen tasks from a single video demonstration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8565–8574, 2019.
- (38) Tingwu Wang, Yuhao Zhou, Sanja Fidler, and Jimmy Ba. Neural graph evolution: Towards efficient automatic robot design. arXiv preprint arXiv:1906.05370, 2019.
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to
no-regret online learning.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.
- (40) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
- (41) Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.
- (42) Sara Sabour, William Chan, and Mohammad Norouzi. Optimal completion distillation for sequence learning. arXiv preprint arXiv:1810.01398, 2018.
- (43) Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. 1999.
- (44) Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. General evaluation for instruction conditioned navigation using dynamic time warping. arXiv preprint arXiv:1907.05446, 2019.
- (45) Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in neural information processing systems, pages 4800–4810, 2018.
- (46) Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017.
- (47) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- (48) Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6629–6638, 2019.
- (49) Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. Improving vision-and-language navigation with image-text pairs from the web. arXiv preprint arXiv:2004.14973, 2020.
- (50) Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. arXiv preprint arXiv:1911.07883, 2019.
- (51) Hao Tan, Licheng Yu, and Mohit Bansal. Learning to navigate unseen environments: Back translation with environmental dropout. arXiv preprint arXiv:1904.04195, 2019.
- (52) Qiaolin Xia, Xiujun Li, Chunyuan Li, Yonatan Bisk, Zhifang Sui, Jianfeng Gao, Yejin Choi, and Noah A Smith. Multi-view learning for vision-and-language navigation. arXiv preprint arXiv:2003.00857, 2020.
- (53) Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, and Yejin Choi. Robust navigation with language pretraining and stochastic sampling. arXiv preprint arXiv:1909.02244, 2019.
- (54) Federico Landi, Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini, and Rita Cucchiara. Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation. arXiv preprint arXiv:1911.12377, 2019.