RMM: A Recursive Mental Model for Dialog Navigation

05/02/2020 ∙ by Homero Roman Roman, et al. ∙ Microsoft Carnegie Mellon University 0

Fluent communication requires understanding your audience. In the new collaborative task of Vision-and-Dialog Navigation, one agent must ask questions and follow instructive answers, while the other must provide those answers. We introduce the first true dialog navigation agents in the literature which generate full conversations, and introduce the Recursive Mental Model (RMM) to conduct these dialogs. RMM dramatically improves generated language questions and answers by recursively propagating reward signals to find the question expected to elicit the best answer, and the answer expected to elicit the best navigation. Additionally, we provide baselines for future work to build on when investigating the unique challenges of embodied visual agents that not only interpret instructions but also ask questions in natural language.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A key challenge for embodied language is moving beyond instruction following to instruction generation. This dialog paradigm raises a myriad of new research questions, from grounded versions of traditional problems like co-reference resolution Das et al. (2017a) to modeling theory of mind to consider the listener Bisk et al. (2020).

In this work, we develop end-to-end dialog agents to navigate photorealistic, indoor scenes to reach goal rooms, trained on human-human dialog from the Collaborative Vision-and-Dialog Navigation (CVDN) Thomason et al. (2019) dataset. Previous work considers only the vision-and-language navigation task, conditioned on single instructions Wang et al. (2019); Ma et al. (2019); Fried et al. (2018); Anderson et al. (2018) or dialog histories Hao et al. (2020); Zhu et al. (2020); Wang et al. (2020); Thomason et al. (2019). Towards dialog, recent work has modelled question answering in addition to navigation Chi et al. (2020); Nguyen and Daumé III (2019); Nguyen et al. (2019). Closing the loop, our work is the first to train agents to perform end-to-end, collaborative dialogs with question generation, question answering, and navigation conditioned on dialog history.

Figure 1: The RMM agent recursively models conversations with instances of itself to choose the right questions to ask (and answers to give) to reach the goal.

Theory of mind Gopnik and Wellman (1992) guides human communication. Efficient questions and answers build on a shared world of experiences and referents. We formalize this notion through a Recursive Mental Model (RMM) of a conversational partner. With this formalism, an agent spawns instances of itself to converse with to posit the effects of dialog acts before asking a question or generating an answer, thus enabling conversational planning to achieve the desired navigation result.

2 Related Work and Background

This work builds on research in multimodal navigation, multimodal dialog and the wider goal oriented dialog literature.

Instruction Following

tasks an embodied agent with interpreting natural language instructions along with visual observations to reach a goal Anderson et al. (2018); Chen and Mooney (2011). These instructions describe step-by-step actions the agent needs to take. This paradigm has been extended to more longer trajectories and outdoor environments Chen et al. (2019), as well as to agents in the real world Chai et al. (2018); Tellex et al. (2014). In this work, we focus on the the simulated, photorealistic indoor environments of the MatterPort dataset Chang et al. (2017), and go beyond instruction following to a cooperative, two-agent dialog setting.

Navigation Dialogs

task a navigator and a guide to cooperate to find a destination. Agents can be trained on human-human dialogs, but previous work either includes substantial information asymmetry between the navigator and oracle de Vries et al. (2018); Narayan-Chen et al. (2019) or only investigates the navigation portion of the dialog without considering question generation and answering Thomason et al. (2019). The latter approach treats dialog histories as longer and more ambiguous forms of static instructions. No text is generated to approach such navigation-only tasks. Going beyond models that perform navigation from dialog history alone Wang et al. (2020); Zhu et al. (2020); Hao et al. (2020), in this work we train two agents: a navigator agent that asks questions, and a guide agent that answers those questions.

Multimodal Dialog

takes several forms. In Visual Dialog Das et al. (2017a)

, an agent answers a series of questions about an image while accounting for dialog context in the process. Reinforcement learning 

Das et al. (2017b) has proved essential to strong performance on this task, and such paradigms have been extended to producing multi-domain visual dialog agents Ju et al. (2019). GuessWhat de Vries et al. (2017) presents a similar paradigm, where agents use visual properties of objects to reason about which referent meets various constraints. Identifying visual attributes can also lead to emergent communication between pairs of learning agents Cao et al. (2018).

Goal Oriented Dialog

Goal-Oriented Dialog Systems, or chatbots, help a user achieve a predefined goal, like booking flights, within a closed domain (Gao et al., 2019; Vlad Serban et al., 2015; Bordes and Weston, 2017)

while trying to limit the number of questions asked to the user. Modeling goal-oriented dialog require skills that go beyond language modeling, such as asking questions to clearly define a user request, querying knowledge bases, and interpreting results from queries as options to complete a transaction. Most current task oriented systems are data-driven which are mostly trained in end-to-end fashion using semi-supervised or transfer learning methods

(Ham et al., 2020; Mrksic et al., 2017). However, these data-driven approaches may lack grounding between the text and the current state of the environment. Reinforcement learning-based dialog modeling Su et al. (2016); Peng et al. (2017); Liu et al. (2017) can improve completion rate and user experience by helping ground conversational data to environments.

3 Task and Data

while  STOP and  do
         // Question
         // Answer
        for  do
                // Move
        end for
end while
Algorithm 1 Dialog Navigation

Our work creates a two-agent dialog task, building on the CVDN dataset Thomason et al. (2019) of human-human dialogs. In that dataset, a human avigator and uide collaborate to find a goal room containing a target object, such as a plant. The avigator moves through the environment, and the uide views this navigation until the avigator asks a question in natural language. Then, the uide can see the next few steps a shortest path planner would take towards the goal, and writes a natural language response. This dialog continues until the avigator arrives at the goal.

We model this dialog between two agents:

  1. Navigator () & Questioner ()

  2. Guide ()

We split the first agent into its two roles: navigation and question asking. The agents receive the same input as their human counterparts in CVDN. In particular, both agents (and all three roles) have access to the entire dialog and visual navigation histories, in addition to a textual description of the target object (e.g., a plant). The avigator, uses this information to decide on a series of actions: forward, left, right, look up, look down, and stop. The uestioner asks for specific guidance from the uide. The uide is presented not only with the navigation/dialog history but also the next five shortest path steps to the goal.

Agents are trained on real human dialogs of natural language questions and answers from CVDN. Individual question-answer exchanges in that dataset are underspecified and rarely provide simple step-by-step instructions like “straight, straight, right, …”. Instead, exchanges rely on assumptions of world knowledge and shared context Frank and Goodman (2012); Grice et al. (1975), which manifest as instructions full of visual-linguistic co-references such as should I go back to the room I just passed or continue on?

The CVDN release does not provide any baselines or evaluations for this interactive dialog setting, focusing instead solely on the navigation component of the task. They evaluate navigation agents by “progress to goal” in meters, the distance reduction between the agent’s starting position versus ending position with respect to the goal location.

Dialog navigation proceeds by iterating through the three roles until either the navigator chooses to stop or a maximum number of turns are played (Algorithm 1). Upon terminating, the “progress to goal” is returned for evaluation. We also report BLEU scores Papineni et al. (2002) for evaluating the generation of questions and answers by comparing against human questions and answers.

Conditioning Context

In our experiments, we define three different notions of context dialog context (, , and ), to evaluate how well agents utilize or are confused by the generated conversations.

  • [align=parleft, labelsep=2em,leftmargin =]

  • The agent must navigate to the goal while only knowing what type of object they are looking for (e.g., a plant).

  • The agent has access to their previous Question-Answer exchange. They can condition on this information to both generate the next exchange and then navigate towards the goal.

  • This is the “full” evaluation paradigm in which an agent has access to the entire dialog when interacting. This context also affords the most potential distractor information.

4 Models

We introduce the Recursive Mental Model (RMM) as an initial approach to our full dialog task formulation of the CVDN dataset. Key to this approach is allowing component models (avigator, uestioner, and uide) to learn from each other and roll out possible dialogs and trajectories. We compare our model to a traditional sequence-to-sequence baseline, and we explore Speaker-Follower data augmentation Fried et al. (2018).

4.1 Sequence-to-Sequence Architecture

The underlying architecture, shown in Figure 2, is shared across all approaches. The core dialog tasks are navigation action decoding and language generation for asking and answering questions. We present three sequence-to-sequence Bahdanau et al. (2015) models to perform as avigator, uestioner, and uide. The models rely on an LSTM Hochreiter and Schmidhuber (1997) encoder for the dialog history and a ResNet backbone He et al. (2015) for processing the visual surroundings; we take the penultimate ResNet layer as image observations.

(a) Dialogue and action histories combined with the current observation are used to predict the next navigation action.
(b) A Bi-LSTM over the path is attended to during decoding for question and instruction generation.
Figure 2: Our backbone Seq2Seq architectures are provided visual observationsand have access to the dialogue history when taking actions or asking/answering questions Thomason et al. (2019).
Navigation Action Decoding

Initially, the dialog context is a target object that can be found in the goal room, for example “plant” indicating that the goal room contains a plant. As questions are asked and answered, the dialog context grows. Following prior work Anderson et al. (2018); Thomason et al. (2019), dialog history words

words are embedded as 256 dimensional vectors and passed through an LSTM to produce

context vectors and a final hidden state . The hidden state is used to initialize the LSTM decoder. At every timestep the decoder is updated with the previous action and current image . The hidden state is used to attend over the language and predict the next action (Figure 1(a)).

We pretrain the decoder on the navigation task alone Thomason et al. (2019)

before fine-tuning in the full dialog setting we introduce in this paper. The next action is sampled from the model’s predicted logits and the episode ends when either a

stop action is sampled or 80 steps are taken.

Language Generation

To generate questions and answers, we train sequence-to-sequence models (Figure 1(b)) where an encoder takes in a sequence of images and a decoder produces a sequence of word tokens. At each decoding timestep, the decoder attends over the input images to predict the next word of the question or answer. This model is also initialized via training on CVDN dialogs. In particular, question asking (uestioner) encodes the images of the current viewpoint where a question is asked, and then decodes the question asked by the human avigator. Question answering (uide) is encoded by viewing images of the next five steps the shortest path planner would take towards the goal, then decoding the language answer produced by the human uide.

Pretraining initializes the lexical embeddings and attention alignments before fine-tuning in the collaborative, turn-taking setting we introduce in this paper. We experimented with several beam and temperature based sampling methods, but saw only minor effects; hence, we use direct sampling from the model’s predicted logits for this paper.

4.2 Data Augmentation (DA)

Navigation agents can benefit from data augmentation produced by a learned agent that provides additional, generated language instructions Fried et al. (2018). Data pairs of generated novel language with visual observations along random routes in the environment can help with navigation generalization. We assess the effectiveness of such data augmentation in our two-agent dialog task.

To augment navigation training data, we choose a CVDN conversation but initialize the navigation agent in a random location in the environment, then sample multiple action trajectories and evaluate their progress towards the conversation goal location. In practice, we sample two trajectories and additionally consider the trajectory obtained from picking the top predicted action each time without sampling. We give the visual observations of the best path to the pretrained uestioner model to produce a relevant instruction. This augmentation allows the agent to explore and collect alternative routes to the goal location. We downweight the contributions of these noisier trajectories to the overall loss, so . We explored different ratios before settling on . The choice of affects the fluency of the language generated, because a navigator too tuned to generated language leads to deviation from grammatically valid English and lack of diversity (Section 6 and Appendix).

4.3 Recursive Mental Model

We introduce the Recursive Mental Model agent (RMM),111https://github.com/HomeroRR/rmm which is trained with reinforcement learning to propagate feedback through all three component models: avigator, uestioner, and uide. In this way, the training signal for question generation includes the training signal for answer generation, which in turn has access to the training signal from navigation error. Over training, the agent’s progress towards the goal in the environment informs the dialog itself; each model educates the others (Figure 3). This model does not use any data-augmentation but still explores the world and updates its representations and language.

Figure 3: The Recursive Mental Model allows for each sampled generation to spawn a new dialog and corresponding trajectory to the goal. The dialog that leads to the most goal progress is followed by the agent.

Each model among the avigator, uestioner, and uide may sample trajectories or generations of max length . These samples in turn are considered recursively by the RMM agent, leading to possible dialog trajectories, where is at most the maximum trajectory length. To prevent unbounded exponential growth during training, each model is limited to a maximum number of total recursive calls per run. Search techniques, such as frontiers Ke et al. (2019), could be employed in future work to guide the agent.


In the dialog task we introduce, the agents begin only knowing the name of the target object. The avigatoragent must move towards the goal room containing the target object, and can ask questions using the uestioner model. The uide agent answers those questions given a privileged view of the next steps in the shortest path to the goal rendered as visual observations.

We train using a reinforcement learning objective to learn a policy which maximizes the log-likelihood of the shortest path trajectory (Eq. 1) where is the action decoder, is the language encoder, and is the dialog context at time .


We can calculate the cross entropy loss between the generated action and the shortest path action at time to do behavioral cloning before sampling the next action from the avigator predictions.

Reward Shaping with Advantage Actor Critic

As part of the avigator loss, the goal progress can be leveraged for reward shaping. We use the Advantage Actor Critic Sutton and Barto (1998) formulation with regularization (Eq. 2) where is the advantage function


The RL agent loss can then be expressed as the sum between the the A2C loss with regularization and the cross entropy loss between the ground truth and the generated trajectories . This is then propagated through the generation models uestioner and uide, as well by simply accumulating the RL navigator loss on top of the standard generation cross entropy .


During training, exact environmental feedback can be used to evaluate samples and trajectories. This information is not available at inference, so we instead rely on the navigator’s confidence to determine which of several sampled paths should be explored. Specifically, for every question-answer pair sampled, the agent rolls forward five navigation actions, and the probability of all resulting navigation sequences are compared. The trajectory with the highest probability is used for the next timestep. Note, that this does not guarantee that the model is actually progressing towards the goal, but rather that the agent is confident that it is acting correctly given the dialog context and target object hint.

4.4 Gameplay

As is common in dialog settings, there are several moving pieces and a growing notion of state throughout training and evaluation. In addition to the avigator, uestioner, and uide ideally there should also be a model which generates the target object and one which determines when is best to ask a question. We leave these two components for future work and instead assume we have access to the human provided target (e.g. a plant) and set the number of steps before asking a question to four based on the human average of 4.5 in CVDN.

Setting a maximum trajectory is required due to computational constraints as the the language context grows with every exchange. Following Thomason et al. (2019), we use a maximum navigation length of 80 steps, leading to a maximum of dialog question-answer pairs.

To simplify training we use a single model for both question and answer generation, and indicate the role of spans of text by prepending <NAV> (uestioner asks during navigation) or <ORA> (uide answers to questions) tags (Figure 1(a)). During roll outs the model is reinitialized to prevent information sharing via the hidden units.

Model Goal Progress (m) BLEU
+ Oracle
Val Seen Baseline 20.1 10.5 15.0 22.9 0.9 0.8
Data Aug. 20.1 10.5 10.0 14.2 1.3 1.3
RMM 18.7 10.0 13.3 20.4 3.3 3.0
RMM 18.9 11.5 14.0 16.8 3.4 3.6
Shortest Path ———– 32.8 ———–
Val Unseen Baseline 6.8 4.7 04.6 6.3 0.5 0.5
Data Aug. 6.8 5.6 04.4 6.5 1.3 1.1
RMM 6.1 6.1 05.1 6.0 2.6 2.8
RMM 7.3 5.5 05.6 8.9 2.9 2.9
Shortest Path ———– 29.3 ————
Table 1: Gameplay results on CVDN evaluated when agent voluntarily stops or at 80 steps. Full evaluations are highlighted in gray with the best results in blue, remaining white columns are ablation results.

5 Results

In Table 1 we present gameplay results for our RMM model and competitive baselines. We report two main results and four ablations for seen and unseen house evaluations; the former are novel dialogs in houses seen at training time, while the latter are novel dialogs in novel houses.

Full Evaluation

The full evaluation paradigm corresponds to for goal progress and BLEU. In this setup, the agent has access to and is attending over the entire dialog history up until the current timestep in addition to the original target object . We present three models and two conditions for RMM ( and ). refers to the number of samples explored in our recursive calls, so corresponds to simply taking the single maximum prediction while allows the agent to explore. In the second condition, the choice of path/dialog is determined by the probabilities assigned by the avigator (Section 4.3).

An additional challenge for navigation agents is knowing when to stop. Following previous work Anderson et al. (2018), we report Oracle Success Rates measuring the best goal progress the agents achieve along the trajectory, rather than the goal progress when the stop action is taken.

In unseen environments, the RMM based agent makes the most progress towards the goal and benefits from exploration at during inference. During inference the agent is not provided any additional supervision, but still makes noticeable gains by evaluating trajectories based on learned avigator confidence. Additionally, we see that while low, the BLEU scores are better for RMM based agents across settings.


We also include two simpler results: , where the agent is only provided the target object and explores based on this simple goal, and where the agent is only provided the previous question-answer pair. Both of these settings simplify the learning and evaluation by focusing the agent on search and less ambiguous language, respectively. There are two results to note. First, even in the simple case of the RMM trained model generalizes best to unseen environments. In this setting, during inference all models have the same limited information, so the RL loss and exploration have better equipped RMM to generalize.

Second, several trends invert between the seen and unseen scenarios. Specifically, the simplest model with the least information performs best overall in seen houses. This high performance coupled with weak language appears to indicate the models are learning a different (perhaps search based) strategy rather than how to communicate via and effectively utilize dialog. In the and settings, the agent generates a question-answer pair before navigating, so the relative strength of the RMM model’s communication becomes clear. We next analyze the language and behavior of our models to investigate these results.

Figure 4: Log-frequency of words generated by human speakers as compared to the Data Augmentation (DA) and our Recursive Mental Model (RMM) models.

6 Analysis

We analyze the lexical diversity and effectiveness of generated questions by the RMM, and present a qualitative inspection of generated dialogs.

(a) Normalized plot of goal progress and #Qs asked by humans. Note, even for long dialogs, most questions lead to substantial progress towards the goal.
(b) DA and RMM generated dialogs make slower but consistent progress (ending below 25% of total goal progress).
Figure 5: Effectiveness of human dialogs (left) vs our models (right) at achieving the goal. The slopes indicate the effectiveness of each dialog exchange in reaching the target.

6.1 Lexical Diversity

Both RMM and Data Augmentation introduce new language by exploring and the environment and generating dialogs. In the case of RMM

an RL loss is used to update the models based on the most successful dialog. In the Data Augmentation strategy, the best generations are simply appended to the dataset for one epoch and weighted appropriately for standard, supervised training. The augmentation strategy leads to small boost in BLEU performance and goal progress in several settings (Table 

1), but the language appears to collapse to repetitive and generic interactions. We see this manifest rather dramatically in Figure 4, where the DA is limited to only 22 lexical types. In contrast, Recursive Mental Model continues to produce over 500 unique lexical types, much closer to the nearly 900 of humans.

6.2 Effective Questions

A novel component of a dialog paradigm is assessing the efficacy of every speech act in accomplishing a goal. Specifically, the optimal question should elicit the optimal response, which in turn maximizes the progress towards the goal room. If agents were truly effective at modeling each other, we would expect the number of dialog acts to be kept to a minimum. We plot the percent of questions asked against the percent of goal progress in Figures 4(a) and 4(b). Human conversations in CVDN always reach the goal location, and usually with only 3-4 questions (Figure 4(a)). We see that the relationship between questions and progress is roughly linear, excusing the occasional lost and confused human teams. The final human-human question is often simply confirmation that navigation has arrived successfully to the goal room.

Figure 6: Generated trajectories in an unseen environment. The red stop-sign is the target, while the black stop-signs are distactors (other fire extinguishers) that may confuse the agents. The white dashed trajectory is the human path from CVDN, black is the baseline model, and green is our RMM with .

In Figure 4(b), we plot dialogs for the Baseline, Data Augmentation, and RMM agents against percent goal progress. The RMM consistently outperforms the other two agents in terms of goal progress for each dialog act. We see an increase in progress for the first 10 to 15 questions before the model levels off. In contrast the other agents exhibit shallower curves and fail to reach the same level of performance.

Conversation GP
Human Do I go in between the ropes to my right or straight forward? straight forward through the next room 0
Should I proceed down the hall to the left of turn right? head down the hall to your right into the next room 13.3
Should I go through the open doors that are the closest to me? You are in the goal room 29.1
DA should i go into the room? you are in the goal room. 5.7
should i go into the room? you are in the goal room. 0.0
RMM should i head forward or bedroom the next hallway in front of me? yes, all the way down the small hall. 4.0
should i turn left here? head into the house, then you will find a doorway at the goal staircase. go through the doors before those two small exit chairs, about half way down the hall. 5.7
lots of sink in this house, or wrong did. ok which way do i go go down the hallway, take a left and go down the next hallway and up the stairs on the right. 8.8
Table 2: Dialog samples for Figure 6 with corresponding Goal Progress – see appendix for complete outputs.

6.2.1 Qualitative Results

Figure 1 gives a cherry-picked example trajectory, and Figure 6 gives a lemon-picked example trajectory, from the unseen validation environments.

We discuss the successes and failures of the lemon-picked Figure 6. As with all CVDN instances, there are multiple target object candidates (here, “fire extinguisher”) but only one valid goal room. Goal progress is measured against the goal room. When the uide is shown the next few shortest path steps to communicate, those steps are towards the goal room. As can be seen in Figure 6, the learned agents have difficulty in deciding when to stop and begin retracing their steps.

This distinction is most obvious when comparing the language generated. Table 2 shows generated conversations along with the Goal Progress (GP) at each point when a question was asked. Note, that the generation procedure for all models is the same sampler, and they start training from the same checkpoint, so the relatively coherent nature of the RMM as compared to the simple repetitiveness of the Data Augmentation is entirely due to the recursive calls and RL loss. No model has access to length penalties or other generation tricks to avoid degenerating.

7 Conclusions and Future Work

In this paper, we present a two-agent task paradigm for cooperative vision-and-dialog navigation. Previous work was limited to navigation only, or to navigation with limited additional instructions, while our work involves navigation, question asking, and question answering components for a full, end-to-end dialog. We demonstrate that a simple speaker model for data augmentation is insufficient for the dialog setting, and instead see promising results from a recursive RL formulation with turn taking informed by theory of mind. This task presents novel and complex challenges for future work, including modeling uncertainty for when to ask a question, incorporating world knowledge for richer notions of common ground, and increasingly effective question-answer generation.


  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and A. van den Hengel. (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments.. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.1, §5.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §4.1.
  • Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai, M. Lapata, A. Lazaridou, J. May, A. Nisnevich, N. Pinto, and J. Turian (2020) Experience Grounds Language. ArXiv. External Links: Link Cited by: §1.
  • A. Bordes and J. Weston (2017) Learning end-to-end goal-oriented dialog. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
  • K. Cao, A. Lazaridou, M. Lanctot, J. Z. Leibo, K. Tuyls, and S. Clark (2018) Emergent communication through negotiation. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
  • J. Y. Chai, Q. Gao, L. She, S. Yang, S. Saba-Sadiya, and G. Xu (2018) Language to action: towards interactive task learning with physical agents. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    External Links: Link Cited by: §2.
  • A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. (2017) Matterport3D: learning from rgb-d data in indoor environments. In International Conference on 3D Vision., Cited by: §2.
  • D. Chen and R. J. Mooney (2011) Learning to interpret natural language navigation instructions from observations. In Conference on Artificial Intelligence (AAAI), Cited by: §2.
  • H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi (2019) Touchdown: natural language navigation and spatial reasoning in visual street environments. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • T. Chi, M. Eric, S. Kim, M. Shen, and D. Hakkani-tur (2020) Just ask: an interactive learning framework for vision and language navigation. In Conference on Artificial Intelligence (AAAI), Cited by: §1.
  • A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M.F. Moura, D. Parikh, and D. Batra (2017a) Visual Dialog. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • A. Das, S. Kottur, J. M.F. Moura, S. Lee, and D. Batra (2017b) Learning cooperative visual dialog agents with deep reinforcement learning. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • H. de Vries, K. Shuster, D. Batra, D. Parikh, J. Weston, and D. Kiela (2018) Talk the walk: navigating new york city through grounded dialogue. arXiv:1807.03367. Cited by: §2.
  • H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. C. Courville (2017) GuessWhat?! visual object discovery through multi-modal dialogue. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • M. C. Frank and N. D. Goodman (2012) Predicting pragmatic reasoning in language games. Science 336 (6084), pp. 998–998. External Links: ISSN 0036-8075, Document, https://science.sciencemag.org/content/336/6084/998.full.pdf, Link Cited by: §3.
  • D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell (2018) Speaker-follower models for vision-and-language navigation. In Neural Information Processing Systems (NeurIPS), Cited by: §1, §4.2, §4.
  • J. Gao, M. Galley, and L. Li (2019) Neural approaches to conversational AI. Foundations and Trends in Information Retrieval. External Links: Link Cited by: §2.
  • A. Gopnik and H. M. Wellman (1992) Why the child’s theory of mind really is a theory. Mind ‘I&’ Language, 7 (1-2):145–171. Cited by: §1.
  • H. P. Grice, P. Cole, and J. J. Morgan (1975) Logic and conversation. Syntax and Semantics, volume 3: Speech Acts, pp. 41–58. External Links: Link Cited by: §3.
  • D. Ham, J. Lee, and Y. J. andKee-Eung Kim (2020)

    End-to-end neural pipeline for goal-oriented dialogue system using gpt-2

    In Conference on Association for the Advancement of Artificial Intelligence (AAAI).. Cited by: §2.
  • W. Hao, C. Li, X. Li, L. Carin, and J. Gao (2020) Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training. In Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. arxiv:1512.03385. External Links: Link Cited by: §4.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §4.1.
  • D. Y. Ju, K. Shuster, Y. Boureau, and J. Weston (2019)

    All-in-one image-grounded conversational agents

    ArXiv abs/1912.12394. Cited by: §2.
  • L. Ke, X. Li, Y. Bisk, A. Holtzman, Z. Gan, J. Liu, J. Gao, Y. Choi, and S. Srinivasa (2019) Tactical rewind: self-correction via backtracking in vision-and-language navigation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3.
  • B. Liu, G. Tür, D. Hakkani-Tür, P. Shah, and L. P. Heck (2017) End-to-end optimization of task-oriented dialogue model with deep reinforcement learning. CoRR abs/1711.10712. External Links: Link, 1711.10712 Cited by: §2.
  • C. Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira (2019)

    The regretful agent: heuristic-aided navigation through progress estimation

    In Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • N. Mrksic, D. Ó. Séaghdha, T. Wen, B. Thomson, and S. J. Young (2017) Neural belief tracker: data-driven dialogue state tracking. In Association for Computational Linguistics (ACL), Cited by: §2.
  • A. Narayan-Chen, P. Jayannavar, and J. Hockenmaier (2019) Collaborative dialogue in Minecraft. In Association for Computational Linguistics (ACL), External Links: Link Cited by: §2.
  • K. Nguyen and H. Daumé III (2019)

    Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning


    Empirical Methods in Natural Language Processing (EMNLP)

    External Links: Link Cited by: §1.
  • K. Nguyen, D. Dey, C. Brockett, and B. Dolan (2019) Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Association for Computational Linguistics (ACL), Cited by: §3.
  • B. Peng, X. Li, L. Li, J. Gao, A. Celikyilmaz, S. Lee, and K. Wong (2017) Composite task-completion dialogue system via hierarchical deep reinforcement learning. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
  • P. Su, M. Gasic, N. Mrksic, L. M. Rojas-Barahona, S. Ultes, D. Vandyke, T. Wen, and S. J. Young (2016) On-line active reward learning for policy optimisation in spoken dialogue systems. In Proceedings of the Association for Computational Linguistics (ACL). Cited by: §2.
  • R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. MIT press Cambridge. Cited by: §4.3.
  • S. Tellex, R. A. Knepper, A. Li, T. M. Howard, D. Rus, and N. Roy (2014) Asking for help using inverse semantics. In Robots: Science and Systems (RSS), External Links: Link Cited by: §2.
  • J. Thomason, M. Murray, M. Cakmak, and L. Zettlemoyer (2019) Vision-and-dialog navigation. Conference on Robot Learning (CoRL). Cited by: §1, §2, §3, Figure 2, §4.1, §4.1, §4.4.
  • I. Vlad Serban, R. Lowe, P. Henderson, L. Charlin, and J. Pineau (2015) A Survey of Available Corpora for Building Data-Driven Dialogue Systems. ArXiv. Cited by: §2.
  • X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. Wang, W. Y. Wang, and L. Zhang (2019) Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. In Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §1.
  • X. Wang, V. Jain, E. Ie, W. Y. Wang, Z. Kozareva, and S. Ravi (2020) Environment-agnostic Multitask Learning for Natural Language Grounded Navigation. arXiv:2003.00443. External Links: Link Cited by: §1, §2.
  • Y. Zhu, F. Zhu, Z. Zhan, B. Lin, J. Jiao, X. Chang, and X. Liang (2020) Vision-Dialog Navigation by Exploring Cross-modal Memory. In Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §1, §2.

Appendix A Appendix

a.1 Training Details

Action Decoder Pretraining

We train with a batch size is 100 for 20000 iterations. The model updates after each 100 iterations using RMSProp with learning rate of 0.0001 and weight decay of 0.0005.

a.2 Conversation Examples

Full human and baseline conversations for Figure 6.

Conversation GP
Human Do I go in between the ropes to my right or straight forward? 0
straight forward through the next room
Should I proceed down the hall to the left of turn right? 13.31
head down the hall to your right into the next room
Should I go through the open doors that are the closest to me? 29.05
You are in the goal room
Baseline ok now which way ? 4.0
keep going forward and then into a bedroom on your left . it ’ s to the left of the bed .
do i continue going to the right into the living room ? 5.7
go into the door in front of you , then an right , the bathroom ’ s there .
do i keep going to the right into the living room ? 8.8
it looks like that is the goal room .
should i go into the laundry room to my right , or do i need to go in ? 7.5
i see many ways to tell . you just go around . you ’ re in the right room .
ok now which way ? 4.0
go into the room near you . that should be the goal
okay i see any set of stairs . should i go up them ? 0.4
turn right , then another right , the bathroom ’ s there .
ok . can you go more directions . i see many rooms on this floor . 0
turn right , then another right , the bathroom ’ s there .
should i go the room ? or into the door to my left ? 4.0
turn right , then another right , the bathroom ’ s there .
should i stay in this room or go out ? 8.8
turn right , then another right , the bathroom ’ s there .
ok , should i go in this direction ? 9.9
turn right , go past the pool table , into the room just towards the stairs . that should be the goal .
should i go left toward the room , or past the painting and straight ? or to my right ? 7.5
turn right , then another right , the bathroom ’ s there .
should i take a left here in the room ? 5.7
that is the goal room
should i stay in this room or go out ? 8.8
you see the front , that is the room . its goal room
the room ? or into the door to my left ? 13.3
go up the stairs in front of you , then to your left , you ’ ll see a bathroom . i think that bathroom is the goal
should i go left toward the room , or past the painting and straight ? or to my right ? 9.3
go towards the table in front of me of the hallway after take a right , then make a left into the sitting room .
should i go to the right towards the blue painting ? 9.3
continue going up , go to the bedroom
should i go into the door on the left , or keep going forward ? 11.1
go into the bedroom near you . that should be the goal
do i need to go up ? 8.8
go out in the room , make a right past the blue painting and go up the stairs
ok now which way ? 7.5
go up the spiral stairs again when you ’ go into that bedroom straight ahead .
should i go outside ? 8.8
a bathroom on pretty left , go in it and then into the bathroom in that bedroom

Data Augmentation and RMM conversations for Figure 6.

Conversation GP
DA should i go into the room? 5.66
you are in the goal room.
should i go into the room? 0.03
you are in the goal room.
RMM should i head forward or bedroom the next hallway in front of me ? 4.0
yes , all the way down the small hall .
should i turn left here ? 5.7
head into the house , then you will find a doorway at the goal staircase . go through the doors before those two small exit chairs , about half way down the hall .
lots of sink in this house , or wrong did . ok which way do i go 8.8
go down the hallway , take a left and go down the next hallway and up the stairs on the right .
right place now ? 7.5
turn left and when you reach to the bathroom . before these go up the stairs on your left , and walk passed the mirror and chairs on the left into hallway across from the stairs
do i go left , right , or up the stairs ? 4.0
turn to your left and keep going straight
do i make a right turn or shall i enter the room to my left ? 0.4
turn left to go down the hall , then that fire bed has that
left or continue ? 0
yes , go past the dining table and take an immediate right . head through the small door to the left of the window before those two way doors behind you , go up the second small set of stairs .
should i go downstairs or turn left ? 4.0
go to the right and go to the front doorway .
should i go down the stairs or stay on this floor ? 8.8
take a left towards the bathroom , then take a left before it and go all the way down the hall
do i go up these is to the right or right of the steps ? 9.9
go to the left side of the staircase and turn left in the doorway before the two small office chairs , about half way down the hall .
should i turn left , go straight into the living room , or up the stairs ? 7.5
turn to your right and go straight down the hall
do i go out into the hallway ? 5.7
go left down the hall where the office floor . and pass the second door after a right and table .
ok , should i go right or left next ? 8.8
go back to the staircase . go through the doorway you and before the hallway on the right .
do i make a left turn or shall i enter the room to my left ? 13.3
go down the hall and turn right into the bedroom
should i go to the left or the right ? 9.3
yes , go out of this room , turn right and go down the white hall before the staircase stairs , then go down the way down that way you get .
ok i was a in by this office painting , or i just in the second hallway in front of me ? 9.3
okay .
which way do i go in , or do i head up the stairs ? 11.1
go all the way to the one of the staircase . turn left in the doorway before the two two office chairs , about half way down the hall .
ok wrong far which way do i go 8.8
right then at the top of the stairs .
left or continue ? 7.5
yes . go down the hall and stop at the landing of the stairs .