Talk the Walk: Navigating New York City through Grounded Dialogue

by   Harm de Vries, et al.
Université de Montréal

We introduce "Talk The Walk", the first large-scale dialogue dataset grounded in action and perception. The task involves two agents (a "guide" and a "tourist") that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location. The task and dataset, which are described in detail, are challenging and their full solution is an open problem that we pose to the community. We (i) focus on the task of tourist localization and develop the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding tourist utterances into the guide's map, (ii) show it yields significant improvements for both emergent and natural language communication, and (iii) using this method, we establish non-trivial baselines on the full task.


page 2

page 14

page 17

page 21


Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence

Acquiring your first language is an incredible feat and not easily dupli...

A recipe for annotating grounded clarifications

In order to interpret the communicative intents of an utterance, it need...

Reference-Centric Models for Grounded Collaborative Dialogue

We present a grounded neural dialogue model that successfully collaborat...

The Pragmatics of Indirect Commands in Collaborative Discourse

Today's artificial assistants are typically prompted to perform tasks th...

A Natural Language Corpus of Common Grounding under Continuous and Partially-Observable Context

Common grounding is the process of creating, repairing and updating mutu...

Wizard of Wikipedia: Knowledge-Powered Conversational agents

In open-domain dialogue intelligent agents should exhibit the use of kno...

Learning to Speak and Act in a Fantasy Text Adventure Game

We introduce a large scale crowdsourced text adventure game as a researc...

1 Introduction

As artificial intelligence plays an ever more prominent role in everyday human lives, it becomes increasingly important to enable machines to communicate via natural language—not only with humans, but also with each other. Learning algorithms for natural language understanding, such as in machine translation and reading comprehension, have progressed at an unprecedented rate in recent years, but still rely on static, large-scale, text-only datasets that lack crucial aspects of how humans understand and produce natural language. Namely, humans develop language capabilities by being embodied in an environment which they can perceive, manipulate and move around in; and by interacting with other humans. Hence, we argue that we should incorporate all three fundamental aspects of human language acquisition—perception, action and interactive communication—and develop a task and dataset to that effect.

We introduce the Talk the Walk dataset, where the aim is for two agents, a “guide” and a “tourist”, to interact with each other via natural language in order to achieve a common goal: having the tourist navigate towards the correct location. The guide has access to a map and knows the target location, but does not know where the tourist is; the tourist has a 360-degree view of the world, but knows neither the target location on the map nor the way to it. The agents need to work together through communication in order to successfully solve the task. An example of the task is given in Figure 1.

Grounded language learning has (re-)gained traction in the AI community, and much attention is currently devoted to virtual embodiment—the development of multi-agent communication tasks in virtual environments—which has been argued to be a viable strategy for acquiring natural language semantics (Kiela et al., 2016). Various related tasks have recently been introduced, but in each case with some limitations. Although visually grounded dialogue tasks (de Vries et al., 2016; Das et al., 2016) comprise perceptual grounding and multi-agent interaction, their agents are passive observers and do not act in the environment. By contrast, instruction-following tasks, such as VNL (Anderson et al., 2017), involve action and perception but lack natural language interaction with other agents. Furthermore, some of these works use simulated environments (Das et al., 2017a) and/or templated language (Hermann et al., 2017), which arguably oversimplifies real perception or natural language, respectively. See Table 1 for a comparison.

Figure 1: Example of the Talk The Walk task: two agents, a “tourist” and a “guide”, interact with each other via natural language in order to have the tourist navigate towards the correct location. The guide has access to a map and knows the target location but not the tourist location, while the tourist does not have a map and is tasked with navigating a 360-degree street view environment.

Talk The Walk is the first task to bring all three aspects together: perception for the tourist observing the world, action for the tourist to navigate through the environment, and interactive dialogue for the tourist and guide to work towards their common goal. To collect grounded dialogues, we constructed a virtual 2D grid environment by manually capturing 360-views of several neighborhoods in New York City (NYC)111We avoided using existing street view resources due to licensing issues.. As the main focus of our task is on interactive dialogue, we limit the difficulty of the control problem by having the tourist navigating a 2D grid via discrete actions (turning left, turning right and moving forward). Our street view environment was integrated into ParlAI (Miller et al., 2017) and used to collect a large-scale dataset on Mechanical Turk involving human perception, action and communication.

We argue that for artificial agents to solve this challenging problem, some fundamental architecture designs are missing, and our hope is that this task motivates their innovation. To that end, we focus on the task of localization and develop the novel Masked Attention for Spatial Convolutions (MASC) mechanism. To model the interaction between language and action, this architecture repeatedly conditions the spatial dimensions of a convolution on the communicated message sequence.

This work makes the following contributions: 1) We present the first large scale dialogue dataset grounded in action and perception; 2) We introduce the MASC architecture for localization and show it yields improvements for both emergent and natural language; 4) Using localization models, we establish initial baselines on the full task;  5) We show that our best model exceeds human performance under the assumption of “perfect perception” and with a learned emergent communication protocol, and sets a non-trivial baseline with natural language.

2 Talk The Walk

We create a perceptual environment by manually capturing several neighborhoods of New York City (NYC) with a 360 camera222A 360fly 4K camera.. Most parts of the city are grid-like and uniform, which makes it well-suited for obtaining a 2D grid. For Talk The Walk, we capture parts of Hell’s Kitchen, East Village, the Financial District, Williamsburg and the Upper East Side—see Figure 6 in Appendix 13 for their respective locations within NYC. For each neighborhood, we choose an approximately 5x5 grid and capture a 360 view on all four corners of each intersection, leading to a grid-size of roughly 10x10 per neighborhood.

The tourist’s location is given as a tuple , where are the coordinates and signifies the orientation (north, east, south or west). The tourist can take three actions: turn left, turn right and go forward. For moving forward, we add , , , to the coordinates for the respective orientations. Upon a turning action, the orientation is updated by where for left and for right. If the tourist moves outside the grid, we issue a warning that they cannot go in that direction and do not update the location. Moreover, tourists are shown different types of transitions: a short transition for actions that bring the tourist to a different corner of the same intersection; and a longer transition for actions that bring them to a new intersection.

The guide observes a map that corresponds to the tourist’s environment. We exploit the fact that urban areas like NYC are full of local businesses, and overlay the map with these landmarks as localization points for our task. Specifically, we manually annotate each corner of the intersection with a set of landmarks , each coming from one of the following categories:

  • Bar

  • Playfield

  • Bank

  • Hotel

  • Shop

  • Subway

  • Coffee Shop

  • Restaurant

  • Theater

The right-side of Figure 1 illustrates how the map is presented. Note that within-intersection transitions have a smaller grid distance than transitions to new intersections. To ensure that the localization task is not too easy, we do not include street names in the overhead map and keep the landmark categories coarse. That is, the dialogue is driven by uncertainty in the tourist’s current location and the properties of the target location: if the exact location and orientation of the tourist were known, it would suffice to communicate a sequence of actions.

Project Perception Action Language Dial. Size Acts
Visual Dialog (Das et al., 2016) Real Human 120k dialogues 20
GuessWhat (de Vries et al., 2016) Real Human 131k dialogues 10
VNL (Anderson et al., 2017) Real Human 23k instructions -
Embodied QA (Das et al., 2017a) Simulated Scripted 5k questions -
TalkTheWalk Real Human 10k dialogues 62
Table 1: Talk The Walk grounds human generated dialogue in (real-life) perception and action.

2.1 Task

For the Talk The Walk task, we randomly choose one of the five neighborhoods, and subsample a 4x4 grid (one block with four complete intersections) from the entire grid. We specify the boundaries of the grid by the top-left and bottom-right corners . Next, we construct the overhead map of the environment, i.e. with and . We subsequently sample a start location and orientation and a target location at random333Note that we do not include the orientation in the target, as we found in early experiments that this led to an unnatural task for humans. Similarly, we explored bigger grid sizes but found these to be too difficult for most annotators..

The shared goal of the two agents is to navigate the tourist to the target location , which is only known to the guide. The tourist perceives a “street view” planar projection of the 360 image at location and can simultaneously chat with the guide and navigate through the environment. The guide’s role consists of reading the tourist description of the environment, building a “mental map” of their current position and providing instructions for navigating towards the target location. Whenever the guide believes that the tourist has reached the target location, they instruct the system to evaluate the tourist’s location. The task ends when the evaluation is successful—i.e., when —or otherwise continues until a total of three failed attempts. The additional attempts are meant to ease the task for humans, as we found that they otherwise often fail at the task but still end up close to the target location, e.g., at the wrong corner of the correct intersection.

2.2 Data Collection

We crowd-sourced the collection of the dataset on Amazon Mechanical Turk (MTurk). We use the MTurk interface of ParlAI (Miller et al., 2017) to render 360 images via WebGL and dynamically display neighborhood maps with an HTML5 canvas. Detailed task instructions, which were also given to our workers before they started their task, are shown in Appendix 14. We paired Turkers at random and let them alternate between the tourist and guide role across different HITs.

2.3 Dataset Statistics

The Talk The Walk dataset consists of over 10k successful dialogues—see Table 6 in the appendix for the dataset statistics split by neighborhood. Turkers successfully completed of all finished tasks (we use this statistic as the human success rate). More than six hundred participants successfully completed at least one Talk The Walk HIT. Although the Visual Dialog (Das et al., 2016) and GuessWhat (de Vries et al., 2016) datasets are larger, the collected Talk The Walk dialogs are significantly longer. On average, Turkers needed more than 62 acts (i.e utterances and actions) before they successfully completed the task, whereas Visual Dialog requires 20 acts. The majority of acts comprise the tourist’s actions, with on average more than 44 actions per dialogue. The guide produces roughly 9 utterances per dialogue, slightly more than the tourist’s 8 utterances. Turkers use diverse discourse, with a vocabulary size of more than 10K (calculated over all successful dialogues). An example from the dataset is shown in Appendix 13. The dataset is available at

3 Experiments

We investigate the difficulty of the proposed task by establishing initial baselines. The final Talk The Walk task is challenging and encompasses several important sub-tasks, ranging from landmark recognition to tourist localization and natural language instruction-giving. Arguably the most important sub-task is localization: without such capabilities the guide can not tell whether the tourist reached the target location. In this work, we establish a minimal baseline for Talk The Walk by utilizing agents trained for localization. Specifically, we let trained tourist models undertake random walks, using the following protocol: at each step, the tourist communicates its observations and actions to the guide, who predicts the tourist’s location. If the guide predicts that the tourist is at target, we evaluate its location. If successful, the task ends, otherwise we continue until there have been three wrong evaluations. The protocol is given as pseudo-code in Appendix 11.

3.1 Tourist Localization

The designed navigation protocol relies on a trained localization model that predicts the tourist’s location from a communicated message. Before we formalize this localization sub-task in Section 3.1.1, we further introduce two simplifying assumptions—perfect perception and orientation-agnosticism—so as to overcome some of the difficulties we encountered in preliminary experiments.

Perfect Perception

Early experiments revealed that perceptual grounding of landmarks is difficult: we set up a landmark classification problem, on which models with extracted CNN (He et al., 2016) or text recognition features (Gupta et al., 2016) barely outperform a random baseline—see Appendix 12 for full details. This finding implies that localization models from image input are limited by their ability to recognize landmarks, and, as a result, would not generalize to unseen environments. To ensure that perception is not the limiting factor when investigating the landmark-grounding and action-grounding capabilities of localization models, we assume “perfect perception”: in lieu of the 360 image view, the tourist is given the landmarks at its current location. More formally, each state observation now equals the set of landmarks at the -location, i.e. . If the -location does not have any visible landmarks, we return a single “empty corner” symbol. We stress that our findings—including a novel architecture for grounding actions into an overhead map, see Section 4.2.1—should carry over to settings without the perfect perception assumption.


We opt to ignore the tourist’s orientation, which simplifies the set of actions to [Left, Right, Up, Down], corresponding to adding [(-1, 0), (1, 0), (0, 1), (0, -1)] to the current coordinates, respectively. Note that actions are now coupled to an orientation on the map—e.g. up is equal to going north—and this implicitly assumes that the tourist has access to a compass. This also affects perception, since the tourist now has access to views from all orientations: in conjunction with “perfect perception”, implying that only landmarks at the current corner are given, whereas landmarks from different corners (e.g. across the street) are not visible.

Even with these simplifications, the localization-based baseline comes with its own set of challenges. As we show in Section 5.1, the task requires communication about a short (random) path—i.e., not only a sequence of observations but also actions

—in order to achieve high localization accuracy. This means that the guide needs to decode observations from multiple time steps, as well as understand their 2D spatial arrangement as communicated via the sequence of actions. Thus, in order to get to a good understanding of the task, we thoroughly examine whether the agents can learn a communication protocol that simultaneously grounds observations and actions into the guide’s map. In doing so, we thoroughly study the role of the communication channel in the localization task, by investigating increasingly constrained forms of communication: from differentiable continuous vectors to emergent discrete symbols to the full complexity of natural language.

3.1.1 Formalization

The full navigation baseline hinges on a localization model from random trajectories. While we can sample random actions in the emergent communication setup, this is not possible for the natural language setup because the messages are coupled to the trajectories of the human annotators. This leads to slightly different problem setups, as described below.

Emergent language

A tourist, starting from a random location, takes random actions to reach target location . Every location in the environment has a corresponding set of landmarks for each of the coordinates. As the tourist navigates, the agent perceives state-observations where each observation consists of a set of landmark symbols . Given the observations and actions , the tourist generates a message which is communicated to the other agent. The objective of the guide is to predict the location from the tourist’s message .

Natural language

In contrast to our emergent communication experiments, we do not take random actions but instead extract actions, observations, and messages from the dataset. Specifically, we consider each tourist utterance (i.e. at any point in the dialogue), obtain the current tourist location as target location , the utterance itself as message , and the sequence of observations and actions that took place between the current and previous tourist utterance as and , respectively. Similar to the emergent language setting, the guide’s objective is to predict the target location models from the tourist message . We conduct experiments with taken from the dataset and with generated from the extracted observations and actions .

4 Model

This section outlines the tourist and guide architectures. We first describe how the tourist produces messages for the various communication channels across which the messages are sent. We subsequently describe how these messages are processed by the guide, and introduce the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding into the 2D overhead map in order to predict the tourist’s location.

4.1 The Tourist

For each of the communication channels, we outline the procedure for generating a message . Given a set of state observations , we represent each observation by summing the -dimensional embeddings of the observed landmarks, i.e. for , , where is the landmark embedding lookup table. In addition, we embed action into a -dimensional embedding via a look-up table . We experiment with three types of communication channel.

Continuous vectors

The tourist has access to observations of several time steps, whose order is important for accurate localization. Because summing embeddings is order-invariant, we introduce a sum over positionally-gated embeddings, which, conditioned on time step , pushes embedding information into the appropriate dimensions. More specifically, we generate an observation message , where is a learned gating vector for time step . In a similar fashion, we produce action message and send the concatenated vectors as message to the guide. We can interpret continuous vector communication as a single, monolithic model because its architecture is end-to-end differentiable, enabling gradient-based optimization for training.

Discrete symbols

Like the continuous vector communication model, with discrete communication the tourist also uses separate channels for observations and actions, as well as a sum over positionally gated embeddings to generate observation embedding . We pass this embedding through a sigmoid and generate a message

by sampling from the resulting Bernoulli distributions:

The action message is produced in the same way, and we obtain the final tourist message through concatenating the messages.

The communication channel’s sampling operation yields the model non-differentiable, so we use policy gradients (Sutton & Barto, 1998; Williams, 1992) to train the parameters

of the tourist model. That is, we estimate the gradient by

where the reward function is the negative guide’s loss (see Section 4.2) and

a state-value baseline to reduce variance. We use a linear transformation over the concatenated embeddings as baseline prediction, i.e.

, and train it with a mean squared error loss444This is different from A2C which uses a state-value baseline that is trained by the Bellman residual.

Natural Language

Because observations and actions are of variable-length, we use an LSTM encoder over the sequence of observations embeddings , and extract its last hidden state . We use a separate LSTM encoder for action embeddings , and concatenate both and to the input of the LSTM decoder at each time step:


where a look-up table, taking input tokens . We train with teacher-forcing, i.e. we optimize the cross-entropy loss: . At test time, we explore the following decoding strategies: greedy, sampling and a beam-search. We also fine-tune a trained tourist model (starting from a pre-trained model) with policy gradients in order to minimize the guide’s prediction loss.

4.2 The Guide

Given a tourist message describing their observations and actions, the objective of the guide is to predict the tourist’s location on the map. First, we outline the procedure for extracting observation embedding and action embeddings from the message for each of the types of communication. Next, we discuss the MASC mechanism that takes the observations and actions in order to ground them on the guide’s map in order to predict the tourist’s location.


For the continuous communication model, we assign the observation message to the observation embedding, i.e. . To extract the action embedding for time step , we apply a linear layer to the action message, i.e. .


For discrete communication, we obtain observation by applying a linear layer to the observation message, i.e. . Similar to the continuous communication model, we use a linear layer over action message to obtain action embedding for time step .

Natural Language

The message

contains information about observations and actions, so we use a recurrent neural network with attention mechanism to extract the relevant observation and action embeddings. Specifically, we encode the message

, consisting of tokens taken from vocabulary , with a bidirectional LSTM:


where is the word embedding look-up table. We obtain observation embedding through an attention mechanism over the hidden states :


where is a learned control embedding who is updated through a linear transformation of the previous control and observation embedding: . We use the same mechanism to extract the action embedding from the hidden states. For the observation embedding, we obtain the final representation by summing positionally gated embeddings, i.e., .

4.2.1 Masked Attention for Spatial Convolutions (MASC)

We represent the guide’s map as , where in this case , where each -dimensional location embedding is computed as the sum of the guide’s landmark embeddings for that location.


While the guide’s map representation contains only local landmark information, the tourist communicates a trajectory of the map (i.e. actions and observations from multiple locations), implying that directly comparing the tourist’s message with the individual landmark embeddings is probably suboptimal. Instead, we want to aggregate landmark information from surrounding locations by imputing trajectories over the map to predict locations. We propose a mechanism for translating landmark embeddings according to state transitions (left, right, up, down), which can be expressed as a 2D convolution over the map embeddings. For simplicity, let us assume that the map embedding

is 1-dimensional, then a left action can be realized through application of the following kernel: which effectively shifts all values of one position to the left. We propose to learn such state-transitions from the tourist message through a differentiable attention-mask over the spatial dimensions of a 3x3 convolution.


We linearly project each predicted action embedding to a 9-dimensional vector , normalize it by a softmax and subsequently reshape the vector into a 3x3 mask :


We learn a 3x3 convolutional kernel , with features, and apply the mask to the spatial dimensions of the convolution by first broadcasting its values along the feature dimensions, i.e. , and subsequently taking the Hadamard product: . For each action step , we then apply a 2D convolution with masked weight to obtain a new map embedding

, where we zero-pad the input to maintain identical spatial dimensions.

Prediction model

We repeat the MASC operation times (i.e. once for each action), and then aggregate the map embeddings by a sum over positionally-gated embeddings: . We score locations by taking the dot-product of the observation embedding , which contains information about the sequence of observed landmarks by the tourist, and the map. We compute a distribution over the locations of the map by taking a softmax over the computed scores:

Predicting T

While emergent communication models use a fixed length trasjectory , natural language messages may differ in the number of communicated observations and actions. Hence, we predict from the communicated message. Specifically, we use a softmax regression layer over the last hidden state of the RNN, and subsequently sample from the resulting multinomial distribution:


We jointly train the -prediction model via REINFORCE, with the guide’s loss as reward function and a mean-reward baseline.

MASC Train Valid Test
Cont. Disc. Upper Cont. Disc. Upper Cont. Disc. Upper
Random 6.25 6.25 6.25 6.25 6.25 6.25 6.25 6.25 6.25
T=0 29.59 28.89 30.23 30.00 30.63 32.50 32.29 33.12 35.00
T=1 39.83 35.40 43.44 35.23 36.56 45.39 35.16 39.53 51.72
55.64 51.66 62.78 53.12 53.20 65.78 56.09 55.78 72.97
T=2 41.50 40.15 47.84 33.50 37.77 50.29 35.08 41.41 57.15
67.44 62.24 78.90 64.55 59.34 79.77 66.80 62.15 86.64
T=3 43.48 44.49 45.22 35.40 39.64 48.77 33.11 43.51 55.84
71.32 71.80 87.92 67.48 65.63 87.45 69.85 69.51 92.41
Table 2: Accuracy results for tourist localization with emergent language, showing continuous (Cont.) and discrete (Disc.) communication, along with the prediction upper bound. denotes the length of the path and a ✓ in the “MASC” column indicates that the model is conditioned on the communicated actions.

4.3 Comparisons

To better analyze the performance of the models incorporating MASC, we compare against a no-MASC baseline in our experiments, as well as a prediction upper bound.


We compare the proposed MASC model with a model that does not include this mechanism. Whereas MASC predicts a convolution mask from the tourist message, the “No MASC” model uses , the ordinary convolutional kernel to convolve the map embedding to obtain . We also share the weights of this convolution at each time step.

Prediction upper-bound

Because we have access to the class-conditional likelihood , we are able to compute the Bayes error rate (or irreducible error). No model (no matter how expressive) with any amount of data can ever obtain better localization accuracy as there are multiple locations consistent with the observations and actions.

5 Results and Discussion

In this section, we describe the findings of various experiments. First, we analyze how much information needs to be communicated for accurate localization in the Talk The Walk environment, and find that a short random path (including actions) is necessary. Next, for emergent language, we show that the MASC architecture can achieve very high localization accuracy, significantly outperforming the baseline that does not include this mechanism. We then turn our attention to the natural language experiments, and find that localization from human utterances is much harder, reaching an accuracy level that is below communicating a single landmark observation. We show that generated utterances from a conditional language model leads to significantly better localization performance, by successfully grounding the utterance on a single landmark observation (but not yet on multiple observations and actions). Finally, we show performance of the localization baseline on the full task, which can be used for future comparisons to this work.

5.1 Analysis of Localization Task

Task is not too easy

The upper-bound on localization performance in Table 2 suggest that communicating a single landmark observation is not sufficient for accurate localization of the tourist (35% accuracy). This is an important result for the full navigation task because the need for two-way communication disappears if localization is too easy; if the guide knows the exact location of the tourist it suffices to communicate a list of instructions, which is then executed by the tourist. The uncertainty in the tourist’s location is what drives the dialogue between the two agents.

Importance of actions

We observe that the upperbound for only communicating observations plateaus around % (even for actions), whereas it exceeds % when we also take actions into account. This implies that, at least for random walks, it is essential to communicate a trajectory, including observations and actions, in order to achieve high localization accuracy.

5.2 Emergent Language Localization

We first report the results for tourist localization with emergent language in Table 2.

MASC improves performance

The MASC architecture significantly improves performance compared to models that do not include this mechanism. For instance, for action, MASC already achieves 56.09 % on the test set and this further increases to 69.85% for . On the other hand, no-MASC models hit a plateau at 43%. In Appendix 10, we analyze learned MASC values, and show that communicated actions are often mapped to corresponding state-transitions.

Continuous vs discrete

We observe similar performance for continuous and discrete emergent communication models, implying that a discrete communication channel is not a limiting factor for localization performance.

table[.53][][t] Model Decoding Train Valid Test Random 6.25 6.25 6.25 Human utterances 23.46 15.56 16.17 Supervised sampling 17.19 12.23 12.43 greedy 34.14 29.90 29.05 beam (size: 4) 26.21 22.53 25.02 Policy Grad. sampling 29.67 26.93 27.05 greedy 29.23 27.62 27.30 table[.45][][t] Train Valid Test #steps Random 18.75 18.75 18.75 - Human 76.74 76.74 76.74 15.05 Best Cont. 89.44 86.35 88.33 34.47 Best Disc. 86.23 82.81 87.08 34.83 Best NL 39.65 39.68 50.00 39.14

Table 3: Localization accuracy of tourist communicating in natural language.
Table 4: Full task evaluation of localization models using protocol of Appendix 11.

5.3 Natural Language Localization

We report the results of tourist localization with natural language in Table 4. We compare accuracy of the guide model (with MASC) trained on utterances from (i) humans, (ii) a supervised model with various decoding strategies, and (iii) a policy gradient model optimized with respect to the loss of a frozen, pre-trained guide model on human utterances.

Human utterances

Compared to emergent language, localization from human utterances is much harder, achieving only on the test set. Here, we report localization from a single utterance, but in Appendix 9.2 we show that including up to five dialogue utterances only improves performance to . We also show that MASC outperform no-MASC models for natural language communication.

Generated utterances

We also investigate generated tourist utterances from conditional language models. Interestingly, we observe that the supervised model (with greedy and beam-search decoding) as well as the policy gradient model leads to an improvement of more than 10 accuracy points over the human utterances. However, their level of accuracy is slightly below the baseline of communicating a single observation, indicating that these models only learn to ground utterances in a single landmark observation.

Better grounding of generated utterances

We analyze natural language samples in Table 5, and confirm that, unlike human utterances, the generated utterances are talking about the observed landmarks. This observation explains why the generated utterances obtain higher localization accuracy. The current language models are most successful when conditioned on a single landmark observation; We show in Appendix 9.1.1 that performance quickly deteriorates when the model is conditioned on more observations, suggesting that it can not produce natural language utterances about multiple time steps.

Method Decoding Utterance
Observations (Bar)
Actions -
Human a field of some type
Supervised greedy at a bar
sampling sec just hard to tell which is a restaurant ?
beam search im at a bar
Policy Grad. greedy bar from bar from bar and rigth rigth bulding bulding
sampling which bar from bar from bar and bar rigth bulding bulding..
Table 5:

Samples from the tourist models communicating in natural language. Contrary to the human generated utterance, the supervised model with greedy and beam search decoding produces an utterance containing the current state observation (bar). Also the reinforcement learning model mentions the current observation but has lost linguistic structure. The fact that these localization models are better grounded in observations than human utterances explains why they obtain higher localization accuracy.

5.4 Localization-based Baseline

Table 4 shows results for the best localization models on the full task, evaluated via the random walk protocol defined in Algorithm 1.

Comparison with human annotators

Interestingly, our best localization model (continuous communication, with MASC, and ) achieves 88.33% on the test set and thus exceed human performance of 76.74% on the full task. While emergent models appear to be stronger localizers, humans might cope with their localization uncertainty through other mechanisms (e.g. better guidance, bias towards taking particular paths, etc). The simplifying assumption of perfect perception also helps.

Number of actions

Unsurprisingly, humans take fewer steps (roughly 15) than our best random walk model (roughly 34). Our human annotators likely used some form of guidance to navigate faster to the target.

6 Conclusion

We introduced the Talk The Walk task and dataset, which consists of crowd-sourced dialogues in which two human annotators collaborate to navigate to target locations in the virtual streets of NYC. For the important localization sub-task, we proposed MASC—a novel grounding mechanism to learn state-transition from the tourist’s message—and showed that it improves localization performance for emergent and natural language. We use the localization model to provide baseline numbers on the Talk The Walk task, in order to facilitate future research.


7 Related Work

The Talk the Walk task and dataset facilitate future research on various important subfields of artificial intelligence, including grounded language learning, goal-oriented dialogue research and situated navigation. Here, we describe related previous work in these areas.

Related tasks

There has been a long line of work involving related tasks. Early work on task-oriented dialogue dates back to the early 90s with the introduction of the Map Task (Anderson et al., 1991) and Maze Game (Garrod & Anderson, 1987) corpora. Recent efforts have led to larger-scale goal-oriented dialogue datasets, for instance to aid research on visually-grounded dialogue (Das et al., 2016; de Vries et al., 2016), knowledge-base-grounded discourse (He et al., 2017) or negotiation tasks (Lewis et al., 2017). At the same time, there has been a big push to develop environments for embodied AI, many of which involve agents following natural language instructions with respect to an environment(Artzi & Zettlemoyer, 2013; Yu et al., 2017; Hermann et al., 2017; Mei et al., 2016; Chaplot et al., 2018b, a), following-up on early work in this area (MacMahon et al., 2006; Chen & Mooney, 2011). An early example of navigation using neural networks is (Hadsell et al., 2007), who propose an online learning approach for robot navigation. Recently, there has been increased interest in using end-to-end trainable neural networks for learning to navigate indoor scenes(Gupta et al., 2017b, a) or large cities (Brahmbhatt & Hays, 2017; Mirowski et al., 2018), but, unlike our work, without multi-agent communication. Also the task of localization (without multi-agent communication) has recently been studied (Chaplot et al., 2018a; Vo et al., 2017).

Grounded language learning

Grounded language learning is motivated by the observation that humans learn language embodied (grounded) in sensorimotor experience of the physical world (Barsalou, 2008; Smith & Gasser, 2005). On the one hand, work in multi-modal semantics has shown that grounding can lead to practical improvements on various natural language understanding tasks (see Baroni, 2016; Kiela, 2017, and references therein). In robotics, researchers dissatisfied with purely symbolic accounts of meaning attempted to build robotic systems with the aim of grounding meaning in physical experience of the world (Roy, 2005; Steels & Hild, 2012). Recently, grounding has also been applied to the learning of sentence representations (Kiela et al., 2017), image captioning (Lin et al., 2014; Xu et al., 2015), visual question answering (Antol et al., 2015; de Vries et al., 2017), visual reasoning (Johnson et al., 2017; Perez et al., 2018), and grounded machine translation (Riezler et al., 2014; Elliott et al., 2016). Grounding also plays a crucial role in the emergent research of multi-agent communication, where, agents communicate (in natural language or otherwise) in order to solve a task, with respect to their shared environment (Lazaridou et al., 2016; Das et al., 2017b; Mordatch & Abbeel, 2017; Evtimova et al., 2017; Lewis et al., 2017; Strub et al., 2017; Kottur et al., 2017).

8 Implementation Details

For the emergent communication models, we use an embedding size . The natural language experiments use 128-dimensional word embeddings and a bidirectional RNN with units. In all experiments, we train the guide with a cross entropy loss using the ADAM optimizer with default hyper-parameters (Kingma & Ba, 2014)

. We perform early stopping on the validation accuracy, and report the corresponding train, valid and test accuracy. We optimize the localization models with continuous, discrete and natural language communication channels for 200, 200, and 25 epochs, respectively. To facilitate further research on Talk The Walk, we make our code base for reproducing experiments publicly available at

9 Additional Natural Language Experiments

First, we investigate the sensitivity of tourist generation models to the trajectory length, finding that the model conditioned on a single observation (i.e. ) achieves best performance. In the next subsection, we further analyze localization models from human utterances by investigating MASC and no-MASC models with increasing dialogue context.

table[.48][][t] Trajectories T Train Valid Test Random 18.75 18.75 18.75 Human 0 38.21 40.93 40.00 1 21.82 23.75 25.62 2 19.77 24.68 23.12 3 18.95 20.93 20.00 Random 0 39.65 39.68 50.00 1 28.99 30.93 25.62 2 27.04 19.06 19.38 3 20.28 20.93 22.50 table[.48][][t] Beam size Train Valid Test Random 6.25 6.25 6.25 1 34.14 29.90 29.05 2 26.24 23.65 25.10 4 23.59 22.87 21.80 8 20.31 19.24 20.87

Table 6: Full task performance of localization models trained on human and random trajectories. There are small benefits for training on random trajectories, but the most important hyper-parameter is to condition the tourist utterance on a single observation (i.e. trajectories of size .) at evaluation time.
Table 7:

Localization performance using pre-trained tourist (via imitation learning) with beam search decoding of varying beam size. Locations and observations extracted from human trajectories. Larger beam-sizes lead to worse localization performance.

#utterances MASC Train Valid Test
Random 6.25 6.25 6.25 -
1 23.95 13.91 13.89 0.99
23.46 15.56 16.17 1.00
3 26.92 16.28 16.62 1.00
20.88 17.50 18.80 1.79
5 25.75 16.11 16.88 1.98
30.45 18.41 20.33 1.99
Table 8: Localization given last {1, 3, 5} dialogue utterances (including the guide). We observe that (1) performance increases when more utterances are included; and (2) MASC outperforms no-MASC in all cases; and (3) mean increases when more dialogue context is included.

9.1 Tourist Generation Models

9.1.1 Path length

After training the supervised tourist model (conditioned on observations and action from human expert trajectories), there are two ways to train an accompanying guide model. We can optimize a location prediction model on either (i) extracted human trajectories (as in the localization setup from human utterances) or (ii) on all random paths of length (as in the full task evaluation). Here, we investigate the impact of (1) using either human or random trajectories for training the guide model, and (2) the effect of varying the path length during the full-task evaluation. For random trajectories, guide training uses the same path length as is used during evaluation. We use a pre-trained tourist model with greedy decoding for generating the tourist utterances. Table 7 summarizes the results.

Human vs random trajectories

We only observe small improvements for training on random trajectories. Human trajectories are thus diverse enough to generalize to random trajectories.

Effect of path length

There is a strong negative correlation between task success and the conditioned trajectory length. We observe that the full task performance quickly deteriorates for both human and random trajectories. This suggests that the tourist generation model can not produce natural language utterances that describe multiple observations and actions. Although it is possible that the guide model can not process such utterances, this is not very likely because the MASC architectures handles such messages successfully for emergent communication.

9.1.2 Effect of beam-size

We report localization performance of tourist utterances generated by beam search decoding of varying beam size in Table 7. We find that performance decreases from 29.05% to 20.87% accuracy on the test set when we increase the beam-size from one to eight.

9.2 Localization from Human Utterances

We conduct an ablation study for MASC on natural language with varying dialogue context. Specifically, we compare localization accuracy of MASC and no-MASC models trained on the last [1, 3, 5] utterances of the dialogue (including guide utterances). We report these results in Table 8. In all cases, MASC outperforms the no-MASC models by several accuracy points. We also observe that mean predicted (over the test set) increases from to when more dialogue context is included.

10 Visualizing MASC predictions

Figure 2 shows the MASC values for a learned model with emergent discrete communications and actions. Specifically, we look at the predicted MASC values for different action sequences taken by the tourist. We observe that the first action is always mapped to the correct state-transition, but that the second and third MASC values do not always correspond to right state-transitions.

Action sequence:
Right, Left, Up

Action sequence:
Up, Right, Down

Figure 2: We show MASC values of two action sequences for tourist localization via discrete communication with actions. In general, we observe that the first action always corresponds to the correct state-transition, whereas the second and third are sometimes mixed. For instance, in the top example, the first two actions are correctly predicted but the third action is not (as the MASC corresponds to a “no action”). In the bottom example, the second action appears as the third MASC.

11 Evaluation on Full Setup

We provide pseudo-code for evaluation of localization models on the full task in Algorithm 1, as well as results for all emergent communication models in Table 9.

T MASC Comm. Train Valid Test #steps
Random 18.75 18.75 18.75 -
Human 76.74 76.74 76.74 15.05
0 cont. 46.17 46.56 52.91 39.87
disc. 46.65 47.70 52.50 38.56
1 cont. 51.46 46.98 46.46 38.05
disc 52.11 51.25 55.00 41.13
cont. 76.57 74.06 77.70 34.59
disc 71.96 72.29 74.37 36.19
2 cont. 53.51 45.93 46.66 40.26
disc 53.38 52.39 55.00 42.35
cont. 87.29 84.05 86.66 32.27
disc 86.23 82.81 87.08 34.83
3 cont. 54.30 43.43 43.54 39.14
disc 57.88 55.20 57.50 43.67
cont. 89.44 86.35 88.33 34.47
disc 86.23 82.81 87.08 34.83
Table 9: Accuracy of localization models on full task, using evaluation protocol defined in Algorithm 1. We report the average over 3 runs.
procedure Evaluate()
      initialize with random location
     for  do create -sized feature buffer
     for  do
         if  then target predicted
              if  then
                  return True
                  if  then
                       return False                                          
          take new action
Algorithm 1 Performance evaluation of location prediction model on full Talk The Walk setup
Figure 3: Result of running the text recognizer of Gupta et al. (2016) on four examples of the Hell’s Kitchen neighborhood. Top row: two positive examples. Bottom row: example of false negative (left) and many false positives (right)

12 Landmark Classification

While the guide has access to the landmark labels, the tourist needs to recognize these landmarks from raw perceptual information. In this section, we study landmark classification as a supervised learning problem to investigate the difficulty of perceptual grounding in Talk The Walk.

The Talk The Walk dataset contains a total of 307 different landmarks divided among nine classes, see Figure 4 for how they are distributed. The class distribution is fairly imbalanced, with shops and restaurants as the most frequent landmarks and relatively few play fields and theaters. We treat landmark recognition as a multi-label classification problem as there can be multiple landmarks on a corner555Strictly speaking, this is more general than a multi-label setup because a corner might contain multiple landmarks of the same class..

Figure 4: Frequency of landmark classes

For the task of landmark classification, we extract the relevant views of the 360 image from which a landmark is visible. Because landmarks are labeled to be on a specific corner of an intersection, we assume that they are visible from one of the orientations facing away from the intersection. For example, for a landmark on the northwest corner of an intersection, we extract views from both the north and west direction. The orientation-specific views are obtained by a planar projection of the full 360-image with a small field of view (60 degrees) to limit distortions. To cover the full field of view, we extract two images per orientation, with their horizontal focus point 30 degrees apart. Hence, we obtain eight images per 360 image with corresponding orientation .

We run the following pre-trained feature extractors over the extracted images:


We resize the extracted view to a 224x224 image and pass it through a ResNet-152 network He et al. (2016) to obtain a 2048-dimensional feature vector from the penultimate layer.

Text Recognition

We use a pre-trained text-recognition model Gupta et al. (2016) to extract a set of text messages from the images. Local businesses often advertise their wares through key phrases on their storefront, and understanding this text might be a good indicator of the type of landmark. In Figure 3, we show the results of running the text recognition module on a few extracted images.

For the text recognition model, we use a learned look-up table to embed the extracted text features , and fuse all embeddings of four images through a bag of embeddings, i.e., . We use a linear layer followed by a sigmoid to predict the probability for each class, i.e. . We also experiment with replacing the look-up embeddings with pre-trained FastText embeddings Bojanowski et al. (2016). For the ResNet model, we use a bag of embeddings over the four ResNet features, i.e. , before we pass it through a linear layer to predict the class probabilities: . We also conduct experiments where we first apply PCA to the extracted ResNet and FastText features before we feed them to the model.

Features Train loss Valid Loss Train F1 Valid F1 Valid prec. Valid recall
All positive - - - 0.39313 0.26128 1
Random (0.5) - - - 0.32013 0.24132 0.25773
Textrecog 0.01462 0.01837 0.31205 0.31684 0.2635 0.50515
Fasttext 0.00992 0.00994 0.24019 0.31548 0.26133 0.47423
Fasttext (100 dim) 0.00721 0.00863 0.32651 0.28672 0.24964 0.4433
ResNet 0.00735 0.00751 0.17085 0.20159 0.13114 0.58763
ResNet (256 dim) 0.0051 0.00748 0.60911 0.31953 0.27733 0.50515
Table 10: Results for landmark classification.

To account for class imbalance, we train all described models with a binary cross entropy loss weighted by the inverted class frequency. We create a 80-20 class-conditional split of the dataset into a training and validation set. We train for 100 epochs and perform early stopping on the validation loss.

The F1 scores for the described methods in Table 10. We compare to an “all positive” baseline that always predicts that the landmark class is visible and observe that all presented models struggle to outperform this baseline. Although 256-dimensional ResNet features achieve slightly better precision on the validation set, it results in much worse recall and a lower F1 score. Our results indicate that perceptual grounding is a difficult task, which easily merits a paper of its own right, and so we leave further improvements (e.g. better text recognizers) for future work.

13 Dataset Details


Figure 5: Map of New York City with red rectangles indicating the captured neighborhoods of the Talk The Walk dataset.
Neighborhood #success #failed #disconnects
Hell’s Kitchen 2075 762 867
Williamsburg 2077 683 780
East Village 2035 713 624
Financial District 2042 607 497
Upper East 2081 359 576
Total 10310 3124 3344
Figure 6: Dataset statistics split by neighborhood and dialogue status.
Dataset split

We split the full dataset by assigning entire 4x4 grids (independent of the target location) to the train, valid or test set. Specifically, we design the split such that the valid set contains at least one intersection (out of four) is not part of the train set. For the test set, all four intersections are novel. See our source code, available at URLANONYMIZED, for more details on how this split is realized.


[fontsize=] Tourist: ACTION:TURNRIGHT ACTION:TURNRIGHT Guide: Hello, what are you near? Tourist: ACTION:TURNLEFT ACTION:TURNLEFT ACTION:TURNLEFT Tourist: Hello, in front of me is a Brooks Brothers Tourist: ACTION:TURNLEFT ACTION:FORWARD ACTION:TURNLEFT ACTION:TURNLEFT Guide: Is that a shop or restaurant? Tourist: ACTION:TURNLEFT Tourist: It is a clothing shop. Tourist: ACTION:TURNLEFT Guide: You need to go to the intersection in the northwest corner of the map Tourist: ACTION:TURNLEFT Tourist: There appears to be a bank behind me. Tourist: ACTION:TURNLEFT ACTION:TURNLEFT ACTION:TURNRIGHT ACTION:TURNRIGHT Guide: Ok, turn left then go straight up that road Tourist: ACTION:TURNLEFT ACTION:TURNLEFT ACTION:TURNLEFT ACTION:FORWARD ACTION:TURNRIGHT ACTION:FORWARD ACTION:FORWARD ACTION:TURNLEFT ACTION:TURNLEFT ACTION:TURNLEFT Guide: There should be shops on two of the corners but you need to go to the corner without a shop. Tourist: ACTION:FORWARD ACTION:FORWARD ACTION:FORWARD ACTION:TURNLEFT ACTION:TURNLEFT Guide: let me know when you get there. Tourist: on my left is Radio city Music hall Tourist: ACTION:TURNLEFT ACTION:FORWARD ACTION:TURNLEFT ACTION:TURNRIGHT ACTION:TURNRIGHT Tourist: I can’t go straight any further. Guide: ok. turn so that the theater is on your right. Guide: then go straight Tourist: That would be going back the way I came Guide: yeah. I was looking at the wrong bank Tourist: I’ll notify when I am back at the brooks brothers, and the bank. Tourist: ACTION:TURNRIGHT Guide: make a right when the bank is on your left Tourist: ACTION:FORWARD ACTION:FORWARD ACTION:TURNRIGHT Tourist: Making the right at the bank. Tourist: ACTION:FORWARD ACTION:FORWARD Tourist: I can’t go that way. Tourist: ACTION:TURNLEFT Tourist: Bank is ahead of me on the right Tourist: ACTION:FORWARD ACTION:FORWARD ACTION:TURNLEFT Guide: turn around on that intersection Tourist: I can only go to the left or back the way I just came. Tourist: ACTION:TURNLEFT Guide: you’re in the right place. do you see shops on the corners? Guide: If you’re on the corner with the bank, cross the street Tourist: I’m back where I started by the shop and the bank. Tourist: ACTION:TURNRIGHT Guide: on the same side of the street? Tourist: crossing the street now Tourist: ACTION:FORWARD ACTION:FORWARD ACTION:TURNLEFT Tourist: there is an I love new york shop across the street on the left from me now Tourist: ACTION:TURNRIGHT ACTION:FORWARD Guide: ok. I’ll see if it’s right. Guide: EVALUATE_LOCATION Guide: It’s not right. Tourist: What should I be on the look for? Tourist: ACTION:TURNRIGHT ACTION:TURNRIGHT ACTION:TURNRIGHT Guide: There should be shops on two corners but you need to be on one of the corners without the shop. Guide: Try the other corner. Tourist: this intersection has 2 shop corners and a bank corner Guide: yes. that’s what I see on the map. Tourist: should I go to the bank corner? or one of the shop corners? or the blank corner (perhaps a hotel) Tourist: ACTION:TURNLEFT ACTION:TURNLEFT ACTION:TURNRIGHT ACTION:TURNRIGHT Guide: Go to the one near the hotel. The map says the hotel is a little further down but it might be a little off. Tourist: It’s a big hotel it’s possible. Tourist: ACTION:FORWARD ACTION:TURNLEFT ACTION:FORWARD ACTION:TURNRIGHT Tourist: I’m on the hotel corner Guide: EVALUATE_LOCATION

14 Mechanical Turk Instructions

Figure 7: Set of instructions presented to turkers before starting their first task.
Figure 8: (cont.) Set of instructions presented to turkers before starting their first task.