As artificial intelligence plays an ever more prominent role in everyday human lives, it becomes increasingly important to enable machines to communicate via natural language—not only with humans, but also with each other. Learning algorithms for natural language understanding, such as in machine translation and reading comprehension, have progressed at an unprecedented rate in recent years, but still rely on static, large-scale, text-only datasets that lack crucial aspects of how humans understand and produce natural language. Namely, humans develop language capabilities by being embodied in an environment which they can perceive, manipulate and move around in; and by interacting with other humans. Hence, we argue that we should incorporate all three fundamental aspects of human language acquisition—perception, action and interactive communication—and develop a task and dataset to that effect.
We introduce the Talk the Walk dataset, where the aim is for two agents, a “guide” and a “tourist”, to interact with each other via natural language in order to achieve a common goal: having the tourist navigate towards the correct location. The guide has access to a map and knows the target location, but does not know where the tourist is; the tourist has a 360-degree view of the world, but knows neither the target location on the map nor the way to it. The agents need to work together through communication in order to successfully solve the task. An example of the task is given in Figure 1.
Grounded language learning has (re-)gained traction in the AI community, and much attention is currently devoted to virtual embodiment—the development of multi-agent communication tasks in virtual environments—which has been argued to be a viable strategy for acquiring natural language semantics (Kiela et al., 2016). Various related tasks have recently been introduced, but in each case with some limitations. Although visually grounded dialogue tasks (de Vries et al., 2016; Das et al., 2016) comprise perceptual grounding and multi-agent interaction, their agents are passive observers and do not act in the environment. By contrast, instruction-following tasks, such as VNL (Anderson et al., 2017), involve action and perception but lack natural language interaction with other agents. Furthermore, some of these works use simulated environments (Das et al., 2017a) and/or templated language (Hermann et al., 2017), which arguably oversimplifies real perception or natural language, respectively. See Table 1 for a comparison.
Talk The Walk is the first task to bring all three aspects together: perception for the tourist observing the world, action for the tourist to navigate through the environment, and interactive dialogue for the tourist and guide to work towards their common goal. To collect grounded dialogues, we constructed a virtual 2D grid environment by manually capturing 360-views of several neighborhoods in New York City (NYC)111We avoided using existing street view resources due to licensing issues.. As the main focus of our task is on interactive dialogue, we limit the difficulty of the control problem by having the tourist navigating a 2D grid via discrete actions (turning left, turning right and moving forward). Our street view environment was integrated into ParlAI (Miller et al., 2017) and used to collect a large-scale dataset on Mechanical Turk involving human perception, action and communication.
We argue that for artificial agents to solve this challenging problem, some fundamental architecture designs are missing, and our hope is that this task motivates their innovation. To that end, we focus on the task of localization and develop the novel Masked Attention for Spatial Convolutions (MASC) mechanism. To model the interaction between language and action, this architecture repeatedly conditions the spatial dimensions of a convolution on the communicated message sequence.
This work makes the following contributions: 1) We present the first large scale dialogue dataset grounded in action and perception; 2) We introduce the MASC architecture for localization and show it yields improvements for both emergent and natural language; 4) Using localization models, we establish initial baselines on the full task; 5) We show that our best model exceeds human performance under the assumption of “perfect perception” and with a learned emergent communication protocol, and sets a non-trivial baseline with natural language.
2 Talk The Walk
We create a perceptual environment by manually capturing several neighborhoods of New York City (NYC) with a 360 camera222A 360fly 4K camera.. Most parts of the city are grid-like and uniform, which makes it well-suited for obtaining a 2D grid. For Talk The Walk, we capture parts of Hell’s Kitchen, East Village, the Financial District, Williamsburg and the Upper East Side—see Figure 6 in Appendix 13 for their respective locations within NYC. For each neighborhood, we choose an approximately 5x5 grid and capture a 360 view on all four corners of each intersection, leading to a grid-size of roughly 10x10 per neighborhood.
The tourist’s location is given as a tuple , where are the coordinates and signifies the orientation (north, east, south or west). The tourist can take three actions: turn left, turn right and go forward. For moving forward, we add , , , to the coordinates for the respective orientations. Upon a turning action, the orientation is updated by where for left and for right. If the tourist moves outside the grid, we issue a warning that they cannot go in that direction and do not update the location. Moreover, tourists are shown different types of transitions: a short transition for actions that bring the tourist to a different corner of the same intersection; and a longer transition for actions that bring them to a new intersection.
The guide observes a map that corresponds to the tourist’s environment. We exploit the fact that urban areas like NYC are full of local businesses, and overlay the map with these landmarks as localization points for our task. Specifically, we manually annotate each corner of the intersection with a set of landmarks , each coming from one of the following categories:
The right-side of Figure 1 illustrates how the map is presented. Note that within-intersection transitions have a smaller grid distance than transitions to new intersections. To ensure that the localization task is not too easy, we do not include street names in the overhead map and keep the landmark categories coarse. That is, the dialogue is driven by uncertainty in the tourist’s current location and the properties of the target location: if the exact location and orientation of the tourist were known, it would suffice to communicate a sequence of actions.
|Visual Dialog (Das et al., 2016)||Real||✗||Human||✓||120k dialogues||20|
|GuessWhat (de Vries et al., 2016)||Real||✗||Human||✓||131k dialogues||10|
|VNL (Anderson et al., 2017)||Real||✓||Human||✗||23k instructions||-|
|Embodied QA (Das et al., 2017a)||Simulated||✓||Scripted||✗||5k questions||-|
For the Talk The Walk task, we randomly choose one of the five neighborhoods, and subsample a 4x4 grid (one block with four complete intersections) from the entire grid. We specify the boundaries of the grid by the top-left and bottom-right corners . Next, we construct the overhead map of the environment, i.e. with and . We subsequently sample a start location and orientation and a target location at random333Note that we do not include the orientation in the target, as we found in early experiments that this led to an unnatural task for humans. Similarly, we explored bigger grid sizes but found these to be too difficult for most annotators..
The shared goal of the two agents is to navigate the tourist to the target location , which is only known to the guide. The tourist perceives a “street view” planar projection of the 360 image at location and can simultaneously chat with the guide and navigate through the environment. The guide’s role consists of reading the tourist description of the environment, building a “mental map” of their current position and providing instructions for navigating towards the target location. Whenever the guide believes that the tourist has reached the target location, they instruct the system to evaluate the tourist’s location. The task ends when the evaluation is successful—i.e., when —or otherwise continues until a total of three failed attempts. The additional attempts are meant to ease the task for humans, as we found that they otherwise often fail at the task but still end up close to the target location, e.g., at the wrong corner of the correct intersection.
2.2 Data Collection
We crowd-sourced the collection of the dataset on Amazon Mechanical Turk (MTurk). We use the MTurk interface of ParlAI (Miller et al., 2017) to render 360 images via WebGL and dynamically display neighborhood maps with an HTML5 canvas. Detailed task instructions, which were also given to our workers before they started their task, are shown in Appendix 14. We paired Turkers at random and let them alternate between the tourist and guide role across different HITs.
2.3 Dataset Statistics
The Talk The Walk dataset consists of over 10k successful dialogues—see Table 6 in the appendix for the dataset statistics split by neighborhood. Turkers successfully completed of all finished tasks (we use this statistic as the human success rate). More than six hundred participants successfully completed at least one Talk The Walk HIT. Although the Visual Dialog (Das et al., 2016) and GuessWhat (de Vries et al., 2016) datasets are larger, the collected Talk The Walk dialogs are significantly longer. On average, Turkers needed more than 62 acts (i.e utterances and actions) before they successfully completed the task, whereas Visual Dialog requires 20 acts. The majority of acts comprise the tourist’s actions, with on average more than 44 actions per dialogue. The guide produces roughly 9 utterances per dialogue, slightly more than the tourist’s 8 utterances. Turkers use diverse discourse, with a vocabulary size of more than 10K (calculated over all successful dialogues). An example from the dataset is shown in Appendix 13. The dataset is available at https://github.com/facebookresearch/talkthewalk.
We investigate the difficulty of the proposed task by establishing initial baselines. The final Talk The Walk task is challenging and encompasses several important sub-tasks, ranging from landmark recognition to tourist localization and natural language instruction-giving. Arguably the most important sub-task is localization: without such capabilities the guide can not tell whether the tourist reached the target location. In this work, we establish a minimal baseline for Talk The Walk by utilizing agents trained for localization. Specifically, we let trained tourist models undertake random walks, using the following protocol: at each step, the tourist communicates its observations and actions to the guide, who predicts the tourist’s location. If the guide predicts that the tourist is at target, we evaluate its location. If successful, the task ends, otherwise we continue until there have been three wrong evaluations. The protocol is given as pseudo-code in Appendix 11.
3.1 Tourist Localization
The designed navigation protocol relies on a trained localization model that predicts the tourist’s location from a communicated message. Before we formalize this localization sub-task in Section 3.1.1, we further introduce two simplifying assumptions—perfect perception and orientation-agnosticism—so as to overcome some of the difficulties we encountered in preliminary experiments.
Early experiments revealed that perceptual grounding of landmarks is difficult: we set up a landmark classification problem, on which models with extracted CNN (He et al., 2016) or text recognition features (Gupta et al., 2016) barely outperform a random baseline—see Appendix 12 for full details. This finding implies that localization models from image input are limited by their ability to recognize landmarks, and, as a result, would not generalize to unseen environments. To ensure that perception is not the limiting factor when investigating the landmark-grounding and action-grounding capabilities of localization models, we assume “perfect perception”: in lieu of the 360 image view, the tourist is given the landmarks at its current location. More formally, each state observation now equals the set of landmarks at the -location, i.e. . If the -location does not have any visible landmarks, we return a single “empty corner” symbol. We stress that our findings—including a novel architecture for grounding actions into an overhead map, see Section 4.2.1—should carry over to settings without the perfect perception assumption.
We opt to ignore the tourist’s orientation, which simplifies the set of actions to [Left, Right, Up, Down], corresponding to adding [(-1, 0), (1, 0), (0, 1), (0, -1)] to the current coordinates, respectively. Note that actions are now coupled to an orientation on the map—e.g. up is equal to going north—and this implicitly assumes that the tourist has access to a compass. This also affects perception, since the tourist now has access to views from all orientations: in conjunction with “perfect perception”, implying that only landmarks at the current corner are given, whereas landmarks from different corners (e.g. across the street) are not visible.
Even with these simplifications, the localization-based baseline comes with its own set of challenges. As we show in Section 5.1, the task requires communication about a short (random) path—i.e., not only a sequence of observations but also actions
—in order to achieve high localization accuracy. This means that the guide needs to decode observations from multiple time steps, as well as understand their 2D spatial arrangement as communicated via the sequence of actions. Thus, in order to get to a good understanding of the task, we thoroughly examine whether the agents can learn a communication protocol that simultaneously grounds observations and actions into the guide’s map. In doing so, we thoroughly study the role of the communication channel in the localization task, by investigating increasingly constrained forms of communication: from differentiable continuous vectors to emergent discrete symbols to the full complexity of natural language.
The full navigation baseline hinges on a localization model from random trajectories. While we can sample random actions in the emergent communication setup, this is not possible for the natural language setup because the messages are coupled to the trajectories of the human annotators. This leads to slightly different problem setups, as described below.
A tourist, starting from a random location, takes random actions to reach target location . Every location in the environment has a corresponding set of landmarks for each of the coordinates. As the tourist navigates, the agent perceives state-observations where each observation consists of a set of landmark symbols . Given the observations and actions , the tourist generates a message which is communicated to the other agent. The objective of the guide is to predict the location from the tourist’s message .
In contrast to our emergent communication experiments, we do not take random actions but instead extract actions, observations, and messages from the dataset. Specifically, we consider each tourist utterance (i.e. at any point in the dialogue), obtain the current tourist location as target location , the utterance itself as message , and the sequence of observations and actions that took place between the current and previous tourist utterance as and , respectively. Similar to the emergent language setting, the guide’s objective is to predict the target location models from the tourist message . We conduct experiments with taken from the dataset and with generated from the extracted observations and actions .
This section outlines the tourist and guide architectures. We first describe how the tourist produces messages for the various communication channels across which the messages are sent. We subsequently describe how these messages are processed by the guide, and introduce the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding into the 2D overhead map in order to predict the tourist’s location.
4.1 The Tourist
For each of the communication channels, we outline the procedure for generating a message . Given a set of state observations , we represent each observation by summing the -dimensional embeddings of the observed landmarks, i.e. for , , where is the landmark embedding lookup table. In addition, we embed action into a -dimensional embedding via a look-up table . We experiment with three types of communication channel.
The tourist has access to observations of several time steps, whose order is important for accurate localization. Because summing embeddings is order-invariant, we introduce a sum over positionally-gated embeddings, which, conditioned on time step , pushes embedding information into the appropriate dimensions. More specifically, we generate an observation message , where is a learned gating vector for time step . In a similar fashion, we produce action message and send the concatenated vectors as message to the guide. We can interpret continuous vector communication as a single, monolithic model because its architecture is end-to-end differentiable, enabling gradient-based optimization for training.
Like the continuous vector communication model, with discrete communication the tourist also uses separate channels for observations and actions, as well as a sum over positionally gated embeddings to generate observation embedding . We pass this embedding through a sigmoid and generate a message
by sampling from the resulting Bernoulli distributions:
The action message is produced in the same way, and we obtain the final tourist message through concatenating the messages.
of the tourist model. That is, we estimate the gradient by
where the reward function is the negative guide’s loss (see Section 4.2) and, and train it with a mean squared error loss444This is different from A2C which uses a state-value baseline that is trained by the Bellman residual.
Because observations and actions are of variable-length, we use an LSTM encoder over the sequence of observations embeddings , and extract its last hidden state . We use a separate LSTM encoder for action embeddings , and concatenate both and to the input of the LSTM decoder at each time step:
where a look-up table, taking input tokens . We train with teacher-forcing, i.e. we optimize the cross-entropy loss: . At test time, we explore the following decoding strategies: greedy, sampling and a beam-search. We also fine-tune a trained tourist model (starting from a pre-trained model) with policy gradients in order to minimize the guide’s prediction loss.
4.2 The Guide
Given a tourist message describing their observations and actions, the objective of the guide is to predict the tourist’s location on the map. First, we outline the procedure for extracting observation embedding and action embeddings from the message for each of the types of communication. Next, we discuss the MASC mechanism that takes the observations and actions in order to ground them on the guide’s map in order to predict the tourist’s location.
For the continuous communication model, we assign the observation message to the observation embedding, i.e. . To extract the action embedding for time step , we apply a linear layer to the action message, i.e. .
For discrete communication, we obtain observation by applying a linear layer to the observation message, i.e. . Similar to the continuous communication model, we use a linear layer over action message to obtain action embedding for time step .
contains information about observations and actions, so we use a recurrent neural network with attention mechanism to extract the relevant observation and action embeddings. Specifically, we encode the message, consisting of tokens taken from vocabulary , with a bidirectional LSTM:
where is the word embedding look-up table. We obtain observation embedding through an attention mechanism over the hidden states :
where is a learned control embedding who is updated through a linear transformation of the previous control and observation embedding: . We use the same mechanism to extract the action embedding from the hidden states. For the observation embedding, we obtain the final representation by summing positionally gated embeddings, i.e., .
4.2.1 Masked Attention for Spatial Convolutions (MASC)
We represent the guide’s map as , where in this case , where each -dimensional location embedding is computed as the sum of the guide’s landmark embeddings for that location.
While the guide’s map representation contains only local landmark information, the tourist communicates a trajectory of the map (i.e. actions and observations from multiple locations), implying that directly comparing the tourist’s message with the individual landmark embeddings is probably suboptimal. Instead, we want to aggregate landmark information from surrounding locations by imputing trajectories over the map to predict locations. We propose a mechanism for translating landmark embeddings according to state transitions (left, right, up, down), which can be expressed as a 2D convolution over the map embeddings. For simplicity, let us assume that the map embeddingis 1-dimensional, then a left action can be realized through application of the following kernel: which effectively shifts all values of one position to the left. We propose to learn such state-transitions from the tourist message through a differentiable attention-mask over the spatial dimensions of a 3x3 convolution.
We linearly project each predicted action embedding to a 9-dimensional vector , normalize it by a softmax and subsequently reshape the vector into a 3x3 mask :
We learn a 3x3 convolutional kernel , with features, and apply the mask to the spatial dimensions of the convolution by first broadcasting its values along the feature dimensions, i.e. , and subsequently taking the Hadamard product: . For each action step , we then apply a 2D convolution with masked weight to obtain a new map embedding
, where we zero-pad the input to maintain identical spatial dimensions.
We repeat the MASC operation times (i.e. once for each action), and then aggregate the map embeddings by a sum over positionally-gated embeddings: . We score locations by taking the dot-product of the observation embedding , which contains information about the sequence of observed landmarks by the tourist, and the map. We compute a distribution over the locations of the map by taking a softmax over the computed scores:
While emergent communication models use a fixed length trasjectory , natural language messages may differ in the number of communicated observations and actions. Hence, we predict from the communicated message. Specifically, we use a softmax regression layer over the last hidden state of the RNN, and subsequently sample from the resulting multinomial distribution:
We jointly train the -prediction model via REINFORCE, with the guide’s loss as reward function and a mean-reward baseline.
To better analyze the performance of the models incorporating MASC, we compare against a no-MASC baseline in our experiments, as well as a prediction upper bound.
We compare the proposed MASC model with a model that does not include this mechanism. Whereas MASC predicts a convolution mask from the tourist message, the “No MASC” model uses , the ordinary convolutional kernel to convolve the map embedding to obtain . We also share the weights of this convolution at each time step.
Because we have access to the class-conditional likelihood , we are able to compute the Bayes error rate (or irreducible error). No model (no matter how expressive) with any amount of data can ever obtain better localization accuracy as there are multiple locations consistent with the observations and actions.
5 Results and Discussion
In this section, we describe the findings of various experiments. First, we analyze how much information needs to be communicated for accurate localization in the Talk The Walk environment, and find that a short random path (including actions) is necessary. Next, for emergent language, we show that the MASC architecture can achieve very high localization accuracy, significantly outperforming the baseline that does not include this mechanism. We then turn our attention to the natural language experiments, and find that localization from human utterances is much harder, reaching an accuracy level that is below communicating a single landmark observation. We show that generated utterances from a conditional language model leads to significantly better localization performance, by successfully grounding the utterance on a single landmark observation (but not yet on multiple observations and actions). Finally, we show performance of the localization baseline on the full task, which can be used for future comparisons to this work.
5.1 Analysis of Localization Task
Task is not too easy
The upper-bound on localization performance in Table 2 suggest that communicating a single landmark observation is not sufficient for accurate localization of the tourist (35% accuracy). This is an important result for the full navigation task because the need for two-way communication disappears if localization is too easy; if the guide knows the exact location of the tourist it suffices to communicate a list of instructions, which is then executed by the tourist. The uncertainty in the tourist’s location is what drives the dialogue between the two agents.
Importance of actions
We observe that the upperbound for only communicating observations plateaus around % (even for actions), whereas it exceeds % when we also take actions into account. This implies that, at least for random walks, it is essential to communicate a trajectory, including observations and actions, in order to achieve high localization accuracy.
5.2 Emergent Language Localization
We first report the results for tourist localization with emergent language in Table 2.
MASC improves performance
The MASC architecture significantly improves performance compared to models that do not include this mechanism. For instance, for action, MASC already achieves 56.09 % on the test set and this further increases to 69.85% for . On the other hand, no-MASC models hit a plateau at 43%. In Appendix 10, we analyze learned MASC values, and show that communicated actions are often mapped to corresponding state-transitions.
Continuous vs discrete
We observe similar performance for continuous and discrete emergent communication models, implying that a discrete communication channel is not a limiting factor for localization performance.
5.3 Natural Language Localization
We report the results of tourist localization with natural language in Table 4. We compare accuracy of the guide model (with MASC) trained on utterances from (i) humans, (ii) a supervised model with various decoding strategies, and (iii) a policy gradient model optimized with respect to the loss of a frozen, pre-trained guide model on human utterances.
Compared to emergent language, localization from human utterances is much harder, achieving only on the test set. Here, we report localization from a single utterance, but in Appendix 9.2 we show that including up to five dialogue utterances only improves performance to . We also show that MASC outperform no-MASC models for natural language communication.
We also investigate generated tourist utterances from conditional language models. Interestingly, we observe that the supervised model (with greedy and beam-search decoding) as well as the policy gradient model leads to an improvement of more than 10 accuracy points over the human utterances. However, their level of accuracy is slightly below the baseline of communicating a single observation, indicating that these models only learn to ground utterances in a single landmark observation.
Better grounding of generated utterances
We analyze natural language samples in Table 5, and confirm that, unlike human utterances, the generated utterances are talking about the observed landmarks. This observation explains why the generated utterances obtain higher localization accuracy. The current language models are most successful when conditioned on a single landmark observation; We show in Appendix 9.1.1 that performance quickly deteriorates when the model is conditioned on more observations, suggesting that it can not produce natural language utterances about multiple time steps.
|Human||a field of some type|
|Supervised||greedy||at a bar|
|sampling||sec just hard to tell which is a restaurant ?|
|beam search||im at a bar|
|Policy Grad.||greedy||bar from bar from bar and rigth rigth bulding bulding|
|sampling||which bar from bar from bar and bar rigth bulding bulding..|
Samples from the tourist models communicating in natural language. Contrary to the human generated utterance, the supervised model with greedy and beam search decoding produces an utterance containing the current state observation (bar). Also the reinforcement learning model mentions the current observation but has lost linguistic structure. The fact that these localization models are better grounded in observations than human utterances explains why they obtain higher localization accuracy.
5.4 Localization-based Baseline
Comparison with human annotators
Interestingly, our best localization model (continuous communication, with MASC, and ) achieves 88.33% on the test set and thus exceed human performance of 76.74% on the full task. While emergent models appear to be stronger localizers, humans might cope with their localization uncertainty through other mechanisms (e.g. better guidance, bias towards taking particular paths, etc). The simplifying assumption of perfect perception also helps.
Number of actions
Unsurprisingly, humans take fewer steps (roughly 15) than our best random walk model (roughly 34). Our human annotators likely used some form of guidance to navigate faster to the target.
We introduced the Talk The Walk task and dataset, which consists of crowd-sourced dialogues in which two human annotators collaborate to navigate to target locations in the virtual streets of NYC. For the important localization sub-task, we proposed MASC—a novel grounding mechanism to learn state-transition from the tourist’s message—and showed that it improves localization performance for emergent and natural language. We use the localization model to provide baseline numbers on the Talk The Walk task, in order to facilitate future research.
- Anderson et al. (1991) Anne H. Anderson, Miles Bader, Ellen Gurman Bard, Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, Stephen Isard, Jacqueline Kowtko, Jan McAllister, Jim Miller, Catherine Sotillo, Henry S. Thompson, and Regina Weinert. The hcrc map task corpus. Language and Speech, 34(4):351–366, 1991.
- Anderson et al. (2017) Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian D. Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. CoRR, abs/1711.07280, 2017.
- Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proc. of ICCV, 2015.
Artzi & Zettlemoyer (2013)
Yoav Artzi and Luke Zettlemoyer.
Weakly supervised learning of semantic parsers for mapping instructions to actions.Transactions of the Association of Computational Linguistics, 1:49–62, 2013.
- Baroni (2016) Marco Baroni. Grounding distributional semantics in the visual world. Language and Linguistics Compass, 10(1):3–13, 2016.
- Barsalou (2008) Lawrence W. Barsalou. Grounded cognition. Annual Review of Psychology, 59(1):617–645, 2008.
- Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
- Brahmbhatt & Hays (2017) Samarth Brahmbhatt and James Hays. Deepnav: Learning to navigate large cities. CoRR, abs/1701.09135, 2017. URL http://arxiv.org/abs/1701.09135.
- Chaplot et al. (2018a) Devendra Singh Chaplot, Emilio Parisotto, and Ruslan Salakhutdinov. Active neural localization. arXiv preprint arXiv:1801.08214, 2018a.
- Chaplot et al. (2018b) Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, and Ruslan Salakhutdinov. Gated-attention architectures for task-oriented language grounding. AAAI, 2018b.
- Chen & Mooney (2011) David L. Chen and Raymond J. Mooney. Learning to interpret natural language navigation instructions fro mobservations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI-2011), San Francisco, CA, USA, August 2011.
- Das et al. (2016) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. Visual dialog. arXiv preprint arXiv:1611.08669, 2016.
- Das et al. (2017a) Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. CoRR, abs/1711.11543, 2017a.
- Das et al. (2017b) Abhishek Das, Satwik Kottur, José MF Moura, Stefan Lee, and Dhruv Batra. Learning cooperative visual dialog agents with deep reinforcement learning. arXiv preprint arXiv:1703.06585, 2017b.
- de Vries et al. (2016) Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. Courville. Guesswhat?! visual object discovery through multi-modal dialogue. arXiv preprint arXiv:1611.08481, 2016.
- de Vries et al. (2017) Harm de Vries, Florian Strub, Jeremie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C. Courville. Modulating early visual processing by language. In Proc. of NIPS, 2017.
- Elliott et al. (2016) Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. Multi30k: Multilingual english-german image descriptions. arXiv preprint arXiv:1605.00459, 2016.
- Evtimova et al. (2017) Katrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. Emergent language in a multi-modal, multi-step referential game. arXiv preprint arXiv:1705.10369, 2017.
- Garrod & Anderson (1987) Simon Garrod and Anthony Anderson. Saying what you mean in dialogue: A study in conceptual and semantic co-ordination. Cognition, 27(2):181 – 218, 1987.
- Gupta et al. (2016) A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In
- Gupta et al. (2017a) Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017a.
- Gupta et al. (2017b) Saurabh Gupta, David Fouhey, Sergey Levine, and Jitendra Malik. Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125, 2017b.
- Hadsell et al. (2007) Raia Hadsell, Pierre Sermanet, Jeff Han, Beat Flepp, Urs Muller, and Yann LeCun. Online learning for offroad robots: Using spatial label propagation to learn long-range traversability. In Proc. of Robotics: Science and Systems (RSS), volume 11, 2007.
He et al. (2017)
He He, Anusha Balakrishnan, Mihail Eric, and Percy Liang.
Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1766–1776, Vancouver, Canada, July 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/P17-1162.
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
- Hermann et al. (2017) Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer, David Szepesvari, Wojtek Czarnecki, Max Jaderberg, Denis Teplyashin, et al. Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551, 2017.
- Johnson et al. (2017) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proc. of CVPR, 2017.
- Kiela (2017) Douwe Kiela. Deep embodiment: grounding semantics in perceptual modalities (PhD thesis). Technical Report UCAM-CL-TR-899, University of Cambridge, Computer Laboratory, February 2017.
- Kiela et al. (2016) Douwe Kiela, Luana Bulat, Anita L. Vero, and Stephen Clark. Virtual embodiment: A scalable long-term strategy for artificial intelligence research. arXiv preprint arXiv:1610.07432, 2016.
- Kiela et al. (2017) Douwe Kiela, Alexis Conneau, Allan Jabri, and Maximilian Nickel. Learning visually grounded sentence representations. arXiv preprint arXiv:1707.06320, 2017.
- Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Kottur et al. (2017) Satwik Kottur, José M.F. Moura, Stefan Lee, and Dhruv Batra. Natural Language Does Not Emerge ’Naturally’ in Multi-Agent Dialog. volume abs/1706.08502, 2017.
- Lazaridou et al. (2016) Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182, 2016.
- Lewis et al. (2017) Mike Lewis, Denis Yarats, Yann N Dauphin, Devi Parikh, and Dhruv Batra. Deal or no deal? end-to-end learning for negotiation dialogues. arXiv preprint arXiv:1706.05125, 2017.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proc. of ECCV, 2014.
- MacMahon et al. (2006) Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. Walk the talk: Connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-2006), Boston, MA, USA, July 2006.
- Mei et al. (2016) Hongyuan Mei, Mohit Bansal, and Matthew R Walter. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of AAAI, 2016.
- Miller et al. (2017) Alexander H Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, and Jason Weston. Parlai: A dialog research software platform. arXiv preprint arXiv:1705.06476, 2017.
- Mirowski et al. (2018) Piotr Mirowski, Matthew Koichi Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith Anderson, Denis Teplyashin, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, and Raia Hadsell. Learning to navigate in cities without a map. CoRR, abs/1804.00168, 2018. URL http://arxiv.org/abs/1804.00168.
- Mordatch & Abbeel (2017) Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908, 2017.
- Perez et al. (2018) Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proc. of AAAI, 2018.
- Riezler et al. (2014) Stefan Riezler, Patrick Simianer, and Carolin Haas. Response-based learning for grounded machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pp. 881–891, 2014.
- Roy (2005) Deb Roy. Grounding words in perception and action: computational insights. Trends in cognitive sciences, 9(8):389–396, 2005.
- Smith & Gasser (2005) Linda Smith and Michael Gasser. The development of embodied cognition: Six lessons from babies. Artificial Life, 11(1-2):13–29, 2005.
- Steels & Hild (2012) Luc Steels and Manfred Hild. Language grounding in robots. Springer Science & Business Media, 2012.
- Strub et al. (2017) Florian Strub, Harm De Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and Olivier Pietquin. End-to-end optimization of goal-driven and visually grounded dialogue systems. arXiv preprint arXiv:1703.05423, 2017.
- Sutton & Barto (1998) Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
Vo et al. (2017)
Nam Vo, Nathan Jacobs, and James Hays.
Revisiting im2gps in the deep learning era.In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2640–2649. IEEE, 2017.
- Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning. Springer, 1992.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proc. of ICML, 2015.
- Yu et al. (2017) Haonan Yu, Haichao Zhang, and Wei Xu. A deep compositional framework for human-like language acquisition in virtual environment. arXiv preprint arXiv:1703.09831, 2017.
7 Related Work
The Talk the Walk task and dataset facilitate future research on various important subfields of artificial intelligence, including grounded language learning, goal-oriented dialogue research and situated navigation. Here, we describe related previous work in these areas.
There has been a long line of work involving related tasks. Early work on task-oriented dialogue dates back to the early 90s with the introduction of the Map Task (Anderson et al., 1991) and Maze Game (Garrod & Anderson, 1987) corpora. Recent efforts have led to larger-scale goal-oriented dialogue datasets, for instance to aid research on visually-grounded dialogue (Das et al., 2016; de Vries et al., 2016), knowledge-base-grounded discourse (He et al., 2017) or negotiation tasks (Lewis et al., 2017). At the same time, there has been a big push to develop environments for embodied AI, many of which involve agents following natural language instructions with respect to an environment(Artzi & Zettlemoyer, 2013; Yu et al., 2017; Hermann et al., 2017; Mei et al., 2016; Chaplot et al., 2018b, a), following-up on early work in this area (MacMahon et al., 2006; Chen & Mooney, 2011). An early example of navigation using neural networks is (Hadsell et al., 2007), who propose an online learning approach for robot navigation. Recently, there has been increased interest in using end-to-end trainable neural networks for learning to navigate indoor scenes(Gupta et al., 2017b, a) or large cities (Brahmbhatt & Hays, 2017; Mirowski et al., 2018), but, unlike our work, without multi-agent communication. Also the task of localization (without multi-agent communication) has recently been studied (Chaplot et al., 2018a; Vo et al., 2017).
Grounded language learning
Grounded language learning is motivated by the observation that humans learn language embodied (grounded) in sensorimotor experience of the physical world (Barsalou, 2008; Smith & Gasser, 2005). On the one hand, work in multi-modal semantics has shown that grounding can lead to practical improvements on various natural language understanding tasks (see Baroni, 2016; Kiela, 2017, and references therein). In robotics, researchers dissatisfied with purely symbolic accounts of meaning attempted to build robotic systems with the aim of grounding meaning in physical experience of the world (Roy, 2005; Steels & Hild, 2012). Recently, grounding has also been applied to the learning of sentence representations (Kiela et al., 2017), image captioning (Lin et al., 2014; Xu et al., 2015), visual question answering (Antol et al., 2015; de Vries et al., 2017), visual reasoning (Johnson et al., 2017; Perez et al., 2018), and grounded machine translation (Riezler et al., 2014; Elliott et al., 2016). Grounding also plays a crucial role in the emergent research of multi-agent communication, where, agents communicate (in natural language or otherwise) in order to solve a task, with respect to their shared environment (Lazaridou et al., 2016; Das et al., 2017b; Mordatch & Abbeel, 2017; Evtimova et al., 2017; Lewis et al., 2017; Strub et al., 2017; Kottur et al., 2017).
8 Implementation Details
For the emergent communication models, we use an embedding size . The natural language experiments use 128-dimensional word embeddings and a bidirectional RNN with units. In all experiments, we train the guide with a cross entropy loss using the ADAM optimizer with default hyper-parameters (Kingma & Ba, 2014)
. We perform early stopping on the validation accuracy, and report the corresponding train, valid and test accuracy. We optimize the localization models with continuous, discrete and natural language communication channels for 200, 200, and 25 epochs, respectively. To facilitate further research on Talk The Walk, we make our code base for reproducing experiments publicly available athttps://github.com/facebookresearch/talkthewalk.
9 Additional Natural Language Experiments
First, we investigate the sensitivity of tourist generation models to the trajectory length, finding that the model conditioned on a single observation (i.e. ) achieves best performance. In the next subsection, we further analyze localization models from human utterances by investigating MASC and no-MASC models with increasing dialogue context.
9.1 Tourist Generation Models
9.1.1 Path length
After training the supervised tourist model (conditioned on observations and action from human expert trajectories), there are two ways to train an accompanying guide model. We can optimize a location prediction model on either (i) extracted human trajectories (as in the localization setup from human utterances) or (ii) on all random paths of length (as in the full task evaluation). Here, we investigate the impact of (1) using either human or random trajectories for training the guide model, and (2) the effect of varying the path length during the full-task evaluation. For random trajectories, guide training uses the same path length as is used during evaluation. We use a pre-trained tourist model with greedy decoding for generating the tourist utterances. Table 7 summarizes the results.
Human vs random trajectories
We only observe small improvements for training on random trajectories. Human trajectories are thus diverse enough to generalize to random trajectories.
Effect of path length
There is a strong negative correlation between task success and the conditioned trajectory length. We observe that the full task performance quickly deteriorates for both human and random trajectories. This suggests that the tourist generation model can not produce natural language utterances that describe multiple observations and actions. Although it is possible that the guide model can not process such utterances, this is not very likely because the MASC architectures handles such messages successfully for emergent communication.
9.1.2 Effect of beam-size
We report localization performance of tourist utterances generated by beam search decoding of varying beam size in Table 7. We find that performance decreases from 29.05% to 20.87% accuracy on the test set when we increase the beam-size from one to eight.
9.2 Localization from Human Utterances
We conduct an ablation study for MASC on natural language with varying dialogue context. Specifically, we compare localization accuracy of MASC and no-MASC models trained on the last [1, 3, 5] utterances of the dialogue (including guide utterances). We report these results in Table 8. In all cases, MASC outperforms the no-MASC models by several accuracy points. We also observe that mean predicted (over the test set) increases from to when more dialogue context is included.
10 Visualizing MASC predictions
Figure 2 shows the MASC values for a learned model with emergent discrete communications and actions. Specifically, we look at the predicted MASC values for different action sequences taken by the tourist. We observe that the first action is always mapped to the correct state-transition, but that the second and third MASC values do not always correspond to right state-transitions.
11 Evaluation on Full Setup
12 Landmark Classification
While the guide has access to the landmark labels, the tourist needs to recognize these landmarks from raw perceptual information. In this section, we study landmark classification as a supervised learning problem to investigate the difficulty of perceptual grounding in Talk The Walk.
The Talk The Walk dataset contains a total of 307 different landmarks divided among nine classes, see Figure 4 for how they are distributed. The class distribution is fairly imbalanced, with shops and restaurants as the most frequent landmarks and relatively few play fields and theaters. We treat landmark recognition as a multi-label classification problem as there can be multiple landmarks on a corner555Strictly speaking, this is more general than a multi-label setup because a corner might contain multiple landmarks of the same class..
For the task of landmark classification, we extract the relevant views of the 360 image from which a landmark is visible. Because landmarks are labeled to be on a specific corner of an intersection, we assume that they are visible from one of the orientations facing away from the intersection. For example, for a landmark on the northwest corner of an intersection, we extract views from both the north and west direction. The orientation-specific views are obtained by a planar projection of the full 360-image with a small field of view (60 degrees) to limit distortions. To cover the full field of view, we extract two images per orientation, with their horizontal focus point 30 degrees apart. Hence, we obtain eight images per 360 image with corresponding orientation .
We run the following pre-trained feature extractors over the extracted images:
We resize the extracted view to a 224x224 image and pass it through a ResNet-152 network He et al. (2016) to obtain a 2048-dimensional feature vector from the penultimate layer.
- Text Recognition
We use a pre-trained text-recognition model Gupta et al. (2016) to extract a set of text messages from the images. Local businesses often advertise their wares through key phrases on their storefront, and understanding this text might be a good indicator of the type of landmark. In Figure 3, we show the results of running the text recognition module on a few extracted images.
For the text recognition model, we use a learned look-up table to embed the extracted text features , and fuse all embeddings of four images through a bag of embeddings, i.e., . We use a linear layer followed by a sigmoid to predict the probability for each class, i.e. . We also experiment with replacing the look-up embeddings with pre-trained FastText embeddings Bojanowski et al. (2016). For the ResNet model, we use a bag of embeddings over the four ResNet features, i.e. , before we pass it through a linear layer to predict the class probabilities: . We also conduct experiments where we first apply PCA to the extracted ResNet and FastText features before we feed them to the model.
|Features||Train loss||Valid Loss||Train F1||Valid F1||Valid prec.||Valid recall|
|Fasttext (100 dim)||0.00721||0.00863||0.32651||0.28672||0.24964||0.4433|
|ResNet (256 dim)||0.0051||0.00748||0.60911||0.31953||0.27733||0.50515|
To account for class imbalance, we train all described models with a binary cross entropy loss weighted by the inverted class frequency. We create a 80-20 class-conditional split of the dataset into a training and validation set. We train for 100 epochs and perform early stopping on the validation loss.
The F1 scores for the described methods in Table 10. We compare to an “all positive” baseline that always predicts that the landmark class is visible and observe that all presented models struggle to outperform this baseline. Although 256-dimensional ResNet features achieve slightly better precision on the validation set, it results in much worse recall and a lower F1 score. Our results indicate that perceptual grounding is a difficult task, which easily merits a paper of its own right, and so we leave further improvements (e.g. better text recognizers) for future work.
13 Dataset Details
We split the full dataset by assigning entire 4x4 grids (independent of the target location) to the train, valid or test set. Specifically, we design the split such that the valid set contains at least one intersection (out of four) is not part of the train set. For the test set, all four intersections are novel. See our source code, available at URLANONYMIZED, for more details on how this split is realized.
[fontsize=] Tourist: ACTION:TURNRIGHT ACTION:TURNRIGHT Guide: Hello, what are you near? Tourist: ACTION:TURNLEFT ACTION:TURNLEFT ACTION:TURNLEFT Tourist: Hello, in front of me is a Brooks Brothers Tourist: ACTION:TURNLEFT ACTION:FORWARD ACTION:TURNLEFT ACTION:TURNLEFT Guide: Is that a shop or restaurant? Tourist: ACTION:TURNLEFT Tourist: It is a clothing shop. Tourist: ACTION:TURNLEFT Guide: You need to go to the intersection in the northwest corner of the map Tourist: ACTION:TURNLEFT Tourist: There appears to be a bank behind me. Tourist: ACTION:TURNLEFT ACTION:TURNLEFT ACTION:TURNRIGHT ACTION:TURNRIGHT Guide: Ok, turn left then go straight up that road Tourist: ACTION:TURNLEFT ACTION:TURNLEFT ACTION:TURNLEFT ACTION:FORWARD ACTION:TURNRIGHT ACTION:FORWARD ACTION:FORWARD ACTION:TURNLEFT ACTION:TURNLEFT ACTION:TURNLEFT Guide: There should be shops on two of the corners but you need to go to the corner without a shop. Tourist: ACTION:FORWARD ACTION:FORWARD ACTION:FORWARD ACTION:TURNLEFT ACTION:TURNLEFT Guide: let me know when you get there. Tourist: on my left is Radio city Music hall Tourist: ACTION:TURNLEFT ACTION:FORWARD ACTION:TURNLEFT ACTION:TURNRIGHT ACTION:TURNRIGHT Tourist: I can’t go straight any further. Guide: ok. turn so that the theater is on your right. Guide: then go straight Tourist: That would be going back the way I came Guide: yeah. I was looking at the wrong bank Tourist: I’ll notify when I am back at the brooks brothers, and the bank. Tourist: ACTION:TURNRIGHT Guide: make a right when the bank is on your left Tourist: ACTION:FORWARD ACTION:FORWARD ACTION:TURNRIGHT Tourist: Making the right at the bank. Tourist: ACTION:FORWARD ACTION:FORWARD Tourist: I can’t go that way. Tourist: ACTION:TURNLEFT Tourist: Bank is ahead of me on the right Tourist: ACTION:FORWARD ACTION:FORWARD ACTION:TURNLEFT Guide: turn around on that intersection Tourist: I can only go to the left or back the way I just came. Tourist: ACTION:TURNLEFT Guide: you’re in the right place. do you see shops on the corners? Guide: If you’re on the corner with the bank, cross the street Tourist: I’m back where I started by the shop and the bank. Tourist: ACTION:TURNRIGHT Guide: on the same side of the street? Tourist: crossing the street now Tourist: ACTION:FORWARD ACTION:FORWARD ACTION:TURNLEFT Tourist: there is an I love new york shop across the street on the left from me now Tourist: ACTION:TURNRIGHT ACTION:FORWARD Guide: ok. I’ll see if it’s right. Guide: EVALUATE_LOCATION Guide: It’s not right. Tourist: What should I be on the look for? Tourist: ACTION:TURNRIGHT ACTION:TURNRIGHT ACTION:TURNRIGHT Guide: There should be shops on two corners but you need to be on one of the corners without the shop. Guide: Try the other corner. Tourist: this intersection has 2 shop corners and a bank corner Guide: yes. that’s what I see on the map. Tourist: should I go to the bank corner? or one of the shop corners? or the blank corner (perhaps a hotel) Tourist: ACTION:TURNLEFT ACTION:TURNLEFT ACTION:TURNRIGHT ACTION:TURNRIGHT Guide: Go to the one near the hotel. The map says the hotel is a little further down but it might be a little off. Tourist: It’s a big hotel it’s possible. Tourist: ACTION:FORWARD ACTION:TURNLEFT ACTION:FORWARD ACTION:TURNRIGHT Tourist: I’m on the hotel corner Guide: EVALUATE_LOCATION