Shifting the Baseline: Single Modality Performance on Visual Navigation & QA

11/01/2018 ∙ by Jesse Thomason, et al. ∙ 0

Language-and-vision navigation and question answering (QA) are exciting AI tasks situated at the intersection of natural language understanding, computer vision, and robotics. Researchers from all of these fields have begun creating datasets and model architectures for these domains. It is, however, not always clear if strong performance is due to advances in multimodal reasoning or if models are learning to exploit biases and artifacts of the data. We present single modality models and explore the linguistic, visual, and structural biases of these benchmarks. We find that single modality models often outperform published baselines that accompany multimodal task datasets, suggesting a need for change in community best practices moving forward. In light of this, we recommend presenting single modality baselines alongside new multimodal models to provide a fair comparison of information gained over dataset biases when considering multimodal input.



There are no comments yet.


page 1

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


While it has been long believed that language, vision, and robotics should be complementary fields building on and learning from one other, disjoint methodologies have often made this difficult. For example, integrating the symbolic representations of language with the continuous RGB of vision or motor control of robotics is challenging.

Deep learning based embedding methods and recurrent architectures are beginning to bridge this gap by allowing arbitrary inputs to be mapped into a latent representation which can be shared across modalities. These new, fully end-to-end models appear to learn multimodal representations. However, it is often unclear when and how these models rely on input from each perceptual modality to make a decision. For example, in navigating a house, one typically walks straight down a hall, and might guess to do so without any explicit instructions (or even with their eyes closed – Figure 1). A model’s ability to choose which modalities to attend to, and when, is a clear boon, but we show that these gains may not be strictly additive.

Figure 1: Taking the most likely valid action when navigating without language (or vision) often leads to sensible navigation trajectories. At , “forward” is unavailable because the agent would collide with the wall.

This work directly extends observations made within both the Computer Vision Goyal et al. (2018) and Natural Language Glockner et al. (2018); Poliak et al. (2018); Gururangan et al. (2018); Kaushik and Lipton (2018) communities that sophisticated models often perform well by fitting to spurious correlations in the data, therefore ignoring the type of grounding or reasoning experimenters hoped was necessary for their task.

This paper provides a comparable analysis for more recent visual navigation and egocentric question answering tasks rising in popularity. Thanks to the community’s impressive data collection efforts, we can evaluate baselines in a single modality framework across three task datasets: navigation using images of real homes paired with crowdsourced language descriptions Anderson et al. (2018); and navigation and question answering Gordon et al. (2018), or embodied question answering alone Das et al. (2018), in simulation with synthetic questions. In this paper, we quantify the difference between the published baselines in those benchmarks (often majority class or random selection) versus single modality versions of the corresponding published multimodal models.

Recommendation for Best Practices.

It is important to investigate when “multimodal” models are truly building multimodal representations. We call on the community to include ablation baselines on tasks with multimodal inputs using existing architectures by omitting inputs for each modality in turn. Such baselines expose possible gains from unimodal biases in multimodal datasets versus details of training or the architecture alone.

In the remainder of this paper, we demonstrate such a baseline evaluation framework on a number of visual navigation and egocentric QA tasks. We find that unimodal baselines typically outperform published random and majority class baselines that accompany multimodal tasks. It is important to note that many papers ablate one modality (e.g. language or vision), though rarely both. In the new space of navigation or egocentric QA, an agent’s action history and available actions serve as an important structural guide, a well-studied signal in the minimally sensing literature O’Kane and LaValle (2006), making it necessary to ablate, both vision and language.

Related Work

The last few years have seen dramatic growth in benchmark datasets at the intersections of natural language processing, computer vision, and robotics, especially related to navigation and question answering. In related natural language processing and computer vision tasks, past investigations have exposed biases that allow models to provide the right answers for the wrong reasons. In this work, we perform a similar exposition for visual navigation and QA tasks.

Language-Driven Visual Navigation

Work on mapping natural language instructions to visual navigation spans many years, and was popularized in a simulation environment for an agent navigating a set of hallways given a human instruction MacMahon et al. (2006); Chen and Mooney (2011). For single sentence instructions in this simple environment, researchers have achieved state of the art performance with deep recurrent models Mei et al. (2016), in contrast to inference relying on traditional, symbolic representations of language for navigation (e.g., Duvallet et al. (2013)

and many others). Simple navigation policies conditioned on language instructions can also be learned via reinforcement learning 

Chaplot et al. (2018) and via predictions of visitation and goal distributions Blukis et al. (2018) in simulated environments. For multi-sentence and more complex, realistic domains, navigation from human instructions remains a difficult problem. We investigate a recent benchmark dataset Anderson et al. (2018) that introduces human language instructions for trajectories through simulations of real home environments captured with high-definition, panoramic cameras Chang et al. (2017).

Egocentric Question Answering

Visual question answering can be seen as beginning with image captioning, where the “question” is the image and the answer is the generic content of the image in natural language Chen et al. (2015). Visual question answering (VQA) has been formalized as a popular multimodal task Antol et al. (2015) where questions are given in natural language with an accompanying subject image. Questions in VQA are conceptually simple, but in theory require understanding the relationships between different objects in the scene to answer correctly. An important simplifying assumption of all of this work is that the answer exists in the current scene. The task can be further generalized by requiring an agent to navigate and interact with the world to answer a question (e.g., Misra et al. (2018)). We investigate IQUAD V1 Gordon et al. (2018) and EmbodiedQA Das et al. (2018), which introduce simulators paired with questions that require learning basic object affordances (e.g., look in the fridge) and scene dynamics (e.g., navigate to the garage) to effectively answer questions using an egocentric, moving agent.

Discovering Benchmark Bias.

Bias is not inherently undesirable. Any dataset that reflects the world will also reflect its biases. Much of the commonsense knowledge we hope models will capture boils down to understanding patterns in the world Zellers et al. (2018). For example, hallways are typically long and refrigerators are found in the kitchen. For this reason, it does not make sense, in general, to control for all world biases (e.g. via randomization), but it is important to understand when they can be exploited to provide an incorrect or inflated perception of task success.

A core source of unwanted dataset bias is collection methodology. In particular, within the Natural Language Processing community, most recent datasets are collected via Amazon’s Mechanical Turk. In cases where the annotators are asked to generate natural language, rather than select a label, the priming of the question, the difficulty of the task, and the pay provided can all have effects on the types of language annotators provide. Recent work has shown that for large scale Natural Language Inference (NLI) tasks, annotators asked to provide entailing or contradicting sentences to a prompt will often do so in a formulaic way (e.g., by introducing negation). This creates consistent words and phrases that can be used as indicator features of the class label, requiring no reasoning or world knowledge Glockner et al. (2018). Exploiting these regularities can allow models to guess correctly without seeing all of the input Poliak et al. (2018), but filtering for these effects can yield subsets of the data which appear to “truly” require inference Gururangan et al. (2018). Recent work has proposed modeling these biases during data collection in order to construct adversarial datasets which resist such artifacts by design Zellers et al. (2018).

A similar set of biases have been seen in the multimodal context of Visual Question Answering Goyal et al. (2018), where some questions can be answered without looking at the image or by exploiting majority class baselines. Similarly, others have created competitive image captioning systems which replace complex language-generation models with simple visual nearest-neighbor lookups from the training set Devlin et al. (2015). This paper conducts investigations in visual navigation and QA benchmarks to uncover similar biases. Our concrete recommendation is not to eliminate these biases during dataset creation, but to conduct evaluations in a bias-aware way by including single modality baselines.

Benchmark Tasks

We evaluate on navigation and question answering tasks across three benchmark datasets. When applicable, we use the authors’ provided code, dataset splits, and evaluation procedures in order to maintain consistency with published results.


The Matterport room-to-room navigation benchmark Anderson et al. (2018) is a navigation challenge in which an agent is given a route to follow in natural language and must take steps through a discretized map to the destination. The benchmark is built using the Matterport 3D Simulator Chang et al. (2017), providing high-definition scans as visual input to the agent (Figure 1). This benchmark’s high fidelity visual inputs, together with real, crowdsourced natural language descriptions, provide a great resource for researchers studying visual navigation. Our single modality experiments help elucidate how much information about routes through the Matterport houses can be captured without the use of all available sensors.

In all experiments, we use the train and validation splits provided by the original benchmark release Anderson et al. (2018) We present results on two validation sets: seen houses and unseen houses. The former are those in which the agent was trained with different trajectories, while the latter are both novel houses and (necessarily) novel trajectories.

Figure 2: Samples question pairs from IQUAD V1: Each row shows a true/false question pair where an image from the same viewpoint can be used to answer the question in both cases, but only if the question is also known. For more than chance accuracy, an agent would require both visual and language inputs. Here, the object of interest is circled in yellow.

Interactive Question Answering Dataset (IQUAD V1).

This dataset Gordon et al. (2018) consists of over 75,000 questions in 30 interactive kitchen environments from AI2THOR Kolve et al. (2017). Each environment is professionally hand-designed to model a real-life room and rendered with the Unity game engine, which provides nearly photo-realistic renderings and extensive physics modeling. The IQUAD V1 dataset contains three question types: existence (e.g., Is there a mug in the room?), counting (e.g., How many mugs are in the room?) where the answer ranges from 0 to 3, and spatial relation: (e.g., Is there a mug in the microwave?). IQUAD V1 was constructed via randomly generated scene configurations (locations of smaller objects such as mugs) and the agent is placed at a random location at the beginning of each trial. Each question in the dataset is paired with other instances of the same question such that all answers are equally likely. For example, if the dataset contains the question Is there an apple in the fridge? where the answer is yes, there will be another, identical question in the same room where the answer is no. Balancing the frequency of each answer per question (as opposed to over all the questions) controls for language bias. Two sample pairs can be found in Figure 2. For our experiments, we use the existing models, data, and train/test splits in seen and unseen rooms from the IQUAD V1

We additionally test a semantic navigation task in the AI2THOR environments which we call THOR-Nav. In this task, the agent is placed in a random location in the room and given a large, unambiguous object (such as the fridge) to approach. In AI2THOR, large objects remain stationary, meaning that although the agent starts in a random location, the fridge is always at the same place for a given room. In our evaluation, we consider navigation successful when the agent is within two steps of the goal location.

Embodied Question Answering (EQA).

This benchmark Das et al. (2018) comprises over 5000 questions in 750 full house environments (contrasting the single-room scenes in IQUAD V1), from the House3D catalog Wu et al. (2018). The questions are programmatically generated to refer to a single, unambiguous object for that specific environment, and are subsequently filtered via an entropy metric to avoid easy questions (e.g., What room is the bathtub in?). The vast majority of the remaining questions fall into three categories: color (e.g., What color is the chair?), color-location (e.g., What color is the chair in the kitchen?), and location (e.g., What room is the chair in?). At evaluation time, an agent is given a question in natural language and placed in a random location up to 50 actions away from the target scene where the question can be answered.

We use the existing models, data, and train and validation splits available in the EQA We evaluate on the validation fold, whose questions are all given in house environments unseen at training time. The model presented along with this benchmark trains a navigation and a visual question answering module separately, then adjoins them for embodied question answering. In this paper, we evaluate only on the visual question answering portion of this task.444Our evaluation is performed with the most up to date data available in the repository.

Ablation Baseline Models

At each timestep during a navigation task, an agent receives a new visual observation, a static language observation encoding, and the previous action. Similarly, before answering a question in a visual question answering task, an agent receives some visual observation history (possibly only its current observation) and an encoding of the question. Intuitively, an agent trained for these tasks should perform at chance if it does not receive either language information (the navigation directions or asked question) or visual information in the scene. We demonstrate that this is not the case, and recommend including these unimodal, ablation baseline models alongside future multimodal model results.

Vision Only.

By removing the language input to an existing model, the best the model can do is find and exploit biases in the dataset’s visual information. For example, in navigation, models can learn to walk down hallways and turn when they reach walls without language data (Figure 1

). In question answering, these models can learn to give the color or count of the most salient object(s) in the scene, or, in the case of EQA, perform scene classification (Figure 


Language Only.

Conversely, removing the vision input to an existing model leaves it to find and exploit biases in language information. In navigation, this can take involve minimal sensing, for example, turn left at the end of the hallway can mean walk forward until forward is not an option, then turn left, in which case vision data is not necessary. For navigation, we zero out vision inputs to full benchmark baseline models to evaluate this condition. In question answering, question categories and their majority class answer can be memorized, a language bias demonstrated in Visual Question Answering (VQA v1) Antol et al. (2015) and later counterbalanced for Goyal et al. (2018) in VQA v2. For example, counting questions in VQA v1 overwhelmingly had the answer two. For QA benchmarks, we train simple LSTM networks to take in questions and predict answers.

Figure 3:

Conditional (probability of taking the column action given the previous row action) and marginal distributions of actions in ground truth trajectories for Matterport. Trajectories involve peaky distributions of actions, enabling agents without access to visual information to memorize simple rules like not turning left immediately after turning right, or moving forward an average number of steps.

Action History.

Finally, by removing both language and vision inputs to an existing model, the only information at each timestep is the previous action. This ablation can be thought of as a majority class baseline conditioned on the action sequence. Intuitively, this can learn “average” trajectories through a map for navigation. For example, it should be able to learn that navigating to a destination never involves walking forward, turning 180 degrees, and walking back the way it came. In a question answering context, this model can at best learn to report the majority answer over all questions, so we do not evaluate the Action History baseline on question answering benchmarks.

Full Model.

Full models reflect the original, multimodal baseline code released with the corresponding benchmark dataset. We briefly describe each model, but leave full details to the corresponding papers.

Matterport Anderson et al. (2018): An LSTM encoding model takes in the input natural language trajectory description, e.g., Move inside the kitchen and walk around the dining room table

. An action decoder is initialized with its hidden state set to the language encoder’s output. At each timestep, the decoder receives the ResNet-152 vector 

He et al. (2015) of the first person, visual scene image for the current location and heading, and its previous action, to decode a next action and update its hidden state.

IQUAD V1 Gordon et al. (2018): Images from the environment are fed through an object detector which populates a top down 2D spatial map of the environment. Questions are input to an LSTM to extract subject and question-type. The output of the LSTM and map are concatenated together, fed through several convolutional and fully-connected layers, outputting probabilities over the answer and action space.

EQA Das et al. (2018): We evaluate on the visual question answering portion of the full Embodied Question Answering model. The full VQA model takes in the last five frames of the ground truth trajectory between the agent’s starting and ending position to answer a given question as visual input, which are processed to a vector embedding with a learned CNN. The question is encoded using an LSTM encoder. The five visual frame representations are summed according to a learned attention weighting, then this sum and the language encoding are used to predict the answer.


Matterport THOR-Nav
Signal Model (Seen) (Unseen) (Seen) (Unseen)
Teacher AH+V 0.174 0.139 0.051 0.081


AH+L 0.227 0.217 0.056 0.067
AH 0.185 0.162 0.081 0.059
Random Baseline 0.159 0.163 0.018 0.036


Full Model 0.253 0.205 0.839 0.823
(Full Model, Baseline) 0.094 0.042 0.821 0.787
(Full Model, Unimodal) 0.026 -0.012 0.758 0.742
(Unimodal, Baseline) +0.068 +0.054 +0.063 +0.045
Sample AH+V 0.306 0.132


AH+L 0.147 0.127
AH 0.038 0.032
Random Baseline 0.159 0.163


Full Model 0.405 0.212
(Full Model, Baseline) 0.246 0.049
(Full Model, Unimodal) 0.099 0.080
(Unimodal, Baseline) +0.147 -0.031
Table 1: For Matterport, navigation success is defined as stopping on the correct map location, while for THOR-Nav the agent must be within two steps of the goal location. In nearly all cases, published performance gains shrink when compared with unimodal baselines versus those used in the papers. We highlight cases where a unimodal baseline creates a 5% absolute difference with a published baseline. In only one setting (Sample-based Matterport in Unseen environments) is a published baseline competitive with unimodal approaches.

Our primary metric for evaluating performance compares unimodal baselines to the majority class and random baselines published in the original works. This delta should theoretically be small if both modalities are necessary for solving the problem. Additionally, we explore the true benefit that each architecture receives by having both modalities by showing the delta between the published model and our best unimodal result. Finally, the strength of our unimodal baselines against those used in existing publications are presented and gaps larger than 5% are presented in red.


We first evaluate our ablation baselines on the two language-and-vision navigation benchmarks mentioned before: Matterport and THOR-Nav. The Matterport benchmark presents two training conditions: teacher-forcing and action sampling. With teacher-forcing, at each timestep the navigation agent takes the gold-standard action regardless of what action it predicted (and loss is computed against that gold-standard action), meaning it only sees steps along gold-standard trajectories. This paradigm is used to train the navigation agent in THOR-Nav as well. Under action sampling, by contrast, the agent samples the action to take from its predictions, and loss is computed at each time step against the action which would have put the agent on the shortest path to the goal. This means the agent sees more of the scene, but takes more training iterations to learn to move to the goal.

In both benchmarks, agents are given feedback from the environment indicating when an action is not possible. For example, when the agent is facing a wall, it can no longer move forward (Figure 1). When visual input is not available to the agent, it still maintains this minimally sensing perception, detecting when a forward action from the current location is unavailable.555Matterport’s random baseline also includes structural information similar to BUG algorithms Lumelsky and Stepanov (1987); Taylor and LaValle (2009).

Model (Unseen) (Seen) (Unseen)
V only +0.435 +0.428 +0.442


L only +0.417 +0.417 +0.488
Lang LSTM +0.419 +0.422 +0.480
Maj Class Baseline +0.417 +0.417 +0.198


Full Model +0.883 +0.893 +0.640
(Full, Baseline) +0.466 +0.476 +0.442
(Full, Unimodal) +0.448 +0.465 +0.152
(Unimodal, Base) +0.018 +0.011 +0.290
Table 2: Top-1 accuracy for QA Performance. In EQA, we see that replacing published baselines with our unimodal results shrink gains by absolute 29%. Following Das et al. (2018), this translates to very similar mean rank scores of 1.8 for the published Full Model and 2.0 for the LSTM (lower is better).

Baseline Model Performance.

For the Matterport benchmark, we train each model under both teacher-forcing and action sampling paradigms, and report the highest validation success rate666

We evaluate on validation because it contains both seen and unseen houses, while the test fold contains only unseen houses. We perform no hyperparameter tuning on our ablation models.

achieved across all training epochs in Table 

1. We measure the difference between the full model and the best unimodal ablation. Two salient points emerge: the gain when training by sampling the agent’s next action is appreciably higher than when using teacher-forcing, and the gains are lower on unseen environments (in the case of teacher-forcing, the full model actually performs worse than considering only language input).

Intuitively, sampling may make better use of both modalities by learning a more robust alignment between language and vision inputs. The models learn an attention mechanism over the language input, and sampling (which allows the agent to veer off the gold-standard trajectory) may tune that attention such that the agent has a better guess of when it is “on” or “off” the path to the goal.

The agent with access only to its action history is able to perform surprisingly well, achieving a success rate of 16%. This agent is able to exploit structural

biases in navigation trajectories. In particular, priors on common actions and the conditional distributions of the next action given the previous are highly skewed (Figure 

3),777The end action is less common than the start action because ground truth trajectories are prematurely ended at 20 steps during training and inference in Anderson et al. (2018). We suspect this is because most trajectories are shorter than 20 meters, but note that 20 action steps is insufficient for many trajectories due to turning and tilting actions. allowing the agent to narrow the space of reasonable actions to take. Further, this agent, as well as the language-only agent, can take advantage of minimal sensing from the environment, knowing they must turn or adjust their elevation heading when forward is unavailable.

Using both modalities in unseen environments likely results in less gain over just language or vision because the attention mechanism aligning these modalities is not tuned for unseen visual environments. This lack of tuning creates more noise in the attention, and highlights the importance of attention in multimodal alignment for this task.

Question Answering

For examining the bias in the Question Answering tasks (Table 2), we factored out the navigation component by providing the agent extra ground truth information. The primary difficulty in both EQA and IQUAD V1 involves finding the object(s) of interest relating to the question. If we remove that difficulty, we can assess the extent to which biases affect the question-answer portion of the dataset, as well as the upper-bound of the Full Model performance of a given network architecture. The EQA dataset provides ground truth trajectories from the agent’s start location to the goal which we force the visual agents to follow. Since there are questions in IQUAD V1 which may require multiple distinct viewpoints (e.g. counting questions, false existence questions), rather than a ground truth trajectory, we provide a complete, ground truth semantic memory as described in the benchmark release Gordon et al. (2018); this additionally factors out the object detection task.

Figure 4: Qualitative results of the various baselines on the EQA benchmark task. The language only model can pick out the most likely answer for a question. The vision only model is able to find salient color and room features, but is unaware of the question type.

Baseline Model Performance.

For EQA, both of the single modality models perform significantly better than the majority class baseline (and chance). The vision-only model is able to identify salient colors and basic room features that allow it to reduce the likely set of answers to an unknown question dramatically given an oracle navigator. The language only models are able to exploit the biases underlying the questions themselves to nearly a coin-flip. Intuitively, this means that each of the questions in the EQA dataset has one answer that is as likely as all other answers combined (e.g. 50% of the answers for What color is the bathtub? are grey, which is a very reasonable reflection of the world.). Sample outputs from these methods can be seen in Figure 4. We also find that the Full Model performance using oracle navigation leaves significant room for improvement. Both the navigation and QA portions of EQA are difficult tasks with lots of headroom for improvement from tighter integration of language and vision information.

On IQUAD V1, we find a different trend. All models without both modalities perform at nearly chance accuracy.888Maj class and chance for IQUAD V1 both achieve 0.5, 0.5, 0.25 when conditioned on question type; we present the average of these. Because of the construction and randomization of IQUAD V1, we find that language-only models are unable to identify any structural bias. The vision-only model gets a minor improvement over chance due to two sources of bias. If there are exactly three of some object, the question is more likely (slightly more than chance) about that count, with answer 3. Similarly, if there are many of a certain object, the question is more likely (slightly more than chance) about existence of that object class and the answer is yes. This allows a model with perfect information about the environment to exploit a 2% bias. The Full Model upper bound on performance still leaves room for improvement, especially for counting questions. The QA network presented along with the benchmark Gordon et al. (2018) could benefit from recent improvements in QA architectures. However, its lack of a full 3D scene representation and low semantic memory spatial resolution means the model will still likely fail when objects are stacked on top of each other or too close together.

For both benchmarks, it is clear that using multimodal models improves accuracy. Similar to previous work Goyal et al. (2018), we argue that quantifying the language bias is especially important in QA tasks. Otherwise, visual features can appear to provide more information than they actually do. In the second question of Figure 4, the full model misidentifies the room as a bathroom, which could be due to the tile floor as a visual cue. However, the language-only model reveals that the visual features may have played no part in that particular answer.


In this work, for three benchmarks in visual navigation and question answering, we measure explicit performance gains from the multimodal versus unimodal baseline models. As a result, we recommend that the community include ablation baselines on tasks with multimodal inputs using existing architectures by omitting inputs for each modality in turn. In the simplest form, the presented models should be retrained and evaluated while zeroing out the inputs from the various modalities.

While the gap between robotics, language, and vision techniques and representations can be narrowed by deep learning based embedding methods and recurrent architectures, we must be careful to analyze whether latent representations across multiple modalities actually capture information in a joint space. Seeing qualitative examples on which unimodal models perform well, as well as when multimodal models fail, is one step towards this (e.g., Figure 4).

New advances happen quickly in this space, introducing more complicated architectures and achieving impressive gains over the original baselines evaluated here Fried et al. (2018); Anonymous (2019). The architectures may be learning better language-to-action or vision-to-action representations that are not as multimodal as we would hope. Evaluating these architectures under the single modality framework described here would help tease apart what is being learned from multimodal inputs.

As a community, we are interested in solving difficult, multimodal learning tasks like visual navigation and question answering in ways that do not rely on structural biases exposed by individual modality signals. By including unimodal model performance when reporting results on such tasks, researchers can make it clear when useful information is being extracted from the joint, multimodal space by proposed models, marking clearer progress at this exciting intersection.


  • Anderson et al. (2018) Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018. Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  • Anonymous (2019) Anonymous. 2019. Self-aware visual-textual co-grounded navigation agent. In Submitted to International Conference on Learning Representations. Under review.
  • Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
  • Blukis et al. (2018) Valts Blukis, Dipendra Misra, Ross A. Knepper, and Yoav Artzi. 2018. Mapping navigation instructions to continuous control actions with position visitation prediction. In Proceedings of the Conference on Robot Learning.
  • Chang et al. (2017) Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV).
  • Chaplot et al. (2018) Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, and Ruslan Salakhutdinov. 2018. Gated-attention architectures for task-oriented language grounding. In

    AAAI Conference on Artificial Intelligence (AAAI-18)

    , pages 1050 –1055.
  • Chen and Mooney (2011) David L. Chen and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI-2011), pages 859–865.
  • Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  • Das et al. (2018) Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. 2018. Embodied Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Devlin et al. (2015) Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C Lawrence Zitnick. 2015. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467.
  • Duvallet et al. (2013) F. Duvallet, T. Kollar, and A. Stentz. 2013. Imitation learning for natural language direction following through unknown environments. In 2013 IEEE International Conference on Robotics and Automation, pages 1047–1053.
  • Fried et al. (2018) Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. 2018. Speaker-follower models for vision-and-language navigation. arXiv preprint arXiv:1806.02724, 1806.
  • Glockner et al. (2018) Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking nli systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655.
  • Gordon et al. (2018) Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. 2018. Iqa: Visual question answering in interactive environments. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE.
  • Goyal et al. (2018) Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2018. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision (IJCV).
  • Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
  • Kaushik and Lipton (2018) Divyansh Kaushik and Zachary C. Lipton. 2018. How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5010–5015. Association for Computational Linguistics.
  • Kolve et al. (2017) Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. 2017. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv.
  • Lumelsky and Stepanov (1987) Vladimir J. Lumelsky and Alexander A. Stepanov. 1987. Path-planning strategies for a point mobile automaton moving amidst unknown obstacles of arbitrary shape. Algorithmica, pages 403 – 430.
  • MacMahon et al. (2006) Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-2006), Boston, MA, USA.
  • Mei et al. (2016) Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2016. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of AAAI.
  • Misra et al. (2018) Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, and Yoav Artzi. 2018. Mapping instructions to actions in 3D environments with visual goal prediction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  • O’Kane and LaValle (2006) Jason M O’Kane and Steven M. LaValle. 2006. On comparing the power of mobile robots. In Robotics: Science and Systems.
  • Poliak et al. (2018) Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis Only Baselines in Natural Language Inference. In Joint Conference on Lexical and Computational Semantics (StarSem).
  • Taylor and LaValle (2009) Kamilah Taylor and Steven M. LaValle. 2009. I-bug: An intensity-based bug algorithm. In IEEE International Conference on Robotics and Automation.
  • Wu et al. (2018) Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. 2018. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209.
  • Zellers et al. (2018) Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).