Recent work in vision-and-language demonstrates that large-scale pretrai...
To be successful, Vision-and-Language Navigation (VLN) agents must be ab...
Images are a convenient way to specify which particular object instance ...
Animal navigation research posits that organisms build and maintain inte...
We consider the problem of embodied visual navigation given an image-goa...
Recent work in Vision-and-Language Navigation (VLN) has presented two
en...
A growing number of service providers are exploring methods to improve s...
Natural language instructions for visual navigation often use scene
desc...
Little inquiry has explicitly addressed the role of action spaces in
lan...
Multilingual Neural Machine Translation (NMT) enables one model to serve...
Neural networks are a popular tool for modeling sequential data but they...
We consider the problem of modeling the dynamics of continuous
spatial-t...
We present Where Are You? (WAY), a dataset of 6k dialogs in which two h...
We study the challenging problem of releasing a robot in a previously un...
Imitation learning is a popular approach for teaching motor skills to ro...
We study an approach to offline reinforcement learning (RL) based on
opt...
There have been significant efforts to interpret the encoder of
Transfor...
We study the task of semantic mapping - specifically, an embodied agent ...
Recent work has presented embodied agents that can navigate to point-goa...
Can we develop visually grounded dialog agents that can efficiently adap...
Following a navigation instruction such as 'Walk down the stairs and sto...
We develop a language-guided navigation task set in a continuous 3D
envi...
Does progress in simulation translate to progress in robotics? Specifica...
Much of vision-and-language research focuses on a small but diverse set ...
While Visual Question Answering (VQA) models continue to push the
state-...
We present Decentralized Distributed Proximal Policy Optimization (DD-PP...
While models for Visual Question Answering (VQA) have steadily improved ...
We present ViLBERT (short for Vision-and-Language BERT), a model for lea...
A visually-grounded navigation instruction can be interpreted as a seque...
Consider a collaborative task that requires communication. Two agents ar...
A counterfactual query is typically of the form 'For situation X, why wa...
To help bridge the gap between internet vision-style problems and the go...
We propose a new class of probabilistic neural-symbolic models, that hav...
Many vision and language models suffer from poor visual grounding - ofte...
We introduce EvalAI, an open source platform for evaluating and comparin...
We introduce the task of scene-aware dialog. Given a follow-up question ...
Image captioning models have achieved impressive results on datasets
con...
We present a modular approach for learning policies for navigation over ...
Modern Visual Question Answering (VQA) models have been shown to rely he...
In an open-world setting, it is inevitable that an intelligent agent (e....
Individual neurons in convolutional neural networks supervised for
image...
We propose a novel scene graph generation model called Graph R-CNN, that...
Many structured prediction problems (particularly in vision and language...
We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- ...
As AI continues to advance, human-AI teams are inevitable. However, prog...
A number of recent works have proposed techniques for end-to-end learnin...
We develop the first approximate inference algorithm for 1-Best (and M-B...
In this paper, we make a simple observation that questions about images ...
We introduce the first goal-driven training for visual question answerin...
Neural sequence models are widely used to model time-series data in many...