
-
Where Are You? Localization from Embodied Dialog
We present Where Are You? (WAY), a dataset of 6k dialogs in which two h...
read it
-
Sim-to-Real Transfer for Vision-and-Language Navigation
We study the challenging problem of releasing a robot in a previously un...
read it
-
Language-Conditioned Imitation Learning for Robot Manipulation Tasks
Imitation learning is a popular approach for teaching motor skills to ro...
read it
-
DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs
We study an approach to offline reinforcement learning (RL) based on opt...
read it
-
On the Sub-Layer Functionalities of Transformer Decoder
There have been significant efforts to interpret the encoder of Transfor...
read it
-
Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views
We study the task of semantic mapping - specifically, an embodied agent ...
read it
-
Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents
Recent work has presented embodied agents that can navigate to point-goa...
read it
-
Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data
Can we develop visually grounded dialog agents that can efficiently adap...
read it
-
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
Following a navigation instruction such as 'Walk down the stairs and sto...
read it
-
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
We develop a language-guided navigation task set in a continuous 3D envi...
read it
-
Are We Making Real Progress in Simulated Environments? Measuring the Sim2Real Gap in Embodied Visual Navigation
Does progress in simulation translate to progress in robotics? Specifica...
read it
-
12-in-1: Multi-Task Vision and Language Representation Learning
Much of vision-and-language research focuses on a small but diverse set ...
read it
-
Question-Conditioned Counterfactual Image Generation for VQA
While Visual Question Answering (VQA) models continue to push the state-...
read it
-
Decentralized Distributed PPO: Solving PointGoal Navigation
We present Decentralized Distributed Proximal Policy Optimization (DD-PP...
read it
-
Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation
While models for Visual Question Answering (VQA) have steadily improved ...
read it
-
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for lea...
read it
-
Chasing Ghosts: Instruction Following as Bayesian State Tracking
A visually-grounded navigation instruction can be interpreted as a seque...
read it
-
Emergence of Compositional Language with Deep Generational Transmission
Consider a collaborative task that requires communication. Two agents ar...
read it
-
Counterfactual Visual Explanations
A counterfactual query is typically of the form 'For situation X, why wa...
read it
-
Embodied Question Answering in Photorealistic Environments with Point Cloud Perception
To help bridge the gap between internet vision-style problems and the go...
read it
-
Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering
We propose a new class of probabilistic neural-symbolic models, that hav...
read it
-
Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded
Many vision and language models suffer from poor visual grounding - ofte...
read it
-
EvalAI: Towards Better Evaluation Systems for AI Agents
We introduce EvalAI, an open source platform for evaluating and comparin...
read it
-
Audio-Visual Scene-Aware Dialog
We introduce the task of scene-aware dialog. Given a follow-up question ...
read it
-
nocaps: novel object captioning at scale
Image captioning models have achieved impressive results on datasets con...
read it
-
Neural Modular Control for Embodied Question Answering
We present a modular approach for learning policies for navigation over ...
read it
-
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization
Modern Visual Question Answering (VQA) models have been shown to rely he...
read it
-
Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition
In an open-world setting, it is inevitable that an intelligent agent (e....
read it
-
Choose Your Neuron: Incorporating Domain Knowledge through Neuron-Importance
Individual neurons in convolutional neural networks supervised for image...
read it
-
Graph R-CNN for Scene Graph Generation
We propose a novel scene graph generation model called Graph R-CNN, that...
read it
-
Learn from Your Neighbor: Learning Multi-modal Mappings from Sparse Annotations
Many structured prediction problems (particularly in vision and language...
read it
-
Embodied Question Answering
We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- ...
read it
-
Evaluating Visual Conversational Agents via Cooperative Human-AI Games
As AI continues to advance, human-AI teams are inevitable. However, prog...
read it
-
Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog
A number of recent works have proposed techniques for end-to-end learnin...
read it
-
Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning
We develop the first approximate inference algorithm for 1-Best (and M-B...
read it
-
The Promise of Premise: Harnessing Question Premises in Visual Question Answering
In this paper, we make a simple observation that questions about images ...
read it
-
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
We introduce the first goal-driven training for visual question answerin...
read it
-
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
Neural sequence models are widely used to model time-series data in many...
read it
-
Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles
Many practical perception systems exist within larger processes that inc...
read it
-
Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks
Convolutional Neural Networks have achieved state-of-the-art performance...
read it