
-
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
We present Vx2Text, a framework for text generation from multimodal inpu...
read it
-
Object-Centric Diagnosis of Visual Reasoning
When answering questions about an image, it not only needs knowing what ...
read it
-
Creative Sketch Generation
Sketching or doodling is a popular creative activity that people engage ...
read it
-
Where Are You? Localization from Embodied Dialog
We present Where Are You? (WAY), a dataset of 6k dialogs in which two h...
read it
-
Sim-to-Real Transfer for Vision-and-Language Navigation
We study the challenging problem of releasing a robot in a previously un...
read it
-
SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency
Recent research in Visual Question Answering (VQA) has revealed state-of...
read it
-
The Open Catalyst 2020 (OC20) Dataset and Community Challenges
Catalyst discovery and optimization is key to solving many societal and ...
read it
-
An Introduction to Electrocatalyst Design using Machine Learning for Renewable Energy Storage
Scalable and cost-effective solutions to renewable energy storage are es...
read it
-
Contrast and Classify: Alternate Training for Robust VQA
Recent Visual Question Answering (VQA) models have shown impressive perf...
read it
-
Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents
Recent work has presented embodied agents that can navigate to point-goa...
read it
-
Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data
Can we develop visually grounded dialog agents that can efficiently adap...
read it
-
Spatially Aware Multimodal Transformers for TextVQA
Textual cues are essential for everyday tasks like buying groceries and ...
read it
-
Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation
We introduce a learning-based approach for room navigation using semanti...
read it
-
Neuro-Symbolic Generative Art: A Preliminary Study
There are two classes of generative art approaches: neural, where a deep...
read it
-
Feel The Music: Automatically Generating A Dance For An Input Song
We present a general computational approach that enables a machine to ge...
read it
-
Exploring Crowd Co-creation Scenarios for Sketches
As a first step towards studying the ability of human crowds and machine...
read it
-
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
Following a navigation instruction such as 'Walk down the stairs and sto...
read it
-
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Numerous recent works have proposed pretraining generic visio-linguistic...
read it
-
Predicting A Creator's Preferences In, and From, Interactive Generative Art
As a lay user creates an art piece using an interactive generative art t...
read it
-
SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions
Existing VQA datasets contain questions with varying levels of complexit...
read it
-
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
Prior work in visual dialog has focused on training deep neural models o...
read it
-
12-in-1: Multi-Task Vision and Language Representation Learning
Much of vision-and-language research focuses on a small but diverse set ...
read it
-
Decentralized Distributed PPO: Solving PointGoal Navigation
We present Decentralized Distributed Proximal Policy Optimization (DD-PP...
read it
-
Improving Generative Visual Dialog by Answering Diverse Questions
Prior work on training generative Visual Dialog models with reinforcemen...
read it
-
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for lea...
read it
-
Unsupervised Discovery of Decision States for Transfer in Reinforcement Learning
We present a hierarchical reinforcement learning (HRL) or options framew...
read it
-
Chasing Ghosts: Instruction Following as Bayesian State Tracking
A visually-grounded navigation instruction can be interpreted as a seque...
read it
-
RUBi: Reducing Unimodal Biases in Visual Question Answering
Visual Question Answering (VQA) is the task of answering questions about...
read it
-
SplitNet: Sim2Sim and Task2Task Transfer for Embodied Visual Navigation
We propose SplitNet, a method for decoupling visual perception and polic...
read it
-
Fashion++: Minimal Edits for Outfit Improvement
Given an outfit, what small changes would most improve its fashionabilit...
read it
-
Emergence of Compositional Language with Deep Generational Transmission
Consider a collaborative task that requires communication. Two agents ar...
read it
-
Towards VQA Models that can Read
Studies have shown that a dominant class of questions asked by visually ...
read it
-
Counterfactual Visual Explanations
A counterfactual query is typically of the form 'For situation X, why wa...
read it
-
Embodied Visual Recognition
Passive visual systems typically fail to recognize objects in the amodal...
read it
-
Embodied Question Answering in Photorealistic Environments with Point Cloud Perception
To help bridge the gap between internet vision-style problems and the go...
read it
-
Habitat: A Platform for Embodied AI Research
We present Habitat, a new platform for research in embodied artificial i...
read it
-
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment
We address the problem of grounding free-form textual phrases by using w...
read it
-
Trick or TReAT: Thematic Reinforcement for Artistic Typography
An approach to make text visually appealing and memorable is semantic re...
read it
-
Lemotif: Abstract Visual Depictions of your Emotional States in Life
We present Lemotif. Lemotif generates a motif for your emotional life. Y...
read it
-
CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
Visual Dialog is a multimodal task of answering a sequence of questions ...
read it
-
Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future
In model-based reinforcement learning, the agent interleaves between mod...
read it
-
Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering
We propose a new class of probabilistic neural-symbolic models, that hav...
read it
-
Cycle-Consistency for Robust Visual Question Answering
Despite significant progress in Visual Question Answering over the years...
read it
-
Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded
Many vision and language models suffer from poor visual grounding - ofte...
read it
-
Embodied Multimodal Multitask Learning
Recent efforts on training visual navigation agents conditioned on langu...
read it
-
Audio-Visual Scene-Aware Dialog
We introduce the task of scene-aware dialog. Given a follow-up question ...
read it
-
Response to "Visual Dialogue without Vision or Dialogue" (Massiceti et al., 2018)
In a recent workshop paper, Massiceti et al. presented a baseline model ...
read it
-
Dialog System Technology Challenge 7
This paper introduces the Seventh Dialog System Technology Challenges (D...
read it
-
nocaps: novel object captioning at scale
Image captioning models have achieved impressive results on datasets con...
read it
-
Do Explanations make VQA Models more Predictable to a Human?
A rich line of research attempts to make deep neural networks more trans...
read it