
-
CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition
Zero-shot action recognition is the task of recognizing action classes w...
read it
-
SMART Frame Selection for Action Recognition
Action recognition is computationally expensive. In this paper, we addre...
read it
-
Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting
The goal of continual learning (CL) is to learn a sequence of tasks with...
read it
-
TextCaps: a Dataset for Image Captioning with Reading Comprehension
Image descriptions can help visually impaired people to quickly understa...
read it
-
Adversarial Continual Learning
Continual learning aims to learn new tasks without forgetting previously...
read it
-
In Defense of Grid Features for Visual Question Answering
Popularized as 'bottom-up' attention, bounding box (or region) based vis...
read it
-
12-in-1: Multi-Task Vision and Language Representation Learning
Much of vision-and-language research focuses on a small but diverse set ...
read it
-
Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
Many visual scenes contain text that carries crucial information, and it...
read it
-
Decoupling Representation and Classifier for Long-Tailed Recognition
The long-tail distribution of the visual world poses great challenges fo...
read it
-
Uncertainty-guided Continual Learning with Bayesian Neural Networks
Continual learning aims to learn new tasks without forgetting previously...
read it
-
Learning to Generate Grounded Image Captions without Localization Supervision
When generating a sentence description for an image, it frequently remai...
read it
-
Towards VQA Models that can Read
Studies have shown that a dominant class of questions asked by visually ...
read it
-
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution
In natural images, information is conveyed at different frequencies wher...
read it
-
CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
Visual Dialog is a multimodal task of answering a sequence of questions ...
read it
-
Continual Learning with Tiny Episodic Memories
Learning with less supervision is a major challenge in artificial intell...
read it
-
Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering
We propose a new class of probabilistic neural-symbolic models, that hav...
read it
-
Cycle-Consistency for Robust Visual Question Answering
Despite significant progress in Visual Question Answering over the years...
read it
-
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition
Motion has shown to be useful for video understanding, where motion is t...
read it
-
Exploring the Challenges towards Lifelong Fact Learning
So far life-long learning (LLL) has been studied in relatively small-sca...
read it
-
Grounded Video Description
Video description is one of the most challenging problems in vision and ...
read it
-
Adversarial Inference for Multi-Sentence Video Description
While significant progress has been made in the image captioning task, v...
read it
-
Efficient Lifelong Learning with A-GEM
In lifelong learning, the learner is presented with a sequence of tasks,...
read it
-
Graph-Based Global Reasoning Networks
Globally modeling and reasoning over relations between regions can be be...
read it
-
Visual Coreference Resolution in Visual Dialog using Neural Module Networks
Visual dialog entails answering a series of questions grounded in an ima...
read it
-
Pythia v0.1: the Winning Entry to the VQA Challenge 2018
This document describes Pythia v0.1, the winning entry from Facebook AI ...
read it
-
Selfless Sequential Learning
Sequential learning studies the problem of learning tasks in a sequence ...
read it
-
Large-Scale Visual Relationship Understanding
Large scale visual understanding is challenging, as it requires a model ...
read it
-
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
Deep models that are both effective and explainable are desirable in man...
read it
-
Memory Aware Synapses: Learning what (not) to forget
Humans can learn in a continuous manner. Old rarely utilized knowledge c...
read it
-
Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)
Deep models are the defacto standard in visual decision problems due to ...
read it
-
Generating Descriptions with Grounded and Co-Referenced People
Learning how to generate descriptions of images or videos received major...
read it
-
Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training
While strong progress has been made in image captioning over the last ye...
read it
-
Attentive Explanations: Justifying Decisions and Pointing to the Evidence
Deep models are the defacto standard in visual decision models due to th...
read it
-
Modeling Relationships in Referential Expressions with Compositional Modular Networks
People often refer to entities in an image in terms of their relationshi...
read it
-
Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions
Image segmentation from referring expressions is a joint vision and lang...
read it
-
Captioning Images with Diverse Objects
Recent captioning models are limited in their ability to scale and descr...
read it
-
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Modeling textual or visual information with vector representations train...
read it
-
Movie Description
Audio Description (AD) provides linguistic descriptions of movies and al...
read it
-
Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
We address a question answering task on real-world images that is set up...
read it
-
Attributes as Semantic Units between Natural Language and Visual Recognition
Impressive progress has been made in the fields of computer vision and n...
read it
-
Generating Visual Explanations
Clearly explaining a rationale for a classification decision to an end-u...
read it
-
Segmentation from Natural Language Expressions
In this paper we approach the novel problem of segmenting an image based...
read it
-
Learning to Compose Neural Networks for Question Answering
We describe a question answering model that applies to both images and s...
read it
-
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data
While recent deep neural network models have achieved promising results ...
read it
-
Natural Language Object Retrieval
In this paper, we address the task of natural language object retrieval,...
read it
-
Grounding of Textual Phrases in Images by Reconstruction
Grounding (i.e. localizing) arbitrary, free-form textual phrases in visu...
read it
-
Neural Module Networks
Visual question answering is fundamentally compositional in nature---a q...
read it
-
Spatial Semantic Regularisation for Large Scale Object Detection
Large scale object detection with thousands of classes introduces the pr...
read it
-
The Long-Short Story of Movie Description
Generating descriptions for videos has many applications including assis...
read it
-
Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up...
read it