
-
Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition
Recent works seek to endow recognition systems with the ability to handl...
read it
-
Open-Vocabulary Object Detection Using Captions
Despite the remarkable accuracy of deep neural networks in object detect...
read it
-
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Neuro-symbolic representations have proved effective in learning structu...
read it
-
Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions
Pre-trained contextual vision-and-language (V L) models have brought i...
read it
-
Uncertainty-Aware Few-Shot Image Classification
Few-shot image classification aims to learn to recognize new categories ...
read it
-
Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding
The prevailing framework for solving referring expression grounding is b...
read it
-
COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation
To combat COVID-19, both clinicians and scientists need to digest the va...
read it
-
Learning Visual Commonsense for Robust Scene Graph Generation
Scene graph generation models understand the scene through object and pr...
read it
-
Deep Learning Guided Building Reconstruction from Satellite Imagery-derived Point Clouds
3D urban reconstruction of buildings from remotely sensed imagery has dr...
read it
-
Cross-media Structured Common Space for Multimedia Event Extraction
We introduce a new task, MultiMedia Event Extraction (M2E2), which aims ...
read it
-
Unifying Specialist Image Embedding into Universal Image Embedding
Deep image embedding provides a way to measure the semantic similarity o...
read it
-
Training with Streaming Annotation
In this paper, we address a practical scenario where training data is re...
read it
-
Weakly Supervised Visual Semantic Parsing
Scene Graph Generation (SGG) aims to extract entities, predicates and th...
read it
-
Bridging Knowledge Graphs to Generate Scene Graphs
Scene graphs are powerful representations that encode images into their ...
read it
-
General Partial Label Learning via Dual Bipartite Graph Autoencoder
We formulate a practical yet challenging problem: General Partial Label ...
read it
-
Flow-Distilled IP Two-Stream Networks for Compressed Video Action Recognition
Two-stream networks have achieved great success in video recognition. A ...
read it
-
Flow-Distilled IP Two-Stream Networks for Compressed Video ActionRecognition
Two-stream networks have achieved great success in video recognition. A ...
read it
-
Learning to Learn Words from Narrated Video
When we travel, we often encounter new scenarios we have never experienc...
read it
-
LPAT: Learning to Predict Adaptive Threshold for Weakly-supervised Temporal Action Localization
Recently, Weakly-supervised Temporal Action Localization (WTAL) has been...
read it
-
Context-Gated Convolution
As the basic building block of Convolutional Neural Networks (CNNs), the...
read it
-
Report of 2017 NSF Workshop on Multimedia Challenges, Opportunities and Research Roadmaps
With the transformative technologies and the rapidly changing global R ...
read it
-
Detecting and Simulating Artifacts in GAN Fake Images
To detect GAN generated images, conventional supervised machine learning...
read it
-
Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions
We focus on grounding (i.e., localizing or linking) referring expression...
read it
-
CDSA: Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation
Many real-world applications involve multivariate, geo-tagged time serie...
read it
-
Unsupervised Embedding Learning via Invariant and Spreading Instance Feature
This paper studies the unsupervised embedding learning problem, which re...
read it
-
Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval
We propose an unsupervised hashing method which aims to produce binary c...
read it
-
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition
Motion has shown to be useful for video understanding, where motion is t...
read it
-
Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation
Scene graphs -- objects as nodes and visual relationships as edges -- de...
read it
-
Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding
We address the problem of phrase grounding by learning a multi-level com...
read it
-
Multi-granularity Generator for Temporal Action Proposal
Temporal action proposal generation is an important task, aiming to loca...
read it
-
Low-shot Learning via Covariance-Preserving Adversarial Augmentation Networks
Deep neural networks suffer from over-fitting and catastrophic forgettin...
read it
-
Heated-Up Softmax Embedding
Metric learning aims at learning a distance which is consistent with the...
read it
-
Multimodal Social Media Analysis for Gang Violence Prevention
Gang violence is a severe issue in major cities across the U.S. and rece...
read it
-
AutoLoc: Weakly-supervised Temporal Action Localization
Temporal Action Localization (TAL) in untrimmed video is important for m...
read it
-
Entity-aware Image Caption Generation
Image captioning approaches currently generate descriptions which lack s...
read it
-
Online Action Detection in Untrimmed, Streaming Videos - Modeling and Evaluation
The goal of Online Action Detection (OAD) is to detect action in a timel...
read it
-
Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Network
We propose a novel framework called Semantics-Preserving Adversarial Emb...
read it
-
Grounding Referring Expressions in Images by Variational Context
We focus on grounding (i.e., localizing or linking) referring expression...
read it
-
Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation
Large-scale image annotation is a challenging task in image content anal...
read it
-
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Recurrent Neural Networks (RNNs) continue to show outstanding performanc...
read it
-
More cat than cute? Interpretable Prediction of Adjective-Noun Pairs
The increasing availability of affect-rich multimedia resources has bols...
read it
-
PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN
We aim to tackle a novel vision task called Weakly Supervised Visual Rel...
read it
-
Localizing Actions from Video Labels and Pseudo-Annotations
The goal of this paper is to determine the spatio-temporal location of a...
read it
-
Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification
Videos are inherently multimodal. This paper studies the problem of how ...
read it
-
PatternNet: Visual Pattern Mining with Deep Neural Network
Visual patterns represent the discernible regularity in the visual world...
read it
-
CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos
Temporal action localization is an important yet challenging problem. Gi...
read it
-
Visual Translation Embedding Network for Visual Relation Detection
Visual relations, such as "person ride bike" and "bike next to car", off...
read it
-
Model-Driven Feed-Forward Prediction for Manipulation of Deformable Objects
Robotic manipulation of deformable objects is a difficult problem especi...
read it
-
Deep Image Set Hashing
In applications involving matching of image sets, the information from m...
read it
-
Multilingual Visual Sentiment Concept Matching
The impact of culture in visual emotion perception has recently captured...
read it