
-
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
This paper presents a new Vision Transformer (ViT) architecture Multi-Sc...
read it
-
VinVL: Making Visual Representations Matter in Vision-Language Models
This paper presents a detailed study of improving visual representations...
read it
-
Object-Centric Diagnosis of Visual Reasoning
When answering questions about an image, it not only needs knowing what ...
read it
-
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Neuro-symbolic representations have proved effective in learning structu...
read it
-
Embodied Visual Recognition
Passive visual systems typically fail to recognize objects in the amodal...
read it
-
Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition
In an open-world setting, it is inevitable that an intelligent agent (e....
read it
-
Graph R-CNN for Scene Graph Generation
We propose a novel scene graph generation model called Graph R-CNN, that...
read it
-
Neural Baby Talk
We introduce a novel framework for image captioning that can produce nat...
read it
-
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model
We present a novel training framework for neural sequence models, partic...
read it
-
LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation
We present LR-GAN: an adversarial image generation model which takes sce...
read it
-
Hierarchical Question-Image Co-Attention for Visual Question Answering
A number of recent works have proposed attention models for Visual Quest...
read it
-
Joint Unsupervised Learning of Deep Representations and Image Clusters
In this paper, we propose a recurrent framework for Joint Unsupervised L...
read it
-
Learn Convolutional Neural Network for Face Anti-Spoofing
Though having achieved some progresses, the hand-crafted texture feature...
read it