
-
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
The canonical approach to video-and-language learning (e.g., video quest...
read it
-
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Large-scale pre-trained multimodal transformers, such as ViLBERT and UNI...
read it
-
Graph Optimal Transport for Cross-Domain Alignment
Cross-domain alignment between two sets of entities (e.g., objects in an...
read it
-
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial trai...
read it
-
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
We present HERO, a Hierarchical EncodeR for Omni-representation learning...
read it
-
Meta Module Network for Compositional Visual Reasoning
There are two main lines of research on visual reasoning: neural module ...
read it
-
UNITER: Learning UNiversal Image-TExt Representations
Joint image-text embedding is the bedrock for most Vision-and-Language (...
read it
-
Relation-aware Graph Attention Network for Visual Question Answering
In order to answer semantically-complicated questions about an image, a ...
read it
-
Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog
This paper presents Recurrent Dual Attention Network (ReDAN) for visual ...
read it
-
Learning to see people like people
Humans make complex inferences on faces, ranging from objective properti...
read it