Fine-Grained Semantically Aligned Vision-Language Pre-Training

08/04/2022
by   Juncheng Li, et al.
2

Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently compute the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves state-of-the-art on image-text retrieval benchmarks. Without any object-level human annotations and fine-tuning, LOUPE achieves competitive performance on object detection and visual grounding. More importantly, LOUPE opens a new promising direction of learning fine-grained semantics from large-scale raw image-text pairs.

READ FULL TEXT
research
11/09/2021

FILIP: Fine-grained Interactive Language-Image Pre-Training

Unsupervised large-scale vision-language pre-training has shown promisin...
research
12/17/2021

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Video-and-language pre-training has shown promising improvements on vari...
research
08/07/2023

COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

Vision-Language Pre-training (VLP) methods based on object detection enj...
research
11/16/2021

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Most existing methods in vision language pre-training rely on object-cen...
research
03/09/2023

Replacement as a Self-supervision for Fine-grained Vision-language Pre-training

Fine-grained supervision based on object annotations has been widely use...
research
10/09/2022

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Multimodal representation learning has shown promising improvements on v...
research
07/12/2022

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

Vision-Language Pre-training (VLP) with large-scale image-text pairs has...

Please sign up or login with your details

Forgot password? Click here to reset