FILIP: Fine-grained Interactive Language-Image Pre-Training

11/09/2021
by   Lewei Yao, et al.
0

Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finer-grained expressiveness between image patches and textual words by modifying only contrastive loss, while simultaneously gaining the ability to pre-compute image and text representations offline at inference, keeping both large-scale training and inference efficient. Furthermore, we construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability.

READ FULL TEXT

page 9

page 16

research
08/04/2022

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Large-scale vision-language pre-training has shown impressive advances i...
research
09/28/2022

TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval

Most existing methods in vision-language retrieval match two modalities ...
research
03/08/2023

Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Text-based Person Search (TPS), is targeted on retrieving pedestrians to...
research
03/27/2023

Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

Contrastive learning-based vision-language pre-training approaches, such...
research
05/27/2023

PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

Large-scale vision language (VL) models use Transformers to perform cros...
research
08/24/2023

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

Cross-modal alignment is one key challenge for Vision-and-Language Navig...
research
09/06/2022

Language-aware Domain Generalization Network for Cross-Scene Hyperspectral Image Classification

Text information including extensive prior knowledge about land cover cl...

Please sign up or login with your details

Forgot password? Click here to reset