CLIP-Driven Fine-grained Text-Image Person Re-identification

10/19/2022
by   Shuanglin Yan, et al.
1

TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondences. Besides, due to the substantial gap between modalities, existing methods embed the original modal features into the same latent space for cross-modal alignment. However, feature embedding may lead to intra-modal information distortion. Recently, CLIP has attracted extensive attention from researchers due to its powerful semantic concept learning capacity and rich multi-modal knowledge, which can help us solve the above problems. Accordingly, in the paper, we propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we perform fine-grained information excavation to mine intra-modal discriminative clues and inter-modal correspondences. Specifically, we first design a multi-grained global feature learning module to fully mine intra-modal discriminative local information, which can emphasize identity-related discriminative clues by enhancing the interactions between global image (text) and informative local patches (words). Secondly, cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules are proposed to establish the cross-grained and fine-grained interactions between modalities, which can filter out non-modality-shared image patches/words and mine cross-modal correspondences from coarse to fine. CFR and FCD are removed during inference to save computational costs. Note that the above process is performed in the original modality space without further feature embedding. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method on TIReID.

READ FULL TEXT

page 1

page 4

page 5

page 11

research
08/30/2022

Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Text-based person search is a challenging task that aims to search pedes...
research
04/07/2017

CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network

Cross-modal retrieval has become a highlighted research topic for retrie...
research
06/21/2022

Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning

Video highlight detection is a crucial yet challenging problem that aims...
research
03/09/2023

MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning

Audio-visual learning helps to comprehensively understand the world by f...
research
08/13/2023

Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation

We introduce caption-guided face recognition (CGFR) as a new framework t...
research
08/29/2021

Fine-Grained Chemical Entity Typing with Multimodal Knowledge Representation

Automated knowledge discovery from trending chemical literature is essen...
research
03/20/2023

Scene Graph Based Fusion Network For Image-Text Retrieval

A critical challenge to image-text retrieval is how to learn accurate co...

Please sign up or login with your details

Forgot password? Click here to reset