TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval

09/28/2022
by   Xiaohan Zou, et al.
0

Most existing methods in vision-language retrieval match two modalities by either comparing their global feature vectors which misses sufficient information and lacks interpretability, detecting objects in images or videos and aligning the text with fine-grained features which relies on complicated model designs, or modeling fine-grained interaction via cross-attention upon visual and textual tokens which suffers from inferior efficiency. To address these limitations, some recent works simply aggregate the token-wise similarities to achieve fine-grained alignment, but they lack intuitive explanations as well as neglect the relationships between token-level features and global representations with high-level semantics. In this work, we rethink fine-grained cross-modal alignment and devise a new model-agnostic formulation for it. We additionally demystify the recent popular works and subsume them into our scheme. Furthermore, inspired by optimal transport theory, we introduce TokenFlow, an instantiation of the proposed scheme. By modifying only the similarity function, the performance of our method is comparable to the SoTA algorithms with heavy model designs on major video-text retrieval benchmarks. The visualization further indicates that TokenFlow successfully leverages the fine-grained information and achieves better interpretability.

READ FULL TEXT

page 1

page 9

research
11/09/2021

FILIP: Fine-grained Interactive Language-Image Pre-Training

Unsupervised large-scale vision-language pre-training has shown promisin...
research
03/01/2020

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

Cross-modal retrieval between videos and texts has attracted growing att...
research
06/19/2023

WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation

The top-down and bottom-up methods are two mainstreams of referring segm...
research
07/29/2022

Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval

This paper investigates an open research problem of generating text-imag...
research
03/10/2022

StyleBabel: Artistic Style Tagging and Captioning

We present StyleBabel, a unique open access dataset of natural language ...
research
08/14/2020

Weakly supervised cross-domain alignment with optimal transport

Cross-domain alignment between image objects and text sequences is key t...
research
04/12/2023

TextANIMAR: Text-based 3D Animal Fine-Grained Retrieval

3D object retrieval is an important yet challenging task, which has draw...

Please sign up or login with your details

Forgot password? Click here to reset