Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

08/27/2023
by   Jiamin Zhuang, et al.
0

Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pretrained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both concept level and context level by self-supervised contrastive learning. It does not require cross-modal embedding interactions during training while maintaining independent image and text encoders during retrieval. With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art non-pretraining independent-embedding models respectively by 9.1 Flickr30K, MSCOCO 1K and MS-COCO 5K datasets. The retrieval accuracy also outperforms most existing interactive-embedding models with orders of magnitude decrease in retrieval time. The source code is available at: https://github.com/Zjamie813/SelfAlign.

READ FULL TEXT

page 1

page 10

page 11

page 12

research
07/29/2022

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Image-text matching is gaining a leading role among tasks involving the ...
research
08/29/2022

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

Vision and Language Pretraining has become the prevalent approach for ta...
research
11/28/2019

Dividing and Conquering Cross-Modal Recipe Retrieval: from Nearest Neighbours Baselines to SoTA

We propose a novel non-parametric method for cross-modal retrieval which...
research
03/08/2022

Where Does the Performance Improvement Come From? – A Reproducibility Concern about Image-Text Retrieval

This paper seeks to provide the information retrieval community with som...
research
04/21/2023

RoCOCO: Robust Benchmark MS-COCO to Stress-test Robustness of Image-Text Matching Models

Recently, large-scale vision-language pre-training models and visual sem...
research
02/18/2021

Hierarchical Similarity Learning for Language-based Product Image Retrieval

This paper aims for the language-based product image retrieval task. The...
research
10/20/2021

Text-Based Person Search with Limited Data

Text-based person search (TBPS) aims at retrieving a target person from ...

Please sign up or login with your details

Forgot password? Click here to reset