Rethinking Benchmarks for Cross-modal Image-text Retrieval

04/21/2023
by   Weijing Chen, et al.
0

Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fine-grained cross-modal semantic matching. With the prevalence of large scale multimodal pretraining models, several state-of-the-art models (e.g. X-VLM) have achieved near-perfect performance on widely-used image-text retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. The reason is that a large amount of images and texts in the benchmarks are coarse-grained. Based on the observation, we renovate the coarse-grained images and texts in the old benchmarks and establish the improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the image side, we enlarge the original image pool by adopting more similar images. On the text side, we propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. Furthermore, we evaluate representative image-text retrieval models on our new benchmarks to demonstrate the effectiveness of our method. We also analyze the capability of models on fine-grained semantic comprehension through extensive experiments. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding, especially in distinguishing attributes of close objects in images. Our code and improved benchmark datasets are publicly available at: https://github.com/cwj1412/MSCOCO-Flikcr30K_FG, which we hope will inspire further in-depth research on cross-modal retrieval.

READ FULL TEXT

page 2

page 5

page 6

page 7

page 9

research
07/29/2022

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Image-text matching is gaining a leading role among tasks involving the ...
research
05/20/2020

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

In this paper, we address the text and image matching in cross-modal ret...
research
03/20/2023

Scene Graph Based Fusion Network For Image-Text Retrieval

A critical challenge to image-text retrieval is how to learn accurate co...
research
03/08/2022

Where Does the Performance Improvement Come From? – A Reproducibility Concern about Image-Text Retrieval

This paper seeks to provide the information retrieval community with som...
research
03/23/2023

Plug-and-Play Regulators for Image-Text Matching

Exploiting fine-grained correspondence and visual-semantic alignments ha...
research
03/25/2023

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Contrastive learning-based video-language representation learning approa...
research
10/20/2021

Text-Based Person Search with Limited Data

Text-based person search (TBPS) aims at retrieving a target person from ...

Please sign up or login with your details

Forgot password? Click here to reset