RoCOCO: Robust Benchmark MS-COCO to Stress-test Robustness of Image-Text Matching Models

by   Seulki Park, et al.

Recently, large-scale vision-language pre-training models and visual semantic embedding methods have significantly improved image-text matching (ITM) accuracy on MS COCO 5K test set. However, it is unclear how robust these state-of-the-art (SOTA) models are when using them in the wild. In this paper, we propose a novel evaluation benchmark to stress-test the robustness of ITM models. To this end, we add various fooling images and captions to a retrieval pool. Specifically, we change images by inserting unrelated images, and change captions by substituting a noun, which can change the meaning of a sentence. We discover that just adding these newly created images and captions to the test set can degrade performances (i.e., Recall@1) of a wide range of SOTA models (e.g., 81.9 VSE∞). We expect that our findings can provide insights for improving the robustness of the vision-language models and devising more diverse stress-test methods in cross-modal retrieval task. Source code and dataset will be available at


page 1

page 5

page 6

page 8

page 11

page 12

∙ 07/29/2022

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Image-text matching is gaining a leading role among tasks involving the ...
∙ 04/07/2022

ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO

Image-Test matching (ITM) is a common task for evaluating the quality of...
∙ 05/30/2023

LANCE: Stress-testing Visual Models by Generating Language-guided Counterfactual Images

We propose an automated algorithm to stress-test a trained visual model ...
∙ 08/27/2023

Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

Image-text retrieval requires the system to bridge the heterogenous gap ...
∙ 11/13/2019

IStego100K: Large-scale Image Steganalysis Dataset

In order to promote the rapid development of image steganalysis technolo...
∙ 11/22/2021

L-Verse: Bidirectional Generation Between Image and Text

Far beyond learning long-range interactions of natural language, transfo...
∙ 10/05/2022

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

Visual-Semantic Embedding (VSE) aims to learn an embedding space where r...

Please sign up or login with your details

Forgot password? Click here to reset