Revising Image-Text Retrieval via Multi-Modal Entailment

08/22/2022
by   Xu Yan, et al.
0

An outstanding image-text retrieval model depends on high-quality labeled data. While the builders of existing image-text retrieval datasets strive to ensure that the caption matches the linked image, they cannot prevent a caption from fitting other images. We observe that such a many-to-many matching phenomenon is quite common in the widely-used retrieval datasets, where one caption can describe up to 178 images. These large matching-lost data not only confuse the model in training but also weaken the evaluation accuracy. Inspired by visual and textual entailment tasks, we propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions. Subsequently, we revise the image-text retrieval datasets by adding these entailed captions as additional weak labels of an image and develop a universal variable learning rate strategy to teach a retrieval model to distinguish the entailed captions from other negative samples. In experiments, we manually annotate an entailment-corrected image-text retrieval dataset for evaluation. The results demonstrate that the proposed entailment classifier achieves about 78 performance of image-text retrieval baselines.

READ FULL TEXT

page 1

page 4

page 7

page 10

research
05/09/2021

Passage Retrieval for Outside-Knowledge Visual Question Answering

In this work, we address multi-modal information needs that contain text...
research
12/03/2022

Named Entity and Relation Extraction with Multi-Modal Retrieval

Multi-modal named entity recognition (NER) and relation extraction (RE) ...
research
06/01/2021

Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features

Cross-modal retrieval is an important functionality in modern search eng...
research
04/16/2021

Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Recent advances in using retrieval components over external knowledge so...
research
10/23/2018

How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval

Automatic art analysis has been mostly focused on classifying artworks i...
research
05/16/2016

Joint Learning of Sentence Embeddings for Relevance and Entailment

We consider the problem of Recognizing Textual Entailment within an Info...
research
12/13/2022

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

A fundamental characteristic common to both human vision and natural lan...

Please sign up or login with your details

Forgot password? Click here to reset