Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

09/12/2021
by   Zhihao Fan, et al.
8

Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grain semantic units in both sides of vision and language. For the training, we propose multi-scale matching losses from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.

READ FULL TEXT

page 1

page 2

page 7

research
04/07/2022

HunYuan_tvr for Text-Video Retrivial

Text-Video Retrieval plays an important role in multi-modal understandin...
research
12/06/2017

Learning Semantic Concepts and Order for Image and Sentence Matching

Image and sentence matching has made great progress recently, but it rem...
research
09/15/2023

Uncertainty-Aware Multi-View Visual Semantic Embedding

The key challenge in image-text retrieval is effectively leveraging sema...
research
09/16/2021

Phrase Retrieval Learns Passage Retrieval, Too

Dense retrieval methods have shown great promise over sparse retrieval m...
research
01/09/2017

Task-Specific Attentive Pooling of Phrase Alignments Contributes to Sentence Matching

This work studies comparatively two typical sentence matching tasks: tex...
research
11/05/2021

Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval

Matching model is essential for Image-Text Retrieval framework. Existing...
research
04/23/2016

Why and How to Pay Different Attention to Phrase Alignments of Different Intensities

This work studies comparatively two typical sentence pair classification...

Please sign up or login with your details

Forgot password? Click here to reset