Scene Graph Based Fusion Network For Image-Text Retrieval

03/20/2023
by   Guoliang Wang, et al.
1

A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts. Most existing methods mainly focus on coarse-grained correspondences based on co-occurrences of semantic objects, while failing to distinguish the fine-grained local correspondences. In this paper, we propose a novel Scene Graph based Fusion Network (dubbed SGFN), which enhances the images'/texts' features through intra- and cross-modal fusion for image-text retrieval. To be specific, we design an intra-modal hierarchical attention fusion to incorporate semantic contexts, such as objects, attributes, and relationships, into images'/texts' feature vectors via scene graphs, and a cross-modal attention fusion to combine the contextual semantics and local fusion via contextual vectors. Extensive experiments on public datasets Flickr30K and MSCOCO show that our SGFN performs better than quite a few SOTA image-text retrieval methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2023

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Image-text retrieval, as a fundamental and important branch of informati...
research
03/08/2020

IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Enabling bi-directional retrieval of images and texts is important for u...
research
10/19/2022

CLIP-Driven Fine-grained Text-Image Person Re-identification

TIReID aims to retrieve the image corresponding to the given text query ...
research
03/31/2022

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

Visual appearance is considered to be the most important cue to understa...
research
08/16/2023

Ranking-aware Uncertainty for Text-guided Image Retrieval

Text-guided image retrieval is to incorporate conditional text to better...
research
08/16/2021

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Vision-and-language pretraining (VLP) aims to learn generic multimodal r...
research
02/28/2023

Joint Representations of Text and Knowledge Graphs for Retrieval and Evaluation

A key feature of neural models is that they can produce semantic vector ...

Please sign up or login with your details

Forgot password? Click here to reset