Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

08/05/2021
by   Xuri Ge, et al.
0

The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog → play → ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In order to jointly and explicitly learn the visual-textual embedding and the cross-modal alignment, SMFEA creates a novel multi-modal structured module with a shared context-aware referral tree. In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels, from which visual and textual features can be jointly learned and optimized. We utilize the multi-modal tree structure to explicitly align the heterogeneous image-sentence data by maximizing the semantic and structural similarity between corresponding inter-modal tree nodes. Extensive experiments on Microsoft COCO and Flickr30K benchmarks demonstrate the superiority of the proposed model in comparison to the state-of-the-art methods.

READ FULL TEXT

page 3

page 7

page 8

research
10/17/2022

Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval

Image-sentence retrieval has attracted extensive research attention in m...
research
06/22/2014

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

We introduce a model for bidirectional retrieval of images and sentences...
research
08/12/2020

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Despite the evolution of deep-learning-based visual-textual processing s...
research
12/30/2021

Radiology Report Generation with a Learned Knowledge Base and Multi-modal Alignment

In clinics, a radiology report is crucial for guiding a patient's treatm...
research
03/28/2023

Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

Automatic radiology report generation has attracted enormous research in...
research
08/08/2017

Deep Binaries: Encoding Semantic-Rich Cues for Efficient Textual-Visual Cross Retrieval

Cross-modal hashing is usually regarded as an effective technique for la...
research
04/05/2018

Finding beans in burgers: Deep semantic-visual embedding with localization

Several works have proposed to learn a two-path neural network that maps...

Please sign up or login with your details

Forgot password? Click here to reset