Step-Wise Hierarchical Alignment Network for Image-Text Matching

06/11/2021
by   Zhong Ji, et al.
0

Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most previous methods rely on single-step reasoning to discover the visual-semantic interactions, which lacks the ability of exploiting the multi-level information to locate the hierarchical fine-grained relevance. Different from them, in this work, we propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process. Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially. This progressive alignment strategy supplies our model with more complementary and sufficient semantic clues to understand the hierarchical correlations between image and text. The experimental results on two benchmark datasets demonstrate the superiority of our proposed method.

READ FULL TEXT

page 1

page 6

research
06/23/2019

Improving Description-based Person Re-identification by Multi-granularity Image-text Alignments

Description-based person re-identification (Re-id) is an important task ...
research
12/16/2022

HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval

Image-text retrieval (ITR) is a challenging task in the field of multimo...
research
01/05/2021

Similarity Reasoning and Filtration for Image-Text Matching

Image-text matching plays a critical role in bridging the vision and lan...
research
06/26/2023

Hierarchical Matching and Reasoning for Multi-Query Image Retrieval

As a promising field, Multi-Query Image Retrieval (MQIR) aims at searchi...
research
04/25/2018

Cross-media Multi-level Alignment with Relation Attention Network

With the rapid growth of multimedia data, such as image and text, it is ...
research
07/25/2023

Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection

Human-Object Interaction (HOI) detection is a challenging computer visio...
research
11/22/2019

HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs

The hubness problem widely exists in high-dimensional embedding space an...

Please sign up or login with your details

Forgot password? Click here to reset