Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

09/25/2019
by   Chunxiao Liu, et al.
0

Learning semantic correspondence between image and text is significant as it bridges the semantic gap between vision and language. The key challenge is to accurately find and correlate shared semantics in image and text. Most existing methods achieve this goal by representing the shared semantic as a weighted combination of all the fragments (image regions or text words), where fragments relevant to the shared semantic obtain more attention, otherwise less. However, despite relevant ones contribute more to the shared semantic, irrelevant ones will more or less disturb it, and thus will lead to semantic misalignment in the correlation phase. To address this issue, we present a novel Bidirectional Focal Attention Network (BFAN), which not only allows to attend to relevant fragments but also diverts all the attention into these relevant fragments to concentrate on them. The main difference with existing works is they mostly focus on learning attention weight while our BFAN focus on eliminating irrelevant fragments from the shared semantic. The focal attention is achieved by pre-assigning attention based on inter-modality relation, identifying relevant fragments based on intra-modality relation and reassigning attention. Furthermore, the focal attention is jointly applied in both image-to-text and text-to-image directions, which enables to avoid preference to long text or complex image. Experiments show our simple but effective framework significantly outperforms state-of-the-art, with relative Recall@1 gains of 2.2

READ FULL TEXT

page 1

page 7

page 8

research
04/01/2020

Graph Structured Network for Image-Text Matching

Image-text matching has received growing interest since it bridges visio...
research
12/16/2022

SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Most TextVQA approaches focus on the integration of objects, scene texts...
research
12/16/2022

HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval

Image-text retrieval (ITR) is a challenging task in the field of multimo...
research
04/03/2017

AMC: Attention guided Multi-modal Correlation Learning for Image Search

Given a user's query, traditional image search systems rank images accor...
research
07/05/2018

Jigsaw Puzzle Solving Using Local Feature Co-Occurrences in Deep Neural Networks

Archaeologists are in dire need of automated object reconstruction metho...
research
12/12/2019

ManiGAN: Text-Guided Image Manipulation

The goal of our paper is to semantically edit parts of an image to match...
research
02/25/2020

Declarative Memory-based Structure for the Representation of Text Data

In the era of intelligent computing, computational progress in text proc...

Please sign up or login with your details

Forgot password? Click here to reset