Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching

02/20/2020
by   Tianlang Chen, et al.
21

Existing image-text matching approaches typically infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. However, they ignore the connections between the objects that are semantically related. These objects may collectively determine whether the image corresponds to a text or not. To address this problem, we propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN). In particular, given an input image-text pair, our model reorders the image objects based on the positions of their most related words in the text. In the same way as extracting the hidden features from word embeddings, the model leverages RNN to extract high-level object features from the reordered object inputs. We validate that the high-level object features contain useful joint information of semantically related objects, which benefit the retrieval task. To compute the image-text similarity, we incorporate a Multi-attention Cross Matching Model into DP-RNN. It aggregates the affinity between objects and words with cross-modality guided attention and self-attention. Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset. Extensive experiments demonstrate the effectiveness of our model.

READ FULL TEXT

page 1

page 6

page 7

research
03/21/2018

Stacked Cross Attention for Image-Text Matching

In this paper, we study the problem of image-text matching. Inferring th...
research
12/20/2014

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) ...
research
10/23/2020

Dual-path Self-Attention RNN for Real-Time Speech Enhancement

We propose a dual-path self-attention recurrent neural network (DP-SARNN...
research
11/24/2016

Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

In this paper, we focus on training and evaluating effective word embedd...
research
10/04/2014

Explain Images with Multimodal Recurrent Neural Networks

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) ...
research
02/23/2020

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

This paper considers the task of matching images and sentences by learni...
research
11/15/2016

Knowledge Enhanced Hybrid Neural Network for Text Matching

Long text brings a big challenge to semantic matching due to their compl...

Please sign up or login with your details

Forgot password? Click here to reset