Multi-modal gated recurrent units for image description

04/20/2019
by   Xuelong Li, et al.
0

Using a natural language sentence to describe the content of an image is a challenging but very important task. It is challenging because a description must not only capture objects contained in the image and the relationships among them, but also be relevant and grammatically correct. In this paper a multi-modal embedding model based on gated recurrent units (GRU) which can generate variable-length description for a given image. In the training step, we apply the convolutional neural network (CNN) to extract the image feature. Then the feature is imported into the multi-modal GRU as well as the corresponding sentence representations. The multi-modal GRU learns the inter-modal relations between image and sentence. And in the testing step, when an image is imported to our multi-modal GRU model, a sentence which describes the image content is generated. The experimental results demonstrate that our multi-modal GRU model obtains the state-of-the-art performance on Flickr8K, Flickr30K and MS COCO datasets.

READ FULL TEXT

page 10

page 19

page 20

research
04/21/2019

3G structure for image caption generation

It is a big challenge of computer vision to make machine automatically d...
research
07/17/2020

A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation

Multi-modal neural machine translation (NMT) aims to translate source se...
research
02/07/2017

Toward Abstraction from Multi-modal Data: Empirical Studies on Multiple Time-scale Recurrent Models

The abstraction tasks are challenging for multi- modal sequences as they...
research
07/28/2019

Two-Stream CNN with Loose Pair Training for Multi-modal AMD Categorization

This paper studies automated categorization of age-related macular degen...
research
05/29/2017

Emergent Communication in a Multi-Modal, Multi-Step Referential Game

Inspired by previous work on emergent communication in referential games...
research
07/23/2020

METEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams

Many learning tasks involve multi-modal data streams, where continuous d...
research
09/14/2022

ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining

The growing interest in developing corpora of persuasive texts has promo...

Please sign up or login with your details

Forgot password? Click here to reset