Self-Guiding Multimodal LSTM - when we do not have a perfect training dataset for image captioning

09/15/2017
by   Yang Xian, et al.
0

In this paper, a self-guiding multimodal LSTM (sg-LSTM) image captioning model is proposed to handle uncontrolled imbalanced real-world image-sentence dataset. We collect FlickrNYC dataset from Flickr as our testbed with 306,165 images and the original text descriptions uploaded by the users are utilized as the ground truth for training. Descriptions in FlickrNYC dataset vary dramatically ranging from short term-descriptions to long paragraph-descriptions and can describe any visual aspects, or even refer to objects that are not depicted. To deal with the imbalanced and noisy situation and to fully explore the dataset itself, we propose a novel guiding textual feature extracted utilizing a multimodal LSTM (m-LSTM) model. Training of m-LSTM is based on the portion of data in which the image content and the corresponding descriptions are strongly bonded. Afterwards, during the training of sg-LSTM on the rest training data, this guiding information serves as additional input to the network along with the image representations and the ground-truth descriptions. By integrating these input components into a multimodal block, we aim to form a training scheme with the textual information tightly coupled with the image content. The experimental results demonstrate that the proposed sg-LSTM model outperforms the traditional state-of-the-art multimodal RNN captioning framework in successfully describing the key components of the input images.

READ FULL TEXT

page 2

page 3

page 9

page 11

page 12

page 13

research
08/17/2017

Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects

Image captioning often requires a large set of training image-sentence p...
research
07/10/2018

"Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention

Generating stylized captions for an image is an emerging topic in image ...
research
06/07/2022

Improving Image Captioning with Control Signal of Sentence Quality

In the dataset of image captioning, each image is aligned with several c...
research
09/11/2018

End-to-end Image Captioning Exploits Multimodal Distributional Similarity

We hypothesize that end-to-end neural image captioning systems work seem...
research
04/11/2018

Decoupled Novel Object Captioner

Image captioning is a challenging task where the machine automatically d...
research
12/13/2021

MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-based Image Captioning

Text-based image captioning (TextCap) requires simultaneous comprehensio...
research
03/23/2022

Affective Feedback Synthesis Towards Multimodal Text and Image Data

In this paper, we have defined a novel task of affective feedback synthe...

Please sign up or login with your details

Forgot password? Click here to reset