Watch What You Just Said: Image Captioning with Text-Conditional Attention

06/15/2016
by   Luowei Zhou, et al.
0

Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called text-conditional attention, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results show that our method outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of our text-conditional attention in image captioning.

READ FULL TEXT

page 1

page 8

research
07/10/2018

Topic-Guided Attention for Image Captioning

Attention mechanisms have attracted considerable interest in image capti...
research
08/18/2016

Seeing with Humans: Gaze-Assisted Neural Image Captioning

Gaze reflects how humans process visual scenes and is therefore increasi...
research
06/16/2019

Image Captioning with Integrated Bottom-Up and Multi-level Residual Top-Down Attention for Game Scene Understanding

Image captioning has attracted considerable attention in recent years. H...
research
11/22/2019

Learning to Caption Images with Two-Stream Attention and Sentence Auto-Encoder

Automatically generating natural language descriptions from an image is ...
research
04/21/2017

Attend to You: Personalized Image Captioning with Context Sequence Memory Networks

We address personalization issues of image captioning, which have not be...
research
08/23/2023

CgT-GAN: CLIP-guided Text GAN for Image Captioning

The large-scale visual-language pre-trained model, Contrastive Language-...
research
05/27/2020

TIME: Text and Image Mutual-Translation Adversarial Networks

Focusing on text-to-image (T2I) generation, we propose Text and Image Mu...

Please sign up or login with your details

Forgot password? Click here to reset