Attend More Times for Image Captioning

12/08/2018
by   Jiajun Du, et al.
0

Most attention-based image captioning models attend to the image once per word. However, attending once per word is rigid and is easy to miss some information. Attending more times can adjust the attention position, find the missing information back and avoid generating the wrong word. In this paper, we show that attending more times per word can gain improvements in the image captioning task. We propose a flexible two-LSTM merge model to make it convenient to encode more attentions than words. Our captioning model uses two LSTMs to encode the word sequence and the attention sequence respectively. The information of the two LSTMs and the image feature are combined to predict the next word. Experiments on the MSCOCO caption dataset show that our method outperforms the state-of-the-art. Using bottom up features and self-critical training method, our method gets BLEU-4, METEOR, ROUGE-L and CIDEr scores of 0.381, 0.283, 0.580 and 1.261 on the Karpathy test split.

READ FULL TEXT
research
10/01/2021

Geometry Attention Transformer with Position-aware LSTMs for Image Captioning

In recent years, transformer structures have been widely applied in imag...
research
12/06/2016

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

Attention-based neural encoder-decoder frameworks have been widely adopt...
research
10/13/2016

Video Fill in the Blank with Merging LSTMs

Given a video and its incomplete textural description with missing words...
research
10/12/2018

Quantifying the amount of visual information used by neural caption generators

This paper addresses the sensitivity of neural image caption generators ...
research
01/04/2020

Understanding Image Captioning Models beyond Visualizing Attention

This paper explains predictions of image captioning models with attentio...
research
04/30/2021

End-to-End Attention-based Image Captioning

In this paper, we address the problem of image captioning specifically f...
research
11/10/2019

Can Neural Image Captioning be Controlled via Forced Attention?

Learned dynamic weighting of the conditioning signal (attention) has bee...

Please sign up or login with your details

Forgot password? Click here to reset