Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

12/06/2016
by   Jiasen Lu, et al.
0

Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as "the" and "of". Other words that may seem visual can often be predicted reliably just from the language model e.g., "sign" after "behind a red stop" or "phone" following "talking on a cell". In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.

READ FULL TEXT

page 6

page 7

page 8

page 10

page 11

page 12

research
12/26/2018

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Recent progress has been made in using attention based encoder-decoder f...
research
09/19/2019

Adaptively Aligned Image Captioning via Adaptive Attention Time

Recent neural models for image captioning usually employs an encoder-dec...
research
07/14/2023

AIC-AB NET: A Neural Network for Image Captioning with Spatial Attention and Text Attributes

Image captioning is a significant field across computer vision and natur...
research
03/06/2019

Image captioning with weakly-supervised attention penalty

Stories are essential for genealogy research since they can help build e...
research
11/02/2020

Boost Image Captioning with Knowledge Reasoning

Automatically generating a human-like description for a given image is a...
research
04/18/2019

Learning to Collocate Neural Modules for Image Captioning

We do not speak word by word from scratch; our brain quickly structures ...
research
12/08/2018

Attend More Times for Image Captioning

Most attention-based image captioning models attend to the image once pe...

Please sign up or login with your details

Forgot password? Click here to reset