Dual Attention on Pyramid Feature Maps for Image Captioning

11/02/2020
by   Litao Yu, et al.
0

Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. In this paper, we propose to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences. Specifically, with the full consideration of the contextual information provided by the hidden state of the RNN controller, the pyramid attention can better localize the visually indicative and semantically consistent regions in images. On the other hand, the contextual information can help re-calibrate the importance of feature components by learning the channel-wise dependencies, to improve the discriminative power of visual features for better content description. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in generating descriptive and smooth natural sentences from images. Using either convolution visual features or more informative bottom-up attention features, our composite captioning model achieves very promising performance in a single-model mode. The proposed pyramid attention and dual attention methods are highly modular, which can be inserted into various image captioning modules to further improve the performance.

READ FULL TEXT

page 1

page 8

research
12/12/2016

Text-guided Attention Model for Image Captioning

Visual attention plays an important role to understand images and demons...
research
07/20/2022

GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features

Current state-of-the-art methods for image captioning employ region-base...
research
11/17/2016

SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

Visual attention has been successfully applied in structural prediction ...
research
05/18/2021

Dependent Multi-Task Learning with Causal Intervention for Image Captioning

Recent work for image captioning mainly followed an extract-then-generat...
research
12/23/2018

Chinese Herbal Recognition based on Competitive Attentional Fusion of Multi-hierarchies Pyramid Features

Convolution neural netwotks (CNNs) are successfully applied in image rec...
research
11/17/2021

DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse Text-to-Image Generation

In this paper, we present an efficient and effective single-stage framew...
research
03/30/2016

Dense Image Representation with Spatial Pyramid VLAD Coding of CNN for Locally Robust Captioning

The workflow of extracting features from images using convolutional neur...

Please sign up or login with your details

Forgot password? Click here to reset