Exploring and Distilling Cross-Modal Information for Image Captioning

02/28/2020
by   Fenglin Liu, et al.
0

Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. To perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our fully-attentive model achieves a CIDEr score of 129.3 in offline COCO evaluation on the COCO testing set with remarkable efficiency in terms of accuracy, speed, and parameter budget.

READ FULL TEXT

page 1

page 5

research
11/14/2022

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment

CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero...
research
07/14/2023

AIC-AB NET: A Neural Network for Image Captioning with Spatial Attention and Text Attributes

Image captioning is a significant field across computer vision and natur...
research
12/14/2022

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Image captioning models require the high-level generalization ability to...
research
05/06/2021

Exploring Explicit and Implicit Visual Relationships for Image Captioning

Image captioning is one of the most challenging tasks in AI, which aims ...
research
05/20/2021

More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints

Attention mechanisms have been widely applied to cross-modal tasks such ...
research
12/18/2022

Efficient Image Captioning for Edge Devices

Recent years have witnessed the rapid progress of image captioning. Howe...
research
04/27/2020

A Novel Attention-based Aggregation Function to Combine Vision and Language

The joint understanding of vision and language has been recently gaining...

Please sign up or login with your details

Forgot password? Click here to reset