HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

05/25/2023
by   Chia-Wen Kuo, et al.
0

A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model's data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6 +12.9 analyses to demonstrate the importance of each part of our design.

READ FULL TEXT

page 1

page 4

page 8

page 14

page 15

page 16

research
07/26/2018

Recurrent Fusion Network for Image Captioning

Recently, much advance has been made in image captioning, and an encoder...
research
06/14/2019

Image Captioning: Transforming Objects into Words

Image captioning models typically follow an encoder-decoder architecture...
research
07/26/2022

Retrieval-Augmented Transformer for Image Captioning

Image captioning models aim at connecting Vision and Language by providi...
research
09/11/2018

End-to-end Image Captioning Exploits Multimodal Distributional Similarity

We hypothesize that end-to-end neural image captioning systems work seem...
research
05/09/2022

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Significant progress has been made on visual captioning, largely relying...
research
02/23/2018

Interpretable Charge Predictions for Criminal Cases: Learning to Generate Court Views from Fact Descriptions

In this paper, we propose to study the problem of COURT VIEW GENeration ...
research
05/21/2021

Visual representation of negation: Real world data analysis on comic image design

There has been a widely held view that visual representations (e.g., pho...

Please sign up or login with your details

Forgot password? Click here to reset