Understanding Guided Image Captioning Performance across Domains

12/04/2020
by   Edwin G. Ng, et al.
6

Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. On the other hand, VQA models generally lack the ability to provide long descriptive answers, while expecting the textual question to be quite precise. We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text that refers to either groundable or ungroundable concepts in the image. Our model consists of a Transformer-based multimodal encoder that uses the guiding text together with global and object-level image features to derive early-fusion representations used to generate the guided caption. While models trained on Visual Genome data have an in-domain advantage of fitting well when guided with automatic object labels, we find that guided captioning models trained on Conceptual Captions generalize better on out-of-domain images and guiding texts. Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity (even without increasing vocabulary size) is a key factor for improved performance.

READ FULL TEXT

page 1

page 5

page 11

page 12

page 13

research
08/04/2021

Question-controlled Text-aware Image Captioning

For an image with multiple scene texts, different people may be interest...
research
02/07/2021

Iconographic Image Captioning for Artworks

Image captioning implies automatically generating textual descriptions o...
research
06/15/2018

Partially-Supervised Image Captioning

Image captioning models are becoming increasingly successful at describi...
research
09/10/2019

Compositional Generalization in Image Captioning

Image captioning models are usually evaluated on their ability to descri...
research
10/28/2020

Fusion Models for Improved Visual Captioning

Visual captioning aims to generate textual descriptions given images. Tr...
research
12/02/2016

Guided Open Vocabulary Image Captioning with Constrained Beam Search

Existing image captioning models do not generalize well to out-of-domain...
research
07/26/2018

Rethinking the Form of Latent States in Image Captioning

RNNs and their variants have been widely adopted for image captioning. I...

Please sign up or login with your details

Forgot password? Click here to reset