Guiding Image Captioning Models Toward More Specific Captions

07/31/2023
by   Simon Kornblith, et al.
0

Image captioning is conventionally formulated as the task of generating captions for images that match the distribution of reference image-caption pairs. However, reference captions in standard captioning datasets are short and may not uniquely identify the images they describe. These problems are further exacerbated when models are trained directly on image-alt text pairs collected from the internet. In this work, we show that it is possible to generate more specific captions with minimal changes to the training process. We implement classifier-free guidance for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions. The guidance scale applied at decoding controls a trade-off between maximizing p(caption|image) and p(image|caption). Compared to standard greedy decoding, decoding with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore (0.808 vs. 0.775) and caption→image retrieval performance in the CLIP embedding space (recall@1 44.6 standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We further explore the use of language models to guide the decoding process, obtaining small improvements over the Pareto frontier of reference-free vs. reference-based captioning metrics that arises from classifier-free guidance, and substantially improving the quality of captions generated from a model trained only on minimally curated web data.

READ FULL TEXT

page 1

page 5

page 6

page 7

page 8

page 14

page 18

research
04/18/2021

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Image captioning has conventionally relied on reference-based automatic ...
research
10/12/2018

Pre-gen metrics: Predicting caption quality metrics without generating captions

Image caption generation systems are typically evaluated against referen...
research
03/22/2018

Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data

The aim of image captioning is to generate similar captions by machine a...
research
07/22/2022

Rethinking the Reference-based Distinctive Image Captioning

Distinctive Image Captioning (DIC) – generating distinctive captions tha...
research
03/08/2020

Better Captioning with Sequence-Level Exploration

Sequence-level learning objective has been widely used in captioning tas...
research
09/15/2023

PatFig: Generating Short and Long Captions for Patent Figures

This paper introduces Qatent PatFig, a novel large-scale patent figure d...
research
10/29/2022

Improving Audio Captioning Using Semantic Similarity Metrics

Audio captioning quality metrics which are typically borrowed from the m...

Please sign up or login with your details

Forgot password? Click here to reset