CLIPScore: A Reference-free Evaluation Metric for Image Captioning

04/18/2021
by   Jack Hessel, et al.
0

Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in stark contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker vs reference-based metrics, e.g., news captions that require richer contextual knowledge.

READ FULL TEXT

page 1

page 7

research
03/21/2023

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

The CLIP model has been recently proven to be very effective for a varie...
research
09/04/2019

TIGEr: Text-to-Image Grounding for Image Caption Evaluation

This paper presents a new metric called TIGEr for the automatic evaluati...
research
07/31/2023

Guiding Image Captioning Models Toward More Specific Captions

Image captioning is conventionally formulated as the task of generating ...
research
05/10/2023

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

Automatic image captioning evaluation is critical for benchmarking and p...
research
05/25/2022

Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset

Research in massively multilingual image captioning has been severely ha...
research
11/07/2021

Machine-in-the-Loop Rewriting for Creative Image Captioning

Machine-in-the-loop writing aims to enable humans to collaborate with mo...
research
03/15/2023

PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning

Vulnerability to lexical perturbation is a critical weakness of automati...

Please sign up or login with your details

Forgot password? Click here to reset