ContextRef: Evaluating Referenceless Metrics For Image Description Generation

09/21/2023
by   Elisa Kreiss, et al.
0

Referenceless metrics (e.g., CLIPScore) use pretrained vision–language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.

READ FULL TEXT

page 8

page 15

research
05/21/2022

Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation Metrics

Few images on the Web receive alt-text descriptions that would make them...
research
06/01/2023

CapText: Large Language Model-based Caption Generation From Image Context and Description

While deep-learning models have been shown to perform well on image-to-t...
research
09/15/2022

Distribution Aware Metrics for Conditional Natural Language Generation

Traditional automated metrics for evaluating conditional natural languag...
research
03/11/2022

CoDA21: Evaluating Language Understanding Capabilities of NLP Models With Context-Definition Alignment

Pretrained language models (PLMs) have achieved superhuman performance o...
research
09/15/2021

What Vision-Language Models `See' when they See Scenes

Images can be described in terms of the objects they contain, or in term...
research
06/23/2020

Automating Text Naturalness Evaluation of NLG Systems

Automatic methods and metrics that assess various quality criteria of au...
research
08/26/2021

Weisfeiler-Leman in the BAMBOO: Novel AMR Graph Metrics and a Benchmark for AMR Graph Similarity

Several metrics have been proposed for assessing the similarity of (abst...

Please sign up or login with your details

Forgot password? Click here to reset