Distribution Aware Metrics for Conditional Natural Language Generation

09/15/2022
by   David M. Chan, et al.
12

Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersion of the distribution of conditional texts) can be ascribed to noise, such as in automated speech recognition, it does not allow for robust evaluation in the case where diversity in the ground truths represents signal for the model. In this work we argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse, and where the diversity in those captions captures useful additional information about the context. We propose a novel paradigm for multi-candidate evaluation of conditional language generation models, and a new family of metrics that compare the distributions of reference and model-generated caption sets using small sample sets of each. We demonstrate the utility of our approach with a case study in visual description: where we show that existing models optimize for single-description quality over diversity, and gain some insights into how sampling methods and temperature impact description quality and diversity.

READ FULL TEXT

page 7

page 18

page 19

page 20

page 21

research
05/12/2022

What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

While there have been significant gains in the field of automated video ...
research
04/06/2020

Evaluating the Evaluation of Diversity in Natural Language Generation

Despite growing interest in natural language generation (NLG) models tha...
research
01/17/2022

Evaluation of HTR models without Ground Truth Material

The evaluation of Handwritten Text Recognition (HTR) models during their...
research
10/09/2020

Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

We propose a family of metrics to assess language generation derived fro...
research
09/21/2023

ContextRef: Evaluating Referenceless Metrics For Image Description Generation

Referenceless metrics (e.g., CLIPScore) use pretrained vision–language m...
research
11/06/2018

Language GANs Falling Short

Generating high-quality text with sufficient diversity is essential for ...
research
12/14/2020

Fork or Fail: Cycle-Consistent Training with Many-to-One Mappings

Cycle-consistent training is widely used for jointly learning a forward ...

Please sign up or login with your details

Forgot password? Click here to reset