Cluster-based Evaluation of Automatically Generated Text

05/31/2022
by   Tiago Pimentel, et al.
0

While probabilistic language generators have improved dramatically over the last few years, the automatic evaluation metrics used to assess them have not kept pace with this progress. In the domain of language generation, a good metric must correlate highly with human judgements. Yet, with few exceptions, there is a lack of such metrics in the literature. In this work, we analyse the general paradigm of language generator evaluation. We first discuss the computational and qualitative issues with using automatic evaluation metrics that operate on probability distributions over strings, the backbone of most language generators. We then propose the use of distributions over clusters instead, where we cluster strings based on their text embeddings (obtained from a pretrained language model). While we find the biases introduced by this substitution to be quite strong, we observe that, empirically, this methodology leads to metric estimators with higher correlation with human judgements, while simultaneously reducing estimator variance. We finish the paper with a probing analysis, which leads us to conclude that – by encoding syntactic- and coherence-level features of text, while ignoring surface-level features – these clusters may simply be better equipped to evaluate state-of-the-art language models.

READ FULL TEXT
research
06/23/2020

Automating Text Naturalness Evaluation of NLG Systems

Automatic methods and metrics that assess various quality criteria of au...
research
10/17/2022

Social Biases in Automatic Evaluation Metrics for NLG

Many studies have revealed that word embeddings, language models, and mo...
research
05/25/2022

RSTGen: Imbuing Fine-Grained Interpretable Control into Long-FormText Generators

In this paper, we study the task of improving the cohesion and coherence...
research
09/06/2022

Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?

The evaluation of recent embedding-based evaluation metrics for text gen...
research
02/14/2022

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Evaluation practices in natural language generation (NLG) have many know...
research
07/18/2022

Towards the Human Global Context: Does the Vision-Language Model Really Judge Like a Human Being?

As computer vision and NLP make progress, Vision-Language(VL) is becomin...
research
06/05/2020

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

Automatic evaluation of various text quality criteria produced by data-d...

Please sign up or login with your details

Forgot password? Click here to reset