Open-Domain Text Evaluation via Meta Distribution Modeling

06/20/2023
by   Sidi Lu, et al.
0

Recent advances in open-domain text generation models powered by large pre-trained language models (LLMs) have achieved remarkable performance. However, evaluating and controlling these models for desired attributes remains a challenge, as traditional reference-based metrics such as BLEU, ROUGE, and METEOR are insufficient for open-ended generation tasks. Similarly, while trainable discriminator-based evaluation metrics show promise, obtaining high-quality training data is a non-trivial task. In this paper, we introduce a novel approach to evaluate open-domain generation - the Meta-Distribution Methods (MDM). Drawing on the correlation between the rising parameter counts and the improving performance of LLMs, MDM creates a mapping from the contrast of two probabilistic distributions – one known to be superior to the other – to quality measures, which can be viewed as a distribution of distributions i.e. Meta-Distribution. We investigate MDM for open-domain text generation evaluation under two paradigms: 1) Generative MDM, which leverages the Meta-Distribution Methods to generate in-domain negative samples for training discriminator-based metrics; 2) Discriminative MDM, which directly uses distribution discrepancies between two language models for evaluation. Our experiments on multi-turn dialogue and factuality in abstractive summarization demonstrate that MDMs correlate better with human judgment than existing automatic evaluation metrics on both tasks, highlighting the strong performance and generalizability of such methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2021

MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text Generation

Despite major advances in open-ended text generation, there has been lim...
research
10/24/2020

Go Figure! A Meta Evaluation of Factuality in Summarization

Text generation models can generate factually inconsistent text containi...
research
05/23/2023

APPLS: A Meta-evaluation Testbed for Plain Language Summarization

While there has been significant development of models for Plain Languag...
research
04/09/2020

BLEURT: Learning Robust Metrics for Text Generation

Text generation has made significant advances in the last few years. Yet...
research
04/11/2022

TRUE: Re-evaluating Factual Consistency Evaluation

Grounded text generation systems often generate text that contains factu...
research
05/24/2023

Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality

LLMs (large language models) such as ChatGPT have shown remarkable langu...
research
04/12/2021

Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

With the recent advances of open-domain story generation, the lack of re...

Please sign up or login with your details

Forgot password? Click here to reset