Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation

08/06/2023
by   Xianfeng Zeng, et al.
0

N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. However, recent studies have revealed a weak correlation between these matching-based metrics and human evaluations, especially when compared with neural-based metrics like BLEURT. In this paper, we conjecture that the performance bottleneck in matching-based metrics may be caused by the limited diversity of references. To address this issue, we propose to utilize multiple references to enhance the consistency between these metrics and human evaluations. Within the WMT Metrics benchmarks, we observe that the multi-references F200spBLEU surpasses the conventional single-reference one by an accuracy improvement of 7.2%. Remarkably, it also exceeds the neural-based BERTscore by an accuracy enhancement of 3.9%. Moreover, we observe that the data leakage issue in large language models (LLMs) can be mitigated to a large extent by our multi-reference metric. We release the code and data at <https://github.com/SefaZeng/LLM-Ref>

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

Most research about natural language generation (NLG) relies on evaluati...
research
05/26/2023

Evaluation of Question Generation Needs More References

Question generation (QG) is the task of generating a valid and fluent qu...
research
03/27/2023

KPEval: Towards Fine-grained Semantic-based Evaluation of Keyphrase Extraction and Generation Systems

Despite the significant advancements in keyphrase extraction and keyphra...
research
08/19/2021

Language Model Augmented Relevance Score

Although automated metrics are commonly used to evaluate NLG systems, th...
research
08/17/2023

Semantic Consistency for Assuring Reliability of Large Language Models

Large Language Models (LLMs) exhibit remarkable fluency and competence a...
research
08/21/2021

CushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge Model to Improve Agreement with Human Judgements

Human evaluation has always been expensive while researchers struggle to...
research
04/30/2020

Improved Natural Language Generation via Loss Truncation

Neural language models are usually trained to match the distributional p...

Please sign up or login with your details

Forgot password? Click here to reset