Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

05/24/2023
by   Tianyi Tang, et al.
0

Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model's hypotheses. To address this issue, this paper presents a novel method, named Para-Ref, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to paraphrase a single reference into multiple high-quality ones in diverse expressions. Experimental results on representative NLG tasks of machine translation, text summarization, and image caption demonstrate that our method can effectively improve the correlation with human evaluation for sixteen automatic evaluation metrics by +7.82 https://github.com/RUCAIBox/Para-Ref.

READ FULL TEXT
research
08/06/2023

Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation

N-gram matching-based evaluation metrics, such as BLEU and chrF, are wid...
research
04/27/2023

Large Language Models Are State-of-the-Art Evaluators of Code Generation

Recent advancements in the field of natural language generation have fac...
research
10/18/2021

BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Natural language processing (NLP) systems are increasingly trained to ge...
research
08/05/2017

Referenceless Quality Estimation for Natural Language Generation

Traditional automatic evaluation measures for natural language generatio...
research
03/27/2023

KPEval: Towards Fine-grained Semantic-based Evaluation of Keyphrase Extraction and Generation Systems

Despite the significant advancements in keyphrase extraction and keyphra...
research
11/04/2022

Evaluating and Improving Factuality in Multimodal Abstractive Summarization

Current metrics for evaluating factuality for abstractive document summa...
research
07/25/2018

"Bilingual Expert" Can Find Translation Errors

Recent advances in statistical machine translation via the adoption of n...

Please sign up or login with your details

Forgot password? Click here to reset