KPEval: Towards Fine-grained Semantic-based Evaluation of Keyphrase Extraction and Generation Systems

03/27/2023
by   Di Wu, et al.
0

Despite the significant advancements in keyphrase extraction and keyphrase generation methods, the predominant approach for evaluation only relies on exact matching with human references and disregards reference-free attributes. This scheme fails to recognize systems that generate keyphrases that are semantically equivalent to the references or keyphrases that have practical utility. To better understand the strengths and weaknesses of different keyphrase systems, we propose a comprehensive evaluation framework consisting of six critical dimensions: naturalness, faithfulness, saliency, coverage, diversity, and utility. For each dimension, we discuss the desiderata and design semantic-based metrics that align with the evaluation objectives. Rigorous meta-evaluation studies demonstrate that our evaluation strategy correlates better with human preferences compared to a range of previously used metrics. Using this framework, we re-evaluate 18 keyphrase systems and further discover that (1) the best model differs in different dimensions, with pre-trained language models achieving the best in most dimensions; (2) the utility in downstream tasks does not always correlate well with reference-based metrics; and (3) large language models exhibit a strong performance in reference-free evaluation.

READ FULL TEXT
research
08/06/2023

Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation

N-gram matching-based evaluation metrics, such as BLEU and chrF, are wid...
research
04/27/2023

Large Language Models Are State-of-the-Art Evaluators of Code Generation

Recent advancements in the field of natural language generation have fac...
research
08/19/2021

Language Model Augmented Relevance Score

Although automated metrics are commonly used to evaluate NLG systems, th...
research
05/24/2023

Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

Most research about natural language generation (NLG) relies on evaluati...
research
09/24/2018

WiRe57 : A Fine-Grained Benchmark for Open Information Extraction

We build a reference for the task of Open Information Extraction, on fiv...
research
05/24/2023

Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality

LLMs (large language models) such as ChatGPT have shown remarkable langu...
research
08/11/2023

Assessing Guest Nationality Composition from Hotel Reviews

Many hotels target guest acquisition efforts to specific markets in orde...

Please sign up or login with your details

Forgot password? Click here to reset