Revisiting the Evaluation Metrics of Paraphrase Generation

02/17/2022
by   Lingfeng Shen, et al.
0

Paraphrase generation is an important NLP task that has achieved significant progress recently. However, one crucial problem is overlooked, `how to evaluate the quality of paraphrase?'. Most existing paraphrase generation models use reference-based metrics (e.g., BLEU) from neural machine translation (NMT) to evaluate their generated paraphrase. Such metrics' reliability is hardly evaluated, and they are only plausible when there exists a standard reference. Therefore, this paper first answers one fundamental question, `Are existing metrics reliable for paraphrase generation?'. We present two conclusions that disobey conventional wisdom in paraphrasing generation: (1) existing metrics poorly align with human annotation in system-level and segment-level paraphrase evaluation. (2) reference-free metrics outperform reference-based metrics, indicating that the standard references are unnecessary to evaluate the paraphrase's quality. Such empirical findings expose a lack of reliable automatic evaluation metrics. Therefore, this paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality. BBScore consists of two sub-metrics: S3C score and SelfBLEU, which correspond to two criteria for paraphrase evaluation: semantic preservation and diversity. By connecting two sub-metrics, BBScore significantly outperforms existing paraphrase evaluation metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2021

REAM♯: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation

The lack of reliable automatic evaluation metrics is a major impediment ...
research
08/17/2020

Evaluating for Diversity in Question Generation over Text

Generating diverse and relevant questions over text is a task with wides...
research
07/21/2017

Why We Need New Evaluation Metrics for NLG

The majority of NLG evaluation relies on automatic metrics, such as BLEU...
research
05/08/2023

Learning to Evaluate the Artness of AI-generated Images

Assessing the artness of AI-generated images continues to be a challenge...
research
10/13/2022

An Analysis Method for Metric-Level Switching in Beat Tracking

For expressive music, the tempo may change over time, posing challenges ...
research
10/22/2022

On the Limitations of Reference-Free Evaluations of Generated Text

There is significant interest in developing evaluation metrics which acc...
research
05/19/2022

Evaluating Subtitle Segmentation for End-to-end Generation Systems

Subtitles appear on screen as short pieces of text, segmented based on f...

Please sign up or login with your details

Forgot password? Click here to reset