Perplexity from PLM Is Unreliable for Evaluating Text Quality

10/12/2022
by   Yequan Wang, et al.
0

Recently, amounts of works utilize perplexity (PPL) to evaluate the quality of the generated text. They suppose that if the value of PPL is smaller, the quality(i.e. fluency) of the text to be evaluated is better. However, we find that the PPL referee is unqualified and it cannot evaluate the generated text fairly for the following reasons: (i) The PPL of short text is larger than long text, which goes against common sense, (ii) The repeated text span could damage the performance of PPL, and (iii) The punctuation marks could affect the performance of PPL heavily. Experiments show that the PPL is unreliable for evaluating the quality of given text. Last, we discuss the key problems with evaluating text quality using language models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2020

RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text

In recent years, large neural networks for natural language generation (...
research
05/08/2023

ANALOGICAL - A New Benchmark for Analogy of Long Text for Large Language Models

Over the past decade, analogies, in the form of word-level analogies, ha...
research
04/17/2023

An Evaluation on Large Language Model Outputs: Discourse and Memorization

We present an empirical evaluation of various outputs generated by nine ...
research
04/09/2021

Towards objectively evaluating the quality of generated medical summaries

We propose a method for evaluating the quality of generated text by aski...
research
06/07/2023

Long-form analogies generated by chatGPT lack human-like psycholinguistic properties

Psycholinguistic analyses provide a means of evaluating large language m...
research
05/19/2023

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

Large language models (LLMs) can be used to generate smaller, more refin...
research
09/09/2019

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Text-to-speech systems are typically evaluated on single sentences. When...

Please sign up or login with your details

Forgot password? Click here to reset