Automating Text Naturalness Evaluation of NLG Systems

06/23/2020
by   Erion Çano, et al.
0

Automatic methods and metrics that assess various quality criteria of automatically generated texts are important for developing NLG systems because they produce repeatable results and allow for a fast development cycle. We present here an attempt to automate the evaluation of text naturalness which is a very important characteristic of natural language generation methods. Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process by using a human likeliness metric we define and a discrimination procedure based on large pretrained language models with their probability distributions. We analyze the text probability fractions and observe how they are influenced by the size of the generative and discriminative models involved in the process. Based on our results, bigger generators and larger pretrained discriminators are more appropriate for a better evaluation of text naturalness. A comprehensive validation procedure with human participants is required as follow up to check how well this automatic evaluation scheme correlates with human judgments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/05/2020

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

Automatic evaluation of various text quality criteria produced by data-d...
research
05/31/2022

Cluster-based Evaluation of Automatically Generated Text

While probabilistic language generators have improved dramatically over ...
research
01/02/2019

Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

Recent advances in deep learning have resulted in a resurgence in the po...
research
09/24/2019

Do Massively Pretrained Language Models Make Better Storytellers?

Large neural language models trained on massive amounts of text have eme...
research
09/15/2021

Challenges in Detoxifying Language Models

Large language models (LM) generate remarkably fluent text and can be ef...
research
12/15/2021

Dynamic Human Evaluation for Relative Model Comparisons

Collecting human judgements is currently the most reliable evaluation me...
research
09/21/2023

ContextRef: Evaluating Referenceless Metrics For Image Description Generation

Referenceless metrics (e.g., CLIPScore) use pretrained vision–language m...

Please sign up or login with your details

Forgot password? Click here to reset