MISMATCH: Fine-grained Evaluation of Machine-generated Text with Mismatch Error Types

06/18/2023
by   Keerthiram Murugesan, et al.
6

With the growing interest in large language models, the need for evaluating the quality of machine text compared to reference (typically human-generated) text has become focal attention. Most recent works focus either on task-specific evaluation metrics or study the properties of machine-generated text captured by the existing metrics. In this work, we propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts. Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types such as spatial/geographic errors, entity errors, etc, to guide the model for better prediction of human judgments. We propose a neural framework for evaluating machine texts that uses these mismatch error types as auxiliary tasks and re-purposes the existing single-number evaluation metrics as additional scalar features, in addition to textual features extracted from the machine and reference texts. Our experiments reveal key insights about the existing metrics via the mismatch errors. We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.

READ FULL TEXT

page 1

page 4

page 5

page 7

page 9

page 14

research
05/06/2023

NorBench – A Benchmark for Norwegian Language Models

We present NorBench: a streamlined suite of NLP tasks and probes for eva...
research
10/06/2020

GRUEN for Evaluating Linguistic Quality of Generated Text

Automatic evaluation metrics are indispensable for evaluating generated ...
research
08/14/2023

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

Automatic evaluation of machine translation (MT) is a critical tool driv...
research
08/14/2023

Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

Fine-grained, span-level human evaluation has emerged as a reliable and ...
research
09/04/2020

Don't miss the Mismatch: Investigating the Objective Function Mismatch for Unsupervised Representation Learning

Finding general evaluation metrics for unsupervised representation learn...
research
05/11/2023

KGA: A General Machine Unlearning Framework Based on Knowledge Gap Alignment

Recent legislation of the "right to be forgotten" has led to the interes...
research
04/15/2022

Evaluating Factuality in Text Simplification

Automated simplification models aim to make input texts more readable. S...

Please sign up or login with your details

Forgot password? Click here to reset