CushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge Model to Improve Agreement with Human Judgements

08/21/2021
by   Lifeng Han, et al.
0

Human evaluation has always been expensive while researchers struggle to trust the automatic metrics. To address this, we propose to customise traditional metrics by taking advantages of the pre-trained language models (PLMs) and the limited available human labelled scores. We first re-introduce the hLEPOR metric factors, followed by the Python portable version we developed which achieved the automatic tuning of the weighting parameters in hLEPOR metric. Then we present the customised hLEPOR (cushLEPOR) which uses LABSE distilled knowledge model to improve the metric agreement with human judgements by automatically optimised factor weights regarding the exact MT language pairs that cushLEPOR is deployed to. We also optimise cushLEPOR towards human evaluation data based on MQM and pSQM framework on English-German and Chinese-English language pairs. The experimental investigations show cushLEPOR boosts hLEPOR performances towards better agreements to PLMs like LABSE with much lower cost, and better agreements to human evaluations including MQM and pSQM scores, and yields much better performances than BLEU (data available at <https://github.com/poethan/cushLEPOR>).

READ FULL TEXT
research
03/26/2017

LEPOR: An Augmented Machine Translation Evaluation Metric

Machine translation (MT) was developed as one of the hottest research to...
research
12/19/2022

LENS: A Learnable Evaluation Metric for Text Simplification

Training learnable metrics using modern language models has recently eme...
research
08/18/2023

ChatHaruhi: Reviving Anime Character in Reality via Large Language Model

Role-playing chatbots built on large language models have drawn interest...
research
01/22/2023

An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models

Large-scale Pre-Trained Language Models (PTLMs) capture knowledge from m...
research
03/11/2022

Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

Recent studies have shown the advantages of evaluating NLG systems using...
research
10/07/2020

What Can We Learn from Collective Human Opinions on Natural Language Inference Data?

Despite the subjective nature of many NLP tasks, most NLU evaluations ha...
research
08/06/2023

Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation

N-gram matching-based evaluation metrics, such as BLEU and chrF, are wid...

Please sign up or login with your details

Forgot password? Click here to reset