Large Language Models Are State-of-the-Art Evaluators of Translation Quality

02/28/2023
by   Tom Kocmi, et al.
0

We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/22/2022

Approaching English-Polish Machine Translation Quality Assessment with Neural-based Methods

This paper presents our contribution to the PolEval 2021 Task 2: Evaluat...
research
10/08/2020

Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task

The quality of machine translation systems has dramatically improved ove...
research
10/21/2022

SIT at MixMT 2022: Fluent Translation Built on Giant Pre-trained Models

This paper describes the Stevens Institute of Technology's submission fo...
research
08/20/2020

Lite Training Strategies for Portuguese-English and English-Portuguese Translation

Despite the widespread adoption of deep learning for machine translation...
research
12/20/2022

DocAsRef: A Pilot Empirical Study on Repurposing Reference-Based Summary Quality Metrics Reference-Freely

Summary quality assessment metrics have two categories: reference-based ...
research
03/10/2022

A new approach to calculating BERTScore for automatic assessment of translation quality

The study of the applicability of the BERTScore metric was conducted to ...

Please sign up or login with your details

Forgot password? Click here to reset