Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

09/14/2023
by   Rishav Hada, et al.
0

Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing (NLP) tasks, such as Question Answering, Summarization, and Classification. The use of LLMs as evaluators, that can rank or score the output of other models (usually LLMs) has become increasingly popular, due to the limitations of current evaluation techniques including the lack of appropriate benchmarks, metrics, cost, and access to human annotators. While LLMs are capable of handling approximately 100 languages, the majority of languages beyond the top 20 lack systematic evaluation across various tasks, metrics, and benchmarks. This creates an urgent need to scale up multilingual evaluation to ensure a precise understanding of LLM performance across diverse languages. LLM-based evaluators seem like the perfect solution to this problem, as they do not require human annotators, human-created references, or benchmarks and can theoretically be used to evaluate any language covered by the LLM. In this paper, we investigate whether LLM-based evaluators can help scale up multilingual evaluation. Specifically, we calibrate LLM-based evaluation against 20k human judgments of five metrics across three text-generation tasks in eight languages. Our findings indicate that LLM-based evaluators may exhibit bias towards higher scores and should be used with caution and should always be calibrated with a dataset of native speaker judgments, particularly in low-resource and non-Latin script languages.

READ FULL TEXT

page 9

page 10

page 11

page 12

page 13

page 15

page 23

page 24

research
09/14/2023

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Despite the progress we have recorded in the last few years in multiling...
research
04/12/2023

ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning

Over the last few years, large language models (LLMs) have emerged as th...
research
04/01/2021

Low-Resource Language Modelling of South African Languages

Language models are the foundation of current neural network-based model...
research
09/02/2023

Multilingual Text Representation

Modern NLP breakthrough includes large multilingual models capable of pe...
research
05/28/2023

Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot LLMs

Large language models (LLMs) are at the forefront of transforming numero...
research
10/18/2021

BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Natural language processing (NLP) systems are increasingly trained to ge...
research
10/24/2019

Capacity, Bandwidth, and Compositionality in Emergent Language Learning

Many recent works have discussed the propensity, or lack thereof, for em...

Please sign up or login with your details

Forgot password? Click here to reset