Evaluating task understanding through multilingual consistency: A ChatGPT case study

05/19/2023
by   Xenia Ohmer, et al.
0

At the staggering pace with which the capabilities of large language models (LLMs) are increasing, creating future-proof evaluation sets to assess their understanding becomes more and more challenging. In this paper, we propose a novel paradigm for evaluating LLMs which leverages the idea that correct world understanding should be consistent across different (Fregean) senses of the same meaning. Accordingly, we measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself. We showcase our approach by instantiating a test where the different senses are different languages, hence using multilingual self-consistency as a litmus test for the model's understanding and simultaneously addressing the important topic of multilingualism. Taking one of the latest versions of ChatGPT as our object of study, we evaluate multilingual consistency for two different tasks across three different languages. We show that its multilingual consistency is still lacking, and that its task and world understanding are thus not language-independent. As our approach does not require any static evaluation corpora in languages other than English, it can easily and cheaply be extended to different languages and tasks and could become an integral part of future benchmarking efforts.

READ FULL TEXT
research
05/12/2022

Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages

Although recent Massively Multilingual Language Models (MMLMs) like mBER...
research
03/22/2023

MEGA: Multilingual Evaluation of Generative AI

Generative AI models have impressive performance on many Natural Languag...
research
12/11/2022

IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages

In this work, we introduce IndicXTREME, a benchmark consisting of nine d...
research
04/25/2021

XLM-T: A Multilingual Language Model Toolkit for Twitter

Language models are ubiquitous in current NLP, and their multilingual ca...
research
04/17/2021

A multilabel approach to morphosyntactic probing

We introduce a multilabel probing task to assess the morphosyntactic rep...
research
06/08/2023

M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

Despite the existence of various benchmarks for evaluating natural langu...
research
09/15/2021

SWEAT: Scoring Polarization of Topics across Different Corpora

Understanding differences of viewpoints across corpora is a fundamental ...

Please sign up or login with your details

Forgot password? Click here to reset