Dialect-robust Evaluation of Generated Text

11/02/2022
by   Jiao Sun, et al.
0

Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics. We introduce a suite of methods and corresponding statistical tests one can use to assess metrics in light of the two goals. Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust and that semantic perturbations frequently lead to smaller decreases in a metric than the introduction of dialect features. As a first step to overcome this limitation, we propose a training schema, NANO, which introduces regional and language information to the pretraining process of a metric. We demonstrate that NANO provides a size-efficient way for models to improve the dialect robustness while simultaneously improving their performance on the standard metric benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/15/2022

MENLI: Robust Evaluation Metrics from Natural Language Inference

Recently proposed BERT-based evaluation metrics perform well on standard...
research
12/24/2020

LCEval: Learned Composite Metric for Caption Evaluation

Automatic evaluation metrics hold a fundamental importance in the develo...
research
11/03/2022

Revisiting Grammatical Error Correction Evaluation and Beyond

Pretraining-based (PT-based) automatic evaluation metrics (e.g., BERTSco...
research
07/24/2021

MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation

Code-mixing is a phenomenon of mixing words and phrases from two or more...
research
05/21/2022

Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation Metrics

Few images on the Web receive alt-text descriptions that would make them...
research
03/15/2023

PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning

Vulnerability to lexical perturbation is a critical weakness of automati...
research
05/18/2023

Flatness-Aware Prompt Selection Improves Accuracy and Sample Efficiency

With growing capabilities of large language models, prompting them has b...

Please sign up or login with your details

Forgot password? Click here to reset