Social Biases in Automatic Evaluation Metrics for NLG

10/17/2022
by   Mingqi Gao, et al.
0

Many studies have revealed that word embeddings, language models, and models for specific downstream tasks in NLP are prone to social biases, especially gender bias. Recently these techniques have been gradually applied to automatic evaluation metrics for text generation. In the paper, we propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics and discover that social biases are also widely present in some model-based automatic evaluation metrics. Moreover, we construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks. Results show that given gender-neutral references in the evaluation, model-based evaluation metrics may show a preference for the male hypothesis, and the performance of them, i.e. the correlation between evaluation metrics and human judgments, usually has more significant variation after gender swapping.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

Gender Biases in Automatic Evaluation Metrics: A Case Study on Image Captioning

Pretrained model-based evaluation metrics have demonstrated strong perfo...
research
04/15/2021

Unmasking the Mask – Evaluating Social Biases in Masked Language Models

Masked Language Models (MLMs) have shown superior performances in numero...
research
08/21/2023

FairBench: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models

Detecting stereotypes and biases in Large Language Models (LLMs) can enh...
research
05/31/2022

Cluster-based Evaluation of Automatically Generated Text

While probabilistic language generators have improved dramatically over ...
research
02/07/2023

Auditing Gender Presentation Differences in Text-to-Image Models

Text-to-image models, which can generate high-quality images based on te...
research
04/11/2023

Approximating Human Evaluation of Social Chatbots with Prompting

Once powerful conversational models have become available for a wide aud...
research
11/15/2021

Evaluating Metrics for Bias in Word Embeddings

Over the last years, word and sentence embeddings have established as te...

Please sign up or login with your details

Forgot password? Click here to reset