Gender Biases in Automatic Evaluation Metrics: A Case Study on Image Captioning

05/24/2023
by   Haoyi Qiu, et al.
0

Pretrained model-based evaluation metrics have demonstrated strong performance with high correlations with human judgments in various natural language generation tasks such as image captioning. Despite the impressive results, their impact on fairness is under-explored – it is widely acknowledged that pretrained models can encode societal biases, and utilizing them for evaluation purposes may inadvertently manifest and potentially amplify biases. In this paper, we conduct a systematic study in gender biases of model-based evaluation metrics with a focus on image captioning tasks. Specifically, we first identify and quantify gender biases in different evaluation metrics regarding profession, activity, and object concepts. Then, we demonstrate the negative consequences of using these biased metrics, such as favoring biased generation models in deployment and propagating the biases to generation models through reinforcement learning. We also present a simple but effective alternative to reduce gender biases by combining n-gram matching-based and pretrained model-based evaluation metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2022

Social Biases in Automatic Evaluation Metrics for NLG

Many studies have revealed that word embeddings, language models, and mo...
research
10/26/2020

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Automatic evaluation of language generation systems is a well-studied pr...
research
01/04/2022

StyleM: Stylized Metrics for Image Captioning Built with Contrastive N-grams

In this paper, we build two automatic evaluation metrics for evaluating ...
research
12/22/2016

Re-evaluating Automatic Metrics for Image Captioning

The task of generating natural language descriptions from images has rec...
research
09/13/2018

Improving Reinforcement Learning Based Image Captioning with Natural Language Prior

Recently, Reinforcement Learning (RL) approaches have demonstrated advan...
research
08/21/2023

FairBench: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models

Detecting stereotypes and biases in Large Language Models (LLMs) can enh...
research
02/07/2023

KENGIC: KEyword-driven and N-Gram Graph based Image Captioning

This paper presents a Keyword-driven and N-gram Graph based approach for...

Please sign up or login with your details

Forgot password? Click here to reset