Log In Sign Up

Mitigating Biases in Toxic Language Detection through Invariant Rationalization

by   Yung-Sung Chuang, et al.

Automatic detection of toxic language plays an essential role in protecting social media users, especially minority groups, from verbal abuse. However, biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. The biases make the learned models unfair and can even exacerbate the marginalization of people. Considering that current debiasing methods for general natural language understanding tasks cannot effectively mitigate the biases in the toxicity detectors, we propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns (e.g., identity mentions, dialect) to toxicity labels. We empirically show that our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.


page 1

page 2

page 3

page 4


Investigating Societal Biases in a Poetry Composition System

There is a growing collection of work analyzing and mitigating societal ...

Debiasing Methods in Natural Language Understanding Make Bias More Accessible

Model robustness to bias is often determined by the generalization on ca...

Moral Mimicry: Large Language Models Produce Moral Rationalizations Tailored to Political Identity

Large Language Models (LLMs) have recently demonstrated impressive capab...

Towards Debiasing Sentence Representations

As natural language processing methods are increasingly deployed in real...

Diverse Misinformation: Impacts of Human Biases on Detection of Deepfakes on Networks

Social media users are not equally susceptible to all misinformation. We...

Challenges in Automated Debiasing for Toxic Language Detection

Biased associations have been a challenge in the development of classifi...

Evaluating Debiasing Techniques for Intersectional Biases

Bias is pervasive in NLP models, motivating the development of automatic...