Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists

03/17/2022
by   Giuseppe Attanasio, et al.
9

Natural Language Processing (NLP) models risk overfitting to specific terms in the training data, thereby reducing their performance, fairness, and generalizability. E.g., neural hate speech detection models are strongly influenced by identity terms like gay, or women, resulting in false positives, severe unintended bias, and lower performance. Most mitigation techniques use lists of identity terms or samples from the target domain during training. However, this approach requires a-priori knowledge and introduces further bias if important terms are neglected. Instead, we propose a knowledge-free Entropy-based Attention Regularization (EAR) to discourage overfitting to training-specific terms. An additional objective function penalizes tokens with low self-attention entropy. We fine-tune BERT via EAR: the resulting model matches or exceeds state-of-the-art performance for hate speech classification and bias metrics on three benchmark corpora in English and Italian. EAR also reveals overfitting terms, i.e., terms most likely to induce bias, to help identify their effect on the model, task, and predictions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/05/2023

Weigh Your Own Words: Improving Hate Speech Counter Narrative Generation via Attention Regularization

Recent computational approaches for combating online hate speech involve...
research
05/27/2022

StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

Analyzing ethnic or religious bias is important for improving fairness, ...
research
08/03/2021

Improving Counterfactual Generation for Fair Hate Speech Detection

Bias mitigation approaches reduce models' dependence on sensitive featur...
research
05/05/2020

Contextualizing Hate Speech Classifiers with Post-hoc Explanation

Hate speech classifiers trained on imbalanced datasets struggle to deter...
research
05/13/2020

Mitigating Gender Bias Amplification in Distribution by Posterior Regularization

Advanced machine learning techniques have boosted the performance of nat...
research
11/01/2022

Why Is It Hate Speech? Masked Rationale Prediction for Explainable Hate Speech Detection

In a hate speech detection model, we should consider two critical aspect...
research
10/07/2022

A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter

Harmful content detection models tend to have higher false positive rate...

Please sign up or login with your details

Forgot password? Click here to reset