Contextualizing Hate Speech Classifiers with Post-hoc Explanation

05/05/2020
by   Brendan Kennedy, et al.
0

Hate speech classifiers trained on imbalanced datasets struggle to determine if group identifiers like "gay" or "black" are used in offensive or prejudiced ways. Such biases manifest in false positives when these identifiers are present, due to models' inability to learn the contexts which constitute a hateful usage of identifiers. We extract post-hoc explanations from fine-tuned BERT classifiers to detect bias towards identity terms. Then, we propose a novel regularization technique based on these explanations that encourages models to learn from the context of group identifiers in addition to the identifiers themselves. Our approach improved over baselines in limiting false positives on out-of-domain data while maintaining or improving in-domain performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2019

How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods

As machine learning black boxes are increasingly being deployed in domai...
research
02/28/2022

An Empirical Study on Explanations in Out-of-Domain Settings

Recent work in Natural Language Processing has focused on developing app...
research
05/24/2021

Reproducibility Report: Contextualizing Hate Speech Classifiers with Post-hoc Explanation

The presented report evaluates Contextualizing Hate Speech Classifiers w...
research
05/15/2022

Fairness via Explanation Quality: Evaluating Disparities in the Quality of Post hoc Explanations

As post hoc explanation methods are increasingly being leveraged to expl...
research
03/17/2022

Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists

Natural Language Processing (NLP) models risk overfitting to specific te...
research
07/23/2023

Right for the Wrong Reason: Can Interpretable ML Techniques Detect Spurious Correlations?

While deep neural network models offer unmatched classification performa...
research
05/05/2020

Global explanations for discovering bias in data

In the paper, we propose attention-based summarized post-hoc explanation...

Please sign up or login with your details

Forgot password? Click here to reset