Log In Sign Up

Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

by   Timo Schick, et al.

When trained on large, unfiltered crawls from the internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: they often generate racist, sexist, violent or otherwise toxic language. As large models often require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. In this paper, we investigate whether pretrained language models at least know when they exhibit some undesirable bias or produce toxic content. Based on our findings, we propose a decoding algorithm that reduces the probability of a model producing problematic text given only a textual description of the undesired behavior. This algorithm does not rely on manually curated word lists, nor does it require any training data or changes to the model's parameters. While our approach does by no means eliminate the issue of language models generating biased text, we believe it to be an important step in this direction.


page 1

page 2

page 3

page 4


Leashing the Inner Demons: Self-Detoxification for Language Models

Language models (LMs) can reproduce (or amplify) toxic language seen dur...

Word Order Does Matter (And Shuffled Language Models Know It)

Recent studies have shown that language models pretrained and/or fine-tu...

Measuring Geographic Performance Disparities of Offensive Language Classifiers

Text classifiers are applied at scale in the form of one-size-fits-all s...

Detoxifying Language Models with a Toxic Corpus

Existing studies have investigated the tendency of autoregressive langua...

The Birth of Bias: A case study on the evolution of gender bias in an English language model

Detecting and mitigating harmful biases in modern language models are wi...

Mitigating Political Bias in Language Models Through Reinforced Calibration

Current large-scale language models can be politically biased as a resul...

Debiased Large Language Models Still Associate Muslims with Uniquely Violent Acts

Recent work demonstrates a bias in the GPT-3 model towards generating vi...

Code Repositories


This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

view repo