LEACE: Perfect linear concept erasure in closed form

06/06/2023
by   Nora Belrose, et al.
2

Concept erasure aims to remove specified features from a representation. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the representation as little as possible, as measured by a broad class of norms. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/21/2023

VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution

We introduce VisoGender, a novel dataset for benchmarking gender bias in...
research
01/24/2021

Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models

This paper proposes two intuitive metrics, skew and stereotype, that qua...
research
02/10/2023

FairPy: A Toolkit for Evaluation of Social Biases and their Mitigation in Large Language Models

Studies have shown that large pretrained language models exhibit biases ...
research
09/10/2020

Investigating Gender Bias in BERT

Contextual language models (CLMs) have pushed the NLP benchmarks to a ne...
research
05/01/2020

Predicting Declension Class from Form and Meaning

The noun lexica of many natural languages are divided into several decle...
research
01/28/2022

Linear Adversarial Concept Erasure

Modern neural models trained on textual data rely on pre-trained represe...
research
06/16/2018

Right for the Right Reason: Training Agnostic Networks

We consider the problem of a neural network being requested to classify ...

Please sign up or login with your details

Forgot password? Click here to reset