Circuit Breaking: Removing Model Behaviors with Targeted Ablation

09/12/2023
by   Maximilian Li, et al.
0

Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2020

PTUM: Pre-training User Model from Unlabeled User Behaviors via Self-supervision

User modeling is critical for many personalized web services. Many exist...
research
12/19/2022

Training Trajectories of Language Models Across Scales

Scaling up language models has led to unprecedented performance gains, b...
research
06/01/2022

What Changed? Investigating Debiasing Methods using Causal Mediation Analysis

Previous work has examined how debiasing language models affect downstre...
research
07/28/2023

The Hydra Effect: Emergent Self-repair in Language Model Computations

We investigate the internal structure of language model computations usi...
research
05/18/2023

Causal Document-Grounded Dialogue Pre-training

The goal of document-grounded dialogue (DocGD) is to generate a response...
research
04/07/2023

Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Causal Video Question Answering (CVidQA) queries not only association or...
research
06/06/2023

Causal interventions expose implicit situation models for commonsense language understanding

Accounts of human language processing have long appealed to implicit “si...

Please sign up or login with your details

Forgot password? Click here to reset