The Hydra Effect: Emergent Self-repair in Language Model Computations

07/28/2023
by   Thomas McGrath, et al.
0

We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token. Our ablation studies demonstrate that language model layers are typically relatively loosely coupled (ablations to one layer only affect a small number of downstream layers). Surprisingly, these effects occur even in language models trained without any form of dropout. We analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/02/2023

Arithmetic with Language Models: from Memorization to Computation

A better understanding of the emergent computation and problem-solving c...
research
06/01/2022

What Changed? Investigating Debiasing Methods using Causal Mediation Analysis

Previous work has examined how debiasing language models affect downstre...
research
11/01/2018

Understanding Learning Dynamics Of Language Models with SVCCA

Recent work has demonstrated that neural language models encode linguist...
research
08/24/2018

Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information

How do neural language models keep track of number agreement between sub...
research
09/12/2023

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Language models often exhibit behaviors that improve performance on a pr...
research
09/06/2021

You should evaluate your language model on marginal likelihood overtokenisations

Neural language models typically tokenise input text into sub-word units...
research
05/02/2023

The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers

Applying language models to natural language processing tasks typically ...

Please sign up or login with your details

Forgot password? Click here to reset