Overthinking the Truth: Understanding how Language Models Process False Demonstrations

07/18/2023
by   Danny Halawi, et al.
0

Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: overthinking and false induction heads. The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/27/2023

Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning

In recent years, pre-trained large language models have demonstrated rem...
research
09/05/2021

Teaching Autoregressive Language Models Complex Tasks By Demonstration

This paper demonstrates that by fine-tuning an autoregressive language m...
research
03/14/2023

A Theory of Emergent In-Context Learning as Implicit Structure Induction

Scaling large language models (LLMs) leads to an emergent capacity to le...
research
12/03/2022

What is Not in the Context? Evaluation of Few-shot Learners with Informative Demonstrations

Large language models demonstrate an emergent ability to learn a new tas...
research
11/18/2020

Predicting metrical patterns in Spanish poetry with language models

In this paper, we compare automated metrical pattern identification syst...
research
12/04/2021

Stage Conscious Attention Network (SCAN) : A Demonstration-Conditioned Policy for Few-Shot Imitation

In few-shot imitation learning (FSIL), using behavioral cloning (BC) to ...
research
03/14/2023

It Takes One to Tango but More Make Trouble? In-Context Training with Different Number of Demonstrations

Large language models (LLMs) are capable to perform complex reasoning by...

Please sign up or login with your details

Forgot password? Click here to reset