Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

06/06/2023
by   Kenneth Li, et al.
0

We introduce Inference-Time Intervention (ITI), a technique designed to enhance the truthfulness of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5 demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.

READ FULL TEXT

page 4

page 7

research
07/24/2019

Info Intervention

We proposed the info intervention, which intervening the information sen...
research
08/07/2023

Simple synthetic data reduces sycophancy in large language models

Sycophancy is an undesirable behavior where models tailor their response...
research
01/17/2023

Tracing and Manipulating Intermediate Values in Neural Math Problem Solvers

How language models process complex input that requires multiple steps o...
research
05/06/2023

Refining the Responses of LLMs by Themselves

In this paper, we propose a simple yet efficient approach based on promp...
research
08/24/2018

Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information

How do neural language models keep track of number agreement between sub...
research
05/24/2023

A Causal View of Entity Bias in (Large) Language Models

Entity bias widely affects pretrained (large) language models, causing t...
research
01/10/2023

Memory Augmented Large Language Models are Computationally Universal

We show that transformer-based large language models are computationally...

Please sign up or login with your details

Forgot password? Click here to reset