The Internal State of an LLM Knows When its Lying

04/26/2023
by   Amos Azaria, et al.
0

While Large Language Models (LLMs) have shown exceptional performance in various tasks, their (arguably) most prominent drawback is generating inaccurate or false information with a confident tone. In this paper, we hypothesize that the LLM's internal state can be used to reveal the truthfulness of a statement. Therefore, we introduce a simple yet effective method to detect the truthfulness of LLM-generated statements, which utilizes the LLM's hidden layer activations to determine the veracity of statements. To train and evaluate our method, we compose a dataset of true and false statements in six different topics. A classifier is trained to detect which statement is true or false based on an LLM's activation values. Specifically, the classifier receives as input the activation values from the LLM for each of the statements in the dataset. Our experiments demonstrate that our method for detecting statement veracity significantly outperforms even few-shot prompting methods, highlighting its potential to enhance the reliability of LLM-generated content and its practical applicability in real-world scenarios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2019

True and false discoveries with e-values

We discuss controlling the number of false discoveries using e-values in...
research
05/26/2023

Can large language models generate salient negative statements?

We examine the ability of large language models (LLMs) to generate salie...
research
08/04/2023

How Good Are SOTA Fake News Detectors

Automatic fake news detection with machine learning can prevent the diss...
research
06/14/2023

Assessing the Effectiveness of GPT-3 in Detecting False Political Statements: A Case Study on the LIAR Dataset

The detection of political fake statements is crucial for maintaining in...
research
05/06/2017

People on Drugs: Credibility of User Statements in Health Communities

Online health communities are a valuable source of information for patie...
research
06/30/2023

Improved NL2SQL based on Multi-layer Expert Network

The Natural Language to SQL (NL2SQL) technique is used to convert natura...
research
06/09/2023

Reliability Check: An Analysis of GPT-3's Response to Sensitive Topics and Prompt Wording

Large language models (LLMs) have become mainstream technology with thei...

Please sign up or login with your details

Forgot password? Click here to reset