LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

05/23/2023
by   Philippe Laban, et al.
0

With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.

READ FULL TEXT

page 5

page 7

page 8

page 12

page 14

research
07/13/2023

Generating Benchmarks for Factuality Evaluation of Language Models

Before deploying a language model (LM) within a given domain, it is impo...
research
12/15/2022

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

Human evaluation is the foundation upon which the evaluation of both sum...
research
09/13/2023

Towards Reliable Dermatology Evaluation Benchmarks

Benchmark datasets for digital dermatology unwittingly contain inaccurac...
research
09/14/2021

BenchIE: Open Information Extraction Evaluation Based on Facts, Not Tokens

Intrinsic evaluations of OIE systems are carried out either manually – w...
research
07/25/2023

ARB: Advanced Reasoning Benchmark for Large Language Models

Large Language Models (LLMs) have demonstrated remarkable performance on...
research
02/09/2023

Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow

Recent research has shown that language models exploit `artifacts' in be...
research
05/27/2023

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Recent model editing techniques promise to mitigate the problem of memor...

Please sign up or login with your details

Forgot password? Click here to reset