HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

05/19/2023
by   Junyi Li, et al.
0

Large language models (LLMs), such as ChatGPT, are prone to generate hallucinations, content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation for Large Language Models (HaluEval) benchmark, a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination. To generate these samples, we propose a ChatGPT-based two-step framework, sampling-then-filtering. Besides, we also hire some human labelers to annotate the hallucinations in ChatGPT responses. The empirical results suggest that ChatGPT is likely to generate hallucinated content in specific topics by fabricating unverifiable information (about 11.4% user queries). Moreover, existing LLMs face great challenges in recognizing the hallucinations in texts. While, our experiments also prove that the hallucination recognition can be improved by providing external knowledge or adding reasoning steps. Our benchmark can be accessed at https://github.com/RUCAIBox/HaluEval.

READ FULL TEXT

page 3

page 5

page 7

page 12

page 13

page 14

research
05/21/2023

Evaluating the Performance of Large Language Models on GAOKAO Benchmark

Large language models have demonstrated remarkable performance across va...
research
11/09/2022

Robosourcing Educational Resources – Leveraging Large Language Models for Learnersourcing

In this article, we introduce and evaluate the concept of robosourcing f...
research
04/16/2023

ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models

AI generated content (AIGC) presents considerable challenge to educators...
research
05/29/2023

Large Language Models are not Fair Evaluators

We uncover a systematic bias in the evaluation paradigm of adopting larg...
research
08/15/2023

Link-Context Learning for Multimodal LLMs

The ability to learn from context with novel concepts, and deliver appro...
research
08/16/2023

Detoxify Language Model Step-by-Step

Detoxification for LLMs is challenging since it requires models to avoid...
research
05/23/2023

WikiChat: A Few-Shot LLM-Based Chatbot Grounded with Wikipedia

Despite recent advances in Large Language Models (LLMs), users still can...

Please sign up or login with your details

Forgot password? Click here to reset