FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

05/23/2023
by   Sewon Min, et al.
0

Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FActScore (Factual precision in Atomicity Score), a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FActScores of people biographies generated by several state-of-the-art commercial LMs – InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI – and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58 an automated model that estimates FActScore, using retrieval and a strong language model, with less than a 2 metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost 26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models.

READ FULL TEXT

page 5

page 11

page 18

research
05/25/2022

RSTGen: Imbuing Fine-Grained Interpretable Control into Long-FormText Generators

In this paper, we study the task of improving the cohesion and coherence...
research
06/02/2023

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Language models (LMs) often exhibit undesirable text generation behavior...
research
10/12/2022

DATScore: Evaluating Translation with Data Augmented Translations

The rapid development of large pretrained language models has revolution...
research
09/28/2022

FiD-Light: Efficient and Effective Retrieval-Augmented Text Generation

Retrieval-augmented generation models offer many benefits over standalon...
research
07/13/2023

Generating Benchmarks for Factuality Evaluation of Language Models

Before deploying a language model (LM) within a given domain, it is impo...
research
04/09/2021

Towards objectively evaluating the quality of generated medical summaries

We propose a method for evaluating the quality of generated text by aski...
research
06/07/2023

Long-form analogies generated by chatGPT lack human-like psycholinguistic properties

Psycholinguistic analyses provide a means of evaluating large language m...

Please sign up or login with your details

Forgot password? Click here to reset