Automatic Evaluation of Attribution by Large Language Models

05/10/2023
by   Xiang Yue, et al.
0

A recent focus of large language model (LLM) development, as exemplified by generative search engines, is to incorporate external references to generate and support their claims. However, evaluating the attribution, i.e., verifying whether the generated statement is indeed fully supported by the cited reference, remains an open problem. Although human evaluation is common practice, it is costly and time-consuming. In this paper, we investigate the automatic evaluation of attribution by LLMs. We begin by providing a definition of attribution and then explore two approaches for automatic evaluation: prompting LLMs and fine-tuning smaller LMs. The fine-tuning data is repurposed from related tasks, such as question answering, fact-checking, natural language inference, and summarization. To facilitate the evaluation, we manually curate a set of test examples covering 12 domains from a generative search engine, New Bing. Our results on the curated test set and simulated test examples from existing benchmark questions highlight both promising signals as well as remaining challenges for the automatic evaluation of attribution. We hope our testbed, modeling methodology, and insights will help lay the foundation for future studies on this important problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/11/2022

A Multilingual Perspective Towards the Evaluation of Attribution Methods in Natural Language Inference

Most evaluations of attribution methods focus on the English language. I...
research
10/17/2022

RARR: Researching and Revising What Language Models Say, Using Language Models

Language models (LMs) now excel at many tasks such as few-shot learning,...
research
10/12/2022

AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning

Fine-tuning large pre-trained language models on downstream tasks is apt...
research
05/23/2023

Evaluating and Modeling Attribution for Cross-Lingual Question Answering

Trustworthy answer content is abundant in many high-resource languages a...
research
09/18/2023

Speaker attribution in German parliamentary debates with QLoRA-adapted large language models

The growing body of political texts opens up new opportunities for rich ...
research
09/04/2023

Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction

Taxonomies represent hierarchical relations between entities, frequently...
research
07/31/2023

HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution

The rise of large language models (LLMs) had a transformative impact on ...

Please sign up or login with your details

Forgot password? Click here to reset