Evaluation of Faithfulness Using the Longest Supported Subsequence

08/23/2023
by   Anirudh Mittal, et al.
0

As increasingly sophisticated language models emerge, their trustworthiness becomes a pivotal issue, especially in tasks such as summarization and question-answering. Ensuring their responses are contextually grounded and faithful is challenging due to the linguistic diversity and the myriad of possible answers. In this paper, we introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous substring of the claim that is supported by the context, which we refer to as the Longest Supported Subsequence (LSS). Using a new human-annotated dataset, we finetune a model to generate LSS. We introduce a new method of evaluation and demonstrate that these metrics correlate better with human ratings when LSS is employed, as opposed to when it is not. Our proposed metric demonstrates an 18 our dataset. Our metric consistently outperforms other metrics on a summarization dataset across six different models. Finally, we compare several popular Large Language Models (LLMs) for faithfulness using this metric. We release the human-annotated dataset built for predicting LSS and our fine-tuned model for evaluating faithfulness.

READ FULL TEXT

page 8

page 12

research
07/26/2023

This is not correct! Negation-aware Evaluation of Language Generation Systems

Large language models underestimate the impact of negations on how much ...
research
08/13/2021

Semantic Answer Similarity for Evaluating Question Answering Models

The evaluation of question answering models compares ground-truth annota...
research
05/24/2023

Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks

Research on automated text summarization relies heavily on human and aut...
research
11/05/2020

Detecting Hallucinated Content in Conditional Neural Sequence Generation

Neural sequence models can generate highly fluent sentences but recent s...
research
05/29/2023

Assess and Summarize: Improve Outage Understanding with Large Language Models

Cloud systems have become increasingly popular in recent years due to th...
research
04/16/2021

Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

Neural knowledge-grounded generative models for dialogue often produce c...
research
05/17/2023

Statistical Knowledge Assessment for Generative Language Models

Generative Language Models (GLMs) have demonstrated capabilities to stor...

Please sign up or login with your details

Forgot password? Click here to reset