Measuring Reliability of Large Language Models through Semantic Consistency

11/10/2022
by   Harsh Raj, et al.
0

While large pretrained language models (PLMs) demonstrate incredible fluency and performance on many natural language tasks, recent work has shown that well-performing PLMs are very sensitive to what prompts are feed into them. Even when prompts are semantically identical, language models may give very different answers. When considering safe and trustworthy deployments of PLMs we would like their outputs to be consistent under prompts that mean the same thing or convey the same intent. While some work has looked into how state-of-the-art PLMs address this need, they have been limited to only evaluating lexical equality of single- or multi-word answers and do not address consistency of generative text sequences. In order to understand consistency of PLMs under text generation settings, we develop a measure of semantic consistency that allows the comparison of open-ended text outputs. We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions in the TruthfulQA dataset, we find that our proposed metrics are considerably more consistent than traditional metrics embodying lexical consistency, and also correlate with human evaluation of output consistency to a higher degree.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/17/2023

Semantic Consistency for Assuring Reliability of Large Language Models

Large Language Models (LLMs) exhibit remarkable fluency and competence a...
research
05/24/2022

The Curious Case of Control

Children acquiring English make systematic errors on subject control sen...
research
05/11/2023

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Lexical matching remains the de facto evaluation method for open-domain ...
research
08/15/2021

Accurate, yet inconsistent? Consistency Analysis on Language Understanding Models

Consistency, which refers to the capability of generating the same predi...
research
03/28/2023

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

As general purpose vision models get increasingly effective at a wide se...
research
09/21/2023

Reranking for Natural Language Generation from Logical Forms: A Study based on Large Language Models

Large language models (LLMs) have demonstrated impressive capabilities i...
research
07/11/2023

Self-consistency for open-ended generations

In this paper, we present a novel approach for improving the quality and...

Please sign up or login with your details

Forgot password? Click here to reset