Evaluating Open-Domain Question Answering in the Era of Large Language Models

05/11/2023
by   Ehsan Kamalloo, et al.
0

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60 the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50 attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.

READ FULL TEXT
research
09/11/2021

What's in a Name? Answer Equivalence For Open-Domain Question Answering

A flaw in QA evaluation is that annotations often only provide one gold ...
research
12/31/2019

What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge

Open-domain question answering (QA) is known to involve several underlyi...
research
10/25/2022

RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering

We introduce RoMQA, the first benchmark for robust, multi-evidence, mult...
research
11/10/2022

Measuring Reliability of Large Language Models through Semantic Consistency

While large pretrained language models (PLMs) demonstrate incredible flu...
research
07/06/2023

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

Nowadays, the quality of responses generated by different modern large l...
research
12/19/2022

Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Generative models have been widely applied to solve extractive tasks, wh...
research
07/06/2023

Style Over Substance: Evaluation Biases for Large Language Models

As large language models (LLMs) continue to advance, accurately and comp...

Please sign up or login with your details

Forgot password? Click here to reset