A Critical Evaluation of Evaluations for Long-form Question Answering

05/29/2023
by   Fangyuan Xu, et al.
0

Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation. We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices. We hire domain experts in seven areas to provide preference judgments over pairs of answers, along with free-form justifications for their choices. We present a careful analysis of experts' evaluation, which focuses on new aspects such as the comprehensiveness of the answer. Next, we examine automatic text generation metrics, finding that no existing metrics are predictive of human preference judgments. However, some metrics correlate with fine-grained aspects of answers (e.g., coherence). We encourage future work to move away from a single "overall score" of the answer and adopt a multi-faceted evaluation, targeting aspects such as factuality and completeness. We publicly release all of our annotations and code to spur future work into LFQA evaluation.

READ FULL TEXT
research
05/30/2023

Concise Answers to Complex Questions: Summarization of Long-form Answers

Long-form question answering systems provide rich information by present...
research
07/22/2019

ELI5: Long Form Question Answering

We introduce the first large-scale corpus for long-form question answeri...
research
10/31/2022

Query Refinement Prompts for Closed-Book Long-Form Question Answering

Large language models (LLMs) have been shown to perform well in answerin...
research
03/10/2021

Hurdles to Progress in Long-form Question Answering

The task of long-form question answering (LFQA) involves retrieving docu...
research
12/26/2021

New Methods Metrics for LFQA tasks

Long-form question answering (LFQA) tasks require retrieving the documen...
research
05/19/2022

Modeling Exemplification in Long-form Question Answering via Retrieval

Exemplification is a process by which writers explain or clarify a conce...
research
09/27/2018

A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC

In this work, we compare three datasets which build on the paradigm defi...

Please sign up or login with your details

Forgot password? Click here to reset