KPQA: A Metric for Generative Question Answering Using Word Weights

05/01/2020
by   Hwanhee Lee, et al.
0

For the automatic evaluation of Generative Question Answering (genQA) systems, it is essential to assess the correctness of the generated answers. However, n-gram similarity metrics, which are widely used to compare generated texts and references, are prone to misjudge fact-based assessments. Moreover, there is a lack of benchmark datasets to measure the quality of metrics in terms of the correctness. To study a better metric for genQA, we collect high-quality human judgments of correctness on two standard genQA datasets. Using our human-evaluation datasets, we show that existing metrics based on n-gram similarity do not correlate with human judgments. To alleviate this problem, we propose a new metric for evaluating the correctness of genQA. Specifically, the new metric assigns different weights on each token via keyphrase prediction, thereby judging whether a predicted answer sentence captures the key meaning of the human judge's ground-truth. Our proposed metric shows a significantly higher correlation with human judgment than widely used existing metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2023

SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References

Evaluation of QA systems is very challenging and expensive, with the mos...
research
08/13/2021

Semantic Answer Similarity for Evaluating Question Answering Models

The evaluation of question answering models compares ground-truth annota...
research
02/28/2022

'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names for 1M Entities

Classic lexical-matching-based QA metrics are slowly being phased out be...
research
08/30/2018

Towards a Better Metric for Evaluating Question Generation Systems

There has always been criticism for using n-gram based similarity metric...
research
09/17/2020

Small but Mighty: New Benchmarks for Split and Rephrase

Split and Rephrase is a text simplification task of rewriting a complex ...
research
11/19/2022

Towards good validation metrics for generative models in offline model-based optimisation

In this work we propose a principled evaluation framework for model-base...
research
07/29/2016

SPICE: Semantic Propositional Image Caption Evaluation

There is considerable interest in the task of automatically generating i...

Please sign up or login with your details

Forgot password? Click here to reset