SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References

09/21/2023
by   Matteo Gabburo, et al.
0

Evaluation of QA systems is very challenging and expensive, with the most reliable approach being human annotations of correctness of answers for questions. Recent works (AVA, BEM) have shown that transformer LM encoder based similarity metrics transfer well for QA evaluation, but they are limited by the usage of a single correct reference answer. We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation), using multiple reference answers (combining multiple correct and incorrect references) for sentence-form QA. We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems, across multiple academic and industrial datasets, and show that it outperforms previous baselines and obtains the highest correlation with human annotations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

Learning Answer Generation using Supervision from Automatic Question Answering Evaluators

Recent studies show that sentence-level extractive QA, i.e., based on An...
research
05/02/2020

AVA: an Automatic eValuation Approach to Question Answering Systems

We introduce AVA, an automatic evaluation approach for Question Answerin...
research
05/01/2020

KPQA: A Metric for Generative Question Answering Using Word Weights

For the automatic evaluation of Generative Question Answering (genQA) sy...
research
10/07/2020

Unsupervised Evaluation for Question Answering with Transformers

It is challenging to automatically evaluate the answer of a QA model at ...
research
04/12/2022

ASQA: Factoid Questions Meet Long-Form Answers

An abundance of datasets and availability of reliable evaluation metrics...
research
09/11/2019

A Discrete Hard EM Approach for Weakly Supervised Question Answering

Many question answering (QA) tasks only provide weak supervision for how...
research
05/21/2023

Evaluating Open Question Answering Evaluation

This study focuses on the evaluation of Open Question Answering (Open-QA...

Please sign up or login with your details

Forgot password? Click here to reset