Semantic Answer Similarity for Evaluating Question Answering Models

08/13/2021
by   Julian Risch, et al.
0

The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In this short paper, we present SAS, a cross-encoder-based metric for the estimation of semantic answer similarity, and compare it to seven existing metrics. To this end, we create an English and a German three-way annotated evaluation dataset containing pairs of answers along with human judgment of their semantic similarity, which we release along with an implementation of the SAS metric and the experiments. We find that semantic similarity metrics based on recent transformer models correlate much better with human judgment than traditional lexical similarity metrics on our two newly created datasets and one dataset from related work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/25/2022

Evaluation of Semantic Answer Similarity Metrics

There are several issues with the existing general machine translation o...
research
02/28/2022

'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names for 1M Entities

Classic lexical-matching-based QA metrics are slowly being phased out be...
research
09/09/2023

FaNS: a Facet-based Narrative Similarity Metric

Similar Narrative Retrieval is a crucial task since narratives are essen...
research
09/27/2018

A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC

In this work, we compare three datasets which build on the paradigm defi...
research
05/01/2020

KPQA: A Metric for Generative Question Answering Using Word Weights

For the automatic evaluation of Generative Question Answering (genQA) sy...
research
08/23/2023

Evaluation of Faithfulness Using the Longest Supported Subsequence

As increasingly sophisticated language models emerge, their trustworthin...
research
06/29/2022

What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0

Performance in natural language processing, and specifically for the que...

Please sign up or login with your details

Forgot password? Click here to reset