Evaluation of Semantic Answer Similarity Metrics

06/25/2022
by   Farida Mustafazade, et al.
0

There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name string pairs along with their similarities, which can be used for training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/13/2021

Semantic Answer Similarity for Evaluating Question Answering Models

The evaluation of question answering models compares ground-truth annota...
research
02/28/2022

'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names for 1M Entities

Classic lexical-matching-based QA metrics are slowly being phased out be...
research
04/21/2022

Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics

Question answering-based summarization evaluation metrics must automatic...
research
11/02/2022

RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question

Existing metrics for evaluating the quality of automatically generated q...
research
05/07/2020

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

Neural abstractive summarization models are prone to generate content in...
research
04/08/2021

Video Question Answering with Phrases via Semantic Roles

Video Question Answering (VidQA) evaluation metrics have been limited to...
research
08/26/2021

Semantic-based Self-Critical Training For Question Generation

We present in this work a fully Transformer-based reinforcement learning...

Please sign up or login with your details

Forgot password? Click here to reset