RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question

11/02/2022
by   Alireza Mohammadshahi, et al.
8

Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer module, in which we use pre-trained models from the existing literature, and therefore, our metric can be used without further training. We show that RQUGE has a higher correlation with human judgment without relying on the reference question. RQUGE is shown to be significantly more robust to several adversarial corruptions. Additionally, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on the synthetic data generated by a question generation model and re-ranked by RQUGE.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/28/2021

QACE: Asking Questions to Evaluate an Image Caption

In this paper, we propose QACE, a new metric based on Question Answering...
research
10/09/2022

QAScore – An Unsupervised Unreferenced Metric for the Question Generation Evaluation

Question Generation (QG) aims to automate the task of composing question...
research
08/26/2021

Semantic-based Self-Critical Training For Question Generation

We present in this work a fully Transformer-based reinforcement learning...
research
04/29/2022

QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance

Existing metrics for assessing question generation not only require cost...
research
09/23/2021

Can Question Generation Debias Question Answering Models? A Case Study on Question-Context Lexical Overlap

Question answering (QA) models for reading comprehension have been demon...
research
06/25/2022

Evaluation of Semantic Answer Similarity Metrics

There are several issues with the existing general machine translation o...
research
11/25/2020

AGenT Zero: Zero-shot Automatic Multiple-Choice Question Generation for Skill Assessments

Multiple-choice questions (MCQs) offer the most promising avenue for ski...

Please sign up or login with your details

Forgot password? Click here to reset