QAScore – An Unsupervised Unreferenced Metric for the Question Generation Evaluation

10/09/2022
by   Tianbo Ji, et al.
0

Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers found within the passage. In recent years, the introduction of neural generation models has resulted in substantial improvements of automatically generated questions in terms of quality, especially compared to traditional approaches that employ manually crafted heuristics. However, the metrics commonly applied in QG evaluations have been criticized for their low agreement with human judgement. We therefore propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore. Instead of fine-tuning a language model to maximize its correlation with human judgements, QAScore evaluates a question by computing the cross entropy according to the probability that the language model can correctly generate the masked words in the answer to that question. Furthermore, we conduct a new crowd-sourcing human evaluation experiment for the QG evaluation to investigate how QAScore and other metrics can correlate with human judgements. Experiments show that QAScore obtains a stronger correlation with the results of our proposed human evaluation method compared to existing traditional word-overlap-based metrics such as BLEU and ROUGE, as well as the existing pretrained-model-based metric BERTScore.

READ FULL TEXT
research
11/02/2022

RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question

Existing metrics for evaluating the quality of automatically generated q...
research
05/26/2023

Evaluation of Question Generation Needs More References

Question generation (QG) is the task of generating a valid and fluent qu...
research
04/02/2022

CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation

Existing reference-free metrics have obvious limitations for evaluating ...
research
09/19/2023

What is the Best Automated Metric for Text to Motion Generation?

There is growing interest in generating skeleton-based human motions fro...
research
04/29/2022

QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance

Existing metrics for assessing question generation not only require cost...
research
07/29/2016

SPICE: Semantic Propositional Image Caption Evaluation

There is considerable interest in the task of automatically generating i...
research
05/15/2022

Mask and Cloze: Automatic Open Cloze Question Generation using a Masked Language Model

Open cloze questions have been attracting attention for both measuring t...

Please sign up or login with your details

Forgot password? Click here to reset