Evaluation of Question Generation Needs More References

05/26/2023
by   Shinhyeok Oh, et al.
0

Question generation (QG) is the task of generating a valid and fluent question based on a given context and the target answer. According to various purposes, even given the same context, instructors can ask questions about different concepts, and even the same concept can be written in different ways. However, the evaluation for QG usually depends on single reference-based similarity metrics, such as n-gram-based metric or learned metric, which is not sufficient to fully evaluate the potential of QG methods. To this end, we propose to paraphrase the reference question for a more robust QG evaluation. Using large language models such as GPT-3, we created semantically and syntactically diverse questions, then adopt the simple aggregation of the popular evaluation metrics as the final scores. Through our experiments, we found that using multiple (pseudo) references is more effective for QG evaluation while showing a higher correlation with human evaluations than evaluation with a single reference.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/06/2023

Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation

N-gram matching-based evaluation metrics, such as BLEU and chrF, are wid...
research
08/26/2021

Semantic-based Self-Critical Training For Question Generation

We present in this work a fully Transformer-based reinforcement learning...
research
10/09/2022

QAScore – An Unsupervised Unreferenced Metric for the Question Generation Evaluation

Question Generation (QG) aims to automate the task of composing question...
research
11/21/2022

Evaluating the Knowledge Dependency of Questions

The automatic generation of Multiple Choice Questions (MCQ) has the pote...
research
03/09/2022

On the Evaluation of Answer-Agnostic Paragraph-level Multi-Question Generation

We study the task of predicting a set of salient questions from a given ...
research
04/29/2022

QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance

Existing metrics for assessing question generation not only require cost...
research
03/01/2023

Can ChatGPT Assess Human Personalities? A General Evaluation Framework

Large Language Models (LLMs) especially ChatGPT have produced impressive...

Please sign up or login with your details

Forgot password? Click here to reset