Towards a Better Metric for Evaluating Question Generation Systems

08/30/2018
by   Preksha Nema, et al.
0

There has always been criticism for using n-gram based similarity metrics, such as BLEU, NIST, etc, for evaluating the performance of NLG systems. However, these metrics continue to remain popular and are recently being used for evaluating the performance of systems which automatically generate questions from documents, knowledge graphs, images, etc. Given the rising interest in such automatic question generation (AQG) systems, it is important to objectively examine whether these metrics are suitable for this task. In particular, it is important to verify whether such metrics used for evaluating AQG systems focus on answerability of the generated question by preferring questions which contain all relevant information such as question type (Wh-types), entities, relations, etc. In this work, we show that current automatic evaluation metrics based on n-gram similarity do not always correlate well with human judgments about answerability of a question. To alleviate this problem and as a first step towards better evaluation metrics for AQG, we introduce a scoring function to capture answerability and show that when this scoring function is integrated with existing metrics, they correlate significantly better with human judgments. The scripts and data developed as a part of this work are made publicly available at https://github.com/PrekshaNema25/Answerability-Metric.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2022

Evaluating the Knowledge Dependency of Questions

The automatic generation of Multiple Choice Questions (MCQ) has the pote...
research
05/01/2020

KPQA: A Metric for Generative Question Answering Using Word Weights

For the automatic evaluation of Generative Question Answering (genQA) sy...
research
11/02/2020

Exploring Question-Specific Rewards for Generating Deep Questions

Recent question generation (QG) approaches often utilize the sequence-to...
research
08/31/2019

Let's Ask Again: Refine Network for Automatic Question Generation

In this work, we focus on the task of Automatic Question Generation (AQG...
research
08/26/2021

Semantic-based Self-Critical Training For Question Generation

We present in this work a fully Transformer-based reinforcement learning...
research
06/26/2023

Beyond AUROC co. for evaluating out-of-distribution detection performance

While there has been a growing research interest in developing out-of-di...
research
07/06/2021

SOCluster- Towards Intent-based Clustering of Stack Overflow Questions using Graph-Based Approach

Stack Overflow (SO) platform has a huge dataset of questions and answers...

Please sign up or login with your details

Forgot password? Click here to reset