DeepAI AI Chat
Log In Sign Up

DISTO: Evaluating Textual Distractors for Multi-Choice Questions using Negative Sampling based Approach

by   Bilal Ghanem, et al.

Multiple choice questions (MCQs) are an efficient and common way to assess reading comprehension (RC). Every MCQ needs a set of distractor answers that are incorrect, but plausible enough to test student knowledge. Distractor generation (DG) models have been proposed, and their performance is typically evaluated using machine translation (MT) metrics. However, MT metrics often misjudge the suitability of generated distractors. We propose DISTO: the first learned evaluation metric for generated distractors. We validate DISTO by showing its scores correlate highly with human ratings of distractor quality. At the same time, DISTO ranks the performance of state-of-the-art DG models very differently from MT-based metrics, showing that MT metrics should not be used for distractor evaluation.


page 1

page 2

page 3

page 4


MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation

There have been several studies on the correlation between human ratings...

Evaluating Commit Message Generation: To BLEU Or Not To BLEU?

Commit messages play an important role in several software engineering t...

Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods

Multiple-choice questions with item-writing flaws can negatively impact ...

Extrinsic Evaluation of Machine Translation Metrics

Automatic machine translation (MT) metrics are widely used to distinguis...

Breeding Machine Translations: Evolutionary approach to survive and thrive in the world of automated evaluation

We propose a genetic algorithm (GA) based method for modifying n-best li...