Comparing Test Sets with Item Response Theory

by   Clara Vania, et al.

Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.


Automated essay scoring using efficient transformer-based language models

Automated Essay Scoring (AES) is a cross-disciplinary effort involving E...

Advances of Transformer-Based Models for News Headline Generation

Pretrained language models based on Transformer architecture are the rea...

AraBART: a Pretrained Arabic Sequence-to-Sequence Model for Abstractive Summarization

Like most natural language understanding and generation tasks, state-of-...

Stress Test Evaluation of Transformer-based Models in Natural Language Understanding Tasks

There has been significant progress in recent years in the field of Natu...

Improved and Efficient Conversational Slot Labeling through Question Answering

Transformer-based pretrained language models (PLMs) offer unmatched perf...

UnifiedQA: Crossing Format Boundaries With a Single QA System

Question answering (QA) tasks have been posed using a variety of formats...

Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

In many applications, one works with deep neural network (DNN) models tr...

Code Repositories


Analysis of NLU test sets with IRT

view repo