Comparing Test Sets with Item Response Theory

06/01/2021
by   Clara Vania, et al.
7

Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.

READ FULL TEXT
02/25/2021

Automated essay scoring using efficient transformer-based language models

Automated Essay Scoring (AES) is a cross-disciplinary effort involving E...
07/09/2020

Advances of Transformer-Based Models for News Headline Generation

Pretrained language models based on Transformer architecture are the rea...
03/21/2022

AraBART: a Pretrained Arabic Sequence-to-Sequence Model for Abstractive Summarization

Like most natural language understanding and generation tasks, state-of-...
02/14/2020

Stress Test Evaluation of Transformer-based Models in Natural Language Understanding Tasks

There has been significant progress in recent years in the field of Natu...
04/05/2022

Improved and Efficient Conversational Slot Labeling through Question Answering

Transformer-based pretrained language models (PLMs) offer unmatched perf...
05/02/2020

UnifiedQA: Crossing Format Boundaries With a Single QA System

Question answering (QA) tasks have been posed using a variety of formats...
02/17/2020

Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

In many applications, one works with deep neural network (DNN) models tr...

Code Repositories

nlu-test-sets

Analysis of NLU test sets with IRT


view repo