Comparing Test Sets with Item Response Theory

06/01/2021
by   Clara Vania, et al.
7

Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.

READ FULL TEXT
research
02/25/2021

Automated essay scoring using efficient transformer-based language models

Automated Essay Scoring (AES) is a cross-disciplinary effort involving E...
research
07/09/2020

Advances of Transformer-Based Models for News Headline Generation

Pretrained language models based on Transformer architecture are the rea...
research
02/14/2020

Stress Test Evaluation of Transformer-based Models in Natural Language Understanding Tasks

There has been significant progress in recent years in the field of Natu...
research
09/17/2021

Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

Despite the success of fine-tuning pretrained language encoders like BER...
research
04/05/2022

Improved and Efficient Conversational Slot Labeling through Question Answering

Transformer-based pretrained language models (PLMs) offer unmatched perf...
research
05/02/2020

UnifiedQA: Crossing Format Boundaries With a Single QA System

Question answering (QA) tasks have been posed using a variety of formats...
research
05/12/2021

News Headline Grouping as a Challenging NLU Task

Recent progress in Natural Language Understanding (NLU) has seen the lat...

Please sign up or login with your details

Forgot password? Click here to reset