Statistical Dataset Evaluation: Reliability, Difficulty, and Validity

12/19/2022
by   Chengwen Wang, et al.
0

Datasets serve as crucial training resources and model performance trackers. However, existing datasets have exposed a plethora of problems, inducing biased models and unreliable evaluation results. In this paper, we propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation. We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity, following a classical testing theory. Taking the Named Entity Recognition (NER) datasets as a case study, we introduce 9 statistical metrics for a statistical dataset evaluation framework. Experimental results and human evaluation validate that our evaluation framework effectively assesses various aspects of the dataset quality. Furthermore, we study how the dataset scores on our statistical metrics affect the model performance, and appeal for dataset quality evaluation or targeted dataset improvement before training or testing models.

READ FULL TEXT
research
03/28/2022

Federated Named Entity Recognition

We present an analysis of the performance of Federated Learning in a par...
research
07/06/2022

Rethinking the Value of Gazetteer in Chinese Named Entity Recognition

Gazetteer is widely used in Chinese named entity recognition (NER) to en...
research
04/29/2022

What do we Really Know about State of the Art NER?

Named Entity Recognition (NER) is a well researched NLP task and is wide...
research
08/14/2019

FlexNER: A Flexible LSTM-CNN Stack Framework for Named Entity Recognition

Named entity recognition (NER) is a foundational technology for informat...
research
11/13/2020

Interpretable Multi-dataset Evaluation for Named Entity Recognition

With the proliferation of models for natural language processing tasks, ...
research
06/10/2021

How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation

Models that top leaderboards often perform unsatisfactorily when deploye...
research
04/03/2019

What is wrong with scene text recognition model comparisons? dataset and model analysis

Many new proposals for scene text recognition (STR) models have been int...

Please sign up or login with your details

Forgot password? Click here to reset