Revisiting Sample Size Determination in Natural Language Understanding

07/01/2023
by   Ernie Chang, et al.
0

Knowing exactly how many data points need to be labeled to achieve a certain model performance is a hugely beneficial step towards reducing the overall budgets for annotation. It pertains to both active learning and traditional data annotation, and is particularly beneficial for low resource scenarios. Nevertheless, it remains a largely under-explored area of research in NLP. We therefore explored various techniques for estimating the training sample size necessary to achieve a targeted performance value. We derived a simple yet effective approach to predict the maximum achievable model performance based on small amount of training samples - which serves as an early indicator during data annotation for data quality and sample size determination. We performed ablation studies on four language understanding tasks, and showed that the proposed approach allows us to forecast model performance within a small margin of mean absolute error (  0.9

READ FULL TEXT
research
12/04/2020

Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning

Recently, leveraging pre-trained Transformer based language models in do...
research
11/06/2012

Sample Size Planning for Classification Models

In biospectroscopy, suitably annotated and statistically independent sam...
research
08/14/2020

How little data do we need for patient-level prediction?

Objective: Provide guidance on sample size considerations for developing...
research
08/01/2023

ALE: A Simulation-Based Active Learning Evaluation Framework for the Parameter-Driven Comparison of Query Strategies for NLP

Supervised machine learning and deep learning require a large amount of ...
research
03/01/2023

Bayesian inference for the Net Promoter Score

The Net Promoter Score is a simple measure used by several companies as ...
research
08/11/2022

Statistical parameters for assessing environmental model performance related to sample size: Case study in ocean color remote sensing

Environmental model performances need to be assessed using some statisti...
research
08/11/2018

The Impact of Automatic Pre-annotation in Clinical Note Data Element Extraction - the CLEAN Tool

Objective. Annotation is expensive but essential for clinical note revie...

Please sign up or login with your details

Forgot password? Click here to reset