Fine Tuning with Abnormal Examples

04/26/2023
by   Will Rieger, et al.
0

Given the prevalence of crowd sourced labor in creating Natural Language processing datasets, these aforementioned sets have become increasingly large. For instance, the SQUAD dataset currently sits at over 80,000 records. However, because the English language is rather repetitive in structure, the distribution of word frequencies in the SQUAD dataset's contexts are relatively unchanged. By measuring each sentences distance from the co-variate distance of frequencies of all sentences in the dataset, we identify 10,500 examples that create a more uniform distribution for training. While fine-tuning ELECTRA [4] on this subset of examples reaches better performance to a model trained on all 87,000 examples. Herein we introduce a methodology for systematically pruning datasets for fine tuning reaching better out of sample performance.

READ FULL TEXT

page 1

page 2

page 3

research
05/02/2023

RadAdapt: Radiology Report Summarization via Lightweight Domain Adaptation of Large Language Models

We systematically investigate lightweight strategies to adapt large lang...
research
12/29/2021

Fine-Tuning Transformers: Vocabulary Transfer

Transformers are responsible for the vast majority of recent advances in...
research
06/27/2021

A Closer Look at How Fine-tuning Changes BERT

Given the prevalence of pre-trained contextualized representations in to...
research
02/15/2020

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Fine-tuning pretrained contextual word embedding models to supervised do...
research
10/09/2022

SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency of Adapters

Adapter Tuning, which freezes the pretrained language models (PLMs) and ...
research
02/27/2017

CIFT: Crowd-Informed Fine-Tuning to Improve Machine Learning Ability

Item Response Theory (IRT) allows for measuring ability of Machine Learn...
research
09/19/2018

Exploring the Impact of Password Dataset Distribution on Guessing

Leaks from password datasets are a regular occurrence. An organization m...

Please sign up or login with your details

Forgot password? Click here to reset