Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values

06/16/2023
by   Stephanie Schoch, et al.
0

Although Shapley values have been shown to be highly effective for identifying harmful training instances, dataset size and model complexity constraints limit the ability to apply Shapley-based data valuation to fine-tuning large pre-trained language models. To address this, we propose TS-DShapley, an algorithm that reduces computational cost of Shapley-based data valuation through: 1) an efficient sampling-based method that aggregates Shapley values computed from subsets for valuation of the entire training set, and 2) a value transfer method that leverages value information extracted from a simple classifier trained using representations from the target language model. Our experiments applying TS-DShapley to select data for fine-tuning BERT-based language models on benchmark natural language understanding (NLU) datasets show that TS-DShapley outperforms existing data selection methods. Further, TS-DShapley can filter fine-tuning data to increase language model performance compared to training with the full fine-tuning dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2021

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

We show that with small-to-medium training data, fine-tuning only the bi...
research
06/18/2021

Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets

Language models can generate harmful and biased outputs and exhibit unde...
research
08/27/2021

Exploring the Capacity of a Large-scale Masked Language Model to Recognize Grammatical Errors

In this paper, we explore the capacity of a language model-based method ...
research
10/02/2019

The merits of Universal Language Model Fine-tuning for Small Datasets – a case with Dutch book reviews

We evaluated the effectiveness of using language models, that were pre-t...
research
08/08/2023

In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuning

In this note, we explore inference-time alignment through in-context lea...
research
01/26/2022

Synchromesh: Reliable code generation from pre-trained language models

Large pre-trained language models have been used to generate code,provid...
research
04/04/2019

Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets

Several datasets have recently been constructed to expose brittleness in...

Please sign up or login with your details

Forgot password? Click here to reset