A Kernel-Based View of Language Model Fine-Tuning

10/11/2022
by   Sadhika Malladi, et al.
0

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with 10^8 or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK) - which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization - describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We also extend the NTK formalism to fine-tuning with Adam. We present extensive experiments that show that once the downstream task is formulated as a language modeling problem through prompting, the NTK lens can often reasonably describe the model updates during fine-tuning with both SGD and Adam. This kernel view also suggests an explanation for success of parameter-efficient subspace-based fine-tuning methods. Finally, we suggest a path toward a formal explanation for our findings via Tensor Programs (Yang, 2020).

READ FULL TEXT

page 6

page 8

research
06/18/2021

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

We show that with small-to-medium training data, fine-tuning only the bi...
research
05/23/2022

Improving language models fine-tuning with representation consistency targets

Fine-tuning contextualized representations learned by pre-trained langua...
research
08/07/2021

NASOA: Towards Faster Task-oriented Online Fine-tuning with a Zoo of Models

Fine-tuning from pre-trained ImageNet models has been a simple, effectiv...
research
10/24/2022

Different Tunes Played with Equal Skill: Exploring a Unified Optimization Subspace for Delta Tuning

Delta tuning (DET, also known as parameter-efficient tuning) is deemed a...
research
10/12/2022

AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning

Fine-tuning large pre-trained language models on downstream tasks is apt...
research
09/29/2021

Targeted Gradient Descent: A Novel Method for Convolutional Neural Networks Fine-tuning and Online-learning

A convolutional neural network (ConvNet) is usually trained and then tes...
research
10/14/2021

P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks

Prompt tuning, which only tunes continuous prompts with a frozen languag...

Please sign up or login with your details

Forgot password? Click here to reset