Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance

10/13/2021
by   Shibal Ibrahim, et al.
0

Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular for improved prediction and efficient use of limited resources. Fine-tuning requires identification of best models to transfer-learn from and quantifying transferability prevents expensive re-training on all of the candidate models/tasks pairs. We show that the statistical problems with covariance estimation drive the poor performance of H-score [Bao et al., 2019] – a common baseline for newer metrics – and propose shrinkage-based estimator. This results in up to 80 H-score correlation performance, making it competitive with the state-of-the-art LogME measure by You et al. [2021]. Our shrinkage-based H-score is 3-55 times faster to compute compared to LogME. Additionally, we look into a less common setting of target (as opposed to source) task selection. We identify previously overlooked problems in such settings with different number of labels, class-imbalance ratios etc. for some recent metrics e.g., LEEP [Nguyen et al., 2020] that resulted in them being misrepresented as leading measures. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings. We also outline the difficulties of comparing feature-dependent metrics, both supervised (e.g. H-score) and unsupervised measures (e.g., Maximum Mean Discrepancy [Long et al., 2015]), across source models/layers with different feature embedding dimension. We show that dimensionality reduction methods allow for meaningful comparison across models and improved performance of some of these measures. We investigate performance of 14 different supervised and unsupervised metrics and demonstrate that even unsupervised metrics can identify the leading models for domain adaptation. We support our findings with  65,000 (fine-tuning trials) experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2023

How to Estimate Model Transferability of Pre-Trained Speech Models?

In this work, we introduce a “score-based assessment” framework for esti...
research
07/10/2021

Noise Stability Regularization for Improving BERT Fine-tuning

Fine-tuning pre-trained language models such as BERT has become a common...
research
11/12/2021

On Transferability of Prompt Tuning for Natural Language Understanding

Prompt tuning (PT) is a promising parameter-efficient method to utilize ...
research
02/15/2023

Measuring the Instability of Fine-Tuning

Fine-tuning pre-trained language models on downstream tasks with varying...
research
08/19/2019

Revisiting Heterogeneous Defect Prediction: How Far Are We?

Until now, researchers have proposed several novel heterogeneous defect ...
research
10/20/2022

Evidence > Intuition: Transferability Estimation for Encoder Selection

With the increase in availability of large pre-trained language models (...
research
10/28/2018

Learning Criteria and Evaluation Metrics for Textual Transfer between Non-Parallel Corpora

We consider the problem of automatically generating textual paraphrases ...

Please sign up or login with your details

Forgot password? Click here to reset