BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

11/07/2019
by   R. Thomas McCoy, et al.
0

If the same neural architecture is trained multiple times on the same dataset, will it make similar linguistic generalizations across runs? To study this question, we fine-tuned 100 instances of BERT on the Multi-genre Natural Language Inference (MNLI) dataset and evaluated them on the HANS dataset, which measures syntactic generalization in natural language inference. On the MNLI development set, the behavior of all instances was remarkably consistent, with accuracy ranging between 83.6 varied widely in their generalization performance. For example, on the simple case of subject-object swap (e.g., knowing that "the doctor visited the lawyer" does not entail "the lawyer visited the doctor"), accuracy ranged from 0.00 66.2 are equally attractive to a low-bias learner such as a neural network; decreasing the variability may therefore require models with stronger inductive biases.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2018

Testing the Generalization Power of Neural Network Models Across NLI Benchmarks

Neural network models have been very successful for natural language inf...
research
01/10/2020

Does syntax need to grow on trees? Sources of hierarchical inductive bias in sequence-to-sequence networks

Learners that are exposed to the same training data might generalize dif...
research
04/24/2020

Syntactic Data Augmentation Increases Robustness to Inference Heuristics

Pretrained neural models such as BERT, when fine-tuned to perform natura...
research
05/08/2020

Probing Linguistic Systematicity

Recently, there has been much interest in the question of whether deep n...
research
11/10/2019

Robust Natural Language Inference Models with Example Forgetting

We investigate whether example forgetting, a recently introduced measure...
research
05/05/2020

ExpBERT: Representation Engineering with Natural Language Explanations

Suppose we want to specify the inductive bias that married couples typic...
research
12/08/2019

Individual predictions matter: Assessing the effect of data ordering in training fine-tuned CNNs for medical imaging

We reproduced the results of CheXNet with fixed hyperparameters and 50 d...

Please sign up or login with your details

Forgot password? Click here to reset