BERT Busters: Outlier LayerNorm Dimensions that Disrupt BERT

05/14/2021
by   Olga Kovaleva, et al.
35

Multiple studies have shown that BERT is remarkably robust to pruning, yet few if any of its components retain high importance across downstream tasks. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of scaling factors and biases in the output layer normalization (<0.0001 weights). These are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. They are present in all six models of BERT family that we examined and removing them significantly degrades both the MLM perplexity and the downstream task performance. Our results suggest that layer normalization plays a much more important role than usually assumed.

READ FULL TEXT
research
04/24/2022

Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training

Recent studies on the lottery ticket hypothesis (LTH) show that pre-trai...
research
01/22/2021

A multi-perspective combined recall and rank framework for Chinese procedure terminology normalization

Medical terminology normalization aims to map the clinical mention to te...
research
04/30/2020

A Matter of Framing: The Impact of Linguistic Formalism on Probing Results

Deep pre-trained contextualized encoders like BERT (Delvin et al., 2019)...
research
10/23/2019

Emergent Properties of Finetuned Language Representation Models

Large, self-supervised transformer-based language representation models ...
research
06/14/2021

Why Can You Lay Off Heads? Investigating How BERT Heads Transfer

The huge size of the widely used BERT family models has led to recent ef...
research
07/30/2020

What does BERT know about books, movies and music? Probing BERT for Conversational Recommendation

Heavily pre-trained transformer models such as BERT have recently shown ...
research
09/30/2020

AUBER: Automated BERT Regularization

How can we effectively regularize BERT? Although BERT proves its effecti...

Please sign up or login with your details

Forgot password? Click here to reset