BERT Busters: Outlier LayerNorm Dimensions that Disrupt BERT

05/14/2021 ∙ by Olga Kovaleva, et al. ∙ 35

Multiple studies have shown that BERT is remarkably robust to pruning, yet few if any of its components retain high importance across downstream tasks. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of scaling factors and biases in the output layer normalization (<0.0001 weights). These are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. They are present in all six models of BERT family that we examined and removing them significantly degrades both the MLM perplexity and the downstream task performance. Our results suggest that layer normalization plays a much more important role than usually assumed.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.