Undivided Attention: Are Intermediate Layers Necessary for BERT?

12/22/2020
by   Sharath Nittur Sridhar, et al.
0

In recent times, BERT-based models have been extremely successful in solving a variety of natural language processing (NLP) tasks such as reading comprehension, natural language inference, sentiment analysis, etc. All BERT-based architectures have a self-attention block followed by a block of intermediate layers as the basic building component. However, a strong justification for the inclusion of these intermediate layers remains missing in the literature. In this work we investigate the importance of intermediate layers on the overall network performance of downstream tasks. We show that reducing the number of intermediate layers and modifying the architecture for BERT-Base results in minimal loss in fine-tuning accuracy for downstream tasks while decreasing the number of parameters and training time of the model. Additionally, we use the central kernel alignment (CKA) similarity metric and probing classifiers to demonstrate that removing intermediate layers has little impact on the learned self-attention representations.

READ FULL TEXT

page 2

page 4

research
02/24/2022

TrimBERT: Tailoring BERT for Trade-offs

Models based on BERT have been extremely successful in solving a variety...
research
01/25/2020

Further Boosting BERT-based Models by Duplicating Existing Layers: Some Intriguing Phenomena inside BERT

Although Bidirectional Encoder Representations from Transformers (BERT) ...
research
02/12/2020

Utilizing BERT Intermediate Layers for Aspect Based Sentiment Analysis and Natural Language Inference

Aspect based sentiment analysis aims to identify the sentimental tendenc...
research
10/08/2019

SesameBERT: Attention for Anywhere

Fine-tuning with pre-trained models has achieved exceptional results for...
research
08/06/2020

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Pre-trained language models like BERT and its variants have recently ach...
research
01/10/2022

TiltedBERT: Resource Adjustable Version of BERT

In this paper, we proposed a novel adjustable finetuning method that imp...
research
06/19/2020

SqueezeBERT: What can computer vision teach NLP about efficient neural networks?

Humans read and write hundreds of billions of messages every day. Furthe...

Please sign up or login with your details

Forgot password? Click here to reset