BERMo: What can BERT learn from ELMo?

10/18/2021
by   Sangamesh Kodge, et al.
0

We propose BERMo, an architectural modification to BERT, which makes predictions based on a hierarchy of surface, syntactic and semantic language features. We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths. Our approach has two-fold benefits: (1) improved gradient flow for the downstream task as every layer has a direct connection to the gradients of the loss function and (2) increased representative power as the model no longer needs to copy the features learned in the shallower layer which are necessary for the downstream task. Further, our model has a negligible parameter overhead as there is a single scalar parameter associated with each layer in the network. Experiments on the probing task from SentEval dataset show that our model performs up to 4.65% better in accuracy than the baseline with an average improvement of 2.67% on the semantic tasks. When subject to compression techniques, we find that our model enables stable pruning for compressing small datasets like SST-2, where the BERT model commonly diverges. We observe that our approach converges 1.67× and 1.15× faster than the baseline on MNLI and QQP tasks from GLUE dataset. Moreover, our results show that our approach can obtain better parameter efficiency for penalty based pruning approaches on QQP task.

READ FULL TEXT
research
01/13/2020

AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search

Large pre-trained language models such as BERT have shown their effectiv...
research
10/10/2019

Structured Pruning of Large Language Models

Large language models have recently achieved state of the art performanc...
research
09/10/2023

RGAT: A Deeper Look into Syntactic Dependency Information for Coreference Resolution

Although syntactic information is beneficial for many NLP tasks, combini...
research
04/30/2020

Investigating Transferability in Pretrained Language Models

While probing is a common technique for identifying knowledge in the rep...
research
10/12/2022

GMP*: Well-Tuned Global Magnitude Pruning Can Outperform Most BERT-Pruning Methods

We revisit the performance of the classic gradual magnitude pruning (GMP...
research
04/17/2020

Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning

Even though BERT achieves successful performance improvements in various...
research
10/06/2020

BERT Knows Punta Cana is not just beautiful, it's gorgeous: Ranking Scalar Adjectives with Contextualised Representations

Adjectives like pretty, beautiful and gorgeous describe positive propert...

Please sign up or login with your details

Forgot password? Click here to reset