Poor Man's BERT: Smaller and Faster Transformer Models

04/08/2020
by   Hassan Sajjad, et al.
0

The ongoing neural revolution in Natural Language Processing has recently been dominated by large-scale pre-trained Transformer models, where size does matter: it has been shown that the number of parameters in such a model is typically positively correlated with its performance. Naturally, this situation has unleashed a race for ever larger models, many of which, including the large versions of popular models such as BERT, XLNet, and RoBERTa, are now out of reach for researchers and practitioners without large-memory GPUs/TPUs. To address this issue, we explore a number of memory-light model reduction strategies that do not require model pre-training from scratch. The experimental results show that we are able to prune BERT, RoBERTa and XLNet models by up to 40 We also show that our pruned models are on par with DistilBERT in terms of both model size and performance. Finally, our pruning strategies enable interesting comparative analysis between BERT and XLNet.

READ FULL TEXT
research
10/14/2019

Q8BERT: Quantized 8Bit BERT

Recently, pre-trained Transformer based language models such as BERT and...
research
02/27/2020

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Transformer-based models pre-trained on large-scale corpora achieve stat...
research
10/02/2019

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more pr...
research
10/12/2020

Load What You Need: Smaller Versions of Multilingual BERT

Pre-trained Transformer-based models are achieving state-of-the-art resu...
research
08/01/2020

Multi-node Bert-pretraining: Cost-efficient Approach

Recently, large scale Transformer-based language models such as BERT, GP...
research
09/24/2021

DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference

Large-scale pre-trained language models have shown remarkable results in...
research
12/08/2021

The Effect of Model Size on Worst-Group Generalization

Overparameterization is shown to result in poor test accuracy on rare su...

Please sign up or login with your details

Forgot password? Click here to reset