Prune Once for All: Sparse Pre-Trained Language Models

11/10/2021
by   Ofir Zafrir, et al.
0

Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of 40X for the encoder with less than 1% accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2019

Q8BERT: Quantized 8Bit BERT

Recently, pre-trained Transformer based language models such as BERT and...
research
03/14/2022

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Pre-trained Transformer-based language models have become a key building...
research
04/08/2020

Exploiting Redundancy in Pre-trained Language Models for Efficient Transfer Learning

Large pre-trained contextual word representations have transformed the f...
research
10/14/2020

An Investigation on Different Underlying Quantization Schemes for Pre-trained Language Models

Recently, pre-trained language models like BERT have shown promising per...
research
06/01/2023

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Fine-tuned transformer models have shown superior performances in many n...
research
04/28/2022

RobBERTje: a Distilled Dutch BERT Model

Pre-trained large-scale language models such as BERT have gained a lot o...
research
12/23/2021

Distilling the Knowledge of Romanian BERTs Using Multiple Teachers

Running large-scale pre-trained language models in computationally const...

Please sign up or login with your details

Forgot password? Click here to reset