The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

03/14/2022
by   Eldar Kurtic, et al.
0

Pre-trained Transformer-based language models have become a key building block for natural language processing (NLP) tasks. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to be applicable to decrease model size and increase inference speed. In this context, this paper's contributions are two-fold. We begin with an in-depth study of the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models, and introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in terms of the compression/accuracy trade-off. Specifically, Optimal BERT Surgeon extends existing work on second-order pruning by allowing for pruning blocks of weights, and by being applicable at BERT scale. Second, we investigate the impact of this pruning method when compounding compression approaches for Transformer-based models, which allows us to combine state-of-the-art structured and unstructured pruning together with quantization, in order to obtain highly compressed, but accurate models. The resulting compression framework is powerful, yet general and efficient: we apply it to both the fine-tuning and pre-training stages of language tasks, to obtain state-of-the-art results on the accuracy-compression trade-off with relatively simple compression recipes. For example, we obtain 10x model size compression with < 1 CPU-inference speedup with < 2 speedups with < 7.5

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2021

Prune Once for All: Sparse Pre-Trained Language Models

Transformer-based language models are applied to a wide range of applica...
research
03/30/2023

oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes

In this paper, we introduce the range of oBERTa language models, an easy...
research
02/07/2023

ZipLM: Hardware-Aware Structured Pruning of Language Models

The breakthrough performance of large language models (LLMs) comes with ...
research
10/10/2019

Structured Pruning of Large Language Models

Large language models have recently achieved state of the art performanc...
research
10/14/2022

oViT: An Accurate Second-Order Pruning Framework for Vision Transformers

Models from the Vision Transformer (ViT) family have recently provided b...
research
05/25/2022

Sparse*BERT: Sparse Models are Robust

Large Language Models have become the core architecture upon which most ...
research
09/17/2020

Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

Pretrained large-scale language models have increasingly demonstrated hi...

Please sign up or login with your details

Forgot password? Click here to reset