DeepAI AI Chat
Log In Sign Up

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

by   Eldar Kurtic, et al.

Pre-trained Transformer-based language models have become a key building block for natural language processing (NLP) tasks. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to be applicable to decrease model size and increase inference speed. In this context, this paper's contributions are two-fold. We begin with an in-depth study of the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models, and introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in terms of the compression/accuracy trade-off. Specifically, Optimal BERT Surgeon extends existing work on second-order pruning by allowing for pruning blocks of weights, and by being applicable at BERT scale. Second, we investigate the impact of this pruning method when compounding compression approaches for Transformer-based models, which allows us to combine state-of-the-art structured and unstructured pruning together with quantization, in order to obtain highly compressed, but accurate models. The resulting compression framework is powerful, yet general and efficient: we apply it to both the fine-tuning and pre-training stages of language tasks, to obtain state-of-the-art results on the accuracy-compression trade-off with relatively simple compression recipes. For example, we obtain 10x model size compression with < 1 CPU-inference speedup with < 2 speedups with < 7.5


page 1

page 2

page 3

page 4


Prune Once for All: Sparse Pre-Trained Language Models

Transformer-based language models are applied to a wide range of applica...

oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes

In this paper, we introduce the range of oBERTa language models, an easy...

ZipLM: Hardware-Aware Structured Pruning of Language Models

The breakthrough performance of large language models (LLMs) comes with ...

Structured Pruning of Large Language Models

Large language models have recently achieved state of the art performanc...

oViT: An Accurate Second-Order Pruning Framework for Vision Transformers

Models from the Vision Transformer (ViT) family have recently provided b...

Sparse*BERT: Sparse Models are Robust

Large Language Models have become the core architecture upon which most ...

Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

Pretrained large-scale language models have increasingly demonstrated hi...