Gradient-based Intra-attention Pruning on Pre-trained Language Models

12/15/2022
by   Ziqing Yang, et al.
0

Pre-trained language models achieve superior performance, but they are computationally expensive due to their large size. Techniques such as pruning and knowledge distillation (KD) have been developed to reduce their size and latency. In most structural pruning methods, the pruning units, such as attention heads and feed-forward hidden dimensions, only span a small model structure space and limit the structures that the pruning algorithm can explore. In this work, we propose Gradient-based Intra-attention pruning (GRAIN), which inspects fine intra-attention structures, and allows different heads to have different sizes. Intra-attention pruning greatly expands the searching space of model structures and yields highly heterogeneous structures. We further propose structure regularization to encourage generating more regular structures, which achieves higher speedups than heterogeneous ones. We also integrate KD into the pruning process with a gradient separation strategy to reduce the interference of KD with the pruning process. GRAIN is evaluated on a variety of tasks. Results show that it notably outperforms other methods at the same or similar model size. Even under extreme compression where only 3% weights in transformers remain, the pruned model is still competitive.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2022

Structured Pruning Learns Compact and Accurate Models

The growing size of neural language models has led to increased attentio...
research
08/07/2023

Knowledge-preserving Pruning for Pre-trained Language Models without Retraining

Given a pre-trained language model, how can we efficiently compress it w...
research
10/14/2022

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Pre-trained vision-language models (VLMs) have achieved impressive resul...
research
03/07/2023

Gradient-Free Structured Pruning with Unlabeled Data

Large Language Models (LLMs) have achieved great success in solving diff...
research
04/06/2022

Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency

Structured pruning has been extensively studied on monolingual pre-train...
research
11/10/2022

BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning

Current pre-trained language models rely on large datasets for achieving...
research
02/07/2023

What Matters In The Structured Pruning of Generative Language Models?

Auto-regressive large language models such as GPT-3 require enormous com...

Please sign up or login with your details

Forgot password? Click here to reset