Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks

08/20/2022
by   Rajiv Movva, et al.
32

Quantization, knowledge distillation, and magnitude pruning are among the most popular methods for neural network compression in NLP. Independently, these methods reduce model size and can accelerate inference, but their relative benefit and combinatorial interactions have not been rigorously studied. For each of the eight possible subsets of these techniques, we compare accuracy vs. model size tradeoffs across six BERT architecture sizes and eight GLUE tasks. We find that quantization and distillation consistently provide greater benefit than pruning. Surprisingly, except for the pair of pruning and quantization, using multiple methods together rarely yields diminishing returns. Instead, we observe complementary and super-multiplicative reductions to model size. Our work quantitatively demonstrates that combining compression methods can synergistically reduce model size, and that practitioners should prioritize (1) quantization, (2) knowledge distillation, and (3) pruning to maximize accuracy vs. model size tradeoffs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/30/2021

Automatic Mixed-Precision Quantization Search of BERT

Pre-trained language models such as BERT have shown remarkable effective...
research
11/30/2020

KD-Lib: A PyTorch library for Knowledge Distillation, Pruning and Quantization

In recent years, the growing size of neural networks has led to a vast a...
research
12/05/2020

Parallel Blockwise Knowledge Distillation for Deep Neural Network Compression

Deep neural networks (DNNs) have been extremely successful in solving ma...
research
05/24/2023

PruMUX: Augmenting Data Multiplexing with Model Compression

As language models increase in size by the day, methods for efficient in...
research
05/25/2022

Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models

Model compression by way of parameter pruning, quantization, or distilla...
research
05/14/2023

Analyzing Compression Techniques for Computer Vision

Compressing deep networks is highly desirable for practical use-cases in...
research
07/31/2022

Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers

There exists a wide variety of efficiency methods for natural language p...

Please sign up or login with your details

Forgot password? Click here to reset