LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

06/20/2023
by   Yixiao Li, et al.
0

Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources. To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods.

READ FULL TEXT
research
03/25/2022

Vision Transformer Compression with Structured Pruning and Low Rank Approximation

Transformer architecture has gained popularity due to its ability to sca...
research
06/25/2023

Low-Rank Prune-And-Factorize for Language Model Compression

The components underpinning PLMs – large weight matrices – were shown to...
research
06/18/2018

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

Model compression is essential for serving large deep neural nets on dev...
research
08/24/2021

Greenformers: Improving Computation and Memory Efficiency in Transformer Models via Low-Rank Approximation

In this thesis, we introduce Greenformers, a collection of model efficie...
research
02/07/2023

What Matters In The Structured Pruning of Generative Language Models?

Auto-regressive large language models such as GPT-3 require enormous com...
research
02/14/2021

Doping: A technique for efficient compression of LSTM models using sparse structured additive matrices

Structured matrices, such as those derived from Kronecker products (KP),...
research
05/14/2019

Network Pruning for Low-Rank Binary Indexing

Pruning is an efficient model compression technique to remove redundancy...

Please sign up or login with your details

Forgot password? Click here to reset