Massive Language Models Can Be Accurately Pruned in One-Shot

01/02/2023
by   Elias Frantar, et al.
0

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50 any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60 sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2022

oViT: An Accurate Second-Order Pruning Framework for Vision Transformers

Models from the Vision Transformer (ViT) family have recently provided b...
research
09/17/2020

Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

Pretrained large-scale language models have increasingly demonstrated hi...
research
08/13/2023

Token-Scaled Logit Distillation for Ternary Weight Generative Language Models

Generative Language Models (GLMs) have shown impressive performance in t...
research
06/04/2022

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

How to efficiently serve ever-larger trained natural language models in ...
research
07/12/2023

Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

We investigate the effects of post-training quantization and quantizatio...
research
07/25/2023

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

This work studies post-training parameter quantization in large language...

Please sign up or login with your details

Forgot password? Click here to reset