nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

06/20/2022
by   Gunho Park, et al.
0

The recent advance of self-supervised learning associated with the Transformer architecture enables natural language processing (NLP) to exhibit extremely low perplexity. Such powerful models demand ever-increasing model size, and thus, large amounts of computations and memory footprints. In this paper, we propose an efficient inference framework for large-scale generative language models. As the key to reducing model size, we quantize weights by a non-uniform quantization method. Then, quantized matrix multiplications are accelerated by our proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy. Our proposed nuQmm reduces the latency of not only each GPU but also the entire inference of large LMs because a high compression ratio (by low-bit quantization) mitigates the minimum required number of GPUs. We demonstrate that nuQmm can accelerate the inference speed of the GPT-3 (175B) model by about 14.4 times and save energy consumption by 93

READ FULL TEXT
research
06/13/2023

SqueezeLLM: Dense-and-Sparse Quantization

Generative Large Language Models (LLMs) have demonstrated remarkable res...
research
09/11/2023

Understanding the Impact of Post-Training Quantization on Large Language Models

Large language models (LLMs) are rapidly increasing in size, with the nu...
research
11/29/2021

Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers

The high memory consumption and computational costs of Recurrent neural ...
research
05/08/2020

GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference

Attention-based models have demonstrated remarkable success in various n...
research
09/19/2023

Language Modeling Is Compression

It has long been established that predictive models can be transformed i...
research
05/17/2023

Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt

Large Language Models (LLMs), armed with billions of parameters, exhibit...
research
09/14/2017

Binary-decomposed DCNN for accelerating computation and compressing model without retraining

Recent trends show recognition accuracy increasing even more profoundly....

Please sign up or login with your details

Forgot password? Click here to reset