GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

06/18/2018
by   Patrick H. Chen, et al.
0

Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. As a case study, a state-of-the-art neural language model usually consists of one or more recurrent layers sandwiched between an embedding layer used for representing input tokens and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of- the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90 model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words). The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. On the OBW dataset, our method achieved 6.6 times compression rate for the embedding and softmax matrices, and when combined with quantization, our method can achieve 26 times compression rate, which translates to a factor of 12.8 times compression for the entire model with very little degradation in perplexity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/02/2019

Distilled embedding: non-linear embedding factorization using knowledge distillation

Word-embeddings are a vital component of Natural Language Processing (NL...
research
06/20/2023

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

Transformer models have achieved remarkable results in various natural l...
research
03/12/2022

Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in Practice

Classifiers in natural language processing (NLP) often have a large numb...
research
10/19/2018

Real-time Neural-based Input Method

The input method is an essential service on every mobile and desktop dev...
research
04/21/2020

A Generic Network Compression Framework for Sequential Recommender Systems

Sequential recommender systems (SRS) have become the key technology in c...
research
07/02/2023

TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition

High-dimensional token embeddings underpin Large Language Models (LLMs),...
research
04/14/2023

Lossy Compression of Large-Scale Radio Interferometric Data

This work proposes to reduce visibility data volume using a baseline-dep...

Please sign up or login with your details

Forgot password? Click here to reset