GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

06/18/2018 ∙ by Patrick H. Chen, et al. ∙ Google University of California-Davis 0

Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. As a case study, a state-of-the-art neural language model usually consists of one or more recurrent layers sandwiched between an embedding layer used for representing input tokens and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of- the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90 model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words). The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. On the OBW dataset, our method achieved 6.6 times compression rate for the embedding and softmax matrices, and when combined with quantization, our method can achieve 26 times compression rate, which translates to a factor of 12.8 times compression for the entire model with very little degradation in perplexity.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural nets with a large number of parameters have a great capacity for modeling complex problems. However, the large size of these models is a major obstacle for serving them on-device where computational resources are limited. As such, compressing deep neural nets has become a crucial problem that draws an increasing amount of interest from the research community. Given a large neural net, the goal of compression is to build a light-weight approximation of the original model, which can offer a much smaller model size while maintaining the same (or similar) prediction accuracy.

In this paper, we focus on compressing neural language models, which have been successfully applied in a range of important NLP tasks including language modeling (e.g., next word prediction) and machine translation. A neural language model often consists of three major components: one or more recurrent layers (often using LSTM), an embedding layer for representing input tokens, and a softmax layer for generating output tokens. The dimension of recurrent layers (e.g., LSTM), which corresponds to the hidden state, is typically small and independent of the vocabulary size of input/output tokens. In contrast, the dimension of the embedding and the softmax layers grow with the vocabulary size, which can easily be at the scale of hundreds of thousands. As a result, the parameter matrices of the embedding and softmax layers are often responsible for the major memory consumption of a neural language model. For example, DE-EN Neural Machine Translation task has roughly a vocabulary size around 30k and around 80% of the memory is used to store embedding and softmax matrices. Furthermore, the One Billion Word language modeling task has a vocabulary size around 800k, and more than 90% of the memory footprint is due to storing the embedding and softmax matrices. Therefore, to reduce the size of a neural language model, it is highly valuable to compress these layers, which is the focus of our paper.

There have been extensive studies for compressing fully connected and convolutional networks sainath2013low ; denton2014exploiting ; DBLP:journals/corr/HanPTD15 ; DBLP:journals/corr/HanMD15 ; DBLP:journals/corr/WuLWHC15 ; Yu2017OnCD ; hubara2016quantized . The mainstream algorithms from these work such as low-rank approximation, quantization, and pruning can also be directly applied to compress the embedding and softmax matrices. However, it has been reported in previous papers that these algorithms, though efficient for CNN compression, are not able to achieve a good compression rate for word embedding matrices. For instance, hubara2016quantized proposed a very successful quantization method for CNNs, but for language models the compression rate is less than 3 times.

One important aspect that has not been well explored in the literature is that the embedding matrix has several specific properties that do not exist in a general weight matrix of CNNs. Each column of the input embedding and softmax matrix represents a token, which implies that on a given training or test set the parameters in that column are used with a frequency which obeys Zipf’s law distribution.

By exploiting these structures, we propose GroupReduce, a novel method for compressing the embedding and softmax matrices using block-wise, weighted low-rank approximation. Our method starts by grouping words into blocks based on their frequencies, and then refines the clustering iteratively by constructing weighted low-rank approximation for each block. This allows word vectors to be projected into a better subspace during compression. Our experiments show that GroupReduce is more effective than standard low-rank approximation methods for compressing these layers. It is easy-to-implement and can handle very large embedding and softmax matrices.

Our method achieves good performance on compressing a range of benchmark models for language modeling and neural machine translation tasks, and outperforms previous methods. For example, on DE-EN NMT task, Our method achieves 10 times compression rate on the embedding and softmax matrices without much degradation of performance. Results can be further improved to 24 times compression rate when combined with quantization scheme. On One Billion Word dataset, our method achieves 6.6 times compression rate on the embedding and softmax matrices that are originally more than 6GB. When combined with quantization scheme, our method achieves more than 26 times compression rate while maintaining similar perplexity.

2 Related Work

2.1 Model Compression for CNN

Low-rank matrix/tensor factorization.

To compress a deep net, a natural direction is to approximate each of its weight matrices, , by a low-rank approximation of the matrix using SVD. Based on this idea, sainath2013low

compressed the fully connected layers in neural nets. For convolution layers, the kernels can be viewed as 3D tensors. Thus,

jaderberg2014speeding ; denton2014exploiting applied higher-order tensor decomposition to compress CNN. In the same vein, howard2017mobilenets developed another structural approximation. kim2015compression proposed an algorithm to select rank for each layer. More recently, Yu2017OnCD reconstructed the weight matrices by using sparse plus low-rank approximation.

Pruning.

Algorithms have been proposed to remove unimportant weights in deep neural nets. In order to do this, one needs to define the importance of each weight. For example, lecun1990optimal

showed that the importance can be estimated by using the Hessian of loss function.

DBLP:journals/corr/HanPTD15 considered adding or regularization and applied iterative thresholding approaches to achieve very good compression rates. Later on, DBLP:journals/corr/HanMD15 demonstrated that state-of-the-art CNNs can be compressed by combining pruning, weight sharing and quantization.

Quantization.

Storing parameters using lower precision representations has been used for model compression. Recently, hubara2016quantized showed that a simple uniform quantization scheme can effectively reduce both the model size and the prediction time of a deep neural net. lin2016fixed showed that non-uniform quantization can further improve the performance. Recently, several advanced quantization techniques have been proposed for CNN compression DBLP:journals/corr/abs-1803-03289 ; DBLP:journals/corr/abs-1802-02271 .

2.2 Model Compression for RNN/LSTM

Although model compression has been studied extensively for CNN models, less works have focused on the compression for recurrent neural nets (RNNs), another widely-used category of deep models in NLP applications. Since RNN involves a collection of fully connected layers, many of the aforementioned approaches can be naturally applied. For example, hubara2016quantized applied their quantization and retraining procedure to compress a LSTM (a popular type of RNN) language model on Penn Tree Bank (PTB) dataset. tjandra2017compressing applied a matrix/tensor factorization approach to compress the transition matrix of LSTM and GRU, and tested their algorithm on image and music classification problems (which does not need word embedding matrices). narang2017exploring ; lobacheva2017bayesian proposed pruning algorithms for LSTM models compression.

Among the previous work, we found only hubara2016quantized ; lobacheva2017bayesian tried to compress the word embedding matrix in NLP applications. hubara2016quantized showed that the quantization-plus-retraining approach can only achieve less than times compression rate on PTB data with no performance loss. lobacheva2017bayesian showed that for word-level LSTM models, the pruning approach can only achieve sparsity with more than 5% performance loss. This means roughly parameters over the original model since this approach also needs to store the index for non-zero locations. Very recently, word2bit compressed the word embeddings computed by the word2vec algorithm and applied to similarity/analogy task and Question Answering. Just before submitting this work, we found another very recent paper shu2017compressing applying compositional coding to compress the input embedding matrix of LSTM. However, as they explicitly mentioned in OpenReview111https://openreview.net/forum?id=BJRZzFlRb, their algorithm is not able to compress the softmax (output) layer matrix. As a result, the overall compressed model from this approach is still large. One main issue of the approach is that multiple words share the same coding, which makes these words indistinguishable in the output layer during inference.

These previous results indicate that compressing embedding matrices in natural language tasks is a difficult problem—it is extremely challenging to achieve 4 times compression rate without sacrificing performance. In this paper, we will show that instead of only treating the embedding or the softmax parameters as a pure matrix, by exploiting the inherent structure of natural languages, GroupReduce algorithm could achieve much better compression rates.

3 Proposed Algorithms

We now introduce a novel algorithm for compressing both the embedding and the softmax layer, two major components in a neural language model as discussed earlier. Assume the word embedding matrix has size -by-, where is the vocabulary size and is the embedding dimension. We will use to denote the embedding matrix (either input or softmax layer), and each row of corresponds to the embedding vector of a word, i.e., the vector representation of the word.

Our goal is to compress the embedding matrix so that it uses less memory while achieving similar prediction performance. For a typical language model, especially the one with a large vocabulary size, the large memory size of the model is mostly due to the need to store the input and output word embedding matrices. In Table 1, we show an anatomy of memory consumption for several classic models trained on the publicly available datasets. We can see that for three out of four setups, embedding matrices contribute more than 75% of the overall memory usage. For example, in bigLSTM model that achieved start-of-the-art performance on OBW, more than 90% of memory is used to store two (input and output) word-embedding matrices. Thus, for deep neural net models alike, the main challenge to serve them on-device is to store tremendous memory usage of word embedding matrices. As such, it is highly valuable to compress these word embedding matrices.

Given a word embedding matrix , a standard way to compress while preserving the information is to perform low-rank approximation over

. A low-rank approximation can be acquired by using singular value decomposition (SVD), which achieves the best rank-

approximation:

(1)

where where is the target rank, and is a diagonal matrix of singular values. After the rank- low-rank approximation, the memory footprint for reduces from to .

There are two issues for using vanilla SVD to compress an embedding matrix. First, the rank of the SVD is not necessarily low for an embedding matrix. For example, Figure 1(b)

shows that all the eigenvalues of the PTB word embedding matrices are quite large, which leads to poor reconstruction error of low-rank approximation in Figure 

1(c). Second, the SVD approach considers as a regular matrix, but in fact each row of corresponds to the embedding of a word, which implies additional structure that we can further exploit under the language model case.

(a) Frequency
(b) eigenvalues
(c) reconstruction error
Figure 1: Illustration on Penn Treebank (PTB) dataset with the vocabulary size to be 10k and the model’s embedding dimension to be 1500. (a): log of word frequency vs rank of the word. One word’ rank is defined as the log of number of words that occurs less than it. We can clearly observe the power law distribution of the word frequency; (b) x-axis shows the rank of approximatiion, and y-axis shows the eigenvalues. Here eigenvalues for two embedding matrices are from the input embedding layer and softmax layer; we can see the eigenvalues are very large. (c) low-rank reconstruction error based on singular value decomposition for the two embedding matrices. This in other way shows that the vanilla SVD may not work well for the embedding matrix.

3.1 The Word Frequency Matters

One important statistical property of natural languages is that the distribution of word frequencies can be approximated by a power law. That means a small fraction of words occur many times, while many words only appear few times. Figure 1(a) shows the power-law distribution of word frequency in the PTB datasets.

In the previous compression methods, none of them takes the word frequency into consideration when approximating the embedding matrix. Intuitively, to construct a good compressed model with low-rank approximation under the limited memory budget, it is important to enforce more frequent words to have better approximation. In this paper, we considered two strategies to exploit the frequency information in low-rank approximation: weighted low-rank approximation and block low-rank approximation.

3.2 Improved Low-rank Approximation by Exploiting Frequency

Models vocabulary size dimension model size input layer softmax layer LSTM cell
PTB-Small 10k 200 17.7MB 7.6MB(42.9%) 7.6MB(42.9%) 2.5MB(14.2%)
PTB-Large 10k 1500 251MB 57MB(22.7%) 57MB(22.7%) 137MB(54.6%)
NMT: DE-EN 30k 500 148 MB 68 MB (45.9%) 47MB(31.8%) 33MB(22.3%)
OBW-BigLSTM 793k 1024 6.8GB 3.1GB (45.6%) 3.1GB(45.6%) 0.6GB(8.8%)
Table 1: The size of each layer in the model. The number in parenthesis shows the ratio respective to the entire model size.

Weighted low-rank approximation.

Firstly, we introduce a weighted low-rank approximation to compress the embedding matrix . This will be used to replace original SVD and serves as the basic building block of our proposed algorithm. The main idea is to assign a different weight for each word’s approximation and penalize more for the higher frequency words when constructing low-rank approximation. Mathematically, for the -th word’s frequency to be , we want to approximate the embedding by minimizing

(2)

where is the reduced rank; is -th word’s -th feature; ; and are -th and -th row of and respectively. Note that here we do not require to be orthonormal.

Although it is known that weighted SVD with element-wise weights does not have a closed-form solution srebro2003weighted , in our case elements in the same row of are associated with the same weights, which leads to a simple solution. Define , then the optimization problem of (2) is equivalent to

(3)

Therefore, assume all the are nonzeros, we can solve (2) by conducting low-rank approximation of . Assume , then will be a solution of (2). Therefore solving Eq.(2) is easy and the solution can be immediately computed from SVD of .

Block low-rank approximation. As can be seen from Figure 1(b), the embedding matrix is in general not low-rank. Instead of constructing one low-rank approximation for the entire matrix, we can consider block-wise low-rank approximation–each block has its own approximation to achieve better compression. A similar strategy has been exploited in si2014memory for kernel approximation (symmetric PSD matrix). Mathematically, suppose we partition the words into disjoint blocks , and each contains a set of words. For each block and its corresponding words’ embedding in , we can generate a low-rank approximation with rank as for . Then block low-rank approximation for is represented as:

(4)

The challenges for Eq (4) is on how to construct the clustering structure. Intuitively, we want similar frequency words to be grouped in the same block, so we can assign different ranks for different blocks based on their average frequency. For higher frequency words’ clusters, we can provide more ranks/budget for better approximation. Meanwhile, we want to make sure the approximation error to be small for words under the same memory budget. Therefore, in this paper we consider two factors, word frequency and reconstruction performance, when constructing the partition. Next, we will explain how to construct the partition.

Figure 2: Illustration of our method. Given an embedding matrix A in (a), we first group the words by their frequency (step (b)), and then perform weighted-SVD inside each group as shown in Eq.2(step (c)). Finally we refine the clustering by considering the low-rank reconstruction error of words as in Eq.5(step (d)).

Block weighted low-rank approximation. To take both matrix approximation as well as frequency information into account when forming the block structure in Eq (4), we propose to refine the blocks after initializing the blocks from frequency grouping to achieve lower reconstruction error. In the refinement stage, we move the words around by simultaneously learning a clustering structure as well as low-rank approximation inside each cluster for the word embedding matrix.

Mathematically, given an embedding matrix , we first initialize the blocks by frequency grouping, and then jointly learn both the clustering and low-rank embeddings for each block simultaneously by minimizing the following clustering objective:

(5)

where . Intuitively, the inner part aims to minimize the weighted low-rank approximation error for one cluster, and outer sum is searching for the partitions so as to minimize the overall reconstruction error.

Optimization: Eq.(5) is non-convex. In this paper, we use alternating minimization to minimize the above objective. When fixing the clusters assignment, we use weighted SVD to solve for and for each . To solve for and , as mentioned above in Eq(2), we can perform SVD over to obtain the approximation. The time complexity is the same with traditional SVD on .

To find the clustering structure, we first initialize the clustering assignment by frequency, and then refine the block structure by moving words from one cluster to another cluster if the moves can decrease the reconstruction error Eq (5). To compute the reconstruction error reduction, we will project each into each basis and see how much reconstruction error will improve. So if

(6)

then we will move -th word from the -th cluster to the -th cluster. By this strategy, we will decrease the restructure error.

The overall algorithm, GroupReduce is in Algorithm  1. Figure (2) illustrates our overall algorithm. First, we group the words into blocks based on frequency. After that, we perform weighted lowrank approximation Eq (2) for each block, and then solve Eq (5) to iteratively refine the clusters and obtain block-wise approximation based on reconstruction error.

There are some implementation details for Algorithm 1. After initial grouping, we assign different ranks to different blocks based on the average frequency of words inside that cluster—the rank for block is proportional to the average frequency of words inside that cluster. Suppose the block with smallest frequency is assigned with rank , then the rank of cluster is , where is the average frequency for the block with least frequency words. is related to the budget requirement. This dynamic rank assignment can significantly boost the performance, as it assigns more ranks to high-frequency words and approximates them better.

In Table 2, we compare the effectiveness of different strategies in our algorithm. We test on PTB-Small setting with statistics shown in Table 1. Every method in the table has the same compression rate, and we report perplexity number. We compare using vanilla SVD, weighted SVD, weighted SVD for each block (10 blocks), assigning different ranks for different blocks, and refining the blocks. We can see that all the operations involved can improve the final performance and are necessary for our algorithm. The overall memory usage to represent after our algorithm is , where is the vocabulary size; is the number of clusters; the average rank of each cluster.

vanilla SVD Weighted-lowrank block lowrank block lowrank with dynamic rank refinement
189.7 179.8 155.3 129.2 127.5
Table 2: PTB-small with 10 blocks and 5 times compression rate. We add the proposed strategies one-by-one to see the effectiveness of each of them using the perplexity as the performance metric.
Input: Embedding matrix ; number of clusters ; the smallest rank ; the maximal number of iterations ; minimal size of the candidate set ;
Output: Compact representation
1 Initialize clusters of words as by clustering on the frequency of words;
2 Compute the desired rank for each cluster based on the average frequency for that cluster and ;
3 for do
4       Compute the rank- weighted lowrank for each sub-matrix as ;
5      
6 for do
7       M = [];
8       for do
9             Compute the reconstruction error for -th word , ;
10             Find the cluster with smallest reconstruction error ;
11             if ( is the original cluster index for -th word) then
12                   put into the candidate set ;
13                  
14            
15       Choose the top words in that with least reconstruction error;
16       move words (we choose 10% in the paper) into clusters with smallest reconstruction error;
17       if then
18             Stop and output;
19            
20       for do
21             if Cluster changes then
22                   Compute the rank- weighted lowrank from Eq (2) for each sub-matrix as ;
23            
24      
Output:
Algorithm 1 GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

4 Experiments

4.1 Datasets and Pretrained Models

We evaluate our method (GroupReduce) on two tasks: language modeling (LM) and neural machine translation (NMT). For LM, we evaluate GroupReduce on two datasets: Penn Treebank Bank (PTB) and One-billion-Word Benchmark (OBW). OBW is introduced by chelba2013one , and it contains a vocabulary of 793,471 words with the sentences shuffled and the duplicates removed. For NMT, we evaluate our method on the IWSLT 2014 German-to-English translation task cettolo2014report . On these three benchmark datasets, we compress four models with the models details shown in Table 1. All four models use a 2-layer LSTM. Two of them (OBW and NMT) are based on exiting model checkpoints and the other two (based on PTB) are trained from scratch due to the lack of publicly released model checkpoint.

We train a 2-layer LSTM-based language model on PTB from scratch with two setups: PTB-Small and PTB-Large. The LSTM hidden state sizes are 200 for PTB-Small and 1500 for PTB-Large, so are their embedding sizes. For OBW, we use the "2-LAYER LSTM-8192-1024" model shown in Table 1 of jozefowicz2016exploring

. For NMT, we use the PyTorch checkpoint provided by OpenNMT

klein2017opennmt to perform German to English translation tasks. We verified that all these four models achieved benchmark performance on the corresponding datasets as reported in the literature. We then apply our method to compress these benchmark models.

For experiments using BLEU scores as performance measure, we report results when the BLEU scores achieved after compression is within 3 percent difference from original score. For experiments using plexplxity (PPL) as measure such as PTB dataset, we target 3 percent drop of performance too. For OBW dataset, since it has larger vocaburary size, we report results within 10 percent difference from original PPL. For each method in Table 3, 4, 5 and 7, we tested various parameters and report the smallest model size of the compressions fulfilling above criteria.

Note that the goal of this work is to compress an existing model to a significantly-reduced size while maintaining accuracy (e.g., perplexity or BLEU scores), rather than attempting to achieve higher accuracy. It is possible that there are models that could achieve higher accuracy, in which case our method can be applied to compress these models as well.

4.2 Comparison with Low-Rank and Pruning

We compare GroupReduce with two standard model compression strategies: low-rank approximation and pruning.These two techniques are widely used for language model compression, such as lobacheva2017bayesian ; narang2017exploring ; LuSS16 We compress both input embedding and softmax matrices. For the low-rank approximation approach, we perform standard SVD on the embedding and softmax matrices and obtain the low-rank approximation. For pruning, we set the entires whose magnitude is less than a certain threshold to zero. Note that storing the sparse matrix requires to use the Compressed Sparse Row or Compressed Sparse Column format, the memory usage is thus 2 times the number of non-zeros in the matrix after pruning. After approximation, we retrain the rest of parameters by SGD optimizer with initial learning rate 0.1. Whenever, the validation perplexity does not drop down, we decrease the learning rate to an order smaller. As shown in Table 3, GroupReduce can compress both the input embedding and softmax layer 5-10 times without losing much accuracy. In particular, GroupReduce compress 6.6 times on the language model trained on OBW benchmark, which saves more than 5 GB memory.

Notice that GroupReduce achieves good results even before retraining. This is important as retraining might be infeasible or take a long time to converge. We experimented with different learning rates and retrained for 100k steps (about 3 hours), but we observe that all the retraining scheme of OBW-bigLSTM model after approximation do not lead to significant improvement on accuracy. One reason is that to retrain the model, we need to keep the approximated embedding matrices fixed and re-initialize other parameters, and train these parameters from scratch as done in shu2017compressing . On OBW-bigLSTM, it will take more than 3 weeks for the retraining process. It is not practical if the goal is to compress model within a short period of time. Therefore, performance before retraining is important and GroupReduce in general obtains good results.

Model Metric Original Low-rank Pruning GroupReduce
PTB-Small Embedding Memory 1x 2x 2x 5x
PPL(before retrain) 112.28 117.11 115.9 115.24
PPL(after retrain) 113.83 113.78 113.78
PTB-Large Embedding Memory 1x 5x 3.3x 10x
PPL(before retrain) 78.32 84.63 84.23 82.86
PPL(after retrain) 80.04 78.38 78.92
OBW-bigLSTM Embedding Memory 1x 2x 1.14x 6.6x
PPL(before retrain) 31.04 39.41 128.31 32.47
PPL(after retrain) 38.03 84.11 32.50
NMT: DE-EN Embedding Memory 1x 3.3x 3.3x 10x
BLEU(before retrain) 30.33 29.63 26.47 29.62
BLEU(after retrain) 29.96 29.40 29.96
Table 3: Embedding compression results on three datasets comparing our method GroupReduce with Low-rank and Pruning. Compression rate is compared to both input embedding and softmax layer. For example, 10x means approximated embedding uses 10 times smaller memory compared to original input layer and softmax layer.

4.3 Comparison with Quantization

As noted in the related work, quantization has been shown to be a competent method in model compression hubara2016quantized . We implement b-bit quantization by equally spacing the range of a matrix into intervals and use one value to represent each interval. For example, 4-bit quantization will transform original matrix into matrix with 16 distinct values.

We need to point out that quantization is not orthogonal to other methods. In fact, GroupReduce can be combined with quantization to achieve a better compression rate. We firstly approximate the embedding or the softmax matrices by GroupReduce to obtain low rank matrices of each block, and then apply 4 or 8 bits quantization on these low rank matrices. After retraining, quantized GroupReduce could achieve at least 24 times compression for both input embedding and softmax matrix in OBW as shown in Table 4.

4.4 Overall Compression

Results above have shown GroupReduce is an effective compression method when the frequency information is given. We need to point out that part of the model (e.g., LSTM cells) cannot leverage this information as the transition matrices in LSTM cell do not correspond to the representation of a word. To have an overall compression of the model, we adopt simple quantized low-rank approximation of LSTM cells. To be more specific, we firstly compute low-rank approximation of LSTM matrix by SVD to obtain 2 times compression, and quantize the entries of low-rank matrices by using only 16 bits. In total the model would be 4 times smaller. However, we found out for OBW-bigLSTM model, LSTM matrix does not have a clear low-rank structure. Even slight compression of LSTM part will cause performance significantly drop. Therefore, we only apply 16-bit quantization on OBW-bigLSTM to have a 2 times compression on LSTM cells. Overall compression rate is shown in Table 5. With the aid of GroupReduce, we can achieve over 10 times compression on both language modeling and neural machine translation task.

Model Metric Original Quantization Quantized GroupReduce
PTB-Small Embedding Memory 1x 8x 40x
PPL(before retrain) 112.28 132.5 146.59
PPL(after retrain) 112.94 112.45
PTB-Large Embedding Memory 1x 8x 40x
PPL(before retrain) 78.32 116.54 88.67
PPL(after retrain) 80.72 80.68
OBW-bigLSTM Embedding Memory 1x 4x 26x
PPL(before retrain) 31.04 32.63 34.43
PPL(after retrain) 33.86 33.60
NMT: DE-EN Embedding Memory 1x 8x 24x
BLEU(before retrain) 30.33 27.96 29.08
BLEU(after retrain) 30.19 29.81
Table 4: Embedding compression results on three datasets comparing our method Quantized GroupReduce with traditional Quantization. 10x means approximated embedding uses 10 times smaller memory compared to original input embedding layer and softmax layer.
Models Original PPL/BLEU PPL/BLEU after approximation input layer softmax layer LSTM cell Overall Compression
NMT: DE-EN 30.33(BLEU) 29.68(BLEU) 24x (45.9%) 24x(31.8%) 4x(22.3%) 11.3x
OBW-BigLSTM 31.04(PPL) 33.61(PPL) 26x (45.6%) 26x(45.6%) 2x(8.8%) 12.8x
Table 5: Compression rate of overall model compression using Quantized GroupReduce. Compression rate shown in the column 4-6 is compared to the corresponding part of the model.

4.5 Selection of the Number of Clusters

In our method, the number of clusters to use is a hyperparameter that we need to decide. We experimented with different cluster numbers on the PTB-Large setup with 6.6 times compression (e.g., using only 15

of the memory compared to the original matrices) of both input embedding and softmax matrix, and the results are shown in Table 6. Basically our method is robust to the number of clusters. In the following experiments with the PTB and IWSLT dataset, we set the number of clusters to be 5. On the OBW datset, as the vocabulary size is larger so we set the number of clusters to be 20.

Number of Clusters 5 10 20 30
PPL(before retrain) 81.79 80.52 82.88 83.1
PPL(after retrain) 78.44 78.5 78.52 80.1
Table 6: GroupReduce with different number of clusters. Results are evaluated on PTB-Large setup with 6.6 times compression rate on both input embedding and softmax layer.

4.6 Comparison with Deep Compositional Coding

Since deep compositional coding shu2017compressing can only compress input embedding matrix, to demonstrate the effectivenss of GroupReduce, we implement the method and compare it to GroupReduce on only approximating input embedding matrix. We evaluate results based on NMT:DE-EN and PTB-Large setups. Again, after compressing each model, we retrain the model for the rest of its parameters and keep the input embedding fixed. We use SGD with learning rate 0.1 as the start, and lower the learning rate an order whenever validation loss stops decreasing. Results are summarized in Table 7. As shown in the table, GroupReduce can compress twice better than deep compositional coding. More importantly, GroupReduce can be applied to both input and softmax embedding which makes overall model not just input embedding smaller.

Model Metric Original Deep Compositional Coding Quantized GroupReduce
PTB-Large Embedding Memory 1x 11.8x 23.6x
PPL(before retrain) 78.32 81.82 80.20
PPL(after retrain) 79.58 79.18
NMT: DE-EN Embedding Memory 1x 16.6x 33.3x
BLEU(before retrain) 30.33 28.90 28.89
BLEU(after retrain) 30.00 30.16
Table 7: Comparison of input embedding compression results on two datasets. Note that the numbers in the table is the compression rate based on only input embedding not overall model size.

5 Conclusion

In this paper, we propose a novel compression method for neural language models to achieve at least 6.6 times compression without losing prediction accuracy. Our method leverages the statistical property of words in language to form block-wise low-rank matrix approximations for embedding and softmax layers. The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. In particular, on the OBW dataset, our method combined with quantization achieves 26 times compression rate for both the embedding and softmax matrices, which saves more than 5GB memory usage. It provides practical benefits when deploying neural language models on memory-constrained devices. For the future work, we will investigate different retrain schemes such as training the block low-rank parameterization of the model end-to-end.

References