1 Introduction
Deep neural nets with a large number of parameters have a great capacity for modeling complex problems. However, the large size of these models is a major obstacle for serving them ondevice where computational resources are limited. As such, compressing deep neural nets has become a crucial problem that draws an increasing amount of interest from the research community. Given a large neural net, the goal of compression is to build a lightweight approximation of the original model, which can offer a much smaller model size while maintaining the same (or similar) prediction accuracy.
In this paper, we focus on compressing neural language models, which have been successfully applied in a range of important NLP tasks including language modeling (e.g., next word prediction) and machine translation. A neural language model often consists of three major components: one or more recurrent layers (often using LSTM), an embedding layer for representing input tokens, and a softmax layer for generating output tokens. The dimension of recurrent layers (e.g., LSTM), which corresponds to the hidden state, is typically small and independent of the vocabulary size of input/output tokens. In contrast, the dimension of the embedding and the softmax layers grow with the vocabulary size, which can easily be at the scale of hundreds of thousands. As a result, the parameter matrices of the embedding and softmax layers are often responsible for the major memory consumption of a neural language model. For example, DEEN Neural Machine Translation task has roughly a vocabulary size around 30k and around 80% of the memory is used to store embedding and softmax matrices. Furthermore, the One Billion Word language modeling task has a vocabulary size around 800k, and more than 90% of the memory footprint is due to storing the embedding and softmax matrices. Therefore, to reduce the size of a neural language model, it is highly valuable to compress these layers, which is the focus of our paper.
There have been extensive studies for compressing fully connected and convolutional networks sainath2013low ; denton2014exploiting ; DBLP:journals/corr/HanPTD15 ; DBLP:journals/corr/HanMD15 ; DBLP:journals/corr/WuLWHC15 ; Yu2017OnCD ; hubara2016quantized . The mainstream algorithms from these work such as lowrank approximation, quantization, and pruning can also be directly applied to compress the embedding and softmax matrices. However, it has been reported in previous papers that these algorithms, though efficient for CNN compression, are not able to achieve a good compression rate for word embedding matrices. For instance, hubara2016quantized proposed a very successful quantization method for CNNs, but for language models the compression rate is less than 3 times.
One important aspect that has not been well explored in the literature is that the embedding matrix has several specific properties that do not exist in a general weight matrix of CNNs. Each column of the input embedding and softmax matrix represents a token, which implies that on a given training or test set the parameters in that column are used with a frequency which obeys Zipf’s law distribution.
By exploiting these structures, we propose GroupReduce, a novel method for compressing the embedding and softmax matrices using blockwise, weighted lowrank approximation. Our method starts by grouping words into blocks based on their frequencies, and then refines the clustering iteratively by constructing weighted lowrank approximation for each block. This allows word vectors to be projected into a better subspace during compression. Our experiments show that GroupReduce is more effective than standard lowrank approximation methods for compressing these layers. It is easytoimplement and can handle very large embedding and softmax matrices.
Our method achieves good performance on compressing a range of benchmark models for language modeling and neural machine translation tasks, and outperforms previous methods. For example, on DEEN NMT task, Our method achieves 10 times compression rate on the embedding and softmax matrices without much degradation of performance. Results can be further improved to 24 times compression rate when combined with quantization scheme. On One Billion Word dataset, our method achieves 6.6 times compression rate on the embedding and softmax matrices that are originally more than 6GB. When combined with quantization scheme, our method achieves more than 26 times compression rate while maintaining similar perplexity.
2 Related Work
2.1 Model Compression for CNN
Lowrank matrix/tensor factorization.
To compress a deep net, a natural direction is to approximate each of its weight matrices, , by a lowrank approximation of the matrix using SVD. Based on this idea, sainath2013low
compressed the fully connected layers in neural nets. For convolution layers, the kernels can be viewed as 3D tensors. Thus,
jaderberg2014speeding ; denton2014exploiting applied higherorder tensor decomposition to compress CNN. In the same vein, howard2017mobilenets developed another structural approximation. kim2015compression proposed an algorithm to select rank for each layer. More recently, Yu2017OnCD reconstructed the weight matrices by using sparse plus lowrank approximation.Pruning.
Algorithms have been proposed to remove unimportant weights in deep neural nets. In order to do this, one needs to define the importance of each weight. For example, lecun1990optimal
showed that the importance can be estimated by using the Hessian of loss function.
DBLP:journals/corr/HanPTD15 considered adding or regularization and applied iterative thresholding approaches to achieve very good compression rates. Later on, DBLP:journals/corr/HanMD15 demonstrated that stateoftheart CNNs can be compressed by combining pruning, weight sharing and quantization.Quantization.
Storing parameters using lower precision representations has been used for model compression. Recently, hubara2016quantized showed that a simple uniform quantization scheme can effectively reduce both the model size and the prediction time of a deep neural net. lin2016fixed showed that nonuniform quantization can further improve the performance. Recently, several advanced quantization techniques have been proposed for CNN compression DBLP:journals/corr/abs180303289 ; DBLP:journals/corr/abs180202271 .
2.2 Model Compression for RNN/LSTM
Although model compression has been studied extensively for CNN models, less works have focused on the compression for recurrent neural nets (RNNs), another widelyused category of deep models in NLP applications. Since RNN involves a collection of fully connected layers, many of the aforementioned approaches can be naturally applied. For example, hubara2016quantized applied their quantization and retraining procedure to compress a LSTM (a popular type of RNN) language model on Penn Tree Bank (PTB) dataset. tjandra2017compressing applied a matrix/tensor factorization approach to compress the transition matrix of LSTM and GRU, and tested their algorithm on image and music classification problems (which does not need word embedding matrices). narang2017exploring ; lobacheva2017bayesian proposed pruning algorithms for LSTM models compression.
Among the previous work, we found only hubara2016quantized ; lobacheva2017bayesian tried to compress the word embedding matrix in NLP applications. hubara2016quantized showed that the quantizationplusretraining approach can only achieve less than times compression rate on PTB data with no performance loss. lobacheva2017bayesian showed that for wordlevel LSTM models, the pruning approach can only achieve sparsity with more than 5% performance loss. This means roughly parameters over the original model since this approach also needs to store the index for nonzero locations. Very recently, word2bit compressed the word embeddings computed by the word2vec algorithm and applied to similarity/analogy task and Question Answering. Just before submitting this work, we found another very recent paper shu2017compressing applying compositional coding to compress the input embedding matrix of LSTM. However, as they explicitly mentioned in OpenReview^{1}^{1}1https://openreview.net/forum?id=BJRZzFlRb, their algorithm is not able to compress the softmax (output) layer matrix. As a result, the overall compressed model from this approach is still large. One main issue of the approach is that multiple words share the same coding, which makes these words indistinguishable in the output layer during inference.
These previous results indicate that compressing embedding matrices in natural language tasks is a difficult problem—it is extremely challenging to achieve 4 times compression rate without sacrificing performance. In this paper, we will show that instead of only treating the embedding or the softmax parameters as a pure matrix, by exploiting the inherent structure of natural languages, GroupReduce algorithm could achieve much better compression rates.
3 Proposed Algorithms
We now introduce a novel algorithm for compressing both the embedding and the softmax layer, two major components in a neural language model as discussed earlier. Assume the word embedding matrix has size by, where is the vocabulary size and is the embedding dimension. We will use to denote the embedding matrix (either input or softmax layer), and each row of corresponds to the embedding vector of a word, i.e., the vector representation of the word.
Our goal is to compress the embedding matrix so that it uses less memory while achieving similar prediction performance. For a typical language model, especially the one with a large vocabulary size, the large memory size of the model is mostly due to the need to store the input and output word embedding matrices. In Table 1, we show an anatomy of memory consumption for several classic models trained on the publicly available datasets. We can see that for three out of four setups, embedding matrices contribute more than 75% of the overall memory usage. For example, in bigLSTM model that achieved startoftheart performance on OBW, more than 90% of memory is used to store two (input and output) wordembedding matrices. Thus, for deep neural net models alike, the main challenge to serve them ondevice is to store tremendous memory usage of word embedding matrices. As such, it is highly valuable to compress these word embedding matrices.
Given a word embedding matrix , a standard way to compress while preserving the information is to perform lowrank approximation over
. A lowrank approximation can be acquired by using singular value decomposition (SVD), which achieves the best rank
approximation:(1) 
where where is the target rank, and is a diagonal matrix of singular values. After the rank lowrank approximation, the memory footprint for reduces from to .
There are two issues for using vanilla SVD to compress an embedding matrix. First, the rank of the SVD is not necessarily low for an embedding matrix. For example, Figure 1(b)
shows that all the eigenvalues of the PTB word embedding matrices are quite large, which leads to poor reconstruction error of lowrank approximation in Figure
1(c). Second, the SVD approach considers as a regular matrix, but in fact each row of corresponds to the embedding of a word, which implies additional structure that we can further exploit under the language model case.3.1 The Word Frequency Matters
One important statistical property of natural languages is that the distribution of word frequencies can be approximated by a power law. That means a small fraction of words occur many times, while many words only appear few times. Figure 1(a) shows the powerlaw distribution of word frequency in the PTB datasets.
In the previous compression methods, none of them takes the word frequency into consideration when approximating the embedding matrix. Intuitively, to construct a good compressed model with lowrank approximation under the limited memory budget, it is important to enforce more frequent words to have better approximation. In this paper, we considered two strategies to exploit the frequency information in lowrank approximation: weighted lowrank approximation and block lowrank approximation.
3.2 Improved Lowrank Approximation by Exploiting Frequency
Models  vocabulary size  dimension  model size  input layer  softmax layer  LSTM cell 

PTBSmall  10k  200  17.7MB  7.6MB(42.9%)  7.6MB(42.9%)  2.5MB(14.2%) 
PTBLarge  10k  1500  251MB  57MB(22.7%)  57MB(22.7%)  137MB(54.6%) 
NMT: DEEN  30k  500  148 MB  68 MB (45.9%)  47MB(31.8%)  33MB(22.3%) 
OBWBigLSTM  793k  1024  6.8GB  3.1GB (45.6%)  3.1GB(45.6%)  0.6GB(8.8%) 
Weighted lowrank approximation.
Firstly, we introduce a weighted lowrank approximation to compress the embedding matrix . This will be used to replace original SVD and serves as the basic building block of our proposed algorithm. The main idea is to assign a different weight for each word’s approximation and penalize more for the higher frequency words when constructing lowrank approximation. Mathematically, for the th word’s frequency to be , we want to approximate the embedding by minimizing
(2) 
where is the reduced rank; is th word’s th feature; ; and are th and th row of and respectively. Note that here we do not require to be orthonormal.
Although it is known that weighted SVD with elementwise weights does not have a closedform solution srebro2003weighted , in our case elements in the same row of are associated with the same weights, which leads to a simple solution. Define , then the optimization problem of (2) is equivalent to
(3) 
Therefore, assume all the are nonzeros, we can solve (2) by conducting lowrank approximation of . Assume , then will be a solution of (2). Therefore solving Eq.(2) is easy and the solution can be immediately computed from SVD of .
Block lowrank approximation. As can be seen from Figure 1(b), the embedding matrix is in general not lowrank. Instead of constructing one lowrank approximation for the entire matrix, we can consider blockwise lowrank approximation–each block has its own approximation to achieve better compression. A similar strategy has been exploited in si2014memory for kernel approximation (symmetric PSD matrix). Mathematically, suppose we partition the words into disjoint blocks , and each contains a set of words. For each block and its corresponding words’ embedding in , we can generate a lowrank approximation with rank as for . Then block lowrank approximation for is represented as:
(4) 
The challenges for Eq (4) is on how to construct the clustering structure. Intuitively, we want similar frequency words to be grouped in the same block, so we can assign different ranks for different blocks based on their average frequency. For higher frequency words’ clusters, we can provide more ranks/budget for better approximation. Meanwhile, we want to make sure the approximation error to be small for words under the same memory budget. Therefore, in this paper we consider two factors, word frequency and reconstruction performance, when constructing the partition. Next, we will explain how to construct the partition.
Block weighted lowrank approximation. To take both matrix approximation as well as frequency information into account when forming the block structure in Eq (4), we propose to refine the blocks after initializing the blocks from frequency grouping to achieve lower reconstruction error. In the refinement stage, we move the words around by simultaneously learning a clustering structure as well as lowrank approximation inside each cluster for the word embedding matrix.
Mathematically, given an embedding matrix , we first initialize the blocks by frequency grouping, and then jointly learn both the clustering and lowrank embeddings for each block simultaneously by minimizing the following clustering objective:
(5) 
where . Intuitively, the inner part aims to minimize the weighted lowrank approximation error for one cluster, and outer sum is searching for the partitions so as to minimize the overall reconstruction error.
Optimization: Eq.(5) is nonconvex. In this paper, we use alternating minimization to minimize the above objective. When fixing the clusters assignment, we use weighted SVD to solve for and for each . To solve for and , as mentioned above in Eq(2), we can perform SVD over to obtain the approximation. The time complexity is the same with traditional SVD on .
To find the clustering structure, we first initialize the clustering assignment by frequency, and then refine the block structure by moving words from one cluster to another cluster if the moves can decrease the reconstruction error Eq (5). To compute the reconstruction error reduction, we will project each into each basis and see how much reconstruction error will improve. So if
(6) 
then we will move th word from the th cluster to the th cluster. By this strategy, we will decrease the restructure error.
The overall algorithm, GroupReduce is in Algorithm 1. Figure (2) illustrates our overall algorithm. First, we group the words into blocks based on frequency. After that, we perform weighted lowrank approximation Eq (2) for each block, and then solve Eq (5) to iteratively refine the clusters and obtain blockwise approximation based on reconstruction error.
There are some implementation details for Algorithm 1. After initial grouping, we assign different ranks to different blocks based on the average frequency of words inside that cluster—the rank for block is proportional to the average frequency of words inside that cluster. Suppose the block with smallest frequency is assigned with rank , then the rank of cluster is , where is the average frequency for the block with least frequency words. is related to the budget requirement. This dynamic rank assignment can significantly boost the performance, as it assigns more ranks to highfrequency words and approximates them better.
In Table 2, we compare the effectiveness of different strategies in our algorithm. We test on PTBSmall setting with statistics shown in Table 1. Every method in the table has the same compression rate, and we report perplexity number. We compare using vanilla SVD, weighted SVD, weighted SVD for each block (10 blocks), assigning different ranks for different blocks, and refining the blocks. We can see that all the operations involved can improve the final performance and are necessary for our algorithm. The overall memory usage to represent after our algorithm is , where is the vocabulary size; is the number of clusters; the average rank of each cluster.
vanilla SVD  Weightedlowrank  block lowrank  block lowrank with dynamic rank  refinement 

189.7  179.8  155.3  129.2  127.5 
4 Experiments
4.1 Datasets and Pretrained Models
We evaluate our method (GroupReduce) on two tasks: language modeling (LM) and neural machine translation (NMT). For LM, we evaluate GroupReduce on two datasets: Penn Treebank Bank (PTB) and OnebillionWord Benchmark (OBW). OBW is introduced by chelba2013one , and it contains a vocabulary of 793,471 words with the sentences shuffled and the duplicates removed. For NMT, we evaluate our method on the IWSLT 2014 GermantoEnglish translation task cettolo2014report . On these three benchmark datasets, we compress four models with the models details shown in Table 1. All four models use a 2layer LSTM. Two of them (OBW and NMT) are based on exiting model checkpoints and the other two (based on PTB) are trained from scratch due to the lack of publicly released model checkpoint.
We train a 2layer LSTMbased language model on PTB from scratch with two setups: PTBSmall and PTBLarge. The LSTM hidden state sizes are 200 for PTBSmall and 1500 for PTBLarge, so are their embedding sizes. For OBW, we use the "2LAYER LSTM81921024" model shown in Table 1 of jozefowicz2016exploring
. For NMT, we use the PyTorch checkpoint provided by OpenNMT
klein2017opennmt to perform German to English translation tasks. We verified that all these four models achieved benchmark performance on the corresponding datasets as reported in the literature. We then apply our method to compress these benchmark models.For experiments using BLEU scores as performance measure, we report results when the BLEU scores achieved after compression is within 3 percent difference from original score. For experiments using plexplxity (PPL) as measure such as PTB dataset, we target 3 percent drop of performance too. For OBW dataset, since it has larger vocaburary size, we report results within 10 percent difference from original PPL. For each method in Table 3, 4, 5 and 7, we tested various parameters and report the smallest model size of the compressions fulfilling above criteria.
Note that the goal of this work is to compress an existing model to a significantlyreduced size while maintaining accuracy (e.g., perplexity or BLEU scores), rather than attempting to achieve higher accuracy. It is possible that there are models that could achieve higher accuracy, in which case our method can be applied to compress these models as well.
4.2 Comparison with LowRank and Pruning
We compare GroupReduce with two standard model compression strategies: lowrank approximation and pruning.These two techniques are widely used for language model compression, such as lobacheva2017bayesian ; narang2017exploring ; LuSS16 We compress both input embedding and softmax matrices. For the lowrank approximation approach, we perform standard SVD on the embedding and softmax matrices and obtain the lowrank approximation. For pruning, we set the entires whose magnitude is less than a certain threshold to zero. Note that storing the sparse matrix requires to use the Compressed Sparse Row or Compressed Sparse Column format, the memory usage is thus 2 times the number of nonzeros in the matrix after pruning. After approximation, we retrain the rest of parameters by SGD optimizer with initial learning rate 0.1. Whenever, the validation perplexity does not drop down, we decrease the learning rate to an order smaller. As shown in Table 3, GroupReduce can compress both the input embedding and softmax layer 510 times without losing much accuracy. In particular, GroupReduce compress 6.6 times on the language model trained on OBW benchmark, which saves more than 5 GB memory.
Notice that GroupReduce achieves good results even before retraining. This is important as retraining might be infeasible or take a long time to converge. We experimented with different learning rates and retrained for 100k steps (about 3 hours), but we observe that all the retraining scheme of OBWbigLSTM model after approximation do not lead to significant improvement on accuracy. One reason is that to retrain the model, we need to keep the approximated embedding matrices fixed and reinitialize other parameters, and train these parameters from scratch as done in shu2017compressing . On OBWbigLSTM, it will take more than 3 weeks for the retraining process. It is not practical if the goal is to compress model within a short period of time. Therefore, performance before retraining is important and GroupReduce in general obtains good results.
Model  Metric  Original  Lowrank  Pruning  GroupReduce 

PTBSmall  Embedding Memory  1x  2x  2x  5x 
PPL(before retrain)  112.28  117.11  115.9  115.24  
PPL(after retrain)  –  113.83  113.78  113.78  
PTBLarge  Embedding Memory  1x  5x  3.3x  10x 
PPL(before retrain)  78.32  84.63  84.23  82.86  
PPL(after retrain)  –  80.04  78.38  78.92  
OBWbigLSTM  Embedding Memory  1x  2x  1.14x  6.6x 
PPL(before retrain)  31.04  39.41  128.31  32.47  
PPL(after retrain)  –  38.03  84.11  32.50  
NMT: DEEN  Embedding Memory  1x  3.3x  3.3x  10x 
BLEU(before retrain)  30.33  29.63  26.47  29.62  
BLEU(after retrain)  –  29.96  29.40  29.96 
4.3 Comparison with Quantization
As noted in the related work, quantization has been shown to be a competent method in model compression hubara2016quantized . We implement bbit quantization by equally spacing the range of a matrix into intervals and use one value to represent each interval. For example, 4bit quantization will transform original matrix into matrix with 16 distinct values.
We need to point out that quantization is not orthogonal to other methods. In fact, GroupReduce can be combined with quantization to achieve a better compression rate. We firstly approximate the embedding or the softmax matrices by GroupReduce to obtain low rank matrices of each block, and then apply 4 or 8 bits quantization on these low rank matrices. After retraining, quantized GroupReduce could achieve at least 24 times compression for both input embedding and softmax matrix in OBW as shown in Table 4.
4.4 Overall Compression
Results above have shown GroupReduce is an effective compression method when the frequency information is given. We need to point out that part of the model (e.g., LSTM cells) cannot leverage this information as the transition matrices in LSTM cell do not correspond to the representation of a word. To have an overall compression of the model, we adopt simple quantized lowrank approximation of LSTM cells. To be more specific, we firstly compute lowrank approximation of LSTM matrix by SVD to obtain 2 times compression, and quantize the entries of lowrank matrices by using only 16 bits. In total the model would be 4 times smaller. However, we found out for OBWbigLSTM model, LSTM matrix does not have a clear lowrank structure. Even slight compression of LSTM part will cause performance significantly drop. Therefore, we only apply 16bit quantization on OBWbigLSTM to have a 2 times compression on LSTM cells. Overall compression rate is shown in Table 5. With the aid of GroupReduce, we can achieve over 10 times compression on both language modeling and neural machine translation task.
Model  Metric  Original  Quantization  Quantized GroupReduce 

PTBSmall  Embedding Memory  1x  8x  40x 
PPL(before retrain)  112.28  132.5  146.59  
PPL(after retrain)  –  112.94  112.45  
PTBLarge  Embedding Memory  1x  8x  40x 
PPL(before retrain)  78.32  116.54  88.67  
PPL(after retrain)  –  80.72  80.68  
OBWbigLSTM  Embedding Memory  1x  4x  26x 
PPL(before retrain)  31.04  32.63  34.43  
PPL(after retrain)  –  33.86  33.60  
NMT: DEEN  Embedding Memory  1x  8x  24x 
BLEU(before retrain)  30.33  27.96  29.08  
BLEU(after retrain)  –  30.19  29.81 
Models  Original PPL/BLEU  PPL/BLEU after approximation  input layer  softmax layer  LSTM cell  Overall Compression 

NMT: DEEN  30.33(BLEU)  29.68(BLEU)  24x (45.9%)  24x(31.8%)  4x(22.3%)  11.3x 
OBWBigLSTM  31.04(PPL)  33.61(PPL)  26x (45.6%)  26x(45.6%)  2x(8.8%)  12.8x 
4.5 Selection of the Number of Clusters
In our method, the number of clusters to use is a hyperparameter that we need to decide. We experimented with different cluster numbers on the PTBLarge setup with 6.6 times compression (e.g., using only 15
of the memory compared to the original matrices) of both input embedding and softmax matrix, and the results are shown in Table 6. Basically our method is robust to the number of clusters. In the following experiments with the PTB and IWSLT dataset, we set the number of clusters to be 5. On the OBW datset, as the vocabulary size is larger so we set the number of clusters to be 20.Number of Clusters  5  10  20  30 

PPL(before retrain)  81.79  80.52  82.88  83.1 
PPL(after retrain)  78.44  78.5  78.52  80.1 
4.6 Comparison with Deep Compositional Coding
Since deep compositional coding shu2017compressing can only compress input embedding matrix, to demonstrate the effectivenss of GroupReduce, we implement the method and compare it to GroupReduce on only approximating input embedding matrix. We evaluate results based on NMT:DEEN and PTBLarge setups. Again, after compressing each model, we retrain the model for the rest of its parameters and keep the input embedding fixed. We use SGD with learning rate 0.1 as the start, and lower the learning rate an order whenever validation loss stops decreasing. Results are summarized in Table 7. As shown in the table, GroupReduce can compress twice better than deep compositional coding. More importantly, GroupReduce can be applied to both input and softmax embedding which makes overall model not just input embedding smaller.
Model  Metric  Original  Deep Compositional Coding  Quantized GroupReduce 

PTBLarge  Embedding Memory  1x  11.8x  23.6x 
PPL(before retrain)  78.32  81.82  80.20  
PPL(after retrain)  –  79.58  79.18  
NMT: DEEN  Embedding Memory  1x  16.6x  33.3x 
BLEU(before retrain)  30.33  28.90  28.89  
BLEU(after retrain)  –  30.00  30.16 
5 Conclusion
In this paper, we propose a novel compression method for neural language models to achieve at least 6.6 times compression without losing prediction accuracy. Our method leverages the statistical property of words in language to form blockwise lowrank matrix approximations for embedding and softmax layers. The experimental results show our method can significantly outperform traditional compression methods such as lowrank approximation and pruning. In particular, on the OBW dataset, our method combined with quantization achieves 26 times compression rate for both the embedding and softmax matrices, which saves more than 5GB memory usage. It provides practical benefits when deploying neural language models on memoryconstrained devices. For the future work, we will investigate different retrain schemes such as training the block lowrank parameterization of the model endtoend.
References
 [1] Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, 2014.
 [2] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
 [3] Yoojin Choi, Mostafa ElKhamy, and Jungwon Lee. Universal deep neural network compression. CoRR, abs/1802.02271, 2018.
 [4] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
 [5] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
 [6] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. CoRR, abs/1506.02626, 2015.
 [7] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [8] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
 [9] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 [10] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.

[11]
YongDeok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun
Shin.
Compression of deep convolutional neural networks for fast and low power mobile applications.
In ICLR, 2016.  [12] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. Opennmt: Opensource toolkit for neural machine translation. arXiv preprint arXiv:1701.02810, 2017.
 [13] Maximilian Lam. Word2bits  quantized word vectors. arXiv preprint arXiv:1803.05651, 2018.
 [14] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.

[15]
Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy.
Fixed point quantization of deep convolutional networks.
In
International Conference on Machine Learning
, pages 2849–2858, 2016.  [16] Ekaterina Lobacheva, Nadezhda Chirkova, and Dmitry Vetrov. Bayesian sparsification of recurrent neural networks. arXiv preprint arXiv:1708.00077, 2017.
 [17] Zhiyun Lu, Vikas Sindhwani, and Tara N. Sainath. Learning compact recurrent neural networks. CoRR, abs/1604.02594, 2016.

[18]
Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta.
Exploring sparsity in recurrent neural networks.
In ICLR, 2017.  [19] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Lowrank matrix factorization for deep neural network training with highdimensional output targets. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6655–6659. IEEE, 2013.
 [20] Raphael Shu and Hideki Nakayama. Compressing word embeddings via deep compositional code learning. In ICLR, 2018.
 [21] Si Si, ChoJui Hsieh, and Inderjit S Dhillon. Memory efficient kernel approximation. J. Mach. Learn. Res, 2017.
 [22] Nathan Srebro and Tommi Jaakkola. Weighted lowrank approximations. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 720–727, 2003.
 [23] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. Compressing recurrent neural network with tensor train. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages 4451–4458. IEEE, 2017.
 [24] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. CoRR, abs/1512.06473, 2015.
 [25] Yuhui Xu, Yongzhuang Wang, Aojun Zhou, Weiyao Lin, and Hongkai Xiong. Deep neural network compression with single and multiple level quantization. CoRR, abs/1803.03289, 2018.

[26]
Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao.
On compressing deep models by low rank and sparse decomposition.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 67–76, 2017.
Comments
There are no comments yet.