Neural Networks Compression for Language Modeling

by   Artem M. Grachev, et al.

In this paper, we consider several compression techniques for the language modeling problem based on recurrent neural networks (RNNs). It is known that conventional RNNs, e.g, LSTM-based networks in language modeling, are characterized with either high space complexity or substantial inference time. This problem is especially crucial for mobile applications, in which the constant interaction with the remote server is inappropriate. By using the Penn Treebank (PTB) dataset we compare pruning, quantization, low-rank factorization, tensor train decomposition for LSTM networks in terms of model size and suitability for fast inference.


page 1

page 2

page 3

page 4


Compression of Recurrent Neural Networks for Efficient Language Modeling

Recurrent neural networks have proved to be an effective method for stat...

Counting in Language with RNNs

In this paper we examine a possible reason for the LSTM outperforming th...

Alternating Synthetic and Real Gradients for Neural Language Modeling

Training recurrent neural networks (RNNs) with backpropagation through t...

Kronecker CP Decomposition with Fast Multiplication for Compressing RNNs

Recurrent neural networks (RNNs) are powerful in the tasks oriented to s...

Tensor Networks for Language Modeling

The tensor network formalism has enjoyed over two decades of success in ...

Iterative evaluation of LSTM cells

In this work we present a modification in the conventional flow of infor...

Scalable Bayesian Learning of Recurrent Neural Networks for Language Modeling

Recurrent neural networks (RNNs) have shown promising performance for la...

1 Introduction

Neural network models can require a lot of space on disk and in memory. They can also need a substantial amount of time for inference. This is especially important for models that we put on devices like mobile phones. There are several approaches to solve these problems. Some of them are based on sparse computations. They also include pruning or more advanced methods. In general, such approaches are able to provide a large reduction in the size of a trained network, when the model is stored on a disk. However, there are some problems when we use such models for inference. They are caused by high computation time of sparse computing. Another branch of methods uses different matrix-based approaches in neural networks. Thus, there are methods based on the usage of Toeplitz-like structured matrices in [1] or different matrix decomposition techniques: low-rank decomposition [1], TT-decomposition (Tensor Train decomposition) [2, 3]. Also [4] proposes a new type of RNN, called uRNN (Unitary Evolution Recurrent Neural Networks).

In this paper, we analyze some of the aforementioned approaches. The material is organized as follows. In Section 2, we give an overview of language modeling methods and then focus on respective neural networks approaches. Next we describe different types of compression. In Section 3.1, we consider the simplest methods for neural networks compression like pruning or quantization. In Section 3.2, we consider approaches to compression of neural networks based on different matrix factorization methods. Section 3.3 deals with TT-decomposition. Section 4 describes our results and some implementation details. Finally, in Section 5, we summarize the results of our work.

2 Language modeling with neural networks

Consider the language modeling problem. We need to compute the probability of a sentence or sequence of words

in a language .


The use of such a model directly would require calculation and in general it is too difficult due to a lot of computation steps. That is why a common approach features computations with a fixed value of and approximate (1) with This leads us to the widely known -gram models [5, 6]. It was very popular approach until the middle of the 2000s. A new milestone in language modeling had become the use of recurrent neural networks [7]. A lot of work in this area was done by Thomas Mikolov [8].

Consider a recurrent neural network, RNN, where is the number of timesteps, is the number of recurrent layers, is the input of the layer

at the moment

. Here , , and

is the embedding vector. We can describe each layer as follows:


where and are matrices of weights and

is an activation function. The output of the network is given by


Then, we define


While -gram models even with not very big require a lot of space due to the combinatorial explosion, neural networks can learn some representations of words and their sequences without memorizing directly all options.

Now the mainly used variations of RNN are designed to solve the problem of decaying gradients  [9]

. The most popular variation is Long Short-Term Memory (LSTM)


and Gated Recurrent Unit (GRU)

[10]. Let us describe one layer of LSTM:

input gate (6)
forget gate (7)
cell state (8)
output gate (9)

where again , , is the memory vector at the layer and time step . The output of the network is given the same formula 4 as above.

Approaches to the language modeling problem based on neural networks are efficient and widely adopted, but still require a lot of space. In each LSTM layer of size we have 8 matrices of size . Moreover, usually the first (or zero) layer of such a network is an embedding layer that maps word’s vocabulary number to some vector. And we need to store this embedding matrix too. Its size is , where

is the vocabulary size. Also we have an output softmax layer with the same number of parameters as in the embedding, i.e.

. In our experiments, we try to reduce the embedding size and to decompose softmax layer as well as hidden layers.

We produce our experiments with compression on standard PTB models. There are three main benchmarks: Small, Medium and Large LSTM models [11]. But we mostly work with Small and Medium ones.

3 Compression methods

3.1 Pruning and quantization

In this subsection, we consider maybe not very effective but still useful techniques. Some of them were described in application to audio processing  [12] or image-processing [13, 14], but for language modeling this field is not yet well described.

Pruning is a method for reducing the number of parameters of NN. In Fig 1. (left), we can see that usually the majority of weight values are concentrated near zero. It means that such weights do not provide a valuable contribution in the final output. We can set some threshold and then remove all connections with the weights below it from the network. After that we retrain the network to learn the final weights for the remaining sparse connections.

Figure 1: Weights distribution before and after pruning

Quantization is a method for reducing the size of a compressed neural network in memory. We are compressing each float value to an eight-bit integer representing the closest real number in one of 256 equally-sized intervals within the range.

Pruning and quantization have common disadvantages since training from scratch is impossible and their usage is quite laborious. In pruning the reason is mostly lies in the inefficiency of sparse computing. When we do quantization, we store our model in an 8-bit representation, but we still need to do 32-bits computations. It means that we have not advantages using RAM. At least until we do not use the tensor processing unit (TPU) that is adopted for effective 8- and 16-bits computations.

3.2 Low-rank factorization

Low-rank factorization represents more powerful methods. For example, in [1], the authors applied it to a voice recognition task. A simple factorization can be done as follows:


Following [1] require . After this we can rewrite our equation for RNN:


For LSTM it is mostly the same with more complicated formulas. The main advantage we get here from the sizes of matrices , , . They have the sizes and , respectively, where the original and matrices have size . With small we have the advantage in size and in multiplication speed. We discuss some implementation details in Section 4.

3.3 The Tensor Train decomposition

In the light of recent advances of tensor train approach [2, 3], we have also decided to apply this technique to LSTM compression in language modeling.

The tensor train decomposition was originally proposed as an alternative and more efficient form of tensor’s representation [15]. The TT-decomposition (or TT-representation) of a tensor is the set of matrices , where , , and such that each of the tensor elements can be represented as In the same paper, the author proposed to consider the input matrix as a multidimensional tensor and apply the same decomposition to it. If we have matrix of size , we can fix and such , that the following conditions are fulfilled: , . Then we reshape our matrix to the tensor with dimensions and size . Finally, we can perform tensor train decomposition with this tensor. This approach was successfully applied to compress fully connected neural networks [2] and for developing convolution TT layer [3].

In its turn, we have applied this approach to LSTM. Similarly, as we describe it above for usual matrix decomposition, here we also describe only RNN layer. We apply TT-decomposition to each of the matrices and in equation 2 and get:


Here means that we apply TT-decomposition for matrix . It is necessary to note that even with the fixed number of tensors in TT-decomposition and their sizes we still have plenty of variants because we can choose the rank of each tensor.

4 Results

For testing pruning and quantization we choose Small PTB Benchmark. The results can be found in Table 1. We can see that we have a reduction of the size with a small loss of quality.

For matrix decomposition we perform experiments with Medium and Large PTB benchmarks. When we talk about language modeling, we must say that the embedding and the output layer each occupy one third of the total network size. It follows us to the necessity of reducing their sizes too. We reduce the output layer by applying matrix decomposition. We describe sizes of LR LSTM 650-650 since it is the most useful model for the practical application. We start with basic sizes for and , , and for embedding. We reduce each and down to and reduce embedding down to . The value 128 is chosen as the most suitable degree of 2 for efficient device implementation. We have performed several experiments, but this configuration is near the best. Our compressed model, LR LSTM 650-650, is even smaller than LSTM 200-200 with better perplexity. The results of experiments can be found in Table 2.

Model Size No. of params Test PP
LSTM 200-200 (Small benchmark) 18.6 Mb 4.64 M 117.659
Pruning output layer 90%
w/o additional training 5.5 Mb 0.5 M 149.310
Pruning output layer 90%
with additional training 5.5 Mb 0.5 M 121.123
Quantization (1 byte per number) 4.7 Mb 4.64 M 118.232
Table 1: Pruning and quantization results on PTB dataset

In TT decomposition we have some freedom in way of choosing internal ranks and number of tensors. We fix the basic configuration of an LSTM-network with two 600-600 layers and four tensors for each matrix in a layer. And we perform a grid search through different number of dimensions and various ranks.

We have trained about 100 models with using the Adam optimizer [16]. The average training time for each is about 5-6 hours on GeForce GTX TITAN X (Maxwell architecture), but unfortunately none of them has achieved acceptable quality. The best obtained result (TT LSTM 600-600) is even worse than LSTM-200-200 both in terms of size and perplexity.

Model Size No. of params Test PP
PTB LSTM 200-200 18.6 Mb 4.64 M 117.659
Benchmarks LSTM 650-650 79.1 Mb 19.7 M 82.07
LSTM 1500-1500 264.1 Mb 66.02 M 78.29
Ours LR LSTM 650-650 16.8 Mb 4.2 M 92.885
TT LSTM 600-600 50.4 Mb 12.6 M 168.639
LR LSTM 1500-1500 94.9 Mb 23.72 M 89.462
Table 2: Matrix decomposition results on PTB dataset

5 Conclusion

In this article, we have considered several methods of neural networks compression for the language modeling problem. The first part is about pruning and quantization. We have shown that for language modeling there is no difference in applying of these two techniques. The second part is about matrix decomposition methods. We have shown some advantages when we implement models on devices since usually in such tasks there are tight restrictions on the model size and its structure. From this point of view, the model LR LSTM 650-650 has nice characteristics. It is even smaller than the smallest benchmark on PTB and demonstrates quality comparable with the medium-sized benchmarks on PTB.


This study is supported by Russian Federation President grant MD-306.2017.9. A.V. Savchenko is supported by the Laboratory of Algorithms and Technologies for Network Analysis, National Research University Higher School of Economics.