1 Introduction
Neural network models can require a lot of space on disk and in memory. They can also need a substantial amount of time for inference. This is especially important for models that we put on devices like mobile phones. There are several approaches to solve these problems. Some of them are based on sparse computations. They also include pruning or more advanced methods. In general, such approaches are able to provide a large reduction in the size of a trained network, when the model is stored on a disk. However, there are some problems when we use such models for inference. They are caused by high computation time of sparse computing. Another branch of methods uses different matrixbased approaches in neural networks. Thus, there are methods based on the usage of Toeplitzlike structured matrices in [1] or different matrix decomposition techniques: lowrank decomposition [1], TTdecomposition (Tensor Train decomposition) [2, 3]. Also [4] proposes a new type of RNN, called uRNN (Unitary Evolution Recurrent Neural Networks).
In this paper, we analyze some of the aforementioned approaches. The material is organized as follows. In Section 2, we give an overview of language modeling methods and then focus on respective neural networks approaches. Next we describe different types of compression. In Section 3.1, we consider the simplest methods for neural networks compression like pruning or quantization. In Section 3.2, we consider approaches to compression of neural networks based on different matrix factorization methods. Section 3.3 deals with TTdecomposition. Section 4 describes our results and some implementation details. Finally, in Section 5, we summarize the results of our work.
2 Language modeling with neural networks
Consider the language modeling problem. We need to compute the probability of a sentence or sequence of words
in a language .(1) 
The use of such a model directly would require calculation and in general it is too difficult due to a lot of computation steps. That is why a common approach features computations with a fixed value of and approximate (1) with This leads us to the widely known gram models [5, 6]. It was very popular approach until the middle of the 2000s. A new milestone in language modeling had become the use of recurrent neural networks [7]. A lot of work in this area was done by Thomas Mikolov [8].
Consider a recurrent neural network, RNN, where is the number of timesteps, is the number of recurrent layers, is the input of the layer
at the moment
. Here , , andis the embedding vector. We can describe each layer as follows:
(2)  
(3) 
where and are matrices of weights and
is an activation function. The output of the network is given by
(4) 
Then, we define
(5) 
While gram models even with not very big require a lot of space due to the combinatorial explosion, neural networks can learn some representations of words and their sequences without memorizing directly all options.
Now the mainly used variations of RNN are designed to solve the problem of decaying gradients [9]
. The most popular variation is Long ShortTerm Memory (LSTM)
[7]and Gated Recurrent Unit (GRU)
[10]. Let us describe one layer of LSTM:input gate  (6)  
forget gate  (7)  
cell state  (8)  
output gate  (9)  
(10) 
where again , , is the memory vector at the layer and time step . The output of the network is given the same formula 4 as above.
Approaches to the language modeling problem based on neural networks are efficient and widely adopted, but still require a lot of space. In each LSTM layer of size we have 8 matrices of size . Moreover, usually the first (or zero) layer of such a network is an embedding layer that maps word’s vocabulary number to some vector. And we need to store this embedding matrix too. Its size is , where
is the vocabulary size. Also we have an output softmax layer with the same number of parameters as in the embedding, i.e.
. In our experiments, we try to reduce the embedding size and to decompose softmax layer as well as hidden layers.We produce our experiments with compression on standard PTB models. There are three main benchmarks: Small, Medium and Large LSTM models [11]. But we mostly work with Small and Medium ones.
3 Compression methods
3.1 Pruning and quantization
In this subsection, we consider maybe not very effective but still useful techniques. Some of them were described in application to audio processing [12] or imageprocessing [13, 14], but for language modeling this field is not yet well described.
Pruning is a method for reducing the number of parameters of NN. In Fig 1. (left), we can see that usually the majority of weight values are concentrated near zero. It means that such weights do not provide a valuable contribution in the final output. We can set some threshold and then remove all connections with the weights below it from the network. After that we retrain the network to learn the final weights for the remaining sparse connections.
Quantization is a method for reducing the size of a compressed neural network in memory. We are compressing each float value to an eightbit integer representing the closest real number in one of 256 equallysized intervals within the range.
Pruning and quantization have common disadvantages since training from scratch is impossible and their usage is quite laborious. In pruning the reason is mostly lies in the inefficiency of sparse computing. When we do quantization, we store our model in an 8bit representation, but we still need to do 32bits computations. It means that we have not advantages using RAM. At least until we do not use the tensor processing unit (TPU) that is adopted for effective 8 and 16bits computations.
3.2 Lowrank factorization
Lowrank factorization represents more powerful methods. For example, in [1], the authors applied it to a voice recognition task. A simple factorization can be done as follows:
For LSTM it is mostly the same with more complicated formulas. The main advantage we get here from the sizes of matrices , , . They have the sizes and , respectively, where the original and matrices have size . With small we have the advantage in size and in multiplication speed. We discuss some implementation details in Section 4.
3.3 The Tensor Train decomposition
In the light of recent advances of tensor train approach [2, 3], we have also decided to apply this technique to LSTM compression in language modeling.
The tensor train decomposition was originally proposed as an alternative and more efficient form of tensor’s representation [15]. The TTdecomposition (or TTrepresentation) of a tensor is the set of matrices , where , , and such that each of the tensor elements can be represented as In the same paper, the author proposed to consider the input matrix as a multidimensional tensor and apply the same decomposition to it. If we have matrix of size , we can fix and such , that the following conditions are fulfilled: , . Then we reshape our matrix to the tensor with dimensions and size . Finally, we can perform tensor train decomposition with this tensor. This approach was successfully applied to compress fully connected neural networks [2] and for developing convolution TT layer [3].
In its turn, we have applied this approach to LSTM. Similarly, as we describe it above for usual matrix decomposition, here we also describe only RNN layer. We apply TTdecomposition to each of the matrices and in equation 2 and get:
(15) 
Here means that we apply TTdecomposition for matrix . It is necessary to note that even with the fixed number of tensors in TTdecomposition and their sizes we still have plenty of variants because we can choose the rank of each tensor.
4 Results
For testing pruning and quantization we choose Small PTB Benchmark. The results can be found in Table 1. We can see that we have a reduction of the size with a small loss of quality.
For matrix decomposition we perform experiments with Medium and Large PTB benchmarks. When we talk about language modeling, we must say that the embedding and the output layer each occupy one third of the total network size. It follows us to the necessity of reducing their sizes too. We reduce the output layer by applying matrix decomposition. We describe sizes of LR LSTM 650650 since it is the most useful model for the practical application. We start with basic sizes for and , , and for embedding. We reduce each and down to and reduce embedding down to . The value 128 is chosen as the most suitable degree of 2 for efficient device implementation. We have performed several experiments, but this configuration is near the best. Our compressed model, LR LSTM 650650, is even smaller than LSTM 200200 with better perplexity. The results of experiments can be found in Table 2.
Model  Size  No. of params  Test PP 
LSTM 200200 (Small benchmark)  18.6 Mb  4.64 M  117.659 
Pruning output layer 90%  
w/o additional training  5.5 Mb  0.5 M  149.310 
Pruning output layer 90%  
with additional training  5.5 Mb  0.5 M  121.123 
Quantization (1 byte per number)  4.7 Mb  4.64 M  118.232 
In TT decomposition we have some freedom in way of choosing internal ranks and number of tensors. We fix the basic configuration of an LSTMnetwork with two 600600 layers and four tensors for each matrix in a layer. And we perform a grid search through different number of dimensions and various ranks.
We have trained about 100 models with using the Adam optimizer [16]. The average training time for each is about 56 hours on GeForce GTX TITAN X (Maxwell architecture), but unfortunately none of them has achieved acceptable quality. The best obtained result (TT LSTM 600600) is even worse than LSTM200200 both in terms of size and perplexity.
Model  Size  No. of params  Test PP  

PTB  LSTM 200200  18.6 Mb  4.64 M  117.659 
Benchmarks  LSTM 650650  79.1 Mb  19.7 M  82.07 
LSTM 15001500  264.1 Mb  66.02 M  78.29  
Ours  LR LSTM 650650  16.8 Mb  4.2 M  92.885 
TT LSTM 600600  50.4 Mb  12.6 M  168.639  
LR LSTM 15001500  94.9 Mb  23.72 M  89.462 
5 Conclusion
In this article, we have considered several methods of neural networks compression for the language modeling problem. The first part is about pruning and quantization. We have shown that for language modeling there is no difference in applying of these two techniques. The second part is about matrix decomposition methods. We have shown some advantages when we implement models on devices since usually in such tasks there are tight restrictions on the model size and its structure. From this point of view, the model LR LSTM 650650 has nice characteristics. It is even smaller than the smallest benchmark on PTB and demonstrates quality comparable with the mediumsized benchmarks on PTB.
Acknowledgements.
This study is supported by Russian Federation President grant MD306.2017.9. A.V. Savchenko is supported by the Laboratory of Algorithms and Technologies for Network Analysis, National Research University Higher School of Economics.
References
 [1] Lu, Z., Sindhwan, V., Sainath, T.N.: Learning compact recurrent neural networks. Acoustics, Speech and Signal Processing (ICASSP) (2016)
 [2] Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D.P.: Tensorizing neural networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. (2015) 442–450
 [3] Garipov, T., Podoprikhin, D., Novikov, A., Vetrov, D.P.: Ultimate tensorization: compressing convolutional and FC layers alike. CoRR/NIPS 2016 workshop: Learning with Tensors: Why Now and How? abs/1611.03214 (2016)

[4]
Arjovsky, M., Shah, A., Bengio, Y.:
Unitary evolution recurrent neural networks.
In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016. (2016) 1120–1128
 [5] Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press (1997)
 [6] Kneser, R., Ney, H.: Improved backingoff for mgram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing 1 (1995) 181–184.
 [7] Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural Computation (9(8)) (1997) 1735–1780
 [8] Mikolov, T.: Statistical Language Models Based on Neural Networks. PhD thesis, Brno University of Technology (2012)
 [9] Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. S. C. Kremer and J. F. Kolen, eds. A Field Guide to Dynamical Recurrent Neural Networks (2001)
 [10] Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014f (2014)
 [11] Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. Arxiv preprint (2014)
 [12] Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. Acoustics, Speech and Signal Processing (ICASSP) (2016)
 [13] Molchanov, P., Tyree, S., Karras, T., Aila, T., Kaut, J.: Pruning convolutional neural networks for resource efficient transfer learning. arXiv preprint arXiv:1611.06440 (2016)
 [14] Rassadin, A.G., Savchenko, A.V.: Deep neural networks performance optimization in image recognition. Proceedings of the 3rd International Conference on Information Technologies and Nanotechnologies (ITNT) (2017)
 [15] Oseledets, I.V.: Tensortrain decomposition. SIAM J. Scientific Computing 33(5) (2011) 2295–2317
 [16] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. The International Conference on Learning Representations (ICLR) (2015)