1 Introduction
LSTM networks (Hochreiter & Schmidhuber, 1997) have been successfully used in language modeling (Jozefowicz et al., 2016; Shazeer et al., 2017), speech recognition (Xiong et al., 2016), machine translation (Wu et al., 2016), and many other tasks. However, these networks have millions of parameters, and require weeks of training on multiGPU systems.
We introduce two modifications of LSTM cell with projection, LSTMP (Sak et al., 2014), to reduce the number of parameters and speedup training. The first method, factorized LSTM (FLSTM) approximates big LSTM matrix with a product of two smaller matrices. The second method, group LSTM (GLSTM) partitions LSTM cell into the independent groups. We test FLSTM and GLSTM architectures on the task of language modeling using One Billion Word Benchmark (Chelba et al., 2013). As a baseline, we used BIGLSTM model without CNN inputs described by Jozefowicz et al. (2016). We train all networks for 1 week on a DGX Station system with 4 Tesla V100 GPUs, after which BIGLSTM’s evaluation perplexity was 35.1. Our GLSTM based model got 36 and FLSTM based model got 36.3 while using two to three times less RNN parameters.
1.1 Long ShortTerm Memory overview
Learning longrange dependencies with Recurrent Neural Networks (RNN) is challenging due to the vanishing and exploding gradient problems
(Bengio et al., 1994; Pascanu et al., 2013). To address this issue, the LSTM cell has been introduced by Hochreiter & Schmidhuber (1997), with the following recurrent computations:(1) 
where is input, is cell’s state, and is cell’s memory. We consider LSTM cell with projection of size , LSTMP, where Equation 1 is computed as follows (Sak et al., 2014; Zaremba et al., 2014). First, cell gates are computed:
(2) 
where , , and is an affine transform .
Next state and memory are computed using following equations:
where is a linear projection. The major part of LSTMP cell computation is in computing affine transform because it involves multiplication with matrix . Thus we focus on reducing the number of parameters in .
1.2 Related Work
The partition of layer into parallel groups have been introduced by Krizhevsky et al. (2012) in AlexNet, where some convolutional layers have been divided into two groups to split the model between two GPUs. Multigroup convnets have been widely used to reduce network weights and required compute, for example by Esser et al. (2016). This multigroup approach was extended to the extreme in Xception architecture by Chollet (2016). The idea of factorization of large convolutinal layer into the stack of layers with smaller filters was used, for example, in VGG networks (Simonyan & Zisserman, 2014), and in ResNet “bottleneck design” (He et al., 2016). Denil et al. (2013) have shown that it is possible to train several different deep architectures by learning only a small number of weights and predicting the rest. In case of LSTM networks, ConvLSTM (Shi et al., 2015), has been introduced to better exploit possible spatiotemporal correlations, which is conceptually similar to grouping.
2 Models
2.1 Factorized LSTM cell
Factorized LSTM (FLSTM) replaces matrix by the product of two smaller matrices that essentially try to approximate as , where is of size , is , and (”factorization by design”). The key assumption here is that can be well approximated by the matrix of rank . Such approximation contains less LSTMP parameters than original model  versus and, therefore, can be computed faster and synchronized faster in the case of distributed training.
2.2 Group LSTM cell
This approach is inspired by groups in Alexnet (Krizhevsky et al., 2012). We postulate that some parts of the input and hidden state can be thought of as independent feature groups. For example, if we use two groups, then both and
are effectively split into two vectors concatenated together
and , with only dependent on , and cell’s memory state. Therefore, for groups Equation 2 changes to:(3) 
where, is a group ’s affine transform from to . The partitioned will now have parameters. This cell architecture is well suited for model parallelism since every group computation is independent. An alternative interpretation of GLSTM layers is demonstrated in the Figure 1 (c). While this might look similar to ensemble (Shazeer et al., 2017) or multitower (Ciregan et al., 2012) models, the key differences are: (1) input to different groups is different and assumed independent, and (2) instead of computing ensemble output, it is concatenated into independent pieces.
3 Experiments and Results
For testing we used the task of learning the joint probabilities over word sequences of arbitrary lengths
: , such that “real” sentences have high probabilities compared to the random sequences of words. Figure 1(a) shows the typical LSTMbased model, where first the words are embedded into the low dimensional dense input for RNN, then the “context” is learned using RNNs via number of steps and, finally, the softmax layer converts RNN output into the probability distribution
. We test the following models:
BIGLSTM  model with projections but without CNN inputs from Jozefowicz et al. (2016)

BIG FLSTM F512  with intermediate rank of 512 for LSTM matrix ,

BIG GLSTM G4, with 4 groups in both layers

BIG GLSTM G16, with 16 groups in both layers.
We train all models on DGX Station with 4 GV100 GPUs for one ween using Adagrad optimizer, projection size of 1024, cell size of 8192, minibatch of 256 per GPU, sampled softmax with 8192 samples and 0.2 learning rate. Note that the use of projection is crucial as it helps to keep down embedding and softmax layer sizes. Table 1 summarizes our experiments.
Judging from the training loss Plots 2 in Appendix, it is clearly visible that at the same step count, model with more parameters wins. However, given the same amount of time, factorized models train faster. While the difference between BIGLSTM and BIG GLSTMG2 is clearly visible, BIG GLSTMG2 contains almost 2 times less RNN parameters than BIGLSTM, trains faster and, as a results, achieves similar evaluation perplexity within the same training time budget (1 week).
Our code is available at https://github.com/okuchaiev/flm
Model  Perplexity  Step  Num of RNN parameters  Words/sec 

BIGLSTM baseline  35.1  0.99M  151,060,480  33.8K 
BIG FLSTM F512  36.3  1.67 M  52,494,336  56.5K 
BIG GLSTM G2  36  1.37M  83,951,616  41.7K 
BIG GLSTM G4  40.6  1.128M  50,397,184  56K 
BIG GLSTM G8  39.4  850.4K  33,619,968  58.5K 
3.1 Future research
While one might go further and try to approximate transform
using arbitrary feed forward neural network with
inputs and outputs, during our initial experiments we did not see immediate benefits of doing so. Hence, it remains a topic of future research.It might be possible to reduce the number of RNN parameters even further by stacking GLSTM layers with increasing group counts on top of each other. In our second, smaller experiment, we replace the second layer of BIG GLSTMG4 network by the layer with 8 groups instead of 4, and call it BIG GLSTMG4G8. We let both BIG GLSTMG4 and BIG GLSTMG4G8 ran for 1 week on 4 GPUs each and achieved very similar perplexities. Hence, the model with “hierarchical” groups did not lose much accuracy, ran faster and got better perplexity. Such “hierarchical” group layers look intriguing as they might provide a way for learning different levels of abstractions but this remains a topic of future research.
Acknowledgements We are grateful to Scott Gray and Ciprian Chelba for helping us identify and correct issues with earlier versions of this work.
References
 Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 Chelba et al. (2013) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
 Chollet (2016) François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
 Ciregan et al. (2012) Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. Multicolumn deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3642–3649. IEEE, 2012.

Denil et al. (2013)
Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al.
Predicting parameters in deep learning.
In Advances in Neural Information Processing Systems, pp. 2148–2156, 2013.  Esser et al. (2016) Steven K Esser, Paul A Merolla, John V Arthur, Andrew S Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J Berg, Jeffrey L McKinstry, Timothy Melano, Davis R Barch, et al. Convolutional networks for fast, energyefficient neuromorphic computing. Proceedings of the National Academy of Sciences, pp. 201604850, 2016.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. ICML (3), 28:1310–1318, 2013.
 Sak et al. (2014) Hasim Sak, Andrew W Senior, and Françoise Beaufays. Long shortterm memory recurrent neural network architectures for large scale acoustic modeling. In Interspeech, pp. 338–342, 2014.
 Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538, 2017.

Shi et al. (2015)
Xingjian Shi, Zhourong Chen, Hao Wang, DitYan Yeung, Waikin Wong, and
Wangchun Woo.
Convolutional lstm network: A machine learning approach for precipitation nowcasting.
In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pp. 802–810, Cambridge, MA, USA, 2015. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969239.2969329.  Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 Xiong et al. (2016) Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256, 2016.
 Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.