Deep learning technique is becoming the dominant force of recent breakthroughs in artificial intelligence area [Geng et al.2017]. Recently, Recurrent Neural Networks (RNNs) have gained wide attentions with dramatic performance improvements in sequential data modeling, e.g., automatic speech recognition (ASR) [Sak et al.2014], language modeling [Mikolov et al.2011], image captioning [Vinyals et al.2017]Cho et al.2014], etc. Recently, due to the success of RNN-ASR [Lei et al.2013], personal assistant system (e.g., Amazon’s Alexa, Apple’s Siri, Google Now, Samsung’s Bixby, Microsoft’s Cortana) has become a standard system configuration of smartphones.
Generally, the trained deep learning models are deployed on the cloud which requires strict Internet connection and also may compromise user privacy. Therefore, there is a high demand to deploy such RNNs with millions of parameters on mobile devices. Deep model compression techniques have been proposed to solve the mobile deployment problem. Specifically for RNNs model compression, a pruning based method [Narang et al.2017]
was introduced which progressively prunes away small parameters using a monotonically increasing threshold during training. Another representative method is to apply singular value decomposition (SVD) low rank approach for decomposing RNNs weight matrices[Prabhavalkar et al.2016]. Although some good results for compressing RNNs have been reported, these works fail to consider the hierarchical changes of weight matrices. In addition, none of the existing methods can dynamically adjust the compression rate according to the requirement of the deployed mobile devices.
In this work, we propose a dynamically hierarchy revolution (DirNet) to address the problems in existing RNNs model compression methods (Fig. 1). To further compress the model with considering different degrees of redundancies among layers, we exploit a novel way of dynamically mining dictionary atoms from original network structures without manual setting the compression rate for different layers. Our approach achieves significant improvement in model compression compared to previous approaches. It also has much higher re-training’s convergence rate and lower accuracy loss compared to the state-of-the-art. In addition, considering different functionalities among hierarchical hidden layers, we come up with the idea of adaptively adjusting sparsity of different hidden layers in RNNs. This can improve the precise structure approximation to the original network with minimizing the performance drop.
Our most significant contributions can be summarized into threefold: Firstly, this is the first work dynamically exploring dictionary learning in weight matrix of both RNN and LSTM that jointly compresses both hidden layer and inter layer to reconstruct original weight matrices. Secondly, our approach is a dynamic model compression algorithm. It can dynamically mine the dictionary atoms of the projection dictionary matrix within the layer to better learn a common codebook representation across inter-layer and recurrent layer. This can find the optimal numbers of neurons among layers to better control the degree of compression with negligible accuracy loss. Thirdly, given a hierarchical RNN structure, our approach can adaptively set various sparsities of sparse codes for each layer to minimize the performance drop for the compressed network. The experimental results demonstrate that DirNet achieves significant improvement in terms of compression scale and speedup rate compared to the state-of-the-art approaches.
2 Related Work
In general, we can summarize existing RNN compression methods into three categories: Pruning, low-rank approximation, and knowledge distillation. The first pruning work was proposed by Han et al. han2015learning where a hard threshold was applied as the pruning criterion to compress the deep neural network. Li et al. li2017deeprebirth discussed the optimization of non-tensor layers such as pooling and normalization without tensor-like trainable parameters for deep model compression. Xue et al. xue2013restructuring used low-rank approximation to reduce parameter dimensions in order to save storage and reduce time complexity. Denile et al. denil2013predicting showed that the given neural network can be represented by a small number of parameters. Sainath et al. sainath2013low reduced the number of model parameters by applying the low-rank matrix factorization to simplify the neural network architecture.
Hinton et al hinton2015distilling learned a small student network to mimic the original teacher network by minimizing the loss between the student and teacher output. Kim et al. kim2016sequence integrated knowledge distillation to approximately match the sequence-level distribution of the teacher network. Chen et al. chen2017darkrank introduced a knowledge of cross sample similarities for model compression and acceleration.
In addition, there are some other compression methods. Low-precision quantization [Xu et al.2018] is a scalar-level compression method without considering the relation among learned parameters in layers. Compared to these works, our approach dynamically adjusts the compression rate across layers and explores the sparsity of weight matrices to compress RNN. It achieves better compression rate with less accuracy loss compared with above methods.
Recently, Han et al. han2017ese used a pruning method to compress the LSTM on speech recognition problem and it reduced the parameters of the weight matrix to 10% and achieved 5X compression rate (each non-zero element is represented by its value and index in the matrix). However, DirNet considers the sparsity of various matrices and dynamically adjusts the sparsity based on the feature of models and gets a better compression rate. Our approach can compress the model by around 8X with negligible performance loss.
In this section, we first introduce the background of RNNs. Then, we present the basic sparse coding based RNN compression method without considering the mobile requirement and the hierarchical network structure. Further, we present the proposed approach DirNet.
3.1 RNN Background
Let denotes time steps and denotes the hidden layers in RNN. At time , denotes the activations of the -th hidden layer with node. Therefore, the inputs to this layer at time are donated by . Then, we can define the output activations of the -th and -th layers in a classical RNN:
denotes a non-linear activation function andand
represent bias vectors.and denote inter-layer and recurrent weight matrices, respectively.
Different from traditional RNNs, LSTM contains specialized cells called memory cell in the recurrent hidden layer. Empirically, LSTM has a memory cell for storing information for long periods of time, we use to denote the long-term memory. LSTM can decide to overwrite the memory cell, retrieve it, or keep it for the next time step [Zaremba et al.2014]. In this paper, our LSTM architecture is the same as [Graves et al.2013]. The general structure is illustrated in Figure 2.
where and are input gate, output gate, forget gate and input modulation gate, respectively. And denotes the memory cells and all gates are the same size as the memory cells. is the element-wise product of the vectors. and denote the dimension of one gate and four gate, respectively.
3.2 Sparse Coding based Model Compression
As indicated above, one direct way to compress RNN is to sparsify weight matrix and . Given a weight matrix , each . Sparse coding aims to learn a dictionary where and a sparse code matrix . The original matrix is modeled by a sparse linear combination of and as . We can formulate the following optimization problem:
where , denotes the th column of and . is the positive regularization parameter and denotes the Frobenius norm of the matrix.
For RNNs, we can compress simultaneously the weight matrices of both recurrent layer and inter-layer (from Eq. (1) or Eq. (2)). Such joint compression can be achieved by a projection matrix [Sak et al.2014], denoted by (), such that, and . Therefore, we can incorporate this idea into Eq. (1) and Eq. (2), then we get the followings:
where and . Therefore, we can calculate the compression rate by
However, a problem exists in this deep compression model is that the compression rate and the sparsity are manually set which limits the compression space. It would be much more efficient and promising to automatically set the compression rate and sparsity based on the understanding of deep neural network’s structure. Also we should consider minimizing the training time during the model compression which may take weeks or even months, especially for RNN based sequential models. In this paper, we propose DirNet, which can dynamically adjust the sparsity of Eq. (5) and Eq. (7), to meet this requirement.
DirNet is mainly composed of two steps and depicted graphically in Fig. 1: In Step 1, a dynamically adaptive dictionary learning (Sec. 3.3) is proposed to adapt the atoms in shared dictionary among inter-layer and recurrent layer to adjust the sparsity (compression rate and accuracy) within the layer (Eq. (5)). In Step 2, we propose an adaptive sparsity learning approach with considering the network’s hierarchical structure (Sec. 3.4) to optimize the and sparsity parameter hierarchically with an appropriate sparsity degree. The sparsity (compression rate and accuracy) achieved by different layers is depending on the number of neurons and the architecture of the specific layer. In general, the non-zero values in are ten to twenty percent of the original weight matrices in our work. From this, we can find the proposed approach adaptively set various sparsity among the hierarchical RNN architecture on Eq. 7
instead of using a fixed fraction function of explained variance as advocated in Prabhavalkar et al. prabhavalkar2016compression. We summarize the key steps of DirNet in Algorithm1.
3.3 Dynamically Adaptive Dictionary Learning
It is an essential step to dynamically adjust the dimension of the projection matrices to get the optimal compression rate. We introduce a shift operation on atoms to better adjust the compression rate. Given a set of shift operations which contains only small shifts relative to the size of the (time window), for every there exist coefficients and shift operators [Hitziger et al.2013], such that
Now, we can formulate the dynamically adaptive dictionary learning problem as follows:
The problem becomes Eq. 4 when , thus we use alternate minimization to solve it.
Sparse Codes Update It is known that updating the sparse code is the most time consuming part [Mairal et al.2009]. One of the state-of-the-art methods for solving such lasso problem is Coordinate descent (CD) [Friedman et al.2007]. Given an input vector , CD initializes and then updates the sparse codes many times via matrix-vector multiplication and thresholding. However, the iteration takes thousands of steps to converge. Lin et al. lin2014stochastic observed that the support (non-zero element) of the coordinates after less than ten steps of CD is very accurate. Besides, the support of the sparse code is usually more important than the exact value of the sparse code. Therefore, we update the sparse code by using a few steps of CD operation because the original sparse coding is a non-convex problem and do not need to run CD to the final convergence. For the
-th epoch, we denote the updated sparse code as. It will be used as an initial sparse code for the -th epoch.
Update via one or a few steps of coordinate descent:
Specifically, for from 1 to , we update the th coordinate of cyclicly as follows:
where is the soft thresholding shrinkage function [Combettes and Wajs2005]. We call above updating cycle as one step of CD. The updated sparse code is then denoted by .
Dictionary Update For updating the dictionary, we use block coordinate descent [Tseng2001] for updating each atom .
This can be solved in two steps, the solution of the unconstrained problem by differentiation followed by normalization and can be summarized by
As in [Mairal et al.2009], we found that one update loop through all of the atoms was enough to ensure fast convergence of Algorithm 1. The only difference of this update compared to common dictionary learning is the shift operator . In Eq. (10), . If the shift operator is a non-circular operators, the inverse needs to be replaced by the adjoint and the rescaling function needs to be applied to the update term.
Besides, we used a random selection method (randomly select samples from matrices to construct initial dictionaries ) to initialize the dictionaries for different layers in DirNet. Then, we set all the sparse codes to be zero in the beginning and epoch.
3.4 Adaptively Hierarchical Network
Considering different functionalities among hierarchical hidden layers, we propose the method of adaptively changing sparsity in different hidden layers. We use an initial perturbations as [Zhang et al.2016] to let all features can be selected by competing with each other. Then we gradually shrank the network by using stronger -penalties and fewer features remaining in the progressive shrinking. Therefore, DirNet will go through the self-adjusting sequential stages before reaching the final optimal.
We note that sharing across the inter-layer and recurrent-layer matrices allows more efficient parameterization for the weight matrices. Besides, this does not result in a significant loss of performance. By adjusting the dimensions of the projection matrices () in each of the layers of the network, the compression rate of the model will be determined. Therefore, after fixing the dictionary to , solving Eq. 7 is equivalent to solve a LASSO problem [Tibshirani1996]. The objective function of solving as follow:
After we get dictionary from Eq. 5, we can determine as the solution to the following LASSO problem with adaptive weights to regularize the model coefficients along with different features:
where is different across layers and we selected the best value from to in this work. denotes regularization weight, we will receive different penalized matrices instead of controlling the sparsity by as in Eq. (11). In addition, and , where is the ordinal least-square solution.
Therefore, we can consider the following linear regression problem with the inter weight matrix, where is the sample size and is the dimension of the target response vector. We use to represent . Then, we use an adaptive weight vector to regularize over different covariates, as
where and we alternatively optimize and in the learning process.
Suppose we initialize , with , denotes -th layer. Then, we alternatively update and in Eq. (13) under this equality norm constraint until convergence. We will start the second stage of iterations with an updated norm constraint after the initial stage, which imposes a stronger penalty. Then we alternatively update and until the second stage ends. We keep strengthening the global -norm regularization stage by stage during the updating procedure. We use to denote the index of each stage and is the compose of iteration.
To solve Eq. (13), we first fix and solve , which can be computationally converted to a LASSO problem:
Then, when we fix update , the problem becomes the following constrained optimization problem:
We used the Lagrangian of Eq. 15, and drop the non-negativity constraint. Let . Then the Lagrangian can be written as
By setting , we have
Plugging the above relation in the constraint , then we have
Finally, we plug the above equation into Eq. 16 and get
Since and , the solution will satisfy the non-negative constraints automatically. After we trained the original RNN model and retrieved the weight matrices of both inter-layer and recurrent layer, the proposed algorithm is applied to learn the new matrices and of each layer, which are used to initialize parameters of neural network layers in the new network structure (Fig.1 (b)). After the initialization, we fine-tune the neural network as Han et al. han2015deep and the sparsity is preserved by only updating non-zero elements in the sparse matrix. Fine-tune step is also required for other compression works, e.g., [Prabhavalkar et al.2016].
3.5 Extended Model Compression
We extend the Sec. 3.3 - Sec. 3.4 from standard RNNs to LSTM. In LSTM, the recurrent-weight matrix and the inter-layer matrix are both concatenation of four gates, which are input gate, output gate, forget gate and input modulation gate. We stack them vertically and denote as .
Besides, compressing a single-layer LSTM model is a special case for the proposed DirNet: (1) we can still learn the shared using and of the single LSTM/RNN layer as in Fig.1 (b) to achieve a high compression rate on a single-layer LSTM model; (2) compressing multiple LSTM/RNN layers, which can adaptively change the sparsity of the sparse codes relying on cross-layer information, will achieve higher compression rate than a single-layer LSTM/RNN model.
4 Experimental Results and Discussion
The proposed approach is a general algorithm for compressing recurrent neural networks including vanilla RNN, LSTM and GRU, etc, and thus can be directly used in many problems, e.g., language modeling (LM), speech recognition, machine translation and image captioning. In this paper, we select two popular domains: LM and speech recognition. We compare DirNet with (1) LSTM-SVD and LSTM-ODL. LSTM-SVD is the method proposed by Prabhavakar et al. prabhavalkar2016compression which compresses RNNs by low-rank SVD. LSTM-ODL is the method which compresses RNNs by online dictionary learning [Mairal et al.2009]. To train DirNet, it takes 10 hours on PTB with 1 K80 GPU and around 120 hours on LibriSpeech using 16 Nvidia K80 GPUs.
4.1 Language Modeling
We conduct word-level prediction on the Penn Tree Bank (PTB) dataset [Marcus et al.1993], which consists of 929k training words, 73k validation words and 82k test words. We first train a two-layer unrolled LSTM model with 650 units per layer, which is using the same network setting of LSTM-medium in [Zaremba et al.2014]. For comparison, the same model is compressed using LSTM-SVD, LSTM-ODL as well as DirNet. We report the result of testing data in Table 1. We observe that DirNet achieves superior performances regarding compression rates and speedup rates while maintaining the accuracy. The reason is that dynamically adjusting the projection matrices can achieve better compression rate and adapt different sparsities across layers can receive negligible accuracy loss.
4.2 Speech Recognition
Dataset: The LibriSpeech corpus is a large (1000 hour) corpus of English read speech derived from audiobooks in the LibriVox project, sampled at 16kHz. The accents are various and not marked, but the majority are US English. It is publicly available at http://www.openslr.org/12/ [Panayotov et al.2015]. LibriSpeech comes with its own train, validation and test sets and we use all the available data for training and validating our models. Moreover, we use the 100-hour “test clean” set as the testing set. Mel-frequency cepstrum (MFCC) [Davis and Mermelstein1990]
features are computed with 26 coefficients, a 25 ms sliding window and 10 ms stride. We use nine time slices before and nine after, for a total of 19 time points per window. As a result, with 26 cepstral coefficients, there are 494 data points per 25 ms observation.
Baseline Model: Following [Prabhavalkar et al.2016]
, our baseline model for speech recognition is a 5-layer RNN model with the Connectionist Temporal Classification (CTC) loss function[Graves et al.2006], which predicts 41 context-independent (CI) phonemes [Sak et al.2015]. Each hidden RNN layer is composed of 500 LSTM units. We train all models for 100 epochs using Adam optimizer [Kingma and Ba2014] with an initial learning rate of 0.001. It takes around 35 and 55 epochs to converge of DirNet and the baseline model, respectively. The size of the batch is 32. The model is first trained to convergence for optimizing the CTC criterion, followed which are sequences discriminatively trained to optimize the state-level minimum Bayes risk (sMBR) criterion [Kingsbury2009]
. We implement DirNet by Tensorflow[Abadi et al.2016] and use Samsung Galaxy S8 smartphone as the mobile platform to evaluate the performance.
Results: To give a comprehensive evaluation of the proposed approach, we list the word error rate along with parameter number and the execution time comparison between different approaches in Table 2. There are two sets of experiments: 1) compare the maximum compression rate obtained by different approaches without the accuracy drop 2) measure the word error rate among different approaches with the same compression rate.
As indicated in the first row of Table 2, the basic LSTM model can achieve 12.7% (WER) before any compression. It includes 10.3M parameters in total and takes 1.77s to recognize 1s audio. Our proposed approach DirNet can compress the model by 4.3 times without losing any accuracy. The speed up achieved by DirNet is 4.2 times faster compared to 1.4 and 2.6 times obtained by LSTM-SVD and LSTM-ODL. These results indicate the superiority of the proposed DirNet over current compressing algorithms. Moreover, such significant improvement shows that dynamically adjust sparsity across layers might help the compression models receive higher compression rates. From the results listed in the second row of Table 2, we can find that when compressing the model size by 7.9 times, LSTM-ODL and LSTM-SVD lose lots of accuracy, especially SVD based compression algorithm, which is 3.4% (16.1%-12.7%) compared to 0.2% achieved by DirNet.
4.3 Adaptively Shrinking Parameter Selection
In this section, we study how the performance of our approach is affected by the following two parameters: that controls the initial “sparsity” of the system and the shrinking factor
that controls the compression rate of the system. We use F-score on LibriSpeech dataset to measure the performance.
First, we examine the performance of choosing different values. The range is from to , and the result is shown on the left side of the Fig. 3. We notice that when is below , the performance quickly drops and the lower initial sparsity even fails to start the whole system with sufficient energy. As a result the iterations could quickly stop at a local optimal. Therefore, we choose the best performance in all experimental settings.
Second, we study the influence of varying shrinking factor . The explored value range is from to . We observe that the performance sharply increases first and becomes stable afterward. Besides, the system evolves slowly such that the shrinking stage is sufficient when the shrinking scheme . In practice, we choose to strike a balance between efficiency and the quality of shrinking procedure.
4.4 Adaptive Hierarchy of the Network
We also compare the performance of DirNet (O) adaptive cross-layers’ compression capability with LSTM-SVD (G). All the results are listed in Table 3. We can find that DirNet achieves much lower WER compared to LSTM-SVD even with the same number of neurons. Also, these results further demonstrate the superiority of DirNet adaptive adjusting sparse features in deep model compression.
|neurons of each layer||Params||WER (G)||WER (O)|
|500, 500, 500, 500, 500||10.3M||12.7||12.7|
|350, 375, 395, 405, 410||8.6M||12.3||12.3|
|270, 305, 335, 345, 350||7.2M||12.5||12.3|
|175, 215, 245, 260, 265||5.4M||12.5||12.4|
|120, 150, 180, 195, 200||4.1M||12.6||12.5|
|80, 105, 130, 145, 150||3.1M||12.9||12.5|
|50, 70, 90, 100, 110||2.3M||13.2||12.7|
|30, 45, 55, 65, 75||1.7M||14.4||12.9|
|25, 35, 45, 50, 55||1.3M||16.6||12.9|
In this paper, we introduce DirNet that dynamically adjusts the compression rate of each layer in the network, and adaptively change the hierarchical structures among different layers on their weight matrices. Experimental results show that compared to other RNN compression methods, DirNet significantly improves the performance before retraining. In our ongoing work, we will integrate scalar level compression with our DirNet to further compress the deep model.
[Abadi et al.2016]
Martín Abadi, Paul Barham, et al.
Tensorflow: A system for large-scale machine learning.In OSDI, pages 265–283, GA, 2016. USENIX Association.
- [Chen et al.2017] Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Darkrank: Accelerating deep metric learning via cross sample similarities transfer. arXiv preprint arXiv:1707.01220, 2017.
- [Cho et al.2014] Kyunghyun Cho, Bart Van Merriënboer, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. abs/1406.1078, 2014.
- [Combettes and Wajs2005] Patrick L Combettes and Valérie R Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005.
- [Davis and Mermelstein1990] Steven B Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In Readings in speech recognition, pages 65–74. Elsevier, 1990.
- [Denil et al.2013] Misha Denil, Babak Shakibi, et al. Predicting parameters in deep learning. In NIPS, pages 2148–2156, 2013.
- [Friedman et al.2007] Jerome Friedman, Trevor Hastie, et al. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.
[Geng et al.2017]
Yanyan Geng, Guohui Zhang, Weizhi Li, Yi Gu, Gaoyuan Liang, Jingbin Wang,
Yanbin Wu, Nitin Patil, and Jing-Yan Wang.
A novel image tag completion method based on convolutional neural network.In International Conference on Artificial Neural Networks, pages 539–546. Springer, 2017.
- [Graves et al.2006] Alex Graves, Santiago Fernández, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, pages 369–376. ACM, 2006.
- [Graves et al.2013] Alex Graves, Abdel-rahman Mohamed, et al. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645–6649. IEEE, 2013.
- [Han et al.2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
- [Han et al.2015b] Song Han, Jeff Pool, et al. Learning both weights and connections for efficient neural network. In NIPS, pages 1135–1143, 2015.
- [Han et al.2017] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 75–84. ACM, 2017.
- [Hinton et al.2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- [Hitziger et al.2013] Sebastian Hitziger, Maureen Clerc, et al. Jitter-adaptive dictionary learning-application to multi-trial neuroelectric signals. arXiv preprint arXiv:1301.3611, 2013.
- [Kim and Rush2016] Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
- [Kingma and Ba2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- [Kingsbury2009] Brian Kingsbury. Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In ICASSP, pages 3761–3764, 2009.
- [Lei et al.2013] Xin Lei, Andrew W Senior, et al. Accurate and compact large vocabulary speech recognition on mobile devices. In INTERSPEECH. Citeseer, 2013.
- [Li et al.2017] Dawei Li, Xiaolong Wang, and Deguang Kong. Deeprebirth: Accelerating deep neural network execution on mobile devices. arXiv preprint arXiv:1708.04728, 2017.
- [Lin et al.2014] Binbin Lin, Qingyang Li, et al. Stochastic coordinate coding and its application for drosophila gene expression pattern annotation. arXiv:1407.8147, 2014.
- [Mairal et al.2009] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In ICML, pages 689–696. ACM, 2009.
- [Marcus et al.1993] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
- [Mikolov et al.2011] Tomàš Mikolov, S. Kombrink, et al. Extensions of recurrent neural network language model. In ICASSP, pages 5528–5531, May 2011.
- [Narang et al.2017] Sharan Narang, Gregory Diamos, Shubho Sengupta, and Erich Elsen. Exploring sparsity in recurrent neural networks. ICLR, 2017.
- [Panayotov et al.2015] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In ICASSP, pages 5206–5210. IEEE, 2015.
- [Prabhavalkar et al.2016] Rohit Prabhavalkar, Ouais Alsharif, et al. On the compression of recurrent neural networks with an application to lvcsr acoustic modeling for embedded speech recognition. In ICASSP, pages 5970–5974. IEEE, 2016.
- [Sainath et al.2013] Tara N Sainath, Brian Kingsbury, et al. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In ICASSP, pages 6655–6659, 2013.
- [Sak et al.2014] Hasim Sak, Andrew Senior, et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In INTERSPEECH, 2014.
- [Sak et al.2015] Haşim Sak, Andrew Senior, et al. Learning acoustic frame labeling for speech recognition with recurrent neural networks. In ICASSP, pages 4280–4284. IEEE, 2015.
- [Tibshirani1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
- [Tseng2001] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475–494, 2001.
- [Vinyals et al.2017] Oriol Vinyals, Alexander Toshev, et al. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663, 2017.
- [Xu et al.2018] Chen Xu, Jianqiang Yao, et al. Alternating multi-bit quantization for recurrent neural networks. In ICLR, 2018.
- [Xue et al.2013] Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In INTERSPEECH, pages 2365–2369, 2013.
- [Zaremba et al.2014] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
- [Zhang et al.2016] Kai Zhang, Shandian Zhe, et al. Annealed sparsity via adaptive and dynamic shrinking. In KDD, pages 1325–1334, 2016.