Dynamically Hierarchy Revolution: DirNet for Compressing Recurrent Neural Network on Mobile Devices

06/04/2018
by   Jie Zhang, et al.
Arizona State University
SAMSUNG
2

Recurrent neural networks (RNNs) achieve cutting-edge performance on a variety of problems. However, due to their high computational and memory demands, deploying RNNs on resource constrained mobile devices is a challenging task. To guarantee minimum accuracy loss with higher compression rate and driven by the mobile resource requirement, we introduce a novel model compression approach DirNet based on an optimized fast dictionary learning algorithm, which 1) dynamically mines the dictionary atoms of the projection dictionary matrix within layer to adjust the compression rate 2) adaptively changes the sparsity of sparse codes cross the hierarchical layers. Experimental results on language model and an ASR model trained with a 1000h speech dataset demonstrate that our method significantly outperforms prior approaches. Evaluated on off-the-shelf mobile devices, we are able to reduce the size of original model by eight times with real-time model inference and negligible accuracy loss.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

03/25/2016

On the Compression of Recurrent Neural Networks with an Application to LVCSR acoustic modeling for Embedded Speech Recognition

We study the problem of compressing recurrent neural networks (RNNs). In...
02/19/2020

RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition

Recurrent neural networks (RNNs) based automatic speech recognition has ...
10/04/2019

Pushing the limits of RNN Compression

Recurrent Neural Networks (RNN) can be difficult to deploy on resource c...
06/07/2019

Compressing RNNs for IoT devices by 15-38x using Kronecker Products

Recurrent Neural Networks (RNN) can be large and compute-intensive, maki...
11/28/2019

Data-Driven Compression of Convolutional Neural Networks

Deploying trained convolutional neural networks (CNNs) to mobile devices...
11/17/2017

Improved Bayesian Compression

Compression of Neural Networks (NN) has become a highly studied topic in...
05/24/2018

Multi-Task Zipping via Layer-wise Neuron Sharing

Future mobile devices are anticipated to perceive, understand and react ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning technique is becoming the dominant force of recent breakthroughs in artificial intelligence area [Geng et al.2017]. Recently, Recurrent Neural Networks (RNNs) have gained wide attentions with dramatic performance improvements in sequential data modeling, e.g., automatic speech recognition (ASR) [Sak et al.2014], language modeling [Mikolov et al.2011], image captioning [Vinyals et al.2017]

, neural machine translation 

[Cho et al.2014], etc. Recently, due to the success of RNN-ASR [Lei et al.2013], personal assistant system (e.g., Amazon’s Alexa, Apple’s Siri, Google Now, Samsung’s Bixby, Microsoft’s Cortana) has become a standard system configuration of smartphones.

Generally, the trained deep learning models are deployed on the cloud which requires strict Internet connection and also may compromise user privacy. Therefore, there is a high demand to deploy such RNNs with millions of parameters on mobile devices. Deep model compression techniques have been proposed to solve the mobile deployment problem. Specifically for RNNs model compression, a pruning based method  [Narang et al.2017]

was introduced which progressively prunes away small parameters using a monotonically increasing threshold during training. Another representative method is to apply singular value decomposition (SVD) low rank approach for decomposing RNNs weight matrices 

[Prabhavalkar et al.2016]. Although some good results for compressing RNNs have been reported, these works fail to consider the hierarchical changes of weight matrices. In addition, none of the existing methods can dynamically adjust the compression rate according to the requirement of the deployed mobile devices.

In this work, we propose a dynamically hierarchy revolution (DirNet) to address the problems in existing RNNs model compression methods (Fig. 1). To further compress the model with considering different degrees of redundancies among layers, we exploit a novel way of dynamically mining dictionary atoms from original network structures without manual setting the compression rate for different layers. Our approach achieves significant improvement in model compression compared to previous approaches. It also has much higher re-training’s convergence rate and lower accuracy loss compared to the state-of-the-art. In addition, considering different functionalities among hierarchical hidden layers, we come up with the idea of adaptively adjusting sparsity of different hidden layers in RNNs. This can improve the precise structure approximation to the original network with minimizing the performance drop.

Our most significant contributions can be summarized into threefold: Firstly, this is the first work dynamically exploring dictionary learning in weight matrix of both RNN and LSTM that jointly compresses both hidden layer and inter layer to reconstruct original weight matrices. Secondly, our approach is a dynamic model compression algorithm. It can dynamically mine the dictionary atoms of the projection dictionary matrix within the layer to better learn a common codebook representation across inter-layer and recurrent layer. This can find the optimal numbers of neurons among layers to better control the degree of compression with negligible accuracy loss. Thirdly, given a hierarchical RNN structure, our approach can adaptively set various sparsities of sparse codes for each layer to minimize the performance drop for the compressed network. The experimental results demonstrate that DirNet achieves significant improvement in terms of compression scale and speedup rate compared to the state-of-the-art approaches.

Figure 1: (a) The initial model is compressed by (b) jointly dynamically adjust sparsity of recurrent and inter-layer matrices, using a shared projection dictionary.

2 Related Work

In general, we can summarize existing RNN compression methods into three categories: Pruning, low-rank approximation, and knowledge distillation. The first pruning work was proposed by Han et al. han2015learning where a hard threshold was applied as the pruning criterion to compress the deep neural network. Li et al. li2017deeprebirth discussed the optimization of non-tensor layers such as pooling and normalization without tensor-like trainable parameters for deep model compression. Xue et al. xue2013restructuring used low-rank approximation to reduce parameter dimensions in order to save storage and reduce time complexity. Denile et al. denil2013predicting showed that the given neural network can be represented by a small number of parameters. Sainath et al. sainath2013low reduced the number of model parameters by applying the low-rank matrix factorization to simplify the neural network architecture.

Hinton et al hinton2015distilling learned a small student network to mimic the original teacher network by minimizing the loss between the student and teacher output. Kim et al. kim2016sequence integrated knowledge distillation to approximately match the sequence-level distribution of the teacher network. Chen et al. chen2017darkrank introduced a knowledge of cross sample similarities for model compression and acceleration.

In addition, there are some other compression methods. Low-precision quantization [Xu et al.2018] is a scalar-level compression method without considering the relation among learned parameters in layers. Compared to these works, our approach dynamically adjusts the compression rate across layers and explores the sparsity of weight matrices to compress RNN. It achieves better compression rate with less accuracy loss compared with above methods.

Recently, Han et al. han2017ese used a pruning method to compress the LSTM on speech recognition problem and it reduced the parameters of the weight matrix to 10% and achieved 5X compression rate (each non-zero element is represented by its value and index in the matrix). However, DirNet considers the sparsity of various matrices and dynamically adjusts the sparsity based on the feature of models and gets a better compression rate. Our approach can compress the model by around 8X with negligible performance loss.

3 Methods

In this section, we first introduce the background of RNNs. Then, we present the basic sparse coding based RNN compression method without considering the mobile requirement and the hierarchical network structure. Further, we present the proposed approach DirNet.

3.1 RNN Background

Let denotes time steps and denotes the hidden layers in RNN. At time , denotes the activations of the -th hidden layer with node. Therefore, the inputs to this layer at time are donated by . Then, we can define the output activations of the -th and -th layers in a classical RNN:

(1)
(2)

where

denotes a non-linear activation function and

and

represent bias vectors.

and denote inter-layer and recurrent weight matrices, respectively.

Different from traditional RNNs, LSTM contains specialized cells called memory cell in the recurrent hidden layer. Empirically, LSTM has a memory cell for storing information for long periods of time, we use to denote the long-term memory. LSTM can decide to overwrite the memory cell, retrieve it, or keep it for the next time step [Zaremba et al.2014]. In this paper, our LSTM architecture is the same as [Graves et al.2013]. The general structure is illustrated in Figure 2.

Figure 2: A graphical representation of LSTM memory cells used in this work.
(3)

where and are input gate, output gate, forget gate and input modulation gate, respectively. And denotes the memory cells and all gates are the same size as the memory cells. is the element-wise product of the vectors. and denote the dimension of one gate and four gate, respectively.

3.2 Sparse Coding based Model Compression

As indicated above, one direct way to compress RNN is to sparsify weight matrix and . Given a weight matrix , each . Sparse coding aims to learn a dictionary where and a sparse code matrix . The original matrix is modeled by a sparse linear combination of and as . We can formulate the following optimization problem:

(4)

where , denotes the th column of and . is the positive regularization parameter and denotes the Frobenius norm of the matrix.

For RNNs, we can compress simultaneously the weight matrices of both recurrent layer and inter-layer (from Eq. (1) or Eq. (2)). Such joint compression can be achieved by a projection matrix [Sak et al.2014], denoted by (), such that, and . Therefore, we can incorporate this idea into Eq. (1) and Eq. (2), then we get the followings:

(5)
(6)
(7)
(8)

where and . Therefore, we can calculate the compression rate by

However, a problem exists in this deep compression model is that the compression rate and the sparsity are manually set which limits the compression space. It would be much more efficient and promising to automatically set the compression rate and sparsity based on the understanding of deep neural network’s structure. Also we should consider minimizing the training time during the model compression which may take weeks or even months, especially for RNN based sequential models. In this paper, we propose DirNet, which can dynamically adjust the sparsity of Eq. (5) and Eq. (7), to meet this requirement.

input : Inter-layer matrix and recurrent matrix , shift operators ,
output : Compressed weight matrices and projection matrix
1 begin
2       for  do
3             Initialize and
4             repeat
5                   Update sparse code
6                   Update dictionary
7                  
8            until convergence;
9            
10             while model is unempty do
11                  
12                   repeat
13                        
14                         Fix update according to Eq. 14
15                         Fix update according to Eq. 15
16                        
17                  until Convergence;
18                  
19                  
20            
21      return ;
22      
23
Algorithm 1 DirNet

DirNet is mainly composed of two steps and depicted graphically in Fig. 1: In Step 1, a dynamically adaptive dictionary learning (Sec. 3.3) is proposed to adapt the atoms in shared dictionary among inter-layer and recurrent layer to adjust the sparsity (compression rate and accuracy) within the layer (Eq. (5)). In Step 2, we propose an adaptive sparsity learning approach with considering the network’s hierarchical structure (Sec. 3.4) to optimize the and sparsity parameter hierarchically with an appropriate sparsity degree. The sparsity (compression rate and accuracy) achieved by different layers is depending on the number of neurons and the architecture of the specific layer. In general, the non-zero values in are ten to twenty percent of the original weight matrices in our work. From this, we can find the proposed approach adaptively set various sparsity among the hierarchical RNN architecture on Eq. 7

instead of using a fixed fraction function of explained variance as advocated in Prabhavalkar et al. prabhavalkar2016compression. We summarize the key steps of DirNet in Algorithm 

1.

3.3 Dynamically Adaptive Dictionary Learning

It is an essential step to dynamically adjust the dimension of the projection matrices to get the optimal compression rate. We introduce a shift operation on atoms to better adjust the compression rate. Given a set of shift operations which contains only small shifts relative to the size of the (time window), for every there exist coefficients and shift operators  [Hitziger et al.2013], such that

Now, we can formulate the dynamically adaptive dictionary learning problem as follows:

(9)

The problem becomes Eq. 4 when , thus we use alternate minimization to solve it.

Sparse Codes Update It is known that updating the sparse code is the most time consuming part [Mairal et al.2009]. One of the state-of-the-art methods for solving such lasso problem is Coordinate descent (CD) [Friedman et al.2007]. Given an input vector , CD initializes and then updates the sparse codes many times via matrix-vector multiplication and thresholding. However, the iteration takes thousands of steps to converge. Lin et al. lin2014stochastic observed that the support (non-zero element) of the coordinates after less than ten steps of CD is very accurate. Besides, the support of the sparse code is usually more important than the exact value of the sparse code. Therefore, we update the sparse code by using a few steps of CD operation because the original sparse coding is a non-convex problem and do not need to run CD to the final convergence. For the

-th epoch, we denote the updated sparse code as

. It will be used as an initial sparse code for the -th epoch.

Update via one or a few steps of coordinate descent:

Specifically, for from 1 to , we update the th coordinate of cyclicly as follows:

where is the soft thresholding shrinkage function [Combettes and Wajs2005]. We call above updating cycle as one step of CD. The updated sparse code is then denoted by .

Dictionary Update For updating the dictionary, we use block coordinate descent [Tseng2001] for updating each atom .

This can be solved in two steps, the solution of the unconstrained problem by differentiation followed by normalization and can be summarized by

(10)

As in [Mairal et al.2009], we found that one update loop through all of the atoms was enough to ensure fast convergence of Algorithm 1. The only difference of this update compared to common dictionary learning is the shift operator . In Eq. (10), . If the shift operator is a non-circular operators, the inverse needs to be replaced by the adjoint and the rescaling function needs to be applied to the update term.

Besides, we used a random selection method (randomly select samples from matrices to construct initial dictionaries ) to initialize the dictionaries for different layers in DirNet. Then, we set all the sparse codes to be zero in the beginning and epoch.

3.4 Adaptively Hierarchical Network

Considering different functionalities among hierarchical hidden layers, we propose the method of adaptively changing sparsity in different hidden layers. We use an initial perturbations as  [Zhang et al.2016] to let all features can be selected by competing with each other. Then we gradually shrank the network by using stronger -penalties and fewer features remaining in the progressive shrinking. Therefore, DirNet will go through the self-adjusting sequential stages before reaching the final optimal.

We note that sharing across the inter-layer and recurrent-layer matrices allows more efficient parameterization for the weight matrices. Besides, this does not result in a significant loss of performance. By adjusting the dimensions of the projection matrices () in each of the layers of the network, the compression rate of the model will be determined. Therefore, after fixing the dictionary to , solving Eq. 7 is equivalent to solve a LASSO problem [Tibshirani1996]. The objective function of solving as follow:

(11)

After we get dictionary from Eq. 5, we can determine as the solution to the following LASSO problem with adaptive weights to regularize the model coefficients along with different features:

(12)

where is different across layers and we selected the best value from to in this work. denotes regularization weight, we will receive different penalized matrices instead of controlling the sparsity by as in Eq. (11). In addition, and , where is the ordinal least-square solution.

Therefore, we can consider the following linear regression problem with the inter weight matrix

, where is the sample size and is the dimension of the target response vector. We use to represent . Then, we use an adaptive weight vector to regularize over different covariates, as

(13)

where and we alternatively optimize and in the learning process.

Suppose we initialize , with , denotes -th layer. Then, we alternatively update and in Eq. (13) under this equality norm constraint until convergence. We will start the second stage of iterations with an updated norm constraint after the initial stage, which imposes a stronger penalty. Then we alternatively update and until the second stage ends. We keep strengthening the global -norm regularization stage by stage during the updating procedure. We use to denote the index of each stage and is the compose of iteration.

To solve Eq. (13), we first fix and solve , which can be computationally converted to a LASSO problem:

(14)

Then, when we fix update , the problem becomes the following constrained optimization problem:

(15)

We used the Lagrangian of Eq. 15, and drop the non-negativity constraint. Let . Then the Lagrangian can be written as

By setting , we have

(16)

Plugging the above relation in the constraint , then we have

Finally, we plug the above equation into Eq. 16 and get

(17)

Since and , the solution will satisfy the non-negative constraints automatically. After we trained the original RNN model and retrieved the weight matrices of both inter-layer and recurrent layer, the proposed algorithm is applied to learn the new matrices and of each layer, which are used to initialize parameters of neural network layers in the new network structure (Fig.1 (b)). After the initialization, we fine-tune the neural network as Han et al. han2015deep and the sparsity is preserved by only updating non-zero elements in the sparse matrix. Fine-tune step is also required for other compression works, e.g., [Prabhavalkar et al.2016].

3.5 Extended Model Compression

We extend the Sec. 3.3 - Sec. 3.4 from standard RNNs to LSTM. In LSTM, the recurrent-weight matrix and the inter-layer matrix are both concatenation of four gates, which are input gate, output gate, forget gate and input modulation gate. We stack them vertically and denote as .

In this work, we do not consider the peephole weights because it already narrows and will not compress a lot of parameters. Thus, we rewrite the Eq. (1) and Eq. (2) as follows:

Besides, compressing a single-layer LSTM model is a special case for the proposed DirNet: (1) we can still learn the shared using and of the single LSTM/RNN layer as in Fig.1 (b) to achieve a high compression rate on a single-layer LSTM model; (2) compressing multiple LSTM/RNN layers, which can adaptively change the sparsity of the sparse codes relying on cross-layer information, will achieve higher compression rate than a single-layer LSTM/RNN model.

4 Experimental Results and Discussion

The proposed approach is a general algorithm for compressing recurrent neural networks including vanilla RNN, LSTM and GRU, etc, and thus can be directly used in many problems, e.g., language modeling (LM), speech recognition, machine translation and image captioning. In this paper, we select two popular domains: LM and speech recognition. We compare DirNet with (1) LSTM-SVD and LSTM-ODL. LSTM-SVD is the method proposed by Prabhavakar et al. prabhavalkar2016compression which compresses RNNs by low-rank SVD. LSTM-ODL is the method which compresses RNNs by online dictionary learning [Mairal et al.2009]. To train DirNet, it takes 10 hours on PTB with 1 K80 GPU and around 120 hours on LibriSpeech using 16 Nvidia K80 GPUs.

4.1 Language Modeling

We conduct word-level prediction on the Penn Tree Bank (PTB) dataset [Marcus et al.1993], which consists of 929k training words, 73k validation words and 82k test words. We first train a two-layer unrolled LSTM model with 650 units per layer, which is using the same network setting of LSTM-medium in [Zaremba et al.2014]. For comparison, the same model is compressed using LSTM-SVD, LSTM-ODL as well as DirNet. We report the result of testing data in Table 1. We observe that DirNet achieves superior performances regarding compression rates and speedup rates while maintaining the accuracy. The reason is that dynamically adjusting the projection matrices can achieve better compression rate and adapt different sparsities across layers can receive negligible accuracy loss.

Network Params R T PER
LSTM 4.7M 1x 1x 81.3
LSTM-SVD 2.4M 2.0x 1.9x 81.4
LSTM-ODL 1.6M 2.9x 2.8x 81.5
DirNet 0.7M 6.7x 6.4x 81.5
LSTM-SVD 0.7M 6.7x 6.4x 87.4
LSTM-ODL 0.7M 6.7x 6.4x 83.2
DirNet 0.7M 6.7x 6.4x 81.5
Table 1: Comparison of the number of parameters (Params), compression rates (R), speedup on mobile CPU speed (T) and Perplexity (PER) on PTB dataset.

4.2 Speech Recognition

Dataset: The LibriSpeech corpus is a large (1000 hour) corpus of English read speech derived from audiobooks in the LibriVox project, sampled at 16kHz. The accents are various and not marked, but the majority are US English. It is publicly available at http://www.openslr.org/12/ [Panayotov et al.2015]. LibriSpeech comes with its own train, validation and test sets and we use all the available data for training and validating our models. Moreover, we use the 100-hour “test clean” set as the testing set. Mel-frequency cepstrum (MFCC) [Davis and Mermelstein1990]

features are computed with 26 coefficients, a 25 ms sliding window and 10 ms stride. We use nine time slices before and nine after, for a total of 19 time points per window. As a result, with 26 cepstral coefficients, there are 494 data points per 25 ms observation.

Baseline Model: Following [Prabhavalkar et al.2016]

, our baseline model for speech recognition is a 5-layer RNN model with the Connectionist Temporal Classification (CTC) loss function 

[Graves et al.2006], which predicts 41 context-independent (CI) phonemes [Sak et al.2015]. Each hidden RNN layer is composed of 500 LSTM units. We train all models for 100 epochs using Adam optimizer [Kingma and Ba2014] with an initial learning rate of 0.001. It takes around 35 and 55 epochs to converge of DirNet and the baseline model, respectively. The size of the batch is 32. The model is first trained to convergence for optimizing the CTC criterion, followed which are sequences discriminatively trained to optimize the state-level minimum Bayes risk (sMBR) criterion [Kingsbury2009]

. We implement DirNet by Tensorflow 

[Abadi et al.2016] and use Samsung Galaxy S8 smartphone as the mobile platform to evaluate the performance.

Network Params R T WER
LSTM 10.3M 1x 1x 12.7
LSTM-SVD 6.9M 1.5x 1.4x 12.7
LSTM-ODL 3.7M 2.8x 2.6x 12.7
DirNet 2.4M 4.3x 4.2x 12.7
LSTM-SVD 1.3M 7.9x 7.6x 16.1
LSTM-ODL 1.3M 7.9x 7.6x 14.3
DirNet 1.3M 7.9x 7.6x 12.9
Table 2: Comparison of the number of parameters (Params), compression rates (R), speedup on mobile CPU speed (T) and Word error rates (%) (WER) on LibriSpeech dataset.

Results: To give a comprehensive evaluation of the proposed approach, we list the word error rate along with parameter number and the execution time comparison between different approaches in Table 2. There are two sets of experiments: 1) compare the maximum compression rate obtained by different approaches without the accuracy drop 2) measure the word error rate among different approaches with the same compression rate.

As indicated in the first row of Table  2, the basic LSTM model can achieve 12.7% (WER) before any compression. It includes 10.3M parameters in total and takes 1.77s to recognize 1s audio. Our proposed approach DirNet can compress the model by 4.3 times without losing any accuracy. The speed up achieved by DirNet is 4.2 times faster compared to 1.4 and 2.6 times obtained by LSTM-SVD and LSTM-ODL. These results indicate the superiority of the proposed DirNet over current compressing algorithms. Moreover, such significant improvement shows that dynamically adjust sparsity across layers might help the compression models receive higher compression rates. From the results listed in the second row of Table 2, we can find that when compressing the model size by 7.9 times, LSTM-ODL and LSTM-SVD lose lots of accuracy, especially SVD based compression algorithm, which is 3.4% (16.1%-12.7%) compared to 0.2% achieved by DirNet.

4.3 Adaptively Shrinking Parameter Selection

In this section, we study how the performance of our approach is affected by the following two parameters: that controls the initial “sparsity” of the system and the shrinking factor

that controls the compression rate of the system. We use F-score on LibriSpeech dataset to measure the performance.

Figure 3: Performance of different parameter selections.

First, we examine the performance of choosing different values. The range is from to , and the result is shown on the left side of the Fig. 3. We notice that when is below , the performance quickly drops and the lower initial sparsity even fails to start the whole system with sufficient energy. As a result the iterations could quickly stop at a local optimal. Therefore, we choose the best performance in all experimental settings.

Second, we study the influence of varying shrinking factor . The explored value range is from to . We observe that the performance sharply increases first and becomes stable afterward. Besides, the system evolves slowly such that the shrinking stage is sufficient when the shrinking scheme . In practice, we choose to strike a balance between efficiency and the quality of shrinking procedure.

4.4 Adaptive Hierarchy of the Network

We also compare the performance of DirNet (O) adaptive cross-layers’ compression capability with LSTM-SVD (G). All the results are listed in Table 3. We can find that DirNet achieves much lower WER compared to LSTM-SVD even with the same number of neurons. Also, these results further demonstrate the superiority of DirNet adaptive adjusting sparse features in deep model compression.

neurons of each layer Params WER (G) WER (O)
500, 500, 500, 500, 500 10.3M 12.7 12.7
350, 375, 395, 405, 410 8.6M 12.3 12.3
270, 305, 335, 345, 350 7.2M 12.5 12.3
175, 215, 245, 260, 265 5.4M 12.5 12.4
120, 150, 180, 195, 200 4.1M 12.6 12.5
80, 105, 130, 145, 150 3.1M 12.9 12.5
50, 70, 90, 100, 110 2.3M 13.2 12.7
30, 45, 55, 65, 75 1.7M 14.4 12.9
25, 35, 45, 50, 55 1.3M 16.6 12.9
Table 3: The Word error rates (%) (WER) on the testing set as varying the dimensions of each projection matrices by adaptively Adjusting the hierarchical structure of RNN.

5 Conclusions

In this paper, we introduce DirNet that dynamically adjusts the compression rate of each layer in the network, and adaptively change the hierarchical structures among different layers on their weight matrices. Experimental results show that compared to other RNN compression methods, DirNet significantly improves the performance before retraining. In our ongoing work, we will integrate scalar level compression with our DirNet to further compress the deep model.

References