1 Introduction
Long shortterm memory (LSTM) has been widely deployed for applications like speech recognition [1]
[2], health monitoring [3], and language modeling [4, 5]. It is capable of learning both the longterm and shortterm dependencies in sequential data [6]. Researchers have kept increasing the depth and size of LSTM models to improve their accuracy. For example, the DeepSpeech2 architecture [1] is more than 2 deeper and 10 larger than the initial DeepSpeech architecture [7]. A deep neural network (NN) architecture allows the model to capture low/mid/highlevel features through a multilevel abstraction that typically results in high inference accuracy [8]. But it also leads to a sharp increase in computation, thus posing significant challenges to model deployment. In addition, a large NN model consumes substantial storage, memory bandwidth, and computational resources that may be too excessive for mobile and embedded devices [9, 10, 11, 12]. Furthermore, the increasingly stringent latency constraints imposed by realtime applications make large highlatency LSTMs unusable. Thus, it is practically important to optimize the model from all three aspects of performance simultaneously: model compactness, accuracy, and execution efficiency.Network compression has emerged as a promising technique to reduce the computational cost of deep NNs by eliminating connections with insignificant weights, such as zeros or nearzeros. By leveraging effective network growth [13] and pruning [14] techniques, the number of parameters can be cut down by over 30
for convolutional neural networks (CNNs)
[14, 13, 15] and more than 10 for LSTMs [16, 5, 17, 18]. However, current compression strategies are mostly hardwareagnostic, and network complexity reduction does not always translate into execution efficiency and may even have an adverse impact on other performance metrics. For example, training NNs towards extreme weight sparsity offers little execution performance gain on current GPUs due to a lack of effective sparsity support. Moreover, some compressed networks may even suffer from inefficient execution, as observed in [19].In this work, we propose a novel hardwareguided symbiotic training methodology based on our observation that the hardware may introduce substantial nonmonotonicity (we call this the latency hysteresis effect (LHE)): a smaller model, which typically has a lower accuracy, may also be slower at runtime. This observation raises a question about the mainstream smallerdimensionisbetter strategy, which often leads to a suboptimal design point in the model architecture space. By leveraging the hardwareimpacted hysteresis effect, we are able to achieve the symbiosis of all three performance aspects: higher accuracy, smaller model size, and lower inference latency. To evaluate this symbiotic strategy, we adopt the internally deeper hiddenlayer LSTM (HLSTM) structure [18] to reduce the number of stacked layers, and start training from a sparse seed architecture, which grows effective connections to reach an initial high accuracy. Then we employ our hardwareguided structured growandprune algorithms to shrink the network into hardwarefavored dimensions. Finally, we prune the network weights again for extra compactness.
The major contributions of our approach can be summarized as follows:

We propose a novel training methodology to exploit hardware LHE to achieve a symbiosis of model compactness and accuracy with reduced runtime latency.

We combine multigranular growandprune algorithms with hardware guidance to reduce the model into a hardwarefavored architecture. This is the first work that effectively avoids suboptimal design points that may consume as much as 90% of the model architecture space.

The reported results outperform those from the literature from all three design perspectives: (a) 7.0 to 30.5 model compression, (b) higher accuracy, and (c) 1.4 to 5.2 reduction in runtime latency on Nvidia GPUs and Intel Xeon CPUs. Thus, our method yields compact, accurate, yet executionefficient inference models.
The rest of this paper is organized as follows. We review related works in Section 2. Then, we explain the motivation of this work in Section 3. In Section 4, we discuss our proposed methodology in detail. We present our experimental results on both language modeling and speed recognition in Section 5. Finally, we draw a conclusion in Section 6.
2 Related work
Various attempts have been made to improve the efficiency of LSTM models. One direction focuses on improving the LSTM cells. The gated recurrent unit (GRU) utilizes reset and update gates to achieve a similar performance to an LSTM while reducing computational cost
[20]. QuasiRNN explores the intrinsic parallelism of time series data to outperform an LSTM for the same hidden state width [21]. HLSTM incorporates deeper control gates to reduce the number of external stacked layers. It achieves higher accuracy than the GRU and LSTM with fewer parameters [18].Network compression techniques, such as the growandprune paradigm, have recently emerged as another direction for reducing LSTM redundancy. The pruning method was initially shown to be effective on large CNNs by demonstrating the reduction in the number of parameters in AlexNet by 9 and VGG by 13
for the wellknown ImageNet dataset, without any accuracy loss
[14]. Followup works have successfully scaled this technique to LSTMs [16, 5, 17]. For example, a recent work proposes structured pruning for LSTMs through group LASSO regularization [5]. Network growth is a complementary method to pruning. It enables a more sparse yet accurate model to be obtained before pruning starts [13]. A growandprune paradigm typically reduces the number of parameters in CNNs [13] and LSTMs [18] by another 2. However, all these methods are hardwareagnostic. Most of them utilize monotonic optimization metrics, e.g., smaller matrix dimensions or fewer multiplyaccumulate operations, hence optimize towards slimmer or more sparse models that may not necessarily translate into execution efficiency.There have been some recent efforts towards bridging the gap between complexity removal and execution efficiency for CNNs through hardwareheuristicsguided pruning approaches. For example, Scalpel
[19] adopts different pruning strategies based on three hardware parallelism levels (low, moderate, and high) of the underlying hardware. DeftNN [22]removes a synapse vector that is highly correlated with another one in the weight matrix, on the assumption that a smaller dimension leads to improved latency. However, both works are based on highlevel hardware heuristics rather than real hardware behavior like LHE, which may lead to suboptimal networks. Energyaware pruning
[23] adopts a hardware energy consumption model in its pruning criteria. However, it leads to a redesign for each target hardware and requires expert knowledge of the hardware. Thus, it is not very userfriendly. Chameleon [24] can effectively adapt CNNs to target platforms and deliver ChamNets that achieve consistent accuracy gains across various latency constraints relative to MobileNetV2 [25], MnasNet [26], and ShuffleNetV2 [27]. However, construction of Chameleon’s three predictors may require training hundreds of baseline models, hence may be timeconsuming. In the domain of recurrent NNs, a relevant work explores hardwareinspired weight and blocklevel sparsity for speech recognition [28]. With cuSPARSE library support, it delivers 0.8 to 4.5 speedup relative to the dense baseline. However, we find cuSPARSE [29] to be slower than the latest cuBLAS [30] dense library for matrix multiplication, which is a key operation in LSTM. In Fig. 1, we compare these two libraries on GPUs over a typical dimension range of LSTMs. It can be observed that cuSPARSE is slower than cuBLAS even at a 95% sparsity level.3 Motivation
As opposed to most prior works that use floatingpoint operations (FLOPs) or multiplyandaccumulate (MAC) operations as an indirect metric for evaluating model compactness, we aim to develop an automated LSTM synthesis flow that acts on directly measured inference latency. This flow does not adopt the traditional assumption that a smaller model (e.g., with smaller hidden state widths) is implicitly faster. In fact, we show that such an assumption is often not valid at runtime on hardware. This points to the need for a new methodology that can link model simplification algorithms to direct execution benefits.
Let us first profile the latency of the matrix multiplication operation on a GPU, as shown in Fig. 2, due to its computational importance. This operation consumes more than half of the computational time in LSTMs. We observed two distinct trends when considering matrix dimension vs. latency:

Global monotonic trend: a smaller dimension is, in general, faster in terms of runtime latency due to the reduced number of weights (i.e., computation).

Local nonmonotonic trend: the runtime latency lags behind or even reverts the trend as the weight dimension decreases. We refer to this local trend as LHE and the point where LHE starts to occur as the latency hysteresis point (LHP). Within the latency hysteresis bin (i.e., the local range), smaller dimensions worsen runtime latency relative to the corresponding LHP.
LHE is caused by cache line granularity when loading/storing data and vectorization optimization (e.g., vectorized vs. general matrix multiplication kernels) enabled at some particular data input dimensions to take full advantage of the bus bandwidth and singleinstructionmultipledata (SIMD) nature of hardware processing units. Change of optimization strategies for memory placement or computation scheduling can easily impact the final execution efficiency for inference [31, 32]. This also leads to a pervasive presence of LHE on a CPU.
The impact of LHE can scale up to the inference model level. For example, Fig. 3 shows a plot of the DeepSpeech2 inference latency against model size (specified by the hidden state width and the control gate hidden layer width). We can observe that a smaller DeepSpeech2 architecture, which also typically has a lower accuracy, may also be slower in runtime. This raises a question about the mainstream smallerisbetter strategy, given the existence of a large number of LHPs that make more than 90% of the design points in Fig. 3 suboptimal.
4 Methodology
In this work, we combine the multigranular growandprune algorithms with network profiling to optimize the network towards a joint algorithmhardware optimal solution. We summarize our main synthesis flow in Fig. 4. The proposed synthesis process starts with a partially connected seed architecture. Under the guidance of hardware profiles, the multigranularity of our algorithms enables the network to adaptively expand (row/column growth), shrink (row/column pruning), condense (weight growth), and sparsify (weight pruning) into hardwarefavored design points. This benefits hardware even when there is no sparsity support. In the final stage, the synthesis flow rests at a compact design point that is both compact and hardware friendly.
We illustrate the details of our training methodology in Fig. 5. As shown in its upper section, we adopt the HLSTM cell that adds hidden layers to its control gates [18]. Its internal computation flow is governed by the following equations:
where , , , , , , and denote the forget gate, input gate, output gate, cell update vector, input, hidden state, and cell state at step , respectively; and refer to the previous hidden and cell states at step ; , , W, b, , and
refer to NN gates, hidden layer that performs a linear transformation followed by an activation function, weight matrix, bias,
function, and elementwise multiplication, respectively; indicates zero or more layers for each NN gate.layers offer three advantages. First, they enhance gate control through a multilevel abstraction, hence alleviate HLSTM’s reliance on external stacking. Second, they can be easily regularized through dropout, and thus lead to better generalization. Third, they offer a wide range of choices for internal activation functions, such as the ReLU. This may provide additional benefits, such as faster learning and computation reduction due to zero outputs
[18].We utilize four training steps to learn the values, connectivity, and dimensions of the NN gates in the HLSTM. We show these steps in the lower part of Fig. 5. Training starts from a sparse seed architecture that contains a small fraction of connections to facilitate the initial back propagation of gradient information. During the weight growth (wg) phase, it iteratively wakes up only the most effective connections to reach high accuracy based on the gradient information. Then, it uses structured row/column pruning (rcp) algorithms to shrink the network dimensions, leading to lower inference latency. Next, it profiles the latency model on hardware and uses row/column growth (rcg) algorithms to obtain a network from LHEaware locally optimal design points. This enables simultaneous latency and accuracy gains, as shown later. Finally, it prunes away some weights for extra compactness.
We next explain in detail the algorithms involved in these four steps. Unless otherwise stated, we assume a maskbased approach for tackling sparse networks. Each weight matrix W has a corresponding binary mask matrix Msk that is of the exact same size. We finally update each W with its corresponding WMsk after the training flow terminates.
4.1 Weight growth & pruning
The main objective of the weight growth phase is to locate only the most effective dormant connections to reduce the value of the loss function
. We first evaluate for each dormant connection based on its average gradient over the entire training set. Then, we activate a dormant connection based on the following policy:where denotes the weight growth ratio. This rule was first proposed in [13]. It caters to dormant connections that are most efficient at loss function reduction, and enables the network to reach a target accuracy with far less redundancy than a fully connected model. This offers an accurate yet irredundant model for all the subsequent steps to act on.
We adopt the magnitudebased weight pruning strategy for final redundancy removal. Pruning of insignificant weights is an iterative process. In each iteration, we adopt the following policy for weight selection:
where
denotes the weight pruning ratio. We prune a neuron if all its input (or output) connections are pruned away. We retrain the network after the weight pruning iteration to recover its performance. The pruning phase terminates when retraining cannot achieve a predefined accuracy threshold. In the final training step, weight pruning minimizes the memory requirement of the final inference model. It also provides a high weight sparsity level for sparsitydriven libraries, such as Intel Math Kernel Library (MKL)
[33] on Intel CPUs, as shown later.4.2 Row/column growth & pruning
The growandprune approach at the row/column level enables the network to adaptively expand and shrink its dimensions. This leads to an effective descent in the model architecture space towards fast, accurate, yet executionefficient design points. However, due to the introduction of a large number of suboptimal design points by hardware, the model architecture space may become rather nonmonotonic, thus necessitating a carefullycrafted stopping criterion for this process.
We propose row/column growandprune algorithms to exploit LHE for hardwaresymbiotic solutions. The pruning algorithm takes advantage of the global trend to shrink the model dimension for latency reduction, whereas the growth algorithm recovers the model back to its corresponding LHP for simultaneous latency and accuracy gains.
We present the row/column pruning algorithm in Algorithm 1. Inspired by magnitudebased weight pruning methods, we examine the sum of the magnitudes of the weights per row/column for importance ranking. Row/column pruning is also an iterative process. We retrain the network after each pruning iteration, and stop if retraining cannot recover the performance to a predefined accuracy threshold.
Algorithm 2 illustrates our gradientbased row/column growth algorithm. Similar to the weight growth algorithm, we first evaluate for all the dormant connections in the network based on the average gradient over the entire training set (or a large batch). We only wake up the dormant rows and columns that possess the largest gradient magnitude sums, hence yielding the most efficiency in the reduction of loss function .
Step  Model  #Params.  Perplexity  Latency  

P2000  P100  V100  
Baseline LSTM  14.46M  72.1  3.52ms  2.12ms  2.72ms  
(a)  HLSTM+wg  4.68M  70.2  2.21ms  1.45ms  1.81ms 
(b)  HLSTM+wg+rcp  3.21M  72.2  1.78ms  1.27ms  1.51ms 
(c)  HLSTM+wg+rcp+rcg  3.24M  71.8  1.65ms  1.18ms  1.14ms 
(d)  HLSTM+wg+rcp+rcg+wp  0.80M  72.1  1.65ms  1.18ms  1.14ms 
5 Experimental Results
We next present our experimental results for the language modeling and speech recognition benchmarks. We implement our methodology using PyTorch
[34] on both Nvidia GPUs and Intel Xeon CPUs. For GPU, we have experimented with Nvidia GPUs: Quadro P2000, Tesla P100, and Tesla V100. For CPU, we have targeted Intel Xeon CPUs: Gold 5118 (2.3 GHz), E52682 v4 (2.5 GHz), and Broadwell (2.4 GHz). For CPU inference, we use Intel MKL [33] implementations for sparse matrix operation.5.1 Language modeling
We first demonstrate the effectiveness of our approach on language modeling.
Model architecture: We experiment with a stacked LSTM architecture for this application that feeds embedded word vectors to the recurrent layers. The word vocabulary has size 10,000. The dimension of the input word embedding is 400. We first train a conventional stacked LSTM architecture as the baseline. It contains two stacked recurrent layers, each with the hidden state width set to 1500, same as in [5, 35, 4]. Next, we implement our methodology on a onelayer HLSTM with the hidden state width again set to 1500. Each control gate contains one hidden layer with this width.
Dataset: We report results on the Penn Treebank (PTB) dataset [36]. It contains 929k, 73k, and 82k words in the training, validation, and test sets, respectively.
Training:
We use a stochastic gradient descent (SGD) optimizer for this application. We initialize the learning rate to 30, decayed by 10 when the validation accuracy does not increase in 50 consecutive epochs. We use a batch size of 32 for training. We use a dropout ratio of 0.2 for the hidden layers in the control gates, as in
[18], 0.65 for input embedding layers, and 0.1 for input words, as in [37]. We employ L2 regularization during training with a weight decay of . We use wordlevel perplexity as our evaluation criterion, same as in [5, 35, 4].We next present our experimental results for GPU and CPU inference.
Model  #Params.  Perplexity  Latency  

P2000  P100  V100  
Our stacked LSTM  baseline  14.5M  72.1  3.52ms  2.12ms  2.72ms 
Sukhbaatar et al. [38]    120       
Mikolov et al. [39]    115/115       
Wen et al. [5]  14.9M  78.7       
Zhu and Gupta [35]  7.2M  77.5  7.94ms*  3.69ms*  3.02ms* 
Zaremba, Sutskever, and Vinyals [40]  18.0M  73.6       
Lin et al. [4]  9.0M  72.2       
This work  0.80M (18.0)  72.1  1.65ms (2.1)  1.18ms (1.8)  1.14ms (2.4) 
Pessimistic estimate at zero word embedding dimension, i.e. ultimate lower bound, and 1500 hidden state width as reported in the paper. 

Measured based on our implementation of the exact same configuration as reported in the paper. 
5.1.1 Synthesized models for GPU inference
We first implement our methodology on various GPUs and compare our results with those for the conventional stacked LSTM architecture in Table LABEL:tb:breakdown. Latency indicates the average value per test word sequence of length 70 over the entire test set at a batch size of 16. It can be observed that the four steps in our learning algorithm work sequentially and collaboratively to learn structured sparsity in the network, as shown in Fig. 6. To further illustrate the performance gains obtained in each training step, we also break down their individual contributions in Table LABEL:tb:breakdown, and present the details next:
Step (a): The seed architecture has a 50% sparsity level. In the weight growth phase, we use a growth ratio set to 10% for the first eight epochs. This enables the network to reach a 70.2 perplexity with only 65% of its available connections, i.e., at a 35% sparsity level.
Step (b): We use equal pruning ratios for rows and columns of 20%. We halve the pruning ratios if the postretraining perplexity surpasses a predefined performance threshold. For this application, we set the performance threshold to 72.1. This is the performance achieved by our stacked LSTM model, and better than the values reported in [40, 5, 4, 35]. In the final stage, we iteratively prune away single rows and columns until the performance threshold can no longer be satisfied. This enables us to fully exploit the global monotonic trend for latency reduction. After this step, the dimension of each control gate matrix shrinks from 1500/400 to 1197/317. This brings a 14% to 20% reduction in inference latency.
Step (c): We next locate the LHPs for each GPU before starting the growth process. For this application, we find all three GPUs favor the same 1200/320 LHP in the model architecture space defined by the control gate dimension. We then calculate the growth ratios accordingly to recover the network into this LHP. As expected, LHE exploration enables a 7% to 23% reduction in measured inference latency jointly with a 0.4 perplexity bonus.
Step (d): We use an initial weight pruning ratio of 70% and update it based on the same rule as in Step (b). This step further reduces the number of network parameters by 4.1.
We compare our final inference model with relevant work in Table LABEL:tb:lm_gpu. Our models outperform the ones in the literature from all three design perspectives. Against the stacked LSTM baseline, we reduce the number of parameters by 18.0, and measured runtime latency by 2.1, 1.8, and 2.4 on Nvidia P2000, P100, and V100 GPUs, respectively, without any accuracy degradation.
5.1.2 Synthesized models for CPU inference
We next report results from the implementation of our methodology on CPUs. We base our experiments on the Intel MKL [33] implementation due to its support and acceleration of sparse matrix computations. For CPU inference, we skip the dimension reduction process (i.e., Step(b) and Step(c)) to fully exploit the potential of weight sparsity. The high sparsity level without dimension reduction at 93.4% enables us to fully explore the benefit of sparsity acceleration, as opposed to a sparsity level at 83.5% with dimension reduction that undermines the benefits of MKL. This yields additional 1.6 latency and 2 parameter reduction. Our final model has a test perplexity of 72.1, same as the test perplexity of the LSTM baseline. It only contains 0.47M parameters as opposed to the baseline LSTM model that has 14.5M parameters (leading to a 30.5 compression ratio).
CPU platform  Baseline LSTM latency  This work  Speedup factor 

Intel Xeon E52682 v4  115.90ms  22.64ms  
Intel Xeon Gold 5118  125.08ms  24.02ms  
Intel Xeon Broadwell  81.03ms  19.57ms 
We next compare the latency of the final inference models on CPUs in Table LABEL:tb:lm_cpu. Relative to the LSTM baseline for language modeling, we reduce the inference latency by 76.1% (4.2), 80.8% (5.2), and 75.8% (4.1) on Intel Xeon Gold 5518, E52682 v4, and Broadwell CPUs, respectively. Sparsitydriven MKL acceleration contributes approximately 2.5 speedup, while utilizing HLSTM cells contributes the remaining 2 speedup on the CPUs.
5.2 Speech recognition
We now consider another wellknown application: speech recognition.
Model architecture: We implement a bidirectional DeepSpeech2 architecture that employs stacked recurrent layers following the convolutional layers for speech recognition [1]
. We extract Melfrequency cepstral coefficients from the speech data in a 20ms feature extraction window. There are two CNN layers present prior to the recurrent layers and one connectionist temporal classification layer for decoding
[41] after the recurrent layers. The width of the hidden state is 800, same as in [4, 42]. Each control gate contains one hidden layer with width 800.Dataset: We obtain the results for the AN4 dataset [43]. It contains 948 training utterances and 130 testing utterances.
Training: We utilize a Nesterov SGD optimizer in our experiment. We use a batch size of 16 for training. We initialize the learning rate to
and decay it by 0.99 after every training epoch. We use a dropout ratio of 0.2 for the hidden layers in the HLSTM. We use batch normalization between recurrent layers. We use L2 regularization with a weight decay of
. We use word error rate (WER) as our evaluation criterion, same as in [4, 44, 42].We adopt the model reported in [4] as our LSTM baseline. It contains five stacked LSTM layers with a hidden state width of 800. Then, we implement our methodology and compare our results for GPU and CPU inference as follows.
5.2.1 Synthesized models for GPU inference
We summarize our results for GPU inference in Table IV. Latency values indicate the average instance latency over the test set with a batch size of 16. In Table IV, we also break down the changes in model characteristics throughout the training flow to separate the performance gains at each training step:
GPU platform  Step  Model  #Layers  Dimension  #Params.  WER(%)  Latency 
Baseline LSTM [4]  5  800/800  50.4M  12.90  35.87ms*  
Nvidia P2000  (a)  HLSTM+wg  3  800/800  27.1M  8.39  32.13ms 
(b)  HLSTM+wg+rcp  3  626/626  18.2M  10.29  22.55ms  
(c)  HLSTM+wg+rcp+rcg  3  644/644  18.3M  9.44  20.79ms  
(d)  HLSTM+wg+rcp+rcg+wp  3  644/644  8.1M  9.97  20.79ms  
Baseline LSTM [4]  5  800/800  50.4M  12.90  24.04ms*  
Nvidia P100  (a)  HLSTM+wg  3  800/800  27.1M  8.39  22.77ms 
(b)  HLSTM+wg+rcp  3  626/626  18.2M  10.30  19.61ms  
(c)  HLSTM+wg+rcp+rcg  3  640/640  18.3M  10.08  17.70ms  
(d)  HLSTM+wg+rcp+rcg+wp  3  640/640  7.2M  10.25  17.70ms  
Baseline LSTM [4]  5  800/800  50.4M  12.90  19.35ms*  
Nvidia V100  (a)  HLSTM+wg  3  800/800  27.1M  8.39  17.67ms 
(b)  HLSTM+wg+rcp  3  626/626  18.2M  10.30  17.32ms  
(c)  HLSTM+wg+rcp+rcg  3  640/640  18.3M  10.08  13.99ms  
(d)  HLSTM+wg+rcp+rcg+wp  3  640/640  7.2M  10.25  13.99ms  
Measured based on our implementation of the exact same configuration as reported in the paper. 
Model  #Params.  WER(%)  Latency 

P2000 / P100 / V100  P2000 / P100 / V100  P2000 / P100 / V100  
Lin et al. [4]  baseline  50.4M  12.90  35.87ms* / 24.04ms* / 19.35ms* 
Alistarh et al. [44]  13.0M  18.85   
Naren [42]  37.8M  10.52   
Dai, Yin, and Jha [18]  2.6M  10.37  32.13ms / 22.77ms / 17.67ms 
This work  8.1M (6.2) / 7.2M (7.0) / 7.2M (7.0)  9.97 / 10.25 / 10.25  20.79ms (1.7) / 17.70ms (1.4) / 13.99ms (1.4) 
Step (a): The seed architecture has a 50% sparsity level. In the weight growth phase, we use a growth ratio of 10% for the first six training epochs. This enables the network to reach an 8.39% WER with only 62% of its available connections, i.e., at a 38% sparsity level.
Step (b): We also adopt equal row and column pruning ratios of 20%, and update them using the same method as in Step (b) of the language modeling application. This trims down the dimension of each weight matrix from 800/800 to 626/626, thus reducing the measured inference latency by 2% to 30% across the three targeted GPUs.
Step (c): Unlike in the case of language modeling, this step unveils different LHPs for the P2000 and P100 GPUs: it recovers the network into a 640/640 LHP for the P2000 GPU and a 644/644 LHP for the P100 GPU. LHE exploration for speech recognition enables an additional 0.22% to 0.85% WER reduction, jointly with a latency reduction of 9.2% to 19.2%.
Step (d): We initialize the weight pruning ratio to 70% and update it using the same rule as in Step (d) in the language modeling application. This step reduces the number of network parameters by 2.3 to 2.5.
We compare our final inference models with relevant work in Table LABEL:tb:gpu_speech. Our results outperform most of the previous work from all three design perspectives. Though containing more parameters than the models presented in [18], our models achieve higher accuracy and deliver substantial inference speedups. Relative to the conventional LSTM baseline [4], our method reduces the measured inference latency by 1.7 (1.4/1.4) on the P2000 (P100/V100) GPU, while simultaneously reducing the number of parameters by 6.2 (7.0/7.0), and WER from 12.90% to 9.97% (10.25%/10.25%).
CPU platform  Baseline LSTM latency  This work  Speedup factor 
Intel Xeon E52682 v4  179.0ms  75.3ms  
Intel Xeon Gold 5118  202.1ms  93.3ms  
Intel Xeon Broadwell  146.7ms  66.3ms  
Baseline latency values are measured based on our implementation of the exact same configuration  
as reported in [4]. 
5.2.2 Synthesized models for CPU inference
We also exploit weight sparsity for speech recognition on CPUs. Augmented by Intel MKL, weight sparsity offers substantial memory and latency reductions at runtime. Similar to language modeling, we skip the dimension reduction process (i.e., Step(b) and Step(c)) to fully exploit the potential of weight sparsity. The high sparsity level without dimension reduction at 94.2% enables us to fully exploit sparsity acceleration, as opposed to a sparsity level at 88.8% with dimension reduction that undermines the benefits of MKL. This yields additional 1.4 latency and 2.8 parameter reduction. Our final CPU inference model contains only 2.6M parameters as opposed to the baseline LSTM model that has 50.4M parameters (19.4 compression ratio). It has a WER of 10.37, which is 2.53 more accurate than the LSTM baseline.
We next compare the latency of the final inference models on the CPUs in Table VI. Relative to the LSTM baseline for speech recognition, we reduce the inference latency by 57.9% (2.4), 53.8% (2.2), and 54.8% (2.2) on Intel Xeon Gold 5518, E52682 v4, and Broadwell CPUs, respectively. Sparsitydriven MKL acceleration contributes approximately 2 speedup, whereas utilizing HLSTM cells contributes the remaining 1.1 speedup.
6 Conclusions
In this work, we proposed a hardwareguided symbiotic training methodology for compact, accurate, yet executionefficient inference models. By leveraging hardwareimpacted LHE and multigranular growandprune algorithms, we were able to reduce LSTM latency while increasing its accuracy. We evaluated our algorithms on the language modeling and speech recognition applications. Relative to the traditional stacked LSTM architecture obtained for the Penn Treebank dataset, we reduced the number of parameters by 18.0, and measured runtime latency by 2.1, 1.8, and 2.4 on the Nvidia P2000, P100, and V100 GPUs, respectively, without any accuracy degradation. We reduced the number of parameters by 30.5, and measured runtime latency by 5.1, 5.2, and 4.1 on the Intel Xeon E52682 v4, Gold 5518, and Broadwell CPUs, respectively, without any accuracy degradation. Relative to the DeepSpeech2 architecture obtained from the AN4 dataset, we reduced the number of parameters by 7.0, WER from 12.9% to 9.9%, and measured runtime latency by 1.7, 1.4, and 1.4 on the Nvidia P2000, P100 and V100 GPUs, respectively. We also reduced the number of parameters by 19.4, WER from 12.9% to 10.4%, and measured runtime latency by 2.4, 2.2, and 2.2 on the Intel Xeon E52682 v4, Gold 5518, and Broadwell CPUs, respectively. Thus, our method yields compact, accurate, yet executionefficient inference models.
References

[1]
D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro,
J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan,
C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang,
A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun,
S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan, and
Z. Zhu, “Deep Speech 2 : EndtoEnd speech recognition in English
and Mandarin,” in
Proc. Int. Conf. Machine Learning
, vol. 48, 2016, pp. 173–182.  [2] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. Advances in Neural Information Processing Systems, 2014, pp. 3104–3112.

[3]
B. Ballinger, J. Hsieh, A. Singh, N. Sohoni, J. Wang, G. H. Tison, G. M.
Marcus, J. M. Sanchez, C. Maguire, J. E. Olgin, and M. J. Pletcher,
“DeepHeart: Semisupervised sequence learning for cardiovascular risk
prediction,” in
Proc. AAAI Conf. Artificial Intelligence.
, 2018, pp. 2079–2086.  [4] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
 [5] W. Wen, Y. He, S. Rajbhandari, W. Wang, F. Liu, B. Hu, Y. Chen, and H. Li, “Learning intrinsic sparse structures within long shortterm memory,” arXiv preprint arXiv:1709.05027, 2017.
 [6] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [7] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, “Deep Speech: Scaling up endtoend speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
 [8] H. Yin, Z. Wang, and N. K. Jha, “A hierarchical inference model for InternetofThings,” IEEE Trans. MultiScale Computing Systems, vol. 4, pp. 260–271, 2018.
 [9] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064, 2016.

[10]
B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholaminejad,
J. Gonzalez, and K. Keutzer, “Shift: A zero flop, zero parameter
alternative to spatial convolutions,” in
Proc. IEEE Conf. Computer Vision and Pattern Recognition
, 2018, pp. 9127–9135.  [11] A. O. Akmandor, H. Yin, and N. K. Jha, “Smart, secure, yet energyefficient, InternetofThings sensors,” IEEE Trans. MultiScale Computing Systems, 2018.
 [12] H. Yin, B. H. Gwee, Z. Lin, A. Kumar, S. G. Razul, and C. M. S. See, “Novel realtime system design for floatingpoint subNyquist multicoset signal blind reconstruction,” in Proc. IEEE Int. Symp. Circuits and Systems, May 2015, pp. 954–957.
 [13] X. Dai, H. Yin, and N. K. Jha, “NeST: A neural network synthesis tool based on a growandprune paradigm,” arXiv preprint arXiv:1711.02017, 2017.
 [14] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Proc. Advances in Neural Information Processing Systems, 2015, pp. 1135–1143.
 [15] T. Zhang, K. Zhang, S. Ye, J. Li, J. Tang, W. Wen, X. Lin, M. Fardad, and Y. Wang, “ADAMADMM: A unified, systematic framework of structured weight pruning for DNNs,” arXiv preprint arXiv:1807.11091, 2018.
 [16] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, “ESE: Efficient speech recognition engine with sparse LSTM on FPGA,” in Proc. ACM/SIGDA Int. Symp. FieldProgrammable Gate Arrays, 2017, pp. 75–84.
 [17] S. Narang, E. Elsen, G. Diamos, and S. Sengupta, “Exploring sparsity in recurrent neural networks,” arXiv preprint arXiv:1704.05119, 2017.
 [18] X. Dai, H. Yin, and N. K. Jha, “Grow and prune compact, fast, and accurate LSTMs,” arXiv preprint arXiv:1805.11797, 2018.
 [19] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke, “Scalpel: Customizing DNN pruning to the underlying hardware parallelism,” ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp. 548–560, 2017.
 [20] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoderdecoder approaches,” arXiv preprint arXiv:1409.1259, 2014.
 [21] J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasirecurrent neural networks,” arXiv preprint arXiv:1611.01576, 2016.
 [22] P. Hill, A. Jain, M. Hill, B. Zamirai, C.H. Hsu, M. A. Laurenzano, S. Mahlke, L. Tang, and J. Mars, “DeftNN: Addressing bottlenecks for DNN execution on GPUs via synapse vector elimination and nearcompute data fission,” in Proc. IEEE/ACM Int. Symp. Microarchitecture, 2017, pp. 786–799.
 [23] T.J. Yang, Y.H. Chen, and V. Sze, “Designing energyefficient convolutional neural networks using energyaware pruning,” arXiv preprint arXiv:1611.05128, 2016.
 [24] X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia, P. Vajda, M. Uyttendaele, and N. K. Jha, “ChamNet: Towards efficient network design through platformaware model adaptation,” arXiv preprint arXiv:1812.08934, 2018.
 [25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” arXiv preprint arXiv:1801.04381, 2018.
 [26] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “MnasNet: Platformaware neural architecture search for mobile,” arXiv preprint arXiv:1807.11626, 2018.
 [27] N. Ma, X. Zhang, H.T. Zheng, and J. Sun, “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” arXiv preprint arXiv:1807.11164, 2018.
 [28] S. Narang, E. Undersander, and G. Diamos, “Blocksparse recurrent neural networks,” arXiv preprint arXiv:1711.02782, 2017.
 [29] “cuSPARSE library,” NVIDIA Corporation, Santa Clara, California, 2018.
 [30] “cuBLAS library,” NVIDIA Corporation, Santa Clara, California, 2018.
 [31] G. Chen, B. Wu, D. Li, and X. Shen, “PORPLE: An extensible optimizer for portable data placement on GPU,” in Proc. IEEE Int. Symp. Microarchitecture, 2014, pp. 88–100.
 [32] G. Chen, Y. Zhao, X. Shen, and H. Zhou, “EffiSha: A software framework for enabling effficient preemptive scheduling of GPU,” ACM SIGPLAN Notices, vol. 52, no. 8, pp. 3–16, 2017.
 [33] “Intel math kernel library,” Intel Corporation, Santa Clara, California, 2018.
 [34] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” NIPS Workshop Autodiff, 2017.
 [35] M. Zhu and S. Gupta, “To prune, or not to prune: Exploring the efficacy of pruning for model compression,” arXiv preprint arXiv:1710.01878, 2017.
 [36] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger, “The Penn Treebank: Annotating predicate argument structure,” in Proc. Workshop Human Language Technology, 1994, pp. 114–119.
 [37] S. Merity, N. S. Keskar, and R. Socher, “An analysis of neural language modeling at multiple scales,” arXiv preprint arXiv:1803.08240, 2018.
 [38] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “Endtoend memory networks,” in Proc. Advances in Neural Information Processing Systems, 2015, pp. 2440–2448.
 [39] T. Mikolov, A. Deoras, S. Kombrink, L. Burget, and J. Černockỳ, “Empirical evaluation and combination of advanced language modeling techniques,” in Proc. Annual Conf. Int. Speech Communication Association, 2011, pp. 605–608.
 [40] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2014.

[41]
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in
Proc. Int. Conf. Machine Learning, 2006, pp. 369–376.  [42] S. Naren, “Speech recognition using DeepSpeech2,” https://github.com/SeanNaren/deepspeech.pytorch/releases, 2018.
 [43] A. Acero, “Acoustical and environmental robustness in automatic speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 1990.
 [44] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communicationefficient SGD via gradient quantization and encoding,” in Proc. Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.
Comments
There are no comments yet.