1 Introduction
Traditionally, network pruning methods [1, 2] have been employed to obtain sparse neural network models to support edge devices with limited resources [3]
. However, today’s machine learning production models often target a variety of consumer hardware capabilities. The wide spectrum of mobile devices alone could have latency differences of multiple orders of magnitude for the same machine learning model
[4]. The situation further complicates when systems such as home speakers and cars are taken into consideration. As an additional compounding factor, different software applications might have different latency requirements. For example, despite likely using the same architecture, the speech recognizer for video conference captioning requires higher synchronicity than one that aids online video subtitle generation.Ideally, differentsized models with varying sparsity levels should be trained to target every single device type. However, this scheme is impractical given the myriad of existing devices. Alternatively, one could train a few sparse models only targeting typical hardware configurations. In addition to the maintenance overhead of an offline device sparsity table, this strategy will also necessarily under or overutilize resources on the heterogeneous long tail of devices. Additionally, even on a single device, resource availability usually changes dynamically as concurrent activities vary. Models with a static sparsity level will hence likely result in suboptimal resource usage.
To support such diverse sets of scenarios, we propose dynamic sparsity neural networks (DSNN). After training, a single DSNN model is able to execute at any sparsity level at runtime with no or insignificant loss in accuracy compared to individually trained single sparsity networks. With such a model, we can dynamically adjust its sparsity according to the device capability and resource availability, thereby achieving an optimal accuracylatency tradeoff with minimal memory footprint.
DSNN was inspired by recent work [5, 6] which showed that even for untrained random networks, there exists subnetworks of arbitrary sparsity levels that achieve very high quality. Therefore, trained networks should also simultaneously contain powerful subnetworks at different sparsity levels.
Methodologically, DSNN builds upon the recent line of work on slimmable neural networks (SNN) [7, 8]
that were developed to tackle a similar issue on model deployment across heterogeneous devices. However, these models are only designed for convolutional neural networks, restricting their applicability to many domains and tasks. We demonstrate in Section
5.2 that a naive generalization of SNN to the task of automatic speech recognition (ASR) shows poor performance.DSNN, on the other hand, is a sparsitybased extension of SNN that is applicable to any weightbased neural network. In this paper, we choose to focus on the task of ASR due to an increasing demand for ondevice ASR [9, 3]. We show that a single DSNN model can match, or sometimes exceed, the performance of individually trained single sparsity networks across a series of sparsity levels (Section 5.1).
DSNN models contribute to practical machine learning systems in two ways. First, the same DSNN model can be deployed to multiple hardware types with different resource and energy constraints. This greatly reduces both the training overhead and management complexity of deployment processes. Secondly, as DSNN models prove to exceed the performance of single sparsity models at high sparsity levels, its training scheme hence also constitutes an effective approach to improve sparse model performance.
2 Related Work
Overparameterization is a commonly addressed issue of neural networks [10, 11]. To deal with this issue, model pruning methods have been developed to remove unimportant connections in weight matrices of neural network models. Optimal Brain Damage [12] and Optimal Brain Surgeon [13] first proposed pruning methods based on secondorder derivatives, and [1] demonstrated a magnitudebased pruning approach which we base upon in our work. The resulting pruned models contain only sparse structures, allowing them to run efficiently at inference time while maintaining performance [12, 1]. Many studies have demonstrated the empirical strength of such sparse networks [1, 3] and examined their theoretical properties [14, 15, 5, 6].
Recently, [5] and [6] showed that untrained random networks contain subnetworks at arbitrary sparsity levels that perform well without training. The best of these subnetworks, usually at around 50% sparsity, can perform as well as the full (i.e. zerosparsity) model in specific datasets. Our work also tries to find a single network containing multiple highquality subnetworks, but we allow model training while requiring these subnetworks to match the quality of individuallytrained single sparsity networks.
Dynamic neural networks are a family of models that optimize runtime accuracy and efficiency tradeoff using dynamic inference graphs [16, 17, 7, 8]. These models often allow selective execution which is desirable when the target inference platforms vary in their constraints. To our knowledge, our proposed DSNN is the first of such models that achieves such optimization using sparse networks.
3 Dynamic Sparsity Neural Networks
In this section, we first provide a formulation of dynamic sparsity neural networks (DSNN) and justify it using previous studies. We then introduce the DSNN training algorithm. Finally we sketch several key distinctions with slimmable neural networks, our methodological precursor.
3.1 Model Formulation
DSNN aims to train a supernetwork such that, given an arbitrary sparsity level in a range , we can find a subnetwork with only a subset of connections in without further finetuning or retraining. This subnetwork should have the same or better quality than an individually trained single sparsity model obtained through traditional pruning algorithms at the same sparsity level . With such a supernetwork, we are able to dynamically switch to subnetworks with different sparsity levels during deployment, optimizing for hardware capacities and application latency constraints.
While it is nontrivial to theoretically prove the existence of such supernetworks , empirical evidence does suggest they are likely to exist. Zhou et al. [5] and Ramanujan et al. [6] showed that an untrained random model can simultaneously contain subnetworks that perform well. At very specific sparsity levels, their models can perform as well as individually trained dense models. Slimmable neural networks [7, 8]
demonstrated that for ImageNet
[18] classification, a trained convolutional supernetwork can have structured subnetworks with similar or better performance than individually trained networks with the same architecture. More recently, BigNAS [19]trained a single set of shared weights on ImageNet which are used to obtain child models via a simple coarsetofine architecture selection heuristic. All these works hint at a possible supernetwork that encompass multiple high quality subnetworks. They hence encourage us to explore DSNN, a general sparsitybased supernetwork.
3.2 Approach
With the likely existence of an , the question becomes how we can efficiently find it. In order for a single model to execute at arbitrary sparsity levels, we jointly train the same network with a variety of sparsity levels. Specifically, at each training step, we choose a target sparsity level at which to train the model.
We leverage “the sandwich rule” [8] which states that the quality of models of varying sizes is bounded by that of a largest and a smallest model. Formally, given two sparsity levels and , as long as the pruning function guarantees
(1) 
where and are the sets of connections remaining after pruning^{1}^{1}1This is a weak condition and many pruning functions satisfy it, including magnitudebased pruning which is used in this work., then the residual error of a network with a smaller sparsity level should be no higher than one with a larger sparsity level. Then, in extension
(2) 
This formulation allows us to focus on training two endpoint models with a minimum and maximum sparsity level, and . In addition, we also sample intermediate sparsity levels to allow better generalizability between the endpoints.
We take inspiration from regular sparse model training where it is common practice to pretrain the full model for some number of steps before pruning begins [20, 21]. This gives a high performance firststep model as the initialization for sparse models to prune from. For DSNN, because we choose (Section 4.3), the full model is already present during the training algorithm as the minimally sparse endpoint. Nevertheless, empirically we find the inclusion of the pretraining stage to still be crucial for the DSNN quality (Section 5.3).
We sketch the DSNN training procedure in Algorithm 1. For each iteration, we alternate among the minimum, intermediate, and maximum sparsity levels for weight masking and execute forward and backward propagation. However, this alternation during training could be a source of instability. To deal with this, instead of updating model parameters immediately after backward propagation in each training step, we accumulate the parameter gradients across training steps and only do one parameter update per iteration.
3.3 Comparison with Slimmable Neural Networks
Our model is similar to slimmable neural network (SNN) [7, 8]
, both allowing dynamic inference graphs, albeit with several key distinctions. First, SNN shrinks models by truncating convolutional channels while DSNN obtains smaller model variants using model pruning. This allows DSNN to be easily applied to more domains and tasks. The sparse structure also allows DSNN to preserve the high dimensionality of input and output spaces, although the mapping from input to output is lowdimensional. We may consider a simple generalization of SNN that prunes whole nodes in a network instead of convolutional channels. In contrast, DSNN uses an edgepruning approach, the common practice for model pruning. This restricts SNN to always use fully connected subnetworks. On the other hand, the lack of a predefined network structure in DSNN allows greater modeling flexibility. Therefore, SNN is a special case of DSNN whose sparse patterns are skewed with all connections to the last channels masked as zeros. See Figure
1 for an illustration.4 Experimental Setup
In this section we describe our experimental settings.
4.1 Task and Dataset
While our approach is widely applicable to all weightbased neural networks, we choose to conduct experiments in automatic speech recognition (ASR) due to an increased interest in ondevice ASR [9, 3]. We perform experiments on the LibriSpeech dataset [22]. It consists of read speech data of audio books. We merge its three training sets while maintaining the “clean” and “other” distinction corresponding to low versus high noise conditions. The final dataset contains 960.9 hours of training data, 5.4/5.3 hours of clean/other development data, and 5.4/5.1 hours of clean/other test data.
4.2 Model Architecture and Settings
We use an attentive encoderdecoder architecture as the base model [23]. The encoder consists of 2 layers of 3x3 convolutional neural networks, 3 layers of projection matrices with 2048 output units, and 4 layers of bidirectional LSTMs [24] with 2048 output units. The decoder consists of 2 layers of LSTMs with 1024 output units, 1 attention layer with 128 hidden units, and 1 fully connected layer. The model contains 184M parameters. The quality of our base model generally matches the similar architecture in [25] (Table 1).
Our models are implemented in TensorFlow
[26]and trained on 8x8 Tensor Processing Units (TPU) with a batch size of 2048. We use a constant learning rate of 1e3 after warmup with the Adam optimizer
[27].4.3 Model Pruning
We use magnitude based pruning which, given a target sparsity level and a weight matrix , zeros out the elements in with the smallest absolute value by applying a binary mask over . We employ block pruning [28] with block size to more efficiently leverage hardware resources. Instead of the smallest elements, we zero out the smallest blocks in . This procedure corresponds to the sparsify function in Algorithm 1. We only prune the recurrent connections which constitute 64.57% of model parameters.^{2}^{2}2When we indicate a sparsity level in this paper, we refer to the sparsity level of these recurrent weights only. We also allow the mask to update at each iteration which enables pruned weights to be recovered if at a later step its magnitude is greater than that of some other survived weights, similar to [29, 30].
For both baseline single sparsity models (Section 5.1) and DSNN, we first train a zerosparsity network until convergence (around 200k steps). When training this network, we maintain exponential moving averages of all model parameters. When the pruning stage begins, we load all model variables from these averages. The baseline models use a constant pruning schedule that fixes the sparsity level. The DSNN training algorithm does not require a pruning schedule.
We train DSNN with a minimum sparsity level (i.e. full model) and a maximum . We sample sparsity levels in this range to train at during each iteration in addition to the two endpoint sparsity levels.
5 Results and Discussion
In this section we present our experimental results. We evaluate the models on the minimum and maximum sparsity levels, 0% and 90%, and two intermediate sparsity levels, 30% and 60%, that test the model generalizability between the endpoints.
Sparsity  Model  Dev  Dev  Test  Test 

Level  Type  clean  other  clean  other 
0%  Zeyer et al.  3.54  11.52  3.82  12.76 
0%  Single  3.6  11.8  3.9  11.8 
SNN  4.7  14.2  4.8  14.5  
DSNN  3.7  12.3  4.0  12.4  
30%  Single  3.7  12.3  4.0  12.2 
SNN  4.7  14.4  4.9  14.8  
DSNN  3.7  12.3  4.0  12.3  
60%  Single  3.8  12.5  4.2  12.6 
SNN  5.9  16.8  6.3  17.5  
DSNN  3.7  12.3  4.0  12.3  
90%  Single  4.2  13.9  4.5  14.1 
SNN  8.0  20.3  8.6  21.5  
DSNN  3.9  13.2  4.2  13.2 
5.1 Comparison with Single Sparsity Networks
As argued in Section 3, the most important success criterion of DSNN is to match individually trained single sparsity networks in quality. If the DSNN model significantly degraded the model quality, it would not be useful especially in realworld scenarios when quality is prioritized. We, therefore, conducted baseline experiments with single sparsity networks and show the results in Table 1.
The dynamic sparsity model generally matches the quality of single sparsity networks. Additionally, the quality decline with increased sparsity is much slower in DSNN than single sparsity networks. The DSNN quality slightly trails behind single sparsity networks’ at 0% but is the best at 60% and 90% sparsity. We hypothesize the reason to be that the sparser networks in DSNN have thinner structures which on expectation are more frequently trained in each step. On the other hand, connections with smaller weights are trained more sporadically, receiving relatively less focus. Notably, a similar trend is observed for SNN in [7, 8] where also increases as fewer convolutional channels are used in inference, growing less negative or more positive.
In practice, even in highly qualitydriven scenarios where the slight quality gap at denser models is unacceptable, one can still deploy DSNN at only high sparsity levels, complementing single sparsity networks that can be used for the denser models.
5.2 Comparison with Slimmable Neural Networks
Slimmable neural networks (SNN) [7, 8] only vary the number of channels in convolutional networks. For each target width, the last channels (ones with the largest indices) are removed. Despite not directly applicable to arbitrary weight matrices, we experiment with a simple generalization of SNN as follows.
Given a target sparsity level and a weight matrix , we apply a binary mask on . For each dimension , we select a threshold by
(3) 
where rounds to the nearest integer. We then generate the mask by
(4) 
Finally we prune by
(5) 
where denotes elementwise multiplication.
Intuitively, we truncate the last rows in each dimension by an equal fraction trying to make the resulting matrix have a sparsity close to the target sparsity.^{3}^{3}3This procedure does not guarantee an exact final sparsity level because the prerounded is usually not an integer, but the larger the original matrix is, the closer it will be. After all, model pruning is usually only applied on large weight matrices. In our experiments, the differences between the target sparsity levels and the resulting sparsity levels are always within 0.01%. We therefore neglect this difference when comparing results. This is analogous to removing the last convolutional channels.
We compare SNN and DSNN quality in Table 1. We see that DSNN’s edge pruning approach significantly outperforms SNN’s node pruning approach.
5.3 Ablations
Sparsity  Model  Dev  Dev  Test  Test 

Level  Setting  clean  other  clean  other 
0%  Baseline DSNN  4.2  12.9  4.4  13.1 
+ pretraining  3.8  12.3  4.1  12.3  
+ GA  3.7  12.3  4.0  12.4  
30%  Baseline DSNN  4.2  12.9  4.4  13.1 
+ pretraining  3.8  12.3  4.1  12.3  
+ GA  3.7  12.3  4.0  12.3  
60%  Baseline DSNN  4.2  12.9  4.4  13.2 
+ pretraining  3.8  12.3  4.1  12.3  
+ GA  3.7  12.3  4.0  12.3  
90%  Baseline DSNN  4.3  13.4  4.5  13.6 
+ pretraining  4.0  13.4  4.3  13.3  
+ GA  3.9  13.2  4.2  13.2  
We analyzed the effect of pretraining and gradient accumulation (GA) and show the ablation results in Table 2. Similar to previous model pruning work [20, 21], we find it important to pretrain the zero sparsity model for a certain number of steps before pruning begins. This pretraining stage uniformly improves the performance by up to 0.8% WER. This further confirms the importance of initialization for sparse model training: it helps even when the zero sparsity model is present during the DSNN training algorithm. The gradient accumulation technique also consistently improves the model performance across sparsity levels by stabilizing the training process.
6 Conclusion and Future Work
We presented a training scheme that allows one single trained model to optimally switch its sparsity level at inference time. Given that its performance is on par with individually trained single sparsity networks, such a model can simultaneously support a variety of devices with different hardware capabilities and applications with diverse latency requirements. As it outperforms single sparsity models at high sparsity levels, DSNN also serves as a way to improve model performance. Nevertheless, the DSNN model is still trailing behind the quality of single sparsity networks at low sparsity levels. We leave it to further work to cover this gap.
In this work, we only considered models with all parameters pruned by the same fraction. However, components of a machine learning model are sometimes not equally important and setting different sparsity levels for different weights may yield a higher quality model [31]. As each weight matrix is independently pruned in the DSNN training algorithm, DSNN is able to approximate the performance of individually trained networks with arbitrary sparsity configurations across weights. Combined with a greedy search algorithm, DSNN can be used to search for an optimal perweight sparsity configuration, analogous to [32]. This can be an interesting future exploration.
References
 [1] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” 2015.
 [2] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,” arXiv preprint arXiv:1710.01878, 2017.
 [3] Y. Shangguan, J. Li, L. Qiao, R. Alvarez, and I. McGraw, “Optimizing speech recognition for the edge,” arXiv preprint arXiv:1909.12408, 2019.

[4]
A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool,
“Ai benchmark: Running deep neural networks on android smartphones,” in
Proceedings of the European Conference on Computer Vision (ECCV)
, 2018, pp. 0–0.  [5] H. Zhou, J. Lan, R. Liu, and J. Yosinski, “Deconstructing lottery tickets: Zeros, signs, and the supermask,” in Advances in Neural Information Processing Systems, 2019, pp. 3592–3602.
 [6] V. Ramanujan, M. Wortsman, A. Kembhavi, A. Farhadi, and M. Rastegari, “What’s hidden in a randomly weighted neural network?” 2019.
 [7] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang, “Slimmable neural networks,” arXiv preprint arXiv:1812.08928, 2018.
 [8] J. Yu and T. Huang, “Universally slimmable networks and improved training techniques,” arXiv preprint arXiv:1903.05134, 2019.
 [9] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang et al., “Streaming endtoend speech recognition for mobile devices,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6381–6385.

[10]
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” 2016.
 [11] D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, and S. LacosteJulien, “A closer look at memorization in deep networks,” 2017.
 [12] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems 2, D. S. Touretzky, Ed. MorganKaufmann, 1990, pp. 598–605. [Online]. Available: http://papers.nips.cc/paper/250optimalbraindamage.pdf
 [13] B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in neural information processing systems, 1993, pp. 164–171.
 [14] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” 2018.
 [15] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin, “Stabilizing the lottery ticket hypothesis,” 2019.
 [16] L. Liu and J. Deng, “Dynamic deep neural networks: Optimizing accuracyefficiency tradeoffs by selective execution,” 2017.
 [17] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger, “Multiscale dense networks for resource efficient image classification,” 2017.

[18]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A
largescale hierarchical image database,” in
2009 IEEE conference on computer vision and pattern recognition
. Ieee, 2009, pp. 248–255.  [19] J. Yu, P. Jin, H. Liu, G. Bender, P.J. Kindermans, M. Tan, T. Huang, X. Song, R. Pang, and Q. Le, “Bignas: Scaling up neural architecture search with big singlestage models,” arXiv preprint arXiv:2003.11142, 2020.
 [20] J.H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5058–5066.
 [21] M. A. CarreiraPerpinán and Y. Idelbayev, ““learningcompression” algorithms for neural net pruning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8532–8541.
 [22] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
 [23] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.

[24]
H. Sak, A. Senior, and F. Beaufays, “Long shortterm memory recurrent neural network architectures for large scale acoustic modeling,” in
Fifteenth annual conference of the international speech communication association, 2014.  [25] A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved training of endtoend attention models for speech recognition,” arXiv preprint arXiv:1805.03294, 2018.
 [26] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
 [27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [28] S. Narang, E. Undersander, and G. Diamos, “Blocksparse recurrent neural networks,” arXiv preprint arXiv:1711.02782, 2017.
 [29] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in Advances in neural information processing systems, 2016, pp. 1379–1387.
 [30] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” arXiv preprint arXiv:1808.06866, 2018.
 [31] A. See, M.T. Luong, and C. D. Manning, “Compression of neural machine translation models via pruning,” arXiv preprint arXiv:1606.09274, 2016.
 [32] J. Yu and T. Huang, “Network slimming by slimmable networks: Towards oneshot architecture search for channel numbers,” arXiv preprint arXiv:1903.11728, 2019.