. However, today’s machine learning production models often target a variety of consumer hardware capabilities. The wide spectrum of mobile devices alone could have latency differences of multiple orders of magnitude for the same machine learning model. The situation further complicates when systems such as home speakers and cars are taken into consideration. As an additional compounding factor, different software applications might have different latency requirements. For example, despite likely using the same architecture, the speech recognizer for video conference captioning requires higher synchronicity than one that aids online video subtitle generation.
Ideally, different-sized models with varying sparsity levels should be trained to target every single device type. However, this scheme is impractical given the myriad of existing devices. Alternatively, one could train a few sparse models only targeting typical hardware configurations. In addition to the maintenance overhead of an offline device sparsity table, this strategy will also necessarily under- or over-utilize resources on the heterogeneous long tail of devices. Additionally, even on a single device, resource availability usually changes dynamically as concurrent activities vary. Models with a static sparsity level will hence likely result in sub-optimal resource usage.
To support such diverse sets of scenarios, we propose dynamic sparsity neural networks (DSNN). After training, a single DSNN model is able to execute at any sparsity level at run-time with no or insignificant loss in accuracy compared to individually trained single sparsity networks. With such a model, we can dynamically adjust its sparsity according to the device capability and resource availability, thereby achieving an optimal accuracy-latency trade-off with minimal memory footprint.
DSNN was inspired by recent work [5, 6] which showed that even for untrained random networks, there exists sub-networks of arbitrary sparsity levels that achieve very high quality. Therefore, trained networks should also simultaneously contain powerful sub-networks at different sparsity levels.
that were developed to tackle a similar issue on model deployment across heterogeneous devices. However, these models are only designed for convolutional neural networks, restricting their applicability to many domains and tasks. We demonstrate in Section5.2 that a naive generalization of SNN to the task of automatic speech recognition (ASR) shows poor performance.
DSNN, on the other hand, is a sparsity-based extension of SNN that is applicable to any weight-based neural network. In this paper, we choose to focus on the task of ASR due to an increasing demand for on-device ASR [9, 3]. We show that a single DSNN model can match, or sometimes exceed, the performance of individually trained single sparsity networks across a series of sparsity levels (Section 5.1).
DSNN models contribute to practical machine learning systems in two ways. First, the same DSNN model can be deployed to multiple hardware types with different resource and energy constraints. This greatly reduces both the training overhead and management complexity of deployment processes. Secondly, as DSNN models prove to exceed the performance of single sparsity models at high sparsity levels, its training scheme hence also constitutes an effective approach to improve sparse model performance.
2 Related Work
Over-parameterization is a commonly addressed issue of neural networks [10, 11]. To deal with this issue, model pruning methods have been developed to remove unimportant connections in weight matrices of neural network models. Optimal Brain Damage  and Optimal Brain Surgeon  first proposed pruning methods based on second-order derivatives, and  demonstrated a magnitude-based pruning approach which we base upon in our work. The resulting pruned models contain only sparse structures, allowing them to run efficiently at inference time while maintaining performance [12, 1]. Many studies have demonstrated the empirical strength of such sparse networks [1, 3] and examined their theoretical properties [14, 15, 5, 6].
Recently,  and  showed that untrained random networks contain sub-networks at arbitrary sparsity levels that perform well without training. The best of these sub-networks, usually at around 50% sparsity, can perform as well as the full (i.e. zero-sparsity) model in specific datasets. Our work also tries to find a single network containing multiple high-quality sub-networks, but we allow model training while requiring these sub-networks to match the quality of individually-trained single sparsity networks.
Dynamic neural networks are a family of models that optimize run-time accuracy and efficiency trade-off using dynamic inference graphs [16, 17, 7, 8]. These models often allow selective execution which is desirable when the target inference platforms vary in their constraints. To our knowledge, our proposed DSNN is the first of such models that achieves such optimization using sparse networks.
3 Dynamic Sparsity Neural Networks
In this section, we first provide a formulation of dynamic sparsity neural networks (DSNN) and justify it using previous studies. We then introduce the DSNN training algorithm. Finally we sketch several key distinctions with slimmable neural networks, our methodological precursor.
3.1 Model Formulation
DSNN aims to train a super-network such that, given an arbitrary sparsity level in a range , we can find a sub-network with only a subset of connections in without further fine-tuning or re-training. This sub-network should have the same or better quality than an individually trained single sparsity model obtained through traditional pruning algorithms at the same sparsity level . With such a super-network, we are able to dynamically switch to sub-networks with different sparsity levels during deployment, optimizing for hardware capacities and application latency constraints.
While it is non-trivial to theoretically prove the existence of such super-networks , empirical evidence does suggest they are likely to exist. Zhou et al.  and Ramanujan et al.  showed that an untrained random model can simultaneously contain sub-networks that perform well. At very specific sparsity levels, their models can perform as well as individually trained dense models. Slimmable neural networks [7, 8]
demonstrated that for ImageNet classification, a trained convolutional super-network can have structured sub-networks with similar or better performance than individually trained networks with the same architecture. More recently, BigNAS 
trained a single set of shared weights on ImageNet which are used to obtain child models via a simple coarse-to-fine architecture selection heuristic. All these works hint at a possible super-network that encompass multiple high quality sub-networks. They hence encourage us to explore DSNN, a general sparsity-based super-network.
With the likely existence of an , the question becomes how we can efficiently find it. In order for a single model to execute at arbitrary sparsity levels, we jointly train the same network with a variety of sparsity levels. Specifically, at each training step, we choose a target sparsity level at which to train the model.
We leverage “the sandwich rule”  which states that the quality of models of varying sizes is bounded by that of a largest and a smallest model. Formally, given two sparsity levels and , as long as the pruning function guarantees
where and are the sets of connections remaining after pruning111This is a weak condition and many pruning functions satisfy it, including magnitude-based pruning which is used in this work., then the residual error of a network with a smaller sparsity level should be no higher than one with a larger sparsity level. Then, in extension
This formulation allows us to focus on training two endpoint models with a minimum and maximum sparsity level, and . In addition, we also sample intermediate sparsity levels to allow better generalizability between the endpoints.
We take inspiration from regular sparse model training where it is common practice to pretrain the full model for some number of steps before pruning begins [20, 21]. This gives a high performance first-step model as the initialization for sparse models to prune from. For DSNN, because we choose (Section 4.3), the full model is already present during the training algorithm as the minimally sparse endpoint. Nevertheless, empirically we find the inclusion of the pretraining stage to still be crucial for the DSNN quality (Section 5.3).
We sketch the DSNN training procedure in Algorithm 1. For each iteration, we alternate among the minimum, intermediate, and maximum sparsity levels for weight masking and execute forward and backward propagation. However, this alternation during training could be a source of instability. To deal with this, instead of updating model parameters immediately after backward propagation in each training step, we accumulate the parameter gradients across training steps and only do one parameter update per iteration.
3.3 Comparison with Slimmable Neural Networks
, both allowing dynamic inference graphs, albeit with several key distinctions. First, SNN shrinks models by truncating convolutional channels while DSNN obtains smaller model variants using model pruning. This allows DSNN to be easily applied to more domains and tasks. The sparse structure also allows DSNN to preserve the high dimensionality of input and output spaces, although the mapping from input to output is low-dimensional. We may consider a simple generalization of SNN that prunes whole nodes in a network instead of convolutional channels. In contrast, DSNN uses an edge-pruning approach, the common practice for model pruning. This restricts SNN to always use fully connected sub-networks. On the other hand, the lack of a pre-defined network structure in DSNN allows greater modeling flexibility. Therefore, SNN is a special case of DSNN whose sparse patterns are skewed with all connections to the last channels masked as zeros. See Figure1 for an illustration.
4 Experimental Setup
In this section we describe our experimental settings.
4.1 Task and Dataset
While our approach is widely applicable to all weight-based neural networks, we choose to conduct experiments in automatic speech recognition (ASR) due to an increased interest in on-device ASR [9, 3]. We perform experiments on the LibriSpeech dataset . It consists of read speech data of audio books. We merge its three training sets while maintaining the “clean” and “other” distinction corresponding to low versus high noise conditions. The final dataset contains 960.9 hours of training data, 5.4/5.3 hours of clean/other development data, and 5.4/5.1 hours of clean/other test data.
4.2 Model Architecture and Settings
We use an attentive encoder-decoder architecture as the base model . The encoder consists of 2 layers of 3x3 convolutional neural networks, 3 layers of projection matrices with 2048 output units, and 4 layers of bidirectional LSTMs  with 2048 output units. The decoder consists of 2 layers of LSTMs with 1024 output units, 1 attention layer with 128 hidden units, and 1 fully connected layer. The model contains 184M parameters. The quality of our base model generally matches the similar architecture in  (Table 1).
4.3 Model Pruning
We use magnitude based pruning which, given a target sparsity level and a weight matrix , zeros out the elements in with the smallest absolute value by applying a binary mask over . We employ block pruning  with block size to more efficiently leverage hardware resources. Instead of the smallest elements, we zero out the smallest blocks in . This procedure corresponds to the sparsify function in Algorithm 1. We only prune the recurrent connections which constitute 64.57% of model parameters.222When we indicate a sparsity level in this paper, we refer to the sparsity level of these recurrent weights only. We also allow the mask to update at each iteration which enables pruned weights to be recovered if at a later step its magnitude is greater than that of some other survived weights, similar to [29, 30].
For both baseline single sparsity models (Section 5.1) and DSNN, we first train a zero-sparsity network until convergence (around 200k steps). When training this network, we maintain exponential moving averages of all model parameters. When the pruning stage begins, we load all model variables from these averages. The baseline models use a constant pruning schedule that fixes the sparsity level. The DSNN training algorithm does not require a pruning schedule.
We train DSNN with a minimum sparsity level (i.e. full model) and a maximum . We sample sparsity levels in this range to train at during each iteration in addition to the two endpoint sparsity levels.
5 Results and Discussion
In this section we present our experimental results. We evaluate the models on the minimum and maximum sparsity levels, 0% and 90%, and two intermediate sparsity levels, 30% and 60%, that test the model generalizability between the endpoints.
|0%||Zeyer et al.||3.54||11.52||3.82||12.76|
5.1 Comparison with Single Sparsity Networks
As argued in Section 3, the most important success criterion of DSNN is to match individually trained single sparsity networks in quality. If the DSNN model significantly degraded the model quality, it would not be useful especially in real-world scenarios when quality is prioritized. We, therefore, conducted baseline experiments with single sparsity networks and show the results in Table 1.
The dynamic sparsity model generally matches the quality of single sparsity networks. Additionally, the quality decline with increased sparsity is much slower in DSNN than single sparsity networks. The DSNN quality slightly trails behind single sparsity networks’ at 0% but is the best at 60% and 90% sparsity. We hypothesize the reason to be that the sparser networks in DSNN have thinner structures which on expectation are more frequently trained in each step. On the other hand, connections with smaller weights are trained more sporadically, receiving relatively less focus. Notably, a similar trend is observed for SNN in [7, 8] where also increases as fewer convolutional channels are used in inference, growing less negative or more positive.
In practice, even in highly quality-driven scenarios where the slight quality gap at denser models is unacceptable, one can still deploy DSNN at only high sparsity levels, complementing single sparsity networks that can be used for the denser models.
5.2 Comparison with Slimmable Neural Networks
Slimmable neural networks (SNN) [7, 8] only vary the number of channels in convolutional networks. For each target width, the last channels (ones with the largest indices) are removed. Despite not directly applicable to arbitrary weight matrices, we experiment with a simple generalization of SNN as follows.
Given a target sparsity level and a weight matrix , we apply a binary mask on . For each dimension , we select a threshold by
where rounds to the nearest integer. We then generate the mask by
Finally we prune by
where denotes element-wise multiplication.
Intuitively, we truncate the last rows in each dimension by an equal fraction trying to make the resulting matrix have a sparsity close to the target sparsity.333This procedure does not guarantee an exact final sparsity level because the pre-rounded is usually not an integer, but the larger the original matrix is, the closer it will be. After all, model pruning is usually only applied on large weight matrices. In our experiments, the differences between the target sparsity levels and the resulting sparsity levels are always within 0.01%. We therefore neglect this difference when comparing results. This is analogous to removing the last convolutional channels.
We compare SNN and DSNN quality in Table 1. We see that DSNN’s edge pruning approach significantly outperforms SNN’s node pruning approach.
We analyzed the effect of pretraining and gradient accumulation (GA) and show the ablation results in Table 2. Similar to previous model pruning work [20, 21], we find it important to pretrain the zero sparsity model for a certain number of steps before pruning begins. This pretraining stage uniformly improves the performance by up to 0.8% WER. This further confirms the importance of initialization for sparse model training: it helps even when the zero sparsity model is present during the DSNN training algorithm. The gradient accumulation technique also consistently improves the model performance across sparsity levels by stabilizing the training process.
6 Conclusion and Future Work
We presented a training scheme that allows one single trained model to optimally switch its sparsity level at inference time. Given that its performance is on par with individually trained single sparsity networks, such a model can simultaneously support a variety of devices with different hardware capabilities and applications with diverse latency requirements. As it outperforms single sparsity models at high sparsity levels, DSNN also serves as a way to improve model performance. Nevertheless, the DSNN model is still trailing behind the quality of single sparsity networks at low sparsity levels. We leave it to further work to cover this gap.
In this work, we only considered models with all parameters pruned by the same fraction. However, components of a machine learning model are sometimes not equally important and setting different sparsity levels for different weights may yield a higher quality model . As each weight matrix is independently pruned in the DSNN training algorithm, DSNN is able to approximate the performance of individually trained networks with arbitrary sparsity configurations across weights. Combined with a greedy search algorithm, DSNN can be used to search for an optimal per-weight sparsity configuration, analogous to . This can be an interesting future exploration.
-  S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” 2015.
-  M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,” arXiv preprint arXiv:1710.01878, 2017.
-  Y. Shangguan, J. Li, L. Qiao, R. Alvarez, and I. McGraw, “Optimizing speech recognition for the edge,” arXiv preprint arXiv:1909.12408, 2019.
A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool,
“Ai benchmark: Running deep neural networks on android smartphones,” in
Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
-  H. Zhou, J. Lan, R. Liu, and J. Yosinski, “Deconstructing lottery tickets: Zeros, signs, and the supermask,” in Advances in Neural Information Processing Systems, 2019, pp. 3592–3602.
-  V. Ramanujan, M. Wortsman, A. Kembhavi, A. Farhadi, and M. Rastegari, “What’s hidden in a randomly weighted neural network?” 2019.
-  J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang, “Slimmable neural networks,” arXiv preprint arXiv:1812.08928, 2018.
-  J. Yu and T. Huang, “Universally slimmable networks and improved training techniques,” arXiv preprint arXiv:1903.05134, 2019.
-  Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang et al., “Streaming end-to-end speech recognition for mobile devices,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6381–6385.
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” 2016.
-  D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, and S. Lacoste-Julien, “A closer look at memorization in deep networks,” 2017.
-  Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems 2, D. S. Touretzky, Ed. Morgan-Kaufmann, 1990, pp. 598–605. [Online]. Available: http://papers.nips.cc/paper/250-optimal-brain-damage.pdf
-  B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in neural information processing systems, 1993, pp. 164–171.
-  J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” 2018.
-  J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin, “Stabilizing the lottery ticket hypothesis,” 2019.
-  L. Liu and J. Deng, “Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offs by selective execution,” 2017.
-  G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger, “Multi-scale dense networks for resource efficient image classification,” 2017.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A
large-scale hierarchical image database,” in
2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
-  J. Yu, P. Jin, H. Liu, G. Bender, P.-J. Kindermans, M. Tan, T. Huang, X. Song, R. Pang, and Q. Le, “Bignas: Scaling up neural architecture search with big single-stage models,” arXiv preprint arXiv:2003.11142, 2020.
-  J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5058–5066.
-  M. A. Carreira-Perpinán and Y. Idelbayev, ““learning-compression” algorithms for neural net pruning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8532–8541.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
-  W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
-  A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved training of end-to-end attention models for speech recognition,” arXiv preprint arXiv:1805.03294, 2018.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  S. Narang, E. Undersander, and G. Diamos, “Block-sparse recurrent neural networks,” arXiv preprint arXiv:1711.02782, 2017.
-  Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in Advances in neural information processing systems, 2016, pp. 1379–1387.
-  Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” arXiv preprint arXiv:1808.06866, 2018.
-  A. See, M.-T. Luong, and C. D. Manning, “Compression of neural machine translation models via pruning,” arXiv preprint arXiv:1606.09274, 2016.
-  J. Yu and T. Huang, “Network slimming by slimmable networks: Towards one-shot architecture search for channel numbers,” arXiv preprint arXiv:1903.11728, 2019.