1 Introduction
Current stateoftheart neural networks need extensive computational resources to be trained and can have capacities of close to one billion connections between neurons
(Vaswani et al., 2017; Devlin et al., 2018; Child et al., 2019). One solution that nature found to improve neural network scaling is to use sparsity: the more neurons a brain has, the fewer connections neurons make with each other (HerculanoHouzel et al., 2010). Similarly, for deep neural networks, it has been shown that sparse weight configurations exist which train faster and achieve the same errors as dense networks (Frankle and Carbin, 2019). However, currently, these sparse configurations are found by starting from a dense network, which is pruned and retrained repeatedly – an expensive procedure.In this work, we demonstrate the possibility of training sparse networks that rival the performance of their dense counterparts with a single training run – no retraining is required. We train from random initializations and maintain sparse weights throughout training while also speeding up the overall training time. We achieve this by developing sparse momentum, an algorithm which uses the exponentially smoothed gradient of network weights (momentum) as a measure of persistent errors to identify which layers are most efficient at reducing the error and which missing connections between neurons would reduce the error the most. Sparse momentum follows a cycle of (1) pruning weights with small magnitude, (2) redistributing weights across layers according to the mean momentum magnitude of existing weights, and (3) growing new weights to fill in missing connections which have the highest momentum magnitude.
We compare the performance of sparse momentum to compression algorithms and recent methods that maintain sparse weights throughout training. We demonstrate stateoftheart sparse performance on MNIST, CIFAR10, and ImageNet2012. Sparse momentum also matches the performance of several dense baselines on MNIST and CIFAR10. We estimate mean speedups of our sparse convolutional networks on CIFAR10 for optimal sparse convolution algorithms and naive dense convolution algorithms compared to dense baselines. For sparse convolution, we estimate speedups between 3.50x and 11.85x and for dense convolution speedups between 1.16x and 1.45x. Finally, we present an analysis of the feature representations of our sparse networks. We find that networks trained by sparse momentum learn more general features which are useful to a broader range of classes than dense features which might explain why sparse networks can compete with dense networks.
2 Related Work
From Dense to Sparse Neural Networks: Work that focuses on creating sparse from dense neural networks has an extensive history. Earlier work focused on pruning via secondorder derivatives (LeCun et al., 1989; Karnin, 1990; Hassibi and Stork, 1992)
and heuristics which ensure efficient training of networks after pruning
(Chauvin, 1988; Mozer and Smolensky, 1988; Ishikawa, 1996). Recent work is often motivated by the memory and computational benefits of sparse models that enable the deployment of deep neural networks on mobile and lowenergy devices. A very influential paradigm has been the iterative (1) traindense, (2) prune, (3) retrain cycle introduced by Han et al. (2015). Extensions to this work include: Compressing recurrent neural networks and other models
(Narang et al., 2017; Zhu and Gupta, 2018; Dai et al., 2018), continuous pruning and retraining (Guo et al., 2016), joint loss/pruningcost optimization (CarreiraPerpinán and Idelbayev, 2018), layerbylayer pruning (Dong et al., 2017), fastswitching growthpruning cycles (Dai et al., 2017), and soft weightsharing (Ullrich et al., 2017). These approaches often involve retraining phases which increase the training time. However, since the main goal of this line of work is a compressed model for mobile devices, it is desirable but not an important main goal to reduce the runtime of these procedures. This is contrary to our motivation. Despite the difference in motivation, we include many of these densetosparse compression methods in our comparisons. Other compression algorithms include regularization (Louizos et al., 2018), and Bayesian methods (Louizos et al., 2017; Molchanov et al., 2017). For further details, see the survey of Gale et al. (2019).Interpretation and Analysis of Sparse Neural Networks: Frankle and Carbin (2019) show that "winning lottery tickets" exist for deep neural networks – sparse initializations which reach similar predictive performance as dense networks and train just as fast. However, finding these winning lottery tickets is computationally expensive and involves multiple prune and retrain cycles starting from a dense network. Followup work concentrated on finding these configurations faster (Frankle et al., 2019; Zhou et al., 2019). In contrast, we reach dense performance levels with a sparse network from random initialization with a single training run while accelerating training.
Sparse Neural Networks Throughout Training: Methods that maintain sparse weights throughout training through a pruneredistributeregrowth cycle are most closely related to our work. Bellec et al. (2018) introduce DEEPR, which takes a Bayesian perspective and performs sampling for prune and regrowth decisions – sampling sparse network configurations from a posterior. While theoretically rigorous, this approach is computationally expensive and challenging to apply to large networks and datasets. Sparse evolutionary training (SET) (Mocanu et al., 2018) simplifies pruneregrowth cycles by using heuristics: (1) prune the smallest and most negative weights, (2) grow new weights in random locations. Unlike our work, where many convolutional channels are empty and can be excluded from computation, growing weights randomly fills most convolutional channels and makes it challenging to harness computational speedups during training without specialized sparse algorithms. SET also does not include the crosslayer redistribution of weights which we find to be critical for good performance, as shown in our ablation study. The most closely related work to ours is Dynamic Sparse Reparameterization (DSR) by Mostafa and Wang (2019), which includes the full pruneredistributeregrowth cycle. However, similar to SET, DSR includes random regrowth which hampers the possibilities of speedups during training. More distantly related is Singleshot Network Pruning (SNIP) (Lee et al., 2019), which aims to find the best sparse network from a single pruning decision. The goal of SNIP is simplicity, while our goal is maximizing predictive and runtime performance. In our work, we compare against all four methods: DEEPR, SET, DSR, and SNIP.
3 Method
3.1 Sparse Learning
We define sparse learning to be the training of deep neural networks which maintain sparsity throughout training while matching the predictive performance of dense neural networks. To achieve this, intuitively, we want to find the weights that reduce the error most effectively. This is challenging since most deep neural network can hold trillions of different combinations of sparse weights. Additionally, during training, as feature hierarchies are learned, efficient weights might change gradually from shallow to deep layers. How can we find good sparse configurations? In this work, we follow a divideandconquer strategy that is guided by computationally efficient heuristics. We divide sparse learning into the following subproblems which can be tackled independently: (1) Pruning weights, (2) redistribution of weights across layers, and (3) regrowing weights, as defined in more detail below.
3.2 Sparse Momentum
We use the mean magnitude of momentum of existing weights in each layer to estimate how efficient the average weight in each layer is at reducing the overall error. Intuitively, we want to take weights from less efficient layers and redistribute them to weightefficient layers. The sparse momentum algorithm is depicted in Figure 1. In this section, we first describe the intuition behind sparse momentum and then present a more detailed description of the algorithm.
The gradient of the error with respect to a weight
yields the directions which reduce the error at the highest rate. However, if we use stochastic gradient descent, most weights of
oscillate between small/large and negative/positive gradients with each minibatch (Qian, 1999) – a good change for one minibatch might be a bad change for another. We can reduce oscillations if we take the average gradient over time, thereby finding weights which reduce the error consistently. However, we want to value recent gradients, which are closer to the local minimum, more highly than the distant past. This can be achieved by exponentially smoothing – the momentum :where is a smoothing factor, is the momentum for the weight in layer ; is initialized with .
Momentum is efficient at accelerating the optimization of deep neural networks by identifying weights which reduce the error consistently. Similarly, the aggregated momentum of weights in each layer should reflect how good each layer is at reducing the error consistently. Additionally, the momentum of zerovalued weights – equivalent to missing weights in sparse networks – can be used to estimate how quickly the error would change if these weights would be included in a sparse network.
The details of the algorithm are shown in Algorithm 1. Before training, we initialize the network with a certain sparsity : We initialize the network as usual and then remove a fraction of weights for each layer. During training, we apply sparse momentum after each epoch. We can break the sparse momentum algorithm itself in three major parts: (a) redistribution of weights, (b) pruning weights, (c) regrowing weights. In step (a), we calculate the weight redistribution proportions and in turn how many weights to regrow in each layer: For each layer, we take the mean of the elementwise momentum magnitude that belongs to all nonzero weights. We then sumnormalize these means across all layers to get the momentum contribution of each layer. Finally, we take this momentum contribution for each layer and multiply with the overall removed weights to get the number of weights which we will regrow in each layer. In step (b), we prune a proportion of (pruning rate) of the weights with the lowest magnitude for each layer. In step (c), we regrow weights by enabling the gradient flow of zerovalued (missing) weights which have the largest momentum magnitude.
Additionally, there are two edgecases which we did not include in Algorithm 1 for clarity: (1) If we allocate more weights to be regrown than is possible for a specific layer, for example regrowing 100 weights for a layer of maximum 10 weights, we redistribute the excess number of weights equally among all other layers. (2) If a layer is dense and still growing we reduce the pruning rate for these layers proportional to the sparsity: .
After each epoch, we decay the pruning rate in Algorithm 1 in the same way learning rates are decayed. We find that a cosine decay schedule that anneals the pruning rate to zero on the last epoch yields the best validation error and we use this procedure for all experiments.
3.3 Experimental Setup
For comparison, we follow two different experimental settings from Lee et al. (2019) and Mostafa and Wang (2019): For MNIST (LeCun, 1998), we use a batch size of 100, decay the learning rate by a factor of 0.1 every 25000 minibatches. For CIFAR10 (Krizhevsky and Hinton, 2009)
, we use standard data augmentations (horizontal flip, and random crop with reflective padding), a batch size of 128, and decay the learning rate every 30000 minibatches. We train for 100 and 250 epochs on MNIST and CIFAR10, use a learning rate of 0.1, stochastic gradient descent with Nesterov momentum of 0.9, and we use a weight decay of
. We use a fixed 10% of the training data as the validation set and train on the remaining 90%. We evaluate the test set performance of our models on the last epoch. For all experiments on MNIST and CIFAR10, we report the standard errors. Our sample size is generally between 10 and 12 experiments per method/architecture/sparsity level with different random seeds for each experiment.
We use the modified network architectures of AlexNet, VGG16, and LeNet5 as introduced by Lee et al. (2019). For the setup of Mostafa and Wang (2019) we use no validation set and for Wide Residual Networks (WRN) 282 (Zagoruyko and Komodakis, 2016) experiments on CIFAR10 we start with the following layers as dense: First convolutional layer, last fully connected layer, and all downsample residual convolutional layers.
On ImageNet (Deng et al., 2009), we use ResNet50 (He et al., 2016)
with a stride of 2 for the 3x3 convolution in the bottleneck layers. We use a batch size of 256, input size of 224, momentum of 0.9, and weight decay of
. We train for 100 epochs and report validation set performance after the last epoch.For all experiments, we keep biases and batch normalization weights dense. We additionally tune a single parameter: The initial pruning rate
. We search in the space {0.2, 0.3, 0.4, 0.5, 0.6, 0.7} and find that for most networks on MNIST and CIFAR10 a pruning rate of works best. We use this pruning rate throughout all experiments.ImageNet experiments were run on 4x RTX 2080 Ti and all other experiments on individual GPUs.
Our software builds on PyTorch
(Paszke et al., 2017) and is a wrapper for PyTorch neural networks with a modular architecture for growth, redistribution, and pruning algorithms. Using our software, any PyTorch neural network can be adapted to be a sparse momentum network with 5 lines of code. We will opensource our software along with trained models and individual experimental results.^{1}^{1}1https://github.com/TimDettmers/sparse_learningTest set accuracy with 95% confidence intervals on MNIST and CIFAR at varying sparsity levels for LeNet 300100 and WRN 282.
4 Results
Results in Table 1 and Table 2 follow the procedure of (Lee et al., 2019)
. On MNIST, sparse momentum does very well for the LeNet5 Caffe model achieving equal performance to the dense baseline with 20% weights. For LeNet 300100, sparse momentum outperforms baselines when using a moderate amount of weights and for 20% exceeds dense baseline performance. However, for 12% of weights, variational dropout is more effective.
On CIFAR10 in Table 2, we can see that sparse momentum outperforms Singleshot Network Pruning (SNIP) for all models and can achieve the same performance level as dense models for VGG16D and WRN 1610 with just 5% of weights.
Figure 2 shows the results on MNIST and CIFAR that follows the experimental procedure of Mostafa and Wang (2019). For LeNet 300100 on MNIST, we can see that sparse momentum outperforms all other methods. For CIFAR10, sparse momentum is better than dynamic sparse in 4 out of 5 cases. However, in general, the confidence intervals for most methods overlap – this particular setup for CIFAR10 with specifically selected dense weights seems to be too easy to differentiate performance between methods and we do not recommend this setup for future work. Sparse momentum outperforms all other methods on ImageNet (ILSVRC2012) as shown in Table 3.
LeNet 300100  LeNet5 Caffe  
W (%)  Error (%)  W (%)  Error (%)  
Dense  100.0  1.340.011  100.0  0.580.010 
Opt. Brain Damage (LeCun et al., 1989)  8.0  2.0  8.0  2.7 
Layerwise Brain Damage (Dong et al., 2017)  1.5  2.0  1.0  2.1 
Compression via optimization**  1.0  3.2  1.0  1.1 
Singleshot Net. Pruning (Lee et al., 2019)  2.0  2.4  1.0  1.1 
Soft weightsharing (Ullrich et al., 2017)  4.4  1.9  0.5  1.0 
Dyn. Network Surgery (Guo et al., 2016)  1.8  2.0  0.9  0.9 
Learn weights&connections (Han et al., 2015)  8.3  1.6  9.3  0.8 
Singleshot Net. Pruning (Lee et al., 2019)  5.0  1.6  2.0  0.8 
Variational Dropout (Molchanov et al., 2017)  1.5  1.9  0.4  0.8 
Sparse Momentum  1.0  2.360.044  1.0  0.830.040 
2.0  1.990.019  2.0  0.760.022  
5.0  1.530.020  5.0  0.690.021  
20.0  1.260.017^{*}  20.0  0.600.013^{*}  
* 95% confidence intervals overlap with or exceed dense model.  
** (CarreiraPerpinán and Idelbayev, 2018). 
Sparse Error (%)  
Model  Dense Error (%)  SNIP  Momentum  Weights (%) 
AlexNets  12.950.056  14.99  14.350.057  10 
AlexNetb  12.850.068  14.50  13.930.048  10 
VGG16C  6.490.038  7.27  6.770.056  5 
VGG16D  6.590.050  7.09  6.490.045^{*}  5 
VGG16like  6.500.054  8.00  6.710.046  3 
WRN168  4.570.022  6.63  5.660.054  5 
WRN1610  4.450.040  6.43  4.590.043^{*}  5 
WRN228  4.260.032  5.85  4.960.042  5 
* 95% confidence intervals overlap with dense model. 
Accuracy (%)  
Model  Top1  Top5  Top1  Top5 
Dense baseline (He et al., 2016)  79.3  94.8  79.3  94.8 
10% weights  20% Weights  
Static sparse (Mostafa and Wang, 2019)  67.8  88.4  71.6  90.4 
Thin Dense (Mostafa and Wang, 2019)  70.7  89.9  72.4  90.9 
DeepR (Bellec et al., 2018)  70.2  90.0  71.7  90.6 
Compressed sparse (Mostafa and Wang, 2019)  70.3  90.0  73.2  91.5 
Sparse Evolutionary Training (Mocanu et al., 2018)  70.4  90.1  72.6  91.2 
Dynamic Sparse (Mostafa and Wang, 2019)  71.6  90.5  73.3  92.4 
Sparse momentum  73.1  91.5  74.9  92.5 
4.1 Speedups and Overhead
We estimated the speedups that could be obtained using sparse momentum in two ways: Theoretical speedups for sparse convolution algorithms and practical speedups using dense convolutional algorithms. For our sparse convolution estimates, we first benchmark the time taken for each dense convolutional layer for a training run and scale it by the sparsity to estimate the speedups gained (equivalent to FLOPs saved). This reflects the maximum speedup for our sparse networks, which can be obtained if optimized sparse convolution algorithms are used. While a fast sparse convolution algorithm for coarse block structures exist for GPUs (Gray et al., 2017), optimal sparse convolution algorithms for finegrained patterns do not and need to be developed to enable these speedups.
The second method measures practical speedups that can be obtained with naive, dense convolution algorithms which are available today. For dense convolution algorithms, we estimate speedups as follows: If a convolutional channel does only contain zerovalued weights, we can remove these channels from the computation without any consequences and obtain speedups. We assume a linear speedup with an increasing number of empty convolutional channels. We use an RTX Titan and measure the runtime of a dense convolution in 32bit. We then scale these measurements obtained by the proportion of empty convolutional channels. Using this measure, we estimated the speedups for our models on CIFAR10. The resulting speedups can be seen in Table 4. We see that dense convolution speedups are mostly dependent on width, with wider networks receiving larger speedups. Sparse convolution speedups are particularly pronounced for Wide Residual Networks (WRN). These results highlight the importance to develop optimized algorithms for sparse convolution.
Beyond speedups, we also measured the overhead of our sparse momentum procedure to be equivalent of a slowdown to 0.973x0.029x compared to a dense baseline.
Speedups  Weights (%)  

Model  Dense Convolution  Sparse Convolution  
AlexNets  1.45x  4.00x  10 
VGG16D  1.36x  3.51x  5 
WRN 282  1.19x  5.82x  5 
WRN 1610  1.16x  11.85x  5 
5 Analysis
5.1 Ablation Analysis
Our method differs from previous methods like Sparse Evolutionary Training and Dynamic Sparse Reparameterization in two ways: (1) redistribution of weights and (2) growth of weights. To better understand how these components contribute to the overall performance, we ablate these components on CIFAR10 for VGG16D and MNIST for LeNet 300100 and LeNet5 Caffe with 5% weights for all experiments. The results can be seen in Table 5.
Redistibution according to the magnitude of momentum increases the performance the most for the deeper networks VGG16D and LeNet5 Caffe. We hypothesize that the benefit of redistribution algorithms is proportional to the level of depth of networks: The deeper a network is, the more reliant is it to learn a hierarchy of features across layers – redistribution facilitates the learning of hierarchies by moving parameters from shallow layers to deeper layers as training progresses.
Momentum growth increases performance for LeNet 300100 reliably. There is some evidence that random growth improves performance slightly for VGG16D and LeNet5 Caffe, but the confidence intervals overlap, and this observation might be a statistical anomaly. Furthermore, the use of random growth distributes parameters across all convolutional channels, and thus it is no longer possible to achieve speedups with dense convolutional algorithms – this is contrary to the main goal of our work. If one is interested in predictive performance, it is more reasonable to increase the number of parameters and use momentum growth, which would yield both better performance and provide speedups compared to random growth.
Test error (%)  
CIFAR10  MNIST  
Redistribution  Growth  VGG16D  LeNet 300100  LeNet5 Caffe 
momentum  momentum  6.490.045  1.530.020  0.690.021 
momentum  random  0.150.054  0.070.022  0.050.011 
None  momentum  0.790.082  0.010.018  0.320.071 
None  random  0.490.060  0.110.020  0.130.013 
5.2 Dense vs Sparse Features
Sparse networks need to use every weight effectively to build feature representations which are competitive with dense networks. In this section, we study the difference between sparse and dense features to further our understanding of what features look like that enable sparse learning.
For feature visualization, it is common to backpropagate activity to the inputs to be able to visualize what these activities represent
(Simonyan et al., 2013; Zeiler and Fergus, 2014; Springenberg et al., 2014). However, in our case, we are more interested in the overall distribution of features for each layer within our network, and as such we want to look at the magnitude of the activity in a channel since – unlike feature visualization – we are not just interested in feature detectors but also discriminators. For example, a face detector would induce positive activity for a ‘person’ class but might produce negative activity for a ‘mushroom’ class. Both kinds of activity are useful.With this reasoning, we develop the following convolutional channelactivation analysis: (1) pass the entire training set through the network and aggregate the magnitude of the activation in each convolutional channel separately for each class; (2) normalize across classes to receive for each channel the proportion of activation which is due to each class; (3) look at the maximum proportion of each channel as a measure of class specialization: a maximum proportion of where is the number of classes indicates that the channel is equally active for all classes in the training set. The higher the proportion deviates from this value, the more is a channel specialized for a particular class.
Results of this method can be seen for AlexNets, VGG16D, and WRN 282 on CIFAR10 in Figure 3. We see the convolutional channels in sparse networks have lower classspecialization indicating they learn features which are useful for a broader range of classes compared to dense networks. This trend intensifies with depth. This suggests that sparse networks might be able to rival dense networks by learning more general features.
6 Conclusion and Future Work
We presented our sparse learning algorithm, sparse momentum, which uses the mean magnitude of momentum to grow and redistribute weights. We showed that sparse momentum outperforms other sparse algorithms on MNIST, CIFAR10, and ImageNet. Additionally, sparse momentum can rival dense neural network performance while yielding speedups. In our analysis, we showed that sparse networks might be able to rival dense networks by learning more general features compared to dense models. We believe that further study of sparse networks and their representations can inform the design of architectures and deep feature learning algorithms. To fully utilize the improved runtime performance of sparse learning algorithms, future research should focus on specialized sparse convolution and sparse matrix multiplication algorithms.
7 Acknowledgements
This work was funded by a Jeff Dean – Heidi Hopper Endowed Regental Fellowship. We thank Ofir Press, Jungo Kasai, Omer Levy, Sebastian Riedel and Yejin Choi for helpful discussions. We thank Ofir Press, Jungo Kasai, Judit Acs, Zoey Chen, Ethan Perez, and Mohit Shridhar for their helpful reviews and comments.
References
 Bellec et al. (2018) Bellec, G., Kappel, D., Maass, W., and Legenstein, R. A. (2018). Deep rewiring: Training very sparse deep networks. CoRR, abs/1711.05136.

CarreiraPerpinán and
Idelbayev (2018)
CarreiraPerpinán, M. A. and Idelbayev, Y. (2018).
“learningcompression” algorithms for neural net pruning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 8532–8541.  Chauvin (1988) Chauvin, Y. (1988). A backpropagation algorithm with optimal use of hidden units. In NIPS.
 Child et al. (2019) Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating long sequences with sparse transformers. CoRR, abs/1904.10509.
 Dai et al. (2017) Dai, X., Yin, H., and Jha, N. K. (2017). Nest: A neural network synthesis tool based on a growandprune paradigm. CoRR, abs/1711.02017.
 Dai et al. (2018) Dai, X., Yin, H., and Jha, N. K. (2018). Grow and prune compact, fast, and accurate lstms. CoRR, abs/1805.11797.
 Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L. (2009). Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
 Devlin et al. (2018) Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pretraining of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
 Dong et al. (2017) Dong, X., Chen, S., and Pan, S. J. (2017). Learning to prune deep neural networks via layerwise optimal brain surgeon. In NIPS.
 Frankle and Carbin (2019) Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR 2019.
 Frankle et al. (2019) Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. (2019). The lottery ticket hypothesis at scale. CoRR, abs/1903.01611.
 Gale et al. (2019) Gale, T., Elsen, E., and Hooker, S. (2019). The state of sparsity in deep neural networks. CoRR, abs/1902.09574.
 Gray et al. (2017) Gray, S., Radford, A., and Kingma, D. P. (2017). Gpu kernels for blocksparse weights.
 Guo et al. (2016) Guo, Y., Yao, A., and Chen, Y. (2016). Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387.
 Han et al. (2015) Han, S., Pool, J., Tran, J., and Dally, W. (2015). Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143.
 Hassibi and Stork (1992) Hassibi, B. and Stork, D. G. (1992). Second order derivatives for network pruning: Optimal brain surgeon. In NIPS.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
 HerculanoHouzel et al. (2010) HerculanoHouzel, S., Mota, B., Wong, P., and Kaas, J. H. (2010). Connectivitydriven white matter scaling and folding in primate cerebral cortex. Proceedings of the National Academy of Sciences of the United States of America, 107 44:19008–13.
 Ishikawa (1996) Ishikawa, M. (1996). Structural learning with forgetting. Neural Networks, 9:509–521.
 Karnin (1990) Karnin, E. D. (1990). A simple procedure for pruning backpropagation trained neural networks. IEEE transactions on neural networks, 1 2:239–42.
 Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, Citeseer.
 LeCun (1998) LeCun, Y. (1998). Gradientbased learning applied to document recognition.
 LeCun et al. (1989) LeCun, Y., Denker, J. S., and Solla, S. A. (1989). Optimal brain damage. In NIPS.
 Lee et al. (2019) Lee, N., Ajanthan, T., and Torr, P. H. S. (2019). Snip: Singleshot network pruning based on connection sensitivity. In ICLR 2019.

Louizos et al. (2017)
Louizos, C., Ullrich, K., and Welling, M. (2017).
Bayesian compression for deep learning.
In Advances in Neural Information Processing Systems, pages 3288–3298.  Louizos et al. (2018) Louizos, C., Welling, M., and Kingma, D. P. (2018). Learning sparse neural networks through regularization. CoRR, abs/1712.01312.
 Mocanu et al. (2018) Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. (2018). Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):2383.
 Molchanov et al. (2017) Molchanov, D., Ashukha, A., and Vetrov, D. P. (2017). Variational dropout sparsifies deep neural networks. In International Conference on MachineLearning (ICML).

Mostafa and Wang (2019)
Mostafa, H. and Wang, X. (2019).
Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization.
InInternational Conference on Machine Learning (ICML)
.  Mozer and Smolensky (1988) Mozer, M. C. and Smolensky, P. (1988). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In NIPS.
 Narang et al. (2017) Narang, S., Diamos, G. F., Sengupta, S., and Elsen, E. (2017). Exploring sparsity in recurrent neural networks. CoRR, abs/1704.05119.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch.
 Qian (1999) Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural networks : the official journal of the International Neural Network Society, 12 1:145–151.
 Simonyan et al. (2013) Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034.
 Springenberg et al. (2014) Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. A. (2014). Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806.
 Ullrich et al. (2017) Ullrich, K., Meeds, E., and Welling, M. (2017). Soft weightsharing for neural network compression. CoRR, abs/1702.04008.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
 Zagoruyko and Komodakis (2016) Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. ArXiv, abs/1605.07146.
 Zeiler and Fergus (2014) Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.
 Zhou et al. (2019) Zhou, H., Lan, J., Liu, R., and Yosinski, J. (2019). Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067.
 Zhu and Gupta (2018) Zhu, M. and Gupta, S. (2018). To prune, or not to prune: Exploring the efficacy of pruning for model compression. CoRR, abs/1710.01878.