1 Introduction
Deep Neural Networks (DNNs) have achieved great success in a broad range of applications in image recognition imagenet09
bert18 , and games alphago16 . Latest DNN architectures, such as ResNet resnet16 , DenseNet densenet17 and WideResNet zagoruyko2016wide , incorporate hundreds of millions of parameters to achieve stateoftheart predictive performance. However, the expanding number of parameters not only increases the risk of overfitting, but also leads to high computational costs. Many practical realtime applications of DNNs, such as for smart phones, drones and the IoT (Internet of Things) devices, call for compute and memory efficient models as these devices typically have very limited computation and memory capacities.Fortunately, it has been shown that DNNs can be pruned or sparsified significantly with minor accuracy losses han2015learning ; han2015deep , and sometimes sparsified networks can even achieve higher accuracies due to the regularization effects of the network sparsification algorithms NekMolAsh17 ; Louizos2017 . Driven by the widely spread applications of DNNs in realtime systems, there has been an increasing interest in pruning or sparsifying networks recently han2015learning ; han2015deep ; WenWuWan16 ; li2016pruning ; louizos2017bayesian ; molchanov2017variational ; NekMolAsh17 ; Louizos2017 . Earlier methods such as the magnitudebased approaches han2015learning ; han2015deep prune networks by removing the weights of small magnitudes, and it has been shown that this approach although simple is very effective at sparsifying network architectures with minor accuracy losses. Recently, the norm based regularization method Louizos2017 is getting attraction as this approach explicitly penalizes number of nonzero parameters and can drive redundant or insignificant parameters to be exact zero. However, the gradient of the regularized objective function is intractable. Louizos et al. Louizos2017
propose to use the hard concrete distribution as a close surrogate to the Bernoulli distribution, and this leads to a differentiable objective function while still being able to zeroing out redundant or insignificant weights during training. Due to the hard concrete substitution, however, the resulting hard concrete estimator is biased with respect to the original objective function.
In this paper, we propose ARM for network sparsification. ARM is built on top of the regularization framework of Louizos et al. Louizos2017 . However, instead of using a biased hard concrete gradient estimator, we investigate the AugmentReinforceMerge (ARM) Yin2019 , a recently proposed unbiased gradient estimator for stochastic binary optimization. Because of the unbiasness and flexibility of the ARM estimator, ARM exhibits a significantly faster rate at pruning network architectures and reducing FLOPs than the hard concrete estimator. Extensive experiments on multiple public datasets demonstrate the superior peroformance of ARM at sparsifying networks with fully connected layers and convolutional layers. It achieves stateoftheart prune rates while retaining similar and sometimes even higher accuracies compared to baseline methods. Additionally, it sparsifies the WideResNet models on CIFAR10 and CIFAR100 while the original hard concrete estimator cannot.
The remainder of the paper is organized as follows. In Sec. 2 we describe the
regularized empirical risk minimization for network sparsification and formulate it as a stochastic binary optimization problem. A new unbiased estimator to this problem
ARM is presented in Sec. 3, followed by related work in Sec. 4. Example results on multiple public datasets are presented in Sec. 5, with comparisons to baseline methods and the stateoftheart sparsification algorithms. Conclusions and future work are discussed in Sec. 6.2 Formulation
Given a training set , where denotes the input and denotes the target, a neural network is a function parametrized by that fits to the training data with the goal of achieving good generalization to unseen test data. To optimize , typically a regularized empirical risk is minimized, which contains two terms – a data loss over training data and a regularization loss over model parameters. Empirically, the regularization term can be weight decay or Lasso, i.e., the or norm of model parameters.
Since the or norm only imposes shrinkage for large values of , the resulting model parameters are often manifested by smaller magnitudes but none of them are exact zero. Intuitively, a more appealing alternative is the regularization since the norm measures explicitly the number of nonzero elements, and minimizing of it over model parameters will drive the redundant or insignificant weights to be exact zero. With the regularization, the empirical risk objective can be written as
(1) 
where denotes the data loss over training data , such as the crossentropy loss for classification or the mean squared error (MSE) for regression, and denotes the norm over model parameters, i.e., the number of nonzero weights, and
is a regularization hyperparameter that balances between data loss and model complexity.
To represent a sparsified model, we attach a binary random variable
to each element of model parameters . Therefore, we can reparameterize the model parameters as an elementwise product of nonzero parameters and binary random variables :(2) 
where , and denotes the elementwise product. As a result, Eq. 1 can be rewritten as:
(3) 
where is an indicator function that is if the condition is satisfied, and otherwise. Note that both the first term and the second term of Eq. 3 are not differentiable w.r.t. . Therefore, further approximations need to be considered.
According to stochastic variational optimization SVO18 , given any function and any distribution , the following inequality holds
(4) 
i.e., the minimum of a function is upper bounded by the expectation of the function. With this result, we can derive an upper bound of Eq. 3 as follows.
Since is a binary random variable, we assume is subject to a Bernoulli distribution with parameter , i.e. . Thus, we can upper bound by the expectation
(5) 
As we can see, now the second term is differentiable w.r.t. the new model parameters , while the first term is still problematic since the expectation over a large number of binary random variables is intractable and so its gradient. Since are binary random variables following a Bernoulli distribution with parameters , we now formulate the original regularized empirical risk (1) to a stochastic binary optimization problem (2).
Existing gradient estimators for this kind of discrete latent variable models include REINFORCE reinforce92 , GumbleSoftmax gumbelsoftmax17 ; concrete17 , REBAR rebar17 , RELAX relax18 and the Hard Concrete estimator Louizos2017
. However, these estimators either are biased or suffer from high variance or computationally expensive due to auxiliary modeling. Recently, the AugmentReinforceMerge (ARM)
Yin2019 gradient estimator is proposed for the optimization of binary latent variable models, which is unbiased and exhibits low variance. Extending this gradient estimator to network sparsification, we find that ARM demonstrates superior performance of prunning network architectures while retaining almost the same accuracies of baseline models. More importantly, similar to the hard concrete estimator, ARM also enables conditional computation BenLeoCou18 that not only sparsifies model architectures for inference but also accelerates model training.3 ARM: Stochastic Binary Optimization
To minimize Eq. 2, we propose ARM, a stochastic binary optimization algorithm based on the AugmentReinforceMerge (ARM) gradient estimator Yin2019 . We first introduce the main theorem of ARM. Refer readers to Yin2019 for the proof and other details.
Theorem (ARM) Yin2019
. For a vector of
binary random variables , the gradient of(6) 
w.r.t.
, the logits of the Bernoulli distribution parameters, can be expressed as
(7) 
where and is the sigmoid function.
Parameterizing as , Eq. 2 can be rewritten as
(8) 
where . Now according to Theorem 1, we can evaluate the gradient of Eq. 3 w.r.t. by
(9) 
which is an unbiased and low variance estimator as demonstrated in Yin2019 .
Note from Eq. 3 that we need to evaluate twice to compute the gradient, the second of which is the same operation required by the data loss of Eq. 3. Therefore, one extra forward pass is required by the ARM gradient estimator. This additional forward pass might be computationally expensive, especially for networks with millions of parameters. To reduce the computational complexity of Eq. 3, we further consider another gradient estimator – AugmentReinforce (AR) Yin2019 :
(10) 
which requires only one forward pass that is the same operation as in Eq. 3. This AR gradient estimator is still unbiased but with higher variance. Now with AR, we can trade off the variance of the estimator with the computational complexity. We will evaluate the impact of this tradeoff in our experiments.
3.1 Choice of
Theorem 1 of ARM defines , where is the sigmoid function. For the purpose of network sparsification, we find that this parametric function isn’t very effective due to its slow transition between values 0 and 1. Thanks to the flexibility of ARM, we have a lot of freedom to design this parametric function . Apparently, it’s straightforward to generalize Theorem 1 for any parametric functions (smooth or nonsmooth) as long as and ^{1}^{1}1The second condition is not necessary. But for simplicity, we will impose this condition to select parametric function that is antithetic. Designing without this constraint could be a potential area that is worthy of further investigation.. Example parametric functions that work well in our experiments are the scaled sigmoid function
(11) 
and the centeredscaled hard sigmoid
(12) 
where is introduced such that . See Fig. 1 for some example plots of and with different . Empirically, we find that works well for all of our experiments.
One important difference between the hard concrete estimator from Louizos et al. Louizos2017 and ARM is that the hard concrete estimator has to rely on the hard sigmoid gate to zero out some parameters during training (a.k.a. conditional computation BenLeoCou18 ), while ARM achieves conditional computation naturally by sampling from the Bernoulli distribution, parameterized by , where can be any parametric function (smooth or nonsmooth) as shown in Fig. 1. We validate this in our experiments.
3.2 Sparisifying Network Architectures for Inference
After training, we get model parameters and . At test time, we can use the expectation of as the mask for the final model parameters :
(13) 
However, this will not yield a sparsified network for inference since none of the element of is exact zero (unless the hard sigmoid gate is used). A simple approximation is to set the elements of to zero if the corresponding values in are less than a threshold , i.e.,
(14) 
We find that this approximation is very effective in all of our experiments as the histogram of is widely split into two spikes around values of 0 and 1 after training because of the sharp transition of the scaled sigmoid (or hard sigmoid) function. See Fig. 2 for a typical plot of the histograms of evolving during training process. We notice that our algorithm isn’t very sensitive to , tuning which incurs negligible impacts to prune rates and model accuracies. Therefore, for all of our experiments we set by default. Apparently, better designed is possible by considering the histogram of . However, we find this isn’t very necessary for all of our experiments in the paper. Therefore, we will consider this histogramdependent as our future improvement.
3.3 Imposing Shrinkage on Model Parameters
The regularized objective function (3) leads to sparse estimate of model parameters without imposing any shrinkage on the magnitude of . In some cases it might still be desirable to regularize the magnitude of model parameters with other norms, such as or (weight decay), to improve the robustness of model. This can be achieved conveniently by computing the expected or norm of under the same Bernoulli distribution: as follows:
(15)  
(16) 
which can be incorporated to Eq. 3 as additional regularization terms.
3.4 Group Sparsity Under and Norms
The formulation so far promotes a weightlevel sparsity for network architectures. This sparsification strategy can compress model and reduce memory footprint of a network. However, it will usually not lead to effective speedups because weightsparsified networks require sparse matrix multiplication and irregular memory access, which make it extremely challenging to effectively utilize the parallel computing resources of GPUs and CPUs. For the purpose of computational efficiency, it’s usually preferable to perform group sparsity instead of weightlevel sparsity. Similar to WenWuWan16 ; NekMolAsh17 ; Louizos2017 , we can achieve this by sharing a stochastic binary gate among all the weights in a group. For example, a group can be all fanout weights of a neuron in fully connected layers or all weights of a convolution filter. With this, the group regularized and norms can be conveniently expressed as
(17)  
(18) 
where denotes the number of groups and denotes the number of weights of group g. For the reason of computational efficiency, we perform this group sparsity in all of our experiments.
4 Related Work
It is wellknown that DNNs are extremely compute and memory intensive. Recently, there has been an increasing interest to network sparsification han2015learning ; han2015deep ; WenWuWan16 ; li2016pruning ; louizos2017bayesian ; molchanov2017variational ; NekMolAsh17 ; Louizos2017 as the applications of DNNs to practical realtime systems, such as the IoT devices, call for compute and memory efficient networks. One of the earliest sparsification methods is to prune the redundant weights based on the magnitudes lecun1990optimal , which is proved to be effective in modern CNN han2015learning . Although weight sparsification is able to compress networks, it can barely improve computational efficiency due to unstructured sparsity WenWuWan16 . Therefore, magnitudebased group sparsity is proposed WenWuWan16 ; li2016pruning , which can compress networks while reducing computation cost significantly. These magnitudebased methods usually proceed in three stages: pretrain a full network, prune the redundant weights or filters, and finetune the pruned model. As a comparison, our method ARM trains a sparsified network from scratch without pretraining and finetuning, and therefore is more preferable.
Another category of sparsification methods is based on Bayesian statistics and information theory
molchanov2017variational ; NekMolAsh17 ; louizos2017bayesian . For example, inspired by variational dropout kingma2015variational , Molchanov et al. propose a method that unbinds the dropout rate, and also leads to sparsified networks molchanov2017variational .Recently, Louizos et al. Louizos2017 propose to sparsify networks with norm. Since the regularization explicitly penalizes number of nonzero parameters, this method is conceptually very appealing. However, the nondifferentiability of norm prevents an effective gradientbased optimization. Therefore, Louizos et al. Louizos2017 propose a hard concrete gradient estimator for this optimization problem. Our work is built on top of their formulation. However, instead of using a hard concrete estimator, we investigate the AugmentReinforceMerge (ARM) Yin2019 , a recently proposed unbiased estimator, to this binary optimization problem.
5 Experimental Results
We evaluate the performance of ARM and AR on mulitple public datasets and multiple network architectures. Specifically, we evaluate MLP 500300 lecun1998gradient
and LeNet 5Caffe
^{2}^{2}2https://github.com/BVLC/caffe/tree/master/examples/mnist on the MNIST dataset mnist , and Wide Residual Networks zagoruyko2016wide on the CIFAR10 and CIFAR100 datasets cifar10 . For baselines, we refer to the following stateoftheart sparsification algorithms: Sparse Variational Dropout (Sparse VD) molchanov2017variational , Bayesian Compression with group normalJeffreys (BCGNJ) and group horseshoe (BCGHS) louizos2017bayesian , and norm regularization with hard concrete estimator (HC) Louizos2017 . For a fair comparison, we closely follow the experimental setups of HC ^{3}^{3}3https://github.com/AMLabAmsterdam/L0_regularization.5.1 Implementation Details
We incorporate ARM and AR into the architectures of MLP, LeNet5 and Wide ResNet. As we described in Sec. 3.4, instead of sparsifying weights, we apply group sparsity on neurons in fullyconnected layers or on convolution filters in convolutional layers. Once a neuron or filter is pruned, all related weights are removed from the networks.
The MultiLayer Perceptron (MLP)
lecun1998gradient has two hidden layers of size 300 and 100, respectively. We initialize by random samples from a normal distribution for the input layer andfor the hidden layers, which activate around 80% of neurons in input layer and around 50% of neurons in hidden layers. LeNet5Caffe consists of two convolutional layers of 20 and 50 filters interspersed with max pooling layers, followed by two fullyconnected layers with 500 and 10 neurons. We initialize
for all neurons and filters by random samples from a normal distribution . WideResNets (WRNs) zagoruyko2016wide have shown stateoftheart performance on many image classification benchmarks. Following Louizos2017 , we only apply regularization on the first convolutional layer of each residual block, which allows us to incorporate regularization without further modifying residual block architecture. The architectural details of WRN are listed in Table 1. For initialization, we activate around 70% of convoluation filters.Group name  layers 

conv1  [Original Conv (16)] 
conv2  [ ARM (160); Original Conv (160)] 4 
conv3  [ ARM (320); Original Conv (320)] 4 
conv4  [ ARM (640); Original Conv (640)] 4 
For MLP and LeNet5, we train with a minibatch of 100 data samples and use Adam kingma2014adam as optimizer with initial learning rate of
, which is halved every 100 epochs. For WideResNet, we train with a minibatch of 128 data samples and use Nesterov Momentum as optimizer with initial learning rate of
, which is decayed by 0.2 at epoch 60 and 120. Each of these experiments run for 200 epochs in total. For a fair comparison, these experimental setups closely follow what were described in HC Louizos2017 and their opensource implementation .5.2 MNIST Experiments
We run both MLP and LeNet5 on the MINIST dataset. By tuning the regularization strength , we can control the trade off between sparsity and accuracy. We can use one for all layers or a separate for each layer to finetune the sparsity preference. In our experiments, we set or for MLP, and set or for LeNet5, where denotes to the number of training datapoints.
We use three metrics to evaluate the performance of an algorithm: prediction accuracy, prune rate, and expected number of floating point operations (FLOPs). Prune rate is defined as the ratio of number of pruned weights to number of all weights. Prune rate manifests the memory saving of a sparsified network, while expected FLOPs demonstrates the training / inference cost of a sparsification algorithm.
Network  Method  Pruned Architecture  Prune rate (%)  Accuracy (%)  

Sparse VD  219214100  74.72  98.2  
BCGNJ  2789813  89.24  98.2  
BCGHS  3118614  89.45  98.2  
HC ()  219214100  73.98  98.6  
HC ( sep.)  2668833  89.99  98.2  
AR ()  45315068  70.39  98.3  
ARM ()  14315378  87.00  98.3  
AR ( sep.)  46411465  77.10  98.2  
ARM ( sep.)  1597473  92.96  98.1  

Sparse VD  1419242131  90.7  99.0  
GL  312192500  76.3  99.0  
GD  71320816  98.62  99.0  
SBP  318284283  80.34  99.0  
BCGNJ  8138813  99.05  99.0  
BCGHS  5107616  99.36  99.0  
HC ()  202545462  91.1  99.1  
HC ( sep.)  9186525  98.6  99.0  
AR ()  182846249  93.73  98.8  
ARM ()  201632257  95.52  99.1  
AR ( sep.)  51213122  98.90  98.4  
ARM ( sep.)  6103911  99.49  98.7 
We compare ARM and AR to five stateoftheart sparsification algorithms on MNIST, with the results shown in Table 2. For the comparison between HC and AR(M) when , we use the exact same hyperparameters for both algorithms (the fairest comparison). In this case, ARM achieve the same accuracy (99.1%) on LeNet5 with even sparser pruned architectures (95.52% vs. 91.1%). When separated s are considered ( sep.), since HC doesn’t disclose the specific s for the last two fullyconncected layers, we tune them by ourselves and find that yields the best performance. In this case, ARM achieves the highest prune rate (99.49% vs. 98.6%) with very similar accuracies (98.7% vs. 99.1%) on LeNet5. Similar patterns are also observed on MLP. Regarding AR, although its performance is not as good as ARM, it’s still very competitive to all the other methods. The advantage of AR over ARM is its lower computational complexity during training. As we discussed in Sec. 3, ARM needs one extra forward pass to estimate the gradient w.r.t. ; for large DNN architectures, this extra cost can be significant.
To evaluate the training cost and network sparsity of different algorithms, we compare the prune rates of HC and AR(M) on LeNet5 as a function of epoch in Fig. 3 (a, b). Similarly, we compare the expected FLOPs of different algorithms as a function of epoch in Fig. 3 (c, d). As we can see from (a, b), ARM yields much sparser network architectures over the whole training epochs, followed by AR and HC. The FLOPs vs. Epoch plots in (c, d) are more complicated. Because HC and AR only need one forward pass to compute gradient, they have the same expected FLOPs for training and inference. ARM needs two forward passes for training. Therefore, ARM is computationally more expensive during training (red curves), but it leads to sparser / more efficient architectures for inference (green curves), which pays off its extra cost in training.
5.3 CIFAR Experiments
We further evaluate the performance of ARM and AR with WideResNet zagoruyko2016wide on CIFAR10 and CIFAR100. Following Louizos2017 , we only apply regularization on the first convolutional layer of each residual block, which allows us to incorporate regularization without further modifying residual block architecture.
Table 3 shows the performance comparison between AR(M) and three baseline methods. We find that HC cannot sparsify the WideResNet architecture (prune rate 0%) ^{4}^{4}4This was also reported recently in the appendix of GalElsHoo19 , and can be easily reproduced by using the opensource implementation of HC ., while ARM and AR prune around 50% of the parameters of the impacted subnet. As we activate 70% convolution filters in initialization, the around 50% prune rate is not due to initialization. We also inspect the histograms of : As expected, they are all split into two spikes around the values of 0 and 1, similar to the histograms shown in Fig. 2. In terms of accuracies, both ARM and AR achieve very similar accuaries as the baseline methods.
Network  Method  Pruned Architecture  Prune rate (%)  Accuracy (%)  


Original WRN zagoruyko2016wide  full model  0  96.00  
Original WRNdropout zagoruyko2016wide  full model  0  96.11  
HC () Louizos2017  full model  0  96.17  
HC () Louizos2017  full model  0  96.07  
AR () 

49.49  95.58  
ARM () 

49.46  95.68  
AR () 

49.95  95.60  
ARM () 

49.63  95.70  

Original WRN zagoruyko2016wide  full model  0  78.82  
Original WRNdropout zagoruyko2016wide  full model  0  81.15  
HC () Louizos2017  full model  0  81.25  
HC () Louizos2017  full model  0  80.96  
AR () 

49.37  80.50  
ARM () 

50.51  80.74  
AR () 

50.93  80.09  
ARM () 

50.78  80.56 
To evaluate the training and inference costs of different algorithms, we compare the expected FLOPs of HC and AR(M) on CIFAR10 and CIFAR100 as a function of iteration in Fig. 4. Similar to Fig. 3, ARM is more computationally expensive for training, but leads to sparser / more efficient architectures for inference, which pays off its extra cost in training. It’s worth to emphasize that for these experiments AR has the lowest training FLOPs and inference FLOPs (since only one forward pass is needed for training and inference), while achieving very similar accuracies as the baseline methods (Table 3).
Finally, we compare the test accuracies of different algorithms as a function of epoch on CIFAR10, with the results shown in Fig. 5. We apply the exact same hyperparameters of HC to AR(M). As AR(M) prunes around 50% parameters during training (while HC prunes 0%), the test accuracies of the former are lower than the latter before convergence, but all the algorithms yield very similar accuracies after convergence, demonstrating the effectiveness of AR(M).
6 Conclusion
We propose ARM, an unbiased and lowvariance gradient estimator, to sparsify network architectures. Compared to HC Louizos2017 and other stateoftheart sparsification algorithms, ARM demonstrates superior performance of sparsifying network architectures while retaining almost the same accuracies of the baseline methods. Extensive experiments on multiple public datasets and multiple network architectures validate the effectiveness of ARM. Overall, ARM yields the sparsest architectures and the lowest inference FLOPs for all the networks considered with very similar accuracies as the baseline methods.
As for future extensions, we plan to design better (possibly nonantithetic) parametric function to improve the sparsity of solutions. We also plan to investigate more efficient algorithm to evaluate ARM gradient (3) by utilizing the antithetic structure of two forward passes.
References
 [1] Mingzhang Yin and Mingyuan Zhou. Arm: AugmentREINFORCEmerge gradient for stochastic binary networks. In International Conference on Learning Representations (ICLR), 2019.
 [2] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations (ICLR), 2018.

[3]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
Imagenet: A largescale hierarchical image database.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2009.  [4] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 [5] David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–503, 2016.
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 [7] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [8] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In The British Machine Vision Conference (BMVC), 2016.
 [9] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
 [10] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016.
 [11] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Structured bayesian pruning via lognormal multiplicative noise. In Advances in Neural Information Processing Systems (NIPS), 2017.
 [12] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems (NIPS), 2016.
 [13] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.

[14]
Christos Louizos, Karen Ullrich, and Max Welling.
Bayesian compression for deep learning.
In Advances in Neural Information Processing Systems, pages 3288–3298, 2017. 
[15]
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov.
Variational dropout sparsifies deep neural networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 2498–2507. JMLR. org, 2017.  [16] Thomas Bird, Julius Kunze, and David Barber. Stochastic variational optimization. arXiv preprint arXiv:1809.04855, 2018.

[17]
Ronald J. Williams.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine Learning, 8(34):229–256, May 1992.  [18] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations (ICLR), 2017.

[19]
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxation of discrete random variables.
In International Conference on Learning Representations (ICLR), 2017.  [20] George Tucker, Andriy Mnih, Chris J. Maddison, John Lawson, and Jascha SohlDickstein. Rebar: Lowvariance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems (NIPS), 2017.
 [21] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for blackbox gradient estimation. In International Conference on Learning Representations (ICLR), 2018.
 [22] Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 [23] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
 [24] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
 [25] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [26] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
 [27] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
 [29] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.