1 Introduction
Deep neural networks (DNNs) have provided stateoftheart solutions for several challenging tasks in many domains such as computer vision, natural language understanding, and speech processing. With the increasing demand for deploying DNNs on resourceconstrained edge devices, it has become even more critical to reduce the memory footprint of neural networks and also to achieve powerefficient inference on these devices. Many methods in model compression
(Hassibi et al., 1993; LeCun et al., 1989; Han et al., 2015b; Zhang et al., 2018), model quantization (Jacob et al., 2018; Lin et al., 2016; Zhou et al., 2017; Faraone et al., 2018) and neural architecture search (Sandler et al., 2018; Tan and Le, 2019a; Cai et al., 2018; Wu et al., 2019) have been introduced with these goals in mind.Neural network compression mainly falls into two categories: structured and unstructured pruning. Structured pruning methods, e.g., (He et al., 2017; Li et al., 2017; Zhang et al., 2016; He et al., 2018)
, change the network’s architecture by removing input channels from convolutional layers or by applying tensor decomposition to the layer weight matrices whereas unstructured pruning methods such as
(Han et al., 2015b; Frankle and Carbin, 2019; Zhang et al., 2018) rely on removing individual weights from the neural network. Although unstructured pruning methods achieve much higher weight sparsity ratio than structured pruning, unstructured is thought to be less hardware friendly because the irregular sparsity is often difficult to exploit for efficient computation (Anwar et al., 2017). However, recent advances in AI accelerator design (Ignatov et al., 2018) have targeted support for highly efficient sparse matrix multiplyandaccumulate operations. Because of this, it is getting increasingly important to develop stateoftheart algorithms for unstructured pruning.Most unstructured weight pruning methods are based on the assumption that smaller weights do not contribute as much to the model’s performance. These pruning methods iteratively prune the weights that are smaller than a certain threshold and retrain the network to regain the performance lost during pruning. A key challenge in unstructured pruning is to find an optimal setting for these pruning thresholds. Merely setting the same threshold for all layers may not be appropriate because the distribution and ranges of the weights in each layer can be very different. Also, different layers may have varying sensitivities to pruning, depending on their position in the network (initial layers versus final layers) or their type (depthwise separable versus standard convolutional layers). The best setting of thresholds should consider these layerwise characteristics. Many methods (Zhang et al., 2018; Ye et al., 2019; Manessi et al., 2018) propose a way to search these layerwise thresholds but become quite computationally expensive for networks with a large number of layers, such as ResNet50 or EfficientNet.
In this paper, we propose Learned Threshold Pruning (LTP) to address these challenges. Our proposed method uses separate pruning thresholds for every layer. We make the layerwise thresholds trainable, allowing the training procedure to find optimal thresholds alongside the layer weights during finetuning. An added benefit of making these thresholds trainable is that it makes LTP fast, and the method converges quickly compared to other iterative methods such as Zhang et al. (2018); Ye et al. (2019). LTP also achieves high compression on newer networks (Tan and Le, 2019a; Sandler et al., 2018; Tan and Le, 2019b) with squeezeexcite (Hu et al., 2018) and depthwise convolutional layers (Chollet, 2017).
Our key contributions in this work are the following:

[leftmargin=*]

We propose a gradientbased algorithm for unstructured pruning, that introduces a learnable threshold parameter for every layer. This threshold is trained jointly with the layer weights. We use softpruning and soft regularization to make this process endtoend trainable.

We show that making layerwise thresholds trainable makes LTP computationally very efficient compared to other methods that search for perlayer thresholds via an iterative pruning and finetuning process, e.g., LTP pruned ResNet50 to 9.11x in just 18 epochs with 12 additional epochs of finetuning, and MixNetS to 2x in 17 epochs without need for further finetuning.

We demonstrate stateoftheart compression ratios on newer architectures, i.e., , and for MobileNetV2, EfficientNetB0 and MixNetS, respectively, which are already optimized for efficient inference, with less than drop in Top1 accuracy.

The proposed method provides a trace of checkpoints with varying pruning ratios and accuracies (e.g., refer to Table 2 for ResNet50). Because of this, the user can choose any desired checkpoint based on the sparsity and performance requirements for the desired application.
2 Related Work
Several methods have been proposed for both structured and unstructured pruning of deep networks. Methods like (He et al., 2017; Li et al., 2017) use layerwise statistics and data to remove input channels from convolutional layers. Other methods apply tensor decompositions on neural network layers, (Denton et al., 2014; Jaderberg et al., 2014; Zhang et al., 2016) apply SVD to decompose weight matrices and (Kim et al., 2015; Lebedev et al., 2014) apply tucker and cpdecompositions to compress. An overview of these methods can be found in Kuzmin et al. (2019). These methods are all applied after training a network and need finetuning afterwards. Other structured methods change the shape of a neural network while training. Methods like Bayesian Compression (Louizos et al., 2017), VIBnets (Dai et al., 2018) and L1/L0regularization (Srinivas et al., 2017; Louizos et al., 2018) add trainable gates to each layer to prune while training.
In this paper we consider unstructured pruning, i.e. removing individual weights from a network. This type of pruning was already in use in 1989 in the optimal brain damage (LeCun et al., 1989) and optimal brain surgeon (Hassibi et al., 1993) papers, which removed individual weights in neural networks by use of Hessian information. More recently, Han et al. (2015a) used the method from Han et al. (2015b) as part of their full model compression pipeline, removing weights with small magnitudes and finetuning afterwards. This type of method is frequently used for pruning, and has recently been picked up for finding DNN subnetworks that work just as well as their mother network in Frankle and Carbin (2019); Zhou et al. (2019). Finally, papers such as Molchanov et al. (2017); Ullrich et al. (2017) apply a variational Bayesian framework on network pruning.
Other methods that are similar to our work are Zhang et al. (2018) and Ye et al. (2019). These papers apply the alternating method of Lagrange multipliers to pruning, which slowly coaxes a network into pruning weights with a L2regularizationlike term. One problem of these methods is that they are timeintensive, another is that they need manual tweaking of compression rates for each layer. In our method, we get rid of these restrictions and achieve comparing compression results, at fraction of the computational burden and without any need for setting perlayer pruning ratios manually. Manessi et al. (2018) is the closest to our work, which also learns perlayer thresholds automatically. However it relies on a combination of and regularization, which as shown in section 3.2, is inefficient when used in networks with batchnormalization (Ioffe and Szegedy, 2015). Our method also achieves much better compression results on AlexNet (Krizhevsky et al., 2017), which does not use batchnormalization. He et al. (2018)
use reinforcement learning to set layerwise prune ratios for structured pruning, whereas we learn the pruning thresholds in the finetuning process.
3 Method
LTP comprises two key ideas, softpruning and soft regularization, detailed in sections 3.1 and 3.2, respectively. The full LTP algorithm is then presented in section 3.3.
3.1 Soft Pruning
The main challenge in learning perlayer thresholds during training is that the pruning operation is not differentiable. More precisely, let us consider an layer DNN where the weights for the th convolutional or fullyconnected layer are denoted by , and let index the weights within the layer. In magnitudebased pruning (Han et al., 2015b) the relation between layer ’s uncompressed weights and pruned weights is given by:
(1) 
where denotes the layer’s pruning threshold and step denotes the Heaviside step function. We name this scheme hardpruning. Since the step function is not differentiable, (1) cannot be used to learn thresholds through backpropagation. To get around this problem, during training LTP replaces (1) with softpruning
(2) 
where sigm
denotes the sigmoid function and
is a temperature hyperparameter. As a result of (2) being differentiable, backpropagation can now be applied to learn both the weights and thresholds simultaneously.Defining softpruning as in (2) has another advantage. Note that if is much smaller than (i.e., ), ’s softpruned version is almost zero and it is pruned away, whereas if it is much larger (i.e., ), . Weights falling within the transitional region of the sigmoid function (i.e.,
), however, may end up being pruned or kept depending on their contribution to optimizing the loss function. If they are important, the weights are pushed above the threshold through minimization of the classification loss. Otherwise, they are pulled below the threshold through regularization. This means that although LTP utilizes similar pruning thresholds as previous methods, it is not a magnitudebased pruning method, as it allows the network to keep important weights that were initially small and removing some of the unimportant weights that were initially large, c.f., Figure
1 (topright).Continuing with equation (2), it follows that
(3) 
and
(4) 
with
(5) 
The function also appears in subsequent equations and merits some discussion. First note that as given by (5) is the derivative of with respect to . Since the latter approaches the step function (located at ) in the limit as , it follows that the former, i.e., would approach a Dirac delta function, meaning that its value over the transitional region is inversely proportional to region’s width, i.e.,
(6) 
and elsewhere.
3.2 Soft Regularization
In the absence of weight regularization, the perlayer thresholds decrease to zero if initialized otherwise. This is because larger thresholds correspond to pruning more weights away, and unless these weights are completely spurious, their removal causes the classification loss, i.e., , to increase. Loosely speaking,
For example, Figure 2 (left) shows that with no regularization, LTP only achieves a modest sparsity of on ResNet20 with CIFAR100, after epochs of pruning.
Among the different weight regularization methods, norm regularization, which targets minimization of the number of nonzero weights, i.e.,
befits pruning applications the most. This is because it directly quantifies the size of memory or FLOPS needed during inference. However, many works use or regularization instead, due to the norm’s lack of differentiability. Notably, (Han et al., 2015b) utilizes and regularization to push redundant weights below the pruning thresholds.
or regularization methods may work well for pruning older architectures such as AlexNet and VGG. However, they fail to properly regularize weights in networks that utilize batchnormalization layers (van Laarhoven, 2017), (Hoffer et al., 2018). This includes virtually all modern architectures such as ResNet, EfficientNet, MobileNet, and MixNet. This is because all weights in a layer preceding a batchnormalization layer can be rescaled by an arbitrary factor, without any change in batchnorm outputs. This uniform rescaling prevents or penalties from having their regularizing effect. Weights can become arbitrarily small, while not influencing the classification loss negatively as they keep their original relative magnitudes. For example, Figure 2 (center) shows that with regularization, LTP provides a poor performance when applied to ResNet20 on Cifar100, as a result of the unison and unbounded fall of layer weights during the course of pruning (right). To fix this issue, (van Laarhoven, 2017) suggests normalizing the norm of a layer’s weight tensor after each update. This, however, is not desirable when learning pruning thresholds as the magnitude of individual weights constantly changes as a result of the normalization. (Hoffer et al., 2018), on the other hand, suggests using or batchnormalization instead of the standard scheme. This, again, is not desirable as it does not address current architectures. Consequently, in this work, we focus on regularization, which does work well with batchnormalization.
As was the case with hardpruning in (1), the challenge in using regularization for learning perlayer pruning thresholds is that it is not differentiable, i.e.,
hence, cannot be used with backpropagation. This motivates our soft norm definition for layer , i.e.,
(7) 
which is differentiable, and therefore can be used with backpropagation, i.e.,
(8) 
where is given by (5), and
(9) 
Inspecting (9) reveals an important aspect of , namely that only weights falling within the sigmoid transitional region, i.e., , may contribute to any change in . This is because other weights are either very small and completely pruned away, or very large and unaffected by pruning. The consequence is that if a significant fraction of these weights, e.g., as a result of backpropagation update, are moved out of the transitional region, becomes constant and the pruning process stalls. The condition for preventing the premature termination of pruning, when using can then be expressed as
(10) 
where denotes the overall objective function comprising both classification and soft regularization losses, i.e.,
(11) 
Note that the left hand side of Equation (10) is the displacement of as a result of weight update. So (10) states that weight displacement should not be comparable to transition region’s width ().
3.3 Learned Threshold Pruning
LTP is a magnitude based pruning method that learns perlayer thresholds while training or finetuning a network. More precisely, LTP adopts a framework of updating all network weights, but only using their softpruned versions in the forward pass. When we apply the chain rule to the classification loss
we have(12) 
and
(13) 
LTP uses (11), (12), (3) and (9) to update the perlayer thresholds:
(14) 
Updating weights needs more care, in particular, minimization of with respect to is subject to the constraint given by (10). Interestingly, as given by (11), (13), (4) and (8), i.e.,
(15) 
includes which as a result of (6) could violate (10) for (a requirement for setting , c.f., (19) and Table 1). There are two simple solutions to enforce (10). The first approach is to compute as given by (15), but clamping it based on (10). The second, and arguably simpler, approach is to use
(16) 
To appreciate the logic behind this approximation, note that for the vast majority of weights that are outside the sigmoid transitional region, equations (16) and (15) give almost identical values. On the other hand, although values given by (16) and (15), after clamping, do differ for weights within the transitional region, these weights remain there for a very small fraction of the training time (as moves past them). This means that they would acquire their correct values through backpropagation once they are out of the transitional region. Also note that (16) is equivalent to using
(17) 
i.e., not using for updating weights, and
(18) 
instead of (4), i.e., treating the sigmoid function in (4) as constant. These adjustments are necessary for preventing premature termination of LTP, c.f., Figure 1.
Finally, after training is finished LTP uses the learned thresholds to hardprune the network, which can further be finetuned, without regularization, for improved performance.
4 Experiments
Guidelines for a proper choice of hyperparameters are given in section 4.1, followed by LTP pruning results in section 4.2.
4.1 Choice of Hyperparameter
LTP has three main hyperparameters, , , and . Table 1 provides the values used for generating data reported in this paper.
Network  Dataset  

ResNet20  Cifar100  
AlexNet  ImageNet  
ResNet50  ImageNet  
MobileNetV2  ImageNet  
EfficientNetB0  ImageNet  
MixNetS  ImageNet 
LTP uses softpruning during training to learn perlayer thresholds, but hardpruning to finally remove redundant weights. Selecting a small enough ensures that the performance of softpruned and hardpruned networks are close. Too small a , on the other hand, is undesirable as it makes the transitional region of the sigmoid function too narrow. This could possibly terminate pruning prematurely. To set the perlayer in this paper, the following equation was used:
(19) 
While one could consider starting with a larger and anneal it during the training, a fixed value of provided us with good results for all results reported in this paper (Table 1).
One important consideration when choosing is given by equation (12), namely, that the gradient with respect to the pruning threshold gets contribution from the gradients of all weights in the layer. This means that can potentially be much larger than a typical , especially if values of are correlated. Therefore, to prevent changes in that are orders of magnitude larger than changes in , should be small. Table 1 summarizes the values used for for producing results reported in this paper.
is the primary hyperparameter determining the sparsity levels achieved by LTP. The following equation was used in this paper to set :
(20) 
where is a constant, is an integer initialized to and incremented every time model’s sparsity does not drop by over epochs since was last updated. Equation (20) provides a mechanism for gradually increasing , if is not large enough to reach the target sparsity. Our experiments show that to get the best results, it is best to choose a large enough such that the desired sparsity is reached sooner rather than later (this is likely due to some networks tendency to overfit, if trained for too long). For example, table 2 gives the trail of pruned networks generated by LTP when applied to ResNet50 on ImageNet. It is noteworthy that a too aggressive may be disadvantageous. Note how the top1 accuracy of the pruned networks given in table 2 improves for the first three epochs, despite the increase in sparsity. We observed that if is too large, the first pruned model may have poor performance without any subsequent recovery. Finally, table 1 gives values of , , and used for producing the reported results.
Epoch  Sparsity  Loss  Top1 Acc.  Top5 Acc. 

4.2 LTP Pruning Results
In this section we perform an evaluation of LTP on the ImageNet ILSVRC2015 dataset (Russakovsky et al., 2015). LTP is used to prune a wide variety of networks comprising AlexNet(Krizhevsky et al., 2017), ResNet50 (He et al., 2016), MobileNetv2 (Sandler et al., 2018), EfficientNetB0 (Tan and Le, 2019a) and MixNetS (Tan and Le, 2019b).
Table 3 gives LTP’s ResNet50 performance results on ImageNet, where it is matching the stateoftheart (Ye et al., 2019), i.e., compression with a Top5 accuracy of . As Table 2 shows, LTP reaches the desired sparsity of in just epochs, where (Ye et al., 2019) typically spends hundreds of epochs of training, and requires presetting perlayer compression ratios as input.
Method  Top5 Acc.  Rate 

Uncompressed (CaffeNet Model)  
Finegrained Pruning (Mao et al., 2017)  
Progressive ADMM Pruning (Ye et al., 2019)  
Uncompressed (TorchVision Model)  
LTP (using TorchVision Model) 
Table 4 provides LTP’s AlexNet performance results on ImageNet, where it achieves a compression rate of without any drop in Top5 accuracy. It is noteworthy that TorchVision’s AlexNet implementation is slightly different from CaffeNet’s. While both implementations have M weights in their fullyconnected layers, TorchVision model has only M weights in its convolutional layers compared to M of CaffeNet’s. As a result of being slimmer, the TorchVision uncompressed model achieves lower Top1 accuracy, and we conjecture can be compressed less.
Method  Top5 Acc.  (Top5 Acc.)  Rate 

Uncompressed (CaffeNet Model)  N/A  
Network Pruning (Han et al., 2015b)  
Automated Pruning (Manessi et al., 2018)  N/A  
ADMM Pruning (Zhang et al., 2018)  
Progressive ADMM Pruning (Ye et al., 2019)  
Uncompressed (TorchVision Model)  N/A  
LTP (using TorchVision Model)  
LTP (using TorchVision Model) 
To the best of our knowledge, it is the first time that (unstructured) pruning results for MobileNetV2, EfficientNetB0 and MixNetS are reported, c.f., Table 5, Table 6 and Table 7, respectively. This is partially because LTP, in contrast to many other methods such as, e.g., (Han et al., 2015b), (Zhang et al., 2018) and (Ye et al., 2019), does not require preset perlayer compression rates, which is nontrivial given these networks’ large number of layers (), parallel branches and novel architectural building blocks such as squeezeandexcite and invertedbottleneck. This, along with LTP’s computational efficiency and batchnormalization compatibility, enables it to be applied to such diverse architectures outofthebox. In the absence of pruning results in the literature, GlobalPruning, as described and implemented in (Ortiz et al., 2019), was used to produce benchmarks. In particular, we see that LTP’s compressed MobileNetV2 provides a Top1 advantage over one compressed by GlobalPruning. Finally, note that LTP can be used to compress MobileNetV2, EfficientNetB0, and MixNetS, which are by design architecturally efficient, by , and , respectively, with less than drop in Top1 accuracy.
Method  Top1 Acc.  Top5 Acc.  Rate 

Uncompressed  
Global Pruning  
Global Pruning  
Global Pruning  
LTP  
LTP  
LTP 
Method  Top1 Acc.  Top5 Acc.  Rate 

Uncompressed  
Global Pruning  
Global Pruning  
Global Pruning  
LTP  
LTP  
LTP 
Method  Top1 Acc.  Top5 Acc.  Rate 

Uncompressed  
Global Pruning  
Global Pruning  
Global Pruning  
LTP  
LTP  
LTP 
5 Conclusion
In this work, we introduced Learned Threshold Pruning (LTP), which is a novel gradientbased algorithm for unstructured pruning of deep networks. With the help of softpruning and soft regularization, we proposed a framework where pruning thresholds for each layer are learned in an endtoend manner. With an extensive set of experiments, we showed that LTP is an outofthebox method that achieves remarkable compression rates on ”old” (AlexNet, ResNet50) as well as ”new” architectures (EfficientNet, MobileNetV2, MixNet). Our experiments also established that LTP gives high compression rates even in the presence of batch normalization layers. LTP achieves x compression on AlexNet and x compression on ResNet50 with no loss in top1 accuracy on the ImageNet dataset. We are also the first to report compression results on efficient architectures comprised of depthwise separable convolutions and squeezeexcite blocks, e.g. LTP achieves x, x and x compression on MobileNetV2, EfficientNetB0 and MixNetS respectively with less than drop in top1 accuracy on ImageNet. Additionally, LTP demonstrates fast convergence characteristics, e.g. it prunes ResNet50 in epochs (plus epochs for finetuning) and MixNetS within epochs without need for further finetuning.
References

Structured pruning of deep convolutional neural networks
. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13 (3), pp. 1–18. Cited by: §1.  Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §1.

Xception: deep learning with depthwise separable convolutions
. InProceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1251–1258. Cited by: §1.  Compressing neural networks using the variational information bottleneck. arXiv preprint arXiv:1802.10399. Cited by: §2.
 Exploiting linear structure within convolutional networks for efficient evaluation. See DBLP:conf/nips/2014, pp. 1269–1277. Cited by: §2.
 SYQ: learning symmetric quantization for efficient deep neural networks. See DBLP:conf/cvpr/2018, pp. 4300–4309. External Links: Document Cited by: §1.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. See DBLP:conf/iclr/2019, Cited by: §1, §2.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.
 Learning both weights and connections for efficient neural networks. CoRR abs/1506.02626. External Links: 1506.02626 Cited by: §1, §1, §2, §3.1, §3.2, §4.2, Table 4.
 Optimal brain surgeon: extensions and performance comparison. See DBLP:conf/nips/1993, pp. 263–270. Cited by: §1, §2.
 Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pp. 770–778. External Links: Document Cited by: §4.2.
 AMC: automl for model compression and acceleration on mobile devices. See DBLP:conf/eccv/20187, pp. 815–832. External Links: Document Cited by: §1, §2.
 Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pp. 1398–1406. External Links: Document Cited by: §1, §2.
 Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 2164–2174. Cited by: §3.2.
 Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.
 AI benchmark: running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, Cited by: §2.
 Quantization and training of neural networks for efficient integerarithmeticonly inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 Speeding up convolutional neural networks with low rank expansions. See DBLP:conf/bmvc/2014, Cited by: §2.
 Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530. Cited by: §2.
 ImageNet classification with deep convolutional neural networks. Commun. ACM. Cited by: §2, §4.2.
 Taxonomy and evaluation of structured compression of convolutional neural networks. arXiv preprint arXiv:1912.09802. Cited by: §2.
 Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553. Cited by: §2.
 Optimal brain damage. See DBLP:conf/nips/1989, pp. 598–605. Cited by: §1, §2.
 Pruning filters for efficient convnets. See DBLP:conf/iclr/2017, Cited by: §1, §2.
 Fixed point quantization of deep convolutional networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 2849–2858. Cited by: §1.
 Bayesian compression for deep learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3288–3298. Cited by: §2.
 Learning sparse neural networks through l0 regularization. In International Conference on Learning Representations, Cited by: §2.
 Automated pruning for deep neural network compression. See DBLP:conf/icpr/2018, pp. 657–664. External Links: Document Cited by: §1, §2, Table 4.
 Exploring the granularity of sparsity in convolutional neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 2126, 2017, pp. 1927–1934. External Links: Document Cited by: Table 3.
 Variational dropout sparsifies deep neural networks. See DBLP:conf/icml/2017, pp. 2498–2507. Cited by: §2.
 Standardizing evaluation of neural network pruning. In Workshop on AI Systems at SOSP, Cited by: §4.2.
 ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.2.
 MobileNetV2: inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §4.2.
 Training sparse neural networks. See DBLP:conf/cvpr/2017w, pp. 455–462. External Links: Document Cited by: §2.
 Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, §1, §4.2.
 Mixconv: mixed depthwise convolutional kernels. CoRR, abs/1907.09595. Cited by: §1, §4.2.
 Soft weightsharing for neural network compression. See DBLP:conf/iclr/2017, Cited by: §2.
 L2 regularization versus batch and weight normalization. CoRR abs/1706.05350. External Links: 1706.05350 Cited by: §3.2.
 Fbnet: hardwareaware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §1.
 Progressive DNN compression: A key to achieve ultrahigh weight pruning and quantization rates using ADMM. CoRR abs/1903.09769. External Links: 1903.09769 Cited by: §1, §1, §2, §4.2, §4.2, Table 3, Table 4.
 A systematic DNN weight pruning framework using alternating direction method of multipliers. See DBLP:conf/eccv/20188, pp. 191–207. External Links: Document Cited by: §1, §1, §1, §1, §2, §4.2, Table 4.
 Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1943–1955. Cited by: §1, §2.
 Incremental network quantization: towards lossless cnns with lowprecision weights. arXiv preprint arxiv:1702.03044 abs/1702.03044. External Links: 1702.03044 Cited by: §1.
 Deconstructing lottery tickets: zeros, signs, and the supermask. See DBLP:conf/nips/2019, pp. 3592–3602. Cited by: §2.
Comments
There are no comments yet.