Learned Threshold Pruning

02/28/2020 ∙ by Kambiz Azarian, et al. ∙ 6

This paper presents a novel differentiable method for unstructured weight pruning of deep neural networks. Our learned-threshold pruning (LTP) method enjoys a number of important advantages. First, it learns per-layer thresholds via gradient descent, unlike conventional methods where they are set as input. Making thresholds trainable also makes LTP computationally efficient, hence scalable to deeper networks. For example, it takes less than 30 epochs for LTP to prune most networks on ImageNet. This is in contrast to other methods that search for per-layer thresholds via a computationally intensive iterative pruning and fine-tuning process. Additionally, with a novel differentiable L_0 regularization, LTP is able to operate effectively on architectures with batch-normalization. This is important since L_1 and L_2 penalties lose their regularizing effect in networks with batch-normalization. Finally, LTP generates a trail of progressively sparser networks from which the desired pruned network can be picked based on sparsity and performance requirements. These features allow LTP to achieve state-of-the-art compression rates on ImageNet networks such as AlexNet (26.4× compression with 79.1% Top-5 accuracy) and ResNet50 (9.1× compression with 92.0% Top-5 accuracy). We also show that LTP effectively prunes newer architectures, such as EfficientNet, MobileNetV2 and MixNet.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have provided state-of-the-art solutions for several challenging tasks in many domains such as computer vision, natural language understanding, and speech processing. With the increasing demand for deploying DNNs on resource-constrained edge devices, it has become even more critical to reduce the memory footprint of neural networks and also to achieve power-efficient inference on these devices. Many methods in model compression

(Hassibi et al., 1993; LeCun et al., 1989; Han et al., 2015b; Zhang et al., 2018), model quantization (Jacob et al., 2018; Lin et al., 2016; Zhou et al., 2017; Faraone et al., 2018) and neural architecture search (Sandler et al., 2018; Tan and Le, 2019a; Cai et al., 2018; Wu et al., 2019) have been introduced with these goals in mind.

Neural network compression mainly falls into two categories: structured and unstructured pruning. Structured pruning methods, e.g., (He et al., 2017; Li et al., 2017; Zhang et al., 2016; He et al., 2018)

, change the network’s architecture by removing input channels from convolutional layers or by applying tensor decomposition to the layer weight matrices whereas unstructured pruning methods such as

(Han et al., 2015b; Frankle and Carbin, 2019; Zhang et al., 2018) rely on removing individual weights from the neural network. Although unstructured pruning methods achieve much higher weight sparsity ratio than structured pruning, unstructured is thought to be less hardware friendly because the irregular sparsity is often difficult to exploit for efficient computation (Anwar et al., 2017). However, recent advances in AI accelerator design (Ignatov et al., 2018) have targeted support for highly efficient sparse matrix multiply-and-accumulate operations. Because of this, it is getting increasingly important to develop state-of-the-art algorithms for unstructured pruning.

Most unstructured weight pruning methods are based on the assumption that smaller weights do not contribute as much to the model’s performance. These pruning methods iteratively prune the weights that are smaller than a certain threshold and retrain the network to regain the performance lost during pruning. A key challenge in unstructured pruning is to find an optimal setting for these pruning thresholds. Merely setting the same threshold for all layers may not be appropriate because the distribution and ranges of the weights in each layer can be very different. Also, different layers may have varying sensitivities to pruning, depending on their position in the network (initial layers versus final layers) or their type (depth-wise separable versus standard convolutional layers). The best setting of thresholds should consider these layer-wise characteristics. Many methods (Zhang et al., 2018; Ye et al., 2019; Manessi et al., 2018) propose a way to search these layer-wise thresholds but become quite computationally expensive for networks with a large number of layers, such as ResNet50 or EfficientNet.

In this paper, we propose Learned Threshold Pruning (LTP) to address these challenges. Our proposed method uses separate pruning thresholds for every layer. We make the layer-wise thresholds trainable, allowing the training procedure to find optimal thresholds alongside the layer weights during finetuning. An added benefit of making these thresholds trainable is that it makes LTP fast, and the method converges quickly compared to other iterative methods such as Zhang et al. (2018); Ye et al. (2019). LTP also achieves high compression on newer networks (Tan and Le, 2019a; Sandler et al., 2018; Tan and Le, 2019b) with squeeze-excite (Hu et al., 2018) and depth-wise convolutional layers (Chollet, 2017).

Our key contributions in this work are the following:

  • [leftmargin=*]

  • We propose a gradient-based algorithm for unstructured pruning, that introduces a learnable threshold parameter for every layer. This threshold is trained jointly with the layer weights. We use soft-pruning and soft regularization to make this process end-to-end trainable.

  • We show that making layer-wise thresholds trainable makes LTP computationally very efficient compared to other methods that search for per-layer thresholds via an iterative pruning and finetuning process, e.g., LTP pruned ResNet50 to 9.11x in just 18 epochs with 12 additional epochs of fine-tuning, and MixNet-S to 2x in 17 epochs without need for further finetuning.

  • We demonstrate state-of-the-art compression ratios on newer architectures, i.e., , and for MobileNetV2, EfficientNet-B0 and MixNet-S, respectively, which are already optimized for efficient inference, with less than drop in Top-1 accuracy.

  • The proposed method provides a trace of checkpoints with varying pruning ratios and accuracies (e.g., refer to Table 2 for ResNet50). Because of this, the user can choose any desired checkpoint based on the sparsity and performance requirements for the desired application.

2 Related Work

Several methods have been proposed for both structured and unstructured pruning of deep networks. Methods like (He et al., 2017; Li et al., 2017) use layer-wise statistics and data to remove input channels from convolutional layers. Other methods apply tensor decompositions on neural network layers, (Denton et al., 2014; Jaderberg et al., 2014; Zhang et al., 2016) apply SVD to decompose weight matrices and (Kim et al., 2015; Lebedev et al., 2014) apply tucker and cp-decompositions to compress. An overview of these methods can be found in Kuzmin et al. (2019). These methods are all applied after training a network and need fine-tuning afterwards. Other structured methods change the shape of a neural network while training. Methods like Bayesian Compression (Louizos et al., 2017), VIBnets (Dai et al., 2018) and L1/L0-regularization (Srinivas et al., 2017; Louizos et al., 2018) add trainable gates to each layer to prune while training.

In this paper we consider unstructured pruning, i.e. removing individual weights from a network. This type of pruning was already in use in 1989 in the optimal brain damage (LeCun et al., 1989) and optimal brain surgeon (Hassibi et al., 1993) papers, which removed individual weights in neural networks by use of Hessian information. More recently, Han et al. (2015a) used the method from Han et al. (2015b) as part of their full model compression pipeline, removing weights with small magnitudes and fine-tuning afterwards. This type of method is frequently used for pruning, and has recently been picked up for finding DNN subnetworks that work just as well as their mother network in Frankle and Carbin (2019); Zhou et al. (2019). Finally, papers such as Molchanov et al. (2017); Ullrich et al. (2017) apply a variational Bayesian framework on network pruning.

Other methods that are similar to our work are Zhang et al. (2018) and Ye et al. (2019). These papers apply the alternating method of Lagrange multipliers to pruning, which slowly coaxes a network into pruning weights with a L2-regularization-like term. One problem of these methods is that they are time-intensive, another is that they need manual tweaking of compression rates for each layer. In our method, we get rid of these restrictions and achieve comparing compression results, at fraction of the computational burden and without any need for setting per-layer pruning ratios manually. Manessi et al. (2018) is the closest to our work, which also learns per-layer thresholds automatically. However it relies on a combination of and regularization, which as shown in section 3.2, is inefficient when used in networks with batch-normalization (Ioffe and Szegedy, 2015). Our method also achieves much better compression results on AlexNet (Krizhevsky et al., 2017), which does not use batch-normalization. He et al. (2018)

use reinforcement learning to set layer-wise prune ratios for structured pruning, whereas we learn the pruning thresholds in the fine-tuning process.

3 Method

LTP comprises two key ideas, soft-pruning and soft regularization, detailed in sections 3.1 and 3.2, respectively. The full LTP algorithm is then presented in section 3.3.

3.1 Soft Pruning

The main challenge in learning per-layer thresholds during training is that the pruning operation is not differentiable. More precisely, let us consider an -layer DNN where the weights for the -th convolutional or fully-connected layer are denoted by , and let index the weights within the layer. In magnitude-based pruning (Han et al., 2015b) the relation between layer ’s uncompressed weights and pruned weights is given by:


where denotes the layer’s pruning threshold and step denotes the Heaviside step function. We name this scheme hard-pruning. Since the step function is not differentiable, (1) cannot be used to learn thresholds through back-propagation. To get around this problem, during training LTP replaces (1) with soft-pruning


where sigm

denotes the sigmoid function and

is a temperature hyper-parameter. As a result of (2) being differentiable, back-propagation can now be applied to learn both the weights and thresholds simultaneously.

Defining soft-pruning as in (2) has another advantage. Note that if is much smaller than (i.e., ), ’s soft-pruned version is almost zero and it is pruned away, whereas if it is much larger (i.e., ), . Weights falling within the transitional region of the sigmoid function (i.e.,

), however, may end up being pruned or kept depending on their contribution to optimizing the loss function. If they are important, the weights are pushed above the threshold through minimization of the classification loss. Otherwise, they are pulled below the threshold through regularization. This means that although LTP utilizes similar pruning thresholds as previous methods, it is not a magnitude-based pruning method, as it allows the network to keep important weights that were initially small and removing some of the unimportant weights that were initially large, c.f., Figure

1 (top-right).

Continuing with equation (2), it follows that






The function also appears in subsequent equations and merits some discussion. First note that as given by (5) is the derivative of with respect to . Since the latter approaches the step function (located at ) in the limit as , it follows that the former, i.e., would approach a Dirac delta function, meaning that its value over the transitional region is inversely proportional to region’s width, i.e.,


and elsewhere.

3.2 Soft Regularization

In the absence of weight regularization, the per-layer thresholds decrease to zero if initialized otherwise. This is because larger thresholds correspond to pruning more weights away, and unless these weights are completely spurious, their removal causes the classification loss, i.e., , to increase. Loosely speaking,

For example, Figure 2 (left) shows that with no regularization, LTP only achieves a modest sparsity of on ResNet20 with CIFAR100, after epochs of pruning.

Among the different weight regularization methods, -norm regularization, which targets minimization of the number of non-zero weights, i.e.,

befits pruning applications the most. This is because it directly quantifies the size of memory or FLOPS needed during inference. However, many works use or regularization instead, due to the -norm’s lack of differentiability. Notably, (Han et al., 2015b) utilizes and regularization to push redundant weights below the pruning thresholds.

or regularization methods may work well for pruning older architectures such as AlexNet and VGG. However, they fail to properly regularize weights in networks that utilize batch-normalization layers (van Laarhoven, 2017), (Hoffer et al., 2018). This includes virtually all modern architectures such as ResNet, EfficientNet, MobileNet, and MixNet. This is because all weights in a layer preceding a batch-normalization layer can be re-scaled by an arbitrary factor, without any change in batch-norm outputs. This uniform re-scaling prevents or penalties from having their regularizing effect. Weights can become arbitrarily small, while not influencing the classification loss negatively as they keep their original relative magnitudes. For example, Figure 2 (center) shows that with regularization, LTP provides a poor performance when applied to ResNet20 on Cifar100, as a result of the unison and unbounded fall of layer weights during the course of pruning (right). To fix this issue, (van Laarhoven, 2017) suggests normalizing the -norm of a layer’s weight tensor after each update. This, however, is not desirable when learning pruning thresholds as the magnitude of individual weights constantly changes as a result of the normalization. (Hoffer et al., 2018), on the other hand, suggests using or batch-normalization instead of the standard scheme. This, again, is not desirable as it does not address current architectures. Consequently, in this work, we focus on regularization, which does work well with batch-normalization.

As was the case with hard-pruning in (1), the challenge in using regularization for learning per-layer pruning thresholds is that it is not differentiable, i.e.,

hence, cannot be used with back-propagation. This motivates our soft norm definition for layer , i.e.,


which is differentiable, and therefore can be used with back-propagation, i.e.,


where is given by (5), and


Inspecting (9) reveals an important aspect of , namely that only weights falling within the sigmoid transitional region, i.e., , may contribute to any change in . This is because other weights are either very small and completely pruned away, or very large and unaffected by pruning. The consequence is that if a significant fraction of these weights, e.g., as a result of back-propagation update, are moved out of the transitional region, becomes constant and the pruning process stalls. The condition for preventing the premature termination of pruning, when using can then be expressed as


where denotes the overall objective function comprising both classification and soft regularization losses, i.e.,


Note that the left hand side of Equation (10) is the displacement of as a result of weight update. So (10) states that weight displacement should not be comparable to transition region’s width ().

3.3 Learned Threshold Pruning

LTP is a magnitude based pruning method that learns per-layer thresholds while training or finetuning a network. More precisely, LTP adopts a framework of updating all network weights, but only using their soft-pruned versions in the forward pass. When we apply the chain rule to the classification loss

we have




LTP uses (11), (12), (3) and (9) to update the per-layer thresholds:


Updating weights needs more care, in particular, minimization of with respect to is subject to the constraint given by (10). Interestingly, as given by (11), (13), (4) and (8), i.e.,


includes which as a result of (6) could violate (10) for (a requirement for setting , c.f., (19) and Table 1). There are two simple solutions to enforce (10). The first approach is to compute as given by (15), but clamping it based on (10). The second, and arguably simpler, approach is to use


To appreciate the logic behind this approximation, note that for the vast majority of weights that are outside the sigmoid transitional region, equations (16) and (15) give almost identical values. On the other hand, although values given by (16) and (15), after clamping, do differ for weights within the transitional region, these weights remain there for a very small fraction of the training time (as moves past them). This means that they would acquire their correct values through back-propagation once they are out of the transitional region. Also note that (16) is equivalent to using


i.e., not using for updating weights, and


instead of (4), i.e., treating the sigmoid function in (4) as constant. These adjustments are necessary for preventing premature termination of LTP, c.f., Figure 1.

Finally, after training is finished LTP uses the learned thresholds to hard-prune the network, which can further be finetuned, without regularization, for improved performance.

Figure 1: CDF (left) and scatter plot (right) of uncompressed and pruned model’s for layer3.2.conv2 of ResNet20 on Cifar100. LTP (top), LTP using for updating weights (middle) and LTP using (4) instead of (18) (bottom). Red lines indicate layer’s pruning threshold. Note how the formation of the gap around the threshold causes pruning to terminate prematurely in latter two scenarios.
Figure 2: Sparsity (left), training Top-1 accuracy (center) and mean of squared weight for layer3.2.conv2 (right) when pruning ResNet20 on Cifar100 with no regularization (blue), regularization (red) and regularization (green).

4 Experiments

Guidelines for a proper choice of hyper-parameters are given in section 4.1, followed by LTP pruning results in section 4.2.

4.1 Choice of Hyper-parameter

LTP has three main hyper-parameters, , , and . Table 1 provides the values used for generating data reported in this paper.

Network Dataset
ResNet20 Cifar100
AlexNet ImageNet
ResNet50 ImageNet
MobileNetV2 ImageNet
EfficientNet-B0 ImageNet
MixNet-S ImageNet
Table 1: Hyper-parameter values used to produce results reported in this paper.

LTP uses soft-pruning during training to learn per-layer thresholds, but hard-pruning to finally remove redundant weights. Selecting a small enough ensures that the performance of soft-pruned and hard-pruned networks are close. Too small a , on the other hand, is undesirable as it makes the transitional region of the sigmoid function too narrow. This could possibly terminate pruning prematurely. To set the per-layer in this paper, the following equation was used:


While one could consider starting with a larger and anneal it during the training, a fixed value of provided us with good results for all results reported in this paper (Table 1).

One important consideration when choosing is given by equation (12), namely, that the gradient with respect to the pruning threshold gets contribution from the gradients of all weights in the layer. This means that can potentially be much larger than a typical , especially if values of are correlated. Therefore, to prevent changes in that are orders of magnitude larger than changes in , should be small. Table 1 summarizes the values used for for producing results reported in this paper.

is the primary hyper-parameter determining the sparsity levels achieved by LTP. The following equation was used in this paper to set :


where is a constant, is an integer initialized to and incremented every time model’s sparsity does not drop by over epochs since was last updated. Equation (20) provides a mechanism for gradually increasing , if is not large enough to reach the target sparsity. Our experiments show that to get the best results, it is best to choose a large enough such that the desired sparsity is reached sooner rather than later (this is likely due to some networks tendency to overfit, if trained for too long). For example, table 2 gives the trail of pruned networks generated by LTP when applied to ResNet50 on ImageNet. It is noteworthy that a too aggressive may be disadvantageous. Note how the top-1 accuracy of the pruned networks given in table 2 improves for the first three epochs, despite the increase in sparsity. We observed that if is too large, the first pruned model may have poor performance without any subsequent recovery. Finally, table 1 gives values of , , and used for producing the reported results.

Epoch Sparsity Loss Top-1 Acc. Top-5 Acc.
Table 2: LTP generates a trail of progressively sparser models when applied to ResNet50 on ImageNet (note: Top-1 and Top-5 accuracies can further be improved by hard-pruning the model, then finetuning it without applying regularization.)

4.2 LTP Pruning Results

In this section we perform an evaluation of LTP on the ImageNet ILSVRC-2015 dataset (Russakovsky et al., 2015). LTP is used to prune a wide variety of networks comprising AlexNet(Krizhevsky et al., 2017), ResNet50 (He et al., 2016), MobileNet-v2 (Sandler et al., 2018), EfficientNet-B0 (Tan and Le, 2019a) and MixNet-S (Tan and Le, 2019b).

Table 3 gives LTP’s ResNet50 performance results on ImageNet, where it is matching the state-of-the-art (Ye et al., 2019), i.e., compression with a Top-5 accuracy of . As Table 2 shows, LTP reaches the desired sparsity of in just epochs, where (Ye et al., 2019) typically spends hundreds of epochs of training, and requires presetting per-layer compression ratios as input.

Method Top-5 Acc. Rate
Uncompressed (CaffeNet Model)
Fine-grained Pruning (Mao et al., 2017)
Progressive ADMM Pruning (Ye et al., 2019)
Uncompressed (TorchVision Model)
LTP (using TorchVision Model)
Table 3: Comparison of weight pruning results on ResNet50 for ImageNet dataset.

Table 4 provides LTP’s AlexNet performance results on ImageNet, where it achieves a compression rate of without any drop in Top-5 accuracy. It is noteworthy that TorchVision’s AlexNet implementation is slightly different from CaffeNet’s. While both implementations have M weights in their fully-connected layers, TorchVision model has only M weights in its convolutional layers compared to M of CaffeNet’s. As a result of being slimmer, the TorchVision uncompressed model achieves lower Top-1 accuracy, and we conjecture can be compressed less.

Method Top-5 Acc. (Top-5 Acc.) Rate
Uncompressed (CaffeNet Model) N/A
Network Pruning (Han et al., 2015b)
Automated Pruning (Manessi et al., 2018) N/A
ADMM Pruning (Zhang et al., 2018)
Progressive ADMM Pruning (Ye et al., 2019)
Uncompressed (TorchVision Model) N/A
LTP (using TorchVision Model)
LTP (using TorchVision Model)
Table 4: Comparison of weight pruning results on AlexNet for ImageNet data set.

To the best of our knowledge, it is the first time that (unstructured) pruning results for MobileNetV2, EfficientNet-B0 and MixNet-S are reported, c.f., Table 5, Table 6 and Table 7, respectively. This is partially because LTP, in contrast to many other methods such as, e.g., (Han et al., 2015b), (Zhang et al., 2018) and (Ye et al., 2019), does not require preset per-layer compression rates, which is non-trivial given these networks’ large number of layers (), parallel branches and novel architectural building blocks such as squeeze-and-excite and inverted-bottleneck. This, along with LTP’s computational efficiency and batch-normalization compatibility, enables it to be applied to such diverse architectures out-of-the-box. In the absence of pruning results in the literature, Global-Pruning, as described and implemented in (Ortiz et al., 2019), was used to produce benchmarks. In particular, we see that LTP’s compressed MobileNetV2 provides a Top-1 advantage over one compressed by Global-Pruning. Finally, note that LTP can be used to compress MobileNetV2, EfficientNet-B0, and MixNet-S, which are by design architecturally efficient, by , and , respectively, with less than drop in Top-1 accuracy.

Method Top-1 Acc. Top-5 Acc. Rate
Global Pruning
Global Pruning
Global Pruning
Table 5: Comparison of weight pruning results on MobileNetV2 for ImageNet data set.
Method Top-1 Acc. Top-5 Acc. Rate
Global Pruning
Global Pruning
Global Pruning
Table 6: Comparison of weight pruning results on EfficientNet-B0 for ImageNet data set (note: does not implement Swish activations).
Method Top-1 Acc. Top-5 Acc. Rate
Global Pruning
Global Pruning
Global Pruning
Table 7: Comparison of weight pruning results on MixNet-S for ImageNet data set.

5 Conclusion

In this work, we introduced Learned Threshold Pruning (LTP), which is a novel gradient-based algorithm for unstructured pruning of deep networks. With the help of soft-pruning and soft regularization, we proposed a framework where pruning thresholds for each layer are learned in an end-to-end manner. With an extensive set of experiments, we showed that LTP is an out-of-the-box method that achieves remarkable compression rates on ”old” (AlexNet, ResNet50) as well as ”new” architectures (EfficientNet, MobileNetV2, MixNet). Our experiments also established that LTP gives high compression rates even in the presence of batch normalization layers. LTP achieves x compression on AlexNet and x compression on ResNet50 with no loss in top-1 accuracy on the ImageNet dataset. We are also the first to report compression results on efficient architectures comprised of depth-wise separable convolutions and squeeze-excite blocks, e.g. LTP achieves x, x and x compression on MobileNetV2, EfficientNet-B0 and MixNet-S respectively with less than drop in top-1 accuracy on ImageNet. Additionally, LTP demonstrates fast convergence characteristics, e.g. it prunes ResNet50 in epochs (plus epochs for finetuning) and MixNet-S within epochs without need for further finetuning.


  • S. Anwar, K. Hwang, and W. Sung (2017)

    Structured pruning of deep convolutional neural networks

    ACM Journal on Emerging Technologies in Computing Systems (JETC) 13 (3), pp. 1–18. Cited by: §1.
  • H. Cai, L. Zhu, and S. Han (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §1.
  • F. Chollet (2017)

    Xception: deep learning with depthwise separable convolutions


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1251–1258. Cited by: §1.
  • B. Dai, C. Zhu, and D. Wipf (2018) Compressing neural networks using the variational information bottleneck. arXiv preprint arXiv:1802.10399. Cited by: §2.
  • E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. See DBLP:conf/nips/2014, pp. 1269–1277. Cited by: §2.
  • J. Faraone, N. J. Fraser, M. Blott, and P. H. W. Leong (2018) SYQ: learning symmetric quantization for efficient deep neural networks. See DBLP:conf/cvpr/2018, pp. 4300–4309. External Links: Document Cited by: §1.
  • J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. See DBLP:conf/iclr/2019, Cited by: §1, §2.
  • S. Han, H. Mao, and W. J. Dally (2015a) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.
  • S. Han, J. Pool, J. Tran, and W. J. Dally (2015b) Learning both weights and connections for efficient neural networks. CoRR abs/1506.02626. External Links: 1506.02626 Cited by: §1, §1, §2, §3.1, §3.2, §4.2, Table 4.
  • B. Hassibi, D. G. Stork, and G. J. Wolff (1993) Optimal brain surgeon: extensions and performance comparison. See DBLP:conf/nips/1993, pp. 263–270. Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. External Links: Document Cited by: §4.2.
  • Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) AMC: automl for model compression and acceleration on mobile devices. See DBLP:conf/eccv/2018-7, pp. 815–832. External Links: Document Cited by: §1, §2.
  • Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 1398–1406. External Links: Document Cited by: §1, §2.
  • E. Hoffer, R. Banner, I. Golan, and D. Soudry (2018) Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 2164–2174. Cited by: §3.2.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.
  • A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool (2018) AI benchmark: running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, Cited by: §2.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. See DBLP:conf/bmvc/2014, Cited by: §2.
  • Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin (2015) Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530. Cited by: §2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2017) ImageNet classification with deep convolutional neural networks. Commun. ACM. Cited by: §2, §4.2.
  • A. Kuzmin, M. Nagel, S. Pitre, S. Pendyam, T. Blankevoort, and M. Welling (2019) Taxonomy and evaluation of structured compression of convolutional neural networks. arXiv preprint arXiv:1912.09802. Cited by: §2.
  • V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky (2014) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553. Cited by: §2.
  • Y. LeCun, J. S. Denker, and S. A. Solla (1989) Optimal brain damage. See DBLP:conf/nips/1989, pp. 598–605. Cited by: §1, §2.
  • H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient convnets. See DBLP:conf/iclr/2017, Cited by: §1, §2.
  • D. D. Lin, S. S. Talathi, and V. S. Annapureddy (2016) Fixed point quantization of deep convolutional networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 2849–2858. Cited by: §1.
  • C. Louizos, K. Ullrich, and M. Welling (2017) Bayesian compression for deep learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3288–3298. Cited by: §2.
  • C. Louizos, M. Welling, and D. P. Kingma (2018) Learning sparse neural networks through l0 regularization. In International Conference on Learning Representations, Cited by: §2.
  • F. Manessi, A. Rozza, S. Bianco, P. Napoletano, and R. Schettini (2018) Automated pruning for deep neural network compression. See DBLP:conf/icpr/2018, pp. 657–664. External Links: Document Cited by: §1, §2, Table 4.
  • H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally (2017) Exploring the granularity of sparsity in convolutional neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1927–1934. External Links: Document Cited by: Table 3.
  • D. Molchanov, A. Ashukha, and D. P. Vetrov (2017) Variational dropout sparsifies deep neural networks. See DBLP:conf/icml/2017, pp. 2498–2507. Cited by: §2.
  • J. J. G. Ortiz, D. W. Blalock, and J. V. Guttag (2019) Standardizing evaluation of neural network pruning. In Workshop on AI Systems at SOSP, Cited by: §4.2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.2.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §4.2.
  • S. Srinivas, A. Subramanya, and R. V. Babu (2017) Training sparse neural networks. See DBLP:conf/cvpr/2017w, pp. 455–462. External Links: Document Cited by: §2.
  • M. Tan and Q. V. Le (2019a) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, §1, §4.2.
  • M. Tan and Q. V. Le (2019b) Mixconv: mixed depthwise convolutional kernels. CoRR, abs/1907.09595. Cited by: §1, §4.2.
  • K. Ullrich, E. Meeds, and M. Welling (2017) Soft weight-sharing for neural network compression. See DBLP:conf/iclr/2017, Cited by: §2.
  • T. van Laarhoven (2017) L2 regularization versus batch and weight normalization. CoRR abs/1706.05350. External Links: 1706.05350 Cited by: §3.2.
  • B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §1.
  • S. Ye, X. Feng, T. Zhang, X. Ma, S. Lin, Z. Li, K. Xu, W. Wen, S. Liu, J. Tang, M. Fardad, X. Lin, Y. Liu, and Y. Wang (2019) Progressive DNN compression: A key to achieve ultra-high weight pruning and quantization rates using ADMM. CoRR abs/1903.09769. External Links: 1903.09769 Cited by: §1, §1, §2, §4.2, §4.2, Table 3, Table 4.
  • T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang (2018) A systematic DNN weight pruning framework using alternating direction method of multipliers. See DBLP:conf/eccv/2018-8, pp. 191–207. External Links: Document Cited by: §1, §1, §1, §1, §2, §4.2, Table 4.
  • X. Zhang, J. Zou, K. He, and J. Sun (2016) Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1943–1955. Cited by: §1, §2.
  • A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen (2017) Incremental network quantization: towards lossless cnns with low-precision weights. arXiv preprint arxiv:1702.03044 abs/1702.03044. External Links: 1702.03044 Cited by: §1.
  • H. Zhou, J. Lan, R. Liu, and J. Yosinski (2019) Deconstructing lottery tickets: zeros, signs, and the supermask. See DBLP:conf/nips/2019, pp. 3592–3602. Cited by: §2.