Convolutional Neural Networks (CNN) have become the state-of-the art in various computer vision tasks, e.g., image classification , object detection , depth estimation . These CNN models are designed with deeper  and wider  convolutional layers with a large number of parameters and convolutional operations. These architectures hinder deployment on low-power devices, e.g, phones, robots, wearable devices as well as real-time critical applications, such as autonomous driving. As a result, computationally efficient models are becoming increasingly important and multiple paradigms have been proposed to minimize the complexity of CNNs.
A straight forward direction is to manually design networks with small footprint from the start such as [29, 13, 41, 15, 14]. This direction does not only require expert knowledge and multiple trials (e.g up to 1000 neural architectures explored manually ), but also does not benefit from available, pre-trained large models. Quantization [40, 17] and distillation [44, 18] are two other techniques, which utilize the pre-trained models to obtain smaller architectures. Quantization reduces bit-width of parameters and thus decreases memory footprint, but requires specialized hardware instructions to achieve latency reduction. While distillation trains a pre-defined smaller model (student) with guidance from a larger pre-trained model (teacher) . Finally, model pruning aims to automatically remove the least important filters (or weights) to reduce number of parameters or FLOPs (i.e indirect measures). However, prior work [46, 47, 2] showed that neither number of pruned parameters nor FLOPs reduction directly correlate with latency (i.e a direct measure) consumption. Latency reduction in that case depends on various aspects, such as the number of filters per layer (signature) and the deployment device. Most GPU programming tools require careful compute kernels111A compute kernel refers to a function such as convolution operation that runs on a high throughput accelerator such as GPU tuning for different matrices shapes (e.g., convolution weights) [39, 32]. These aspects introduce non-linearity in modeling latency with respect to the number of filters per layer. Recognizing the limitations in terms of latency or energy by simply pruning away filters, recent works [47, 45, 46] proposed optimizing directly over these direct measures. These methods require per hardware and per architecture latency measurements collection to create lookup-tables or latency prediction model which can be time intensive. In addition, these filter pruned methods are bounded by the model’s depth and can only reach a limited goal for latency consumption.
In this work, we show the limitations of filter pruning methods in terms of latency reduction. Fig. 1 shows the range of attainable latency reduction on randomly generated models. Each box bar summarizes the latency reduction of 100 random models with filter and layer pruning on different network architectures and hardware platforms. For each filter pruned model , a pruning ratio per layer such that is generated thus models differ in signature/width. For each layer pruned model, layers out of total layers (dependent on the network) are randomly selected for retention such that
thus models differ in depth. As to be expected, layer pruning has higher upper bound in latency reduction compared to filter pruning specially on modern complex architectures with residual blocks. However, we want to highlight quantitatively in the plot the discrepancy of attainable latency reduction using both methods. Filter pruning is not only constrained by the depth of the model but also by the connection dependency in the architecture. Example of such connection dependency is the element-wise sum operation in residual block between identity connection and residual connection. Filter pruning methods commonly prune in-between convolution layers in a residual to respect number of channels and spatial dimensions. BAR proposed atypical residual block that allows mixed-connectivity between blocks to tackle the issue. However, this requires special implementations to leverage the speedup gain. Another limitation in filter pruning is the iterative process and thus is constrained to keep minimum number of filters per layer during optimization to allow for data passing. LayerPrune performs a one-shot pruning before fine-tuning and thus it allows for layer removal even from single path networks.
Motivated by these points, what remains to ask is how well do layer pruned models perform in terms of accuracy compared to filter pruned methods. Fig. 2 shows accuracy and images per second between our LayerPrune and several state-of-the-art pruning methods, as well as, several handcrafted architectures. In general, pruning methods tend to find better quality models than handcrafted architectures. It is worth noting that filter pruning methods such as ThiNet  and Taylor  show small speedup gain as more filters are pruned compared to LayerPrune. This shows the limitation of filter pruning methods on latency reduction.
2 Related Work
We divide existing pruning methods into four categories: weight pruning, hardware-agnostic filter pruning, hardware-aware filter pruning and layer pruning.
Weight pruning. An early major category in pruning is individual weight pruning (unstructured pruning). Weight pruning methods leverage the fact that some weights have minimal effect on the task accuracy and thus can be zeroed-out. In , weights with small magnitude are removed and in , quantization is further applied to achieve more model compression. Another data-free pruning is 
where neurons are removed iteratively from fully connected layers.-regularization based method  is proposed to encourage network sparsity in training. Finally, in lottery ticket hypothesis , the authors propose a method of finding winning tickets which are subnetworks from random initialization that achieve higher accuracy than the dense model. The limitation of the unstructured weight pruning is that dedicated hardware and libraries  are needed to achieve speedup from the compression. Given our focus on latency and to keep the evaluation setup simple, we do not consider these methods in our evaluation.
Hardware-agnostic filter pruning. Methods in this category (also known as structured pruning) aim to reduce the footprint of a model by pruning filters without any knowledge of the inference resource consumption. Examples of these are [30, 25, 27, 42, 23], which focus on removing the least important filters and obtaining a slimmer model. Earlier filter-pruning methods [27, 23] required layer wise sensitivity analysis to generate the signature (i.e number of filters per layer) as a prior and remove filters based on a filter criterion. The sensitivity analysis is computationally expensive to conduct and becomes even less feasible for deeper models. Recent methods [30, 25, 42] learn a global importance measure removing the need for sensitivity analysis. Molchanov et al.  propose a Taylor approximation on network’s weights where the filter’s gradients and norm are used to approximate its global importance score. Liu et al.  and Wen et al.  propose sparsity loss for training along with the classification’s cross entropy loss. Filters whose criterion are less than a threshold are removed and the pruned model is finally fine-tuned. Zhao et al. 
introduce channel saliency that is parameterized as gaussian distribution and optimized in the training process. After training, channels with small mean and variance are pruned. In general, methods with sparsity loss lack a simple approach to respect a resource consumption target and require hyperparameter tuning to balance different losses.
Hardware-aware filter pruning. To respect a resource consumption budget, recent works [4, 47, 45, 11] have been proposed to take into consideration a resource target within the optimization process. NetAdapt 
prunes a model to meet a target budget using heuristic greedy search. Lookup table is built for latency prediction and then multiple candidates are generated at each pruning iteration by pruning aof filters from each layer independently. The candidate with the highest accuracy is then selected and the process continues to the next pruning iteration with a progressively increasing . On the other hand, AMC  and ECC 
propose an end-to-end constrained pruning. AMC utilizes reinforcement learning to select a model’s signature by trial and error. ECC simplify the latency reduction model as a bilinear per-layer model. The training utilizes alternating direction method of multiplier (ADMM) to perform constrained optimization by alternating between network weight optimization and dual variables that control layer-wise pruning ratio. Although, these methods incorporate resource consumption as a constraint in the training process, the range of attainable budgets is limited by the depth of the model. In addition, generating data measurements to model resource consumption per hardware and architecture can be expensive specially on low-end hardware platforms.
Layer pruning. Unlike filter pruning, little attention is paid to shallowing CNNs in the pruning literature. In SSS , the authors propose to train a scaling factor for structure selection such as neurons, blocks and groups. However, shallower models are only possible with architectures with residual connections to allow data flow in the optimization process. Closest to our work for a general (i.e not constrained by architecture type) layer pruning approach is the work done by Chen et al. 
. In their method, linear classifiers probes are utilized and trained independently per layer for layer-ranking. After layer-ranking learning stage, they prune the least important layers and fine-tune the shallower model. Although requires rank training, it is without any gain in classification accuracy compared to our one-shot LayerPrune layer ranking as will be shown in experiments section.
In this section, we describe in details LayerPrune for layer pruning using existing filter criteria along with a novel layer-wise accuracy approximation.
3.1 Layer criteria
Typical filter pruning method follows a three-stage pipeline as illustrated in Figure 3. Filter importance is iteratively re-evaluated after each pruning step based on a pruning meta-parameter such as pruning N filters or pruning those threshold. In LayerPrune, we remove the need for the iterative pruning step, and show that using the same filter criterion, we can remove layers in one-shot to respect a budget. This simplifies the pruning step to a hyper-parameter free process and is computationally efficient. Layer importance is calculated as the average of filter importance in this layer. Unlike , LayerPrune does not require training for layer ranking and leverage the filters statistics.
Layer-wise imprinting. In addition to existing filter criteria, we present a novel layer importance by layer-wise accuracy approximation. Motivated by the few-shot learning literature [33, 28], we use imprinting to approximate the classification accuracy up to each layer. Imprinting is used to approximate a classifier’s weight matrix when only a few training samples are available. Although we have adequate training samples, we are inspired by the efficiency of imprinting to approximate the accuracy in one pass without the need for training. We create a classifier proxy for each prunable candidate (e.g convolution layer or residual blocks), and then the training data is used to imprint the classifier weight matrix for each proxy. Since each layer has a different output feature shape, we apply adaptive average pooling to simplify our method and unify the embedding length so that each layer produces roughly an output of the same size. Specifically, the pooling is done as follows:
where is the embedding length, is layer ’s number of filters, is layer ’s output feature map, and AdaptiveAvgPool  reduces to embedding . Finally, embeddings per layer are flattened to be used in imprinting. Imprinting calculates the proxy classifier’s weights matrix as follows:
where is the class id, is sample’s class id, is the number of samples in class , is the total number of samples, and denotes the indicator function.
The accuracy at each proxy is then calculated using the imprinted weight matrices. The prediction for each sample is calculated for each layer as:
where is calculated as shown in Eq.(1). This is equivalent to finding the nearest class from the imprinted weights in the embedding space. Ranking of each layer is then calculated as the gain in accuracy from previous pruning candidate.
3.2 Filter criteria
Although existing filter pruning methods are different in algorithms and optimization used, they focus more on finding the optimal per-layer number of filters and share common filter criteria. We divide the methods based on the filter criterion used and propose their layer importance counterpart used in LayerPrune.
Preliminary notion. Consider a network with layers, each layer has weight matrix with input channels, number of filters and is the size of the filters at this channel. Evaluated criteria and methods are:
Taylor weights. Taylor method  is slightly different from previous criterion in that the gradients are included in the ranking as well. Filter ranking is based on where iterates over all individual weights in , is the gradient, is the weight value. Similarly, layer ranking can be expressed as:
where is element-wise product and is the gradient of loss with respect to weights .
Feature map based heuristics. [27, 31, 24] rank filters based on statistics from output of layer. In , ranking is based on the effect on the next layer while , similar to Taylor weights, utilizes gradients and norm but on feature maps.
Channel saliency. In this criterion, a scalar is multiplied by the feature maps and optimized within a typical training cycle with task loss and sparsity regularization loss to encourage sparsity. Slimming 
utilizes Batch Normalization scaleas the channel saliency. Similarly, we use Batch Normalization scale parameter to calculate layer importance for this criteria, specifically:
Ensemble. We also consider diverse ensemble of layer ranks where the ensemble rank of each layer is the sum of its rank per method, more specifically:
where is the layer’s index, is the number of all criteria and LayerRank indicates the order of layer in the sorted list for criterion .
4 Evaluation Results
In this section we present our experimental results comparing state-of-the-art pruning methods and LayerPrune in terms of accuracy and latency reduction on two different hardware platforms. We show latency on high-end GPU 1080Ti and on NVIDIA Jetson Xavier embedded device, which is used in mobile vision systems and contains 512-core Volta GPU. We evaluate the methods on CIFAR10/100  and ImageNet  datasets.
4.1 Implementation details
Latency model is averaged over 1000 forward pass after 10 warm up forward passes for lazy GPU initialization. Latency is calculated using batch size 1, unless otherwise stated, due to its practical importance in real-time application as in robotics where we process online stream of frames. All pruned architectures are implemented and measured using PyTorch. For fair comparison, we compare latency reduction on similar accuracy retention from baseline and reported by original papers or compare accuracy on similar latency reduction with methods supporting layer or block pruning.
Handling filter shapes after layer removal. If the pruned layer with weight has , we replace layer ’s weight matrix from to with random initialization. All other layers are initialized from the pre-trained dense model.
4.2 Results on CIFAR
4.2.1 Random filters vs. Random layers
Initial hypothesis verification is to generate random filter and layer pruned models, then train them to compare their accuracy and latency reduction. Random models generation follows the same setup as explained in Section (1
). Each model is trained with SGD optimization for 164 epochs with learning rate 0.1 that decays by 0.1 at apochs 81, 121, and 151. Figure4 shows the latency-accuracy plot for both random pruning methods. Layer pruned models outperforms filter pruned ones in accuracy by 7.09% on average and can achieve up to 60% latency reduction. In addition, within the same latency budget, filter pruning shows higher variance in accuracy than layer pruning. This suggests that latency constrained optimization with filter pruning is complex and requires careful per layer pruning ratio selection. On the other hand, layer pruning has small accuracy variation, in general within a budget.
Results on CIFAR-100 are presented in Table 1. The table is divided based on the previously mentioned filter criterion categorization in Section 3.2. First, we compare with Chen et el.  on a similar latency reduction as both  and LayerPrune perform layer pruning. Although  requires training for layer ranking, LayerPrune outperforms it by 1.11%. We achieve up to 56% latency reduction with 1.52% accuracy increase from baseline. As VGG19-BN is over-parametrized for CIFAR-100, removing layers act as a regularization and can find models with better accuracy than the baseline. Unlike with filter pruning methods, they are bounded by small accuracy variations around the baseline. It is worth mentioning that latency reduction of removing similar number of filters using different filter criterion varies from -0.06% to 40.0%. While layer pruning methods, with the same number of pruned layers, regardless of the criterion ranges from 34.3% to 41%. That suggests that latency reduction using filter pruning is sensitive to environment setup and requires complex optimization to respect latency budget.
). Each bar represents the approximated classification accuracy up to this layer (rounded for visualization). We see a drop in accuracy followed by an increasing trend from conv10 to conv15. This is likely because the number of features is the same from conv10 to conv12. We start to observe an accuracy increase only at conv13 that follows a max pooling layer and has twice as many features. That highlights the importance of downsampling and the doubling the number of features at this point of the model. So layer pruning does not only improve inference speed but can also discover a better regularized shallow model specially on small dataset. It is also worth mentioning that both the proxy classifier from the last layer, conv16, and the actual model classifier, GT, have the same accuracy, showing how the proxy classifier is a plausible approximation to the converged classifier.
|Chen et al. ||✓||73.25||56.01||52.86||58.06||49.86|
|Weight norm ||✗||73.01||-2.044||-0.873||-4.256||-0.06|
We also compare on the more complex architecture ResNet56 on CIFAR-10 and CIFAR-100 in Table 2. On a similar latency reduction, LayerPrune outperforms  by 0.54% and 1.23% on CIFAR-10 and CIFAR-100 respectively. On the other hand, within each filter criterion, LayerPrune outperforms filter pruning and is on par with the baseline in accuracy. In addition, filter pruning can result in latency increase (i.e negative LR) with specific hardware targets and batch sizes  as shown with batch size 8. However, LayerPrune consistently shows latency reduction under different environmental setups. We also compare with larger batch size to further encourage filter pruned models to better utilize the resources. Still, we found LayerPrune achieves overall better latency reduction with large batch size. Latency reduction variance, LR var, between different batch sizes within the same hardware platform is shown as well. Consistent with previous results on VGG, LayerPrune is less sensitive to changes in criterion, batch size, and hardware than filter pruning. We also show results up to 2.5x latency reduction with less than 2% accuracy drop.
|CIFAR-10 ResNet56 baseline (93.55%)|
|Chen et al. ||✓||93.09||26.60||26.31||26.96||25.66|
|Taylor weight ||✗||93.15||0.31||5.28||-0.11||2.67|
|Weight norm ||✗||92.95||-0.90||5.22||1.49||3.87|
|L1 norm ||✗||93.30||-1.09||-0.48||2.31||1.64|
|Feature maps ||✗||92.7||-0.79||6.17||1.09||8.38|
|Batch Normalization ||✗||93.00||0.6||3.85||2.26||1.42|
|CIFAR-100 ResNet56 baseline (71.2%)|
|Chen et al. ||✓||69.77||38.30||34.31||38.53||39.38|
|Taylor weight ||✗||71.03||2.13||5.23||-1.1||3.75|
|Weight norm ||✗||71.00||2.52||6.46||-0.3||3.86|
|L1 norm ||✗||70.65||-1.04||4.06||0.58||1.34|
|Feature maps ||✗||70.00||1.22||9.49||-1.27||7.94|
|Batch Normalization ||✗||70.71||0.37||2.26||-1.02||2.89|
4.3 Results on ImageNet
We evaluate the methods on the challenging ImageNet dataset for classification. For all experiments in this section, PyTorch pretrained models are used as starting point for network pruning. We follow the same setup as in  where we prune 100 filters for each 30 minibatches for 10 pruning iterations 222Ablation study including different hyperparameters are presented in supplementary.. The pruned model is then fine-tuned with learning rate 1e using SGD optimizer and 256 batch size. Results on ResNet50 are presented in Table 3. In general, LayerPrune methods improve accuracy over the baseline and their counterpart filter pruning methods. Although feature maps criterion  achieves better accuracy by 0.92% over LayerPrune, LayerPrune has higher latency reduction that exceeds by 5.7%. It is worth mentioning that the latency aware optimization ECC has an upper bound latency reduction of 11.56%, on 1080Ti, with accuracy 16.3%. This stems from the fact that iterative filter pruning is bounded by the network’s depth and structure dependency within the network, thus not all layers are considered for pruning such as the gates at residual blocks. In addition, ECC builds a layer-wise bilinear model to approximate latency of a model given number of input channels and output filters per layer. This simplifies the non-linear relationship between number of filters per layer and latency. We show latency reduction on Xavier for an ECC pruned model optimized for 1080Ti, and this pruned model results in latency increase on batch size 1 and the lowest latency reduction on batch size 64. This suggests that, a hardware-aware filter pruned model for one hardware architecture might perform worse on another hardware than even a hardware-agnostic filter pruning method. It is worth noting that the filter pruning HRank  with 2.6x FLOPs reduction shows large accuracy degradation compared to LayerPrune (71.98 vs 74.31). Even with aggressive filter pruning, speed up is noticeable with large batch size but shows small speed gain with small batch size. Within shallower models, LayerPrune outperforms SSS on same latency budget even when SSS supports block pruning for ResNet50, that shows the effectiveness of accuracy approximation as a layer importance.
|ResNet50 baseline (76.14)|
|Weight norm ||✗||76.50||6.79||3.46||6.57||8.06|
|Feature maps ||✗||75.92||10.86||3.86||20.25||8.74|
|Channel pruning* ||✗||72.26||3.54||6.13||2.70||7.42|
4.4 Layer pruning comparison
In this section, we analyse different criteria for layer pruning under the same latency budget as presented in Table 4. Our imprinting method consistently outperforms other methods specifically on higher latency reduction rates. Imprinting is able to get 30% latency reduction with only 0.36% accuracy loss from baseline. Ensemble method, although has better accuracy than the average accuracy, it is still sensitive to individual’s errors. We further compare imprinting layer pruning on similar latency budget with smaller ResNet variants. We outperform ResNet34 by 1.44% (LR=39%) and ResNet18 by 0.56% (LR=65%) in accuracy showing the effectiveness of incorporating accuracy in block importance. Detailed numerical evaluation can be found in supplementary.
|1 block (LR 15%)||2 blocks (LR 20%)||3 blocks (LR 25%)||4 blocks (LR 30%)|
We presented LayerPrune framework which includes set of layer pruning methods. We show the benefits of LayerPrune on latency reduction compared to filter pruning. The key findings of this paper are the following:
For a filter criterion, training a LayerPrune model based on this criterion achieves the same, if not better, accuracy as the filter pruned model obtained by using the same criterion.
Filter pruning compresses the number of convolution operations per layer and thus latency reduction depends on hardware architecture, while LayerPrune removes the whole layer. In result, filter pruned models might produce non-optimal matrix shapes for the compute kernels that can lead even to latency increase on some hardware targets and batch sizes.
Filter pruned models within a latency budget have a larger variance in accuracy than LayerPrune. This stems from the fact that the relation between latency and number of filters is non-linear and optimization constrained by a resource budget requires complex per-layer pruning ratios selection.
We also showed the importance of incorporating accuracy approximation in layer ranking by imprinting.
Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research (1), pp. 5595–5637. Cited by: §4.1.
-  (2018) Benchmark analysis of representative deep neural network architectures. IEEE Access, pp. 64270–64277. Cited by: §1.
-  (2018) Shallowing deep networks: layer-wise pruning based on feature representations. IEEE transactions on pattern analysis and machine intelligence (12), pp. 3048–3056. Cited by: §2, §3.1, §4.2.2, §4.2.3, Table 1, Table 2.
-  (2018) Layer-compensated pruning for resource-constrained convolutional neural networks. NeurIPS. Cited by: §2.
-  (2019) Lightweight monocular depth estimation model by joint end-to-end filter pruning. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 4290–4294. Cited by: §1.
-  (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §2.
-  (2015) Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR 2017. Cited by: §2.
-  (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §2, §3.2, Table 1, Table 2, Table 3.
-  (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, pp. 346–361. Cited by: §3.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE CVPR, pp. 770–778. Cited by: §1, §4.2.
-  (2018) Amc: automl for model compression and acceleration on mobile devices. In Proceedings of the ECCV), pp. 784–800. Cited by: §2.
-  (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE ICCV, pp. 1389–1397. Cited by: Table 3.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
-  (2017) Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844. Cited by: §1.
Densely connected convolutional networks.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
-  (2018) Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp. 304–320. Cited by: §2, Table 3.
-  (2017) Quantized neural networks: training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: §1.
-  (2019) Knowledge distillation via route constrained optimization. In Proceedings of the IEEE ICCV, pp. 1345–1354. Cited by: §1.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §4.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §4.
Abandoning the dark arts: scientific approaches to efficient deep learning. The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing , Conference on Neural Information Processing Systems. Cited by: §1.
-  (2019) Structured pruning of neural networks with budget-aware regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9108–9116. Cited by: §1.
-  (2017) Pruning filters for efficient convnets. ICLR. Cited by: §2, §3.2, Table 2.
-  (2020) HRank: filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1529–1538. Cited by: §3.2, §4.3, Table 3.
-  (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE ICCV, pp. 2736–2744. Cited by: §2, §3.2, Table 1, Table 2.
-  (2017) Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §2.
-  (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE ICCV, pp. 5058–5066. Cited by: §1, §2, §3.2, Table 3.
-  (2019) AMP: adaptive masked proxies for few-shot segmentation. In Proceedings of the IEEE ICCV, Cited by: §3.1.
-  (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116–131. Cited by: §1.
-  (2019) Importance estimation for neural network pruning. In Proceedings of the IEEE CVPR, pp. 11264–11272. Cited by: §1, §2, §3.2, §4.3, Table 1, Table 2, Table 3.
Pruning convolutional neural networks for resource efficient transfer learning. arXiv preprint arXiv:1611.06440 3. Cited by: §3.2, §4.3, Table 2, Table 3.
-  (2015) CLTune: a generic auto-tuner for opencl kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, pp. 195–202. Cited by: §1.
-  (2018) Low-shot learning with imprinted weights. In Proceedings of the IEEE CVPR, pp. 5822–5830. Cited by: §3.1.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1.
-  (2019) Laconic deep learning inference acceleration. In Proceedings of the 46th International Symposium on Computer Architecture, pp. 304–317. Cited by: §2.
-  (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §4.2.
-  (2015) Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149. Cited by: §2.
-  (2017) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §4.2.3.
-  (2019) Kernel tuner: a search-optimizing gpu code auto-tuner. Future Generation Computer Systems, pp. 347–358. Cited by: §1.
-  (2019) Haq: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE CVPR, pp. 8612–8620. Cited by: §1.
-  (2018) Pelee: a real-time object detection system on mobile devices. In Advances in Neural Information Processing Systems, pp. 1963–1972. Cited by: §1.
-  (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §2.
-  (2019) Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recognition, pp. 119–133. Cited by: §1.
-  (2019) Snapshot distillation: teacher-student optimization in one generation. In Proceedings of the IEEE CVPR, pp. 2859–2868. Cited by: §1.
-  (2019) Ecc: platform-independent energy-constrained deep neural network compression via a bilinear regression model. In Proceedings of the IEEE CVPR, pp. 11206–11215. Cited by: §1, §2, §3.2, Table 1, Table 3.
-  (2017) Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5687–5695. Cited by: §1.
-  (2018) Netadapt: platform-aware neural network adaptation for mobile applications. In Proceedings of the ECCV, pp. 285–300. Cited by: §1, §2.
-  (2019) Variational convolutional neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2780–2789. Cited by: §2.