CNN is a dominant representative in DL, delivering outstanding performance on many computer vision tasks such as image classification and object detection. However, CNN-based DL applications are typically too computationally intensive to be deployed on resource-limited platforms such as smartphones where latency is very sensitive to the user experience.
At present, almost all CNN structures are designed without considering backend platforms. It is hard for a single network to run across all platforms in an optimal way due to the different hardware architecture characteristics. For example, the fastest CNN model on a desktop GPU may not be the fastest one on an embedded domain-specific accelerator with the same accuracy. However, manually crafting CNN structures for a given target platform requires deep knowledge about details of the backend platform, including the toolchains, configuration, and hardware architecture, which are usually unavailable.
Moreover, a large number of works have focused on DL model optimization such as weights-sparsity and pruning techniques. In spite of promising optimization results, they are not platform-aware and cannot generate optimal models across all platforms from servers or desktops to resource-limited devices.
In this work, we propose a platform-aware solution, called CompactNet, to address the aforementioned issues. It automatically generates optimal platform-specific CNN models with a certain speedup target guaranteed. CompactNet (see Figure 1) is driven by a platform simulator that collects data from real backend hardware and simulates the latency of a model trimmed by removing certain redundant filters in certain convolutional layers. Guided by the simulated latency, the searching loop can generate the best-trimmed model satisfying the target speedup with the highest accuracy. Since the latency can be simulated on any platform that supports common DL algorithms, this solution supports any platform without detailed knowledge of the platform itself.
The main contributions of this paper are:
A general and adaptable platform simulator that guides the whole approach. This simulator can collect data from any platforms that support common DL algorithms and precisely simulate the latency of them.
An automatic optimizing approach with a certain speedup target guaranteed. We trim the entire filters (output channels) instead of individual weights of each layer until the speedup target is satisfied. This makes our approach more direct and purposive.
A platform-aware optimizing approach that generates platform-specific optimal CNN models without knowing the backend hardware architecture details. The experiments show that optimal models for different platforms with the same speedup generated by our approach have different structures. An optimal model for a certain platform cannot achieve the same speedup on another as the optimal model specifically for that different platform. The intrinsic reason for this should be deep in the hardware architecture characteristics and our work can be considered to provide a black-box to interact with them.
2 Related Work
In recent years, a large number of works [Cheng et al.2018] aiming to optimize CNN models have achieved great success. Most of the works can be divided into two main categories.
First, lots of famous works adopt pruning techniques [Han et al.2015b] [Molchanov et al.2016]. These approaches focus on removing the redundant weights to sparsify the filters in the model. And they can be further divided into weight-level [Guo et al.2016] [Han et al.2015a]
, vector-level[Mao et al.2017], kernel-level [Anwar et al.2017] and group-level [Lebedev and Lempitsky2016] [Wen et al.2016]. Unfortunately, not all the platform can fully take advantage of such sparse data structure [Yu et al.2017] and therefore, there is no guarantee on reducing the latency. Other works [He et al.2017] [Luo et al.2017], in contrast, consider removing entire filters, which shows a more conspicuous speedup. The main issue of these approaches is that they are not automatic or platform-aware. It means that the number of removed filters needs to be set manually since different backend platforms may have different optimal options. ADC [He and Han2018]
proposes using reinforcement learning and MorphNet[Gordon et al.2018] leverages the sparsifying regularizers to decide the compression rates. AdaptNet [Yang et al.2018] uses direct metrics as guides for adapting DL models to mobile devices given a specific resource budget. Our CompactNet addresses the same issue in a different way by removing certain redundant filters according to the simulated latency data of the backend platforms in order to satisfy the target speedup.
Another way to optimize CNN models is focusing on network structure. MobileNets [Howard et al.2017] [Sandler et al.2018] SqueezeNet [Iandola et al.2016] and [Zhang et al.2018] are typical examples of this kind. They are all general designs to build more efficient CNN models by removing the FC layer, using multiple group convolution or proposing depth-wise convolution. There’s no doubt that such works have achieved great success in saving resources and reducing latency. However, they are not designed for specific platforms and our experiments show that deployed on different backend platforms, they still have a significant optimizing space for kernel computation speedup via our CompactNet.
Besides, other approaches based on low-rank approximation [Jaderberg et al.2014] [Kim et al.2015] use matrix decomposition to reduce the number of operations. The motivation behind such decomposition is to find an approximate matrix that substitutes the original weights. And others like [Courbariaux and Bengio2016] [Gong et al.2014] focus on the data type and significantly reduce the latency by quantization. All those works are stand-alone optimization and can be considered as complements to our CompactNet.
Different from the solutions in the literature, we propose a platform-aware optimization, called CompactNet, that automatically trims a pre-trained CNN model given a certain target speedup while maintaining the accuracy. CompactNet is driven by a platform simulator based on the actual target platform. Guided by the simulated latency data, CompactNet generates platform-specific optimal models by progressively removing redundant filters without the requirement of expertise of the platform itself and guarantees the target speedup.
3.1 General and Adaptable Platform Simulator
The platform simulator is the foundation of our work. As guided by the latency data simulated on real hardware platforms by the simulator, our CompactNet can be so-called platform-aware and is able to trim a CNN model purposively given a certain speedup target.
Different from traditional hardware simulators, the proposed one here does not focus on the architecture or mechanism of the platform. Instead, it directly simulates the kernel computation latency of a CNN model on the real backend platform. The implementation of the simulator is not complicated so long as the platform supports common DL algorithms. Taking MobilenetV2 [Sandler et al.2018] (see Table 1) as an example, we implement the 18 layers respectively including the first Conv2D layer and each of the following bottleneck layers on the target backend platform. Then we continuously change the number of input channels and filters (output channels) of each layer. Thus we can collect the latency data of each layer with any number of input and output channels. A simple example that the data collected by the simulator is shown in Figure 2.
|Layers||Input Channels||Output Channels|
|Layers||Input Channels||Output Channels|
With the collected data, we can simulate the latency of kernel computation of a certain trimmed model (after removing some filters across layers like in Table 2) by simply summing up the latency of each layer with the trimmed number of input and output channels (see Figure 2). Such simulated latency is well approximates the real latency of the model according to experiments (see Section 4.3). Then in the searching loop, we can decide whether the trimmed model satisfy the target speedup or not. The details of the searching loop will be introduced in the next part.
Our simulator is general and adaptable. It can collect latency data of any kind of CNN model on any platform by simply implementing each layer of the model on the target platform and executing with any number of input and output channels.
3.2 Automatic and Platform-Aware Searching Algorithm
In this part, we introduce the searching loop that generates the optimal trimmed model given a target speedup and its relationship with the platform simulator. The whole process is shown in Figure 3-left.
To trim the pre-trained model and satisfy the target speedup progressively, we first break the optimization into iterations, where
is one of the key hyperparameters in this work. So in each iteration, we have a sub-target of speedup. Here we use another important hyperparameter – the decay of sub-target of speed upwhich is similar to the decay of learning rate in DL model training. It enables us to trim the model more carefully during the searching to avoid removing imperative filters. Therefore, the sub-target of speedup for each iteration is computed by Eq. 1 where is the initial target of speedup without considering the decay. Thus the final target speedup can be computed by Eq. 2. In fact, while doing real tasks, we need to keep in mind and inversely compute first. For each iteration, guided by the simulated latency, we trim the model to satisfy and the trimmed model is used in the next iteration. When we finish the last iteration and all the sub-targets are satisfied, we get a new model that meets the final speedup target. Then we retrain it and generate the final optimal model.
With the platform simulator, the whole searching loop does not need to be deployed on or interact with the target platform itself. All the latency data required could be simulated before. Each iteration in the loop takes the simulated latency data and trims the model to satisfy . Since different platforms have different latency data, the final optimal model is platform-specific. Obviously, the trimming step is important in each iteration. In the next part, we will discuss the details of it.
3.3 Trimming Approach
In this part, we introduce the core step in the searching loop to trim the model progressively according to the simulated latency data. The inside process of this trimming step is shown in Figure 3-right.
During each iteration in the searching loop, we generate candidate models where is the number of trimmable layers in the CNN model and select the one with the highest accuracy as an update for the next iteration. For each iteration, to generate the candidate models, the trimming approach goes through each of the trimmable layers in the model. For each layer, guided by the simulated latency data, we reduce the number of filters (output channels) one by one until the trimmed model latency satisfies the sub-target speedup ( in Eq.1) of this iteration. Note that when the number of filters in one layer is modified, the number of input channels of the next layer should also be modified accordingly.
After we know how many filters should remain in current layer to satisfy the sub-target. We need to decide which filters should be preserved. There are several studies like [Yang et al.2017] discussing the influence of certain filters in a model and can be used to choose filters here. We simply choose the magnitude-based method that filters with the largest L2-norm magnitude remain.
Then we fine-tune the model with the trimmed layer to restore the accuracy and generate the fine-tuned model as a candidate before trying trimming the next layer. So in this case, only one layer is trimmed in each candidate model and in all we generate candidates that satisfy the sub-target in one iteration.
Finally, we select the model with the highest accuracy from the candidates. Therefore, in one iteration, we essentially trim only one layer with minimum accuracy loss to satisfy the sub-target .
Above is the trimming approach we use in our CompactNet. We need to emphasize that it is not the only option to do it. We can also consider other methods not only for choosing which filters remaining in each layer as mentioned but also for the whole trimming approach. For example, we can use other pruning approaches like reinforcement learning to trim the model more fine-grained instead of reducing the entire filters. In this point of view, our searching loop (see Figure 1 and Figure 3) is more like a general algorithm framework that makes various trimming approaches become platform-aware and automatic.
We choose the state-of-the-art slim CNN model MobileNetV2 [Sandler et al.2018] as our target model for optimizing, and we agree with the viewpoint in [Jacob et al.2018] that traditional CNN models are designed redundantly so it is easy to achieve great optimization on them. So the more significant challenge should be optimizing the models which have already achieved a great tradeoff between speed and accuracy.
We experiment on a HUAWEI Mate10 smartphone and apply our CompactNet on two processors, an ARM mobile CPU and a domain-specific accelerator called NPU with Cambricon-1A ISA [Liu et al.2016]. Both processors are typical backend hardware platform for CNNs applications in mobile devices.
We use the Cifar-10 dataset [Krizhevsky and Hinton2009] with 50K images as training data and another 10K as validation data to pre-train an original MobileNetV2 model. Then driven by the simulated latency data of the two platforms in the smartphone, we implement the searching loop to optimize the pre-trained model for the two backend platforms. We also validate our platform simulator which is the foundation of our work to see whether it can simulate the latency precisely. The details of the experiments are discussed in this section. Note that the code of our work has been released and all the experiment results can be easily reproduced.
As described in Section 3.2, driven by the simulated latency data, the searching loop (see Figure 3) does not need to be deployed or interact with the actual backend platform. So we implement the algorithm on a desktop with an Nvidia Geforce GTX10801Ti GPU and an Intel Xeon E5-2620v3 CPU since we have CNN training workloads here. The searching loop generating the optimal model has several hyperparameters to set. These hyperparameters can be divided into two sets.
4.1.1 The Searching Algorithm
As described in Section 3.2, the whole searching process is broken into N iterations and in Eq.1 is determined by not only but and as well. should not be too small since that makes so big that might cause losing imperative feature in layers during the trimming step. The decay is just like that of the learning rate in DL training. In our experiments, these hyperparameters are set as in Table 3. Due to the much better performance of the NPU accelerator, its optimizing space should be smaller. As a result, we lower the target speedup (both and ) and increase the number of iterations on the NPU platform to trim more carefully.
4.1.2 CNN Training
There are two training processes during the searching. One is a short-term fine-tuning after trimming each layer and the other is a long-term retraining after getting the optimal model which satisfies the target speedup. Still, we use the same Cifar-10 dataset for the pre-trained model and the standard RMSPropOptimizer built in TensorFlow. The training parameters for the two processes are shown in Table 4. We use a larger learning rate with faster decay in the short-term fine-tuning to quickly converge the loss and restore the accuracy after removing filters in a certain layer of the model.
4.2 Optimizing Results
With the configurations set as the above, CompactNet can generate a platform-specific optimal model and achieve better speedup results than other optimizing approaches on both mobile CPU and the NPU accelerator. On the other hand, CompactNet maintains accuracy with such speedup and can even slightly improve it if the speedup target is less aggressive.
4.2.1 Mobile CPU
We set two different speedup targets and the results are shown in Table 5. Our CompactNet achieves higher accuracy with a 1.5x speedup compared with the original MobileNetV2 and maintains the accuracy with up to a 1.8x speedup. We also compare with some state-of-the-art counterparts, including NetAdapt [Yang et al.2018], MorphNet [Gordon et al.2018] and ADC [He and Han2018] on the same Cifar-10 dataset and our CompactNet outperforms them in terms of both speedup and accuracy than those works. Compared with the original MobileNetV2 model, the number of filters in each layer of the optimal models are shown in Figure 4.
|Original MobileNetV2 (100%)||1.0x||71.98|
|NetAdapt [Yang et al.2018]||1.3x||70.63|
|MorphNet [Gordon et al.2018]||1.2x||69.87|
|ADC [He and Han2018]||1.2x||69.51|
|CompactNet on mobile CPU||1.5x||72.12|
4.2.2 NPU Accelerator
Since other works are not specifically designed for the NPU platform, they cannot achieve the same speedup as that on the mobile CPU platform. In contrast, our CompactNet can still generate optimal models for the NPU. The results are shown in Table 6 and the number of filters per layer in optimal models compared with the original MobileNetV2 are shown in Figure 5.
|Original MobileNetV2 (100%)||1.0x||71.98|
|NetAdapt [Yang et al.2018]||1.2x||70.63|
|MorphNet [Gordon et al.2018]||1.1x||69.87|
|ADC [He and Han2018]||1.1x||69.51|
|CompactNet on NPU||1.3x||72.56|
These results can also give us some interesting insight about the CNN model itself. For example, the original MobileNetV2 has an incremental-filters architecture among the layers. However, the majority of the filters trimmed by our CompactNet are in the last several layers, which might mean that many filters in those layers are redundant for the classification task on Cifar-10.
4.2.3 Platform-Specific Models
As in Table 5 and Table 6, we achieve a 1.5x speedup on both the mobile CPU and the NPU accelerator platforms. Regarding the number of filters per layer (see Figure 6), the optimal models for the two platforms with the same speedup are different in some layers. Furthermore, if we exchange the models between them, neither the mobile CPU nor the NPU accelerator can satisfy the target speedup (see Table 7). It suggests that the optimal model generated by our CompactNet is platform-specific. The intrinsic reason for this should be deep in the hardware architecture characteristics and our work can be considered to provide a black-box to interact with them.
|Mobile CPU||Mobile CPU||1.5x||72.12|
4.3 Validation of Platform Simulator
We use the Cambricon NeuWare SDK with Google TensorFlow [Abadi et al.2016] to implement kernel computations in each layer of the original MobileNetV2 model on both mobile CPU and the NPU accelerator to simulate the latency data. Then we compare the latency of the original model with that simulated by accumulating the latency of each layer (method shown in Figure 2). The results are shown in Figure 7. For both backend platforms, no matter how many layers are executed, the difference between the real latency and the simulated one is sufficiently small.
In summary, we propose an optimizing approach, named CompactNet, to optimize a pre-trained CNN model specifically for different backend platforms given a specific target of speedup. Based on a platform simulator, CompactNet can automatically trim a pre-trained model and generate a platform-specific optimal model which satisfies the target speedup. Driven by the simulated latency data of the backend platform, CompactNet progressively trims the pre-trained model by removing the redundant filters across the layers while maintaining the accuracy by fine-tuning and retraining processes. We deploy CompactNet on a smartphone with backend platforms of a mobile CPU and a domain-specific accelerator. Compared with the state-of-the-art slim CNN model – MobileNetV2, we achieve up to a 1.8x kernel computation speedup on different platforms with equal or even higher accuracy for image classification task on the Cifar-10 dataset.
- [Abadi et al.2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016., pages 265–283, 2016.
- [Anwar et al.2017] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. JETC, 13(3):32:1–32:18, 2017.
- [Cheng et al.2018] Jian Cheng, Peisong Wang, Gang Li, Qinghao Hu, and Hanqing Lu. Recent advances in efficient computation of deep convolutional neural networks. Frontiers of IT & EE, 19(1):64–77, 2018.
- [Courbariaux and Bengio2016] Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR, abs/1602.02830, 2016.
- [Gong et al.2014] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional networks using vector quantization. CoRR, abs/1412.6115, 2014.
[Gordon et al.2018]
Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, et al.
Morphnet: Fast & simple resource-constrained structure learning of
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1586–1595, 2018.
- [Guo et al.2016] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1379–1387, 2016.
- [Han et al.2015a] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
- [Han et al.2015b] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1135–1143, 2015.
- [He and Han2018] Yihui He and Song Han. ADC: automated deep compression and acceleration with reinforcement learning. CoRR, abs/1802.03494, 2018.
- [He et al.2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1398–1406, 2017.
- [Howard et al.2017] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
- [Iandola et al.2016] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.
- [Jacob et al.2018] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 2704–2713, 2018.
- [Jaderberg et al.2014] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014, 2014.
- [Kim et al.2015] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. CoRR, abs/1511.06530, 2015.
- [Krizhevsky and Hinton2009] A Krizhevsky and G Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 1, 01 2009.
- [Lebedev and Lempitsky2016] Vadim Lebedev and Victor S. Lempitsky. Fast convnets using group-wise brain damage. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2554–2564, 2016.
- [Liu et al.2016] Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. Cambricon: An instruction set architecture for neural networks. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016, pages 393–405, 2016.
- [Luo et al.2017] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5068–5076, 2017.
- [Mao et al.2017] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, et al. Exploring the regularity of sparse structure in convolutional neural networks. CoRR, abs/1705.08922, 2017.
- [Molchanov et al.2016] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient transfer learning. CoRR, abs/1611.06440, 2016.
- [Sandler et al.2018] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 4510–4520, 2018.
- [Wen et al.2016] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2074–2082, 2016.
- [Yang et al.2017] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6071–6079, 2017.
- [Yang et al.2018] Tien-Ju Yang, Andrew G. Howard, Bo Chen, Xiao Zhang, Alec Go, et al. Netadapt: Platform-aware neural network adaptation for mobile applications. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part X, pages 289–304, 2018.
- [Yu et al.2017] Jiecao Yu, Andrew Lukefahr, David J. Palframan, Ganesh S. Dasika, Reetuparna Das, et al. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017, pages 548–560, 2017.
- [Zhang et al.2018] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6848–6856, 2018.