Empowering embedded systems to run the well-known deep learning architectures, such as convolutional neural networks (CNNs), has been a hot topic in recent years. For smart Internet of Things applications, the challenging part is that the whole system is required to be both energy-constrained and of small size. To meet the challenge, the work of improving the efficiency of the whole computing process can be roughly broken into two directions: The first is to design lightweight networks which has a small MAddshoward2017mobilenets ; sandler2018mobilenetv2 ; zhang2018shufflenet ; ma2018shufflenet , thus friendly to low power consumption platforms; The second is to optimize hardware-side configurations, such as FPGA based accelerators FarabetPHL09 ; ZhangLSGXC15 , or to make the whole computing process more efficient by improving the compiler and generating more smart instructions abdelfattah2018dla ; chen2018tvm ; xing2019dnnvm .
All of the mentioned works above have demonstrated their great practical value in various applications. However, the real performance may not live up to the designer’s expectations, due to the gap between the two different optimization directions. Specifically, for elaborately tuned networks with small MAdds, the overall latency may be high ma2018shufflenet , while for carefully designed compilers or accelerators, the real networks may be hard to be processed.
In this work, we intend to close the exiting gap by systematically analyze the necessary properties of a lightweight network that is friendly to the embedded hardware and the corresponding compilers. More precisely, since the computation patterns of a chip in a embedded system is strictly limited, we propose that a embedded-system-friendly network should fit into the targeted computation patterns and also the ideal data layout. By fitting into the ideal data layout, we can reduce the communication cost between on-chip memory and off-chip memory, thus fully exploit the computation throughput.
Inspired by the observation that the computation graph of a network is easier to be optimized, if the computational intensity of the operations in a network is more balanced. We propose the variable group convolution, which is based on depthwise separable convolution krizhevsky2012imagenet ; chollet2017xception ; xie2017aggregated
. In variable group convolution, the number of input channels in each group is fixed and can be tuned as a hyperparameter, which is different from the group convolution where the number of groups are fixed. The benefits are two folds: Fixing the number of channels is more suitable for optimization from the perspective of compilers, due to the more coherent computation pattern and data layout; Compared with depthwise convolution inhoward2017mobilenets ; sandler2018mobilenetv2 , which set the group number to be the channel number, variable group convolution has a larger network capacity sandler2018mobilenetv2 , thus allowing the smaller channel numbers, which helps relief the time consuming off-chip communication.
Another key component in our network is to better exploit the on-chip memory based on the inverted residual block sandler2018mobilenetv2 . However, in MobileNetV2 sandler2018mobilenetv2 , the number of channels are adjusted by pointwise convolutions, which has a different computing pattern with the depthwise convolution in between and then is hard to be optimized due to limited computation patterns. Therefore, we propose that the input feature with channels is first expanded to by variable group convolution and returned to by pointwise convolution. In this manner, the computational costs between the two types of layers are more balanced, thus being more hardware and compiler friendly. To sum up, our contributions can be summed as follows:
We systematically analyze how to optimize the computation of CNNs from the perspective of both network architectures and hardware/compilers on embedded systems. We found that there exists a gap between the two optimization directions that some elaborately designed architectures are hard to be optimized due to limited computation patterns in an embedded system.
Observing that more unified computation pattern and data layout are more friendly to an embedded system, we propose the variable group convolution and the corresponding improved whole network, named variable group network and VarGNet for short.
Experiments on prevalent vision tasks, such as classification, detection, segmentation, face recognition and etc., and corresponding large scale datasets verify the practical value of our proposed VarGNet.
1.1 Related works
Designing lightweight CNNs has been a hot topic in recent years. Representative manual designed networks include SqueezeNet 2016_SqueezeNet , Xception chollet2017xception , MobileNets howard2017mobilenets ; sandler2018mobilenetv2 , ShuffleNets zhang2018shufflenet ; ma2018shufflenet and IGC zhang2017interleaved ; xie2018interleaved ; sun2018igcv3 . Besides, neural architecture search (NAS) zoph2016neural ; pham2018efficient ; Real2018Regularized ; zoph2017learning ; liu2018darts is a promising direction for automatically designing lightweight CNNs. The above methods are capable to effectively speed up the recognition process. More recently, platforms aware NAS methods are proposed cai2018proxylessnas ; fbnet ; dai2018chamnet ; stamoulis2019single to search some specific networks that are efficient on certain hardware platforms. Our network, VarGNet, is complementary to the existing platforms aware NAS methods, since the proposed variable group convolution is helpful for setting the search space in NAS methods.
Optimizations on CNN accelerators.
To accelerate neural networks, FPGAs FarabetPHL09 ; ZhangLSGXC15 ; gupta2015deep ; ma2017optimizing and ASIC designs chen2014diannao ; reagen2016minerva ; jouppi2017datacenter ; luo2017dadiannao ; hegde2018ucnn have been widely studied. Generally speaking, Streaming Architectures (SAs) venieris2017fpgaconvnet ; xiao2017exploring and Single Computation Engines (SCEs) guo2016angel ; chang2017compiling ; abdelfattah2018dla are two kinds of FPGA based accelerators venieris2018toolflows . The difference between the two directions is on customization and generality. SAs designs seek customization more than generality, while SCEs emphasize the tradeoff between flexibility and customization. In this work, we hope to propose a network that can be optimized by existing accelerators more easily, thus improve the overall performance.
2 Designing Efficient Networks on Embedded Systems
For chips used on embedded systems, such as FPGA or ASIC, a low unit price as well as a fast time to market are critical factors in designing the whole system. Such crucial points result in a relative simple chip configuration. In other words, the computation schemes are strictly limited when compared with general-purpose processing units. However, operators in a SOTA network are so complex that some layers can be accelerated by hardware design while others not. Thus, for designing efficient networks on embedded systems, the first intuition here is that the layers in a network should be similar as each other in some sense.
Another important intuition is based on two properties of convolutions used in CNNs. The first property is the computation pattern. In convolution, several filters (kernels) slide over the whole feature map, indicating that the kernels are repeatedly used while values from the feature map are only used once. The second property is the data size of convolutional kernels and feature maps. Typically, the size of convolutional kernels is much lower than the size of feature maps, such as for kernels and for feature maps in 2D convolutions. In light of the above two properties, an ingenious solution is to load all the data of kernels first and then perform the convolution with popping and popping out feature data sequentially xing2019dnnvm . Such practical solution is the second intuition for our following two guidelines for efficient network design on embedded systems:
It will be better if the size of intermediate feature maps between blocks is smaller.
The computational intensity of layers in a block should be balanced.
Next, we introduce the two guidelines in detail.
Small intermediate feature maps between blocks.
In SOTA networks, a common practice is to first design a normal block and a down sampling block first, and then stack several blocks together to get a deep network. Also, in these blocks, residual connectionshe2016deep are widely adopted. So, in recent compiler-side optimizations xing2019dnnvm , layers in a block are usually grouped and computed together. In such manner, off-chip memory and on-chip memory only communicates when starting or ending computing a block in the network. Therefore, a smaller intermediate feature map between blocks will certainly help reduce the data transfer time.
Balanced computational intensity inside a block.
As mentioned before, in practice, weights in several layers are loaded before performing convolution. If the loaded layers have a large divergence in terms of the computational intensity, extra on-chip memory is needed to store the intermediate slices of feature maps. In MobileNetV1 howard2017mobilenets , a depthwise conv and a pointwise conv are used. Different from previous definitions, in our implementation, weights are already loaded. So, computational intensity is computed as MAdds divide the size of feature maps. Then, if the feature map is of size , the computational intensity of depthwise convolution and pointwise convolution are 9 and 256, respectively. As a result, when running the two layers, we have to increase the on-chip buffer to satisfy the pointwise, or not grouping the computation of the two layers together.
3 Variable Group Network
Based on the previous mentioned two guidelines, we propose a novel network in this section. To balance the computation intensity, we set the channel numbers in a group in a network to be constant, resulting in variable groups in each convolution layers. The motivation of fixing the channel numbers is not hard to understand if we look at the MAdds of a convolution,
Thus, if the size of feature map is a constant, then by fixing , the computational intensity inside a block is more balanced. Further, the number of channels in a group can be set to satisfy the configurations of the processing elements, in which channels of a certain number will be processed every time.
Compared with depthwise convolution, the variable group convolution increases the MAdds as well as the expressiveness sandler2018mobilenetv2 . Thus, now we are able to reduce the channel number of intermediate feature maps, while keeping the same generalizing ability as previous networks. Specifically, we design novel network blocks as shown in Fig. 1. For the normal block used in the early stages in the whole network, since the size of weights are relatively small at this time, the weights of the four layers can be all cached into the on-chip memory. When entering the late stages, where channel numbers increase and the size of weights increase as well, the normal block is also able to be optimized by only loading a variable group conv and a pointwise conv. Similarly, the operations in down sampling block are also friendly to the compiler-side and hardware-side optimizations. The whole computing process for a normal block is demonstrated in Fig. 2. Then, based on the architecture of MobileNetV1 howard2017mobilenets , we substitute their basic blocks to ours and the whole detailed network architecture is shown in Tab. 1. Also, another ShuffleNet v2 based architecture is shown in Tab. 2.
|Layer||Output Size||KSize||Stride||Repeat||Output Channels|
|Image||224 x 224||3||3||3||3||3||3||3|
|Conv 1||112 x 112||3 x 3||2||1||8||16||24||32||40||48||56|
|DownSample||56 x 56||2||3||16||32||48||64||80||96||112|
|DownSample||14 x 14||2||1||64||128||192||256||320||384||448|
|Stage Block||14 x 14||1||2||64||128||192||256||320||384||448|
|DownSample||7 x 7||2||1||128||256||384||512||640||768||896|
|Stage Block||7 x 7||1||1||128||256||384||512||640||768||896|
|Conv 5||7 x 7||1 x 1||1||1||1024||1024||1024||1024||1280||1536||1792|
|Global Pool||1 x 1||7 x 7|
|Layer||Output Size||KSize||Stride||Repeat||Output Channels|
|Image||224 x 224||3||3||3||3||3||3||3||3|
|Conv 1||112 x 112||3 x 3||2||1||8||16||24||32||40||48||56||64|
|Head Block||56 x 56||2||1||8||16||24||32||40||48||56||64|
|Stage 2||28 x28||2||1||16||32||48||64||80||96||112||128|
|28 x 28||1||2|
|Stage 3||14 x 14||2||1||32||64||96||128||160||192||224||256|
|14 x 14||1||6|
|Stage 4||7 x 7||2||1||64||128||192||256||320||384||448||512|
|7 x 7||1||3|
|Conv 5||7 x 7||1 x 1||1||1||1024||1024||1024||1024||1280||1536||1792||2048|
|Global Pool||1 x 1||7 x 7|
|(c) Comparison network: MobileNet v1|
VarGNet v1 performance on ImageNet. (is the number of channels in a group.)
|(c) Comparison network: ShuffleNet v2|
4.1 ImageNet Classification
. Training hyperparameters are set as: batch size 1024, crop ratio 0.875, learning rate 0.4, cosine learning rate schedule, weight decay 4e-5 and training epochs 240. We can observe that VarGNet v1 performs better than MobileNet v1, as shown in Tab.3. From (c) in Tab. 4, we can see that when the model scale is small, the performance of VarGNet v2 is worse than ShuffleNet v2, due to less channels used in our VarGNet v2. Then, when the model size is large, our network performs better.
4.2 Object Detection
In Tab. 5, we present the performance of our proposed VarGNet as well as comparison methods. We evaluate the object detection performance of our proposed networks on COCO datasets Lin2014MicrosoftCC and compare them with other state-of-the-art lightweight architectures. We choose FPN-based Faster R-CNN Lin2017FeaturePN as the framework and all the experiments are implemented under the same settings with the input resolution being 8001333 and the number of epochs being 18. Specially, we find that ShuffleNet v2 achieves better accuracy if trained with more epochs so a model with 27 epochs is trained for ShuffleNet v2. 1000 proposals per image are evaluated in RPN stage at test time. We use train+val set for training except 8000 minimal images and finally test on minival set.
|MobileNet v1 1.0||24.15||31.1|
|MobileNet v2 1.0||18.71||31.0|
|ShuffleNet v1 1.0||15.31||27.9|
|ShuffleNet v2 1.0||15.55||27.5|
|ShuffleNet v2 1.0 (27 epochs)||15.55||28.9|
|VarGNet v1 1.0||24.91||33.7|
|VarGNet v2 0.5||14.98||28.6|
|VarGNet v2 1.0||19.61||33.3|
4.3 Pixel Level Parsing
We use the standard Adam Optimizer with weight decay set to 1e-5 and batch size set to 16. The learning rate is initialized as 1e-4 and follows a polynomial decay with power of 0.9. Total training epochs are set as 100. For data augmentation, random horizontal flip is used and images are resized with scale randomly chosen from 0.6-1.2. For multitask training, we have the loss function defined as
When the task is panoptic segmentation, we set . After adding depth task, we set .
Parameters and MAdds of comparison methods are presented in Table. 6. Results and some visual examples on segmentation and depth prediction are shown in Table. 7 and Fig. 4, respectively. The priority of the proposed VarGNet v1 and v2 is proved by the above tables. VarGNet v1 and v2 are efficient and can perform equally well when compared with large networks.
|(a) Semantic Segmentation (image size 20481024)|
|(c) Panoptic Segmentation (MAdds calculated with 20481024 input size.)|
|(d) Panoptic Segmentation + Depth (MAdds calculated with 20481024 input size.)|
|Image||VarGNet v1||VarGNet v2||GT|
For single image depth prediction and stereo tasks on KITTI dataset geiger2013vision , we present the performance of our VarGNet based models. A U-Net style architecture ((b)b) is employed in the experiments. All the depth models are trained on KITTI RAW datasets, We test on 697 images from 29 scenes split by Eigen et al. eigen2014depth
, and train on about 23488 images from the remaining 32 scenes. All the experiment results are evaluated with the depth ranging from 0m to 80m and 0m to 50m. The evaluation metrics are the same as previous works. All the stereo models are trained on KITTI RAW datasets, We test on test set split by Eigen et al.eigen2014depth , and train set of KITTI15. The evaluation metrics for stereo are EPE and D1. During training, standard SGD Optimizer is used, and the momentum set to 0.9. The standard weight decay is set to 0.0001 for resnet18 and resnet50, and 0.00004 for others. The iteration number is set to 300 epochs. The initial learning rate is 0.001, and learning rate decay 0.1 at [120, 180, 240] epoch. We use 4 GPU to train models, and the batch size is set to 24.
In Table. 8 and Table. 9, we show our depth results and stereo results under various evaluation metrics. Also, we report our implemented MobileNet and ResNet as comparison. Further, visual effects are presented in Fig. 5 and Fig. 6.
|(a) On KITTI RAW|
|(b) On KITTI 15|
4.5 Face Recognition
All the networks are trained on the DeepGlint MS-Celeb-1M-v1c dataset dg cleaned from MS-Celeb-1M guo2016ms . There are 3,923,399 aligned face images from 86,876 ids. The LFW huang2008labeled , CFP-FP sengupta2016frontal and AgeDB-30 moschoglou2017agedb are used as the validation datasets. Finally, all network models are evaluated on MegaFace Challenge 1 nech2017level . Table. 10 lists the best face recognition accuracies on validation datasets, as well as face verification true accepted rates under 1e-6 false accepted rate on the refined version of MegaFace dataset deng2018arcface . We use MobileNet v1 and MobileNet v2 as baseline models. To adapt the input image size of 112x112, the stride of the first convolutional layer is set to 1 for each baseline and vagnet model. To achieve better performance, we further replace the pooling layer by a “BN-Dropout-FC-BN” structure as InsightFace deng2018arcface , followed by the ArcFace loss deng2018arcface . The standard SGD optimizer is used with momentum 0.9 and the batch-size is set to 512 with 8 GPUs. The learning rate begins with 0.1 and is divided by 10 at the 100K, 140K and 160K iterations. We set the weight decay to be 5e-4. The embedding feature dimension is 256 with 0.4 dropout rate. The normalization scale is 64 and the ArcFace margin is set to 0.5. All training are based on the InsightFace toolbox deng2018arcface .
|Networks||MAdds||LFW huang2008labeled||CFP-FP sengupta2016frontal||AgeDB-30 moschoglou2017agedb||MegaFace deng2018arcface|
-  http://trillionpairs.deepglint.com/overview.
-  Mohamed S Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew C Ling, et al. Dla: Compiler and fpga overlay for neural network inference acceleration. In International Conference on Field Programmable Logic and Applications, pages 411–4117. IEEE, 2018.
-  Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017.
-  Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations (ICLR), 2019.
-  Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello. Compiling deep learning models for custom hardware accelerators. arXiv preprint arXiv:1708.00117, 2017.
-  Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: end-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799, pages 1–15, 2018.
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and
Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning.In ACM Sigplan Notices, volume 49, pages 269–284. ACM, 2014.
-  François Chollet. Xception: Deep learning with depthwise separable convolutions. In , pages 1251–1258, 2017.
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The cityscapes dataset for semantic urban scene understanding.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.
-  Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, et al. Chamnet: Towards efficient network design through platform-aware model adaptation. arXiv preprint arXiv:1812.08934, 2018.
-  Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
-  David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
-  Clément Farabet, Cyril Poulet, Jefferson Y. Han, and Yann LeCun. CNP: an fpga-based processor for convolutional networks. In International Conference on Field Programmable Logic and Applications, pages 32–37, 2009.
-  Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
-  Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Song Yao, Song Han, Yu Wang, and Huazhong Yang. Angel-eye: A complete design flow for mapping cnn onto customized hardware. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 24–29. IEEE, 2016.
-  Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision (ECCV), pages 87–102. Springer, 2016.
-  Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning (ICML), pages 1737–1746, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
-  Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W Fletcher. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), pages 674–687. IEEE Press, 2018.
-  Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.
-  Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and 0.5mb model size. arXiv:1602.07360, 2016.
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al.
In-datacenter performance analysis of a tensor processing unit.In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), pages 1–12. IEEE, 2017.
-  Alexander Kirillov, Ross B. Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid networks. arXiv preprint arXiv:1901.02446, 2019.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017.
-  Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
-  Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations (ICLR), 2019.
-  Tao Luo, Shaoli Liu, Ling Li, Yuqing Wang, Shijin Zhang, Tianshi Chen, Zhiwei Xu, Olivier Temam, and Yunji Chen. Dadiannao: A neural network supercomputer. IEEE Transactions on Computers, 66(1):73–88, 2017.
-  Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In European Conference on Computer Vision (ECCV), pages 116–131, 2018.
-  Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 45–54. ACM, 2017.
-  Stylianos Moschoglou, Athanasios Papaioannou, Christos Sagonas, Jiankang Deng, Irene Kotsia, and Stefanos Zafeiriou. Agedb: the first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 51–59, 2017.
-  Aaron Nech and Ira Kemelmacher-Shlizerman. Level playing field for million scale face recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7044–7053, 2017.
-  Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
-  Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning (ICML), pages 4092–4101, 2018.
-  Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), pages 267–278. IEEE, 2016.
-  Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. CoRR, abs/1802.01548, 2018.
-  Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018.
-  Soumyadip Sengupta, Jun-Cheng Chen, Carlos Castillo, Vishal M Patel, Rama Chellappa, and David W Jacobs. Frontal to profile face verification in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9. IEEE, 2016.
-  Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu. Single-path nas: Designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877, 2019.
-  Ke Sun, Mingjie Li, Dong Liu, and Jingdong Wang. Igcv3: Interleaved low-rank group convolutions for efficient deep neural networks. arXiv preprint arXiv:1806.00178, 2018.
-  Stylianos I Venieris and Christos-Savvas Bouganis. fpgaconvnet: Automated mapping of convolutional neural networks on fpgas. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 291–292. ACM, 2017.
-  Stylianos I Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. Toolflows for mapping convolutional neural networks on fpgas: A survey and future directions. ACM Computing Surveys (CSUR), 51(3):56, 2018.
-  Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018.
-  Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on fpgas. In ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2017.
-  Guotian Xie, Jingdong Wang, Ting Zhang, Jianhuang Lai, Richang Hong, and Guo-Jun Qi. Interleaved structured sparse convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8847–8856, 2018.
-  Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1492–1500, 2017.
-  Yu Xing, Shuang Liang, Lingzhi Sui, Xijie Jia, Jiantao Qiu, Xin Liu, Yushun Wang, Yu Wang, and Yi Shan. Dnnvm: End-to-end compiler leveraging heterogeneous optimizations on fpga-based cnn accelerators. arXiv preprint arXiv:1902.07463, 2019.
-  Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision (ECCV), pages 325–341, 2018.
-  Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 161–170, 2015.
-  Ting Zhang, Guo-Jun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions. In IEEE International Conference on Computer Vision (ICCV), pages 4373–4382, 2017.
-  Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6848–6856, 2018.
-  Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016.
-  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CoRR, abs/1707.07012, 2017.