1 Introduction
Edge AI applications not only require high inference accuracy of deep neural networks (DNNs), but also require more aggressive inference speed (e.g, low latency for real-time response and high throughput for supporting streaming inputs) and efficiency (e.g, low power and energy consumption for less heat and longer battery life). These applications are in urgent need of hardware acceleration using energy efficient edge devices, such as embedded GPUs
[Franklin2017] and FPGAs [Zhang et al.2018].As both inference accuracy and efficiency are the key factors to distinguish good solutions, software algorithms (DNNs) need to be compact and hardware friendly; otherwise, it is impossible to overcome the hardware limitations under various edge-computing scenarios. To demonstrate our DNN design, we participated in the DAC-SDC, which features a low power object detection challenge and asks for novel object detection solutions on the resource-constrained embedded hardware platforms [Xu et al.2018]. This challenge aims at a single object detection task from real-life UAV applications and requires comprehensive evaluations among detection accuracy, throughput, and energy consumption on two targeted embedded platforms (Nvidia TX2 GPU and Xilinx Ultra96 FPGA).


Rank | GPU-Track | Predefined DNN | IoU | FPS | Compression strategies | |
---|---|---|---|---|---|---|
1st |
|
Tiny YOLO | 0.698 | 24.55 | 1⃝ 2⃝ 3⃝ 4⃝ | |
2nd |
|
Tiny YOLO | 0.691 | 25.30 | Not clear | |
3rd |
|
YOLOv2 | 0.685 | 23.64 | 1⃝ 2⃝ 3⃝ | |
Rank | FPGA-Track | Predefined DNN | IoU | FPS | Compression strategies | |
1st |
|
SSD | 0.624 | 11.96 | 1⃝ 2⃝ 3⃝ | |
2nd |
|
SqueezeNet | 0.492 | 25.97 | 1⃝ 2⃝ 3⃝ | |
3rd |
|
MobileNet | 0.573 | 7.35 | 1⃝ 2⃝ 3⃝ |
We address the task in DAC-SDC by proposing SkyNet — a lightweight object detection DNN developed by a novel bottom-up design approach. The main contributions of this paper are summarized as follows:
-
We summarize the designs proposed by the top-3 winners of DAC-SDC 2018 (both GPU and FPGA tracks) to locate the potential obstacles for achieving better detection accuracy and hardware efficiency.
-
We develop SkyNet using a bottom-up design approach with comprehensive awareness of the hardware limitations. In addition, three more features are added including: feature map bypassing, reordering and ReLU6 (instead of ReLU).
-
We deploy the proposed SkyNet on both TX2 GPU and Ultra96 FPGA and achieve the highest IoU and total score in DAC-SDC 2019, winning the first place award for both GPU and FPGA tracks (Figure 1).
2 Dac-Sdc
This year’s DAC-SDC launches an object detection challenge for images taken from drones with comprehensive evaluation system considering design accuracy, throughput, and energy consumption. The goal of this competition is to provide unified edge-platforms to develop and compare state-of-the-art object detection system design.
2.1 Targeted UAV applications
DAC-SDC targets the single object detection, which is one of the most important tasks in real-life UAV applications. It considers the most appropriate needs of UAV applications, such as capability of real-time processing, energy efficiency, and detection accuracy. To better reflect real-life challenges, images of the dataset are captured by UAVs and they are provided by the drone manufacturer DJI. The whole dataset is divided by two parts: the training dataset (100,000 images with objects of interest across 12 main categories and 95 sub-categories) and the hidden test set for official evaluation (50,000 images that only the contest organizers could access) [DJI2018]. Examples of the training dataset are shown in Figure 2 and most of these objects are very small and challenging to detect.
2.2 Previous winning design summary

By examining the top-3 entries from both GPU and FPGA tracks, we notice that all of these designs share a similar top-down DNN design approach as shown in Figure 3. They first adopt well-known DNNs which show outstanding accuracy in similar tasks (such as YOLO [Redmon et al.2016] and SSD [Liu et al.2016] for object detection) and then apply optimization techniques on the software and hardware sides, trying to compress the network size to fit into edge devices. Since these well-behaved DNNs are originally accuracy-orientated on general computing platforms such as desktop and server GPUs, they may not perform well on resource limited edge devices, which easily introduce quality degradation to final results.
We summarize the top-3 designs of DAC-SDC 2018 in Table 1 with their predefined DNNs as well as the compression strategies they used. All of the GPU teams start from YOLO, and result in IoU lower than 0.7 and throughput around 25 FPS. To compress the original DNNs, participants employ input resizing to lower the computational complexity and network pruning to reduce the unnecessary network connections. They also use half-precision floating precision (16-bit) instead of 32-bit to improve throughput. In the FPGA track, participants are required to conduct more aggressive network compression because of the tighter hardware resource budget on the targeted FPGA. In these designs, DNN parameters are greatly reduced by shrinking the DNN depth (the number of DNN layers) and width (the number of channels for each layer) of the predefined DNNs. Meanwhile, the bit-width of the DNN parameters are quantized to 8 bits or even 1 2 bits.

3 Motivations
Since all top-3 GPU teams in 2018 have fully investigated the potential of YOLO, we start our preliminary investigations on SSD, another popular DNN candidate for object detection. By following the top-down DNN design approach (Figure 3) as all teams have done, we pick two predefined backbone networks (VGG16 [Simonyan and Zisserman2014] and MobileNet [Howard et al.2017]
) for feature extraction with input size 3
360640. We use the DAC-SDC training dataset (containing 100K images) for DNN training and the officially provided 1000 images for validation. After sufficient training, the accuracy results are 0.70 and 0.66 IoU using VGG16 and MobileNet backbone, respectively. Without network compression, these two SSD models can only run 15 FPS (with VGG16) and 24 FPS (with MobileNet) on a desktop GPU (Nvidia 1080Ti). Still, great efforts need to be made for adapting them to the targeted embedded platforms.Following the top-down DNN design flow, we face two essential challenges preventing us for reaching better solutions with both higher inference accuracy and faster inference speed:
-
1) Similar inference speedup (with similar DNN compression ratio) but vastly different accuracy.
-
2) Uncertain inference accuracy variation for a given task.
For the first challenge, the underlying factor is the different sensitivities of DNN configurations regarding inference accuracy and hardware performance. It is hard to make a perfect balance since negligible changes in DNN model may cause huge differences during its hardware deployment and vice versa, resulting in difficult trade-off between inference accuracy and hardware performance.
An example regarding the compression ratio of AlexNet [Krizhevsky, Sutskever, and Hinton2012]
and its inference accuracy on ImageNet dataset
[Deng et al.2009] is given in [Zhang et al.2019a]. By using data quantization, the memory footprints of parameter and feature map start shrinking to enable better inference throughput. However, the accuracy trends of the designs with parameter and feature map quantization vary significantly; obviously, the precision of feature map contributes more to the inference accuracy than the parameters. To overcome the first challenge, we need to study the accuracy and speed sensitivity of each component of the DNN before network compression.For the second challenge, the accuracy upper bound on a given task is very difficult to determine. An experiment to evaluate the accuracy variation on DAC-SDC dataset is presented in [Zhang et al.2019a]. With the fixed back-end bounding box regression part, the well-known DNNs (including VGG16 and ResNet [He et al.2016]) are respectively adopted as the backbone to get accuracy results in targeted dataset. However, there are no clear clues regarding the network size and inference accuracy even for the same architecture as ResNet-18, -32, and -50. It is not easy to select a promising predefined model for a given task following the top-down design approach.
4 New Approach: Bottom-up DNN Design
Due to the two challenges mentioned above, we try a different design direction — a bottom-up DNN design strategy, and start building DNN from scratch. We adopt the DNN design method from [Hao et al.2019b] and propose a hardware-oriented DNN model with adequate understanding of hardware constraints, hoping that our design can balance well between inference accuracy and performance on the edge devices. The overall bottom-up design flow is shown in Figure 4,
4.1 Step 1: Bundle construction
The proposed design flow starts with constructing the hardware-aware basic building blocks called Bundles. In our definition, a Bundle is a set of sequential DNN layers, which can be repeatedly stacked and eventually construct DNNs. To guarantee that Bundles have full understandings of hardware constraints, we use analytical models for performance and resource estimation of Bundles, so that we can select Bundles according to their hardware performance.
In the first step, we enumerate the DNN components from different essential DNN layer types (such as Conv, Pooling, activation layer, etc.) and assemble them into Bundle . Since our design needs to target both GPU and FPGA tracks, we use the resource constraints from FPGA side (more restrictive compared to GPU) to evaluate the hardware performance (e.g., inference latency) for each Bundle. During implementation, all DNN components inside a Bundle are instantiated on the targeted hardware, which means the larger Bundle (with more DNN components) results in higher resource overhead and longer latency, and it is less likely to be selected.
In order to get each Bundle’s potential accuracy contribution, we build a DNN sketch with fixed front-end and back-end structures, and insert one Bundle (with replications) in the middle each time. In our case, the front-end and back-end are the input resizing and bounding box regression, respectively. Each DNN is quickly trained for 20 epochs to get its accuracy. Since each Bundle has its own characteristics regarding latency, accuracy and resource overhead, the most promising Bundles with satisfied speed and best relative accuracy are selected for the next step.
4.2 Step 2: Network search
In step 2, we perform DNN structure search. The required inputs include the initial DNNs (built by stacking selected Bundles), the latency target , the acceptable latency tolerance , and the resource constraint , while the outputs are the generated DNNs. We use stochastic coordinate descent (SCD) to update three variables related to the DNN structure: the number of Bundle replications; down-sampling configuration between Bundles; and channel expansion configuration. Assuming the achieved latency and resource overhead of the generated DNN are and , the objective of using SCD is and . These three variables contribute three coordinates and the SCD algorithm picks one coordinate randomly every iteration and updates DNN structure along that direction [Hao et al.2019b]. As we specify the latency target at the very beginning, the generated DNNs after network search favor lightweight structures for the targeted hardware platforms.
In addition, we have made two contributions to boost the DNN search efficiency. First, during step 1, we already select the promising Bundles regarding latency, accuracy, and resource overhead. Since such exploration has been done in step 1, the search process in step 2 is not overwhelming. Second, we limit the DNN design space for faster search, to only explore the number of layers, channel expansion insertion locations, and the pooling opportunities. As a result, the generated DNNs are more structured with traditional stacked network architecture. Such regular network structures help boost the hardware efficiency while deploying on the edge devices.
4.3 Step 3: Feature addition
In step 3, we add more advanced DNN features when hardware resources are allowed to further tailor the generated network for targeted tasks. For DAC-SDC, since most of the objects to be detected are small, we add a bypass with feature map reordering in SkyNet to strengthen the capability of small object detection. To enhance the hardware efficiency, we replace the ReLU with ReLU6. More discussions are provided in the next section.

5 SkyNet
The main idea of building SkyNet is to explore DNN design following the bottom-up approach to deliver AI capabilities on resource-constrained edge devices.
5.1 SkyNet architecture
The bundle we selected is a combination of 33 depth-wise convolutional layer (DW-Conv3), 1
1 point-wise convolutional layer (PW-Conv1), batch normalization layer (BN), and rectified linear unit 6 (ReLU6). By repeatedly stacking this Bundle, we generate the backbone network architecture A in Table
2. Advanced features are added afterwards shown in architecture B and C (Table 2). For the back-end of SkyNet, we adapt the YOLO back-end by removing the classification output and use two anchors for bounding box regression.Configurations of SkyNet | |||||||||||||
A | B | C | |||||||||||
input (3160360 color image) | |||||||||||||
|
|||||||||||||
2 | |||||||||||||
|
|||||||||||||
22 max-pooling | |||||||||||||
|
|||||||||||||
22 max-pooling | |||||||||||||
|
|||||||||||||
|
|||||||||||||
|
|
|
|||||||||||
Back-end for bounding box regression |

5.2 Feature map bypassing and reordering

By examining the competition training data, we keep a record of the size ratio between the output bounding box and the input image and present a distribution diagram in Figure 7. It clearly shows that 91% of the objects to be detected in DAC-SDC dataset are less than 9% of the original input image size and 31% of them are even smaller than 1% of the input image size. It means the majority of objects in this dataset can be considered as small objects and we need to propose a DNN accordingly.
We add feature map bypassing and reordering to enhance the ability of small object detection. The bypass helps to keep small object features until the deeper part of the DNN without being diluted by the pooling layers. Also, it is beneficial to have multiple feature maps (from different layers) before generating the bounding boxes. Since the bypass crosses a pooling layer (highlighted in Figure 5), we use reordering (shown in Figure 6) to align the size of original feature map (generated by the fifth Bundle) and the bypassed one without losing valuable features.
5.3 ReLU6
The other feature we use to improve hardware efficiency is ReLU6, which is a activation function with compressed output range
. Since ReLU6 asks for much smaller data range compared to the original ReLU (), less bits are required to represent activations, such as using lower-precision floating point in embedded GPUs, or fixed-point data type in embedded FPGAs.5.4 Training
We train the SkyNet in an end-to-end fashion using multi-scale training with the learning rate starting from 1e-4 to 1e-7. We use stochastic gradient descent (SGD) to update parameters and data augmentations to distort, jitter, crop, and resize inputs with size 160
320. The accuracy of SkyNet is shown in Table 3, where SkyNet C reaches the highest IoU (0.741) on the validation set. Therefore, we use SkyNet C as the proposed SkyNet for the following experiments.DNN Model | IoU | ||
---|---|---|---|
|
|
||
|
|
||
|
|
6 Experiments
We demonstrate the capability of the proposed SkyNet for the low power object detection challenge in DAC-SDC 2019. The evaluation is based on detection accuracy (IoU), inference throughput (FPS), and energy consumption (J). The calculation of final score is defined in [Xu et al.2018] as follows.
Assuming there are registered teams and images in the test set, the IoU score for team , denoted as , is calculated as:
(1) |
For energy, is denoted as the average energy consumption of all entries when performing DNN inference on the test dataset (Equation 2). The energy score of team () is then computed using Equation 3 relating to the ratio between average energy and the energy consumed by this team. in this equation is set to 2 and 10 for FPGA category and GPU category, respectively.
(2) |
(3) |
Eventually, the total score, denoted as , is calculated in Equation 4 including both inference accuracy () and energy consumption ().
(4) |
6.1 Result comparison
The proposed SkyNet is deployed onto the given GPU and FPGA platforms. For GPU implementation, we keep all network parameters using 32-bit float data format; for the FPGA design, we quantize the feature maps and parameters to 9 bits and 11 bits, respectively, for better hardware performance. The final results of the top-3 teams are listed in Table 4 and 5 [Hu et al.2019]. In total, 52 GPU teams and 58 FPGA teams participated worldwide creating a very intense competition. Our SkyNet design has successfully delivered the best inference accuracy and total score for both GPU and FPGA tracks.
Team Name | Backbone | IoU | FPS | Power(W) | Total Score | ||||
Results from 2019 | |||||||||
|
|
0.731 | 67.33 | 13.50 | 1.504 | ||||
|
|
0.713 | 28.79 | 8.55 | 1.442 | ||||
|
|
0.723 | 26.37 | 15.12 | 1.422 | ||||
Results from 2018 | |||||||||
|
|
0.698 | 24.55 | 12.58 | 1.373 | ||||
|
|
0.691 | 25.30 | 13.27 | 1.359 | ||||
|
|
0.685 | 23.64 | 10.31 | 1.358 |
Team Name | Backbone | IoU | FPS | Power (W) | Total Score | |||
Results in 2019 | ||||||||
|
|
0.716 | 25.05 | 7.26 | 1.526 | |||
|
|
0.615 | 50.91 | 9.25 | 1.394 | |||
|
|
0.553 | 55.13 | 6.69 | 1.318 | |||
Results in 2018 | ||||||||
|
|
0.624 | 11.96 | 4.20 | 1.267 | |||
|
|
0.492 | 25.97 | 2.45 | 1.179 | |||
|
|
0.573 | 7.35 | 2.59 | 1.164 |
7 Conclusions and Discussions
In this paper, we proposed SkyNet, a lightweight DNN developed following a bottom-up design approach specializing in low power object detection. The proposed design was demonstrated on the 56th IEEE/ACM Design Automation Conference System Design Contest (DAC-SDC) and won the first place award for both GPU and FPGA tracks. The great success in DAC-SDC also indicated that the proposed bottom-up DNN design approach is effective for enhancing the performance of object detection on embedded GPUs and FPGAs. The method can be extended to other edge devices with appropriate latency and resource modeling. Bundles can be enumerated and evaluated on targeted devices, and with the specific performance targets, DNN can be grown by stacking Bundles following the guidance of search algorithms. In our case, we used SCD to approach performance targets, and other search algorithms are also feasible. The proposed design method can also be extended to other DNN-related edge applications, such as classification, recognition, object tracking, etc.
8 Acknowledgments
This work was partly supported by the IBM-Illinois Center for Cognitive Computing System Research (CSR) – a research collaboration as part of IBM AI Horizons Network. The authors would like to express their deep gratitude for additional team members in team iSmart3-SkyNet (GPU track): Haoming Lu, Jiachen Li, Yuchen Fan, Sitao Huang, Bowen Cheng, Yunchao Wei, Thomas Huang, Honghui Shi, and team members in team iSmart3 (FPGA track): Yao Chen, Xingheng Liu, Sitao Huang.
References
- [Deng and Zhuo2018] Deng, J., and Zhuo, C. a. 2018. DAC-SDC’18 2nd place winner in GPU track. https://github.com/jndeng/DACSDC-DeepZ. Accessed: 2019-06-09.
-
[Deng et al.2009]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L.
2009.
Imagenet: A large-scale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, 248–255. Ieee. - [Deng et al.2019] Deng, J.; Shen, T.; Yan, X.; Chen, Y.; Zhang, H.; Wang, R.; Zhou, P.; and Zhuo, C. 2019. DAC-SDC’19 3rd place winner in GPU track.
- [DJI2018] DJI. 2018. DAC-SDC dataset. https://github.com/xyzxinyi zhang/2018-DAC-System-Design-Contest. Accessed: 2019-06-08.
- [Franklin2017] Franklin, D. 2017. NVIDIA Jetson TX2 delivers twice the intelligence to the edge. NVIDIA Accelerated Computing— Parallel Forall.
- [Hao et al.2018] Hao, C.; Li, Y.; Huang, S. H.; Zhang, X.; Gao, T.; Xiong, J.; Rupnow, K.; Yu, H.; Hwu, W.-M.; and Chen, D. 2018. DAC-SDC’18 3rd place winner in FPGA track. https://github.com/onioncc/iSmartDNN. Accessed: 2019-06-09.
- [Hao et al.2019a] Hao, C.; Zhang, X.; Li, Y.; Chen, Y.; Liu, X.; Huang, S. H.; Rupnow, K.; Xiong, J.; Hwu, W.-M.; and Chen, D. 2019a. DAC-SDC’19 1st place winner in FPGA track.
- [Hao et al.2019b] Hao, C.; Zhang, X.; Li, Y.; Huang, S.; Xiong, J.; Rupnow, K.; Hwu, W.-m.; and Chen, D. 2019b. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge. In Proceedings of the 56th Annual Design Automation Conference 2019, 206. ACM.
- [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- [Howard et al.2017] Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
- [Hu et al.2019] Hu, J.; Goeders, J.; Brisk, P.; Wang, Y.; Luo, G.; and Yu, B. 2019. 2019 DAC system design contest on low power object detection. When Accuracy meets Power: 2019 DAC System Design Contest on Low Power Object Detection.
- [Kara and Alonso2019] Kara, K., and Alonso, G. 2019. DAC-SDC’19 3rd place winner in FPGA track.
- [Kara, Zhang, and Alonso2018] Kara, K.; Zhang, C.; and Alonso, G. 2018. DAC-SDC’18 2nd place winner in FPGA track. https://github.com/fpgasystems/spooNN. Accessed: 2019-06-09.
-
[Krizhevsky, Sutskever, and
Hinton2012]
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E.
2012.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, 1097–1105. - [Liu et al.2016] Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; and Berg, A. C. 2016. Ssd: Single shot multibox detector. In European conference on computer vision, 21–37. Springer.
- [Lu et al.2018] Lu, H.; Cai, X.; Zhao, X.; and Wang, Y. 2018. DAC-SDC’18 1st place winner in GPU track. https://github.com/lvhao7896/DAC2018. Accessed: 2019-06-09.
- [Redmon et al.2016] Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788.
- [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- [Xiong et al.2019] Xiong, F.; Yin, S.; Fan, Y.; and Ouyang, P. 2019. DAC-SDC’19 2nd place winner in GPU track.
- [Xu et al.2018] Xu, X.; Zhang, X.; Yu, B.; Hu, X. S.; Rowen, C.; Hu, J.; and Shi, Y. 2018. DAC-SDC low power object detection challenge for UAV applications. arXiv preprint arXiv:1809.00110.
- [Zang et al.2018] Zang, C.; Liu, J.; Hao, Y.; Li, S.; Yu, M.; Zhao, Y.; Li, M.; Xue, P.; Qin, X.; Ju, L.; Li, X.; Zhao, M.; and Dai, H. 2018. DAC-SDC’18 3rd place winner in GPU track. https://github.com/xiaoyuuuuu/dac-hdc-2018-object-detection-in-Jetson-TX2. Accessed: 2019-06-09.
- [Zeng et al.2018] Zeng, S.; Chen, W.; Huang, T.; Lin, Y.; Meng, W.; Zhu, Z.; and Wang, Y. 2018. DAC-SDC’18 1st place winner in FPGA track. https://github.com/hirayaku/DAC2018-TGIIF. Accessed: 2019-06-09.
- [Zhang et al.2018] Zhang, X.; Wang, J.; Zhu, C.; Lin, Y.; Xiong, J.; Hwu, W.-m.; and Chen, D. 2018. DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the International Conference on Computer-Aided Design, 56. ACM.
- [Zhang et al.2019a] Zhang, X.; Hao, C.; Li, Y.; Chen, Y.; Xiong, J.; Hwu, W.-m.; and Chen, D. 2019a. A bi-directional co-design approach to enable deep learning on IoT devices. arXiv preprint arXiv:1905.08369.
- [Zhang et al.2019b] Zhang, X.; Lu, H.; Li, J. L.; Hao, C.; Fan, Y.; Li, Y.; Huang, S.; Cheng, B.; Wei, Y.; Huang, T.; Xiong, J.; Shi, H.; Hwu, W.-m.; and Chen, D. 2019b. DAC-SDC’19 1st place winner in GPU track.
- [Zhao et al.2019] Zhao, B.; Zhao, W.; Xia, T.; Chen, F.; Fan, L.; Zong, P.; Wei, Y.; Tu, Z.; Zhao, Z.; Dong, Z.; and Ren, P. 2019. DAC-SDC’19 2nd place winner in FPGA track.
Comments
There are no comments yet.