1 Introduction †††Readers can also find our challenge report for DAC System Design Contest 2019 at Zhang et al. (2019). Our code is open-sourced here.
Edge AI applications not only require high inference accuracy from deep neural networks (DNNs), but also ask for aggressive inference speed, throughput, and energy efficiency to meet real-life demands. These applications rely on hardware-efficient DNN design when they are deployed onto embedded systems with extremely limited computation and memory resources. Recently, we have seen intensive studies on DNN accelerators in hardware, which attempt to take advantage of different hardware design styles, such as GPUs, FPGAs and AISCs, to improve the speed and efficiency of DNN inference and training processesQiu et al. (2016); Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne (2016); Zhang et al. (2017a); Jouppi et al. (2017); Franklin (2017); Zhang et al. (2018b).
Although hardware accelerators can be helpful, they are still limited by available resources to handle varied real-life applications, especially for embedded systems since most DNNs are not originally designed to be hardware-efficient. As a result, optimization starts turning to the software side, to compress DNNs for less complexities, lowering computation demands and memory footprints. Recent researches have demonstrated the possibility of using low bit-width data to represent original floating-point parameters, such as using binary and ternary networks Courbariaux et al. (2016); Rastegari et al. (2016); Li et al. (2016); Tschannen et al. (2018); Wang et al. (2018a); Gope et al. (2019). These solutions are intended to replace the hardware-intensive floating-point multiplications by logical operations, so that DNNs can be more efficient on hardware platforms.
Researchers also investigate the network pruning strategies to reduce the redundancy of DNN structures Han et al. (2015, 2016); Luo et al. (2017). According to the published pruning strategies, the relatively less important connections between DNN layers are discarded and network retraining is then performed to regain accuracy. Significant reductions can be achieved on the classic DNNs, such as AlexNet Krizhevsky et al. (2012) and VGG-16 Simonyan and Zisserman (2014). Since the major benefit of network compression comes from the fully-connected (FC) layers, to continuously have effective pruning results for latter DNNs (e.g., GoogleNet Szegedy et al. (2015) and ResNet He et al. (2016)
) with reduced FC layers, more sophisticated algorithms are required to be integrated in network pruning. Recently published literature adopts evolutionary algorithmDai et al. (2019a), alternating direction method of multipliers (ADMM) Ren et al. (2019), and iterative pruning Ding et al. (2018) for better compression performance while maintaining DNN accuracy.
As most of the computations happen inside the convolutional (Conv) layers, previous works also attempts to reduce the computation complexity by using depth-wise separable Conv layers for image classification and ubiquitous keyword-spotting applications Howard et al. (2017); Zhang et al. (2017b). This depth-wise separable structure can efficiently reduce the number of operations and provide more compact DNN designs for resource-constrained hardware. To further improve the DNN deployment on hardware, layer fusion is proposed in Alwani et al. (2016) to minimize data movements between on-chip and off-chip memory.
In general, a design process of hardware-efficient DNNs can be summarized in Figure 1 with the adoption of above-mentioned technologies. It is a top-down design flow which starts from step 1
: to select a reference DNN with more concentrations on accuracy. For computer vision applications, the families of VGGSimonyan and Zisserman (2014) and ResNet He et al. (2016) are highly likely to be selected as backbones of desired designs. Such DNNs are excessively complicated for targeted embedded systems, which must be compressed using software and hardware optimizations in step 2 and 3, respectively. Since software compression and hardware implementation are typically carried out in two separate steps, step 2 and 3 are usually performed in an iterative manner to balance DNN accuracy and hardware performance on targeted devices. Network retraining is also required to regain accuracy after compression before step 4. Because of the iterative nature of the process, it is very challenging to cover both inference accuracy in software and deployment efficiency in hardware.
In this paper, we address the hardware-efficient DNN design problem by proposing SkyNet, a bottom-up DNN design approach with comprehensive awareness of hardware constraints. SkyNet has been demonstrated on a low power object detection task, which can deliver the state-of-the-art results for both DNN accuracy and hardware efficiency. The main contributions of this paper are summarized as follows:
We summarize the latest low power object detectors for embedded systems and locate the potential obstacles of using top-down DNN design flows, which may prevent improved DNN accuracy and hardware efficiency.
We propose a bottom-up design strategy of hardware-efficient DNNs for both embedded GPU and embedded FPGA; using such a design method, we propose SkyNet, which has comprehensive awareness of hardware limitations to overcome the challenges of top-down design flow.
We demonstrate SkyNet in DAC-SDC’19 using both TX2 GPU and Ultra96 FPGA with the stat-of-the-art accuracy. SkyNet achieved the highest overall score regarding accuracy, throughput, and energy-efficiency, and won the first place winner award for both GPU and FPGA tracks.
We extend SkyNet for object tracking. By using SkyNet as the backbone DNN, SiamRPN++ and SiamMask obtain 1.60X and 1.73X speedup with better or similar accuracy, and 37.20X smaller parameter size compared to using the original ResNet-50 backbone when running on a 1080Ti GPU.
2 Related Work
|ShuffleNet + RetinaNet||1⃝ 2⃝ 3⃝||9⃝|
|Tiny YOLO||Not clear||9⃝|
|Tiny YOLO||1⃝ 2⃝ 3⃝ 4⃝||Not clear|
|Tiny YOLO||Not clear||9⃝|
|YOLOv2||1⃝ 2⃝ 3⃝||9⃝|
|ShuffleNetV2 + YOLO||2⃝ 3⃝||5⃝ 6⃝ 8⃝|
|SqueezeNet + YOLO||1⃝ 2⃝ 3⃝||7⃝|
|SSD||1⃝ 2⃝ 3⃝||5⃝ 6⃝|
|SqueezeNet + YOLO||1⃝ 2⃝ 3⃝||7⃝|
|MobileNet + YOLO||1⃝ 2⃝ 3⃝||5⃝ 7⃝|
Recent state-of-the-art object detectors feature DNN backbones to extract input features. Researchers initially propose a two-stage approach: the first stage outputs multiple region proposals for object candidates and the second stage generates more accurate regions with corresponding class labels Dai et al. (2016); Lin et al. (2017a); He et al. (2017); Cheng et al. (2018b, a); Cai and Vasconcelos (2019); Li et al. (2019b). Since the two-stage detectors have long latency, some one-stage approaches are proposed to simultaneously regress object locations and classes to reduce latency Sermanet et al. (2014); Redmon et al. (2016); Liu et al. (2016); Lin et al. (2017b); Law and Deng (2018); Shen et al. (2019); Zhou et al. (2019); Tian et al. (2019)
. Object tracking also relies on the features extracted from powerful DNN backbones, and we have seen recent Siamese network based trackers formulate trackers as feature between the exemplar image and search regionTao et al. (2016); Valmadre et al. (2017); Wang et al. (2018b); Li et al. (2019a); Wang et al. (2019). These state-of-the-art methods make real-time object detection and tracking possible using desktop GPUs but still need aggressive compression before deploying onto embedded systems.
2.1 Low-Power Object Detectors
Nowadays, much attention has been paid to delivering hardware-efficient designs for object detection instead of simply pursuing higher inference quality. To address the design difficulties of real-life applications, a low power object detection challenge in DAC-SDC is proposed to target unmanned aerial vehicle (UAV) applications using embedded platforms, such as NVIDIA TX2 GPU, Ultra96 FPGA, and Xilinx Pynq-Z1 FPGA Xu et al. (2019). By examining the winning entries, we notice that all of them share similar top-down DNN design approaches as shown in Figure 1.
All teams listed in Table 1 adopt one-stage detectors. Most of them start from well-established hardware-efficient DNNs, such as ShuffleNet Zhang et al. (2018a), SqueezeNet Iandola et al. (2016), and MobileNet Howard et al. (2017)
, and replace the image classifier with YOLORedmon et al. (2016); Redmon and Farhadi (2017) or RetinaNet Lin et al. (2017b) back-end for object detection. Other solutions directly adopt the object detection algorithms, such as SSD Liu et al. (2016) and YOLO. To deliver hardware-efficient DNNs, they employ input resizing and network pruning to lower the network complexity. Some of the GPU entries use half-precision data format (16-bit) and TensorRT for improved throughput. More aggressive compression is necessary for FPGA designs because of even tighter resource budgets. DNN parameters are quantized to around 8 bits or even down to 1 or 2 bits. The FPGA teams also cover task partitioning (between host CPU and FPGA), double-pumped DSP (with doubled working frequency in DSP units), tailored pipeline, and multithreading to boost hardware performance. One of the teams apply clock gating for even better energy-efficiency in the embedded system.
2.2 Hardware-Aware Neural Network Search
To deliver DNNs for edge devices, there has been growing interests in using neural architecture search (NAS) Tan et al. (2019); Wu et al. (2019); Dai et al. (2019b); Howard et al. (2019); Xiong et al. (2019b); Stamoulis et al. (2019); Dong et al. (2018); Cai et al. (2018)
to automatically find resource constrained convolutional neural networks (CNNs) targeting edge-platforms. Tanet al. Tan et al. (2019)
are one of the first to use NAS for efficient CNN by adding latency into the optimization constraint and use reinforcement learningZoph and Le (2016); Zoph et al. (2018) to maximize the reward (high accuracy and low latency). To find the efficient networks for a specific platform, Tan et al. (2019) uses real-time latency by running models on the targeted device instead of latency proxy. Limited by the number of available physical hardwares, Wu et al. (2019); Cai et al. (2018) use look-up-table (LUT) to approximate the run-time of models on a specific device. To incorporate human knowledge, Howard et al. (2019) uses platform-aware NAS to search CNNs for a platform and manually adjust some part of the structure to make it more efficient. Compared to previous hardware-aware NAS methods that target a specific platform, SkyNet can target both embedded GPU and embedded FPGA platforms and capture hardware limitations by using the realistic hardware performance feedbacks instead of using LUT approximation.
To deliver an even better solution compared to the winning designs listed in Table 1, we investigate the potential obstacles in the top-down design flow (Figure 1) which may hinder further improvements on DNN accuracy and efficiency. We summarize two challenges as follow:
It is difficult to balance the sensitivities of DNN configurations on software and hardware during model compression following the top-down approach.
It is difficult to select the appropriate reference DNNs at the very beginning of the top-down flow as the uncertain accuracy variations for a given real-life task.
The first challenge causes tedious iterative explorations between software and hardware optimizations. With the similar hardware performance (e.g., throughput and latency), DNN candidates may have different accuracy results as the compression technologies are applied to different network components. We take data quantization as an example. As shown in Figure 2 (a), the accuracy results vary significantly for quantizing parameters and intermediate feature maps (FMs). In this figure, the coordinates of the bubble center represent accuracy and model compression ratio, while the area of a bubble shows data size in megabyte (MB). We scale-up the FM bubble for better graphic effect. By compressing the model from float32 to fixed point representation, we reduce 22X parameter size (237.9MB10.8MB) and 16X FM size (15.7MB0.98MB), respectively. Results show that the inference accuracy is more sensitive to the precision of FM.
|Backbone DNN||# of Parameter||IoU|
On the other hand, DNN models with similar accuracy may result in different hardware efficiency. To provide a quantitative analysis, we implement DNNs with the same architecture but different configurations in a FPGA and examine their impacts on hardware. Figure 2 (b) shows the BRAM (on-chip memory in FPGA) usages with different input resize factors and configurations of FM quantization. By reducing the resize factor from 1.00 to 0.78, we can maintain nearly the same DNN accuracy (<1.0% drop), but save half memory when the factor is smaller than 0.9. Similarly, Figure 2 (c) indicates small changes may lead to diverse DSP utilization. By taking the 6-bit FM (FM16) as an example, the required DSPs reduce from 128 to 64 when weights are changed from 15-bit (W15) to 14-bit (W14).
For the second challenge, it is difficult to select a reference DNN with relatively high accuracy upper bound on a given task. The DNNs with impressive accuracy on published datasets (e.g., CIFAR-10/100 and ImageNet) may not be always suitable. We evaluate the accuracy of popular reference DNNs on DAC-SDC object detection dataset and list the results in Table2. With the fixed back-end bounding box regression part, these reference DNNs show no clear clues regarding their parameter size and inference accuracy after adequate training. Thus, it is not easy to select a promising reference model for a given task.
4 A Bottom-Up Design Approach
Motivated by the discussed challenges in Section 3, we propose a bottom-up approach to leverage the hardware-efficient DNN design for embedded systems. It is a three-stage approach as shown in Figure 3.
4.1 Stage 1: Bundle Selection and Evaluation
This flow starts with building the hardware-aware basic blocks, called Bundles. From a software perspective, a Bundle is a set of sequential DNN layers, which can be repeatedly stacked and construct DNNs. While from a hardware perspective, a Bundle is a set of IPs which need to be implemented on hardware. To capture the hardware constraints, Bundles need to be evaluated on targeted embedded systems for collecting realistic latency (for both FPGA and GPU) and resource utilization (for FPGA) results.
In the first stage, we enumerate the DNN components (such as Conv, pooling, activation layers, etc.) and assemble them into Bundle
. Since our design for DAC-SDC needs to target both GPU and FPGA, we use the resource constraints from FPGA (more restrictive compared to GPU) to evaluate the hardware performance (latency and resource utilization) for each Bundle. To get each Bundle’s potential accuracy contribution, we build a DNN sketch with fixed front- and back-end structures, and insert one type of Bundle (with replications) in the middle. In our case, the front-end is a input resizing unit while the back-end is a bounding box regression. Each DNN sketch is quickly trained for 20 epochs to get its accuracy. The most promising Bundles located in the Pareto curve are selected for the next stage.
4.2 Stage 2: Hardware-Aware DNN Search
During DNN search, the inputs include the target application (e.g., image classification or object detection), software and hardware metrics (e.g., DNN accuracy and throughput performance), and target hardware platforms. The outputs are DNN candidates which meet the software and hardware requirements running on targeted embedded platforms.
To solve such a multi-objective optimization problem, we propose a group-based particle swarm optimization (PSO) evolutionary algorithm to discover proper DNN candidates. In this algorithm, each individual DNN is regarded as a particle, and all active DNNs during the search contribute to the swarm. Since we only use one type of Bundle in one DNN, DNNs composed by the same type of Bundle are considered as a particle group. In order to maintain evolution stability, a DNN only evolves within its own group. In our group-based PSO, we label the global optimal DNN asand a group optimal DNN as within the - group. We denote a DNN particle within group as and each has two tunable dimensions: represents the number of channels of each Bundle replication; and describes the pooling position between Bundles. Both dimensions affect accuracy and hardware performance. We propose the search algorithm in Algorithm 1 with the following major components:
Population generation. An initial network population is generated, with groups and networks for each group. The search is conducted for iterations. Within the - iteration, all networks will be trained for epochs, where increases with .
. We perform platform-specific latency estimation. For GPUs, we directly measure the inference latency on the training GPU, and scale latency to the target GPU for deployment if the target GPU is different from the training one. For FPGAs, we follow the FPGA implementation template inHao et al. (2019), which is an IP based mapping strategy. Given a configurable IP pool used for DNN implementation, in order to save FPGA resources, all DNN layers of the same type share the same hardware computational IP. To maximize the FPGA implementation performance, we configure the IPs to be as large as possible within the available FPGA resources. For each IP under different configurations, such as computation parallelism and buffer size, we collect its hardware resource usage and latency from high level synthesis tool. Based on individual IP performance, we adopt the DNN performance modeling from Hao et al. (2019), to get the end-to-end latency and resource usage for a DNN.
Fitness value calculation. After each iteration of training and latency estimation, we calculate the fitness value for each network as:
where is the validation accuracy of , represents each targeted hardware from all candidates ; is the estimated latency on hardware ; and is the required hardware latency on . Parameters are used to balance the penalty across different platforms, and is used to balance between network accuracy and hardware latency. Since FPGA latency is more strictly constrained by its resource budget, we set the FPGA platform factor larger than GPU to prioritize FPGA implementation.
Velocity calculation and particle update
. In standard PSO algorithm, a velocity for each dimension of a particle is calculated based on the global best particle, and the particle position is updated by the velocity vector. DNNs in the same group are updated based on the group bestparticle and have two tunable dimensions. To determine the number of channels (), we first compute the per-layer difference between the current network and the group best; then we update the number of channels of the current network by a random percentage of the difference to approach the group best. Similarly, to ensure the best pooling positions (), we compare the current network with the global best, and change a random number of pooling positions to approach the group best.
4.3 Stage 3: Feature Addition
We manually add more advanced DNN design features if hardware resources/constraints allow. For DAC-SDC, since most objects are very small, we add a bypass directly from low-level features to high-level features along with feature map reordering Redmon and Farhadi (2017)
to improve small object detection. To enhance the hardware efficiency, we replace ReLU with ReLU6Sandler et al. (2018). More discussions are provided in the next section.
5.1 SkyNet Architecture
During the design process, the best Bundle is selected as a combination of 33 depth-wise Conv layer (DW-Conv3 Howard et al. (2017)), 1
1 point-wise Conv layer (PW-Conv1), batch normalization layer (BNIoffe and Szegedy (2015)
), and rectified linear unit 6 (ReLU6Sandler et al. (2018)). By repeatedly stacking this Bundle, we generate three backbone networks in Table 3 for DAC-SDC. These networks share the same chain structure and bounding box regression function but with different configurations of feature map bypass. For model A, no bypass is included; while for the model B and C, output feature maps of Bundle #3 are fed in the Bundle #6. To handle the requirement of DAC-SDC object detection, SkyNet adapts the YOLO detector head by removing the classification output and use two anchors for bounding box regression.
5.2 Feature Map Bypass, Reordering, and ReLU6
By examining the DAC-SDC competition training data, we keep a record of the size ratio between the output bounding box and the input image and present a distribution diagram in Figure 6. It clearly shows that 91% of the objects to be detected in DAC-SDC dataset are less than 9% of the original input image size and 31% of them are even smaller than 1% of the input image size. It means the majority of objects inside this dataset can be considered as small objects and we need to propose a DNN accordingly.
We add feature map bypass and reordering Redmon and Farhadi (2017) to enhance the ability of detecting small object (model B and C). The bypass helps to keep small object features in the later part (closer to the output layer) of the DNN by adding low-level high-resolution feature maps. Also, it is beneficial to have multiple feature maps (from different layers) before generating the bounding boxes. Since the bypass crosses a pooling layer (highlighted in Figure 4), we use reordering (shown in Figure 5) to align the size of original feature map (generated by the Bundle #5) and the low-level feature without losing information.
The other feature to improve hardware efficiency is the ReLU6 Sandler et al. (2018)
. It is an activation function which clips output range to. Since ReLU6 generates much smaller data range compared to the original ReLU (), less bits are required to represent intermediate FMs. It also helps to better implement lower-precision floating point in embedded GPUs and fixed-point data format in embedded FPGAs.
|Configurations of SkyNet|
|input (3160360 color image)|
|Back-end for bounding box regression|
6 Experiment on DAC-SDC
DAC-SDC features a single object detection challenge for embedded systems, which include embedded GPUs (NVIDIA TX2) and FPGAs (Pynq-Z1 and Ultra96) with very low energy consumption. This competition considers the most appropriate needs of UAV applications, such as capability of real-time processing, energy efficiency, and detection accuracy. To better reflect real-life challenges, images of the dataset are captured by UAVs and they are provided by the drone manufacturer called DJI. The whole dataset is divided by two parts: the training dataset with 100,000 images with objects of interest across 12 main categories and 95 sub-categories, and the hidden test set for official evaluation with 50,000 images that only the contest organizers could access DJI (2018). Results generated by SkyNet are shown in Figure 7, where 91% of targeted objects are smaller than 9% of the input image size. In DAC-SDC’19, 52 GPU teams and 58 FPGA teams participated worldwide creating a very intense competition. Our SkyNet design has successfully delivered the best inference accuracy and total score for both GPU and FPGA tracks.
6.1 Ablation Study
We perform an ablation study on DAC-SDC dataset to analyze these three configurations of SkyNet (Model A, B, and C listed in Table 3
). By combining two activation functions (ReLU and ReLU6), six configurations of SkyNet are evaluated. We train these models in an end-to-end fashion using multi-scale training with the learning rate starting from 1e-4 to 1e-7. We apply stochastic gradient descent (SGD) to update parameters. To further enrich the training data, we use data augmentations to distort, jitter, crop, and resize inputs with size 160320. The accuracy results are presented in Table 4, where SkyNet C - ReLU6 reaches the highest IoU (0.741) on the validation set. Therefore, we use this model as the proposed design for the following experiments.
|DNN Model||Parameter Size||IoU|
6.2 DAC-SDC Evaluation Criteria
Comprehensive evaluations are introduced in DAC-SDC covering detection accuracy (IoU), throughput (FPS), and energy consumption Xu et al. (2019). To identify the best design, a total score is calculated and shown below.
Assuming there are registered teams and images in the test set, the IoU score for team , denoted as , is calculated as:
For energy, is denoted as the average energy consumption of all entries when performing DNN inference on the test dataset (Equation 3). The energy score of team () is then computed using Equation 4 relating to the ratio between average energy and the energy consumed by this team. in this equation is set to 2 and 10 for FPGA track and GPU track, respectively.
Eventually, the total score, denoted as , is calculated in Equation 5 including both inference accuracy () and energy consumption ().
|Team Name||IoU||FPS||Power(W)||Total Score|
|Results from 2019|
|Results from 2018|
|Team Name||IoU||FPS||Power (W)||Total Score|
|Results in 2019|
|Results in 2018|
6.3 GPU Implementation
For the TX2 GPU implementation, we keep all network parameters using 32-bit float data format to maintain the best inference accuracy. Since most of the compute-intensive part of DNN inference are handled by NVIDIA cuDNN, which leaves little space for handcrafted improvement, we start optimizing our design on a system-level.
The whole procedure of running SkyNet contains four steps as: 1) input fetching from the flash storage in a unit of batch; 2) image pre-processing which includes input resizing and normalization; 3) DNN inference; and 4) post-processing to generate bounding boxes and buffer results in DDR memory. The most straightforward way is to execute these four steps in serial with the cost of low resource utilization and poor throughput performance. In our design, we first merge step 1 and 2 in pre-process and enable multithreading technology to execute these steps in a pipelined fashion as shown in Figure 10. We use NVIDIA System Profiler (L4T) to capture the latency results. In average, the proposed system-level optimizations enable a 3.35X speedup compared to the original series design and help our design reach the highest throughput performance, peaking at 67.33 FPS.
6.4 FPGA Implementation
To implement SkyNet on FPGA, we suffer even scarcer resource budgets, as the peak performance provided by Ultra96 FPGA (144 GOPS @200MHz) is much lower than the TX2 GPU (665 GFLOPS @1300MHz). By using the proposed bottom-up design flow, hardware limitations have already captured by the Bundle design and the Bundle is instantiated on FPGA as a single customized hardware IP. Since SkyNet is structured by the same type of Bundle, this IP can be shared across different SkyNet layers to cope with the resource constraints. Still, we need more optimization strategies to further enhance the performance.
6.4.1 Quantization, Batch, and Tiling
|Scheme||Feature Map||Weight||Accuracy (IoU)|
|1||9 bits||11 bits||0.727|
|2||9 bits||10 bits||0.714|
|3||8 bits||11 bits||0.690|
|4||8 bits||10 bits||0.680|
Since fixed-point representation is more favorable in FPGA design, we quantize the FMs and weights from floating point to fixed point and explore different quantization schemes in Table 7. After quantization, the same SkyNet model suffers different levels of accuracy drop from 1.4% to 6.1% in scheme 1 to 4. Since accuracy has higher weight in the total score calculation (Equation 5), we pick scheme 1 as the quantization design for SkyNet.
To exploit data reuse opportunities, input batching is a common technique by increasing the amount of input workload. With larger batch size, the process of network inference asks for larger amount of FPGA on-chip memory (BRAM) to buffer intermediate FMs. Since our implementation is based on an IP-shared structure, buffers instantiated on FPGA are shared by different layers, which means the buffer may not be large enough for the FMs generated by the first few layers while too large for the last few layers as FMs get smaller after pooling. To overcome this problem, we propose a input tiling and batch scheme as shown in Figure 9. Four inputs are stitched to form a larger input which can be processed as an entirety. With the tiling and batch process, it is possible to use one shared buffer across different layers without changing its size. The proposed solution inherits the benefit from batch process to allow better reuse of DNN weights and it eliminates the possible waste of unused buffer space.
6.4.2 Task partitioning
To fully utilize the available computational resource, we also implement task partitioning on the Ultra96. The whole design is shown in Figure 10 which is highly similar to our GPU design. Workloads are distributed to both CPU and FPGA and creating a system-level pipeline. With all three tasks (pre-process, SkyNet inference, and post-process) overlapped, our FPGA design can reach 25.05 FPS.
6.5 Result Comparison
After implementing SkyNet on GPU and FPGA following the strategies mentioned in Section 6.3 and 6.4, our designs are evaluated by the DAC-SDC organizers using the hidden test set. As shown in Table 5 and 6, we present the comparison results with the top-3 teams in DAC-SDC’19 and ’18. In our GPU design, SkyNet outperforms all other competitors by delivering the best accuracy (0.731), throughput performance (67.33), and total score (1.504). In terms of the FPGA design, SkyNet also reaches the best accuracy and gets the highest total score.
7 SkyNet Extension on GOT-10K
Since SkyNet can deliver real-time object detection on embedded systems, we setup experiments on the GOT-10k benchmark Huang et al. (2018) to demonstrate its potential on object tracking. GOT-10k is a large high-diversity database for generic object tracking with rich motion trajectory and wide coverage of object classes. Models are evaluated with two metrics in GOT-10k benchmark, average overlap (AO) and success rate (SR). Average overlap is defined as the mean of intersection over union (IoU) between prediction and ground truth bounding boxes, while success rate is defined as the proportion of predictions where the IoU is beyond some threshold. With its open responsive evaluation server, we are able to assess SkyNet with other conventional backbones.
7.1 Evaluation Using SiamRPN++
Siamese trackers are conventional object trackers that locate the object by the correlation between features extracted from the exemplar image and search image, where the quality of feature extractors play an important role. SiamRPN++ Li et al. (2019a) is the first Siamese tracker that has been proven to profit from backbones with different capacities as long as they are properly trained. To evaluate the performance of different backbones, we trained the networks on GOT-10k with learning rates from 1e-3 to 1e-5, the exemplar and search image to have size 128/127 and 256/255 for SkyNet and other backbones respectively. Results are shown in Table 8.
7.2 Evaluation Using SiamMask
SiamMask Wang et al. (2019) is another Siamese tracker that outperforms SiamRPN++ under the same configuration. However, segmentation information is required during its training period, so it cannot be directly trained with GOT-10k dataset. Instead, we perform training with Youtube-VOS dataset Xu et al. (2018) to compare the performance of different backbones under this structure. The networks are trained with learning rates from 1e-3 to 1e-4, examplar size 128/127 and search size 256/255 for SkyNet and ResNet-50 respectively. The results are shown in Table 9.
In this paper, we proposed SkyNet, a hardware-efficient method to generate compact DNNs for object detection running on embedded GPUs and embedded FPGAs. SkyNet contains a novel bottom-up DNN design flow which can capture hardware limitations using realistic hardware feedbacks and deliver DNNs with great balance between software and hardware metrics such as DNN inference accuracy and throughput performance. SkyNet was demonstrated on the 56th IEEE/ACM DAC-SDC low power object detection challenge and won the first place winner award for both GPU and FPGA tracks. We also extended SkyNet to handle object tracking task and it delivered 1.60X and 1.73X higher FPS, and 37.20X smaller parameter size with comparable accuracy when compared to Siamese trackers with ResNet-50 backbone.
This work was partly supported by the IBM-Illinois Center for Cognitive Computing System Research (CSR) – a research collaboration as part of IBM AI Horizons Network.
- Fused-layer cnn accelerators. In Proceedings of the International Symposium on Microarchitecture, Cited by: §1.
- Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2.2.
- Cascade r-cnn: high quality object detection and instance segmentation. External Links: Cited by: §2.
- Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE International Solid-State Circuits Conference (ISSCC), Cited by: §1.
- Decoupled classification refinement: hard false positive suppression for object detection. External Links: Cited by: §2.
- Revisiting RCNN: on awakening the classification power of faster RCNN. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
- Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §1.
- R-FCN: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, Cited by: §2.
- NeST: a neural network synthesis tool based on a grow-and-prune paradigm. IEEE Transactions on Computers. Cited by: §1.
Chamnet: towards efficient network design through platform-aware model adaptation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
- DAC-SDC’19 3rd place winner in GPU track. Cited by: Table 1, Table 5.
- DAC-SDC’18 2nd place winner in GPU track. Note: https://github.com/jndeng/DACSDC-DeepZAccessed: 2019-09-01 Cited by: Table 1, Table 5.
- Auto-balanced filter pruning for efficient convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §1.
- The DAC-SDC dataset for low power object detection. Note: http://www.cse.cuhk.edu.hk/b̃yu/2019-DAC-SDC/index.htmlAccessed: 2019-09-04 Cited by: §6.
- PPP-net: platform-aware progressive search for pareto-optimal neural architectures. Cited by: §2.2.
- NVIDIA Jetson TX2 delivers twice the intelligence to the edge. NVIDIA Accelerated Computing— Parallel For all. Cited by: §1.
- Ternary hybrid neural-tree networks for highly constrained iot applications. Cited by: §1.
- Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §1.
- Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, Cited by: §1.
- DAC-SDC’18 3rd place winner in FPGA track. Note: https://github.com/onioncc/iSmartDNNAccessed: 2019-09-01 Cited by: Table 1, Table 6.
- FPGA/dnn co-design: an efficient design methodology for iot intelligence on the edge. In Proceedings of the 56th Annual Design Automation Conference 2019, pp. 206. Cited by: §4.2.
- Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (CVPR), Cited by: §2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §1, §1, Table 2.
- Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §2.1, §5.1.
- Searching for mobilenetv3. In Proceedings of the International Conference on Computer Vision, Cited by: §2.2.
- GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. arXiv preprint arXiv:1810.11981. Cited by: §7.
- SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §2.1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. External Links: Cited by: §5.1.
In-datacenter performance analysis of a tensor processing unit. In Proceedings of International Symposium on Computer Architecture (ISCA), Cited by: §1.
- DAC-SDC’19 3rd place winner in FPGA track. Cited by: Table 1, Table 6.
- DAC-SDC’18 2nd place winner in FPGA track. Note: https://github.com/fpgasystems/spooNNAccessed: 2019-09-01 Cited by: Table 1, Table 6.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, Cited by: §1.
- CornerNet: detecting objects as paired keypoints. Lecture Notes in Computer Science, pp. 765–781. External Links: Cited by: §2.
- SiamRPN++: evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2, §7.1.
- Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §1.
- Scale-aware trident networks for object detection. External Links: Cited by: §2.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §2.1, §2.
- SSD: single shot multibox detector. In Proceedings of the European conference on computer vision (ECCV), Cited by: §2.1, §2.
- DAC-SDC’18 1st place winner in GPU track. Note: https://github.com/lvhao7896/DAC2018Accessed: 2019-09-01 Cited by: Table 1, Table 5.
- Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §1.
- Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of International Symposium on Field-Programmable Gate Arrays (FPGA), Cited by: §1.
- Xnor-net: imagenet classification using binary convolutional neural networks. In Proceedings of European Conference on Computer Vision, Cited by: §1.
- You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.1, §2.
- YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.1, §4.3, §5.2.
- ADMM-nn: an algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Cited by: §1.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.3, §5.1, §5.2.
- Overfeat: integrated recognition, localization and detection using convolutional networks. Cited by: §2.
- Improving object detection from scratch via gated feature reuse. Cited by: §2.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §1, Table 2.
- Single-path nas: device-aware efficient convnet design. arXiv preprint arXiv:1905.04159. Cited by: §2.2.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §1.
- Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
- Siamese instance search for tracking. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2.
- FCOS: fully convolutional one-stage object detection. External Links: Cited by: §2.
StrassenNets: deep learning with a multiplication budget. Cited by: §1.
- End-to-end representation learning for correlation filter based tracking. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2.
- Efficient inference with tensorrt. Cited by: Table 1.
- Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), pp. 163–169. Cited by: §1.
- Learning attentions: residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4854–4863. Cited by: §2.
- Fast online object tracking and segmentation: a unifying approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §7.2.
- Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
- DAC-SDC’19 2nd place winner in GPU track. Cited by: Table 1, Table 5.
- Resource constrained neural network architecture search. In Proceedings of the International Conference on Computer Vision, Cited by: §2.2.
- Youtube-vos: a large-scale video object segmentation benchmark. In European Conference on Computer Vision (ECCV), Cited by: §7.2.
- DAC-sdc low power object detection challenge for uav applications. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.1, §6.2.
- DAC-SDC’18 3rd place winner in GPU track. Note: https://github.com/xiaoyuuuuu/dac-hdc-2018-object-detection-in-Jetson-TX2Accessed: 2019-09-01 Cited by: Table 1, Table 5.
- DAC-SDC’18 1st place winner in FPGA track. Note: https://github.com/hirayaku/DAC2018-TGIIFAccessed: 2019-09-01 Cited by: Table 1, Table 6.
- Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
- SkyNet: a champion model for dac-sdc on low power object detection. arXiv preprint arXiv:1906.10327. Cited by: footnote †.
- High-performance video content recognition with long-term recurrent convolutional network for FPGA. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4. Cited by: §1.
- DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of International Conference on Computer-Aided Design (ICCAD), Cited by: §1.
- Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §1.
- DAC-SDC’19 2nd place winner in FPGA track. Cited by: Table 1, Table 6.
- Objects as points. External Links: Cited by: §2.
- Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §2.2.
- Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.2.