EdgeNet: Balancing Accuracy and Performance for Edge-based Convolutional Neural Network Object Detectors

11/14/2019 ∙ by George Plastiras, et al. ∙ University of Cyprus 47

Visual intelligence at the edge is becoming a growing necessity for low latency applications and situations where real-time decision is vital. Object detection, the first step in visual data analytics, has enjoyed significant improvements in terms of state-of-the-art accuracy due to the emergence of Convolutional Neural Networks (CNNs) and Deep Learning. However, such complex paradigms intrude increasing computational demands and hence prevent their deployment on resource-constrained devices. In this work, we propose a hierarchical framework that enables to detect objects in high-resolution video frames, and maintain the accuracy of state-of-the-art CNN-based object detectors while outperforming existing works in terms of processing speed when targeting a low-power embedded processor using an intelligent data reduction mechanism. Moreover, a use-case for pedestrian detection from Unmanned-Areal-Vehicle (UAV) is presented showing the impact that the proposed approach has on sensitivity, average processing time and power consumption when is implemented on different platforms. Using the proposed selection process our framework manages to reduce the processed data by 100x leading to under 4W power consumption on different edge devices.



There are no comments yet.


page 1

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Visual intelligence is a rapidly growing field that can provide improved high-level understanding of the environment. Computer vision algorithms, in particular, are increasingly employed on mobile/edge devices that support high-resolution cameras. Applications such as emergency response, disaster management, and recovery, and monitoring of critical infrastructures, can all benefit from real-time video analytics. In many cases, for such applications the connectivity to a cloud service may not be available or not existent at all. Furthermore, processing information on-board can eliminate security issues when transmitting sensitive information for such applications. Hence, on-board processing is highly desirable at the edge.

In particular, object detection, the first step in visual data analytics, has recently enjoyed significant accuracy and performance improvements due to the emergence of deep learning and the technology advances in Graphical Processing Units(GPU), respectively. However, such complex paradigms intrude increasing computational demands and are not traditionally implemented in resource-constrained devices.

Convolutional Neural Networks (CNNs) build hierarchical representations that can efficiently perform a variety of vision tasks such as detection, recognition and segmentation (Hinton et al., 2012), (He et al., 2017). To facilitate the mapping of CNNs on resource constrained devices, recent works have focused on co-designing for high task-level accuracy and low computational complexity. This has been addressed from different aspects, with emphasis on precision reduction, network pruning, and compression as well as compact network design. Furthermore, such optimizations works mostly on small and fixed image size and do not consider applications, such as Unmanned Aerial Vehicles (UAV) that need to work on higher resolution images. As such there is still a need to accommodate improvements in CNN architectures and design techniques with intelligent data reduction to maximize the efficiency of CNNs for such applications.

Thus, our contribution focuses on an intelligent way to reduce the processed data by using the proposed framework, that can work with any predefined architecture, on larger scale images leading to an increase of both accuracy and overall performance of the system. We propose a way of focusing only on promising regions and examine the impact of building resolution-optimized networks to further improve the computation and accuracy trade-offs, as shown in Fig. 1.

Figure 1. Proposed tiles for processing base on the selection process.

framework consist of three main stages:

  • An optimized CNN, called

    that is lightweight and operates on lower resolution input to provide initial estimates for object positions

  • A pool of per-scale- and region-size- optimized CNNs, called out of which the most suitable for processing are selected at each time instance based on statistical metrics

  • An optical-flow tracker to compensate for the increasing demands of the previous stages and speed-up of the whole process

The proposed framework was evaluated and compared with state-of-the-art object detectors using a pedestrian dataset from Unmanned Aerial Vehicle(UAV)-captured images, on an i5 CPU and two ARM-based CPUs on different platforms. Throughout the analysis on our test dataset, we demonstrate that the detection accuracy can considerably improve , along with a reduction on the energy consumption of the system while increasing the performance compared to state-of-the-art CNNs. EdgeNet, is able to maintain the accuracy of a high-end implementation, while outperforming existing works in terms of processing speed and energy consumption when targeting a low-power embedded processor implementation, without changing the structure of an existing network just by intelligently selecting regions of the image.

Figure 2. Overview of

2. CNN inference at the Edge

Convolutional Neural Networks have shown remarkable promise in a variety of scenarios with impressive accuracy and performance. In most cases, this comes at the cost of high computational, power and memory requirements. In typical application scenarios, these CNNs run on powerful GPUs that consume a lot of power. In response to the excessive resource demands of CNNs, the traditional way is to use powerful cloud datacenter for training and evaluating CNNs (Huang et al., 2017). Input data generated from mobile devices are sent to the cloud for processing, and the results are sent back to the mobile devices after the inference. This cannot be applied in some cases such as search and rescue missions in remote areas or in cases of natural disaster where the network grid might not be available. With the advancement of the technology and the powerful devices such as Jetson by Nvidia(Nvidia, 2019) and Edge TPU by Google(Google, 2019), that can analyze real-time data at the edge, a new wealth of possibilities opens up for potential applications, including sensing the user’s immediate environment, navigating, assisting medical professionals, and home automation (Chinta et al., 2014), (Billinghurst and Starner, 1999).

However, in some cases particularly for high resolution image processing, the use of deep neural networks on devices like mobile phones or smart watches is challenging, since model sizes are large and do not fit in the limited memory available on such devices. Recent works are looking to minimize the size of the neural networks, while maintaining accuracy, using different strategies such as down-sampling and filter count reduction (Howard et al., 2017),(Iandola et al., 2016). Other works (Ravi, 2018),(Ravi, 2017), focus on creating specialized frameworks to compress the neural network models, using state-of-the-art techniques such as pruning on weights and operations that are least useful for prediction, quantization by reducing the number of bits for model weights and activations.

Other approaches look at the optimization beyond the CNN optimization (Ren et al., 2017), (Erhan et al., 2013)

. These CNNs are working on a region proposal base, where they use a small network to slide over a convolutional feature map in order to generate proposal for the region where the object lies. Different anchor boxes are proposed for each position of an image in order to be examined by a classifier and regressor to check the occurrence of objects. On the other hand, some approach are trying to look at the image only once

(Redmon and Farhadi, 2017),(Liu et al., 2015), and predict this boxes without the two stage approach of the region proposal and the large amount of proposals that need to process. Moreover, in (Plastiras et al., 2018), we presented a Selective Tile Processing approach where instead of resizing the input image and process it with a CNN, we selected only regions of the image for processing in a static separation of the input image on same sized tiles, based on the image and CNN input.

To this end, in this work we focus on techniques beyond the CNN optimization, in order to intelligently reduce the data that need to be processed by a CNN and enable real-time processing on mobile/edge devices on high resolution images. In particular, we focus on techniques that reduce both the large amount of proposals of Region Proposal networks, and the times an image is resized on Single-Shot networks. Based on the Selective Tile Processing approach (Plastiras et al., 2018), we proposed a framework that evaluates and dynamically select regions of the image based on statistical metrics gathered from previous frames. In particular, we are able to use smaller structures of CNN that can utilize efficiently the tiling approach and avoid the static separation of the input image.

3. Proposed Approach

We propose , a framework based on multiple CNN detectors aiming to improve the overall performance of both accuracy and processing time along with a reduction of power consumption, of an edge-based detector on high-resolution images. Moreover, we present an evaluation of different algorithmic parameters and configurations, an indication of the number of frames the framework must spend at each stage before moving to the next stage, in order to analyze the impact on both performance and accuracy of the detections. To this end, Fig. 2 shows the pipeline of framework, which consist of three-stages. A detailed description of each stage is given below:

CNN Input Size (pixels) Processing Time (sec)
DroNet_V3 512 0.08
DroNet_Tile 512 0.03
DroNet_Tile 416 0.02
DroNet_Tile 352 0.014
DroNet_Tile 256 0.008
DroNet_Tile 128 0.002
Table 1. Procesing time of and for different input sizes indicating the Pool of CNNs

The first stage of our framework is responsible for producing the initial positions of objects in a frame, thus an appropriate method must be selected that is accurate enough to steer the framework in the right direction. For this task, we used an efficient Convolutional Neural Network designed for edge applications(Kyrkou et al., 2018). We extend the structure of this network by up-sampling feature maps from earlier layers to detect object at multiple scales leading to a sufficient improvement on the accuracy of the detector (Redmon and Farhadi, 2018) for smaller objects, such as pedestrians, that we are going to refer to as . This stage works with the traditional way of resizing the input image, passed it through the and then the produced detections are saved as a set of bounding boxes where each box correspond to an object in the image.

The second stage of the proposed framework is responsible to reduce the data that need to be processed by a CNN detector. The idea is to select different regions of the image, referred to as tiles, to find the minimum image region that needs to be processed, based on the detected positions of the objects in prior time instances. To be able to illustrate the whole process, we are going to use the proposed CNN (Kyrkou et al., 2018) that operates on different input sizes depending on the tile size () and refer to it as .

Prior to the selection process it is necessary to perform a profiling and benchmarking of CNNs with different input sizes, between in our case as shown in Table 1. These CNNs make up a pool out of which the best ones will be chosen at every time instance to guarantee the minimum processing time.

In addition, we also utilize the number of objects in that tile as a factor to guide the selection. This procedure requires to identify candidate tiles that cover each object. Hence, for each detected box proposed by the first stage (Fig. 2) a number of tiles are generated by positioning the object at each of the four tile corners, as shown in Fig.3. In addition, tiles with different sizes are also generated, in our case we used a total of sizes: matching the different sizes in the CNN pool, as shown in Table 1. A total of tiles for each object are proposed, where each tile is evaluated by the selection process based on the objects that it covers and its associated processing time. Thus, for each of the tiles per object proposed we calculate an Effective Processing Time (EPT), which is the number of objects that are covered divided by its corresponding processing time (Table 1). From the proposed tiles per object we select the one with the minimum EPT. Finally, we combine all the extracted tiles for all objects, and discards the redundant ones (i.e., those that cover the same or fewer objects) and retain only the one with the minimum .

For the example in Fig. 1 four tiles are selected by the selection process. Each tile that is selected is processed by the appropriate CNN from the the pool, based on its size. To this end, the processing time will be using the selected tiles compared to using the , which shows a significant impact on the performance even on this simple example.

Figure 3. Different tile proposals, with respect to the position and size of the tiles, for an object in the image. a) , b) , c) , d)

The third and final stage of the proposed framework is an optical flow tracker, named Lucas-Kanade (Lucas and Kanade, 1981). Lucas-Kanade tracker works on the principle that the motion of objects in two consecutive images is approximately constant relative to the given object. The selection of this tracker was based on its fast execution time, even with a large number of tracked points in the image. It is worth mentioning that any other tracker will also work on this stage with regards to application requirements such as accuracy and speed trade-off. This stage is used for two main reasons. 1) To track the objects of the framework along with stage and and compare and verify the position of the detected object using both tracking and detection algorithms and 2) to reduce the processing time of the framework using only the tracker, before detecting the whole image again. To be able to use this tracker, a centered point must be calculated for each detected box in the frame, based on stage . These points are used along with the corresponding frame as the initialized points of the tracker. Each time the tracker is called, it uses the previous and current frame in order to calculate the optical flow of the points that correspond to the objects and returns the estimated new position of each object. Based on the application requirements and the processing platform, for having a good trade-off between accuracy and performance, a specific time-slot combination is selected, which determines how many times each stage will be executed in the process loop as described in Section 5.2.

4. Training Dataset For UAV Case Study

Images were collected using manually annotated video footage from a UAV and the UCF Aerial Action Data Set (UCF, 2011) in order to train , and (Redmon and Farhadi, 2018) to detect pedestrians in a variety of scenarios, and different conditions with regards to illumination, viewpoint, occlusion, and backgrounds. Overall, for the training set a total of images were collected with a total of pedestrians captured. We used Darknet (Redmon, 2016), a C- and CUDA-based Neural Network framework, to train, test and evaluate each CNN on different platforms. Each CNN that we tested was trained on the Titan Xp GPU for iterations on the same dataset.

5. Evaluation and Experimental results

In this section, we present an extensive evaluation of the proposed framework for different configurations. The configurations differ in the amount of time (number of frames) that is allocated to each stage. Specifically we use the notation , to indicate the number of frames that affords to each stage. Moreover, we present an extensive evaluation and comparison with three different single-shot models , , and that vary in terms of computational complexity. In this way we demonstrate that any approach not utilizing some form of tiling and dynamic selection exhibits accuracy drop because of the reduced image resolution. We also compare each of them for different three different computational platforms that facilitate different use-cases. The CNNs were trained and tested on the same dataset for various input sizes and compared initially on a low-end Laptop CPU, and then ported on two embedded platforms an Odroid device 111Samsung Exynos-5422 Cortex-A15 2Ghz and Cortex-A7 Octa-core CPUs with Mali-T628 MP6 GPU a Raspberry Pi3 222 Quad Core 1.2GHz Broadcom 64bit CPU all on the same constructed aerial-view pedestrian dataset, consisting of sequential images containing pedestrians in total.

5.1. Metrics

The different approaches are analyzed and evaluated on the same test dataset using the following metrics:

Sensitivity (SEN): This metric is defined as the proportion of true positives that are correctly identified by the detector and it is widely used as an accuracy metric, that returns the percentage of the correctly classified objects. Is calculated by taking into account the True Positives () and False Negatives () of the detected objects as given by (1).


Average Processing Time (APT): To evaluate and compare the performance for each Network, we use the average processing time metric which shows the time needed to process a single frame from a sequence of images. Specifically, this metric is the average processing time across all test images, where is the processing time for image .


Average Power Consumption (APC): This metric is defined as the amount of input energy (measured in watts) required for processing a single frame from a sequence of images for a particular platform. It is calculated as the summation of the power consumption at each frame devided by the total number of test images in a particular test set, where is the energy consumption for image .


5.2. Evaluation of Framework

We investigate the impact of each stage for different configurations on the overall performance and accuracy. To make the framework more suitable for real-time processing at the edge and considering that stage is the most time consuming component of the framework we set its time allocation to frame. In the analysis we only vary the time allocation for stages and . The average processing time and sensitivity on our constructed pedestrian test set is presented for different configurations in Fig.4.

Figure 4. Comparison of average processing time (CPU) and sensitivity between different configurations for different time frames for each stage

This figure shows that the time allocated at each stage has a significant impact on both performance and sensitivity of the detection framework. By increasing the time spend at stage we observe that there is a significant impact on the performance of the framework, from to with no impact on the sensitivity of the framework. Moreover, an increase of stage leads to a decrease for the processing time from to but at the same time there is a decrease on the sensitivity from to . Comparing the two extreme configurations and it is first observed that there is a significant variation both in terms of processing time and sensitivity. By increasing the time allocated on both stages and , from to there is a significant decrease for the processing time, from to since we delay the use of the slower network for a window of frames. On the other hand, since stages and operate on initial target position estimates from stage they are more susceptible to missing newly entered objects in the field-of-view. This is reflected by a decrease in sensitivity from to .

Figure 5. Sensitivity of , , and on different platforms

This indicates that it is important to choose the appropriate values for each stage in order to avoid operating with outdated information which can lead to a reduction in accuracy. To this end it is worth exploring the design-space in between the two extremes. Our objective is to obtain the higher possible processing speed with the highest possible accuracy. As seen in Fig. 4, the left-most points provide the best processing time. From these there is a point where the framework achieves both utilize high accuracy of and low average processing time of . Consequently we select as the best configuration in order to compare it with the different alternatives and implement it on the different edge platforms.

5.3. Performance analysis on CPU, Odroid and Raspberry platforms

In this section, we present an evaluation of on different platforms, compared to the other three single-shot CNN approaches, , , and with respect to sensitivity, average processing time, and energy consumption. Fig. 5 shows the sensitivity of each CNN detector on the pedestrians dataset. Sensitivity comparison, shows that manages to keep the accuracy close to compared to the other CNNs, with for and for . This can be attributed to the fact that the single shot models resize the image prior to processing and as a result reduce the object resolution as well, leading to accuracy degradation. Even comparing with a deeper and larger CNN, manages to outperform by an indication of how well the selection process works along with the tiling approach. spends most of its time working on image parts of the higher resolution image and as a result manages to improve accuracy by . This shows that even though utilizes smaller, theoretically less capable CNNs, the appropriate combination of a single deep network with the tiling for attention focusing and the tracking for fast position estimation can significantly boost accuracy.

Figure 6. Average Processing Time of , , and on different platforms
Figure 7. Average Power Consumption of , , and on different platforms
Figure 8. selection process on different frames of the test set. White boxes indicates the proposed tiles for processing and orange is the actual detection of the objects

Fig. 6 shows the average processing time on different evaluation platforms. First, it is noticeable that the selected configuration of is faster on all platforms than the other approaches. In all devices there is a reduction of the APT from , which shows that with no impact, and in some cases an increase of sensitivity, manages to boost the inference time of the detector. It is also worth noting that the performance of adapts to the activity (number and location of pedestrians) in the scene due to its dynamic nature, whereas the processing time of the other approaches is constant regardless of the frame content. Fig. 8 shows different time instances of the selection of tiles and the detections on the constructed dataset, another example on the way can select different tiles for processing and at the same time cover all the objects efficiently. Overall, is able to run in real-time on all platforms with an average processing time between on all devices. As such, it verifies our claim that an intelligent processing pipeline can be more efficient than a single CNN, for use in mobile/edge devices.

Moreover, as shown in Fig. 7, leads to a reduction of the average power consumption which makes it the most power efficient detector compared to the other CNNs. The reduction of processed data along with the use of CNNs with small input size and the tracker, has a direct impact on the average power consumption on all platforms due to the reduction of computation. Comparing to , which consumes the most power compared to all the other networks, there is a decrease of the power consumption on the CPU platform and a decrease of on the other two platforms.

6. Conclusion & Future Work

This paper proposed a three-stage framework for a more efficient object detection on higher resolution images for edge/mobile devices. We have demonstrated that an intelligent data reduction mechanism can go a long way towards improving the overall accuracy and focus the computation on the important image regions. Furthermore we have shown that by selectively choosing the best CNNs to use based on the position and proximity of targets between them there are significant benefits in terms of performance and accuracy. Overall, manages to provide promising performance between frames-per-second, with accuracy and a power consumption between

, depending on the inference device. Future research plans include the optimization, using binarization and pruning techniques, of each individual CNN to further improve the speed. Moreover, we plan to test EdgeNet on various scenes and different conditions and incorporate the movement of the objects in the selection process in order further increase the detection accuracy in case of high movement of both the objects and the camera.


Christos Kyrkou gratefully acknowledges the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.