Intelligent Surveillance as an Edge Network Service: from Harr-Cascade, SVM to a Lightweight CNN

04/24/2018 ∙ by Seyed Yahya Nikouei, et al. ∙ Binghamton University 0

Edge computing efficiently extends the realm of information technology beyond the boundary defined by cloud computing paradigm. Performing computation near the source and destination, edge computing is promising to address the challenges in many delay sensitive applications, like real time surveillance. Leveraging the ubiquitously connected cameras and smart mobile devices, it enables video analytics at the edge. However, traditional human objects detection and tracking approaches are still computationally too expensive to edge devices. Aiming at intelligent surveillance as an edge network service, this work explored the feasibility of two popular human objects detection schemes, Harr Cascade and SVM, at the edge. Understanding the existing constraints of the algorithms, a lightweight Convolutional Neural Network (LCNN) is proposed using the depthwise separable convolution. The proposed LCNN considerably reduces the number of parameters without affecting the quality of the output, thus it is ideal for an edge device usage. Being trained with Single Shot Multi box Detector (SSD) to pinpoint each human object location, it gives coordination of bounding box around the object. We implemented and tested LCNN on an edge device using Raspberry PI 3. The intensive experimental comparison study has validated that the proposed LCNN is a feasible design for real time human object detection as an edge service.



There are no comments yet.


page 5

page 6

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Nowadays almost every person can be connected to the network using their pocket-sized mobile devices wherever and whenever. The advancement of cyber-physical technologies and their interconnection through elastic communication networks facilitate the concept of the Smart Cities that improve the life quality of residents. Attracted by the convenient lifestyle in bigger cities, the world’s population has been increasingly concentrated in urban areas at an unprecedented scale and speed [10].

The fast pace of urbanization [11] poses many opportunities and challenges. The recent concept of Smart Cities has attracted the attention of the urban planners and researchers to enhance the security and well-being of the residents. The proliferation of information and communication technologies (ICT) connects cyber-physical systems and social entities as well as facilitates many smart community services. One of the most essential smart community services is the intelligent resident surveillance [7]. It enables a broad spectrum of promising applications, including access control in areas of interest, human identity or behavior recognition, detection of anomalous behaviors, interactive surveillance using multiple cameras and crowd flux statistics and congestion analysis and so on [26].

Many of these smart surveillance applications require significant computing and storage resources handling massive contextual data created by video sensors. A typical low frame rate (1.25 Hz) wide area motion imagery (WAMI) sequence alone can generate over 100M of data per second (400G per hour). According to the recent study, the video data dominates the real-time traffic and creates heavy workload on the communication networks. For example, online video accounts for 74% of all online traffic in 2017 and 78% of mobile traffic will be video data by 2021 [13]. Thus, it is important to handle this massive data transfer in new ways. The cloud computing paradigm provides excellent flexibility and is also scalable corresponding to the increasing number of surveillance cameras. In practice, however, there are significant hurdles for the remote cloud-based smart surveillance architecture.

Key surveillance applications such as monitoring and tracking need a real-time capability. However, processing raw video data from widely distributed video sensors such as Close-Circle Television (CCTV) cameras and mobile cameras not only incurs uncertainty in data transfer and timing but also poses significant overhead and delay to the communication networks[10]. Also, it may cause the data security and privacy issues by providing more attacking opportunities for adversaries. Therefore, current surveillance applications are for off-line forensics analysis instead of a proactive tool to deter suspicious activities before the damages are caused.

The surveillance community has been aware of the growing demand for human resources to interpret the data due to the ubiquitous deployment of networked static and mobile cameras and has made many efforts in past decades [34]

. For example, many automated anomaly detection algorithms have been investigated using machine learning 

[39] and statistical analysis [19] approaches. Although these intelligent approaches are powerful, they are computationally very expensive. Hence they are implemented as a central cloud service. Researchers are also trying to help operation personnel beware events using event-driven visualization mechanism [18], re-configuring the networked cameras [36], and mapping conventional real-time images to 3D camera images [45]. However, the traditional human-in-the-loop solutions are still challenged significantly by the demand of real-time surveillance systems due to the lack of scalability. For example, video analysis mostly relies on teams of specially trained officers manually watching thousands of hours of different format and quality videos, looking for one specific target. Due to the manual coordination and tracking and mechanical pan-tilt-zoom (PTZ) remote control, it is challenging to achieve adequate real-time surveillance.

Edge computing as a surveillance service is considered as the answer to the shortcomings [2], [4], [6]. The edge computing technology migrates more computing tasks to the connected smart “things” (sensors and actuators) at the edge of the network [40]. In general, edge computing possesses the following advantages compared to cloud computing:

  1. Real-time response: applications or services are directly executed on-site or near-site, communication delays are minimized, which is essential to delay sensitive, mission critical tasks, such as the smart surveillance;

  2. Lower network workload: raw data generated by sensors or monitors is consumed at the edge of the network instead of outsourcing to a remote cloud center. While the processed results may be sent to the cloud for future analysis, the communication overhead is much lower than outsourcing tasks to cloud;

  3. Lower energy consumption: most of the edge devices are energy constrained, by its nature the algorithms deployed at the edge are lightweight that will reduce energy consumption for the process and data transmission in total; and

  4. Data security and privacy: the less data is sent, the fewer opportunities are available to adversaries to compromise the confidentiality and integrity of the data, also it is easier to enforce security and privacy policies at local network in comparison to requesting collaboration among multiple network domains under different administrations.

In this paper, we propose to devolve more intelligence to the edge to significantly improve many smart tasks such as fast object detection and tracking. Adopting the recent research on machine learning, we choose the Convolutional Neural Network (CNN) algorithm which incurs comparatively less pre-processing overhead than other human image classification algorithms. We efficiently tailored the CNN to be furnished in the resource-constrained edge devices according to an observation that surveillance systems are mainly for the safety and security of human being. According to our experimental study, the lightweight CNN (L-CNN) algorithm can process an average of 1.79 and up to 2.06 frames per second (FPS) on the selected edge device, a Raspberry PI 3 board. It meets the design goals considering the limited computing power and the requirement from the application.

In summary, the major contributions of this work are highlighted below:

  1. Aiming at intelligent surveillance as an edge network service, a thorough study of two well-known human-objects detection schemes, Harr-Cascade and HOG+SVM, has been conducted, which evaluates their feasibility of running on resource-limited edge devices;

  2. A lightweight Convolutional Neural Network (L-CNN) is applied to enable a real-time human-objects identification on a network edge [35];

  3. Instead of simulation, system-oriented research has been conducted. The L-CNN, SSD GoogleNet, Harr-Cascade, and HOG+SVM algorithms are implemented on a Raspberry PI Model 3 board as the edge device; and

  4. An extensive experimental validation study has been conducted using real-world surveillance video data. Comparing with SSD GoogleNet, Harr-Cascade and HOG+SVM, the L-CNN is a promising approach for delay-sensitive, mission-critical applications like real-time smart surveillance.

Figure 1: Edge-Fog-Cloud hierarchical architecture.

The rest of the paper is sorted as follows. Section II provides background of the closely related work. Section III explains Haar-Cascaded and HOG+SVM at edge. Then, Section IV introduces the proposed lightweight CNN architecture, and the training of the L-CNN is disused in Section V. Section VI explains the results of the tracking algorithm implemented on a Raspberry PI 3 model B and a Tinker board. At last, Section VII wraps up this paper with conclusions and discussions of our on-going efforts.

Ii Background Knowledge and Related Work

Ii-a Smart Surveillance as an Edge Service

The surveillance community has been aware of the growing demand for human resources to interpret the data due to the ubiquitous deployment of networked static and mobile cameras and has made many efforts in past decades [34].

Traditional surveillance systems depend on human operators to manipulate the processing of captured video [8]. However, there are many shortcomings with this approach. Not only it is unrealistic to have a human operator to maintain full concentration on the video for a long time, but it is also not scalable as the number of cameras as sensors grows significantly. More recently proposed smart systems considered as the second generation of the surveillance systems aimed at minimizing the role that human operators play in object detection, and the responsibility of abnormal behavior detection is taken by various more intelligent machine learning algorithms [11][44]. The algorithm automatically processes the collected video frames in a cloud to detect, track, and report any unusual circumstances.

Figure 1 presents an edge-fog-cloud hierarchical architecture in which functions in a smart surveillance system are classified into three levels:

  • Level 1: each object of interest is identified through low-level feature extraction from video data;

  • Level 2: the behavior or intention of each object of interest is detected/recognized, quick alarm raising; and

  • Level 3: anomalous or suspicious activities profile building and historical statistical analysis and also fine tuning through online training the decision making algorithm.

Ideally, the minimum delay and communication overhead would be achieved if all the functions are conducted on-site at the network edge where the sensor is located, and the decision is made instantly [9][10]. However, it is not realistic to accomplish the operations of Level 1 and Level 2 by the edge devices. Therefore, once the detection and tracking tasks are done, the results are outsourced to the fog layer for further data contextualization and decision making. The computationally expensive Level 3 functions can be positioned on the fog computing level or even further away on the cloud centers considering the constraints on edge processing power. And functions like long term profile building based on geo-location of the camera is not required to be accomplished instantly. In a smart surveillance system, the Level 1 functions are the fundamental. More specifically human object detection is vital as missing any objects in the frame will lead to undetected behavior. Also. the false positive rate should be minimized, because wrongly identifying an object in the frame as a human will possibly result in false alarms.

Ii-B Human-Object Detection

Haar-like feature extraction is well suited for face and eye detection [14]. Haar models are light weighted and very fast, which are appreciated as a candidate for edge implementation. However, the human body can have different appearance in different ambient lighting, which is harder for this type of models to achieve a high detection accuracy [20].

Grids of Histograms of Oriented Gradient (HOG) can produce reliable features for human detection [15]

. While the Haar features fail to detect humans when the body angle toward the camera changes, HOG features continue to perform well. HOG features are given to a Support Vector Machine (SVM) classifier to create a human detection algorithm called HOG+SVM 

[33]. One downside to using HOG feature extraction at the edge is that this method creates a burden on the limited resource environment. Performance results are discussed and compared in Section VI.

Scale Invariance Feature Transformation (SIFT) is another well-known algorithm for human detection through extracting distinctive invariant features from images, which provides features that can be used to perform reliable matching between different views of an object or scene [22].

Ii-C Machine Learning at the Edge

Powerful machine learning algorithms are recognized as the solution to take full advantage of big data in many areas [29][48]. However, when the big data comes to the edge, the demand for computing and storage resources makes them unfit. The edge environment necessitates lightweight but robust algorithms. Applications of some simple machine learning algorithms have been investigated in environments with constraints on resources, such as wireless sensor networks (WSNs) and IoT devices [3]. There are efforts to build large-scale distributed machine learning systems to leverage heterogeneous computing devices and reduce the communication overhead [1][42]. Although none of the reported algorithms is ideally fit to the resource-constrained edge environment, they have laid a solid foundation for us. In fact, the community has recognized the importance of efficient models for mobile and embedded applications [5][24][47].

GoogleNet [43] and Microsoft ResNet [21] are widely used and well-known architectures for image classification because of their high accuracy. They can take a picture as an input and conduct classification for up to one thousand different objects. This type of network has as many filters as needed to create a feature map that can differentiate between the possible objects classes [23][25]

. If only human objects are required to be classified or detected, the network architecture needs fewer filters to reach the same performance accuracy. However, the authors could not find any deep learning network specially designed for detecting human objects in mind, rather the networks tend to have a more generalized use cases.

Recent attempts have been made to generate faster deep learning networks that require less resource without losing performance. As its name implies, SqueezeNet achieves the same performance as the AlexNet but takes less memory [27]. MobileNet is another architecture created to work on resource constraint devices [24]. It is not only memory efficient, but also it runs very fast because of a different convolutional architecture that creates it. It has been mathematically proven that this network creates less computational burden while having fewer parameters  [41]. MobileNet produces results comparable to GoogleNet which is of the best performing architectures in terms of accuracy.

Iii Harr-Cascade and SVM at the Edge

In this paper, we focus on fast and accurate detection of humans as objects of interest (human objects) as it is vital for the algorithm to give out the exact position coordination of the object of interest for tracking purposes. Otherwise, an abnormality detection algorithm based on human behavior may not function properly in case of incomplete information provided to it. Although discussed comprehensively in literature, in this section an overview of the Harr-Cascaded and HOG+SVM algorithms are provided. Their wide usage for human detection in surveillance, makes them noticeable candidates for edge application and exploration gives insight about their weaknesses on the edge devices.

Iii-a Harr-Cascade

Haar-like features consist of three general shapes. Figure 2 shows Haar-like feature set examples. These filters are going to convolute over an input image and in each position the sum of pixel values in black rectangles will be subtracted from the sum of pixel values in white rectangles. When the capturing angle changes the same filter may produce very different results. A simpler classifier may miss an object if the features are not totally reliable for detection. One may also argue that with a better image set for training, Haar Cascade will provide more accurate results [30].

When all possible scenarios are considered, even a image will produce more than 160 thousand features since filters in Fig. 2 can have any combination of sizes and rotations and positions. The learning process is computationally expensive and needs to take place at CPU clusters, which might not be available to many. However, once the training is finished, a feature set is ready and only the selected features are stored for future classification. Thus, the computational complexity of the overall algorithm is small in executtion phase.

Figure 2: Examples of Haar-like features. (a) two rectangular features. (b) three rectangular features. (c) four rectangular features

In trainig phase best performing features are selected. The process of selecting best features is performed by the Adaboost algorithm which stands for Adaptive Boosting and it is constructed from classifiers that are called ”weak learners”. This algorithm generates a weighted sum between results of weak learners (Eq. 1), where is considered as each weak learner for input . During the learning process each weak learner receives a weight in summation for error calculation (Eq. 2), which is based on lastly calculated boosted classifier. The goal is set to minimize error value (as shown in 2) where is every input for learning iteration .


Positive and negative images, are collected for training, where positive images contain the object of interest with different backgrounds including the positions and coordinates of the sample. In practice, many images are used more than one time by mirroring them or cutting its edges. Negative images do not include the object of interest. In the training around 2000 positive and 1000 negative images are used. The result of training creates a file containing the specific best performing features to be executed on input images. According to the results, regid regression is further made, in area where features give positive results, to give more accurate coordination as the output.

Figure 3: (a) HOG convolution filter. (b) HOG representation of an Image. (c) cell representation of the HOG [32].

As revealed by the training procedure and simplistic working flow of the algorithm, Haar-Cascade object classification will not perform well if the training set does not contain all possible angles as shown in section VI. Also if the object of interest is far away from the camera, which is the usual case in surveillance application videos for outdoor applications, the algorithm may fail to detect the object. Furthermore, the simple structure may imply the loss of robustness. However it is used for low power and real-time applications because of its fast detection.

Iii-B SVM Classifiers

Pixel values cannot be trusted because of so many parameters that affect them, other features are thus searched for.Figure 3(a) depicts a filter that is placed on each pixel in white with its four neighboring pixels in black. X and Y derivatives are calculated simply by subtracting the horizontal neighboring and vertical neighboring pixel values respectively corresponding to the white pixel. In particular, the X derivatives are fired by the vertical lines, and Y derivatives are fired by horizontal lines, which makes the overall features to be sensitive to lines and object edges. Changing the presentation format to amplitude and angle will result in unsigned gradients for each given pixel. In practice, a filter can be used to convolute over the image and in each step calculate the gradient for a given pixel. Because of the unsigned gradients, the angular values are between to . If nine bins of each are considered, the amplitude of the gradients can be represented in respected bin based on the angular value. It is worth mentioning that if the angular value is not the center of the bin, then the amplitude is going to be divided into two bins that the angle of the gradient is closest to. If the input image has more than one channel such as RGB, then channel with the highest amplitude is chosen, and also the respective angle is used for histogram representation.

Figure 3(b) shows one of such histograms, with normalized amplitudes based on highest value of amplitude. This figure is taken from a batch of pixels, where there is a line passing the window, so the angular value of to has the most abundance.

As an example, the HOG algorithm output is depicted as an image in Fig. 3(c) where a cell is one gradient cell (it is enlarged to be seen by human eyes, also less computation).

Figure 4: Image pyramid used in HOG feature extraction method.

In an attempt to capture all details with different distances from camera location, usually a pyramid of the image is employed. The image with original resolution is considered first, and then some pixels in each row and column are discarded to create a lower resolution version of the same image, and then the same HOG algorithm will generate another feature map. The steps iterate until it is not feasible anymore to conduct classification on the image. Figure 4 shows an image pyramid, where the top left is the actual input and the bottom-right one has the least number of pixels, but the size of each of pixels is the largest, which preserves the dimensionality of the input image.

The SVM classifies objects of interest at each stage, so multiple detection reports are possible. In different scenes, fine-tuning of the HOG variables might be needed to determine the number of maps generated. Figure 5 is an example, where the output detects a human object several times because several feature maps are provided to the SVM. Assuming to use the general pre-tuned variables yields an extra step to take only one of the bounding boxes and discard the rest. One mostly used method is to capture the biggest bounding box as the object. This approach may lead to an inaccurate detection. The effect is more noticeable when there are multiple human objects closer to each other. Although the detection rate can be improved by fine-tuning the filter size and variables, in practice, it is non-trivial to reconfigure once the cameras have already been installed.

Figure 5: False multiple detection for a single human object.

As explained above, the HOG algorithm extracts features and a trained SVM based on the featues, classifies the humans. COCO image set [31] archive for person is used for training with around 20K images. Unfortunately, while the feature extraction presents useful information, SVM and HOG are too expensive to edge devices where these computing intensive tasks are repeatedly executed for each frame.

Iv Lightweight CNN

Recently, CNNs has been widely applied as a powerful tool for object classifications. However, it is considered as a challenging task to fit the CNNs into the network edge devices due to the very restrict constraints on resources. Even if the time consuming and computing intensive training can be outsourced to the cloud and the network layer architecture get simplified, edge devices still cannot afford the storage space for parameters and weight values of filters of these deep neural networks and the computation required. Therefore, a lightweight designed CNN is expected in the edge environment.

In designing the L-CNN architecture Depthwise Separable Convolution [24][41] is employed to reduce the computational cost of the CNN itself, without much sacrificing the accuracy of the whole network. Also, the network is specialized for human detection to reduce the unnecessary huge filter numbers in each layer. This yields to a network implementable at the edge.

Iv-a Depthwise Separable Convolution

By splitting each conventional convolution layer into two parts, computational complexity is more suitable for edge devices using depthwise separable convolution and pointwise separable convolution. More specifically, the conventional convolution will take an input such as , which has a dimensionality of and of channels, and maps it into , which is channels of dimension. This is done by filter , which is a set of filters, each of them is and has channels, as calculated in (Eq. 3):


The computational complexity is

Figure 6: Comparison between (a) The conventional convolution; and (b) Depthwise separable convolution.

Figure 6 compares the depthwise separable convolution filters and the conventional convolution, where the same results is taken into parts to make the complexity of the operation minimized. The depthwise separable convolution consists of two parts: The first is channels of filters that will generate outputs, which is a depthwise convolution layer. Next is a pointwise convolution layer in which the filters are channels of filters. Similarly, with the input of as before this layer will produce an output such as in (Eq. 5) the same as (Eq. 3):


where is a depthwise convolutional filter, which has a special dimension of and the filter in will be applied on . The computational complexity of the depthwise convolution is


Based on (Eq. 6) and (Eq. 4), the calculation complexity is reduced by a factor calculated by (Eq. 7[24]. It makes a faster and more efficient network that is an ideal fit for edge devices.


Immediately after each convolutional step, there is a Batch Normalization layer or normalization and an ReLU layer for nonlinearity introduction.

Iv-B The L-CNN Architecture

The proposed L-CNN network architecture has 23 layers considering depthwise and pointwise convolutions as separate layers, which does not count the final classifier, softmax; and regression layers to give a bounding box around the detected object. A simple fully connected neural network classifier takes the prior probabilities of each window of objects, identifies the objects within the proposed window, and adds the label for output bounding boxes at the end of the network. Figure

7 depicts the network filter specifications for each layer.

Figure 7: L-CNN network layers specification.

Downsizing happens with the help of no striding in filters and no spesific layer is added to have less computation. The first convolutional layer of the L-CNN architecture is a conventional convolution, but in the rest of the network depthwise along with pointwise convolutions are used. The L-CNN is focused on a human object detection such that the network is used only for pedestrian detection, which further simplifies the network and decreases the number of parameters to store.

Introduced in late 2016, the Single Shot Multi-Object Detector (SSD) method is faster than R-CNN [32] and more accurate than YOLO [37]. The name comes from the fact that in one feed forward through the network, results are generated and there is no need for extra steps taken in R-CNN. It is a unified framework for detection of an object with a single network. For training purposes, SSD architecture needs more layer architecture than conventional CNN, and when installed, it will receive the input image and output the coordination of each object detected in the image along with a label for the object. It modifies the proposal generator to get class probability instead of the existence of an object in the proposal. Instead of having the classical sliding window and checking in each window for an object to report, SSD at the beginning layer of convolutional filters will create a set of default bounding boxes over different aspect ratios, then scales them with each feature map through convolutional layers along the image itself. In the end, it will check for each object category presence, based on the prior probabilities of the objects in the bounding box and finally adjusts the bounding box to better fit the detection which means adding five layers and using outputs of two layers in SSD application. One of the downsides of SSD is that smaller objects detection accuracy is low if prior probability extraction performed in one layer. In smart surveillance, this can lead to loss of generalization. Because the goal is to detect every human object regardless of distance to camera or angle towards it. However, if the output of different feature maps from different layers is used [38] detection rate can be increased.

V L-CNN Training

V-a CNN Training

A CNN needs to be well trained before being deployed and

applied to conduct the task of classification. Usually, the training process requires a lot of computing resources and large storage space that allows the training images be loaded and fed to the network in batches. Also the filter and other parameters should be pre-loaded into the memory. In addition, the back propagation operation incurs math intensive matrix and differential calculations. Clearly, the edge environment is not an ideal place for training.

There are several widely used models to serve this purpose, such as TensorFlow 


, Keras 


and Caffe 

[28]. Introduction of each gives a clear view of each platform weakness or strong points. TensorFlow is an open source software library for machine learning and artificial intelligence in general. One big benefit of this model is many GPUs that can collaborate and increase the training speed. Also, a light version of the TensorFlow is recently introduced for mobile devices, which allows loading CNN models without additional libraries needed. However, architecture in Tensorflow can be lenghty, so other platforms such as TFlearn are used to make it more compact.

Keras gained popularity for its user-friendly, easy-to-learn environment. It uses TensorFlow as a back-end engine. Keras libraries, accessed using python, create a bridge between python syntax and TensoFlow. The libraries are created to make it easy for the user to generate and test deep modes as fast as possible. The trade-off is, allowing spontaneous coding in python, low-level flexibility of TensorFlow is sacrificed. Moreover, one of the problems that make Keras not the best choice for edge devices is that while OpenCV library supports deep learning models, it still fails to import Keras based networks. To use Keras the library itself has to be installed on the edge device, and also the results need to be loaded in the OpenCV library.

Being introduced in 2014 in C++ language, Caffe is a well-known tool for the deep learning community. It is a low-level library to work with CNNs. Fast speed makes Caffe perfect for research experiments and industry deployment. Caffe can process over 60M images per day with a single NVIDIA K40 GPU [28]. On the other hand, caffe has two main weaknesses. Lack of documentation on its commands makes coding a hard job. Specifically, in SSD realization of caffe model, there are layers needed for SSD deployment but very little information is provided on their functionality. Additionally, Caffe architecture is written as a plain text file, which is harder to manage when more layers are included in the architecture.

In this work, the proposed L-CNN is trained using MXNet because of its implementation of SSD and good documentation. Apache MXNet works best for veriaty of machine learning tasks. The architecture is simplity coded through the programming language and while training, the file is generated. As a result the fully trained network is fast and implementable on the edge devices. LAso, MXNet is a flexible and scaleable deep learning model that many cloud centers provide today. Writing a deep learning architecture becomes an easy task because of its support of several languages such as C++, Java, Python. MXNet has a great community and documentation that will help faster reach of the results. Moreover, SSD branch of MXNet has a responsive community to help programmers learn how to familiarize themselves with SSD. The L-CNN used MXNet in python language as the structure can be generated easily in python functions. The L-CNN architecture file which is called a symbol file ready for training will have less than 80 lines with the addition of SSD code at the end. Through training phase .Jason text file of the architecture along another file containing network weights will be created [46] which can later be used instead of the symbol file for better speed. These outputs are closely related to what seen on caffe files (design.txt and .caffemodel).

The images from the VOC07 and VOC12  [17]

along with ImageNet 

[16] are used to train the proposed L-CNN network where 85% of the total set used for training and 15% for validation. ImageNet is the biggest image set that contains more than 14 million different images from more than one thousand different classes of objects. In this application, only the images form the class of human is used. ImageNet provides bounding box coordination for some of the images in particular classes. Note that the ImageNet uses synset to name the classes, so a file with the same format of image list as VOC07 is needed. Although synset system of naming is machine-readable, it is harder for humans to understand. A combination of sub-classes for human images with coordination from ImageNet website is employed.

The images have to be the same size as the input of the network. The network accepts colored (RGB) images with the size of pixels. Thus, for training a blob of 16 images each having three-dimensional data created and for validation blobs of 16 images. 75% of the total set was used for training and 25% for validation. Lightning Memory-Mapped Database (LMDB) files were produced, which store data in a format as a {key, value} pair. Converting the image set to such a format leads to a faster reading speed. Furthermore, before the training, the data is normalized by calculating the mean value of each RGB channel using Caffe platform packages.

Training is done on a server machine with 28 CPU cores of Intel(R) Xenon(R) CPU at the base frequency of 2.4 GHz with physical memory of 256 GB. Training took 9.7 days. Several stop-criteria are introduced such as maximum iteration of 400 where each epoch is produced of 250 batches, and for every iteration, one validation test took place. also, Every 40 iterations a snapshot of the weights were created to save the progress. The training and error are calculated as Eq.



where is the value calculated by the network, and the

is the actual value. This Mean Square Error represents the error in object detection and linear regression used for fine-tuning the bounding box used another error.

Vi Experimental Results

Vi-a Experimental Setup

All of the above-discussed methods are implemented on an edge device, Raspberry PI 3 Model B with ARMv7 1.2 GHz processor and 1 GB of RAM.

Raspberry PI is a single Board Computers (SBC), which run a full operating system and have sufficient peripherals (memory, CPU, power regulation) to start execution without the addition of hardware, are targeted industrial platforms such as vending machines. The Raspberry PI Foundation made the SBC accessible to almost anyone with low cost (less than $100) through delivering Raspberry PI product family. Given merits like commodity hardware, supporting high-level programming languages (e.g., Python) and running popular variants of Unix-based operating systems, The Raspberry Pi is an ideal platform for Edge Computing.

The CPU and memory utilization for the algorithms are captured by a third party application named memory profiler. This software is used for python applications and can track the CPU and memory used by that process. It saves the data and later plots it using python MATPLOTLIB library. Frame Per Second (FPS) is the major parameter to evaluate the performance of these algorithms. Figure 8 shows the average FPS in 30 seconds of run time for each algorithm. Once again it is reminded that other CNN architectures needed to be retrained using SSD platform model so they can be used for detection rather than classification and can be compared with other object detection algorithms.

Figure 8: Performance in FPS, CPU, Memory Utility, Average False Positive Rate (FPR%) and Average False Negative Rate (FNR%)

Vi-B Results and Discussions

Figure 8 summarizes the experimental results. The fastest algorithm is the Haar Cascaded, the proposed L-CNN is the second and very close to the best while other algorithms are very slow. The figure also shows that Haar Cascaded is the best in terms of resource efficiency, and again the L-CNN is the second and very close. However, in terms of average false positive rate (FPR) our L-CNN achieved a very decent performance (6.6%) and False Negative Rate (FNR) of 18.1% that is much better than that of Harr Cascaded (26.3% and 34.9%). In fact, the L-CNN’s accuracy is comparable with SSD GoogleNet (5.3% and 15.6%), but the later features a much higher resource demanding and an extremely low speed (0.39 FPS) that makes it not suitable for edge. In contrast, the average speed of L-CNN is 1.79 FPS and a peak performance of 2.06 FPS is obtained. This is 64% faster than MobileNet results and added along less memory usage, makes L-CNN the best choice.

It is worth mentioning that GoogleNet does not use a huge memory portion in contrary to other reports because this is a reduced SSD based GoogleNet. As shown in Fig. 8 and Table I, with fewer classes, less parameters (thus less memory) is needed to get the same accuracy. To compute these accuracy measures, real-life surveillance video is used along with the VOC12 test dataset, and so percentages reported here may be higher than general purpose usage reported in other literature.

Figure 9: (a)Haar Cascaded. (b)HOG+SVM. (c)SSD-GoogleNet. (d)L-CNN.

Figures 9 (a) to (d) show the results of Haar Cascaded, HOG+SVM, GoogleNet and L-CNN in processing a sample surveillance video. The footage is re-sized for all algorithms to pixels. The smaller image size is, the less computation resource requires. Also, the deep model architecture only accepts fixed-size images. Therefore, to compare all the algorithms fairly, they all fed image with the same size. Because in practice surveillance videos are not allowed to be exposed to the public, figures included in this paper are footage from an online open source video.

The Haar Cascaded algorithm gives false detection by misidentifying the stone and the tripod as a human, shown in Fig. 9 (a). Meanwhile, the HOG+SVM algorithm does not make the same mistakes as illustrated in Fig. 9 (b). However, two other issues are observed. First, the bounding box is not fixed around the human objects. This may lead to inaccurate tracking performances in later steps. Secondly, in the middle of the frame where objects that are very close are considered as one in some frames, although in later frames two separate boxes are created for each person. Figures 9 (c) and 9 (d) verify the high accuracy achieved by the CNNs at edge.

Figure 10 highlights the results of the L-CNN algorithm in processing video frames in which human object is captured from variant angles and distances. These are challenging scenarios for detection algorithms to decide whether or not the objects are human beings. Not only the visible features vary when the angles and distances are different, but also sometimes the human body is only partially visible or in different gestures. For example, in the right-up subfigure, the legs of the worker standing in the middle are overlapped, the second person has only head and part of the left arm captured. In the left-bottom subfigure, both two legs of the pedestrian are not visible. Many algorithms either cannot identify it is a human body, or very high false positive rate is incurred.

Figure 10: L-CNN: A human object from variant angles and distances.

Table I compares different CNN architectures with the proposed L-CNN algorithm, including several well-known architectures such as VGG, GoogleNet, and the lightweight MobileNet. It is reminded that SSD networks because of their change in architecture and needs for special images with contour of objects for training, are dissimilar to classification CNNs and so SSD CNNs in this table are architectures that are trained using SSD. This test is performed on a desktop machine without the graphic card and with Intel(R) Core(TM) i7 (3.40 GHz and 16 GB of RAM). The result matches our intuition very well that many heavy algorithms are not good choices for an edge device as they require up to 20 times more memory space.

Architecture Memory (MB)
VGG 2459.8
SSD-GoogleNet 320.4
SqueezeNet 145.3
MobileNet 172.2
SSD-L-CNN 139.5
Table I: Memory utility of CNNs.

Vii Conclusions

To make proactive urban surveillance and human behavior recognition and prediction as edge network services, timely, accurate human object detection at the edge is the essential and first step. While there are many algorithms for human detection, they are not suitable for edge computing environment. In this paper, leveraging the Depthwise Separable Convolutional network, a lightweight CNN architecture is introduced for human object detection at the edge. This model was trained using VOC07 datasets which contains the coordination of the objects of interest. MXNet platform for neural networks was used for training, and later OpenCV libraries are used for implementation on the edge device.

This paper has also studied the advantages and constraints of two widely used human object detection algorithms, namely Haar Cascaded object detector and HOG+SVM human detector, in the context of edge computing. Along with GoogLeNet, they are implemented on a Raspberry PI as an edge device for a comparison study. The experimental results have verified that the proposed L-CNN algorithm has met the design goals. The L-CNN has achieved satisfactory FPS (Maximum 2.03 and Average 1.79) and high accuracy (false positive rate of 6.6% and false negative rate of 18.1%), it uses two times fewer parameters than GoogleNet and occupies 2.6 times less memory than SSD GoogleNet.

With the capability of immediate human object identification, our on-going efforts include the following tasks: (1) lightweight object tracking, (2) human behavior recognition, (3) suspicious activity prediction and early alarm, and (4) video clip marking for batch replay. Our ultimate goal is a proactive surveillance system that enables a more safe and secure community by identifying suspicious activities and raising alert before damages are caused.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  • [2] E. Ahmed and M. H. Rehmani, “Mobile edge computing: opportunities, solutions, and challenges,” 2017.
  • [3] J. Ahn, J. Paek, and J. Ko, “Machine learning-based image classification for wireless camera sensor networks,” in Embedded and Real-Time Computing Systems and Applications (RTCSA), 2016 IEEE 22nd International Conference on.   IEEE, 2016, pp. 103–103.
  • [4] M. Ali, R. Dhamotharan, E. Khan, S. U. Khan, A. V. Vasilakos, K. Li, and A. Y. Zomaya, “Sedasc: secure data sharing in clouds,” IEEE Systems Journal, vol. 11, no. 2, pp. 395–404, 2017.
  • [5] D. Anisimov and T. Khanova, “Towards lightweight convolutional neural networks for object detection,” in Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on.   IEEE, 2017, pp. 1–8.
  • [6] H. Cao, M. Wachowicz, C. Renso, and E. Carlini, “An edge-fog-cloud platform for anticipatory learning process designed for internet of mobile things,” arXiv preprint arXiv:1711.09745, 2017.
  • [7] A. Cenedese, A. Zanella, L. Vangelista, and M. Zorzi, “Padova smart city: An urban internet of things experimentation,” in World of Wireless, Mobile and Multimedia Networks (WoWMoM), 2014 IEEE 15th International Symposium on a.   IEEE, 2014, pp. 1–6.
  • [8] F. F. Chamasemani and L. S. Affendey, “Systematic review and classification on video surveillance systems,” International Journal of Information Technology and Computer Science (IJITCS), vol. 5, no. 7, p. 87, 2013.
  • [9] N. Chen, Y. Chen, E. Blasch, H. Ling, Y. You, and X. Ye, “Enabling smart urban surveillance at the edge,” in 2017 IEEE International Conference on Smart Cloud (SmartCloud).   IEEE, 2017, pp. 109–119.
  • [10] N. Chen, Y. Chen, S. Song, C.-T. Huang, and X. Ye, “Smart urban surveillance using fog computing,” in Edge Computing (SEC), IEEE/ACM Symposium on.   IEEE, 2016, pp. 95–96.
  • [11] N. Chen, Y. Chen, Y. You, H. Ling, P. Liang, and R. Zimmermann, “Dynamic urban surveillance video stream processing using fog computing,” in Multimedia Big Data (BigMM), 2016 IEEE Second International Conference on.   IEEE, 2016, pp. 105–112.
  • [12] F. Chollet et al., “Keras,” 2015.
  • [13] Cisco, “Cisco visual networking index: Forecast and methodology, 20162021,”, 2017.
  • [14] M. Cristani, R. Raghavendra, A. Del Bue, and V. Murino, “Human behavior analysis in video surveillance: A social signal processing perspective,” Neurocomputing, vol. 100, pp. 86–97, 2013.
  • [15] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.   IEEE, 2005, pp. 886–893.
  • [16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   IEEE, 2009, pp. 248–255.
  • [17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,”
  • [18] C.-T. Fan, Y.-K. Wang, and C.-R. Huang, “Heterogeneous information fusion and visualization for a large-scale intelligent video surveillance system,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 4, pp. 593–604, 2017.
  • [19]

    T. Fuse and K. Kamiya, “Statistical anomaly detection in human dynamics monitoring using a hierarchical dirichlet process hidden markov model,”

    IEEE Transactions on Intelligent Transportation Systems, 2017.
  • [20] S. Hannuna, M. Camplani, J. Hall, M. Mirmehdi, D. Damen, T. Burghardt, A. Paiement, and L. Tao, “Ds-kcf: a real-time tracker for rgb-d data,” Journal of Real-Time Image Processing, pp. 1–20, 2016.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [22] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, 2015.
  • [23] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [24] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  • [25] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional neural networks for hyperspectral image classification,” Journal of Sensors, vol. 2015, 2015.
  • [26] W. Hu, T. Tan, L. Wang, and S. Maybank, “A survey on visual surveillance of object motion and behaviors,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 34, no. 3, pp. 334–352, 2004.
  • [27] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
  • [28] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia.   ACM, 2014, pp. 675–678.
  • [29]

    K. Kolomvatsos and C. Anagnostopoulos, “Reinforcement learning for predictive analytics in smart cities,” in

    Informatics, vol. 4, no. 3.   Multidisciplinary Digital Publishing Institute, 2017, p. 16.
  • [30] R. Lienhart and J. Maydt, “An extended set of haar-like features for rapid object detection,” in Image Processing. 2002. Proceedings. 2002 International Conference on, vol. 1.   IEEE, 2002, pp. I–I.
  • [31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  • [32] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision.   Springer, 2016, pp. 21–37.
  • [33] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
  • [34] J. Ma, Y. Dai, and K. Hirota, “A survey of video-based crowd anomaly detection in dense scenes,” Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 21, no. 2, pp. 235–246, 2017.
  • [35] S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. R. Faughnan, “Real-time human detection as an edge service enabled by a lightweight cnn,” arXiv preprint arXiv:1805.00330, 2018.
  • [36] C. Piciarelli, L. Esterle, A. Khan, B. Rinner, and G. L. Foresti, “Dynamic reconfiguration in camera networks: a short survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 5, pp. 965–977, 2016.
  • [37] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
  • [38] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.
  • [39] M. Ribeiro, A. E. Lazzaretti, and H. S. Lopes, “A study of deep convolutional auto-encoders for anomaly detection in videos,” Pattern Recognition Letters, 2017.
  • [40] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016.
  • [41] L. Sifre, “Rigid-motion scattering for image classification, 2014,” Ph.D. dissertation, Ph. D. thesis, PSU.
  • [42] P. Sun, Y. Wen, T. N. B. Duong, and S. Yan, “Timed dataflow: Reducing communication overhead for distributed machine learning systems,” in Parallel and Distributed Systems (ICPADS), 2016 IEEE 22nd International Conference on.   IEEE, 2016, pp. 1110–1117.
  • [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  • [44] X. Wang, “Intelligent multi-camera video surveillance: A review,” Pattern recognition letters, vol. 34, no. 1, pp. 3–19, 2013.
  • [45] J. Wu, “Mobility-enhanced public safety surveillance system using 3d cameras and high speed broadband networks,” GENI NICE Evening Demos, 2015.
  • [46] J. Z. Zhang, “Mxnet port of ssd: Single shot multibox object detector,” in, 2018.
  • [47] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” arXiv preprint arXiv:1707.01083, 2017.
  • [48] L. Zhou, S. Pan, J. Wang, and A. V. Vasilakos, “Machine learning on big data: Opportunities and challenges,” Neurocomputing, vol. 237, pp. 350–361, 2017.