The unprecedented pace of urbanization  poses many opportunities and challenges. The recent concept of Smart Cities has attracted the attention of the urban planners and researchers to enhance the security and well-being of the residents. One of the most essential smart community services is the intelligent resident surveillance . It enables a broad spectrum of promising applications, including access control in areas of interest, human identity or behavior recognition, detection of anomalous behaviors, interactive surveillance using multiple cameras and crowd flux statistics and congestion analysis and so on .
Many of these smart surveillance applications require significant computing and storage resources handling massive contextual data created by video sensors. The cloud computing paradigm provides excellent flexibility and is also scalable corresponding to the increasing number of surveillance cameras. In practice, however, there are significant hurdles for the remote cloud-based smart surveillance architecture. Many key surveillance applications such as monitoring and tracking need a real-time capability. However, processing raw video data from widely distributed video sensors such as Close-Circle Television (CCTV) cameras and mobile cameras not only incurs uncertainty in data transfer and timing but also poses significant overhead and delay to the communication networks. Also, it may cause the data security and privacy issues by providing more attacking opportunities for adversaries. Therefore, current surveillance applications are for off-line forensics analysis instead of a proactive tool to deter suspicious activities before the damages are caused.
Edge computing as a surveillance service is considered as the answer to the shortcomings, , . The edge computing technology migrates more computing tasks to the connected smart “things” (sensors and actuators) at the edge of the network . Consequently it possesses the following advantages: real-time response, lower network workload, lower energy consumption, and higher data security and privacy.
Despite the promising Edge computing benefits, one of the critical challenges is how to efficiently process the data-intensive tasks in real-time on a rather resource hungry edge nodes. Specifically, many smart video surveillance approaches for object detection and tracking propose to use Artificial Intelligence (AI) and Machine Learning (ML) algorithms. However, they usually have the high computational requirement. How to migrate those computing and data-intensive tasks to the edge nodes are still significant challenges.
In this paper, a novel lightweight, hybrid tracking algorithm named Kerman is proposed, which addresses the weaknesses of the well-known Kernelized Correlation Filter (KCF)  by leveraging some features of Kalman Filter  and Background Subtraction (BS) . The proposed Kerman algorithm achieves higher performance while preserves the favorite features of KCF such as its fast adaptation and tracking. An experimental study has been conducted using real-world surveillance video streams on two types of single board computers (SBC) as edge devices, Raspberry PI 3 and Tinker board.
The rest of the paper is organized as follows: in section II the previous attempts for object tracking at the edge of the network is discussed. Section III presents the complete Kerman tracking algorithm. Experimental results of the tracker is presented in Section IV. Finally, Section V wraps up this paper with the conclusions.
Ii Motivation and Related Work
Currently, most of the video surveillance systems function as an archive of footages and being used for the afterwards, offline forensics analysis as well as depending on human operators in process loop 
. Because of long transmission and process time, it cannot support uninterrupted, real-time video surveillance tasks. However, thanks to the fast development in machine learning (ML) algorithms, there are very promising results presented in human-oriented surveillance area. Where human detection and tracking along with abnormal behavior detection analysis are feasible using various smart deep learning or other ML algorithms. There are various methods for automated video frames collection in a cloud and unusual events detection. Research community has recognized that heavy communication overhead is not tolerable in many delay sensitive, mission-critical tasks . Leveraging the fog computing paradigm, there are online and uninterrupted target tracking systems proposed to meet the requirements of real-time video processing and instant decision making , .
While there are a lot of work conducted in image processing area such as object detection and tracking, only few literature is specifically focused on human object . As the first step for any video surveillance application, object detection and coordinates allocation are essential for further object tracking tasks. CNNs have high accuracy as well as shorter run time after training , . Smaller size networks are tailored for edge constrained environment to resolve the issue of limited memory , . For instance, a lightweight Single Shot Multi-box Detector (SSD) CNN is chose for human detection and decent performance is achieved . However, even lightweight CNNs that are designed for low power devices, are not fast enough to perform real time object detection, the best ones reached about 2 FPS in experiments on Raspberry PI 3 .
This yields for a tracker that can follow a human being once it is identified and does not mix it with other moving objects in the frame. Online trackers are preferred since they conduct training online for better results. This is critical because people subject for tracking may with any size and clothes colors, offline trained systems are not suitable. Meanwhile, the tracking tasks will take place at the edge environment in which it is assumed there is not GPUs, therefore, the CPU based tracking algorithms are subject of this research.
Analytically, a smart surveillance task can be considered as a three-layer framework:
Layer 1: the low-level conducts information extraction like feature detection and object tracking;
Layer 2: the intermediate-level is in charge of mode recognition like action recognition and behavior understanding; and
Layer 3: the high-level is focused on decision making like abnormal event detection.
Functions belonging to each layer may be deployed on different positions of the hierarchical platform consisting of edge, fog, and cloud stratum . Most of the first layer functions are expected to run on the edge devices, like the surveillance cameras, leveraging light weighted algorithms. Figure 1 shows how each layer of the human surveillance system. The detection and tracking algorithms can be implemented at the edge layer with minimum latency and online training for better performance. It should be noted that in human surveillance application, there may be multiple objects being tracked in one individual frame. The detection algorithm may give the bounding box around a person but it will not guarantee the order of detection, which implies the object that was labeled in one frame may be re-labeled differently and be refereed to as another object in the following frame.
For trackers the common challenges include partial or full object occlusions, scene illumination changes and object shapes and motion . The well-known region based tracking algorithm detects a human and extracts it from the background 
. Because it only detects the background and foreground based on Gaussian modeling, the region based tracking is not suitable for surveillance application at hand. Another method is Feature Based Tracking, where a classifier looks for features that are well-describing the object of interest such as lines, point that separates the object from the background. The feature based methods suffer from occlusion problem as the they need at least some sub-features to remain visible and even then the accuracy of the classification drops. Active Contour Based Tracking represents object’s outlines as bounding contours .
In 2017 Need for Speed (NFS) was introduced as a dataset created with very high quality videos used for benchmarking and divides object tracking to deep trackers and correlation filter (CF) trackers . Several best performing algorithms are used in the benchmarks. The fastest algorithm is the Multi-Dimensional Network (MDNet) that has more than 50 FPS tested using the benchmark . In contrast, the CF trackers like Multiple Instance Learning (MIL)  and Boosting algorithm  are slow. The MOSSE filter  is very fast but not accurate. The KCF  is based on MOSSE too but it achieved better accuracy with supports from the HOG features. Because of the boundary issues in frequency domain learning , some researchers use boundary learning methods to reach good performances. KCF has a much higher speed on CPU than the others without sacrificing the accuracy. CF trackers using CNN achieved a high accuracy with a low speed.
Iii The Kerman Algorithm
Iii-a Design Rationale
In this effort the edge computing paradigm is leveraged for real-time human targets detecting and tracking. All the raw video streams are processed locally by edge devices instead of being sent to the remote cloud center over the communication network. While there are several methods for object tracking, only few are feasible for edge implementation. The KCF algorithm is considered as the foundation because of its fastness and light weight. It has the potential to serve the purpose of real-time surveillance at the edge. The Kernel Trick and matrix multiplications in frequency domain provide the option that enables faster computation and thus the ability to use more complex classifiers online. Meanwhile, the KCF has weaknesses to be addressed. It loses the object of interest if the object moves fast, and this flaw exits no matter how well the algorithm is implemented. As the KCF algorithm considers the background in the coordinates given in each frame as well as the object. If the object of interest moves fast, soon the tracker will be mistakenly trained to focus on the background and lose the object of interest. Additionally, the tracker tends to stop when the pedestrian walks with the usual speed but there is a sharp border line of another object or shadow, which may block the human object of interest. Therefore, another method for occlusions detection is needed.
Kalman Filter (KF)  is one of the most popular object tracking methods. If the object of interest is viewed as a system with the central point in the bounding box as its representation, its position in the next frame can be predicted. Running other code along with KCF algorithm may improve the accuracy of the tracker at the cost of slightly lower speed. KF needs to be fed with the actual position of the object of interest in each frame to create a feedback system and update its parameters. However, the KF can be considered as a post processing for KCF algorithm, and the data from the KCF algorithm is used for KF update. If the KCF bounding box stops because of an error in tracking, the KF will follow with a delay because of its nature to remain in the situation it was before update. This delay creates a distance from the center of bounding box in the KCF and the output of the KF, which can be used as a pointer to indicate the occlusions and to prevent the KCF algorithm from mistakenly re-focusing on the background or other items but the object of interest.
By its nature, the move of a person is hard to be predicted, a person can make sudden changes either in terms of speed or direction. These variations create the same challenges as what occlusions introduce. On the one hand, the KCF is a very capable algorithm to follow the object with sudden changes in appearance or moving direction. On the other hand, KF lacks this ability and has a delay to follow, but this ”shortcoming” is useful to stop KCF from launching wrong updates. The key here is a intelligent decision making that help the system to choose between KCF or KF correctly. In this paper, the background subtraction (BS) method based on Gaussian Mixture-based Background/Foreground Segmentation Algorithm  is selected to address this issue.
Basically, a bounding box is given by the KCF algorithm, the pixels from the mask (classified as foreground) near the KCF bounding box are considered as the object of interest. The background subtraction is going to be executed only one time in each frame. The algorithm complexity is where is the number of modeling pixels. In the framework of our proposal, the background subtraction is not used for object tracking. Instead, it functions as an indicator telling the system whether or not the KCF or KF should be applied. The bounding boxes that are output of the background subtraction can be in any position of the frame where no human is detected. Thus, it is important to associate them to each human that is being tracked. If the center of each contour is in range of the KCF bounding box, the background subtraction verifies the human object is subject of tracking.
Iii-B The Kerman Algorithm: Pseudocode
The proposed Kerman algorithm is an answer to the flaws in KCF algorithm. While each individual method has some overhead, the combination is not as heavy as simply cascading them in series.
The Kerman algorithm is designed based on the knowledge that tracking with an online learning method is the best way to make the tracker adaptive to the object of interest and make it more immune to sudden changes to the object’s appearance. In surveillance applications, human beings are walking and it is possible the pedestrian changes the directions swiftly. Consequently, the tracker may lose the target. Because it changes features that the tracker is using.
Pseudo code of the proposed tracker is presented in Algorithm 1. The Kerman algorithm actually makes decisions based on a majority voting mechanism, taking into account of the opinions of KCF, KF, and BS. These algorithms work together for each situation to stop KCF training, recalibrate the bounding box or continue normal path. In order to use center coordinates that are obtained by the KCF, by default the tracker is set as KCF. Next, the center of the bounding box given by the KCF is set as coordinate for each specific object. Knowing the centers of KF and BS algorithms for the same object can give two gradient for the corresponding object and a threshold between gradients gives the flag to recalibrate KCF bounding box. In should be noted that in the Kerman algorithm, a class is introduced, objects from this class are created in a multi-thread manner in order to utilize the multi-core processor of the edge device. The number of threads depends on the number of cores the edge device has.
The overall Kerman algorithm work-flow is shown in Fig. 2, where from the input frame the human object bounding boxes are given.
Iv Experimental Results
The proposed Kerman algorithm has been implemented on two types of Single Board Computers (SBC) for test. One is the Raspberry PI 3 with 1 GB of RAM and ARMv7 1.2 GHz processor, the other device is a Tinker Board with 1.8 GHz ARM-based RK3288 SoC and 2 GB LPDDR3 dual-channel. These SBCs are selected as the edge devices because of the low price (<$80 each), but at the same time they are capable of running UNIX or Android based operating systems. Such that they support high level programming like Python and various I/O connections. Actually it is a trend that more and more powerful small devices like SBC have shown faster and reliable performances at the edge of the network.
Iv-a High Level Overview
In the tracking process, the objects that are already being tracked should not be labeled again. Therefore the center of a new detection should be out of a circle with diameter of the objects that are already in the queue for track. Meanwhile, the non-trivial lags between the boundary box and the actual position of the object will fail the tracking algorithm. Figure 3 shows example cases in which the object was lost when using the KCF algorithm only. In the upper left image the tracker lag leads to re-detect the human and label him as a new object. This comparison is based on the KCF because to the best of the authors knowledge, it is the fastest among today’s online tracking methods, which are not based on CNN. The human object detection method applied here is the L-CNN that is tailored for edge environment  to detect humans and pass them to the tracker. As shown in Fig. 4, the KF has a lag in comparison to the KCF algorithm when the object moves fast or changes its direction abruptly.
The proposed Kerman algorithm integrates three fast tracking algorithms, KCF, KF, and BS, to achieve a higher accuracy than each individual one can do separately. Kerman makes better decisions when the human object moves faster than the KCF can follow or there is a occlusions between object of interest and another object in the frame. Although it is true that the tracker is able to find the object of interest again using the object detection algorithm even if it lost the object under tracking, the performance is impacted significantly and the interrupted tracking will further slow down the next steps in surveillance, e.g. anomalous behavior detection.
Figure 5 shows the results of a real life surveillance footage processing, which compares the Kerman algorithm with the KCF algorithm. Looking more closely to this figure, the part (a) shows results of instances from KCF based algorithm. Where the object is lost or the bounding box around the object has a lag or contains a huge space. In contrast, Fig. 5(b) shows the results of Kerman algorithm on the same video stream with same instances, but the bounding box is better fit neither the object is lost.
Iv-B Performance Analysis
The performance of the Kerman algorithm has been evaluated experimentally in terms of memory consumption, CPU utility, and the video processing speed (FPS). Figure 6 compares the memory consumption of the Kerman algorithm with the memory used by the KCF algorithm. The memory is read in a 30 seconds of run time of two scenarios. In one scenario, there are only one or two human objects in the frame and they are positioned far away from each other. In the other scenario, the frame is more crowded with human objects and at most 10 pedestrians are in the frame in the same time. The memory utility shown in Fig. 6 includes both the tracking and detection algorithms. The human detection algorithm is the L-CNN that runs in every five frames for a video input rate of 10 FPS. The L-CNN detector runs two times per second. Considering the velocity of pedestrians, it is sufficient. In case there are up to 10 human objects in the frame the algorithm needs up to 350 MB of memory space, which is available even in a memory limited device like Raspberry PI. The experimental results also show that the memory consumption is not sensitive to the number of objects in the frame. The difference between having many objects and fewer objects is not significant, which verified the Kerman algorithm is scalable in terms of memory utility.
The CPU usage is also a critical metrics for the detection and tracking algorithms as a whole system designed for human surveillance automation. The percentage of CPU usage is read on Raspberry PI 3 and Tinker Board for 30 seconds of runtime and averaged. Same scenarios as used for memory consumption test are applied to evaluate assess the CPU usage, but divided into two scenarios. One scenario is with at most 2 human objects in a frame and another is with 6-10 human objects. Both cases are managed using CPU.
In this paper the Kerman algorithm is introduced that integrates three well-known lightweight tracking algorithms, KCF, KF and BS, to enable the smart surveillance as an edge service. The Kerman algorithm calculates the gradient of the KF and BS algorithms based on the KCF algorithm for each object of interest and recalibrates the bounding box given by the KCF. On the selected edge devices, the Kerman algorithm achieved decent performance in processing the real-world security surveillance videos. The experimental results verified that the Kerman algorithm has solved the flaws associated with the KCF algorithm at a tolerable trade-off in processing time. It meets the design goals and is a feasible solution at the edge.
-  E. Ahmed and M. H. Rehmani, “Mobile edge computing: opportunities, solutions, and challenges,” 2017.
-  M. Ali, R. Dhamotharan, E. Khan, S. U. Khan, A. V. Vasilakos, K. Li, and A. Y. Zomaya, “Sedasc: secure data sharing in clouds,” IEEE Systems Journal, vol. 11, no. 2, pp. 395–404, 2017.
-  B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 983–990.
-  D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation filters,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2544–2550.
-  H. Cao, M. Wachowicz, C. Renso, and E. Carlini, “An edge-fog-cloud platform for anticipatory learning process designed for internet of mobile things,” arXiv preprint arXiv:1711.09745, 2017.
-  A. Cenedese, A. Zanella, L. Vangelista, and M. Zorzi, “Padova smart city: An urban internet of things experimentation,” in World of Wireless, Mobile and Multimedia Networks (WoWMoM), 2014 IEEE 15th International Symposium on a. IEEE, 2014, pp. 1–6.
-  F. F. Chamasemani and L. S. Affendey, “Systematic review and classification on video surveillance systems,” International Journal of Information Technology and Computer Science (IJITCS), vol. 5, no. 7, p. 87, 2013.
-  N. Chen, Y. Chen, S. Song, C.-T. Huang, and X. Ye, “Smart urban surveillance using fog computing,” in Edge Computing (SEC), IEEE/ACM Symposium on. IEEE, 2016, pp. 95–96.
-  N. Chen, Y. Chen, Y. You, H. Ling, P. Liang, and R. Zimmermann, “Dynamic urban surveillance video stream processing using fog computing,” in Multimedia Big Data (BigMM), 2016 IEEE Second International Conference on. IEEE, 2016, pp. 105–112.
-  M. Cristani, R. Raghavendra, A. Del Bue, and V. Murino, “Human behavior analysis in video surveillance: A social signal processing perspective,” Neurocomputing, vol. 100, pp. 86–97, 2013.
-  H. K. Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey, “Need for speed: A benchmark for higher frame rate object tracking,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 1134–1143.
-  H. K. Galoogahi, T. Sim, and S. Lucey, “Correlation filters with limited boundaries,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015, pp. 4630–4638.
-  H. Grabner, M. Grabner, and H. Bischof, “Real-time tracking via on-line boosting.” in Bmvc, vol. 1, no. 5, 2006, p. 6.
-  J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, 2015.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  W. Hu, T. Tan, L. Wang, and S. Maybank, “A survey on visual surveillance of object motion and behaviors,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 34, no. 3, pp. 334–352, 2004.
-  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
-  C. Mouradian, D. Naboulsi, S. Yangui, R. H. Glitho, M. J. Morrow, and P. A. Polakos, “A comprehensive survey on fog computing: State-of-the-art and research challenges,” IEEE Communications Surveys & Tutorials, 2017.
-  M. Mukherjee, L. Shu, and D. Wang, “Survey of fog computing: Fundamental, network applications, and research challenges,” IEEE Communications Surveys & Tutorials, 2018.
-  H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, 2016, pp. 4293–4302.
-  D. T. Nguyen, W. Li, and P. O. Ogunbona, “Human detection from images and videos: A survey,” Pattern Recognition, vol. 51, pp. 148–175, 2016.
-  S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. R. Faughnan, “Intelligent surveillance as an edge network service: from harr-cascade, svm to a lightweight cnn,” arXiv preprint arXiv:1805.00331, 2018.
-  ——, “Real-time human detection as an edge service enabled by a lightweight cnn,” in Edge Computing, the IEEE International Conference on, 2018.
-  S. Ojha and S. Sakhare, “Image processing techniques for object tracking in video surveillance-a survey,” in Pervasive Computing (ICPC), 2015 International Conference on. IEEE, 2015, pp. 1–6.
-  R. O’Malley, E. Jones, and M. Glavin, “Rear-lamp vehicle detection and tracking in low-exposure color video for night conditions,” IEEE Transactions on Intelligent Transportation Systems, vol. 11, no. 2, pp. 453–462, 2010.
Y. Pang, H. Yan, Y. Yuan, and K. Wang, “Robust cohog feature extraction in human-centered image/video management system,”IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 2, pp. 458–468, 2012.
-  H. A. Patel and D. G. Thakore, “Moving object tracking using kalman filter,” International Journal of Computer Science and Mobile Computing, vol. 2, no. 4, pp. 326–332, 2013.
-  W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016.
-  S. Vishwakarma and A. Agrawal, “A survey on activity recognition and behavior understanding in video surveillance,” The Visual Computer, vol. 29, no. 10, pp. 983–1009, 2013.
-  C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: Real-time tracking of the human body,” IEEE Transactions on pattern analysis and machine intelligence, vol. 19, no. 7, pp. 780–785, 1997.
-  Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1834–1848, 2015.
Z. Zivkovic and F. Van Der Heijden, “Efficient adaptive density estimation per image pixel for the task of background subtraction,”Pattern recognition letters, vol. 27, no. 7, pp. 773–780, 2006.