I-SAFE: Instant Suspicious Activity identiFication at the Edge using Fuzzy Decision Making

09/12/2019 ∙ by Seyed Yahya Nikouei, et al. ∙ Binghamton University 10

Urban imagery usually serves as forensic analysis and by design is available for incident mitigation. As more imagery collected, it is harder to narrow down to certain frames among thousands of video clips to a specific incident. A real-time, proactive surveillance system is desirable, which could instantly detect dubious personnel, identify suspicious activities, or raise momentous alerts. The recent proliferation of the edge computing paradigm allows more data-intensive tasks to be accomplished by smart edge devices with lightweight but powerful algorithms. This paper presents a forensic surveillance strategy by introducing an Instant Suspicious Activity identiFication at the Edge (I-SAFE) using fuzzy decision making. A fuzzy control system is proposed to mimic the decision-making process of a security officer. Decisions are made based on video features extracted by a lightweight Deep Machine Learning (DML) model. Based on the requirements from the first-line law enforcement officers, several features are selected and fuzzified to cope with the state of uncertainty that exists in the officers' decision-making process. Using features in the edge hierarchy minimizes the communication delay such that instant alerting is achieved. Additionally, leveraging the Microservices architecture, the I-SAFE scheme possesses good scalability given the increasing complexities at the network edge. Implemented as an edge-based application and tested using exemplary and various labeled dataset surveillance videos, the I-SAFE scheme raises alerts by identifying the suspicious activity in an average of 0.002 seconds. Compared to four other state-of-the-art methods over two other data sets, the experimental study verified the superiority of the I-SAFE decentralized method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Providing safety and well-being of the residents who live in populated cities is a rising challenge. City planners answered the security challenge by adding more cameras to enhance ubiquitous surveillance [10]. The capability of real-time activities monitoring enables faster reaction for first responders in cases of emergencies. For instance, when a security agent sees the live footage and identifies a problem, quick actions can be taken. However, it is very difficult, if not impossible, for a security agent to focus on one of so many cameras when an event happens. In fact, most surveillance video streams are normally used as a forensics for diagnostics, lessons learned, and preparation for future events. Likewise, it takes long time to investigate information from thousands of video clips. The situation could be worse when the footage has been deleted due to limited storage space. To utilize the limited storage space more efficiently, many video surveillance systems integrate motion sensors. The cameras do not capture videos unless they are triggered by the motion sensors’ detection of certain movements.

Currently, smart security cameras use intelligence to detect, classify and recognize objects of interest to determine which video clips to retain

[27]. These advances accelerate the decision making for the operator and also classify the data for later forensic analysis. Recently, machine learning (ML) models are adopted to detect anomalous behaviours by identifying certain bio-mechanical movements [43], but suffer from high false positive rates from inadequate training.

Edge computing is recognized as a promising solution to tackle the challenges in today’s ubiquitously deployed video surveillance systems [11], [40]. Migrating computing power to the edge allows more intelligence at each edge node such that on-site or near-site data processing becomes feasible, which consequently enables real-time object detection [33], tracking [32], and feature abstraction at the edge. Presently, it is still challenging to recognize the activities based on the features and identify suspicious behaviors.

Recently, researchers introduced a lightweight Convolutional Neural Network (L-CNN)

[33] for real-time object detection focusing on a primary object of interest (humans) and a hybrid KERnelized Machine learning and Artificial Networks (Kerman) algorithm [32] for object tracking at the edge. The Kerman algorithm analyzes each frame of the video stream and extracts movement features and models patterns.

In this paper, we introduce an Instant Suspicious Activity identiFication at the Edge (I-SAFE) using a fuzzy decision making engine. A set of contextualized detectors are created by considering the features in a spatio-temporal context [41], which include the location of the camera and time of the day. Then, a fuzzy model with five membership functions is proposed to decide whether or not the behavior or activity of each of the people in the frame are suspicious. The fuzzy model was generated with inputs from campus police officers. Their knowledge and experience are integrated in the rule set, which uses the fuzzified contextualized features. The experimental study using real-world surveillance video streams verifies the I-SAFE scheme nominates suspicious activities in average of 0.002 seconds.

The contributions of this paper are as follows:

  • A lightweight smart safety system which is decentralized and real-time which detects humans and rises an individual based alarm in case certain factors are seen in the behavior;

  • A data experiment leveraging domain expert features selection that best describe the behavior elements of each pedestrian to support edge paradigm constraints;

  • A noise resistant fuzzy control module that upon receiving image features determines the intentions of each pedestrian in the frame; and

  • A comprehensive experimental study that verified the effectiveness of the I-SAFE decentralized method by comparing it to four other state-of-the-art methods over two other data sets.

The rest of this paper is structured as follows. In Section II, the background knowledge of the problem and fuzzy systems are introduced. Section III presents a novel smart surveillance architecture to support the proposed I-SAFE scheme. Section IV explains the feature extraction method and contextualization. The fuzzy model of the I-SAFE scheme is introduced in Section V. The experimental results are presented in Section VI. Finally, Section VII provides conclusions.

Ii Background and Related Work

Ii-a Human-in-Loop Surveillance Systems

The surveillance community is aware of the growing demand for human resources to interpret data such as live video streams [5]. The ubiquitous deployment of networked static and mobile cameras creates a huge amount of video data that is being transmitted to data centers for analysis [7] and atomize the process [27]. Many automated object detection algorithms have been investigated using ML [39] and statistical analysis [15] approaches that are implemented at the server side of the surveillance system.

There are also efforts made to promote operators’ awareness by leveraging context [6], providing query languages [3], re-configuring the networked cameras [38], utilizing event-driven visualization [14], and mapping conventional real-time images to 3D camera images [44]. Lack of scalability is still a challenge in traditional human-in-the-loop solutions to meet the demand of real-time surveillance.

Ii-B Safety Modeling and Anomaly Detection

Anomaly detection can have various definitions. For example, some researchers define anomaly as the most rare state in a sequence [13], [25]. In case of a video, the algorithm selects the most rare frames in the sequence. For example, detecting a person or vehicle not in a normal behavior for pattern of life/anomaly detection (POL/AD) [4]. In the Appearance and Motion DeepNet (AMDN) framework, several classifiers work in parallel to detect whether or not there is an object in a frame [45], such as a car or bike in the park [28]. An integrated pipeline incorporates the output of object trajectory analysis and pixel-based analysis for abnormal behavior inference [12]. It aims at better automation in the surveillance system with an algorithm that robustly captures activities such as: loitering, fighting and passing cars. Although POL/AD is comprehensive, the approach is unfortunately too expensive to be implemented on fog nodes which host edge units’ data streams. Labeling partial video segments rather than bounding boxes in video frames, anomaly analysis segments the video where an action such as moving, stealing, or incident [17], [42]

. Video range labeling using a refined Recurrent Neural Network (RNN) also translates to a more accurate rare instance detection along with outputs of the bounding box around the anomalous object

[26].

Loitering detection is selected as the case study to implement and test the I-SAFE framework. Loitering is moving back and forth around a centralized spot. Hence stopping, starting, and returning to the same spot are obvious indicators. Although loitering and move-top-move actions look similar, having a model that differentiates between the motivations is helpful to distinguish them accurately. A spatio-temporal clustering method based on the pedestrian speed is adopted for classification [36]. It uses some features for motions detection and alarm raising, which suffers an overall low performance. An unsupervised dynamic sparse coding approach method was suggested for unusual event detection using an atomically learned event dictionary [51]. This approach shows the unusual scene which loitering may be one, but does not concern the security aspect. Finally, the loitering problem was tried using a Markov random field (MRF) [21] and generally seeks rare occurrence anomaly detection.

Two-dimensional human pose estimation in still images and videos has been explored

[19], where both top-down [29] and bottom-up [18] approaches have been proposed. The research community also retrieved the spatial configuration of humans by matching the holistic human shape [30], aggregating poses from segmentations, or using contours to model the shape of human body parts [22]. Recently, many new ML models are introduced that can detect and connect human body parts to detect human pose [2], [9]. These approaches suffer from the huge computation burden of image analysis for each and every object in the frame; and consequently, the computing time may be longer than one second in a crowded scene [37]. Besides the long delay to recognize the pose of an object, the detected pose is to be compared to predefined “normal poses” to detect anomalous ones, such that the detection accuracy highly depends on the quality of the training set. If the training set fails to provide a sufficiently large number of “normal pose” exemplars, the system could suffer a high false positive rate. To reduce the need for a complete training data set, we seek a model to estimate between representative CNN extracted examples using fuzzy logic.

Ii-C Fuzzy Controller

Fuzzy sets and fuzzy logic together make fuzzy controllers – which is an attractive and promising method [20] to deal with uncertainties. While the concept of fuzzy mathematics and its use is not new, it is specifically useful when methodologies are too complex for analysis by conventional quantitative techniques or when the available sources of information are interpreted in a qualitatively inaccurate or uncertain way [49].

Fuzzy logic, which is the formal operator on which fuzzy logic control is based upon, emulates human thinking and natural language process rather than the traditional Boolean logic

[23]. It is an effective tool for capturing the approximate, inexact nature of the real world. Viewed as a semantic classifier, the underlying part of the fuzzy logic controller (FLC) is a set of linguistic rules related by the dual concepts of fuzzy implication and the compositional rule of inference. The FLC provides the automation technique which can convert the linguistic control strategy based on expert knowledge into a stand-alone control system that is robust to environment and measurement noise [47]. When the FLC is applied to domains for which it is trained, reports provide evidence of FLC superior results when compared to those obtained by conventional control methods. Hence, fuzzy logic control is considered as a step toward bridging conventional precise mathematical control to human-like decision making [16].

Because of the high uncertainty and complexity in describing the human activities, fuzzy decision making is appealing to address suspicious human activity detection in public safety surveillance systems.

A comprehensive survey for video classification and anomaly detection for video analytics was given in [24]. Some works are focused on specific human behavior classification [48]

. Both methods used the context of the data to show human decision making and behavior prediction, instead of anomalous behavior. Researchers have also tried to reach better accuracy for human behavior prediction by combining the fuzzy logic and a Hidden Markov Model (HMM)

[31].

Iii I-SAFE System Architecture

Fig. 1: The smart surveillance system architecture for I-SAFE scheme.

In a well performing safety system, humans are detected and their movement and behavior is closely watched to draw a conclusion of their intent. This process may be performed by the human agents or a smart system. The proposed I-SAFE system tries to mimic the same decision-making logic as the human agent. Figure 1 presents the decentralized smart surveillance system architecture proposed to enable the I-SAFE. Within the architecture, the functionality of I-SAFE scheme can be considered as three steps:

  • Video Feature Extraction: Based on the design of the decision making algorithm, features from the video are extracted. These features present the movement of the humans as objects of interest in the frame. A lightweight CNN that feeds an online tracker, gives the position of the objects in each processed frame. In finding the domain-selected features for decision making, I-SAFE calculates the relative speed of each human in the frame and also the relative movement direction.

  • Feature Contextualization: Obviously, the context of the data can support the decision. Experiences from law-enforcement officers confirm that time and geo-location are factors which have an undeniable affect on the outcome whether an officer engages in for further investigation. Realizing this importance, I-SAFE incorporates contextual information to the features before feeding them to the decision making algorithm.

  • Fuzzy Decision Making: As one of the more robust approaches to control an environment, a fuzzy system armed with a complete set of rules and comprehensive membership functions (please refer to Section V for details) can make a powerful tool for decision making. The fuzzy relations between inputs and outputs enhances system robustness to noise. This effect is amplified by the object tracker and the feature extraction imperfections that lead to added noise to position-related features.

As illustrated in Fig. 1, the I-SAFE system utilizes edge camera units. The edge device is either a basic surveillance camera with a Single Board Computer (SBC) mounted, like a Raspberry Pi board, or a smart camera with integrated computing resources. The edge node processes the video frames and extracts the features for decision making purposes [32]. Under the framework of I-SAFE, although the live feed is available for the human operators in the control room, it is not recommended due to the high traffic volume, and it is impractical to expect any human operator to identify useful information from hundreds of real-time video streams. Only the extracted features that are essential for decision making are immediately transferred to guarantee low network communication workload. Exploiting a web service, the camera video stream can be stopped in case of no requests. Additionally, data context is not added to features at the edge, because of the repetition that is involved leading to undesired overhead in feature communication.

The feature stream is outsourced to a fog device located closest. A fog node can be a smart-phone, laptop, or a desktop, which is more powerful than the edge device and is deployed near the source of data. The reason for this outsourcing is simply the limitations of the edge node. After video processing, more calculations could drastically impact performance as verified by our experimental study reported in Section VI.

Adopting the decentralized approach, the I-SAFE scheme possesses several advantages over the traditional cloud-based services. In terms of the system architecture design, the network manager is eliminated that usually becomes a bottleneck of performance as the number of nodes increases. According to the capacity of a fog node, a number of edge servers are assigned to it at system setup. And the operator may have access to the real-time stream from the edge or the decisions made by the fog services.

Figure 1 shows that the access to the edge is managed by a private blockchain access control (BAC) protocol. The access authentication is conducted in network setup phase and the smart contract is enforced in the blockchain network. To ensure the whole platform is scalable and easy-to-upgrade, both the video processing and security management functions are implemented using microservices architecture. Each microservice is placed inside of a docker container with all requirements necessary for ease of distribution. Due to the limited space, the rationales, architecture, implementation, and the performance evaluation are not presented in this paper. Interested readers are referred to two papers for more information, one details the microservices based surveillance platform [34] and the other is focused on the blockchain enabled security mechanisms [46].

Iv Feature Preparations

In the I-SAFE scheme, dynamic data is analyzed to prepare features for the fuzzy engine with the following steps: feature generation, selection, extraction, and contextualization, where the classifier works best if the features produce most divergence between classes. Deep Learning (DL) models, which auto-define the features, need labeled training dataset that are unfortunately not available for security use-case. Thus, selected features based on subject matter expert (SME) are presented in this section. Moreover, context supports robust operations by including environmental, societal, and cultural information.

Iv-a Feature Selection

Looking at recent publications for anomaly detection, there are many different features that could be of importance. However, this work utilizes law enforcement officers to have a better understanding of what features should be of interest. Based on their input, loitering in odd hours or places, indicates high chance of misbehavior. Although there are other clues such as the appearance, clothing, and certain smells that may also draw an officer attention, in order to minimize bias or profiling individuals and avoiding extracting sensitive private information from the individual, the system only considers the pattern of movement after tracking the general figure of a human body. In the future, gesture and important body keypoints can be added to the tracking module for more accurate description of the activities.

The pattern of individual’s movement is an important factor in the decision making procedure of an officer. Actually, many ML models can learn from the dynamic time-series data to model patterns and detect the anomalies [3]. In a 2-D RGB picture of a random scenario, the pattern of movement may be confusing as depth information is also presented in up or down movements. In addition, studies show that movement patterns are context specific [50]. For example, a person who is walking to his/her car in a parking lot may behave differently from a person who is walking into his/her dorm room. Thus, comparing the patterns and directions generalize scenarios leads to high false positive rate. Furthermore, mapping the pattern of movement is time consuming and resource demanding. Thus, it is not affordable for resource limited edge or fog devices for each person. Instead, indicators of the movement are chosen because of their generalization to all scenes.

A strategy is employed to utilize the number of speed and direction changes, in order to have the moving pattern. The more number of changes indicates the higher probability of loitering. Again according to the law officers who watched pedestrian movements closely on a daily basis, a person who has a known destination in mind is likely to walk straight an at certain reasonable pace. Turning around in an area or changing position without an apparent destination should raise an alarm. The other benefit of this strategy is that there is no need for extracting complex pastern routes for each person, rather the indicators. The difference in calculation time is more noticeable when there are more than three objects are present in the frame.

There are two complimentary features. Standing at one location for a long time may be an indicator of loitering. Nonetheless, this feature should be used in context. In addition, if there are more than one person in the scene, it is less likely to be of suspicious. Thus, the number of the people in the frame is also considered as a feature.

Iv-B Video Feature Extraction

The human detection uses a Convolutional Neural Network (CNN). If a human as the object of interest is detected, the object’s bounding box coordination resides in a queue. In each upcoming frame, the queue is updated with the tracker bounding box prediction. Once every several frame when the CNN is applied to the input frame, not only it will add any newly detected objects to the queue for tracking, but it also checks the object placement in each previous bounding box. If the Intersection Over Union (IOU) is smaller than a threshold which is set by the administrator, it will add the person as a new object and delete the old bounding box and information related to it.

The online tracker should give an accurate estimation of the position of each object, otherwise the extracted features lead to inaccurate classification. The Kerman hybrid tracker [32] maintains tracking and supports track hand-off to improve tracking accuracy. By extracting the coordination of the object in the frame and comparing it to the previously collected information, we can obtain the indicators and features for the decision making. Post processing algorithm of the frame after reception of each bounding box is presented in Fig. 2. The feature set for each frame is then transmitted to the fog unit to be contextualized and change detection determination.

Fig. 2: Pseudocode of Algorithm 1.

It should be mentioned that the accuracy of the I-SAFE system incorporates algorithms that detect and track the human as the object of interest. Although the algorithms used for human detection and tracking at the edge utilize deep learning architectures and methods, the performance accuracy is not 100% which means that the camera may lose the object of interest and so the features used in the fog node will be disrupted. The I-SAFE framework has a disunified architecture where the performance can be improved by integrating new algorithms.

Iv-C Feature Contextualization

The first step toward a dynamic data analytics is to consider all of the features that explain the input data including feature generation method. Feature generation is learned from previous analysis with salient feature selection. In case of human-oriented public safety surveillance, features are directly extracted from the video frames (e.g., intensities, lines, shapes) as well as other external factors (e.g., camera placement, lighting conditions, and scene content). The procedure of putting these features together with the factors will be referred to as feature contextualization.

This paper focuses on university campus surveillance and attempts generalization. During the normal operating hours of a building on campus; many students, faculty, staff, and other personnel may be detected in the scene. While it is normal to observe many people during the day, it is abnormal if many people appear after 11:30 pm. In the case of the abnormality detection, the contextual data of the time of day assists in decision making. According to the police, the time of day determines officer vigilance in monitoring attention to a gathering. What context features should be selected and how common they are utilized in cases of surveillance are determined based on the suggestions from our campus police officers. The edge device that hosts the data extraction from the video and prepares the feature list for each frame, cannot handle the contextualization of features due to the resource constraints. The video processing task makes use of most of the computational power [35] while the rest is allocated to the transmission and security modules. Moreover, the context, such as spatiotemporal and geo-location are usually repeated data that sending them with each frame creates overhead. Therefore, the video features are extracted by the edge camera and sent to the fog node for contextualization. There are three features that are added to the camera data during the contextualization phase.

The first is the time of the day. The importance of this feature may differ from one location to the other, but it is considered as one of the most important factors to make a decision in security systems. The second factor is the geo-location of the camera. Cameras installed indoors should have a different set of thresholds for decision making than the outdoor cameras. Just as the accessibility and space use-case varies, normal behavior changes too. Additionally, cameras that are installed outside of a bank should have less tolerance for detection of a human being after hours and should raise an immediate alarm in such a case. Thus, the security level of the building where the camera is mounted is the third context we consider.

V Fuzzy Model

V-a Rationale

The decision-making process for the safety surveillance system is based on a fuzzy control system model. Although the momentum in FLCs systems is lost due to absence of experts in many challenges, the fuzzy method remains one of the best methods for systems with high noise levels. In surveillance, the law officers can act as the experts and their opinion shall be used for system operation. The officers take months or even years to develop an innate sense of behavior analysis for a certain location of their duty.

A DNN training that can do classification with proper accuracy requires many training sets both negative and positive examples and while negative data is easy to acquire, positive samples are harder to gather. Even after labeling, if the dataset does not contain all scenarios, the result does not cover the whole input space which leads to undetected events.

In case of the campus safety system, there are many campus police officers who have spent time in the domain of the interest and they know what to expect from the crowd. Their experience is used for creating a series of rules that are implemented in the fuzzy model to detect the anomalies. On the other hand, general purpose classifiers with unsupervised learning methods do not offer high accuracy and suffer form noise distortion, trivial solutions, and collapsing of features in deeper models

[8].

With the contextualized features, a fuzzy control system is introduced to mimic how the police officers make decisions and to obtain a semantic output amenable to human operators. Unlike mathematical probability analysis, the fuzzy-based models are not based on numbers, but semantic classes. A FLC maps the input sensor measurements to linguistic labels, which are a description of the input. The fuzzy system affords the mapping of operator knowledge into a decision making model.

For anomalous activity detection, the officer performs linguistic-type reasoning in his/her mind and reaches a conclusion of a behavior being normal or abnormal, instead of giving a numeric description of the observation. To mimic this cognitive behavior, the fuzzy model gives a linguistic output which is the classification label. This output can be translated to a number based on the defuzzification formulation that is consumed by digital computers. The results are reported to the police department with the amount of attention (i.e., based on a confidence, credibility, or reliability estimate) needed to assess a specific scene.

V-B Fuzzy Model

The feature-set for each of the objects of interest is sent to the fuzzy logic controller at the fog node where it contextualize and fed the FLC.

The first step to realizing the fuzzy model is fuzzification, which translates the features to a fuzzy value. For any set , membership functions represent the fuzzy subsets of it. For an element , the fuzzy subset corresponding value is denoted by as shown as Eq. (1):

(1)

In order to fuzzify the measurements (the contextualized features in this case), each measurement is compared to its respected subsets as shown in Eq. (2), which creates the linguistic variables that are used in the rules. Linguistic variables are in contrast to normal variables where each variable presents a range of meanings such as cold, medium, hot.

(2)

where each measurement is considered as a range shown by . If is only one number as the measurement, then becomes only one number too. Note, subscript represents each of the fuzzifiers (membership functions).

If two sets and are considered in a fuzzy system, based on each fuzzifier in each set for each reading, linguistic variables and are calculated. The linguistic variables are used in the rule set to calculate the results () for each rule. The Minimum value between the resulting Premise and each fuzzy membership functions for the output yields Eq. (3):

(3)
Fig. 3: Membership functions: (a) The hour in the day. (b) The number of times there is a change in the speed of the object. (c) The total time that the object is in the frame. (d) The number of people present in the frame. (e) The number of times there is a change in the direction of movement of each object. (f) The malicious behaviour levels.

V-C Membership Functions

According to Eq. (2), the linguistic variable is mapped to an interval [0, 1] which can be inferred as a credibility analysis. If the credibility of a subset is less than , then the linguistic variable is not reliable enough. This yields to subsets (membership functions) that should cover each set so that at no point , the credibility of aligning a feature to a set falls below . This in return yields that the output results of the system can have higher credibility as the inputs are more reliably fuzzified. In addition, it is very important to consider the best membership functions to cover the entire set. Having the wrong candidate as the membership function can lower the credibility of that set faster or the membership function will not cover the desired area in the set.

Figure 3 presents the fuzzy-relation types, where the first five sets (Fig. 3a - e) are the input membership functions and the last one (f) is the output.

In Fig. 3 the x-axis shows variable in 1 and the y-axis represents the credibility of membership functions that belongs to as it changes. In part (d) for example, having five people in the frame, means about confidence in “medium activity” and confidence in “high activity”. Then Eq. (2) determines that having five people in the frame means “medium activity” and it is considered for linguistic variable of set: “NumPpl”.

Three membership functions are considered for each feature interval and five membership functions are used to describe all possible levels of suspicious behavior with high accuracy. As illustrated in Fig. 3, the following contextualized features are presented to the fuzzy system, thus the input is fuzzified and the output is made based on rules:

  • (a) The time of the day, which is the hour ranging from [0:00 to 24:00];

  • (b) The number of the times an object in a frame changes the speed [0 30]. If a person walks for a long time in the frame and changes directions many times showing not having a clear destination, there is a good chance of loitering;

  • (c) The time that a human object stays in a frame, normally a person walks out of the frame in several seconds if they are walking at normal speed [0 30] seconds;

  • (d) The number of people in the frame under processing [0 40]; and

  • (e) The number of the times an object in a frame changes the direction [0 30].

The boundaries of each set are designed to handle almost every possible scenario, but in case of outliers, the fuzzy system can handle noise and out of range values very well. The key is the fuzzification process. During the calculation phase, if the values are outliers (out of the set scope), the fuzzification still maps the linguistic variable that is most closest to the edge of the set limit. If the fuzzy system fails to align measurements to set values, the human operator will receive an acknowledgement that the inputs are erroneous and alert the operator to observe the situation in the videos. Reported in an error log, the output is also set to zero so no alarm is raised.

It is noticeable that in the models of Fig. 3, each set is covered with at least one of the membership functions for any given input, and there is no point on the x-axis that at least one membership function is above of the nominal value. Which implies that the decisions of set classification are made with high credibility.

Having a membership function that has a salient value for a long range translates to higher share and higher power of corresponding linguistic variable. All membership function shapes are chosen keeping this in mind. And is why in the malicious behavior membership functions just one point is the highest confidence in that result so no bias toward one behavior is not added.

In a completely defined fuzzy control system, all possible combinations of the fuzzy linguistic variables should be considered in the rule set. However, there is a exponential relationship between the number of rules and the processing time. Hence, the rule set for I-SAFE is designed to cover all scenarios of interest while utilizing a combination of the features.

V-D Producing the Output

Figure 4 shows how the features are combined in rules to reach a conclusion. The column shows the combination between the linguistic conditions (features). Only logical Intersection () and Union () are used, and the parenthesis show which logical operation should be executed first between the conditions. Since the speed change and direction change variables are subject to high measurement noise, the I-SAFE system employs the OR operator between them to reduce their impact based on the number of rules using them.

Fig. 4: Set of rules that are used for the safety system.

Based on the five linguistic variables respective to each fuzzy set for every contextualized features, the system gives the suspicious activity fuzzy probability. After defuzzification, the geo-location of the camera and building security level can be considered to give an appropriate threshold to raise the alarm. Such that the camera position and building security are considered as the two final features.

Once again, the expert’s experience or knowledge enforces the rules to control the environment. The rules may be different corresponding to different conditions. The expert is asked to generalize rules and features that are vital in the decision-making process. Note that each membership function boundary that is used in the input and output generations in Fig. 3 is based on a camera that is installed on a hallway in a campus building. If the area under security surveillance needs more supervision, the administrator can change the fuzzifiers subject to tolerate less activity and/or raise the alarm sooner in certain times. All changes are done at setup and no more adjustments are needed.

The rule set used in I-SAFE system emulates the calculations that an officer performs before approaching a suspicious object. As shown in the fuzzy sets, the number of people in a frame is an important feature. The highest attention is drawn to scenarios where only few people are present at night. Based on the law enforcement experience, the videos that have only one person in frames that are the most important. Therefore, the video footage with one or two people are considered as of high interest. With increase in the number, the interest goes down based on the double sign membership functions. Another key area is the amount of time that the object is present in the location. As the time expands, the probability that the object is loitering goes higher with respect to its set membership functions.

V-E Suspicious Score

The last step is to defuzzify the results of the rules and translate it to a number between 0 to 100%. A threshold decides whether or not the output should raise an alarm to make operators aware of an activity. Then the operator will make decision for further actions. The can be defuzzified using Eq. ( 4):

(4)

where the is the membership function of the output.

The I-SAFE system is able to draw attention to the scene where anomalous or suspicious activities are determined, but it is the human operator that makes the decision of action. The fuzzy model is implemented on the fog level devices and it is easy to access and reconfigure parameters through a single cloud node, for a batch of edge units connected to the fog, if the operator chooses so.

Vi Experimental Results

A proof-of-concept prototype of the I-SAFE scheme has been implemented and tested using real-world surveillance video streams. The experimental results are encouraging that the design goals are achieved to provide a secure, agile, and fast surveillance system for safety monitoring. The I-SAFE system detects the activity successfully in an average of 0.002 seconds after features are pre-processed.

Vi-a System Setup

The prototype consists of both edge and fog layer function units. At the edge, human detection and tracking are accomplished using lightweight L-CNN and Kerman algorithm. Features are created using the indicators and other methods explained previously. The edge functions are hosted by a Tinker Board with 1.8 GHz ARM-based RK3288 SoC and 2 GB LPDDR3 dual-channel. The Tinker Board is placed behind the camera, in this sense, the camera can be considered as the sensor and the edge device is the Tinker that connects to the sensor through Local Area Network (LAN). The features are sent to a laptop PC running Ubuntu 16.04 operating system as a fog node, where the contextualization and fuzzy decision making of the I-SAFE scheme is located. The PC has a 7th generation Intel core i7 processor @3.1GHz and 32 GB of RAM. The wireless connection between the fog and edge is through wireless LAN (WLAN) with 100Mbits/s.

Vi-B Threshold Setting

Fig. 5: Two objects normally walking in the frame.

Suspicious activity can be interpreted in different ways; but the goal was to provide a machine-level triage of the situation for cueing operators to human abnormal behavior. For example, the abnormal behavior could point to a bicyclist in a crowded place where everyone else is walking [45]. In another attempt a certain possible pose of a human is a sign of abnormal behavior [37]. The challenge is to determine an ontology of activities that would alert and operator to potential abnormal activity. In this paper, we consider a campus environment, where simply put, the students are unlikely to loiter around in parking lots or in hallways in late hours. The system is designed in such a way that it will require certain thresholds to be met before raising an alarm.

It is important to note the separation of the feature map and decision making algorithm, makes the project more suitable for edge computing paradigm where outsourcing the process to higher link in the hierarchy is indisputable.

Figure 5 is a scenario where two people are walking at their normal pace and they will exit the frame when reaching the end of the hallway. The algorithm follows both objects and outputs the abnormality score corresponding to the measured likelihood of suspicious activity respective to each individual.

Figure 6 compares the malicious scores of these cases. The x-axis is the time an object in the frame in second, and the y-axis is the defuzzified suspicious score. The red line is the score of the case of single person walking at 11:00 am, where the object walked through the corridor in about 100 seconds. The score is in a reasonable range showing no suspicious behavior. In contract, the blue curve is the score of the single person walking at 3:00 am and stays in the area for long time. The suspicious score is rising as time goes by as the blue curve starts at a higher value at time zero than the red curve does, because the scheme considers walking at 3:00 am is more suspicious. As the time of movement increases, the normal activity’s score rises slower because of the other parameters that besides the time shows suspicious activity, however the blue line has higher jumps in the score.

Figure 6 does not indicate the threshold to raise the alarm. It can be set conveniently by the system administrator based on the experience of building usage. In another words, setting a reasonable threshold needs a statistical analysis of the distribution of activity scores corresponding to the behavior patterns in this building.

Fig. 6: Comparison of two sample cases in decision making process.

Vi-C Performance Evaluation

Fig. 7: Time delays (in micro-seconds) as the result of the data encryption and transmission on the local wifi network.

Figure 7 shows the delay due to the data transfer from edge to the fog. Three scenarios are compared: no encryption, AES (Advanced Encryption Standard) encryption, and AES+RSA (Rivest-Shamir-Adleman) encryption. The AES+RSA has handshake and establishment of connection based on the RSA and the rest of the data transaction is based on AES, which is better in terms of low latency on resource limited devices. It is noticeable that the delay does not have substantial impact to the real-time performance of the system. Since five to eight frames per second is the speed at which the edge device can process the input frames. In addition, with the increment of the number of detected human objects, the feature file being transmitted for each frame gets bigger and also the communication takes more time. As more data is transmitted there are spikes in the transmission times due to unstable network connection. Two scenarios are included in Fig. 7, one where there are between 0 to 2 objects in the frame and another where we have 6 to 10 objects in each frame for comparison of delay as the files get bigger.

Figure 8 compares the difference between processing the fuzzy model at the edge or at the fog level. The total time shown in Fig. 8 includes the time for data contextualization and the fuzzy control system results. As shown in Fig. 7, the communication time is much shorter than the time required in decision making process. So the communication time is neglected in Fig. 8. Note how the edge devices struggles to have around 1.5 (FPS). The same operation takes about 0.002 second on the fog node. Figure 8 justifies outsourcing the fuzzy decision making function to the fog node along with showing the heavy load on the edge devices. Processing a frame and generating a decision in 0.002 seconds in average for human activity detection meets real-time requirements. Considering the velocity of pedestrians, a person cannot move much in 0.1 seconds (processing rate of the current video processing applied at edge) giving ample time for a security response.

In the future, with the introduction of more powerful edge devices, the whole process may be executed at the edge. Figure 8 is generated based on a scenario of two people in the frame. Intuitively, if more people are in the image frame, longer delay is expected both for video processing and decision making.

Fig. 8: Time (in micro-seconds) needed for decision making given the features from the video.

Figure 9 presents some occasions where the detection and tracking algorithms failed. As explained before, this leads to a lower accuracy of the decision making. Figure 9 includes three instances. Part (a) of this figure shows the object of interest (person in white) at the beginning, who is closer to the camera and is walking past the other person. The tracking algorithm stops following the person in white and stays with the person in red as the red one becomes closer. The detection algorithm, however, detects the person in white again and deletes the other bounding box. Unfortunately, the suspicious score data gathered for the person in the white shirt is lost. Figure 9(b) shows a scenario where the person in red is far away from the camera and the system failed to detect them. Figure 9(c) is when the detection algorithm detects only one person instead of two that are in the frame. While this problems exists, it happens very seldom over numerous trails. Finally, Fig. 9(d) shows a very challenging case, when one object blocks the other, as there is no way to have both two people detected. Each of these issues can be mitigated through intelligent design such as more cameras and enhanced trackers, of which the general operation is scalable.

Fig. 9: Moments of the human detection and tracking failure that makes the decision making harder: (a) where the tracker loses the object and finds him again. (b) Object is too small for the detection algorithm to detect. (c) The detection confuses two people as one. (d) One object covers the other so he/she will remain undetected.

We also ran the I-SAFE on two publicly available video detests, namely Adam [1] subway entrance and more than 3 hours of video from a mall security camera. One point which in these datasets is very noticeable, is that the original videos have the number of frames compared to what is provided. The main reason for this down sampling goes back to the slow human motion compared to fast cameras of today. Figure 10 shows some of the instances that the I-SAFE detects people and assigns a score to them. In Fig. 10, the bounding boxes around the humans are shown with the corresponding loitering score around each object. Notice how objects that get further away from the camera are not detected, which is the result of inadequate pixel density for detection or in some congested cases low detection resolution. It is worth highlighting that in Fig. 10, the blue boxes are the tracker output while the green boxes show the detection algorithm that checks the frames for new objects with the frequency of 1 in 5 frames.

Fig. 10: Instances of the fuzzy algorithm output for the videos in the Adam dataset including mall and subway entrance videos.

It can be seen in these samples pictures in Fig. 10, that the camera needs to train to a person with abnormal behavior (here is a man walking on the phone carrying a cardboard piece with different colors on it, for camera detection) for more traditional detection methods.Noted that the proposed I-SAFE does not require training when deployed as camera placement supports the threshold parameters that the operator needs to recalculate.

Finally, Table I compares the results of the I-SAFE to the ground-truth and other models for abnormal loitering detection as reported in each paper and shows comparable results. Although the small number of the loitering cases are detected with other algorithms, the I-SAFE achieves these results while minimizing the delay and network overhead in a decentralized edge computing paradigm environment. Observing the scores for the videos in the dataset, we concluded that a threshold of 60% is best to show the abnormal activity using an average CPU of 68.3% on a single thread and 96 MB of memory. Of course, with a higher threshold the system gives less False Positives.

Although these examples are used for model performance analysis, they carry very limited number of positive examples to have an otherwise machine-learning solution to them.

A closer look at Table I and Fig. 10 shows that the human activity recognition heavily depends on how accurately it can detect and track each individual. In the mall video segment, because of the frame complexity and the object partial or complete collision, tracking may be interrupted and data may be lost. The advantage of decoupling feature extraction from decision making is that we can use the same fuzzy model with future more accurate video processing techniques.

Detection Model Sub. Ent. Mall 1 Mall 2
TP FP TP FP TP FP
Ground Truth 14 0 4 0 4 0
Adam et al. [1] 13 4 4 1 4 3
Zhao et al. [51] 14 5 NA NA NA NA
Cocsar et al. [12] 14 4 NA NA NA NA
Kim et al. [21] 13 6 NA NA NA NA
I-SAFE 13 4 3 2 4 1
TABLE I: Loitering Score in different video samples (TP: True Positive, FP: False Positive, NA: Not Available).

Vii Conclusions

The smart surveillance system for public safety should be able to detect suspicious people or activities in realtime. Based on the lightweight human object detection and tracking algorithms previously reported, this paper advances proactive surveillance system design by proposing I-SAFE, an instant suspicious activity identification in the edge paradigm using CNN feature extraction and fuzzy decision making. The algorithms to extract features from incoming video stream are implemented on an edge device, which efficiently reduces the communication overhead and enables outsourcing the decision making process to the fog level. The fog device contextualizes the features, fuses the seven features with a fuzzy logic control system, and provides decision making. The rules and features adopted are chosen under the guidance of campus police officers. A proof-of-concept prototype of the I-SAFE scheme has been implemented and tested using real-world surveillance video streams.

Our on-going efforts consider two directions: (1) Adding features to the tracking and classification algorithms to detect gesture for more accurate decision making; and (2) enhancing the lightweight detection and tracking algorithms to tackle the challenging situations shown in Fig. 9.

Acknowledgements

The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the United States Air Force.

References

  • [1] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, “Robust real-time unusual event detection using multiple fixed-location monitors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 3, pp. 555–560, 2008.
  • [2] M. Andriluka, U. Iqbal, A. Milan, E. Insafutdinov, L. Pishchulin, J. Gall, and B. Schiele, “Posetrack: A benchmark for human pose estimation and tracking,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2018, pp. 5167–5176.
  • [3] A. J. Aved and E. P. Blasch, “Multi-int query language for dddas designs,” Procedia Computer Science, vol. 51, pp. 2518–2532, 2015.
  • [4] E. Blasch, C. Banas, M. Paul, B. Bussjager, and G. Seetharaman, “Pattern activity clustering and evaluation (pace),” in Evolutionary and Bio-Inspired Computation: Theory and Applications VI, vol. 8402.   International Society for Optics and Photonics, 2012, p. 84020C.
  • [5] E. Blasch, É. Bossé, and D. A. Lambert, High-level information fusion management and systems design.   Artech House, 2012.
  • [6] E. Blasch, J. Nagy, A. Aved, E. K. Jones, W. M. Pottenger, A. Basharat, A. Hoogs, M. Schneider, R. Hammoud, G. Chen et al., “Context aided video-to-text information fusion,” in Information Fusion (FUSION), 2014 17th International Conference on.   IEEE, 2014, pp. 1–8.
  • [7] E. P. Blasch, K. Liu, B. Liu, D. Shen, and G. Chen, “Cloud based video detection and tracking system,” 06 2016, uS Patent 9,373,174.
  • [8] P. Bojanowski and A. Joulin, “Unsupervised learning by predicting noise,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.   JMLR. org, 2017, pp. 517–526.
  • [9] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.
  • [10] N. Chen, Y. Chen, E. Blasch, H. Ling, Y. You, and X. Ye, “Enabling smart urban surveillance at the edge,” in 2017 IEEE International Conference on Smart Cloud (SmartCloud).   IEEE, 2017, pp. 109–119.
  • [11] N. Chen, Y. Chen, S. Song, C.-T. Huang, and X. Ye, “Smart urban surveillance using fog computing,” in Edge Computing (SEC), IEEE/ACM Symposium on.   IEEE, 2016, pp. 95–96.
  • [12] S. Coşar, G. Donatiello, V. Bogorny, C. Garate, L. O. Alvares, and F. Brémond, “Toward abnormal trajectory and event detection in video surveillance,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 3, pp. 683–695, 2017.
  • [13] A. Del Giorno, J. A. Bagnell, and M. Hebert, “A discriminative framework for anomaly detection in large videos,” in European Conference on Computer Vision.   Springer, 2016, pp. 334–349.
  • [14] C.-T. Fan, Y.-K. Wang, and C.-R. Huang, “Heterogeneous information fusion and visualization for a large-scale intelligent video surveillance system,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 4, pp. 593–604, 2017.
  • [15] T. Fuse and K. Kamiya, “Statistical anomaly detection in human dynamics monitoring using a hierarchical dirichlet process hidden markov model,” IEEE Transactions on Intelligent Transportation Systems, 2017.
  • [16] M. M. Gupta and Y. Tsukamoto, “Fuzzy logic controllers?? a perspective,” in Joint Automatic Control Conference, no. 17, 1980, p. 93.
  • [17] H. Holloway, E. K. Jones, A. Kaluzniacki, E. Blasch, and J. Tierno, “Activity recognition using video event segmentation with text (vest),” in Signal Processing, Sensor/Information Fusion, and Target Recognition XXIII, vol. 9091.   International Society for Optics and Photonics, 2014, p. 90910O.
  • [18] G. Hua, M.-H. Yang, and Y. Wu, “Learning to estimate human pose with data driven belief propagation,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2.   IEEE, 2005, pp. 747–754.
  • [19] S. Ioffe and D. Forsyth, “Finding people by sampling,” in Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, vol. 2.   IEEE, 1999, pp. 1092–1097.
  • [20] D. Karaboga and E. Kaya, “Adaptive network based fuzzy inference system (anfis) training approaches: a comprehensive survey,” Artificial Intelligence Review, pp. 1–31, 2018.
  • [21] J. Kim and K. Grauman, “Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental updates,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2009, pp. 2921–2928.
  • [22] M. P. Kumar, A. Zisserman, and P. H. Torr, “Efficient discriminative learning of parts-based models,” in Computer Vision, 2009 IEEE 12th International Conference on.   IEEE, 2009, pp. 552–559.
  • [23] C.-C. Lee, “Fuzzy logic in control systems: fuzzy logic controller. i,” IEEE Transactions on systems, man, and cybernetics, vol. 20, no. 2, pp. 404–418, 1990.
  • [24] C. H. Lim, E. Vats, and C. S. Chan, “Fuzzy human motion analysis: A review,” Pattern Recognition, vol. 48, no. 5, pp. 1773–1796, 2015.
  • [25] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6536–6545.
  • [26] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly detection in stacked rnn framework,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 341–349.
  • [27] J. Ma, Y. Dai, and K. Hirota, “A survey of video-based crowd anomaly detection in dense scenes,” Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 21, no. 2, pp. 235–246, 2017.
  • [28] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, “Anomaly detection in crowded scenes,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.   IEEE, 2010, pp. 1975–1981.
  • [29] K. Mikolajczyk, C. Schmid, and A. Zisserman, “Human detection based on a probabilistic assembly of robust part detectors,” in European Conference on Computer Vision.   Springer, 2004, pp. 69–82.
  • [30] G. Mori and J. Malik, “Estimating human body configurations using shape context matching,” in European conference on computer vision.   Springer, 2002, pp. 666–680.
  • [31] K. Mozafari, N. M. Charkari, H. S. Boroujeni, and M. Behrouzifar, “A novel fuzzy hmm approach for human action recognition in video,” in Knowledge Technology Week.   Springer, 2011, pp. 184–193.
  • [32] S. Y. Nikouei, Y. Chen, S. Song, and T. R. Faughnan, “Kerman: A hybrid lightweight tracking algorithm to enable smart surveillance as an edge service,” arXiv preprint arXiv:1808.02134, 2018.
  • [33] S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. R. Faughnan, “Real-time human detection as an edge service enabled by a lightweight cnn,” in Edge Computing, the IEEE International Conference on, 2018.
  • [34] S. Y. Nikouei, R. Xu, Y. Chen, A. Aved, and E. Blasch, “Decentralized smart surveillance through microservices platform,” arXiv preprint arXiv:1903.04563, 2019.
  • [35] S. Y. Nikouei, R. Xu, D. Nagothu, Y. Chen, A. Aved, and E. Blasch, “Real-time index authentication for event-oriented surveillance video query using blockchain,” arXiv preprint arXiv:1807.06179, 2018.
  • [36] A. T. Palma, V. Bogorny, B. Kuijpers, and L. O. Alvares, “A clustering-based approach for discovering interesting places in trajectories,” in Proceedings of the 2008 ACM symposium on Applied computing.   ACM, 2008, pp. 863–868.
  • [37] S. Penmetsa, F. Minhuj, A. Singh, and S. Omkar, “Autonomous uav for suspicious action detection using pictorial human pose estimation and classification,” ELCVIA: electronic letters on computer vision and image analysis, vol. 13, no. 1, pp. 18–32, 2014.
  • [38] C. Piciarelli, L. Esterle, A. Khan, B. Rinner, and G. L. Foresti, “Dynamic reconfiguration in camera networks: a short survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 5, pp. 965–977, 2016.
  • [39] M. Ribeiro, A. E. Lazzaretti, and H. S. Lopes, “A study of deep convolutional auto-encoders for anomaly detection in videos,” Pattern Recognition Letters, 2017.
  • [40] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016.
  • [41] L. Snidaro, J. Garca, J. Llinas, and E. Blasch, “Context-enhanced information fusion-boosting real-world performance with domain knowledge, ser,” Advances in Computer Vision and Pattern Recognition. Springer International Publishing, 2016.
  • [42] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6479–6488.
  • [43] X. Wang, “Intelligent multi-camera video surveillance: A review,” Pattern recognition letters, vol. 34, no. 1, pp. 3–19, 2013.
  • [44] J. Wu, “Mobility-enhanced public safety surveillance system using 3d cameras and high speed broadband networks,” GENI NICE Evening Demos, 2015.
  • [45] D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe, “Learning deep representations of appearance and motion for anomalous event detection,” arXiv preprint arXiv:1510.01553, 2015.
  • [46] R. Xu, S. Y. Nikouei, Y. Chen, E. Blasch, and A. Aved, “Blendmas: A blockchain-enabled decentralized microservices architecture for smart public safety,” arXiv preprint arXiv:1902.10567, 2019.
  • [47] O. Yagishita, “Application of fuzzy reasoning to the water purification process,” Industrial applications of fuzzy control, pp. 19–40, 1985.
  • [48] B. Yao, H. Hagras, M. J. Alhaddad, and D. Alghazzawi, “A fuzzy logic-based system for the automation of human behavior recognition using machine vision in intelligent environments,” Soft Computing, vol. 19, no. 2, pp. 499–506, 2015.
  • [49] J. Yen and R. Langari, Fuzzy logic: intelligence, control, and information.   Prentice Hall Upper Saddle River, NJ, 1999, vol. 1.
  • [50] X. Zhang, M. Ding, and G. Fan, “Video-based human walking estimation using joint gait and pose manifolds,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 7, pp. 1540–1554, 2017.
  • [51] B. Zhao, L. Fei-Fei, and E. P. Xing, “Online detection of unusual events in videos via dynamic sparse coding,” in CVPR 2011.   IEEE, 2011, pp. 3313–3320.