Advances in networking, intelligence, and media available in urban areas attracts people towards a more comfortable lifestyle. Urbanization at an unprecedented scale and speed incurs significant challenges to city administrators, urban planners and policy makers. In order to efficiently manage the cities functions and be responsive to dynamic transitions, surveillance systems are essential for situational awareness (SAW) , . Nowadays, a prohibitively large amount of surveillance data is being generated every second by ubiquitously distributed video sensors. For example, North America alone has more than 62 million cameras in the year 2016. These cameras are connected to powerful data centers through communication networks and the delivery of surveillance video streams creates a heavy burden on the network. Researchers have shown that video streaming accounts for 74% of the total online traffic in 2017 .
Since the first generation video surveillance systems known as Close Circuit TV (CCTV) were introduced in 1960s, urban surveillance mechanisms adapted to the changing technology . Compared with today’s edge computing paradigm, CCTV-like surveillance systems are limited because:
The network is “best effort” based which means not only transmission of the video data suffers delays and jitters, the data may get lost or dropped because of network congestion.
The raw-data transmission is “dedicated” which wastes resources in the communication network and at the data center, because not all data is globally significant or worthy to be stored for long time.
An agent needs to pay “full attention” to the video to capture any emergency in real-time. Obviously this naïve approach is not scalable, and there are several architectures introduced based on computer vision techniques and make decisions based on machine learning algorithms. However, to date there is not a system that is able to meet the performance requirements like real-time, good scalability, and robustness.
An agent employs “working memory” as computing capabilities afforded only searching for a specific target of interest or focusing on a special feature. Meanwhile, today’s multimedia forensics desires real-time or near real-time searching by scanning through the large surveillance video record base.
It is very challenging to immediately analyze the objects of interest or zoom in on suspicious actions from thousands of video frames. Making the big data indexable is critical to tackle the object analytics problem , . It is ideal to generate pattern indexes in a real-time, on-site manner on the video streaming instead of depending on the batch processing at the cloud centers. The modern edge-fog-cloud computing paradigm allows implementation of time sensitive tasks at the network edge. In this paper, a novel event-oriented indexable and queryable intelligent surveillance (EIQIS) system is introduced leveraging the on-site edge devices to collect the information sensed in format of frames and extracts useful features to enhance situation awareness.
The rest of this paper is organized as follows. Section 2, briefly discusses background knowledge and relative work. Section 3 highlights the main challenges in the real-time surveillance. Section 4 introduces the rationale of the proposed indexable and queryable surveillance system. A preliminary study is presented in Section 5, which validates the concept and shows the feasibility of the system architecture. Finally, Section 6 concludes the paper with future research directions.
Ii Background Knowledge and Related Work
Today, most available surveillance systems archive streaming video footage to be used off-line for forensics analysis . Communication delays and uncertainties associated with the data transfer from image sensors to a remote computing facility limit implementation of the online surveillance tasks. However, delay sensitive applications require on-line processing. Thanks to the recent development of lightweight machine learning (ML) algorithms that require less computing power and storage space, more processing can be migrated to the edge of the network , where no more delay is incurred for data transmission. For tasks like anomalous behavior detection that is not affordable at the edge, instead of directly outsourcing the job to the remote cloud, near-site fog nodes are powerful enough for complex data analytics tasks.
For instance, in a smart transportation application following a hierarchical system architecture, data is accessed by the sensors implemented on buses and transferred to a fog node where contextualization and decision making happens . For video surveillance systems, the remote cloud is mainly used for profile building, pattern analysis, and long term historical record analysis.
In general, a smart surveillance system includes three layers as shown in Fig. 1. In the first layer, image analysis, the input camera frame is given to an edge device and the low-level features are extracted , . The edge devices are able to conduct object detection and object tracking tasks , . The intermediate-level, considered as the fog stratum, is in charge of mode recognition for action recognition, behavior understanding, and abnormal event detection. Finally, the high-level, cloud center, is focused on systems analysis including historical profile building, global statistical analysis, and narrative reporting. Connections among the edge, fog and cloud nodes present challenges in terms of overall platform, connections, quality of service (QoS) requirements, and preserving privacy and security.
The first step of a video surveillance system is to simultaneously track and identify (ID) (STID) the objects of interest in the video , . STID continues to be a challenging task performed on the edge of the network . Nowadays, once an event incurred, the operators need to spend considerable amount of time to go through the footage and look at videos from different cameras in order to find a specific target. Even in the next generation surveillance systems that are combined with image processing techniques for better decision making, performing a search in real-time or near real-time is very challenging , .
Ideally, the surveillance system is expected to be able to quickly and automatically identify the clips of interest based on a given query. Earlier researchers have proposed to adopt video parsing techniques that automatically extract index data from video and store the index data in relational tables , , . The index is used through SQL queries to retrieve events of interest quickly. However, this approach cannot meet the performance requirements of online, real-time, operator-in-loop interactions. Future smart surveillance video streams have to be indexable and queryable such that the operator is able to obtain the information of interest instantly.
Iii Real-time Queryable Surveillance: Architecture and Challenges
This section introduces an edge-fog-cloud computing based system architecture to achieve event-based indexable and queryable intelligent surveillance (EIQIS). It is non-trivial to extract features in real-time and use them as indexes to conduct online query on surveillance video streams . Advances in machine learning, multi-modal data fusion, and physics-based and human-derived information fusion (PHIF) show promise for EIQIS. Current systems are designed to enhance user responsibilities to include security, surveillance, and forensics. Typically, the user provides a standing query that the image processing is to provide event triggers , . The user would like the system to do the functions autonomously, however, the ultimate design would include a combination of humans in, on, or out-of the loop (HIL, HON, HOON).
In order to have a smart surveillance system raise an alarm when something abnormal is detected, each captured frame that is processed requires knowledge of the proceeding frames. A three layer edge-fog-cloud hierarchical architecture reduces the delays that are incurred when the frame is transferred to a remote cloud center. The more processing that is migrated to the network edge, the faster the features are obtained and indexes are constructed because of the close proximity of the edge node to the geo-location of the camera. Meanwhile, due to the constraints on computation and storage capacity at the edge devices, more computing or data intensive tasks are outsourced to more powerful cloud.
The first layers is the edge camera, it should be mentioned that most reliable detection and tracking algorithms are dedicated for specific surveillance applications. Running them in a resource constrained environment that requires the algorithm to be a light weight version of the original does not help the accuracy. Thus, finding better methods is a contemporary research topic .
Once a frame is captured by the image sensor, it will be either transferred to the edge device that is connected via a local area network (LAN) connection or processed on-site if it is a smart camera (edge device) with sufficient computing power. The edge node has limited computing power and so all computing intensive event detections cannot be executed at this level. The edge device conducts pre-processing using a convolutional neural network (CNN), which will identify the objects of interest and give their positions in the image frame. Even with small architectures with few layers that reduce the overall computation complexity, CNNs are heavy for the edge device. The edge device cannot afford to execute the CNNs more than couple of times per second. Therefore, in order to reach a higher resolution of the detection, the bounding box around the object of interest is given to a tracker algorithm that uses an online learning algorithm to follow the object in each frame until it moves out of the frame. Each time the CNN runs, the newly found bounding boxes are sent to a fast tracker such as the Kernelized Correlation Filter (KCF), improving the speed. It should be noted that although newer and powerful edge nodes are made every day, with more features to be extracted, a longer processing time is needed. Consequently, the key for the real-time application is a trade-off between the speed and the amount of features to be extracted in each frame.
After each object is detected and tracked, features can be extracted. These features might include, but are not limited to the current position and speed the object is walking, the direction of the walk and some other physical features such as the angles the other parts of the upper body parts create and so the pose of the pedestrian 
. For each detected pedestrian, there is a table that is updated with each frame and includes a key and value for features extracted from the video. The actual video may not be needed to be transferred to the fog level device where the decision making code is executed.
The edge device is designed to conduct immediate techniques such as feature extraction, while the advanced analytics is outsourced to a more powerful, near-site node. Several edge devices from several camera feeds can be connected to a fog node, which conducts feature contextualization, indexing, and storage. One of the challenges in a surveillance system is the security of the connection between the edge and fog. Although there are new promising technologies to address privacy/security, like blockchain technology , more development is needed to make them light weight and robust for the smaller networks with low power. The features transmitted to the fog node can be contextualized to support decision making . Valuable data in the contextualization include: The location of the camera, time of the footage, terrain information, semantic ontologies of descriptors, etc. For example, while it is normal for people to walk and stand in a campus building, it can be considered as abnormal if it is late at night when the building should be close. Also, connecting several cameras in the same area to the same fog node will give the fog the ability to look at the monitored area from different perspectives, illuminations, and contexts.
Another challenge that the surveillance community faces is the decision support algorithm, which includes supervised, unsupervised, and semi-supervised methods. The, the lack of labeled data for unknown situations, requires methods in semi-supervised training to better characterize abnormal situations. The answer may include the location and several other factors and sequence of events lead to abnormal behavior detection. Also, the security camera and the functionality of the place surveillance may differ from one to the other which makes it very difficult to differentiate between normal and abnormal activity.
The historical analysis, profile building, and situation analysis are conducted by the most powerful node in the edge-fog-cloud architecture hierarchy, the cloud. The decisions making and the detection of false alarm and the features that raised the alarm are sent for future fine tuning of the algorithms and also some analytical studies. Figure 1 shows the interconnections of the nodes in the network described in this section.
Iv Making the Video Streams Indexable
The usability of any exploited video is based on what is stored or indexed or fast retrieval, such as content-based image retrieval. The surveillance video streamed to the edge device enables features extracting for decision making. Decision making is based on the real-time search query. The real-time video search will make the job of the operator/user easier by giving instances of the video that are asked for in a query to the system. Search string is the query that is given to the fog node. The fog node is the ideal level to handle search requests where contextualized information from close by cameras is stored. The following describes how a query is handled at the fog layer:
The fog node receives the query and will check the eligibility of the machine asking for the information. The access level of the nodes in such a network is defined in a smart contract in a blockchain enabled security platform.
The fog node searches for the query in the index table to find the corresponding camera, timestamp and other information based on the real-time features provided and select them if any.
The fog node answers the search requester based on the information found.
Then the operator selects the cameras with the query and has the live feed or recorded clips (it is assumed that the operator has access to the edge device in charge of the camera of interest if he/she has access to the higher-level fog).
The operator thus can search the video streams in real-time.
Indexing requires the association of complementary information (hashed, correlated, and linked) with the video frame for storage. Using the mapping table affords fast information retrieval. Considering the indexing table the same as the features simplifies the search operation. While there are many features extracted from the video, there might be several different indexes that are required by the system administrator. Features are generated in order to make a decision for the actions of the object in the video. However, indexes that are based on features might include more options. There are two scenarios that are plausible. First, the fog node uses the same features and adds context to make the data useable as the index table. Second, the fog node uses several edge devices (perform as microservices) to extract features required and creating a table to be used as indexes based on the resulting features.
In order to facilitate faster search results, one known method used today in search engines and operating systems, is to create an index table which is used later for finding search queries. Indexing means to have a key and value table of features that are of interest and once the keys are searched for (in query format), the corresponding values are the results of the search that gives certain files that contain the query. This way the search is faster and there is no need to scan all files for the key values that are searched for. The same principle applied to the video file captured by the surveillance cameras results in efficient and real time operations. Based on the index table points to the corresponding edge device, the camera live or recorded footage clips are identified and sent to the query sender.
Once the camera captures each frame, an edge device extracts features in real-time or near real-time from the video and the features are transferred to a fog node. After the contextualization of the features, they can be used as the indexes for querying when the operator needs to find something instantly. For example, if the operator is looking for moments that there are congestion of people on the campus in the late night hours. The search can be directed to the exact hours and locations, then look for features that report more than ten people or more at the same frame. Using the query-based parameters inherent in the index table will lead to the corresponding video clips faster and the operator can look for incidents that have the exact search keys. The EIQIS method is obviously more efficient than having to check all the camera footage security systems to find what imagery is of interest.
Iv-B Features VS. Indexes
Creating the indexes for the extract features that are useful for video search supports historical analytics. However, the features that are of interest in the abnormal behavior detection may not support an operator search, be enough, or exactly the same as the indexes (key values) that are applicable in usual search. Figure 2 shows a scenario in which more feature extraction from the video is needed. The job can be divided into more than one edge devices and each feature can be handled as a microservice . Microservices is defined as a separate piece of program that provides a service to a bigger piece of program. In this case the feature extraction can be considered as the microservice that is used in the video indexing platform. More features can be extracted as a result of this architecture. If any indexes need to be added, simply adding the service to the platform can expand the scope of the indexes that are used.
V A Preliminary Case Study
A preliminary proof-of-concept prototype has been built to validate the feasibility of EIQIS . It shows that the edge devices are capable of extracting and sending features in real-time to the fog layer. The features are written into a text file and sent to fog through a secure channel. The features are synchronized with every node of the network for added security. Figure 3 is an example of features stored in the fog in a key value manner and Fig. 4 is graphical output of the edge device, where the device adds a bounding box around the object (e.g., person, vehicle, other) of interest and the box follows the object. Figure 4 presents several moments that are challenging to be detected. It is a proof showing an acceptable performance of the edge device.
The real environment validates the feasibility of the proposed system. The prototype model run on two Asus Tinker Boards with the configuration as follows: 1.8 GHz 32-bit quad-core ARM Cortex-A17 CPU, the memory is 2GB of LPDDR3 dual-channel memory and the operating system is the TinkerOS based on the Linux kernel. The fog layer functions are implemented on a laptop, in which the configuration is as follows: the processor is 2.3 GHz Intel Core i7 (8 cores), the RAM memory is 16 GB and the operating system is the Ubuntu 16.04. A private blockchain network is implemented to secure the feature data transferring from edge to fog. Our private Ethereum network includes four miners, which are distributed to four desktops that are empowered with the Ubuntu 16.04 OS, 3 GHz Intel Core TM (2 cores) processor and 4 GB memory. Each miner uses two CPU cores for mining task to maintain the private blockchain network and the resulting blocks are synchronized through the whole network so every node has a copy of the latest block. The data transfer between the fog node and the miner is carried through an encrypted channel. Before the fog node can secure the features, there should be no adversaries who can temper with the surveillance data. Python based socket programming language is used for both ends of the channel. More details of the prototype are reported in .
Many surveillance systems available today cannot meet the performance requirements raised from real-time, human-in-loop interactive operations. The event-oriented indexable and queryable intelligent surveillance (EIQIS) edge-fog-cloud hierarchical architecture is promising for real-time or near real-time applications, which allows instant querying on the online surveillance video streams to give more time to first responders. In this paper, the architecture toward an event-oriented, indexable, queryable smart surveillance system is introduced. The proposed system enables query of video in real-time based on an index table, which is created on top of the features that are extracted on-site by edge computing nodes. This intelligent surveillance system enables the operator to search for scenes or events of interest instantly. A preliminary study has validated the feasibility of the proposed architecture.
-  A. J. Aved and E. P. Blasch, “Multi-int query language for dddas designs,” Procedia Computer Science, vol. 51, pp. 2518–2532, 2015.
-  E. Blasch and A. Aved, “Urref for veracity assessment in query-based information fusion systems,” in Information Fusion (Fusion), 2015 18th International Conference on. IEEE, 2015, pp. 58–65.
-  E. Blasch and L. Hong, “Data association through fusion of target track and identification sets,” Fusion00, 2000.
-  E. Blasch and B. Kahler, “Multiresolution eo/ir target tracking and identification,” AIR FORCE RESEARCH LAB WRIGHT-PATTERSON AFB OH SENSORS DIRECTORATE, Tech. Rep., 2005.
-  E. Blasch, J. Nagy, A. Aved, E. K. Jones, W. M. Pottenger, A. Basharat, A. Hoogs, M. Schneider, R. Hammoud, G. Chen et al., “Context aided video-to-text information fusion,” in Information Fusion (FUSION), 2014 17th International Conference on. IEEE, 2014, pp. 1–8.
-  E. P. Blasch and A. J. Aved, “Dynamic data-driven application system (dddas) for video surveillance user support,” Procedia Computer Science, vol. 51, pp. 2503–2517, 2015.
-  E. P. Blasch, S. K. Rogers, H. Holloway, J. Tierno, E. K. Jones, and R. I. Hammoud, “Quest for information fusion in multimedia reports,” International Journal of Monitoring and Surveillance Technologies Research (IJMSTR), vol. 2, no. 3, pp. 1–30, 2014.
-  F. F. Chamasemani, L. S. Affendey et al., “Systematic review and classification on video surveillance systems,” Int. Journal Information Technology and Computer Science, no. 7, pp. 87–102, 2013.
-  N. Chen and Y. Chen, “Smart city surveillance at the network edge in the era of iot: Opportunities and challenges,” in Smart Cities. Springer, 2018, pp. 153–176.
-  N. Chen, Y. Chen, E. Blasch, H. Ling, Y. You, and X. Ye, “Enabling smart urban surveillance at the edge,” in Smart Cloud (SmartCloud), 2017 IEEE International Conference on. IEEE, 2017, pp. 109–119.
R. I. Hammoud, C. S. Sahin, E. P. Blasch, B. J. Rhodes, and T. Wang, “Automatic association of chats and video tracks for activity learning and recognition in aerial video surveillance,”Sensors, vol. 14, no. 10, pp. 19 843–19 860, 2014.
-  A. Hampapur, L. Brown, R. Feris, A. Senior, C.-F. Shu, Y. Tian, Y. Zhai, and M. Lu, “Searching surveillance video,” in Advanced Video and Signal Based Surveillance, 2007. AVSS 2007. IEEE Conference on. IEEE, 2007, pp. 75–80.
-  N. Khanezaei and Z. M. Hanapi, “A framework based on rsa and aes encryption algorithms for cloud computing services,” in Systems, Process and Control (ICSPC), 2014 IEEE Conference on. IEEE, 2014, pp. 58–62.
-  H. Li, K. Sudusinghe, Y. Liu, J. Yoon, M. Van Der Schaar, E. Blasch, and S. S. Bhattacharyya, “Dynamic, data-driven processing of multispectral video streams,” IEEE Aerospace & Electronic Systems Magazine, 2017.
-  B. Liu, Y. Chen, D. Shen, G. Chen, K. Pham, E. Blasch, and B. Rubin, “An adaptive process-based cloud infrastructure for space situational awareness applications,” in Sensors and Systems for Space Applications VII, vol. 9085. International Society for Optics and Photonics, 2014, p. 90850M.
-  D. Nagothu, R. Xu, S. Y. Nikouei, and Y. Chen, “A microservice-enabled architecture for smart surveillance using blockchain technology,” arXiv preprint arXiv:1807.07487, 2018.
-  S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. R. Faughnan, “Intelligent surveillance as an edge network service: from harr-cascade, svm to a lightweight cnn,” arXiv preprint arXiv:1805.00331, 2018.
-  ——, “Real-time human detection as an edge service enabled by a lightweight cnn,” arXiv preprint arXiv:1805.00330, 2018.
-  S. Y. Nikouei, R. Xu, D. Nagothu, Y. Chen, A. Aved, and E. Blasch, “Real-time index authentication for event-oriented surveillance video query using blockchain,” arXiv preprint arXiv:1807.06179, 2018.
-  A. Ouaddah, A. Abou Elkalam, and A. Ait Ouahman, “Fairaccess: a new blockchain-based access control framework for the internet of things,” Security and Communication Networks, vol. 9, no. 18, pp. 5943–5964, 2016.
-  K. Palaniappan, F. Bunyak, P. Kumar, I. Ersoy, S. Jaeger, K. Ganguli, A. Haridas, J. Fraser, R. M. Rao, and G. Seetharaman, “Efficient feature extraction and likelihood fusion for vehicle tracking in low frame rate airborne video,” MISSOURI UNIV-COLUMBIA DEPT OF COMPUTER SCIENCE, Tech. Rep., 2010.
S. Penmetsa, F. Minhuj, A. Singh, and S. Omkar, “Autonomous uav for suspicious action detection using pictorial human pose estimation and classification,”ELCVIA: electronic letters on computer vision and image analysis, vol. 13, no. 1, pp. 18–32, 2014.
-  L. Snidaro, J. Garcia-Herrera, J. Llinas, and E. Blasch, Context-Enhanced Information Fusion. Springer, 2016, vol. 748.
-  R. Surette, “The thinking eye: Pros and cons of second generation cctv surveillance systems,” Policing: An International Journal of Police Strategies & Management, vol. 28, no. 1, pp. 152–173, 2005.
-  V. Tsakanikas and T. Dagiuklas, “Video surveillance systems-current status and future trends,” Computers & Electrical Engineering, 2017.
-  P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine recognition of human activities: A survey,” IEEE Transactions on Circuits and Systems for Video technology, vol. 18, no. 11, p. 1473, 2008.
-  R. Wu, B. Liu, Y. Chen, E. Blasch, H. Ling, and G. Chen, “Pseudo-real-time wide area motion imagery (wami) processing for dynamic feature detection,” in Information Fusion (Fusion), 2015 18th International Conference on. IEEE, 2015, pp. 1962–1969.
-  R. Xu, Y. Chen, E. Blasch, and G. Chen, “Blendcac: A blockchain-enabled decentralized capability-based access control for iots,” arXiv preprint arXiv:1804.09267, 2018.
-  D. Yu, Y. Jin, Y. Zhang, and X. Zheng, “A survey on security issues in services communication of microservices-enabled fog applications,” Concurrency and Computation: Practice and Experience, p. e4436, 2018.