Intelligent surveillance cameras have become increasingly popular in both commercial and private applications. Traditionally, Closed-Circuit Television (CCTV) systems are the de facto monitoring system. More recent products such as Ring home security systems use Internet of Things (IoT) connected surveillance video feeds to provide homeowners an extra degree of security through remote monitoring. Most existing monitoring systems require some amount of manual operation and have limited use in low-light environments. Other similar solutions include the Code Blue emergency call system, which many universities have deployed on their campuses. This system provides a direct communication line to first responders. However, the call station depends on the user being able to reach and physically push the call button. Furthermore, the rate of reported crimes is dependent on the victims or bystanders to self-report.
Though there exist algorithms for fully automated action recognition [DBLP:journals/corr/two_stream, DBLP:journals/corr/temporal_networks, DBLP:journals/corr/spatio_temporal_conv, DBLP:journals/corr/real_time_action_rec, DBLP:journals/corr/quo_vadis], many are not applied in real-time or low-light environments. The existing benchmark action recognition datasets such as HMDB-51 [Kuehne:2011:HLV:2355573.2356424] (Human Motion DataBase), UCF-101 [DBLP:journals/corr/abs-1212-0402] (University of Central Florida), and Sports-1M [karpathy2014large] contain primarily daytime videos. UCF released the UCF-Crime dataset [DBLP:journals/corr/abs-1801-04264]
for general anomaly detection and recognizes 13 crime categories, includingarrest, arson, assault, burglary, explosion, fighting, robbery, shooting, stealing, shoplifting, and vandalism. This dataset consists of indoor video feeds or camera views focused on buildings, and untrimmed videos with noisy labels which makes the dataset hardly viable for real-life application.
We propose Low-Light Environment Neural Surveillance (LENS), a proactive, scalable, computer vision based solution for low-light surveillance (Fig. 1). LENS implements real-time action recognition algorithms to detect crime in low-light conditions, assists the abilities of civilians to report a crime, and provides potentially expedited response times from government entities in urgent situations. LENS does not rely on the victims or bystanders and is intended to aid victims who are in shock, injured, or unaware of local authorities/blue light systems. We collect and use a dataset tailored for low-light crime, namely, LENS-4 (Section 6.1). LENS consists of 3 modules: the computer vision module (Section 3); the IoT module (Section 4); and the user interface module (Section 5). The computer vision module performs crime-detection in low light environments via a two-stream approach [DBLP:journals/corr/two_stream]. The IoT module performs post-processing steps to the computer vision module output and relays the post-processed messages to the user interface module. The user interface module receives metadata, such as location of crime, time of crime, and type of crime, and provides a user-to-user notification system.
2 Previous Work
advanced the action recognition research by training deep Convolutional Neural Networks (CNN) on videos with a two-stream CNN architecture combining the predictions of a spatial and a temporal network. The spatial stream was trained on video frames, while the temporal stream was trained on multi-frame dense optical flow, using UCF-101 and HMDB-51 datasets. Building upon[DBLP:journals/corr/two_stream], [DBLP:journals/corr/temporal_networks] proposed the Temporal Segment Network (TSN) and improved the two-stream CNN architecture in terms of initialization and training setup. Nevertheless, TSN suffers from the computational bottleneck in calculating the optical flow, preventing real-time application. [DBLP:journals/corr/quo_vadis] proposed the Inflated 3D Convnet (I3D) architecture, a two-stream CNN employing 3D convolutions. As 3D convolutions lead to an explosion in the number of network parameters, I3D requires training on large datasets such as Kinetics-400 [DBLP:journals/corr/KayCSZHVVGBNSZ17]. The next steps in action recognition are to leverage spatio-temporal features with 3D residual networks, which has shown promise in recent works such as [DBLP:journals/corr/abs-1708-07632].
[DBLP:journals/corr/spatio_temporal_conv, DBLP:journals/corr/real_time_action_rec] focused on real-time action recognition. [DBLP:journals/corr/spatio_temporal_conv] developed the C3D architecture, a spatio-temporal neural network using 3-dimensional CNNs, where the temporal information between video frames is implicitly calculated. C3D performs action recognition at a frame rate of 313.9, at the cost of degraded detection performance compared to TSN. [DBLP:journals/corr/real_time_action_rec] circumvented this by calculating optical flow via pre-extracted motion vectors from compressed videos and achieved a frame rate of 390.7.
In only a few recent works, action recognition has been successfully explored in dark environments with infrared and thermal cameras [deeprep, batchuluun2019action]. Nevertheless, using infrared and thermal cameras have several drawbacks and lead to requiring additional technical work compared to standard cameras. [deeprep] require multiple video streams, while [batchuluun2019action]
require significant pre-processing of video frames with cycle-consistent generative adversarial networks (CycleGANs) and use a long-short term memory (LSTM) network to capture temporal information. LSTM and CycleGAN-based methods lead to slow inference time and have less possible future improvement, preventing real-time crime detection. Moreover, both[deeprep] and [batchuluun2019action] analyze datasets that have no human-to-human and human-object interaction, making the learning task significantly less complex compared to crime detection. More complex tasks including crime detection bring about the need for normal rather than thermal images, and allow the flexibility of easily extending the application in day-time.
We differ from the previous works by proposing an end-to-end action recognition system for accurate real-time crime detection for low-light scenarios. Furthermore, as the state-of-the-art datasets for crime detection are insufficient for low-light environments, we collect and use an action recognition dataset tailored to this application.
3 Computer Vision
The LENS computer vision algorithm consists of two parts: dense optical flow calculation and action recognition. We use FlowNet2.0 [DBLP:journals/corr/flownet2] for dense optical flow calculation, and a two stream approach [DBLP:journals/corr/two_stream] for action recognition, similar to those discussed in Section 2. The overall network architecture is depicted in Fig. 3.
3.1 Optical Flow
Optical flow is calculated for the temporal stream of our two-stream approach. We employ FlowNet2.0-CSS, which is shown to operate at high FPS for optical flow calculation without a significant degradation in end-point-error (EPE) compared to FlowNet2.0 [DBLP:journals/corr/flownet2].
FlowNet2.0 architecture consists of layered networks, each of which calculates or refines the optical flow. The first three layers calculate large displacement optical flow, which translates to large movements and requires less accuracy, while the last layer calculates small displacements corresponding to fine movements and details. The small displacement layer is of relevance as subjects may only be a few pixels, or the movements we are interested in may not be very large. At the output of the network, the last large displacement layer and the single small displacement layer are concatenated to produce the final prediction. Fig. 4 shows the output of FlowNet2.0 and FlowNet2.0-CSS applied on an example image in the LENS-4 dataset.
3.2 Action Recognition
The action recognition approach follows [DBLP:journals/corr/two_stream, DBLP:journals/corr/temporal_networks], where two CNN architecture outputs are fused for final action prediction (Fig. 3
): spatial stream trained on video frames, and temporal stream trained on the dense optical flow estimates provided by FlowNet2.0-CSS. We use a two-stream 2D CNN instead of 3D[DBLP:journals/corr/TranBFTP14] to explicitly calculate optical flow in the temporal stream and avoid poor motion estimates. We replace the spatial and temporal stream architectures in [DBLP:journals/corr/two_stream] by ResNet [DBLP:journals/corr/deep_res_learning]
, a novel deep neural network architecture that alleviates vanishing/exploding gradients with residual connections. The softmax predictions from spatial and temporal streams are fused via an SVM with a polynomial kernel of order 5.
4 IoT Capabilities
4.1 Network Architecture
The three main network components are edge computers, cloud, and clients. The edge computers use Amazon FreeRTOS and AWS IoT Greengrass to process the images, perform inference using the computer vision module, and communicate with the cloud. The edge computers are WiFi enabled and communicate to a local router that contains the local network of edge computers. In the cloud, AWS IoT Core and Lambda functions process incoming messages and relay them to the clients. The client is a mobile app that requires authentication with the cloud to consume and send messages and requests. These messages are handled with Representational State Transfer (REST) application program interface (API) calls to populate the mobile app and notify law enforcement.
4.2 Cloud Services
In order for the local board hosting the cameras to communicate with the app, a cloud component is used to handle requests. The local board hosting the camera is WiFi enabled in order to scale to multiple cameras in the same network. If the computer vision module detects criminal activity, the corresponding camera ID, GPS location and clip of the incident is sent to the cloud. From there, the message is relayed to law enforcement through the mobile app.
Most of the computation for the inference model and triggers are handled by the GPU or processor on the local board hosting the camera, allowing the cloud to simply parse the results and relay messages.
5 User Interface
The mobile app (Fig. 5) allows local authorities to monitor and control the LENS system in real-time via a cellular device. There are two levels of user privileges to the mobile app: local authority level and civilian level. The local authority privileges allow features such as requesting video-feed from cameras, and sending out push notifications/text messages to civilians. Civilians still have access to the crime log, but cannot access the associated video feed (Fig. 5). Notifications are pushed to a local authority’s device with metadata such as location of crime in relation to location of cameras, time of incident, and type of crime. The local authorities may then request the video feed of the crime that occurred (Fig. 1) by navigating to the crime-log in the mobile app (Fig. 5). The local authority may send a push notification/text message to civilians with the civilian-mobile app.
An important feature of the mobile app is the confidence threshold slider (Fig. 5
). This slider controls the precision and recall trade-off of crime detection. A high precision and low recall implies that the local authorities will receive a few confident crime detections (low false-positive rate), but not receive as many notifications. A low precision and high recall implies that the local authorities will receive many uncertain crime detections (high false-positive rate), but receive many notifications. This trade-off is left to the local authorities’ discretion and allows them to have more control.
The app is created using React Native and node.js for easy deployment across iOS, Android, and the web as well as simple integration with AWS.
6.1 LENS-4 Dataset
Although there are numerous publicly available computer vision datasets, less than 2% of them feature low-light data [DBLP:journals/corr/abs-1805-11227]. On top of this, our application requires crime data captured in low-light. Thus, we create our own dataset, LENS-4, for crime detection in low-light. The LENS-4 dataset, explained in detail in table 1, has the same structure as the UCF-101 dataset, where the low-light videos are recorded with an ELP Sony IMX322. All clips have a fixed frame rate of 30 FPS and resolution of 640x480 or 1920x1080.
|Groups per Action||20|
|Clips per Group||45|
|Total Duration||244 mins|
|Min Clip Length||3 sec|
|Max Clip Length||4 sec|
The ELP Sony IMX322 has both high SNR and high dynamic range capabilities. The crop CMOS IMX322 is rated to perform in 0.01 lux minimum illumination, which satisfies low-light requirements for collecting data. We preferred the ELP Sony IMX322 over an infrared (IR) camera or a thermal camera for several reasons. Thermal cameras have two disadvantages: a halo effect on objects with high temperature, and temperature similarities between background and objects. Furthermore, IR and thermal cameras are unable to differentiate objects due to requiring heat signature [batchuluun2019action]. Finally, the ELP Sony IMX322 is functional in daytime as well as nighttime, allowing LENS to operate 24/7, while IR and thermal cameras may deliver brighter images in complete darkness, losing pixel-level information and details.
6.2 Training Setup
Temporal and spatial stream ResNet networks are initialized with weights pre-trained on the ImageNet dataset[deng2009imagenet]. The first convolutional layer weights for the temporal stream is initialized with cross-modality weight initialization [DBLP:journals/corr/temporal_networks].
We employ data augmentation, batch normalization and dropout on several layers of the CNN architectures as regularization to prevent overfitting. Several data augmentation techniques including random cropping, resizing, channel normalization, and scale/aspect jittering are employed during training and testing.
The spatial CNN parameters are optimized using stochastic gradient descent with momentum set to 0.9111Code and LENs-4 dataset are publicly available at https://github.com/mcgridles/LENS.The initial learning rate is set to 5
and decayed over multiple epochs using a Plateau learning rate scheduler with patience set to 1. For every video in a mini-batch of 64 videos, 3 frames are randomly sampled within equally spaced temporal windows and a consensus of the frame predictions provides a video-level prediction for calculating the loss.
The temporal CNN parameters are also optimized using stochastic gradient descent with 0.9 momentum. The initial learning rate is set to 1 and decayed over epochs using a Plateau learning rate scheduler with patience set to 3. For every mini-batch, 64 videos are randomly selected and 1 stacked optical flow is randomly sampled in each video.
The temporal and spatial streams are initially trained on the UCF-101 dataset, with pre-computed TVL1 optical flow. We improve the two-stream action recognition pipeline accuracy on split 1 of UCF-101 compared to [DBLP:journals/corr/two_stream] (table 2
). Finally, the spatial and temporal streams are fine-tuned using transfer learning on the LENS-4 dataset.
|Spatial Acc.||Temporal Acc.||Two Stream Acc.|
The SVM is trained using 5-fold cross validation, and hyper-parameters are selected by randomized grid search.
All models are trained on Google Cloud, as large memory and computation capabilities are required to train deep neural networks. A virtual machine is deployed using the Deep Learning VM image in the Google Cloud Marketplace on the Google Cloud Compute Engine, where the PyTorch, FastAI, and NVIDIA configuration settings are applied. Due to the training datasets taking over 100 Gigabytes (GB) in memory, a 500 GB zonal persistent disk is added and mounted to the VMs for dataset storage. The GPU used is an NVIDIA Tesla P100.
6.3 Action Recognition
For the spatial stream, we achieve 65.5% accuracy after fine-tuning on the LENS-4 dataset. A similar process for the temporal stream achieves 70.3% accuracy, confirming that motion plays an inherent role for action recognition. The last step in the inference pipeline, the SVM, combines the predictions from each of the two streams for final crime inference, where the nonlinear predictor improves the accuracy of crime detection up to 71.5% (table 3).
|Train Acc.||Val. Acc.||Test Acc.|
|Two Stream + SVM||-||-||71.5|
An example of LENS crime detection (at dusk for assault) inference is shown in Fig. 7.
The RGB confusion matrix (Fig.6a) indicates that the spatial stream cannot distinguish between assault and theft. Assault and theft are hard to distinguish because they are not mutually exclusive classes, but rather, assault may be seen as a subclass of theft. Even state-of-the-art computer vision tasks such as object detection [DBLP:journals/corr/RedmonF16] are reported to have a difficulty in distinguishing non-mutually exclusive classes. The temporal stream resolves the spatial stream difficulties of distinguishing between assault and theft, as the motion that occurs for the two actions is typically different (as when a thief tries to steal a bag, both actors grab the bag and pull in a seesaw manner). However, the temporal stream performs worse than the spatial stream on shooting detection, as the OPF confusion matrix indicates that the temporal network confuses shooting with assault (Fig. 6b). Again, shooting and assault are not mutually exclusive classes, and the motion associated with shooting shares similarities with the motion of assault, and the temporal stream only learns from motion, i.e. the temporal stream does not learn about what the gun looks like, only the motion associated with movements of an actor holding a gun. The SVM confusion matrix (Fig. 6c) shows that the deficiencies from each individual stream are compensated by the SVM fusion (table 4) .
Overall, we achieve an accuracy of 71.5% for crime detection in near pitch black environments at a frame rate of 19 FPS (section 6.4). This is a breakthrough, as crime detection in low-light environments with real-time inference capability has not yet been explored.
6.4 Frame Rate
For total inference time, we achieve approximately 10 FPS running on the NVIDIA Tesla P100 in the cloud. Skipping frames improves the FPS without loss of accuracy, since the motion captured between two sequential frames when recording at 30 FPS is significantly redundant. As many as three or four frames could be skipped before performance started to suffer, which would amount to a 3-4x increase in speed. On the NVIDIA Jetson TX2, we are able to achieve approximately 5 FPS due to the decreased processing power. However, we improve this in a few ways. The first is by skipping frames as mentioned previously. We also manage to stream from the Jetson to the cloud and perform the inference on the cloud. While this does not run as fast as native video on the cloud, it still improves the inference time and would be worth developing further. By dropping every other frame we achieved 20 FPS (Table 3).
7 Conclusions and Future Work
We introduce LENS, a complete and modular system that serves as a tool for law enforcement to proactively combat criminal activity in low-light areas. It has been shown that LENS runs in real-time, operate in a low-light environment on a single camera (unlike many current architectures), and is scalable with the cloud infrastructure. LENS allows police to be immediately notified of potential criminal activity detected, and does so with a high degree of accuracy.
Future improvements to this project to create a more robust, faster, and accurate system include: 1. collecting a larger and more diverse dataset than LENS-4, 2. using Active Learning in the LENS-human feedback loop, and 3. developing shallower LENS architecture.
The size and diversity of actors in the LENS-4 is extremely small, which could lead to over-fitting models if regularization techniques are not carefully incorporated. Actors and cameramen should be hired to record a large volume of clips, in varying scenery, with actors performing fake shootings, thefts, and assaults.
Active learning is a study of machine learning where the algorithm may interactively query the user to obtain a user-label output for new data points. Active learning allows local authorities to receive clips of the crime that occurs, and if it is a false positive/incorrect crime label the user could correctly label the video clip and then LENS could be retrained in an online manner with the new information.
Finally, we will look at developing a shallower LENS architecture for faster inference time without reducing the accuracy of the model.