Robust and Computationally-Efficient Anomaly Detection using Powers-of-Two Networks

10/30/2019 ∙ by Usama Muneeb, et al. ∙ 23

Robust and computationally efficient anomaly detection in videos is a problem in video surveillance systems. We propose a technique to increase robustness and reduce computational complexity in a Convolutional Neural Network (CNN) based anomaly detector that utilizes the optical flow information of video data. We reduce the complexity of the network by denoising the intermediate layer outputs of the CNN and by using powers-of-two weights, which replaces the computationally expensive multiplication operations with bit-shift operations. Denoising operation during inference forces small valued intermediate layer outputs to zero. The number of zeros in the network significantly increases as a result of denoising, we can implement the CNN about 10 comparable network while detecting all the anomalies in the testing set. It turns out that denoising operation also provides robustness because the contribution of small intermediate values to the final result is negligible. During training we also generate motion vector images by a Generative Adversarial Network (GAN) to improve the robustness of the overall system. We experimentally observe that the resulting system is robust to background motion.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) [1] are networks trained via an adversarial process primarily to capture the data distribution and to generate artificial data or objects that look like real objects. In recent years GANs have also been used to perform anomaly detection [2] [3].

Owing to the huge computational complexity of GAN based anomaly detection techniques, this paper aims to explore the potential of using a powers-of-two [4] [5] Convolutional Neural Network (CNN) in anomaly detection with no computational overhead during inference. In this paper, we use the GAN structure during only training to generate motion vector images for the computationally efficient Convolutional Neural Network (CNN) with power-of-two coefficients. Because of the special structure of the filter coefficients the convolutional filters can be implemented using only bit-shift operations without performing any multiplications.

Our CNN structure is more efficient than other power-of-two networks and GAN-improved CNNs because (i) we not only replace multiplications with bit-shift operations but also (ii) prune the intermediate outputs of the network layers by denoising. As the number of zeros significantly increase as a result of denoising, we can implement the CNN about 30% faster than a comparable power-of-two CNN during inference. While the primary motive is to reduce computational complexity, it also achieves robustness against background motion due to wind, leaves etc. This is because contributions of small valued intermediate output values to the final result are eliminated by denoising. It should be noted that denoising the network layers is different from filter weight pruning which is often used in deep neural networks. Instead of the weights, we prune the outputs of the intermediate layers in this paper.

We project the intermediate output layer vectors to -balls to achieve denoising. This is equivalent to soft-thresholding, which does not add any significant computational load. Motion vectors due to background objects are very small and should not contribute to the anomaly detection process. Intermediate layer denoising removes the effects of background objects and their motion to the final result. This approach is similar to the wavelet denoising [6] in which band-pass and high-pass filtered data is soft-thresholded in the wavelet domain. In this paper, we soft-threshold the intermediate layer outputs which are obtained after convolutional filtering and nonlinear activation.

The paper is organized as follows. In Section 2, we describe the motion vector based anomaly detection network. In Section 3, we describe the -ball based denoising that we use to denoise the intermediate layer outputs of the CNN. In Section 4, we present the simulation results. In Section 5, we conclude the paper.

2 Motion Vector Based Anomaly Detection

Most anomaly detection methods are based on motion information which use hand-crafted features to model normal-activity patterns [7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. On the other hand our method uses the entire set of motion vectors obtained from the video as in [17], [18], [19]. We also found the optical flow domain as a reasonable representation for training and evaluating our model. For example, suddenly stopping objects or fast moving objects have larger motion vectors than objects performing usual activities.

Figure 1: An image from the UCSD Pedestrian database (left) and its Farneback optical flow (magnitudes) computed using the neighboring image frame.

In Figure 1, an image (left) and the corresponding optical flow image (right) from UCSD Pedestrians database [20] are shown. We use the Farneback optical flow method [21] to find the optical flow. The optical flow image on the right shows only the magnitude information of motion vectors. Vehicle and the biker are fast moving objects in the scene (shown on the left). As a result, we observe large (and brighter) blobs due to the vehicle and the biker, which are considered to be anomalous objects in the scene. On the other hand pedestrians have relatively smaller motion vectors which translate to darker blobs on the optical flow image. Magnitudes of motion vectors are larger for trucks, bicycles and skateboarders compared to relatively slow moving pedestrian subjects which have a smaller and not as bright representation.

Background objects (e.g. leaves, grass etc) and camera swaying due to wind add noise to the optical flow of the video. One of our motives is to make the detector resilient to such noise. The challenge is the lack of noisy training data. To this end, we use a GAN to generate the data to compensate for the lack of noisy data.

To improve the accuracy of the CNN we can train a GAN on optical flows and run it to generate blurry optical flow images to further train the CNN. In Figure 2 a GAN generated optical flow image is shown. Our hypothesis is that further training the CNN on the blurry GAN generated data will make it resilient to noise, hence making our system resilient to real life disturbances (i.e., camera swaying and background object movements).

Figure 2: GAN generated optical flow image (magnitude)

We assume we do not know the actual noise model of the optical flow images and will not use any noisy optical flow images in training. Hence the GAN turns out to be a valuable player in giving us the “noisy” data for training.

3 Denoising the Intermediate Output Layers During Inference

norm was used in training both GANs and CNNs in the past. For example, it is observed that it is beneficial to combine the GAN objective function with a more traditional loss, such as the distance [22] [23]

. In pix2pix system (

[23] [2]), the linear combination of the GAN cost function with the norm based cost function is used:

(1)

where is the output image, is the observed image and is a noise vector; and represent the generator, and the discriminator networks, respectively, is the mixture parameter and

(2)

based cost functions make the resulting solution sharper and sparser than Euclidian objective functions [24, 25, 26, 27, 28].

Our approach is different from the previous work [29] in the sense that we project convolutional layer outputs to -balls (Figure 3) during inference. The projection ”denoises” the layer outputs by eliminating small valued coefficients. This approach not only leads to a computationally efficient system by eliminating a significant portion (around 30%) of small valued output values but also leads to a robust system because the contribution of small valued intermediate values due to background motion to the final result will be reduced.

Figure 3: Geometric illustration of projections onto an -ball

Let

be the vector obtained after the k-th convolutional layer (filtering) and nonlinearity (activation function). We project

onto an -ball of size

(3)

and obtain

(4)

where represent the orthogonal projection operator onto the -ball. The size of the ball, , may vary depending on the layer and the individual convolutional filter. Projection onto the -ball can be implemented in a computationally efficient manner [28]

. It is also possible to estimate the size of the

-ball in an adaptive manner using the epigraph set of the -ball [27] [30] but this may increase the computational cost. We can estimate during training for each layer or individual convolutional filter.

Projection onto the -ball is essentially equivalent to soft thresholding [31], [6] because the last step of the projection algorithm [28] is the soft thresholding operation:

(5)

with

(6)

where and are the i-th entries of vectors and , respectively, and is a parameter related with the size of the -ball. In Fig. 3 the projection operation onto the -ball is illustrated in 2D. Parameter is subtracted from all the components of to obtain . In the second case, the first component of

becomes 0 after the projection. In Donoho and Johnstone’s wavelet denoising algorithm the input vector goes through a wavelet filter and it is soft-thresholded or hard-thresholded. Therefore, our idea of projecting the convolutional layer outputs is similar to adaptive wavelet denoising in the sense that the input vector goes through the filter of the neural network and soft-thresholded. In deep neural networks we have an additional nonlinearity such as RELU or Leaky-ReLU which also provides some denoising but we observed that this is not sufficient for zeroing out small valued convolutional filter outputs. This may be due to a problem in training deep neural networks because the cost function of deep neural networks or GANs have many local optima. Soft-thresholding leads to about 30% computational savings during inference without loosing any recognition accuracy after the nonlinear Leaky-ReLU operations. As pointed above, we estimate the parameter

during the training of the neural network. Obviously, if the nonlinearity is the RELU, then we do not need Eq. 6 because all the elements are non-negative.

4 Experimental Results

We trained both a Deep Convolutional Generative Adversarial Network (DCGAN [32]) and a powers-of-two (Pow2) CNN using motion vectors of the UCSD Pedestrians datasets. The training process is as described below:

GAN Training: The topology of a DCGAN is summarized as follows:

- Strided convolutions are performed in discriminator network and fractional strided convolutions are carried out in the generator network of the GAN.


- Discriminator: LeakyReLU activations in all layers.
- Generator: ReLU activation for all layers except the last one, which uses Tanh function.

- Batch Normalization in some layers in both generator

and discriminator .
- Unlike the original DCGAN, we also add two fully connected hidden layers to the end of the discriminator.
The GAN is trained on the optical flows computed from the entire training dataset. UCSD Pedestrians datasets (both 1 and 2) only have normal data in the training sets. Once trained, our GAN has learnt to generate optical flow images as shown in Figure 2. After the training, we only use the generator block to improve CNN training.

The CNN has the same topology as the discriminator block of the DCGAN. To obtain Pow2 filter coefficients we quantize the convolutional filter coefficients to powers-of-two during each training epoch. The CNN is also trained using the augmented data. Since we do not have any anomalous training motion images and CNN requires supervised training, we generate anomalous motion vector images artificially. This is done by creating an artificial optical flow images with a random bright blob denoting anomaly, and then feeding it to the intermediate layers of the generator to obtain a set of optical flow images. Additionally, We use a mini-batch size of 64 and train until the Discriminator Loss (

) is below 0.01. Training is stopped soon after the loss falls below the desired value to avoid overfitting. This model is saved for evaluation as a baseline [33].

It should be pointed out that while the convolutional filter coefficients of the CNN are constrained to Pow2, the GAN was trained without this constraint, because the GAN architecture is not part of the final inference model. Only the GAN generator output is used during the training of CNNs. In our case, the CNN is trained with 3456 normal and 3456 anomalous optical flow images, which are generated using the generator network of GAN. We use binary cross entropy loss (also known as log loss) as the cost function. We stop the training when the Discriminator Loss () and Generator Loss () both fall below 0.01.

Robustness to background motion: We first compare our GAN improved CNN with a baseline CNN. We will perform this comparison at various noise levels. In Figure 5

we have the ROC curves corresponding to 9 different cases. Background motion is modelled by adding random Gaussian blobs onto the actual motion vector (mv) images. In Peds1 and Peds2 datasets there is no background motion due to leaves etc. That is why we added artificial motion vectors onto the optical flow images. The number of blobs vary between 5 and 20. Their variance is 16 squared pixels. An example of a frame overlaid with Gaussian blobs is shown in Figure

4.

Figure 4: An example optical flow of a frame overlaid with 10 Gaussian blobs to simulate background motıon.
Figure 5: Comparison of a GAN trained CNN, a Pow2 GAN and a CNN at different noise levels. GAN trained CNN (dots) outperforms the CNN (other lines). GAN trained Pow2 GAN closely follows both the CNN and the GAN in terms of performance, while greatly reducing multiplications.

Due to the lack of space we plotted all the ROC curves on top of each other but the ROC curves of GAN improved CNNs (curves with big dots) have more area under them. Curves with green (blue) [red] dots have more area compared to other green, (blue) and [red] curves. In Tables 1 - 4 we present the Area Under Curve (AUC) values and percentage savings for different sets of both soft and hard denoising thresholds. All of the networks are also trained with artificial motion vector images generated by a GAN. Denoising with hard thresholds 0.009 and 0.01 in the first and second layers, respectively, improves the AUC of GAN trained CNN while saving more than 9% computational savings as shown in Table 1 in Peds1 dataset. In Pow2 network, senoising with soft thresholds 0.009 and 0.01 leads to almost 20% computational savings with better AUC as shown in the last row of Table 2. This network is also more robust to noise compared to all the other networks. Soft thresholds of 0.003 and 0.02 provides even higher AUC values but computational savings are in the order of 7 % for this network. As shown in Tables 1-4, when there is no noise computational savings can reach to almost 30% while detecting all of the abnormal events (trucks, skateboarders and bikers) in Peds1 and Peds2 datasets. We also observe that layer denoising has a bigger advantage with Pow2 networks as compared to regular networks.

Threshold(s) Noise = 0 Noise = 5 blobs Noise = 10 blobs Noise = 20 blobs
Network (Layer1, Layer2) AUC Saving AUC Saving AUC Saving AUC Saving
Regular N/A 0.744 0.00% 0.727 0.00% 0.698 0.00% 0.641 0.00%
Denoising (0.001, None) 0.744 0.06% 0.727 0.13% 0.698 0.20% 0.641 0.33%
Denoising (0.001, 0.01) 0.748 3.15% 0.727 3.24% 0.696 3.33% 0.635 3.50%
Denoising (0.002, None) 0.744 2.05% 0.727 2.10% 0.697 2.16% 0.641 2.27%
Denoising (0.002, 0.01) 0.748 5.13% 0.727 5.21% 0.695 5.28% 0.635 5.44%
Denoising (0.003, None) 0.744 3.96% 0.727 3.97% 0.697 3.99% 0.641 4.03%
Denoising (0.003, 0.01) 0.748 7.05% 0.727 7.08% 0.695 7.11% 0.635 7.20%
Denoising (0.005, None) 0.744 4.17% 0.727 4.25% 0.697 4.33% 0.640 4.50%
Denoising 0.005, 0.01 0.748 7.26% 0.726 7.36% 0.695 7.46% 0.634 7.67%
Denoising (0.007, None) 0.745 4.30% 0.727 4.44% 0.697 4.58% 0.639 4.86%
Denoising (0.007, 0.01) 0.748 7.39% 0.727 7.55% 0.695 7.71% 0.633 8.03%
Denoising (0.009, None) 0.744 6.19% 0.726 6.29% 0.696 6.39% 0.639 6.61%
Denoising (0.009, 0.01) 0.748 9.28% 0.726 9.40% 0.694 9.53% 0.633 9.79%
Denoising (0.009, 0.02) 0.749 10.04% 0.712 10.19% 0.673 10.34% 0.612 10.66%
Denoising (0.009, 0.03) 0.754 11.31% 0.716 11.45% 0.656 11.60% 0.576 11.91%
Denoising (0.01, None) 0.744 6.28% 0.727 6.40% 0.696 6.53% 0.639 6.80%
Denoising (0.02, None) 0.744 10.26% 0.724 10.39% 0.691 10.53% 0.633 10.82%
Denoising (0.1, None) 0.744 26.06% 0.710 26.16% 0.649 26.26% 0.543 26.46%
Denoising (0.1, 0.01) 0.745 29.63% 0.707 29.77% 0.641 29.92% 0.528 30.22%
Power-of-2 (Pow2) without Denoising N/A 0.756 0.00% 0.741 0.00% 0.720 0.00% 0.687 0.00%
Denoising, Pow2 (0.001, None) 0.756 0.06% 0.741 0.08% 0.720 0.10% 0.687 0.14%
Denoising, Pow2 (0.001, 0.01) 0.769 4.29% 0.752 4.37% 0.725 4.46% 0.666 4.62%
Denoising, Pow2 (0.002, None) 0.756 0.16% 0.741 0.20% 0.720 0.24% 0.687 0.32%
Denoising, Pow2 (0.002, 0.01) 0.769 4.39% 0.751 4.49% 0.725 4.59% 0.666 4.81%
Denoising, Pow2 (0.003, None) 0.756 0.20% 0.741 0.26% 0.719 0.32% 0.686 0.46%
Denoising, Pow2 (0.003, 0.01) 0.769 4.43% 0.751 4.55% 0.725 4.68% 0.665 4.94%
Denoising, Pow2 (0.005, None) 0.756 0.32% 0.740 0.44% 0.718 0.56% 0.685 0.83%
Denoising, Pow2 (0.005, 0.01) 0.769 4.54% 0.751 4.73% 0.724 4.92% 0.664 5.31%
Denoising, Pow2 (0.007, None) 0.755 0.46% 0.739 0.67% 0.717 0.89% 0.684 1.33%
Denoising, Pow2 (0.007, 0.01) 0.768 4.68% 0.750 4.96% 0.723 5.24% 0.662 5.82%
Denoising, Pow2 (0.009, None) 0.758 0.91% 0.745 1.27% 0.726 1.62% 0.691 2.32%
Denoising, Pow2 (0.009, 0.01) 0.768 5.14% 0.752 5.56% 0.727 5.98% 0.663 6.81%
Denoising, Pow2 (0.009, 0.02) 0.768 7.92% 0.756 8.29% 0.740 8.65% 0.703 9.37%
Denoising, Pow2 (0.009, 0.03) 0.746 8.40% 0.710 8.76% 0.683 9.12% 0.632 9.83%
Denoising, Pow2 (0.01, None) 0.755 2.84% 0.744 3.19% 0.724 3.54% 0.688 4.22%
Denoising, Pow2 (0.02, None) 0.759 14.43% 0.744 14.66% 0.722 14.88% 0.686 15.32%
Denoising, Pow2 (0.1, None) 0.756 28.00% 0.722 28.04% 0.688 28.07% 0.644 28.15%
Denoising, Pow2 (0.1, 0.01) 0.766 33.18% 0.742 33.29% 0.699 33.40% 0.631 33.60%
Table 1: [UCSD Pedestrians 1 Dataset] Observed Area Under Curve (AUC) of ROCs and percentage savings on different optimizations using hard thresholding. Denoising not only improves AUC in both regular and Pow2 networks but also provides 9.28 and 28.0 % savings in regular and Pow2 networks, respectively. Soft thresholding provides robustness against noise as shown in Table 2.
Threshold(s) Noise = 0 Noise = 5 blobs Noise = 10 blobs Noise = 20 blobs
Network (Layer1, Layer2) AUC Saving AUC Saving AUC Saving AUC Saving
Regular N/A 0.744 0.00% 0.727 0.00% 0.698 0.00% 0.687 0.00%
Denoising (0.001, None) 0.745 2.05% 0.727 2.10% 0.697 2.16% 0.686 0.32%
Denoising (0.001, 0.01) 0.753 5.84% 0.717 5.95% 0.670 6.06% 0.696 7.37%
Denoising (0.002, None) 0.745 4.06% 0.727 4.11% 0.696 4.16% 0.686 0.64%
Denoising (0.002, 0.01) 0.753 7.89% 0.717 7.99% 0.668 8.10% 0.696 7.70%
Denoising (0.003, None) 0.746 4.22% 0.727 4.33% 0.696 4.44% 0.685 1.06%
Denoising (0.003, 0.01) 0.753 8.09% 0.716 8.25% 0.667 8.42% 0.695 8.12%
Denoising (0.005, None) 0.746 6.28% 0.727 6.40% 0.694 6.53% 0.686 4.22%
Denoising 0.005, 0.01 0.754 10.46% 0.721 10.62% 0.670 10.78% 0.691 11.30%
Denoising (0.007, None) 0.749 6.83% 0.726 7.04% 0.690 7.26% 0.683 11.28%
Denoising (0.007, 0.01) 0.755 11.03% 0.715 11.29% 0.659 11.55% 0.690 18.37%
Denoising (0.009, None) 0.748 8.73% 0.726 8.93% 0.690 9.12% 0.681 13.95%
Denoising (0.009, 0.01) 0.756 12.98% 0.713 13.22% 0.654 13.46% 0.687 21.05%
Denoising (0.009, 0.02) 0.759 15.36% 0.717 15.60% 0.657 15.83% 0.454 21.92%
Denoising (0.009, 0.03) 0.500 16.71% 0.500 16.91% 0.500 17.10% 0.500 21.93%
Denoising (0.01, None) 0.748 10.26% 0.726 10.39% 0.689 10.53% 0.683 15.32%
Denoising (0.02, None) 0.756 16.58% 0.727 16.82% 0.681 17.06% 0.671 25.93%
Denoising (0.1, None) 0.500 31.91% 0.500 31.91% 0.500 31.91% 0.577 30.55%
Denoising (0.1, 0.01) 0.500 39.89% 0.500 39.89% 0.500 39.89% 0.467 38.30%
Power-of-2 (Pow2) without Denoising N/A 0.756 0.00% 0.741 0.00% 0.720 0.00% 0.687 0.00%
Denoising, Pow2 (0.001, None) 0.756 0.16% 0.741 0.20% 0.720 0.24% 0.686 0.32%
Denoising, Pow2 (0.001, 0.01) 0.768 7.17% 0.753 7.21% 0.733 7.26% 0.696 7.37%
Denoising, Pow2 (0.002, None) 0.756 0.27% 0.741 0.36% 0.719 0.45% 0.686 0.64%
Denoising, Pow2 (0.002, 0.01) 0.769 7.28% 0.754 7.38% 0.734 7.48% 0.696 7.70%
Denoising, Pow2 (0.003, None) 0.756 0.39% 0.741 0.55% 0.719 0.72% 0.685 1.06%
Denoising, Pow2 (0.003, 0.01) 0.769 7.40% 0.755 7.58% 0.733 7.75% 0.695 8.12%
Denoising, Pow2 (0.005, None) 0.756 2.84% 0.743 3.19% 0.722 3.54% 0.686 4.22%
Denoising, Pow2 (0.005, 0.01) 0.761 9.85% 0.751 10.22% 0.731 10.58% 0.691 11.30%
Denoising, Pow2 (0.007, None) 0.757 10.40% 0.742 10.62% 0.720 10.84% 0.683 11.28%
Denoising, Pow2 (0.007, 0.01) 0.765 17.42% 0.752 17.66% 0.730 17.90% 0.690 18.37%
Denoising, Pow2 (0.009, None) 0.757 12.96% 0.741 13.21% 0.719 13.47% 0.681 13.95%
Denoising, Pow2 (0.009, 0.01) 0.764 19.99% 0.750 20.26% 0.727 20.53% 0.687 21.05%
Denoising, Pow2 (0.009, 0.02) 0.459 20.92% 0.459 21.18% 0.459 21.43% 0.454 21.92%
Denoising, Pow2 (0.009, 0.03) 0.500 20.94% 0.500 21.19% 0.500 21.44% 0.500 21.93%
Denoising, Pow2 (0.01, None) 0.758 14.43% 0.744 14.66% 0.721 14.88% 0.683 15.32%
Denoising, Pow2 (0.02, None) 0.758 25.85% 0.741 25.87% 0.716 25.89% 0.671 25.93%
Denoising, Pow2 (0.1, None) 0.695 30.27% 0.666 30.35% 0.628 30.42% 0.577 30.55%
Denoising, Pow2 (0.1, 0.01) 0.626 37.93% 0.567 38.03% 0.523 38.13% 0.467 38.30%
Table 2: [UCSD Pedestrians 1 Dataset] Observed Area Under Curve (AUC) of ROCs and percentage savings on different optimizations using soft thresholding, which provides better AUC results in both regular and Pow2 networks under no-noise and noisy conditions.
Threshold(s) Noise = 0 Noise = 5 blobs Noise = 10 blobs Noise = 20 blobs
Network (Layer1, Layer2) AUC Saving AUC Saving AUC Saving AUC Saving
Regular N/A 0.837 0.00% 0.811 0.00% 0.793 0.00% 0.751 0.00%
Denoising (0.001, None) 0.837 0.06% 0.811 0.09% 0.793 0.13% 0.751 0.19%
Denoising (0.001, 0.01) 0.838 3.14% 0.812 3.19% 0.794 3.23% 0.751 3.31%
Denoising (0.002, None) 0.836 2.07% 0.811 2.09% 0.793 2.11% 0.750 2.15%
Denoising (0.002, 0.01) 0.838 5.16% 0.811 5.18% 0.793 5.21% 0.750 5.27%
Denoising (0.003, None) 0.837 3.94% 0.811 3.94% 0.793 3.93% 0.750 3.93%
Denoising (0.003, 0.01) 0.838 7.02% 0.811 7.03% 0.793 7.03% 0.750 7.04%
Denoising (0.005, None) 0.836 4.15% 0.811 4.19% 0.793 4.22% 0.750 4.29%
Denoising 0.005, 0.01 0.837 7.24% 0.811 7.28% 0.793 7.32% 0.749 7.41%
Denoising (0.007, None) 0.836 4.31% 0.811 4.38% 0.793 4.44% 0.750 4.57%
Denoising (0.007, 0.01) 0.837 7.40% 0.811 7.47% 0.793 7.55% 0.750 7.69%
Denoising (0.009, None) 0.836 6.15% 0.810 6.20% 0.792 6.24% 0.749 6.32%
Denoising (0.009, 0.01) 0.837 9.24% 0.811 9.29% 0.792 9.34% 0.749 9.45%
Denoising (0.009, 0.02) 0.819 9.98% 0.792 10.05% 0.774 10.11% 0.741 10.24%
Denoising (0.009, 0.03) 0.821 11.27% 0.797 11.32% 0.779 11.38% 0.744 11.50%
Denoising (0.01, None) 0.836 6.25% 0.810 6.30% 0.792 6.36% 0.749 6.47%
Denoising (0.02, None) 0.836 10.20% 0.808 10.26% 0.788 10.32% 0.745 10.45%
Denoising (0.1, None) 0.809 26.05% 0.786 26.09% 0.763 26.14% 0.729 26.23%
Denoising (0.1, 0.01) 0.807 29.60% 0.787 29.66% 0.762 29.72% 0.728 29.84%
Power-of-2 (Pow2) without Denoising N/A 0.811 0.00% 0.800 0.00% 0.783 0.00% 0.751 0.00%
Denoising, Pow2 (0.001, None) 0.811 0.06% 0.800 0.07% 0.783 0.07% 0.751 0.09%
Denoising, Pow2 (0.001, 0.01) 0.824 4.30% 0.805 4.33% 0.789 4.37% 0.756 4.44%
Denoising, Pow2 (0.002, None) 0.811 0.14% 0.800 0.16% 0.783 0.17% 0.751 0.20%
Denoising, Pow2 (0.002, 0.01) 0.824 4.38% 0.805 4.43% 0.788 4.47% 0.756 4.56%
Denoising, Pow2 (0.003, None) 0.811 0.19% 0.799 0.21% 0.783 0.23% 0.751 0.28%
Denoising, Pow2 (0.003, 0.01) 0.824 4.43% 0.804 4.48% 0.788 4.53% 0.756 4.63%
Denoising, Pow2 (0.005, None) 0.811 0.28% 0.799 0.33% 0.782 0.38% 0.750 0.49%
Denoising, Pow2 (0.005, 0.01) 0.824 4.52% 0.805 4.59% 0.788 4.68% 0.754 4.84%
Denoising, Pow2 (0.007, None) 0.811 0.42% 0.798 0.51% 0.781 0.60% 0.747 0.79%
Denoising, Pow2 (0.007, 0.01) 0.823 4.65% 0.803 4.77% 0.786 4.89% 0.752 5.14%
Denoising, Pow2 (0.009, None) 0.817 0.98% 0.808 1.14% 0.793 1.30% 0.764 1.61%
Denoising, Pow2 (0.009, 0.01) 0.823 5.21% 0.808 5.40% 0.795 5.59% 0.767 5.97%
Denoising, Pow2 (0.009, 0.02) 0.815 7.99% 0.802 8.15% 0.796 8.31% 0.777 8.63%
Denoising, Pow2 (0.009, 0.03) 0.813 8.46% 0.788 8.62% 0.781 8.78% 0.740 9.09%
Denoising, Pow2 (0.01, None) 0.801 2.92% 0.791 3.07% 0.777 3.22% 0.756 3.52%
Denoising, Pow2 (0.02, None) 0.815 14.45% 0.804 14.56% 0.787 14.66% 0.756 14.87%
Denoising, Pow2 (0.1, None) 0.805 27.99% 0.773 28.00% 0.746 28.01% 0.711 28.04%
Denoising, Pow2 (0.1, 0.01) 0.769 33.20% 0.754 33.25% 0.735 33.30% 0.717 33.40%
Table 3: [UCSD Pedestrians 2 Dataset] Observed Area Under Curve (AUC) of ROCs and percentage savings on different optimizations using hard thresholding. This dataset is smaller than Pedestrian 1 dataset.
Threshold(s) Noise = 0 Noise = 5 blobs Noise = 10 blobs Noise = 20 blobs
Network (Layer1, Layer2) AUC Saving AUC Saving AUC Saving AUC Saving
Regular N/A 0.837 0.00% 0.811 0.00% 0.793 0.00% 0.751 0.00%
Denoising (0.001, None) 0.835 2.07% 0.810 2.09% 0.793 2.11% 0.750 2.15%
Denoising (0.001, 0.01) 0.817 5.86% 0.792 5.91% 0.777 5.95% 0.747 6.03%
Denoising (0.002, None) 0.834 4.03% 0.810 4.05% 0.792 4.07% 0.750 4.11%
Denoising (0.002, 0.01) 0.818 7.85% 0.792 7.89% 0.777 7.93% 0.747 8.02%
Denoising (0.003, None) 0.833 4.21% 0.809 4.26% 0.791 4.31% 0.750 4.41%
Denoising (0.003, 0.01) 0.815 8.06% 0.790 8.13% 0.775 8.21% 0.745 8.36%
Denoising (0.005, None) 0.831 6.25% 0.807 6.30% 0.790 6.36% 0.750 6.47%
Denoising 0.005, 0.01 0.826 10.41% 0.801 10.48% 0.785 10.55% 0.753 10.68%
Denoising (0.007, None) 0.821 6.77% 0.800 6.88% 0.784 6.98% 0.748 7.19%
Denoising (0.007, 0.01) 0.801 10.98% 0.779 11.10% 0.765 11.22% 0.739 11.46%
Denoising (0.009, None) 0.827 8.67% 0.803 8.77% 0.785 8.87% 0.747 9.07%
Denoising (0.009, 0.01) 0.795 12.93% 0.772 13.04% 0.758 13.16% 0.732 13.39%
Denoising (0.009, 0.02) 0.825 15.30% 0.795 15.42% 0.772 15.53% 0.737 15.76%
Denoising (0.009, 0.03) 0.500 16.65% 0.500 16.75% 0.500 16.85% 0.500 17.04%
Denoising (0.01, None) 0.826 10.20% 0.803 10.26% 0.785 10.32% 0.748 10.45%
Denoising (0.02, None) 0.809 16.65% 0.791 16.75% 0.776 16.85% 0.749 17.04%
Denoising (0.1, None) 0.500 31.91% 0.500 31.91% 0.500 31.91% 0.500 31.91%
Denoising (0.1, 0.01) 0.500 39.89% 0.500 39.89% 0.500 39.89% 0.500 39.89%
Power-of-2 (Pow2) without Denoising N/A 0.811 0.00% 0.800 0.00% 0.783 0.00% 0.751 0.00%
Denoising, Pow2 (0.001, None) 0.810 0.14% 0.799 0.16% 0.783 0.17% 0.751 0.20%
Denoising, Pow2 (0.001, 0.01) 0.807 7.15% 0.794 7.17% 0.789 7.19% 0.775 7.23%
Denoising, Pow2 (0.002, None) 0.809 0.25% 0.798 0.28% 0.782 0.32% 0.751 0.39%
Denoising, Pow2 (0.002, 0.01) 0.807 7.26% 0.795 7.30% 0.789 7.34% 0.777 7.42%
Denoising, Pow2 (0.003, None) 0.808 0.35% 0.798 0.42% 0.781 0.49% 0.750 0.63%
Denoising, Pow2 (0.003, 0.01) 0.808 7.36% 0.795 7.44% 0.789 7.51% 0.780 7.66%
Denoising, Pow2 (0.005, None) 0.801 2.92% 0.792 3.07% 0.777 3.22% 0.753 3.52%
Denoising, Pow2 (0.005, 0.01) 0.809 9.95% 0.794 10.10% 0.785 10.26% 0.780 10.57%
Denoising, Pow2 (0.007, None) 0.808 10.38% 0.799 10.49% 0.782 10.59% 0.752 10.79%
Denoising, Pow2 (0.007, 0.01) 0.804 17.42% 0.789 17.53% 0.778 17.64% 0.776 17.86%
Denoising, Pow2 (0.009, None) 0.802 12.88% 0.794 13.00% 0.777 13.13% 0.749 13.37%
Denoising, Pow2 (0.009, 0.01) 0.793 19.91% 0.778 20.05% 0.770 20.18% 0.773 20.44%
Denoising, Pow2 (0.009, 0.02) 0.516 20.84% 0.483 20.97% 0.512 21.09% 0.507 21.34%
Denoising, Pow2 (0.009, 0.03) 0.500 20.85% 0.500 20.98% 0.500 21.11% 0.500 21.35%
Denoising, Pow2 (0.01, None) 0.810 14.45% 0.799 14.56% 0.782 14.66% 0.754 14.87%
Denoising, Pow2 (0.02, None) 0.804 25.84% 0.793 25.84% 0.776 25.84% 0.748 25.84%
Denoising, Pow2 (0.1, None) 0.618 30.35% 0.612 30.38% 0.615 30.41% 0.573 30.47%
Denoising, Pow2 (0.1, 0.01) 0.569 38.01% 0.568 38.06% 0.553 38.10% 0.515 38.20%
Table 4: [UCSD Pedestrians 2 Dataset] Observed Area Under Curve (AUC) of ROCs and percentage savings on different optimizations using soft thresholding, which provides robustness against noise as shown in the last row. AUC values of the soft thresholded network is higher than corresponding cases when the is noise is high while providing about 10% savings.

5 Conclusion

In this paper, we developed a computationally efficient anomaly detection algorithm using motion vector images and Pow2 arithmetic. Our deep neural network inference algorithm is about 10% to 25% faster than the corresponding powers-of-two networks while detecting all the anomalous events with almost the same AUC. We reduced the complexity of the network by denoising the first two output layers. As a result, the anomaly detection scheme can be implemented in real-time on low power devices.

The resulting system turns out to be more robust to background motion because the denoising forces small valued intermediate outputs to zero. This process eliminates the contribution of small motion vectors to the final result in Pow2 networks. Furthermore, we augmented the training data set of the Pow2 network with a GAN generated images to improve the anomaly detection rate.

Future work will include enhancing the denoised powers-of-two networks with other complexity reduction techniques such as network pruning or conditional execution [34, 35, 36].

References

  • [1] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of NIPS 2014, 2014, pp. 2672–2680.
  • [2] H. Zenati, C. Sheng Foo, B. Lecouat, G. Manek, and V. Ramaseshan Chandrasekhar, “Efficient gan-based anomaly detection,” CoRR, vol. abs/1802.06222, 2018.
  • [3] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. S. Regazzoni, and N. Sebe, “Abnormal event detection in videos using generative adversarial nets,” in 2017 IEEE ICIP 2017, 2017, pp. 1577–1581.
  • [4] H. Tann, S. Hashemi, R. I. Bahar, and S. Reda, “Hardware-software codesign of accurate, multiplier-free deep neural networks,” in Proceedings of DAC 2017, 2017, pp. 28:1–28:6.
  • [5] R. Ding, Z. Liu, Ting-Wu Chin, D. Marculescu, and R. D. (Shawn) Blanton, “Flightnns: Lightweight quantized deep neural networks for fast and accurate inference,” in Proceedings of DAC 2019, 2019, p. 200.
  • [6] H. Krim, D. Tucker, S. Mallat, and D. L. Donoho, “On denoising and best signal representation,” IEEE Trans. Information Theory, vol. 45, no. 7, pp. 2225–2238, 1999.
  • [7] H. Mousavi, S. Mohammadi, A. Perina, R. Chellali, and V. Murino, “Analyzing tracklets for the detection of abnormal crowd behavior,” in 2015 IEEE WACV 2015, 2015, pp. 148–155.
  • [8] R. Mehran, A. Oyama, and M. Shah, “Abnormal crowd behavior detection using social force model,” in 2009 IEEE CVPR 2009, 2009, pp. 935–942.
  • [9] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, “Anomaly detection in crowded scenes,” in IEEE CVPR 2010, 2010, pp. 1975–1981.
  • [10] Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in IEEE CVPR 2011, 2011, pp. 3449–3456.
  • [11] J. Kim and K. Grauman, “Observe locally, infer globally: A space-time MRF for detecting abnormal activities with incremental updates,” in 2009 IEEE CVPR 2009, 2009, pp. 2921–2928.
  • [12] R. Raghavendra, M. Cristani, A. Del Bue, E. Sangineto, and V. Murino, “Anomaly detection in crowded scenes: A novel framework based on swarm optimization and social force modeling,” in Modeling, Simulation and Visual Analysis of Crowds - A Multidisciplinary Perspective, pp. 383–411. 2013.
  • [13] H. R. Rabiee, J. Haddadnia, H. Mousavi, M. Kalantarzadeh, M. Nabi, and V. Murino, “Novel dataset for fine-grained abnormal behavior understanding in crowd,” in IEEE AVSS 2016, 2016, pp. 95–101.
  • [14] V. Saligrama and Z. Chen, “Video anomaly detection based on local statistical aggregates,” in IEEE CVPR 2012, 2012, pp. 2112–2119.
  • [15] H. R. Rabiee, H. Mousavi, M. Nabi, and M. Ravanbakhsh, “Detection and localization of crowd behavior using a novel tracklet-based model,”

    Int. J. Machine Learning & Cybernetics

    , vol. 9, no. 12, pp. 1999–2010, 2018.
  • [16] X. Huang, W. Wang, G. Shen, X. Feng, and X. Kong, “Crowd activity classification using category constrained correlated topic model,” TIIS, vol. 10, no. 11, pp. 5530–5546, 2016.
  • [17] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, and N. Sebe, “Plug-and-play CNN for crowd motion analysis: An application in abnormal event detection,” in IEEE WACV 2018, 2018, pp. 1689–1698.
  • [18] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, and R. Klette, “Deep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes,” Computer Vision and Image Understanding, vol. 172, pp. 88–97, 2018.
  • [19] D. Xu, Y. Yan, E. Ricci, and N. Sebe, “Detecting anomalous events in videos by learning deep representations of appearance and motion,” Computer Vision and Image Understanding, vol. 156, pp. 117–127, 2017.
  • [20] “Peds dataset,” http://www.svcl.ucsd.edu/projects/anomaly.
  • [21] G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” in SCIA 2003, Halmstad, Sweden, 2003, pp. 363–370.
  • [22] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in IEEE CVPR 2016, 2016, pp. 2536–2544.
  • [23] P. Isola, Jun-Yan Zhu, T. Zhou, and A. A. Efros,

    “Image-to-image translation with conditional adversarial networks,”

    in IEEE CVPR 2017, 2017, pp. 5967–5976.
  • [24] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D: Nonlinear Phenomena, vol. 60, no. 1, pp. 259 – 268, 1992.
  • [25] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  • [26] A. E. Çetin and M. Tofighi, “Projection-based wavelet denoising [lecture notes],” IEEE Signal Process. Mag., vol. 32, no. 5, pp. 120–124, 2015.
  • [27] M. Tofighi, K. Köse, and A. Enis Çetin, “Denoising using projections onto the epigraph set of convex cost functions,” in IEEE ICIP 2014, Paris, France, October 27-30, 2014, 2014, pp. 2709–2713.
  • [28] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, “Efficient projections onto the l-ball for learning in high dimensions,” in Proceedings of ICML 2008, 2008, pp. 272–279.
  • [29] A. Afrasiyabi, D. Badawi, B. Nasir, O. Yildiz, F. T. Yarman-Vural, and A. Enis Çetin, “Non-euclidean vector product for neural networks,” in 2018 IEEE ICASSP 2018, 2018, pp. 6862–6866.
  • [30] Y. Kopsinis, K. Slavakis, and S. Theodoridis, “Online sparse system identification and signal reconstruction using projections onto weighted ell balls,” IEEE Trans. Signal Processing, vol. 59, no. 3, pp. 936–952, 2011.
  • [31] D. L. Donoho and I. M. Johnstone, “Adapting to unknown smoothness via wavelet shrinkage,” Journal of the American Statistical Association, vol. 90, no. 432, pp. 1200–1224, 1995.
  • [32] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in ICLR 2016, 2016.
  • [33] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient transfer learning,” CoRR, vol. abs/1611.06440, 2016.
  • [34] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang, “A survey of model compression and acceleration for deep neural networks,” CoRR, vol. abs/1710.09282, 2017.
  • [35] Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy,

    “Conditional deep learning for energy-efficient and enhanced pattern recognition,”

    in Design, Automation & Test in Europe Conference & Exhibition, 2016, pp. 475–480.
  • [36] M. Biasielli, C. Bolchini, L. Cassano, E. Koyuncu, and A. Miele, “A neural network based fault management scheme for reliable image processing,” Submitted for publication, Apr. 2019.