EvAn: Neuromorphic Event-based Anomaly Detection

11/21/2019 ∙ by Lakshmi Annamalai, et al. ∙ indian institute of science 26

Event-based cameras are bio-inspired novel sensors that asynchronously record changes in illumination in the form of events, thus resulting in significant advantages over conventional cameras in terms of low power utilization, high dynamic range, and no motion blur. Moreover, such cameras, by design, encode only the relative motion between the scene and the sensor (and not the static background) to yield a very sparse data structure, which can be utilized for various motion analytics tasks. In this paper, for the first time in event data analytics community, we leverage these advantages of an event camera towards a critical vision application - video anomaly detection. We propose to model the motion dynamics in the event domain with dual discriminator conditional Generative adversarial Network (cGAN) built on state-of-the-art architectures. To adapt event data for using as input to cGAN, we also put forward a deep learning solution to learn a novel representation of event data, which retains the sparsity of the data as well as encode the temporal information readily available from these sensors. Since there is no existing dataset for anomaly detection in event domain, we also provide an anomaly detection event dataset with an exhaustive set of anomalies. We empirically validate different components of our architecture on this proposed dataset as well as validate the benefits of our event data representation over state-of-the-art event data representations on video anomaly detection application.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper focusses on anomaly detection using bio-inspired event-based cameras that register pixel-wise changes in brightness asynchronously in an efficient manner, which is radically different from how a conventional camera works. The asynchronous principle of operation endows event cameras [9] [10] [36] [41] to capture high-speed motions (with temporal resolution in the order of ), high dynamic range () and sparse data. These low latency sensors have paved way to develop agile robotic applications [1], which was not feasible with conventional cameras. Only limited achievements have been accomplished in designing robust and accurate visual analytics algorithms for the event data, mainly because of unavailability of event cameras for commercial purposes and resultantly, dearth of large scale event datasets.

Video anomaly detection [28] [19]

is a pervasive application of computer vision with its widespread applications as diverse as surveillance, intrusion detection etc. Anomaly detection can be posed as a foreground motion analytics task. This makes event camera an ideal candidate for video anomaly detection task as it comes with the ability to encode motion information at sensor level. WE argue that a vast majority of state-of-the-art frame-based anomaly detection networks rely on optical flow estimation or

convolution to explicitly model motion constraint to detect anomalies. It is pretty important to note that event cameras exhibits embedded motion information at the sensor level, which allows us to circumvent optical flow or convolution. Hence, direct application of existing conventional anomaly detection networks to event data is debatable.

In this paper, we introduce a solution towards solving event-based anomaly detection with a dual discriminator conditional Generative adversarial Network (cGAN), which has not been explored in prior works of event vision. Dual discriminator allows detection of spatial event anomalies, which would have not been feasible otherwise. However, data modality of event data does not directly fit this network due to its inherent difference from conventional cameras. State-of-the-art deep learning compatible event representations belong to feature engineering which mandates domain expertise. However, the advantages of deep learning to learn nested concepts from data has not been explored in event data representation. To alleviate this gap, we introduce a less computationally complex shallow encoder-decoder architecture (referred here as DL (Deep Learning) memory surface generation network) that learns a better latent-space sparse event representation while efficiently modeling motion analytics. To validate the efficacy of our algorithm, we introduce for the first time, a novel anomaly detection event dataset recorded from a type of event camera known as Dynamic Active Pixel Vision Sensor (DAVIS) [30] [22].

Event data representation:

The models that can cope with event data are biologically inspired neural networks acknowledged as Spiking Neural Networks (SNN)

[38]. SNN has not become increasingly popular due to the lack of scalable training procedures. An alternative methodology followed in literature is an adaptation of event data to make them compatible with conventional networks.

Earlier works restrained themselves to encoding basic information such as polarity. In [33], list of events are converted into images by recording the occurrence of the recent event in the given time. The drawback of this representation is that it encodes solely the latest event information at each pixel value. In [16], a two-channel event image is created with histogram and of positive and negative events respectively. Storage of events of different polarity in different channels avoids the cancellation of events of opposite polarity at a given location. This choice proves to be better than that of [33]. The predominant setback of the above basic strategies is that they discard the treasured time information that is obtained from event cameras.

Non-spatial time encodings contain useful motion information, including which will enhance the accuracy of vision algorithms. However, incorporating this information is a cumbersome task. In [35]

, time-stamp maps are created by using three distinctive techniques, pixel replication, temporal interpolation, and spatio-temporal correlation. These time-stamp maps are merged temporally for further processing, hence tending to lose the details of the time information obtained from the event camera.

In [20], the intensity image has been coded with the timestamp of each pixel of recent positive and negative events in the given integration time and around a spatial location of . This image was further used to construct features recognized as time surfaces. Following this, [48] encode the first two channels as number of positive and negative events that have occurred at each pixel and the last two channels as the time-stamp of the most recent positive and negative events. This representation discards all the other time information except that of the recent event. Moreover, this kind of encoding is very sensitive to noise.

[29] and [3] has attempted to improve the time channel information by expertly combining the time information. In [29], third channel stores average of the timestamp of the events that occurred at pixel in a given temporal window of size . [3]

improved it by allocating four channels that encode standard deviation of the timestamps of positive ad negative events (separately) that happened at that specific pixel in the given time interval

in addition to their average value.

Recently, [2] proposed an interesting approach to encode time information by introducing a representation (highly resistant to noise) known as memory surfaces which exponentially weighs the information carried by past events. Following this, [47] has proposed an event representation by discretizing the time domain. However, this representation might result in higher computational cost when applied on deep network[12] has generated frames by accumulating constant number of events, thus claiming to have adaptive frame rate.

Anomaly Detection on Conventional Camera: As there is no prior work on event data anomaly detection, we briefly describe the frame-based deep learning algorithms for anomaly detection [5]. Researchers build a statistical model (reconstruction modeling and predictive modeling) to characterize the normal samples and the actions that deviate from the estimated model are identified as anomalies.

The learning capability of deep networks used in reconstruction modelling [32] [39] [7] [8] [14] are too high that they don’t conform to the expectation of higher reconstruction error for abnormal events. This led to the new attractive phase of predictive models such as convolutional LSTM [27] [24] [26] and generative modeling. While convolutional LSTM learns the transformation required to predict the frames, generative models such as, Variational Auto-encoder (VAE) [11] [4] and Generative Adversarial Network (GAN) [13]

, learn the probability distribution to generate the present from the history, which makes it a perfect candidate for anomaly detection. As this work is built on GAN, we limit our survey to GAN based conventional camera anomaly detection.

[40] proposed AnoGAN to identify the manifold of abnormal anatomical variability. AnoGAN is trained with a weighted sum of residual loss and discriminator loss (dissimilarity between intermediate feature representation of the original image and that of the reconstructed image). However, temporal information has been discarded in modeling anomalies. [37] proposed an anomaly detection framework that tries to model the anomalies based on motion inconsistency as well. The framework consists of two conditional GAN (cGAN) [34] networks, trained on cross channel tasks of generating frame of training video from the optical flow [6] and vice-versa. During test time, discriminators have been applied to detect anomalies. Similar architecture has been followed in [25]. However, the significant distinction lies in the methodology used to detect possible anomalies.

[45] proposed a GAN solution by leveraging reconstruction loss, gradient loss and optical flow loss in addition to the adversarial loss. The constraint on the motion is modeled as the difference between the optical flow of predicted frames and the original frame. Recently, [46] has proposed 3D convolutional GAN to capture temporal information for anomaly detection. convolution increases the computational complexity of the network.

Contributions: In the context of the previous discussion, our contributions in this paper can be summarized as,

  1. Event Data Representation: Shallow encoder-decoder module to generate an appropriate sparse representation of event data that inherently learns the temporal information from the data as opposed to state-of-the-art handcrafted features

  2. Anomaly detection network: Dual discriminator cGAN that combines the conditional and unconditional setting of the discriminator (inspired from [34] and [44]) to model spatial and temporal anomalies in event domain.

2 Proposed Anomaly Detection Method

The pipeline of our event data prediction framework for anomaly detection is shown in Fig. 1. In our methodology, we present anomaly detection as a conditional generative problem that predicts future events conditioned on past events. To predict the future events, we train a dual discriminator conditional GAN with two discriminators (details of which is given in the forthcoming section), one under conditional setting and other under the unconditional setting. As the generator network of GAN is deep, the solution of predicting future events conditioned on the previous events will confront with computationally heavier model. To make the computation faster, we introduce a DL memory surface generation network (details of which are furnished in upcoming section), that tries to capture the coherence of motion from the given a set of events into a single structure known as DL memory surface, conditioned on which cGAN learns to predict the future events. To feed the DL memory surface network, we stack the events to produce a discretized volume (adapted from [47]), given a time duration and a set of discrete time bins each with duration.

Fig. 1: Framework of the proposed DL memory surface network and dual discriminator cGAN. DL memory surface network is trained as stand alone. Dual discriminator cGAN predicts future events conditioned on DL memory surfaces

2.1 DL Memory Surface Generation Network

In this work, we propose a novel event representation generated by a shallow, computationally inexpensive encoder-decoder network architecture with a loss function that comprises data term and sparsity term. Before dwelling into the loss function, we introduce the architecture of the encoder-decoder network.

2.1.1 Network Architecture

We adapt fully convolutional encoder-decoder architecture [15] with layers of the form convolution (with sigmoid activations) that maps a discretized volume of event data (explained above) to a single image known as DL memory surface with same spatial resolution at the bottleneck layer. The input and output are renderings of the discretized volume data.

In order to model only the temporal structure in the data without upsetting the spatial distribution, it is adequate if we restrict the convolution operation across the time dimension. Hence, we have constructed the network with convolution layers inspired by the papers [23], which was the first to introduce convolution. As studies have shown, CNN can learn complex tasks [43] [18] with shallow architecture unlike its counterpart CNN, thereby resulting in ”small” network. We have utilized the potential of deep learning to model the motion, thereby capturing full information available from the data that they model in contrast to their state-of-the-art handcrafted frame generation counterparts.

2.1.2 Loss Function for Learning Temporal Information

The DL memory surface generation network tries to learn a function such that the target values is similar to that of the input , while the bottleneck layer models the temporal information encoded by the event camera. The architecture has two parts, encoder and decoder defined by the transformation functions and respectively with and being the parameters of decoder and encoder. As the event camera has built-in sparsity in data encoding, we would like to retain the sparsity in our encoding process as well. By placing sparsity constraints on the bottleneck layer, we will be able to discover an interesting representation that models the temporal information while preserving the sparse structure of the input event data. The addition of sparsity constraint to the loss function gives us the liberty to use shallow networks, but still force the network to learn more appropriate temporal information. The objective of the netwrok can be expressed as weighted sum of a data term and a sparsity term as follows,

(1)

In order to maximize the usefulness of the latent variable encoding, we put forth a data term that tries to model the probability distribution of getting the event discretized volume, , given the ideal DL memory surface, , by maximizing the forward divergence between the ideal distribution and our estimate (Eq. 2). Forward divergence will result in best latent variable that covers all the modes of probability distribution of normal videos.

(2)

As the first term does not depend on the estimated latent variable, it could be ignored. Hence, the second term of Eq. 2 boils down to maximizing the log likelihood of when the sample size tends to infinity. The output of the decoder can be modeled as a function of latent variable and noise as . This makes

a Gaussian distribution with mean

. Thus maximizing the log likelihood turns out to minimizing

In order to impose sparsity constraint on the bottleneck layer, we define a sparsity term [17] as minimization of

(3)

Where, and are the spatial dimensions of the event data, is the Bernoulli probability distribution with parameter as low as . By modelling the probability distribution

as Bernoulli distribution with parameter

, the divergence between the two Bernoulli distributions has an analytical expression as follows

(4)

Minimization of this will make sure that the output of encoder will be close to the sparsity parameter as much as possible.

2.2 Anomaly Detection Network

The framework adopted here is a dual discriminator conditional GAN architecture. A striking effect of stationary event camera under surveillance is that it captures moving objects alone, which results in inexpensive modeling of temporal anomalies. However, this has the effect of producing a silhouette of moving objects. Hence, cGAN results in poor characterization of spatial anomalies whose temporal modality overlaps with that of normal data. This could be attributed to the fact that cGAN penalizes the mismatch between input and output by modeling the joint distribution of input and output. In order to incentivize the detection of spatial anomalies in event data, we propose a variant of cGAN which pushes the generator distribution closer to the ground truth distribution while still sustaining the quality of match between input and output

We start this section with a brief review of conditional GAN architecture. For the sake of easy readability, we have used the notation and for and

respectively. Conditional GAN is a two-player game, wherein a discriminator takes two points x and y in data space and emits high probability indicating that they are samples from the data distribution, whereas a generator maps a noise vector z drawn from

and input sample drawn from to a sample that closely resembles the data . This is learned by solving the following minimax optimization

Preposition: For a fixed G, the optimal discriminator results in

(6)

With this given , the minimization of turns into minimizing the Jensen-Shannon divergence

Proof: Left hand side of Eq.2.2 can be expanded as

(7)

The above equation becomes

(8)

The optimal is estimated by differentiating the above equation with respect to and equating to zero,

(9)

On simplification, we get as given in Eq. 6. Substituting the optimum value of in the generator equation, it yields

Multiplying and dividing the terms inside by and using the fact , we get

The second and third term together is nothing but . Thus, the objective function of is minimized when

2.2.1 Dual Discriminator Loss Function of cGAN

We propose a three player game with two discriminators and and one generator. The disriminator sees the inputs and , whereas sees only . The new objective function becomes

Expanding as before, we get

(13)

By differentiating and equating to zero, we get the optimum values of and as Eq. 6 and as follows respectively

(14)

Substituting this optimal values and into the generator optimization function, we get

(15)

On simplification (as done for conventional conditional GAN), the above equation reduces to

(16)

The last four terms turns out to be sum of and . Hence generator achieves its minimum when and .

cGAN achieves optimum setting when , whereas dual discriminator cGAN enforces a constraint on the , which enables it to memorize the objects in the training set in addition to learning input-output image relation.

3 Experimental Validation

In this section, we provide the details of the dataset and the experiments conducted to validate the proposed algorithm

3.1 Dataset

Although few event-based datasets are available for other vision-based tasks such as visual odometry [31], object recognition [21], there is no event-based dataset available for anomaly detection. To address this constraint and in order to evaluate the proposed algorithm, we introduce a new event dataset for anomaly detection. We present two variations of event dataset from two distinctive sets of environments, indoor lab environment, and outdoor corridor environment to set a realistic baseline for algorithm evaluation. This dataset is comprised of short event clips of pedestrian movements parallel to the camera plane, captured from static DAVIS camera with a resolution of . The normal and anomalous scenes are staged with typical training videos consisting of people walking, talking, sitting on a couch etc. The testing videos consist of the following variety of anomalous activities: people running, fighting, bending, and stealing bag. We summarize the statistics of the collected dataset in Table I.

Indoor Outdoor
Normal Instances Normal Instances
Walking 25 Walking 23
Sitting on sofa 11 Sitting on chair 3
Talking 5 Talking 5
Handshaking 5 Handshaking 5
Abnormal Instances Abnormal Instances
Running 11 Running 13
Bending 13 Bending 13
Fighting 3 Fighting 3
Stealing bag 3 Stealing some object 3
TABLE I: Details of the Normal and Anomaly videos captured in indoor and outdoor environment

3.2 Evaluation Procedure

In this section, we evaluate different components of the proposed method on the event anomaly dataset proposed in this paper. We have conducted three different sets of experiments adapted to validate DL memory surface network, dual discriminator cGAN and anomaly detection system as a whole for varied event representations in literature.

3.2.1 Experiment A: DL Memory Surface

In this segment, we shed light on the qualitative and quantitative assessment of our proposed DL memory surface generation network on real event data. We applied the DL network on a training set of discretized event volumes of two different temporal classes, walking and running. Towards qualitative analysis, we provide the visualization (Fig. 2) of DL memory surfaces (of walking and running) and the temporal filters learnt by the network. The filters learned by the algorithm represent the different motion models of the events presented to the network.

Fig. 2: Leftmost: DL memory surfaces generated at bottleneck layer of DL memory surface network on walking (first) and running (second) sequences, Rightmost: Visualization of 32 encoder bases learnt by DL memory surface network

To quantitatively evaluate the temporal modeling efficiency and sparsity of latent representation, we have conducted experiments that include tuning of the sparsity. As the sparsity acts as a regularizer, it could be demonstrated that an optimum sparsity is suitable for learning a better representation which could find its application in down-stream processing such as anomaly detection. Towards this, the DL memory surface network was trained on walking and running discretized event volumes with varied levels of sparsity imposed. Subsequent to freezing the model for each (Eq. 1

), DL memory surfaces are extracted from the bottleneck layer of the network for walking and running event volumes. To confirm the effectiveness of the temporal information captured by the proposed network, we evaluated it by examining the performance of them in classification task. Towards this, a convNet with 5 convolution layers and pooling layers, followed by 3 fully connected layers and a softmax output layer has been trained as two-class classifier with walking and running DL memory surfaces. Fig.

3 (right) shows the classification accuracy for different levels of sparsity (). Higher accuracy for close to emphasizes the fact that an optimum sparsity enforces the network to learn the global motion model. Fig. 3 (left) shows the

t-SNE embedding of the features extracted from the last fully connected layer of the convNet trained with DL memory surfaces generated under optimum sparsity. Better clustering of feature is clearly evident.

Fig. 3: Left: Visualization of t-SNE embedding of features learnt by CNN classification network on DL memory surfaces of walking and running sequence generated with optimum sparsity constraint. Right: Classification accuracy of CNN on DL memory surfaces vs sparsity constraint weighted by different

3.2.2 Experiment B: Dual Discriminator cGAN

This section is dedicated to analysis of the dual discriminator cGAN in terms of anomaly detection subsequent to event prediction. In addition to this, we have also performed experiments to emphasize the role of unconditional discriminator in cGAN for anomaly detection. Towards this, a training set of normal activity clips such as pedestrians walking, talking, sitting, etc are provided to the network for learning the model.

Performance of dual discriminator cGAN: As part of qualitative validation, we provide the prediction output of the network for normal activity and abnormal activity in Fig. 4. It is evident that the prediction capability of the network is well pronounced for normal activities than that of anomalous activities.

Fig. 4: Top row: Person walking, Bottom row: Anomalies (Left: Person bending and Right: running). The system predicts the normal activity (top row), whereas the prediction accuracy is low in the case of anomalies

So as to quantitatively evaluate the performance of the anomaly detection efficiency of dual discriminator cGAN, testing cases with intermittent events of abnormal activities such as running and fighting, etc were presented to the network. The proposed algorithm detects the presence of anomalous events at the event frame level by evaluating Mean Square Error (MSE) between the predicted events and the ground truth events. It should be noted that the co-occurrence of the spatial location of anomaly is not considered for evaluation. Fig. 5 shows the plot of MSE vs event frame number. It can be seen that MSE is higher for events such as running and fighting resulting from the fact that these kind of motion never appeared in the training set. Running has been distinguished as anomaly with higher probability than that of fighting sequence.

Fig. 5: Plot of MSE between original and predicted frames. Left: Sequence with walking as normal activity and running as abnormal activity. Right: Sequence with walking as normal activity and fighting as abnormal activity

Significance of dual discriminator in anomaly detection: To evaluate the importance of dual discriminator in anomaly detection, we experimented by removing the unconditional discriminator from the objective function. The training set did not include events with a bag, hence the presence of bag has to be classified as an anomaly. In fig. 6, we show the future frame predicted, without (leftmost plot) and with (rightmost plot) unconditional discriminator. It can be seen that cGAN with a conditional discriminator alone can predict the bag in future event frame as it models the relation between the input DL memory surface and output events and hence results in failing to predict bag as anomaly. cGAN with dual discriminator results in distorted prediction of the bag, as it constrains the network to capture input data probability density in addition to the joint probability density between input and output. This proves the necessity to explicitly impose dual discriminator with conditional and unconditional setting to model shape-based anomalies in event domain.

Fig. 6: Left: Prediction by cGAN with a single conditional discriminator. Right: Prediction by cGAN with conditional and unconditional discriminator. As cGAN models the relation between input and output, it was able to prdeict the future event frame with bag though it has not seen bag during training.

3.2.3 Experiment C: Different Input Representations

In this section, we quantitatively assess the performance of the proposed system as a whole. As, anomaly detection is generally highly imbalanced problem, it has been stated in [5]

that Precision-Recall (PR) curve is the best suited metric. Henceforth, we have evaluated our models with the PR curve, generated by evaluating precision and recall at multiple threshold values.

As there is no work available for event-based anomaly detection, we compare our anomaly detection model by changing the input event representation. This is an interesting evaluation as this proves how well our DL memory surface network learns the temporal information from the data in contrast to the conventional fixed encoding of event data. The three different event representations that has been used for comparison are [48], [42] and [12]. Wherever the event representation results in multiple frames, they are treated as multiple channels of input. Fig. 7 shows the comparison of PR curves estimated using precision-recall-curve command of python. It could be visualized that the proposed DL memory surface generation learns the temporal information superior to other state-of-the-art methods, thus resulting in better performance in terms of PR.

Fig. 7: Comparison of Precision-Recall curve of the proposed anomaly method for different event representations. DL memory surface learns temporal information from data as opposed to other non-deep learning based conventional event data representations, hence out-performing the others

4 Conclusion

In this paper, we presented the first baseline for event-based anomaly detection by potentially taking advantage of the most well-known deep learning models such as generative model and encoder-decoder model. The proposed solution involves cGAN with dual discriminator loss function, which permits capture of the spatial as well in event domain that might have gotten away from the cGAN with single discriminator. We have also proposed a first in the line deep learning solution to effectively encode the event data as sparse DL memory surface, where-in the motion information introduced by the event cameras at the sensor level is learnt adaptively from the data as opposed to state-of-the-art hard-wired event data representations. We have also provided an event-based anomaly dataset on which the proposed algorithm has been validated from different perspectives. Our analysis of other event data representations uncovers that the proposed DL based approach allows effective automatic learning of temporal pattern from the event data.

References

  • [1] L. A, A. Chakraborty, and C. S. Thakur (2019) Neuromorphic vision: from sensors to event based algorithms. WIREs Data mining and knowledge discovery 9. Cited by: §1.
  • [2] S. A, B. M, B. N, and L. X. et. al (2018) HATS: histograms of averaged time surfaces for robust event-based object classification.. CVPR, pp. 1731–1740. Cited by: §1.
  • [3] Alonso, Inigo, and A. C. Murillo (2019) EV-segnet: semantic segmentation for event-based cameras. CVPRW. Cited by: §1.
  • [4] J. An and S. Cho (2015)

    Variational autoencoder based anomaly detection using reconstruction probability

    .
    Technical Report, Tech. Rep.. Cited by: §1.
  • [5] K. B, D. Thomas, and R. Parakkal (2018) An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. Journal of Imaging 36 (2). Cited by: §1, §3.2.3.
  • [6] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert (2004) High accuracy optical flow estimation based on a theory for warping. ECCV. Cited by: §1.
  • [7] R. Chalapathy, A. K. Menon, and S. Chawla (2017) Robust, deep and inductive anomaly detection.

    European Conference on Machine Learning and Principles and Practice of Knowledge Discovery

    .
    Cited by: §1.
  • [8] Y. S. Chong and Y. H. Tay (2017) Abnormal event detection in videos using spatiotemporal autoencoder. International Symposium on Neural Networks, pp. 189–196. Cited by: §1.
  • [9] T. Delbruck and B. L. et. al (2010) Activity-driven, event-based vision sensors. Proc. IEEE International Symposium on Circuits and Systems. Cited by: §1.
  • [10] T. Delbruck and C. Mead (1989) An electronic photoreceptor sensitive to small changes in intensity. NIPS. Cited by: §1.
  • [11] M. W. Diederik and P. Kingma (2014) Stochastic gradient vb and the variational auto-encoder,” in proceedings of the 2nd international conference on learning representations. ICLR. Cited by: §1.
  • [12] C. E, T. G, and A. E. C. et. al (2019) DHP19: dynamic vision sensor 3d human pose dataset. CVPRW. Cited by: §1, §3.2.3.
  • [13] I. Goodfellow and J. P. et. al (2014) Generative adversarial nets. Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [14] M. Hasan, J. Choi, and J. N. et. al (2016) Learning temporal regularity in video sequences. CVPR, pp. 733–742. Cited by: §1.
  • [15] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. Science 313 (5786), pp. 504–507. Cited by: §2.1.1.
  • [16] M. A. I and L. A. et. al (2018) Event-based vision meets deep learning on steering prediction for self-driving cars. CVPR, pp. 5419–5427. Cited by: §1.
  • [17] G. Ian, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §2.1.2.
  • [18] F. N. Iandola and M. W. M. et. al (2016) Squeezenet: alexnet-level accuracy with 50x fewer parameters and¡ 1mb model size. arXiv preprint arXiv:1602.07360. Cited by: §2.1.1.
  • [19] S.W. Joo and R. Chellappa (2006) Attribute grammar-based event recognition and anomaly detection. CVPRW, pp. 107–107. Cited by: §1.
  • [20] X. Lagorce, G. Orchard, and F. G. et. al (2017)

    HOTS: a hierarchy of event-based time-surfaces for pattern recognition.

    .
    IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (7), pp. 1346–1359. Cited by: §1.
  • [21] H. Li, H. Liu, X. Ji, G. Li, and L. Shi (2017) Cifar10-dvs: an event-stream dataset for object classification. Frontiers in neuroscience 11 (309). Cited by: §3.1.
  • [22] P. Lichtsteiner, C. Posch, and T. Delbruck (2008) A 128 x 128 120 db 15 us latency asynchronous temporal contrast vision sensor.. IEEE Journal of Solid-State Circuits 43 (2), pp. 566–576. Cited by: §1.
  • [23] Lin, Min, Q. Chen, and S. Yan (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §2.1.1.
  • [24] W. Luo, W. Liu, and S. Gao (2017) Remembering history with convolutional lstm for anomaly detection. IEEE International Conference on Multimedia and Expo (ICME), pp. 439–444. Cited by: §1.
  • [25] R. M and N. M. et. al (2017) Abnormal event detection in videos using generative adversarial nets. IEEE International Conference on Image Processing (ICIP), pp. 1577–1581. Cited by: §1.
  • [26] J. R. Medel and A. Savakis (2016)

    Anomaly detection in video using predictive convolutional long short-term memory networks

    .
    arXiv preprint arXiv:1612.00390. Cited by: §1.
  • [27] J. R. Medel (2016) Anomaly detection using predictive convolutional long short-term memory units. Rochester Institute of Technology. Cited by: §1.
  • [28] J. C. S. Miguel and J. M. Martinez (2008) Robust unattended and stolen object detection by fusing simple algorithms. Advanced Video and Signal Based Surveillance, pp. 18–25. Cited by: §1.
  • [29] C. Y. A. Mitrokhin and C. P. et. al (2019) Unsupervised learning of dense optical flow and depth from sparse event data. arXiv:1809.08625. Cited by: §1.
  • [30] Moeys, D. Paul, and F. C. et. al (2017) A sensitive dynamic and active pixel vision sensor for color or neural imaging applications.. IEEE transactions on biomedical circuits and systems 12 (1), pp. 123–136. Cited by: §1.
  • [31] E. Mueggler and H. R. et. al (2017)

    The event-camera dataset and simulator: event-based data for pose estimation, visual odometry, and slam

    .
    The International Journal of Robotics Research. Cited by: §3.1.
  • [32] A. Ng (2011) Sparse autoencoder. CS294A Lecture notes 72 (2011), pp. 1–19. Cited by: §1.
  • [33] A. Nguyen, T.T. Do, D. G. Caldwell, and N. G. Tsagarakis (2017) Real-time pose estimation for event cameras with stacked spatial lstm networks. arXiv preprint arXiv:1708.09011. Cited by: §1.
  • [34] I. P, Z. J. Y., Z. T., and E. A. A. (2017) Image-to-image translation with conditional adversarial networks. CVPR, pp. 1125–1134. Cited by: item 2, §1.
  • [35] P. K. Park and B. H. C. et. al (2016) Performance improvement of deep learning based gesture recognition using spatiotemporal demosaicing technique.. Image Processing (ICIP), 2016 IEEE International Conference on 43 (2), pp. 1624–1628. Cited by: §1.
  • [36] C. Posch, D. Matolin, and R. Wohlgenannt (2008) An asynchronous timebased image sensor. IEEE International Symposium on Circuits and Systems, pp. 2130–2133. Cited by: §1.
  • [37] M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe (2017) Training adversarial discriminators for cross-channel abnormal event detection in crowds. CoRR, vol. abs/1706.07680. Cited by: §1.
  • [38] A. Russell and G. O. et. al (2010) Optimization methods for spiking neurons and networks. IEEE transactions on neural networks, pp. 123–136. Cited by: §1.
  • [39] M. Sabokrou, M. Fathy, and M. Hoseini (2016) Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder. Electronics Letters 52 (13), pp. 1122–1124. Cited by: §1.
  • [40] T. Schlegl, P. Seebock, and S. M. W. at. al (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. International Conference on Information Processing in Medical Imaging, pp. 146–157. Cited by: §1.
  • [41] T. Serrano-Gotarredona and B. Linares-Barranco A 128 x128 1.5. Cited by: §1.
  • [42] M. Shu, C. Guang, N. Xiangyu, Z. Yang, R. Kejia, B. Zhenshan, and K. A. C (2019) Neuromorphic benchmark datasets for pedestrian detection, action recognition, and fall detection. Frontiers in neurorobotics 13, pp. 38. Cited by: §3.2.3.
  • [43] C. Szegedy, W. Liu, and Y. J. et. al (2015) Going deeper with convolutions. Computer Vision and Pattern Recognition. External Links: Link Cited by: §2.1.1.
  • [44] N. T., L. T., V. H., and P. D (2017) Dual discriminator generative adversarial nets. Advances in Neural Information Processing Systems, pp. 2670–2680. Cited by: item 2.
  • [45] L. W, L. W, L. D, and G. S (2018) Future frame prediction for anomaly detection – a new baseline. CVPR, pp. 6536–6545. Cited by: §1.
  • [46] M. Yan, X. Jiang, and J. Yuan (2018) 3D convolutional generative adversarial networks for detecting temporal irregularities in videos. 24th ICPR. Cited by: §1.
  • [47] Z. A. Z, Y. L, C. K, and D. K (2019) Unsupervised event-based learning of optical flow, depth, and egomotion. CVPR, pp. 989–997. Cited by: §1, §2.
  • [48] A. Zhu, L. Yuan, K. Chaney, and K. Daniilidis (2018) EV-flownet: self-supervised optical flow estimation for event-based cameras. Proceedings of Robotics: Science and Systems. Cited by: §1, §3.2.3.