Many real-life anomaly detection problems including surveillance, infrastructure monitoring, environmental and natural disaster monitoring, border security using unattended ground sensors, crime hot-spot detection for law enforcement, and real-time traffic monitoring involve multi-modal data. For example, in a traffic monitoring application, a decision maker who wishes to detect abnormal behavior or impending congestions, may have access to CCTV imagery data, social media data, and other physical sensor data. For such applications, efficient algorithms are needed that can detect anomalies or deviations from normal behavior as quickly as possible. Effective algorithms can be developed only when one has access to, and has a good understanding of, the multi-modal data encountered in these applications. Motivated by this, in this paper, we develop statistical models and algorithms for detecting anomalous behavior in multi-modal data. The statistical models studied here are motivated by an analysis of a real-life multi-modal traffic monitoring dataset.
The datasets studied in this paper were collected by us around a 5K run that occurred in New York City on Sunday, September 24th, 2017. We collected data on two Sundays before the run, and one Sunday after the run. We collected CCTV images and Twitter and Instagram posts over a geographic region from the Red Hook village in Brooklyn on the south end to the Tribeca village on the north end of the collection area. An analysis of the data reveals that the 5K run changes the averages of counts of persons and vehicles appearing in the CCTV cameras and the number of Instagram posts per second posted in the geographical areas near the run. The counts of persons and vehicles appearing in the CCTV images were obtained by passing the images through a convolution neural network-based object detector, , , . See Fig. 1. The analysis also suggests that the data has periodic or cyclostationary behavior (see Section 2 for more details). In general, in many monitoring applications, a certain cyclostationary behavior is expected, especially while observing long-term patterns of life, unless an unexpected event occurs.
In this paper, we define a statistical model to capture the cyclostationary behavior. We also develop sequential algorithms to detect deviations away from learned cyclostationary behavior. We develop the sequential algorithms in the framework of quickest change detection , , , and also provide their delay and false alarm analysis. The salient features of our paper are as follows.
[wide, labelwidth=!, labelindent=0pt]
We use a novel framework introduced by us in  for decision making using multi-modal data involving CCTV images and social media data. In this framework, we use a deep neural network to extract counts of objects from the images. This count is then combined with counts of the number of Tweets and Instagram posts near the CCTV cameras. The decision making is then based on the sequence of counts.
We define the concept of an independent and periodically identically distributed (i.p.i.d) process. We model the count data as an instance of an i.p.i.d. process. We then propose novel algorithms to detect deviations from learned i.p.i.d. behavior. See Definition 1.
We define the concept of asymptotic efficiency for a change point detection algorithm and show that our proposed algorithms are asymptotically efficient. See Definition 2.
Machine learning and signal processing algorithms for event detection have been developed in the literature , , , , , ,  . However, in these studies, the abnormal event is often either well-defined and/or can be created to train a model. Since the algorithms proposed by us are based on detecting deviations from learned normal behavior, our framework allows for decision making in rare-event scenarios where the anomalous behavior is hard to learn.
2 Data Analysis
Details of the data collected, including information on the deep neural network employed, timings and frame rates can be found in our previous work . The objective is to detect the 5K run from the multi-modal data collected. In Figs. 2 to Figs. 4 below, we have plotted averages of the count data collected on the four days, one event day (Sept. 24), and three non-event day (Sept. 10, Sept. 17, and Oct. 1). The data were extracted in 3-second intervals and averaged over a sliding window of size 1000. The figures show plots for two selected cameras: one which was away from the path of the run called the off-path camera, and one which was near the path of the run. The latter is called the on-path camera.
In Fig. 1(a), we have plotted the average person count for the off-path camera and in Fig. 1(b), we have plotted the average person count for the on-path camera. Similar plots for the average vehicle counts are shown in Fig. 2(a) and Fig. 2(b), and for Instagram counts are shown in Fig. 3(a) and Fig. 3(b). The Instagram counts in Fig. 4 were obtained by averaging the counts for the Instagram posts near the geographical vicinity of the off-path and on-path cameras.
We see a clear increase in the average count on the event day for the on-path camera. Thus, the 5K run event can be detected using the count sequences from both CCTV data and social media posts. More generally, we can expect counts and sequences of sub-events to capture information about anomalous behavior. For example, an event happening twice in a day or two events happening too close to each may indicate a deviation from normal behavior.
We see from the figures that the data is nonstationary in nature, even on non-event days. Also, we observe similarity in statistical behavior in data across all four days from the off-path camera. We also see a similarity in behavior in the data from the on-path cameras on the non-event days. The data also have cyclic behavior. For example, the Instagram count data in Fig. 3(a) show that the data has a trend that repeats itself every Sunday. Thus, the anomaly detection problem here can be rephrased as either the problem of detecting deviations from normal nonstationary behavior or as the problem of detecting deviations from normal cyclostationary behavior. In , we studied a Bayesian problem that captures the problem of detecting changes in the levels of nonstationarity. In this paper, we study the latter problem.
3 Mathematical Model and Problem Formulation
The central modeling object in this paper is the following.
A stochastic process is called independent and periodically identically distributed (i.p.i.d.) if
the random variables are independent, and there is a positive integer
is called independent and periodically identically distributed (i.p.i.d.) if the random variables are independent, and there is a positive integersuch that for each , the process is independent and identically distributed (i.i.d).
An i.p.i.d. process can be seen as an interleaved version of i.i.d. stochastic processes, interleaved in a round-robin fashion. An i.p.i.d. process is a wide-sense cyclostationary process , but has more structure that we will exploit to develop efficient algorithms. We model a count observation sequence as an i.p.i.d. process. Although counts are discrete in nature, the following discussion is valid for more general random variables as well.
In our statistical model, the variables in the i.p.i.d. process have distribution in a parametric family with parameters , and the parameter sequence is periodic with period . In other words, we have a sequence model
If the data is collected once per hour, then in the above model, the period would correspond to hours in a day, and the variables could correspond to the data collected each hour. In many applications, the data is often collected more frequently, at the rate of many samples per second. In such applications, could be, for example, equal to , where is the number of samples collected per second. Note that the statistical model in (1) has only parameters .
The statistical problem we wish to solve is described as follows. Given the parameters , the objective is to observe the process sequentially over time and detect any changes in the values of any of the parameters. This change has to be detected in real-time with minimum possible delay, subject to a constraint on the rate of false alarms. The baseline parameters in the problem, the period and the parameters within a period , can be learned from the training data. General tests for learning an i.p.i.d. process will be reported elsewhere. In this paper, we will make additional modeling assumptions to make the learning process simpler.
Note that the sequence model (1) studied in this paper is different from the sequence model studied in  and . In the model studied in  and , the random variables are modeled as Gaussian random variables and the parameters
are not periodic. Furthermore, the problem there is of simultaneous estimation of all the different parametersgiven all the observations . That is, the problem is not sequential in nature. It is also not a change point problem.
To summarize, in the absence of an anomaly, we model the data as a nonstationary process. But, we believe there is some regularity in the statistical properties of the process. This allows us to model the data as a cyclostationary process. The type of cyclostationary behavior we are interested in is captured by the i.p.i.d. process defined above. The objective in the anomaly detection problem then is to detect a deviation away from a learned cyclostationary or i.p.i.d. behavior.
The algorithm to be used for change detection will depend on the pattern of changes that we assume in the statistical model. We now discuss two change point models for our problem. As discussed above, if the number of samples taken per second is and the statistical behavior of the data repeats itself after one week, then we have . In practice, it may be hard to learn a large number of parameters, and detect changes in them. In order to control the complexity of the problem, we assume that the parameters are divided into batches and parameters in each batch are approximately constant. For example, a batch may correspond to data collected in an hour and the average count of objects may not change in an hour. Mathematically, we assume that in each cycle or period of length
, the vector of parametersis partitioned into batches or episodes. Specifically, for and positive integers we define such that For , we define Thus, is partitioned as
Note that we have .
We further assume a step model for parameters. Under this assumption, the parameters remain constant within a batch resulting in the step-wise constant sequence model
That is , , and so on. Thus, if the batch sizes are large, there are only parameters to learn from the data. Also, we have samples for batch . The objective is then to observe the process over time and detect any changes in the parameters .
We now define two change point models. Let be the change point. If , i.e., no change occurs, then the stochastic process that we observe, and the parameter values, are given by
If , i.e., a change occurs at a finite time , we have two possible change point models. For , we define the batch of , , as the value satisfying .
[wide, labelwidth=!, labelindent=0pt]
Change in parameter values in a single batch: In this model, the distribution of the random variables changes only inside a specific batch say . That is, in this model, starting at time , the parameter values at all the times change as long as the times fall in the batch . Also, the post-change parameter is different for each , even within a batch. Specifically, if denotes the batch of then
The value of is not known to the decision maker.
Change in parameter values in all the batches: In this model, the distribution of the random variables changes for all the batches.
In a traffic monitoring scenario, if corresponds to a day, the single batch change point model may correspond to an anomalous behavior between 7 am and 8 am everyday, while the all batch change point may correspond to an anomalous behavior throughout the day.
We wish to find a stopping time for the sequence so as minimize some version of the average of the detection delay , with a constraint on the false alarm rate. A popular criterion studied in the literature is that by Pollak 
denotes expectation with respect to the probability measure when the change occurs at time, and is a given constraint on the mean time to false alarm. Finding optimal solution to such minimax quickest change detection problem is generally hard , , . We, therefore, propose algorithms (stopping times), and show that they have the following important property, which we also define.
A stopping time is called asymptotically efficient for a change point problem, if as
and there exists a positive constant such that
We note that most of the classical optimal algorithms in the literature are asymptotically efficient , , , while a trivial algorithm like is not. Furthermore, according to fundamental limit theorems on change point detection , the performance of any stopping time cannot be of a smaller order of magnitude than . Thus, being asymptotically efficient is an important property to have for a change detection algorithm. Comments on optimality with respect to the Pollak’s criterion (7) or Lorden’s criterion  will be provided in an extended version of this paper.
4 Algorithms for Anomaly Detection
The change detection model defined in (5) and (6) are similar to change point models studied in sensor network literature , , , where a change can affect one, or all the sensors. Observations from a batch can be viewed as observations from a sensor. The important difference between our problem and the sensor network problem is that the decision maker here observes the data from batches in sequence, i.e., does not have access to all the data at the same time. Nonetheless, the analogy between the two problems provides us with guidelines for identifying relevant algorithms for our problem. We will make some assumptions about the way change occurs to simplify our notations, algorithms, and analysis. Algorithms for more general change point models can be developed by following the techniques discussed below.
4.1 Algorithm for Detecting Change in a Single Batch
We assume that after the change occurs in a single batch , the post-change parameter is the same for all the variables in the batch . Since it is not known in which batch the change occurs, we execute algorithms, one for each batch, and raise an alarm as soon as any of the algorithms detect the change. Mathematically, define the following statistics for data from batch :
Also, define as the stopping time for the batch :
Here, is the minimum amount of change from the baseline parameter the algorithm can detect. Note that the condition ensures that only data from the batch are utilized for computing the statistic . Our change detection algorithm is the minimum of these stopping times.
The false alarm result is true because stochastically dominates Lorden’s stopping time designed for pre-change parameter . The finite family assumption and martingale arguments imply setting will ensure , as  . For delay, it can be shown that if is the true post-change parameter in batch then as , , where , and
is the Kullback-Leibler divergence betweenand , implying asymptotic efficiency. ∎
4.2 Algorithm for Detecting Change in All the Batches
We assume that after the change occurs, the post-change parameter is the same for all the variables in a batch . Since the change occurs in all the batches, we use an algorithm that combines observations from all the batches. Mathematically, we compute the statistic
and declare an anomaly at the stopping time
Independence and separation of suprema over gives . The false alarm result follows from the previous theorem because implies . For the delay analysis, note that removing the maximum operators gives . Asymptotic efficiency follows because the latter’s behavior is similar to that of a random walk and based on the arguments in . ∎
5 Numerical Results and Conclusions
We now apply the developed algorithm to the NYC data. Due to a paucity of space, the performance of the algorithm for simulated data will be reported elsewhere. We apply to the count data because the change appears to affect the entire day’s data. In Fig. 4(a)
, we have plotted the evolution of the test statisticfor all the count data: person count, vehicle count, and the Instagram count. In the figure, the data for each modality is arranged in a concatenated fashion, with labeled segments separated via red vertical lines. Each day has samples. To compute the statistic, we divided the data into four batches, with the first three batches being of length . We modeled the data as a sequence of Poisson random variables. We used the count data from Sept. 10 (one of the non-event days) to learn the averages of these Poisson random variables for each of the four batches. We assumed that there is only one post-change parameter per batch that is equal to twice the normal parameter (half the normal parameters for vehicles) for that batch. We then applied the test to all the four days of data. In Fig. 4(b), we have replotted the test statistic applied to the Instagram counts. As seen from the figures, the algorithm detects the anomaly that occurs on Sept. 24 (event day).
In future, we will apply the algorithms to other multi-modal datasets to test their effectiveness. We will also study optimality of the proposed algorithms for Lorden’s and Pollak’s criteria.
-  T. Banerjee, G. Whipps, P. Gurram, and V. Tarokh, “Sequential event detection using multimodal data in nonstationary environments,” in Proc. of the 21st International Conference on Information Fusion, July 2018.
-  S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” CoRR, vol. abs/1506.01497, 2015.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The
pascal visual object classes (voc) challenge,”
International Journal of Computer Vision, vol. 88, pp. 303–338, June 2010.
-  V. V. Veeravalli and T. Banerjee, Quickest Change Detection. Academic Press Library in Signal Processing: Volume 3 – Array and Statistical Signal Processing, 2014. http://arxiv.org/abs/1210.5552.
-  H. V. Poor and O. Hadjiliadis, Quickest detection. Cambridge University Press, 2009.
-  A. G. Tartakovsky, I. V. Nikiforov, and M. Basseville, Sequential Analysis: Hypothesis Testing and Change-Point Detection. Statistics, CRC Press, 2014.
-  R. Panda and A. K. Roy-Chowdhury, “Multi-view surveillance video summarization via joint embedding and sparse optimization,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2010–2021, 2017.
-  S. C. Lee and R. Nevatia, “Hierarchical abnormal event detection by real time and semi-real time multi-tasking video surveillance system,” Machine vision and applications, vol. 25, no. 1, pp. 133–143, 2014.
-  R. Szechtman, M. Kress, K. Lin, and D. Cfir, “Models of sensor operations for border surveillance,” Naval Research Logistics (NRL), vol. 55, no. 1, pp. 27–41, 2008.
-  D. B. Neill and W. L. Gorr, “Detecting and preventing emerging epidemics of crime,” Advances in Disease Surveillance, vol. 4, no. 13, 2007.
-  R. Mitchell and I. R. Chen, “Effect of intrusion detection and response on reliability of cyber physical systems,” IEEE Transactions on Reliability, vol. 62, pp. 199–210, March 2013.
-  E. D’Andrea, P. Ducange, B. Lazzerini, and F. Marcelloni, “Real-time detection of traffic from Twitter stream analysis,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, pp. 2269–2283, Aug 2015.
-  E. W. Dereszynski and T. G. Dietterich, “Probabilistic models for anomaly detection in remote sensor data streams,” arXiv preprint arXiv:1206.5250, 2012.
-  T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes Twitter users: Real-time event detection by social sensors,” in Proceedings of the 19th Int. Conf. on World Wide Web, pp. 851–860, ACM, 2010.
-  W. A. Gardner, A. Napolitano, and L. Paura, “Cyclostationarity: Half a century of research,” Signal processing, vol. 86, no. 4, pp. 639–697, 2006.
-  I. M. Johnstone, Gaussian estimation: Sequence and wavelet models. Book Draft, 2017. Available for download from http://statweb.stanford.edu/~imj/GE_08_09_17.pdf.
-  A. B. Tsybakov, Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009.
-  M. Pollak, “Optimal detection of a change in distribution,” Ann. Statist., vol. 13, pp. 206–227, Mar. 1985.
-  T. L. Lai, “Information bounds and quick detection of parameter changes in stochastic systems,” IEEE Trans. Inf. Theory, vol. 44, pp. 2917 –2929, Nov. 1998.
-  G. Lorden, “Procedures for reacting to a change in distribution,” Ann. Math. Statist., vol. 42, pp. 1897–1908, Dec. 1971.
-  A. G. Tartakovsky and V. V. Veeravalli, “An efficient sequential procedure for detecting changes in multichannel and distributed systems,” in IEEE International Conference on Information Fusion, vol. 1, (Annapolis, MD), pp. 41–48, July 2002.
-  Y. Mei, “Efficient scalable schemes for monitoring a large number of data streams,” Biometrika, vol. 97, pp. 419–433, Apr. 2010.
-  A. G. Tartakovsky and V. V. Veeravalli, “Asymptotically optimal quickest change detection in distributed sensor systems,” Sequential Analysis, vol. 27, pp. 441–475, Oct. 2008.
-  T. Banerjee, H. Firouzi, and A. O. Hero III, “Quickest detection for changes in maximal knn coherence of random matrices,” arXiv preprint arXiv:1508.04720, 2015.
-  M. Woodroofe, Nonlinear Renewal Theory in Sequential Analysis. CBMS-NSF regional conference series in applied mathematics, SIAM, 1982.