When datasets become large and complex, it is often more important and interesting to know what stands out in the data than to learn about its general behavior. The anomaly detection, an important branch of data mining, is to discover rare occurrences in datasets. This field has wide applications in finance, security, health care, law enforcement, and many others.
, classifying Web and email spam, monitoring data-center , detecting malware/spyware , and image/video surveillance 
. Anomaly detection has been studied in a variety of data domains including high-dimensional data, uncertain data , streaming data , network data , and time series data .
Different types of data always require techniques from different domains. Inspecting anomalies in time series needs to examine the behavior of the data across time. Thus, several models were proposed in the statistics literature, which includes autoregressive integrated moving average (ARIMA), autoregressive moving average (ARMA), CUmulative SUM Statistics (CUSUM), vector autoregression (VARMA), exponentially weighted moving average, etc. In this work, we focus on multi-dimensional time series in streams, where temporal dependencies are highly non-linear and data is analyzed in an online way. Yamanishi et al. 
present SmartSifter algorithm for time series in streams, which employs an online discounting learning algorithm to incrementally learn the probabilistic mixture model, with a decay factor to account for anomalies. Other than the online discounting methods, large number of methods use dynamically maintained cluster models for computing outliers from data streams such as.
Because of the rising applications of social network, wearable devices and sensor networks, the analysis of structured (graph) time series becomes more and more important. In this work, we also incorporate graph structures of data into anomaly detection. Ide et al. 
proposed an eigenvector-based method for anomaly detection over graph streams. A similar method is also proposed by Akoglu et al. to spot anomalous points in time at which many agents in an agent network change their behavior in a way such that it deviates from the norm. However, these methods only learn graph behavior in the time interval , and cannot work well on time series with long and complex repeating patterns.
In this paper, we combine variation inference (VI) and recurrent neural network (RNN) to model the graph time series, and detect anomalies based on log likelihood. RNNs can be employed for a wide range of tasks as they inherit their flexibility from plain neural networks. This includes universal approximation capabilities, since RNNs are capable of approximating any measurable sequence to sequence mapping and have been shown to be Turing complete . However, the internal transition structure of the standard RNN is entirely deterministic. There is recent evidence that when complex sequences are modeled, the performances of RNNs can be dramatically improved when uncertainty is included in their hidden states . In 
, authors propose to extend the variational autoencoder (VAE) into a recurrent framework for modeling complex sequences which is called variational RNN (VRNN). The proposed time series model is based on VRNN.
In order to incorporate the structural information, the graphical convolution neural network (GCNN) is applied to extract features from graphical data. We use K-localized filter to incorporate and utilize the local information on the graph. In practice, time series are influenced by external features, which should be considered in outliers detection. Our model is conditioned on the external features as extra input. The experiments on human wearable senor data and traffic flow data show the detection capability of the proposed model.
2.1 Recurrent Neural Network
RNNs are discrete-time state–space models trainable by specialized weight adaptation algorithms. The input to RNN is a variable-length sequence which can be recursively processed. And when processing each symbol, RNN maintains its internal hidden state . The operation of RNN at each timestep can be formulated as
where is the deterministic state transition function and is the parameter of
where is the function mapping the hidden state to the output distribution conditioned on previous observations, which is parameterized by .
The representational power of an RNN is limited by the output function in (1). With the deterministic transition function , the joint probabilities which can be represented by the RNN are determined by the function .
In this work, the function
can be realized by long short-term memory. Since the state transition is performed in a deterministic way, when modeling high-dimensional and structured time series, the variability embedded in the hidden states cannot be learned and expressed. That’s because the conditional output probability density is the only source of variability. So, it is necessary to introduce stochasticity into the hidden state transition. And in analyzing some highly structured time series, the hidden state of RNNs is not expressive enough to capture the state transition dynamics. Some previous work have introduced extra latent variables to model the output functions . Different from other work, following 
, this work makes the prior distribution of the latent random variableat timestep dependent on all the preceding inputs via the RNN hidden state which can improve the representational power of the model.
2.2 Variational Autoencoder
Different from previous work, we detect anomalies based on the conditional probability of time series conditioned on previous data. However, it is difficult to compute the probability of complex time series. Recently variational autoencoder (VAE)  has been shown to be a powerful tool to approximate the intractable complex posterior in the data space. The VAE introduces a set of latent random variables , designed to capture the variations underlie the observed variables . Typically VAE models the conditional probability as a highly flexible function approximator such as a neural network, which makes the inference of the posterior intractable. Thus VAE uses a variational approximation of the posterior, which introduces the evidence lower bound (ELBO):
is the Kullback-Leibler divergence between two distributionsand .
In , the approximate posterior is a Gaussian and the mean
and varianceare modeled as the output of neural networks of . The prior
is assumed to simple standard Gaussian distribution. The training process is to maximize the ELBO, which can yield optimal selection of parameters for generative modeland inference model . Based on re-parameterizing trick, we formulate and rewrite:
is drawn from standard Gaussian distribution. Then, both generative and inference model can be trained by standard backpropagation technique based on gradient descent.
3 Anomaly Detection Model
In this work, our model is based on variational recurrent neural network (VRNN) , which is a recurrent extension of VAE. And we extend it to anomaly detection on graph time series by graph Laplacian transform. Moreover, in order to better model the practical situation, we incorporate external features as extra input.
3.1 Graph Laplacian Transform
The graph time series always demonstrate strong spatial dependency. Each node is strongly influenced by its neighbors. For example, on highways, the sensors are installed every 1-2 miles, and the traffic flow of adjacent sensors are highly correlated. Moreover, the wearable sensors on human body are always dependent on the neighboring sensors, and this dependency is different in different activities. However, VRNN cannot explicitly model such spatial dependencies. In this work, VRNN is augmented by modeling spatial dependency in an efficient way.
In order to model spatial dependency, in vertex domain, the hidden state transition of certain node should incorporate the hidden states of neighboring nodes weighted by correlation coefficients. However, it is quite time-consuming to consider the pairwise dependencies of every vertex in a large graph. So, it is difficult to express a meaningful translation operator in the vertex domain . Therefore, we transform the graph time series from vertex domain into spectral domain by graph Laplacian transform.
Assume that the graph time series are defined on undirected and connected graph , where is a finite set of vertices, is a set of edges and is a weighted adjacency representing the weight between two vertices. Define the snapshot of graph time series at time , , as a signal defined on the nodes of the graph, where the i-th element of is its value at the i-th node. Define the diagonal matrix as . The graph Laplacian operator  is the one-step diffusion of the signal defined on the graph, where the Laplacian matrix is defined as and normalized version is .
Diagonalizing the Laplacian matrix , where and
are eigenvectors and eigenvalues of the Laplacian matrix
, the graph Fourier transform is defined asand its inverse as . The spectral filtering of graph signals  is defined as:
where and is a vector of Fourier coefficients. However, this filter is enough, since the Laplacian operator is local and working on 1-hop neighborhoods. The th power of Laplacian operator is supported by exactly k-hop neighbors , representing the signals on the graph at different scales. Since the th power of Laplacian matrix is expensive to compute, We apply the Chebyshev polynomial expansion  in the following:
where is the learned filter based on Laplacian matrix parameterized by , and is the Chebyshev polynomial of order with . The computational complexity is reduced from to .
3.2 Generative Model
In this model, VAE is applied in a recurrent setting. At each timestep, different from previous work , the prior of latent variables is dependent on the hidden state of RNN, which makes the output function incorporate the temporal structure of the time series. Then we have:
where and denote the mean and variance of the prior distribution. Then, the generating distribution will be conditioned on both latent variable and hidden state such that:
where and denote the mean and variance of the generating distribution, and can be realized by neural networks. Since the latent variables are assumed to be independent of each other at the same time, the feature extractor can introduce dependencies to better model the sequential variation. Inside the RNN, the hidden state is updated as follows:
where transition function parameterized by is realized by LSTM cell , which can capture both long and short-term temporal dependencies. In order to capture the spatial dependencies, the function is modeled by graph spectral filter (3) with Chebyshev polynomial expansion, i.e. . From (6), we know that hidden state is dependent on and . The distributions (4) and (5) define and respectively. We can see that the parameterization of this generative model is motivated by the factorization of joint probability as below:
3.3 Conditional Inference Model
Denote the external feature at time , such as holiday and meteorological data, as a extra input vector . It is first processed by a two-layer neural networks denoted by . Then it is directly added to the mean of latent variables , which has no influences on the variance . Based on Gaussian assumption, the approximate posterior is a function of input and hidden state as below:
where and denote the mean and variance of the approximate posterior. Conditioning on , the posterior follows the factorization:
3.4 Detection Objective Function
which is learning objective function during the training. By maximizing (10), we first train this model on time series data without anomalies. The optimization problem is solved by ADAM  algorithm. The optimal generative and inference model can be learned simultaneously. During the testing phase, at time , we detect the anomaly based on the ELBO bound as below:
which is an approximate to the log likelihood.
3.5 Likelihood Ratio Test
Different from previous work , we can locate anomalies on the graph via likelihood ratio test (LRT). In statistics, people use the likelihood ratio test to compare the fit of two models, one of which (the null model) is a special case of (or ”nested within”) the other (the alternative model). This often occurs when testing whether a simplifying assumption for a model is valid, as when two or more model parameters are assumed to be related. Each of the two compared models, the null model and the alternative model, is separately fitted to the data with the log-likelihood recorded 
. The test statistics (often denoted by) is twice negative the difference in these log-likelihoods:
Whether the alternative model fits the data significantly better than the null model can be determined by deriving the probability or p-value of the obtained difference .
When applying LRT to a single node on the graph, based on the generative model, we assume its value follows Gaussian distribution with mean and variance calculated in (5). Suppose the observed value at node is , the likelihood ratio is:
where is the new parameter changing over that fits the observed data best; denotes the supreme function that finds the maximizer of over . For simplicity, we assume the mean and variance in the has same proportion as . The anomalous degree od of this test is calculated as below:
denotes the cumulative density function of Chi-Square distribution;df
is the degree of freedom, which means the the number of free parameters of the null model and the alternative model, respectively. The nodes withod larger than a given threshold are likely to be anomalous.
For example, suppose a certain node at time follows Gaussian distribution . Assume the its observed value at time is 30. Then we calculate the degree of anomaly at as below. The likelihood of the null model is . In order to achieve the supreme of likelihood, the alternative model should have mean 30 and variance 60. The likelihood of alternative model is . According to above equations, the anomalous degree is calculated as below:
Normally, we set the threshold of anomalous degree to be 0.95.
4 Experimental Study: Beijing Taxi Flows
In this experiment, the city of Beijing is partitioned into a grid map based on the longitude and latitude where a grid denotes a region. The dataset contains Beijing’s taxicab’s trajectories and meteorological data . At -th time interval, the inflow and outflow in all
regions can be denoted as a tensorwhere represent the amount of taxi infow and outflow in region . The inflow matrix is shown as below.
The traffic flows are always influenced by external factors, such as weather and holidays. There are 16 types of weather conditions, such as sunny and rainy. The holidays include all public holidays in China, with total number of 41. The figures below show the influence of Chinese Spring Festival and thunderstorm on the CBD of Beijing, i.e., grid (18,22) on the map.
The other external factors are weekday, temperature and windspeed. The weekdays are ranging from Monday to Sunday. The temperature is from C to C. And windspeed is within 0 and 48.6mph. For examples, traffic flows in weekends are usually lower than working days. And strong wind can make people stay at home and reduce the traffic. In implementation, we use one-hot binary vectors to represent weekday, holiday and weather, and use float numbers to represent temperature and windspeed.
The dataset contains traffic flows ranging in two time intervals, i.e., 1st Mar. 2015 to 30th Jun. 2015, and from 1st Nov. 2015 to 10th Apr. 2016. The traffic flow in every region is scaled into . Each data point shows the traffic flow in 30 minutes. The first 80% data points are used for training, and the rest is for testing. Since the external factors are already given, we assume that the training set doesn’t have any anomalies.
Anomalies are manually added into testing set. All anomalies can be categorized into global anomaly and local anomaly. The global anomaly is always outstanding from global perspective, which has significant difference from normal values and covers a relatively wide range. The local anomalies are bound between the seasonal minimum and maximum and may not appear to be outliers from a global perspective. We evaluate the detection performances on the following types of anomalies (, anomaly center ):
(a) global mean shift (GMS): with and
(b) local mean shift (LMS): with and
(c) global amplitude change (GAC): multiplying multiple dimensions of the data with where function is a Gaussian random variable with 0 mean and , and
(d) local amplitude change (LAC): multiplying single dimensions of the data with where function is a Gaussian random variable with 0 mean and , and .
where is the generated variance in (5). The detection examples are shown as below. The y-axis label shows the index of the corresponding dimension. The red-dotted points are detected outliers.
The following table shows the detection performance on four types of anomalies, where each type of anomaly is tested 200 times in different locations.
This paper shows the the application of variational inference and recurrent neural networks on time series anomaly detection. With comprehensive experiments, we prove that the proposed model can have competitive performance on anomaly detection in graph time series. The spectral filter can better learn the spatial information, and the variational recurrent neural networks can model the temporal correlation. The multi-variable Gaussian model is sensitive enough to outliers. Different from previous work, based on likelihood ratio test, our model can not only detect anomalous time points, but also localize the anomalous nodes (positions) on the graph. In the future, we plan to apply this model to detect video outliers.
-  Q. Ding, N. Katenka, P. Barford, E. D. Kolaczyk, and M. Crovella. Intrusion as (anti)social communication: characterization and detection. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, pages 886-894. ACM, 2012.
-  C. Cortes, D. Pregibon, and C. Volinsky. Communities of interest. Intelligent Data Analysis, 6(3):211-219, 2002.
-  C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In Proceedings of the 30th International Conference on Research and Development in Information Retrieval (SIGIR), Amsterdam, pages 423-430. ACM, 2007.
-  L. Li, C. Liang, J. Liu, S. Nath, A. Terzis, and C. Faloutsos. Thermocast: A cyber-physical forecasting model for data centers. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA. ACM, 2011.
-  L. Invernizzi and P. M. Comparetti. Evilseed: A guided approach to finding malicious web pages. In IEEE Symposium on Security and Privacy, pages 428-442, 2012.
-  B. Krausz and R. Herpers. MetroSurv: detecting events in subway stations. Multimedia Tools and Applications, 50(1):123-147, 2010.
-  C. C. Aggarwal and P. S. Yu, Outlier detection for high dimensional data, SIGMOD Rec., vol. 30, pp. 37-46, May 2001.
-  C. C. Aggarwal and P. S. Yu. Outlier detection with uncertain data, in Proc. 2008 SIAM Int. Conf. SDM, pp. 483-493.
-  C. Aggarwal and K. Subbian. Event detection in social streams, in Proc. 12th SIAM Int. Conf. SDM, 2012, pp. 624-635.
-  C. C. Aggarwal, Y. Zhao, and P. S. Yu, Outlier detection in graph streams, in Proc. 27th ICDE, Hannover, Germany, 2011, pp. 399-409.
-  J. P. Burman and M. C. Otto, Census bureau research project: Outliers in time series, Bureau of the Census, SRD Res. Rep. CENSUS/SRD/RR-88114, May 1988.
-  G. Deco and B. Schurmann, Neural learning of chaotic dynamics. Neural Process Lett., vol. 2, no. 2, pp. 23-26, 1995.
-  Archer, Evan, et al. Black box variational inference for state space models. arXiv preprint arXiv:1511.07367 (2015).
-  V. Barnett and T. Lewis, Outliers in Statistical Data. New York, NY, USA: Wiley, 1978.
-  D. M. Hawkins, Identification of Outliers. London, U.K.: Chapman and Hall, 1980.
-  P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier Detection. New York, NY, USA: Wiley, 1987.
-  K. Yamanishi and J.-I. Takeuchi. A unifying framework for detecting outliers and change points from non-stationary time series data, in Proc. 8th ACM Int. Conf. KDD, Edmonton, AB, Canada, 2002, pp. 676-681.
-  K. Sequeira and M. Zaki. ADMIT: Anomaly-based data mining for intrusions, in Proc. 8th ACM Int. Conf. KDD, New York, NY, USA, 2002, pp. 386-395.
T. Idé and H. Kashima. Eigenspace-based anomaly detection in computer systems, in Proc. 10th ACM Int. Conf. KDD, Seattle, WA, USA, 2004, pp. 440-449.
-  L. Akoglu and C. Faloutsos, “Event detection in time series of mobile communication graphs,” in Proc. Army Science Conf., 2010.
-  B. Hammer. On the approximation capability of recurrent neural networks. Neurocomputing, 31(1):107-123, 2000.
-  J. Bayer and C. Osendorfer. Learning stochastic recurrent networks. arXiv:1411.7610, 2014.
-  J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In NIPS, pages 2962-2970, 2015.
-  N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv:1206.6392, 2012.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735-1780, 1997.
-  A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
-  J. Bayer and C. Osendorfer. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
-  D. Kingma, and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
-  M. Sölch, J. Bayer, M. Ludersdorfer and P. van der Smagt. Variational inference for on-line anomaly detection in high-dimensional time series. arXiv preprint arXiv:1602.07109 (2016).
-  J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral Networks and Locally Connected Networks on Graphs. In International Conference on Learning Representations (ICML), 2014.
-  F. R. K. Chung. Spectral Graph Theory, volume 92. American Mathematical Society, 1997.
-  M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NIPS), 2016.
-  D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83-98, 2013.
-  M. Wu, X. Song, C. Jermaine, et al, A LRT framework for fast spatial anomaly detection, In Proc. of KDD ’09, pp. 887-896.
-  Junbo Zhang, Yu Zheng, Dekang Qi. Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction. In AAAI 2017.