1 Introduction
Complex systems are ubiquitous in modern manufacturing industry and information services. Monitoring the behaviors of these systems generates a substantial amount of multivariate time series data, such as the readings of the networked sensors (e.g., temperature and pressure) distributed in a power plant or the connected components (e.g., CPU usage and disk I/O) in an Information Technology (IT) system. A critical task in managing these systems is to detect anomalies in certain time steps such that the operators can take further actions to resolve underlying issues. For instance, an anomaly score can be produced based on the sensor data and it can be used as an indicator of power plant failure [Len, Vittal, and Manimaran2007]. An accurate detection is crucial to avoid serious financial and business losses as it has been reported that 1 minute downtime of an automotive manufacturing plant may cost up to US dollars [Djurdjanovic, Lee, and Ni2003]. In addition, pinpointing the root causes, i.e., identifying which sensors (system components) are causes to an anomaly, can help the system operator perform system diagnosis and repair in a timely manner. In real world applications, it is common that a short term anomaly caused by temporal turbulence or system status switch may not eventually lead to a true system failure due to the autorecovery capability and robustness of modern systems. Therefore, it would be ideal if an anomaly detection algorithm can provide operators with different levels of anomaly scores based upon the severity of various incidents. For simplicity, we assume that the severity of an incident is proportional to the duration of an anomaly in this work. Figure 1(a) illustrates two anomalies, i.e., and marked by red dash circle, in multivariate time series data. The root causes are yellow and black time series, respectively. The duration (severity level) of is larger than .
To build a system which can automatically detect and diagnose anomalies, one main problem is that few or even no anomaly label is available in the historical data, which makes the supervised algorithms [Görnitz et al.2013] infeasible. In the past few years, a substantial amount of unsupervised anomaly detection methods have been developed. The most prominent techniques include distance/clustering methods [He, Xu, and Deng2003, Hautamak̈i, Kar̈kkaïnen, and Fran̈ti2004, Idé, Papadimitriou, and Vlachos2007, Campello et al.2015], probabilistic methods [Chandola, Banerjee, and Kumar2009]
, density estimation methods
[Manevitz and Yousef2001], temporal prediction approaches [Chen et al.2008, Günnemann, Günnemann, and Faloutsos2014], and the more recent deep learning techniques
[Malhotra et al.2016, Qin et al.2017, Zhou and Paffenroth2017, Wu et al.2018, Zong et al.2018]. Despite the intrinsic unsupervised setting, most of them may still not be able to detect anomalies effectively due to the following reasons:
[leftmargin=0.15in]

There exists temporal dependency in multivariate time series data. Due to this reason, distance/clustering methods, e.g.
, kNearest Neighbor (kNN)
[Hautamak̈i, Kar̈kkaïnen, and Fran̈ti2004]), classification methods, e.g., OneClass SVM [Manevitz and Yousef2001], and density estimation methods, e.g., Deep Autoencoding Gaussian Mixture Model (DAGMM)
[Zong et al.2018], may not perform well since they cannot capture temporal dependencies across different time steps. 
Multivariate time series data usually contain noise in real word applications. When the noise becomes relatively severe, it may affect the generalization capability of temporal prediction models, e.g., Autoregressive Moving Average (ARMA) [Hamilton1994] and LSTM encoderdecoder [Malhotra et al.2016, Qin et al.2017], and increase the false positive detections.

In real world application, it is meaningful to provide operators with different levels of anomaly scores based upon the severity of different incidents. The existing methods for root cause analysis, e.g., Ranking Causal Anomalies (RCA) [Cheng et al.2016], are sensitive to noise and cannot handle this issue.
In this paper, we propose a MultiScale Convolutional Recurrent EncoderDecoder (MSCRED) to jointly consider the aforementioned issues. Specifically, MSCRED first constructs multiscale (resolution) signature matrices to characterize multiple levels of the system statuses across different time steps. In particular, different levels of the system statuses are used to indicate the severity of different abnormal incidents. Subsequently, given the signature matrices, a convolutional encoder is employed to encode the intersensor (time series) correlations patterns and an attention based Convolutional LongShort Term Memory (ConvLSTM) network is developed to capture the temporal patterns. Finally, with the feature maps which encode the intersensor correlations and temporal information, a convolutional decoder is used to reconstruct the signature matrices and the residual signature matrices are further utilized to detect and diagnose anomalies. The intuition is that MSCRED may not reconstruct the signature matrices well if it never observes similar system statuses before. For example, Figure 1(b) shows two signature matrices and during normal and abnormal periods. Ideally, MSCRED cannot reconstruct well as training matrices (e.g., ) are distinct from . To summarize, the main contributions of our work are:

[leftmargin=0.15in]

We formulate the anomaly detection and diagnosis problem as three underlying tasks, i.e., anomaly detection, root cause identification, and anomaly severity (duration) interpretation. Unlike previous studies which investigate each problem independently, we address these issues jointly.

We introduce the concept of system signature matrix, develop MSCRED to encode the intersensor correlations via a convolutional encoder, incorporate temporal patterns with attention based ConvLSTM networks, and reconstruct signature matrix via a convolutional decoder. As far as we know, MSCRED is the first model that considers correlations among multivariate time series for anomaly detection and can jointly resolve all the three tasks.

We conduct extensive empirical studies on a synthetic dataset as well as a power plant dataset. Our results demonstrate the superior performance of MSCRED over stateoftheart baseline methods.
2 Related Work
Unsupervised anomaly detection on multivariate time series data is a challenging task and various types of approaches have been developed in the past few years.
One traditional type is the distance methods [Hautamak̈i, Kar̈kkaïnen, and Fran̈ti2004, Idé, Papadimitriou, and Vlachos2007]. For instance, the Nearest Neighbor (kNN) algorithm [Hautamak̈i, Kar̈kkaïnen, and Fran̈ti2004] computes the anomaly score of each data sample based on the average distance to its nearest neighbors. Similarly, the clustering models [He, Xu, and Deng2003, Campello et al.2015]
cluster different data samples and find anomalies via a predefined outlierness score. In addition, the classification methods,
e.g., OneClass SVM [Manevitz and Yousef2001], models the density distribution of training data and classifies new data as normal or abnormal. Although these methods have demonstrated their effectiveness in various applications, they may not work well on multivariate time series since they cannot capture the temporal dependencies appropriately. To address this issue, temporal prediction methods,
e.g., Autoregressive Moving Average (ARMA) [Hamilton1994] and its variants [Brockwell and Davis2013], have been used to model temporal dependency and perform anomaly detection. However, these models are sensitive to noise and thus may increase false positive results when noise is severe. Other traditional methods include correlation methods [Kriegel et al.2012], ensemble methods [Lazarevic and Kumar2005], etc.Besides traditional methods, deep learning based unsupervised anomaly detection algorithms [Malhotra et al.2016, Zhai et al.2016, Zhou and Paffenroth2017, Zong et al.2018] have gained a lot attention recently. For instance, Deep Autoencoding Gaussian Mixture Model (DAGMM) [Zong et al.2018] jointly considers deep autoencoder and Gaussian mixture model to model density distribution of multidimensional data. LSTM encoderdecoder [Malhotra et al.2016, Qin et al.2017] models time series temporal dependency by LSTM networks and achieves better generalization capability than traditional methods. Despite their effectiveness, they cannot jointly consider the temporal dependency, noise resistance, and the interpretation of severity of anomalies.
In addition, our model design is inspired by fully convolutional neural networks
[Long, Shelhamer, and Darrell2015], convolutional LSTM networks [Shi et al.2015], and attention technique [Bahdanau, Cho, and Bengio2014, Yang et al.2016]. This paper is also related to other time series applications such as clustering/classification [Li and Prakash2011, Hallac et al.2017, Karim et al.2018], segmentation [Keogh et al.2001, Lemire2007], and so on.3 MSCRED Framework
In this section, we first introduce the problem we aim to study and then we elaborate the proposed MultiScale Convolutional Recurrent EncoderDecoder (MSCRED) in detail. Specifically, we first show how to generate multiscale (resolution) system signature matrices. Then, we encode the spatial information in signature matrices via a convolutional encoder and model the temporal information via an attention based ConvLSTM. Finally, we reconstruct signature matrices based upon a convolutional decoder and use a square loss to perform endtoend learning.
3.1 Problem Statement
Given the historical data of time series with length , i.e., , and assuming that there exists no anomaly in the data, we aim to achieve two goals:

[leftmargin=0.15in]

Anomaly detection, i.e., detecting anomaly events at certain time steps after .

Anomaly diagnosis, i.e., given the detection results, identifying the abnormal time series that are most likely to be the causes of each anomaly and interpreting the anomaly severity (duration scale) qualitatively.
3.2 Characterizing Status with Signature Matrices
The previous studies [Hallac et al.2017, Song et al.2018] suggest that the correlations between different pairs of time series are critical to characterize the system status. To represent the intercorrelations between different pairs of time series in a multivariate time series segment from to , we construct an signature matrix based upon the pairwise innerproduct of two time series within this segment. Two examples of signature matrices are shown in Figure 1(b). Specifically, given two time series and in a multivariate time series segment , their correlation is calculated with:
(1) 
where is a rescale factor (). The signature matrix, i.e., , not only can capture the shape similarities and value scale correlations between two time series, but also is robust to input noise as the turbulence at certain time series has little impact on the signature matrices. In this work, the interval between two segments is set as 10. In addition, to characterize system status at different scales, we construct ( = 3) signature matrices with different lengths ( = 10, 30, 60) at each time step.
3.3 Convolutional Encoder
We employ a fully convolutional encoder [Long, Shelhamer, and Darrell2015] to encode the spatial patterns of system signature matrices. Specifically, we concatenate
at different scales as a tensor
, and then feed it to a number of convolutional layers. Assuming that denotes the feature maps in the ()th layer, the output of th layer is given by:(2) 
where denotes the convolutional operation,
is the activation function,
denotes convolutional kernels of size , is a bias term, and denotes the output feature map at th layer. In this work, we use Scaled Exponential Linear Unit (SELU) [Klambauer et al.2017] as the activation function and 4 convolutional layers, i.e., Conv1Conv4 with 32 kernels of size , 64 kernels of size , 128 kernels of size , and 256 kernels of size , as well as , , , and strides, respectively. Note that the exact order of the time series based on which the signature matrices are formed is not important, because for any given permutation, the resulting local patterns can be captured by the convolutional encoder. Figure 2(a) illustrates the detailed encoding process of signature matrices.3.4 Attention based ConvLSTM
The spatial feature maps generated by convolutional encoder is temporally dependent on previous time steps. Although ConvLSTM [Shi et al.2015] has been developed to capture the temporal information in a video sequence, its performance may deteriorate as the length of sequence increases. To address this issue, we develop an attention based ConvLSTM which can adaptively select relevant hidden states (feature maps) across different time steps. Specifically, given the feature maps from the th convolutional layer and previous hidden state , the current hidden state is updated with , where the ConvLSTM cell [Shi et al.2015] is formulated as:
(3) 
where denotes the convolutional operator, represents Hadamard product,
is the sigmoid function,
are convolutional kernels of size and are bias parameters of the th layer ConvLSTM. In our work, we maintain the same convolutional kernel size as convolutional encoder at each layer. Note that all the input , cell outputs , hidden states , and gates , , are 3D tensors, which is different from LSTM. We tune step length (i.e., the number of previous segments) and set it as 5 due to the best empirical performance. In addition, considering not all previous steps are equally correlated to the current state , we adopt a temporal attention mechanism to adaptively select the steps that are relevant to current step and aggregate the representations of those informative feature maps to form a refined output of feature maps , which is given by:(4) 
where
denotes vector and
is a rescale factor ( = 5.0). That is, we take the last hidden state as the group level context vector and measure the importance weights of previous steps through a softmax function. Unlike the general attention mechanism [Bahdanau, Cho, and Bengio2014] that introduces transformation and context parameters, the above formulation is purely based on the learned hidden feature maps and achieves the similar function as the former. Essentially, the attention based ConvLSTM jointly models the spatial patterns of signature matrices with temporal information at each convolutional layer. Figure 2(b) illustrates the temporal modeling procedure.3.5 Convolutional Decoder
To decode the feature maps obtained in previous step and get the reconstructed signature matrices, we design a convolutional decoder which is formulated as:
(5) 
where denotes the deconvolution operation, is the concatenation operation, is the activation unit (same as the encoder), and are filter kernel and bias parameter of th deconvolutional layer. Specifically, we follow the reverse order and feed of th ConvLSTM layer to a deconvolutional neural network. The output feature map is concatenated with the output of previous ConvLSTM layer, making the decoder process stacked. The concatenated representation is further fed into the next deconvolutional layer. The final output (with the same size of the input matrices) denotes the representations of reconstructed signature matrices. As a result, we use 4 deconvolutional layers: DeConv4DeConv1 with 128 kernels of size , 64 kernels of size , 32 kernels of size , and 3 kernels of size filters, as well as , , , and strides, respectively. The decoder is able to incorporate feature maps at different deconvolutional and ConvLSTM layers, which is effective to improve anomaly detection performance, as we will demonstrate in the experiment. Figure 2(c) illustrates the decoding procedure.
3.6 Loss Function
For MSCRED, the objective is defined as the reconstruction errors over the signature matrices, i.e.,
(6) 
where
. We employ minibatch stochastic gradient descent method together with the Adam optimizer
[Kingma and Ba2014]to minimize the above loss. After sufficient number of training epochs, the learned neural network parameters are utilized to infer the reconstructed signature matrices of validation and test data. Finally, we perform anomaly detection and diagnosis based on the residual signature matrices, which will be elaborated in the next section.
4 Experiments
In this section, we conduct extensive experiments to answer the following research questions:

[leftmargin=0.15in]

Anomaly detection. Whether MSCRED can outperform baseline methods for anomaly detection in multivariate time series (RQ1)? How does each component of MSCRED affect its performance (RQ2)?

Anomaly diagnosis. Whether MSCRED can perform root cause identification (RQ3) and anomaly severity (duration) interpretation (RQ4) effectively?

Robustness to noise. Compared with baseline methods, whether MSCRED is more robust to input noise (RQ5)?
4.1 Experimental Setup
4.1.1 Data.
We use a synthetic dataset and a real world power plant dataset for empirical studies. The detailed statistics and settings of these two datasets are shown in Table 1.

[leftmargin=0.15in]

Synthetic data. Each time series is formulated as:
(7) where is a 0 or 1 random seed. The above formula captures three attributes of multivariate time series: (a) trigonometric function (C1) simulates temporal patterns; (b) time delay and frequency (C2) simulates various periodic cycles; (c) random Gaussian noise scaled by factor (C3) simulates data noise as well as various shapes. In addition, two sinusoidal waves have high correlation if their frequencies are similar and they are almost inphase. By randomly selecting frequency and phase of each time series, we expect some pairs to have high correlations while some have low correlations. We randomly generate 30 time series and each includes 20000 points. Besides, 5 shock wave like anomalies (with similar value range of normal data, as the examples in Figure 1(a)) are randomly injected into 3 random time series (root causes) during test period. The duration of each anomaly belongs to one of the three scales, i.e., 30, 60, 90.

Power plant data. This dataset was collected on a real power plant. It contains 36 time series generated by sensors distributed in the power plant system. It has 23,040 time steps and contains one anomaly identified by the system operator. Besides, we randomly inject 4 additional anomalies (similar to what we did in the synthetic data) into the test period for thorough evaluation.
Statistics  Synthetic  Power Plant 

# time series  30  36 
# points  20,000  23,040 
# anomalies  5  5 
# root causes  3  3 
train period  0 8,000  0 10,080 
valid period  8,001 10,000  10,081 18,720 
test period  10,001 20,000  18,721 23,040 
4.1.2 Baseline methods.
We compare MSCRED with eight baseline methods of four categories, i.e., classification model, density estimation model, temporal prediction model, and variants of MSCRED.

[leftmargin=0.15in]

Classification model. It learns a decision function and classifies test data as similar or dissimilar to the training set. We use OneClass SVM model (OCSVM) [Manevitz and Yousef2001] for comparison.

Density estimation model.
It models data density for outlier detection. We use Deep Autoencoding Gaussian Mixture model (DAGMM)
[Zong et al.2018] and take the energy score [Zong et al.2018] as the anomaly score. 
Prediction model. It models the temporal dependencies of training data and predicts the value of test data. We employ three methods: History Average (HA), AutoRegression Moving Average (ARMA) [Hamilton1994] and LSTM encoderdecoder (LSTMED) [Cho et al.2014]. The anomaly score is defined as the average prediction error over all time series.

MSCRED variants. Besides the above baseline methods, we consider three variants of MSCRED to justify the effectiveness of each component: (1) CNN is MSCRED with attention module and first three ConvLSTM layers been removed. (2) CNN is MSCRED with attention module and first two ConvLSTM layers been removed. (3) CNN is MSCRED with attention module been removed.
We employ Tensorflow to implement MSCRED and its variants, and train them on a server with Intel(R) Xeon(R) CPU E52637 v4 3.50GHz and 4 NVIDIA GTX 1080 Ti graphics cards. The parameter settings of MSCRED are described in the model section. In addition, the anomaly score is defined as the number of poorly reconstructed pairwise correlations. In other words, the number of elements whose value is larger than a given threshold
in the residual signature matrices and is detemined empirically over different datasets.4.1.3 Evaluation metrics.
We use three metrics, i.e., Precision, Recall, and F1 Score, to evaluate the anomaly detection performance of each method. To detect anomaly, we follow the suggestion of a domain expert by setting a threshold , where are the anomaly scores over the validation period and is set to maximize the F1 Score over the validation period. Recall and Precision scores over the test period are computed based on this threshold. Experiments on both datasets are repeated 5 times and the average results are reported for comparison. Note that the output of MSCRED contains three channel of residual signature matrices w.r.t. different segment lengths. We use the smallest one ( = 10) for the following anomaly detection and root cause identification evaluation. The performance comparison of three channel results will also be provided for anomaly severity interpretation.
4.2 Performance Evaluation
4.2.1 Anomaly detection result (RQ1, RQ2).
The performance of different methods for anomaly detection are reported in Table 2, where the best scores are highlighted in boldface and the best baseline scores are indicated by underline. The last row reports the improvement (%) of MSCRED over the best baseline method.

[leftmargin=0.15in]

(RQ1: comparison with baselines) In Table 2, we observe that (a) temporal prediction models perform better than classification and density estimation models, indicating both datasets have temporal dependency; (b) LSTMED has better performance than ARMA, showing deep learning model can capture more complex relationship in the data than traditional method; (c) MSCRED performs best on all settings. The improvements over the best baseline range from 13.3% to 30.0%. In other words, MSCRED is much better than baseline methods as it can model both intersensor correlations and temporal patterns of multivariate time series effectively.
In order to show the comparison in detail, Figure 3 provides case study of MSCRED and two best baseline methods, i.e., ARMA and LSTMED, for both datasets. We can observe that the anomaly score of ARMA is not stable and the results contain many false positives and false negatives. Meanwhile, the anomaly score of LSTMED is smoother than ARMA while still contains several false positives and false negatives. MSCRED can detect all anomalies without any false positive and false negative.
To demonstrate a more convincing evaluation, we do experiment on another synthetic data with 10 anomalies (it is easy to generate larger data with more anomalies). The average recall and precision scores (5 repeated experiments) of MSCRED are (0.84, 0.95) while the values of LSTMED are (0.64, 0.87). In addition, we do experiment on another large power plant data which has 920 sensors and 11 labeled anomalies. The recall and precision scores of MSCRED are (7/11, 7/13) while the values of LSTMED are (5/11, 5/17). All evaluation results show the effectiveness of our model.
Method Synthetic Data Power Plant Data Pre Rec F Pre Rec F OCSVM 0.14 0.44 0.22 0.11 0.28 0.16 DAGMM 0.33 0.20 0.25 0.26 0.20 0.23 HA 0.71 0.52 0.60 0.48 0.52 0.50 ARMA 0.91 0.52 0.66 0.58 0.60 0.59 LSTMED 1.00 0.56 0.72 0.75 0.68 0.71 CNN 0.37 0.24 0.29 0.67 0.56 0.61 CNN 0.63 0.56 0.59 0.80 0.72 0.76 CNN 0.80 0.76 0.78 0.85 0.72 0.78 MSCRED 1.00 0.80 0.89 0.85 0.80 0.82 Gain (%) – 30.0 23.8 13.3 19.4 15.5 Table 2: Anomaly detection results on two datasets. 
(RQ2: comparison with model variants) In Table 2, we also observe that by increasing the number of ConvLSTM layers, the performance of MSCRED improves. Specifically, CNN outperforms CNN and the performance of CNN is superior than CNN, indicating the effectiveness of ConvLSTM layers and stacked decoding process for model refinement. We also observe that CNN is worse than MSCRED, suggesting that attention based ConvLSTM can further improve anomaly detection performance.
To further demonstrate the effectiveness of attention module, Figure 4 reports the average distribution of attention weights over 5 previous timesteps at the last two ConvLSTM layers. The results are obtained using the power plant data. We compute the average attention weights distribution for segments in the normal periods and that for segments in the abnormal periods separately. Note that in the latter distribution, the older timesteps (step 1 or 2), which tend to still be normal and therefore in a different system status than the current timestep (step 5), are assigned lower weights than in the distribution for normal segments. In other words, the attention modules show high sensitivity to system status change and thus is beneficial for anomaly detection.
4.2.2 Root cause identification result (RQ3).
As one of the anomaly diagnosis tasks, root cause identification depends on good anomaly detection performance. Therefore, we compare the performances of MSCRED and the best baseline, i.e., LSTMED. Specifically, for LSTMED, we use the prediction error of each time series to represent its anomaly score of this series. The same value of MSCRED is defined as the number of poorly reconstructed pairwise correlations in a specific row/column of residual signature matrices as each row/column denotes a time series. For each anomaly event, we rank all time series by their anomaly scores and identify the top series as the root causes. Figure 5 shows the average recall ( = 3) in 5 repeated experiments. MSCRED outperforms LSTMED by a margin of 25.9% and 32.4% in the synthetic and power plant data, respectively.
4.2.3 Anomaly severity (duration) interpretation (RQ4).
The signature matrices of MSCRED include channels ( = 3 in current experiments) that capture system status at different scales. To interpret anomaly severity, we first compute different anomaly scores based on the residual signature matrices of three channels, i.e., small, medium, and large with segment size = 10, 30, and 60, respectively, and denote them as MSCRED(S), MSCRED(M), and MSCRED(L). Then, we independently evaluate their performances on three types of anomalies, i.e., short, medium, and long with the duration of 10, 30, and 60, respectively. The average recall scores over 5 repeated experiments on two datasets are reported in Figure 6. We can observe that MSCRED(S) is able to detect all types of anomalies and MSCRED(M) can detect both medium and long duration anomalies. On the contrary, MSCRED(L) can only detect the long duration anomaly. Accordingly, we can interpret the anomaly severity by jointly considering the three anomaly scores. The anomaly is more likely to be long duration if it can be detected in all three channels. Otherwise, it may be a short or medium duration anomaly. To better show the effectiveness of MSCRED, Figure 7 provides a case study of anomaly diagnosis in power plant data. In this case, MSCRED(S) detects all of 5 anomalies including 3 short, 1 medium and 1 long duration anomalies. MSCRED(M) misses two short duration anomalies and MSCRED(L) only detects the long duration anomaly. Moreover, four residual signature matrices of injected anomaly events show the root causes identification results. We can accurately pinpoint more than half of the anomaly root causes (rows/columns highlighted by red rectangles) in this case.
4.2.4 Robustness to Noise (RQ5).
The multivariate time series often contains noise in real world applications, thus it is important for an anomaly detection algorithm to be robust to input noise. To study the robustness of MSCRED for anomaly detection, we conduct experiments in different synthetic datasets by adding various noise factors in Equation 7. Figure 8 shows the impact of
on the performance of MSCRED, ARMA, and LSTMED. Similar to previous evaluation, we compute Precision and Recall scores based on the optimized cutting threshold and the average values of 5 repeated experiments are reported for comparison. We can observe that MSCRED consistently outperforms ARMA and LSTMED when the scale of noise varies from 0.2 to 0.45. This suggests that, compared with ARMA and LSTMED, MSCRED is more robust to the input noise.
5 Conclusion
In this paper, we formulated anomaly detection and diagnosis problem, and developed an innovative model, i.e., MSCRED, to solve it. MSCRED employs multiscale (resolution) system signature matrices to characterize the whole system statuses at different time segments and adopts a deep encoderdecoder framework to generate reconstructed signature matrices. The framework is able to model both intersensor correlations and temporal dependencies of multivariate time series. The residual signature matrices are further utilized to detect and diagnose anomalies. Extensive empirical studies on a synthetic dataset as well as a power plant dataset demonstrated that MSCRED can outperform stateoftheart baseline methods.
Acknowledgments
Chuxu Zhang and Nitesh V. Chawla are supported by the Army Research Laboratory under Cooperative Agreement Number W911NF0920053 and the National Science Foundation (NSF) grant IIS1447795.
References
 [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. In ICLR.
 [Brockwell and Davis2013] Brockwell, P. J., and Davis, R. A. 2013. Time series: theory and methods. Springer Science & Business Media.
 [Campello et al.2015] Campello, R. J.; Moulavi, D.; Zimek, A.; and Sander, J. 2015. Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data. 10(1):5.
 [Chandola, Banerjee, and Kumar2009] Chandola, V.; Banerjee, A.; and Kumar, V. 2009. Anomaly detection: A survey. ACM Comput. Surv. 41(3):15.
 [Chen et al.2008] Chen, H.; Cheng, H.; Jiang, G.; and Yoshihira, K. 2008. Exploiting local and global invariants for the management of large scale information systems. In ICDM, 113–122.
 [Cheng et al.2016] Cheng, W.; Zhang, K.; Chen, H.; Jiang, G.; Chen, Z.; and Wang, W. 2016. Ranking causal anomalies via temporal and dynamical analysis on vanishing correlations. In KDD, 805–814.
 [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
 [Djurdjanovic, Lee, and Ni2003] Djurdjanovic, D.; Lee, J.; and Ni, J. 2003. Watchdog agent—an infotronicsbased prognostics approach for product performance degradation assessment and prediction. Adv. Eng. Inform. 17(34):109–125.
 [Görnitz et al.2013] Görnitz, N.; Kloft, M.; Rieck, K.; and Brefeld, U. 2013. Toward supervised anomaly detection. J. Artif. Intell. Res. 46:235–262.
 [Günnemann, Günnemann, and Faloutsos2014] Günnemann, N.; Günnemann, S.; and Faloutsos, C. 2014. Robust multivariate autoregression for anomaly detection in dynamic product ratings. In WWW, 361–372.
 [Hallac et al.2017] Hallac, D.; Vare, S.; Boyd, S.; and Leskovec, J. 2017. Toeplitz inverse covariancebased clustering of multivariate time series data. In KDD, 215–223.
 [Hamilton1994] Hamilton, J. D. 1994. Time series analysis, volume 2. Princeton university press Princeton, NJ.
 [Hautamak̈i, Kar̈kkaïnen, and Fran̈ti2004] Hautamak̈i, V.; Kar̈kkaïnen, I.; and Fran̈ti, P. 2004. Outlier detection using knearest neighbour graph. In ICPR, 430–433.
 [He, Xu, and Deng2003] He, Z.; Xu, X.; and Deng, S. 2003. Discovering clusterbased local outliers. Pattern Recognit. Lett. 24(910):1641–1650.
 [Idé, Papadimitriou, and Vlachos2007] Idé, T.; Papadimitriou, S.; and Vlachos, M. 2007. Computing correlation anomaly scores using stochastic nearest neighbors. In ICDM, 523–528.
 [Karim et al.2018] Karim, F.; Majumdar, S.; Darabi, H.; and Chen, S. 2018. Lstm fully convolutional networks for time series classification. IEEE Access 6:1662–1669.
 [Keogh et al.2001] Keogh, E.; Chu, S.; Hart, D.; and Pazzani, M. 2001. An online algorithm for segmenting time series. In ICDM, 289–296.
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Klambauer et al.2017] Klambauer, G.; Unterthiner, T.; Mayr, A.; and Hochreiter, S. 2017. Selfnormalizing neural networks. In NIPS, 971–980.
 [Kriegel et al.2012] Kriegel, H.P.; Kroger, P.; Schubert, E.; and Zimek, A. 2012. Outlier detection in arbitrarily oriented subspaces. In ICDM, 379–388.
 [Lazarevic and Kumar2005] Lazarevic, A., and Kumar, V. 2005. Feature bagging for outlier detection. In KDD, 157–166.
 [Lemire2007] Lemire, D. 2007. A better alternative to piecewise linear time series segmentation. In SDM, 545–550.
 [Len, Vittal, and Manimaran2007] Len, R. A.; Vittal, V.; and Manimaran, G. 2007. Application of sensor network for secure electric energy infrastructure. IEEE Trans. Power Del. 22(2):1021–1028.
 [Li and Prakash2011] Li, L., and Prakash, B. A. 2011. Time series clustering: Complex is simpler! In ICML, 185–192.
 [Long, Shelhamer, and Darrell2015] Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In CVPR, 3431–3440.
 [Malhotra et al.2016] Malhotra, P.; Ramakrishnan, A.; Anand, G.; Vig, L.; Agarwal, P.; and Shroff, G. 2016. Lstmbased encoderdecoder for multisensor anomaly detection. In ICML Workshop.
 [Manevitz and Yousef2001] Manevitz, L. M., and Yousef, M. 2001. Oneclass svms for document classification. J. Mach. Learn. Res. 2(Dec):139–154.

[Qin et al.2017]
Qin, Y.; Song, D.; Chen, H.; Cheng, W.; Jiang, G.; and Cottrell, G.
2017.
A dualstage attentionbased recurrent neural network for time series prediction.
In IJCAI. 
[Shi et al.2015]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; and Woo, W.c.
2015.
Convolutional lstm network: A machine learning approach for precipitation nowcasting.
In NIPS, 802–810.  [Song et al.2018] Song, D.; Xia, N.; Cheng, W.; Chen, H.; and Tao, D. 2018. Deep rth root of rank supervised joint binary embedding for multivariate time series retrieval. In KDD, 2229–2238.
 [Wu et al.2018] Wu, X.; Shi, B.; Dong, Y.; Huang, C.; Faust, L.; and Chawla, N. V. 2018. Restful: Resolutionaware forecasting of behavioral time series data. In CIKM, 1073–1082.
 [Yang et al.2016] Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; and Hovy, E. 2016. Hierarchical attention networks for document classification. In NAACLHLT, 1480–1489.

[Zhai et al.2016]
Zhai, S.; Cheng, Y.; Lu, W.; and Zhang, Z.
2016.
Deep structured energy based models for anomaly detection.
In ICML, 1100–1109.  [Zhou and Paffenroth2017] Zhou, C., and Paffenroth, R. C. 2017. Anomaly detection with robust deep autoencoders. In KDD, 665–674.
 [Zong et al.2018] Zong, B.; Song, Q.; Min, M. R.; Cheng, W.; Lumezanu, C.; Cho, D.; and Chen, H. 2018. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In ICLR.
Comments
There are no comments yet.