1 Introduction
As Industry 4.0 accelerates system automation, consequences of system failures can have a significant social impact lee2008cyber; lee2015cyber; baheti2011cyber
. To prevent this failure, the detection of the anomalous state of a system is more important than ever, and it is being studied under the name of anomaly detection (AD). Meanwhile, deep learning has shown its effectiveness in modeling multivariate timeseries data collected from numerous sensors and actuators of large systems
chalapathy2019deep. Therefore, various timeseries AD (TAD) methods have widely adopted deep learning, and each of them demonstrated its own superiority by reporting higher F1 scores than the preceding methods choi2021deep. For some datasets, the reported F1 scores exceed 0.9, giving an encouraging impression of today’s TAD capabilities.However, most of the current TAD methods measure the F1 score after applying a peculiar evaluation protocol named point adjustment (PA), proposed by 2018unsupervised su2019robust; audibert2020usad; shen2020timeseries. PA
works as follows: if at least one moment in a contiguous anomaly segment is detected as an anomaly, the entire segment is then considered to be correctly detected. The F1 score is calculated with the adjusted predictions (hereinafter denoted by
). If the F1 score is computed without PA, it is denoted as F1. The PA protocol was proposed on the basis that a single alert within an anomaly period is sufficient to take action for system recovery. It has become a fundamental step in TAD evaluation, and some of the following studies reported only without F1 chen2021learning. A higher has been indicated better detection capability.However, PA has a high possibility of overestimating the model performance. A typical TAD model produces an anomaly scoFre that informs about the degree of input abnormality, and predicts an anomaly if this score is higher than a threshold. The black solid lines show two different anomaly scores; the upper line shows informative scores from a welltrained model, while the lower is randomly generated. The shaded area and dashed line indicate the ground truth (GT) anomaly segment and TAD threshold, respectively. The informative scores (above) are ideal given that they are high only during the GT segment. In contrast, randomly generated anomaly scores (below) cross the threshold only once within the GT segment. Despite their disparity, the predictions after PA become indistinguishable, as indicated by the red line. If random anomaly scores can yield as high as a proficient detection model, it is difficult to conclude that a model with a higher performs better than the others. Our experimental results in Section 5 show that random anomaly scores can overturn most stateoftheart TAD methods (Figure 1(b)).
Another question that arises is whether PA is the only problem in the evaluation of TAD methods. Until now, only the absolute F1
has been reported, without any attempt to establish a baseline and relative comparison against it. If the accuracy of a binary classifier is 50%, it is not much different from random guessing despite being an apparently large number. Similarly, a proper baseline should be discussed for TAD, and future methods should be evaluated based on the improvement compared to the baseline. According to our observations, existing TAD methods do not seem to have obtained a significant improvement over the baseline that this paper proposes. Furthermore, several methods fail to exceed it. Our observations for one of the benchmark dataset are summarized in the right of Figure
1(b).In this paper, we raise the question of whether the current TAD methods that claim to bring significant improvements are being properly evaluated, and suggest directions for the rigorous evaluation of TAD for the first time. Our contributions are summarized as follows:

We show that PA, a peculiar evaluation protocol for TAD, greatly overestimates the detection performance of existing methods.

We show that, without PA, existing methods exhibit no (or mostly insignificant) improvement over the baseline.

Based on our findings, we propose a new baseline and an evaluation protocol for rigorous evaluation of TAD.
2 Background
2.1 Types of anomaly in time series signals
Various types of anomalies exist in TAD dataset choi2021deep. A contextual anomaly represents a signal that has a different shape from that of the normal signal. A collective anomaly indicates a small amount of noise accumulated over a period of time. The point anomaly indicates a temporary and significant deviation from the expected range owing to a rapid increase or decrease in the signal value. Point anomaly is the most dominant type in the current TAD datasets.
2.2 Unsupervised TAD
A typical AD setting assumes that only normal data are accessible during the training time. Therefore, an unsupervised method is one of the most appropriate approaches for TAD, which trains a model to learn shared patterns in only normal signals. The final objective is to assign different anomaly scores to inputs depending on the degree of their abnormality, i.e., low and high anomaly scores for normal and abnormal inputs, respectively. Recent unsupervised TAD methods can be categorized into three types: reconstructionbased, forecastingbased, and others.
Reconstructionbased AD
method trains a model to minimize the distance between a normal input and its reconstruction. Anomalous input at the test time results in large distance as it is difficult to reconstruct. The distance, or reconstruction error, serves as an anomaly score. Various network architectures have been adopted for reconstruction, from an autoencoder
malhotra2016lstmto a generative adversarial network (GAN)
li2019mad; zhou2019beatgan. The distance metrics also vary from the Euclidean distance audibert2020usad to likelihood of reconstruction su2019robust.Forecastingbased AD method is similar to the reconstructionbased methods, except that it predicts a signal for the future time steps. The distance between the predicted and ground truth signal is considered an anomaly score. hundman2018detecting
adopted a long shortterm memory (LSTM)
hochreiter1997longto forecast the current timestepahead input. Another variant of the recurrent model, the gated recurrent unit
chung2014empirical, was applied to the forecastingbased TAD wu2020developing.Others include various attempts to model the normal data distribution. Oneclass classificationbased approaches ma2003timeseries; shen2020timeseries
measured the similarity between the hidden representations of the normal and abnormal signals. Realtime anomaly detection in multivariate timeseries (RADM)
ding2018radmapplied hierarchical temporal memory and a Bayesian network to model normal data distribution. Recently, as graph neural networks (GNNs) have shown very good performance in modeling temporal dependency, their use in TAD is growing rapidly
chen2021learning; deng2021graph.3 Pitfalls of the TAD evaluation
3.1 Problem formulation
First, we denote the timeseries signal observed from sensors during time as . As conventional approaches, it is normalized and split into a series of windows
with stride 1, where
and is the window size. The ground truth binary label , indicating whether a signal is an anomaly (1) or not (0), is given only for the test dataset. The goal of TAD is to predict the anomaly label for all windows in the test dataset. The labels are obtained by comparing anomaly scores with a TAD threshold given as follows:(1) 
An example of is the mean squared error (MSE) between the original input and its reconstructed version, which is defined as follows:
(2) 
where denotes the output from a reconstruction model parameterized with . After labeling, the precision (P), recall (R), and F1 score for the evaluation are computed as follows:
(3)  
where TP, FP, and FN denote the number of true positives, false positives and false negatives, respectively.
The TAD test dataset may contain multiple anomaly segments lasting over a few time steps. We denote as a set of anomaly segments; that is, , where ; and denote the start and end times of , respectively. PA adjusts to 1 for all if anomaly score is higher than at least once in . With PA, the labeling scheme of Eq. 1 changes as follows:
(4) 
denotes the F1 score computed with adjusted labels.
3.2 Random anomaly score with high
In this section, we demonstrate that the PA protocol overestimates the detection capability. We start from the abstract analysis of the and of Eq. 3, and we mathematically show that a randomly generated can achieve a high value close to 1. According to Eq. 3
, as the F1 score is a harmonic mean of
and , it also depends on , and . As shown in Eq. 4, PA increases and decreases while maintaining . Therefore, after the PA, the P, R and consequently F1 score can only increase.Next, we show that can easily get close to 1. First,
is restated as a conditional probability as follows:
(5)  
Let’s assume that
is drawn from a uniform distribution
. We use to denote a TAD threshold for this assumption. If only one anomaly segment exists, i.e., , after PA can be expressed as follows, referring to Eq. 4:(6)  
where is a test dataset anomaly ratio and .
can be obtained by applying Bayes rule to as follows:
(7)  
For a more generalized proof, please refer to the Appendix. The anomaly ratio for a dataset is mostly given between 0 and 0.2; is also determined by the dataset and generally ranges from 100 to 5,000 in the benchmark datasets. Figure 2 depicts varying with under different when is fixed to 0.05. As shown in the figure, we can always obtain the close to 1 by changing , except for the case when the length of the anomaly segment is short.
3.3 Untrained model with comparably high F1
This section shows that the anomaly scores obtained from an untrained model are informative to a certain extent. A deep neural network is generally initialized with random weights drawn from a Gaussian distribution
, where is often much smaller than 1. Without training, the outputs of the model are close to zero because they also follow a zeromean Gaussian distribution. The anomaly score of a reconstructionbased or forecastingbased method is typically defined as the Euclidean distance between the input and output, which in the above case is proportional to the value of the input window:(8) 
In the case of a point anomaly, the specific sensor values increase abruptly. This leads to a larger magnitude of than normal windows, which is connected directly to a high for GT anomalies. The experimental results in Section 5 reveal that F1 calculated from of Eq. 8 is comparable to that of current TAD methods. It is also shown that F1 increases even more when the window size gets longer.
4 Towards a rigorous evaluation of TAD
4.1 New baseline for TAD
For a classification task, the baseline accuracy is often defined as that of a random guess. It can be said that there is an improvement only when the classification accuracy exceeds this baseline. Similarly, TAD needs to be compared not only with the existing methods but also with the baseline detection performance. Therefore, based on the findings of Section 3.3, we suggest establishing a new baseline with the F1 measured from the prediction of a randomly initialized reconstruction model with simple architecture, such as an untrained autoencoder comprising a singlelayer LSTM. Alternatively, the anomaly score can be defined as the input itself, which is the extreme case of Eq. 8 when the model consistently outputs zero regardless of the input. If the performance of the new TAD model does not exceed this baseline, the effectiveness of the model should be reexamined.
4.2 New evaluation protocol Pa%k
In the previous section, we demonstrated that PA has a great possibility of overestimating detection performance. F1 without PA can settle the overestimation immediately. In this case, it is recommended to set a baseline as introduced in Section 4.1. However, depending on the test data distribution, F1 can unexpectedly underestimate the detection capability. In fact, due to the incomplete test set labeling, some signals labeled as anomalies share more statistics with normal signals. Even if anomalies are inserted intermittently over a period of time, for all in that period.
We further investigated this problem using tdistributed stochastic neighbor embedding (tSNE) van2008visualizing, as depicted in Figure 3. The tSNE is generated by the test dataset of secure water treatment (SWaT) goh2016dataset. Blue and orange colors indicate the normal and abnormal samples, respectively. The majority of the anomalies form a distinctive cluster located far from the normal data distribution. However, some abnormal windows are closer to the normal data than anomalies. The visualization of signals corresponding to the green and red points is depicted in (b) and (c), respectively. Although both samples were annotated as GT anomalies, (b) shared more patterns with normal data of (a) than (c). Concluding that the model’s performance is deficient only because it cannot detect signals such as (b) can lead to an underestimation of the detection capability.
Therefore, we propose a new evaluation protocol PA%K, which can mitigate the overestimation effect of and the possibility of underestimation of F1. The idea of PA%K is to apply PA to only if the ratio of the number of correctly detected anomalies in to its length exceeds the PA%K threshold K. PA%K modifies Eq. 4 as follows:
where denotes the size of (i.e., ) and K can be selected manually between 0 and 100 based on prior knowledge. For example, if the test set labels are reliable, a larger K is allowable. If a user wants to remove the dependency on K, it is recommended to measure the area under the curve of obtained by increasing K from 0 to 100.
5 Experimental results
5.1 Benchmark TAD datasets
In this section, we introduce a list of the five most widely used TAD benchmark datasets as follows:
Secure water treatment (SWaT) goh2016dataset: the SWaT dataset was collected over 11 days from a scaleddown water treatment testbed comprising 51 sensors mathur2016swat. In the last 4 days, 41 anomalies were injected using diverse attack methods, while only normal data were generated during the first 7 days.
Water distribution testbed (WADI) wadi: the WADI dataset was acquired from a reduced city water distribution system with 123 sensors and actuators operating for 16 days. Only normal data were collected during the first 14 days, and the remaining two days contained anomalies. The test dataset had a total of 15 anomaly segments.
Server Machine Dataset (SMD) su2019robust: the SMD dataset was collected from 28 server machines with 38 sensors for 10 days; only normal data appeared for the first 5 days, and anomalies were intermittently injected for the last 5 days. The results for the SMD dataset are the averaged values from 28 different models for each machine.
Mars Science Laboratory (MSL) and Soil Moisture Active Passive (SMAP) hundman2018detecting: the MSL and SMAP dataset is a realworld dataset collected from a spacecraft of NASA. These are the anomaly data from an incident surprise anomaly (ISA) report for a spacecraft monitoring system. Unlike other datasets, unlabeled anomalies are contained in the training data, which makes training difficult.
The statistics are summarized in Table 1.
5.2 Evaluated methods
Below, we present the descriptions of 7 representative TAD methods recently proposed and the 3 cases investigated in Section 3.
USAD audibert2020usad stands for unsupervised anomaly detection, which trains two autoencoders consisting of one shared encoder and two separate decoders, under a twophase training scheme: an autoencoder training phase and an adversarial training phase.
DAGMM zong2018deep
represents deep autoencoding Gaussian mixture model that adopts an autoencoder to yield a representation vector and feed it to a Gaussian mixture model. It uses the estimated sample energy as a reconstruction error; high energy indicates high abnormality.
LSTMVAE park2018multimodal represents an LSTMbased variational autoencoder that adopts variational inference for reconstruction.
OmniAnomaly su2019robust applied a VAE to model the timeseries signal into a stochastic representation, which would predict an anomaly if the reconstruction likelihood of a given input is lower than a threshold value. It also defined the reconstruction probabilities of individual features as attribution scores and quantified their interpretability.
MSCRED zhang2019deep represents a multiscale convolutional recurrent encoderdecoder comprising convolutional LSTMs to reconstruct the input matrices that characterize multiple system levels, rather than the input itself.
Dataset  Train  Test (anomaly%)  

SWaT  495,000  449,919 (12.33%)  51 
WADI  784,537  172,801 (5.77%)  123 
SMD  25,300  25,300 (4.21%)  38 
MSL  58,317  73,729 (10.5%)  55 
SMAP  135,183  427,617 (12.8%)  25 
THOC shen2020timeseries
represents a temporal hierarchical oneclass network, which is a multilayer dilated recurrent neural network and a hierarchical deep support vector data description.
GDN deng2021graph represents a graph deviation network that learns a sensor relationship graph to detect deviations of anomalies from the learned pattern.
Case 1. Random anomaly score corresponds to the case described in Section 3.2. The F1 score is measured with a randomly generated anomaly score drawn from a uniform distribution , i.e., .
Case 2. Input itself as an anomaly score denotes the case assuming regardless of . This is equal to an extreme case of Eq. 8. Therefore, .
Case 3. Anomaly score from the randomized model corresponds to Eq. 8, where denotes a small output from a randomized model. The parameters were fixed after being initialized from a Gaussian distribution .
5.3 Correlation between and F1
F1 is the most conservative indicator of detection performance. Therefore, if reliably represents the detection capability, it should have at least some correlation with F1. Figure 4 plots and F1 for SWaT and WADI, as reported by the original studies on USAD, DAGMM, LSTMVAE, OmniAnomaly, and GDN. The figure also includes the results of Case 1–3. It is noteworthy that given that only a subset of the datasets and methods reported and F1 together, we plotted them only. For SWaT, the Pearson correlation coefficient (PCC) and Kendall rank correlation coefficient (KRC) were 0.59 and 0.07, respectively. For WADI, the PCC and KRC were 0.41 and 0.43, respectively. However, these numbers are insufficient to assure the existence of correlation and confirm that comparing the superiority of the methods using only may have the risk of improper evaluation of the detection performance.
5.4 Comparison results
Here, we compare the results of the AD methods with Case 13. It should be noted that the anomaly score is directly generated without model inference for Case 1 and 2.
For Case 3, we adopted the simplest encoderdecoder architecture with LSTM layers. The window size for Case 2 and 3 was set to 120. For experiments that included randomness, such as Case 1 and 3, we repeated them with five different seeds and reported the average values. For the existing methods, we used the best numbers reported in the original papers and officially reproduced results choi2021deep; if there were no available scores, we reproduced them referring to the officially provided codes. The F1
for MSL, SMAP, and SMD have not been provided in previous papers; thus they are all reproduced. It is worth noting that we searched for optimal hyperparameters within the suggested range in the papers and we did not apply downsampling. All thresholds were obtained from those that yielded the best score. Further details of the implementation are provided in the Appendix. The results are shown in Table
2. The reproduced results are marked as . Bold and underlined numbers indicate the best and secondbest results, respectively. The up arrow () is displayed with the result for the following cases: (1) is higher than Case 1, (2) F1 is higher than Case2 or 3, whichever is greater.Clearly, the randomly generated anomaly score (Case 1) is not able to detect anomalies because it reflects nothing about the abnormality in an input. Correspondingly, F1 was quite low, which clearly revealed a deficient detection capability. However, when applying the PA protocol, Case 1 appears to yield the stateoftheart far beyond the existing methods, except for SMD. If the result is provided only with PA, as in the case of the MSL, SMAP, and SMD, distinguishing whether the method successfully detects anomalies or whether it merely outputs a random anomaly score irrelevant to the input is impossible. In particular, F1 of the MSL and SMAP is quite low; this implies difficulty in modeling them, originating from the fact that they are both realworld datasets, and the training data contain anomalies. However, appears considerably high, creating an illusion that the anomalies are being detected well for those datasets.
The F1 of Case 1 of SMD is lower than that in other datasets, and there are previous methods surpassing it. This may be attributed to the composition of the SMD test dataset. According to Eqs. 6 and 7, varies with three parameters: the ratio of anomalies in the test dataset , length of anomaly segments , and TAD threshold . Unlike the other datasets, the anomaly ratio of SMD was quite low, as shown in Table 1. Moreover, the lengths of the anomaly segments are relatively short; the average length of 28 machines is 90, unlike other datasets ranging from hundreds to thousands. This is similar to the lowest case in Figure 2, which shows that the maximum achievable in this case is only approximately 0.8. Therefore, we can conclude that the overestimation effect of PA depends on the test dataset distribution, and its effect becomes less conspicuous with shorter anomaly segments.
Across all datasets, the F1 for the existing methods is mostly inferior to Case 2 and 3, implying that the currently proposed methods may have obtained marginal or even no advancement against the baselines. Only the GDN consistently exceeded the baselines for all datasets. The F1 of Case 2 and 3 depend on the length of the input window. With a longer window, the F1 baseline becomes even larger. We experimented with various window lengths ranging from 1 to 250 in Case 2 and depicted the results in Figure 5. For SWaT, WADI, and SMAP, F1 begins to increase after a short decrease as increases. This increase occurs because a longer window is more likely to contain more point anomalies, resulting in high anomaly score for the window. If becomes too large, F1 saturates or degrades, possibly because the windows that used to contain only normal signals unexpectedly contain anomalies in it.
5.5 Effect of Pa%k protocol
To examine how PA%K alleviates the overestimation effect of PA and underestimation tendency of F1, we observed varying with different PA%K thresholds K. Figure 6 shows the for SWaT from Case 1 and the fully trained encoderdecoder when K changes in increments of 10 from 0 to 100. The values of and are equal to the original and F1, respectively. The of a welltrained model is expected to show constant results regardless of the value of K. Correspondingly, the of the trained encoderdecoder (orange) shows consistently high . In contrast, the of Case 1 (blue) rapidly decreased when K increased. We also proposed measuring the area under the curve (AUC) to reduce the dependence on K. In this case, the AUC were 0.88 and 0.41 for the trained encoderdecoder and Case 1, respectively; this demonstrates that PA%K clearly distinguishes the former from the latter regardless of K.
6 Discussion
Throughout this paper, we have demonstrated that the current evaluation of TAD has pitfalls in two respects: (1) since PA overestimate the detection performance, we cannot ensure that a method with higher has indeed a better detection capability; (2) the results have been compared only with existing methods, not against the baseline. A better anomaly detector can be developed when the current achievements are properly assessed. In this section, we suggest several directions for future TAD evaluations.
The motivation of PA, i.e., the source of the first pitfall, originates from the incompleteness of the test dataset labeling process, as observed in Section 4.2. An exterminatory solution is to develop a new benchmark dataset annotated in a more finegrained manner, so that the timestepwise labels become reliable. As it is often not feasible because finegrained annotation requires tremendous resources, can be a good alternative that can alleviate overestimation without any additional dataset modification. For the second issue, it is important to set a baseline as the performance of the untrained model as Case 2 and 3 and measure the relative improvement against it. The window size should be carefully determined by considering its effect on the baselines, as described in Section 5.4.
Furthermore, predefining the TAD threshold without any access to the test dataset is often impractical in the real world. Correspondingly, many AD methods in the vision field evaluate themselves using the area under the receiver operating characteristic (AUROC) curve yi2020patch. In contrast, existing TAD methods set the threshold after investigating the test dataset or simply use the optimal threshold that yields the best F1. Thus, the detection result depends significantly on threshold selection. Additional metrics with the reduced dependency such as AUROC or area under precisionrecall (AUPR) curve will help in rigorous evaluation. Even in this case, the proposed baselines and the PA%K protocol are valid.
7 Conclusion
In this paper, we showed for the first time that applying PA can severely overestimate a TAD model’s capability, which may not reflect the true modeling performance. Our experimental results show that randomly predicted anomaly scores may yield stateoftheart results. We also proposed a new baseline for TAD and showed that only a few methods have achieved significant advancement. To mitigate overestimation PA, we propose a new protocol called PA%K. Finally, we suggest several directions for rigorous evaluation of future TAD methods, including baseline selection and reduction of TAD threshold dependence. We expect that our research help clarify the potential of current TAD methods and lead the improvement of TAD in the future.
Comments
There are no comments yet.