Towards a Rigorous Evaluation of Time-series Anomaly Detection

In recent years, proposed studies on time-series anomaly detection (TAD) report high F1 scores on benchmark TAD datasets, giving the impression of clear improvements. However, most studies apply a peculiar evaluation protocol called point adjustment (PA) before scoring. In this paper, we theoretically and experimentally reveal that the PA protocol has a great possibility of overestimating the detection performance; that is, even a random anomaly score can easily turn into a state-of-the-art TAD method. Therefore, the comparison of TAD methods with F1 scores after the PA protocol can lead to misguided rankings. Furthermore, we question the potential of existing TAD methods by showing that an untrained model obtains comparable detection performance to the existing methods even without PA. Based on our findings, we propose a new baseline and an evaluation protocol. We expect that our study will help a rigorous evaluation of TAD and lead to further improvement in future researches.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8


An Evaluation of Anomaly Detection and Diagnosis in Multivariate Time Series

Several techniques for multivariate time series anomaly detection have b...

Anomaly Detection: How to Artificially Increase your F1-Score with a Biased Evaluation Protocol

Anomaly detection is a widely explored domain in machine learning. Many ...

Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress

Time series anomaly detection has been a perennially important topic in ...

OneLog: Towards End-to-End Training in Software Log Anomaly Detection

In recent years, with the growth of online services and IoT devices, sof...

Anomaly Detection and Localization based on Double Kernelized Scoring and Matrix Kernels

Anomaly detection is necessary for proper and safe operation of large-sc...

Street Scene: A new dataset and evaluation protocol for video anomaly detection

Progress in video anomaly detection research is currently slowed by smal...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As Industry 4.0 accelerates system automation, consequences of system failures can have a significant social impact lee2008cyber; lee2015cyber; baheti2011cyber

. To prevent this failure, the detection of the anomalous state of a system is more important than ever, and it is being studied under the name of anomaly detection (AD). Meanwhile, deep learning has shown its effectiveness in modeling multivariate time-series data collected from numerous sensors and actuators of large systems 

chalapathy2019deep. Therefore, various time-series AD (TAD) methods have widely adopted deep learning, and each of them demonstrated its own superiority by reporting higher F1 scores than the preceding methods choi2021deep. For some datasets, the reported F1 scores exceed 0.9, giving an encouraging impression of today’s TAD capabilities.

Figure 1: (a) PA makes different anomaly scores indistinguishable. The black solid lines, gray area, and dashed line indicate the anomaly scores, GT anomaly segment, and TAD threshold, respectively. After applying PA, the predictions for informative and random anomaly scores degenerate to the same adjusted prediction (red line). (b) Existing methods fail to exceed  of a randomly generated anomaly score (left) and show no improvement against the newly proposed baseline (right) for the WADI dataset.

However, most of the current TAD methods measure the F1 score after applying a peculiar evaluation protocol named point adjustment (PA), proposed by 2018unsupervised su2019robust; audibert2020usad; shen2020timeseries. PA

 works as follows: if at least one moment in a contiguous anomaly segment is detected as an anomaly, the entire segment is then considered to be correctly detected. The F1 score is calculated with the adjusted predictions (hereinafter denoted by

). If the F1 score is computed without PA, it is denoted as F1. The PA protocol was proposed on the basis that a single alert within an anomaly period is sufficient to take action for system recovery. It has become a fundamental step in TAD evaluation, and some of the following studies reported only  without F1 chen2021learning. A higher  has been indicated better detection capability.

However, PA has a high possibility of overestimating the model performance. A typical TAD model produces an anomaly scoFre that informs about the degree of input abnormality, and predicts an anomaly if this score is higher than a threshold. The black solid lines show two different anomaly scores; the upper line shows informative scores from a well-trained model, while the lower is randomly generated. The shaded area and dashed line indicate the ground truth (GT) anomaly segment and TAD threshold, respectively. The informative scores (above) are ideal given that they are high only during the GT segment. In contrast, randomly generated anomaly scores (below) cross the threshold only once within the GT segment. Despite their disparity, the predictions after PA  become indistinguishable, as indicated by the red line. If random anomaly scores can yield  as high as a proficient detection model, it is difficult to conclude that a model with a higher  performs better than the others. Our experimental results in Section 5 show that random anomaly scores can overturn most state-of-the-art TAD methods (Figure 1-(b)).

Another question that arises is whether PA is the only problem in the evaluation of TAD methods. Until now, only the absolute F1

 has been reported, without any attempt to establish a baseline and relative comparison against it. If the accuracy of a binary classifier is 50%, it is not much different from random guessing despite being an apparently large number. Similarly, a proper baseline should be discussed for TAD, and future methods should be evaluated based on the improvement compared to the baseline. According to our observations, existing TAD methods do not seem to have obtained a significant improvement over the baseline that this paper proposes. Furthermore, several methods fail to exceed it. Our observations for one of the benchmark dataset are summarized in the right of Figure 


In this paper, we raise the question of whether the current TAD methods that claim to bring significant improvements are being properly evaluated, and suggest directions for the rigorous evaluation of TAD for the first time. Our contributions are summarized as follows:

  • We show that PA, a peculiar evaluation protocol for TAD, greatly overestimates the detection performance of existing methods.

  • We show that, without PA, existing methods exhibit no (or mostly insignificant) improvement over the baseline.

  • Based on our findings, we propose a new baseline and an evaluation protocol for rigorous evaluation of TAD.

2 Background

2.1 Types of anomaly in time series signals

Various types of anomalies exist in TAD dataset choi2021deep. A contextual anomaly represents a signal that has a different shape from that of the normal signal. A collective anomaly indicates a small amount of noise accumulated over a period of time. The point anomaly indicates a temporary and significant deviation from the expected range owing to a rapid increase or decrease in the signal value. Point anomaly is the most dominant type in the current TAD datasets.

2.2 Unsupervised TAD

A typical AD setting assumes that only normal data are accessible during the training time. Therefore, an unsupervised method is one of the most appropriate approaches for TAD, which trains a model to learn shared patterns in only normal signals. The final objective is to assign different anomaly scores to inputs depending on the degree of their abnormality, i.e., low and high anomaly scores for normal and abnormal inputs, respectively. Recent unsupervised TAD methods can be categorized into three types: reconstruction-based, forecasting-based, and others.

Reconstruction-based AD

method trains a model to minimize the distance between a normal input and its reconstruction. Anomalous input at the test time results in large distance as it is difficult to reconstruct. The distance, or reconstruction error, serves as an anomaly score. Various network architectures have been adopted for reconstruction, from an autoencoder 


to a generative adversarial network (GAN) 

li2019mad; zhou2019beatgan. The distance metrics also vary from the Euclidean distance audibert2020usad to likelihood of reconstruction su2019robust.

Forecasting-based AD method is similar to the reconstruction-based methods, except that it predicts a signal for the future time steps. The distance between the predicted and ground truth signal is considered an anomaly score. hundman2018detecting

adopted a long short-term memory (LSTM) 


to forecast the current time-step-ahead input. Another variant of the recurrent model, the gated recurrent unit 

chung2014empirical, was applied to the forecasting-based TAD wu2020developing.

Others include various attempts to model the normal data distribution. One-class classification-based approaches ma2003timeseries; shen2020timeseries

measured the similarity between the hidden representations of the normal and abnormal signals. Real-time anomaly detection in multivariate time-series (RADM) 


applied hierarchical temporal memory and a Bayesian network to model normal data distribution. Recently, as graph neural networks (GNNs) have shown very good performance in modeling temporal dependency, their use in TAD is growing rapidly 

chen2021learning; deng2021graph.

3 Pitfalls of the TAD evaluation

3.1 Problem formulation

First, we denote the time-series signal observed from sensors during time as . As conventional approaches, it is normalized and split into a series of windows

with stride 1, where

and is the window size. The ground truth binary label , indicating whether a signal is an anomaly (1) or not (0), is given only for the test dataset. The goal of TAD is to predict the anomaly label for all windows in the test dataset. The labels are obtained by comparing anomaly scores with a TAD threshold given as follows:


An example of is the mean squared error (MSE) between the original input and its reconstructed version, which is defined as follows:


where denotes the output from a reconstruction model parameterized with . After labeling, the precision (P), recall (R), and F1 score for the evaluation are computed as follows:


where TP, FP, and FN denote the number of true positives, false positives and false negatives, respectively.

The TAD test dataset may contain multiple anomaly segments lasting over a few time steps. We denote as a set of anomaly segments; that is, , where ; and denote the start and end times of , respectively. PA adjusts to 1 for all if anomaly score is higher than at least once in . With PA, the labeling scheme of Eq. 1 changes as follows:


 denotes the F1 score computed with adjusted labels.

3.2 Random anomaly score with high

In this section, we demonstrate that the PA protocol overestimates the detection capability. We start from the abstract analysis of the and of Eq. 3, and we mathematically show that a randomly generated can achieve a high  value close to 1. According to Eq. 3

, as the F1 score is a harmonic mean of

and , it also depends on , and . As shown in Eq. 4, PA increases and decreases while maintaining . Therefore, after the PA, the P, R and consequently F1 score can only increase.

Next, we show that  can easily get close to 1. First,

is restated as a conditional probability as follows:


Let’s assume that

is drawn from a uniform distribution

. We use to denote a TAD threshold for this assumption. If only one anomaly segment exists, i.e., , after PA can be expressed as follows, referring to Eq. 4:


where is a test dataset anomaly ratio and .

can be obtained by applying Bayes rule to as follows:


For a more generalized proof, please refer to the Appendix. The anomaly ratio for a dataset is mostly given between 0 and 0.2; is also determined by the dataset and generally ranges from 100 to 5,000 in the benchmark datasets. Figure 2 depicts   varying with under different when is fixed to 0.05. As shown in the figure, we can always obtain the  close to 1 by changing , except for the case when the length of the anomaly segment is short.

Figure 2:  for the case of uniform random anomaly scores varying with for different . If an anomaly segment is considerably long, that is, if is sufficiently large,  approaches 1 as increases.

3.3 Untrained model with comparably high F1

This section shows that the anomaly scores obtained from an untrained model are informative to a certain extent. A deep neural network is generally initialized with random weights drawn from a Gaussian distribution

, where is often much smaller than 1. Without training, the outputs of the model are close to zero because they also follow a zero-mean Gaussian distribution. The anomaly score of a reconstruction-based or forecasting-based method is typically defined as the Euclidean distance between the input and output, which in the above case is proportional to the value of the input window:


In the case of a point anomaly, the specific sensor values increase abruptly. This leads to a larger magnitude of than normal windows, which is connected directly to a high for GT anomalies. The experimental results in Section 5 reveal that F1 calculated from of Eq. 8 is comparable to that of current TAD methods. It is also shown that F1 increases even more when the window size gets longer.

4 Towards a rigorous evaluation of TAD

4.1 New baseline for TAD

For a classification task, the baseline accuracy is often defined as that of a random guess. It can be said that there is an improvement only when the classification accuracy exceeds this baseline. Similarly, TAD needs to be compared not only with the existing methods but also with the baseline detection performance. Therefore, based on the findings of Section 3.3, we suggest establishing a new baseline with the F1 measured from the prediction of a randomly initialized reconstruction model with simple architecture, such as an untrained autoencoder comprising a single-layer LSTM. Alternatively, the anomaly score can be defined as the input itself, which is the extreme case of Eq. 8 when the model consistently outputs zero regardless of the input. If the performance of the new TAD model does not exceed this baseline, the effectiveness of the model should be reexamined.

4.2 New evaluation protocol Pa%k

In the previous section, we demonstrated that PA has a great possibility of overestimating detection performance. F1 without PA can settle the overestimation immediately. In this case, it is recommended to set a baseline as introduced in Section 4.1. However, depending on the test data distribution, F1 can unexpectedly underestimate the detection capability. In fact, due to the incomplete test set labeling, some signals labeled as anomalies share more statistics with normal signals. Even if anomalies are inserted intermittently over a period of time, for all in that period.

We further investigated this problem using t-distributed stochastic neighbor embedding (t-SNE) van2008visualizing, as depicted in Figure 3. The t-SNE is generated by the test dataset of secure water treatment (SWaT) goh2016dataset. Blue and orange colors indicate the normal and abnormal samples, respectively. The majority of the anomalies form a distinctive cluster located far from the normal data distribution. However, some abnormal windows are closer to the normal data than anomalies. The visualization of signals corresponding to the green and red points is depicted in (b) and (c), respectively. Although both samples were annotated as GT anomalies, (b) shared more patterns with normal data of (a) than (c). Concluding that the model’s performance is deficient only because it cannot detect signals such as (b) can lead to an underestimation of the detection capability.

Figure 3: t-SNE of the input windows of the SWaT test dataset and visualization of corresponding signals. Blue color indicates ground truth (GT) normal while orange, green and red color indicate GT anomaly. Even though (b) is GT anomaly, it shares patterns more with (a), GT normal signal, than (c) abnormal signal.

Therefore, we propose a new evaluation protocol PA%K, which can mitigate the overestimation effect of  and the possibility of underestimation of F1. The idea of PA%K is to apply PA to only if the ratio of the number of correctly detected anomalies in to its length exceeds the PA%K threshold K. PA%K modifies Eq. 4 as follows:

where denotes the size of (i.e., ) and K can be selected manually between 0 and 100 based on prior knowledge. For example, if the test set labels are reliable, a larger K is allowable. If a user wants to remove the dependency on K, it is recommended to measure the area under the curve of  obtained by increasing K from 0 to 100.

5 Experimental results

5.1 Benchmark TAD datasets

In this section, we introduce a list of the five most widely used TAD benchmark datasets as follows:

Secure water treatment (SWaT) goh2016dataset: the SWaT dataset was collected over 11 days from a scaled-down water treatment testbed comprising 51 sensors mathur2016swat. In the last 4 days, 41 anomalies were injected using diverse attack methods, while only normal data were generated during the first 7 days.

Water distribution testbed (WADI) wadi: the WADI dataset was acquired from a reduced city water distribution system with 123 sensors and actuators operating for 16 days. Only normal data were collected during the first 14 days, and the remaining two days contained anomalies. The test dataset had a total of 15 anomaly segments.

Server Machine Dataset (SMD) su2019robust: the SMD dataset was collected from 28 server machines with 38 sensors for 10 days; only normal data appeared for the first 5 days, and anomalies were intermittently injected for the last 5 days. The results for the SMD dataset are the averaged values from 28 different models for each machine.

Mars Science Laboratory (MSL) and Soil Moisture Active Passive (SMAP) hundman2018detecting: the MSL and SMAP dataset is a real-world dataset collected from a spacecraft of NASA. These are the anomaly data from an incident surprise anomaly (ISA) report for a spacecraft monitoring system. Unlike other datasets, unlabeled anomalies are contained in the training data, which makes training difficult.

The statistics are summarized in Table 1.

5.2 Evaluated methods

Below, we present the descriptions of 7 representative TAD methods recently proposed and the 3 cases investigated in Section 3.

USAD audibert2020usad stands for unsupervised anomaly detection, which trains two autoencoders consisting of one shared encoder and two separate decoders, under a two-phase training scheme: an autoencoder training phase and an adversarial training phase.

DAGMM zong2018deep

represents deep autoencoding Gaussian mixture model that adopts an autoencoder to yield a representation vector and feed it to a Gaussian mixture model. It uses the estimated sample energy as a reconstruction error; high energy indicates high abnormality.

LSTM-VAE park2018multimodal represents an LSTM-based variational autoencoder that adopts variational inference for reconstruction.

OmniAnomaly su2019robust applied a VAE to model the time-series signal into a stochastic representation, which would predict an anomaly if the reconstruction likelihood of a given input is lower than a threshold value. It also defined the reconstruction probabilities of individual features as attribution scores and quantified their interpretability.

MSCRED zhang2019deep represents a multi-scale convolutional recurrent encoder-decoder comprising convolutional LSTMs to reconstruct the input matrices that characterize multiple system levels, rather than the input itself.

Dataset Train Test (anomaly%)
SWaT 495,000 449,919 (12.33%) 51
WADI 784,537 172,801 (5.77%) 123
SMD 25,300 25,300 (4.21%) 38
MSL 58,317 73,729 (10.5%) 55
SMAP 135,183 427,617 (12.8%) 25
Table 1: Statistics of benchmark TAD datasets. denotes the dimension of input features.

width=2.1center SWaT WADI MSL SMAP SMD F1 F1 F1 F1 F1 USAD 0.846 ( ) 0.791 ( ) 0.429 ( ) 0.232 ( ) 0.927 ( ) 0.211 ( ) 0.818 ( ) 0.228 ( ) 0.938( ) 0.426 ( ) DAGMM 0.853 ( ) 0.550 ( ) 0.209 ( ) 0.121 ( ) 0.701 ( ) 0.199 ( ) 0.712 ( ) 0.333 ( ) 0.723 ( ) 0.238 ( ) LSTM-VAE 0.805 ( ) 0.775 ( ) 0.380 ( ) 0.227 ( ) 0.678 ( ) 0.212 ( ) 0.756 ( ) 0.235 ( ) 0.808 ( ) 0.435 ( ) OmniAnomaly 0.866 ( ) 0.782 ( ) 0.417 ( ) 0.223 ( ) 0.899 ( ) 0.207 ( ) 0.805 ( ) 0.227 ( ) 0.944 ( ) 0.474 ( ) MSCRED 0.868 ( ) 0.662 ( ) 0.346 ( ) 0.087 ( ) 0.775 ( ) 0.199 ( ) 0.942 ( ) 0.232 ( ) 0.389 ( ) 0.097 ( ) THOC 0.880 ( ) 0.612 ( ) 0.506 ( ) 0.130 ( ) 0.891 ( ) 0.190 ( ) 0.781 ( ) 0.240 ( ) 0.541 ( ) 0.168 ( ) GDN 0.935 ( ) 0.81 ( ) 0.855 ( ) 0.57 ( ) 0.903 ( ) 0.217 ( ) 0.708 ( ) 0.252 ( ) 0.716 ( ) 0.529 ( ) Case 1 0.969 0.216 0.965 0.109 0.931 0.190 0.961 0.227 0.804 0.080 Case 2 0.873 0.781 0.694 0.353 0.812 0.239 0.675 0.229 0.896 0.494 Case 3 0.869 0.789 0.695 0.331 0.427 0.236 0.699 0.229 0.893 0.466

Table 2: F1 score for various methods. indicates the reproduced results. Bottom three rows represent the followings: Case 1. Random anomaly score, Case 2. Input itself as a anomaly score, Case 3. Anomaly score from a randomized model. Please refer to the manuscript for the detailed explanations. Bold and underlined cases indicate the best and the second best, respectively. is marked in the following cases: (1)  is higher than Case 1, (2) F1 is higher than Case 2 or 3.

THOC shen2020timeseries

represents a temporal hierarchical one-class network, which is a multi-layer dilated recurrent neural network and a hierarchical deep support vector data description.

GDN deng2021graph represents a graph deviation network that learns a sensor relationship graph to detect deviations of anomalies from the learned pattern.

Case 1. Random anomaly score corresponds to the case described in Section 3.2. The F1 score is measured with a randomly generated anomaly score drawn from a uniform distribution , i.e., .

Case 2. Input itself as an anomaly score denotes the case assuming regardless of . This is equal to an extreme case of Eq. 8. Therefore, .

Case 3. Anomaly score from the randomized model corresponds to Eq. 8, where denotes a small output from a randomized model. The parameters were fixed after being initialized from a Gaussian distribution .

5.3 Correlation between  and F1

F1 is the most conservative indicator of detection performance. Therefore, if  reliably represents the detection capability, it should have at least some correlation with F1. Figure 4 plots  and F1 for SWaT and WADI, as reported by the original studies on USAD, DAGMM, LSTM-VAE, OmniAnomaly, and GDN. The figure also includes the results of Case 13. It is noteworthy that given that only a subset of the datasets and methods reported  and F1 together, we plotted them only. For SWaT, the Pearson correlation coefficient (PCC) and Kendall rank correlation coefficient (KRC) were -0.59 and 0.07, respectively. For WADI, the PCC and KRC were 0.41 and 0.43, respectively. However, these numbers are insufficient to assure the existence of correlation and confirm that comparing the superiority of the methods using only  may have the risk of improper evaluation of the detection performance.

5.4 Comparison results

Here, we compare the results of the AD methods with Case 1-3. It should be noted that the anomaly score is directly generated without model inference for Case 1 and 2.

Figure 4: Correlation between  and F1 of the existing methods on SWaT and WADI dataset. The Kendall rank correlation (KRC) and Pearson correlation coefficient (PCC) are indicated in the figure.

For Case 3, we adopted the simplest encoder-decoder architecture with LSTM layers. The window size for Case 2 and 3 was set to 120. For experiments that included randomness, such as Case 1 and 3, we repeated them with five different seeds and reported the average values. For the existing methods, we used the best numbers reported in the original papers and officially reproduced results choi2021deep; if there were no available scores, we reproduced them referring to the officially provided codes. The F1

 for MSL, SMAP, and SMD have not been provided in previous papers; thus they are all reproduced. It is worth noting that we searched for optimal hyperparameters within the suggested range in the papers and we did not apply down-sampling. All thresholds were obtained from those that yielded the best score. Further details of the implementation are provided in the Appendix. The results are shown in Table 

2. The reproduced results are marked as . Bold and underlined numbers indicate the best and second-best results, respectively. The up arrow () is displayed with the result for the following cases: (1)  is higher than Case 1, (2) F1 is higher than Case2 or 3, whichever is greater.

Clearly, the randomly generated anomaly score (Case 1) is not able to detect anomalies because it reflects nothing about the abnormality in an input. Correspondingly, F1 was quite low, which clearly revealed a deficient detection capability. However, when applying the PA protocol, Case 1 appears to yield the state-of-the-art  far beyond the existing methods, except for SMD. If the result is provided only with PA, as in the case of the MSL, SMAP, and SMD, distinguishing whether the method successfully detects anomalies or whether it merely outputs a random anomaly score irrelevant to the input is impossible. In particular, F1 of the MSL and SMAP is quite low; this implies difficulty in modeling them, originating from the fact that they are both real-world datasets, and the training data contain anomalies. However,  appears considerably high, creating an illusion that the anomalies are being detected well for those datasets.

The F1 of Case 1 of SMD is lower than that in other datasets, and there are previous methods surpassing it. This may be attributed to the composition of the SMD test dataset. According to Eqs. 6 and 7,  varies with three parameters: the ratio of anomalies in the test dataset , length of anomaly segments , and TAD threshold . Unlike the other datasets, the anomaly ratio of SMD was quite low, as shown in Table 1. Moreover, the lengths of the anomaly segments are relatively short; the average length of 28 machines is 90, unlike other datasets ranging from hundreds to thousands. This is similar to the lowest case in Figure 2, which shows that the maximum achievable  in this case is only approximately 0.8. Therefore, we can conclude that the overestimation effect of PA depends on the test dataset distribution, and its effect becomes less conspicuous with shorter anomaly segments.

Across all datasets, the F1 for the existing methods is mostly inferior to Case 2 and 3, implying that the currently proposed methods may have obtained marginal or even no advancement against the baselines. Only the GDN consistently exceeded the baselines for all datasets. The F1 of Case 2 and 3 depend on the length of the input window. With a longer window, the F1 baseline becomes even larger. We experimented with various window lengths ranging from 1 to 250 in Case 2 and depicted the results in Figure 5. For SWaT, WADI, and SMAP, F1 begins to increase after a short decrease as increases. This increase occurs because a longer window is more likely to contain more point anomalies, resulting in high anomaly score for the window. If becomes too large, F1 saturates or degrades, possibly because the windows that used to contain only normal signals unexpectedly contain anomalies in it.

Figure 5: F1 for various window sizes . As increases, F1 mostly increases after a short decrease.

5.5 Effect of Pa%k protocol

To examine how PA%K alleviates the overestimation effect of PA and underestimation tendency of F1, we observed  varying with different PA%K thresholds K. Figure 6 shows the  for SWaT from Case 1 and the fully trained encoder-decoder when K changes in increments of 10 from 0 to 100. The  values of and are equal to the original  and F1, respectively. The  of a well-trained model is expected to show constant results regardless of the value of K. Correspondingly, the of the trained encoder-decoder (orange) shows consistently high . In contrast, the of Case 1 (blue) rapidly decreased when K increased. We also proposed measuring the area under the curve (AUC) to reduce the dependence on K. In this case, the AUC were 0.88 and 0.41 for the trained encoder-decoder and Case 1, respectively; this demonstrates that PA%K clearly distinguishes the former from the latter regardless of K.

6 Discussion

Throughout this paper, we have demonstrated that the current evaluation of TAD has pitfalls in two respects: (1) since PA overestimate the detection performance, we cannot ensure that a method with higher  has indeed a better detection capability; (2) the results have been compared only with existing methods, not against the baseline. A better anomaly detector can be developed when the current achievements are properly assessed. In this section, we suggest several directions for future TAD evaluations.

Figure 6: F1 score with PA%K with varying K. If , it is equal to the  and if , it is equal to the F1.

The motivation of PA, i.e., the source of the first pitfall, originates from the incompleteness of the test dataset labeling process, as observed in Section 4.2. An exterminatory solution is to develop a new benchmark dataset annotated in a more fine-grained manner, so that the time-step-wise labels become reliable. As it is often not feasible because fine-grained annotation requires tremendous resources,  can be a good alternative that can alleviate overestimation without any additional dataset modification. For the second issue, it is important to set a baseline as the performance of the untrained model as Case 2 and 3 and measure the relative improvement against it. The window size should be carefully determined by considering its effect on the baselines, as described in Section 5.4.

Furthermore, pre-defining the TAD threshold without any access to the test dataset is often impractical in the real world. Correspondingly, many AD methods in the vision field evaluate themselves using the area under the receiver operating characteristic (AUROC) curve yi2020patch. In contrast, existing TAD methods set the threshold after investigating the test dataset or simply use the optimal threshold that yields the best F1. Thus, the detection result depends significantly on threshold selection. Additional metrics with the reduced dependency such as AUROC or area under precision-recall (AUPR) curve will help in rigorous evaluation. Even in this case, the proposed baselines and the PA%K protocol are valid.

7 Conclusion

In this paper, we showed for the first time that applying PA can severely overestimate a TAD model’s capability, which may not reflect the true modeling performance. Our experimental results show that randomly predicted anomaly scores may yield state-of-the-art results. We also proposed a new baseline for TAD and showed that only a few methods have achieved significant advancement. To mitigate overestimation PA, we propose a new protocol called PA%K. Finally, we suggest several directions for rigorous evaluation of future TAD methods, including baseline selection and reduction of TAD threshold dependence. We expect that our research help clarify the potential of current TAD methods and lead the improvement of TAD in the future.