Neural Contextual Anomaly Detection for Time Series

07/16/2021 ∙ by Chris U. Carmona, et al. ∙ Amazon University of Oxford 0

We introduce Neural Contextual Anomaly Detection (NCAD), a framework for anomaly detection on time series that scales seamlessly from the unsupervised to supervised setting, and is applicable to both univariate and multivariate time series. This is achieved by effectively combining recent developments in representation learning for multivariate time series, with techniques for deep anomaly detection originally developed for computer vision that we tailor to the time series setting. Our window-based approach facilitates learning the boundary between normal and anomalous classes by injecting generic synthetic anomalies into the available data. Moreover, our method can effectively take advantage of all the available information, be it as domain knowledge, or as training labels in the semi-supervised setting. We demonstrate empirically on standard benchmark datasets that our approach obtains a state-of-the-art performance in these settings.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Detecting anomalies in real-valued time series data has many practical applications, such as monitoring machinery for faults, finding anomalous behavior in IoT sensor data, improving the availability of computer applications and (cloud) infrastructure, and monitoring patients vital signs, among many others. Since Shewhart’s pioneering work on statistical process control (Shewhart, 1931), statistical techniques for monitoring and detecting abnormal behavior have been developed, refined, and deployed in countless highly impactful applications.

Recently, deep learning techniques have been successfully applied to various anomaly detection problems (see e.g. the surveys by

Ruff et al. (2021) and Pang et al. (2020)). In the particular case of time series, these methods have demonstrated remarkable performance for large-scale monitoring problems such as those encountered by companies like Google (Shipmon et al., 2017), Microsoft (Ren et al., 2019), Alibaba (Gao et al., 2020), and Amazon (Ayed et al., 2020).

Classically, anomaly detection on time series is cast as an unsupervised learning problem, where the training data contains both normal and anomalous instances, but without knowing which is which. However, in many practical applications, a

fully unsupervised approach can leave valuable information unutilized, as it is often possible to obtain (small amounts of) labeled anomalous instances, or to characterize the relevant anomalies in some general way.

Ideally, an effective method for anomaly detection requires a semi-supervised approach, allowing to utilize information about known anomalous patterns or out-of-distribution observations, if any of these are available. Recent developments on deep anomaly detection for computer vision have achieved remarkable performance by following such a learning strategy. A notable example is the line of work leading to the hsc (Ruff et al., 2018, 2020, 2020), which extends the concept of one-class classification to a powerful framework for semi-supervised anomaly detection on complex data.

In this work, we introduce ncad, a framework for anomaly detection on time series that can scale seamlessly from the unsupervised to supervised setting, allowing to incorporate additional information, both through labeled examples and through known anomalous patterns. This is achieved by effectively combining recent developments in representation learning for multivariate time series (Franceschi et al., 2019), with a number of deep anomaly detection techniques originally developed for computer vision, such as the hsc (Ruff et al., 2020) and oe (Hendrycks et al., 2019), but tailored to the time series setting.

Our approach is based on breaking each time series into overlapping, fixed-size windows. Each window is further divided into two parts: a context window and a suspect window (see fig. 1), which are mapped into neural representations (embedding) using tcn (Bai et al., 2018). Our aim is to detect anomalies in the suspect window. Anomalies are identified in the space of learned latent representations, building on the intuition that anomalies create a substantial perturbation on the embeddings, so when we compare the representation of two overlapping segments, one with the anomaly and one without it, we expect them to be distant.

Time series anomalies are inherently contextual. We account for this in our methodology by extending the hsc loss to a contextual hypersphere loss, which dynamically adapts the hypersphere’s center based on the context’s representation. We use data augmentation techniques to ease the learning of the boundary between the normal and anomalous classes. In particular, we employ a variant of OE to create contextual anomalies, and employ simple injected point outlier anomalies.

In summary, we make the following contributions: (I) Propose a simple yet effective framework for time series anomaly detection that achieves state-of-the-art performance across well-known benchmark datasets, covering univariate and multivariate time series, and across the unsupervised, semi-supervised, and fully-supervised settings (Our implementation of ncad is publicly available 111; (II) Build on related work on deep anomaly detection using the hypersphere classifier (Ruff et al., 2020) and expand it to introduce contextual hypersphere detection. (III) Adapt the oe (Hendrycks et al., 2019) and Mixup (Zhang et al., 2018) methods to the particular case of anomaly detection for time series.

2 Related work

ad is an important problem with many applications and has consequently been widely studied. We refer the reader to one of the recent reviews in the topic for a general overview of methods (Chandola et al., 2009; Ruff et al., 2021; Pang et al., 2020).

We are interested in anomaly detection for time series. This is a problem typically framed in an unsupervised way. A traditional approach is to use a predictive model, estimating the distribution (or confidence bands) of future values conditioned on historical observations, and mark observations as anomalous if they are considered unlikely under the model

(Shipmon et al., 2017). Forecasting models such as arima or exponential smoothing methods are often used here, assuming a Gaussian noise distribution. Siffer et al. (2017) propose SPOT and DSPOT, which detect outliers in time series using extreme value theory to model the tail of the distribution.

New advances on deep learning models for anomaly detection have become popular recently. For time series, some models have maintained the classical predictive approach, and introduced flexible neural networks for the dependency structure, yielding significant improvements. Shipmon et al. (2017)

use deep (recurrent) neural networks to parametrize a Gaussian distribution and use the tail probability to detect outliers.

Effective ideas for deep anomaly detection that deviate from the predictive approach have been successfully imported to the time series domain from other fields. Reconstruction based methods, e.g. with vae, or density based methods , e.g. with gan: Donut uses a vae to predict the distribution of sliding windows; LSTM-VAE (Park et al., 2018) uses a recurrent neural network with a vae; OmniAnomaly (Su et al., 2019) extends this framework with deep innovation state space models and normalizing flows; AnoGAN (Schlegl et al., 2017) uses GANs to model sequences of observations and estimate their probabilities in latent space. MSCRED (Zhang et al., 2019) uses convolutional auto-encoders and identifies anomalies by measuring the reconstruction error.

Compression-based approaches have become very popular in image anomaly detection. The working principle is similar to the one-class classification used in the support vector data description method (Tax and Duin, 2004): instances are mapped to latent representations which are pulled together during training, forming a sphere in the latent space; instances that are distant from the center are considered anomalous. (Ruff et al., 2018, 2020) build on this idea to learn a neural mapping , such that the representations of nominal points concentrate around a (fixed) center , while anomalous points are mapped away from that center. In the unsupervised case, DeepSVDD (Ruff et al., 2018) achieves this by minimizing the Euclidean distance , subject to a suitable regularization of the mapping and assuming that anomalies are rare.

THOC (Shen et al., 2020) applies this principle to the context of time series, by extending the model to consider multiple spheres to obtain more convenient representations. This method differs from our work in two ways: it relies on a dilated recurrent neural network with skip connections to handle the contextual aspect of the data, we use a much simpler network and use a context window to handle the contextuality. Then, our method can seamlessly handle the semi-supervised setting and benefits from our data augmentation techniques.

Ruff et al. (2020) propose hsc, improving on DeepSVDD by training the network using the standard bce loss, this way extending the approach to the (semi-)supervised setting. With this method, they can rely on labeled examples to regularize the training and do not have to resort to limiting the network. In particular, the hsc loss is given by setting the pseudo-probability of an anomalous instance () as , i.e.


where maps the representation to a probabilistic prediction. Choosing , leads to a spherical decision boundary in representation space, and reduces to the DeepSVDD loss (with center ) when all labels are .

Current work on semi-supervised anomaly detection indicates that including even only few labeled anomalies can already yield remarkable performance improvements on complex data (Ruff et al., 2021; Liznerski et al., 2020; Tuluptceva et al., 2020). A powerful resource in this line is oe (Hendrycks et al., 2019), which improves detection by incorporating large amounts of out-of-distribution examples from auxiliary datasets during training. Despite such negative samples may not coincide with ground-truth anomalies, such contrasting can be beneficial for learning characteristic representations of normal concepts. Moreover, the combination of oe and expressive representations with the hyperphere classifier have shown exceptional results for Deep ad on images (Ruff et al., 2020; Deecke et al., 2020).

For time series data, however, artificial anomalies and related data augmentation techniques have not been studied extensively. Smolyakov et al. (2019) used artificial anomalies to select thresholds in ensembles of anomaly detection models. Most closely related to our approach, SR-CNN Ren et al. (2019) trains a supervised CNN on top of an unsupervised anomaly detection model (SR), by using labels from injected single point outliers.

Fully supervised methods are not as widely studied because labeling all the anomalies is too expensive and unreliable in most applications. An exception is the work by Liu et al. (2015)

who propose a system to continuously collect anomaly labels and to iteratively re-train and deploy a supervised random forest model. The

U-Net-DeWA approach of Gao et al. (2020) relies on preprocessing using robust time series decomposition to train a convolutional network in a supervised way, relying on data augmentations that preserve the anomaly labels to increase the training set size.

3 Neural Contextual Anomaly Detection

This section describes our anomaly detection framework and its building blocks. We combine a window-based anomaly detection approach with a flexible training paradigm and effective heuristics for data augmentation to produce a state-of-the-art system for anomaly detection.

We consider the following general time series anomaly detection problem: We are given a collection of discrete-time time series , where for time series and time step we have an observation vector . We further assume that we are given a corresponding, set of partial anomaly labels with , indicating whether the corresponding observation is normal (), anomalous (), or unlabeled ().

The goal is to predict anomaly labels , with given a time series . Instead of predicting the binary labels directly, we predict a positive anomaly score for each time step, which can subsequently be thresholded to obtain anomaly labels satisfying a desired precision/recall trade-off.

3.1 Window-based Contextual Hypersphere Detection

Similar to other work on time series ad (e.g. Ren et al. (2019); Guha et al. (2016)), we convert the time series problem to a vector problem by splitting each time series into a sequence of overlapping, fixed-size windows of length . A key element of our approach is that within each window, we identify two segments: a context window of length and suspect window of length : , where we typically choose . Our goal is to detect anomalies in the suspect window relative to the local context provided by the context window. This split not only naturally aligns with the typically contextual nature of anomalies in time series data, it also allows for short suspect windows (even ), minimizing detection delays and improving anomaly localization.

Intuitively, our approach is based on the idea that we can identify anomalies by comparing representation vectors and , obtained by applying a neural network feature extractor , which is trained in such a way that representations are pulled together if there is no anomaly present in the suspect window , and pushed apart otherwise.

We propose a loss function, which can be seen as

contextual version of the hsc (equation 1) by considering a loss function which contrasts the representation of the context window with the representation of the full window:

In our experiments we follow Ruff et al. (2020) and use the Euclidean distance

and a radial basis function

, to create a spherical decision boundary as in hsc/DeepSVDD, resulting in the loss function


Intuitively, this is the hsc loss where the center of the hypersphere is chosen dynamically for each instance as the representation of the context. This introduces an inductive bias: representations of the context window and representations of the full window should be different if an anomaly occurs in the suspect window. As we show in our empirical analysis, this inductive bias makes the model more label efficient leads to better generalization. In particular, we show that when this model is trained using generic injected anomalies such as point outliers, it is able to generalize to the more complex anomalies found in real world datasets.

3.2 NCAD architecture & training

Our model identifies anomalies in a space of learned latent representations, building on the intuition that: if an anomaly is present in the suspect window , then representation vectors constructed from and should be distant.

context window

suspect window

full window


Figure 1: ncad encodes two windows that differ by a suspect window using the same TCN network and computes a distance score of the embeddings. The model is trained to give a high score for instances with an anomaly in the suspect window.

As illustrated in fig. 1, our ncad architecture has three components:

  1. A neural network encoder , that maps input sequences to representation vectors in . The same encoder is applied both to the full window and to the context window, resulting in representations and , respectively. While any neural network could be used, in our implementation we opt for a cnn with exponentially dilated causal convolutions (van den Oord et al., 2016), in particular the tcn architecture (Bai et al., 2018; Franceschi et al., 2019)

    with adaptive max-pooling along the time dimension.

  2. A distance-like function, , to compute the similarity between the representations and .

  3. A probabilistic scoring function , which creates a spherical decision boundary centered at the embedding of the context window.

The parameters of the encoder are learned by minimizing the classification loss on minibatches of windows, . These are sampled uniformly at random (across time series and across time) from the training data set after applying the data augmentation techniques that follow.

Rolling predictions

While our window based approach allows the model to decide if an anomaly is present in the suspect window, in many applications it is important to react quickly when an anomaly occurs, or to locate the anomaly with some accuracy. To support these requirements, we apply the model on rolling windows of the time series. Each time point can then be part of different suspect windows corresponding to different rolling windows had so is given multiple anomaly scores. Using these we can either alert on the first high score, to reduce time to alert, or average the scores for each point to pin-point the anomalies in time more accuratly.

3.3 Data augmentation

In addition to the contrastive classifier, we utilize a collection of data augmentation methods that inject synthetic anomalies, allowing us to rely on a supervised training objective without requiring ground-truth labels. While can rely on the hypersphere to train without any labels, having some anomalous examples labels allow to greatly improve the performance, as it has been observed in computer vision (Hendrycks et al., 2019). We cannot effectively rely on ground-truth anomalies as few datasets have any and in practice it is very costly to obtain training labels; therefore, we propose to generate the anomalous example. These data augmentation methods explicitly do not attempt to characterise the full data distribution of anomalies, which would be infeasible; rather, we combine effective generic heuristics that work well for detecting common types of out-of-distribution examples.

coe — Motivated by the success of oe for out-of-distribution detection Hendrycks et al. (2019), we propose a simple task-agnostic method to create contextual out-of-distribution examples. Given a data window , we induce anomalies into the suspect segment, , by replacing a chunk of its values with values taken from another time series. The replaced values in will most likely break the temporal relation with their neighboring context, therefore creating an out of distribution example. In our implementation, we apply coe at training time by selecting random examples in a minibatch and permuting a random length of their suspect windows (visualizations in section B.1). In multivariate time series, as anomalies do not have to happen in all dimensions, we randomly select a subset of the dimensions in which the windows are swapped.

Anomaly Injection — We propose to inject simple single po

in the time series. We use a simple method: at a set of randomly selected time points we add (or subtract) a spike to the time series. The spike is proportional to the inter-quartile range of the points surrounding the spike location. Like for coe, in multivariate time series we simply select a random subset of dimensions on which we add the spike. These simple point outliers serve the same purpose as coe: create clear labeled abnormal points to help the learning of the hypersphere. (visualizations in

section B.2).

In addition to these, in some practical applications, it is possible to identify general characteristics of anomalies that should be detected. Some widely-known anomalous patterns include: sudden changes in the location or scale of the series (change-points); interruption of seasonality, etc. We have used this approach in our practical application and the domain knowledge allowed to improve the detection performance. As they require and domain knowledge it would be unfair to compare our method when incorporating these; therefore, in the results table we only use the point outliers described above.

Window Mixup — If we do not have access to training labels and know little about the relevant anomalies, we can only rely on coe and po, which may result in significantly missmatch between injected and true anomalies. To improve generalization of our model in this case, we propose to create linear combinations of training examples inspired by the mixup procedure Zhang et al. (2018).

Mixup was proposed in the context of computer vision and creates new training examples out of original samples by using a convex combinations of the features and their labels. This data augmentation technique creates more variety in training examples, but more importantly, the soft labels result in smoother decision functions that generalize better. mixup is suited for time series applications: convex combinations of time series most often result in realistic and plausible new time series (see visualizations in section B.3). We show that mixup can improve generalization of our model even in cases with a large mismatch between injected and true anomalies.

4 Experiments

In this section, we compare the performance of our approach with alternative methods on public benchmark datasets, and exploring the model behavior under different data settings and model variations in ablation studies. Further details on the experiments are included in the supplement.

4.1 Benchmark datasets

We benchmark our method to others on six datasets (more details in appendix E):222While we share many of the concerns expressed by Wu and Keogh (2020) about the lack of quality benchmark datasets for time series anomaly detection, we use these commonly-used benchmark datasets here for lack of better alternatives and to enable direct comparison of our approach to competing methods.

smap and msl — Two datasets published by NASA Hundman et al. (2018), with 55 and 27 series respectively. The lengths of the time series vary from 300 to 8500 observations.

swat — The dataset was collected on a water treatment testbed over 11 days, 36 attacks were launched in the last 4 days and compose the test set. (Mathur and Tippenhauer, 2016) To compare our numbers with Shen et al. (2020), we use the first half of the proposed test set for validation and the second one for test.

smd — Is a 5 weeks long dataset with 28 38-dimensional time series each collected from a different machine in large internet companies (Su et al., 2019).

smap, msl, swat, and smd, each have a pre-defined train/test split, where anomalies in the test set are labeled, while the training set contains unlabeled anomalies.

Yahoo — A dataset published by Yahoo labs,333 consisting of 367 real and synthetic time series. Following (Ren et al., 2019), we use the last 50% of the time points of each of the time series as test set and split the rest in 30% training and 20% validation set.

KPI — A univariate dataset released in the AIOPS data competition (1). It consists of KPI curves from different internet companies in 1 minute interval. Like (Ren et al., 2019), we use 30% of the train set for validation. For KPI and Yahoo labels are available for all the series.

4.2 Evaluation setup

Measuring the performance of time series anomaly detection methods in a universal way is challenging, as different applications often require different trade-offs between sensitivity, specificity, and temporal localization. To account for this, various measures that improve upon simple point-wise classification metrics have been proposed, e.g. the flexible segment-based score proposed by Tatbul et al. (2018) or the score used in the Numenta anomaly benchmark (Lavin and Ahmad, 2015). To make our results directly comparable, we follow the procedure proposed by Xu et al. (2018) (and subsequently used in other work Su et al. (2019); Ren et al. (2019); Shen et al. (2020)), which offers a practical compromise: point-wise scores are used, but the predicted labels are expanded to mark an entire true anomalous segment as detected correctly if at least one time point was detected by the model.444We use the implementation by Su et al. (2019): We align our experimental protocol with this body of prior work and report scores computed by choosing the best threshold on the test set. For each dataset, the best threshold is chosen and used on all the time series of the test set.

In many real world scenarios, one is interested in detecting anomalies in a streaming fashion on a variety of different time series that may not be known at deployment time. We incorporate these requirements by training a single model on all the training time series of each dataset, and evaluate that model on all the test set time series. Further we use short suspect windows allowing to decide if a point is anomalous or not when it is first observed. We report the performance of this harder detection setting.

Hyperparameters were chosen in the following way: for Yahoo, KPI and swat, as the validation datasets have anomaly labels available, we use a Bayesian optimization (Perrone et al., 2020) for parameter tuning, by maximizing the F1 score on the validation set. If no validation set with labels is available, we use a set of standard hyperparameter settings inferred from the datasets with validation datasets. (see details in the supplement). On each dataset we pick the context window length to roughly match the length of the seasonal patterns in the time series.

We run the model 10 times on each of the benchmark datasets and report mean and standard deviation. We use standard AWS EC2 ml.p3.2xlarge instances with a single-core Tesla V100 GPU. Training the model on one of the benchmark datasets takes on average 90 minutes. In our code

555 we provide scripts to reproduce the results on the benchmark datasets shown below.

4.3 Benchmark results

Table 1 shows the performance of our ncad approach compared against the state-of-the-art methods on two commonly used univariate datasets. As these datasets contains labels for anomalies both on the training and the test set, we evaluate our method on them both in the supervised setting ((sup.)) and the unsupervised setting ((un.)) We take the numbers from Ren et al. (2019). Our approach significantly outperforms competing approaches on Yahoo, performs similarly to the best unsupervised approach on KPI, and slightly worse than the best supervised approach. It is important to note that while other methods are either designed for the supervised or unsupervised setting, our method can be used seamlessly in both settings.

Model Yahoo (un.) KPI (un.) KPI (sup.)
SPOT (Siffer et al., 2017) 33.8 21.7
DSPOT (Siffer et al., 2017) 31.6 52.1
DONUT (Xu et al., 2018) 2.6 34.7
SR (Ren et al., 2019) 56.3 62.2
SR-CNN (Ren et al., 2019) 65.2 77.1
SR+DNN (Ren et al., 2019) 81.1
ncad w/ coe, po , mixup 81.16 1.43 76.64 0.89 79.20 0.92
Table 1: F1 score of the model on univariate datasets

The Yahoo-supervised experiments are included in section C.2, where we compare against the supervised approach of (Gao et al., 2020), which represents the state-of-the-art in this setting to the best of our knowledge. Our approach outperforms their approach significantly with point-wise F1 score versus F1 score for their approach.

Model smap msl swat smd
AnoGAN (Schlegl et al., 2017) 74.59 86.39 86.64
DeepSVDD (Ruff et al., 2018) 71.71 88.12 82.82
DAGMM (Zong et al., 2018) 82.04 86.08 85.38 70.94
LSTM-VAE (Park et al., 2018) 75.73 73.79 86.39 78.42
MSCRED (Zhang et al., 2019) 77.45 85.97 86.84
OmniAnomaly (Su et al., 2019) 84.34 89.89 88.57
MTAD-GAT (Zhao et al., 2020) 90.13 90.84
THOC (Shen et al., 2020) 95.18 93.67 88.09
ncad w/ coe, po , mixup 94.45 0.68 95.60 0.59 95.28 0.76 80.16 0.69
Table 2: F1 score of the model on multivariate datasets

Table 2 shows the performance of our ncad approach compared against the state-of-the-art methods. None of these datasets provides labels for the anomalies in the training set, all benchmark methods are designed for unsupervised anomaly detection. Our method outperforms THOC

by a reasonable margin both on msl and swat. On smap while our average score is slightly lower, the difference is within the variance. OmniAnomaly

(Su et al., 2019) is the state of the art on smd, our numbers are only second to theirs. We note that OmniAnomaly is considerably more costly and less scalable, since it trains one model for each of the 28 time series of the dataset, while we train a single global model.

4.4 Ablation study

To better understand the advantage brought by each of the components of our method, we perform an ablation study on the smap and msl datasets, shown in fig. 1(a). We average two runs for each configuration, the full table with all configurations and standard deviation is shown and discussed in section C.1. The row labeled "- contextual …" does not use the contextual hypersphere described in section 3.1, but instead a model trained using the original hypersphere classifier loss on the whole-window representation . The contextual loss function provides a substantial performance boost, making our approach competitive even without the data augmentation techniques. Each of the data augmentation techniques improves the performance further. A further ablation study on the supervised Yahoo dataset can be found in table 4.

Model smap msl
THOC (Shen et al., 2020) 95.18 93.67
ncad w/ coe, po , mixup 94.45 95.60
   - po 94.28 94.73
   - coe 88.59 94.66
   - mixup - coe - po 66.9 79.47
   - contextual - mixup - coe - po 55.09 36.03
(a) Ablation study on smap and msl
(b) F1 score of ncad on the Yahoo dataset trained with only a fraction of training anomalies being labeled.

4.5 Scaling from unsupervised to supervised

To investigate how the performance of our approach changes as we scale from unsupervised, to semi-supervised, to fully supervised, we measure the performance of our approach as a function of the amount of ground truth labels on the Yahoo dataset, shown in fig. 1(b). Firstly, we observe that the performance increases steadily with the amount of true anomaly labels, as desired. Secondly, by using synthetic anomalies (either po or coe), we can significantly boost the performance in the regime when no or only few labels are available. Finally, by using an injection technique that is well-aligned with the desired type of anomalies (po in this case, as Yahoo contains a large number of single-point outliers), one can significantly improve performance over relying solely on the labeled data, this is explained by the very high class imbalance in anomaly detection. The flipside is, of course, that injecting anomalies that may be significantly different from the desired anomalies (coe in this case) can ultimately hurt when enough labeled examples are available.

4.6 Using specialized anomaly injection methods

While in all our benchmarks we rely on completely generic anomalies for injection (coe and po), a by-product of our methodology is that the model can be guided towards detecting the desired class of anomalies by designing anomaly injection methods that mimic the true anomalies. Designing such methods is often simple compared to finding enough examples of true anomalies as they are rare. Figure 1(c) demonstrates the effectiveness of this approach: The first dimension of the smap dataset contains slow slopes that are labeled as anomalous in the dataset. These are harder to detect for our model when only using coe and po  because these cannot create similar behavior. We can design a simple anomaly injection that injects slopes to randomly selected region and labels it as anomalous. Training ncad with these slopes gives a model that achieves a much better score.

This approach can be effective in applications where anomalies are subtle and closer to the normal data, and where some prior knowledge is available about the kind of anomalies that are to be detected. However one may not have this prior knowledge or the resources required to create these injections. This is a limitation of this technique which prevents it from being generally applicable. This is the reason why we did not use in for the comparison to the other methods.

Model smap 1st dimension
ncad 93.38
ncad + injections 96.48
(c) F1 score on the Performance first dimension of smap with specialized anomaly injections.
(d) F1 score vs. width of true anomalies for models trained only on point outliers, with different fractions of training examples mixed-up.
Figure 2: Investigating the potential of anomaly injections taking advantage of domain knowledge, and investigating the generalization of the model form po when this knowledge is not available.

4.7 Generalization from injected anomalies

Artificial anomalies will always differ from the true anomalies to some extent, be it the ones created by coe, po, or more complex methods. This requires the model to bridge this gap and generalize from imperfect training examples to true anomalies. By design the hypersphere formulation can help to bridge the gap, and we use mixup further improve the generalization capabilities of the model. Figure 1(d) shows the results of an experiment exploring one aspect of this generalization ability for ncad. The model is trained with injected single-point outliers, and we measure the detection performance for anomalies of longer width. For this experiment we use a synthetic base data set containing simple sinusoid time series with Gaussian noise. We create multiple datasets from this base dataset adding true anomalies of varying width by convolving spike anomalies with Gaussian filters of different widths. For training, regardless of the shape of the true anomalies, we use po and train models using different mixup rates, i.e., fraction of training examples with mixup applied. We observe that mixup helps the model to generalize in this setting:the higher the mixup rate, the better the model generalizes to anomalies that differ from the injected examples, achieving higher F1 scores.

5 Discussion

We present ncad, a methodology for anomaly detection in time series that achieves state-of-the-art performance in a broad range of settings, including both the univariate and multivariate cases, as well as across the unsupervised, semi-supervised, and supervised anomaly detection regimes. We demonstrate that combining expressive neural representation for time series with data augmentation techniques can outperform traditional approaches such as predictive models or methods based on reconstruction error.

We do not foresee clear potential negative societal impact of this work. Time series anomaly detection is a general problem which is applied in many different domains, such as cyber-security where it can be used to automatically prevent attacks to power plants or hospitals. While the anomaly detection results of our approach are good, we think that the detection of the algorithm should not be blindly followed in medical application impacting directly the patients health.


  • [1] AIOps challenge. Note: Cited by: Appendix E, §4.1.
  • F. Ayed, L. Stella, T. Januschowski, and J. Gasthaus (2020) Anomaly detection at scale: the case for deep distributional time series models. arXiv preprint arXiv:2007.15541. Cited by: §1.
  • S. Bai, J. Z. Kolter, and V. Koltun (2018) An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv preprint arXiv:1803.01271. External Links: 1803.01271, Link Cited by: §A.1, §1, item 1.
  • V. Chandola, A. Banerjee, and V. Kumar (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §2.
  • L. Deecke, L. Ruff, R. A. Vandermeulen, and H. Bilen (2020) Deep Anomaly Detection by Residual Adaptation. arXiv preprint arXiv:2010.02310. External Links: 2010.02310, Link Cited by: §2.
  • W. A. Falcon (2019) PyTorch Lightning. External Links: Link Cited by: Appendix D.
  • J. Y. Franceschi, A. Dieuleveut, and M. Jaggi (2019) Unsupervised scalable representation learning for multivariate time series. In Proceedings of the 33rd Conference on Neural Information Processing Systems, NeurIPS 2019, Vol. 32. External Links: 1901.10738, ISSN 10495258 Cited by: §A.1, §1, item 1.
  • J. Gao, X. Song, Q. Wen, P. Wang, L. Sun, and H. Xu (2020)

    RobustTAD: Robust Time Series Anomaly Detection via Decomposition and Convolutional Neural Networks

    arXiv preprint arXiv:2002.09545. External Links: 2002.09545, Link Cited by: §C.2, Table 4, §1, §2, §4.3.
  • S. Guha, N. Mishra, G. Roy, and O. Schrijvers (2016) Robust random cut forest based anomaly detection on streams. In

    International conference on machine learning

    pp. 2712–2721. Cited by: §3.1.
  • C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant (2020) Array programming with NumPy. Nature 585 (7825), pp. 357–362. External Links: Document, Link Cited by: Appendix E.
  • D. Hendrycks, M. Mazeika, and T. Dietterich (2019) Deep anomaly detection with outlier exposure. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, External Links: 1812.04606, Link Cited by: §1, §1, §2, §3.3, §3.3.
  • K. Hundman, V. Constantinou, C. Laporte, I. Colwell, and T. Soderstrom (2018) Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 387–395. Cited by: Appendix E, §4.1.
  • A. Lavin and S. Ahmad (2015) Evaluating real-time anomaly detection algorithms–the numenta anomaly benchmark. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 38–44. Cited by: §4.2.
  • E. Liberty, Z. Karnin, B. Xiang, L. Rouesnel, B. Coskun, R. Nallapati, J. Delgado, A. Sadoughi, Y. Astashonok, P. Das, C. Balioglu, S. Chakravarty, M. Jha, P. Gautier, D. Arpin, T. Januschowski, V. Flunkert, Y. Wang, J. Gasthaus, L. Stella, S. Rangapuram, D. Salinas, S. Schelter, and A. Smola (2020) Elastic machine learning algorithms in amazon sagemaker. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 731–737. Cited by: Appendix D.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2020) Focal Loss for Dense Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2), pp. 318–327. External Links: Document, 1708.02002, ISSN 0162-8828, Link Cited by: §A.3.
  • D. Liu, Y. Zhao, H. Xu, Y. Sun, D. Pei, J. Luo, X. Jing, and M. Feng (2015) Opprentice: towards practical and automatic anomaly detection through machine learning. In Proceedings of the 2015 Internet Measurement Conference, pp. 211–224. Cited by: §2.
  • P. Liznerski, L. Ruff, R. A. Vandermeulen, B. J. Franks, M. Kloft, and K. Müller (2020) Explainable Deep One-Class Classification. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, External Links: 2007.01760, Link Cited by: §2.
  • A. P. Mathur and N. O. Tippenhauer (2016) SWaT: a water treatment testbed for research and training on ics security. In 2016 international workshop on cyber-physical systems for smart water networks (CySWater), pp. 31–36. Cited by: Appendix E, §4.1.
  • G. Pang, C. Shen, L. Cao, and A. v. d. Hengel (2020) Deep learning for anomaly detection: a review. arXiv preprint arXiv:2007.02500. Cited by: §1, §2.
  • D. Park, Y. Hoshi, and C. C. Kemp (2018)

    A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder

    IEEE Robotics and Automation Letters 3 (3), pp. 1544–1551. Cited by: §2, Table 2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. Proceedings of the 33rd Conference on Neural Information Processing Systems, NeurIPS 2019. External Links: 1912.01703, Link Cited by: Appendix D.
  • V. Perrone, H. Shen, A. Zolic, I. Shcherbatyi, A. Ahmed, T. Bansal, M. Donini, F. Winkelmolen, R. Jenatton, J. B. Faddoul, B. Pogorzelska, M. Miladinovic, K. Kenthapadi, M. Seeger, and C. Archambeau (2020) Amazon SageMaker Automatic Model Tuning: Scalable Black-box Optimization. Technical report Amazon. External Links: 2012.08489, Link Cited by: §D.1, §4.2.
  • H. Ren, B. Xu, Y. Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang, J. Tong, and Q. Zhang (2019) Time-Series Anomaly Detection Service at Microsoft. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Vol. 19, New York, NY, USA, pp. 3009–3017. External Links: Document, 1906.03821, ISBN 9781450362016, Link Cited by: §1, §2, §3.1, §4.1, §4.1, §4.2, §4.3, Table 1.
  • L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K. Muller (2021) A Unifying Review of Deep and Shallow Anomaly Detection. Proceedings of the IEEE 109 (5), pp. 756–795. External Links: Document, 2009.11732, ISSN 0018-9219, Link Cited by: §1, §2, §2.
  • L. Ruff, R. A. Vandermeulen, N. Görnitz, A. Binder, E. Müller, K. Müller, and M. Kloft (2020) Deep Semi-Supervised Anomaly Detection. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, External Links: 1906.02694, Link Cited by: §1, §2.
  • L. Ruff, R. A. Vandermeulen, N. Görnitz, L. Deecke, S. A. Siddiqui, A. Binder, E. M. Uller, and M. Kloft (2018) Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pp. 4393–4402. External Links: Link Cited by: §1, §2, Table 2.
  • L. Ruff, R. A. Vandermeulen, B. J. Franks, K. Müller, and M. Kloft (2020) Rethinking Assumptions in Deep Anomaly Detection. arXiv preprint arXiv:2006.00339. External Links: 2006.00339, Link Cited by: §A.2, §1, §1, §1, §2, §2, §3.1.
  • T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017)

    Unsupervised anomaly detection with generative adversarial networks to guide marker discovery

    In International conference on information processing in medical imaging, pp. 146–157. Cited by: §2, Table 2.
  • L. Shen, Z. Li, and J. Kwok (2020) Timeseries anomaly detection using temporal hierarchical one-class network. Advances in Neural Information Processing Systems 33. Cited by: Table 3, §2, 1(a), §4.1, §4.2, Table 2.
  • W. A. Shewhart (1931) Economic control of quality of manufactured product. Macmillan And Co Ltd, London. Cited by: §1.
  • D. T. Shipmon, J. M. Gurevitch, P. M. Piselli, and S. T. Edwards (2017) Time series anomaly detection; detection of anomalous drops with limited features and sparse examples in noisy highly periodic data. arXiv preprint arXiv:1708.03665. Cited by: §1, §2, §2.
  • A. Siffer, P. Fouque, A. Termier, and C. Largouet (2017) Anomaly detection in streams with extreme value theory. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1067–1075. Cited by: §2, Table 1.
  • D. Smolyakov, N. Sviridenko, V. Ishimtsev, E. Burikov, and E. Burnaev (2019) Learning Ensembles of Anomaly Detectors on Synthetic Data. arXiv:1905.07892 [cs, stat]. External Links: 1905.07892 Cited by: §2.
  • Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, and D. Pei (2019) Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2828–2837. Cited by: Appendix E, §2, §4.1, §4.2, §4.3, Table 2, footnote 4.
  • N. Tatbul, T. J. Lee, S. Zdonik, M. Alam, and J. Gottschlich (2018) Precision and recall for time series. Advances in Neural Information Processing Systems 31, pp. 1920–1930. Cited by: §4.2.
  • D. M. Tax and R. P. Duin (2004) Support vector data description. Machine learning 54 (1), pp. 45–66. Cited by: §2.
  • N. Tuluptceva, B. Bakker, I. Fedulova, H. Schulz, and D. V. Dylov (2020) Anomaly Detection with Deep Perceptual Autoencoders. arXiv preprint arXiv:2006.13265. External Links: 2006.13265, ISBN 2006.13265v2, Link Cited by: §2.
  • A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) WaveNet: A Generative Model for Raw Audio. Technical report Google. External Links: 1609.03499, Link Cited by: item 1.
  • R. Wu and E. J. Keogh (2020) Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. arXiv preprint arXiv:2009.13807. Cited by: footnote 2.
  • H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei, Y. Feng, et al. (2018) Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 World Wide Web Conference, pp. 187–196. Cited by: §4.2, Table 1.
  • M. Zaheer, S. J. Reddi, D. Sachan, S. Kale, and S. Kumar (2018) Adaptive methods for nonconvex optimization. In Proceedings of the 32nd Conference on Neural Information Processing Systems, NIPS 2018, Vol. 2018-Decem, pp. 9793–9803. External Links: ISSN 10495258 Cited by: Appendix D.
  • C. Zhang, D. Song, Y. Chen, X. Feng, C. Lumezanu, W. Cheng, J. Ni, B. Zong, H. Chen, and N. V. Chawla (2019) A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data.

    Proceedings of the AAAI Conference on Artificial Intelligence

    33, pp. 1409–1416.
    External Links: Document, 1811.08055, ISSN 2374-3468, Link Cited by: §2, Table 2.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: Beyond Empirical Risk Minimization. In Proceedings of the Sixth International Conference on Learning Representations, ICLR 2018, External Links: 1710.09412, ISSN 23318422 Cited by: §B.3, §1, §3.3.
  • H. Zhao, Y. Wang, J. Duan, C. Huang, D. Cao, Y. Tong, B. Xu, J. Bai, J. Tong, and Q. Zhang (2020) Multivariate time-series anomaly detection via graph attention network. arXiv preprint arXiv:2009.02040. Cited by: Table 2.
  • B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018)

    Deep autoencoding gaussian mixture model for unsupervised anomaly detection

    In International Conference on Learning Representations, Cited by: Table 2.

Appendix A Model Architecture

a.1 Encoder

Our encoder function 666 denotes the dimension of the time series ( for univariate series), is the length of the windows, and the dimension of the vector representation. is similar to the encoder proposed by Franceschi et al. (2019) for generating universal representations of multivariate time series.

The architecture is based on multi-stack tcn (Bai et al., 2018)

, which combines causal convolutions with residual connections. The output of this causal network is passed to an adaptive max pooling layer, aggregating the temporal dimension into a fixed-size vector.

777In our experiments, using adaptive pooling consistently outperformed the global pooling alternative.

A linear transformation is applied to produce the unnormalized vector representations, which are then

-normalized to produce the final output of the encoder. See fig. 3 for an illustration.

Max Pooling
Figure 3: Illustration of the encoder architecture.

a.2 Distance

We introduce a “distance” 888The function in our framework is not strictly a distance in the mathematical sense. Specifically, the triangle inequality is not required, but it is expected to be symmetric and non-negative. function, , as a central element in our approach. For a given window , the distance function is used to compare the embeddings of the entire window, , with the embedding of its corresponding context segment, . The output of this function can be directly interpreted as an anomaly score for the associated suspect segment .

We explored two types of distances: the Euclidean distance,

and the Cosine distance

, defined as a simple logarithmic mapping of the cosine similarity to the positive reals,

where is the cosine similarity between and .

In our experiments, we found that both distances were able to achieve state-of-the-art results in most of the reported benchmarks datasets, with slightly better performance from the Euclidean distance. All the results of ncad reported in appendices C and 4 are based on the Euclidean distance.

Moreover, the ncad framework can be extended to use other distances, e.g. other norms, the pseudo-Hubber norm used in the Ruff et al. (2020), or even trainable neural-based distances.

a.3 Probabilistic scoring function and Classification Loss

The final element in our model is the binary classification loss which measures the agreement between target labels and the assigned anomaly scores.

As described in section 3.2, for a given window and encoder , we compute the anomaly score for the suspect window , as the distance between the corresponding representations of the full window and the context window: .

This score is mapped into a pseudo-probability of an anomaly () via the probabilistic scoring function ,

which is used within the bce loss to define the target to minimize during training. We train the encoder using mini-batch gradient descent (see appendix D), taking randomly selected windows, and minimizing

Alternative classification losses can be considered as an extension of the standard ncad framework. For example, the mae, the mse, or the Focal Loss (Lin et al., 2020). These losses may be particularly useful in applications with significant contamination of labels.

Appendix B Data Augmentation

As presented in section 3.3, our framework relies on three data augmentation methods for time series, we provide more details and some visualizations in this section.

Figure 4 shows four time series drawn from the SMAP benchmark dataset, we use these to visualize each of the data augmentation method.

Figure 4: The four original time series used for visualization of the data augmentation methods.

b.1 Contextual Outlier Exposure (COE)

As described in section 3.3, we use coe at training time to generate additional anomalous training examples. These are created by replacing a chunk of values in one suspect window with values taken from another suspect window in the same batch.

As an example, consider fig. 5, each time series has the window between 1500 and 1550 swapped with its horizontal neighbor. We can see that this creates an anomalous window that does not follow the expected behavior. In some cases, the swapping of values create jumps, as in (c); in other cases the change is more subtle, like in (d), where the series becomes constant for the duration of the interval or (a) and (b) where the regular pattern is broken. 999Our implementation of coe can be found in file src/ncad/model/ of the supplementary code.

Figure 5: Visualization of coe. Each of the series has its window between 1500 and 1550 swapped with its horizontal neighbor time series: (a) swaps with (b) and (c) swaps with (b)).

b.2 Point Outliers (PO)

As described in section 3.3, po allow to add single isolated outliers to the time series. We simply add a spike to the time series at a random location in time. By default, the spike is between 0.5 and 3 time the inter-quartile range of the 100 points around the spike location.

With this method, the injected spike can be a local outlier, but is not necessarily a global outlier as its magnitude could be within the range of other values in the time series. Similarly to coe, in the case of multivariate time series we select a random subset of dimensions on which we add the spike. In fig. 6 we visualizes some examples of the injected point outliers. These are added to the 1550th value of each of the time series. We can see that they break the normal patterns but do not necessarily result in extreme events. 101010Our implementation of po can be found in file src/ncad/ts/transforms/ of the supplementary code.

Figure 6: Visualization of po. In each of the time series a point outlier is injected at the 1550th value.

b.3 Time series Mixup

As described in section 3.3, inspired by Zhang et al. (2018) we use an adapted version of the Mixup procedure as part of our framework.

We sample 111111we set , as this value gave the best generalization among the values that were tried in the experience of fig. 1(d). Using this we create a new training example as a convex combination of two examples from the batch:

Note that, in addition to the new time series values , the method also produces soft labels , different to 0 or 1, which are used during training. 121212Our implementation of mixup can be found in file src/ncad/model/ of the supplementary code.

Figure 7 shows example time series created using mixup. Each of the original time series is mixed up with its horizontal neighbor time series. We see that the newly created series have characteristics from both time series to create a new realistic time series. The patterns in (a) and (b) became a bit more noisy. The slope of (c) has the additional spiky pattern from (d) and the pattern in (d) now slowly ramps up.

Figure 7: Visualization of time series mixup. Each of the series is "mixed up" with the its horizontal neighbor time series: (a) with (b) and (c) with (d)).

Appendix C Further Results and Ablation Studies

c.1 Ablation Study on SMAP and MSL

Here we present a full ablation study on the smap and msl datasets. We consider variations of the framework by removing some of its components, and train the model in each configurations twice. We report the average and standard deviation of these runs.

First, we observed that the contextual hypersphere formulation improves performance of the model. In the setting with all the data augmentation techniques "- contextual" the difference is not very big 1.98% F1 and 1.17% F1 on SMAP and MSL respectively. However, in the setting where none of the data augmentation is used, it makes a dramatic difference to use this formulation: 11.81% F1 and 43.44% F1 on SMAP and MSL respectively. Further, we can see that solely with the contextual hypersphere and without relying on any data augmentation technique the model can achieve a very reasonable performance.

Model smap MSL
THOC (Shen et al., 2020) 95.18 93.67
ncad w/ coe, po, mixup 94.45 0.68 95.60 0.68
   - coe 88.59 1.81 94.66 0.22
   - po 94.28 0.45 94.73 0.35
   - mixup 92.69 1.14 95.59 0.01
   - mixup - po 94.4 0.43 94.12 0.77
   - mixup - coe 86.86 0.7 91.7 2.58
   - coe - po 60.48 9.7 42.02 6.34
   - mixup - coe - po 66.9 2.01 79.47 9.39
   - contextual 92.47 0.53 94.43 0.15
   - contextual - coe 91.86 0.96 88.29 0.43
   - contextual - po 93.39 0.61 90.68 0.74
   - contextual - mixup 94.37 0.21 95.07 0.14
   - contextual - mixup - po 93.24 0.31 90.89 0.46
   - contextual - mixup - coe 89.88 2.53 87.26 4.17
   - contextual - coe - po 54.95 2.62 32.05 0.17
   - contextual - coe - mixup - po 55.09 1.0 36.03 3.01
Table 3: F1 score of the model on smap and msl

We also observed that both coe and po jointly improve the model performance. If we remove separately one of these elements, neither of the two tends to have a large impact on the performance. However, if none of them is used the performance drops drastically: by 43.98% F1 and 53.58% F1 for SMAP and MSL respectively, and the drop is even bigger when not using the contextual inductive bias.

It is interesting to note that it does not seem to be a good idea to use mixup as the only data augmentation method (at least in this unsupervised setting). In the setting where neither coe nor po are used, using mixup seems to significantly deteriorate the performance: by 6.42% and 37.45% on SMAP and MSL respectively. We conjecture that this is due to the fact that, for this datasets, there are no labels in the training data and so mixup does not allow to create soft-labels. In addition, mixup creates new time series that may not correspond to the original data distribution, these may deviate the learning away from the original data distribution.

c.2 Ablation study and Supervised benchmark on Yahoo dataset

Here we present the results of our method on the supervised Yahoo dataset. It is important to note that, since the only baseline method that we found evaluated their model with point-wise F1 score, this is also what we use here to make our results comparable.

Model F1 prec rec
U-Net-Raw 40.3 47.3 35.1
U-Net-De 62.1 65.1 59.4
U-Net-DeW 66.2 79.3 56.9
U-Net-DeWA 69.3 85.9 58.1

ncad supervised
62.11 80.44 50.59
    + mixup 63.08 76.70 53.57
    + po 79.92 74.96 85.57
    + coe 53.66 78.84 40.67
    + coe + mixup 59.85 78.89 48.21
    + po + coe 58.36 54.89 62.30
    + po + coe + mixup 67.32 88.38 54.36
    - contextual 5.50 3.42 14.08
    - contextual + po 67.90 64.15 72.13
    - contextual + coe 39.53 42.56 36.90
    - contextual + po + coe 55.25 43.87 74.60
Table 4: Supervised anomaly detection performance on Yahoo. Results for U-Net taken from (Gao et al., 2020).

The supervised approach proposed by Gao et al. (2020) is based on a U-net architecture, which is combined with preprocessing (using robust time series decomposition), loss weighting (to up-weight the rare anomalous class), and several forms of tailored data augmentation applied to the time series (keeping the labels unchanged). They report results for four variants: U-Net-Raw (plain supervised U-net on raw data), U-Net-De (applied to residual after preprocessing), U-Net-DeW (with loss weighting), U-Net-DeWA (with loss weighting and data augmentation).

Using only the true labels but no data augmentation (ncad supervised), our approach significantly outperforms U-Net-Raw, and performs on-par with U-Net-De, without relying on time series decomposition and using an arguably much simpler architecture.

When we use the po data augmentation, our approach outperforms the full U-Net-DeWA by a large margin, hinting at the possibility that addressing the class imbalance problem by creating artificial anomalies is more effective than using their strategy of loss weighting while keeping the labels intact.

In the supervised setting, injecting the generic coe anomalies (either individually or in combination with po) hurts performance, presumably by steering the model away from the specific kind of anomalies that are labeled as anomalous in this data set. On the other hand, adding mixup generally improves performance. The contrastive loss is crucial for good performance, as shown by the rows labeled - contrastive, where it is replaced with a standard softmax classifier.

Appendix D Model Implementation and Training

At the core of the ncad framework, we use a single Encoder, , to produce time series representations. The same encoder is applicable to all the time series in a given dataset, and it is used to encode both the full windows and the context windows. The parameters of the encoder, , are learned via Mini-Batch Gradient Descent, aimed at minimizing the classification loss discussed in section A.3.

Training mini-batches of size are created by first randomly selecting series from the training set, and then taking random fixed-size windows from each 131313num_series_in_train_batch and num_crops_per_series in the supplementary code. Data augmentation strategies described in section 3.3 are applied to these windows, creating additional examples which are incorporated to the batch. The number of augmented examples is controlled as a proportion of the original batch, using two additional hyperparameters: and for coe and Mixup, respectively. The size of the training batch is therefore

Our implementation141414open-source code available at Anonymous Github repository is based on PyTorch (Paszke et al., 2019) and PyTorch Lightning (Falcon, 2019). We used the default initialization defined in PyTorch, and the YOGI optimizer (Zaheer et al., 2018) for all our experiments. We use standard AWS EC2 ml.p3.2xlarge instances with a single-core Tesla V100 GPU. Training and hyperparameter tuning was aided by AWS Sagemaker (Liberty et al., 2020), training takes on average 90 minutes for each dataset.

d.1 Model Hyperparameters

Hyperparameters in our framework con be mainly divided in four categories:

  1. Encoder architecture: Number of TCN layers, TCN kernel size, embedding dimension, embedding normalization.

  2. Data augmentation: , (described above).

  3. Optimizer

    : learning rate, number of epochs.

  4. Mini-batch cropping: window length, suspect window length, , .

For Yahoo, KPI and swat, validation labels are available, so we use a Bayesian optimization (Perrone et al., 2020) for hyperparameter tuning, maximizing the F1 score on the validation set. We restricted the search of hyperparameters to only “sensible” values for most of the hyperparameters (e.g. max. 10 TCN layers, max. 256 dimensions for the embedding, max. 2.0 for the augmentation rates, etc.). Lengths of the window and suspect window are set by observing the lengths and seasonal patterns of the training dataset, so that it covers at least one cycle and this seasonality could be encoded in the representation. We use early stopping and keep the model with the lower validation F1, which is then evaluated on the test dataset and the result is reported.

For smap, msl and smd, we do not have validation data to pick the hyperparametes, so we use default values that seemed to work well for the other datasets. It is not possible to do early stopping either, so we keep the model resulting of training until the last epoch, which is then evaluated on the test dataset and the result is reported.

In all cases, we align our experimental protocol with prior works and report scores computed by choosing the best threshold on the test set.

We provide hyperparameter configuration files in the supplementary code, which allow to replicate our benchmark results in section section 4.

Appendix E Datasets and External Assets

We use the following datasets to compare the performance of ncad to other methods:

smap and msl Hundman et al. (2018), the datasets are under the custom license

swat (Mathur and Tippenhauer, 2016) This dataset is distributed by the ITrust Centre for Research in Cyber Security, we were not able to find the precise license of the dataset.

smd (Su et al., 2019) is distributed under the MIT License

Yahoo A dataset published by Yahoo labs,, we were not able to find the precise license of the datase beyond the ReadMe.txt specifying that the dataset could be used for non-commercial research purposes.

KPI (1) we were not able to find the precise license of the dataset.

Additional assets

In addition to the datasets, we use existing code for the TCN encoder from which is under the Apache License Version 2.0. We also use the evaluation code from which is under the MIT License. We use the standard Python library numpy Harris et al. (2020), which is under the BSD 3-Clause "New" or "Revised" License

We make our code available, licensed under the Apache License, Version 2.0.

All the dataset and code that we use in this work is openly available under licences that allow to use them, as a result we did not seek additional consent from their creators. None of the datasets contains personally identifiable information, nor do they contain offensive content.