Log In Sign Up

Are we certain it's anomalous?

The progress in modelling time series and, more generally, sequences of structured-data has recently revamped research in anomaly detection. The task stands for identifying abnormal behaviours in financial series, IT systems, aerospace measurements, and the medical domain, where anomaly detection may aid in isolating cases of depression and attend the elderly. Anomaly detection in time series is a complex task since anomalies are rare due to highly non-linear temporal correlations and since the definition of anomalous is sometimes subjective. Here we propose the novel use of Hyperbolic uncertainty for Anomaly Detection (HypAD). HypAD learns self-supervisedly to reconstruct the input signal. We adopt best practices from the state-of-the-art to encode the sequence by an LSTM, jointly learnt with a decoder to reconstruct the signal, with the aid of GAN critics. Uncertainty is estimated end-to-end by means of a hyperbolic neural network. By using uncertainty, HypAD may assess whether it is certain about the input signal but it fails to reconstruct it because this is anomalous; or whether the reconstruction error does not necessarily imply anomaly, as the model is uncertain, e.g. a complex but regular input signal. The novel key idea is that a detectable anomaly is one where the model is certain but it predicts wrongly. HypAD outperforms the current state-of-the-art for univariate anomaly detection on established benchmarks based on data from NASA, Yahoo, Numenta, Amazon, Twitter. It also yields state-of-the-art performance on a multivariate dataset of anomaly activities in elderly home residences, and it outperforms the baseline on SWaT. Overall, HypAD yields the lowest false alarms at the best performance rate, thanks to successfully identifying detectable anomalies.


page 1

page 5


TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks

Time series anomalies can offer information relevant to critical situati...

Anomaly Detection for Multivariate Time Series of Exotic Supernovae

Supernovae mark the explosive deaths of stars and enrich the cosmos with...

On the Usage of Generative Models for Network Anomaly Detection in Multivariate Time-Series

Despite the many attempts and approaches for anomaly detection explored ...

A Vision Inspired Neural Network for Unsupervised Anomaly Detection in Unordered Data

A fundamental problem in the field of unsupervised machine learning is t...

Contracting Skeletal Kinematic Embeddings for Anomaly Detection

Detecting the anomaly of human behavior is paramount to timely recognizi...

Anomaly detection on streamed data

We introduce powerful but simple methodology for identifying anomalous o...

Anomaly Detection in Clutter using Spectrally Enhanced Ladar

Discrete return (DR) Laser Detection and Ranging (Ladar) systems provide...

I Introduction

Fig. 1: HypAD detects anomalies by the joint use of reconstruction error and uncertainty, learning both aspects end-to-end. In the illustration, the colored circular sector represents the hyperbolic Poincaré ball, where the radius distance of data embeddings is their degree of certainty (points on the circumference are most certain). is the hyperbolic mapping of the input signal, which HypAD attempts to match by the reconstruction . Thanks to hyperbolic neural networks and their exponentially larger penalization for errors at high certainty, HypAD learns to prefer signal reconstructions such as , i.e. with the same amount of error as (the same angle and cosine distance) but smaller radius and thus higher uncertainty. HypAD uses both the reconstruction error and the uncertainty to identify detectable anomalies, where the model is certain but the prediction is wrong, i.e. it knows what to expect but something anomalous occurs.

Anomaly detection stands for detecting outliers (anomalies) in data, i.e. points that deviate significantly from the distribution of the data. Outlier detection, however, is an under-specified and consequently ill-posed task due to its inherent unsupervised nature. Anomaly detection strategies such as distance-based

[fan2006nonparametric, ghoting2008fast], density-based [lof, papadimitriou2003loci], and subspace-based methods [keller2012hics, lazarevic2005feature]

have been pioneers in the literature. Additionally, autoencoders

[chen2017outlier, sarvari2021unsupervised], and adversarial networks [geiger2020tadgan] have given a substantial contribution. However, the literature neglects assessing the trustfulness of the predicted outcomes, namely their uncertainty.

Uncertainty is a measure of model confidence, which may be learnt from the data [lakshminarayanan2016simple, kendall2017uncertainties] or by the use of extra instances [gal2016dropout]

. Uncertainty estimation has been a long-standing challenge in machine learning. Most recently, it has been successfully adopted to improve performance on object detection 

[jiang2018acquisition, kuppers2020multivariate, 9156274, neumann2018relaxed]

, pose estimation 


, unsupervised and self-supervised learning 

[fiery2021, 8289350, suris2021hyperfuture]. Yet, uncertainty remains largely unexplored in the context of anomaly detection.

In this work, we propose a novel model based on Hyperbolic uncertainty for Anomaly Detection, which we dub HypAD. We leverage the current state-of-the-art technique for anomaly detection in univariate time series, TadGAN [geiger2020tadgan]. TadGAN detects anomalies by attempting to reconstruct the input signal, making use of an LSTM sequence encoding and two GAN critics, cf. Sec. III-A. We introduce uncertainty into the anomaly detector: we map the input and the reconstructed signal into a hyperbolic space, where the signals additionally have an uncertainty score; and we train the novel embeddings end-to-end with a Poincaré distance loss, cf. Sec. III-B.

The proposed HypAD uses uncertainty to discern whether the reconstruction error is large because the signal is anomalous, or simply because the model cannot reconstruct it well. In the former case, HypAD is certain about the reconstruction (e.g. most signal is well-behaved and the model expects known patterns) but its reconstruction is wrong, as a part of the signal is anomalous. In the latter, HypAD downgrades its anomaly score because it is not certain about the signal reconstruction. This may be because of a complex pattern, which the model did not have enough capacity to learn. The larger uncertainty indicates that the larger reconstruction error may be due to anomaly or to a model failure in the reconstruction. (See discussion in Sec III-B2.)

Thanks to uncertainty, HypAD outperforms the state-of-the-art univariate anomaly detector TadGAN [geiger2020tadgan] on the established univariate benchmarks of NASA, Yahoo, Numenta Anomaly Benchmark [lavin2015evaluating], as well as on two multivariate datasets of daily activities in elderly home residences CASAS [cook2012casas] and industrial water treatment plant SWaT [Mathur2016SWaTAW]. As we show in experimental results in Sec. IV, reducing anomaly scores in uncertain cases also yields fewer false alarms (the model achieves best F1 performance with larger precision).

The main contributions of this work are:

  • We propose the first model for anomaly detection based on hyperbolic uncertainty;

  • We propose the novel key idea of detectable anomaly: an instance is anomalous when the model is certain about it but wrong;

  • We integrate the estimated uncertainty into a state-of-the-art univariate anomaly detector and consistently outperform it on established univariate and multivariate datasets.

Ii Related works

To the best of our knowledge, this is the first work to have combined anomaly detection with uncertainty estimation, and the first work to have further proposed hyperbolic uncertainty for it. Previous work relates to ours from three main perspectives, which we review here: uncertainty estimation techniques, anomaly detection in time series and hyperbolic neural networks.

Ii-a Uncertainty Estimation Techniques

We identified two different strategies of approximating uncertainty. Ensemble-based posterior approximation uses several weak models to make naive predictions and combine them according to a consensus function into a more complex predictive model [dietterich2000ensemble]

. One of the most popular approaches to uncertainty estimation based on ensembles is Monte Carlo (MC) Dropout. It drops neurons on every layer during training and test phases


Generative models for aleatory modelling

use an additional latent variable to make stochastic predictions and evaluate the uncertainty of the model. Generative Adversarial Networks (GANs)


play a minmax game where the discriminator needs to distinguish between real examples and the generated outcomes. GANs have state-of-the-art performances and we build on top of that by attaching hyperspace mapping layers to estimate the uncertainty of the model. Another interesting approach to estimate uncertainty is by using energy-based models

[du2019implicit, salakhutdinov2009deep, xie2016theory]. They learn an energy function that models the compatibility of the input and the output. Our method transcends energy-based models because the integrated hyperbolic uncertainty mechanism does not suffer from cold- nor warm-start problems [zhang2021dense] which undermine the training complexity [xie2018cooperative].

Ii-B Anomaly Detection in Time Series

We identified five categories of methods proposed in the literature for anomaly detection in time series. Distance-based outlier detectors consider the distance of a point from its k-nearest neighbours [conf/vldb/KnorrN98, angiulli2002fast, ghoting2008fast]. Density-based methods [lof, papadimitriou2003loci, He2003DiscoveringCL, 10.1145/3447548.3467137, 10.1145/3292500.3330672, WangCNLCT21, su2019robust] take into account the density of the point and its neighbours. Prediction-based methods [benkabou2021local, unsupahmad, conf/kdd/HundmanCLCS18] calculate the difference between the predicted value and the true value to detect anomalies. Reconstruction based methods [An2015VariationalAB, Malhotra2016LSTMbasedEF, 10.1145/3447548.3467174] compare the input signal and the reconstructed one in the output layer typically by using autoencoders. These methods assume that anomalies are difficult to reconstruct and are lost when the signal gets mapped to lower dimensions, thus a higher reconstruction error means higher anomaly score. [Malhotra2016LSTMbasedEF] uses an LSTM autoencoder for multi-sensor anomaly detection. [chen2017outlier, sarvari2021unsupervised] use an ensemble of autoencoders to boost performances by focusing on learning the inlier characteristics at each iteration. [li2021multivariate] uses a hierarchical variational auto-encoder with two stochastic latent variables to learn the temporal and inter-metric embeddings for multivariate data.SISVAE [li2020anomaly] uses a variational auto-encoder with a smoothness-inducing prior over possible estimations to capture latent temporal structures of time series without relying on the assumption of constant noise. Recently, GANs have been employed to detect anomalies in time series data. Our method also lies in this category. MAD-GAN [Li2019MADGANMA] combines the discriminator output and reconstruction error to detect anomalies in multivariate time series. BeatGAN [ijcai2019-616] uses encoder-decoder generator with a modified time-warping-based data augmentation to detect anomalies in medical ECG inputs. TadGAN [geiger2020tadgan] uses a cycle-consistent GAN architecture with an encoder-decoder generator and additionally proposes several ways to compute reconstruction error and its combination with the critic outputs. We build on top of TadGAN’s architecture by incorporating the hyperbolic mapping layer to the reconstructed time-windows to assess the uncertainty of the detector.

Ii-C Hyperbolic Neural Networks

Deep representation learning in hyperspaces has gained momentum after the pioneering work of hyperNNs [NEURIPS2018_dbab2adc]

that generalises Euclidean operations (e.g. matrix multiplications) to their counterparts in hyperspace. The authors propose analogue counterparts in hyperspace of neural network components such as fully-connected (FC) layers, multinomial logistic regression (MLR) and recurrent neural networks. Furthermore, methods like Einstain midpoint 

[gulcehre2018hyperbolic] and Fréchet mean [Lou2020DifferentiatingTT] propose different ways of aggregating features in hyperspace. The work in [shimizu2021hyperbolic] extends hyperNN and proposes Poincaré split/concatenation operations, generalising the convolutional layer to hyperspace. [NEURIPS2019_103303dd, chami2019hyperbolic] propose hyperbolic graph neural networks, leveraging hyperNNs.

Thus formulated, hyperNNs have mainly been adopted to improve performance by leveraging hierarchies and uncertainty in zero-shot learning [Liu_2020_CVPR], re-identification [Khrulkov_2020_CVPR], and action recognition [9157196]. Of particular interest, [suris2021hyperfuture] has leveraged hyperNNs to model a hierarchy of actions from unlabeled videos. To the best of our knowledge, this is the first work to have applied hyperNNs for sequence modelling with the goal of anomaly detection.

Iii Method

In this section, we first discuss best practises in anomaly detection (Section III-A); then we detail the proposed hyperbolic uncertainty and its use for detecting anomalies (Section III-B); finally we discuss the motivation for it (Section III-C).

Fig. 2: Overall architecture of our proposed model HypAD. The model integrates hyperNNs [NEURIPS2018_dbab2adc] with an LSTM based encoder-decoder trained with two GAN-based discriminators, and . HypAD maps the input signal and the output of the decoder () to the Poincaré ball model of hyperbolic spaces, shown as dotted red edge box with red background, to get the corresponding hyperbolic embeddings. The two embeddings are then compared using Poincaré distance, as described in Section III-B1. Solid magenta and green boxes denote the input and output of the hyperbolic mapping, respectively.

Iii-a Background

The current state-of-the-art in univariate anomaly detection is a reconstruction-based technique [geiger2020tadgan] which additionally leverages a GAN critic score. TadGAN encodes the input data to a latent space and then decodes the encoded data. This encoding-decoding operation requires two mapping functions and . The reconstruction operation can be given as . TadGAN leverages adversarial learning to train the two mappings by using two adversarial critics and . The goal of is to distinguish between the real and the generated time series, while measures the performance of the mapping into latent space. The model is trained using the combination of Wasserstein loss [pmlr-v70-arjovsky17a] and Cycle consistency loss [8237506]. TadGAN computes the reconstruction error between x and using three types of reconstruction functions: i. Point-wise difference: considers the difference of values at every time stamp; ii. Area difference: is applied on signals of fixed lengths and measures the similarity between local regions; iii. Dynamic time warping: additionally handles time gaps between the two signals to calculate the reconstruction error.

To calculate the anomaly score, TadGAN first normalises the reconstruction error and critic scores by subtracting the mean and dividing by standard deviation. The normalised scores,

and , are then combined using their product:


Iii-B Hyperbolic uncertainty for Anomaly detection (HypAD)

We propose a novel model for anomaly detection in time series based on hyperbolic uncertainty. HypAD is a reconstruction-based model and minimises the reconstruction loss, given by a measure of the hyperbolic distance between the input signal and its reconstruction. In hyperbolic space, errors are exponentially larger when predictions are certain. Therefore, HypAD tends to predict either certain correct reconstructions or uncertain possibly mistaken reconstructions.

This leads, as we discuss in Sec. III-C, to a novel definition of detectable anomaly: i.e. the case of a large reconstruction error with high certainty.

Iii-B1 Hyperbolic Reconstruction Error

The proposed HypAD is illustrated in Figure 2. It integrates the machinery of hyperbolic neural networks into the reconstruction-based architecture of TadGAN. In HypAD the input signal is first passed through an encoder, then followed by a decoder sub-network. The output of the decoder as well as the original signal are mapped to the hyperspace, shown as the red edge box with red background.

An -dimensional hyperspace is a Riemannian geometry with a constant negative sectional curvature. As in [Khrulkov_2020_CVPR, suris2021hyperfuture], we adopt the Poincaré ball model of hyperspaces, given by the manifold endowed with the Riemmanian metric , where is the conformal factor and

is the Euclidean metric tensor. For details, see 

[riemannian, smoothmanifold].

In order to map and to the Poincaré ball, we leverage an exponential map centered at  [suris2021hyperfuture]. This is followed by a hyperbolic feed-forward layer [NEURIPS2018_dbab2adc] to estimate the corresponding hyperbolic embeddings and , shown as solid green boxes in Figure 2. Finally, the two hyperbolic embeddings are compared using the Poincaré distance, formulated as follows:


where and are the distances of the embeddings from the center of the Poincaré ball. Note that the same reconstruction error function , is used at training as well as inference time.

Iii-B2 Hyperbolic Uncertainty

A key property of the Poincaré ball is that the distance between two points grows exponentially as we move away from the origin. This means that an erroneous reconstruction towards the circumference of the disk is penalised exponentially more than an erroneous reconstruction close to the centre. This leads to the useful tendency of HypAD to either predict a matched reconstruction ( and are close by) or an unmatched reconstruction towards the origin ( and are far-away, and are small), in order to minimize Eq. (2).

Hence, the distance of the reconstruction to the origin provides a natural estimate of the model’s uncertainty, referred to as hyperbolic uncertainty, , thus formulated:


The smaller the distance from the origin, the more uncertain is the model.

Iii-B3 Combining Hyperbolic Uncertainty with Reconstruction Error and Critic Score

Hyperbolic uncertainty is integrated into the anomaly score as follows:


Equation 4 brings together the reconstruction error (the larger the error, the more likely the anomaly) with the critic score (larger critic scores point to anomalies) and the model certainty: .

The simple multiplication formulation of the certainty of the model reduces the scores of anomalies when HypAD is not confident of the reconstructions. While being simple, this outperforms the current state-of-the-art, as we show in Section IV.

Iii-C Motivation for HypAD

HypAD takes motivation from a key idea: a detectable anomaly is one where the model is certain, but it predicts wrongly. In other words, if the model encounters a known pattern, which it knows how to reconstruct, then it will call it anomalous, if the reconstruction does not match the input signal.

The principled formulation of hyperbolic uncertainty is paramount towards this goal: HypAD predicts a reconstruction as uncertain if it is doubtful that it may be wrong.

Fig. 3: Bar plot of the average cosine distance of all the datasets for specific intervals of uncertainty (see Sec. III-C). The first two rows contain three plots corresponding to the signals that report the best improvement in terms of F1-score (g-measure) and one plot (the last column) that corresponds to the signal with the worst improvement. Notice that even the signals with the worst improvement follow the same increasing trend. Because SWaT contains a single long-term signal, we report only its corresponding bar-plot.

Fig. 3 illustrates this key concept for all the datasets. The first, second and third rows correspond to the univariate, U-CASAS and SWaT datasets respectively. The bar plot depicts the average cosine distances between the input signals and their reconstructions, against specific intervals of uncertainty, along the x-axis. The higher the cosine distances, the more distinct the reconstruction is from the provided signals111The Poincaré ball model is conformal to the Euclidean space and it preserves the same angles [shimizu2021hyperbolic].. Note that for the first two rows, the initial three plots (columns) correspond to the signals that report the best improvement in F1-score and the last plot corresponds to the signal with worst improvement. A single bar-plot is reported for SWaT because this consists of a single long-term signal.

Observe in the second row, how HypAD learns to correctly assign higher uncertainty to wronger estimates for the cases of Fall, Weakness, and Nocturia. For the last signal SlowerWalking, HypAD fails to learn a meaningful uncertainty and labels all reconstructions as certain. It is however notable that the representation is still interpretable and the case of failure discernible. Trends are similar for the cases of Univariate and SWaT datasets.

Univariate Multivariate
SMAP MSL A1 A2 A3 A4 Art AdEx AWS Traf Tweets F MTC W SW N
Num. of signals 53 27 67 100 100 100 6 5 17 7 10 1 1 1 1 1 1
Num. of anomalies 67 36 178 200 939 835 6 11 30 14 33 2 2 4 2 2 33
Point anomalies 0 0 68 33 935 833 0 0 0 0 0 0 0 0 0 0 0
Collective anomalies 67 36 110 167 4 2 6 11 30 14 33 2 2 4 2 2 33
Num. of anomaly points 54,696 7,766 1,699 466 943 837 2,418 795 6,312 1,560 15,651 99 239 1,248 276 1,060 10786
Percentage of total 9.7% 5.8% 1.8% 0.3% 0.6% 0.5% 10% 9.9% 9.3% 9.8% 9.8% 0.7% 1.7% 8% 2.4% 6.3% 10.05%
Num. out of distribution 18,126 642 861 153 21 49 123 15 210 86 520 - - - - - -
Num. of instances k k k k 168k 168k k k k k k k k k k k k
Synthetic? No No No Yes Yes Yes Yes No No No No No No No No No No
TABLE I: Overview of selected univariate and multivariate datasets and their characteristics, grouped by the sources of signals (NASA, Yahoo, NAB, CASAS and SWaT). See Sec. IV-A for details.

Iv Results

We compare HypAD against the current best univariate anomaly detector TadGAN [geiger2020tadgan] on a benchmark of 11 time series, and extend the comparison to 2 multivariate sensor datasets: one comprising a water treatment plant and one comprising daily activities in elderly residences. First, we introduce the benchmarks (Sec. IV-A), then we compare against baselines and the state-of-the-art (Sec. IV-B), finally we conduct ablative studies on the importance of uncertainty for performance and the reduction of false alarms (Sec. IV-C).

Iv-a Datasets, metrics and experimental setup

Table I summarizes the main characteristics of the datasets, which we coarsely divide into univariate, main test bed of our baseline TadGAN [geiger2020tadgan], and multivariate, to which we extend comparison. In the table, we report the sources of signals (NASA, Yahoo, NAB, CASAS and SWaT) and the datasets within each source (SMAP, MSL, A1 etc.), cf. detailed description later in the section.

In the table, the number of signals is the number of time series within the datasets. Note that univariate datasets are composed of multiple signals, while the multivariate only comprise single large time series. The number of anomalies counts the instances, within the time series, labelled as anomalous. These are further detailed as point anomalies, single anomalous values at a specific point in time, or collective anomalies, sets of contiguous times which are altogether anomalous. Yahoo is the sole dataset with point anomalies, and the only with synthetic sequences A2, A3, A4. We also report the percentage of total anomalous points and, following [geiger2020tadgan], for the univariate datasets, the number of out-of-distribution points, exceeding the means by more than .

Univariate datasets

NASA includes two spacecraft telemetry datasets222 based on the Mars Science Laboratory (MSL) and the Soil Moisture Active Passive (SMAP) signals. The former consists of scientific and housekeeping engineering data taken from the Rover Environmental Monitoring Station aboard the Mars Science Laboratory. The latter includes measurements of soil moisture and freeze/thaw state from space for all non-liquid water surfaces globally within the top layer of the Earth.

We analyse Yahoo datasets based on real production traffic to Yahoo computing systems. Additionally, we consider three synthetic datasets coming from the same source333 The dataset tests the detection accuracy of various anomaly-types including outliers and change-points. The synthetic dataset consists of time-series with varying trend, noise and seasonality. The real dataset consists of time-series representing the metrics of various Yahoo services.

Numenta Anomaly Benchmark (NAB) is a well-established collection of univariate time-series from real-world application domains444We invite the reader to consult for the entire directory of datasets.. To be consistent with [geiger2020tadgan], we analyse Art, AdEx, AWS, Traf, and Tweets from the original collection.

Multivariate datasets

We consider for analysis the SWaT [10.1007/978-3-319-71368-7_8, Mathur2016SWaTAW] and CASAS [cook2012casas, dahmen2021indirectly] datasets. SWaT is collected from a cyber-physical system testbed that is a scaled down replica of an industrial water treatment plant. The data was collected every second for a total of 11 days, where for the first few days the system was operated normally, while for the remaining days certain cyber-physical attacks were launched. Following [10.1145/3447548.3467137], we sample sensor data every every 5 seconds.

CASAS is a collection of two weeks of sensor data from retirement homes. Each sensor reading has a label attached to it, according to the activity of elderly people recognised by human annotators. The 5 time series are a collection of activities, grouped by the medical conditions established prior to the experimentation: Falling (F), More Time in Chair (MTC), Weakness (W), Slower Walking (SW), and Nocturia (N) for nightly time visits. Although sensor readings give fine-grained information, we are interested in creating daily patient profiles. Hence, we collapse them into a single engendered activity with a start and end time555The start time corresponds to the first sensor reading for that activity, whereas the end time is the last reading. for each consecutive sensor signal. We then create a time matrix structure where is the number of days the patient is monitored and 1440 represent the total number of minutes in a day. represents the label performed in minute of day . Lastly, we densify each value with contextual and duration information corresponding to the label therein.

Finally, we create train and test splits of this data, caring to safeguard sequentiality of observations, and gathering the few anomalies into the test set for evaluation only. For this reason we name this Unsupervised-CASAS, dubbed U-CASAS, differing from the original CASAS [cook2012casas, dahmen2021indirectly]

since the latter interleaves anomalous sensor readings with normal instances, thus breaking the sequentiality of anomalies. For each anomalous day encountered in the test set, we pad two days prior to it and two after. If the padding overlaps with another anomalous day, then we concatenate

666The concatenation procedure merges common sequences. If the sequence contains more than one anomalous day within the two-day padding window, the padded sequence gets collapsed into a single one. them and perform the padding procedure again. We delete the days assigned to the test set from the overall dataset and assign the rest as training. Moreover, because we employ a time-related strategy, we create time windows of 30 actions to detect anomalies.


Being all the enlisted datasets highly unbalanced, accuracy is misleading. Therefore, as done in [suris2021hyperfuture], we use the F1 score to account for this challenge. Notice that we do not use the cumulative F1 score as proposed in [garg2021evaluation] to evaluate the performances because not all datasets contain anomalous events; rather they contain anomalous data points. Based on [hundman2018detecting], for the univariate and the CASAS datasets, we penalize high false positive rates and encourage the detection of true positives in a timely fashion. Since anomalies are rare events and come in collective sequences in real-world applications, we proceed as follows:

  1. We record a true positive (TP) if any predicted window overlaps a true anomalous window.

  2. We record a false negative (FN) if a true anomalous window does not overlap any predicted window.

  3. We record a false positive (FP) if a predicted window does not overlap any true anomalous region.

For U-CASAS, we also measure the g-measure which is the geometric mean of the product of recall (R) and precision (P), being a robust metric when classes are imbalanced


MSL SMAP A1 A2 A3 A4 Art AdEx AWS Traf Tweets F1 ()
TadGAN [geiger2020tadgan] 0.623 0.680 0.668 0.820 0.631 0.497 0.667 0.667 0.610 0.455 0.605 0.629 0.123
AE 0.199 0.270 0.283 0.008 0.100 0.073 0.283 0.100 0.239 0.088 0.296 0.176 0.099
LstmAE 0.317 0.318 0.310 0.023 0.097 0.089 0.261 0.130 0.223 0.136 0.299 0.200 0.103
ConvAE 0.300 0.292 0.301 0.000 0.103 0.073 0.289 0.129 0.254 0.082 0.301 0.212 0.096
TadGAN* 0.500 0.580 0.620 0.865 0.750 0.576 0.420 0.550 0.670 0.480 0.590 0.600 0.115
HypAD (proposed) 0.565 0.643 0.610 0.670 0.670 0.470 0.777 0.663 0.630 0.570 0.670 0.631 0.075
TABLE II: Results on the univariate datasets from NASA, YAHOO and NAB, measured in terms of F1-score. The entry TadGAN [geiger2020tadgan]

is reported from the paper. TadGAN* refers to its reproduced result with the PyTorch version (see Secs. 

IV-A, IV-B). Mean and standard deviation are computed across all datasets.
g-measure F1 g-measure F1 g-measure F1 g-measure F1 g-measure F1
LstmAE 0.085 0.014 0.182 0.108 0.000 0.000 0.158 0.049 0.133 0.035 0.112 0.064 0.041 0.037
AE 0.139 0.127 0.033 0.027 0.116 0.103 0.000 0.000 0.158 0.049 0.089 0.062 0.061 0.047
ConvAE 0.086 0.014 0.284 0.150 0.251 0.119 0.158 0.048 0.134 0.035 0.183 0.074 0.073 0.052
TadGAN* 0.222 0.267 0.570 0.555 0.000 0.000 0.630 0.570 0.267 0.222 0.338 0.233 0.323 0.216
HypAD (proposed) 0.447 0.333 0.660 0.610 0.447 0.333 0.470 0.364 0.577 0.5 0.520 0.095 0.428 0.123
TABLE III: Results for the U-CASAS multivariate datasets, measured in terms of F1-score and g-measure. TadGAN* refers its PyTorch version (see Secs. IV-A, IV-B). The datasets for the specific medical conditions as abbreviated as such: Falling (F), More Time in Chair (MTC), Weakness (W), Slower Walking (SW), and Nocturia (N).

We include the following strategies as our baselines in this paper:

  • AE [baldi2012autoencoders] - We use a six-layer fully-connected autoencoder.

  • ConvAE [maggipinto2018convolutional]

    - We have three layers of convolutional encoding interleaved with max pooling. The decoder has a specular composition as the encoder where the de-convolution is aided by two-dimensional up-sampling layers.

  • LstmAE [sagheer2019unsupervised]

    - We use a deep stacked LSTM autoencoder with four layers. The first LSTM hidden and output vectors get passed to the second LSTM layer. The latent representation of the encoder gets then reconstructed in reverse order from the decoder.

  • TadGAN* [geiger2020tadgan] - We use a one-layer bidirectional LSTM for the generator , and a two-layer bidirectional LSTM for . For the critic we use a fully connected layer, and two dense layers for .

Implementation details.

For the first three baselines, we set the number of epochs to 30, the batch size to 32, and the learning rate to

. For TadGAN*, we set the epochs to 30, the batch size to 64, the learning rate to , and the iteration for the critic to 5. We use Adam as the optimisation function to train all the baselines.

For our proposed method HypAD, we took inspiration from an online available PyTorch implementation777 We leave the architecture of TadGAN unvaried, but we incorporate the hyperbolic transformation888 as in [suris2021hyperfuture]

. The hyperparameters are the same in the original paper, but we use the Riemannian Adam

999 as optimisation function.

Iv-B Comparison to the state of the art

HypAD defines a new state-of-the-art performance for univariate anomaly detection by having the highest average F1-scores of 0.631. In Table II, HypAD outperforms the current best technique (TadGAN*, 0.600) by 5.17%, as well as all baselines by a large margin. Note that we also report the original paper [geiger2020tadgan] performance of TadGAN in the first row (0.629), which we could not reproduce with the available PyTorch code (cf. Sec. IV-A). In the Table, the column F1 reports mean and standard deviation over all datasets. Looking at , HypAD also appears as the most consistent performer. Considering the F1-score, the largest performance gain of HypAD Vs TadGAN are on NAB and NASA datasets, while it is outperformed more largely on the A2, A3 and A4 Yahoo datasets, the sole synthetic univariate ones.

In Table III, we extend the evaluation of HypAD to the multivariate U-CASAS dataset. We cannot include Isudra [dahmen2021indirectly] because the underlying architecture uses a small amount of labels to conduct the selection of parameters for the execution. Additionally, Isudra is trained in a supervised fashion differing from all the other methods reported here. For completeness, we also report the g-measure [dahmen2021indirectly] and its average across all datasets. As shown in the table, HypAD surpasses TadGAN* by 32.51% in terms of average F1-score (0.428 Vs. 0.323).

Finally, in Table IV, we also extend comparison to the multivariate SWaT dataset. Here, following previous work [10.1145/3447548.3467137]

, we also report the precision and recall in addition to the F1-score. HypAD outperforms the baseline TadGAN* (0.753 Vs. 0.722) but both techniques are behind the current state-of-the-art on multivariate anomaly detection, NSIBF 

[10.1145/3447548.3467137]. Observe however that HypAD achieves its highest F1-performance at the highest precision (0.996) among all other methods. This confirms that HypAD really detects anomalies which it is certain about, i.e. when it understands the time series and it is certain about what to expect, but cannot reconstruct the input signal due to an anomaly.

Model Precision Recall F1
EncDec-AD [Malhotra2016LSTMbasedEF] [ICML-WorkShop16] 0.945 0.620 0.748
DAGMM [zong2018deep][ICLR18] 0.946 0.747 0.835
OmniAnomaly [10.1145/3292500.3330672] [KDD19] 0.979 0.757 0.854
USAD [10.1145/3394486.3403392] [KDD20] 0.987 0.740 0.846
TadGAN* [BigData20] 0.937 0.587 0.722
NSIBF [10.1145/3447548.3467137] [KDD21] 0.982 0.863 0.919
HypAD (Ours) 0.996 0.605 0.753
TABLE IV: Results on the SWAT dataset, in terms of F1-score, including the corresponding predicion and recall. TadGAN* refers to the PyTorch version.
Univariate U-CASAS SWaT
Euclidean (TadGAN) 0.600 0.323 0.722
Hyperbolic 0.604 0.397 0.566
Hyperbolic + Uncertainty (HypAD) 0.631 0.428 0.753
TABLE V: Ablative evaluation on the importance of the hyperbolic mapping and the integration of uncertainty. Average F1-scores are reported for Univariate, U-CASAS and SWaT datasets.

Iv-C Ablation Studies

In Table V, we analyze the importance of the hyperbolic embedding and the use of uncertainty for anomaly detection on the univariate datasets, as well as the multivariate U-CASAS and SWaT. The first row shows the performance of the model in Euclidean space (average F1-score across datasets). This corresponds to TadGAN* in Table IIIII and IV. In the second row, we report performance for the hyperbolic TadGAN*, without including uncertainty (cf. Sec. III-B1). This improves marginally over the univariate (0.604 Vs. 0.600) and more importantly over the U-CASAS datasets (0.397 Vs. 0.323), but it decreases performance in SWaT (0.566 Vs. 0.722), which we further analyze in the following subsection.

The complete proposed HypAD model, in the third row, improves consistently on both ablative variants. HypAD yields 0.631 on the univariate datasets, 0.428 on U-CASAS, and 0.753 on SWaT. Improvements wrt not using uncertainty are large, respectively 4.5%, 7.8% and 33%. We therefore conclude that uncertainty is fundamental for ameliorating anomaly detection.

Qualitative Ablation on SWaT: In Figure 4, we demonstrate qualitative ablation on the SWaT dataset, which consists of a single long signal. The three plots correspond to the three ablative variants of Table V. In all plots, the blue and green points represent the predicted anomaly scores and the ground-truth anomalies, respectively. The red line denotes the anomaly detection threshold i.e. blue points above the red line are the predicted anomalies. False positives are blue points above the red line threshold outside the green (ground-truth) anomalous regions.

As it shows in the figure, the Euclidean model (first plot) yields many false positives; the hyperbolic model w/o uncertainty (second plot) reduces the number of false positives substantially, but it also misses anomalies, esp. in the middle signal part. This explains the drop in the F-1 score for the SWaT dataset (cf. Table V). Integrating hyperbolic uncertainty (proposed HypAD, third plot) recovers detection of these anomalies, it increases true positives but maintains the number of false positives low, thus yielding the best F1 score of 0.753.

Fig. 4: Qualitative Ablation on SWaT: The anomaly scores (blue points) that lie above the anomaly detection threshold (red line) but do not coincide with the ground-truth (green points) are false positives. The Euclidean model in plot 1 has many false positives. The corresponding hyperbolic model in plot 2 is more precise but loses some true positives, especially in the middle region. The integration of uncertainty helps to recover these points, increasing the true positives and overall F-1 score (See Sec IV-C).

V Conclusions

We have proposed a novel model for anomaly detection based on hyperbolic uncertainty, HypAD. The proposed hyperbolic uncertainty allows HypAD to self-adjusts its output, encouraging the model to either predict a correct reconstruction or a less certain wrong reconstruction. This benefits anomaly detection in two ways: it provides better reconstructions of the signal (the deviation from those being anomalous) and it yields a measure of certainty. This is a novel viewpoint on anomaly detection: detectable anomaly instances are those which are certain but wrongly predicted.