Variational Inference for On-line Anomaly Detection in High-Dimensional Time Series

by   Maximilian Soelch, et al.

Approximate variational inference has shown to be a powerful tool for modeling unknown complex probability distributions. Recent advances in the field allow us to learn probabilistic models of sequences that actively exploit spatial and temporal structure. We apply a Stochastic Recurrent Network (STORN) to learn robot time series data. Our evaluation demonstrates that we can robustly detect anomalies both off- and on-line.


page 1

page 2

page 3

page 4


Anomaly Detection on Graph Time Series

In this paper, we use variational recurrent neural network to investigat...

Probabilistic Recurrent State-Space Models

State-space models (SSMs) are a highly expressive model class for learni...

Anomaly Detection at Scale: The Case for Deep Distributional Time Series Models

This paper introduces a new methodology for detecting anomalies in time ...

Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces

In order to integrate uncertainty estimates into deep time-series modell...

Process Monitoring Using Maximum Sequence Divergence

Process Monitoring involves tracking a system's behaviors, evaluating th...

Synergetic Learning of Heterogeneous Temporal Sequences for Multi-Horizon Probabilistic Forecasting

Time-series is ubiquitous across applications, such as transportation, f...

1 Introduction

With a complex system like a robot, we would like to be able to discriminate between normal and anomalous behavior of this system. For instance, we would like to be able to recognize that something went wrong while the robot was fulfilling a task. Generally speaking, determining whether an unknown sample is structurally different from prior knowledge is referred to as anomaly detection.

Recording anomalous data is costly (or even dangerous) in comparison to normal data. Moreover, anomalies are inherently diverse, which prohibits explicit modeling. Due to the underrepresentation of anomalous samples in training data, anomaly detection remains a challenging instance of two-class classification to this day. Consequently, the problem is reversed: A normality score is learned from normal data only, and the fully trained normality score is used to discriminate anomalous from normal data by thresholding.

The contribution of this paper is an application of approximate variational inference for anomaly detection. We learn a generative time series model of the data, which can handle high-dimensional, spatially and temporally structured data. Neither the learning algorithm nor subsequent anomaly detection via scores requires domain knowledge.

2 Problem Description: Anomaly Detection

As (Pimentel et al., 2014) show, a plethora of anomaly detection approaches exist. A common assumption for time series anomaly detection is that data streams are i.i.d. in time and/or space. For robots (and many other systems) this is not true: Joint torques that are perfectly normal in one joint configuration are anomalous in another. This is not captured by previous approaches. A notable exception is (Milacski et al., 2015). However, their approach, requiring the entire time series for processing, lacks on-line capability and their off-line evaluation cannot be transferred to different models because it is based on a model-based segmentation of the time-series. (An & Cho, 2015) have independently applied variational inference anomaly detection on static data. Since no comparable algorithm exists, no comparison is possible.

Figure 1: 1008 normal samples. This approximates the target distribution we are trying to learn with STORN.

For training and testing, we recorded the joint configurations of the seven joints of a Rethink Robotics Baxter Robot arm. We recorded 1008 anomaly-free samples at 15 Hz of a pick-and-place task, our target distribution, cf. Fig. 1. This task is simulated by first fixing a pool of 10 waypoints and then, for each sample, traversing a random sequence of as many waypoints as possible for a duration of 30s and finally returning to the initial configuration. This results in roughly 8 to 10 waypoints per sample. For this distribution, we would like to learn a generative model.

For testing purposes, we recorded 300 samples with anomalies obtained by manually hitting the robot on random hit commands. For each time stamp, we obtained two labels: whether or not a hit command occurred within the previous 4 seconds (a rough bound on the human response time), and unusual torque.111Torque was only used for labeling, not for learning. Both are depicted as red and blue background color in Fig. 3. Neither of these labels is perfect, the temporal label is necessarily too loose, whereas the torque label misses subtle anomalies while putting false positive labels on artifacts in the data.

Figure 2: Receiver operating characteristic (ROC) for evaluating the off-line detection algorithm for different normality criteria and thresholds.
Positive Negative
Detect. P 141 141 142 140 0 3 2 1 1.0 .979 .986 .993 PPV
N 4 4 3 5 109 106 107 108 .965 .964 .973 .956 NPV
.972 .972 .979 .966 1.0 .972 .982 .991 .984 .972 .980 .976
Sensitivity Specificity Accuracy
Table 1: Standard metrics for classification on unseen test set data. PPV and NPV denote positive and negative predictive value, respectively. Each cell contains four values, from left to right the result for the four scores in the order outlined in Section 4.1.

3 Methodology: Variational Inference and Stochastic Recurrent Networks

In the wake of (Rezende et al., 2014; Kingma & Welling, 2013), who introduced the Variational Auto-Encoder (VAE), there has been a renewed interest in variational inference. The VAE replaces an auto-encoder’s latent representation of given data with stochastic variables.

The decoding distribution (or recognition model) approximates the true, unknown posterior . The encoding distribution implements a simple graphical model. Decoder and encoder are parametrized by and

, respectively, e.g., neural networks. This allows to overcome intractable posterior distributions by learning.

The approach is theoretically justified by the observation that an arbitrary inverse transform sampling is approximated.

The approach is generalized to time series by (Bayer & Osendorfer, 2014) by applying recurrent neural networks with hidden layers as encoder and decoder. For observations and corresponding latent states , we assume the factorization


The two distributions, the generative model (decoder) and the prior over latents, are implemented by two RNNs with parameters and and hidden states and

, respectively. These RNNs output sufficient statistics of normal distributions. This introduces a

trending prior, i.e., a prior that can change over time, which is an extension of the original STORN. Also, it forms an extension of (Chung et al., 2015).

With the encoding RNN with parameters implementing the recognition model , we arrive at the typical variational lower bound (also referred to as free energy) to the marginal log-likelihood :


Maximization of the lower bound is provably equivalent to minimizing the KL-divergence between the true posterior and the approximate posterior

. The lower bound can be used to simultaneously train all adjustable parameters by stochastic backpropagation. The decomposition over time allows for computationally feasible on-line detection.

For each time step, STORN outputs a lower bound value and a predictive distribution for the next time step. These (and post-processed versions) serve as scores for thresholding anomalies.

4 Experiments

Prior to any anomaly detection, we trained STORN on a training set of 640 normal time series. Model selection was then based on 160 validation samples. Based on a fixed STORN, the anomaly detection then derives a scalar score from the outputs of the model and finds a threshold to discriminate normal from anomalous data. Anomaly detection was tested on the 208 remaining normal samples and 300 anomalous samples: Half of these 508 samples were taken to extract thresholds and the remaining half was taken to test overall performance of the detection algorithm.

4.1 Off-line Detection

For off-line detection, i.e., detecting whether an unknown test sample has an anomaly or not, we used different normality scores:

  1. Lower bound (value of objective function (3)).

  2. Monte Carlo estimate of the true likelihood (

    4) (for comparative reasons).

  3. A high percentile (outlier-adjusted extreme value) of the deviation from MAP one-step prediction.

  4. A high percentile of the step-wise components of the lower bound.

It should be stressed that none of the normality scores is related to the original data domain. Anomaly detection is entirely transferred onto probabilistic grounds.

A ROC curve on the 254 samples used for threshold extraction can be seen in Fig. 2

. For each of the four scores, we used the threshold minimizing the sum of squared sensitivity and specificity (which coincides with the threshold corresponding to the point closest to the top left corner of the ROC curve—the perfect classifier).

Table 1 reports standard metrics for classification on unseen test data. We see that off-line classification is remarkably robust.

4.2 On-line Detection

The more challenging case of on-line detection is depicted in Fig. 3. Again, we applied four different normality scores. Three were based on the step-wise lower bound—we used the step-wise lower bound output of our model, as well as a smoothed version (with a narrow Gaussian Kernel), and the absolute value of forward differences. As a fourth score, we used the step-wise magnitude of lower-bound gradient w.r.t. the sample. A large gradient magnitude in one time step indicates a significant perturbation from a more likely time series; such perturbations are indicators of anomalous data. As with the off-line scores, none of the on-line scores is related to the original data domain. This renders the overall approach very flexible.

For each of the four scores, we extracted three different thresholds, each leveraging the two types of imperfect labels differently. For each of the four scores, we extracted three thresholds, namely the ones maximizing

  1. the sum of squared sensitivity and specificity on torque-based labels,

  2. in addition to 1. the positive predictive value (PPV) on torque labels,

  3. in addition to 1. and 2. the weighted PPV on hit-command-based labels.

The underlying assumption is that we do not necessarily want to perfectly recover the true labels (which we do not know), but we want to spot anomalies qualitatively (i.e., report an anomaly while it is ongoing while at the same time having few false alarms). The choice of metric for extracting a threshold highly depends on the application at hand.

These four times three thresholds can be seen in Fig. 3.

Figure 3: Anomalous test sample with labels (red and blue background colors), hit commands (black lines). On-line normality scores color-coded by the 12 bars (4 scores, 3 thresholds each)—green-red-yellow spectrum indicating that the respective score is below, on, or above the anomaly threshold.

5 Conclusion and Future Work

In this paper, we successfully applied the framework of variational inference (VI), in particular Stochastic Recurrent Networks (STORNs), for learning a probabilistic generative model of high-dimensional robot time series data. No comparable approach has been proposed previously.

This new approach enables feasible off- and on-line detection without further assumptions on the data. In particular, no domain knowledge is required for applying both the learning and the detection algorithm. This renders our algorithm a very flexible, generic approach for anomaly detection in spatially and temporally structured time series. We have shown that the new approach is able to detect anomalies in robot time series data with remarkably high precision.

Future research will have to show reproducibility of the results (i) with different kinds of anomalies, (ii) in new environments (e.g., on other robots).

Furthermore, we believe that variational inference will enable us to extract the true latent dynamics of the system from observable data by introducing suitable priors and transitions into STORN. This will equip us with a more meaningful latent space, which can in turn serve as a basis for new detection methods based on output of STORN.


A previous version of this paper was presented at the Workshop of ICLR 2016.

Part of this work has been supported by the TAC-MAN project, EC Grant agreement no. 610967, within the FP7 framework program.

Patrick van der Smagt is also affiliated with BRML, Technische Universität München, Germany.