1 Introduction
With a complex system like a robot, we would like to be able to discriminate between normal and anomalous behavior of this system. For instance, we would like to be able to recognize that something went wrong while the robot was fulfilling a task. Generally speaking, determining whether an unknown sample is structurally different from prior knowledge is referred to as anomaly detection.
Recording anomalous data is costly (or even dangerous) in comparison to normal data. Moreover, anomalies are inherently diverse, which prohibits explicit modeling. Due to the underrepresentation of anomalous samples in training data, anomaly detection remains a challenging instance of twoclass classification to this day. Consequently, the problem is reversed: A normality score is learned from normal data only, and the fully trained normality score is used to discriminate anomalous from normal data by thresholding.
The contribution of this paper is an application of approximate variational inference for anomaly detection. We learn a generative time series model of the data, which can handle highdimensional, spatially and temporally structured data. Neither the learning algorithm nor subsequent anomaly detection via scores requires domain knowledge.
2 Problem Description: Anomaly Detection
As (Pimentel et al., 2014) show, a plethora of anomaly detection approaches exist. A common assumption for time series anomaly detection is that data streams are i.i.d. in time and/or space. For robots (and many other systems) this is not true: Joint torques that are perfectly normal in one joint configuration are anomalous in another. This is not captured by previous approaches. A notable exception is (Milacski et al., 2015). However, their approach, requiring the entire time series for processing, lacks online capability and their offline evaluation cannot be transferred to different models because it is based on a modelbased segmentation of the timeseries. (An & Cho, 2015) have independently applied variational inference anomaly detection on static data. Since no comparable algorithm exists, no comparison is possible.
For training and testing, we recorded the joint configurations of the seven joints of a Rethink Robotics Baxter Robot arm. We recorded 1008 anomalyfree samples at 15 Hz of a pickandplace task, our target distribution, cf. Fig. 1. This task is simulated by first fixing a pool of 10 waypoints and then, for each sample, traversing a random sequence of as many waypoints as possible for a duration of 30s and finally returning to the initial configuration. This results in roughly 8 to 10 waypoints per sample. For this distribution, we would like to learn a generative model.
For testing purposes, we recorded 300 samples with anomalies obtained by manually hitting the robot on random hit commands. For each time stamp, we obtained two labels: whether or not a hit command occurred within the previous 4 seconds (a rough bound on the human response time), and unusual torque.^{1}^{1}1Torque was only used for labeling, not for learning. Both are depicted as red and blue background color in Fig. 3. Neither of these labels is perfect, the temporal label is necessarily too loose, whereas the torque label misses subtle anomalies while putting false positive labels on artifacts in the data.
Anomaly  
Positive  Negative  
Detect.  P  141  141  142  140  0  3  2  1  1.0  .979  .986  .993  PPV 
N  4  4  3  5  109  106  107  108  .965  .964  .973  .956  NPV  
.972  .972  .979  .966  1.0  .972  .982  .991  .984  .972  .980  .976  
Sensitivity  Specificity  Accuracy 
3 Methodology: Variational Inference and Stochastic Recurrent Networks
In the wake of (Rezende et al., 2014; Kingma & Welling, 2013), who introduced the Variational AutoEncoder (VAE), there has been a renewed interest in variational inference. The VAE replaces an autoencoder’s latent representation of given data with stochastic variables.
The decoding distribution (or recognition model) approximates the true, unknown posterior . The encoding distribution implements a simple graphical model. Decoder and encoder are parametrized by and
, respectively, e.g., neural networks. This allows to overcome intractable posterior distributions by learning.
The approach is theoretically justified by the observation that an arbitrary inverse transform sampling is approximated.
The approach is generalized to time series by (Bayer & Osendorfer, 2014) by applying recurrent neural networks with hidden layers as encoder and decoder. For observations and corresponding latent states , we assume the factorization
(1)  
(2) 
The two distributions, the generative model (decoder) and the prior over latents, are implemented by two RNNs with parameters and and hidden states and
, respectively. These RNNs output sufficient statistics of normal distributions. This introduces a
trending prior, i.e., a prior that can change over time, which is an extension of the original STORN. Also, it forms an extension of (Chung et al., 2015).With the encoding RNN with parameters implementing the recognition model , we arrive at the typical variational lower bound (also referred to as free energy) to the marginal loglikelihood :
(3)  
(4) 
Maximization of the lower bound is provably equivalent to minimizing the KLdivergence between the true posterior and the approximate posterior
. The lower bound can be used to simultaneously train all adjustable parameters by stochastic backpropagation. The decomposition over time allows for computationally feasible online detection.
For each time step, STORN outputs a lower bound value and a predictive distribution for the next time step. These (and postprocessed versions) serve as scores for thresholding anomalies.
4 Experiments
Prior to any anomaly detection, we trained STORN on a training set of 640 normal time series. Model selection was then based on 160 validation samples. Based on a fixed STORN, the anomaly detection then derives a scalar score from the outputs of the model and finds a threshold to discriminate normal from anomalous data. Anomaly detection was tested on the 208 remaining normal samples and 300 anomalous samples: Half of these 508 samples were taken to extract thresholds and the remaining half was taken to test overall performance of the detection algorithm.
4.1 Offline Detection
For offline detection, i.e., detecting whether an unknown test sample has an anomaly or not, we used different normality scores:
It should be stressed that none of the normality scores is related to the original data domain. Anomaly detection is entirely transferred onto probabilistic grounds.
A ROC curve on the 254 samples used for threshold extraction can be seen in Fig. 2
. For each of the four scores, we used the threshold minimizing the sum of squared sensitivity and specificity (which coincides with the threshold corresponding to the point closest to the top left corner of the ROC curve—the perfect classifier).
Table 1 reports standard metrics for classification on unseen test data. We see that offline classification is remarkably robust.
4.2 Online Detection
The more challenging case of online detection is depicted in Fig. 3. Again, we applied four different normality scores. Three were based on the stepwise lower bound—we used the stepwise lower bound output of our model, as well as a smoothed version (with a narrow Gaussian Kernel), and the absolute value of forward differences. As a fourth score, we used the stepwise magnitude of lowerbound gradient w.r.t. the sample. A large gradient magnitude in one time step indicates a significant perturbation from a more likely time series; such perturbations are indicators of anomalous data. As with the offline scores, none of the online scores is related to the original data domain. This renders the overall approach very flexible.
For each of the four scores, we extracted three different thresholds, each leveraging the two types of imperfect labels differently. For each of the four scores, we extracted three thresholds, namely the ones maximizing

the sum of squared sensitivity and specificity on torquebased labels,

in addition to 1. the positive predictive value (PPV) on torque labels,

in addition to 1. and 2. the weighted PPV on hitcommandbased labels.
The underlying assumption is that we do not necessarily want to perfectly recover the true labels (which we do not know), but we want to spot anomalies qualitatively (i.e., report an anomaly while it is ongoing while at the same time having few false alarms). The choice of metric for extracting a threshold highly depends on the application at hand.
These four times three thresholds can be seen in Fig. 3.
5 Conclusion and Future Work
In this paper, we successfully applied the framework of variational inference (VI), in particular Stochastic Recurrent Networks (STORNs), for learning a probabilistic generative model of highdimensional robot time series data. No comparable approach has been proposed previously.
This new approach enables feasible off and online detection without further assumptions on the data. In particular, no domain knowledge is required for applying both the learning and the detection algorithm. This renders our algorithm a very flexible, generic approach for anomaly detection in spatially and temporally structured time series. We have shown that the new approach is able to detect anomalies in robot time series data with remarkably high precision.
Future research will have to show reproducibility of the results (i) with different kinds of anomalies, (ii) in new environments (e.g., on other robots).
Furthermore, we believe that variational inference will enable us to extract the true latent dynamics of the system from observable data by introducing suitable priors and transitions into STORN. This will equip us with a more meaningful latent space, which can in turn serve as a basis for new detection methods based on output of STORN.
Acknowledgments
A previous version of this paper was presented at the Workshop of ICLR 2016.
Part of this work has been supported by the TACMAN project, EC Grant agreement no. 610967, within the FP7 framework program.
Patrick van der Smagt is also affiliated with BRML, Technische Universität München, Germany.
References

An & Cho (2015)
An, Jinwon and Cho, Sungzoon.
Variational autoencoder based anomaly detection using reconstruction probability.
2015.  Bayer & Osendorfer (2014) Bayer, Justin and Osendorfer, Christian. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
 Chung et al. (2015) Chung, Junyoung, Kastner, Kyle, Dinh, Laurent, Goel, Kratarth, Courville, Aaron C., and Bengio, Yoshua. A recurrent latent variable model for sequential data. CoRR, abs/1506.02216, 2015. URL http://arxiv.org/abs/1506.02216.
 Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Milacski et al. (2015) Milacski, Zoltán Á, Ludersdorfer, Marvin, Lorincz, András, and van der Smagt, Patrick. Robust detection of anomalies via sparse methods. In Proc. 22nd Int. Conf. on Neural Information Processing (ICONIP 2015), 2015.

Pimentel et al. (2014)
Pimentel, Marco AF, Clifton, David A, Clifton, Lei, and Tarassenko, Lionel.
A review of novelty detection.
Signal Processing, 99:215–249, 2014.  Rezende et al. (2014) Rezende, Danilo J., Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. In Jebara, Tony and Xing, Eric P. (eds.), Proceedings of the 31st International Conference on Machine Learning (ICML14), pp. 1278–1286. JMLR Workshop and Conference Proceedings, 2014. URL http://jmlr.org/proceedings/papers/v32/rezende14.pdf.