Unsupervised Contextual Anomaly Detection using Joint Deep Variational Generative Models

04/01/2019 ∙ by Yaniv Shulman, et al. ∙ Aleph Zero Records 1

A method for unsupervised contextual anomaly detection is proposed using a cross-linked pair of Variational Auto-Encoders for assigning a normality score to an observation. The method enables a distinct separation of contextual from behavioral attributes and is robust to the presence of anomalous or novel contextual attributes. The method can be trained with data sets that contain anomalies without any special pre-processing.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection is an important area of research since anomalies represent a substantial deviation from the normal characteristics of a system or process of interest. Often these processes result in highly dimensional data sets, with complex relationships within the data and exhibit stochastic behavior. Furthermore the anomalies by definition contain high self-information measure and therefore carry useful information about the underlying data generation process. There exist a number of similar definitions of what an anomaly is however in this paper the following definition is adopted

CompareAnomalyDetection2016 :

  1. Anomalies are different from the norm in respect to their attributes.

  2. They are rare in a data set compared to the normal instances.

  3. In addition a novel observation is defined as an observation that is substantially different than any observation in the training data set.

In this paper a method for contextual anomaly detection is proposed using a cross-linked pair of Variational Auto-Encoders (VAE) for assigning a normality score to an observation. The method enables a distinct separation of contextual from behavioral attributes and is robust to the presence of anomalous or novel contextual attributes. The method can be trained with data sets that contain anomalies without any special pre-processing. In addition the method can be extended in a straight forward way to further decompose and separately model the joint variational approximation by introducing additional independent recognition networks thus allowing for more accurate representation in the latent space.

In summary the key contributions of this paper are:

  • A novel architecture for auto-encoding joint latent variational Bayes.

  • A novel method for robust unsupervised anomaly detection in the presence of contextual anomalies.

2 Preliminaries

2.1 Anomaly Detection

In this section a number of criteria for broadly categorizing anomaly detection algorithms is briefly discussed. These concepts are covered in more detail in Aggarwal:2013:OA:2436823 ; Chandola:2009 ; CompareAnomalyDetection2016 .

Proximity based anomaly detection

assumes that anomalous data are isolated from the majority of the data whether in relation to clusters or global/local dense regions. To determine if an observation is anomalous, the distance to the clusters or the density estimate is calculated to generate a normality score;

Statistical based anomaly detection

assumes that data is generated from a known probability distribution which can be described by parametric or non-parametric formulation. To determine if a data point is an anomaly the probability of it being generated from the assumed distribution is determined and a normality score is produced derived from this probability;

Deviation based anomaly detection is based on the reconstruction errors following a spectral or other transformation of the data to a lower dimensional space and then back to the original space. The magnitude of the reconstruction error is used to generate a normality score.

Supervised anomaly detection is employed where both the training and test data sets specify for each observation whether it is normal or anomalous; Semi-supervised anomaly detection is typically defined as scenarios where the training data contains only normal observations; Unsupervised anomaly detection is the case where there are no labels provided in either the training or the testing data sets and no assumptions are made on the existence or number of anomalous observations in the available data.

Contextual anomaly detection is formulated such that the data contains two types of attributes, behavioral and contextual attributes. Behavioral attributes are attributes that relate directly to the process of interest whereas contextual attributes relate to exogenous but highly affecting factors in relation to the process. Generally the behavioral attributes are conditional on the contextual attributes.

2.2 Variational Auto-Encoder

In this section a brief overview of the Variational Auto-Encoder (VAE) Kingma:2009 is provided for presenting the notation used in subsequent sections of the paper.

A Variational Auto-Encoder (VAE) is a directed probabilistic graphical model that enables an efficient variational inference for intractable posterior distributions which are approximated by a neural network. The VAE is comprised of two serially adjoined neural networks which are referred to as encoder/recognizer and decoder/generator respectively. The generator network

where is a latent variable approximates the generative process . The recognition network models a variational approximation of the intractable posterior . All parameters are learned jointly and efficiently by employing the Stochastic Gradient Variational Bayes (SGVB) Kingma:2009 estimator. As the marginal likelihood of the data is intractable, the problem is transformed into an optimization problem where the objective function of the VAE is the Evidence Lower Bound (ELBO), a lower bound on as formulated in equation 2c.


Where the inequality in equation 2a

follows from the non-negativity of the Kullback–Leibler divergence. A complete derivation can be found in

doi:10.1080/01621459.2017.1285773 .

2.3 Conditional Variational Auto-Encoder

In this section a very brief overview of the Conditional Variational Auto-Encoder (CVAE) NIPS2015_5775 . The CVAE expands on the learning capacity of the VAE by defining an architecture that enables the model to learn explicit joint variational approximation of the latent variable and a directly modulated conditional generative model. In CVAE the input is denoted as , the output is denoted as and the latent variable is . The CVAE utilizes the SGVB optimization framework and an objective function closely related to the VAE defined in equation 3e.


2.4 Related Work

Anomaly detection has attracted large interest from the research community over decades due to the varied areas of application and theoretical importance. There are many suggested methods for the general case however a much smaller number of methods that deal explicitly with contextual anomaly detection exist. A review of related work is given in Chandola:2009 and in CompareAnomalyDetection2016

, the latter being more recent and also endeavors to provide an elaborate comparative evaluation for a large number of methods. In this section the focus is on more recent methods proposed either for contextual anomaly detection or anomaly detection that make use of variational inference and deep learning methods. Note that both supervised, semi-supervised and unsupervised methods are included.

DBLP:journals/corr/LiangP16 proposes a contextual anomaly detection method (ROCOD) for dealing with situations where there are abnormal or sparse contextual attributes by utilizing local and global behavioral models conditional on the context. DBLP:journals/corr/LiangP16 also performs comparative analysis of a number of methods and demonstrates that state-of-the-art point methods achieve relatively poor results on contextual anomaly detection problems. Song2007ConditionalAD

has proposed a method for general contextual anomaly detection and proposes three different expectation-maximization algorithms for learning the model. Additionally

Song2007ConditionalAD comparatively evaluates more than 13 different data sets against several other non-contextual anomaly detection methods. DBLP:conf/aaai/HongH16

propose a multivariate conditional outlier detection framework for clinical applications by defining a multi-variate function to calculate the normality score.


propose a method for improved unsupervised learning of

constrained representations for clustering analysis using deep Auto-Encoders. Normality scores are then calculated based on similarity measure to clusters. Note that in

DBLP:journals/corr/abs-1802-00187 the number of clusters is assumed to be known. Slch2016VariationalIF apply a Stochastic Recurrent Network (STORN) Bayer:2015 for supervised detection of anomalies in robot sensors time series data. An2015VariationalAB suggests an anomaly detection method using a VAE and proposes the Reconstruction Probability a novel normality score based on the probabilistic measure expressed in the objective function of the VAE. Xu:2018:UAD:3178876.3185996 suggest Donot, an unsupervised anomaly detection algorithm utilizing a Variational Auto-Encoder for anomaly detection in Seasonal KPI arising from web applications utilizing the Reconstruction Probability.

3 Problem Description

3.1 Unsupervised Contextual Anomaly Detection

Most anomaly detection methods known to the author at this time do not provide explicit treatment of contextual and behavioral attributes separately but simply merge the two attribute types into a single observation thus transforming the original task into a standard point anomaly detection CompareAnomalyDetection2016 ; Chandola:2009 . On the other hand some contextual anomaly detection methods either require a labeled data set for training or are designed for specific domains therefore it seems not many methods exist to perform general unsupervised contextual anomaly detection. Furthermore by definition relatively little information is available on the distribution of the behavioral attributes in low density areas of the contextual subspace which results in an additional challenge for the existing algorithms especially when there is no information available on the distribution of the behavioral attributes when the context is in itself novel.

In this paper the focus is on unsupervised contextual anomaly detection where the training and testing data sets are generated by the same process. It is of interest to develop a robust model that is able to learn efficiently the state of the process and correctly predict an observation as an anomaly when the behavioral attributes are in fact an anomaly given the context. However it is desirable for such a model to be robust to anomalies present in the contextual attributes and use the “best” available relevant context to make meaningful predictions.

4 Proposed Method

4.1 The Data Generation Model

Given a data set of observations where denotes concatenation, the set contains only behavioral attributes and the set contains only the corresponding contextual attributes and are jointly and independently drawn. The data generation process where the samples are taken can be modeled as follows:

  1. A sample is taken from a latent variable with prior distribution .

  2. A sample is taken from a latent variable with prior distribution .

  3. A sample is taken from a variable with conditional distribution .

  4. A sample is taken from a variable with conditional distribution .

The generative process is defined as ,
and is chosen so to prevent from modulating the generative process of directly for reasons brought in subsequent sections. Figure 1 provides an overview of the generative process. and are often intractable.

Figure 1: Illustration of the generative model as a directed graphical model. is the behavioral attributes for the process of interest, is the contextual attributes which in this case do not participate directly in the generative process . Solid lines denote the generative process whereas dashed line denote the variational approximations.

4.2 Joint Deep Variational Generative Models

4.2.1 The Variational Bound

Let denote the complete set of latent variables. The variational lower bound of and is defined as follows:


To optimize jointly the variational lower bound objective of the two marginal likelihoods equations 4 and 5 are combined.


Given the KL terms in equation 6 may be integrated analytically under certain conditions for calculating the empirical loss, the objective is optimized using the Stochastic Gradient Variational Bayes (SGVB) Kingma:2009 estimator:


Where and , is the number of samples. The first two terms in equation 7 represent the latent error for the two variational distributions and , and the two remaining terms the log probability of the reconstruction errors for the contextual and behavioral attributes and respectively.

4.2.2 Architecture

To approximate the posteriors of the joint generative models and two recognition networks and two generator networks are jointly trained. The behavioral attributes are input into one of the recognition networks and both the contextual and behavioral attributes are input into the other. Both recognition networks output the parameters of the variational approximations to the prior followed by samples that are drawn from the variational approximations to form a Monte Carlo approximation of the expectations of the reconstruction with respect to variational approximations Kingma:2009 . This architecture provides a number of benefits:

  1. Explicit treatment of behavioral and contextual attributes.

  2. Enables an indirect modulation of the generative process of by based on the latent representation of the contextual attributes rather than a direct modulation of the process as done in a CVAE architecture which results in increased robustness to the presence of outliers and novelties in the contextual space. Intuitively this can be explained by the similarity of the latent representation of to spectral dimensionality reduction representation which maps the data into a known sub-space, but with the increased model capacity of the recognition network and the benefit of a probabilistic interpretation.

  3. Enables assigning different priors for the contextual and behavioral spaces, having multiple of each as a method to decompose and separately model the joint latent distribution.

4.2.3 Training and Classification

All recognition and generator networks are jointly trained using the Stochastic Gradient Variational Bayes (SGVB) Kingma:2009 estimator. Having learned the model parameters a normality score can be obtained by either calculating a reconstruction error norm for the behavioral attributes or by calculating the Reconstruction Probability of defined as An2015VariationalAB . Note that the reconstructed context is not strictly required for assigning a normality score for classification but can be used to estimate the normality score of the context if desired. Figure 2 provides an overview of the architecture.

Figure 2: Illustration of the architecture

5 Experimental Results

5.1 Kddcup99

5.1.1 Data

Comparative evaluation of contextual anomaly detection methods is a challenging task due to lack of availability of common and suitable data sets that are both labeled and partitioned into behavioral and contextual attributes. To overcome this challenge a publicly available data set the Kddcup99111http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html was adopted as well as an evaluation method used in DBLP:journals/corr/LiangP16 to provide a performance baseline. The Kddcup99 is ”the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between “bad” connections, called intrusions or attacks, and “good” normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment”. The Kddcup99 data set is by a large margin the most challenging data set evaluated by DBLP:journals/corr/LiangP16 and therefore was elected for this experiment. An effort was made to adhere to the same method of pre-processing and data inclusion as described in DBLP:journals/corr/LiangP16 however there are some differences as described subsequently.

The observations from the r2l and u2r attack families were retained as well as attacks of type ipsweep and nmap, and normal observations. This results in a total of 605,803 observations out of which 595,797 are labeled as normal, and the rest 10,006 are considered anomalies (approx. 1.652%). Similarly to DBLP:journals/corr/LiangP16 the service, duration, src_bytes and dst_bytes were used as behavioral attributes and all other as contextual attributes. The logarithm of duration, src_bytes and dst_bytes was taken since these attributes are processed in the same manner in DBLP:journals/corr/LiangP16

. All categorical features were one-hot-encoded and finally all attributes are normalized to

range. The resulting data set contains 65 behavioral attributes and 45 contextual attributes and enables quantitative analysis of the proposed algorithm’s effectiveness against the algorithms evaluated in DBLP:journals/corr/LiangP16 on a similar data set.

5.1.2 Model

The model is comprised of behavioral recognizer and generator networks and contextual recognizer and generator networks as in the basic architecture described in section 4.2.2 and illustrated in figure 2

. The arrangement of units in the behavioral recognizer MLP were: 65 (input), 58, 32 and 4 units for the latent output, with the generator having a mirror architecture. The arrangement of units in the contextual recognizer MLP were: 110 (input), 40, 22 and 4 units for the latent output, with the generator having a mirror architecture except for the output layer containing 45 units. All activation functions in the MLPs are Relu where applicable, however the latent parameters layer as well as the outputs of both generators employ linear activation. Isotropic normal distribution were assumed to the data and latent distributions which lead to the total empirical objective is presented in equation

8, note there is an added L1 regularization term over the MLPs’ weights with .


Where since a mini-batch size of 200 was used, , and are the dimensions of the latent variables , and respectively. For optimization Adam journals/corr/KingmaB14 was employed. Note that the aforementioned architecture is likely not optimal and was chosen based on previous personal experience for illustrative purposes with no attempt to find an optimal hyper-parameter setting for this experiment. Training was performed with early stop strategy once the loss on the validation set has started increasing.

5.1.3 Additional Baselines

Despite aiming to compare primarily against the results presented in DBLP:journals/corr/LiangP16 for diligence the same data set was evaluated by three additional algorithms: Isolation Forest Liu:2012:IAD:2133360.2133363 , One Class SVM Cortes1995 and Local Outlier Factor Breunig:2000:LID:335191.335388 . Not much effort was put into fine tuning these algorithms on the target data set and the results should be taken as indicative only.

5.1.4 Metrics

The following metrics were evaluated against each of the methods:

  1. Area under the Precision-Recall Curve (PRC): The area under the curve when plotting the recall on the x-axis against precision on the y-axis for all relevant possible threshold values for discriminating between normal and anomalous observations. PRC is recommended in scenarios where the data set is highly imbalanced Davis:2006:RPR:1143844.1143874

    . The area under the curve (AUC) provides a summary statistic to the performance of a classifier in the PRC space.

  2. Average Precision Score (APS): Provides a summary statistic for the Precision-Recall Curve as a weighted mean of precision obtained at each threshold, with the weight being the increase in recall from the previous threshold, calculated as where and

    are the precision and recall at the


  3. Area under the Receiver Operating Characteristics Curve (ROC): The ROC curve enables the visualization of the relative trade-off between true-positive rate (TPR) and false-positive rate (FPR) by plotting the FPR on the x-axis against TPR on the y-axis for all relevant threshold values. The area under the curve (AUC) provides a summary statistic to the performance of a classifier in the ROC space.

  4. Top-100 Precision: The fraction of correctly detected anomalies in the top 100 scored observations.

5.1.5 Performance Metrics for Standard Classification

Due to the challenges related to binary classification over a highly imbalanced data sets DBLP:journals/corr/BrancoTR15 a cross-validation with 5-fold stratified partitioning was performed where the ratio of the two classes in each of the the train/test partitions was kept equal to the distribution in the complete data set. The results are summarized in the following tables:

Method PRC (AUC) APS ROC (AUC) Top-100 Precis.
JLVAE 0.51848 0.51874 0.99257 0.018
IF Liu:2012:IAD:2133360.2133363 0.00842 0.00855 0.01937 0
OCSVM Cortes1995 0.00846 0.00853 0.02459 0
LOF Breunig:2000:LID:335191.335388 0.02458 0.03579 0.64849 0.056
Table 1: Summary of mean results obtained over the 5-folds for all methods.

The results obtained demonstrate a substantial improvement compared to the benchmark algorithms tested in the described setting and to the results obtained by DBLP:journals/corr/LiangP16 for a similar data set. The following tables contain detailed information as to the results obtained for each of the algorithms and k-folds.

K-Fold PRC (AUC) APS ROC (AUC) Top-100 Precision
1 0.50543 0.5057 0.99240 0.01
2 0.53492 0.53524 0.99321 0.03
3 0.51293 0.5131 0.99227 0.01
4 0.53134 0.53162 0.99264 0.03
5 0.50777 0.50805 0.99233 0.01
mean 0.51848 0.51874 0.99257 0.018
Table 2: JLVAE - proposed method.
K-Fold PRC (AUC) APS ROC (AUC) Top-100 Precision
1 0.0084 0.00854 0.01773 0
2 0.0084 0.0085 0.01203 0
3 0.00843 0.00859 0.02282 0
4 0.00848 0.0086 0.02942 0
5 0.00838 0.00853 0.01486 0
mean 0.00842 0.00855 0.01937 0
Table 3: Isolation Forest.
K-Fold PRC (AUC) APS ROC (AUC) Top-100 Precision
1 0.00847 0.00854 0.02476 0
2 0.00847 0.00853 0.02475 0
3 0.00846 0.00853 0.02462 0
4 0.00846 0.00853 0.02469 0
5 0.00846 0.00852 0.02414 0
mean 0.00846 0.00853 0.02459 0
Table 4: One Class SVM.
K-Fold PRC (AUC) APS ROC (AUC) Top-100 Precision
1 0.02442 0.03593 0.64976 0.01
2 0.02464 0.03627 0.64744 0.09
3 0.02394 0.0353 0.64671 0.03
4 0.02483 0.0351 0.64564 0.08
5 0.02506 0.03633 0.65290 0.07
mean 0.02458 0.03579 0.64849 0.056
Table 5: Local Outlier Factor.

5.2 Waste Water Treatment Plant

5.2.1 Robustness to Contextual Anomalies

To demonstrate the effectiveness of the method in dealing with contextual anomalies it was evaluated on a real-world waste water treatment plant located at Western Australia. The plant design features a splitter chamber that divides the incoming waste water into two wells each having two pumps. Waste water pumped by the pumps are then merged into a single outlet pipe by a series of two joiner pipes, one joining the pumps output in each well, and one joining the two well’s output. The control logic for the plant under normal conditions will turn pumps on and off as required to meet inflow conditions and also use variable speed drives to modulate the speed of the operational pumps based on a level reading of the splitter chamber. This design results in a system where the operational characteristics of a pump is not independent from the other pumps.

5.2.2 Data

The data set contains roughly 30 months of operational data, close to 150 attributes and about 690,200 coincident observations with 2 minutes frequency and is comprised of the following information:

  1. Sensors specific per pump such as vibration, temperature, speed, operational pressures and flows, power supply characteristics, and more.

  2. Generated features per pump such as efficiency.

  3. Environmental readings from the two wells.

  4. Other useful data such as the splitter chamber level and external weather conditions.

The data is assumed to contain anomalies of unknown nature and frequency. The data was partitioned such that data generated in a particular pump run-cycle was kept together and not partitioned across sets. Partitioning was done into training (65%), validation(15%) and testing (%20) sets where the percentages represent the portion of pump run-cycles rather than single observations. Lastly the data was not pre-processed except for aligning observations in time by mean interpolation, discarding partial observations with the remaining observations standardized. Note that there are no categorical attributes in this data set.

5.2.3 Model

A model is developed for each pump individually where the behavioral attribute are the data relating directly to the operational sensor readings of the pump, and where contextual attributes are some of the behavioral attributes of the remaining pumps as well as environmental factors such as the splitter chamber level and weather conditions. For example, a model for pump one will include as context the inflow and outflow rate and pressure of pumps 2-4, the splitter chamber level and environmental information. The setup was similar to the one described in section 5.1.2 with arrangement of units in the behavioral recognizer as follows: 28 (input), 20, 10 and 5 units for the latent output, with the generator having a mirror architecture. The arrangement of units in the contextual recognizer were: 38 (input), 20, 10 and 2 units for the latent output, with the generator having 4, 7 and 10 units in the output layer. Note that similarly to the previous experiment the aforementioned architecture is likely not optimal and was chosen based on personal experience of the author for illustrative purposes.

5.2.4 Metrics

In this case it is intended to evaluate the models robustness to contextual anomalies and novelties. To do so the following method is applied. A threshold was set so that the number of anomalies detected by the model in the test data set is roughly 1%. Then 10,000 normal observation are randomly selected and transformed by scaling and offsetting a randomly chosen subset of the attributes element-wise where and

resulting in 15 new test data sets. For the A data sets approximately 10% of each group was transformed, for the B data sets about 30% and for the C data sets about 50%. For the D,E and F data sets the same absolute number of attributes was transformed in each of the groups. Given the attributes are standardized to zero mean and unit standard deviation the noise levels applied to the attributes are substantial and result in many anomalies detected as per the summary in table


Data set # behavior trans. # context trans. # anomalies reported
A1 3 0 1240
A2 0 1 7
A3 3 1 576
B1 9 0 4614
B2 0 3 2
B3 9 3 4251
C1 14 0 6949
C2 0 5 10
C3 14 5 5909
Dx 2 0 288
Dc 0 2 0
Ex 5 0 1567
Ec 0 5 13
Fx 10 0 4485
Fc 0 10 17
Table 6: Summary of number of anomalies detected in the noisy data sets.

The results demonstrate the algorithm is robust to anomalies and novelties in the contextual data attributes whilst maintaining sensitivity to anomalies in the behavioral space. It is notable that even when the entire set of contextual attributes is transformed in data set Fc, still less anomalies are reported than data set Dx where only two behavioral attributes are corrupted with noise.

6 Conclusion

In this paper a novel algorithm for contextual anomaly detection is presented and a novel ANN architecture comprised of multiple cross-linked VAEs to model directed graphical distribution models for modeling generative processes. The algorithm performs well in the test scenarios and is robust to contextual anomalies and novelties.

7 Acknowledgements

This research was supported by the Water Corporation of Western Australia. I gratefully acknowledge my colleagues from the Water Corporation for access to infrastructure and for their cooperation, which greatly assisted the research.