1 Introduction
Most anomaly and clustering methods will perform well on good data. Real life applications however do not follow any guidelines. Actual data is typically incomplete, faulty, missing, unbalanced or consist of outliers. Data can have a large amount of mixed categorical and numerical features making it difficult to ascertain their relational importance. Many times there exist a size imbalance between categories. Thus a machine learning model could bias towards the larger classes in the data set. Furthermore overlap can make it difficult to identify the class each of the data points belongs to. In this case, anomaly detection can be challenging since an anomalous data point for one class may be a normal point for another. Last but not least data could be wrongly labeled which in turn can lead to severe modeling errors.
Clustering is a classic machine learning problem that is well studied but it is also heavily data dependent [32]. The success of the clustering often comes down to feature engineering or otherwise preprocessing the data in some way if it’s not already nicely placed in some euclidean space.
Anomaly detection is one of the most important problems in manufacturing [20], cyber security [28], medical imaging [37], fraud detection and several others. At the same time anomaly detection is a difficult problem since it is heavily depended on both data quantity and quality. The problem becomes even more complicated when considering an unsupervised setting such as we do here.
We consider data which is unmarked in terms of containing anomalies or not. As a result, while the class label is known, we do not know which points are anomalous and which are normal. For this case it’s important to specify what we mean by anomaly
. In this paper, we will consider anomalies to be data points which occur in low probability regions of the dataset. We use dimensionality reduction to find these probabilities and evaluate their degree of anomaly using the EMMV measure
[9]. More importantly however our proposed framework will be able to work with a broad category of data in an unsupervised manner. It will also allow us to use established methods of clustering and anomaly detection for comparison purposes.We propose to use a variation of the VAE [14] for this purpose. Instead of using the generative aspects of the VAE, we will perform our analysis in latent space. This is a reduced dimensionality space which forces the data to be close to some prior distribution. The latent space will also retain relational information and make it possible to observe how data features compare with each other. We propose to name this algorithm Conditional Latent Space Variational Autoencoder or CLVAE for short.
We apply two different clustering methods in the latent space: the kmeans and the expectation maximization (EM). Specifically the EM algorithm
[15]alternates between estimating the function of the expectation of the loglikelihood for each data point and then maximizing that expectation. This method therefore iteratively updates the model parameters and creates a probability for each point belonging to a cluster. The kmeans method relies on Lloyd’s algorithm
[18] and works much like the EM method described above in order to cluster the data. The main difference from the EM method above is that we must provide the number of data clusters as an input. Overall however the algorithm is simple in its implementation. We outline the kmeans method below in order to better highlight how clusters are created:
Assume clusters are to be established from the data.

Randomly assign points as centroids.

Assign each of the points in the dataset to belong to the closest of the centroids.

Compute new centroids of each cluster based on its assigned points from above.

Repeat steps 2 and 3 until centroids do not change or maximum iterations exceeded.
Finally we apply the Vscore as a widely accepted measure of success or failure of each clustering method.
We begin in Section 2 with a brief overview of the clustering and anomaly detection metrics which we apply during this study. In Section 3
we present the theory behind VAE’s and in particular how exactly we condition our VAE to fit multiple Gaussians and categorize the latent space based on the data. The choice of loss function is critical in forming the clusters. In Section
3.3 we provide the theoretical background on the KullbackLeibler (KL) divergence used for regularization as well as the reconstruction piece of the loss function. We present the unsupervised case of anomaly detection in Section 4.2 where we also discuss the proposed EMMV measure to evaluate its performance. We end with a number of remarks and an overview in Section 5.2 Approach and state of the art
There are a number of approaches to anomaly detection using generative models; many, using the reconstruction error as a measure of "degree" of abnormality [1] [21]. However in this paper we implement a VAE as a preprocessing algorithm for use in common anomaly detection algorithms. In order to do so effectively, we introduce the CLVAE which shapes the latent space by discovering which distribution each of the data belongs to and making it easier to work with.
The idea of using a Gaussian Mixture Model (GMM) as prior for training a VAE is not a new idea [30, 4]. However, in the case of GMVAE [4], there is no conditioning on a predefined label. Rather, it tries to estimate the prior distribution with Monte Carlo methods. A similar algorithm was cleverly described by J. Su [12]
but it includes a classifier that we do not really need here since we know which class to assign to which Gaussian. This is also related to the DECalgorithm by J. Xie et al.
[36] and the VaDEalgorithm by Z. Jiang et al. [13] with a builtin unsupervised cluster assignment which is not something that we attempt here. Our algorithm is different from the CVAE (Conditional Variational Autoencoder) in [31, 5] because we’re conditioning the latent space itself on the class label by selecting the appropriate Gaussian, not assuming we can map all classes to the same Gaussian as CVAE does. Instead the CVAE conditions the encoder and decoder, keeping the onegaussian assumption in the latent space which we avoid.Two different datasets will be used in this paper in order to illustrate different properties of the proposed methodology. First, the standard MNIST dataset will be used to present clustering and anomaly detection capabilities in order to compare with other publications and results on that dataset. The second dataset we use consists of actual trades made by traders at a Swedish investment bank. Thus this data set is typical of the many problems a real dataset can have: unbalanced classes, mislabeling, missing data etc. This dataset was also used in the thesis [22] and is included to diversify the discussion on anomaly detection as well as to illustrate a real application of the proposed approach.
We will first train and produce latent spaces from the two different autoencoders: VAE and CLVAE. We then compare how well we can perform clustering and anomaly detection on each. For the sake of visualization we have chosen to present the results in 2 latent dimensions. We note however that performance could increase if we were to use more dimensions in latent space as it would include more information. Arguably two of the most standard clustering algorithms, which we also use in this paper, are the means [18] and EM [15] algorithms. Similarly, for anomaly detection the most frequently used algorithms in this area are Isolation Forest [17], LOF [2] and OCSVM [27].
2.1 Detection methodology
A typical issue for most clustering algorithms is how to evaluate their results. This is because, while clustering itself is an unsupervised task, in some cases we can check the clusters that do form against some ground truth. In this case, that would be class label.
The Vscore or validity score is one such metric capable of measuring this kind of clustering result. It is an entropy (12) based approach and works by measuring two interdependent characteristics in clusters: completeness and homogeneity.
Homogeneity measures whether every member of a given cluster belongs to a single class. This is done as follows,
Note that the case with cross entropy of denotes a totally homogeneous cluster.
Completeness on the other hand measures whether all known members of a class are assigned to the same cluster. We define completeness via,
Note that . In practice however we normalize the conditional entropy by in order to remove any class size dependencies.
The validation score or Vscore is defined to be the weighted mean between and ,
where is a weight favoring completeness if or homogeneity if .
Readers might recognize how the completeness and homogeneity defined above encompass metrics such as inertia and the Dunn index [7] which have been used in the past to measure intraclustering and interclustering distances.
2.2 Anomaly detection
Confirming anomalies which are not already labeled is not a trivial task. In this unsupervised setting we can not use PrecisionRecall or ROC simply because there’s no way to check our results against a ground truth. Therefore, the excess massmass volume (EMMV) method can be a great tool at indicating anomalies as it has been found to agree with its supervised counterparts [9].
We assume that anomalies occur on the tail ends of a probability distribution. So our goal is to estimate the density level curves of that distribution. The method works on level sets for which a function
equals a given constant , . In our case the function is the probability density which we estimated by our CLVAE in latent space. The degree of abnormality or anomaly for the EMMV method is given by a scoring function for data in . The method relies on the mass volume (MV) and excess mass (EM) curves of as follows,(1) 
(2) 
where and . So now we can evaluate our method by computing the distance between the level sets of and for excess mass and mass volume as follows: and [9]. Practically we let and where as shown in [9]. One technical issue involves the computation of the Lebesgue terms above which can be resolved via Monte Carlo estimation.
The measure is then based on the area under the EM and MV curves. This area should be maximized for EM and minimized for MV.
3 Creating Generative Models
As mentioned earlier, Variational Autoencoders (VAEs) are generative models introduced in 2014 by Kingma and Welling [14]. The purpose of such a model is density estimation. Meaning, that a VAE can estimate any underlying distribution in a dataset and start to generate new samples from that dataset.
The VAE consists of a encoderdecoder pair of neural networks, with a stochastic latent layer in the middle as depicted in Figure
1. The encoder which we express here with processes the inputand produces the parameters of a Gaussian distribution represented by
in the latent layer as shown in Figure 1. The decoder which we denote by uses that Gaussian distribution in latent space as input to produce the parameters to an approximation of the probability distribution describing the original data. The weights and biases of the encoder and decoder networks are represented in .VAEs allow us to effectively compress information within the data through dimensionality reduction occurring in the latent space. Since the decoder goes from a smaller to a larger space information is lost. We measure this loss by the reconstruction loss which is the loglikelihood of which will give us a loss function to minimize in training. This shows also how effective the encoder was in compressing information within from the original data . Furthermore VAEs can also be used with categorical data as well as nonlinear type transformations in contrast to PCA type methods [23].
Note that the VAE architecture is very similar to that of an autoencoder shown in Figure 10 in the Appendix except for the latent space which is sampled by estimating and
. This is done by mapping the mean and variance of these points with the stochastic layer described by
, and . We also note that every point in the latent space of a VAE is forced to have feasible features since it is assigned a variance. A prime advantage of this approach is that VAEs, unlike regular autoencoders, map values to the latent space by retaining relational information between them so that they remain meaningful for analysis purposes [3].Another type of generative model is a GAN [10] that solves the same problem with a gametheoretic approach. The VAE on the other hand is derived from a purely statistical point of view and the fact that you can use if for generation is almost just a bonus [35]. Because of the statistical interpretability and ease of manipulation, we chose to use a VAE for this task.
3.1 The optimization function
The choice of loss function during any optimization or neural network problem is critical and has direct implications in success or failure of the given problem. To create a proper loss function however requires knowledge of in order to compute the which is intractable (see Appendix B). To counter this problem we define a lower bound of the likelihood and optimize that instead.
Assuming that can be estimated using the distribution we work on estimating the logarithm of (see also (9) in Appendix B),
This however is a typical application for the KullbagLeibler (KL) divergence (see Appengix C or [33]
). The KullbackLeibler divergence is a semimetric
[33] and can provide an estimate of the distance between two measures and ,Using the KL divergence therefore we can express as,
(3) 
The first term above is possible to compute directly from the decoder and sampling with the reparameterization trick [14], which will be discussed later. The second term is a KLdivergence between a general and a normal Gaussian which has a closed form solution based on Property 2 of Appendix C. Finally, the problematic last term can be shown to be greater than 0 [34] through Property 1 in Appendix C. Therefore we can define the lower bound of (3) as,
(4) 
Thus our task is to minimize the above with respect to the parameters and ,
(5) 
The only remaining issue is to resolve the expected value in (4). This can be accomplished with a reparameterization idea from [14] provided below for completeness.
3.2 The change of variables idea
Functions such as (4) are not uncommon in optimization problems [11]. The novel idea, introduced in [14], which removes difficulties of the expected value in (4) is to introduce a suitable change of variables.
We express through a deterministic transformation
and a random variable
as where is a simple distribution which does not depend on or . In our case we choose since we want the latent space to be Gaussian. This reduces to,We now define an auxiliary function over the distribution and write (4) as an expectation of . If we then substitute and compute the gradient with respect to we get,
The expectation and the gradient commute which allows us to practically optimize the above using back propagation.
3.3 The Conditional Variational Autoencoder
VAEs are designed to estimate the unknown probability distribution for the given using a Gaussian in latent space. Instead we propose to use information which is already available in the data in order to more accurately express the dataset through several Gaussians. We specify the Gaussians in latent space by conditioning on any one of the labels in the data set. For the MNIST data set of images for instance we condition on the number label associated with each image. Therefore for categories in that label in the data,
(6) 
where we chose
to be a weighted uniform distribution since the numbers in the MNIST data are discrete constants and
is a set of Gaussians each with unknown mean and variance. Finally to understand we apply Bayes rule,(7) 
where can be estimated from the encoder . Note furthermore that is known and gives . We can therefore write .
Following arguments similar to constructing the loss function for VAE we now produce the following loss function for our CLVAE,
(8) 
Since is discrete the KL divergence above can be written as,
To produce a closed form solution for in the nonnormal Gaussian case we follow ideas from [12] and let be some Gaussian distribution that is conditioned on ,
We now let be Gaussian with mean and variance 1,
and use in simplifying,
We apply the same reparameterization idea as before with where which simplifies the expression above to,
Computation of now reduces to a simple matrix operation.
4 Clustering and Anomaly Detection Results
Training a regular VAE on the MNIST dataset we are treated by a latent space representation where the classes accumulate around a single Gaussian. The model is trained on minimizing the loss function (4) with a single Gaussian normal prior. The result of this is clear in the top of Figure 2 as the points are all expanding outward from the origin where the normal Gaussian has its mean. As can be seen in that figure, there is some structure and class separation but clusters have inconsistent shapes and are overlapping heavily. Class overlap is natural since numbers can easily resemble each other and can be hard to tell apart for humans as well. However, class overlap will remain a problem when using the VAE. This is attributed to the fact that classes are trying to adapt their form to a single Gaussian prior.
We also train the CLVAE on the same data set and present the results in the middle of Figure 2. It is immediately evident, at least visually, that the class separation has now improved compared with that from a regular VAE. The class overlap is still visible since numbers due to their inherent shape resemble each other in the MNIST data. In that respect it is not surprising to see that the distribution for numbers and are close to each other and the clusters forming 4 and 9 are completely overlapping. We also see that and seem to be completely overlapping and 3, 5 and 7 are partially overlapping. This is not going to be helpful for clustering.
Finally, we also show a weight adjusted CLVAE in the bottom of Figure 2. Here the reconstruction error is weighted down s.t. and . This means that more focus is put on the KLdivergence term in training. Visually, we can see that and no longer overlap and no other class is completely overlapping with any other.
4.1 Clustering
We now present and analyze a number of clustering results comparing our proposed CLVAE and a regular VAE. During our analysis we use two different clustering methods: the means and the expectation maximization (EM). A visual inspection of the results is always helpful although we also measure the Vscore in order to better assert the success of each algorithm.
Based on the latent space representations from VAE and CLVAE we compute their respective Vscores in order to find the optimal number of clusters for each. For the VAE, the Vscore is quite bad for both the EM and means clustering algorithms as can be seen in Table 1. Another problem is that it reaches its maximum Vscore using 14 clusters, more than the true amount of numbers in the dataset. This however is not that strange since the clusters based on the VAE latent space shown in Figure 2 are not clearly separated.
Algorithm  EM  means  

VAE  0.6313  14  0.5572  14 
CLVAE  0.8132  7  0.8088  7 
0.9248  10  0.9233  10 
For the CLVAE we get a higher Vscore for both the EM and means algorithms but the clusters are now too few instead. Inspecting Figure 3 and comparing it with Figure 2 we see that the numbers 4 and 9 ended up in the same cluster. The same is true for the numbers 3, 5 and 7 as well.
Finally, considering the weight adjusted CLVAE we find that we get 10 unique clusters and again, a much higher Vscore for both EM and means. Using this setup you can therefore actually cluster the numbers quite successfully. As a result we will use the weight adjusted CLVAE to identify anomalies in the next section.
4.2 Anomaly detection
In this section we present and compare unsupervised anomaly detection capabilities based on the latent space of the proposed CLVAE versus that of the regular VAE. We perform these comparisons under the context of three of the most widely used methods for anomaly detection. We also note that neither of our data sets include information or labels as to whether a given data point is an anomaly or not.
Some of the algorithms applied here to identify anomalies include hyperparameters which must be tuned. We have therefore undertaken this task and in all results presented we have already established the best set of such hyperparameters for those methods.
LS/Algo.  EM ()  MV  








Remembering that we should maximize the EM score and minimize the MV score, the results presented in Table 2 clearly indicate that the VAE produces better EMMV scores than the CLVAE. This is rather unexpected and something we’ll discuss further. It’s also clear that the OCSVM provides the best EMMV scores for both types of latent spaces. Investigating a little further, we find that the OCSVM classifies more than half of the test data as anomalous. This obviously deems it rather useless.
Ruling out the OCSVM, the Isolation Forest algorithm performs the best on both latent spaces. Closer investigation of the anomaly classification mechanism however reveals that the IF algorithm classifies the center of the prior distribution for the VAE as normal and the edges of the distribution as anomalous. This means that, while the number of elements in each class are approximately the same, the numbers 0, 1 and 7 are overrepresented among the anomalies while numbers 6, 9, 2, 8 and 5 are underrepresented. Taking a look at the top Figure 2 again the reason for this becomes clear as there is plenty of class overlap in the center of the distribution for the VAE where 6, 9, 2, 8 and 5 are all put in the middle. At the same time, 0, 1 and 7 occupy the fringes of the distribution, leading to the overrepresentation in the number of anomalies.
This behavior is not as prevalent in the LOF algorithm which actually finds some anomalies between clusters, leading to a more wellbalanced distribution of anomalies among the different classes. So even though the LOF performed the worst on the EMMV score, it seems to be the most reasonable classifier of anomalies in this case.
Now considering the weight adjusted CLVAE, we find that again disregarding the OCSVM, Isolation Forest gets the best EMMV scores. However, the same over and underrepresentation problems apply here. It seems like it weighs the regions with classes that are slightly overlapping as ’more normal’ while classes that are clearly distinct are considered ’more anomalous’. For instance, it finds 144 anomalies for the number 6’s while only 14 anomalies for the number 8. Once again however LOF manages to provide us with a much more even distribution of anomalies.
Overall however we see that in general VAE produces better EMMV scores than CLVAE. This could be because the VAE latent space has a few points that end up far out on the tail of the distribution and therefore the algorithms have an easier time distinguishing normal points from anomalous. That being said, these points are mostly consisting of a few classes, leading to a more unbalanced set of anomalies which is clearly wrong. That is where the CLVAE has an advantage. To visualize this, we have plotted the percentages of anomalies for each class in Figure 4. What is clear from this plot is that both autoencoders overestimate and underestimate certain classes. But overall, the CLVAE get’s a lower RMSE from the actual distribution of the dataset at 0.0326 compared with 0.0408 for the VAE.
To summarize our findings in this section we found that the VAE separates the data in a more extreme way than CLVAE while at the same time overestimating anomalies for some classes at the edges of its singleblob type latent space. This higher degree of separation seems to be why the VAE achieves better EMMV scores. The CLVAE produces a more balanced set of anomalies in the dataset which stem for all of the classes in the data. Considering Figure 5, the most extreme anomalies seem more difficult to categorize than those anomalies identifies by the VAE.
4.3 Anomaly Detection in Trading Data
The dataset of trades from [22] contains about 100 000 rows and has columns describing what type of trade occurred. These include margin, nominal value, currency, type of instrument, counter party, portfolio and importantly trader name. By conditioning on what trader did what trade, we can try and create similar clusters to Figure 2. Instead of variations of handwritten numbers, the clusters describe the trading behavior of each trader in the dataset. Two overlapping traders for instance suggest that they traded in a similar fashion, meaning we have found a trading category. For more details related to the data we refer to the thesis work in [22].
LS/Algo.  EM ()  MV  








As can be seen in Table 3 we see that, on this dataset, the CLVAE latent space outperforms the VAE latent space on the EMMV measure for all algorithms tried. One big difference here is that the clusters that was formed with CLVAE are placed much further apart than the clusters in MNIST as can be seen in Figure 6. This is likely contributing to the higher EMMV score as it seems easier for the anomaly detection algorithms to contrast normal from anomalous points in such a latent space.
4.4 Using Misclassification for Anomaly Detection
We now explore an alternative way toward anomaly detection. If a cluster is found in latent space where one most frequent class can be identified, then the points inside this cluster which do not belong to the most frequent class can be considered anomalies. To identify these points however can be tricky. These are points that in some sense behave more like the predicted class than its own class in the conditioned space, meaning that they should be classified as anomalies.
We will employ the Vscore which is related to the PrecisionRecall curves and the F1score described in Section 4.1 to assist us in scoring and detecting these anomalies. As we showed in Table 1, the CLVAE has a much better Vscore than the VAE. In the case of the MNIST data this means that numbers which are classified to belong to a cluster will be moer likely to have the same label (homogeneity) and all of the members of that class are assigned to the same cluster (completeness).
We provide a representative sample of anomaly misclassification based on both the VAE and CL VAE latent spaces in Figure 7. As can be seen in that figure some of the handwritten numbers which were classified as ’anomalies’ using the VAE latent space are rather easy to identify. Some would probably be considered anomalies but certainly not most. This is not the case for the CLVAE where many of the found anomalies are in fact unintelligible.
One of the big points about the CLVAE is that points that end up on the tails of their respective prior distributions should be ’more anomalous’. There’s no such guarantee with the VAE, because it only has one prior distribution. Instead, giving each class their own prior distribution should mean that the classes don’t have to fight over placement in the latent space. To illustrate this further, we coloured the latent spaces according to the average deviation for each point compared with the respective mean of each class in Figure 8.
What this means is that points that are closer to the mean of each class also end up closer to their respective means in the latent space of the CLVAE. If we compare it with the VAE, we see that the middle of the prior distribution actually tends to have a higher error than on some pockets on the tail of the distribution. Ultimately, what this means is that the latent space of the VAE can’t be interpreted as samples from a Gaussian Normal distribution. Because there is no clear structure of where the points should end up based on their ’degree of abnormality’. This is in contrast with the CLVAE, where the points can be interpreted as being sampled from a GMM prior. Consequently, points that either end up on the tail of their respective prior or end up in the wrong cluster, can in fact be called anomalies.
5 Discussion
In this article we explore classification of labeled data by conditioning in latent space. Conditioning allows us to use information within the data to improve clustering. We subsequently use these clusters to identify anomalies in the data. To showcase our findings we used the MNIST data set as well as actual trades from traders in a Swedish bank.
Our overall strategy for classification and anomaly detection involves a number of methodologies. We begin with automatic formation of clusters in latent space. These clusters are found and shaped according to optimal recommendations from our CLVAE. In that respect we fit different Gaussians around each cluster via the EM clustering method. This improved clustering description allows us to apply a number of methods for detecting anomalies. Anomaly detection for us relies on detecting outliers in the latent space of the autoencoder. We compare among the kmeans, the percentile, oneclass SVM and LOF in order to identify the outliers for each of those clusters. Finally we apply multiple isolation forest (an isolation forest per cluster) in order to single out these anomalies.
We also note that neither of the datasets used was labeled in terms of whether a given point is an anomaly or not. This led us to propose methodologies which while unsupervised can indicate whether a given trade is typical or anomalous. The proposed EMMV measure is able to detect anomaly candidates and the oneclass SVM can indicate their severity. Furthermore the proposed methodology works with data of categorical nature (noncontinuous) in order to find meaningful latent representations  both of which are not possible for classic PCA methods [23].
The proposed CLVAE makes the latent space easier to understand while clustering data and performing anomaly detection. The question of how much of the proposed methodology is truly unsupervised should also be addressed. Clearly we are using labeling in the data to help us in the initial
classification. So this is not an entirely unsupervised method. The data is forced into Gaussians based on that information. For the MNIST data for instance that corresponds to 10 clusters. However the methodology proposed is not completely supervised either since after the conditioning is performed the method performs all actions in an unsupervised way based on a newly formed latent space. We do not supervise the algorithm in terms of how to find the anomalies after that. So this is still a form of unsupervised learning
[8, 3].We have shown that a weight adjusted CLVAE can be a successful preprocessing procedure for clustering and eventual anomaly detection. It strongly outperformed the VAE on the clustering task for instance. While the VAE gets good EMMV scores on the MNIST dataset, it does so while overestimating some classes and underestimating others. This problem is decreased when using the CLVAE. Also, the anomalies that gets the worst possible scores with the CLVAE do look much more unintelligible than the ones found for the VAE. Meanwhile, using the trading dataset, the CLVAE obtained better EMMV scores than the VAE.
The proposed CLVAE attempts to make the latent space more understandable and suitable for analysis with established methods. It seems to succeed in this regard, as it both divides the latent space up in accordance with class labels and ensures that points that are improbable do in fact end up on the tail of their respective prior distributions or in other clusters.
It would be interesting to study the generative aspects of the CLVAE as well in the future. In particular, it would be interesting to compare it with a CVAE, as they both have the functionality of generating samples from a given class, something that the VAE does not.
Appendix A. Background on Autoencoders
An autoencoder is a neural network capable of performing unsupervised dimensionality reduction. As a result it is able to discover a lowerlevel representation of a higher dimensional data space.
An autoencoder is typically constructed of two neural networks: an encoder and a decoder. An idealized autonencoder representation is given in Figure 9 where we reduce a 3 dimensional input to a 2 dimensional latent space.
In general autoencoder applications both the encoder and the decoder networks are densely connected.
Definition 1
Autoencoder. Given an input we assume that there is a mapping to s.t.
. This mapping is called the encoder and can be defined with an activation function
, a weight matrix and a bias term as,Conversely, we assume that there is a mapping s.t. which is called the decoder and can be defined in a similar way to the encoder with the corresponding terms , and as,
According to this definition therefore the autoencoder’s job is to estimate a nonlinear transformation from
to and its inverse from to . In order for this autoencoder to compute these transformations it minimizes the reconstruction error of ,In general however autoencoders produce latent spaces which are not useful for analysis [3]. This is one of the main reasons that variational autoencoders have gained such popularity.
Appendix B. Variational autoencoders and Bayes rule
Variational autoencoders (VAEs) are able to discover the distributions responsible for the provided data. A VAE therefore solves the problem of probability density estimation and is a true generative model. This practically means that you can generate new samples from an unknown distribution [16]. Applications in image processing for example use VAEs to generate new images which retain some of the main features of the original data set [35].
If we have some set of locally observed variables and we assume that they follow some unknown stochastic process that we want to sample from, we can use some prior that we assume to be Gaussian. By then taking the expectation of the conditional distribution of given , under , we get the distribution for from,
(9) 
Note however that the integral above is intractable [14]. This is where Bayes rule can help.
We assume that the density functions and for stochastic variables x and y are known. If the conditional density function is given then the conditional density function can be computed from,
(10) 
Bayer rule will be useful in terms of computing the posterior ,
(11) 
However above is not possible to compute in general. Instead we estimate the posterior distribution using some other distribution where contains estimates of the model parameters.
Appendix C. KullbagLeibler divergence
The KLdivergence has a number of imporant properties which we outline below. First we provide some definitions from information theory.
Given two probability distributions and the entropy of is defined from,
(12) 
and the crossentropy of and is given by,
(13) 
Given two probability distributions and we define the KullbackLeibler divergence by taking the crossentropy minus the entropy,
(14) 
The KullbackLeibler divergence measures how well approximates .
Property 1
Properties of KLdivergence [6].

if then ,

if then .
Property 2
Solution to in the normal Gaussian case [14]. Let’s assume that x is some random variable in and
This can now be evaluated as two different integrals. The first being
and the second,
giving us the closed form solution,
References
 [1] A. Borghesi, A. Bartolini, M. Lombardi, M. Milano and L. Benini, Anomaly Detection using Autoencoders in High Performance Computing Systems, (2018), Available at: https://arxiv.org/pdf/1811.05269.pdf
 [2] M. M. Breunig, H. Kriegel, R. Ng and J. Sander, : LOF: Identifying DensityBased Local Outliers, in Proc. ACM Sigmod Int. Conf. on Management of Data, Dallas, TX, (2000).

[3]
F. Chollet, Deep Learning With Python, Manning Publ., (2017).
 [4] N. Dilokthanakul1, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, H. Salimbeni, K. Arulkumaran and M. Shanahan, Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders, (2016), Available at: https://arxiv.org/pdf/1611.02648.pdf
 [5] C. Doersch, Tutorial on Variational Autoencoders, (2016), Available at: https://arxiv.org/pdf/1606.05908.pdf
 [6] J. Duchi, Derivations For Linear Algebra and Optimization, (2018) Available at: http://web.stanford.edu/~jduchi/projects/general_notes.pdf
 [7] J. C. Dunn, WellSeparated Clusters and Optimal Fuzzy Partitions, Journal of Cybernetics, 4 (1): pp. 95–104, ()1974).

[8]
A. Géron, Handson Machine Learning with ScikitLearn and Tensorflow, 2nd Ed. O’Reilly, (2017).
 [9] N. Goix, How to Evaluate the Quality of Unsupervised Anomaly Detection Algorithms?, (2016), Available at: https://arxiv.org/pdf/1607.01152.pdf
 [10] J. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S. Ozair, A. Courville, Y. Bengio, Stat. ML. arXiv:1406.2661, Generative Adverserial Nets, (2014).
 [11] G. E. Hinton, P. Dayan, B. J. Frey and R. M. Neal, The wakesleep algorithm for unsupervised neural networks, (1995), Available at: http://science.sciencemag.org/content/268/5214/1158http://science.sciencemag.org/content/268/5214/1158http://science.sciencemag.org/content/268/5214/1158
 [12] S. Jianlin, Variational SelfEncoder: OneStep Clustering Scheme, (2018), Available at: https://kexue.fm/archives/5887 (Chinese)
 [13] Z. Jiang, Y. Zheng, H. Tan, B. Tang and H. Zhou, CS arXiv 1611.05148, Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering, (2016).
 [14] P. Kingma, M. Welling, Stat. ML. arXiv:1312.6114, AutoEncoding Variational Bayes, (2014).
 [15] N. Laird, and D. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, (1977), Available at: http://web.mit.edu/6.435/www/Dempster77.pdf
 [16] F. F. Li, J. Johnson and S. Yeoung, Lecture 13, Generative Models, (2017), Available at: https://www.youtube.com/watch?v=5WoItGTWV54
 [17] F. T. Liu, K. M. Ting and Z.H. Zhou, Isolation forest, In: Proc. of the 8th IEEE International Conference on Data Mining (ICDM’08), Pisa, Italy, pp.413422 (2008).
 [18] S. P. Loyd, Least squares quantization in PCM, (1982), Available at: https://scinapse.io/papers/2150593711
 [19] W. McCulloch and W. Pitts, A Logical Calculus of the Ideas Immanent in Nervous Activity, Bulletin of Mathematical Biophysics, 5, pp. 115133, (1943).
 [20] L. Mart, N. SanchezPi, J. M. Molina, and A. C. Bicharra Garcia. Anomaly detection based on sensor data in petroleum industry applications. Sensors, 15(2): 2774–2797, (2015).
 [21] Q. Nguyen, K. Lim, D. Divakaran, K. Low and M. Chan, GEE: A Gradientbased Explainable Variational Autoencoder for Network Anomaly Detection, (2019), Available at: https://arxiv.org/pdf/1903.06661.pdf
 [22] E. Norlander, Clustering and Anomaly Detection on Financial Trading Data using the Conditional Latent Space Variational Autoencoder, master thesis, Lund University, (2019).
 [23] K. Pearson, On Lines and Planes of Closest Fit to Systems of Points in Space, The London Edinburgh Dublin Philosophical Magazine and J. of Sc., 2 (11), pp. 559572, (1901).
 [24] A. Rosenberg and J. Hirschberg, VMeasure: A conditional entropybased external cluster evaluation measure, (2007), Available at: https://www.aclweb.org/anthology/D071043
 [25] F. Rosenblatt, Perceptrion: A Probabilistic Model For Information Storage And Organization In The Brain, Psycological Review, 65(6), (1958).
 [26] D. E. Rumelhard, G. E. Hinton and R. J. Williams, Learning representations by backpropagating errors, Nature, 323, pp. 533536, (1986).

[27]
B. Scholkopf, R. Williamson, A. Smola, J. ShaweTaylor and J. Platt, Support Vector Method for Novelty Detection, (2000), Available at:
http://papers.nips.cc/paper/1723supportvectormethodfornoveltydetection.pdf 
[28]
E. Schubert, A. Zimek, and H.P. Kriegel.
Local outlier detection reconsidered: ageneralized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Discov, 28: 190237, (2014).
 [29] R. Shu, Density Estimation: Variational Autoencoders, (2018), Available at: http://ruishu.io/2018/03/14/vae/
 [30] R. Shu, Gaussian Mixture VAE: Lessons In Variational Inference, Generative Models and Deep Nets, (2016), Available at: http://ruishu.io/2016/12/25/gmvae/
 [31] K. Sohn, X. Yan, H. Lee, Learning Structured Output Representation using Deep Conditional Generative Models (2015), Available at: https://pdfs.semanticscholar.org/3f25/e17eb717e5894e0404ea634451332f85d287.pdf
 [32] A. Sopasakis, Traffic demand and longerterm forecasting from realtime observations, Sopasakis, A., Conf. on Time Series and Forecasting, Granada, Vol. 2. pp. 12471259 (2019)
 [33] A. Sopasakis, M. A. Katsoulakis, Information metrics for improved traffic model fidelity through sensitivity analysis and data assimilation, Transportation Research Part B: Methodological 86, 118 (2016)

[34]
L. Tiao, A Tutorial on Variational Autoencoders with a Concise Keras Implementation, (2018), Available at:
https://tiao.io/post/tutorialonvariationalautoencoderswithaconcisekerasimplementation/  [35] D. Train, Variational autoencoders do not train complex generative models, (2016), Available at: http://dustintran.com/blog/variationalautoencodersdonottraincomplexgenerativemodels
 [36] J. Xie, R. Girshick and A. Farhad, CS arXiv 1511.06335, Unsupervised Deep Embedding for Clustering Analysis, (2015).
 [37] H. Zenati, C.S. Foo, B. Lecouat, G. Manek, V. R. Chandrasekhar, Efficient GANBased anomaly detection Workshop track, ICLR (2018).
Comments
There are no comments yet.