Divergences between anticipated and actual distribution of the data, colloquially called data anomalies, are often analysed on the level of individual elements of the data set: a yellow ball in the basket where only red balls would be considered an anomaly. A less extreme example would be a basket of red balls with yellow spots, but with still ‘enough’ red exposed. Here, a ball with, say, more than 90% of the surface covered with yellow spots would be considered an anomaly.
However, there are cases when the anomaly can be identified by only considering the whole data set. For example, a basket contains red and yellow balls. We expect the number of red and yellow balls to be about the same. An anomaly then is five times as many yellow balls as red balls in the basket.
This work is concerned with the latter case. We call population anomaly a phenomenon where the distribution of elements, rather than an individual element taken an isolation, is abnormal. In population anomalies, each anomalous data point may have high probability mass or density w.r.t. the anticipated distribution (Figure 1). Fraud trends in electronic payment systems, disease clusters in public health care, data exfiltration through network protocols are just some examples of population anomalies. While considering a population anomaly, we want to evaluate the hypothesis that the distribution underlying the data set diverges from the anticipated distribution, and, assuming that the data set is a mixture, obtain the probability for each sample to come either from the regular or from the anomalous component of the mixture.
In a single dimension, the problem can be solved rather straightforwardly, for example, by performing Kolmogorov-Smirnov test or constructing a histogram. However, as the number of dimensions grows, in particular in presence of complicated interdependencies and heterogeneous data types, straightforward brute-force approaches stop working, which is known as ‘the curse of dimensionality.’ Building on previous work, we propose a method for efficient detection and ranking of population anomalies.
2 Problem statement
We formulate population anomaly detection as the following machine learning problem.
We are given a data set where each sample is i.i.d. from an unknown distribution — the training set. Further on, we are given a data set where each sample is drawn from which is a mixture of and unknown distribution — the evaluation set with unknown mixing probability . We assume that, given a sample set of sufficient size from each of , and can be distinguished at any given confidence level.
We want to test the hypothesis that and were drawn from two different distributions and and to assess the probability for each sample to be drawn from rather from .
3 Related work
Related work belongs to three areas of machine learning research: population anomalies, gaussianization, and adversarial autoencoders.
Population anomaly detection and divergence estimation is explored inPBX+11 ; XPS+11 . YHL16 apply population (group) anomaly detection to social media analysis.
Guassianization as a principle for tackling the curse of dimensionality in high-dimensional data was first introduced in CG01 and further developed in LCM11 ; EJR+06 and other publications. Iterative algorithms involving component-wise gaussianization and ICA were initially proposed, with various modifications and improvements in later publications.
Adversarial autoencoders, a deep learning architecture for variational inference, facilitate learnable invertible gaussianization of high-dimensional large data sets. The architecture was introduced inMSJ+16 . The use of autoencoders in general and adversarial autoencoders in particular for detection of (individual) anomalies is explored in HHW+02 ; SSW+17 ; ZP17 ; CSA+17 .
This work differs from earlier research in that it introduces a black-box machine learning approach to detection and ranking of population anomalies. The approach is robust to dimensionality and distribution properties of the data and scalable to large data sets.
4 The Method
To handle population anomalies, we employ an adversarial autoencoder MSJ+16
to project the anticipated distribution of the data set into a multivariate unit normal distribution, that is to applymultivariate gaussianization CG01 to the data. We then use the gaussianized representation to detect and analyse population anomalies.
An adversarial autoencoder (Figure 2) consists of two networks, the autoencoder and the discriminator.
The autoencoder has two subnetworks, the encoder and the decoder. The encoder projects the data into internal representation, which is in our case a multivariate unit normal, . The decoder reconstructs the data sample from a point in the space of the internal representation. The discriminator ensures that the internal representation is indeed normally distributed.
After training, the encoder and decoder implement invertible gaussianization. For every sample in the data set, a corresponding sample from is computed by the encoder, and the sample can be recovered by the decoder.
We train the autoencoder on the training data set, which represents the anticipated distribution. Then, we use the encoder to project the evaluation data set on the space of the internal representation. If the distribution of the evaluation set diverges from that of the training set, the projection of the evaluation set will diverge from unit multivariate normal distribution.
Unit multivariate normal distribution eliminates the curse of dimensionality because the dimensions are mutually independent. Instead of testing the projection of the evaluation set against the multivariate distribution, we can test the distribution along each dimension independently, and then combine statistics over all dimensions to detect an anomaly.
Any goodness-of-fit test assessing normality of a sample can be used. Statistics which scale well to large sample sizes and are sensitve to local discrepancies in the distributions should be preferred. In our realization of the method we use the Kolmogorov-Smirnov statistic, which allows natural probabilistic interpretation and works sufficiently well in practice (see Section 5).
For combining the statistics over all dimensions, we use a p-norm. In simple computational evaluations norm worked well enough. When we are concerned with anomalies caused by small intrusions or perturbations in particular, , that is, taking the maximum of statistics over axes, is a reasonable choice. also serves as a lower bound on the hypothesis test that and come from the same distribution. If the hypothesis can be rejected (that is, there is a population anomaly in ) based on a single dimension of the gaussianized representation, then by all means the hypothesis could have been rejected if all dimensions were considered.
In addition to testing for presence of an anomaly in the evaluation set, we would like to rank each element of the evaluation set by the probability to belong to the anomaly.
Here again we leverage the adversarial autoencoder. The discriminator component is trained to distinguish between the projection and the unit multivariate normal distribution. We will reuse the discriminator component to predict the anomality of each element.
As trained during the training phase of the adversarial autoencoder, discriminator is not yet useful for ranking the evaluation set. However, we can take the pre-trained discriminator and train on the evaluation set to distinguish between the projection of the evaluation set and samples from the unit multivariate normal distribution. The more the evaluation set diverges from the training set, the higher will be classification accuracy. Elements which are more likely to come from the anomalous component (
in the problem statement) will be classified as such with higher confidence.
Indeed, we rank the elements of the evaluation set using the discriminator:
We project the evaluation set into the internal representation using the encoder trained on the training set.
We train the discriminator to distinguish between the projection of the evaluation set and the unit multivariate normal distribution, assigning label to the evaluation set and label to random samples.
After training, we classify the projection of the evaluation set by the discriminator and use the predicted label ( is definitely an anomaly,
is definitely a random sample) as the rank of anomality, and then backpropagate the labels to the original data.
The discriminator is trained with binary cross-entropy loss. An optimally trained discriminator will rank each projection of sample with the probability of the projection (and hence of the data sample) to come from . can be used to estimate the probability of to come from . Indeed, denoting the densities of and as and correspondingly, and the ratio as we obtain:
figure and vs. for .
increases with as function of . For , which is the case in anomalies, and as functions of are shown in Figure 4.3. , that is, the density of anomalous samples being equal to the density of regular samples, corresponds to .
4.4 Method Outline
Let us now summarize the algorithmic steps constituting the method:
Train the adversarial autoencoder on the training set.
Project the evaluation set on the internal representation space using the encoder.
Compute KS statistics for each dimension of the projection.
Combine the compute statistics over all dimensions using a p-norm (e.g. take the maximum KS statistic) and use the combined value to test whether an anomaly is present.
Train the discriminator to distinguish between the projection of the evaluation set and random samples from the unit multivariate normal distribution.
Classify the evaluation set using the trained discriminator network.
Sort the elements in the evaluation set according to the rank assigned by the classifier.
Report elements with the highest ranks as the most ‘surprising’ ones, i.e. those most likely to belong to an anomaly.
5 Empirical Evaluation
In the empirical evaluation that follows we evaluate the method on three domains of different structure and from different application areas. In all cases, point-based anomaly detection cannot be applied to detect the anomalies as the probability of each individual element belonging to the anomaly is as high as or higher than of some of the regular elements in the training set.
However, unusually high probabilities of the anomalous elements in the evaluation set indicate the anomaly which is detected by the introduced method for population anomaly detection.
5.1 Credit Card Payments
We were provided with a data set of credit card transaction data over a month. The data set contains
million transactions. We divided the data into 168 buckets, for each hour of each day of the week. For each bucket, a separate model is trained. A data record consists of 14 fields of both continuous (transaction amount, conversion rate) and categorical (country, currency, market segment, etc.) types. In the expanded form, each record is represented by a 415-element vector. 8-dimensional internal representation was used. Since the data is a mixture of continuous and categorical values, mean squared error loss was used as the reconstruction loss of the autoencoder.
We use the method to compare different hours of day and different days of week. There are possible combinations of model and data, to which we apply the method.
Figure 3 presents some of results of quantifying anomality (novelty) between different hours of the week. The hours are per the Pacific Time Zone.
Figure 3a shows novelty of each hour over a week relative to a particular hour’s model, for 4 randomly selected hours.
Same hours on different days, even as different as Monday and Sunday, have similar distributions.
Weekdays are more similar to each other than a weekday and a weekend.
Figure 3b shows relative novelty in each pair of hours over a single day (Wednesday). The lighter the square, the more similar the model and the data hours are. There are three regions of similarly looking hours (appearing as light squares on the plots). We marked these regions by dashed colored boxes. Ranking of transactions in hours belonging to different regions (see Section 5.1.2) suggests geographical interpretation of peak activities:
1 – 4 (blue box) — Europe and Middle East
6 – 14 (red box) — Americas
16 – 23 (green box) — Asia-Pacific
We consider top-ranked transactions from several combinations of data and model hours. For simplicity, only the sender and the receiver country are shown here, however other fields may have also affected the ranking.
Sanity check — same hour
First, we make sure that the transactions are not surprising when they come from the model’s bucket (0.5 is the neutral rank):
03:00 on Monday
The highest probability is and the lowest is which is a rather narrow range of surprise, as expected.
Different hours within the same day
Comparing different hours on the same day helps give interpretation to different similarity regions (colored boxes) in Figure 3b.
12:00 vs. 3:00 on Wednesday
3:00 vs. 12:00 on Wednesday
The most surprising transactions at 12:00 on Wednesday compared to 3:00 are payments within the US. When ranked in the opposite direction (ranking is not symmetric), the most surprising transactions at 3:00 compared to 12:00 are payments within Europe. Let’s now check the evening hours:
22:00 vs. 12:00 on Wednesday
At 22:00 the most surprising transactions relative to 12:00 are those within the Far East.
Same hour, different days
Same hours on different days are generally similar, but we saw that weekends are different from weekdays. Let’s try to explain some of the differences:
2:00 on Sunday vs. on Monday
At 2:00 on Sunday the most surprising transactions relative to Monday 2:00am are certain payments within Germany (probably involving other attributes).
5.2 London Crime Data
The Kaggle data set of London Crime KaggleLondonCrime contains million of unique crime cases for years 2008–2016. Each crime case record contains the crime category, the borough were the crime happened, and the year and month of the event. We divided the data into 9 buckets, a bucket per year. In the expanded form, each record is represented by a 78-dimensional vector. 8-dimensional internal representation was used. All fields are categorical, hence binary cross-entropy loss was used as the reconstruction loss of the autoencoder.
Figure 4 presents results of quantifying novelty between different years.
Figure 4a shows relative novelty for each pair of years. The lighter the square the more similar the years are. Subsequent years are similar to each other, the further apart the years the greater is the mutual novelty.
Figure 4b shows novelty (maximum KS statistic over the dimensions of the internal representation) of each year relative to the model trained on data of all years. Years 2010-2011 appear to be the closest to the overall distribution of crimes, with years at the beginning and the end of the year range farther apart. Year 2016 is much further from the overall distribution than year 2008 though.
To illustrate insights which can be obtained through ranking of anomalous records we compare the first and the last year in the span to each other, as well as the overall distribution to the last year.
In year 2008 compared to 2016 the highest ranked records compared to year 2016 are theft from motor vehicle in several boroughs. This can be interpreted as that the frequency of this crime decreased in London by 2016.
2008 vs. 2016
|0.743||2||Theft From Motor Vehicle||Islington|
|0.721||2||Theft From Motor Vehicle||Wandsworth|
|0.717||2||Theft From Motor Vehicle||Hammersmith and Fulham|
In 2016 compared to 2008 the highest ranked record is of harassment. Note that harassment is not an outlier in 2008 — 5% of reported crimes are harassment, compared to 11% in 2016. Still, harassment records appear to constitute the greatest novelty in 2016.
2016 vs. 2008
Comparing all year’s data to the model of 2016 we find that the highest ranked records are of assault with injury in central boroughs of London. That can be interpreted as that that particular crime was frequent in central London, but the frequency decreased by 2016.
all years vs. 2016
|0.744||7||Assault with Injury||Westminster|
|0.720||7||Assault with Injury||Kensington and Chelsea|
|0.716||7||Assault with Injury||Lambeth|
5.3 DNS-based data exfiltration
We applied the method to detection of DNS-based data exfiltration. The CAIDA UCSD DNS Names Dataset CAIDA was used.
We emulate data exfiltration by replacing the last component of domain in a certain fraction of the data set with a sequence of characters sampled from characters permitted in domain names (uppercase and lowercase letters, as well as digits and the dash). For example, foobar.example.com might be replaced with AsdR5t.example.com. This method approximates the distribution of encoded data, while still keeping the distribution of domain lengths unaffected. 0.1%, 1%, and 10% of the entries in the evaluation data set are replaced with entries emulating data exfiltration. Only domain names are considered for machine learning. The domain names were mapped to 64-dimensional (by the number of allowed characters) vectors of character counts. 4-dimensional internal representation was used. Mean squared error loss was used as the reconstruction loss of the autoencoder.
Figure 5 shows the ROC and precision-recall curves histograms of exfiltration detection.
Note that the classification accuracy (for the given amount of training budget) increases as the number of anomalous entries goes up, unlike in methods which rank every sample individually. This self-boosting is a useful feature of the proposed method: the more severe the attack, the higher is the ranking accuracy.
We described a method for detecting and quantifying population anomalies in high-dimensional data and evaluated the method on several application domains. An anomaly, or novelty, in the data is an unusually high probability of occurrence of certain elements. Individual anomalies are commonly detected based on low probability of the elements relative to the anticipated distribution, which is sufficient but not necessary condition of anomality. Elements of population anomalies may still have relatively high probability.
Population anomalies and methods of their detection have been subject of earlier research, however the introduced method offers a black-box approach to population anomaly detection and is robust to data set sizes and data types and distributions. One challenge for any population anomaly detection method introduced so far which still needs to be addressed is explanation — summary characterization of the anomaly instead of just presenting most anomalous samples. Our method may be a good foundation for addressing this challenge, by allowing augmentation and reconstruction of anomalies from the internal representation, a subject for future research.
-  London crime data, 2008–2016. https://www.kaggle.com/jboysen/london-crime. Accessed: 2017-12-30.
-  The CAIDA UCSD IPv4 routed /24 DNS names dataset. http://www.impactcybertrust.org. Accessed: 2017-12-12.
-  Jinghui Chen, Saket Sathe, Charu Aggarwal, and Deepak Turaga. Outlier Detection with Autoencoder Ensembles, pages 90–98. 2017.
-  Scott Saobing Chen and Ramesh A. Gopinath. Gaussianization. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 423–429. MIT Press, 2001.
-  Deniz Erdogmus, Robert Jenssen, Yadunandana N. Rao, and Jose C. Principe. Gaussianization: An efficient multivariate density estimation technique for statistical signal processing. Journal of VLSI signal processing systems for signal, image and video technology, 45(1):67–83, Nov 2006.
Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter.
Outlier detection using replicator neural networks.In Yahiko Kambayashi, Werner Winiwarter, and Masatoshi Arikawa, editors, Data Warehousing and Knowledge Discovery, pages 170–180, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.
-  Valero Laparra, Gustavo Camps-Valls, and Jesús Malo. Iterative gaussianization: from ICA to random rotations. IEEE transactions on neural networks, 22(4):537–549, 2011.
-  Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In International Conference on Learning Representations, 2016.
Barnabás Póczos, Liang Xiong, and Jeff Schneider.
Nonparametric divergence estimation with applications to machine
learning on distributions.
Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 599–608. AUAI Press, 2011.
-  Thomas Schlegl, Philipp Seeböck, Sebastian M. Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In Marc Niethammer, Martin Styner, Stephen Aylward, Hongtu Zhu, Ipek Oguz, Pew-Thian Yap, and Dinggang Shen, editors, Information Processing in Medical Imaging, pages 146–157, Cham, 2017. Springer International Publishing.
-  Liang Xiong, Barnabas Poczos, Jeff Schneider, Andrew Connolly, and Jake VanderPlas. Hierarchical probabilistic models for group anomaly detection. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 789–797, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.
-  Rose Yu, Xinran He, and Yan Liu. GLAD: group anomaly detection in social media analysis. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(2):18, 2015.
-  Chong Zhou and Randy C. Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages 665–674, New York, NY, USA, 2017. ACM.