1 Introduction
Addressing complex outliers is an important challenge for machine learning and statistics. Outliers need to be detected and removed as they typically prevent learning systems to generalize
[14]. On the other side an outlier itself can hold important information and thus be the central target of interest in data analysis [12, 5, 1]. Especially complex outliers are hard to detect since the underlying data may be structured [8, 34, 13] and/or has unknown latent variables [28] and it is precisely this new and challenging scenario where this work will contribute with a novel mathematical model.Specifically we propose a CRFbased [16]
oneclass classifier
[31, 35] that incorporates latent structure. Figure 1 illustrates the idea of our proposed method that we would like to call latentclass contextual anomaly detector (LCCAD). Here, we assume an unsupervised scenario where the data is generated from two unknown latent processes (red and blue dots in the figure). We would like to learn two hyperspheres (normal data) containing most of the data points corresponding to the respective latent variables (color) while taking the given (complex) dependency structure (black lines) into account. This should allow us to detect anomalies even if they hide in high density regions (cf. points ). This is in stark contrast to traditional anomaly detectors such as vanilla SVDD [35], oneclass SVM [31], isolation forest [19, 20], and LODA [27] or structured approaches such as [8, 34] where no unknown latent variable is present beyond existing structures.While in toy scenarios we can measure and assess the quality of our model’s performance, analyzing complex realworld data requires more: Explanation needs to be provided to a user why a certain specific data point is considered anomalous e.g. by answering which input features of the data point make the model decide “anomalous”.
We provide toy simulations demonstrating that our model, unlike existing ones can detect reliably for this scenario. To clearly demonstrate that the described complex structured outlier detection scenario is a practically existing important realworld problem, we study Geophysics data from oil exploration; also here we observe that our LCCAD algorithm compares very favorably to potential competing approaches. In addition we show the usefulness of our novel explanation method for LCCAD anomaly detection of geophysical facies data and illustrate possible scientific insights that can be obtained.
2 Detecting Latentclass Contextual Anomalies
The mathematical setup of our unsupervised learning probem is the following: We are given an unlabeled dataset
with , a feature mapmapping each data point into a possibly high dimensional feature space, and a general loss function
. Further, each entry is assigned a corresponding discrete latent class variable with . The task is to find the center and the radius of a hypersphere for each context such that the bulk of the corresponding data points belonging to this context is contained and a fraction lying outside or on the border. The set of centers and radii is referred to as and correspondingly. Dependency structure between latent variables is induced using a joint feature map .Subsequently, we will derive our novel latentclass contextual anomaly detector (LCCAD), based on the support vector data description (SVDD) [35] and give its probabilistic interpretation. Additionally we will show how to induce structural dependencies among latent variables using Markov random fields (MRF) as an example. Further, we will discuss in detail how to efficiently solve the resulting optimization problems and consider special cases and relations to existing methods.
2.1 From SVDD to Latentclass Contextual Anomaly Detector
Support vector data description (SVDD) [35, 3, 9] is a wellknown efficient anomaly detector, assuming, however, i.i.d. data without any latent variable dependencies or other complex dependency structure. The formulation of unconstrained SVDD in [10, Def. 3] will now be extended to allow arbitrary loss functions to obtain ultimately our latentclass contextual anomaly detector (LCCAD)
Here, setting recovers the Hinge loss formulation of the vanilla unconstrained SVDD. Moreover, we extend this model in the spirit of ClusterSVDD in [10, Def. 4] by introducing latent variables for each example. Basically, the class assignment of each entry is selecting the corresponding SVDD. The resulting optimization neatly splits into a sum of independent SVDDs once the set of latent variables H is fixed:
Finally, we respect the dependency structure between the latent variables by using a convex combination of the above optimization problem with a loglinear model employing joint feature maps. The specifics of the structure are hereby encoded in the joint feature map . We give an example on the Markov random field in Section 2.3.
Definition 1 (Latentclass Contextual Anomaly Detector, LCCAD).
Given predefined parameters , the fraction of outliers , and the regularizer tradeoff , the primal nonconvex optimization problem of our proposed latentclass contextual anomaly detector (LCCAD) is given by:
(1)  
where denotes the partition function.
2.2 A Probabilistic Interpretation
For the specific setting of LCCAD as given in Def. 1, namely for and being the Hingeloss, we can derive a simple probabilistic interpretation in terms of mixture of Gaussians (with additional dependency structure). Setting and will result in (cf. Lemma 3 in [10]) and the optimal parameters for the SVDD part can be solved analytically by (cf. Theorem 2 in [10]). We arrive at
(2)  
The above optimization problem is a combination of a mixture model (much like means) and a conditional random field where the corresponding probabilistic model of the latter can be written as
We are interested in finding , hence (cf. Eqn (2)).
2.3 Inducing Structural Dependencies
Here, we give an example of how to induce structural dependencies in case of Markov random fields (MRF), more specifically, conditional random fields (CRF) with the only difference between both being the log partition function: for the MRF and for the CRF. Given a undirected graph with binary edges and vertices , where each vertex corresponds to a sample and the state space is , we employ the following joint feature map:
and hence,
Joint feature maps had been introduced in [23, 6] and are heavily used in structured output prediction, e.g. [36].
2.4 Efficient Optimization
The key for our model to be applicable in practice is an efficient optimization of the nonconvex problem as stated in Def. 1. A common scheme is to alternate between the various variables and update only one at a time given the current solutions of the remaining variables. In our case, this splits the optimization into three parts: (i) finding the most likely latent state configuration given the intermediate solutions of the latentclass SVDD and loglinear model part, (ii) finding the optimal solution for the latentclass SVDD part given the current latent state configuration, and (iii) finding the optimal parameter for the loglinear model given the current latent state configuration. Luckily, subproblems (ii) and (iii) are convex problems and any local solution will also be a global optimal solution.
If , the optimal solution can be found analytically by .
For subproblem (i) and treelike structures, optimal solutions can be found using e.g. belief propagation algorithms or linear program approximations (cf.
[38] for a comprehensive discussion). For arbitrary structures, loopy belief propagation approximation [39], where each is sequentially updated given the states of its neighbors, is a powerful and fast method to solve the problem. This algorithm works by iteratively sending messages from node to node (in state ) in the proximity of its location:where is some normalization constant, denotes the set of neighboring nodes of node and
After convergence, maxmarginals can be computed as follows:
Finally, backtracking using the maxmarginals reveals the latent states per node. We empirically found that LBP approximations are fast, converging within a few iterations, and give reasonable results. The final optimization problem is given in Algorithm 1. For relations to other anomaly detection algorithms and special cases see Supplement.
3 Explaining Latentclass Contextual Anomalies
In addition to a reliable detection of anomalies it is important to explain and understand why a point has been detected as anomalous [24, 21, 15]. The explanation method we present here identifies to what extent each input feature is relevant to the decision. Denoting by the predicted outlierness of a data point , we explain by producing a decomposition where is the relevance of variable .
A number of techniques have been proposed for decomposing machine learning predictions in terms of input variables [29, 17, 2, 25, 33]. In the SVDD model of Section 2.1, outlier scores to be explained are squarednorms with an offset and a rectification function. While the squared norm has a simple additive structure, it applies to the feature space, the dimensions of which, are not interpretable. Instead, we would like a decomposition in terms of the input variables. As a first step, we note that SVDD applied to the feature space of a Gaussian kernel is strictly equivalent to the OneClass SVM with the same kernel [32]. Kauffmann et al. [15] proposed a method called oneclass deep Taylor decomposition (OCDTD) which explains OneClass SVM predictions in terms of input variables. The method is based on a deep Taylor decomposition (DTD) explanation framework that was first developed in the context of deep neural network classifiers [25]. We describe the main idea of the OCDTD in the following. Let
(3) 
be the SVM discriminants, where is the kernel function associated to the feature map . When is Gaussian, we can measure the degree of outlierness by the monotonically decreasing function . OCDTD first writes this function as a twolayer neural network: Layer is a mapping to the effective distances , where is the scaled difference from the support vector, and where nonsupport vectors () are effectively infinitely far. Layer applies a soft minpooling to these effective distances: . That is, a point is an outlier if no support vector is nearby. Application of the deep Taylor decomposition to this twolayer neural network yields the relevance scores:
(4) 
While we refer the reader to [15] for a derivation of , inspecting the product structure in Eqn (4) gives the following interpretation: An input feature is relevant to anomaly if (1) it differs to the support vector more than other features, (2) the support vector is among the nearest support vectors, and (3) the point is an outlier within its assigned cluster .
4 Empirical Evaluation
We divide the empirical evaluation into three distinct parts. First, we verify the usefulness of our method in the described latentclass contextual anomaly setting where we have full control over data generation and hence, the latent variables as well as the corresponding anomalies. We compare the anomaly detection accuracy of our method against several stateoftheart anomaly detection methods. Furthermore, we test the ability to uncover the underlying cluster structure and compare the results against means clustering. In the second part, we apply our method to the wellknown synthetic reservoir dataset. Here, we can only quantify the ability to identify the underlying clustering structure as no ground truth anomalies are known. Finally, we unleash the full potential of our methodology by applying it to a largescale realworld reservoir data set where no ground truth, either in form of anomaly labels nor in form of latent variables, is given. We let a domain expert assess the solution and explanation produced by our method.
Throughout the empirical evaluation, we restrict the model to
. Furthermore, we measure anomaly detection accuracy in area under the ROC curve (AUROC) and clustering accuracy in adjusted Rand index (ARI). We empirically found that automatically adjusting the regularization hyperparameter
such that gives reasonable results across multiple applications. The use of Gaussian kernels (cf. [26] for an introduction to kernel methods) has been observed to achieve good results for oneclass classification [37]. In order to use our model as defined in Def. 1 we employ a feature transformation that resembles Gaussian kernels [30]. We can rewrite the kernel machine as a neural network [15] and use our explanation method from Section 3.Verification on Artificial Data
We test the ability to find anomalies and corresponding clustering structure in latentclass contextual settings on an artificially generated twodimensonal dataset shown in Figure 2 (left). The two classes are generated by
with classes respectively. Therefore, the ground truth anomaly score will be defined as the amount of distortion that is due to the Gaussian noise,
On this dataset, we expect a baseline anomaly model, where the latent classes (depicted by colors red and blue) are not used, to identify as outlier only points outside the mixed distribution. On the other hand, the LCCAD approach should find additional outliers located within the other class.
As a preliminary step, in order to verify the clusters built by LCCAD, we perform a comparison with the standard means clustering algorithm. The third plot of Figure 2 shows the adjusted rand index for various distances between the means of the two latent classes. On the left, classes overlap completely and no separation is possible. On the right side, classes are completely separated. We observe that LCCAD performs comparable to means in this task. For a fair comparison, we performed a nested crossvalidation and show the performance on ten unseen test sets per distance for both methods.
Now that the clustering performed by LCCAD has been validated, we assess the ability of our method to detect anomalies reliably. The last plot of Figure 2 shows the AUC for several anomaly thresholds on . Also here, results are shown for ten unseen test sets per outlier fraction. Our proposed LCCAD method is clearly superior to all baselines in this task.
Evaluation on Synthetic Reservoir Data
In this section, we measure the ability to find grouped structures on synthetic reservoir data. We further use explanation methods to verify anomaly findings due to lack of ground truth data. We use the synthetic 3D reservoir benchmark data set Stanford VI [4] ( voxels), which was created through realistic geological modeling. It contains two facies: the sand channels (blue in Fig. 6 upper, left) and the background shale (red). Due to the vertical low resolution during the seismic acquisition process [7], we simplify our setting by only considering connections in the horizontal slices. So, from each of those volumes, we extract horizontal slices, and assume that the whole impedance data is available as the input (center and right images in the top row of Fig. 6). Since facies are available as a ground truth clustering for the data set, we calculated ARI for means and LCCAD and found an index of 0.88 for both methods without significant differences throughout the data set. This is no surprise if we look at the crossplot for both available features, Figure 3. Our goal is to find extremal contextual (spatial connections) anomalies by inferring the latent class structures (facies).
We compare our findings (center row in Fig. 6) against baseline competitors LODA, vanilla OCSVM, and isolation forests (bottom row in Fig. 6) which do find similar global anomalies (in different scales) within the shale facies as well as the sand channels. In contrast, the results of our method suggest, after almost perfect reconstruction of the latentclass configuration, that there is only a single spot of anomalies within the shale facies that significantly deviates from the remainder of the data. Moreover, the decomposition into input feature slices (center row center and right) according to our proposed explanation method, suggests that seismic impedance is the origin of the anomalous signal. If we assume that the discovered latent states correspond to the ground truth, we find that some features are better suited for inference of latent states, since they contain only fewer anomalies. Interestingly, the found anomalous spot has meanwhile been assessed to be an echo from a deeper sand structure by a geophysics expert.
Application to Realworld Reservoir Data
We apply our method to a real petroleum reservoir, located in the offshore coast of Brazil. It covers an area of approximately 100 square kilometers, with 460 meters in depth. The data in this region comprises a 3D volume with voxels containing acoustic and SWave impedance samples. Figure 4 shows a cross plot the particular slice of our analysis. This data contains truly labeled data from only four wells, with which no generalpurpose machine learning method can cope. Hence, any finding need careful manual assessment by some domain expert.
For our analysis, we assume a three latent states as relevant for detection of contextual anomalies. This corresponds also to the number of facies in this data set [18]. Similar to the results obtained on the simulated data, we can use the featurewise explanation for analysis of the origin of detected anomalies. Firstly, we see that latent states capture ridge spatial structures which partially overlap with findings from related semisupervised methods (cf. [18]). Secondly, one can find from the explanations and , that anomalous fragments are mostly due to distortion in Swave impedance features. This is similar to our findings on the synthetic data set.
5 Conclusion
Analyzing real world data requires rich models that can learn and model the underlying complexity and structure. Outliers can significantly disrupt this modeling process and they are particularly harmful when few labels exist and the data has a rich structure including latent states. We have contributed to address these challenges by proposing a novel outlier detection model for structured data with intrinsic unknown latent states. Toy data show the usefulness and unique capabilities of our novel model; clearly providing a scenario where only our model can be successful as effectively no competitors exist. To demonstrate that such scenarios exist and are relevant in the real world, we study geophysics data from oil exploration. Here we can show that structured outliers with a latent state can help to accurately detect the unknown facies structure. An important additional finding is that we can adapt explanation techniques originally defined for supervised learning to our unsupervised structured outlier detection case. This allows not only detecting but also to explain and to visualize why outliers are considered outliers by our model – an immense progress for a geophysics practitioner.
Future work will explore this framework in other scientific applications beyond geophysics. In particular in the medical domain outliers in complex data with latent states are of high interest, they may e.g. be the responders to drugs or e.g. longterm survivors.
Acknowledgments
This work was supported by the German Research Foundation (GRK 1589/1) by the Federal Ministry of Education and Research (BMBF) under the BMBF project ALICE II, Autonomous Learning in Complex Environments (01IB15001B), and the project Berlin Big Data Center (FKZ 01IS14013A).
References
 [1] C. C. Aggarwal. Outlier Analysis. Springer, 2013.
 [2] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.R. Müller, and W. Samek. On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PLOS ONE, 10(7):e0130140, jul 2015.
 [3] A. Banerjee, P. Burlina, and C. Diehl. A support vector method for anomaly detection in hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing, 44(8):2282–2291, 8 2006.
 [4] S. Castro, J. Caers, and T. Mukerji. The stanford VI reservoir. 18th Annual Report. Stanford Center for Reservoir Forecasting (SCRF), pages 1–73, 2005.
 [5] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3):1–58, 2009.

[6]
M. Collins.
Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms.
InConference on Empirical Methods in Natural Language Processing
, pages 1–8, 2002.  [7] C. V. Deutsch. Geostatistical Reservoir Modeling. Oxford University Press, 2002.
 [8] N. Görnitz, M. Braun, and M. Kloft. Hidden Markov Anomaly Detection. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 1833–1842, 2015.

[9]
N. Görnitz, M. Kloft, K. Rieck, and U. Brefeld.
Toward Supervised Anomaly Detection.
Journal of Artificial Intelligence Research (JAIR)
, 46:235–262, 2013.  [10] N. Görnitz, L. A. Lima, K.R. Müller, M. Kloft, and S. Nakajima. Support Vector Data Descriptions and kMeans Clustering: One Class? IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2017.
 [11] N. Görnitz, L. A. Lima, L. E. Varella, K.R. Müller, and S. Nakajima. Transductive Regression for Data with Latent Dependency Structure. IEEE Transactions on Neural Networks and Learning (TNNLS), 2017.
 [12] S. Harmeling, G. Dornhege, D. Tax, F. Meinecke, and K.R. Müller. From outliers to prototypes: Ordering data. Neurocomputing, 69(1315):1608–1618, 8 2006.
 [13] J. Höhner, S. Nakajima, A. Bauer, K.R. Müller, and N. Görnitz. Minimizing Trust Leaks for Robust Sybil Detection. In International Conference on Machine Learning (ICML), pages 1520–1528, 7 2017.
 [14] P. J. Huber. Robust Statistics. Wiley, New York, 1981.
 [15] J. Kauffmann, K. Müller, and G. Montavon. Towards explaining anomalies: A deep taylor decomposition of oneclass models. CoRR, abs/1805.06230, 2018.
 [16] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML), pages 282–289, 2001.
 [17] W. Landecker, M. D. Thomure, L. M. A. Bettencourt, M. Mitchell, G. T. Kenyon, and S. P. Brumby. Interpreting individual classifications of hierarchical networks. In IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2013, Singapore, 1619 April, 2013, pages 32–38, 2013.

[18]
L. A. Lima, N. Görnitz, L. E. Varella, M. Vellasco, K. Müller, and
S. Nakajima.
Porosity estimation by semisupervised learning with sparsely available labeled samples.
Computers & Geosciences, 106:33–48, 2017.  [19] F. T. Liu, K. M. Ting, and Z. H. Zhou. Isolation forest. In IEEE International Conference on Data Mining (ICDM), pages 413–422, 2008.
 [20] F. T. Liu, K. M. Ting, and Z.H. Zhou. IsolationBased Anomaly Detection. ACM Transactions on Knowledge Discovery from Data, 6(1):1–39, 2012.
 [21] N. Liu, D. Shin, and X. Hu. Contextual outlier interpretation. CoRR, abs/1711.10589, 2017.

[22]
J. MacQueen.
Some methods for classification and analysis of multivariate
observations.
In
Berkeley Symposium on Mathematical Statistics and Probability
. University of California Press, 1967.  [23] A. McCallum, D. Freitag, and F. Pereira. Maximum Entropy Markov Models for Information Extraction and Segmentation. In International Conference on Machine Learning (ICML), pages 591–598. Morgan Kaufmann, 2000.
 [24] B. Micenková, R. T. Ng, X. Dang, and I. Assent. Explaining outliers by subspace separability. In 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 710, 2013, pages 518–527, 2013.
 [25] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K. Müller. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, 2017.
 [26] K. R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An Introduction to Kernelbased Learning Algorithms. IEEE Transactions on Neural Networks (TNNLS), 12(2):181–201, 1 2001.
 [27] T. Pevny. Loda: Lightweight online detector of anomalies. Machine Learning, 102:275–304, 2016.
 [28] A. K. Porbadnigk, N. Görnitz, C. Sannelli, A. Binder, M. Braun, M. Kloft, and K.R. Müller. Extracting latent brain states — Towards true labels in cognitive neuroscience experiments. NeuroImage, 120:225–253, 2015.
 [29] B. Poulin, R. Eisner, D. Szafron, P. Lu, R. Greiner, D. S. Wishart, A. Fyshe, B. Pearcy, C. Macdonell, and J. Anvik. Visual explanation of evidence with additive classifiers. In Proceedings, The TwentyFirst National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, July 1620, 2006, Boston, Massachusetts, USA, pages 1822–1829, 2006.
 [30] A. Rahimi and B. Recht. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Advances in Neural Information System Processing Systems (NIPS), pages 1313–1320, 2009.
 [31] B. Schölkopf, J. C. Platt, J. ShaweTaylor, A. J. Smola, and R. C. Williamson. Estimating the Support of a Highdimensional Distribution. Neural Computation, 13(7):1443–1471, 2001.
 [32] B. Schölkopf, J. C. Platt, J. ShaweTaylor, A. J. Smola, and R. C. Williamson. Estimating the support of a highdimensional distribution. Neural Computation, 13(7):1443–1471, 2001.
 [33] A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 3145–3153, 2017.
 [34] Y. Song, Z. Wen, C. Lin, and R. Davis. Oneclass conditional random fields for sequential anomaly detection. In International Joint Conference on Artificial Intelligence (IJCAI), 2013.
 [35] D. Tax and R. Duin. Support Vector Data Description. Machine Learning, 54:45–66, 2004.
 [36] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research (JMLR), 6:1453–1484, 2005.
 [37] R. Vert and J.P. Vert. Consistency and Convergence Rates of OneClass SVMs and Related Algorithms. Journal for Machine Learning Research (JMLR), 7:817–854, 2006.
 [38] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
 [39] Y. Weiss and W. T. Freeman. Correctness of belief propagation in Gaussian models of arbitrary topology. Neural Computation, 298(0704), 2000.
Appendix A Supplement
a.1 Discussion, Relations, and Special Cases
Our LCCAD method performs unsupervised anomaly detection. It is inspired by the supervised algorithm of transductive conditional random field regression (TCRFR) [11]; LCCAD being easier to use as it has less free parameters. Moreover, the latentclass SVDD part of LCCAD is inspired by ClusterSVDD as given in [10].
Notable special cases of our LCCAD approach include vanilla SVDD [35] which is recovered when , , and . Conditional random fields [16] are recovered when and (however, without any provided latent states for parameter estimation). Moreover, if , , , and , our proposed method LCCAD becomes equivalent to means [22].
LCCAD assumes that (i) useful dependency structure is given an (ii) that latentclass dependencies exists (cf. Fig. 1). Especially if (i) is not fulfilled, our method could easily get less accurate than its base model SVDD since it has no structure information to capitalize from. Further, it is worth mentioning that contexts of outofsample data points can not be readily inferred without retraining. Moreover, due to the increased complexity and nonconvexity, runtime performance is much slower for LCCAD than for vanilla SVDD.