DeepAI
Log In Sign Up

Towards Explaining Anomalies: A Deep Taylor Decomposition of One-Class Models

A common machine learning task is to discriminate between normal and anomalous data points. In practice, it is not always sufficient to reach high accuracy at this task, one also would like to understand why a given data point has been predicted in a certain way. We present a new principled approach for one-class SVMs that decomposes outlier predictions in terms of input variables. The method first recomposes the one-class model as a neural network with distance functions and min-pooling, and then performs a deep Taylor decomposition (DTD) of the model output. The proposed One-Class DTD is applicable to a number of common distance-based SVM kernels and is able to reliably explain a wide set of data anomalies. Furthermore, it outperforms baselines such as sensitivity analysis, nearest neighbor, or simple edge detection.

READ FULL TEXT VIEW PDF
02/16/2022

Latent Outlier Exposure for Anomaly Detection with Contaminated Data

Anomaly detection aims at identifying data points that show systematic d...
03/06/2019

Explaining Anomalies Detected by Autoencoders Using SHAP

Anomaly detection algorithms are often thought to be limited because the...
10/17/2014

A Hierarchical Multi-Output Nearest Neighbor Model for Multi-Output Dependence Learning

Multi-Output Dependence (MOD) learning is a generalization of standard c...
05/15/2022

Attack vs Benign Network Intrusion Traffic Classification

Intrusion detection systems (IDS) are used to monitor networks or system...
12/09/2020

An Isolation Forest Learning Based Outlier Detection Approach for Effectively Classifying Cyber Anomalies

Cybersecurity has recently gained considerable interest in today's secur...
10/25/2017

Anatomical labeling of brain CT scan anomalies using multi-context nearest neighbor relation networks

This work is an endeavor to develop a deep learning methodology for auto...

1 Introduction

Novelty detection, or outlier detection, is a well-studied and well-formalized machine learning problem with numerous practical applications. One such application is intrusion detection in computer systems, where data points are typically digital messages transmitted over a network, and messages that are detected as outliers are considered likely to carry a threat (Denning, 1987; Görnitz et al., 2013). Another application is obstacle detection in autonomous car driving (Häne et al., 2015). The ability to detect outliers is also important in scientific applications, where points detected as such are intrinsically more interesting than inliers, and should therefore be given more attention (Zhang et al., 2004; Laurikkala et al., 2000). A number of techniques can be used for outlier detection (Day, 1969; Hinton, 2002; Pearl, 1988; Schölkopf et al., 1999; Tax and Duin, 2004). In practice, it is not only important to be able to detect outliers and inliers with high accuracy, one would also like to be able to explain why a machine learning model considers a sample as inlier or outlier. An interpretable explanatory feedback can indeed be used by a human operator for appropriate decision making. The data point could either be considered as benign and possibly incorporated to the dataset, or appropriate action might be taken. The problem of outlier explanation is shown schematically in Figure 1. A dataset, here one class from the MNIST data set of handwritten digits, is fitted by a one-class model from which outlier scores can be obtained. These scores must then be traced back to interpretable quantities such as the input variables of the model.

Interpretability of machine learning models has received growing attention, especially in scientific applications (Schütt et al., 2017; Kraus et al., 2016; Sturm et al., 2016; Hansen et al., 2011; Vidovic et al., 2017) and for systems that interact with humans (Caruana et al., 2015; Kamarinou et al., 2016; Lipton, 2017; Bojarski et al., 2017). A number of generally applicable techniques for interpreting machine learning models have been proposed Simonyan et al. (2013); Erhan et al. (2009); Zeiler and Fergus (2014); Bach et al. (2015); Caruana et al. (2015)

. Most of them have been developed in the context of supervised learning.

Figure 1: Illustration of the outlier detection and explanation setting. Left: Data is generated from an unknown distribution, we are for example interested in potential outliers; Middle:

Unsupervised machine learning techniques estimate the data generating distribution and assign an outlier score

to unlikely data points; Right: Our explanation method assigns a relevance score to every input variable that reflects the contribution of input variable to the model decision. We apply dithering to all heatmaps for printing reliability.

Therefore, the present work addresses the present lack of interpretability of unsupervised machines learning models, and provides a practical solution in the context of one-class SVMs for outlier detection. We will first argue that the problems of explaining inlier and outlier decisions are qualitatively different, and need to be treated in distinct ways. Inliers will be best explained in terms of contribution of support vectors, whereas outliers will be better explained in terms of contributions of input variables. We propose fairly general conditions for

inlierness and outlierness that can be reconciled with many common models.

Exemplarily, this will be reflected by the identification of two distinct compositions of the one-class SVM. The first one will perform a sum-pooling over similarity scores. This architecture enables the interpretation of inlierness. The second one will perform a min-pooling over distances, which provides interpretation of outlierness. In particular, we will propose in this paper a deep Taylor decomposition decomposition/integrated gradients approach (Montavon et al., 2017; Sundararajan et al., 2017). The proposed method can be applied to a number of outlier detection models, namely those of RBF-type Knorr et al. (2000); Tax and Duin (2004); Harmeling et al. (2006); Bishop (1994); Hoffmann (2007); Yang et al. (2009); Breunig et al. (2000); Tax and Duin (1998). For that, the model does not have to be modified and neither re-training nor access to training data are required for the presented explanation method. Instead, only the detection model needs to be known and an appropriate measure of outlierness has to be constructed. The latter will be formally defined in Section 3. In Section 8, we will show empirically that the proposed technique provides meaningful explanations.

1.1 Related work

A number of studies have considered the problem of outlier explanations: Schwenk and Bach (2014) applied structured one-class SVMs to explaining anomalies in MediaCloud applications, and proposed a technique to decompose their predictions in terms of input variables for sum-decomposable kernels. We extend the previous work by proposing a Taylor-based decomposition framework applicable to various non-decomposable RBF-type kernels. Liu et al. (2017)

use the decision of a complex outlier detection model to train a set of simple detectors that separate outliers linearly from clusters of nearby training patterns. Subsequently, the linear weights are used for interpretation of the outlier.

Micenková et al. (2013)heuristically remove features from detected outliers and return a subset of features that maximizes separability of the outlier from the surrounding training patterns. These methods rely on (1) the existence of a hypothetical outlier class that is approximated by sampling in the vicinity of the supposed outliers and (2) access to the training data in the explanation stage. On the other hand, the methods are implementation invariant and model agnostic and can be applied to any outlier detection model.

We take on a different approach, where we look at the model as a mathematical function and identify marginal contributions of input variables on the produced detection score.

2 One-Class SVM

In one-class learning, we are trying to separate patterns that are generated by one common distribution from the rest of the input domain. Schölkopf et al. (1999) proposed the one-class SVM as an algorithm that learns the tails of a high-dimensional distribution, which is sufficient for the separation task. For a set of training data and some feature map , the primal one-class SVM problem takes the form wF,ρR,ξRn 12∥w∥^2 - ρ+ 1νn∑_i=1^nξ_i ⟨w,Φ(x_i)⟩  ≥ρ- ξ_i, ∀i = 1,…,n ξ_i   ≥0 where controls the fraction of outliers that are extracted by the model Schölkopf et al. (1999). Given an explicit map with interpretable features (e.g. BoW or pixel intensities), the one-class SVM is readily interpretable in feature space by means of the linear weight vector

. For RBF-kernels, the optimization is performed in the dual formulation, which does not provide an explicit representation of the weight vector, but a set of Lagrangian multipliers, taken as coefficients of radial basis functions, centered at support vectors. Let

be a RBF-kernel that acts on the distance of two points and produces large output for patterns that are similar and small output for distinct patterns. The one-class SVM extracts a small set of support vectors with from the training set together with coefficients such that

is large for data points that are typical in terms of the training data and the chosen similarity measure . For anomalous points, will be small.

3 Inlierness and Outlierness

Having introduced the one-class SVM, we now take a more abstract look at the problem. We will characterize inliers and outliers by answering: (1) what is an appropriate compositional structure for these two quantities, and (2) how should inlierness and outlierness be quantified. Our answers to these questions will provide a theoretical basis for the design of our explanation method described in Sections 5 and 6.

3.1 Modeling Inlierness and Outlierness

Figure 2: The compositional structure of inlier and outlier decision is substantially different. Left:

Inlier decisions are characterized by a max-pooling over similarity activations,

right: outlier decisions are a min-pooling over distance activations.

Any complex prediction task requires a set of function classes to choose from. These functions are preselected based on some prior knowledge about the problem, and incorporate properties such as linearity, smoothness, and more general types of equivariances or invariances. Practically, these function classes can be implemented by a model which can be, for example, a composition of multiple layers.

The compositional structure of the model differs substantially between the types of prediction tasks. For example, a model that detects “airplanes” in an image will typically consist of multiple detectors that test for the presence of an airplane at various locations in the image. The detection decision can be expressed as: “Decide ‘airplane’ if any airplane template is matched.”

An appropriate architecture for this problem would therefore be a collection of similarity functions in the first layer, followed by a max-pooling operation in the second layer. This structure of the prediction function is prototypical for state-of-the-art classification architectures such as the deep convolutional neural network, where detection layers are interleaved with max-pooling layers

(Boureau et al., 2010).

Max-pooling architectures are also particularly suitable for the problem of detecting inliers. A first layer will detect the similarity to every individual airplane in the data, and a second layer will retain the maximum similarity scores obtained in the previous layer. Here, each airplane detector measures the similarity to an airplane or a group of airplanes in the data. The inlier decision can in that case also be expressed as “Decide ‘inlier’ if any airplane template is matched.”, i.e. in the same way as for the detection task. An appropriate composition of the inlierness function is therefore of type , where the first layer maps the input data to the similarity function scores, and the second layer applies some max-pooling operation or a soft variant of it.111Typical soft variants of max-pooling are the sum, -norm, arithmetic mean, or log-sum-exp. Henceforth, we refer to these functions as soft max-pooling. This structure is visualized in Figure 2 (left).

On the other hand, if we were using the same max-pooling approach for outlier detection, one would need to build as many detectors as there are possible inputs without an airplane. There is an exponential number of them. Instead, the outlier detection problem is better expressed as follows: “Decide ‘outlier’ if all airplane templates are unmatched.” In that case, the first layer models the level of distance functions, and the second layer becomes a min-pooling operation. An appropriate composition of the outliereness function will be of type , where the first layer maps the input data to the distances, and the second layer applies some min-pooling operation or a soft variant of it. This new structure is visualized in Figure 2 (right).

3.2 Quantifying Inlierness and Outlierness

In problems such as classification and regression, the output of the model can be readily interpreted, e.g. as the probability of membership to a given class, or as the expected value of the target variable respectively. When using, e.g. a one-class SVM, such interpretation is not obvious: The discriminant function

does provide an ordering from the most to the least anomalous point (cf. Harmeling et al. (2006)), however, it only answers which of two data points is most anomalous, and not the absolute level of anomaly of a given data point. We propose the following axiomatic definitions for inlierness and outlierness, and then briefly discuss how common machine learning outlier detection models fulfill or violate these definitions:

Definition 1.

A measure of inlierness must fulfill the following two conditions for all :

  1. It is bounded by zero and some positive number : and

  2. It converges asymptotically to zero: .

For example, the Gaussian mixture model, which is sometimes used for inlier/outlier detection (e.g. 

Tax and Duin (1998)), associates to each input point a probability score representing the likelihood of that point being generated from the underlying distribution (Bishop, 2006). This probability score is bounded between and and converges to when moving away from the data. Thus, these probability scores fullfill our definition of inlierness. Similarly, the discriminant function of the one-class SVM with RBF kernel is upper bounded by the kernel bound, and converges to zero as we move away from the data.

These quantities are however not suitable as an outlierness model: they asymptote to as moves away from the data, which does not captures the fact that the degree of outlierness continues to increase. Outlierness is instead better defined by the following set of axioms:

Definition 2.

A function is called measure of outlierness if it fulfills the following conditions for all :

  1. It is lower bounded by zero: and

  2. It converges asymptotically with the Euclidean norm for some and some .

To reflect the Euclidean geometry of the input space, the norm in the denominator will be assumed to be a -norm. Example of functions that satisfy Definition 2

are the distance to the mean, or the neg-log-likelihood under an isotropic probability distribution, e.g. 

. These function are typically used in machine learning for measuring error.

As a counter example, the neg-log-likelihood of a general Gaussian distribution

learned from the data does not satisfy Definition 2: The latter is indeed not suitable for measuring outlierness, as the learned covariance overrides the natural metric of input space on which the outlier decision should be based.

Having defined inlierness and outlierness, we now provide measures for the one-class SVM of interest in this paper. These measures are based on the discriminant defined above. In general, there may be more than one measure of inlierness or outlierness, and we shall here apply a principle of parsimony.

Exponential kernels

The first class of kernels we consider are exponential kernels which can be parametrized as

Parameter is the bandwidth of the kernel. For , the kernel is called Laplacian, for Gaussian kernel. The simplest measures of inlierness and outlierness that satisfy Definitions 1 and 2 would be:

(1)

A proof that the outlierness meets Definition 2 can be found in B.1.

-Student kernels

The second class of kernels that we consider are -Student kernels:

The parameter is positive and often set to 1. When the norm is also scaled by a bandwidth, the kernel is also referred to as Cauchy kernel. Inlierness and outlierness will be measured by the following functions:

(2)

A proof for the agreement of with Definition 2 is in B.1.

4 Explaining Machine Learning Decisions

In this section, we review several techniques to explain the predictions of a machine learning classifier in terms of input variables. Let

be an input example and be its prediction, where is a function learned from the data. The goal of an explanation is to assign a relevance score to each feature , that reflect the importance of that feature for the prediction.

Sensitivity Analysis

The simplest technique for explanation is to attribute relevance to the input variables to which the prediction is locally most sensitive (Zurada et al., 1994; Gevrey et al., 2003; Baehrens et al., 2010). That is, for a given prediction, we define the importance score for each input variable as:

that is, the squared locally evaluated partial derivatives. A limitation of sensitivity analysis is that it is an explanation of the function variation rather than of the function value. Considering a single distance norm , we observe that the gradient does not grow with the distance, implying that sensitivity analysis does not capture the amount of outlierness that a pattern holds. Another observation is that the gradient vanishes between modes of the data, imposing zero importance to variables that occupy a local maximum of outlierness, when measured with sensitivity analysis. The aforementioned weaknesses in sensitivity have led to the development of more precise explanation techniques, which we will take up in the following.

Simple Taylor Decomposition

Taylor decomposition Bazen and Joutard (2013); Bach et al. (2015) seeks to determine the importance of input variables for a certain prediction by performing an expansion of the function at a certain reference point :

It then identifies as importance for a given variable the various terms of the expansion that are bound to it. In the equation above, (1) is the function value at the reference point, (2) contains linear contributions, (3) contains all higher-order terms, including interdependence relations between input variables. Simple Taylor decomposition focuses on the term (2), where the summands are bound to a given input variable. Thus, we define the relevance scores for the prediction as:

In our analysis, we will also choose functions and reference points such that the term (1) is zero, i.e. contains no information on the models prediction and (3) is small. In that case, we obtain the relevance conservation property , which guarantees that the explanation matches in magnitude the amount of predicted inlierness or outlierness. A limitation of simple Taylor decomposition is the need to find a root point in the vicinity of , which can be time-consuming. Further, the reference point might jump to a different mode as the input pattern moves from one mode to another mode of the distribution. This may cause two nearly equivalent data points with nearly equivalent predictions to receive a different explanation. Stated differently, the explanation as a function of is discontinuous.

Integrated Gradients

Another approach for setting importance scores of inputs to a prediction has been proposed by Sundararajan et al. (2017). For some reference point , a prediction is explained by summing over a finite number of small steps of first order simple Taylor decompositions between the input and the reference point . In the limit, the attribution can be written in terms of the integral

which typically has to be evaluated numerically, but may also have an analytical solution for simpler models. Like for simple Taylor decomposition, one needs to choose an appropriate reference point. The advantage of integrated gradients is the absence of second- and higher-order residual terms. In Section 6.3 we apply the method to convex functions of which the integral has an analytical solution.

Deep Taylor Decomposition

Deep Taylor decomposition (DTD) is a method for decomposing the prediction of a neural network on its input variables Montavon et al. (2017). The decomposition is obtained by propagating the model output into the neural network graph by means of redistribution rules, until the input variables are reached. As such, it belongs to a broader class of propagation techniques Bach et al. (2015); Landecker et al. (2013); Shrikumar et al. (2017); Zhang et al. (2016)

. A distinctive feature of DTD is that the propagation rules are derived from a Taylor decomposition performed at each neuron of the network.

The decomposition process starts from the top neuron, whose activation is redistributed into relevance scores of neurons in the previous layer. The previous layer relevance scores are then expressed as a function of the activations of the layer before, which enables another step of redistribution. The Taylor decomposition process is iterated from the top layer down to the input layer where the decomposition in each layer has a closed form for known compositions. The procedure leads ultimately to a relevance score for each input variable. Like for the forward pass or standard gradient propagation, DTD can be quickly computed in .

The original DTD method uses Taylor decomposition as a unit of explanation at each neuron. However, our adaptation of DTD in the context of one-class SVM leads to the observation that, for certain neuron types, e.g. mapping on the kernel basis function, this unit of explanation can be advantageously substituted by other analyses such as integrated gradients. Overall, the method we present in this paper generalizes DTD to a “deep decomposition” where we use standard Taylor decomposition or integrated gradients as unit of explanation at various layers.

Other methods

A number of other methods have been proposed for explanation: It includes methods based on locally sampling the decision function Ribeiro et al. (2016), local perturbations Zeiler and Fergus (2014); Zintgraf et al. (2017), other types of propagation techniques Zeiler and Fergus (2014); Springenberg et al. (2014), as well as explanation methods supported by specific choices of achitectures Zhou et al. (2016); Caruana et al. (2015).

5 Explaining Inlierness

Figure 3: Neural network equivalent of the one-class SVM for inlier detection, and the relevance redistribution from the top layer to the intermediate layer.

In this section, we present the decomposition of the measure of inlierness defined in Section 3.2. As it was argued in Section 3, the inlierness is best modeled by a detection-max-pooling architecture. Such architecture is common in convolutional neural networks, where max-pools are composed of outputs from different detectors that were applied to the same lower level features. A two-layer neural network that implements the measures of inlierness is given by:

layer 1:
layer 2:

where the first layer are the weighted similarities to the support vectors measured by the kernel, and the second layer performs a sum-pooling, which can be viewed as a soft variant of max-pooling. Deep Taylor decomposition applies as a first step the decomposition of the output on the first-layer activations that we call “effective similarities” due to the weighting term . The two-layer architecture and the process of relevance redistribution from the top layer to the intermediate layer is shown in Figure 3.

The input (e.g. a handwritten digit) is first propagated into the neural network, to compute the inlier score. Then, this score is redistributed from the top layer to the hidden layer, which gives a decomposition of inlierness in terms of support vectors. Technically, we perform a Taylor expansion of the inlier as a function of the hidden layer activations . Relevance scores are then given by:

(3)

Due to the linearity of the sum-pooling function, there is no second order term. In order to satisfy the conservation property , we further need to have , i.e. we need to perform the Taylor expansion at a root point of the function. Here, we choose the root point , because it is the only admissible root point in the space of activations. The deep Taylor decomposition method Montavon et al. (2017) does not require the root point to have a pre-image in the lower-layer, however, in this particular case, one can still interpret the segment as moving in some direction orthogonal to the data manifold in the input domain.

Injecting the root point in equation 3 gives the relevance score:

That is, the relevance of support vector corresponds to its hidden neuron activation . This operation can be interpreted as a “max-takes-most” redistribution.

We now ask if it is sufficient for explanation to stop at this layer, or if relevance should be further propagated to the input variables. For this, consider the simplest inlier model composed of a single support vector . Consider the most inlier point . At this location, it is easy to conclude that has contributed to the inlierness of , however, because the kernel is RBF and lies at the maximum, it is impossible to assign a directional explanation for such inlinerness. Indeed, looks exactly the same along each direction. Based on this prototypical example, one concludes that explanations for inlierness are better given in terms of support vectors than input directions.

6 Explaining Outlierness

As it was discussed in Section 3, outlier detection is more naturally described as a min-pooling over local distances. Unlike explanation of inliers, the analysis here will depend on the choice of kernel. For each family of kernel, one needs to find a suitable model composition, and appropriate root points for the explanation. In this section, two classes of kernels are considered. These kernels are frequently encountered in practical applications.

6.1 -Student Kernels

The first kernel we focus on is the generalized -Student kernel given by . We compute the one-class SVM discriminant , and apply the measure of outlierness proposed in Section 3.2 for this kernel. The measure of outlierness can be implemented by the following two-layer neural network (see B.2 for a proof):

layer 1:
layer 2:

The first layer can be interpreted as a mapping to the effective distances from each support vector. By effective distance, we mean the distance as perceived by the data point , i.e. modulated by the support vector coefficients

. The second layer computes the harmonic mean

which implements a soft min-pooling.

We would now like to redistribute the output to the lower-layer. We let depend on the hidden layer activations so that a Taylor decomposition can be performed on the previous layer. Specifically, we choose the root point and perform a first order Taylor decomposition at that point. It can be shown that higher order terms sum to one in the Taylor expansion for that root (see B.3). Relevance scores are given by:

(4)

where is the first-layer activation representing effective distances, and is a factor that only retains support vectors that are active in the min-pooling operation (i.e. those with the lowest effective distance). In the input domain, can be interpreted as a localization term. A large relevance score is therefore the result from a large effective distance , but low in comparison to other effective distances in the pool. In B.3, we show that the decomposition is conservative, i.e. . In Section 6.3 we will show how to redistribute to the input layer.

6.2 Exponential Kernels

In this section we consider the family of kernels of type . Unlike the kernels of Section 6.1, this family of kernel implements stronger locality. The Laplacian and Gaussian kernels are special cases for and respectively. Like in the previous section, we compute the SVM discriminant , however, we apply a different measure of outlierness, , proposed in Section 3.2 for this kernel. The function can be mapped to the following two-layer neural network (proven in B.2):

layer 1: (detection)
layer 2: (pooling)

a set of radial basis distance functions followed by a flipped log-sum-exp computation which implements a soft of min-pooling. We let the neural output depend on the hidden layer, and choose the root point , i.e. we substract the output of the model to each dimension of the vector of activations. Relevance scores on the hidden layer are obtained by Taylor decomposition:

(5)

One can also show that this decomposition is conservative (see B.3).

6.3 Redistribution on the Input Layer

Figure 4: Neural network equivalent of the one-class SVM for outlier detection, along with the relevance propagation procedure to determine pixel-wise contributions to outlierness.

In the inlier detection case, it was sufficient to perform redistribution on the domain of support vectors. We argue that explaining an outlier in terms of support vectors does not provide much interpretability. Take a prototypical outlier, which is very far from the data. From this distance, two distinct support vectors will look very similar, and the main information about outlierness is not contained in which point it is the closest from, but in the distance and direction between the outlier and the data. Thus, motivated by this prototypical-case argument, we now look at how to backpropagate the outlier explanation

one layer below onto the input domain.

In Sections 6.1 and 6.2, support vector relevance was given by and . Redistribution on the input domain requires to express as a function of . We will first show that , and are approximately constant: When , i.e. when support vector dominates locally, then , and . Furthermore, is constant under any rescaling of activations , and are constant under any increment of activations by a constant value. A proof for the invariances can be found in B.4. These transformations also describe the path where we look for the root point. Thus, considering these terms as effectively constant, can be modeled locally as an affine transformation of activations, which are themselves an affine transformation of distances. We write:

where and are constant. This quantity is redistributed on the input dimensions by means of integrated gradients Sundararajan et al. (2017). A detailed derivation of integrated gradients of can be found in B.5.

The attribution on the input variables is given in vector form by

(6)

where like in the original paper Montavon et al. (2017) we have summed over relevance received from all higher-layer units. The expression denotes and the integral is the vector of individual integrals of .

The whole process of layer-wise redistribution from the top layer down to the input layer is shown in Figure 4. The data point (e.g. a handwritten digit) is given as input to the neural network. The network implements the outlier function as a soft min-pooling over support vector distances. The outlier score obtained at the output of the network is redistributed using deep Taylor decomposition: it is first redistributed using Taylor decomposition on the support vectors, and then further propagated to the input domain using integrated gradients.

7 Extension for Sequential Data

When applied to sequential data such as images or time series, one-class models based on RBF kernels become affected by the curse of dimensionality. Thus, it is sometimes preferable to apply these models to small sequences or patches of the input

Li and Perona (2005); Li et al. (2005); Frome et al. (2007). The scores computed for all patches are then pooled to compute a global score for the sequence. Let and be the inlier and outlier scores associated to a collection of patches or segments taken from the input sequence. One measure of outlierness that satisfies Definition 2 is obtained by summing all outlier scores, thus forming a third layer of representation:

layer 3:

This composition resembles the max-pooling layer 2 from Section 5. Choosing the root consists therefore of a max-take-most redistribution on the spatial locations, , from where on we proceed as explained in Section 6, by first redistributing location relevance to support vectors, by Equation 4 or Equation 5 and perform a final redistribution on input variables by Equation 6.

8 Experiments

Figure 5: A One-Class SVM is trained on small patches of the very image itself. Parameter is set to allow at most 10% outliers. Images from a texture data set (Cimpoi et al., 2014) (row one, two and four) and PatternNet (Zhou et al., 2017); top image is altered by us. For every image, we show Left: input image; Middle decomposition of one-class SVM; Right Sobel filter for reference. All images were resized to 256 pixels width.

We first test our deep Taylor decomposition (DTD)-based method for outlier explanation on large images, where we use the sequential model of Section 7. Figure 5 shows heatmaps for images taken from various image datasets. These heatmaps are compared to a simple baseline edge detector.

All models are trained on patches from the single image itself, thus heatmaps should highlight unusual statistics in the image. The function that the model implements depends solely on model parameters (1) fraction of outliers , here chosen as 0.1, (2) degree of Euclidean distance , here set to 2, (3) kernel bandwidth

chosen as 0.1 quantile of one-nearest-neighbor distances for the exponential kernel

222a variation of the heuristic from Smola (2010) and (4) the patch size. Having these parameters fixed, the one-class SVM has a unique solution and explanation. Examples in Figure 5 are generated with a Gaussian kernel. We rescale images to a common width of 256 pixels and apply anti-aliasing in the rescaling, because we observed that the method is sensitive to aliasing artifacts.

The One-Class DTD grounds anomalies to individual pixels. The first row in Figure 5 shows a modified image from the class “grid” from the Describable Textures Dataset (Cimpoi et al., 2014). We perturbed the clean grid by a circle in the grid that is invisible to the human eye. The outlier edges that are due to our modification are indeed discovered by our method, as we can see in the heatmap right next to the grid image. The Sobel edge detector is not able to detect these special edges. The second image has a small defect in the middle of the lower-right quadrant. It is not obvious that this defect can be detected in the presence of other distractions, like the lamps, that are detected as well. The first three images show that our One-Class DTD is able to discard recurring patterns, e.g. grid lines, wood lines or parking lines. In the fourth image, we see that the method is also robust to some amount of noisy patterns. While the Sobel filter detects edges reliably, One-Class DTD puts emphasis to edges that are outstanding on a small scale. The scale on which outlierness is detected is parameterized by the patchsize.

8.1 External Validation

The following experiment tests the ability of One-Class DTD to produce correct explanations on an artificial problem where we have ground truth information on the input features that cause outlierness.

We build a dataset composed of two horizontally concatenated images of size . Inliers are constituted by a MNIST digit of the particular digit class on the left, e.g. the class “0”, and a blank image on the right. A simple one-class SVM with no extension for sequential inputs is trained with a Gaussian kernel and and . After training, the following three cases are considered for explanation: (1) Inlier: A test image from the training class is presented. (2) Type I outlier: Structure of the inliers is present (i.e. a test example from the training class appears in the left panel) together with some distraction. As distraction, we replace the right panel with a random sample from another random class. (3) Type II outlier: Structure of the inliers is distracted on both sides; the left panel of type I outliers is replaced by a random sample from another random class.

Figure 6 (left) shows some example data for the class of zeros in a 2D PCA embedding. The ground truth explanation for inliers contains no relevance at all, because the measure of outlierness should detect no evidence for outlierness in these images. For type I outliers, the ground truth only contains relevance in the right side of the image. Consider a growing amount of outlierness in the right side only: an explanation of the left side should not be affected by these distractions. For type II outliers, relevance should fall in both sides of the image: the left side contains relevance for deviation from the training digit, and the relevance on the right side explains deviation from the blank image. If we consider an input with growing amount of outlierness in the left panel, we see that relevance should also increase in the left panel only and vice versa.

As a baseline, we compare the relevance attribution with the maximum likelihood estimate of a multivariate normal density (MVN) of the training data with no off-diagonal covariances (Murphy, 2012; Bishop, 1994)

. The maximum likelihood estimate for the variances is given by

with being the mean of training data and a regularization term . The negative log-likelihood of the MVN, although not a measure of outlierness in the strict sense of Definition 2, provides however a natural decomposition on input features as

where the first term is a non-decomposable zero-order term and terms of the sum determine the relevance of input features.

Figure 6 collects the outcomes of the experiment. In the bottom plots, every sample is represented by one dot. On the -axis, we plot the amount of relevance that falls in the right side of the image, . On the -axis, the relevance of the left image, , is plotted.

Figure 6: Top: Random subset of the artificial data set in a 2D PCA embedding for visualization (training and explaining is performed in the original space); only inliers are used for training; Middle: Explanations from one-class SVM and multivariate normal for one example from every type of pattern; Bottom: Amount of relevance falling in the left or right side of the image; plot shows results for all ten classes, both models are trained on each class separately; dotted line corresponds to share of relevance.

One-class DTD and MVN are both able to explain inliers and type I outliers reliably. They both attribute a small amount of outlierness to the inlier data points, though. The effect of growing outlierness on the right side leading to more relevance in that area can still be observed by looking at blue dots in Figure 6 (right). One-class SVM is better able to explain the outlierness of type II outliers, because it reacts equally strongly to permutations over the input dimensions. Instead, the MVN largely ignores outlying patterns in the left panel and thus produces a partial explanation. The incorrect behavior of MVN explanations stems from the fact that the MVN negative log-likelihood is not a true measure of outlierness in the sense of Definition 2 as it distorts the natural metric of the input space.

8.2 Internal Validation

In the following experiments, we consider the output of the one-class SVM as a ground-truth model for outlierness. This allows us to perform validation on datasets for which we do not have a priori knowledge of which features are causing outlierness.

The deep Taylor decomposition method will be compared to a number of other explanation techniques: Sensitivity analysis uses the same trained model but assigns relevance based on the locally evaluated gradient. Other analyses assign relevance based on a simple decomposition of the distance to data, or on the output of some image filter.

For evaluation of explanation quality, we consider the pixel-flipping approach described by Samek et al. (2017a) in the context of DNN classifiers. The approach consists of gradually destroying pixels from most to least relevant, and measure how quickly the prediction score decreases.

In the context of outlier detection, however, destroying a pixel does not reduce evidence for outlierness and might even create more of it. Thus, the original pixel-flipping method must be adapted to the specific outlier detection problem. Our approach will consist of performing the flipping procedure not in the pixel-space directly, but in some feature space

(7)

containing all component-wise differences to support vectors. The one-class SVM can be rewritten in terms of elements of this feature space as , and similarly the outlier function can be written as .

Our modified procedure reduces the dimensionality of the data one dimension at the time. Once dimensionality 0 is reached, the pattern is necessarily an inlier, because no deviance from the support vectors exists anymore. This method makes removal of outlierness computational feasible. The ordering of variables is inferred from the relevance scores assigned to each input dimension (cf. A.4 for pseudo-code). Also note that we seek to provide a global explanation of the outlierness of pattern . Except for trivial cases (with only one support vector) no single pattern in can represent a minimizer of all detectors that the model is composed of.

We train a one-class SVM on the CIFAR-10 data set. The data set consists of 50000 images in the space with values ranging from 0 to 255. The images are divided into 625 patches of adjacent pixels each, where a patch is of dimensionality . This leads to more than 31 million training vectors . For training speedup, we randomly select 30,000 patches from the data set and train a one-class SVM on these patches. The outlier scores are summed over all patches of an image, as described in Section 7, to get a measure of outlierness for the whole image.

Figure 7: Pixel flipping experiment; Left: Example from the CIFAR-10 class “airplane” shown next to the explanation and several baselines; here shown for the Gaussian kernel; Right: Pixel flipping experiment for several kernels; -Student is with .

Figure 7 shows an example image and heatmaps from all attribution methods.

We consider as baselines for explanation sensitivity analysis (SA) as defined in Section 4, the squared difference to the nearest neighbor (NN), the squared difference to the expected pixel value (EV) and the Sobel filter. The squared difference to the nearest neighbor support vector (NN),

with is similar to DTD but performs a min-take-all redistribution instead of min-take-most. This yields discontinuities in the explanation for perturbations of the inputs. This issue is reduced in the sequential model due to the overlap of patches, however. We also add the squared difference to the expected pixel value (EV) to the baselines. EV is inferred from the support vectors by

where . Finally, a random pixel ordering is considered as a completely uninformed baseline method. Figure 7 shows the results of the pixel flipping experiment for all methods and several kernels. Deep Taylor decomposition is indeed superior for all considered kernels. Sensitivity analysis can be interpreted as the explanation of local variation of the detection function in the vicinity of the pattern in question. We can see that the local gradient is not as well suited for explanation as DTD. In particular, sensitivity is unable to detect truly relevant pixels that cause the outlier score to be large. Instead, it assigns the most relevance to pixels to which the model is sensitive locally (cf. Samek et al. (2017a)). As mentioned before, nearest neighbor support vectors provide an explanation that is discontinuous to perturbations of the inputs. Explanations from the NN procedure are more complete compared to SA. As we see in the right plots of Figure 7, the pixel ordering is still diverging from the explanation that is produced by DTD. The EV baseline corresponds roughly to a squared difference to the data mean, and is even more global than DTD. We see that its performance in the pixel flipping experiment is still better than Sobel and Random flipping. Remaining baselines (EV, Sobel, Random) fail to produce a competitive explanation.

Results for more kernels can be found in C.

8.3 Intrusion detection

One-class SVM has been applied to network intrusion detection and malware detection Görnitz et al. (2013); Wressnegger et al. (2013); Rieck (2009); Wang et al. (2004); Denning (1987). Having interpretable model outputs can help to identify the intent or the method of an attack. We take up this idea in a simpler setting where no domain knowledge is necessary and where it is arguably possible to detect outlierness on a symbolic level, that can be compared to an attack. In particular, we train a one-class SVM on the personal attacks corpus from the Detox data set (Wulczyn et al., 2017). In this dataset, documents are labeled by up to ten annotators as either 0 (neutral) or 1 (personal attack).

A dictionary is constructed from stemmed terms that appear in at least five documents and binary features are extracted as a vectorial representation of documents. No stop words are removed and no document frequencies are used for feature extraction. The model is trained on samples with label mean 0 with a Gaussian kernel and

. Parameter is set to 10, which is a soft assumption of an expected difference in 10 terms for similar documents.

Interpretable outputs are produced in terms of term relevance scores Arras et al. (2017); Horn et al. (2017). Figure 8 shows the explanation for two example documents. As one would expect, common terms have no or low relevance in the document and terms that would not be expected in a neutral message receive more relevance. Due to the RBF property, relevance will also be assigned to terms that do not appear in a document. These terms can be interpreted as being benign and expected to appear in a typical example. This quantity can be of interest in text analysis and could not be derived from, e.g., a linear model. Note the ironic use of the word fantastic. The term receives most relevance, simply because it is not used frequently in neutral messages. The interpretation of the term being detected due to the ironic use can not be justified for such a symbolic model. The property to assign high relevance to rare events is still given. Rare events, here, is the presence of terms which appear rarely or missing terms that usually appear. Outlierness continues to grow if more rare events appear Samek et al. (2017b).

Figure 8: Relevance assignment for two sample message from the Detox data set; red color indicates relevance scores. Below each document, most frequent terms in the document and most frequent terms that are not in the document are listed.

9 Conclusion

In this paper, we have addressed the problem of anomaly explanation. Technically, we have proposed a deep Taylor decomposition of the one-class SVM. It is applicable to a number of commonly used kernels, and produces explanations in terms of support vectors or input variables. Our empirical analysis has demonstrated that the proposed method is able to reliably explain a wide range of outliers, and that these explanations are more robust than those obtained by sensitivity analysis or nearest neighbor.

A crucial aspect of our explanation method is that it required us to elicit a natural neural network architecture for the problem at hand. Achieving this in the context of the one-class SVM model has highlighted the asymmetry between the problem of inlier and outlier detection, where the first one can be modeled as a max-pooling over similarities, and where the latter is better modeled as a min-pooling over distances. The novel insight on the structure of the outlier detection problem might inspire the design of deeper and more structured outlier detection models.

Acknowledgments

This work was supported by the Brain Korea 21 Plus Program through the National Research Foundation of Korea; the Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) [No. 2017-0-01779]; the Deutsche Forschungsgemeinschaft (DFG) [grant MU 987/17-1]; and the German Ministry for Education and Research as Berlin Big Data Center (BBDC) [01IS14013A]. This publication only reflects the authors views. Funding agencies are not liable for any use that may be made of the information contained herein. We are grateful to Guido Schwenk for the valuable discussion.

References

  • Arras et al. [2017] L. Arras, F. Horn, G. Montavon, K.-R. Müller, and W. Samek. ”What is relevant in a text document?”: An interpretable machine learning approach. PLOS ONE, 12(8):1–23, 08 2017. doi: 10.1371/journal.pone.0181142.
  • Bach et al. [2015] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 07 2015. doi: 10.1371/journal.pone.0130140.
  • Baehrens et al. [2010] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller. How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831, 2010.
  • Bazen and Joutard [2013] S. Bazen and X. Joutard. The Taylor decomposition: A unified generalization of the Oaxaca method to nonlinear models. Working papers, HAL, 2013.
  • Bishop [1994] C. M. Bishop. Novelty detection and neural network validation. IEE Proceedings-Vision, Image and Signal processing, 141(4):217–222, 1994.
  • Bishop [2006] C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, 2006.
  • Bojarski et al. [2017] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner, L. D. Jackel, and U. Muller. Explaining how a deep neural network trained with end-to-end learning steers a car. CoRR, abs/1704.07911, 2017.
  • Boureau et al. [2010] Y.-L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 111–118, 2010.
  • Breunig et al. [2000] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA., pages 93–104, 2000. doi: 10.1145/342009.335388.
  • Caruana et al. [2015] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pages 1721–1730, 2015. doi: 10.1145/2783258.2788613.
  • Cimpoi et al. [2014] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In

    Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

    , 2014.
  • Day [1969] N. E. Day.

    Estimating the components of a mixture of normal distributions.

    Biometrika, 56(3):463–474, 1969. doi: 10.1093/biomet/56.3.463.
  • Denning [1987] D. E. Denning. An intrusion-detection model. IEEE Trans. Software Eng., 13(2):222–232, 1987. doi: 10.1109/TSE.1987.232894.
  • Erhan et al. [2009] D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
  • Frome et al. [2007] A. Frome, Y. Singer, F. Sha, and J. Malik.

    Learning globally-consistent local distance functions for shape-based image retrieval and classification.

    In IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, October 14-20, 2007, pages 1–8, 2007. doi: 10.1109/ICCV.2007.4408839.
  • Gevrey et al. [2003] M. Gevrey, I. Dimopoulos, and S. Lek. Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological Modelling, 160(3):249–264, feb 2003. doi: 10.1016/s0304-3800(02)00257-0.
  • Görnitz et al. [2013] N. Görnitz, M. Kloft, K. Rieck, and U. Brefeld.

    Toward supervised anomaly detection.

    J. Artif. Intell. Res., 46:235–262, 2013. doi: 10.1613/jair.3623.
  • Häne et al. [2015] C. Häne, T. Sattler, and M. Pollefeys. Obstacle detection for self-driving cars using only monocular cameras and wheel odometry. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 5101–5108. IEEE, 2015.
  • Hansen et al. [2011] K. Hansen, D. Baehrens, T. Schroeter, M. Rupp , and K.-R. Müller . Visual interpretation of kernel-based prediction models. Molecular Informatics, 30(9):817–826, 2011. doi: 10.1002/minf.201100059.
  • Harmeling et al. [2006] S. Harmeling, G. Dornhege, D. Tax, F. Meinecke, and K.-R. Müller. From outliers to prototypes: ordering data. Neurocomputing, 69(13):1608–1618, 2006.
  • Hinton [2002] G. E. Hinton.

    Training products of experts by minimizing contrastive divergence.

    Neural Computation, 14(8):1771–1800, 2002. doi: 10.1162/089976602760128018.
  • Hoffmann [2007] H. Hoffmann. Kernel PCA for novelty detection. Pattern Recognition, 40(3):863–874, 2007. doi: 10.1016/j.patcog.2006.07.009.
  • Horn et al. [2017] F. Horn, L. Arras, G. Montavon, K. Müller, and W. Samek. Exploring text datasets by visualizing relevant words. CoRR, abs/1707.05261, 2017.
  • Kamarinou et al. [2016] D. Kamarinou, C. Millard, and J. Singh. Machine learning with personal data. Queen Mary School of Law Legal Studies Research Paper, 247, 2016.
  • Knorr et al. [2000] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. VLDB J., 8(3-4):237–253, 2000. doi: 10.1007/s007780050006.
  • Kraus et al. [2016] O. Z. Kraus, L. J. Ba, and B. J. Frey. Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics, 32(12):52–59, 2016. doi: 10.1093/bioinformatics/btw252.
  • Landecker et al. [2013] W. Landecker, M. D. Thomure, L. M. A. Bettencourt, M. Mitchell, G. T. Kenyon, and S. P. Brumby. Interpreting individual classifications of hierarchical networks. In IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2013, Singapore, 16-19 April, 2013, pages 32–38, 2013. doi: 10.1109/CIDM.2013.6597214.
  • Laurikkala et al. [2000] J. Laurikkala, M. Juhola, and E. Kentala. Informal identification of outliers in medical data. In The Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology, 2000.
  • Li and Perona [2005] F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, pages 524–531, 2005. doi: 10.1109/CVPR.2005.16.
  • Li et al. [2005] P. Li, K. L. Chan, and S. M. Krishnan. Learning a multi-size patch-based hybrid kernel machine ensemble for abnormal region detection in colonoscopic images. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, pages 670–675, 2005. doi: 10.1109/CVPR.2005.201.
  • Lipton [2017] Z. C. Lipton. The Doctor Just Won’t Accept That! ArXiv e-prints, Nov. 2017.
  • Liu et al. [2017] N. Liu, D. Shin, and X. Hu. Contextual outlier interpretation. CoRR, abs/1711.10589, 2017.
  • Micenková et al. [2013] B. Micenková, X.-H. Dang, I. Assent, and R. T. Ng. Explaining outliers by subspace separability. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 518–527. IEEE, 2013.
  • Montavon et al. [2017] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. Müller. Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognition, 65:211–222, 2017. doi: 10.1016/j.patcog.2016.11.008.
  • Murphy [2012] K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
  • Pearl [1988] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
  • Ribeiro et al. [2016] M. T. Ribeiro, S. Singh, and C. Guestrin. ”Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144, 2016. doi: 10.1145/2939672.2939778.
  • Rieck [2009] K. Rieck. Machine learning for application-layer intrusion detection. PhD. Thesis, 2009.
  • Samek et al. [2017a] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller. Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Trans. Neural Netw. Learning Syst., 28(11):2660–2673, 2017a. doi: 10.1109/TNNLS.2016.2599820.
  • Samek et al. [2017b] W. Samek, T. Wiegand, and K. Müller. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. CoRR, abs/1708.08296, 2017b.
  • Schölkopf et al. [1999] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector method for novelty detection. In Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], pages 582–588, 1999.
  • Schütt et al. [2017] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko.

    Quantum-chemical insights from deep tensor neural networks.

    Nature Communications, 8:13890, 01 2017.
  • Schwenk and Bach [2014] G. Schwenk and S. Bach. Detecting behavioral and structural anomalies in mediacloud applications. CoRR, abs/1409.8035, 2014.
  • Shrikumar et al. [2017] A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3145–3153, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • Simonyan et al. [2013] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.
  • Smola [2010] A. Smola. Easy kernel width choice. http://blog.smola.org/post/940859888/, 2010. Accessed: 2018-01-08.
  • Springenberg et al. [2014] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller. Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806, 2014.
  • Sturm et al. [2016] I. Sturm, S. Bach, W. Samek, and K.-R. Müller. Interpretable deep neural networks for single-trial EEG classification. Journal of Neuroscience Methods, 274:141–145, 2016. doi: 10.1016/j.jneumeth.2016.10.008.
  • Sundararajan et al. [2017] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • Tax and Duin [1998] D. M. J. Tax and R. P. W. Duin. Outlier detection using classifier instability. In Advances in Pattern Recognition, Joint IAPR International Workshops SSPR ’98 and SPR ’98, Sydney, NSW, Australia, August 11-13, 1998, Proceedings, pages 593–601, 1998. doi: 10.1007/BFb0033283.
  • Tax and Duin [2004] D. M. J. Tax and R. P. W. Duin. Support vector data description. Machine Learning, 54(1):45–66, 2004. doi: 10.1023/B:MACH.0000008084.60811.49.
  • Vidovic et al. [2017] M. M. C. Vidovic, M. Kloft, K.-R. Müller, and N. Görnitz. ML2motif—reliable extraction of discriminative sequence motifs from learning machines. PLOS ONE, 12(3):e0174392, mar 2017. doi: 10.1371/journal.pone.0174392.
  • Wang et al. [2004] Y. Wang, J. Wong, and A. Miner. Anomaly intrusion detection using one class SVM. In Information Assurance Workshop, 2004. Proceedings from the Fifth Annual IEEE SMC, pages 358–364. IEEE, 2004.
  • Wressnegger et al. [2013] C. Wressnegger, G. Schwenk, D. Arp, and K. Rieck.

    A close look on n-grams in intrusion detection: anomaly detection vs. classification.

    In

    Proceedings of the 2013 ACM workshop on Artificial intelligence and security

    , pages 67–76. ACM, 2013.
  • Wulczyn et al. [2017] E. Wulczyn, N. Thain, and L. Dixon. Ex machina: Personal attacks seen at scale. pages 1391–1399, 2017. doi: 10.1145/3038912.3052591.
  • Yang et al. [2009] X. Yang, L. J. Latecki, and D. Pokrajac. Outlier detection with globally optimal exemplar-based GMM. In Proceedings of the SIAM International Conference on Data Mining, SDM 2009, April 30 - May 2, 2009, Sparks, Nevada, USA, pages 145–154, 2009. doi: 10.1137/1.9781611972795.13.
  • Zeiler and Fergus [2014] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, pages 818–833, 2014. doi: 10.1007/978-3-319-10590-1_53.
  • Zhang et al. [2016] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. In European Conference on Computer Vision, pages 543–559. Springer, 2016.
  • Zhang et al. [2004] Y.-X. Zhang, A.-L. Luo, and Y.-H. Zhao. Outlier detection in astronomical data. In Optimizing Scientific Return for Astronomy through Information Technologies. SPIE, Sep 2004. doi: 10.1117/12.550998.
  • Zhou et al. [2016] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba.

    Learning deep features for discriminative localization.

    In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2921–2929, 2016. doi: 10.1109/CVPR.2016.319.
  • Zhou et al. [2017] W. Zhou, S. Newsam, C. Li, and Z. Shao. Patternnet: A benchmark dataset for performance evaluation of remote sensing image retrieval. In ISPRS Journal of Photogrammetry and Remote Sensing, 06 2017.
  • Zintgraf et al. [2017] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling. Visualizing deep neural network decisions: Prediction difference analysis. CoRR, abs/1702.04595, 2017.
  • Zurada et al. [1994] J. M. Zurada, A. Malinowski, and I. Cloete. Sensitivity analysis for minimization of input data dimension for feedforward neural network. In 1994 IEEE International Symposium on Circuits and Systems, ISCAS 1994, London, England, UK, May 30 - June 2, 1994, pages 447–450, 1994. doi: 10.1109/ISCAS.1994.409622.

Appendix A Pseudo codes

In this section, we list pseudo codes for the proposed algorithms.

a.1 Support Vector Relevance

Support vector relevance is the decomposition of inliers and the higher-layer relevance for outliers. All decompositions can be calculated in terms of quantities that are already computed in the forward path.

inputs:
      Weight vector
      Vector of kernel evaluations at
outputs:
      Support vector relevances
procedure 
      for all
end procedure
Algorithm 1 Inlier explanation

a.2 Input relevance for -Student kernels

For the -Student kernel, we identify decomposable upper-layer relevance as

inputs:
      Input vector
      matrix of support vectors
      Weight vector
outputs:
      Input relevance vector
procedure 
     
     
     
     
      for all ,
     
end procedure
Algorithm 2 Outlier explanation for -Student kernels

Here, denotes element wise products, the element wise division. The matrix V and vectors d and k can be precomputed for better runtime performance.

a.3 Input relevance for exponential kernels

For the exponential kernels, the decomposable upper-layer relevance can be written as

This allows the following fast algorithm for input relevance.

inputs:
      Input vector
      matrix of support vectors
      Weight vector
outputs:
      Input relevance vector
procedure 
     
     
     
     
      for all ,
     
end procedure
Algorithm 3 Outlier explanation for exponential kernels

Here denotes element wise multiplication.

a.4 Pixel flipping procedure

We show here a pseudo-code of the pixel flipping experiment that we perform in Section 8.2.

inputs:
      Effective inputs
      Heatmap
outputs:
     pfcurve Declining outlier score
procedure 
     pfcurve [ ]
     for i in argsort(do
         
         pfcurve.append()
     end for
     return pfcurve
end procedure
Algorithm 4 Pixel flipping procedure

where is the heatmap to evaluate and “pfcurve” is the result of the analysis.

Appendix B Proofs

In this section we prove asymptotic convergence for our proposed measures of outlierness, equality of proposed neural networks with kernelized one-class SVM and a unified formulation of support vector relevance for measures of outlierness.

b.1 Proofs for outlierness measures of Section 3.2

In the following, we show that outlierness measures constructed from the one-class SVM with the -Student and exponential kernels satisfy Definition 2. For both proofs we use . First, we consider the -Student kernel. Condition 1 follows from the positivity of . Condition 2 is proven below.

Next, we show the convergence for exponential kernels. Condition 1 follows from being upper bounded by 1 for the exponential kernels. The proof of the second condition follows below.

b.2 Equivalence of the one-class SVM with the neural network representation

We show that the neural network from Section 6.1 implements the measure of outlierness for the -Student kernel that is proposed in Section 3.2.

Let be the -Student kernel and , then

Next, we show the equivalence for exponential kernels. Let therefore be the exponential kernel and . Then

b.3 Decompositions of are conservative

We prove . First, for -Student kernels:

Next, we show the conservation for exponential kernels.

As a consequence, all higher order terms in the Taylor expansion sum to zero.

b.4 Constancy of , and

First, we show the invariance of with respect to scaling. Let for some be a scaling of .

Next, we prove that and from Section 6.2 are constant for any constant increment. Let therefore for some be the incremented .

It follows that and stay constant for the transformations that we discuss in Section 6.3.

b.5 Decomposition of

In this section, we derive the decomposition of

(8)

in terms of input variables in order to elaborate on some critical steps. For that, we show that integrated gradients Sundararajan et al. (2017) of has a closed analytic solution and does not need to be calculated numerically. The integrated gradients are formally defined as

(9)

which, as a consequence of the gradient theorem, is a conservative decomposition of . If is also a root of , then integrated gradients can serve as an explanation of . The coefficient is always positive. We need to consider separately the case where , and .

  1. When , there is no root point, but still admits a minimum at . Performing integrated gradients at this minimum is possible but the decomposition will not be conservative.

  2. When , the relevance function has a root point at .

  3. When , there is always a root point of the relevance function. The nearest root point from is the one on the segment between the data point and the support vector .

All these cases can be regrouped in the same redistribution formulas from support vector relevances to the heatmap in input domain. We prove the decomposition for and subsequently generalize to the other two cases. Assume and .

  1. The gradient of w.r.t. can be written as

    with .

  2. The nearest root of lies on the segment (and exists by assumption). Integrating that gradient between and for some (i.e. on the segment) yields

    (10)
  3. Decomposition of by integrated gradients Sundararajan et al. (2017) is given by combining Equations 10 and 9 and parametrization :

  4. Injecting for the root of Equation 8, , with

    gives the decomposition