Interpretable Anomaly Detection with DIFFI: Depth-based Feature Importance for the Isolation Forest

07/21/2020 ∙ by Mattia Carletti, et al. ∙ Università di Padova 0

Anomaly Detection is one of the most important tasks in unsupervised learning as it aims at detecting anomalous behaviours w.r.t. historical data; in particular, multivariate Anomaly Detection has an important role in many applications thanks to the capability of summarizing the status of a complex system or observed phenomenon with a single indicator (typically called `Anomaly Score') and thanks to the unsupervised nature of the task that does not require human tagging. The Isolation Forest is one of the most commonly adopted algorithms in the field of Anomaly Detection, due to its proven effectiveness and low computational complexity. A major problem affecting Isolation Forest is represented by the lack of interpretability, as it is not possible to grasp the logic behind the model predictions. In this paper we propose effective, yet computationally inexpensive, methods to define feature importance scores at both global and local level for the Isolation Forest. Moreover, we define a procedure to perform unsupervised feature selection for Anomaly Detection problems based on our interpretability method. We provide an extensive analysis of the proposed approaches, including comparisons against state-of-the-art interpretability techniques. We assess the performance on several synthetic and real-world datasets and make the code publicly available to enhance reproducibility and foster research in the field.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Anomaly Detection (AD) techniques are aimed at automatically identifying anomalies (or outliers) within a given collection of data points. Their effectiveness is of paramount importance in a wide array of application domains, ranging from wireless sensor networks [30] and industrial cyber-physical systems [44] to healthcare [29] and driving systems [49]

, and their high usability is mainly due to the fact that most AD algorithms can be trained and deployed in unsupervised settings. This is particularly appealing in environments where the data labelling process by human experts is prohibitively expensive and time-consuming, in other words environments where any intelligent technological solution should be conceived according to an underlying human-centered principle that guarantees the minimization of human efforts. In recent years, a growing volume of research has been focusing on approaches based on Deep Neural Networks (DNNs) to tackle the AD task, especially for applications involving graphs

[46] and videos [36]. Despite the high performance, DNNs cannot be considered as the ultimate solution to any AD problem as they exhibit a number of drawbacks in several real-world scenarios: i) depending on the complexity of the task and the dimensionality of the data, the training process of a DNN might last many hours or even days; ii) state-of-the-art DNN models are implemented (and trained) on expensive Graphics Processing Units, that might not be affordable in environments characterized by limited budget or resource-constrained devices; iii) typically, a huge number of data points are required for the DNN to get satisfying generalization capabilities. For these reasons, there still persists a countless number of applications where traditional AD techniques, such as LOF [5], ABOD [18], Isolation Forest [23, 25], are being preferred over solutions based on DNNs.

Although AD algorithms have proved to be extremely useful and effective, their widespread adoption is far from being a reality even in industries and organizations with adequate infrastructures. This is actually a more general problem affecting any technology based on Machine Learning (ML) and it is mainly due to two ‘soft’ factors: (i) lack of confidence/trust from the users in AD algorithm outcomes and (ii) not immediate association between AD algorithm outcomes and root causes. The first issue arises from the lack of labelled data points (that, on the other hand, is one of the main reasons why AD algorithms are appealing in the first place), which makes it impossible to set up an adequate testing procedure. This leads either to blindly trust the algorithm or not to use it at all, both the cases being undesirable. The second question, instead, investigates the possibility to gain additional knowledge about the task at hand, which may translates into actionable insights for troubleshooting or root cause analysis. The aforementioned issues can be addressed following the principles of eXplainable Artificial Intelligence (XAI) [13], whose objective is to make so-called black-box ML models easily understandable by human beings.

In the remainder of this Section we review the relevant literature in the XAI field (Section I-A) and provide an overview of the main contributions of this work (Section I-B), trying to clarify the positioning of the proposed approaches within the XAI ecosystem according to the principles recently introduced in the field for the classification and evaluation of interpretability methods (Section I-C). Section II is devoted to the description and analysis of the proposed interpretability methods while in Section III the experimental results and a discussion thereof are provided. Finally, in Section IV we draw the conclusions and identify some interesting research directions for future works.

I-a Related works

XAI is a research field that is attracting great interest in the past recent years, as evidenced by a growing body of research in the field and as a consequence of the widespread interest and diffusion of ML-based solutions in countless application scenarios and industries [17]

. The general goal of XAI is to shed light on the inner workings of Machine Learning and Deep Learning models, especially in the context of regression and classification problems. A major focus is put on DNNs

[7] and ensemble methods [41], two emblematic examples of algorithms classes that provide models that are highly accurate, but really hard to be understood by humans.

Given the fact that DNNs achieve state-of-the-art performance on several complex tasks such as image classification, text classification and time series forecasting - just to name a few - it comes as no surprise that a considerable volume of research in the XAI field focused on the problem of DNNs interpretability. The latter can be tackled with the purpose of either providing explanations about the predictions (i.e. the outputs) produced by the model [47, 38, 32], or interpreting the internal representations of the processed data [37, 3]. It is worth highlighting a third promising line of research aimed at designing inherently interpretable DNNs [28, 19, 33]. Since a complete dissertation on DNNs interpretability is out of the scope of this work, we refer the curious reader to [11, 50].

As regards ensemble methods, we mainly relate our work to Random Forests (RFs)

[4]

, but several works on the interpretation of other ensembles (such as Gradient Boosting Decision Trees) can be found in literature

[42, 43]. RFs are ensembles of classification or regression trees leveraging bagging

to reduce the variance of predictions. With respect to single Decision Trees, RFs significantly improve performance in terms of accuracy, at the price of reduced interpretability. In this context, many works address the problem of improving standard feature importance score methods. Relevant examples are

[39], which proposes an improvement of the permutation importance measure based on a conditional permutation scheme, and [20], in which the authors introduce a variant of the Mean Decrease Impurity (MDI) feature importance measure aimed at overcoming the problem of MDI feature selection bias. Besides single-feature importance measures, it is worth mentioning some recent works focused on the detection of interactions between features [2, 26, 8].

Beyond interpretability methods tailored for particular ML models, so-called model-specific methods characterized by high translucency [31] (i.e. they heavily rely on the inherent structure of the specific ML model under examination), there also exist several flexible approaches which earned remarkable interest due to their high portability (i.e. they can be applied to a wide range of models), so-called model-agnostic methods. Among the most prominent model-agnostic techniques used to explain individual predictions two popular approaches are LIME [35] and SHAP [27]. Instead, Partial Dependence Plots [10] and Accumulated Local Effects plots [1] represent examples of model-agnostic methods used to explain the model’s behavior at a global level. While on one hand high portability may appear as an attractive feature for interpretability methods, on the other hand it must be said that the interpretability problem is usually dealt with only once a specific model type has been chosen and the usefulness of model-agnostic methods simply lies in the lack of model-specific methods for several ML models classes. Moreover, model-agnostic methods usually exhibit not negligible problems:

  • Since the inner structure of the model being examined is not exploited, the user might suspect that the provided explanation is just a simplistic and coarse approximation of the true underlying relation between the input and the output.

  • The majority of model-agnostic methods are based on the manipulation of inputs and evaluation of the effects said manipulations induce on the corresponding predictions; this represents a delicate process as the artificially created input instances might not belong to the original data manifold, potentially causing stability issues and raising doubts about the actual information conveyed by the interpretability method.

  • In light of the need for further restrictive assumptions and/or opaque methodological choices (e.g. independence between features, the creation of perturbed input instances), the user is asked to take a leap of faith and consider the method as reasonable while not fully understanding the theoretical underpinnings; undoubtedly, this simply shifts the problem from the lack of trust in the model to the lack of trust in the interpretability method itself.

Exhaustive descriptions, analyses and examples of both model-specific and model-agnostic approaches can be found in [12, 31].

I-B Contributions

Motivated by its ability to attract the attention of a growing and heterogeneous community of researchers and practitioners, we directed our efforts in this work to the interpretation of the Isolation Forest (IF) [23, 25]

. The IF model is particularly appreciated and widely used thanks to its high detection performance (very often even with default hyperparameters values, with no tuning required) and its computational efficiency. Despite that, just like all ensemble learning methods, it might trigger perplexities and doubts as far as interpretability is concerned: indeed, no information about the logic behind the mechanism producing the predictions is available and neither an indication about which are the most relevant features to solve the AD task. In this work, we propose for the first time

model-specific methods (i.e. methods based on the particular structure of the IF model) to address the mentioned issues. Specifically, we introduce:

  • A global interpretability method, called Depth-based Isolation Forest Feature Importance (DIFFI), to provide Global Feature Importances (GFIs) which represent a condensed measure describing the macro-behavior of the IF model on training data.

  • A local version of the DIFFI method, called Local-DIFFI, to provide Local Feature Importances (LFIs) aimed at interpreting individual predictions made by the IF model at test time.

  • A simple and effective procedure to perform unsupervised feature selection for AD problems based on the DIFFI method.

Each contribution mentioned above complies with the human-centered principle adopted throughout this work, whose main goal is to match the user’s needs to the best extent possible. This translates into a number of characteristics we sought to prioritise, e.g. limited computational times, light and straightforward hyperparameters tuning procedures. Additionally, our approach does not require additional knowledge (e.g. game theory concepts, necessary to fully grasp the rationale behind the functioning of SHAP), since it is based on very basic computations on quantities that naturally emerge from the principles governing the IF model. Along these lines, the proposed methods are consistent with the simplicity that characterizes the IF model, thus avoiding the risk of developing an interpretability framework which is more complex than the model itself.

I-C Motivations

If we consider the design of the evaluation procedure as part of the problem formalization process, the need for interpretable algorithms in the context of AD is consistent with the connection between the notions of interpretability and incompleteness evidenced in [9]. Indeed, due to the lack of labelled datasets in AD problems, we are almost always in unsupervised settings and AD algorithms are practically rarely testable. To fill this gap that may prevent the adoption of such automated systems, we need to provide proxies to assess their trustworthiness.

DIFFI is, as far as we know, the first model-specific method addressing the need of interpretability for the IF detector. Notice the use of the term interpretability: according to the definitions in [11], we consider the proposed feature importance scoring systems as simple and easy-to-grasp tools to capture the intrinsic logic governing the behavior of IF. Nonetheless, such a condensed representation is not meant to be a complete description of the model inner workings and predictions as it does not allow the behavior of the system to be anticipated.

The global DIFFI method is inspired by the preliminary work [6], but differs entirely in the information is supposed to convey: while in [6]

the goal is to get additional knowledge on the specific AD problem at hand (which is extremely useful especially in contexts where no domain expertise is available), in this work we focus on providing additional information about a trained instance of the IF model with the main goal of increasing users’ trust. Indeed, if the estimated feature importance scores aligned well with human prior knowledge, users would be more prone to lessen the supervision and safely give more autonomy to the machine (at least in non-critical scenarios), thus facilitating a massive adoption of the IF in fields where the professionals’ skepticism towards intelligent algorithms is still a major obstacle to a more widespread use.

The model-specific nature of DIFFI is motivated by the will to reflect the actual logic governing the IF behavior and this may not be feasible with some model-agnostic techniques. For example, when exploiting interpretable surrogate models [31] trained to approximate the predictions of a black box model, we need to make sure that the surrogate model fits the predictions of the original model with a satisfactory level of accuracy. Such a requirement represents an undesirable source of suspicion. Moreover, as argued in [31, 22]

, models commonly considered as universally interpretable, such as decision trees or linear regression, may lose their transparency advantage as they are asked to fit complex relations: very deep decision trees do not offer simple and intuitive visualizations, while linear regression is not suitable to model highly non-linear mappings.

DIFFI is a post-hoc method: we decided to preserve the performance of an established and effective AD algorithm and focus on providing global and local feature importance measures computed a posteriori. The design of an intrinsically interpretable model would have required to sacrifice some predictive power in light of the trade-off between accuracy and interpretability [31].

The introduction of a local variant of the original algorithm for the interpretation of individual predictions serves a two-fold objective: on one hand it enables the interpretation of single data points in online settings, when the model has already been deployed; on the other hand it helps in enhancing trust as the user can check not only whether the model tends to make mistakes on those kinds of inputs where humans also make mistakes [22], but also whether the misclassified inputs are being misinterpreted in the same way a human would.

Finally, it should be noted that by providing both a global and a local interpretability method we can guarantee maximum flexibility: based on the required granularity or the amount of time that can be invested in the analysis of the results, the user has the possibility to choose the solution better suited to the specific scenario in which he operates.

Ii Diffi

In this Section, we first summarize the key concepts at the core of the IF algorithm and introduce the necessary notation. Then we extensively discuss the rationale behind the DIFFI method and thoroughly analyse each building block. We then propose a local variant of the DIFFI approach, local-DIFFI, for the interpretation of individual predictions. We conclude with the introduction of a novel method based on global DIFFI for unsupervised features selection.

Ii-a Background: the Isolation Forest

As introduced in Section I, the IF is an unsupervised AD algorithm leveraging an isolation procedure to infer a measure of outlierness, called anomaly score, for each data point: the isolation procedure is based on recursive partitioning and aims at defining an area in the data domain where only the data point under examination lies. The underlying mechanism of IF is based on the reasonable hypothesis that isolating an outlier would be easy and the recursive procedure would be fast, while for inliers we would get longer procedures. Let us explain the IF algorithm more formally in the following.

The IF is an ensemble of Isolation Trees (ITs) , i.e. base anomaly detectors characterized by a tree-like structure. ITs are data-induced random trees, in which each internal node is associated with a randomly chosen splitting feature (denoted ) and a randomly chosen splitting threshold (denoted ). Data points associated with node undergo a split test: points for which the value of is less than are sent to the left child of , the others to the right child.

Given a dataset of -dimensional data points, each IT is assigned a subset (usually called bootstrap sample) sampled from the original set and carries out an isolation procedure based on the split tests associated to the internal nodes. Bootstrap samples have the same predetermined size, i.e.

Data points in (called in-bag samples, from the perspective of tree ) are recursively partitioned until either all points are isolated or the IT reaches a predetermined depth limit , function of the bootstrap samples size . As a result, each data point ends up in a leaf node, denoted . We will denote with the number of edges that passes through in its path from the root node to the corresponding leaf node, which is equivalent to the depth of the leaf node .

The procedure described above is iterated over all ITs, each of which is assigned a different bootstrap sample. The anomaly score for a generic data point is then computed as

(1)

where is a normalization factor given by

(2)

and is the harmonic number which can be estimated as . is the average path length associated with and it is computed as

(3)

In the last step of the IF algorithm, anomalous data points are flagged through a thresholding operation on the anomaly scores. In this way, it is possible to partition the original set as follows:

  • the subset of predicted inliers ,

  • the subset of predicted outliers ,

where is the binary label produced by the thresholding operation, indicating whether the corresponding data point is anomalous () or not ().

For further details on the IF algorithm and its properties, we refer the reader to the original paper [23] and to the extended work [25]. To conclude, it is worth highlighting that the IF, as a tree-based ensemble model, shares an inherent structure similar to that of RF. Nonetheless, random choices in IF have a far greater impact since, unlikely in RF, attributes associated with internal nodes are not selected according to specific splitting criteria but, indeed, randomly. This may be daunting to researchers interested in making the IF interpretable, but the present work serves as evidence that finding a solution to such a challenge is feasible.

Ii-B DIFFI: method

At the core of the DIFFI method are two simple hypothesis to define that a variable is ‘important’ for the AD task at hand, which we describe hereafter. A split test associated with a feature deemed as important (for the purposes of the AD task at hand) should:

  • induce the isolation of anomalous data points at small depths (i.e. close to the root), while relegating regular data points to the bottom end of the trees;

  • produce higher imbalance on anomalous data points, while ideally being useless on regular points.

Let us consider an already-trained instance of the IF detector and the corresponding training set . For each tree we partition, based solely on the predictions produced by tree , the assigned bootstrap sample into the subset of predicted inliers and the subset of predicted outliers , where

and denotes the prediction produced by tree associated to . Predictions are obtained, as usual, through a thresholding operation on the anomaly scores, which are now computed by replacing with in (1). The choice to consider only bootstrap samples for each tree, rather than the entire training set, is motivated by the desire of decoupling the evaluation of feature importance scores from the generalization capability of the trained model: if we considered the whole training set, we would implicitly take into account also the performance of single trees on unseen data, since, for the effect of the boostrap procedure, they were trained only on a fraction of the training data points; while this might not be a huge problem since training data are supposed to be drawn from the same distribution and, thus, the performance on in-bag and out-of-bag samples should be similar, by considering only in-bag samples for each tree we are also able to make the computational cost independent from the training set size.

We will define Cumulative Feature Importances (CFIs) for inliers and outliers, real-valued quantities that will be then properly normalized and combined together to produce the final feature importance measures, by exploiting data points in and , for . The update of the CFIs is performed in an additive fashion and depends on two quantities that reflect the two intuitions explained above: the depth of the leaf node where a specific data point ends up (intuition I1) and the Induced Imbalance Coefficient (IIC) associated with a specific internal node (intuition I2). In the remainder of this Section we first explain how to compute IICs, then we describe the procedure for the update of CFIs for inliers and outliers and combine them to produce the final GFIs.

Ii-B1 IICs computation

Let us consider the generic internal node in tree . Let represent the number of data points associated with node , the number of data points associated with its left child and the number of data points associated with its right child. The IIC of node , denoted , is obtained as follows

(4)

where

(5)

and is a scaling function mapping its input into the interval . Specifically, we use the following scaling function

(6)

where and denote the minimum and maximum scores, respectively, that can be obtained a priori given the number of data points associated to the specific node location . We notice that by scaling the values we can reduce the impact of problems related to the different number of data points we may have at different locations. For instance, if data points are associated to a specific node, the worst non-useless split ( points to the left child and to the right child) leads to ; instead, if , the worst non-useless split ( samples to the left child and to the right child or vice versa) leads to . After extended testing, we concluded that the new IICs computation strategy (5) has little to no effect on the overall performance of the proposed interpretability method if compared to that used in our previous work [6]. Even so, we consider it conceptually more appropriate and, as such, worthy to be introduced since no additional complexity is involved. In (4), the first case represents a useless split, in which all data points are sent either to the left or right child. The best possible split, instead, is what we call an isolating split: this happens when either the left or the right child receives exactly one data point. An isolating split is assigned the highest possible IIC, i.e. 1. As it will be clear later on, we need to distinguish between IICs for inliers, denoted , and the counterpart for outliers, denoted .

Ii-B2 CFIs update

Let represent the set of nodes in tree and the path from the root node to the corresponding leaf node associated with data point in tree . We will distinguish between CFI for inliers, denoted , and the counterpart for outliers, denoted . Notice that both and are

-dimensional vectors, where the

-th component represents the CFI (for inliers or outliers) for the -th feature.

We describe the CFI update rule only for inliers (i.e. ), as the extension to is immediate. First we initialize , where denotes the -dimensional vector of zeros. Then we update in an additive fashion. Specifically, we iterate over the subset of predicted inliers and for the generic predicted inlier , we iterate over the internal nodes in its path (in tree ). If the splitting feature associated with the generic internal node is , then we update the -th component of by adding the quantity

(7)

where we recall that denotes the depth of the leaf node (in tree ) associated with data point . In (7), we can notice the contributions of two factors, formalizing the two intuitions at the core of DIFFI: the right-hand side factor characterizes the “local” effect of the split through the induced imbalance at that specific node location; the left-hand side factor, instead, characterizes the “global” effect of the split taking into account potential situations in which an apparently bad (from the “local” perspective) split actually re-organizes the data points in a way that makes it easier for subsequent split tests to isolate them.

As regards the update rule for , the only differences w.r.t. the procedure detailed above are that we iterate over rather than and that we replace with in (7).

Ii-B3 GFIs computation

Recalling that in the IF, differently to what happens in the RF algorithm, the splitting features are selected randomly, the careful reader should perceive a potential issue: if the generic feature was sampled more frequently than others, it would unfairly receive a higher CFI. We define the features counter for inliers, denoted , and the counterpart for outliers, denoted , as -dimensional vectors where the -th component represents how many times the -th feature appeared while updating the CFIs. Again, is updated iterating over , while is updated iterating over . In order to filter out the effect of random splitting features selection, we simply normalize the CFIs by their corresponding features counters, and respectively. The GFIs are then obtained as

(8)

where divisions are performed element-wise. Notice that higher feature importance for inliers (i.e. high value of the denominator in (8)) implies lower overall feature importance. This is consistent with intuition I1: important features isolate outliers closer to the root and simultaneously do not contribute to the isolation of inliers.

Ii-C DIFFI: local interpretability

For the interpretation of individual predictions produced by the IF, we exploit a procedure similar to the one described in Section II with differences due to the impossibility to compute some quantities in the local case (i.e. when considering one sample at a time). Specifically:

  • the Induced Imbalance Coefficients cannot be computed since we consider only one sample;

  • all quantities referred to predicted inliers cannot be computed, since the focus is on the interpretation of predicted outliers.

The Local Feature Importance (LFI) is computed as

(9)

where is the features counter for the (single) predicted outlier and is now updated by adding the quantity

(10)

while iterating over all the ITs in the forest.

Notice that while in the global case the contribution due to the depth of the leaf node, where the data point ends up, is weighted through the IIC at the current node, in the local case we need a different strategy to take into account the usefulness of the splits. To overcome this problem, we introduced the correction term in Equation (10) which takes into account the non-zero contribution of a useless split: without the correction term, would be always strictly greater than zero, also in cases where the data point under examination is not isolated (i.e. when it ends up in leaf nodes at the maximum depth).

Ii-D Unsupervised feature selection with global DIFFI

The DIFFI method outlined above can be effectively exploited to perform feature selection in the context of AD problems when labels associated to training data points are not available. The procedure consists in training different instances of IF (with different random seeds) and aggregating the corresponding DIFFI scores to define a ranking on the features. The motivations behind this strategy stem from the human-centered design principle adopted in this work and can be summarized as follows:

  • Users can spend only a limited amount of time on preprocessing operations since, especially in productive environments, the deployment of novel algorithmic solutions is usually meant to promptly react to emerging issues; the light computational cost of the DIFFI method, thanks to the in-bag samples trick, is particularly appealing in such time-constrained applications.

  • With the intention of minimizing the effort of users, the choice of the IF (combined with DIFFI) as a proxy model to produce a ranking on the features is attractive as it introduces just a few hyperparameters to be tuned; in addition, it is worth mentioning that the IF is often preferred over other AD algorithms due to its good performance with the default hyperparameters values suggested in the original paper.

  • The proposed strategy for unsupervised feature selection takes into account the nature of the task, while most of other methods do not. This is particularly important for AD tasks as relevant features for classification might not be relevant for AD and/or viceversa. The user interested in solving an AD problem may trust more a method specifically suited for such purpose than other task-agnostic methods.

In addition to the unquestionable usefulness of the unsupervised feature selection task, the procedure outlined above also represents an excellent proxy to indirectly assess the quality of the feature importance scores provided by the global DIFFI method described in Section II-B. Indeed, good feature importance scores leads to a good ranking of the features, which in turn translates into a good solution to the unsupervised feature selection problem.

Iii Experimental Results

We report in this Section experimental results on synthetic and real-world datasets to assess the effectiveness of both the global DIFFI method to perform unsupervised feature selection and its local variant to provide feature importance scores associated to individual predictions. We do not provide results on the usage of global DIFFI since it is impossible to get a ground truth measure about what the model has actually learnt. For the same reason, any comparison with other state-of-the-art interpretability methods would be meaningless since the latter are usually meant as tools for knowledge discovery, i.e. aimed at providing additional knowledge on the problem/data than on the model itself.

We make the code publicly available111https://github.com/mattiacarletti/DIFFI to enhance reproducibility of our experimental results and to foster research in the field.

Iii-a Interpretation of individual predictions

For the interpretation of individual predictions provided by the IF model we exploit the local variant of the DIFFI method described in Section II-C. We assess the effectiveness of the Local-DIFFI method on a synthetic dataset and a real-world dataset: on both datasets we have prior knowledge about the most relevant features for the AD task to be solved, which would be fundamental for evaluating the performance of DIFFI. We remark how finding real-world data for AD tasks with a priori knowledge on the relevant features is not a trivial task. The experimental setup adopted here simulates a real scenario of remarkable interest in several application domains: given a trained instance of the IF, the user is interested in deploying the model in online settings to get the prediction and the corresponding local feature importance scores associated with the individual data point being processed. Typical applications include (but are not limited to) the monitoring of smart manufacturing systems and the detection of abnormal patterns in healthcare data: in both examples the promptness of responses may be crucial to ensure quick and effective corrective actions for the industrial processes/machines or for the well-being of patients.

Iii-A1 Synthetic dataset

The synthetic dataset employed in this work was created by initially considering 2-dimensional data points whose dimension is then augmented by adding noise features, similarly to what done in [6]. Specifically, the generic data point is represented by the -dimensional vector

(11)

where for

are white noise samples. Parameters

and

are random variables drawn from continuous uniform distributions. In particular, for regular data points we have

(12)

while for anomalous data points we have

(13)

For our experiments we consider a training set composed of 1000 -dimensional data points (thus 4 noise features), with 10 % anomalies. We trained an instance of IF with 100 trees and (that are typical choices for the IF hyperparameters [23]), and we obtained an F1-score on training data equal to 0.76.

For the testing phase, we generated 300 additional ad-hoc anomalies, displayed in Figure 1 (projected on the subspace of relevant features): 100 lying on the -axis (blue points), 100 on the -axis (orange points) and 100 on the bisector (green points). The prior knowledge for this AD task is represented by the fact that only feature is relevant for outliers on the -axis, only feature is relevant for outliers on the -axis and both and are relevant for outliers on the bisector (all the other features, being white noise samples, are irrelevant in all cases).

Once obtained the predictions associated with the generated test outliers, we run the Local-DIFFI algorithm to get the corresponding local feature importance scores. We compared the performance of the Local-DIFFI method with SHAP. As can be seen in Figure 2, both methods perfectly identify the actual important feature(s): in the first two rows, for all correctly predicted outliers, the first column (representing the most important feature as estimated by the interpretability method) is always associated with the correct feature, namely and for outliers on the -axis and on the -axis, respectively; in the third row (referred to the points on the bisector), instead, both and are deemed important by Local-DIFFI and SHAP, thus aligning with prior knowledge. In this latter case, we also observed that feature importance scores provided by Local-DIFFI for feature and are comparable, while the same rightly does not happen for outliers on the axes. A major advantage of Local-DIFFI over SHAP is the computational time: while SHAP has an average execution time of 0.221 seconds per sample, Local-DIFFI runs in 0.023 seconds per sample on average.

Fig. 1: Synthetic outliers projected on the plane.
Fig. 2: Feature rankings for the synthetic dataset based on local DIFFI scores (left column) and SHAP scores (right column): outliers on the -axis (first row), on the -axis (second row) and on the bisector (third row).

Iii-A2 Real-world dataset

We consider a modified version of the Glass Identification UCI dataset222https://archive.ics.uci.edu/ml/datasets/Glass+Identification originally intended for multiclass classification tasks. The dataset consists of 213 glass samples represented by a -dimensional feature vector: one feature is the refractive index (RI), while the remaining features indicates the concentration of Magnesium (Mg), Silicon (Si), Calcium (Ca), Iron (Fe), Sodium (Na), Aluminum (Al), Potassium (K) and Barium (Ba). Originally the glass samples were representative of seven categories of glass type, but for our experiments we group classes 1, 2, 3 and 4 (i.e. window glass) to form the class of regular points, while the other three classes contribute to the set of anomalous data points (i.e. non-window glass): containers glass (class 5), tableware glass (class 6) and headlamps glass (class 7). We assess the performance of Local-DIFFI on predicted outliers belonging to class 7, considered as test data points. Similarly to [14], we exploit prior knowledge on headlamps glass: the concentration of Aluminum, used as a reflective coating, and the concentration of Barium, which induces heat resistant properties, should be important features when distinguishing between headlamps glass and window glass.

We trained an instance of IF with 100 trees and , and obtained an F1-score on training data equal to 0.55. On the test data (class 7), the IF was able to identify 28 out of 29 anomalies. As for the synthetic dataset, we run the Local-DIFFI algorithm to get the local feature importance scores and compared the performance with those obtained with the SHAP method. As can be seen in Figure 3, Local-DIFFI identifies the concentration of Barium and Aluminum as the most important features in the vast majority of predicted anomalies, well aligned with a priori information about the task. The same cannot be said for SHAP: while the most important feature is still the concentration of Barium, the second most important feature for almost all predictions is the concentration of Magnesium. Additionally, also in this case the Local-DIFFI exhibits way smaller execution time (0.019 seconds per sample on average) than SHAP (0.109 seconds per sample on average).

Fig. 3: Feature rankings for the glass dataset based on local DIFFI scores (left column) and SHAP scores (right column): class 7 outliers (headlamps glass).

Iii-B Unsupervised feature selection

According to the procedure outlined in Section II-D, we exploit the global DIFFI scores to define a ranking over the features representing the data points in the problem at hand. In all experiments described below we run instances of IF, obtained with the same training data but different random seeds, in order to filter out effects due to the stochasticity inherently present in the model. The global DIFFI scores associated to each instance of IF are then aggregated as follows:

  1. We define the -dimensional vector of aggregated scores initialized as a -dimensional vector of zeros, where is the number of features.

  2. For each of the IF instances:

    • we rearrange the global DIFFI scores in decreasing order, thus obtaining a ranking of the features (for the specific IF instance) from the most important one to the least important one;

    • we update by adding, for each feature, a quantity that is a function of the estimated rank , namely

      (14)

      Notice that in (14) we differentiate more the scores among most important features, while for the least important ones the scores are similar and very small (see Figure 4).

  3. The resulting vector of aggregated scores is then used to define a ranking over the features: the higher the aggregated score the more important the feature.

Fig. 4: Update function for aggregated scores: example with .

To verify the quality of the selected features, we perform experiments on six common AD datasets from the Outlier Detection DataSets (ODDS) database

333http://odds.cs.stonybrook.edu/, whose characteristics are summarized in Table I. Once the ranking is obtained, we train an instance of IF by exploiting only the top most important features (according to the ranking), with . We repeat the procedure 30 times (with different random seeds) and compute the median F1-score.

Dataset ID Num samples Num features
satellite 6435 36
cardio 1831 21
ionosphere 351 33
lympho 148 18
musk 3062 166
letter 1600 32
TABLE I: AD datasets used for unsupervised feature selection experiments.

We provide comparisons with two other commonly used unsupervised feature selection techniques, i.e. Laplacian Score [16] and SPEC [51]. We did not consider other techniques such as Nonnegative Discriminative Feature Selection [21] and -Norm Regularized Discriminative Feature Selection [45] due to their prohibitive computational cost, which makes extensive usage practically cumbersome. The hyperparameters values for the final IF model are tuned separately for each dataset through grid search, exploiting all the available features, and then kept fixed for all the experiments involving the same dataset. For the unsupervised feature selection methods, instead, we set the hyperparameters to the default values in order to be consistent with the goal of minimizing the user efforts in time-consuming operations. Furthermore, this approach perfectly fits real-world applications in which the lack of ground truth labels prevents the design of any principled hyperparameters tuning procedure, leading users to rely on existing ”rules of thumb”.

Fig. 5: Evaluation of the global DIFFI method (diffi_5) for unsupervised feature selection, compared with Laplacian Score (lapl) and SPEC (spec) methods.

As can be seen in Figure 5, the performance of DIFFI are comparable with those of the Laplacian Score and SPEC methods and consistently outperform them for a wide range of values (i.e. number of exploited features) for the cardio, ionosphere, and musk datasets. Also notice that in most cases DIFFI is able to identify the optimal combination of features, i.e. the subset of features leading to the highest (median) F1-score value.

Beyond these considerations, that are essentially of quantitative nature, it is equally - if not more - important to adopt the perspective of the final user and go through all the subtle aspects that make a specific methods more attractive than others. Along these lines, we believe that task-specific methods such as DIFFI are preferable over task-agnostic methods (like Laplacian Score and SPEC) as the features that actually relevant to solve a classification problem might not be relevant to solve an AD problem [34]. This comes as no surprise in light of the different nature of the two tasks and it may unconsciously affect the user preference when reasoning about the most appropriate approach. Additionally, as mentioned in Section II-D, the procedure based on DIFFI requires minimal - if any - hyperparameters tuning: the only hyperparameters are inherited from the underlying proxy model, i.e. an instance of IF, which has proved to provide satisfactory performance with the default hyperparameters values on a broad spectrum of applications.

Iv Conclusions and Future Works

This paper introduces Depth-based Feature Importance for the Isolation Forest (DIFFI), a method to provide interpretabily traits to Isolation Forest (IF), one of the most popular and effective Anomaly Detection (AD) algorithm. By providing a quantitative measure of feature importance in the context of the AD task, DIFFI allows to describe the behavior of IF at global and local scale, providing insightful information that can be exploited by final users of an IF-based AD solution to get a better understanding of the underlying process and to enable root cause analysis; moreover the approach can help scientists and developers in improving their solutions by getting a better understanding of the important variables in their AD task.

One of the main merits of DIFFI is that it straightforward to implement and requires very few parameters to be tuned; however, despite its simplicity, DIFFI is as effective as the current state-of-the-art method SHAP, with significantly smaller computational costs, making it really appealing for real world productive applications and even amenable for real-time scenarios. Moreover, we show that DIFFI can be employed to perform unsupervised feature selection, allowing the development of computationally parsimonious (and potentially more accurate) AD solutions. We believe that, given the exponentially growing interest in IF, DIFFI will be of paramount importance to enhance its usability and applicability; we also believe that by equipping IF with DIFFI would lead to an increase in adoption in IF, thanks to the increased trust of users towards methods that have interpretability traits.

Finally, we envision DIFFI to be possibly extended to other tree-based models for AD (for example the Extended Isolation Forest [15], the SCiForest [24] or the Streaming HSTrees [40]). In particular, the low computational costs open up the opportunity to exploit DIFFI in the flourishing field of online AD applications with streaming data, where time efficiency is crucial [30, 48]. Moreover, it may be also possible to employ DIFFI for other tasks such as out-of-distribution samples detection: we will explore such direction in future research investigations.

References

  • [1] D. W. Apley (2016)

    Visualizing the effects of predictor variables in black box supervised learning models

    .
    arXiv preprint arXiv:1612.08468. Cited by: §I-A.
  • [2] S. Basu, K. Kumbier, J. B. Brown, and B. Yu (2018) Iterative random forests to discover predictive and stable high-order interactions. Proceedings of the National Academy of Sciences 115 (8), pp. 1943–1948. Cited by: §I-A.
  • [3] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba (2017) Network dissection: quantifying interpretability of deep visual representations. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 6541–6549. Cited by: §I-A.
  • [4] L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §I-A.
  • [5] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander (2000) LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 93–104. Cited by: §I.
  • [6] M. Carletti, C. Masiero, A. Beghi, and G. A. Susto (2019) Explainable machine learning in industry 4.0: evaluating feature importance in anomaly detection to enable root cause analysis. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 21–26. Cited by: §I-C, §II-B1, §III-A1.
  • [7] Z. Che, S. Purushotham, R. Khemani, and Y. Liu (2016) Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, Vol. 2016, pp. 371. Cited by: §I-A.
  • [8] S. Devlin, C. Singh, W. J. Murdoch, and B. Yu (2019) Disentangled attribution curves for interpreting random forests and boosted trees. arXiv preprint arXiv:1905.07631. Cited by: §I-A.
  • [9] F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §I-C.
  • [10] J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §I-A.
  • [11] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal (2018) Explaining explanations: an overview of interpretability of machine learning. In

    2018 IEEE 5th International Conference on data science and advanced analytics (DSAA)

    ,
    pp. 80–89. Cited by: §I-A, §I-C.
  • [12] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi (2019) A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51 (5), pp. 93. Cited by: §I-A.
  • [13] D. Gunning (2017) Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA), nd Web 2. Cited by: §I.
  • [14] N. Gupta, D. Eswaran, N. Shah, L. Akoglu, and C. Faloutsos (2018) Beyond outlier detection: lookout for pictorial explanation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 122–138. Cited by: §III-A2.
  • [15] S. Hariri, M. C. Kind, and R. J. Brunner (2018) Extended isolation forest. arXiv preprint arXiv:1811.02141. Cited by: §IV.
  • [16] X. He, D. Cai, and P. Niyogi (2006) Laplacian score for feature selection. In Advances in neural information processing systems, pp. 507–514. Cited by: §III-B.
  • [17] M. I. Jordan and T. M. Mitchell (2015) Machine learning: trends, perspectives, and prospects. Science 349 (6245), pp. 255–260. Cited by: §I-A.
  • [18] H. Kriegel, M. Schubert, and A. Zimek (2008)

    Angle-based outlier detection in high-dimensional data

    .
    In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 444–452. Cited by: §I.
  • [19] O. Li, H. Liu, C. Chen, and C. Rudin (2018) Deep learning for case-based reasoning through prototypes: a neural network that explains its predictions. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I-A.
  • [20] X. Li, Y. Wang, S. Basu, K. Kumbier, and B. Yu (2019) A debiased mdi feature importance measure for random forests. arXiv preprint arXiv:1906.10845. Cited by: §I-A.
  • [21] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu (2012) Unsupervised feature selection using nonnegative spectral analysis. In Twenty-Sixth AAAI Conference on Artificial Intelligence, Cited by: §III-B.
  • [22] Z. C. Lipton (2018) The mythos of model interpretability. Queue 16 (3), pp. 31–57. Cited by: §I-C, §I-C.
  • [23] F. T. Liu, K. M. Ting, and Z. Zhou (2008) Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. Cited by: §I-B, §I, §II-A, §III-A1.
  • [24] F. T. Liu, K. M. Ting, and Z. Zhou (2010) On detecting clustered anomalies using sciforest. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 274–290. Cited by: §IV.
  • [25] F. T. Liu, K. M. Ting, and Z. Zhou (2012) Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (1), pp. 3. Cited by: §I-B, §I, §II-A.
  • [26] S. M. Lundberg, G. G. Erion, and S. Lee (2018) Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888. Cited by: §I-A.
  • [27] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in neural information processing systems, pp. 4765–4774. Cited by: §I-A.
  • [28] D. A. Melis and T. Jaakkola (2018) Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, pp. 7775–7784. Cited by: §I-A.
  • [29] L. Meneghetti, M. Terzi, S. Del Favero, G. A. Susto, and C. Cobelli (2018) Data-driven anomaly recognition for unsupervised model-free fault detection in artificial pancreas. IEEE Transactions on Control Systems Technology. Cited by: §I.
  • [30] X. Miao, Y. Liu, H. Zhao, and C. Li (2018)

    Distributed online one-class support vector machine for anomaly detection over networks

    .
    IEEE transactions on cybernetics 49 (4), pp. 1475–1488. Cited by: §I, §IV.
  • [31] C. Molnar (2019) Interpretable machine learning. Lulu. com. Cited by: §I-A, §I-C, §I-C.
  • [32] W. J. Murdoch, P. J. Liu, and B. Yu (2018) Beyond word importance: contextual decomposition to extract interactions from lstms. arXiv preprint arXiv:1801.05453. Cited by: §I-A.
  • [33] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio (2019) N-beats: neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437. Cited by: §I-A.
  • [34] L. Puggini and S. McLoone (2018) An enhanced variable selection and isolation forest based methodology for anomaly detection with oes data. Engineering Applications of Artificial Intelligence 67, pp. 126–135. Cited by: §III-B.
  • [35] M. T. Ribeiro, S. Singh, and C. Guestrin (2016)

    ” Why should i trust you?” explaining the predictions of any classifier

    .
    In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §I-A.
  • [36] M. Sabokrou, M. Fayyaz, M. Fathy, and R. Klette (2017) Deep-cascade: cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes. IEEE Transactions on Image Processing 26 (4), pp. 1992–2004. Cited by: §I.
  • [37] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806–813. Cited by: §I-A.
  • [38] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §I-A.
  • [39] C. Strobl, A. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008) Conditional variable importance for random forests. BMC bioinformatics 9 (1), pp. 307. Cited by: §I-A.
  • [40] S. C. Tan, K. M. Ting, and T. F. Liu (2011) Fast anomaly detection for streaming data. In Twenty-Second International Joint Conference on Artificial Intelligence, Cited by: §IV.
  • [41] G. Tolomei, F. Silvestri, A. Haines, and M. Lalmas (2017) Interpretable predictions of tree-based ensembles via actionable feature tweaking. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 465–474. Cited by: §I-A.
  • [42] G. Valdes, J. M. Luna, E. Eaton, C. B. Simone II, L. H. Ungar, and T. D. Solberg (2016) MediBoost: a patient stratification tool for interpretable decision making in the era of precision medicine. Scientific reports 6, pp. 37854. Cited by: §I-A.
  • [43] X. Wang, X. He, F. Feng, L. Nie, and T. Chua (2018) Tem: tree-enhanced embedding model for explainable recommendation. In Proceedings of the 2018 World Wide Web Conference, pp. 1543–1552. Cited by: §I-A.
  • [44] J. Yang, C. Zhou, S. Yang, H. Xu, and B. Hu (2017) Anomaly detection based on zone partition for security protection of industrial cyber-physical systems. IEEE Transactions on Industrial Electronics 65 (5), pp. 4257–4267. Cited by: §I.
  • [45] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou (2011) L2, 1-norm regularized discriminative feature selection for unsupervised. In Twenty-Second International Joint Conference on Artificial Intelligence, Cited by: §III-B.
  • [46] Y. Yuan, D. Ma, and Q. Wang (2015) Hyperspectral anomaly detection by graph pixel selection. IEEE transactions on cybernetics 46 (12), pp. 3123–3134. Cited by: §I.
  • [47] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §I-A.
  • [48] L. Zhang, J. Zhao, and W. Li (2019) Online and unsupervised anomaly detection for streaming data using an array of sliding windows and pdds. IEEE Transactions on Cybernetics. Cited by: §IV.
  • [49] M. Zhang, C. Chen, T. Wo, T. Xie, M. Z. A. Bhuiyan, and X. Lin (2017) SafeDrive: online driving anomaly detection from large-scale vehicle data. IEEE Transactions on Industrial Informatics 13 (4), pp. 2087–2096. Cited by: §I.
  • [50] Q. Zhang and S. Zhu (2018) Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 27–39. Cited by: §I-A.
  • [51] Z. Zhao and H. Liu (2007) Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th international conference on Machine learning, pp. 1151–1157. Cited by: §III-B.