Rule Extraction in Unsupervised Anomaly Detection for Model Explainability: Application to OneClass SVM

11/21/2019 ∙ by Alberto Barbado, et al. ∙ Universidad Politécnica de Madrid Telefonica 0

OneClass SVM is a popular method for unsupervised anomaly detection. As many other methods, it suffers from the black box problem: it is difficult to justify, in an intuitive and simple manner, why the decision frontier is identifying data points as anomalous or non anomalous. Such type of problem is being widely addressed for supervised models. However, it is still an uncharted area for unsupervised learning. In this paper, we describe a method to infer rules that justify why a point is labelled as an anomaly, so as to obtain intuitive explanations for models created using the OneClass SVM algorithm. We evaluate our proposal with different datasets, including real-world data coming from industry. With this, our proposal contributes to extend Explainable AI techniques to unsupervised machine learning models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Responsible Artificial Intelligence

[alej2019explainable] is an emerging discipline that is gaining relevance both in academic and industrial research. An increasing number of organisations are defining policies and criteria for the usage of data and the development of support decision systems using AI techniques.

For example, Telefónica has defined its own AI principles [benjamins2019responsible], which can be organised into the following categories:

  • Detecting sensitive data in the datasets used to train Machine Learning (ML) models (for example, the use of gender information to support the decision of giving a credit score) or detecting whether there is a bias in the data, what may have an unwanted effect in decision making. Besides analysing directly the dataset, there are also methods to perform an evaluation on a trained model to check whether there is any bias on its decisions. This is done using a group of metrics known as fairness metrics [Speicher_2018, Hardt:2016:EOS:3157382.3157469, zhang2018mitigating].

  • Explaining how an algorithm reaches a conclusion in a way that is clear and intuitive for a human being. This is crucial not only to avoid the fear of considering AI as black boxes that may suddenly take harmful decisions in the future, but also to contribute to the democratisation of AI and increase trust on these systems.

The first group of challenges is being widely addressed by researchers, with tools to audit datasets and trained models to detect risks, and at the same time provide solutions to mitigate those biases [bellamy2018ai, saleiro2018aequitas]. The second group is addressed through the use of Explainable AI (XAI) techniques, which generate post-hoc explanations based on the information provided by the black box model. In the literature, there are many XAI proposals for supervised ML models. However, some of the most recent and thorough reviews on XAI [gilpin2018explaining, mueller2019explanation, alej2019explainable, molnar2019interpretable] do not mention the application of such techniques to unsupervised learning.

In this paper we focus on unsupervised ML models used for anomaly detection. Our main contribution is a novel algorithmic solution to generate explanations for the particular case of anomaly detection using OneClass SVM (OCSVM) algorithms. Such explanations will be obtained using rule extraction techniques.

We perform an empirical validation of the results of our proposal using open datasets as well as real data from Telefónica. Our evaluation consists in analysing whether the number of rules for the non-anomalous data points are inferior to the ones obtained for the anomalous data points. We also compare the rules extracted with our algorithm with those extracted with a surrogate Decision Tree (DT) trained over the features and outputs of the unsupervised anomaly detection model.

The rest of the paper is organized as follows. First, we describe some related work in the area of XAI and rule extraction applied to SVM. After identifying research opportunities derived from those works, the paper introduces the algorithm implementation for OCSVM. Following this, we present an empirical evaluation of our algorithm with several datasets. We then conclude, showing also potential future research lines of work.

Related Work

This section reviews unsupervised ML models used for anomaly detection, and reviews previous work on rule extraction in SVM that is relevant for our proposal.

Unsupervised ML for Anomaly Detection

Many algorithms for unsupervised anomaly detection exist. Examples are IsolationForest [liu2008isolation], Local anomaly Factor (LOF) [breunig2000lof] and OCSVM [scholkopf2000support]

. The latter has relevant advantages over the former ones, mainly in terms of computational performance. This is due to the fact that it creates a decision frontier using only the support vectors (like general supervised SVM) and that model training always leads to the same solution because the optimization problem is a convex one. However, SVM (hence OCSVM) algorithms are some of the most difficult ones to explain due to the mathematically complex method that obtains the decision frontier.

SVM for classification theoretically maps the data points available in the dataset to a higher dimensional space than the one determined by their features, so that the separation among classes may be done linearly. It uses a hyperplane obtained from data points from all of the classes. These data points, known as support vectors, are the ones that are closer to each other and the only ones needed to determine the decision frontier. However, it is not really necessary to map to a higher dimension due to the fact that the equation that appears in the optimization of the algorithm uses a dot product of those mapped points. Because of that, the only thing to be calculated is such dot product, something that can be accomplished with the well-known kernel trick. Hence instead of calculating explicitly the mapping to a higher dimension the equation is solved using a kernel function.

In OCSVM there are no labels. Hence all data points are considered to belong to a same class at the beginning. The decision frontier is computed trying to separate the region of the hyperspace with a higher number of data points close to each other from another that has small density, considering those points as anomalies. To do so the algorithm tries to define a decision frontier that maximizes the distance to the origin of the hyperspace and that at the same time separates from it the maximum number of data points. This compromise between those factors leads to the optimization of the algorithm and allows obtaining the optimal decision frontier. Those data points that are separated are labeled as non-anomalous (+1) and the others are labeled as anomalous (-1).

The optimization problem is reflected in the following equations:

(1)

In that equation, is a hyper-parameter known as rejection rate, which needs to be selected by the user. It sets an upper bound on the fraction of anomalies that can be considered, and also defines a lower bound on the fraction of support vectors that can be considered. Using Lagrange techniques, the decision frontier obtained is the following one:

(2)

Hence the hyper-parameters that must be defined in this method are the rejection rate, , and the type of kernel used.

Rule Extraction in SVM

Several papers deal with the importance of XAI applied to models such as SVM. In particular [martens2008rule]

aims to resolve the black box problem in SVM for supervised classification tasks. It obtains a set of rules that explain in a simple manner the boundaries that contain the values of the different classes. Thanks to that, it is easier to understand what are the conditions that will identify a data point as belonging to one class or to another (in OCSVM it would be belonging to class +1 - normal data - or class -1 - anomaly -). The challenge consists in discovering a way to map the algorithm results to a particular set of rules. There are two general ways to do it. One consists in inferring the rules directly from the decision frontier, using a decompositional rule extraction technique. The other (simpler) one uses a method called pedagogical rule extraction technique that does not care about the decision frontier itself and considers the algorithm as a black box from where to extract those rules depending on which class it uses to classify different relevant data points used as an input.

The first method is clearly more transparent, as it deals directly with the inner structure of the model. However, it is generally more difficult to implement. One way to implement it is with a technique known as SVM+ Prototypes [nunez2002rule], which consists in finding hypercubes using the centroids (or prototypes) of data points of each class and using as vertices the data points from that hyperspace area farther away from that centroid (or use directly the support vectors themselves if they are available). It will then infer a rule from the values of the vertices of the hypercube that contain the limits of all the points inside it, creating one rule for each hypercube.

For example, a dataset that contains two numerical features X and Y will be defined in a 2-dimensional space. The algorithm will create a square that contains the data points on each of the classes, as shown in Figure 1. The rule that justifies that a data point belongs to class 2 is:

  • Rule 1: CLASS 2 IF XX1 YY1 XX2 YY2

Figure 1: SVM with linear kernel classifying data points of two classes.

The generated hypercubes may wrongly include points from the other class when the decision frontier is not linear or spherical, as shown in Figure 2. In this case the algorithm considers an additional number of clusters trying to include the points into a smaller hypercube, as shown in Figure 3.

Figure 2: A hypercube generated using the farthest points leads to the wrong inclusion of data from the another class.
Figure 3: Using more hypercubes avoids the aforementioned problem. Now there is no wrong inclusion of data points from another class.

A rule will be generated for each hypercube, considering all those scenarios as independent, leading to this output:

  • Group 1: CLASS 1 IF X…

  • Group 2: CLASS 1 IF X…

There are some downsides of that method in supervised classification tasks, especially when the problem is not simply a binary classification or when the algorithm is performing a regression. For instance, the number of rules may grow immensely due to the fact that a set of rules will be generated for each category and each set may contain a huge number of rule groups, leading to an incomprehensible output.

However, in OCSVM these difficulties may be potentially mitigated due to two reasons. On the one hand, the explanations are reduced to rules that explain when a data point is not an anomaly (so there would be no need to define rules for the anomalies). On the other hand, the algorithm tries to group all non-anomalous points together, setting them apart from the outliers. Because of this, the chance to define a hypercube that does not contain a point from the another class may be higher than in a standard classification task. Both the unbalanced inherent nature of data points in anomaly detection (few anomalies vs. many more non-anomalous data points) and the fact that non-anomalous points tend to be closer to each other may help achieving good results with this method.

Method

We first describe the intuition behind our rule extraction approach from an OCSVM model for anomaly detection. Then, we describe in detail the algorithm implementation.

Algorithm Intuition

We propose using rule extraction techniques within OCSVM models for anomaly detection, by generating hypercubes that encapsulate the non-anomalous data points, and using their vertices as rules that explain when a data point is considered non-anomalous. Our method has the following characteristics, according to the taxonomy for XAI in [molnar2019interpretable]:

  • Post-hoc: Explainability is achieved using external techniques.

  • Global and individual: Explanations serve to explain how the whole model works, as well as why a specific data point is considered anomalous or non-anomalous.

  • Model-agnostic: As with other techniques for global explanations [molnar2019interpretable], the only information needed to build the explanations are the input features and the outcomes of the system after fitting the model.

  • Counterfactual: The explanations for why a data point is anomalous also include information on the changes that should take place in the feature values in order to consider that data point as non-anomalous.

Since explanations are based on hypercubes, the OCSVM kernel to be used is the Radial Basis Function (RBF), illustrated in Figure

4.

Figure 4: With an RBF Kernel the correct hypercube will be the one that encloses the points that are not anomalies, since the OCSVM algorithm will try to enclose most of the points inside the decision frontier and leave anomalies outside.

The dimensionality of the hyperspace will correspond to the different numerical features used for fitting the model. However, a caveat to be considered is when some features are categorical non ordinal variables. In that case the approximation would be to extract a rule for each of the possible combinations of categorical values among the data points that are not considered anomalous. Considering again the aforementioned 2-dimensional example, with variable X being binary categorical, a dataset may look like in Figure

5:

Figure 5:

Rule extraction with a categorical variable.

In that case, two rules would be extracted, one for each of the possible states of X:

  • Rule 1: NOT OUTLIER IF X = 0 Y Y2 Y Y1

  • Rule 2: NOT OUTLIER IF X = 1 Y Y4 Y Y3

Generally speaking, the algorithm logic can be summarised as:

  • Apply OCSVM to the dataset to create the model.

  • Depending on the characteristics of variables, do:

    • Case 1. Numerical only: Iteratively create clusters in the non-anomalous data (starting with one cluster) and create a hypercube using the centroid and the points further away from it. Check whether the hypercube contains any data point from the anomalous group; if it does, repeat using one more cluster than before. End when no anomalies are contained in the generated hypercubes. If there are anomalies and the data points in a cluster are inferior to the number of vertices needed for the hypercube, complete the missing vertices with artificial datapoints and end when there are no anomalies or when the convergence criterion is reached.

    • Case 2. Categorical only: The rules will correspond directly to the different value states contained in the dataset of non-anomalous points.

    • Case 3. Both numerical and categorical. This case would be analogous to Case 1, but data points will be filtered for each of the combinations of the categorical variables states. For each combination, there will be a set of rules for the numerical features.

  • Use these vertices to obtain the boundaries of that hypercube and directly extract rules from them.

Algorithm Description

Algorithm 1 contains the proposal for rule extraction for an OCSVM model that may be applied over a dataset with either categorical or numerical variables (or both). ocsvm_rule_extractor is the main function of the algorithm. Regarding input parameters, X is the input data frame with the features, ln is a list with the numerical columns, lc

is a list with the categorical columns, d is a dictionary with the hyperparameters for OCSVM (since it is an RBF kernel, the hyperparameters are the values for the upper bound on the fraction of training errors and a lower bound of the fraction of support vectors,

, and the kernel coefficient, ). This function starts with the feature scaling of the numerical features (function featureScaling). After that, it fits an OCSVM model with all the data available and detects the anomalies within it, generating two datasets, with the anomalous data points and with the rest (function filterAnomalies). The next step is assessing that the number of data points within the decision frontier are enough to build the hypercubes. In order to check this, the dimension of the hypercube should be lower than the number of non-anomalous data points for that categorical state. The dimension of that hypercube is computed using the number of numerical features for each combination of categorical states; for each categorical state it will be , where is the length of a list containing all the numerical columns names.

Next, the algorithm checks whether features are numerical, categorical or both. In case of only numerical columns, it calls function getRules, described in Algorithm 2. getRules receives as input a matrix with anomalous and non-anomalous data points, and respectively, and the number of vertices for each hypercube, , based on the number of numerical features. It also receives the list of names of those numerical features, . The main purpose of this function is to cluster non-anomalous data points in a set of hypercubes that do not contain any anomalous data points. To do that, it iteratively increases the number of clusters (hypercubes) until there are no anomalous points within any hypercube. The function outPosition checks whether the rules defined based on the vertices of the hypercube do not include any data point from the anomalous subset, .

getRules calls function getVertices with a specific number of clusters, . This function then performs the clustering over the non-anomalous data points, , using the function getClusters

that returns the label of the cluster for each data point, as well as the centroid position for each cluster using the K-means++ algorithm

[arthur2007k]. Then, it iterates through each cluster and first obtains the subset of data points for that cluster with the function . After that, if there are enough data points in that cluster (more data points than the vertices of the hypercube) it computes the distance of each of them to the centroid with computeDistance and uses the furthest as vertices.

If there are not enough data points in that cluster (less than the number of vertices for the hypercube), all of them are used limits and the missing vertices are artificially generated to close the hypercube (leaving outside all anomalous data points).

1:procedure ocsvm_rule_extractor()
2:     for  do
3:         
4:     end for
5:     
6:     
7:     
8:     
9:     
10:     
11:     if  then
12:         return
13:     else
14:         continue
15:     end if
16:     if  then
17:         
18:     else if  then
19:         
20:     else
21:         
22:          empty list
23:         for  do
24:              
25:              
26:         end for
27:     end if
28:     
29:     
30:     return
31:end procedure
Algorithm 1 Rule Extractor algorithm for OneClassSVM - Main

If all the features are categorical, then the rules for non-anomalous data points will simply be the unique combination of values for them. If there are both categorical and numerical features, the algorithm obtains the hypercubes (as mentioned for numerical features only) for the subset of data points associated to each combination of categorical values.

After obtaining the rules, function is used to express rules in their original values (not the scaled ones used for the ML models). And function checks whether there are rules that may be included inside others; that is, for each rule it checkes whether there is another with a bigger scope that will include it as a subset case.

There is a convergence criteria, applied in the scenario where a cluster has less data points than the number of vertices. In this case, if there are anomalies within the hypercube, the algorithm checks the number of data points within the cluster versus a threshold defined by the number of vertices multiplied by a reference value . If there are less data points than this threshold, then that cluster is discarded and not considered for rule generation. Since the proposal aims to explain what makes data points non-anomalous and what should have happened to label a data point as non-anomalous, this approximation is feasible.

1:procedure getRules()
2:      empty list
3:      reference value
4:     
5:     
6:     while check do
7:         
8:         
9:         for  do
10:              
11:              
12:              
13:              if  then
14:                  if  then
15:                       
16:                       
17:                  else
18:                       
19:                  end if
20:              else
21:                  
22:                  if  then
23:                       
24:                  else
25:                       
26:                       
27:                  end if
28:              end if
29:         end for
30:     end while
31:     return
32:end procedure
33:
34:procedure getVertices()
35:      empty list
36:      empty list
37:     
38:     for  do
39:         
40:         if  then
41:              
42:         else
43:              
44:         end if
45:         
46:         
47:     end for
48:     return
49:end procedure
Algorithm 2 Rule Extractor algorithm for OneClassSVM - Additional functions

Evaluation

We use our algorithm over different datasets (both public and from Telefonica’s real data), to evaluate the following hypotheses:

  • The OCSVM algorithm with an RBF decision frontier groups non-anomalous points within a hypersphere. Hence the number of rules extracted using non-anomalous data will be fewer than using anomalous ones.

  • Our algorithm yields a number of rules similar to other proposals for global model-agnostic explanations, such as Surrogate Trees.

In our experiments we will count the number of rules extracted in different scenarios, and check whether there are significative differences. Several experiments are conducted over different datasets:

  • Rule extraction over non-anomalous data points with our algorithm.

  • Rule extraction over anomalous data points with our algorithm. In this scenario, there is no convergence criteria since the number of anomalous data points will be very low (many times lower than the number of vertices).

  • Rule extraction over all the data points using a DT overfitted in the dataset, using the features as input and the anomalous/non-anomalous labels as target variable. This yields two types of rules, the ones that explain the anomalous data points, and the ones that explain the non-anomalous ones.

The used datasets belong to different domains, have different sizes and different number of features (both categorical and numerical). They are indicated in Table 1:

  • Datasets 1 and 2 about seismic activity [sathe2016lodes]. Dataset 1 is bi-dimensional with only numerical features (’gdenergy’, ’gdpuls’). Dataset 2 has 2 categorical features (’hazard’, ’shift’) and 7 numerical (’seismoacoustic’, ’shift’, ’genergy’, ’gplus’, ’gdenergy’, ’gdpuls’, ’hazard’, ’bumps’, ’bumps2’).

  • Dataset 3 from a call center at Telefónica. It is real data that includes the total number of calls received in one of its services during every hour. Using these data, some features are extracted (weekday), and they are cyclically transformed, so that each time feature turns into two features for the sine and cosine components. The rules in this case are also transformed back into the original features in order to enhance rule comprehension.

  • Dataset 4 about cardiovascular diseases [padmanabhan2019physician]. There are 4 categorical features (’smoke’, ’alco’, ’active’, ’is_man’) and 7 numerical (’age’, ’height’, ’weight’, ’ap_hi’, ’ap_lo’,’cholesterol’,’gluc’).

  • Datasets 5 and 6 on the US census for year 1990 [blake1998uci]. Dataset 5 has 2 categorical features (’dAncstry1_3’, ’dAncstry1_4’) and 4 numerical ones (’dAge’, ’iYearsch’, ’iYearkwrk’, ’dYrsserv’). Dataset 6 has the same numerical features, but 18 categorical ones (dAncstry_i_j with i that ranges fro 1 to 2, and j that ranges from 3 to 11 if i equals 1, and 2 to 10 if i equals 2.)

Dataset Ref. Nº Cat. Nº Num. Nº Rows
1 [sathe2016lodes] 0 2 669
2 [sathe2016lodes] 2 7 1705
3 Telefónica 0 5 2712
4 [padmanabhan2019physician] 4 7 42000
5 [blake1998uci] 2 4 100000
6 [blake1998uci] 18 4 200000
Table 1: Description of each dataset, with their reference (Ref.), categorical features (Nº Cat.), numerical features (Nº Num) and number of rows.

We ran experiments with the following infrastructure: the implementations of the OCSVM algorithm, the K-Means++ clustering and the DT algorithms are based on Scikit-Learn [scikit-learn]. The rest of the code described in Algorithms 1 and 2 were developed from scratch, and available in Github [Barbado2019].

OCSVM models use as hyperparameters: . K-means++ models use . The DT hyperparameters are dynamically set for each dataset, using the Gini criterion to find the best splits, and using as seed. For each dataset, the DT is as follows:

  • 1: Depth = 11, Nodes = 53, Leaf nodes = 30

  • 2: Depth = 17, Nodes = 129, Leaf nodes = 69

  • 3: Depth = 13, Nodes = 159, Leaf nodes = 91

  • 4: Depth = 27, Nodes = 2265, Leaf nodes = 1184

  • 5: Depth = 17, Nodes = 353, Leaf nodes = 162

  • 6: Depth = 30, Nodes = 1843, Leaf nodes = 1051

Results are detailed in Table 2. It shows the number of rules extracted for all datasets with the different algorithms.

Dataset Prop. [NA] Prop. [A] DT [NA] DT [A]
1 89 25 11 14
2 33 20 30 93
3 151 37 36 72
4 4754 45 171 63
5 249 185 62 26
6 762 762 86 11
Table 2: Number of rules extracted using our algorithm (Prop.) or the surrogate DT over anomalous (A) and non-anomalous (NA) data points.

Hypothesis 1 is not validated. Table 2 shows that the number of rules for anomalous data points is always inferior to those for non-anomalous ones. This may be due to the fact that the pruning algorithm is not good enough yet for rule reduction. For instance, one rule is removed if another wider rule includes it. However, many times the rules are not included within only one wider rule, but within many wider rules all together. This can be visually seen using dataset 1. Figures 6 and7 show the rules obtained before and after applying the pruning algorithm. Although the number of rules is reduced, many rules may still be removed. This affects more the rules obtained with non-anomalous data points since they are closer than the ones obtained with anomalous ones, as shown in Figure 8 (here, points without a visible hypercube are due to the fact that all the vertices are the same; thus, it is the same point as the anomaly).

Figure 6: Hypercubes in a 2D space for non-anomalous data (before pruning).
Figure 7: Hypercubes in a 2D space for non-anomalous data (after pruning).
Figure 8: Hypercubes in a 2D space for anomalous data (after pruning).

Regarding hypothesis 2, in most of the cases DT generates less rules than our algorithm (with the exception of anomalous data points in dataset 4). However, results may be considered similar.

Figures 9 and 10 show a sample rule over the same dataset for our algorithm and for the DT surrogate method, respectively. The DT method uses symbols for equality and inequalities, while in our method all the extracted rules use only and , and not or .

Figure 9: A sample rule for non-anomalous data points for dataset 4 using our proposal
Figure 10: A sample rule for non-anomalous data points for dataset 4 using the DT surrogate method

Limitations of our Approach

Our method can only be used if there is a minimum number of data points (at least one point per vertex for the hypercube). This limitation may be avoided by generating artificial vertices, as done for clusters with few data points.

Conclusion

We described a novel implementation of a rule extraction technique for model-agnostic, both global and local, and counterfactual explanations for unsupervised learning in the scenario of anomaly detection, using the OCSVM algorithm.

We applied our algorithm to open and private datasets, with anomalous and non-anomalous data, with the initially unexpected result that using anomalous data points yields less rules than using the non-anomalous ones. We also compared the results of our algorithm with other surrogate methods such as using a DT model trained over the same features and using as target variable the anomalous or non-anomalous labels. In most scenarios both methods yielded similar results, thus showing that our approach is a feasible technique for XAI for anomaly detection.

Future Work

Our rule extraction method can be optimised by using different clustering algorithms, so as to analyse whether there is a reduction in the number or rules that are extracted. It will be interesting to compare cluster methods that consider together both categorical non ordinal as well as numerical features, instead of dividing the hyperspace according to the categorical value combinations, as we did in our work. One example of such algorithm is K-modes [chaturvedi2001k]. Besides, when there are clusters with anomalies, instead of increasing the number of clusters and apply it over the whole dataset, the algorithm may separate data points that are already inside a cluster without anomalies and apply a different clustering algorithm to partition the subspace that had anomalies. It will be interesting to compare the number of rules extracted with this approach versus the one in this paper.

Rule extraction should also be designed to consider all types of comparisons (, , and ), and more sophisticated pruning techniques should be applied to improve rule reduction, since a rule may be contained in two or more rules together.