Responsible Artificial Intelligence[alej2019explainable] is an emerging discipline that is gaining relevance both in academic and industrial research. An increasing number of organisations are defining policies and criteria for the usage of data and the development of support decision systems using AI techniques.
For example, Telefónica has defined its own AI principles [benjamins2019responsible], which can be organised into the following categories:
Detecting sensitive data in the datasets used to train Machine Learning (ML) models (for example, the use of gender information to support the decision of giving a credit score) or detecting whether there is a bias in the data, what may have an unwanted effect in decision making. Besides analysing directly the dataset, there are also methods to perform an evaluation on a trained model to check whether there is any bias on its decisions. This is done using a group of metrics known as fairness metrics [Speicher_2018, Hardt:2016:EOS:3157382.3157469, zhang2018mitigating].
Explaining how an algorithm reaches a conclusion in a way that is clear and intuitive for a human being. This is crucial not only to avoid the fear of considering AI as black boxes that may suddenly take harmful decisions in the future, but also to contribute to the democratisation of AI and increase trust on these systems.
The first group of challenges is being widely addressed by researchers, with tools to audit datasets and trained models to detect risks, and at the same time provide solutions to mitigate those biases [bellamy2018ai, saleiro2018aequitas]. The second group is addressed through the use of Explainable AI (XAI) techniques, which generate post-hoc explanations based on the information provided by the black box model. In the literature, there are many XAI proposals for supervised ML models. However, some of the most recent and thorough reviews on XAI [gilpin2018explaining, mueller2019explanation, alej2019explainable, molnar2019interpretable] do not mention the application of such techniques to unsupervised learning.
In this paper we focus on unsupervised ML models used for anomaly detection. Our main contribution is a novel algorithmic solution to generate explanations for the particular case of anomaly detection using OneClass SVM (OCSVM) algorithms. Such explanations will be obtained using rule extraction techniques.
We perform an empirical validation of the results of our proposal using open datasets as well as real data from Telefónica. Our evaluation consists in analysing whether the number of rules for the non-anomalous data points are inferior to the ones obtained for the anomalous data points. We also compare the rules extracted with our algorithm with those extracted with a surrogate Decision Tree (DT) trained over the features and outputs of the unsupervised anomaly detection model.
The rest of the paper is organized as follows. First, we describe some related work in the area of XAI and rule extraction applied to SVM. After identifying research opportunities derived from those works, the paper introduces the algorithm implementation for OCSVM. Following this, we present an empirical evaluation of our algorithm with several datasets. We then conclude, showing also potential future research lines of work.
This section reviews unsupervised ML models used for anomaly detection, and reviews previous work on rule extraction in SVM that is relevant for our proposal.
Unsupervised ML for Anomaly Detection
Many algorithms for unsupervised anomaly detection exist. Examples are IsolationForest [liu2008isolation], Local anomaly Factor (LOF) [breunig2000lof] and OCSVM [scholkopf2000support]
. The latter has relevant advantages over the former ones, mainly in terms of computational performance. This is due to the fact that it creates a decision frontier using only the support vectors (like general supervised SVM) and that model training always leads to the same solution because the optimization problem is a convex one. However, SVM (hence OCSVM) algorithms are some of the most difficult ones to explain due to the mathematically complex method that obtains the decision frontier.
SVM for classification theoretically maps the data points available in the dataset to a higher dimensional space than the one determined by their features, so that the separation among classes may be done linearly. It uses a hyperplane obtained from data points from all of the classes. These data points, known as support vectors, are the ones that are closer to each other and the only ones needed to determine the decision frontier. However, it is not really necessary to map to a higher dimension due to the fact that the equation that appears in the optimization of the algorithm uses a dot product of those mapped points. Because of that, the only thing to be calculated is such dot product, something that can be accomplished with the well-known kernel trick. Hence instead of calculating explicitly the mapping to a higher dimension the equation is solved using a kernel function.
In OCSVM there are no labels. Hence all data points are considered to belong to a same class at the beginning. The decision frontier is computed trying to separate the region of the hyperspace with a higher number of data points close to each other from another that has small density, considering those points as anomalies. To do so the algorithm tries to define a decision frontier that maximizes the distance to the origin of the hyperspace and that at the same time separates from it the maximum number of data points. This compromise between those factors leads to the optimization of the algorithm and allows obtaining the optimal decision frontier. Those data points that are separated are labeled as non-anomalous (+1) and the others are labeled as anomalous (-1).
The optimization problem is reflected in the following equations:
In that equation, is a hyper-parameter known as rejection rate, which needs to be selected by the user. It sets an upper bound on the fraction of anomalies that can be considered, and also defines a lower bound on the fraction of support vectors that can be considered. Using Lagrange techniques, the decision frontier obtained is the following one:
Hence the hyper-parameters that must be defined in this method are the rejection rate, , and the type of kernel used.
Rule Extraction in SVM
Several papers deal with the importance of XAI applied to models such as SVM. In particular [martens2008rule]
aims to resolve the black box problem in SVM for supervised classification tasks. It obtains a set of rules that explain in a simple manner the boundaries that contain the values of the different classes. Thanks to that, it is easier to understand what are the conditions that will identify a data point as belonging to one class or to another (in OCSVM it would be belonging to class +1 - normal data - or class -1 - anomaly -). The challenge consists in discovering a way to map the algorithm results to a particular set of rules. There are two general ways to do it. One consists in inferring the rules directly from the decision frontier, using a decompositional rule extraction technique. The other (simpler) one uses a method called pedagogical rule extraction technique that does not care about the decision frontier itself and considers the algorithm as a black box from where to extract those rules depending on which class it uses to classify different relevant data points used as an input.
The first method is clearly more transparent, as it deals directly with the inner structure of the model. However, it is generally more difficult to implement. One way to implement it is with a technique known as SVM+ Prototypes [nunez2002rule], which consists in finding hypercubes using the centroids (or prototypes) of data points of each class and using as vertices the data points from that hyperspace area farther away from that centroid (or use directly the support vectors themselves if they are available). It will then infer a rule from the values of the vertices of the hypercube that contain the limits of all the points inside it, creating one rule for each hypercube.
For example, a dataset that contains two numerical features X and Y will be defined in a 2-dimensional space. The algorithm will create a square that contains the data points on each of the classes, as shown in Figure 1. The rule that justifies that a data point belongs to class 2 is:
Rule 1: CLASS 2 IF XX1 YY1 XX2 YY2
The generated hypercubes may wrongly include points from the other class when the decision frontier is not linear or spherical, as shown in Figure 2. In this case the algorithm considers an additional number of clusters trying to include the points into a smaller hypercube, as shown in Figure 3.
A rule will be generated for each hypercube, considering all those scenarios as independent, leading to this output:
Group 1: CLASS 1 IF X…
Group 2: CLASS 1 IF X…
There are some downsides of that method in supervised classification tasks, especially when the problem is not simply a binary classification or when the algorithm is performing a regression. For instance, the number of rules may grow immensely due to the fact that a set of rules will be generated for each category and each set may contain a huge number of rule groups, leading to an incomprehensible output.
However, in OCSVM these difficulties may be potentially mitigated due to two reasons. On the one hand, the explanations are reduced to rules that explain when a data point is not an anomaly (so there would be no need to define rules for the anomalies). On the other hand, the algorithm tries to group all non-anomalous points together, setting them apart from the outliers. Because of this, the chance to define a hypercube that does not contain a point from the another class may be higher than in a standard classification task. Both the unbalanced inherent nature of data points in anomaly detection (few anomalies vs. many more non-anomalous data points) and the fact that non-anomalous points tend to be closer to each other may help achieving good results with this method.
We first describe the intuition behind our rule extraction approach from an OCSVM model for anomaly detection. Then, we describe in detail the algorithm implementation.
We propose using rule extraction techniques within OCSVM models for anomaly detection, by generating hypercubes that encapsulate the non-anomalous data points, and using their vertices as rules that explain when a data point is considered non-anomalous. Our method has the following characteristics, according to the taxonomy for XAI in [molnar2019interpretable]:
Post-hoc: Explainability is achieved using external techniques.
Global and individual: Explanations serve to explain how the whole model works, as well as why a specific data point is considered anomalous or non-anomalous.
Model-agnostic: As with other techniques for global explanations [molnar2019interpretable], the only information needed to build the explanations are the input features and the outcomes of the system after fitting the model.
Counterfactual: The explanations for why a data point is anomalous also include information on the changes that should take place in the feature values in order to consider that data point as non-anomalous.
Since explanations are based on hypercubes, the OCSVM kernel to be used is the Radial Basis Function (RBF), illustrated in Figure4.
The dimensionality of the hyperspace will correspond to the different numerical features used for fitting the model. However, a caveat to be considered is when some features are categorical non ordinal variables. In that case the approximation would be to extract a rule for each of the possible combinations of categorical values among the data points that are not considered anomalous. Considering again the aforementioned 2-dimensional example, with variable X being binary categorical, a dataset may look like in Figure5:
Rule extraction with a categorical variable.
In that case, two rules would be extracted, one for each of the possible states of X:
Rule 1: NOT OUTLIER IF X = 0 Y Y2 Y Y1
Rule 2: NOT OUTLIER IF X = 1 Y Y4 Y Y3
Generally speaking, the algorithm logic can be summarised as:
Apply OCSVM to the dataset to create the model.
Depending on the characteristics of variables, do:
Case 1. Numerical only: Iteratively create clusters in the non-anomalous data (starting with one cluster) and create a hypercube using the centroid and the points further away from it. Check whether the hypercube contains any data point from the anomalous group; if it does, repeat using one more cluster than before. End when no anomalies are contained in the generated hypercubes. If there are anomalies and the data points in a cluster are inferior to the number of vertices needed for the hypercube, complete the missing vertices with artificial datapoints and end when there are no anomalies or when the convergence criterion is reached.
Case 2. Categorical only: The rules will correspond directly to the different value states contained in the dataset of non-anomalous points.
Case 3. Both numerical and categorical. This case would be analogous to Case 1, but data points will be filtered for each of the combinations of the categorical variables states. For each combination, there will be a set of rules for the numerical features.
Use these vertices to obtain the boundaries of that hypercube and directly extract rules from them.
Algorithm 1 contains the proposal for rule extraction for an OCSVM model that may be applied over a dataset with either categorical or numerical variables (or both). ocsvm_rule_extractor is the main function of the algorithm. Regarding input parameters, X is the input data frame with the features, ln is a list with the numerical columns, lc
is a list with the categorical columns, d is a dictionary with the hyperparameters for OCSVM (since it is an RBF kernel, the hyperparameters are the values for the upper bound on the fraction of training errors and a lower bound of the fraction of support vectors,, and the kernel coefficient, ). This function starts with the feature scaling of the numerical features (function featureScaling). After that, it fits an OCSVM model with all the data available and detects the anomalies within it, generating two datasets, with the anomalous data points and with the rest (function filterAnomalies). The next step is assessing that the number of data points within the decision frontier are enough to build the hypercubes. In order to check this, the dimension of the hypercube should be lower than the number of non-anomalous data points for that categorical state. The dimension of that hypercube is computed using the number of numerical features for each combination of categorical states; for each categorical state it will be , where is the length of a list containing all the numerical columns names.
Next, the algorithm checks whether features are numerical, categorical or both. In case of only numerical columns, it calls function getRules, described in Algorithm 2. getRules receives as input a matrix with anomalous and non-anomalous data points, and respectively, and the number of vertices for each hypercube, , based on the number of numerical features. It also receives the list of names of those numerical features, . The main purpose of this function is to cluster non-anomalous data points in a set of hypercubes that do not contain any anomalous data points. To do that, it iteratively increases the number of clusters (hypercubes) until there are no anomalous points within any hypercube. The function outPosition checks whether the rules defined based on the vertices of the hypercube do not include any data point from the anomalous subset, .
getRules calls function getVertices with a specific number of clusters, . This function then performs the clustering over the non-anomalous data points, , using the function getClusters
that returns the label of the cluster for each data point, as well as the centroid position for each cluster using the K-means++ algorithm[arthur2007k]. Then, it iterates through each cluster and first obtains the subset of data points for that cluster with the function . After that, if there are enough data points in that cluster (more data points than the vertices of the hypercube) it computes the distance of each of them to the centroid with computeDistance and uses the furthest as vertices.
If there are not enough data points in that cluster (less than the number of vertices for the hypercube), all of them are used limits and the missing vertices are artificially generated to close the hypercube (leaving outside all anomalous data points).
If all the features are categorical, then the rules for non-anomalous data points will simply be the unique combination of values for them. If there are both categorical and numerical features, the algorithm obtains the hypercubes (as mentioned for numerical features only) for the subset of data points associated to each combination of categorical values.
After obtaining the rules, function is used to express rules in their original values (not the scaled ones used for the ML models). And function checks whether there are rules that may be included inside others; that is, for each rule it checkes whether there is another with a bigger scope that will include it as a subset case.
There is a convergence criteria, applied in the scenario where a cluster has less data points than the number of vertices. In this case, if there are anomalies within the hypercube, the algorithm checks the number of data points within the cluster versus a threshold defined by the number of vertices multiplied by a reference value . If there are less data points than this threshold, then that cluster is discarded and not considered for rule generation. Since the proposal aims to explain what makes data points non-anomalous and what should have happened to label a data point as non-anomalous, this approximation is feasible.
We use our algorithm over different datasets (both public and from Telefonica’s real data), to evaluate the following hypotheses:
The OCSVM algorithm with an RBF decision frontier groups non-anomalous points within a hypersphere. Hence the number of rules extracted using non-anomalous data will be fewer than using anomalous ones.
Our algorithm yields a number of rules similar to other proposals for global model-agnostic explanations, such as Surrogate Trees.
In our experiments we will count the number of rules extracted in different scenarios, and check whether there are significative differences. Several experiments are conducted over different datasets:
Rule extraction over non-anomalous data points with our algorithm.
Rule extraction over anomalous data points with our algorithm. In this scenario, there is no convergence criteria since the number of anomalous data points will be very low (many times lower than the number of vertices).
Rule extraction over all the data points using a DT overfitted in the dataset, using the features as input and the anomalous/non-anomalous labels as target variable. This yields two types of rules, the ones that explain the anomalous data points, and the ones that explain the non-anomalous ones.
The used datasets belong to different domains, have different sizes and different number of features (both categorical and numerical). They are indicated in Table 1:
Datasets 1 and 2 about seismic activity [sathe2016lodes]. Dataset 1 is bi-dimensional with only numerical features (’gdenergy’, ’gdpuls’). Dataset 2 has 2 categorical features (’hazard’, ’shift’) and 7 numerical (’seismoacoustic’, ’shift’, ’genergy’, ’gplus’, ’gdenergy’, ’gdpuls’, ’hazard’, ’bumps’, ’bumps2’).
Dataset 3 from a call center at Telefónica. It is real data that includes the total number of calls received in one of its services during every hour. Using these data, some features are extracted (weekday), and they are cyclically transformed, so that each time feature turns into two features for the sine and cosine components. The rules in this case are also transformed back into the original features in order to enhance rule comprehension.
Dataset 4 about cardiovascular diseases [padmanabhan2019physician]. There are 4 categorical features (’smoke’, ’alco’, ’active’, ’is_man’) and 7 numerical (’age’, ’height’, ’weight’, ’ap_hi’, ’ap_lo’,’cholesterol’,’gluc’).
Datasets 5 and 6 on the US census for year 1990 [blake1998uci]. Dataset 5 has 2 categorical features (’dAncstry1_3’, ’dAncstry1_4’) and 4 numerical ones (’dAge’, ’iYearsch’, ’iYearkwrk’, ’dYrsserv’). Dataset 6 has the same numerical features, but 18 categorical ones (dAncstry_i_j with i that ranges fro 1 to 2, and j that ranges from 3 to 11 if i equals 1, and 2 to 10 if i equals 2.)
|Dataset||Ref.||Nº Cat.||Nº Num.||Nº Rows|
We ran experiments with the following infrastructure: the implementations of the OCSVM algorithm, the K-Means++ clustering and the DT algorithms are based on Scikit-Learn [scikit-learn]. The rest of the code described in Algorithms 1 and 2 were developed from scratch, and available in Github [Barbado2019].
OCSVM models use as hyperparameters: . K-means++ models use . The DT hyperparameters are dynamically set for each dataset, using the Gini criterion to find the best splits, and using as seed. For each dataset, the DT is as follows:
1: Depth = 11, Nodes = 53, Leaf nodes = 30
2: Depth = 17, Nodes = 129, Leaf nodes = 69
3: Depth = 13, Nodes = 159, Leaf nodes = 91
4: Depth = 27, Nodes = 2265, Leaf nodes = 1184
5: Depth = 17, Nodes = 353, Leaf nodes = 162
6: Depth = 30, Nodes = 1843, Leaf nodes = 1051
Results are detailed in Table 2. It shows the number of rules extracted for all datasets with the different algorithms.
|Dataset||Prop. [NA]||Prop. [A]||DT [NA]||DT [A]|
Hypothesis 1 is not validated. Table 2 shows that the number of rules for anomalous data points is always inferior to those for non-anomalous ones. This may be due to the fact that the pruning algorithm is not good enough yet for rule reduction. For instance, one rule is removed if another wider rule includes it. However, many times the rules are not included within only one wider rule, but within many wider rules all together. This can be visually seen using dataset 1. Figures 6 and7 show the rules obtained before and after applying the pruning algorithm. Although the number of rules is reduced, many rules may still be removed. This affects more the rules obtained with non-anomalous data points since they are closer than the ones obtained with anomalous ones, as shown in Figure 8 (here, points without a visible hypercube are due to the fact that all the vertices are the same; thus, it is the same point as the anomaly).
Regarding hypothesis 2, in most of the cases DT generates less rules than our algorithm (with the exception of anomalous data points in dataset 4). However, results may be considered similar.
Figures 9 and 10 show a sample rule over the same dataset for our algorithm and for the DT surrogate method, respectively. The DT method uses symbols for equality and inequalities, while in our method all the extracted rules use only and , and not or .
Limitations of our Approach
Our method can only be used if there is a minimum number of data points (at least one point per vertex for the hypercube). This limitation may be avoided by generating artificial vertices, as done for clusters with few data points.
We described a novel implementation of a rule extraction technique for model-agnostic, both global and local, and counterfactual explanations for unsupervised learning in the scenario of anomaly detection, using the OCSVM algorithm.
We applied our algorithm to open and private datasets, with anomalous and non-anomalous data, with the initially unexpected result that using anomalous data points yields less rules than using the non-anomalous ones. We also compared the results of our algorithm with other surrogate methods such as using a DT model trained over the same features and using as target variable the anomalous or non-anomalous labels. In most scenarios both methods yielded similar results, thus showing that our approach is a feasible technique for XAI for anomaly detection.
Our rule extraction method can be optimised by using different clustering algorithms, so as to analyse whether there is a reduction in the number or rules that are extracted. It will be interesting to compare cluster methods that consider together both categorical non ordinal as well as numerical features, instead of dividing the hyperspace according to the categorical value combinations, as we did in our work. One example of such algorithm is K-modes [chaturvedi2001k]. Besides, when there are clusters with anomalies, instead of increasing the number of clusters and apply it over the whole dataset, the algorithm may separate data points that are already inside a cluster without anomalies and apply a different clustering algorithm to partition the subspace that had anomalies. It will be interesting to compare the number of rules extracted with this approach versus the one in this paper.
Rule extraction should also be designed to consider all types of comparisons (, , and ), and more sophisticated pruning techniques should be applied to improve rule reduction, since a rule may be contained in two or more rules together.