An Adversarial Approach for Explainable AI in Intrusion Detection Systems

11/28/2018 ∙ by Daniel L. Marino, et al. ∙ Virginia Commonwealth University 0

Despite the growing popularity of modern machine learning techniques (e.g. Deep Neural Networks) in cyber-security applications, most of these models are perceived as a black-box for the user. Adversarial machine learning offers an approach to increase our understanding of these models. In this paper we present an approach to generate explanations for incorrect classifications made by data-driven Intrusion Detection Systems (IDSs). An adversarial approach is used to find the minimum modifications (of the input features) required to correctly classify a given set of misclassified samples. The magnitude of such modifications is used to visualize the most relevant features that explain the reason for the misclassification. The presented methodology generated satisfactory explanations that describe the reasoning behind the mis-classifications, with descriptions that match expert knowledge. The advantages of the presented methodology are: 1) applicable to any classifier with defined gradients. 2) does not require any modification of the classifier model. 3) can be extended to perform further diagnosis (e.g. vulnerability assessment) and gain further understanding of the system. Experimental evaluation was conducted on the NSL-KDD99 benchmark dataset using Linear and Multilayer perceptron classifiers. The results are shown using intuitive visualizations in order to improve the interpretability of the results.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The increasing incorporation of Cyber-based methodologies for monitoring and control of physical systems has made critical infrastructure vulnerable to various cyber-attacks such as: interception, removal or replacement of information, penetration of unauthorized users and viruses [1] [2] [3] [4]. Intrusion detection systems (IDSs) are an essential tool to detect such malicious attacks [5].

Extendibility and adaptability are essential requirements for an IDS [6]. Every day, new strains of Cyber-attacks are created with the objective of deceiving these systems. As a result, machine learning and data-driven intrusion detection systems are increasingly being used in cyber-security applications [6]. Machine learning techniques such as deep neural networks have been successfully used in IDSs [6] [7]. However, these techniques often work as a black-box model for the user. [8] [7].

Fig. 1: Explainable AI [9] is concerned with designing of interfaces that help users to understand the decisions made by a trained machine learning model.
Fig. 2: Overview of the presented explainable interface. The approach provides explanations for misclassified samples. An adversarial approach is followed to modify the misclassified samples until the model assigns the correct class. This approach provides a mechanism to understand the decision boundaries of the classifier.

With machine learning being increasingly applied in the operation of critical systems, understanding the reason behind the decisions made by a model has become a common requirement to the point that governments are starting to include it into legislation [10] [11]. Recent developments in the machine learning community have been focused in the development of methods which are more interpretable for the users. Explainable AI (Figure 1) [8] makes use of visualizations and natural language descriptions to explain the reasoning behind the decisions made by the machine learning model.

It is crucial that the inner workings of data-driven models are transparent for the engineers designing IDSs. Decisions presented by explainable models can be easily interpreted by a human, simplifying the process of knowledge discovery. Explainable approaches help on diagnosing, debugging, and understanding the decisions made by the model, ultimately increasing the trust on the data-driven IDS.

In the case of data-driven IDS, when the model is presented with new attacks where data is not available, the model might mis-classify an attack as normal, leading to a breach in the system. Understanding the reasons behind the misclassification of particular samples is the first step for debugging and diagnosing the system. Providing clear explanations for the cause of misclassification is essential to decide which steps to follow in order to prevent future attacks.

In this paper, we present an explainable AI interface for diagnosing data-driven IDSs. We present a methodology to explain incorrect classifications made by the model following an adversarial approach.

Although adversarial machine learning is usually used to deceive the classifier, in this paper we use it to generate explanations by finding the minimum modifications required in order to correctly classify the misclassified samples. The difference between the original and modified samples provide information of the relevant features responsible for the misclassification. We show the explanations provide satisfactory insights behind the reasoning made by the data-driven models, being congruent with the expert knowledge of the task.

The rest of the paper is organized as follows: Section II presents an overview of adversarial machine learning; Section III describes the presented methodology for explainable IDS systems. Section IV describes the experimental results carried out using the NSL-KDD99 intrusion detection benchmark dataset. Section V concludes the paper.

Ii Adversarial Machine Learning

Adversarial machine learning has been extensively used in cyber-security research to find vulnerabilities in data-driven models [12] [13] [14] [15] [16] [17]

. Recently, there has been an increasing interest on adversarial samples given the susceptibility of Deep Learning models to these type of attacks

[18] [19].

Adversarial samples are samples crafted in order to change the output of a model by making small modifications into a reference (usually real) sample [18]. These samples are used to detect blind spots on ML algorithms.

Adversarial samples are crafted from an attacker perspective to evade detection, confuse the classifier [18], degrade performance [20] and/or gain information about the model or the dataset used to train the model [17]. Adversarial samples are also useful from a defender point of view given that they can be used to perform vulnerability assessment [16], study the robustness against noise, improve generalization and debug the machine learning model [20].

In general, the problem of crafting an adversarial sample is stated as follows [17]:



  • measures the impact of the adversarial sample in the model,

  • measures the capability of the defender to detect the adversarial sample.

  • ensures the crafted sample is inside the domain of valid inputs. It also represents the capabilities of the adversary.

The objective of Eq. (1) can be interpreted as crafting an attack point that maximizes the impact of the attack, while minimizing the chances of the attack to be detected. The definition of will depend on the intent of the attack and the available information about the model under attack. For example: 1) can be a function that measures the difference between a target class and the output from the model; 2) measures the discrepancy between the reference and the modified sample

In this paper, instead of deceiving the classifier, we use Equation 1 to find the minimum number of modifications needed to correctly classify a misclassified sample. The methodology is described in detail in section III.

Iii Explaining misclassifications using adversarial machine learning

In this paper we are interested in generating explanations for incorrect estimations made by a trained classifier. Figure

2 presents an overview of the presented explainable interface. The presented methodology modifies a set of misclassified samples until they are correctly classified. The modifications are made following an adversarial approach: finding the minimum modifications required to change the output of the model. The difference between the modified samples and the original real samples is used to explain the output from the classifier, illustrating the most relevant features that lead to the misclassification.

In the following sections we explain in detail each component of the presented explainable interface.

Iii-a Data-driven classifier

The classifier

estimates the probability of a given sample

to belong to class . The classifier is parameterized by a set of parameters

that are learned from data. Learning is performed using standard supervised learning. Given a training dataset

of samples, the parameters of the model are obtained by minimizing the cross-entropy:


is a one-hot encoding representation of the class:

Depending on the complexity of the model and the dataset , the model may misclassify some of the samples. We are interested on provide explanations for these incorrect estimations.

Iii-B Modifying misclassified samples

In this paper, instead of deceiving the classifier, we make use of adversarial machine learning to understand why some of the samples are being mis-classified. The idea of this approach is that adversarial machine learning can help us to understand the decision boundaries of the learned model.

The objective is to find the minimum modifications needed in order to change the output of the classifier for a real sample . This is achieved by finding an adversarial sample that is classified as while minimizing the distance between the real sample () and the modified sample :

s.t (3)

where is a symmetric positive definite matrix, that allows the user to specify a weight in the quadratic difference metric.

The program in Equation 2 can serve multiple purposes depending on how and are specified. For the purpose of explaining incorrect classifications, in this paper, represents the real misclassified samples that serve as a reference for the modified samples . Furthermore, the value of is set to the correct class of . In this way, the program (2) finds a modified sample that is as close as possible to the real misclassified sample , while being correctly classified by the model as .

We constrain the sample to be inside the bounds . These bounds are extracted from the maximum and minimum values found in the training dataset. This constraint ensures that the adversarial example is inside the domain of the data distribution.

The class of the adversarial sample is specified by the user. Note that , i.e. the class of the real sample is different from the class of the adversarial sample.

The optimization problem in Eq. 2 provides a clear objective to solve. However, the problem as stated in Eq. 2 is not straightforward to solve using available deep-learning optimization frameworks. In order to simplify the implementation, we modify the way the constraints are satisfied by moving the constraint in Eq. 3 into the objective function:



  • is a reference sample from the dataset

  • is the modified version of that makes the estimation of the class change from to the target

  • is the cross-entropy between the estimated adversarial sample class and the target class

  • is an indicator function that specifies whether the adversarial sample is being classified as

    This function provides a mechanism to stop the modifications once the sample is classified as . We assume this function is not continuous, hence, the gradients with respect the inputs are not required.

  • is a scale factor that can be used to weight the contribution of the cross-entropy to the objective loss.

The problem stated in Eq. 4 can be seen as an instance of the adversarial problem stated in Eq. 1, where the cross-entropy represents the effectiveness () of the modifications while the quadratic difference represents the discrepancy () between and .

(a) misclassified samples
(b) misclassified and modified samples
Fig. 5: t-SNE visualization of misclassified samples and corresponding modified samples for the MLP classifier. The legend shows the class to which the misclassified samples belong to. The axes correspond to an embedded 2D representation of the samples. The figure shows there is no clear distinction between the misclassified real samples and the modified samples , demonstrating that small modifications can be used to change the output of the classifier.

Iii-C Explaining incorrect estimations

We used the adversarial approach stated in Eq. 4, to generate an explanation for incorrect classification. Using a set of misclassified samples as reference, we use Eq. 4 to find the minimum modifications needed to correctly classify the samples. We use as target the real class of the samples.

The explanations are generated by visualizing the difference between the misclassified samples and the modified samples . This difference shows the deviation of the real features from what the model considers as the target class .

The explanations can be generated for individual samples or for a set of misclassified samples. We used the average deviation to present the explanations for a set of misclassified samples.

Iv Experiments and Results

Iv-a Dataset

For experimental evaluation, we used the NSL-KDD intrusion detection dataset [21]. The NSL-KDD dataset is a revised version of the KDD99 dataset [22], a widely used benchmark dataset used for intrusion detection algorithms. The NSL-KDD dataset removes redundant and duplicate records found in the KDD99 dataset, alleviating some of the problems of the KDD99 dataset mentioned in [21].

The NSL-KDD consists of a series of aggregated records extracted from a packet analyzer log. Besides normal communications, the dataset contains records of attacks that fall in four main categories: DOS, R2L, U2R and proving. For our experiments, we only considered normal, DOS and probe classes.

The dataset consists of 124926 training samples and 16557 testing samples. We used the same split provided by the authors of the dataset. To alleviate the effects of the unbalanced distribution of the samples over the classes, we trained the models extracting mini-batches with an equal ratio of samples from each class. Samples were extracted with replacement. A detailed description of the dataset features can be found in [23].

The dataset samples were normalized in order to make the quadratic distance metric in Equation 4

invariant to the features scales. We used the mean and standard deviation of the training dataset for normalization:

Iv-B Classifiers

One of the advantages of the methodology presented in this paper is that it works for any classifier which has a defined gradient of the cross-entropy loss with respect to the inputs.

For the experimental section, we used the following data-driven models: 1) a linear classifier, and 2) a Multi Layer Perceptron (MLP) classifier with ReLU activation function. We used weight decay (L2 norm) and early-stopping regularization for training both models.

Table I shows the accuracy achieved with the linear and the MLP classifiers. We can see that the MLP classifier provides higher accuracy in the training and testing datasets.

classifier train test
Linear 0.957 0.936
MLP 0.995 0.955
TABLE I: Accuracy of Classifiers

Iv-C Modified misclassified samples

We used t-SNE [24] to visualize the misclassified samples () and the modified/corrected samples () found using Equation 4. t-SNE is a dimensionality reduction technique commonly used to visualize high-dimensional datasets.

Figure 5 shows the visualization of the misclassified samples and the corresponding modified samples using t-SNE. This figure shows the effectiveness of the presented methodology to find the minimum modifications needed to correct the output of the classifier. No visual difference in the visualization can be observed between the real and the modified samples. The modified samples are close enough to the real samples that the modified samples occlude the real samples in Figure (b)b.

Iv-D Explaining incorrect estimations

(a) Normal samples misclassified as DOS using Linear model
(b) Normal samples misclassified as DOS using MLP model
Fig. 8: Explanation for Normal samples being misclassified as DOS using the difference between real samples and modified samples .
Fig. 9: Comparison of duration feature between (Normal samples misclassified as DOS) and (modified samples). The model considers connections with zero duration suspicious. The modified samples have an increased duration in order to correctly classify the samples as Normal.
(a) DOS samples misclassified as Normal using Linear model
(b) DOS samples misclassified as Normal using MLP model
Fig. 12: Explanation for DOS samples being misclassified as Normal using the difference between real samples and modified samples .
(a) Deviation for continuous features
(b) Comparison of categorical features distributions
Fig. 15: Explanation for Normal samples mis-classified as DOS while taking into account categorical features (MLP model).

Figure 8 shows the generated explanation for Normal samples being incorrectly classified as DOS. Figure (a)a shows the explanations for the Linear model while Figure (b)b shows the explanations for the MLP model.

We observed that the explanations for both models provide similar qualitative explanations. Figures (a)a and (b)b can be naturally interpreted as follows:

Normal samples were mis-classified as DOS because:

  • high number of connections to the same host (count) and to the same destination address (dst_host_count)

  • low connection duration (duration)

  • low number of operations performed as root in the connection (num_root)

  • low percentage of samples have successfully logged in (logged_in, is_guest_login)

  • high percentage of connections originated from the same source port (dst_host_same_src_port_rate)

  • low percentage of connections directed to different services (diff_srv_rate)

  • low number of connections were directed to the same destination port (dst_host_srv_count)

Figures (a)a and (b)b that a high number of connections with low duration and low login success rate is responsible for misclassifying Normal samples as DOS. These attributes are clearly suspicious and match what a human expert would consider as typical behavior of a DOS attack, providing a satisfactory explanation for the incorrect classification.

A more detailed view of the difference between the duration of real () and modified () samples is presented in Figure 9. This figure shows that all misclassified Normal samples had connections with a duration of zero seconds, which the classifier considers suspicious for a Normal behavior.

The graphs also provide a natural way to extract knowledge and understand the concepts learned by the model. For example, figures (a)a and (b)b show the classifier considers suspicious when there is a low percentage of connections directed to different services (low diff_srv_rate).

A parallel analysis can be performed to other misclassified samples. Figure 12 provides explanations for DOS attacks misclassified as Normal connections for Linear and MLP models. Overall, Figures (a)a and (b)b, show that the samples are misclassified as Normal because they have: (a) lower error rate during the connection (b) higher login success. These features are usually expected to belong to normal connections, which successfully explains the reason for the misclassification.

The explanations shown in Figures 8 and 12 do not take categorical features into account. In order to include categorical features into the analysis, we perform a round operation to the inputs of the indicator . 111Given that we do not use the gradient of during the optimization process, we are allowed to include discontinuous operations like the rounding function This ensures the objective function in Equation 4 takes into consideration the effects of the rounding operation.

Figure 15 shows the explanations generated when considering categorical features. Figure (a)a shows the deviation of continuous features, providing the same information as the explanations from Figures 8 and 12. Figure (b)b shows the comparison between the histograms of misclassified samples and modified samples. Figure (b)b shows that protocol_type was not modified, suggesting that this feature is not relevant in order to explain the misclassification. On the other hand, the service feature was modified in almost all samples. The figure shows that most of the Normal samples misclassified as DOS used a private service. Changing the service value helps the classifier to correctly estimate the class of the samples. The explanation shows that the model considers communication with private service as suspicious.

V Conclusion

In this paper, we presented an approach for generating explanations for the incorrect classification of a set of samples. The methodology was tested using an Intrusion Detection benchmark dataset.

The methodology uses an adversarial approach to find the minimum modifications needed in order to correctly classify the misclassified samples. The modifications are used to find and visualize the relevant features responsible for the misclassification. Experiments were performed using Linear and Multilayer perceptron classifiers. The explanations were presented using intuitive plots that can be easily interpreted by the user.

The proposed methodology provided insightful and satisfactory explanations for the misclassification of samples, with results that match expert knowledge. The relevant features found by the presented approach showed that misclassification often occurred on samples with conflicting characteristics between classes. For example, normal connections with low duration and low login success are misclassified as attacks, while attack connections with low error rate and higher login success are misclassified as normal.

Vi Discussion

An advantage of the presented approach is that it can be used for any differentiable model and any classification task. No modifications of the model are required. The presented approach only requires the gradients of the model cross-entropy w.r.t. the inputs. Non-continuous functions can also be incorporated into the approach, for example rounding operations for integer and categorical features.

The presented adversarial approach can be extended to perform other analysis of the model. For example, instead of finding modifications for correct the predicted class, the modifications can be used to deceive the classifier, which can be used for vulnerability assessment. Future research will be conducted to incorporate the explanations to improve the accuracy of the model without having to include new data into the training procedure.