Exploring techniques for estimating safety of machine learning classifiers
Ensuring safety and explainability of machine learning (ML) is a topic of increasing relevance as data-driven applications venture into safety-critical application domains, traditionally committed to high safety standards that are not satisfied with an exclusive testing approach of otherwise inaccessible black-box systems. Especially the interaction between safety and security is a central challenge, as security violations can lead to compromised safety. The contribution of this paper to addressing both safety and security within a single concept of protection applicable during the operation of ML systems is active monitoring of the behaviour and the operational context of the data-driven system based on distance measures of the Empirical Cumulative Distribution Function (ECDF). We investigate abstract datasets (XOR, Spiral, Circle) and current security-specific datasets for intrusion detection (CICIDS2017) of simulated network traffic, using distributional shift detection measures including the Kolmogorov-Smirnov, Kuiper, Anderson-Darling, Wasserstein and mixed Wasserstein-Anderson-Darling measures. Our preliminary findings indicate that the approach can provide a basis for detecting whether the application context of an ML component is valid in the safety-security. Our preliminary code and results are available at https://github.com/ISorokos/SafeML.READ FULL TEXT VIEW PDF
High-accurate machine learning (ML) image classifiers cannot guarantee t...
In this paper, we address the problem of dataset quality in the context ...
Due to the increasing usage of machine learning (ML) techniques in secur...
Machine learning (ML) systems are rapidly increasing in size, are acquir...
Safety Security Assurance Framework applied to two standards IEC 61508 a...
Machine Learning models are deployed across a wide range of industries,
Dependability assurance of systems embedding machine learning(ML)
Exploring techniques for estimating safety of machine learning classifiers
Exploring techniques for estimating safety of machine learning classifiers
Machine Learning (ML) is expanding rapidly in numerous applications. In parallel with this rapid growth, the expansion of ML towards dependability-critical applications raises societal concern regarding the reliability and safety assurance of ML. For instance, ML in medicine by [1, 2, 3], in autonomous systems e.g. self-driving cars by [4, 5], in military , and in economic applications by . In addition, different organizations and governmental institutes are trying to establish new rules, regulations and standards for ML, such as in [8, 9, 10].
While ML is a powerful tool for enabling data-driven applications, its unfettered use can pose risks to financial stability, privacy, the environment and in some domains even life. Poor application of ML is typically characterized by poor design, misspecification of the objective functions, implementation errors, choosing the wrong learning process, or using poor or non-comprehensive datasets for training. Thus, safety for ML can be defined as a set of actions to prevent any harm to humanity by ML failures or misuse. However, there are many perspectives and directions to be defined for ML Safety. In fact,  have addressed different research problems of certifying ML systems operating in the field. They have categorized safety issues into five categories: a) safe exploration, b) robustness to distributional shift, c) avoiding negative side effects, d) avoiding “reward hacking” and “wire heading”, e) scalable oversight. This categorization is helpful for an adequate assessment of the applicability a concept for a given (safety) problem. In the work presented here, we will be focusing on addressing distributional shift, however using a non-standard interpretation. Distributional shift is usually interpreted as the gradual deviation of the initial state of learning of an ML component and its ongoing state as it performs online learning. As will be shown later, distributional shift will instead be used by our approach to evaluate the distance between the training and observed data of an ML component.
Statistical distance measures can be considered as a common method to measure distributional shift. Furthermore, in modern ML algorithms like Generative Adversarial Nets (GANs), statistical distance or divergence measures are applied as a loss function, such as the Jensen-Shannon divergence, the Wasserstein distance , and the Cramer distance 
. For dimension reduction, the t-SNE (t-distributed stochastic neighbour embedding) algorithm uses the Kullback-Leibler divergence as a loss function.
This paper studies the applicability of safety-security monitoring based on statistical distance measures on the robustness of ML systems in the field.The basis of this work is a modified version of the statistical distance concept to allow the comparison of the data set during the ML training procedure and the observed data set during the use of the ML classifier in the field. The calculation of the distance is carried out in a novel controller-in-the-loop procedure to estimate the accuracy of the classifier in different scenarios. By exploiting the accuracy estimation, applications can actively identify situations where the ML component may be operating far beyond its trained cases, thereby risking low accuracy, and adjust accordingly. The main advantage of this approach is its flexibility in potentially handling a large range of ML techniques, as it is not dependent on the ML approach. Instead, the approach focuses on the quality of the training data and its deviation from the field data. In a comprehensive case study we have analyzed the possibilities and limitations of the proposed approach.
The rest of the paper is organised as follows: In Section 1.3, previous work related to this publication is discussed. In Section 2, the problem definition is provided. The proposed method is addressed in Section 3. Numerical results are demonstrated in Section 4 with a brief discussion. Explainable AI is introduced and discussed as a highly relevant topic in Section 5. The capabilities and limitations of the proposed method are summarised in Section 6 and the paper terminates with a conclusion in Section 7.
Our analysis of the research literature did not reveal any reference to existing publications dealing with the safety, security and accuracy of ML-based classification using statistical measures of difference. Nevertheless, there are publications that provide a basis for comparison with the current study. A Resampling Uncertainty Estimation (RUE)-based algorithm has been proposed by  to ensure the point-wise reliability of the regression when the test or field data set is different from the training dataset. The algorithm has created prediction ensembles through the modified gradient and Hessian functions for ML-based regression problems. An uncertainty wrapper for black-box models based on statistical measures has been proposed by . Hobbhahn. M. et al. 
have proposed a method to evaluate the uncertainty of Bayesian Deep Networks classifiers using Dirichlet distributions. The results were promising but to a limited class of classifiers (Bayesian Network-based classifiers). A new Key Performance Index (KPI), the Expected Odds Ratio (EOR) has been introduced in
. This KPI was designed to estimate empirical uncertainty in deep neural networks based on entropy theories. However, this KPI has not yet been applied to other types of machine learning algorithms. A comprehensive study on dataset shift has been provided by
and the dataset issues such as projection and projectability, simple and prior probability shift are discussed there. However, the mentioned study does not address the use of statistical distance and error bound to evaluate the dataset shift, in contrast to the work presented here.
Classification ML algorithms are typically employed to categorize input samples into predetermined categories. For instance, abnormality detection can be performed by detecting whether a given sample falls within known ranges i.e. categories. A simple example of a classifier for 1-dimensional input can be a line or threshold. Consider a hypothetical measurement t (e.g. time, temperature etc.) and a classifier D based on it, as shown in Figure 1-(a) and defined as (1). Note that Figure 1 shows the true classes of the input.
can predict two classes which represent, in this example, the normal and abnormal state of a system. From measurement input 0 to 100, the sample is considered to fall under class 1 and from above 100 to 200 under class 2. The Probability Density Functions (PDFs) of the (true) classes can be estimated as shown in Figure1
-(b). In this figure, the threshold of the classifier has been represented with a red vertical dash-line and value of four. The area with an overlap in this figure can cause false detection to occur, as the classifier misclassifies the input belonging to the opposite class. These type of misclassifications are also known as false positive/type I errors (e.g. when misclassifying input as being class 1) and false-negative/type II errors (e.g. when misclassifying input as not being class 1).
Considering Figure 1-(b) of probability density functions, we notice that in the area where the two probability density functions merge, the misclassifications and thus the errors can occur. The probability of the error or misclassification can be calculated with (2) . Note that the error probability is also related to the threshold value (x considered as the threshold value), (for more details see ).
In listing (2), the can be calculated as the minimum of both PDFs as (3). The minimization is subject to variation of threshold value from to .
By dividing the space into two regions as and , the probability of error can be written in two parts.
To ease the minimization problem, consider the following inequality rule .
The can be calculated using (11) where and
Considering the equation (11) effectively becomes the Bhattacharyya distance. It can be proven that this value is the optimal value when [22, 25]. In this study, for simplicity, the Bhattacharyya distance will be used to demonstrate the approach. It should be noted that there may be cases where the calculated error bound is higher than the real value. However, this is acceptable as an overestimation of the classifier error would not introduce safety concerns (although it may impact performance). As the and are complementary, the probability of having a correct classification can be calculated using (12).
The Chernoff upper bound of error is usually used as a measure of separability of two classes of data, but in the above context, equation (12) measures the similarity between two classes. In other words, in an ideal situation, by comparing the of a class, with itself, the response should be equal to one while should be zero. The intuitive explanation is to determine whether the distribution of the data during training is the same as the distribution observed in the field (or not).
Assuming , the integral part of can be converted to the cumulative distribution function as (13).
Equation (13) shows that there is relation between probability of error (and also accuracy) and statistical difference between two Cumulative Distribution Functions (CDF) of two classes. Using this fact and considering that the Empirical CDFs of each class is available, ECDF-based statistical measures such as the Kolmogorov-Smirnov distance (equation 14) and similar distance measures can be used [26, 27].
It should be mentioned that such ECDF-based distances are not bounded between zero and one and, in some cases, need a coefficient to be adjusted as a measure for accuracy estimation. In section 4.3, the correlation between ECDF-based distance and accuracy will be discussed.
To begin with, we should note that while this study focuses on ML classifiers, the proposed approach does not prohibit application on ML components for regression tasks either. Figure 2 illustrates how we envision the approach to be applied practically. In this flowchart, there are two main sections; the training phase and the application phase. A) The ’training’ phase is an offline procedure in which a trusted dataset is used to train the ML algorithm. Once training is complete, the classifier’s performance is measured with user-defined KPIs. Meanwhile, the PDF and statistical parameters of each class are also computed and stored for future comparison in the second phase. B) The second or ’application’ phase is an online procedure in which real-time and unlabelled data is provided to the system. For example, consider an autonomous vehicle’s machine vision system. Such a system has been trained to detect obstacles (among other tasks), so that the vehicle can avoid collisions with them. A critical issue to note in the application phase is that the incoming data is unlabeled. So, it cannot be assured that the classifier will perform as accurately as it had in during the training phase. As input samples are collected, the PDF and statistical parameters of each class can be estimated. The system requires enough samples to reliably determine the statistical difference, so a buffer of samples may have to be accumulated before proceeding. Using the modified Chernoff error bound in 12, the statistical difference of each class in the training phase and application phase is compared. If the statistical difference is very low, the classifier results and accuracy can be trusted. In the example mentioned above, the autonomous vehicle would continue its operation in this case. Instead, if the statistical difference is greater, the classifier results and accuracy are no longer considered valid (as the difference between the training and observed data is too large). In this case, the system should use an alternative approach or notify a human operator. In the above example, the system could ask the driver to takeover control of the vehicle.
In this section, the proposed method described in Section 3 is applied on typical synthetic benchmarks for ML classification. The proposed method has been implemented in three different programming languages including R, Python and MATLAB. Regarding R programming, three well-known benchmarks have been selected: a) the XOR dataset, b) the Spiral dataset and c) the Circle dataset. Each dataset has two features (i.e. input variables) and two classes. Figure 3 illustrates the scatter plots of the selected benchmarks. More examples and benchmarks are available at SafeML Github Repository.
To start the ML-based classification, 80 percent of each dataset was used for training and testing and 20 percent of the dataset has been used for validation, with 10-fold cross-validation. Both linear and nonlinear classifiers have been selected for classification. The Linear discriminant analysis (LDA) and the Classification And Regression Tree (CART) are used as linear methods. Moreover, The Random Forest (RF), K-Nearest Neighbours (KNN) and Support Vector Machine (SVM) are applied as nonlinear methods. As KPIs, the accuracy and Kappa measure are used to measure the performance of each classifier. Finally, as Empirical Cumulative Distribution Function (ECDF)-based statistical distance measures, the Kolmogorov-Smirnov Distance (KSD), Kuiper Distance, Anderson-Darling Distance (ADD), Wasserstein Distance (WD), and a combination of ADD and Wasserstein-Anderson-Darling Distance (WAD) have been selected for evaluation.
XOR Dataset: The XOR dataset has two features and two classes in which features have the same mean and variance characteristics. Table 1 compares the estimated accuracy based on the ECDF measures with the Minimum True Accuracy (MTA) and the Average True Accuracy (ATA) over 10 folds. For instance, the second column of this table provides the estimated accuracy based on the KSD measure. As a matter of safety, MTA is more important because it represents the worst-case scenarios, where the lowest accuracy may be experienced and impact safety. We observe that the KSD measure reports low accuracy for the LDA classifier ( .77). Instead, the ADD and WAD measures significantly overestimate the accuracy of the LDA.
Based on Table 1, Table 2 represents the (absolute) difference between accuracy estimations of each measure and the MTA of each classifier. The ADD, WD and WAD measures have the best accuracy estimations overall. In particular, when a LDA classifier is used, the WD measure provides an estimated accuracy with comparatively less error.
Spiral Dataset: Similar to the XOR dataset, the proposed method can be applied for the spiral dataset. Table 3 presents difference between ECDF-based distance measures and minimum true accuracy for this dataset. For brevity, for this dataset and the next one, only the difference table is provided. Based on this table, the KSD and Kuiper distance have better estimation for accuracy of the classifiers for the spiral dataset.
Circle dataset: The circle dataset has similar statistical characteristics with the spiral dataset. Table 4 provides the difference between ECDF-based distance measures and MTA for this dataset. As can be seen, the worst accuracy estimation is related to the accuracy estimation of the LDA classifier. For the LDA, the Kuiper distance estimates with less error, with the KSD and WD being in second and third place respectively.
This case-study applies the proposed method towards the CICIDS2017 dataset, which was originally produced by  at the Canadian Institute for Cyber Security (CICS) as an aide to the development and research of anomaly-based intrusion detection techniques for use in Intrusion Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) .
The labelled dataset includes both benign (Monday) and malicious (Tuesday, Wednesday, Thursday, Friday) activity. The benign network traffic is simulated by abstraction of typical user activity using a number of common protocols such as HTTP, HTTPS, FTP and SHH. Benign and malicious network activity is included as packet payloads in packet capture format (PCAPS).
This attack occurred on Wednesday, July 5, 2017, and different types of attacks on the availability of the victim’s system have been recorded, such as DoS / DDoS, DoS slowloris (9:47 – 10:10 a.m.), DoS Slowhttptest (10:14 – 10:35 a.m.), DoS Hulk (10:43 – 11 a.m.), and DoS GoldenEye (11:10 – 11:23 a.m.). Regarding the cross-validation, a hold-out approach has been used, in which 70 percent of data has been randomly extracted for testing and training and the rest has been used for accuracy estimation. Additionally, traditional classifiers including ’Naive Bayes’,’Discriminant Analysis’,’Classification Tree’, and ’Nearest Neighbor’ have been used. Figure4
Figure 5 has been generated over 100 iterations. For each iteration, 70 percent of the data has been randomly extracted for testing and training and the rest has been used for accuracy estimation. Figure 5 shows the box plot of the statistical distance measurements vs. the evaluated accuracy over 100 iterations. By observing the average values (red lines) of each box plot, the relationship between each measure and the average change in accuracy can be understood. In addition, this plot shows which method has less variation. For instance, the Kuiper distance and WD have the best performance while Chernoff has the least performance.
Thursday Attack: This attack occurred on Thursday, July 6, 2017, and various attacks, such as the Web Attack – Brute Force (9:20 – 10 a.m.), Web Attack – XSS (10:15 – 10:35 a.m.), and Web Attack – Sql Injection (10:40 – 10:42 a.m.) have been recorded. Figure 6 shows the confusion matrix for Thursday morning’s security intrusion in the CICIDS2017 dataset when the Naive Bayes classifier is applied. Similar to Wednesday, 70 percent hold-out cross validation is used for this dataset. As can be seen, this dataset has four classes and the classifier has problem to detect the last class or last type of intrusion.
Figure 7 shows a sample result of six statistical measures (Chern-off and five ECDF-based measures) vs. accuracy of the classifier. In this sample, the Kolmogorov-Smirnov and Kuiper measures have better performance.
Similar to the previous example, Figure 7 has been generated over 100 times and the box plot of Figure 8 can be seen. In this figure, the Kolmogorov-Smirnov, Kuiper and Wassertein distance measures have a better performance, however, their decision variance is a bit high.
The rest of results for Security Intrusion Detection in CICIDS2017 dataset are available in the SafeML Github Repository.
Figure 9 shows Pearson’s correlation between the classes of Wednesday’s data and the statistical ECDF-based distances. As can be seen, the WD and WAD distances have more correlation with the classes. This figure also shows the correlation between the measures themselves. The KSD and KD appear to be correlated. The WD and WAS also seem to be correlated. These correlations can be explained due to the similarity in their formulation. P-values for the above correlations were evaluated to be zero, thereby validating the correlation hypotheses above.
In this section, we discuss a relevant topic to our proposed approach, to explain how the proposed approach could be applied for this purpose as well. Explainable AI (XAI) can be defined as a tool or framework that increases interpretability of ML algorithms and their outputs . Our proposed approach can also be used to improve the interpretability of ML classifiers using the statistical ECDF-based distance measures seen previously. We shall discuss a small example here and intend to delve further on this topic in our future works. For the example, the Wednesday data from the security dataset mentioned previously is chosen and its class labels vs. the sample time has been plotted in Figure 10. This dataset has six different classes with variable number of occurrence. In this figure a sliding window with the size of is used. In the beginning, samples of class one are considered as reference and then compared with the rest of the samples for each window using the statistical ECDF-based distance measures. It should be mentioned that the smoothness of the output is related to the sliding window’s size. As can be see in the figure, the change in the average distance vs. the class shows the existing high correlation. In addition, it seems that class number five is slightly robust to statistical change and class number six has a low number of samples, that cannot produce meaningful statistical difference. The problem of detecting class six can be solved by decreasing the size of the sliding window.This figure can be generated for different classifiers and show how their decisions are correlated to the ECDF-based distance measures. As an future work, we aim to investigate ECDF-based distance inside different algorithms to better understand their actions. This section is just a hint for future works.
Overall, our preliminary investigation indicates that statistical distance measures offer the potential for providing a suitable indicator for ML performance, specifically for accuracy estimation. In particular, we further denote the following capabilities and limitations for the proposed approach.
By modifying the existing statistical distance and error bound measures, the proposed method enables estimation of the accuracy bound of the trained ML algorithm in the field with no label on the incoming data.
A novel human-in-loop monitoring procedure is proposed to certify the ML algorithm during operation. The procedure has three levels of operation: I) nominal operation allowed with assured ML-accuracy based on the distance estimation, II) buffering data samples to generate estimation, and III) low estimated accuracy estimated, leading to external intervention by automated/human controller being needed.
The proposed approach is easy to implement, and can support a variety of distributions (i.e. exponential and normal distribution families).
The proposed algorithm is currently only tackling the safety evaluation problem of the machine-learning-based classification. However, we believe it can be easily expanded for clustering, dimension reduction or any problem that can be evaluated through statistical difference.
Some of the machine learning algorithms can be robust to a certain distributional shift or variation in the dataset distribution. This may limit the effectiveness of the discussed distance measures. That being said, the proposed measures can then be used as additional confirmation of the robustness, contributing to certification arguments.
The expansion of ML applications to safety-critical domains is a major research question. We investigate the problem of context applicability of an ML classifier, specifically the distributional shift between its training and observed data. We have identified and evaluated sets of statistical distance measures that can provide estimated upper error bounds in classification tasks based on the training and observed data distance. Further, we have proposed how this approach can be used as part of safety and security-critical systems to provide active monitoring and thus improve their robustness. The overall most effective distance measure was identified to be the Kolmogorov-Smirnov. The proposed human-in-the-loop procedure uses this statistical distance measure to monitor the estimated accuracy of the ML component and notify its AI or human controller when the deviation exceeds specific boundaries. The study is still in its early stages, but we believe the results to offer a promising starting point. The strengths and weaknesses of the proposed approach are discussed in the previous section.
This work was supported by the DEIS H2020 Project under Grant 732242. We would like to thank EDF Energy RD UK Centre, AURA Innovation Centre and the University of Hull for their support.
M. M. Deza and E. Deza, “Distances in probability theory,” inEncyclopedia of Distances. Springer, 2014, pp. 257–272.
M. Raschke, “Empirical behaviour of tests for the beta distribution and their application in environmental research,”Stochastic Environmental Research and Risk Assessment, vol. 25, no. 1, pp. 79–89, 2011.