Nowadays processing of Big Security Data, such as log messages, is commonly used for intrusion detection purposed. Its heterogeneous nature, as well as combination of numerical and categorical attributes does not allow to apply the existing data mining methods directly on the data without feature preprocessing. Therefore, a rather computationally expensive conversion of categorical attributes into vector space should be utilised for analysis of such data. However, a well-known k-modes algorithm allows to cluster the categorical data directly and avoid conversion into the vector space. The existing implementations of k-modes for Big Data processing are ensemble-based and utilise two-step clustering, where data subsets are first clustered independently, whereas the resulting cluster modes are clustered again in order to calculate metamodes valid for all data subsets. In this paper, the novel frequency-based distance function is proposed for the second step of ensemble-based k-modes clustering. Besides this, the existing feature discretisation method from the previous work is utilised in order to adapt k-modes for processing of mixed data sets. The resulting k-metamodes algorithm was tested on two public security data sets and reached higher effectiveness in comparison with the previous work.
Nowadays outlier detection algorithms are widely used in different areas, such as banking, insurances, medicine and of course security. In the latter area security experts apply outlier detection methods in order to identify intrusions or other types of malicious behaviour, which should deviate from “normal” patterns in the data. This intrusion detection scenario has several characteristics that makes application of outlier detection methods especially complicated. First of all, intrusion detection should often be performed in the absence of the ground truth. Indeed, even for security experts it would be hardly possible to guarantee that the collected monitoring data does not contain any traces of the malicious behaviour in it. This implies that only more sophisticated unsupervised outlier detection methods could normally be used for processing of such data. Next, the heterogeneous nature of Big Security Data (that are collected from a variety of devices, operating systems and services) makes it hard to stick to specific set of numerical features or metrics. Rather, each data feed has its own fields/format, which requires either conversion to a common log format[OLF] or application of generic outlier detection methods applicable on data without feature preparation [Sapegin2017, thesis]. Finally, security data, such as log messages or network traffic (e.g., TCP dump or derived features) normally contain textual data (username, action, domain name, protocol, etc.), or also numerical data, but of different measures and distribution types, i.e. bytes transferred and number of connections per hour. Therefore, most existing data mining algorithms cannot be directly applied on such mixed data.
Rather, most existing methods including classic clustering-based approaches require discretisation of continuous numerical data into categories [Garcia2013, Sapegin2017] and conversion of textual data into vector space model in order to apply distance function on data, which is needed to build clusters and find outliers.
Luckily, textual data of security log messages actually represent a limited number of categories and not a free natural text. So instead of applying techniques such as TF-IDF in order to convert log messages into vector space model, a direct conversion using one-hot encoding — where each unique field value becomes a separate column/dimension — is normally performed on the data111Since each term appears only once in the log line, calculation of term frequencies does not bring any benefits.
. Of course, in the large enterprise networks covering data of, for example, 100,000 employees a username column alone will require 100,000 dimensions in the vector space representation, whereas full vector space can contain 500-600 thousand dimensions/columns. Although special programming language structures and algorithms (such as sparse matrices and spherical k-means clustering) allow processing of high-dimensional vector space models[Sapegin2017, thesis], such conversion still affects performance and may have higher requirements on RAM usage.
Therefore, intrusion detection for Big Security Data may benefit from utilisation of generic clustering-based outlier detection that does not require data conversion into vector space model. The most well-known algorithm for this task is k-modes, which by default uses hamming distance (“simple matching similarity”) and defines modes (“cluster centers”) as a set of most frequent category for each feature/attribute. Since k-modes was originally proposed by Huang et al in 1998 [Huang1998], many researchers worked on various improvements of this algorithm.
In 2004, San et al. elaborated on the problem of non-unique cluster modes in k-modes, which “makes the algorithm unstable depending on mode selection during the clustering process” [San2004]. Authors propose to replace modes with “cluster representatives”, which represents a fuzzy set with all possible categorical values for each attribute from the cluster, whereas each value is characterised by its relative (to the mode/representative) frequency. San et al. also propose a new distance function, where the distance between object and mode/representative is calculated based on the relative frequencies for each attribute in the object.
The similar idea was proposed in 2005 by He et al., where dissimilarity function also takes into account the relative frequency of the attribute, but only if the most frequent attribute category in the mode is equal to the attribute value in the object [He2005].
Both San et al. and He et al. demonstrated that the frequency-based distance functions allow to achieve higher clustering accuracy. Following it, Ng et al. in 2007 provided a formal proof for the effectiveness of the k-modes with frequency-based dissimilarity measure, as well as confirmed its guaranteed convergence. Next, Cao et al. in 2011 proposed improved dissimilarity measure based on rough set theory [Cao2012]. This novel dissimilarity measure (1) takes into account “the distribution of attribute values over the whole universe”222The distance function is based on the similarity measure that takes into account the number of equivalence classes in the whole data set with respect to the attribute value being compared., (2) has higher effectiveness (on selected biological and genetical data sets), (3) eliminates some border cases when object assignment to the mode is undetermined and (4) also guarantees convergence for k-modes. Besides that, authors showed both the effectiveness and efficiency of k-modes on large data sets.
The more detailed review of various modification of k-modes algorithm was provided by Goyal and Aggrawal [KModesReview]. The review paper covers not only the improvement of distance functions, but also related work on initialisation of modes and automatic selection of parameter k.
Finally, when distributed computing become widely used, Visalakshi et al. proposed one more important modification of k-modes algorithm, namely the ensemble-based distributed version of k-modes in 2015 [KarthikeyaniVisalakshi2015a]. Under this approach, data are divided into subsets and these subsets are clustered on the different nodes using k-modes. Next, all modes from all subsets are collected by the master node and undergo one more iteration of k-modes clustering in order to calculate global clusters. Authors claim equal or better performance and cluster quality in comparison with non-distributed k-modes as well as classic distributed k-means algorithm [DKM] (authors utilised label/integer encoding in order to apply k-means on categorical data).
However, in the distributed version of k-modes the second step of the algorithm still requires optimisation. Indeed, after all modes from all subsets are collected on the master node, one needs to calculate distances between (1) pairs of modes (from the first step of the algorithm), as well as (2) modes and “modes of modes” (which will be called metamodes in this paper). Even if the modes themselves were calculated taking relative frequencies of the attribute values into account, there is no existing distance function that can calculate dissimilarity between two modes/representatives containing relative frequencies for all possible attribute values in the cluster.
In this paper we review the distributed k-modes algorithm and propose a novel distance function for clustering of modes (cluster representatives containing frequencies for all possible categorical values for each attribute from the cluster objects). We utilise new distance function for clustering of modes in the second step of the distributed k-modes algorithm. We also prove that the resulting metamodes represent global cluster centers more effectively than in the cases when attribute frequencies are discarded after first step of the algorithm. Besides that we also combine distributed k-metamodes with discretisation of numeric data, which allows to apply this clustering method on numeric or mixed (containing both numerical and categorical data) data sets. The resulting algorithm is compared with Hybrid Outlier Detection from related work [Sapegin2017, thesis] and shows similar effectiveness while avoiding computationally expensive conversion into vector space.
The rest of the paper is organised as follows. In Section 1 we describe existing incremental distributed k-modes algorithm. Next, Section 2 proposes the novel frequency-based distance function for clustering of modes/representatives. The effectiveness of the k-metamodes with the new distance functions is evaluated in Section 3, which also contains comparison with existing Hybrid Outlier Detection algorithm. Finally, Section 4 concludes the paper.
1 Ensemble-based incremental distributed k-modes
Since k-modes is based on the k-means algorithm, it tries to solve the same optimisation problem. Namely, how to partition a set of objects333For k-means, the set should contain numeric objects only. into clusters. Formally, this problem P is described as follows [selim1984]:
where Q is a set of modes444For k-means, Q is a set of cluster centers., is a partition matrix with size of , and is distance function between object and mode.
The original k-modes differs from k-means only in the definition of cluster center (which is replaced with mode) and distance function. Therefore, for both algorithms the problem P can be solved by repeating the following steps until P will not converge to the local minimum [Huang1998]:
step 1: Fix Q and solve P through finding optimal W. Here the set of modes is fixed and for each object the best mode is identified through calculating the distance between object and mode.
step 2: Fix W and solve P through finding optimal Q. Here the modes are recalculated based on the object reassignments from step 1.
Thus, in order to identify clusters with k-modes, a distance function between object and mode should be defined555The same distance function can be used to find distance between two objects, since it will be a border case where mode will be based on one object only.. The most basic hamming distance can be formally defined as:
Of course, in case of frequency-based modes/representatives, both mode and distance function should be redefined. According to San et al., the frequency-based mode666Hereafter we will always use the notion “mode” when talking about both cluster modes and cluster representatives. is defined as follows [San2004]:
where is a set of all possible values of the attribute among all objects in the cluster with mode . Let us also define the , where is the most frequent attribute value in , so that [He2005].
When modes are calculated based on the attribute frequencies, a frequency-based distance function can be applied to calculate distance between object and mode, formally the formula 5 should be replaced with:
In the ensemble-based k-modes (proposed by Visalakshi et al. in [KarthikeyaniVisalakshi2015a]) this distance function can be applied at the first step of the ensemble-based clustering, where each data subset is clustered independently at the ensemble member (k-modes algorithm instance). Nevertheless, the next steps of the proposed ensemble-based clustering expects clustering of modes themselves, i.e. applies k-modes on collection of modes from all ensemble members. In this case the existing frequency-based distance function that uses formula 8 cannot be applied in order to calculate the distance between two modes while taking into account attribute frequencies calculated on the first step of the algorithm. The only existing solution supposes discarding previous attribute frequencies from the calculation of the distance and treating modes as usual objects777In order to convert mode to object, the classical definition of mode by Huang can be used, i.e. the mode can be converted back to or the set of the most frequent values for each attribute.. However, this approach might be less effective in comparison to frequency-based distance function.
Therefore, in this paper we propose a novel frequency-based distance function for clustering of modes at the third step of ensemble-based k-modes algorithm. The proposed distance function is described in details in the section below.
2 Frequency-based distance function for calculation of metamodes
Similarly with distance function from formula 4, the distance function for clustering of modes should be able to calculate distance between clustering object (mode) and cluster center (metamode). While the mode is already defined with formula 7, the metamode can be defined as a set of all attribute frequencies from all objects with all modes in the meta-cluster. Formally,
where is a set of all possible values of the attribute among all objects in all clusters with all modes with metamode . In order to be able to calculate the frequencies of the metamode attributes, it is needed to keep original counts (and not frequencies) of attribute values in the mode, since the frequencies are not scaled to the cluster size. Formally, we redefine mode so that formula 7 becomes:
where is the number of occurrences of as a value of attribute in the cluster with mode . Here we note that the distance between object and mode can still be calculated using formula 8, since it is easy to calculate from and, correspondingly from :
where is the number of objects in the cluster .
Thus, both modes and metamodes become a fuzzy sets containing counts for each possible attribute value for each object in the cluster and meta-cluster, while . This allows us to define a frequency-based distance function to find distance between two modes (or mode and metamode) as sum of Euclidean distances for each attribute:
In order to differentiate it from 8, we will call it as meta-frequency-based distance function in this paper.
Please note that in order to calculate both metamode and distance between mode and metamode, we use both (fuzzy set of attribute frequencies) and (fuzzy set of attribute counts), although can be calculated from on the fly.
The new distance function allows to take into account attribute frequencies in the modes for calculation of distance to the metamode. This approach should be more effective than the case when the mode is converted back to in order to discard attribute frequencies and threat mode as an object, which allows to apply distance functions from the previous work [Huang1998, San2004, He2005, Ng2007, Cao2012] mentioned in the introduction and also as formulas 5 and 8. The effectiveness of the k-metamodes with proposed distance function is evaluated on two data sets as described in the next section.
3 Evaluation of k-metamodes on public KDD Cup 1999 and UNSW-NB15 network data sets
KDD Cup 1999 data set is the most popular security data set for evaluation of machine learning and data mining algorithms[KDD99CUP]. This dataset contains both numerical and categorical features, which makes it perfect example of data that SIEM and IDS systems need to process. However, this data set is already 20 years old and contains a very high (80%) attack ratio [thesis, Section 4.3.2], which makes only first 400,000 records with attack ratio of 9.8% suitable for evaluation of unsupervised outlier detection algorithms888With the high attack ratio, such as 80%, unsupervised outlier detection tends to learn attacks as “normal” and therefore is not suitable for processing of such data..
In 2015, Moustafa et al. proposed the newer data set that also has lower attack ratio [nb15a, nb15b]. The UNSW-NB15 data set covers two simulation periods: (1) 16 hours on Jan 22, 2015 and (2) 15 hours on Feb 17, 2015. During these periods of time, the raw traffic was collected and later converted with the help of Argus[Argus] and Bro [Bro] into higher-level set of features (similar to KDD Cup 1999 data set) available as CSV files. Although all types of data (raw traffic in pcap format, BRO files, Argus files and CSV) are available for download, we use CSV files for evaluation of k-metamodes. In total, there are 4 CSV files covering both time periods, as shown in Figure 1.
Figure 1 shows rates of normal and attack records per second from both time periods, as well as all 4 CSV files. The simulation periods are recognisable through a gap in the middle of the figure, whereas red lines in the figure show the borders between CSV files. The second time period has much higher attack ratio (10 attacks per second or 25.92% of data), while first time period is more suitable for evaluation of unsupervised outlier detection algorithms (1 attack per second or 2.09% of data). Therefore, we only apply k-metamodes on the data from 16 hours of simulation on Jan 22, 2015. In total, this part contains 1,087,203 records, 22,215 of them are related to the malicious activity and listed in Table LABEL:tab:attacks.
|Attack category||Number of records|
The k-metamodes outlier detection algorithm was applied on both of described data sets.
In order to apply it on KDD Cup 1999 Data, the data were discretised according to previous work [thesis, Section 4.3.2]. Due to the fact that k-modes is able to process discretised categorical values only, advanced discretisation — which produces continuous numerical values in the range between 0 and 1 — cannot be applied on the data. Therefore, we utilised simple discretisation, sample size of 10,000 records999This sample size was selected to be the same with [thesis, Section 4.3.2] in order to be able to compare results with the previous work. and (number of modes per sample), which is the optimal k value for these data and discretisation type (please see [thesis, Section 4.3.6] for details). Each sample was selected using random sampling without replacement due to the fact that without random sampling some of KDD Cup data subsets (10,000 records each) will not have enough unique records to initialise 22 modes. Next, frequency-based distance function from Formula 8 was used on both steps of ensemble-based k-modes clustering. As mentioned in the Section 1, at the second step of the clustering, attribute frequencies were discarded in order to apply distance function from 8 for calculation of distance between mode and metamodes. Besides that, for the second step of the clustering, we continue using (number of metamodes). The resulting AUC (Area Under Curve) values for different number of samples are shown in Figure below.
In Figure 2, the AUC value does not show any dependency on the number of training samples, which supports the claim that training on subsamples of data does not decrease the quality of outlier detection [subsampling, Section 3.2]. Rather, the k-metamodes trained on just 4 samples 10,000 records each (10% of the data set altogether) allows to achieve the same AUC value as k-metamodes trained on the full data (40 samples).
However, another factor — namely the measure selected as outlier score — may affect the effectiveness of the outlier detection. In the previous work, the outliers were clustered together and each outlier from the cluster was assigned the score of the cluster center [thesis, Section 22.214.171.124]. K-metamodes outlier detection allows to reproduce this approach. Instead of taking distance from each record to all metamodes as outlier score, it is possible to assign outlier score for each record from the corresponding record’s mode. Thus, the outlier score turns into the distance from record’s mode to all metamodes. With such outlier score, k-metamodes is able to reach 98,09% AUC on the same data101010Calculation of AUC with the proposed outlier score for different number of samples would be unreasonable, since all records from all samples should be clustered in order to assign outlier score to each record. We provide ROC and Precision-Recall curves in the Figures below.
demonstrates that high values for both precision and recall can also be achieved applying k-metamodes on the KDD Cup data.
Next, we apply the same algorithm, but with new distance function proposed in the Section 2 on the same data. Since the distance from record’s mode to all metamodes as outlier score allows to achieve higher effectiveness, we stick to 40 training samples during the data analysis. Unfortunately, the utilisation of proposed distance function allows to reach AUC 97,93%, which is slightly worse than original distance function.
To check if the proposed distance function does not help to achieve higher effectiveness on other data sets as well, we applied k-metamodes on the UNSW-NB15 data. Similarly to KDD Cup data, this data set was discretised in order to convert continuous numerical values into categories. Heuristically, the sample size of 50,000 records was selected. The optimal k was determined according to[thesis, Section 4.3.6] an equals 36. The Figure 5 provides the charts for different k and corresponding cluster similarity111111
Mean cosine similarity between concept vector and each cluster record according to[thesis, Section 4.2.4]..
In Figure 5 the similarity measure is calculated for different k on two types of data subsets. First (called “data subsets”) are subsets of 50,000 records each, from the UNSW NB15 data set without shuffling. Second (called “training subsets”) are also subsets of 50,000 records each, but created by applying random sampling without replacement on the UNSW NB15 data set. For both types of subsets, optimal (equal to 36) allows to reach approximately 90% cluster similarity.
As we mentioned above, we stick to more effective outlier score, which requires to cluster all records from the dataset with k-metamodes, i.e. the number of “training” samples should always cover the data set completely, which implies usage of 22 samples taking into account the data set size of 1,087,203 records and sample size of 50,000 records per sample.
The k-metamodes with frequency-based distance function and parameters discussed above was able to achieve AUC of 94,51% on this dataset, while k-metamodes with the proposed distance function (defined in the Section 2 reaches AUC of 96,24% on the same dataset. The corresponding Receiver Operating Characteristic, as well as Precision-Recall curves are provided in the Figure below.
In Figure 6, both ROC curves for k-metamodes with frequency-based and the proposed meta-frequency-based distance functions show that the algorithm is able to achieve high true positive rate while keeping relatively low false positive rate. In turn, this proves that the k-metamodes is able to place outliers at the top of the algorithm’s output by giving higher outlier score to real outliers (true positives). The ROC curve in Figure (b)b also shows that with meta-frequency-based distance function, k-metamodes achieves True Positive Rate of more than 90% while still having False Positive Rate below 10%. Different to this, without the proposed distance function, True Positive Rate need to be around 60% in order to keep False Positive Rate below 10%, as shown in Figure (a)a.
On the other hand, both Precision-Recall charts in Figures (c)c and (d)d show that k-metamodes applied on the UNSW NB15 data is unable to achieve as high precision, as for KDD Cup data (shown in Figure 4). For both types of distance functions, the Recall / True Positive Rate of 60% or more implies relatively low precision which does not even exceed 30%. However, the precision-recall ratio is not so important for an unsupervised outlier detection algorithm as ROC and AUC, especially in security area. Even if the outlier detection has a high number of false positives, but is able to place true positives at the top of its output, it can still be effectively used for capturing outliers that will be uncaught otherwise (please see [Goldstein2016, Comparative Evaluation] and [thesis, Section 6.1] for further details).
We have also checked that increasing the number of clusters per sample, e.g. to does not allow to achieve a higher AUC value. Rather, with the proposed meta-frequency-based distance function and the measured AUC value was 95,13% (which is less than 96,24% reached with ). This fact allows us to conclude that the selected number of clusters per sample has only the minor effect on the outlier detection.
Thus, the results are twofold. On the one hand, the proposed distance function helps to increase AUC from 94,51% to 96,24% on the UNSW NB15 data set. On the other hand, on the KDD Cup 1999 data, AUC slightly decreases from 98,09% to 97,93%.
In the next section, we compare the k-metamodes with the previous work, namely Hybrid Outlier Detection [Sapegin2017, thesis] to check if k-metamodes is able to achieve higher effectiveness on both data sets.
3.1 Comparison of k-metamodes with Hybrid Outlier Detection
The Hybrid Outlier Detection represents an example of the algorithm that utilises conversion of the features into vector space in order to perform clustering and outlier detection on the categorical data, such as security log messages. Under this algorithm, the data are divided into subsets, whereas each subset undergoes one-hot encoding followed by the clustering using spherical k-means. After the clustering, the concept vectors of clusters are used as training data for the one-class SVM. Each data subset is used to train the corresponding model from the ensemble of one-class SVMs independently. During the application/testing phase, the data is divided into subsets and clustered again in order to check the concept vectors of clusters agains all one-class SVMs from the ensemble. If all models will classify the concept vector as outlier, all the records from the corresponding clusters will be considered outliers and assigned the same outlier score[Sapegin2017, thesis].
In the previous work, Hybrid Outlier Detection was applied on the KDD Cup data set and also achieved a high AUC value, as shown in Figure 7.
Figure 7 provides the measurements of AUC value for different parameters that were used for Hybrid Outlier Detection. The best AUC reached was 98.4% (with 4 training samples, k = 22 and advanced discretisation with C = 0.5), which is slightly better than the AUC reached by k-metamodes on the same dataset (98,09%, as mentioned in the previous section). The higher effectiveness of the Hybrid Outlier Detection can be explained by the fact that if k-metamodes is applied on the data sets with mixed data (containing both numerical and categorical features), only simple discretisation might be used to convert numerical features into categorical ones. For Hybrid Outlier Detection (which uses sperical k-means for clustering) it is possible to apply advanced discretisation and retain the difference between original numerical values even though they are discretised into the same category.
However, on the UNSW NB15 data set, Hybried Outlier Detection does not outperform k-metamodes, as shown in the Figure 8.
Figure 8 presents the AUC values achieved by Hybrid Outlier Detection with different number of training samples121212Similarly to application of Hybrid Outlier Detection on KDD Cup data, for UNSW NB15 data we did not calculate AUC for number of training samples 10 (corresponding to approx. half of the dataset), since the algorithm is expected to work best when trained on the subset of data [subsampling, Section 3.2]. and different discretisation types ( was heuristically selected, since this value on of the discretisation coefficient allows the algorithm to achieve the best AUC on the KDD Cup data). Although that the average AUC achieved by HOD was slightly higher for advanced discretisation, the best AUC value was achieved with simple discretisation and equals 88.17%, which is less than 96,24% reached by k-metamodes.
Thus, k-metamodes showed almost the same effectiveness on the KDD Cup data as Hybrid Outlier Detection and reached higher AUC value on the UNSW NB15 data set. Next section provides a more details overview on the effectiveness of both algorithms (including both frequency- and meta-frequency-based k-metamodes) and concludes the paper.
All three algorithms — k-metamodes with frequency-based distance function, k-metamodes with the proposed meta-frequency-based distance function and Hybrid Outlier Detection — were applied on both KDD Cup 1999 data and UNSW NB15 data and reached high AUC values, which are shown in Figure 9 below.
Figure 9 provides an overview of AUC values for all datasets and algorithms tested in this paper. On the KDD Cup 1999 data, all algorithms reach nearly the same AUC value of approximately 98%. Although the k-metamodes with the proposed meta-frequency-based distance function demonstrates the worst result on these data, the difference with other algorithms might be considered statistically insignificant. Different to this, on the UNSW NB15 data, k-metamodes outperforms Hybrid Outlier Detection, even if advanced feature discretisation was used for the last one. On this dataset, the usage of the proposed meta-frequency-based distance function allows to reach 2% higher AUC value. Even though 2% cannot be considered a significant improvement, from ROC curves provided in Figure 6 we may conclude that the usage of the proposed distance function allows to reach 90% Recall / True Positive Rate while keeping False Positive Rate as low as 10%. Without the proposed distance function the corresponding recall will be much lower, i.e. around 60%.
Thus, in this paper we proposed a novel frequency-based distance function for clustering modes into metamodes within the second step of k-metamodes algorithm. Besides this, we combined k-metamodes with feature discretisation approach from the previous work. The resulting algorithm131313The source code for both k-metamodes with the proposed distance function and Hybrid Outlier Detection is made available on the Github under MIT license [HODgithub, kmetamodesgithub]. is able to run on mixed data sets containing both numerical and categorical features and allows to reach higher recall for the same False Positive Rate. These improvements are not only relevant for outlier detection in the area of security analytics, where such mixed data sets are rather common, but also for generic Big Data processing cases.