In the information age most human activity depends on the computer network to function, and our dependence on digital processes for productivity means that system failure would lead to a tremendous loss of resources and inefficiency. In the case of a cyber attack, system failure could also lead to the release of sensitive information to malicious actors who may do further harm. Historically, such cyber attacks have been identified, recognized, and dealt with by cybersecurity professionals who used manual tools to scan the network traffic for suspicious activity. The emerging techniques of machine learning (ML) have the potential to make these cyber watchdogs even more effective. If a ML model could identify anomalous packets of data traveling through a network autonomously and accurately, then human security professionals would waste less time sifting through network traffic alerts and log files.
For such a ML model to be implemented, one would have to guarantee low levels of false positives and negatives in model validation and testing. The ML model would also require efficiency so that organizations could run it on standard computing equipment. We set out to build such a model for anomaly detection in cybersecurity suitable for implementation according to the above criteria.
In developing a suitable model for anomaly detection, we must consider methods that account for the adversarial environment where adversaries could use adversarial machine learning (AML) techniques; these are categorized based on the adversary’s knowledge of the target model. In a white-box AML method, the attacker knows the architecture or parameters of their target, and black-box techniques are used when they do not have such information . Some defensive schemes have been proposed to harden learning models, but new AML methods are constantly being developed.
Cybersecurity professionals need novel techniques to aid them in identifying malicious attacks while simultaneously maintaining a low rate of false positives and false negatives in adversarial environments. This work explores the feasibility of some unsupervised methods, as well as a graph-based approach to anomaly detection. We also develop a supervised stacking ensemble model trained on realistic adversarial samples that maintains a high level of precision and recall.
2 Related Work
2.1 Common Approaches
ML techniques for tackling the anomaly detection problem in cybersecurity have included semi-supervised methods, deep learning models, and graph-based approaches. A common baseline approach for semi-supervised anomaly detection is the One Class Support Vector Machine (OC-SVM)
, which finds the hyperplane that separates the data from the origin with the greatest possible distance, using the same kernel mechanics as a traditional SVM. OC-SVM has been highly effective on noisy data sets in cybersecurity, but it struggles with run-time on data sets with high dimensionality.
As network traffic data is generally high-dimensional, deep learning approaches have supplanted OC-SVM as the industry standard. A primary deep learning technique for anomaly detection in cybersecurity is the autoencoder (AE), where a neural network is tasked with reconstructing network traffic patterns, after having passed them through a series of ever-smaller hidden layers. Its central layer is a reduced-dimensions representation of normal traffic, since it is trained exclusively on normal traffic. When the network is fully trained, the instances that are still inaccurately reconstructed by the AE are considered to be anomalous.
The most elaborate implementation of the AE in cybersecurity is the variational autoencoder (VAE), a directed graph-based probabilistic model that leans on an AE’s central layer for learning model parameters . There is skepticism from researchers that AEs are not any more effective than the classic OC-SVM, but the academic consensus is that AEs can be a useful tool, especially in conjunction with more classic methods .
Two additional unsupervised ML methods for anomaly detection are relevant. Isolation forests exploit a decision tree-like architecture to uncover anomalies, assuming that those anomalies would be most easily split from the rest of the data (and thus closer to the root of the decision tree)
. Local outlier factor is a common density-based unsupervised learning method, which identifies anomalies as points which have significantly lower density than neighboring observations.
Anomaly detection can also be performed by modeling computer networks as graphs. Although anomaly detection is a well-researched problem, the vast majority of the prior approaches have treated networks as static graphs. Static graph-based methods had severe limitations, as they failed to capture the temporal characteristics of emerging anomalies. Microcluster-based Detector of Anomalies in Edge Streams (MIDAS) is a novel anomaly detection technique, which uses dynamic graphs to detect micro-cluster anomalies in the network. MIDAS scans network traffic to find sudden groups of suspiciously similar edges in dynamic graphs. MIDAS has relatively robust predictive power, and its chief advantage is detecting anomalies quickly in real-time .
2.2 Adversarial Attacks
Malicious actors will attempt to circumvent the network intrusion detection system via adversarial attacks, often employing AML techniques. One such AML technique for evasion attacks, for example, is the Fast Gradient Sign Method (FGSM). With FGSM, adversarial examples are generated according to Equation 1:
where is a scale parameter,
is the loss function,are the model parameters, are the inputs (features), and are the targets (labels) . Another effective technique is the Carlini-Wagner Attack, which is formulated as a minimization problem primarily to construct adversarial image samples . Adversaries have many ways to attack ML models, but we focus especially on evasion attacks. Evasion attacks entail supplying a perturbed sample to a trained model with the goal of misclassifying that sample. In cybersecurity, this means an adversary would perturb malicious traffic to mask it as normal.
Defending against these adversarial evasion attacks has received widespread attention in recent years. One approach uses adversarial training leveraging an ensemble-based stacking method. In this method there are two levels: the first stack consists of multiple classifiers whose output is a new feature matrix of the predicted labels from each classifier; and the second stack is a single classifier that outputs a final prediction. The first stack classifiers are trained on an original training set, as well as some adversarial training sets. Classifiers in the first stack are chosen based on their performance against an unchanged test set . This approach served as a primary motivator for our own ensemble method.
We briefly describe the cybersecurity data set used and pre-processing performed, and give an overview of the unsupervised and graph-based methods explored for anomaly detection. We also describe the process for developing the adversarial training sets, and the subsequent ensemble approach for adversarial training based supervised learning.
3.1 Data Set
We used the UNSW-NB15 data set synthesized by the Australian Center for Cybersecurity , . It contains 2.5 million observations of packets traveling through a computer network, where each observation is labelled as normal traffic or attack traffic. The set contains 321,283 malicious observations and 2.2 million normal observations. Many of the attacks were artificially injected into the real traffic to make the entire data set more representative of all the types of attacks an anomaly detection system could encounter.
UNSW-NB15 contains 47 features of which 42 are numerical and 5 are nominal. In addition to these 47 features, the data included two labels: the attack category for malicious traffic, and a binary label for normal (0) and malicious (1) traffic. These features described the characteristics of the packet flow, the quantity of information contained in each packet, and the broad features of the packet content, among other characteristics. There are 12 pre-engineered features included as part of the 47 features provided in the data set.
We first examined the data set to check for unrealistic entries. For example, we found a few rows whose “source port” value did not fall in the range of actual port values and, therefore, removed them from the data set. We also removed the pre-engineered features, as we worked under the assumption that our ML model only had intrinsic network data with no pre-determined logic applied. We also encoded the nominal features as pre-processing for modeling. After exploring feature importance by examining a test and a Mutual Information test, we dropped several of the encoded features that were not in the top five results during importance testing. As such, we then examined a correlation matrix between features, which allowed us to drop any highly-correlated (greater than 0.85) features. Through this process, we were able to reduce the dimensionality of the data set from 49 to 21 features. Using these 21 features, we divided the data into training and test sets to evaluate our ML trained classifiers. The original, unchanged training set is called . It is important to note that the test set remains unchanged (i.e., it is not perturbed as part of the adversarial training) in order to provide a standard basis for ML model evaluation.
3.3 Unsupervised Learning and Graph-based Methods
We first experimented with unsupervised and graph-based methods for anomaly detection using the UNSW-NB15 data set. For the unsupervised learning, we implemented both the isolation forest and local outlier factor methods on . An isolation forest selects a random feature and a random split value between that feature’s maximum and minimum value. The algorithm continues and builds an isolation tree and anomalous samples are those with a smaller path length in the tree . The local outlier factor method, on the other hand, measures the deviation of the samples with respect to their neighbors. Samples with lower density than their neighbors are considered outliers . These techniques helped us determine whether the anomalous cluster could be separated from the normal instances in an unassisted yet effective manner.
We then analyzed the data with MIDAS, a graph-based approach to anomaly detection. To do this, we extracted the following features from the original UNSW-NB15 data set: timestamp, source IP address, and destination IP address. We organized the data set in ascending chronological order in terms of timestamp and ran the MIDAS algorithm. The MIDAS algorithm takes as input a stream of graph edges over time using the features described above (e.g., the source and destination IPs are the nodes, and a sample at time provides the edge). For efficiency, the state of the graph is stored in Count-Min-Sketch (CMS) data structures to keep count of the number of edges between nodes. There are two such CMS structures. First, we maintain a count , which is the total number of edges over time between nodes and . The second is the number of edges at the current time. The primary difference is that is maintained while is reset when we move forward to the next time tick. These structures can be queried for the approximate number of edges and . While new edges between nodes and are provided to MIDAS, an anomaly score at time is output according to Equation 2:
Further details of of this algorithm can be found in .
Due to the relative lack of literature related to adversarial examples for unsupervised learning, our goal in experimenting with these methods was strictly focused on the anomaly detection task in a non-adversarial environment. Likewise, we focused on non-adversarial anomaly detection for MIDAS since it only used specific features to assign anomaly scores.
3.4 Development of Adversarial Training Sets
In preparing to perform supervised learning for anomaly detection using the UNSW-NB15 data set, we sought to train models robust against adversarial evasion attacks at inference time as part of a network intrusion detection system. Thus, we used adversarial training as part of an ensemble approach , which required the generation of adversarial training sets. As such, we generated these adversarial training sets according to realistic methods, assuming the adversary could influence (i.e., perturb) approximately 20% of the network traffic. We employed two different approaches, which had an overarching goal of increasing the number of false negatives; that is, we wanted to confuse the classifiers so that malicious samples would be classified as normal. The first approach was inspired by FGSM. This method, which in our case considers a 0-1 loss function, uses a linear discriminant analysis (LDA) decision function to determine a direction in which each sample’s features are perturbed. The goal is essentially to shift samples across the decision boundary to confuse the classifier. The algorithm used can be seen in Algorithm 1. The resulting training set is .
The next adversarial training set we developed rested on the concept that some features held more importance to our model. The importance of certain features was determined previously using the test and the Mutual Information test, and then we examined those for features that an adversary could realistically control. For example, a reasonably controlled feature would be setting the source time-to-live for a given connection or fixing the amount of bytes sent from a source during that connection. The idea would then be to perturb the malicious samples by a fixed amount such that the feature mean for the malicious samples would closely approximate that of the normal samples as seen in Algorithm 2. This final training set is .
With the training sets defined, we then standardized our training and test feature matrices. Since we wanted to build supervised ML models on the original and adversarial training sets, the fixed test set was standardized according to the training set used for each iteration.
3.5 The Ensemble Approach for Supervised Learning
The ensemble approach for adversarial training based supervised learning for anomaly detection was inspired by the approach proposed in . For our level one stack, we first experimented with several common classifiers trained with each data set. We then chose the best classifier for each data set (, ,
) and optimized the classifiers’ hyper-parameters. This optimization was accomplished by tuning according to the data set it performed best against. Next, we re-trained and passed each classifiers’ class prediction probabilities as features to several level two classifiers. In this way, we utilized a soft-voting stacking model proposed by. Our level two stack then ideally consists of the single best classifier across all three data sets based on the new feature matrix. The primary metrics used to evaluate our classifiers were time-to-train and F1-score. We chose F1-score because of the imbalance between normal and malicious samples. Since our primary aim was successfully identifying malicious samples, we focused on the F1-score related to the malicious samples. We also considered the Area Under Curve (AUC) metric when results were inconclusive.
4 Computational Experimentation
Our experiments relied on scikit-learn, or
, implementations in Python for feature-selection, pre-processing, and modeling. The feature-selection module contains useful and mutual information tests, as each takes as input a feature matrix and label vector . The test computes the
statistic between each feature and class, while the mutual information test estimates dependency between variables.
For the unsupervised learning methods, the ensemble module of provides an isolation forest implementation that provides an anomaly score for a feature matrix based on the extra tree regressor. The local outlier factor method is implemented through the neighbors module and also provides an anomaly score. The graph-based MIDAS approach, however, relied on a custom class developed by the authors and available on GitHub .
The diverse supervised learning methods provided by allow several classifiers to choose from. First, however, we generated the adversarial training sets using custom implementations of Algorithm 1 and Algorithm 2. We applied Standard Scaler from the
pre-processing package to normalize training and test set features before building classifiers. We used seven different classifiers drawn from different modules, including LDA, Quadratic Discriminant Analysis (QDA), Gaussian Naive Bayes, Bagging (with a Decision Tree base classifier), Decision Tree, Random Forest, and Logistic Regression. For Random Forest, we ensured we set theparameter to use all available processing cores to increase performance. Once classifiers were selected for our level one stack, we optimized the classifier hyper-parameters using the grid search cross-validation method.
5 Results and Discussion
We will briefly describe the experimental results of the implemented unsupervised, graph-based, and supervised methods for anomaly detection using the UNSW-NB15 data set.
5.1 Unsupervised Learning and Graph-based Methods
As part of the computational experimentation, we tested both isolation forest and local outlier factor techniques as the baseline unsupervised methods. Unfortunately, the results were not on par with our expectations, as both methods achieved an AUC score of approximately 0.57 and a recall of less than 25% on . Less than a quarter of the relevant observations were correctly identified, which is not operable. With these baseline metrics, we experimented with the MIDAS technique, running the algorithm on the subset of the network traffic using the domain name system protocol. Due to MIDAS’ model assumptions, we were unable to run it on observations using other protocols. Although MIDAS’ final AUC score turned out to be 0.74, the algorithm only took 100 seconds to complete with approximately 800,000 rows of data. We concluded that graph-based MIDAS was a superior approach compared to the two unsupervised methods, given its extremely fast run-time and better AUC.
5.2 Stacking Ensemble Based Supervised Learning
Subsequently, we implemented the stacking ensemble model using various supervised learning classifiers. In order to select our level one stack, we first trained and tested a variety of classifiers against our original and adversarial training sets. The results for F1-score are displayed in Table 1, where the F1-scores are reported separately for each class in the format of [F1-Normal, F1-Malicious].
|LDA||[0.98747567, 0.91473469]||[0.98735051, 0.91400748]||[0.98745405, 0.91463265]|
|QDA||[0.98974236, 0.93382849]||[0.98973623, 0.93377771]||[0.94055049, 0.71915339]|
|Naive Bayes||[0.96716175, 0.81693655]||[0.9809481, 0.88403678]||[0.93100809, 0.68796041]|
|Bagging||[0.98974958, 0.93435551]||[0.99064169, 0.93963535]||[0.98974315, 0.93431738]|
|Decision Tree||[0.98974958, 0.93435551]||[0.99064169, 0.93963535]||[0.98974315, 0.93431738]|
|Random Forest||[0.98855853, 0.927386]||[0.98905139, 0.92869987]||[0.93545834, 0.70446009]|
|Logistic Regression||[0.99080788, 0.93775624]||[0.99037449, 0.93461133]||[0.9905612, 0.93598927]|
From this initial experimentation, the following classifiers performed best for level one: Logistic Regression (), Decision Tree (), and LDA (). We chose Decision Tree over Bagging for primarily due to its faster run-time. We then used the grid search method for hyper-parameter optimization and determined the optimal parameters based on F1-score for these methods. Next, we re-trained our level one stack consisting of these tuned classifiers and passed the new feature matrix to several classifiers to help us determine an optimal level two classifier. The results for F1-score and training time are depicted in Tables 2 and 3, respectively. Since we clearly did not have a consensus “best” classifier among those tested for level two, we also considered the AUC metric; these results are shown in Table 4.
|LDA||[0.99728165, 0.98121153]||[0.99715777, 0.98032807]||[0.99691815, 0.978665]|
|QDA||[0.99728347, 0.98122352]||[0.99715867, 0.98033419]||[0.99691905, 0.97867111]|
|Naive Bayes||[0.99728347, 0.98122352]||[0.99715867, 0.98033419]||[0.99691905, 0.97867111]|
|Bagging||[0.99728347, 0.98122329]||[0.99715688, 0.98032147]||[0.99691726, 0.9786581]|
|Decision Tree||[0.99728165, 0.98121153]||[0.99715777, 0.98032807]||[0.99691815, 0.978665]|
|Random Forest||[0.99728347, 0.98122352]||[0.99715867, 0.98033419]||[0.99691905, 0.97867111]|
|Logistic Regression||[0.99728437, 0.98122964]||[0.99715867, 0.98033419]||[0.99692086, 0.97868332]|
From these classification model performance metrics, there are several options for the level two classifier. Since all results are relatively good, we recommend using Naive Bayes if computational efficiency is a priority. Otherwise, we recommend using a Decision Tree here as its AUC score for and were highest and it maintained a high AUC for ; it is also a highly interpretable model.
Our results suggest that a stacking ensemble approach for supervised learning with LDA FGSM and feature importance perturbation methods used for adversarial training could be highly effective in detecting anomalies in the network intrusion detection setting, even if malicious actors are using AML to conduct evasion attacks against the model. With AUC scores of over 0.98 and total training times of less than one minute, our model could certainly be useful to cybersecurity professionals, as long as they are able to collect data in a similar format to our data set.
There is also room for improving the ensemble approach. We did not attempt to optimize the hyper-parameters of the level two classifiers, so improvements could be made to the F1-score. Also, we did not attempt training the models on smaller amounts of data. It is possible that we will not sacrifice too much accuracy by training on a smaller number of observations, which could improve run-time greatly. Additionally, we could expand the stack to three levels. This would allow us to build out the level two stack with multiple classifiers that then pass prediction probabilities to a third level for a final classification. This approach would, of course, decrease overall performance in terms of run-time. Depending on the effectiveness of other adversarial methods, this trade-off may be worthwhile.
In future experimentation, we will explore other ways to test our methods. Since we did not consider adversarial training samples for our unsupervised approaches, future work should consider novel approaches for generating such samples. This would also allow for a more direct comparison between the unsupervised and supervised methods. The adversarial training methods we did incorporate, however, are certainly not the only approaches that could be used. We could also consider using other methods, such as adapting the Carlini-Wagner attack to our data set, or leverage other techniques from evolutionary computation and deep learning to generate adversarial examples as part of the adversarial training mechanism. This might reveal other strengths and weaknesses of classifiers within the stack. We also plan to incorporate adversarial examples into the training of the unsupervised learning methods, as recently done in .
Another recommendation for future work is to incorporate the MIDAS graph-based approach into the ensemble model, perhaps leveraging other graph mining techniques such as graph neural networks. This would require adjusting the MIDAS algorithm to incorporate all types of network traffic and could be used to generate a new feature for classification purposes. We also should consider different types of malicious behavior rather than just a 0-1 classification model. In a real-world setting, professionals may prioritize different types of malicious activity to focus their efforts. This means we should consider a model that classifies these different attacks once identified as anomalous. Finally, we also believe conducting a test of our model through implementation on a closed network will validate our approach. We recommend using commonly available hardware - such as a Raspberry Pi - to examine model feasibility on cost-effective platforms.
This work was funded in part by the U.S. Army Combat Capabilities Development Command (DEVCOM) C5ISR Center under Support Agreement No. USMA21056. The views expressed in this paper are those of the authors and do not reflect the official policy or position of the United States Military Academy, the United States Army, the Department of Defense, or the United States Government.
-  (2016) A survey of network anomaly detection techniques. Journal of Network and Computer Applications 60, pp. 19–31. External Links: Cited by: §2.1.
-  (2020) Adversarial machine learning in network intrusion detection systems. CoRR abs/2004.11898. External Links: Cited by: §6.
-  (2015-12) Variational autoencoder based anomaly detection using reconstruction probability. Technical report 2015-2 Special Lecture on IE, SNU Data Mining Center. External Links: Cited by: §2.1.
MIDAS: microcluster-based detector of anomalies in edge streams.
Proceedings of the AAAI Conference on Artificial Intelligence34 (04), pp. 3242–3249. External Links: Cited by: §2.1, §3.3.
-  (2021) MIDAS. GitHub. Note: https://github.com/Stream-AD/MIDAS Cited by: §4.
-  (2000-05) LOF: identifying density-based local outliers. SIGMOD Rec. 29 (2), pp. 93–104. External Links: Cited by: §3.3.
-  (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. External Links: Cited by: §2.2.
-  (2021) An adversarial training based machine learning approach to malware classification under adversarial conditions. In Proceedings of the 54th Hawaii International Conference on System Sciences, pp. 827–836. External Links: Cited by: §2.2, §3.4, §3.5.
-  (2017) Support vector machine for network intrusion and cyber-attack detection. In 2017 Sensor Signal Processing for Defence Conference (SSPD), Vol. , pp. 1–5. External Links: Cited by: §2.1.
-  (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Cited by: §2.2.
-  (2021) Adversarial examples for unsupervised machine learning models. CoRR abs/2103.01895. External Links: Cited by: §6.
-  (2019) Techniques for adversarial examples threatening the safety of artificial intelligence based systems. CoRR abs/1910.06907. External Links: Cited by: §1.
-  (2008) Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, Vol. , pp. 413–422. External Links: Cited by: §3.3.
-  (2015) UNSW-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 Military Communications and Information Systems Conference (MilCIS), Vol. , pp. 1–6. External Links: Cited by: §3.1.
-  (2016) The evaluation of network anomaly detection systems: statistical analysis of the unsw-nb15 data set and the comparison with the kdd99 data set. Information Security Journal: A Global Perspective 25 (1-3), pp. 18–31. External Links: Cited by: §3.1.
-  (2018) Scalable and interpretable one-class svms with deep learning and random fourier features. External Links: Cited by: §2.1, §2.1.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.
-  (2017) An improved data anomaly detection method based on isolation forest. In 2017 10th International Symposium on Computational Intelligence and Design (ISCID), Vol. 2, pp. 287–291. External Links: Cited by: §2.1.
-  (2013) A hierarchical framework using approximated local outlier factor for efficient anomaly detection. Procedia Computer Science 19, pp. 1174–1181. Note: The 4th International Conference on Ambient Systems, Networks and Technologies (ANT 2013), the 3rd International Conference on Sustainable Energy Information Technology (SEIT-2013) External Links: Cited by: §2.1.
-  (2017) Autoencoder-based feature learning for cyber security applications. In 2017 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 3854–3861. External Links: Cited by: §2.1.