A wide spectrum of threats ranging from opportunistic malicious activities to sophisticated nation-sponsored campaigns threaten organizations from industry, academia, and government. These attacks usually result in loss of important information and affect consumers and businesses alike. Notable examples are the Equifax data breach in 2017 and the Anthem healthcare campaign in 2015 that compromised personal financial and medical records for millions of US citizens.
To date, most enterprises deploy many security controls in their environments and apply best practice (such as patching vulnerable systems, use of threat intelligence services, and endpoint scanning) to protect against cyber threats. Monitoring tools are deployed in most organizations either on the network (e.g., network intrusion-detection systems, web proxies, firewalls) or on the end hosts (e.g., anti-virus software, endpoint agents). With the availability of security logs collected by large enterprises, machine learning (ML) started to become an important defensive tool in face of increasingly sophisticated cyber attacks. ML techniques applied to network data include systems for detecting malicious domains (e.g., [1, 5, 2]), methods for detecting malware delivery (e.g., ) or command-and-control communication [4, 11, 8, 12], techniques for detecting malicious web pages (e.g., ), and various industry products for enterprise threat detection (e.g., [13, 6, 10, 7, 16]).
ML has a lot of potential in shortening the malware detection cycle, but these algorithms tend to come with a number of shortcomings. In particular, Sommer and Paxson  highlighted the difficulties of using ML in operational settings for cyber security. The main limitations they identified were: (1) ML excels at supervised tasks by learning from labeled examples, while in cyber security most of the data is unlabeled. (2) ML errors (and in particular false positives) have high cost as alerts need to be investigated by security analysts. (3) Network traffic exhibits high diversity under normal operating conditions. (4) Performing sound evaluations is usually challenging due to unavailability of standard benchmark datasets.
In this paper, we describe some concrete guidelines and recommendations for using supervised ML in cyber security. As a case study, we consider the problem of botnet detection from network traffic data. We leverage a public dataset (CTU-13) which includes network traffic collected from a university campus and attacks launched on the university network. Among our findings, we highlight the following:
Feature representations should take into consideration the specifics of the attacks. Among standard feature representations, we compare connection-level features (extracted directly from Bro logs) with aggregated traffic statistics and temporal features (using fixed time windows).
Class imbalance is a major issue that hinders the performance of simple linear models such as logistic regression.
The granularity of data labeling (ground truth) can impact the classification metrics substantially. If available, ground truth obtained at the level of individual network connections can boost the performance of supervised ML models.
2 Background and Threat Model
2.1 Machine Learning for Network Traffic Classification
Network Intrusion Detection is a highly active area of research. Traditional systems such as Snort are based on manually-generated rules for detecting well-known malware variants.
Recently, ML has proven to be valuable in augmenting rule-based systems. ML has the potential of detecting more advanced malicious activities that evade rule-based systems. Successful applications of ML to various types of network data for malware detection include:
Bro is an open-source network monitoring agent that collects a number of network logs. Here we leverage the Bro connection logs, which record the fields included in Figure 1. These include the TCP connection timestamp, duration, source IP and port, destination IP and port, number of packets sent and received, number of bytes sent and received, and connection state. For UDP, an entry is generated for every UDP packet (as there do not exist connections over UDP).
2.2 Problem statement and threat model
ML algorithms have demonstrated success in network traffic classification tasks for detecting botnets or malicious domains. However, most ML methods are designed in an ad-hoc manner and guidelines for principled approaches in this space are currently missing. We are interested in filling this gap and providing recommendations on several general principles that should guide ML design for botnet and malware detection. We are specifically addressing the problem of detecting botnets from network logs (as generated by Bro logs), but our methods can be used with other network data types (such as NetFlow, pcap, firewalls). Some of the research questions we would like to answer are the following:
Can raw network data be used effectively in an ML algorithm?
Which feature representations are most appropriate for applying ML classification algorithms?
Which classifiers achieve best performance in handling the largely imbalanced cyber-security datasets?
What is the impact of labeling the data for ground truth generation?
We assume that the monitoring agent, which collects the network data, is not under the attacker’s control. We also assume that the attacker cannot tamper with the collected network logs. Therefore, attackers do not have access to the storage device where data is recorded. 111Attackers with access to the monitoring environment and the system logs are much more powerful, and are beyond our current scope.
3 Case Study on ML for Botnet Detection
We leverage a dataset of botnet traffic that was captured in 2011 at the CTU University in the Czech Republic. The dataset includes 13 scenarios, each including legitimate traffic, as well as various attacks such as spam, port scanning, DDOS, and click fraud. The dataset also includes a list of botnet IPs that can be used for labeling the traffic.
Since ML classification needs to use similar attack data for training and testing, we decided to use a subset of 6 scenarios. Among these, 3 scenarios are generated by botnet Neris (performing spam and click fraud activity), and 3 scenarios are generated by botnet Rbot (performing DDoS activity). The statistics are in Table 1. For other botnets, there was only one scenario available and that precluded the use of supervised ML.
In traditional ML, cross-validation is a well-known method to evaluate the generalization of a model. -fold cross-validation splits the data into partitions at random, trains a model on of them and evaluates it on the -th partition. Splitting the logs at random produces highly-correlated data between training and testing sets. Instead, we train on two scenarios, and test on the third (independent) scenario, repeating the experiment 3 times for each of the two botnets. We have thus assurances that testing data is independent from training. This method of splitting the data into training and testing (based on independent attack scenario) is more appropriate for this setting. In other contexts, the specifics of the environment need to be taken into consideration.
|Neris||1||Spam, click fraud||31,089||569||3,067,241||76,614|
|2||Spam, click fraud||39,730||407||1,872,270||54,675|
|9||Spam, click fraud||111,895||2893||1,689,040||62,970|
We show our system architecture in Figure 2. Our system processes network logs collected at the border of an organization (i.e., campus or enterprise network). After data collection, a feature extraction layer is employed to prepare the data for ML training. A number of classification algorithms are used to train a classifier and optimize for standard metrics, such as precision, recall, F1 score, and AUC. The classifiers are applied to new testing scenarios in order to evaluate their generality and predict suspicious network activity. We believe that this framework is general enough to be applicable in other environments.
3.3 Feature extraction
We experiment with different feature representations, as described below.
Connection-level representation. This representation extracts features directly from the raw connection logs. We consider all connections in which is either or and we use directly the fields from the Bro connection logs as features:
For categorical features (e.g.,
) we use standard one-hot encoding. In this representation, we obtained 26 features after one-hot encoding.
|IPs||Distinct||Number of IPs communicated with per port|
|(Per port)||Distinct||Number of Subnets communicated with per port|
|Duration||Sum||Total duration of connection per port|
|(Per port)||Min||Min duration of connection per port|
|Max||Max duration of connection per port|
|Bytes||Sum||Total bytes sent by per port|
|(Per port)||Min||Min bytes sent by in a connection per port|
|Max||Max bytes sent by in a connection per port|
|Sum||Total bytes received by per port|
|Min||Min bytes received by in a connection per port|
|Max||Max bytes received by in a connection per port|
|Packets||Sum||Total packets sent by per port|
|(Per port)||Min||Min packets sent by in a connection per port|
|Max||Max packets sent by in a connection per port|
|Sum||Total packets received by per port|
|Min||Min packets received by in a connection per port|
|Max||Max packets received by in a connection per port|
|Traffic statistics||Sum||Number of connections per transport protocol (TCP, UDP, ICMP)|
|Distinct||Number of source ports|
|Distinct||Number of external destination IPs|
|Distinct||Number of destination ports|
Aggregated traffic statistics. Next, we would like to explore if features obtained by time aggregation are more powerful than raw features. We consider a time interval of length over which we define aggregated features over all connections in which is either or .
An important consideration when defining our features is to generate a fixed number of features, independent of the traffic at a particular host. In our first attempt, we consider the set of all destination IP addresses that communicates with: . From these we can define the set of /24 destination subnets that communicates with: , with . If we define aggregated features per destination or subnet, we will encounter an issue when a host visits new IPs or new destinations. In that case, we need to add new features to our representation, which is not desirable in practice.
To alleviate this problem, we define our aggregated features by destination port (corresponding to applications or network services). Specifically, we define a set of 17 popular application ports (e.g., HTTP - 80, HTTPS - 443, SSH - 22, DNS - 53). We then take a modular approach. We select a small number of operators (Distinct, Sum, Min, Max) and apply them to fields in conn.log for each destination port. The features are described in Table 2. We generate these features separately for outgoing and incoming connections. Additionally, we add some features that capture communication patterns with external IP destinations (e.g., number of connections per transport protocol, number of source and destination ports, number of destination IPs, etc.). In this representation, we obtain 756 aggregated traffic features.
Temporal features. Considering the same time interval
as with the aggregated connection-level features, we define inter-arrival features on a node as the mean, standard deviation, median, minimum, and maximum of the time distribution between node communications. Each internal node has two such sets of features: one for events where the node serves as the source of communication (outgoing), and one where it is the target (incoming). These communications are aggregated by common ports. Thus, in each time interval , a node will have the inter-arrival features listed in Table 3. In this representation, we obtain 180 features.
|Outgoing||Mean, std. dev., median, min, max||Statistics of inter-arrival distribution for outgoing traffic|
|Incoming||Mean, std. dev., median, min, max||Statistics of inter-arrival distribution for incoming traffic|
3.4 ML classification and labeling
Ground truth labeling CTU-13 dataset provides a list of botnet IP addresses. One of our main observations is that the attack is not active during the duration of the entire data collection. We found that the granularity at which we label the data plays a large role in the results. We experiment with two levels of granularity:
Coarse-grained labeling: We label all the connection logs generated by the botnet IPs as Malicious during the entire scenario period.
Fine-grained labeling: For the Rbot attack (an instance of DDoS), we obtain the IP address of the victim machine. We use that to identify the attack flows that connect to the victim IP. For all feature representations, we label a time window as Malicious if there is at least one attack log event in that time window.
Fine-grained labeling is difficult to obtain in general because it is a manual process, but when it is available it improves significantly the performance of ML in botnet detection.
We consider several well-known ML classification models, including logistic regression, random forest, and gradient boosting. We use several metrics to evaluate the performance of the ML algorithms (precision, recall, F1 score, and AUC). As the imbalance is quite large in this dataset (the ratio ofMalicious to Legitimate samples is as low as 1:134 for Neris and 1:401 for Rbot with features aggregated at 30-second intervals), the accuracy is always quite high (above 0.96 in all our experiments). We are interested in results on the minority (Malicious) class, thus precision, recall, F1 score, and AUC are better indicators of how the classifiers perform at detecting botnets.
For the ML classifiers, we perform a grid search on several hyper-parameters to select the models performing best in our setting. For Random Forest, we selected the number of trees in
and found that 100 tree worked best. For Gradient Boosting, we varied the number of estimators in, the maximum depth of each tree in and learning rate in . We selected 100 estimators with maximum depth of 3 and learning rate 0.05. For logistic regression, we used or Lasso regularization to reduce the space dimension.
4 Experimental Evaluation
During our experimental evaluation, we would like to answer several research questions, which we detail below.
Which feature representation performs best? We compare different feature representations (connection-level representation, aggregated traffic statistics, and temporal features). For this experiment, we use a random forest classifier with 100 trees and a 30-second time window for aggregation.
The results for Neris are in Table 4 and they show that aggregated features (both traffic statistics and temporal) perform significantly better than raw features extracted directly from Bro logs at all metrics of interest. For instance, when training on scenarios 2 and 9 and testing on scenario 1, the F1 score for connection features is 0.65, while the F1 score for aggregated features is 0.98. We do not observe a major difference when we consider both traffic and timing features, compared to using only aggregated traffic features.
The results for Rbot for fine-grained labeling are in Table 5. Here, connection-based features perform quite well. The reason is that this is a DDoS attack in which all packets sent to the victim are identical. However, traffic statistics and temporal features also perform well. The exception is when training on scenarios 4 and 11, and testing on scenario 10. In that case, the amount of botnet samples used for training with 30-second aggregation is very small (142), while there are much more botnet samples in the raw data (378,252).
What is the impact of varying the time window? Here, we validate the choice of the time window for aggregation. Table 6 and Figure 3 show results for varying the time window from 1 to 600 seconds. The 30-second and 60-second time windows exhibit similar results and they are performing well most of the time. Window size 10 is also performing well, except when testing on scenario 1. As the time window increases beyond 120 seconds, the results start to degrade. We suspect this is because of the small samples of attack traffic at larger aggregation windows, as well as additional noise in the legitimate traffic. In general, selecting the best time window for aggregation is attack-dependent. We recommend the use of cross-validation for selecting the optimal value of the time window. Based on these results, we select a time window of 30 seconds for the rest of experiments.
What is the impact of different ML models? One important observation is that the amount of imbalance in cyber security is very large (as also observed by previous work [3, 12]). It is well-known that ensemble classifiers such as random forests and boosting handle imbalance much better than simpler models. We test this hypotheses by using three different classifiers for our task: logistic regression, random forests, and gradient boosting. We fix the aggregation time window to 30 seconds and use the traffic statistics and temporal features.
The results for three classifiers for Neris are in Table 7 and the precision-recall curves are in Figure 4. All three models we experimented with perform relatively well. Both ensemble method perform better than the logistic regression model, with F1 scores reaching between 0.94 and 0.98 on all scenarios. The difference between random forest and gradient boosting is imperceptible, they are both powerful classification models.
Are the models interpretable? To understand what the ML models learned, we computed feature importance for the random forest classifier for both Neris and Rbot (using the aggregated traffic statistics and timing features at 30-second window). The results are in Table 8. Interestingly, we observe that the classifier identifies features that are correlated with the attack. Neris is a spam botnet and most of its activity uses port 25, making features such as distinct source ports and median inter-arrival packet time on port 25 most relevant. In contrast, Rbot is a DDoS botnet that uses different ports for the attack. For instance, the UDP flood is using port 161, and the classifier correctly determines that the standard deviation of inter-arrival packet timing on port 161 is the most important feature.
These results show our framework’s flexibility and ability to generalize to different attack patterns. We defined a set of 936 generic features that can be used for a variety of botnet attacks. For the two different botnets we experimented with, the ML models identified the most relevant features that are correlated with the attacks, without the need for a human expert to explicitly locate those features. Models such as random forest provide standard metrics for feature importance, with a clear advantage for model interpretability compared to deep learning and neural networks that lack interpretability. Interpretability is important in cyber security, as most of the time human experts analyze the alerts of ML systems.
|Neris||Distinct source ports||25||0.085|
|Median inter-arrival time||25||0.070|
|Distinct destination ports||25||0.067|
|Min packets sent||25||0.061|
|Distinct external IPs||25||0.054|
|Total packets sent||25||0.051|
|Max duration of connection||25||0.048|
|Rbot||Std. dev. of inter-arrival time||161||0.049|
|Distinct source ports||135||0.046|
|Distinct source ports||Other||0.043|
|Min inter-arrival timing||138||0.042|
|Distinct source ports||138||0.040|
|Distinct source ports||3||0.038|
|Distinct source ports||8||0.030|
|Std. dev. of inter-arrival time||138||0.029|
What is the impact of labeling flows accurately? We perform an experiment to test how the granularity of data labeling impacts the classification results. For the Rbot DDoS botnet we have access to the IP address of the victim machine and thus we can determine which connections are botnet-related. We refer to fine-grained labeling to the process of labeling only the botnet connection to victim IP as Malicious. We refer to coarse-grained labeling to the process of labeling all connections initiated by the botnet IP as Malicious.
Table 9 shows the results of fine-grained and coarse-grained labeling for the Random Forest and Gradient Boosting classifiers for features aggregated at 30-second intervals. The results demonstrate that classifier performance obtained with fine-grained labeling is much better than using coarse-grained labeling. For instance, when training on scenarios 10 and 11, and testing on scenario 4, the F1 score for coarse-grained labeling is 0.44, compared to a perfect F1 score for fine-grained labeling. Both classifiers perform here similarly for fine-grained labeling.
5 Lessons and General Recommendations
Motivated by our case study of botnet classification from Bro logs, we highlight several guidelines that we believe are applicable in other settings where ML is used in cyber security.
Multiple feature representations need to be evaluated. Features extracted directly from raw data such as Bro connection logs do not always results in the most optimal representation. A representation that worked well in our setting for classifying internal IP addresses is feature aggregation by time windows and port number. We also observed that feature representation depends on the amount of training data available. With the large imbalance between the malicious and benign classes, smaller time windows work better for aggregation. However, the right feature representation and the choice of time window for feature aggregation are dependent on the attack type. We recommend evaluating multiple feature representations.
Model interpretability. Models that provide interpretability are preferred in cyber security as security analysts need to investigate the alerts raised by ML systems. Understanding why a flow is labeled as malicious can speed up the investigation significantly. We showed how a random forest classifier is interpretable by identifying most relevant features that clearly provide insights about the botnet activity.
Data imbalance raises a challenge for supervised learning.
Data imbalance raises a challenge for supervised learning.Data imbalance results in a huge challenge when applying classification methods to cyber security. Simpler models such as linear models are not equipped to deal well with class imbalance. We showed that ensemble models such as random forest and gradient boosting achieve good results even in highly imbalanced scenario, compared to logistic regression. For instance, at an imbalance of 1:134 (when testing on scenario 2 for Neris) we obtain 0.97 precision and 0.95 recall with gradient boosting.
The alternative to classification is to employ anomaly-detection models that learn from the legitimate class and identify attacks as anomalies. Nevertheless, Sommer and Paxson discussed extensively the difficulty of using anomaly detection in cyber security. We plan to investigate the performance of anomaly detectors in future work.
Fine-grained ground truth labeling can be a major factor in the success of supervised learning. As we demonstrated, data labeling for generating the ground truth plays a major factor in measuring the success of supervised learning algorithms. If detailed information about the attack is available (e.g., the destination IPs contacted by attacker), then the performance of classifiers can be greatly improved. However, it is difficult most of the time to identify exactly the attack flows, even when running cotrolled attack simulations. Malware can contact a variety of IP addresses using different protocols, but infected machines also generate a fair number of legitimate connections (e.g., connections to Window updates).
The research reported in this document/presentation was performed in connection with contract number W911NF-18-C-0019 with the U.S. Army Contracting Command - Aberdeen Proving Ground (ACC-APG) and the Defense Advanced Research Projects Agency (DARPA). The views and conclusions contained in this document/presentation are those of the authors and should not be interpreted as presenting the official policies or position, either expressed or implied, of ACC-APG, DARPA, or the U.S. Government unless so designated by other authorized documents. Citation of manufacturer’s or trade names does not constitute an official endorsement or approval of the use thereof. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.
We thank Malathi Veeraraghavan, Jack Davidson, Alastair Nottingham, and Donald Brown from University of Virginia, Kolia Sadeghi from Commonwealth Computer Research, Inc., and other PCORE-project team members for their support of this work. We would also like to thank Vijay Sarvepalli, Andrew J Kompanek, and Lena Pons from the Software Engineering Institute at Carnegie Mellon University for their helpful feedback regarding the evaluation.
-  M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. Building a dynamic reputation system for DNS. In Proc. 19th USENIX Security Symposium, 2010.
-  M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, and D. Dagon. From throw-away traffic to bots: Detecting the rise of DGA-based malware. In Proc. 21st USENIX Security Symposium, 2012.
-  K. Bartos, M. Sofka, and V. Franc. Optimized invariant representation of network traffic for detecting unseen malware variants. In 25th USENIX Security Symposium (USENIX Security 16), pages 807–822. USENIX Association, 2016.
-  L. Bilge, D. Balzarotti, W. Robertson, E. Kirda, and C. Kruegel. DISCLOSURE: Detecting botnet Command-and-Control servers through large-scale NetFlow analysis. In Proc. 28th Annual Computer Security Applications Conference (ACSAC), ACSAC, 2012.
-  L. Bilge, E. Kirda, K. Christopher, and M. Balduzzi. EXPOSURE: Finding malicious domains using passive DNS analysis. In Proc. 18th Symposium on Network and Distributed System Security, NDSS, 2011.
-  Endgame. Using Deep Learning To Detect DGAs. https://www.endgame.com/blog/technical-blog/using-deep-learning-detect-dgas, 2016.
-  FireEye. Reverse Engineering the Analyst: Building Machine Learning Models for the SOC. https://www.fireeye.com/blog/threat-research/2018/06/build-machine-learning-models-for-the-soc.html, 2018.
-  X. Hu, J. Jang, M. P. Stoecklin, T. Wang, D. L. Schales, D. Kirat, and J. R. Rao. BAYWATCH: robust beaconing detection to identify infected hosts in large-scale enterprise networks. In DSN, pages 479–490. IEEE Computer Society, 2016.
-  L. Invernizzi, S. Miskovic, R. Torres, S. Saha, S.-J. Lee, C. Kruegel, and G. Vigna. Nazca: Detecting malware distribution in large-scale networks. In Proc. ISOC Network and Distributed System Security Symposium (NDSS ’14), 2014.
-  Microsoft. Machine Learning in Azure Security Center. https://azure.microsoft.com/en-us/blog/machine-learning-in-azure-security-center/, 2016.
-  T. Nelms, R. Perdisci, and M. Ahamad. ExecScent: Mining for new C&C domains in live networks with adaptive control protocol templates. In Proc. 22nd USENIX Security Symposium, 2013.
-  A. Oprea, Z. Li, R. Norris, and K. Bowers. MADE: Security analytics for enterprise threat detection. In Proc. Annual Computer Security Applications Conference (ACSAC), ACSAC, 2018.
-  RSA. Threat Detection and Response NetWitness Platform. https://www.rsa.com/en-us/products/threat-detection-response, 2018.
-  R. Sommer and V. Paxson. Outside the closed world: On using machine learning for network intrusion detection. In Proc. IEEE Symposium on Security and Privacy, SP ’10. IEEE Computer Society, 2010.
-  G. Stringhini, C. Kruegel, and G. Vigna. Shady Paths: Leveraging surfing crowds to detect malicious web pages. In Proc. 20th ACM Conference on Computer and Communications Security, CCS, 2013.
-  Symantec. How does Symantec Endpoint Protection use advanced machine learning? https://support.symantec.com/en_US/article.HOWTO125816.html, 2018.
-  F. Tegeler, X. Fu, G. Vigna, and C. Kruegel. BotFinder: Finding bots in network traffic without deep packet inspection. In Proc. 8th International Conference on Emerging Networking Experiments and Technologies, CoNEXT, 2012.