Botnets are networks of compromised computers that coordinate to perform various malicious activities, such as DDoS attacks, spamming, click-fraud scams, and personal user information stealing. They remain an acute problem in today’s Internet. Botnets receive commands from a botmaster through either centralized command-and-control (C&C) structures, e.g. Muhstik (2019); Mirai (2019) or decentralized peer-to-peer (P2P) C&C structures, e.g. Roboto (2019); Mozi (2019). With centralized C&C channels in a hierarchical structure, the botmaster can communicate with bots more effectively, but suffer from the single-point-of-failure problem when the C&C channel is taken down due to detection and response efforts. To address this problem, botnets began to communicate in P2P structures. Botnets in these structures allow the botmaster to join and control at any part of the botnet, which makes them harder to detect.
Existing work on botnet detection heavily depends on operators/researchers’ deep understanding of botnet behaviors and requires a huge amount of manual labor. For example, some works Gu et al. (2008b, a); Bartos et al. (2016); Doshi et al. (2018) rely on traffic patterns, such as packet sizes and port numbers, to differentiate botnet traffic from background traffic. However, detailed traffic patterns can be confidential and encrypted or intentionally manipulated to evade monitoring Gu et al. (2008a). Furthermore, some approaches require additional prior knowledge of botnet such as domain names Perdisci and Lee (2018) or DNS blacklists Andriesse et al. (2015). Some researchers Mirai (2019); Herwig et al. (2019) use honeypot techniques to study these patterns, but honeypots trap the traffic directed to them only and cannot detect the real botnet in the wild network.
There have also been many works that leverage specific topology features of botnets such as mixing rates Nagaraja et al. (2010), number and size of connected graph components Collins and Reiter (2007); Iliofotou et al. (2008, 2009), etc. For example, P2P botnets often have fast mixing rates because botnets form a topology that is most efficient in diffusing information and launching attacks. Nagaraja et al. (2010) proposes a detection approach based on this feature of fast mixing rate. However, a major obstacle is that the massive scale of network communications makes it hard to differentiate botnet communication patterns from background Internet traffic. Previous works Nagaraja et al. (2010); Jaikumar and Kak (2015) take significant human efforts to define topology features, perform multiple pre-filtering steps, and require data-dependent feature engineering and parameter tuning. Thus, one challenge of designing machine learning models for botnet detection is that they need to capture the topology of communication in large-scale graphs using an automatic detection mechanism.
In this work, we propose to tailor graph neural networks (GNN) to identify botnets within massive background Internet communication graphs by automatically identifying their topological features (i.e., communication patterns). GNNs are well-suited for the botnet detection problem given the large graphs of complex topological structures. In each layer of a GNN, nodes update their states and exchange information by passing messages to their neighboring nodes. Thus, the model can automatically identify node dependencies in the graph after many layers of message passing. There is no need for explicit filters, explicit feature definitions, or manual tuning. Although GNNs have gained increasing popularity in social networks Kipf and Welling (2016), code analysis Allamanis et al. (2017), scientific modeling Zitnik et al. (2018), etc., there have not been many datasets and studies applying GNNs to the area of network security. We specifically design GNNs for the problem of botnet detection that can automatically capture the hierarchical structure of centralized botnets and the fast-mixing structure for decentralized botnets.
In summary, we make the following contributions:
We consider the challenge of a fully automatic botnet detection approach. Our experiments consider large Internet traffic data in many different botnet scenarios.
Our GNN approach is tailored to botnet detection. It is solely based on topology and takes in non-attributed graphs. This approach improves detection rates under controlled false positive rates compared to previous work.
Our datasets for GNN detection are communication network graphs of a large scale compared to many graph benchmark datasets. To be specific, one single graph in our datasets contains over k nodes and k edges, for which we find deeper models are needed to detect some of the topological properties.
Given a communication graph, the detection goal is to reliably isolate the botnet nodes. We formulate this problem as a binary node classification problem on graphs and introduce datasets for botnet detection to facilitate model training and testing later.
Since it has so far been difficult to determine the size of botnets Saad et al. (2011), ground truth of botnet on real datasets with P2P botnet inside background traffic can be inaccurate. We embed both synthetic and real topology of botnet traffic graphs within real background traffic graphs for datasets. We generate datasets for training, validation and testing of our approach with pure graph topology in the same way. We are able to examine our approach under different scenarios, since we can control the settings of overlay botnets, such as topology types and botnet sizes.
We consider all traces collected in from the IP backbone from CAIDA (2018)’s monitors for background traffic. Similar to Nagaraja et al. (2010), we perform the aggregation for the traffic graph and conduct experiments over the resulting subnet-level graph since netflow traces are aggregated into subnets for anonymity. We select a random subset of nodes in the background traffic as botnet nodes for embedding the botnet topology.
To investigate sensitivity of our techniques, we embed the background traffic with particular overlaid P2P topologies we synthesize, including de Bruijn Kaashoek and Karger (2003), Kademlia Maymounkov and Mazieres (2002), Chord Stoica et al. (2001), and LEET-Chord Jelasity and Bilicki (2009). We also overlay two real botnets from Garcia et al. (2014) in the manner as the synthetic ones: a decentralized botnet P2P and a centralized botnet C2 captured in 2011. As the two botnets are from real malware, their traffic contains attack behaviors apart from inner communication traffic.
Both centralized and decentralized botnets exhibit topological properties different from background networks. Centralized botnets are strongly hierarchical with a star shape. Meanwhile, decentralized P2P botnet topology is designed to diffuse information fast inside botnets so that they receive a command for launching an attack efficiently. Mathematically, the rate for random walk to reach the stationary distribution inside botnet, i.e. the mixing rate, is higher than that of background traffic. An example of P2P botnet in Figure 1 shows intuitively the high mixing rate of the botnet, where nodes in the red botnet can easily reach the rest of the botnet within several hops. Since decentralized botnets are not as explicit as the centralized botnet with a hierarchical structure, we focus more on detecting the decentralized P2P botnets as all of the synthetic topologies are decentralized.
Define a communication graph as , where is the node set consisting of unique nodes observed in traffic traces, and is a symmetric (typically sparse) adjacency matrix with representing an edge (direct communication) between nodes and , and otherwise. We use with to denote the diagonal node degree matrix.111We consider undirected and non-weighted graphs for simplicity, generalization to directed and weighted graphs is straightforward. We also assume self-loops in whenever they are needed. Note that and represent graph structures and are fixed throughout the learning process.
We utilize a GNN model Kipf and Welling (2016)
to learn the botnet topology for end-to-end detection. A GNN is a type of neural network that constructs a node vector representation as a vector of sizethrough a stack of multiple graph convolutional layers. This representation captures the important context of the node. At the final layer, these representations are used to predict properties related to the node, in our case whether it is in the botnet. Formally, let denote the node feature matrix after layer and the feature vector (row) for node represented as . The vectors are updated every layer, so as to construct a hierarchical profile with higher-level vectors representing broader and more abstract properties.
At each GNN layer, the representation of each node is first transformed via a learned matrix
Then each node’s representation is averaged with the representations of its direct neighbors, which allows the node representation to include its neighbor information as a way to effectively explore the graph structure:
where the normalization is commonly used to prevent numerical instabilities for deep models, and is shown to have performance gains Kipf and Welling (2016). Furthermore, a non-linear function is applied at the end to finish updating the hidden node representations for the current layer. More compactly, we can express the above update in matrix form as
with the normalized adjacency matrix
, and the non-linear activation function
which is typically ReLU (i.e.
). We also have a separate linear transformation to map forward the last layer’s representation,
where is a learnable transformation matrix at layer . The top node representations after layers are then inputted to a linear layer followed by the softmax function for the final classification. The updating procedures are frequently summarized as a message-passing framework Gilmer et al. (2017), as node features are passed to its neighboring nodes in every layer. With the stack of layers, the final representation of each node would be able to learn useful local properties within its -hop neighborhood for the downstream task. For botnets, we will see that we need multiple layers to capture the necessary neighbor information.
The form of impacts how the neighboring node features are normalized before aggregation, and different choices lead to different GNN model variants. Examples include symmetric normalization based on both the source and target nodes’ degrees Kipf and Welling (2016), graph attention networks that calculate with a learnable non-linear function based on node features Veličković et al. (2017), and independent normalization for each edge Bresson and Laurent (2018).
We customize the GNN framework for our topological botnet detection with the following two changes. First, to better utilize the fast-mixing property of botnet topologies Nagaraja et al. (2010), we propose to use a random walk style normalization
which only involves the degree of the source nodes to equate the normalized adjacency matrix to the corresponding probability transition matrix. Second, since we want feature initialization agnostic to any ordering of nodes for purely topological learning, we set the first layer input to all ones with. This differs from the common practice of dealing with featureless graphs by assigning identities to each node, i.e. Kipf and Welling (2016)
. Note that in this setup some more sophisticated GNN models that normalize by target degree such as the graph attention modelVeličković et al. (2017) will lose their learning capacity solely based on topologies, since normalizing among neighboring node features will not differentiate any local patterns.
Datasets and Evaluation
We generate a collection of datasets as discussed in Section 2, including 4 synthetic botnet topologies, de Bruijn, Kademlia, Chord, and LEET-Chord, as well as 2 real botnet topologies we captured, C2 and P2P. The background network graph contains about 140k nodes and 700k edges (undirected) on average. For each of the synthetic botnet topologies, we generate graphs containing 100/1k/10k botnet nodes, and the real botnets contain about 3k botnet nodes. Each dataset contains 960 graphs which are randomly split into training, validation, and test sets with ratio 8:1:1.222The datasets are available at https://github.com/harvardnlp/botnet-detection. The dataset statistics for Chord with 10k botnet nodes are shown in Table 1. Other datasets are similar except for the number of botnet nodes. All the graphs are undirected and preprocessed to have self-loops to speed up training. Since the number of botnet nodes is extremely small for the 100/1k-bots datasets compared with the overall network size, we train models on 10k-bots dataset and tested on datasets with different sizes of botnets which helps detection on smaller botnet communities compared to directly training on them.
For evaluation of the trained model, since the datasets are highly imbalanced (0.05% - 10% of the nodes are botnet nodes depending on the particular graph), we report average false positive rate, false negative rate, detection rate to get fair evaluations and to be consistent with previous works.
We compare the GNN model with a non-learning specialized detection method, BotGrep Nagaraja et al. (2010)
, and a simple machine learning baseline, logistic regression (LR) which takes in the following constructed features for each node: its own degree, and the mean, max, min of its neighbors’ degrees. Note that BotGrep is a specialized multi-stage algorithm for topological botnet detection, which utilizes the fast-mixing property of random walks within the botnet community and relies on several hand-tuned heuristics. This is in contrast with our GNN detection method which is fully automatic but data-driven.
Model and Training Configuration
Our base GNN model structure contains 12 layers, as we found deeper models are helpful for better detection on many botnet topologies. We use ReLU for non-linear activation between layers, and a bias vector is added after each layer. The input to the model is just the graph, as the learning is purely based on topology without any help of node features. The embedding size is 32 for all layers, and there is an additional linear layer for the final output on each node.
Models are trained on all the graphs in the training split of the data, for which we use Adam optimizer Kingma and Ba (2014)
with cross entropy loss, learning rate 0.005, and weight decay 5e-4. Learning rate scheduling is applied, where we reduce the learning rate to its 1/4 whenever the average loss on the validation set is not reduced with a patience number one. We also use early stopping whenever the average validation loss is not reduced in 5 epochs consecutively. Our implementations are based on PyTorchPaszke et al. (2019) and PyTorch Geometric Fey and Lenssen (2019).
. The logistic regression (LR) model performs poorly in most of the cases, indicating that it is not enough to only utilize local information up to 2-hop neighbors and there are no hidden representations to allow more complex learning. Note, the reported results of BotGrep are from previous work based on a single graph instance, while our results are averaged over all the graphs in the test set. Still, the end-to-end GNN method achieves comparable or better results with BotGrep showing higher detection rates and lower false positive rates in most of the cases, which validates its application on automated botnet detection based on network topologies. Moreover, although the GNN models are trained on graphs with 10k botnet nodes, the detection performance on 1k and 100 botnet nodes does not deteriorate too much regardless of the worsening class imbalance problem (no tuning is applied for the detection threshold), which also supports the robustness of the automated approach.
For the analysis of different model variations we adopt the average F1 score as our basic metric for better illustration, as it takes into consideration both the false positive rate and the detection rate thus is easier to compare.
Detection on Small Botnets
To see how much training of larger botnet communities help for detection on much smaller communities, in Figure 2 we plot detection results of our 12-layer GNN model on 1k-bots dataset, when trained on 10k-bots and 1k-bots datasets respectively. We can see that the model trained on larger botnet community clearly outperforms. We attribute this to the data-driven nature of our detection method, which would suffer from poor quality or inadequate amount of correct labels for efficient learning.
We consider varying the number of GNN layers for all the synthetic botnet topologies. The results are plotted in Figure 3. The general trend is that deeper models help on all datasets. In particular, the model needs at least 6 layers to discover useful topological properties for reliable detection on most of the topologies, and more layers still benefit though the gains are diminishing. However, it is also clear that different topologies behave differently under the same GNN model structure. For example, de Bruijn botnets are the easiest to detect and shallow models of 2-3 layers can already do well, while Chord botnets need much deeper models to do better detection. The overall difficulty of detection, as in terms of the performance gain when the GNN model goes deeper, is Chord Kademlia LEET-Chord de Bruijn.
We explore in more detail some network properties with different botnet topologies to understand different GNN behaviors on them. The mixing time of random walks on a graph is roughly the number of steps to reach the stationary distribution, which is typically smaller for botnets compared to normal traffic, and the bigger the gap is the easier to detect. We thus calculate the 2nd largest eigenvalueof the random walk probability transition matrix on a graph that is positively related to the mixing time. A smaller eigenvalue results in shorter mixing time and presumably fewer layers in GNN. The average path length represents on average how close any two nodes in a graph in hop distance, so a smaller value would indicate that messages of most nodes can be diffused throughout the graph faster, which should also correspond to fewer layers needed in GNN. We present these values in Table 4. As we can see, the topological properties justify the results in Figure 3, where botnets with smaller 2nd largest eigenvalues reach a higher score early (in the order of de Bruijn LEET-Chord Kademlia Chord) and the number of layers needed is around the average path length.
For the real botnets, we found that there is a sharp phase transition after 3 layers with the model starting to perform very well from almost nothing, and more layers bring little gain. As can be seen in Table3, the poor LR baseline and GNN-2 results are consistent, indicating 2-hop neighbor information is not adequate. As for the unnecessary need of deeper models as for other synthetic botnets, it can be explained by the fact that these botnets contain star-shaped attack traffics towards a victim, such as DDoS attack and spam. Since most nodes in the botnet are able to reach their victim within few hops traveling through a hub node (because of the star-shaped topology), it makes sense that a relatively shallow model can detect this pattern.
5 Related Work
Previous works Gu et al. (2008b, a); Andriesse et al. (2015); Bartos et al. (2016); Doshi et al. (2018); Perdisci and Lee (2018) on botnet detection are mostly based on traffic analysis. BotMiner Gu et al. (2008a) clusters nodes with similar communication traffic and similar malicious traffic and then performs cross-cluster correlation to differentiate botnet nodes with other nodes. Another work Bartos et al. (2016)
uses statistical feature representation computed from the network traffic and train a classifier to recognize malicious behavior. However, botnets can intentionally manipulate their communication patterns or encrypted channel to evade traffic monitoring according toGu et al. (2008a). Furthermore, some approaches require additional knowledge of botnet such as domain names Perdisci and Lee (2018), DNS blacklists Andriesse et al. (2015). Some researchers Mirai (2019); Herwig et al. (2019) use Honeypot techniques to study these patterns, which only trap the botnet directed to them.
There are also topology-based approaches Collins and Reiter (2007); Iliofotou et al. (2008, 2009); Nagaraja et al. (2010); Jaikumar and Kak (2015); Zhou et al. (2018). Nagaraja et al. (2010) utilize the unique overlay topology patterns and localize botnet through prefiltering, clustering and validation. However, these approaches involves multiple manual steps of filtering and clustering, and elaborate threshold tuning to identify the embedded botnet subgraph. Collins and Reiter (2007) observe the number of connected graph components since communication insides botnets will suddenly increase that number. Iliofotou et al. (2009) use a graph-level metric for the size of the largest connected component as well as spatial and temporal metrics on node and edge level.
We propose to detect P2P botnets in an end-to-end data-driven approach with graph neural networks. To extensively study the automated detection method, we overlay synthetic or real botnet topologies with different underlying communication patterns on large-scale real background traffic graphs to generate datasets, and apply GNN models to capture the special topology of P2P botnets. Experiments show the effectiveness of our approach compared with the non-learning method, and both our data and studies exhibit their usefulness for both the network security and graph learning communities. Future works include extending the approach to other network security problems where graph patterns are important, such as DDoS attacks and prefix hijackings.
Thanks to Ajay Chinta, Jay Sankaran, and Gurdeep Singh for conversations about this project. This research was supported by Tata Communications under the Harvard-Tata Alliance.
- Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740. Cited by: §1.
- Reliable recon in adversarial peer-to-peer botnets. In Proceedings of the 2015 Internet Measurement Conference, pp. 129–140. Cited by: §1, §5.
- Optimized invariant representation of network traffic for detecting unseen malware variants. In 25th USENIX Security Symposium (USENIX Security 16), pp. 807–822. Cited by: §1, §5.
- An experimental study of neural networks for variable graphs. Cited by: §3.
- External Links: Cited by: §2.
- Hit-list worm detection and bot identification in large networks using protocol graphs. In International Workshop on Recent Advances in Intrusion Detection, pp. 276–295. Cited by: §1, §5.
- Machine learning ddos detection for consumer internet of things devices. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 29–35. Cited by: §1, §5.
- Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: §4.1.
- An empirical comparison of botnet detection methods. computers & security 45, pp. 100–123. Cited by: §2.
- Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §3.
BotMiner: clustering analysis of network traffic for protocol-and structure-independent botnet detection. In Proceedings of the 17th conference on Security symposium, pp. 139–154. Cited by: §1, §5.
- BotSniffer: detecting botnet command and control channels in network traffic. Cited by: §1, §5.
- Measurement and analysis of hajime, a peer-to-peer iot botnet.. In NDSS, Cited by: §1, §5.
- Graption: automated detection of p2p applications using traffic dispersion graphs. Cited by: §1, §5.
- Exploiting dynamicity in graph-based traffic analysis: techniques and applications. In Proceedings of the 5th international conference on Emerging networking experiments and technologies, pp. 241–252. Cited by: §1, §5.
- A graph-theoretic framework for isolating botnets in a network. Security and communication networks 8 (16), pp. 2605–2623. Cited by: §1, §5.
- Towards automated detection of peer-to-peer botnets: on the limits of local approaches.. LEET 9, pp. 3. Cited by: §2.
- Koorde: a simple degree-optimal distributed hash table. In International Workshop on Peer-to-Peer Systems, pp. 98–107. Cited by: §2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §3, §3, §3, §3.
- Crafting papers on machine learning. In Proceedings of the 17th International Conference on Machine Learning (ICML 2000), P. Langley (Ed.), Stanford, CA, pp. 1207–1216. Cited by: Acknowledgements.
- Kademlia: a peer-to-peer information system based on the xor metric. In International Workshop on Peer-to-Peer Systems, pp. 53–65. Cited by: §2.
- External Links: Cited by: §1, §1, §5.
- External Links: Cited by: §1.
- External Links: Cited by: §1.
- BotGrep: finding p2p bots with structured graph analysis.. In USENIX Security Symposium, Vol. 10, pp. 95–110. Cited by: §1, §2, §3, §4.1, Table 2, §5.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.1.
- Method and system for detecting malicious and/or botnet-related domain names. Google Patents. Note: US Patent 10,027,688 Cited by: §1, §5.
- External Links: Cited by: §1.
- Detecting p2p botnets through network behavior analysis and machine learning. In 2011 Ninth annual international conference on privacy, security and trust, pp. 174–180. Cited by: §2.
- Chord: a scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review 31 (4), pp. 149–160. Cited by: §2.
- Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §3, §3.
- Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §5.
- Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34 (13), pp. i457–i466. Cited by: §1.