MalPaCA: Malware Packet Sequence Clustering and Analysis

04/02/2019
by   Azqa Nadeem, et al.
Delft University of Technology
0

Malware family characterization is a challenging problem because ground-truth labels are not known. Anti-virus solutions provide labels for malware samples based on their static analysis. However, these labels are known to be inconsistent, causing the evaluation of analysis methods to depend on unreliable ground truth labels. These analysis methods are often black-boxes that make it impossible to verify the assigned family labels. To support malware analysts, we propose a whitebox method named MalPaCA to cluster malware's attacking capabilities reflected in their network traffic. We use sequential features to model temporal behavior. We also propose an intuitive, visualization-based cluster evaluation method to solve interpretability issues. The results show that clustering malware's attacking capabilities provides a more intimate profile of a family's behavior. The identified clusters capture various attacking capabilities, such as port scans and reuse of C&C servers. We discover a number of discrepancies between behavioral clusters and traditional malware family designations. In these cases, behavior within a family group was so varied that many supposedly related malwares had more in common with malware from other families than within their family designation. We also show that sequential features are better suited for modeling temporal behavior than statistical aggregates.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 12

page 16

03/07/2021

Cluster Analysis of Malware Family Relationships

In this paper, we use K-means clustering to analyze various relationship...
09/23/2021

A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

In some problem spaces, the high cost of obtaining ground truth labels n...
11/29/2021

MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

Malware family classification is a significant issue with public safety ...
07/31/2020

Identifying meaningful clusters in malware data

Finding meaningful clusters in drive-by-download malware data is a parti...
10/22/2020

Malware Traffic Classification: Evaluation of Algorithms and an Automated Ground-truth Generation Pipeline

Identifying threats in a network traffic flow which is encrypted is uniq...
06/18/2020

AVClass2: Massive Malware Tag Extraction from AV Labels

Tags can be used by malware repositories and analysis services to enable...
09/24/2018

Statistical Estimation of Malware Detection Metrics in the Absence of Ground Truth

The accurate measurement of security metrics is a critical research prob...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Malware is one of the leading threats in cybersecurity today, and its growth is still on the rise [6]. According to PandaLabs report, in Q3 of 2016 alone, 18 million new malware samples have been detected111https://www.pandasecurity.com/mediacenter/pandalabs/pandalabs-q3/. Available literature suggests that malware developers follow an iterative approach towards malware development, as opposed to prototyping each variant separately [45]. The similarities in these variants are used to categorize them into malware families. The number of malware samples to be analyzed can be drastically reduced by analyzing only a few samples from each malware family.

Malware family characterization is a challenging problem. Most commercial anti-virus solutions that provide a labelling for malware samples utilize static analysis to extract signatures for malware detection [22]. This has several shortcomings: Firstly, syntactical signatures are incapable of detecting malware variants under code obfuscation [36]. Secondly, each vendor has their own way of determining a malware family. Labels obtained from different vendors are often inconsistent [39, 35]. Thirdly, the precise methods by each vendor are proprietary and unstandardized [42]. This black-box nature of these methods makes it impossible to verify assigned family labels, causing the evaluation of analysis methods to depend on unreliable ground truth labels [32].

In this paper, we propose MalPaCA, a novel way of characterizing malware based on similarities in the attacking capabilities they exhibit in a network. The key intuition behind our solution is that since malware authors share code and resources [45], malware belonging to the same family will perform actions in a similar order. The similarity in the temporal behavior will be visible in high-level features even when traffic is encrypted [28]. We hypothesize that the visible patterns will be useful in distinguishing behavioral attributes of malware families and will provide a more detailed profile of their behavior. Instead of simply providing cluster labels and accuracy values, we advocate a cluster evaluation methodology based on visualizing the obtained clusters using heatmaps. The key advantage of this methodology is its white-box nature, providing a rationale for the assigned labels. Malware analysts can use this to investigate the similarities found in malware samples, understand malware’s attacking capabilities, and correct an assigned label if necessary. We perform an analysis on a financial malware dataset that was collected in the wild and that reveals several interesting findings: The malware dataset exhibits various attacking capabilities, such as port scans and reusing C&C servers; Some malware families exhibit significantly more capabilities than others; and we discovered a number of samples that do not adhere to their family names, either because their labels are incorrect or because the overlapping families share significant behavior. We also show that using sequence clustering outperforms state-of-the-art statistical features.

In summary, our contributions are: 1) An intuitive, protocol agnostic, encryption agnostic, and reproducible methodology called MalPaCA to cluster malware’s network behavior that is applicable to low quality data sets; 2) A white-box visualization-based malware analysis methodology that works without ground truth labels and brings malware analysts back in the loop; 3) A proof-of-concept of the proposed method and evaluate it on real-world malware samples collected in the wild; 4) A demonstration of the effectiveness of the proposed method by comparing the results with an existing state-of-the-art solution.

2 Background, Problem Definition, and Scope

The goal of our system, MalPaCA, is to automatically partition a set of malware into disjunct groups such that each group contains malware exhibiting similar behavior on networks. An important requirement is the ability to investigate the systems’s reasoning and support the analyst in using the clustering.

Malware analysis vs. detection. The goal of our work is to support analysts in understanding malware, in relating new malware to existing malware, and in quickly understanding the capabilities of malware. Typically, the end goal of malware analysis is the extraction of behavioral or syntactical signatures to identify future malware samples. However, malware analysis methods in themselves are often black boxes with no way of understanding/reasoning about the results. We address the interpretability problem by proposing an intuitive and whitebox malware analysis approach to improve the stepping stones towards better detection methods. Malware detection itself is left as future work.

Packet capture (Pcap) vs. traffic flow data (Netflow). Network traffic can be analyzed on different levels: Full packet capture collects each packet sent or received during a transmission. This can include the full packet payload. A series of uninterrupted packet transfers captures the temporal behavior of a host. Flow capture, in contrast, aggregates packet-based traffic between two hosts and only retains summary statistics about the packets. During this aggregation, some of the temporal characteristics of the packet exchange is lost. Often only a subset of all flows is collected, further reducing the information available. This process is called sampling. Due to their light-weight nature, flows are excessively used in network traffic analysis [38, 4, 17]. MalPaCA utilizes packet captures to model malware behavior, but only relies on packet headers without looking at the packet contents. Unsampled netflows could also be used.

Clustering vs. Classification The key difference between clustering and classification is the objective. In classification, the goal is to assign a data sample to category from a set of given categories. In contrast, the goal of clustering is to partition a given set of data into subgroups such that each group shares some characteristics. While classification requires knowing a ground-truth, i.e. membership of the samples in the different categories, clustering does not require that. However, malware clustering evaluation still assumes the presence of some ground-truth labels to measure cluster quality. In malware research, we cannot rely on any ground-truth labels since malware family labels are known to be noisy and inconsistent [42, 32]. We propose a cluster evaluation method that uses manual investigation through visualizations. This gives us a goal-driven approach to investigating the accuracy of MalPaCA’s clustering.

Sequence vs. statistical clustering. Sequences capture temporal information useful for behavioral modeling. Sequence clustering is a technique where input features are sequences that are clustered based on their mutual distances. In contrast, statistical features only capture aggregate information. While statistical features are more efficient, they can only model behavioral summaries.

3 Methodology

Figure 1 illustrates the architecture of MalPaCA with its five phases (P1 to P5). Pcap files are given as input to the system, which are split into unidirectional streams (called connections) that are clustered into attacking capabilities. Each cluster is visualized using heatmaps that show similarities in the feature set of streams belonging to different malware samples.

Figure 1:

The MalPaCA framework. The input are Pcap files. After extracting features from connections, the cluster analysis is performed in three steps.

3.1 Connection generation (P1)

A connection is defined as an uninterrupted unidirectional list of all packets sent from source IP to destination IP address. This means 8.8.8.8 123.123.123.123 is a different connection than 123.123.123.123 8.8.8.8. This was done to distinguish between connections initiated and received by a host separately for more diverse behavioral profiles. An Outgoing connection is defined as the packets transferred from (localhost guest IP), where guest IP is the IP address of a host the localhost connects to. On the contrary, an Incoming connection is defined as the packet transfers from (guest IP localhost).

Each connection is a sequence of packet exchanges. Rather than using all packet from a complete flow, we only consider a fixed number of packets since the start of the flow, denoted by the parameter len

. Ideally, this parameter will be long enough to capture all possible behaviors. However, in realistically, it needs to be capped to a fixed threshold in order to avoid introducing artifacts due to the considerable variance in the lengths of the sequences. Following the guidelines of Korczyński at al.

[28], we analyze only the first few packets of a connection, often referred to as the handshake. In network traffic analysis, the length of handshakes is often unknown. Hence, len. should be large enough to allow the handshake to be modeled. With len, the computational resources required also increase. Finding a trade-off between capturing most handshakes and limiting computation requirements is important.

3.2 Feature-set extraction (P2)

The choice of feature-set is crucial for determining the kind of behaviors that are clustered together. Two considerations motivate our choice: 1) The method is generalizable to more than one type of malware; 2) The feature set is small and easy to extract. We cannot use features extracted from the application layer (i.e. the packet payload itself) as they limit the applicability of the method. We also do not use IP addresses as they are considered Personally Identifiable Information

222https://www.enterprisetimes.co.uk/2016/10/20/ecj-rules-ip-address-is-pii/, and are easy to spoof.

We use four features: (i) packet size, (ii) time interval, (iii) source port, (iv) destination port. All four features are independent of the packet payload and protocol type, making them available for every connection. Packet size measures the size of the IP datagram of each packet in bytes. Time interval captures the time difference between the previous and the current packet , and is measured in milliseconds. It has been observed that malware tends to show a periodic behaviorsuch as periodic heartbeat packets333https://www.ixiacom.com/company/blog/mirai-botnet-things sent to inform the C&C server about the infected host. Packet sizes and the periodicity of the packets may indicate similar underlying infrastructure. Port numbers can be considered as the doors that hosts use to communicate with the outside world. We use both source and destination port numbers because the connections are unidirectional. We can potentially identify the protocol the malware uses based on port information, e.g. Port 80 indicates HTTP-based malware while 53 indicates DNS-based malware. Moreover, usage of certain vulnerable ports can indicate suspicious activity [24, 16].
For each connection, we build one separate sequence for each feature.

3.3 Distance measure (P3)

We describe the behavior of malware in terms of connections, which in turn are represented by one sequence for each feature. To reason about similarity of behaviors, we need a way to measure similarity between sequences. The datatype of a sequence determines which distance measure is applicable. Our method should be resilient to delays and noise, which are common characteristics of network traces. In addition, the distance measure should be intuitive to help understand the results. Therefore, we use a combination of Dynamic Time Warping (DTW) and N-gram analysis to measure distance between two connections.

DTW has applications in shape-matching and time-series classification, such as in fingerprint verification [29], or in characterizing DDoS attack dynamics [50]. Bio-informatics and Computational linguistics have used Ngrams long before their application in cybersecurity, e.g., in modeling genomic sequences [49], in OCR retrieval [20], and in file matching [33]

. Recently, Ngrams have been used in cybersecurity to classify malicious code

[1]

. In our system, numeric features (packet size and time interval) utilize DTW for distance measurement due to its robustness to delays and noise. Port numbers are represented as ngrams and the distance between them is measured in the vector space using Cosine similarity. The distance values of each feature are first normalized to range [0-1]. Then, they are consolidated using a simple unweighted average, as shown in Eq. (

1).

(1)

where a and b are two connections; is the final calculated distance between a and b; is the distance between the sequences of packet sizes of a and b; is the distance between the sequences of intervals between a and b; and are the distances between the sequences of source ports and destination ports between a and b, respectively.

3.4 HDBScan Clustering (P4)

A key decision in our pipeline is the clustering algorithm itself. There exists a familial structure among malware samples [47, 43]

. Therefore, it makes sense to use hierarchical clustering to model their relationships. At the time of writing this paper, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBScan)

[7] is the best clustering algorithm, especially for Python implementations444http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms
.html
. It requires a mutual distance matrix as input. It does not force data points to become part of clusters—all data points whose membership to a cluster cannot be determined are considered to be noise. In the current context, noise refers to behaviors that are either too different from all the others or cannot be clearly assigned to one cluster. An ideal dataset with clear cluster boundaries will have no noise. Hence, in the presence of a less ideal dataset, noise is discarded to extract high-quality clusters. Keep in mind that discarding excessive connections as noise can also be counterproductive. We discuss this limitation in Section 6.

Evasion resilience. Knowing our system, attackers can attempt to evade the system in two ways: First, the attackers can publish an updated version of their malware exhibiting different behavior in the hope to be clustered differently or to not exhibit a behavior clearly associated with one single cluster. Second, attackers can design malware to exhibit a wide range of randomized behavior to avoid falling into a specific cluster.

Changing behavior should lead to a different cluster assignment, as MalPaCA clusters by behavior. If the updated malware still accomplishes the same tasks, it has to show the same core network behaviors. Purely randomizing behavior characteristics, however, will not help evasion: the usage of DTW makes our system resilient to random delays [14] and due to the relative distance measures used in HDBScan, randomized port numbers [16], will also be clustered together. Our results demonstrate that malware samples with randomized port numbers are still clustered together. An attacker attempting to evade the system will have to evade all four features. Maintaining an infrastructure that keeps changing long-term behavior is difficult, making it uneconomical for most attackers [12].

3.5 Cluster analysis (P5)

Cluster quality is a subjective and context-dependent notion. Although a some metrics exist that capture cluster quality (i.e. Silhouette index [41] and DB Index [11]

), they are not applicable for our use case. They require a notion of a point’s distance from its centroid, but that this does not work well with sequences since each data point (a.k.a connection) is represented as a vector of its relative distance from the other points. Since each connection is represented by four sequences (one for each feature), we argue that a single number does not adequately capture the intricacies that we can otherwise see via visualizations. In addition, since we do not have any ground truth labels at capability level we cannot validate our method using e.g. a confusion matrix.

We define the following properties to be indicative of good clustering: (1) Cluster homogeneity is high. (2) Each cluster captures a different behavior. (3) Clusters are small and specific. The first two properties ensure that we obtain meaningful clusters, the third prevents HDBScan from discarding many samples falling between large clusters as noise.

Rather than assigning each cluster quality a single number, we make the process white box by using heatmaps for a visualization-based cluster analysis. This gives malware analysts more control over the cluster debugging phase. The analyst can use the heatmaps to determine threats—we leave the automatic calculation of the cluster quality metric as future work.

Cluster Content Visualization Each connection is represented by four sequences – one for each feature. Hence, the dimensionality of the feature set is 4*len, where len is the length of each feature sequence. We have utilized heatmaps for finding patterns in multidimensional data, len dimensions at a time. Hence, four heatmaps are associated to each cluster, one corresponding to each feature. Each row in a heat map belongs to a connection contained in that cluster, represented by a sequence of the associated feature. The biggest advantage of using heatmaps is that it makes the cluster analysis much easier and white-box in nature. A set of exmple heat maps is shown in Figure 6. The x-axis represents the packet number. Each row shows the feature of the first len packets in a connection. The figure highlights one dissimilar connection among the eight connections in the cluster.

Cluster Accuracy Analysis Because we are in an unsupervised setting without knowing a factual ground-truth, evaluation is subjective and based on how informative the clusters are given the problem statement. Visualizing the cluster content helps to identify which connections do not belong into a cluster. A false positive (FP) is defined as a connection that is placed in cluster X despite half of its features being different than remaining connections in the cluster. Since each feature currently holds equal weight, we only consider a connection as FP if more than two features differ. We consider two features different if more than 50% of their sequences do not look alike. Figure 6 shows a cluster containing one FP, highlighted in red. It shows that three out of four feature values of this connection are significantly different from other connections in the same cluster. The FP rate is calculated as , i.e., . We measure the FP rate of each cluster similarly, and calculate the average percentage of FPs per cluster as a notion of cluster quality. Note that FP is only used during evaluation, but not for cluster analysis itself.

In practice, we first establish the common majority by finding two or more connections that are most similar to each other, i.e., the ones that have the least mutual distance. We consider those connections as the rightful owners of that cluster. Figure 6 shows a simple case where the rightful owners of a cluster

are easily visible since 7 out of 8 connections are very similar. The rest of the connections are compared with the rightful owners and are either considered as FPs or TPs, depending on how many feature sequences differ. We do not calculate false negatives because they can be derived (over-estimated) from False positives – a FP connection in cluster

X has its feature values much similar to cluster y, so it is a FN for cluster Y.

(a) Packet sizes
(b) Interval
(c) Source Port
(d) Destination Port
Figure 6: A false positive: a connection does not belong in the cluster it is assigned

4 Experimental Setup

4.1 Dataset

The dataset is composed of real-world financial malware samples detected in 2017. It was provided by a security company that specializes in malware analysis and threat intelligence. The dataset was collected by executing the detected malware binary in a sandboxed environment for a variable amount of time. The resulting network traffic is stored in a Pcap file. Each Pcap file refers to one malware sample. The initial dataset contained approximately 47k Pcap files. The data providers have labeled the Pcap files using a set of YARA555https://virustotal.github.io/yara/ rules, extracted from the static analysis of malware binaries. We focus on a few prevalent and well-known malware families. Crowe [10] and Wueest [51] report a list of malware families that have posed the highest threat in 2016-2017. We select 15 banking malware families for the analysis. The cleaned dataset contains 1,196 Pcap files, from which we obtain 8,997 connections. Table 1 summarizes the dataset.

Family name # Malware samples Family name # Malware samples
Blackmoon (B) 887 (74.10%) Gozi EQ (GE) 7 (0.58%)
Gozi ISFB (GI) 122 (10.19%) Dridex RAT Fake Pin 7 (0.58%)
Citadel (C) 70 (5.85%) Dridex (D) 6 (0.50%)
Zeus VM AES (ZVA) 29 (2.42%) Zeus P2P (ZP) 4 (0.33%)
Ramnit (R) 22 (1.83%) Zeus (Z) 3 (0.25%)
Dridex Loader (DL) 15 (1.25%) Zeus OpenSSL 2 (0.17%)
Zeus v1 (Zv1) 10 (0.83%) Zeus Action 2 (0.16%)
Zeus Panda (ZPa) 10 (0.83%)
Total 1,196 (100%)
Table 1: Composition of Malware Families and Their Contribution in the Experimental Dataset.

4.2 Parameters

There are four tunable parameters in the proposed framework, i.e. of the Ngrams used on port numbers, of packet sequence used to create the feature sequence, and the two parameters of HDBScan clustering algorithm: and .

We have used trigrams () to represent categorical sequences. Trigrams form a good trade-off between performance and data sparsity, based on the results of Kalgutkar et al. [23]

. The length of connections in the dataset is highly skewed towards shorter sequences, with a mean of 20 packets. This mean is used as

. Out of 8997 connections, only 733 connections (8%) are longer than the threshold. A pairwise distance matrix of dimensions 733x733 is generated using the selected distance measures. The HDBScan clustering algorithm uses and . These parameters were selected on a configuration dataset, which formed roughly 5% of the dataset. The experiments were performed on a machine with Intel Xeon E3-12xx v2 processor, 8 cores and 64GB of RAM.

4.3 Comparison with state-of-the-art

Utilizing network traces for malware classification is a common theme in research [46, 39, 3, 35, 18]. A majority of the studies use statistical features for behavioral modeling. Our approach, MalPaCA uses sequential features for modeling fine-grained attacking capabilities. We compare the performance of using sequential versus statistical features. We use Tegeler et al. [46]

as a baseline to compare our results with since they not only use statistical features, but also incorporate periodic behavior using Fourier transform to detect bot-infected network traffic. Although the goal of their study diverges from ours, their feature selection approach is aligned with ours. For objectivity, we keep the rest of the framework as explained in Section

3.

Taking guidelines from Tegeler et al. [46] and adapting them to our dataset, each connection in the baseline is characterized by 1) average packet size, 2) average interval between packets, 3) average duration of a connection and 4) the maximum Power Spectral Density (PSD) of the FFT obtained by a binary sampling of the C&C communication. The binary signal is generated using the approach by Tegeler et al. [46] – the signal is 1 when a packet is present in the connection and is 0 in between.

5 Results and Discussion

The clustering algorithm produces 18 clusters. There are, on average, 25 connections in each cluster. The algorithm discards 284 connections as noise. Cluster 1 is the largest with 90 connections, while cluster 10 is the smallest with merely eight connections. Only 12 families in the filtered dataset have connections long enough to be used. The family distribution shows the dominance of Gozi-ISFB’s samples (38%), even though Blackmoon had the highest number of samples in the initial dataset (Table 1). This is because most of the sequences generated from Blackmoon’s samples are not long enough to be considered, i.e., out of 5819 Blackmoon connections, only 80 are longer than the selected threshold of packets.

5.1 Cluster Content Analysis

Using heat maps to visualize cluster content helps to find relevant patterns. As was concluded by Korczyński et al. [28], the clusters do not model the exhaustive capabilities of malware, but rather how they utilize the Internet to carry out their objectives, referred to as attacking capabilities. Although IP address is not used as a feature, one cluster contains all connections broadcasting to 239.255.255.250, which is used by the SSDP protocol to find Plug and Play devices. Another cluster captures all connections broadcasting to 224.0.0.252, which is used by Link-Local Multicast Name Resolution (LLMNR) protocol to find local network computers. These clusters exhibit device searching behavior. However, it cannot be concluded with certainty that they are used for maliciously.

5.2 Detected Attacking Capabilities

The cluster content analysis reveals valuable threat intelligence information related to malware’s attacking capabilities. Some of these insights can be obtained by merely using IP addresses, but this would make it impossible to cluster behaviorally similar hosts with different IP addresses.

Connection Direction Identification The clustering algorithm is successfully able to identify the direction of traffic flow even though no such feature is used. The clusters and their traffic direction are listed in Table 2. Interestingly, we continue to see this pattern even when port-related features are removed from the clustering. Hence, the sequence of packet sizes and their timing are collectively indicative of the flow direction. This important trait identifies whether the suspicious behavior is originating from inside the network or from outside it.

Port Scan Detection Some clusters capture a Port Scan666https://whatismyipaddress.com/port-scan, which is a method for determining open ports on a device in a network. Port scans are usually a part of the reconnaissance phase in the attack kill chain [52]. Utilizing sequences of port numbers enables us to detect any suspicious temporal behavior before an attack happens. The clusters show two types of port scans: (i) Systematic port scan where ports are swept incrementally, which is seen as a gradient in the corresponding heat map as shown in Figure 9(a); and (ii) Randomized port scan where ports are contacted randomly, which shows up in the heat map as a checkered pattern shown in Figure 9(b). Port scan attempts carried out by different connections are clustered together if they contact the same range of port numbers, which increases their mutual similarity. We show that port numbers narrow down the categorization of malicious behaviors. This result is in direct contrast with Mohaisen et al. [35] who conclude that port numbers are one of the least useful features in distinguishing malware families.

(a) Systematic port scan.
(b) Randomized port scan.
Figure 9: Example clusters showing systematic and randomized port scans

Split-personality C&C Servers Several instances were observed where connections contacted the same IP address, but the responses were so different that they end up in different clusters. For example, two connections of Gozi-ISFB contact 46.38.238.XX, which has been reported as a malicious server located in Germany. The outgoing connections are identical as they both request for the same resource. However, the responses received are very different – the first response contains a small packet followed by a series of 1200-byte packets, while the second one contains a periodic list of small and large packets in the range of 600 to 1800 bytes. This insight portrays a better picture of the behavior of C&C. In contrast, if connections were clustered using IP addresses, these different behaviors would have been grouped since it is the same host under consideration.

C&C Reuse by Multiple Families One cluster contained connections that contact the same C&C server, even though the connections belong to different malware families. The connections to this server are identical. Figure 12 shows the packet sizes and interval features of the connections in the cluster. Observe that three Zeus-Panda (ZPA) connections and one Blackmoon (B) connection contact a single IP address (encoded as 009), which has been reported as malicious, The corresponding connections are highlighted in green. The source port of 6 and destination port of 80 remains constant for the whole cluster. This suggests that either the YARA rules mislabeled one of the samples or that the authors of these samples shared the C&C server.

(a) Packet sizes
(b) Interval
Figure 12: Zeus Panda and Blackmoon connections reusing the same C&C server.

Malicious Subnet Identification Several instances were observed where connections within the same cluster contacted IP addresses that fell in the same subnet. For example, two Zeus-Panda connections contact 88.221.14.11 while one Blackmoon connection contacts 88.221.14.16. In another cluster, two Zeus-VM-AES connections contact 62.113.203.55 and another connection, detected 15 days later, contacts 62.113.203.99. This gives enough actionable intelligence to ISPs to investigate whether other IPs in 88.221.14.XX and 62.113.203.XX subnets are also hosting C&C servers. Using IP addresses can also reach the same conclusion. We can also identify behaviorally similar hosts in a subnet.

5.3 Application of Behavioral Clustering

A malware family can be broken down into the attacking capabilities it exhibits. Clustering malware’s attacking capabilities provides an inventory of behaviors. This task is typically manual in nature [5]. However, MalPaCA’s placement of connections into clusters automates the inventory design. Moreover, this way of clustering also identifies families that share common behaviors, providing a behavioral profile of a family that characterizes it much better than a mere family label would.

Behavior Inventory By inspecting the heatmaps of each cluster, we identify the various capabilities represented by them. Table 2 lists which behavior is captured by each cluster along with the count of families that have their connections in them. The most common behavior are SSDP and Broadcast traffic, both specific to Windows OS. Since the malware families in our dataset are Windows-based, it explains why 9 out of 12 families have connections in these two clusters. On the contrary, Connection Spam and Malicious Subnet are the rarest behaviors. Malicious Subnet is only observed for the connections of Zeus-VM-AES. In addition, the connections of Gozi-ISFB open numerous connections, creating a Connection Spam. The incoming connections are stored in one cluster, while the outgoing traffic is split into two clusters due to the difference in the type of requests. This detailed behavioral analysis enables the identification of interesting clusters to analyze further. Moreover, the common clusters can be discarded if they contain known-benign behaviors, drastically reducing the number of connections to analyze.

Clus # families Behavior Direction
c1 9 (Common) SSDP traffic Out
c2 9 (Common) Broadcast traffic Out
c3 4 LLMNR traffic Out
c4 5 Systematic port scan In
c5 5 Randomized port scan Out
c6 1 (Rare) Connection spam In
c7 1 (Rare) Connection spam Out
c8 1 (Rare) Malicious subnet Out
c9 1 (Rare) Connection spam Out
Clus # families Behavior Direction
c10 2 HTTPs traffic Out
c11 2 C&C Reuse In
c12 4 HTTPs traffic In
c13 5 Misc. In
c14 3 Misc. In
c15 3 Misc. In
c16 3 Misc. Out
c17 3 Misc. Out
c18 4 Misc. Out
Table 2: For Each Cluster, (i) Number of Malware Families Contained in it, (ii) What Behavior is Captured, and (iii) The Direction of Traffic in it.

Malware Family Characterization We can obtain behavioral profiles of malware families using the diversity of their exhibited behaviors. Table 3 lists the malware families in our dataset and the attacking capabilities they exhibit. Note that each cluster represents a different behavior. We observe 18 different behaviors accounting for roughy 11 high-level capabilities. In the dataset, Dridex, Gozi-EQ, Zeus-P2P and Zeus-v1 only generate either SSDP or Broadcast traffic. Since this traffic is obtained from standard Windows services, it is likely that the malware was not activated when the associated Pcap files were recorded. Hence, the only connections observed from these families seem benign. On the contrary, Gozi-ISFB is the most diverse family. Its connections are found in 16 out of 18 clusters, which exhibit attacking capabilities such as Port Scans and Connection Spamming. Specifically, the Connection Spamming behavior is never exhibited by any other malware family in the dataset. There are two reasons for Gozi-ISFB’s diversity: (i) Gozi-ISFB is the largest family under consideration with longer sequences on average, so many of its behavioral aspects are captured; and (ii) Gozi-ISFB opens more connections per Pcap file compared to other families. For example, one Pcap of Gozi-ISFB opens 111 connections, while the average number of connections per Pcap file is 3.

   B    C    D   DL   GE   GI    R    Z   ZP  ZPa  Zv1  ZVA
SSDP traffic
X X X X X X X X - X - X
Broadcast traffic
X X - X - X X - X - X X
LLMNR traffic
X X - X - X - - - - - -
System. port scan
X X - - - X X - - - - X
Random. port scan
X X - - - X X - - - - X
In conn spam
- - - - - X - - - - - -
Out conn spam
- - - - - X - - - - - -
Malicious Subnet
- - - - - - - - - - - X
In HTTPs
- X - X - X - - - X - -
Out HTTPs
- - - - - X - - - X - -
C&C reuse
X - - - - - - - - X - -
Misc. X X - X - X - X - X - X
# Clusters
7 11 1 8 1 16 4 2 1 7 1 7
Table 3: The Behavioral profile of malware families. The Columns are Malware Families. The Rows are Behaviors Each Cluster Captures.

5.4 Sequential versus Statistical features

We compare the results of our clustering with an existing state-of-the-art method by Tegeler et al. [46]. The baseline method results in 22 clusters, with an average size of 21.2 connections per cluster. 265 connections are discarded as noise. These results are in comparison with those of sequence clustering – 18 clusters; on average 25 connections per cluster; 284 connections discarded as noise.

There are three results when comparing sequential with statistical features:
1. With sequence clustering, the majority of the clusters are homogeneous and well-separated. On average, 8.3% connections per cluster are false positives—their feature sequences are different from their member connections in a cluster.
2. With statistical features, connections present in the majority of the clusters appear very different from their member connections. On average, a 57.5% of connections per cluster look visually different from their neighboring connections. Figure 17 shows an example of a cluster from the baseline. It has nine connections, out of which six are FPs based on their exhibited behavior. The rightful owners of the cluster are the connections that have the least mutual distance, i.e. GI|090|178021, GI|073|610131, GI|073|610346. Compared to these three connections, the other six connections differ significantly in all features, except the source port. They were primarily clustered together because their statistical features had the least mutual distance in the whole dataset, i.e. ; ; ; . The heat maps in contrast clearly show behavioral differences missed by the statistical features.
3. Statistical features are also unable to identify the direction of network traffic. In the cluster shown in Figure 17, there is one incoming connection in the cluster along with eight outgoing. A similar trend is observed for the majority of the clusters (19 out of 22) contain incoming and outgoing connections together. In contrast, merely using sequences of packet sizes and time interval were enough to identify traffic direction in sequence clustering.

In summary, while statistical features may be simple to use, they lose behavioral information that plays a crucial role in accurately determining similarities in malware behavior. Sequence clustering performs significantly better in modeling behavior than statistical features. The most significant difference is observed in the flexibility and leniency of the clusters. Different behavioral profiles may look the same from a statistical viewpoint. Hence, because of the quality of captured behavior demonstrated by sequence clustering, its benefits pertaining to behavioral modeling cannot be denied.

(a) Packet sizes
(b) Interval
(c) Source Port
(d) Destination Port
Figure 17: Six out of nine behaviorally different connections clustered together in baseline version.

6 Limitations and Future work

Performance optimizations are needed to make sequence clustering more efficient and scalable. Sequential features tend to be slower and more expensive than statistical features. In our method, DTW forms the main bottleneck as the length of sequences grows longer. However, there exist streaming versions of DTW that compute results in real-time. One such technique is presented by Oregi et al. [37]. Using Locality Sensitive Hashing [3] can reduce the distance computations.

Secondly, density-based clustering discards rare events as noise. It makes sense if the dataset is noisy. However, in the presence of a purely malicious dataset, the connections that lie in lower-density regions may represent rare (zero-day) attacking capabilities, which will be discarded in the current implementation. In addition, traffic generally contains both benign and malicious traffic. Further investigation is required to check the noise resiliency of the proposed method in the presence of benign traffic.

Thirdly, the proposed solution suffers from data loss. The analysis is performed on merely 5% of the 8997 connection dataset. This happens because of two reasons: (i) The length of sequences in the dataset forms a long-tail distribution, so 91.8% connections are discarded because they are not adequately long for behavioral modeling. (ii) HDBScan clustering itself discards 3.2% connections as noise. The former issue can be resolved by collecting more network traffic, while the latter issue can be resolved by compromising on clusters quality.

Threats to validity. The filtered Pcap files used in the evaluations are merely 2.5% of the original dataset (1196 out of 45k Pcaps). Moreover, it contains only financial malware, which is not representative of all malware types. Hence, although the parameters were chosen using a configuration dataset, they have a tendency to be biased towards financial malware.

The specificity of the clustered behaviors is highly dependent on the length of sequences. The shorter sequences only capture the handshake, while longer sequences can capture the additional behavior too. At longer lengths, significantly more clusters are formed, each highly specific to a certain kind of behavior. At smaller lengths, those differences diminish, and the clusters start to merge. For example, at several clusters capture slightly different variations of port scans, while at all those variations merge to form a few clusters.

Future work. Formalizing cluster quality without the knowledge of ground truth is a fundamental challenge in clustering. We demonstrate that visualizing feature values after clustering is a good way of measuring cluster quality. Automating the proposed FP analysis may be the first step towards designing an adequate cluster quality metric. This paper introduces a cluster analysis method but does not yet detects malware. We will develop a detection module that uses the proposed behavioral profiles in action.

7 Related work

Early studies on malware classification required disassembling malware binaries to extract features [27, 47]. This process is not only slow but it also lacks accuracy due to the existence of packed and obfuscated malware. Instead, behavior-based approaches have recently gained attention [31, 26]. They execute the malware sample in a controlled environment and collect the generated traces [13].
Research on malware analysis is generally available in two strains – network traffic analysis and system activity analysis. Perdisci et al. [39] show malware samples that perform significantly different system-level activities while having identical network traces. They propose a 3-step clustering method that is alternative to system-level clustering approach presented by Bayer et al. [3]. We also limit ourselves to network traffic analysis because it shows the core behavior of malware [8] by showing direct interactions with the attacker or C&C server.
Perdisci et al. [39] use a feature set that is extracted from the request URL of the malware, which limits their method’s applicably to only HTTP-based malware. Similarly, the focus of much of the existing work is to analyse other protocol-specific malware, such as DNS-based malware [40, 30], and HTTPs-based malware [2, 34]. Existing work also places emphasis on using Deep Packet Inspection [21, 53], which will not work out-of-the-box when traffic is encrypted. Most existing approaches use statistical features, often relying on traffic analysis uses sampled netflows [4, 38, 17, 46]. In such cases, only high-level features are available for behavior modeling. Garcia [17] builds a behavioral Intrusion Prevention System by using the size, duration and periodicity of flows. Tegeler et al. [46] build a classification system to detect bot-infected network traffic using high-level features such as the average interval, duration and size of flows. They model temporal periodicity by using the Fourier transformations.
Nevertheless, there also exist papers that use sequential features extracted from netflows in behavioral modeling. Pellegrino et al. [38] learn state machines from sequential netflow data in order to detect bot-infected traffic. Hammerschmidt et al. [19] uses sequences of netflows to cluster host behavior over time. Korczyński et al. [28]

use Markov chains to fingerprint the encrypted traffic generated by 12 everyday-use software such as Skype, Paypal, and Twitter. However, these methods require long uninterrupted sequences to provide a reasonable statistical distribution of data. In practice, malware-related data is often noisy.


Calculating the distance between sequential features is also a challenging problem. There exist Bio-informatics inspired solutions using sequence alignment [48]. They require pre-computed substitution matrices, which currently do not exist for malware. Chan [9] proposes the use of Longest Common Subsequence to measure the distance between sequences of accessed resources in Android apps to group similar apps. In network traffic, the LCS will be formed by delays and noisy packets, which overshadow the actual behavior. We primarily use Dynamic Time Warping because of their proven ability to be resilient to noise and delays [14], which are common attributes of network traffic.
Malware analysis research is seeing a spike in new methods [35, 4]. One common problem with those methods is their complexity in both reproducing and understanding them, e.g. They generally involve multiple filtering phases turning them into blackboxes, giving little control to malware analysts.
These proposed methods also assume the availability of some ground truth labels. Perdisci et al. [39] show two malware samples having identical network traffic being assigned different family labels only because their system-level activities were different. This shows the non-generalizable nature of labels assigned by Anti-Virus (AV) vendors. Most of the work that clusters malware samples using their network behavior relies on the labels provided by the AV vendors to calculate the accuracy of their approach. However, research constantly shows that AV vendors do not use a standardized naming convention for malware samples [42, 25]. These labels are heavily based on static analysis [15] and system-level behavioral analysis [3, 44], rather than network-level behavioral analysis.

8 Conclusion

In this paper, we propose MalPaCA, a network traffic-based, intuitive, and protocol-agnostic methodology to cluster malware according to its attacking capabilities using sequence clustering. Instead of simply providing cluster labels with an abstract cluster score, we propose a visualization-based cluster evaluation methodology. The key advantage of this methodology is its white-box nature, allowing malware analysts to investigate, understand, and even correct labels, if necessary. We implement MalPaCA and evaluate it on real-world financial malware samples collected in the wild. The clusters identified in this study capture various attacking capabilities, such as port scans and reuse of C&C servers. We discover a number of samples that do not adhere to their family names, either because of incorrect labeling by black-box solutions or extensive overlap in the families’ behavior. We also show that sequence clustering outperforms state-of-the-art statistical features because sequences capture temporal behavior better.

MalPaCA with its visualization can actively support the investigation of new, unknown malware samples. The resulting behavioral clusters give malware researchers a more informative and actionable characterization of malware than current family designations.

References

  • [1] Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: Detection of new malicious code using n-grams signatures. In: PST. pp. 193–196 (2004)
  • [2] Anderson, B., Paul, S., McGrew, D.: Deciphering malware’s use of tls (without decryption). Journal of Computer Virology and Hacking Techniques 14(3) (2017)
  • [3] Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: NDSS. vol. 9, pp. 8–11. Citeseer (2009)
  • [4] Bilge, L., Balzarotti, D., Robertson, W., Kirda, E., Kruegel, C.: Disclosure: detecting botnet command and control servers through large-scale netflow analysis. In: ACSAC. pp. 129–138. ACM (2012)
  • [5] Black, P., Gondal, I., Layton, R.: A survey of similarities in banking malware behaviours. Computers & Security (2017)
  • [6] Brenner, B.: 2018 malware forecast: ransomware hits hard, continues to evolve. https://news.sophos.com/en-us/2017/11/02/2018-malware-forecast-ransomware-hits-hard-crosses-platforms/ (2018)
  • [7] Campello, R.J., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: PAKDD. pp. 160–172. Springer (2013)
  • [8] Cavallaro, L., Kruegel, C., Vigna, G., Yu, F., Alkhalaf, M., Bultan, T., Cao, L., Yang, L., Zheng, H., Cipriano, C.C., et al.: Mining the network behavior of bots. Technical Report 2009-12 (2009)
  • [9] Chan, N.W.H.: SCANNER: Sequence Clustering of resource Access to find Nearest Neighbors. Master’s thesis, Rochester Institute of Technology (2015)
  • [10] Crowe, J.: 10 must-know cybersecurity statistics for 2018. https://blog.barkly.com/2018-cybersecurity-statistics (2018)
  • [11] Davies, D.L., Bouldin, D.W.: A cluster separation measure. In: TPAMI 1979
  • [12] van Eeten, M.J., Bauer, J.M.: Economics of malware: Security decisions, incentives and externalities. OECD Science, Technology and Industry Working Papers (2008)
  • [13] Egele, M., Scholte, T., Kirda, E., Kruegel, C.: A survey on automated dynamic malware-analysis techniques and tools. CSUR 44(2),  6 (2012)
  • [14] Elfeky, M.G., Aref, W.G., Elmagarmid, A.K.: Warp: time warping for periodicity detection. In: Data Mining. pp. 8–pp. IEEE (2005)
  • [15] Feng, Y., Anand, S., Dillig, I., Aiken, A.: Apposcopy: Semantics-based detection of android malware through static analysis. In: SIGSOFT. pp. 576–587. ACM (2014)
  • [16] Gadge, J., Patil, A.A.: Port scan detection. In: ICON. pp. 1–6. IEEE (2008)
  • [17] Garcia, S.: Modelling the network behaviour of malware to block malicious patterns. the stratosphere project: a behavioural ips. Virus Bulletin (2015)
  • [18] Gratian, M., Bhansali, D., Cukier, M., Dykstra, J.: Identifying infected users via network traffic. Computers & Security (2018)
  • [19] Hammerschmidt, C., Marchal, S., State, R., Verwer, S.: Behavioral clustering of non-stationary ip flow record data. In: 2016 12th International Conference on Network and Service Management (CNSM). pp. 297–301. IEEE (2016)
  • [20] Harding, S.M., Croft, W.B., Weir, C.: Probabilistic retrieval of ocr degraded text using n-grams. Research and Advanced Technology for Digital Libraries (1997)
  • [21]

    Ho, T., Cho, S.J., Oh, S.R.: Parallel multiple pattern matching schemes based on cuckoo filter for deep packet inspection on graphics processing units. IET Information Security (2018)

  • [22] Jiang, X., Zhou, Y.: Dissecting android malware: Characterization and evolution. In: S&P. pp. 95–109. IEEE (2012)
  • [23] Kalgutkar, V., Stakhanova, N., Cook, P., Matyukhina, A.: Android authorship attribution through string analysis. In: ARES. p. 4. ACM (2018)
  • [24] Kanlayasiri, U., Sanguanpong, S., Jaratmanachot, W.: A rule-based approach for port scanning detection. In: ICEEE. pp. 485–488 (2000)
  • [25] Kantchelian, A., Tschantz, M.C., Afroz, S., Miller, B., Shankar, V., Bachwani, R., Joseph, A.D., Tygar, J.D.: Better malware ground truth: Techniques for weighting anti-virus vendor labels. In: AISec. pp. 45–56. ACM (2015)
  • [26] Kirda, E.: Malware behavior clustering. In: Encyclopedia of Cryptography and Security, pp. 751–752. Springer (2011)
  • [27]

    Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research

    7(Dec), 2721–2744 (2006)
  • [28] Korczyński, M., Duda, A.: Markov chain fingerprinting to classify encrypted traffic. In: Infocom. pp. 781–789. IEEE (2014)
  • [29] Kovacs-Vajna, Z.M.: A fingerprint verification system based on triangular matching and dynamic time warping. TPAMI 22(11), 1266–1276 (2000)
  • [30] Lee, J., Lee, H.: Gmad: Graph-based malware activity detection by dns traffic analysis. Computer Communications 49, 33–47 (2014)
  • [31] Lee, T., Mody, J., Lin, Y., Marinescu, A., Polyakov, A.: Application behavioral classification (Jun 14 2007), uS Patent App. 11/608,625
  • [32] Li, P., Liu, L., Gao, D., Reiter, M.K.: On challenges in evaluating malware clustering. In: RAID. pp. 238–255. Springer (2010)
  • [33] Li, W.J., Wang, K., Stolfo, S.J., Herzog, B.: Fileprints: Identifying file types by n-gram analysis. In: IAW, SMC. pp. 64–71. IEEE (2005)
  • [34] LI, Y., HAO, W.: Research of encrypted network traffic type identification. Journal of Computer Applications 29(6), 1662–1664 (2009)
  • [35] Mohaisen, A., Alrawi, O., Mohaisen, M.: Amal: High-fidelity, behavior-based automated malware analysis and classification. computers & security 52 (2015)
  • [36] Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: ACSAC. pp. 421–430. IEEE (2007)
  • [37] Oregi, I., Pérez, A., Del Ser, J., Lozano, J.A.: On-line dynamic time warping for streaming time series. In: ECML-PKDD. pp. 591–605. Springer (2017)
  • [38] Pellegrino, G., Lin, Q., Hammerschmidt, C., Verwer, S.: Learning behavioral fingerprints from netflows using timed automata. In: IFIP. pp. 308–316. IEEE (2017)
  • [39] Perdisci, R., Lee, W., Feamster, N.: Behavioral clustering of http-based malware and signature generation using malicious network traces. In: NSDI. vol. 10 (2010)
  • [40] Pomorova, O., Savenko, O., Lysenko, S., Kryshchuk, A., Bobrovnikova, K.: Cn. In: ICCN. pp. 127–138. Springer (2015)
  • [41] Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 (1987)
  • [42] Sebastián, M., Rivera, R., Kotzias, P., Caballero, J.: Avclass: A tool for massive malware labeling. In: RAID. pp. 230–253. Springer (2016)
  • [43] Suarez-Tangil, G., Tapiador, J.E., Peris-Lopez, P., Blasco, J.: Dendroid: A text mining approach to analyzing and classifying code structures in android malware families. Expert Systems with Applications 41(4), 1104–1117 (2014)
  • [44] Sun, M., Li, X., Lui, J.C., Ma, R.T., Liang, Z.: Monet: a user-oriented behavior-based malware variants detection system for android. TIFS 12(5) (2017)
  • [45] Tajalizadehkhoob, S., Asghari, H., Gañán, C., Van Eeten, M.: Why them? extracting intelligence about target selection from zeus financial malware. In: WEIS
  • [46] Tegeler, F., Fu, X., Vigna, G., Kruegel, C.: Botfinder: Finding bots in network traffic without deep packet inspection. In: CoNEXT. pp. 349–360. ACM (2012)
  • [47] Tian, R., Batten, L., Islam, R., Versteeg, S.: An automated classification system based on the strings of trojan and virus families. In: MALWARE. IEEE (2009)
  • [48] Vinod, P., Laxmi, V., Gaur, M., Chauhan, G.: Momentum: metamorphic malware exploration techniques using msa signatures. In: IIT. pp. 232–237. IEEE (2012)
  • [49] Volis, G., Makris, C., Kanavos, A.: Two novel techniques for space compaction on biological sequences. WEBIST (2016)
  • [50] Wang, A., Mohaisen, A., Chang, W., Chen, S.: Capturing ddos attack dynamics behind the scenes. In: DIMVA. pp. 205–215. Springer (2015)
  • [51] Wueest, C.: New: Financial threats review 2017: An istr special report — symantec connect. https://www.symantec.com/connect/forums/new-financial-threats-review-2017-istr-special-report (2018)
  • [52] Yadav, T., Rao, A.M.: Technical aspects of cyber kill chain. In: SSCC (2015)
  • [53] Yu, C., Lan, J., Xie, J., Hu, Y.: Qos-aware traffic classification architecture using machine learning and deep packet inspection in sdns. Procedia computer science 131, 1209–1216 (2018)