1 Introduction
The term malware is used to describe malicious software, which has been designed with the specific purpose of exploiting vulnerabilities in computer systems. In general, any given malware aims at compromising such systems. Malware can be further divided into nonexclusive categories such as trojan, virus, adware, worms, etc. Malware development may have had an innocent start, but there are now multiple examples suggesting this is a multimillion business sometimes associated to organised crime NHS.BBC; Ransomware.BBC; Ransomware2.BBC; Ransomware3.BBC. Malware has also been used in acts of sabotage, and may even have political motivations PoliticalMalware.BBC.
Here, we are interested in clustering malware data. That is, identifying homogeneous groups (ie. clusters) of malware in a given data set — without the need for labelled samples for an algorithm to learn from. Once one is able to say that a new malware sample should be assigned to a cluster containing homogeneous malware samples (for a longer discussion of what a cluster is, see hennig2015true) it becomes easier to create defence mechanisms. Clustering algorithms (for complete reviews, see mirkin2012clustering; aggarwal2014data and references therein) can be usually divided in two groups: partitional and hierarchical. The latter includes algorithms able to produce a clustering as well as information regarding the relationships that exist between clusters (this information can be represented with the help of a dendrogram). Partitional clustering identify disjoint clusters in a given data set, so that each object (a malware sample, in our case) in the data set is assigned to a single cluster. Although hierarchical algorithms produce information one would usually associate with families (ie. a tree), it comes with a computational cost. In 2018 alone, 246,002,762 new malware variants were found istr2019. Hence, we see partitional clustering as a more realistic approach for the realworld.
There are various examples of clustering algorithms applied to malware data in the literature (for instance, faridi2018performance; asquith2016extremely; amorim_komisarczuk and references therein). However, these apply classical normalisation to the data sets (for details, see Section 3). This type of normalisation (eg.
score, range normalisation, unit length, etc.) aims at putting all features used to describe an object at the same level. It does not favour more meaningful features over those that are less meaningful. There is considerable similarity between malware samples (at times some are even said to belong to the same family). Hence, it is reasonable to expect that there will be a considerable amount of overlap between clusters. Also, malware samples have a tendency to be released in bursts with a skewed distribution
song2016learning. This scenario makes clustering particularly difficult.In this paper we introduce a novel method to deal with the problem described above. Our method is capable of increasing the separation between clusters during the data preprocessing stage. It does so by calculating the withincluster degree of relevance of each feature in a given data set, and using this as a rescaling factor. By iterating this process our method increases the quality of malware clusterings, as measured by the average silhouette index rousseeuw1987silhouettes. We apply our method to drivebydownload malware data. This referrers to malware delivered to client systems that browse resources on the web, usually via http. This is a very timely issue given that in 2019 Symantec has found one in ten unique resource locators (urls) to be malicious istr2019.
The remainder of this paper is organised as follows. Section 2 discusses the clustering algorithms that are directly relevant to our research, as well as a method to measure how good a clustering is. Section 3 briefly explains the classical normalisation algorithms used in the preprocessing stage. Section 4 explains our method, together with its mathematical motivation. Sections 5 and 6 explain our process of data gathering, our methodology, and the results we have obtained. Finally, Section 7 presents our conclusions, and indications of future work.
2 Related work
Given our objective, the work we discuss in this section relates to clustering algorithms one could use to cluster malware samples. Partitional clustering algorithms aim at identifying a set so that each contains homogeneous objects, and . means macqueen1967some is arguably the most popular such algorithm jain2010data; steinley2006k. Given a data set containing objects, each described over features, means minimises the withincluster distance
(1) 
where is the centroid of cluster , that is, the dimensional point with the lowest sum of distances to all objects in . We defined as a point to make it clear that it may or may not belong to . The means algorithm iteratively minimises (1) with three simple steps:

Select objects from uniformly at random, and use their values to initialise .

Assign each to the cluster represented by the centroid that is the nearest to .

Update each to the centre of .
The means criterion (1) applies the Euclidean squared distance. Hence, the centre of a cluster is the componentwise mean of its objects, that is for .
As popular as means may be, it does have known weaknesses. The most relevant in this paper are: (i) the final clustering depends heavily on the initial set of centroids, which are usually found at random. Suboptimal initial centroids are likely to lead the algorithm to a local minima solution; (ii) means requires the number of clusters, , to be known beforehand; (iii) all features are treated as if they were equally relevant, which is rather unlikely in realworld data sets.
There has been a considerable research effort to address the weaknesses above. For instance, means++ arthur2007k selects one object uniformly at random, and then copies its values to the first centroid
. All other initial centroids are selected following a weighted probability that is proportional to the distances between objects and their nearest centroid.

Set . Select an object from uniformly at random and copy its values to .

Increment by one. Select an object at random, with probability and copy its values to , setting .

Repeat the steps above until .

Run means using each as an initial centroid.
In the algorithm above is the distance between and its nearest centroid . Experiments show that means++ has a faster convergence to a lower criterion output (1) than the traditional means algorithm arthur2007k. This algorithm has enjoyed considerable popularity, and it is now the default means option in MATLAB MATLAB:2019 and scikitlearn scikitlearn.
The intelligent means algorithm (means) mirkin2012clustering addresses both weaknesses (i) and (ii) by identifying good initial centroids for means as well as the number of clusters in . This algorithm iteratively identifies anomalous clusters, afterwards the centroids of these anomalous clusters are then used as initial centroids for means. In this, is set to the number of anomalous clusters in .

Set to the centre of , and . Identify , the object that is the furthest from .

Apply means to the data set using and as initial centroids, but do not allow to move in the cluster update step. This will lead to clusters and with centroids and , respectively.

If , set . In any case, remove each from .

If go to Step 1. Otherwise run means by setting , and using each as an initial centroid.
In the above is a userdefined parameter that helps to avoid small clusters in , should this be of interest to the user. If the value of is known one can sort the elements of by the cardinality of their initial clusters (ie. their value of in Step four), and keep only the elements of with the highest cardinality (this would happen between Steps three and four). Another approach would be to select the elements of in the order they were found. This way they would be the most anomalous initial centroids.
We can see that means identifies each centroid , and related cluster by iteratively minimising
(2) 
where is the componentwise mean of .
In order to address all three weaknesses (i), (ii) and (iii), we introduced the intelligent Minkowski weighted means (means) de2012minkowski. This extends means by following the intuitive idea that a given feature may have different degrees of relevance at each cluster . We model this behaviour by introducing , the weight of feature at cluster . The higher is, the higher the contribution of at cluster is to the clustering. First, we define the weighted Minkowski distance between and as
(3) 
The above is in fact the p power of the Minkowski distance, which is analogous to the use of the Euclidean squared distance in means. This approach saves the computational effort of calculating p roots, and does not change the clusterings produced by the algorithm. The means algorithm minimises
(4) 
subject to
(5) 
Leading to
(6) 
where is the dispersion of feature at cluster , given by . We usually add a small constant to each dispersion (, say) to avoid a division by zero in (6) when perfectly discriminates (ie. for all , has the same value). The means algorithm can be described as follows.

Set to be the Minkowski centre of , to be a copy of , and each .

Find the object that is the farthest from using (3), and copy its values to .

Assign each to either or , depending on which centroid is the nearest to ( or ) as per (3). If this step does not change either or , go to Step 6.

Update to the Minkowski centre of its cluster . Update each as per (6). Go back to Step 3.

If , add to and to . In any case, remove all objects from . If go to Step 2.

Assign each to the cluster whose centroid is the nearest to as per (3). If this step produces no change to to any , stop.

Update each to the Minkowski centre of its cluster . Update each as per (6). Go back to Step 6.
The Minkowsi centre of a feature at cluster with an exponent is the value leading to the lowest . Notice is a convex function. Hence, one can approximate its minimum by setting , and then keep moving by a small number (0.0001, say) to the side that minimises .
If one knows how many clusters a data set has, we can restate step 5 as “Add to and to . Remove all objects from . If go to Step 2”. This would require a new step between 5 and 6: “Keep in and only the elements related to the clusters with the highest cardinality.”, which is the approach used in the original paper (see de2012minkowski). Of course, very much like in means it is also possible to remove all but the first tentative centroids from .
A suitable Minkowski exponent can be found using a consensus clustering approach de2017minkowski. This requires one to run means with values of from to in steps of , leading to 40 clusterings. The chosen is that of the clustering with the highest average similarity to all other 39 clusterings, usually measured using the Adjusted Rand Index (ARI) (rand1971objective). Given two clusterings and , the ARI is defined as
(7) 
where , , . The ARI is corrected for chance.
As well as being able to cluster a data set, one must be able to decide whether a given clustering represents the actual structure of the data set without the use of labels. Clustering validity indices (CVIs) are usually used for this purpose. There is no clear evidence in the literature showing a particular CVI to be the best in all cases, however, the average Silhouette width rousseeuw1987silhouettes usually performs well arbelaitz2013extensive.
For a given , let . That is, is the average distance between and all other objects in its cluster. A low indicates the suitability of the assignment of to . Let . That is, is the average distance between and the objects of its closest neighbouring cluster. A high indicates the unsuitability of assigning to the closest cluster to . The silhouette index of is given by
(8) 
Clearly, . A close to one indicates is closer to the other objects in its cluster than to objects in other clusters. We can expand this measure to deal with all , that is .
3 Classical data normalisation
Often, data sets contain features with different variances. Features with a higher variance will have a higher average distance than features with a lower variance. Hence, the former will have a higher contribution to the clustering than the latter. This common issue highlights the importance of data preprocessing. In this paper, we normalise our data set (for details on the data set itself see Section
5) using(9) 
where , the average of feature over all objects in . The score is also a popular choice in this scenario, it is given by
where
is the standard deviation of
over all objects in . We favoured range normalisation (9) over the score because the latter is biased towards features following a unimodal distribution. This is perhaps easier to explain with an example. Let the features and be unimodal and multimodal, respectively. The standard deviation of will be higher than that of , thus, the score of will be higher than that of . Thus, will have a higher contribution to the clustering than . However, in clustering we would be more interested in the clusters’ information in .Another interesting characteristic of (9) is that if is a binary feature, then . Hence, the standardised value is just . Note that is in fact the frequency of in the data set . The higher the frequency of the lower the standardised value , and the lower is its contribution to the clustering. This is wellaligned with intuition, a feature that is commonly present (ie. frequent) is less likely to be discriminative.
4 Iterative clusterdependent feature rescaling
The normalisation discussed in Section 3 sets all features of a given data set to have about the same contribution to the clustering. This can also be seen as a disadvantage because it means that features with a higher relevance are set to have the same contribution to the clustering as features with a lower relevance. Intuition indicates that features with a higher relevance to the clustering should have a higher contribution. In fact, we can go even further. A given feature may have different degrees of relevance at each cluster , and this should be taken into account during the clustering task.
We can interpret in the distance measure used in means (3) as the degree of relevance of feature at cluster . Such assertion requires further analysis of means. This algorithms aims to produce a weight for and , minimising (4) subject to the conditions in (5). Notice that the dispersion of at cluster is given by
allowing us to rewrite (4) as
The Lagrangian function of the above is
Allowing us to equate its two partial derivatives to zero.
(10) 
(11) 
We can rearrange (10) to
(12) 
The above leads to
and
The weights calculated as per the above Equation minimise (4) by modelling the withincluster degree of relevance of each feature. This is quite interesting because it allows us to go a step beyond the normalisation described in Section (3) by using these weights as feature rescaling factors. This is quite unusual because each feature will have factors, but it is fine because these are the weights minimising the clustering criteria (4).
Given a set of weights calculated as per (6), we can rescale a data set that has been normalised (see Section 3) using
(13) 
where . In other words, the rescaling factor applied to depends on both: (i) feature ; (ii) the cluster belongs to.
The method we introduce in this paper also has another novelty. In the first step of means there is no data transformation that separates clusters, and each is set to . While this seems sensible as a starting point to minimise (4), it also means this starting point is suboptimal. To address this, the main part of our method iterates between generating clusterings with means and rescaling the data set using (13). This way, at each iteration means starts from a better position. Given that for , each time the data set is rescaled the values of its entries are lowered. To avoid computational issues related to dealing with very small numbers, we also normalise the data set using (9) between each of these main iterations.
Iterative clusterdependent feature rescaling (icdfr):

For each value of from 1.1 to 5.0 in steps of 0.1, generate a clustering and a set of weights using means.

Calculate the similarity between each pair of clusterings. This similarity can be calculated using the Adjusted Rand Index. Select as optimal that which is associated to the clustering with the highest average similarity to all other clusterings.

Rescale the data set using the weights generated with the optimal and (13).

Normalise the data set using (9).

Apply means to the new data set with the optimal . This will update the weights as well as the clustering. Unless a predetermined number of iterations has been reached (or the algorithm has converged), go to Step 3.
In the above, steps one and two relate to a consensus approach that has been shown to find suitable values for the Minkowski exponent de2017minkowski. Regarding the number of iterations in Step five, we experimented with 100 although the algorithm would converge much before that.
5 Malware data
In this paper we analyse driveby download malware. In other words, malicious code downloaded unintentionally to the user’s computer. In order to gather useful data we need to release a malware sample in a safe environment, analyse the malware itself and keep track of any changes it does to such environment. Luckily, there are a number of options in terms of software we could use to accomplish this. This type of software is commonly referred as malware sandbox, and it is used to execute untrusted programs without risking the host machine (for details see gandotra2014malware; chakkaravarthy2019survey, and references therein). Here we have chosen to use Cuckoo Sandbox 2.06 Cuckoo
mainly because it is a free opensource solution, which has been consistently used in research (see for instance
barakat2014malware; vasilescu2014practical; shijo2015integrated). However, one should note that the significance of this choice is rather low as our method does not depend on the sandbox in use (see Section 4). All we need is data describing the malware to be analysed. Hence, one can use any malware sandbox capable of fulfilling this requirement.Cuckoo runs at hostlevel and manages one or more Windows 8 VM guests (see Figure 1 for a visual representation). The latter is an isolated environment allowing Cuckoo to gather behavioural data (eg. API calls made by the malware, dropped files, processes spawned, etc.). Cuckoo resets this VM to its original (ie. clean) state before each experiment with a potential malware. This particular sandbox is also able to extract information from files as part of its static analysis. For each malware Cuckoo lists a number of features related to the behaviour of the malware, as well as its static analysis.
In terms of raw data (ie. the malware samples themselves), we acquired a total of 2,000 samples from VirusSign VirusSign. These malware samples were gathered by VirusSign using HoneyPots, submissions, as well as trading and exchange. Each and everyone of them was confirmed by VirusSign to be malicious, using several mainstream AntiVirus software. Each malware also presented features from Cuckoo’s behavioural and statical analyses.
Given a list of features obtained with Cuckoo (see Table 1), we can transform our data into an actual data matrix. The process is quite straightforward. First we must note we have two types of features: (i) binary features, which represent the presence or absence of a particular feature at a particular malware (eg. whether or not a malware checks if the cursor is in use); (ii) numerical features, which represent the number of times a particular feature was present at a particular malware (eg. the number of times a malware sent ICMP messages). We have 2,000 samples with 67 features meaning that any given malware sample in our data set is described over 67 features (ie. and ).
Name  Description  Name  Description 

ICMP  # ICMP messages  AllocateVMem  # calls to NtAllocateVirtualMemory 
AntiDebug  Use of debugging techniques  Bind  # calls to bind 
CheckCursor  Whether a cursor is in use  CloseSocket  # calls to closesocket 
CreateFile  # calls to NtCreateFile  CreateMutant  # calls to NtCreateMutant 
CryptographyReg  Access to Cryptography registry  CustomLocaleReg  Access to CustomLocale registry 
DelayExe  Call to NtDelayExecution  DeviceIO  Use of DeviceIOControl 
DroppedFiles  # files dropped  FindFile  # calls used to locate files 
FreeVMem  # calls to NtFreeVirtualMemory  GetSysTime  Use of GetSystemTimeAsFileTime 
HttpOpenReq  # calls to HttpOpenRequest  HttpSendReq  # calls to HttpSendRequest 
IE  Access to IE registry  MapView  # calls to NtMapViewOfSection 
OpenFile  # calls to NtOpenFile  OpenMutant  # calls to NtOpenMutant 
ProcessNum  # processes spawned  ProtectVMem  # calls to NtProtectVirtualMemory 
QueryFile  # queries for information about files  RegCreate  # calls to create registry keys 
RegQuery  # queries to the registry  SafeBootReg  Access to SafeBoot registry 
Socket  # calls to socket  SortingReg  Access to Nls/Sorting registry 
TcpipReg  Access to TCP/IP registry  WriteFile  # calls to NtWriteFile 
IReadFile  # calls to InternetReadFile  ToolSS  Use of CreateToolHelp32Snapshot 
Dllsloaded  # dlls loaded  SysInfoRef  Access to SystemInformation registry 
CryptDecodeObjectX  # calls to CryptDecodeObjectX  Fips  Access to FIPS algorithm policy 
CryptCreateHash  # calls to CryptCreateHash  CryptHashData  # calls to CryptHashData 
DnscacheReg  Access to DNSCache registry  RegModify  # calls to modify registry keys 
DockingInfo  Access to DockingState registry  Persistence  Access to persistencerelated registry keys 
SCManager  Access to service control manager  CryptExportKey  # calls to CryptExportKey 
CryptGenKey  # calls to CryptGenKey  AppInit  Access to DLLloading registry keys 
CryptAcquireContextA  # calls to CryptAcquireContextA  RemoteThread  Creation of remote threads. 
Files Recreated 
# recreated files  DuplicateProcess  # call to duplicate process 
NumSections  # sections in PE file  NumResources  # resources in PE file 
NumExports  # exports in PE file  TCP  # tcp packets detected 
UDP  # udp packets detected  HTTP  # http packets detected 
DroppedBuffers  # dropped buffers  Yaraembedded_ pe  Detects embedded PE file 
YaraLnkHeader  Detects lnk header  DnscacheReg  DnsCache registry modified 
Yaraembedded_ win_ api  Detects embedded win api  Yarashellcode  Detects shellcode in file 
Yaravm_ detect  Use VM detection techniques  CheckDiskSize  Call to CheckDiskSize function 
Yaraembedded_ macho  Detects Macho file 
6 Clustering results
We began by applying classical normalisation (see Section 3) to the data set we constructed (for details on the data set, see Section 5). Figure (a)a shows the plot of our data over its first and second principal components. Unfortunately, there is no clear evidence of a cluster structure from a Gaussian perspective (ie. clearly separable round clusters).
As popular as it may be, a clustering algorithm such as means++ will identify clusters even if there is no cluster structure in a data set. Hence, one should not just jump into applying this algorithm to the data. To illustrate this, we applied means++ to our data set 100 times. Figure(b)b shows the clustering we obtained with the lowest criterion output by setting (given this is just illustrative, the actual value of matters very little). The fact this clustering is meaningless is further reinforced by an average silhouette index of .
The results we obtained using our icdfr method are certainly more promising. In these experiments we set the means’ thereshold . We did so to avoid very small clusters in our results, which would not be of particular interest (this choice ended up leading to seven clusters, hence the number of clusters in our illustrative means++ example). The first and second steps of our method (described in Section 4) use consensus clustering in order to identify a suitable Minkowski exponent between 1.1 and 5.0 — higher values of tend to remove the advantages of feature weighting as (6) will produce more uniform weights. In this experiment the optimal value was found to be . Given this value of we set the number of iterations to 100 and allowed our method to follow Steps 35. We were happy to see the method converged in iteration 53 — and even happier to see that the average silhouette of means on the data set produced by our method was 0.92. The new data set generated by our method also increased the average silhouette of means++ to 0.52.
Figure 3 shows our method in action. Each of its subfigures (a to f) shows the plot of a means clustering (over its first and second principal components) on a data set generated by our method at a different (but increasing) iteration. We can clearly see that our method starts from a chaotic scenario and it quickly starts separating clusters.
All of the above is very positive, but we need to ensure that our final clustering (ie. Figure (f)f) is actually meaningful in the realworld. In order to do this, we analysed each cluster with the help of VirusTotal VirusTotal. The latter is the most prominent online public service with multiple antivirus scanners sakib2020maximizing. Even with its help, analysing our malware data is far from being a trivial task. VirusTotal is capable of describing each malware in our data set by employing the use of a number of AntiVirus (AV) software. However, this does not mean that each and every AV will agree what a malware sample actually is (or even if the sample is really a malware). Also, different AVs may have different taxonomies. Thus, even if two AVs agree what a malware sample actually is, they may use different names.
Thanks to our analysis we can describe each cluster as follows:
Cluster one: this cluster contains four malware samples, Figure (f)f shows these in red. The AV labels for these in VirusTotal seem to be quite different (using names like adware and trojan, which are not mutually exclusive), however, all four samples have exactly the same compilation timestamp. This certainly suggests these malware samples are very much related.
Cluster two: this cluster contains four malware samples, Figure (f)f represents these in green. The vast majority of AV’s in VirusTotal labels these as trickbot, a trojan designed to steal banking information in particular. All the malware samples in this cluster share the same import hash (imphash), which means they have very similar import tables and are by consequence similar.
Cluster three: this cluster contains 27 malware samples, Figure (f)f represents these in blue. These malware samples can be easily characterised by the huge amount of behaviour they exhibit while running in a Virtual Machine, sometimes spawning over 100 processes. According to VirusTotal, over 30 AVs (out of 60) label most of the samples in this cluster as generic malware. Around five AVs consistently label these malware samples with ransomware characteristics, while other five AVs use the term flystudio (adware).
Cluster four: this cluster contains 14 malware samples, Figure (f)f represents these in black. Eight of the malware samples appear to be a specific type of trojan (a downloader/installer). All but one of the malware samples has the same compilation timestamp, and share the same imphash.
Cluster five: this cluster contains 14 malware samples, Figure (f)f represents these in yellow. These malware samples have different imphashes but the AVs in VirusTotal labels all of these samples as belonging to the ransomware family GandCrab.
Cluster six: this cluster contains 79 malware samples, Figure (f)f
represents these in magenta. Almost all malware samples in this cluster exhibit extremely similar behaviour given they all share the same imphash value. VirusTotal suggests these as adware in general, and AVs classify them as being in the Adposhel or DNSUnlocker families.
Cluster seven: this cluster contains 1,858 malware samples, Figure (f)f represents these in cyan. This is a very large cluster for us to analyse each and every malware sample in VirusTotal. The malware samples in this cluster seem to be associated with different families, but also seem different from the malware samples in other clusters.
The above shows there is a considerable difference in cardinality between clusters, which is certainly expected. Malwares have a tendency to appear in bursts, and their distribution is highly skewed song2016learning. Our method will identify the most anomalous clusters first. Further analysis could be done by applying our method solely to the data in the largest cluster. We do not pursue this here because we have already clearly achieved our aim. Taking the cluster’s descriptions above together with: (i) the average silhouette of given by means; (ii) the increased average silhouette of means++ (from to , the latter on the data set generated by icdf); (iii) the mathematical model shown in Section 4, we can state our method produces a data set which increases the chances of a meaningful clustering.
7 Conclusion
In this paper we faced the problem of finding meaningful clusters in drivebydownload malware data. The patterns in this type of data can be difficult to identify, particularly if using a distance based clustering algorithm (see Figures (a)a and (b)b). We identified as the main reason for this the fact that classical data normalisation treats all features equally, instead of favouring those that are more relevant.
In order to address the above, we introduced a data preprocessing method called iterative clusterdependent feature rescaling (for details see Section 4). This method makes use of clusterdependent feature weights to iteratively separate the clusters in a data set (see Figure 3). This mathematically sound method leads to higher average silhouettes. For instance, means++ saw an increase from to when using our feature rescaling method, while means went as high as . Hence, more meaningful clusters.
We foresee our method being used in the data preprocessing stage of a malware clustering task, or perhaps even in other clustering tasks. In the future we intend to investigate the use of this method in supervised classification problems.