Identifying meaningful clusters in malware data

Finding meaningful clusters in drive-by-download malware data is a particularly difficult task. Malware data tends to contain overlapping clusters with wide variations of cardinality. This happens because there can be considerable similarity between malware samples (some are even said to belong to the same family), and these tend to appear in bursts. Clustering algorithms are usually applied to normalised data sets. However, the process of normalisation aims at setting features with different range values to have a similar contribution to the clustering. It does not favour more meaningful features over those that are less meaningful, an effect one should perhaps expect of the data pre-processing stage. In this paper we introduce a method to deal precisely with the problem above. This is an iterative data pre-processing method capable of aiding to increase the separation between clusters. It does so by calculating the within-cluster degree of relevance of each feature, and then it uses these as a data rescaling factor. By repeating this until convergence our malware data was separated in clear clusters, leading to a higher average silhouette width.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2021

Cluster Analysis of Malware Family Relationships

In this paper, we use K-means clustering to analyze various relationship...
research
04/02/2019

MalPaCA: Malware Packet Sequence Clustering and Analysis

Malware family characterization is a challenging problem because ground-...
research
01/13/2023

Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces

Identifying meaningful concepts in large data sets can provide valuable ...
research
01/11/2019

Explaining Vulnerabilities of Deep Learning to Adversarial Malware Binaries

Recent work has shown that deep-learning algorithms for malware detectio...
research
10/28/2022

A Deep Dive into VirusTotal: Characterizing and Clustering a Massive File Feed

Online scanners analyze user-submitted files with a large number of secu...
research
02/22/2016

Recovering the number of clusters in data sets with noise features using feature rescaling factors

In this paper we introduce three methods for re-scaling data sets aiming...
research
05/16/2020

Revisiting Agglomerative Clustering

In data clustering, emphasis is often placed in finding groups of points...

Please sign up or login with your details

Forgot password? Click here to reset