On the impact of dataset size and class imbalance in evaluating machine-learning-based windows malware detection techniques

06/13/2022
by   David Illes, et al.
0

The purpose of this project was to collect and analyse data about the comparability and real-life applicability of published results focusing on Microsoft Windows malware, more specifically the impact of dataset size and testing dataset imbalance on measured detector performance. Some researchers use smaller datasets, and if dataset size has a significant impact on performance, that makes comparison of the published results difficult. Researchers also tend to use balanced datasets and accuracy as a metric for testing. The former is not a true representation of reality, where benign samples significantly outnumber malware, and the latter is approach is known to be problematic for imbalanced problems. The project identified two key objectives, to understand if dataset size correlates to measured detector performance to an extent that prevents meaningful comparison of published results, and to understand if good performance reported in published research can be expected to perform well in a real-world deployment scenario. The research's results suggested that dataset size does correlate with measured detector performance to an extent that prevents meaningful comparison of published results, and without understanding the nature of the training set size-accuracy curve for published results conclusions between approaches on which approach is "better" shouldn't be made solely based on accuracy scores. Results also suggested that high accuracy scores don't necessarily translate to high real-world performance.

READ FULL TEXT

page 20

page 27

page 28

page 29

page 31

page 32

page 33

page 35

research
01/15/2020

On Model Evaluation under Non-constant Class Imbalance

Many real-world classification problems are significantly class-imbalanc...
research
05/31/2022

Dataset Bias in Android Malware Detection

Researchers have proposed kinds of malware detection methods to solve th...
research
10/14/2019

Comment on "AndrODet: An adaptive Android obfuscation detector"

We have identified a methodological problem in the empirical evaluation ...
research
09/02/2022

Explainable AI for Android Malware Detection: Towards Understanding Why the Models Perform So Well?

Machine learning (ML)-based Android malware detection has been one of th...
research
12/31/2022

Knowledge-Based Dataset for Training PE Malware Detection Models

Ontologies are a standard for semantic schemata in many knowledge-intens...
research
08/27/2019

A characterisation of system-wide propagation in the malware landscape

System-wide propagation is frequently observed in malware, and there are...

Please sign up or login with your details

Forgot password? Click here to reset