Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

09/12/2023
by   Maksim E. Eren, et al.
0

Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this paper, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier, that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.

READ FULL TEXT
research
03/07/2021

Cluster Analysis of Malware Family Relationships

In this paper, we use K-means clustering to analyze various relationship...
research
05/01/2023

Classification and Online Clustering of Zero-Day Malware

A large amount of new malware is constantly being generated, which must ...
research
05/02/2022

Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)

Background: Most of the existing machine learning models for security ta...
research
09/10/2015

A deep matrix factorization method for learning attribute representations

Semi-Non-negative Matrix Factorization is a technique that learns a low-...
research
04/19/2017

Semi-supervised classification for dynamic Android malware detection

A growing number of threats to Android phones creates challenges for mal...
research
01/04/2019

Network-based Analysis and Classification of Malware using Behavioral Artifacts Ordering

Using runtime execution artifacts to identify malware and its associated...

Please sign up or login with your details

Forgot password? Click here to reset