Virus-MNIST: A Benchmark Malware Dataset

02/28/2021
by   David Noever, et al.
0

The short note presents an image classification dataset consisting of 10 executable code varieties and approximately 50,000 virus examples. The malicious classes include 9 families of computer viruses and one benign set. The image formatting for the first 1024 bytes of the Portable Executable (PE) mirrors the familiar MNIST handwriting dataset, such that most of the previously explored algorithmic methods can transfer with minor modifications. The designation of 9 virus families for malware derives from unsupervised learning of class labels; we discover the families with KMeans clustering that excludes the non-malicious examples. As a benchmark using deep learning methods (MobileNetV2), we find an overall 80 families when beneware is included. We also find that once a positive malware detection occurs (by signature or heuristics), the projection of the first 1024 bytes into a thumbnail image can classify with 87 The work generalizes what other malware investigators have demonstrated as promising convolutional neural networks originally developed to solve image problems but applied to a new abstract domain in pixel bytes from executable files. The dataset is available on Kaggle and Github.

READ FULL TEXT

page 1

page 3

research
09/22/2018

DeepOrigin: End-to-End Deep Learning for Detection of New Malware Families

In this paper, we present a novel method of differentiating known from p...
research
05/01/2023

Classification and Online Clustering of Zero-Day Malware

A large amount of new malware is constantly being generated, which must ...
research
11/06/2017

Computer activity learning from system call time series

Using a previously introduced similarity function for the stream of syst...
research
10/24/2019

Malware Classification using Deep Learning based Feature Extraction and Wrapper based Feature Selection Technique

In case of behavior analysis of a malware, categorization of malicious f...
research
02/17/2020

Tools and Techniques for Malware Detection and Analysis

One of the major and serious threats that the Internet faces today is th...
research
11/03/2021

Virus-MNIST: Machine Learning Baseline Calculations for Image Classification

The Virus-MNIST data set is a collection of thumbnail images that is sim...
research
11/16/2018

The MalSource Dataset: Quantifying Complexity and Code Reuse in Malware Development

During the last decades, the problem of malicious and unwanted software ...

Please sign up or login with your details

Forgot password? Click here to reset