EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

04/12/2018
by   Hyrum S. Anderson, et al.
0

This paper describes EMBER: a labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable files. The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign). To accompany the dataset, we also release open source code for extracting features from additional binaries so that additional sample features can be appended to the dataset. This dataset fills a void in the information security machine learning community: a benign/malicious dataset that is large, open and general enough to cover several interesting use cases. We enumerate several use cases that we considered when structuring the dataset. Additionally, we demonstrate one use case wherein we compare a baseline gradient boosted decision tree model trained using LightGBM with default settings to MalConv, a recently published end-to-end (featureless) deep learning model for malware detection. Results show that even without hyper-parameter optimization, the baseline EMBER model outperforms MalConv. The authors hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research.

READ FULL TEXT

page 2

page 5

page 7

research
10/28/2022

Multi-feature Dataset for Windows PE Malware Classification

This paper describes a multi-feature dataset for training machine learni...
research
12/14/2020

SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection

In this paper we describe the SOREL-20M (Sophos/ReversingLabs-20 Million...
research
06/10/2021

Towards an Automated Pipeline for Detecting and Classifying Malware through Machine Learning

The constant growth in the number of malware - software or code fragment...
research
08/20/2022

Quo Vadis: Hybrid Machine Learning Meta-Model based on Contextual and Behavioral Malware Representations

We propose a hybrid machine learning architecture that simultaneously em...
research
07/15/2020

Static analysis of executable files by machine learning methods

The paper describes how to detect malicious executable files based on st...
research
12/10/2017

Improving Malware Detection Accuracy by Extracting Icon Information

Detecting PE malware files is now commonly approached using statistical ...
research
08/21/2023

Neural Networks Optimizations Against Concept and Data Drift in Malware Detection

Despite the promising results of machine learning models in malware dete...

Please sign up or login with your details

Forgot password? Click here to reset