DeepAI AI Chat
Log In Sign Up

MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

by   Robert J. Joyce, et al.
Booz Allen Hamilton Inc.
University of Maryland, Baltimore County

Malware family classification is a significant issue with public safety and research implications that has been hindered by the high cost of expert labels. The vast majority of corpora use noisy labeling approaches that obstruct definitive quantification of results and study of deeper interactions. In order to provide the data needed to advance further, we have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset. MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, nearly 3x larger than any prior expert-labeled corpus and 36x larger than the prior Windows malware corpus. MOTIF also comes with a mapping from malware samples to threat reports published by reputable industry sources, which both validates the labels and opens new research opportunities in connecting opaque malware samples to human-readable descriptions. This enables important evaluations that are normally infeasible due to non-standardized reporting in industry. For example, we provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools when names are obtained from differing sources. Evaluation results obtained using the MOTIF dataset indicate that existing tasks have significant room for improvement, with accuracy of antivirus majority voting measured at only 62.10 accuracy. Our findings indicate that malware family classification suffers a type of labeling noise unlike that studied in most ML literature, due to the large open set of classes that may not be known from the sample under consideration


page 1

page 2

page 3

page 4


A Large Scale Study and Classification of VirusTotal Reports on Phishing and Malware URLs

VirusTotal (VT) provides aggregated threat intelligence on various entit...

MalPaCA: Malware Packet Sequence Clustering and Analysis

Malware family characterization is a challenging problem because ground-...

AVClass2: Massive Malware Tag Extraction from AV Labels

Tags can be used by malware repositories and analysis services to enable...

Identifying Authorship Style in Malicious Binaries: Techniques, Challenges Datasets

Attributing a piece of malware to its creator typically requires threat ...

Microsoft Malware Classification Challenge

The Microsoft Malware Classification Challenge was announced in 2015 alo...

AVMiner: Expansible and Semantic-Preserving Anti-Virus Labels Mining Method

With the increase in the variety and quantity of malware, there is an ur...