MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

Malware family classification is a significant issue with public safety and research implications that has been hindered by the high cost of expert labels. The vast majority of corpora use noisy labeling approaches that obstruct definitive quantification of results and study of deeper interactions. In order to provide the data needed to advance further, we have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset. MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, nearly 3x larger than any prior expert-labeled corpus and 36x larger than the prior Windows malware corpus. MOTIF also comes with a mapping from malware samples to threat reports published by reputable industry sources, which both validates the labels and opens new research opportunities in connecting opaque malware samples to human-readable descriptions. This enables important evaluations that are normally infeasible due to non-standardized reporting in industry. For example, we provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools when names are obtained from differing sources. Evaluation results obtained using the MOTIF dataset indicate that existing tasks have significant room for improvement, with accuracy of antivirus majority voting measured at only 62.10 accuracy. Our findings indicate that malware family classification suffers a type of labeling noise unlike that studied in most ML literature, due to the large open set of classes that may not be known from the sample under consideration



There are no comments yet.


page 1

page 2

page 3

page 4


MalPaCA: Malware Packet Sequence Clustering and Analysis

Malware family characterization is a challenging problem because ground-...

Cluster Analysis of Malware Family Relationships

In this paper, we use K-means clustering to analyze various relationship...

AVClass2: Massive Malware Tag Extraction from AV Labels

Tags can be used by malware repositories and analysis services to enable...

Identifying Authorship Style in Malicious Binaries: Techniques, Challenges Datasets

Attributing a piece of malware to its creator typically requires threat ...

A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

In some problem spaces, the high cost of obtaining ground truth labels n...

Microsoft Malware Classification Challenge

The Microsoft Malware Classification Challenge was announced in 2015 alo...

Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Information Security

Despite the potential of Machine learning (ML) to learn the behavior of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A malware family is a collection of malicious files that are derived from a common source code. Classifying a malware sample into a known family provides valuable insights about the behaviors it likely performs and can greatly aid triage and remediation efforts

[10]. Using manual analysis to label a large quantity of malware is generally considered to be intractable, and developing methods to automatically classify malware with high fidelity has long been an area of interest. Approximate labeling methods are typically employed due to the difficulty of obtaining family labels with ground truth confidence, often using a single antivirus or the majority vote of a collection of them. However, it is also common for malware family classifiers to make their classification decisions based upon antivirus signatures [1] or to train them using antivirus-based labels [11]

. This has the potential to bias evaluation results. The difficulty of constructing datasets and evaluating them in a manner that avoids over-estimating accuracy has been a chronic problem for over two decades

[12, 13, 14, 15, 16].

In order to improve current evaluation practices, especially evaluation of antivirus-based malware classifiers, we introduce the Malware Open-source Threat Intelligence Family datset (MOTIF). Containing 3,095 samples from 454 families, the MOTIF dataset is the largest and most diverse public malware dataset with expert-derived, ground truth confidence family labels to date. MOTIF is the first dataset to provide a comprehensive alias mapping that includes aliases derived from both open-source reporting and antivirus signatures. Each malware sample is associated with a report written by a domain expert, which acts as the high confidence source for the label. This provides a new direction to explore the intersection of document processing and natural language understanding with malware family detection, as well as the labeling and documentation needed to design experiments with higher confidence then previously possible.

For the remainder of this section we provide an overview of the methods historically used to label malware and perform a survey of notable reference datasets. In Section 2 we describe the contents of the MOTIF dataset and the methodology used to construct it. In Section 3 we use MOTIF to benchmark the performance of a well-known malware labeling tool, four hash-based clustering algorithms, and two ML classifiers. Further discussion and conclusions are offered in Section 4.

1.1 Malware Dataset Labeling Strategies

Manual Labeling. Malware family labels obtained via manual analysis are said to have ground truth confidence. Although manual analysis is not perfectly accurate, the error rate is considered to be negligible [17]. Unfortunately, manual analysis is extremely time consuming. It can take an average of ten hours for a human analyst to fully analyze a previously unseen malware sample [18]. Although a full analysis may not be necessary to determine the malware family, the degree of difficulty and cost that manual labeling imposes is evident. Manually labeling a large quantity of malware quickly becomes intractable, and we are not aware of any reference datasets in which the creators manually analyzed all malware samples themselves in order to obtain ground truth family labels.

Open-Source Threat Intelligence Reports. Hundreds of open-source threat intelligence reports containing detailed analysis of malware samples are published by reputable cybersecurity organizations each year, mainly in the form of blog posts and whitepapers. These reports often focus on a specific family, campaign, or threat actor, and include the evidence the analyst uses to reach their conclusion – giving high confidence of correct labeling. In addition to analysis of the related malware samples, they frequently include file hashes and other indicators of compromise (IOCs) related to the cyberattack. If the family labels published in a report were obtained using manual analysis, then we say they have ground truth confidence. The MOTIF dataset was constructed by processing thousands of these reports and aggregating labeled file hashes from them. Although this is a more scalable method of obtaining ground truth family labels, it cannot be automated and it is restricted to the content published by the aforementioned organizations. This is due to the high cost of manual analysis, with expert practitioners reporting days-to-weeks of effort to reverse a single file [19].

Approximate Labeling Methods. The remaining methods for obtaining malware reference labels are by far the most commonly employed, but they do not provide ground truth confidence. Cluster labeling is an approach in which a dataset is first clustered and then one malware sample per cluster is manually analyzed. The label for that sample is applied to the entire cluster [20, 21]. However, this approach still requires some manual analysis and it relies on the precision of the selected clustering algorithm, which is often custom-made and not rigorously evaluated. A common, automated method for labeling malware is scanning with a single antivirus. However, antivirus signatures are frequently incorrect and can lack family information entirely [22, 23]. Furthermore, antivirus vendors often use different names for the same malware family, which we refer to as aliases. Many malware families have one or more aliases, causing widespread labeling inconsistencies [1]. Another typical approach to malware labeling is antivirus majority voting, in which a collection of antivirus products vote on the label. Although regarded as highly accurate, prior work has not quantified this assumption [1]. Antivirus majority voting may also cause a selection bias, because the samples for which a majority agree on a label are likely the “easiest" to classify [24]. AVClass, a leading public tool for performing automatic malware family classification, was recently used to label the EMBER2018 dataset. When provided an antivirus scan report for a malware sample, AVClass attempts to aggregate the many antivirus signatures in the report into a single family label. It does this by normalizing each antivirus signature, resolving aliases, and using plurality voting to predict the family. AVClass is open source, simple to use, and does not require the malware sample to obtain a label, making it a popular choice as a malware classifier since its release in 2016 [1]. In Section 3 we compute the accuracy of both antivirus majority voting and AVClass for the first time and they correctly predict malware families in the MOTIF dataset just 62.10% and 46.78% of the time respectively.

Non-Family Based Labeling. Malware family labels are not perfect, and no oracle based categorization of family, or even if a program is a virus, is possible due to the halting problem  [25]. Our definition of a malware family as being derived from common source code is within normal industry use111See for an informative introduction to some of the challenges with malware family names, but indeed the lack of objective means to group families leads some to desire different means of grouping. Alternative approaches to grouping malware exist, such as by functionality  [26]. However, all approaches to malware grouping run into different versions of the same set of problems. Malware is written by an active adversary who attempts to evade and mislead analysts, including complex code obfuscations and misdirection via code theft to slow analysts and thwart automation [27, 28, 29, 30]. Even standard countermeasures such as packing, which hides the original source code of a program from static analysis, are poorly understood and difficult to circumvent  [31]. Dynamic analysis of running malware is not full-proof for similar reasons. The process is extraordinarily expensive and time-consuming, and malware may have numerous means of detecting instrumentation, altering behavior, and only executing on target victims, if sufficiently motivated [32, 33]. The various pros and cons of alternative labeling schemes is beyond the scope of our study, and focuses on family labels given its ubiquity in academic research and industry practice.

Name Samples Families Platform Collection Period Labeling Method
MOTIF (our work) 3,095 454 Windows Jan. 2016 - Jan. 2021 Threat Reports
MalGenome [34] 1,260 49 Android Aug. 2010 - Oct. 2011 Threat Reports
Variant [35] 85 8 Windows Jan. 2014 Threat Reports
Table 1: Public Datasets with Ground Truth Family Labels

1.2 Notable Malware Reference Datasets

Statistics about malware datasets with ground truth family labels are shown in Table 1. Labels for all three datasets were gathered from open-source threat reports. MOTIF is the largest, and by far the most diverse of these datasets, and it contains the most modern malware samples. Variant contains just 85 malware samples from eight families, all apparently from a single report published by the DHS NCCIC in 2014 [36], with additional analysis provided in a second report by Dell Secureworks [37]. MalGenome is nearly a decade old, as of the time of writing [34].

Name Samples Families Platform Collection Period Labeling Method
Malheur [38] 3,133 24 Windows 2006 - 2009 AV Majority Vote
AMD [39] 24,553 71 Android 2010 - 2016 Cluster Labeling
Drebin [40] 5,560 179 Android Aug. 2010 - Oct. 2012 AV Majority Vote
VX Heaven [41] 271,092 137 Windows 2012 or earlier Single AV
Malicia [42] 11,363 55 Windows Mar. 2012 - Mar. 2013 Cluster Labeling
Kaggle [43] 10,868 9 Windows Feb. 2015 or earlier Susp. Single AV
Malpedia [44] 5,862 2,165 Both 2017 - ongoing Hybrid
MalDozer [45] 20,089 32 Android Mar. 2018 or earlier Susp. Single AV
EMBER2018 [46] 485,000 3,226 Windows 2018 or earlier AVClass
Table 2: Notable Public Datasets With Imperfect Labeling

Table 2 displays public malware datasets with approximate family labels. The collection periods of the VX Heaven, Kaggle, and MalDozer datasets are undocumented; we use publication dates as an upper bound for the end of the collection period. The VX Heaven site operated between 1999 and 2012 and the dataset is considered to contain very outdated malware. The VX Heaven dataset was labeled using the Kaspersky antivirus [41] and we suspect that the Kaggle dataset was labeled using Windows Defender [47] due to the use of the label Obfuscator.ACY (a signature that Defender uses for obfuscated malware). The labeling method for MalDozer was not disclosed; however, the formatting of the provided family names suggests a single antivirus was used. The original EMBER dataset does not have family labels, but an additional 1,000,000 files (both malicious and benign) were released in 2018 [48]. 485,000 of these files are malware samples with AVClass labels. The Malpedia dataset is the most similar to MOTIF. Some of the labels in the Malpedia dataset were obtained using open-source reporting, and manual analysis was used for dumping and unpacking some malware samples. However, other family labels were derived using automated methods such as YARA rules and similarity analyses of unpacked files to known malware samples [44]. Therefore, we do not consider the Malpedia dataset to have full ground-truth confidence for all labels.

Name Samples Families Platform Collection Period Labeling Method
Malsign [20] 142,513 Unknown Windows 2012 - 2014 Cluster labeling
MaLabel [17] 115,157 80 Windows Apr. 2015 or earlier AV Majority Vote
MtNet [11] 1,300,000 98 Windows Jun. 2016 or earlier Hybrid
Table 3: Notable Private Datasets With Imperfect Labeling

The datasets in Table 3 are not publicly available and some information about their contents is unknown. The majority of the files in the Malsign dataset are potentially unwanted applications (PUAs), rather than malware. Reference labels for Malsign were produced via clustering on statically-extracted features [20]. Of the 115,157 samples in MaLabel, 46,157 belong to 11 large families and the remaining 69,000 samples belong to families with fewer than 1,000 samples each. The total number of families in the dataset is unknown [17]. Microsoft provided the authors of the MtNet malware classifier with a dataset of 1.3 million malware samples, labeled using a combination of antivirus labeling and manual labeling. Due to the enormous scale of the dataset and the source that provided it, we suspect that the vast majority of the files were labeled using Windows Defender.

A lack of family diversity is apparent in many of the datasets we surveyed, most notably in the Variant and Kaggle datasets. The Windows datasets in particular tend to use approximate labeling methods, lack family diversity, and contain old malware. There is also a dearth of notable datasets containing malware that targets other platforms (e.g. Linux, macOS, and iOS), but this is work beyond our paper’s scope.

2 The MOTIF Dataset

The MOTIF dataset was constructed by surveying all open-source threat intelligence reports published by 14 major cybersecurity organizations during a five-year period. During this process, we meticulously reviewed thousands of these reports to determine the ground truth families of 4,369 malware samples. The organizations that published these reports included only the file hashes, to avoid distributing live malware samples. Hence, much of the malware discussed in these reports was inaccessible to us, and MOTIF was limited to only the malware samples which we could obtain. Furthermore, although we originally intended the dataset to include Windows, macOS, and Linux malware, we found that the vast majority of malware samples in our survey targeted Windows. In order to standardize the dataset, we elected to discard all files not in the Windows Portable Executable (PE) file format. As a result, the MOTIF dataset includes disarmed versions of 3,095 PE files from the survey that were in our possession, labeled by malware family with ground truth confidence. All files were disarmed by replacing the values of their OPTIONAL_HEADER.Subsystem and FILE_HEADER.Machine fields with zero, a method previously employed by the SOREL dataset to prevent files from running [49]. Furthermore, we also release EMBER raw features (feature version 2) for each malware sample [46]. In addition to being the largest public malware dataset with ground truth family labels as of the time of writing, MOTIF is the largest public dataset with ground truth family labels and a full alias mapping derived from both open-source reporting and antivirus signatures. The remainder of this section describes our methodology for processing the open-source threat intelligence reports, discusses how we resolved malware family aliases, and reviews the contents of the MOTIF dataset.

2.1 Source and Report Inclusion

Table 4 lists the 14 sources that contributed to the MOTIF dataset and how many labeled malware hashes we identified from each. All sources are large, reputable cybersecurity organizations that regularly release threat reports containing IOCs to the community. We considered several other organizations that satisfy these requirements but due to time and resource constraints we were unable to review them. Thus, the selection of sources to include in the MOTIF dataset was somewhat subjective; the sources that were included tended to have the greatest name recognition and publish the most IOCs in their reports.

max width= Source Samples Source Samples Bitdefender [50] 199 G DATA [51] 106 CheckPoint [52] 221 Kaspersky [53] 502 CISA [54] 101 Malwarebytes [55] 201 Cybereason [56] 204 Palo Alto Networks [57] 434 ESET [58] 607 Proofpoint [59] 567 FireEye [60] 314 Symantec [61] 164 Fortinet [62] 269 Talos [63] 480

Table 4: MOTIF Sources

We reviewed all open-source threat intelligence reports published by the 14 sources listed in Table 4 between January 1, 2016 and January 1, 2021. This time window permits the inclusion of recent malware (as of the time of writing) and should allow for most antivirus signatures to have stabilized [64]. In order to ensure that the malware samples included in the MOTIF dataset have family labels with ground truth confidence, we processed each report in a standardized manner. Reports were omitted if they did not meet all of the following conditions: First, the report must provide a detailed technical analysis of the malware, implying that some expert manual analysis was performed. Second, the report must include the MD5, SHA-1, or SHA-256 hashes of the analyzed malware samples and clearly indicate their corresponding families. Finally, the report must be fully open-source: reports requiring email registration or a subscription to a paid service were not included. Of the thousands of reports we surveyed, only 644 met these conditions and contained at least one labeled hash.

2.2 Malware Sample Inclusion and Naming

We also stipulated which malware samples in a report could be included in the dataset. Malware samples without a clear, unambiguous family label were omitted. Samples labeled only using an antivirus signature (and not a family name) were also skipped. Malware samples not targeting the Windows, macOS, or Linux operating systems were excluded. Large dumps of 50 or more malware samples were not included, unless very detailed analysis was provided which suggested all files received manual attention. We emphasize that while constructing the MOTIF dataset, we made every reasonable effort to be methodical and judicious about which malware samples were included.

The following conventions were used when cataloguing malware family names from reports. Any files used to download, launch, or load a malware sample were not treated as belonging to that sample’s family. However, plugins and modules used by modular malware were considered to have family membership. Legitimate tools abused by malicious actors (e.g. Metasploit, Cobaltstrike, etc.) were not treated as malware families. We recorded 4,369 malware hashes with 595 distinct family names (normalized, without alias resolution) during this procedure. Family names were normalized by converting them to lowercase and removing all non-alphanumeric characters. We later use the same process for normalizing the names of threat actors and, for the remainder of this paper, we use normalized names when referring to a malware family or threat actor. In order to ensure reproducibility, we also recorded the name of the source, the date the report was published, and the URL of the report. In some reports the IOCs were provided in a separate appendix; a second URL to the appendix is included in these cases. As described at the beginning of Section 2, non-PE files and files to which we could not gain access were not included in MOTIF. As a result, MOTIF contains 3,095 files from 454 malware families.

In manual review of the reports, we have found no occurrences of two reports disagreeing on a malware family label. When a report included a non-PE file as part of the family (e.g., a PDF used as the payload for the malware), we excluded it and retained only the valid Windows PE executables. Only reports that provided well documented manual analysis supporting its conclusion were included, giving us high confidence in all samples of the MOTIF dataset.

2.3 Family Alias Resolution

The MOTIF dataset provides a comprehensive alias mapping for each family, in addition to a brief description of that family’s capabilities and which specific threat actor or campaign it is attributed to, if any. Descriptions are succinct and based on our subjective expertise. Most descriptions contain the category of malware to which the family belongs (e.g. ransomware) and a noteworthy fact about the malware (e.g. an interesting capability, the language it was written in, or the name of a related family). In cases when a threat actor has multiple names, the 2-3 most common ones were provided.

Our primary method for identifying family aliases was open-source reporting. Threat reports about a particular family often supply the aliases used by other organizations. We were thorough in our search for aliases and only considered reports which we considered to be published by reputable organizations. We observed that the boundary between two malware families being aliases and variants is often nebulous. To remain consistent, we considered two families to be variants when significant functionality was added to a malware family and subsequent reports referred to the new samples by a different name. In cases when malware was "rebranded" but no new functionality was added (e.g. babax / osno) we treated the two family names as aliases. As a secondary source of alias naming, we investigated antivirus signatures that contained slight variations (e.g. agenttesla / agensla) or rearrangements (e.g. remcos / socmer) of family names. In these cases, it was often obvious when two families were aliases. However, antivirus signatures that indicated a generic detection (e.g. centerposgen) could refer to multiple families or threat actors (e.g. the ekans / snake ransomware and the turla / snake APT) or were too generic (e.g. padcrypt / crypt), were disregarded.

Our investigation of malware family aliases was supplemented by Malpedia [44], which provides information about family aliases, behavior, and attribution. We independently confirmed these details using open-source reporting from reputable organizations. In a few cases, we identified family names which we suspected were aliases, but we were unable to confirm a relationship using publicly available information. We assess that our alias mapping is as comprehensive as possible given available knowledge and that the impact upon evaluation from the small number of suspected missing aliases is negligible. The alias mapping contains 968 total aliases for the 454 families in MOTIF, an average of 2.132 per family. The wannacry family has an astonishing 15 aliases and 25 families in the dataset have at least five aliases. As illustrated by these results, we strongly feel that inconsistent family naming is a significant issue in the malware analysis community and continued effort to study and correct this problem is warranted.

2.4 Malware Demographics in MOTIF

Figure 1: MOTIF Family Size Distribution

Figure 1 displays the distribution of malware family sizes in MOTIF. The vast majority of families in the dataset are represented by five or fewer samples, many by just one sample. The icedid banking trojan is by far the largest family in MOTIF, with 142 samples. Many of the sources we studied had clear tendencies as to which types of malware were included in their reports. For example, Fortinet [62] focused primarily on malware associated with cybercrime, with a heavy emphasis on ransomware. CISA [54] reported almost exclusively on state-sponsored malware, especially malware tied to North Korea. Many other sources also reported heavily on ransomware and malware used in targeted attacks, which is reflected in the composition of the MOTIF dataset. MOTIF contains 151 malware families attributed to 42 distinct threat actors or campaigns associated with targeted attacks. Criminal groups that do not engage in cyber-espionage (e.g. ta505 and the carbanak group) are not included in this tally. Nearly a third of the malware samples in MOTIF (974 files) have been attributed to one of these 42 threat actors. Ransomware also makes up a significant portion of MOTIF, with 576 malware samples from 102 families. Backdoors, downloaders, RATs, infostealers, and ATM malware are also common malware categories. Adware and other types of PUAs were not frequently reported on and thus have little representation in MOTIF. Although MOTIF includes first-stage malware, such as droppers and downloaders, this category of malware was reported on far more often than is apparent in the dataset. This is likely because first-stage malware is rarely granted a family designation unless it is associated with a targeted attack (e.g artfulpie) or is known for delivering notable commodity malware families (e.g. hancitor).

2.5 Sources of Bias in MOTIF

We acknowledge that the methods used to construct MOTIF biases the data. Due to the factors described in Section 2.4, MOTIF does not reflect the average distribution of malware that might be encountered on a daily basis, but rather, it portrays a collection of malware that a cybersecurity organization would deem most important. Furthermore, the malware samples in MOTIF were published in reports dating between January 1, 2016 and January 1, 2021. MOTIF is no exception to the rule that manually labeling malware is time-consuming, and the malware samples in MOTIF will become outdated over time. We consider MOTIF to be a “hard" dataset on account of its high number of families and the considerable proportion of malware attributed to advanced threat actors. In addition, MOTIF contains multiple malware families that are variants of each other, are attributed to the same threat actor, are packed using the same packer, or share other similarities. We have high confidence that MOTIF is more challenging than existing datasets and we provide further evidence of this claim in Sections 3.1 and 3.4. At the same time, MOTIF represents a scant 3,095 malware samples in an much larger ecosystem where hundreds of thousands of new malware samples are being observed daily [65]. Creating larger, even more representative datasets is a goal of future work.

3 Experiments

To date, MOTIF is the largest public malware dataset with expert ground truth family labels. It is also one of the most diverse, with 454 distinct families. Unlike prior datasets, many of which have approximate labels, lack diversity, or contain outdated malware samples, we claim that evaluations made using the MOTIF dataset can be made with confidence. Furthermore, MOTIF’s combination of ground truth confidence family labels and a comprehensive alias mapping enable experiments and evaluations that could not be performed before. Machine learning has been investigated for malware classification since 1995


, and its study has used techniques from across the spectrum of classical ML methods like SVMs and boosting, deep learning, graph based learning, supervised and unsupervised clustering, and borrowing from Computer Vision and Natural Language Processing techniques is frequent  


. It is not possible to enumerate all current relevant ML approaches. In this section, we evaluate a notable malware classifier and multiple clustering algorithms using the MOTIF dataset. In addition, we provide benchmark results for two ML models trained on MOTIF, and we assess the capabilities of three outlier detection models to identify novel malware families. The MOTIF dataset enables us to obtain new insights about these tasks and to identify new directions for future research. For this section we use Precision and Recall in the clustering specific terminology relating to the ability to group different families into different clusters and same families into same clusters respectively

[68]. This style of defining Precision and Recall is prevalent in the malware literature in part due to the difficulty in knowing what the true class labels are, which MOTIF helps resolve. As our results will show, these alternate definitions of necessity can be misleading, “hiding” instances of data points that are being labeled incorrectly in a consistent fashion.

3.1 Evaluating AVClass

In order for MOTIF to be a benchmark for evaluating antivirus-based malware classifiers, all samples have been uploaded to VirusTotal [69], a platform that scans malware samples with a large number of antivirus products. In our first experiment we evaluate AVClass using the MOTIF dataset. AVClass processes and normalizes antivirus scan reports and then uses plurality voting to predict a malware family [1]. AVClass is one of the most widely used approaches to better automated family labeling, and operates on the family labels produced by several Anti-Virus products. Using domain specific steps akin to tokenization, stop-word removal, and lemmatization to resolve inconsistencies between products. We obtained antivirus scan reports for each of the 3,095 malware samples in MOTIF by querying the VirusTotal API. All queries used in the following experiments were made in Aug. 2021. Although the VirusTotal terms of service prohibit us from distributing these scan reports, they can easily be obtained by querying the VirusTotal API.

Table 5 displays AVClass’s precision, recall, and F1 measure on five public malware datasets when run under default settings, previously reported by Sebastián et al. [1]. Note that Malgenome* represents a modification to the MalGenome dataset that groups six variants of DroidKungFu into a single family. Table 5

also shows AVClass’s evaluation results on MOTIF when using its default alias mapping (MOTIF-Default) and when provided with MOTIF’s alias mapping (MOTIF-Alias). In both cases, all of AVClass’s evaluation metrics are significantly lower for MOTIF than any other

max width= Dataset Precision Recall F1 Measure Accuracy Drebin 0.954 0.884 0.918 Unknown Malicia 0.949 0.680 0.792 Unknown Malsign 0.904 0.907 0.905 Unknown MalGenome* 0.879 0.933 0.926 Unknown Malheur 0.904 0.983 0.942 Unknown MOTIF-Default 0.763 0.674 0.716 0.468 MOTIF-Alias 0.773 0.700 0.735 0.506

Table 5: AVClass Evaluation Results

dataset, with the exception of recall for the Malicia dataset. This raises the question of why the evaluation results of the other datasets are so much higher. Drebin (the dataset with the highest precision) and Malheur (the dataset with the highest recall and F1 measure) were both labeled using antivirus majority voting. Because this is very similar to AVClass’s plurality voting strategy, evaluation results on these datasets were likely artificially high. Other attributes of the other datasets used to evaluate AVClass, such as containing Android malware rather than Windows, having outdated malware samples, using approximate labeling methods, or lacking size or family diversity, may have also contributed to the observed discrepancies.

Because computing the accuracy of a malware classifier requires a mapping between the aliases used by the classifier and the aliases used by the reference dataset, the majority of works only use precision and recall, which are based on accurate grouping (and not necessarily accurate labeling) [1, 20, 21, 70]. The MOTIF dataset provides a full family alias mapping, allowing accuracy to be computed as well. In Table 5 the stark contrast between AVClass’s accuracy on the MOTIF dataset (46.78%) and its precision and recall (76.35% and 67.36% respectively) is evident. Although our results show that AVClass labels related malware samples with some consistency, the tool predicts an incorrect family name for a malware sample more often than the correct one.

3.2 Further Investigation of Antivirus Results

An investigation of the labels predicted by AVClass in our prior experiment revealed that errors were caused in part by AVClass, but primarily due to the antivirus scan reports used as input. Antivirus signatures frequently contained the name of a variant of the correct family, the name of a family with similar capabilities to the correct one, or the name of a family known to be otherwise associated with the correct one. Furthermore, antivirus signatures commonly contained non-family information that AVClass did not properly discard, including the name of the group that the sample is attributed to, the broad category of malware it belongs to, the sample’s behavioral attributes, the name of the packer that the sample was packed with, or the programming language that the sample was written in. The discrepancies between the results for MOTIF-Default and MOTIF-Alias in Table 5 also indicate that AVClass also often fails to resolve family aliases properly. In our study, these factors frequently caused AVClass to make incorrect predictions; in many cases, the predicted labels were not the name of any valid malware family. Precision and recall are intended to be used as cluster evaluation metrics and it is clear that model performance can be severely misjudged in cases such as these. Future malware datasets should continue to offer comprehensive alias mappings for each of their constituent families so that accuracy can be computed. Further investigation revealed that of the 3,095 antivirus scans reports for the MOTIF dataset, 577 reports (18.64%) do not include the correct family name at all and 934 reports (30.18%) include it once at most. The lack of family information in these scan reports would prevent an antivirus-based malware family classifier from achieving higher than 81.36% accuracy on the MOTIF dataset.

By running AVClass with MOTIF’s family alias mapping, we extracted family information from each scan report for the MOTIF dataset. Rather than applying plurality voting (AVClass’s voting method) to these results, we instead used majority voting. Antivirus majority voting resulted in 1,178 reports with the correct family, 719 reports with an incorrect family, and 1,198 reports where no family was the clear majority. After discarding the scan reports with no majority (as is common practice), antivirus majority voting resulted in only 62.10% accuracy. In Section 2.4 we noted that MOTIF contains a disproportionate amount of malware attributed to threat actors associated with targeted attacks. To test whether this was impacting our results, we repeated the experiment using only the 2,121 malware samples from MOTIF that are not attributed to one of these threat actors. Although slightly improved, the results were similar - 854 scan reports had no clear majority family, and only 863 of the remaining 1,267 reports (68.11%) had a correct majority vote. Conventional wisdom has always held that antivirus majority voting is highly accurate, and no prior work has challenged this assumption [1]. However, our finding that antivirus majority voting has just 62.10% accuracy on MOTIF indicates that this belief may require re-examination.

3.3 Evaluating Metadata Hashes

Next, we evaluate the precision, recall, and F1 measure of four hashing algorithms used for identifying similar malware samples. These hashes use the metadata of the PE files, and are widely used as tools to identify similar files. To the best of our knowledge such metadata hashes have not been evaluated on a single corpus for comparison, and the design of new hashes using Learning to Hash [71, 72] research is an open problem MOTIF enables. Two PE files have identical Imphash digests if their Import Address Tables (IATs) contain the same functions in the same order [73]. If two files have equivalent values for specific metadata fields in their FILE_HEADER, OPTIONAL_HEADER, and each SECTION_HEADER, they have the same peHash digest [74]. Finally, the RichPE hash can be computed for PE files containing a Rich header, which is added to files compiled and linked using the Microsoft toolchain. Files that share select values in their Rich header, FILE_HEADER, and OPTIONAL_HEADER have the same RichPE digest [75]. The remaining hash - vHash - is VirusTotal’s “simple structural feature hash” [76]. VirusTotal has not disclosed how the vHash of a file is computed, and we don’t know whether it is based upon metadata or file contents. We are aware of no formal evaluation of vHash. Although prior evaluation has shown that the three remaining hashes are effective at grouping similar malware, we are unaware of any studies that quantify their precision or recall using a reference dataset with family labels.

max width= Hash Name Precision Recall F1 Measure Imphash 0.866 0.382 0.530 Imphash-10* 0.971 0.301 0.460 peHash 0.998 0.264 0.417 RichPE 0.964 0.331 0.494 vHash 0.983 0.317 0.480

Table 6: Metadata Hash Evaluation Results

In order to better understand how often collisions between unrelated malware samples occur and how effective these hashes are at grouping malware from the same family, we clustered the MOTIF dataset on identical hash digests. Files for which a hash digest could not be determined (e.g. PE files without an IAT or Rich header) were assigned to singleton clusters. Table 6 displays evaluation results for each hash. Imphash has the highest recall and F1 measure, while at the same time having the lowest precision. Files with few imports may not have a unique Imphash [73], so we repeated the evaluation, assigning all files with fewer than 10 imports to singleton clusters. The results (denoted Imphash-10 in Table 6) indicate that this modification drastically increases the precision of Imphash, but causes a correspondingly large drop in recall and F1 measure. We were not surprised that peHash had a near-perfect precision, as it is widely regarded by the community as a very strict metadata hash. However, this property also yielded the lowest recall and F1 measure. Finally, although RichPE and vHash had no outstanding metric results compared to the other hashes, both possess high precision values. All recall results seem to be poor, but this is typical of metadata hashes as even small changes in a file’s metadata can result in a different digest.

3.4 Machine Learning Experiments

MOTIF provides the first rigorous benchmark for ML family classification methods. We demonstrate this using malware family classification and novel family detection, two standard ML tasks performed extensively on prior datasets [43, 40, 46, 49]. We trained LightGBM [77] and MalConv2 [78]

models on the MOTIF dataset using default settings. The LightGBM model uses EMBER feature vectors, while MalConv2 was trained on disarmed binaries using a single Titan RTX GPU.

max width= Model Accuracy Std. Dev. LightGBM 0.724 0.021 MalConv2 0.487 0.017

Table 7: Few-Shot Learning Results

Since malware families with only one representative sample (which we call singleton families) cannot be represented in both the training and test sets, they were left out of the experiment. Then, we performed five-fold stratified cross-validation on the remaining 2,964 files (323 families). The mean and standard deviation of the accuracy scores obtained during cross-validation are listed in Table

7. Neither model demonstrated particularly high accuracy, but they are both significantly better than random guessing (4.79%). Furthermore, given the high number of families in the dataset and limited number of samples per family, their performances are very reasonable.

Model Precision Recall F1 Measure
Isolation Forest 0.0 0.0 0.0

Local Outlier Factor

0.265 0.206 0.232
One-Class SVM 0.233 0.382 0.289
Table 8: Novel Family Detection Results

Malware family classifiers are trained on a finite number of malware families and new, unknown families will be present when a model is deployed into a production environment. To test how well ML models can distinguish novel malware families from known families, we withheld the 131 malware samples from singleton families in MOTIF, in addition to 10% of the remaining dataset (297 files). Three outlier detection models were trained on EMBER feature vectors for the remainder of the files. The objective of the experiment was to determine whether these models would detect the singleton families as outliers because they were not included in the training set. As shown in Table 8, none of the models were able to perform this task reliably and the Isolation Forest did not detect a single file as an outlier. Although existing ML models can distinguish between malware families with moderate accuracy, the overall differences between existing and novel families seem difficult to identify. Further research is needed to ensure that models can appropriately address novel families.

4 Discussion and Conclusion

MOTIF is the first large, publicly accessible corpus of Windows malware with ground truth reference labels. The disarmed malware samples, EMBER features, and linked reports are valuable resources for future research. MOTIF is also the first dataset to provide a comprehensive mapping of malware family aliases, enabling numerous experiments and evaluations that could not be previously performed. Results obtained using the MOTIF dataset have already challenged conventional wisdom firmly held by the community, such as the accuracy of techniques which use collective decisions of a group of antivirus products as a source of family labeling. In the first evaluations of their kind, we found that AVClass has a 46.78% accuracy on the MOTIF dataset that is considerably lower than previously thought, and antivirus majority voting correctly classifies only 62.10% of the malware samples for which a clear majority could be obtained. These findings impact nearly all malware family classification research, especially related to antivirus-based labeling.

We recognize that the malware demographics in MOTIF do not reflect the distribution of malware families that might be encountered in the wild, and instead express the families which malware analysis organizations consider to be most relevant, a balance which has pros and cons. While collecting reports over a longer period of time or from more sources could further expand the corpus, it is unlikely to significantly change the current limitations of MOTIF. The use of high-confidence methods for identifying related malware (g.g., peHash) could significantly increase the size of MOTIF, at the cost of losing full ground truth confidence. Our hope for future datasets is that they will be constructed with care to identify the trade-offs in scale, label quality, and diversity so that they can be used together to provide more accurate judgments.

Beyond the standard tasks shown in this paper, MOTIF opens the door to many new avenues of ML research to malware problems. The reports can be used to explore few- and zero-shot learning to detect families before samples are available. The label quality allows exploring transfer learning from larger, less accurately labeled, corpora. The large number of families more representative of real-world diversity also allows more consideration to metrics and training approaches in the face of class-imbalanced learning

[79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89]. We envision MOTIF becoming a valuable asset for evaluating malware family classifiers and for enabling future malware research.


  • Sebastián et al. [2016] M. Sebastián, R. Rivera, P. Kotzias, and J. Caballero, “Avclass: A tool for massive malware labeling,” in Research in Attacks, Intrusions, and Defenses, F. Monrose, M. Dacier, G. Blanc, and J. Garcia-Alfaro, Eds.   Cham: Springer International Publishing, 2016, pp. 230–253.
  • Northcutt et al. [2021] C. G. Northcutt, A. Athalye, and J. Mueller, “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks,” pp. 1–16, 2021. [Online]. Available:
  • Geiger et al. [2020] R. S. Geiger, K. Yu, Y. Yang, M. Dai, J. Qiu, R. Tang, and J. Huang, “Garbage in, Garbage out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?” in Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, ser. FAT* ’20.   New York, NY, USA: Association for Computing Machinery, 2020, pp. 325–336. [Online]. Available:
  • Pleiss et al. [2020] G. Pleiss, T. Zhang, E. Elenberg, and K. Q. Weinberger, “Identifying Mislabeled Data using the Area Under the Margin Ranking,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 17 044–17 056. [Online]. Available:
  • Patrini et al. [2017]

    G. Patrini, A. Rozza, A. K. Menon, R. Nock, and L. Qu, “Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach,” in

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    .   IEEE, jul 2017, pp. 2233–2241. [Online]. Available:
  • Liu and Tao [2016] T. Liu and D. Tao, “Classification with Noisy Labels by Importance Reweighting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 447–461, mar 2016. [Online]. Available:
  • Nicholson et al. [2015] B. Nicholson, J. Zhang, V. S. Sheng, and Z. Wang, “Label noise correction methods,”

    Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015

    , 2015.
  • Frenay and Verleysen [2014] B. Frenay and M. Verleysen, “Classification in the Presence of Label Noise: A Survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845–869, may 2014. [Online]. Available:
  • Natarajan et al. [2013] N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari, “Learning with Noisy Labels,” in Advances in Neural Information Processing Systems 26, 2013, pp. 1196–1204. [Online]. Available:
  • Joyce et al. [2021] R. J. Joyce, E. Raff, and C. Nicholas, “A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels,” in

    Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security (AISec ’21)

    .   Association for Computing Machinery, 2021.
  • Huang and Stokes [2016] W. Huang and J. Stokes, “Mtnet: A multi-task neural network for dynamic malware classification,” 07 2016, pp. 399–418.
  • Rossow et al. [2012] C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich, V. Paxson, N. Pohlmann, H. Bos, and M. van Steen, “Prudent Practices for Designing Malware Experiments: Status Quo and Outlook,” in 2012 IEEE Symposium on Security and Privacy.   IEEE, may 2012, pp. 65–79. [Online]. Available:
  • Pendlebury et al. [2019] F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, and L. Cavallaro, “TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time,” in 28th USENIX Security Symposium (USENIX Security 19).   Santa Clara, CA: USENIX Association, aug 2019, pp. 729–746. [Online]. Available:
  • Jordaney et al. [2016] R. Jordaney, Z. Wang, D. Papini, I. Nouretdinov, and L. Cavallaro, “Misleading Metrics : On Evaluating Machine Learning for Malware with Confidence,” University of London, Tech. Rep., 2016. [Online]. Available:
  • Christodorescu and Jha [2004] M. Christodorescu and S. Jha, “Testing Malware Detectors,” in Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’04.   New York, NY, USA: ACM, 2004, pp. 34–44. [Online]. Available:
  • Marx [2000] A. Marx, “A guideline to anti-malware-software testing,” in European Institute for Computer Anti-Virus Research (EICAR), 2000, pp. 218–253.
  • Mohaisen et al. [2015] A. Mohaisen, O. Alrawi, and M. Mohaisen, “Amal: High-fidelity, behavior-based automated malware analysis and classification,” Computers & Security, vol. 52, pp. 251 – 266, 2015. [Online]. Available:
  • Mohaisen and Alrawi [2013] A. Mohaisen and O. Alrawi, “Unveiling zeus: Automated classification of malware samples,” in Proceedings of the 22nd International Conference on World Wide Web, ser. WWW ’13 Companion.   New York, NY, USA: Association for Computing Machinery, 2013, p. 829–832. [Online]. Available:
  • Votipka et al. [2019] D. Votipka, S. M. Rabin, K. Micinski, J. S. Foster, and M. M. Mazurek, “An Observational Investigation of Reverse Engineers ’ Processes,” in USENIX Security Symposium, 2019.
  • Kotzias et al. [2015] P. Kotzias, S. Matic, R. Rivera, and J. Caballero, “Certified pup: Abuse in authenticode code signing,” in CCS ’15, 2015.
  • Nappa et al. [2015] A. Nappa, M. Z. Rafique, and J. Caballero, “The malicia dataset: identification and analysis of drive-by download operations,” International Journal of Information Security, vol. 14, no. 1, pp. 15–33, 2015. [Online]. Available:
  • Botacin et al. [2020a] M. Botacin, F. Ceschin, P. de Geus, and A. Grégio, “We need to talk about antiviruses: challenges & pitfalls of av evaluations,” Computers & Security, vol. 95, p. 101859, 2020. [Online]. Available:
  • Mohaisen et al. [2014] A. Mohaisen, O. Alrawi, M. Larson, and D. McPherson, “Towards a methodical evaluation of antivirus scans and labels,” in Information Security Applications, Y. Kim, H. Lee, and A. Perrig, Eds.   Cham: Springer International Publishing, 2014, pp. 231–241.
  • Li et al. [2010] P. Li, L. Liu, D. Gao, and M. K. Reiter, “On challenges in evaluating malware clustering,” in Recent Advances in Intrusion Detection, S. Jha, R. Sommer, and C. Kreibich, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 238–255.
  • Cohen [1987] F. Cohen, “Computer viruses: Theory and experiments,” Computers & Security, vol. 6, no. 1, pp. 22–35, feb 1987. [Online]. Available:
  • [26] W. Ballenthin and M. Raabe, “capa: Automatically identify malware capabilities,”, Last accessed on 2020-10-1.
  • Botacin et al. [2020b] M. Botacin, F. Ceschin, P. de Geus, and A. Grégio, “We need to talk about antiviruses: challenges & pitfalls of av evaluations,” Computers & Security, vol. 95, p. 101859, 2020. [Online]. Available:
  • Arp et al. [2020] D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro, and K. Rieck, “Dos and Don’ts of Machine Learning in Computer Security,” arXiv, 2020. [Online]. Available:
  • Li et al. [2017] B. Li, K. Roundy, C. Gates, and Y. Vorobeychik, “Large-Scale Identification of Malicious Singleton Files,” in 7TH ACM Conference on Data and Application Security and Privacy, 2017.
  • Giacinto and Dasarathy [2011] G. Giacinto and B. V. Dasarathy, “Machine learning for computer security: A guide to prospective authors,” Information Fusion, vol. 12, no. 3, pp. 238–239, 2011. [Online]. Available:
  • Aghakhani et al. [2020] H. Aghakhani, F. Gritti, F. Mecca, M. Lindorfer, S. Ortolani, D. Balzarotti, G. Vigna, and C. Kruegel, “When Malware is Packin’ Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features,” in Proceedings 2020 Network and Distributed System Security Symposium.   Reston, VA: Internet Society, 2020. [Online]. Available:
  • Wampler et al. [2019] J. Wampler, I. Martiny, and E. Wustrow, “ExSpectre: Hiding Malware in Speculative Execution,” in Proceedings 2019 Network and Distributed System Security Symposium.   Reston, VA: Internet Society, 2019. [Online]. Available:{_}02B-5{_}Wampler{_}paper.pdf
  • Egele et al. [2017] M. Egele, T. Scholte, E. Kirda, and S. Barbara, “A Survey On Automated Dynamic Malware Analysis Evasion and Counter-Evasion,” in Proceedings of Reversing and Offensive-oriented Trends Symposium, 2017. [Online]. Available:{&}btnG=Search{&}q=intitle:A+Survey+on+Automated+Dynamic+Malware+Analysis+Techniques+and+Tools{#}0
  • [34] Y. Zhou, “Malgenome project,”, Last accessed on 2020-3-9.
  • Upchurch and Zhou [2015] J. Upchurch and X. Zhou, “Variant: a malware similarity testing framework,” 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pp. 31–39, 2015.
  • [36] U.S. Department of Homeland Security, “National cybersecurity and communications integration center (nccic),”, Last accessed on 2021-8-12.
  • [37] Dell Secureworks, “Analysis of dhs nccic indicators,”, Last accessed on 2021-8-12.
  • [38] K. Rieck, “Malheur dataset,”, Last accessed on 2020-3-9.
  • Wei et al. [2017] F. Wei, Y. Li, S. Roy, X. Ou, and W. Zhou, “Deep ground truth analysis of current android malware,” in Detection of Intrusions and Malware, and Vulnerability Assessment, M. Polychronakis and M. Meier, Eds.   Cham: Springer International Publishing, 2017, pp. 252–276.
  • [40] D. Arp, “The drebin dataset,”, Last accessed on 2020-3-9.
  • Qiao et al. [2016] Y. Qiao, X. Yun, and Y. Zhang, “How to automatically identify the homology of different malware,” in 2016 IEEE Trustcom/BigDataSE/ISPA, Aug 2016, pp. 929–936.
  • [42] “Dataset - malicia project,”, Last accessed on 2020-3-9.
  • Ronen et al. [2018] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, “Microsoft malware classification challenge,” CoRR, vol. abs/1802.10135, 2018. [Online]. Available:
  • Plohmann et al. [2017] D. Plohmann, M. Clauss, S. Enders, and E. Padilla, “Malpedia: A Collaborative Effort to Inventorize the Malware Landscape,” in The Journal on Cybercrime & Digital Investigations, vol. 3, no. 1, 2017.
  • Karbab et al. [2018] E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb, “Maldozer: Automatic framework for android malware detection using deep learning,” Digit. Investig., vol. 24, pp. S48–S59, 2018. [Online]. Available:
  • Anderson and Roth [2018] H. S. Anderson and P. Roth, “Ember: An open dataset for training static pe malware machine learning models,” 2018.
  • [47] Microsoft, “Protect your data and devices with windows security,”, Last accessed on 2021-8-12.
  • Loi et al. [2021] N. Loi, C. Borile, and D. Ucci, “Towards an automated pipeline for detecting and classifying malware through machine learning,” 2021.
  • Harang and Rudd [2020] R. Harang and E. M. Rudd, “Sorel-20m: A large scale benchmark dataset for malicious pe detection,” 2020.
  • [50] Bitdefender, “Bitdefender labs,”, Last accessed on 2021-8-14.
  • [51] G DATA, “G data security blog | g data,”, Last accessed on 2021-8-14.
  • [52] CheckPoint, “Home - check point research,”, Last accessed on 2021-8-14.
  • [53] Kaspersky, “Securelist | kaspersky’s threat research and reports,”, Last accessed on 2021-8-14.
  • [54] Cybersecurity & Infrastructure Security Agency, “Analysis reports | cisa,”, Last accessed on 2021-8-14.
  • [55] Malwarebytes, “Threat analysis archives - malwarebytes labs | malwarebytes labs,”, Last accessed on 2021-8-14.
  • [56] Cybereason, “Cybereason blog | cybersecurity news and analysis,”, Last accessed on 2021-8-14.
  • [57] Palo Alto Networks, “Unit 42 - latest cyber security research | palo alto networks,”, Last accessed on 2021-8-14.
  • [58] ESET, “Welivesecurity,”, Last accessed on 2021-8-14.
  • [59] Proofpoint, “Threat insight information & resources | proofpoint blog,”, Last accessed on 2021-8-14.
  • [60] FireEye, “Threat research blog | fireeye inc,”, Last accessed on 2021-8-14.
  • [61] Symantec, “Threat intelligence | symantec blogs,”, Last accessed on 2021-8-14.
  • [62] Fortinet, “Threat research,”, Last accessed on 2021-8-14.
  • [63] Talos, “Cisco talos intelligence group - comprehensive threat intelligence,”, Last accessed on 2021-8-14.
  • Zhu et al. [2020] S. Zhu, J. Shi, L. Yang, B. Qin, Z. Zhang, L. Song, and G. Wang, “Measuring and modeling the label dynamics of online anti-malware engines,” in 29th USENIX Security Symposium (USENIX Security 20).   Boston, MA: USENIX Association, Aug. 2020. [Online]. Available:
  • VirusTotal [a] VirusTotal, “File statistics during last 7 days,”, Last accessed on 2021-8-11.
  • Kephart et al. [1995] J. O. Kephart, G. B. Sorkin, W. C. Arnold, D. M. Chess, G. J. Tesauro, and S. R. White, “Biologically Inspired Defenses Against Computer Viruses,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 1, ser. IJCAI’95.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995, pp. 985–996. [Online]. Available:
  • Raff and Nicholas [2020] E. Raff and C. Nicholas, “A Survey of Machine Learning Methods and Challenges for Windows Malware Classification,” in NeurIPS 2020 Workshop: ML Retrospectives, Surveys & Meta-Analyses (ML-RSA), 2020. [Online]. Available:
  • Jang et al. [2011] J. Jang, D. Brumley, and S. Venkataraman, “BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis,” in Proceedings of the 18th ACM conference on Computer and communications security - CCS.   New York, New York, USA: ACM Press, 2011, pp. 309–320. [Online]. Available:
  • VirusTotal [b] VirusTotal, “Analyze suspicious files and urls to detect types of malware, automatically share them with the security community,”, Last accessed on 2021-8-11.
  • Rieck et al. [2011] K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis of malware behavior using machine learning,” Journal of Computer Security, vol. 19, pp. 639–668, 2011.
  • Wang et al. [2018] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A Survey on Learning to Hash,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 769–790, apr 2018. [Online]. Available:
  • Kulis and Darrell [2009] B. Kulis and T. Darrell, “Learning to Hash with Binary Reconstructive Embeddings,” in Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, Eds.   Curran Associates, Inc., 2009, pp. 1042–1050. [Online]. Available:
  • Mandiant [2014] Mandiant, “Tracking malware with import hashing,” Jan 2014,, Last accessed on 2021-8-14.
  • Wicherski [2009] G. Wicherski, “pehash: A novel approach to fast malware clustering,” in LEET, 2009.
  • Joyce et al. [2019] R. J. Joyce, S. Burke, and K. Bilzer, “Malware attribution using the rich header,” Jan 2019,, Last accessed on 2021-8-14.
  • VirusTotal [c] VirusTotal, “Virustotal api v3 overview,”, Last accessed on 2021-8-14.
  • Ke et al. [2017]

    G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” in

    Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.   Curran Associates, Inc., 2017, pp. 3146–3154. [Online]. Available:
  • Raff et al. [2021] E. Raff, W. Fleshman, R. Zak, H. S. Anderson, B. Filar, and M. McLean, “Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection,” in The Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021. [Online]. Available:
  • Lemaître et al. [2017] G. Lemaître, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning,” Journal of Machine Learning Research, vol. 18, no. 17, pp. 1–5, 2017. [Online]. Available:
  • Oak et al. [2019] R. Oak, M. Du, D. Yan, H. Takawale, and I. Amit, “Malware Detection on Highly Imbalanced Data through Sequence Modeling,” in Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, ser. AISec’19.   New York, NY, USA: Association for Computing Machinery, 2019, pp. 37–48. [Online]. Available:
  • DEL GAUDIO et al. [2014] R. DEL GAUDIO, G. BATISTA, and A. BRANCO, “Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting,” Natural Language Engineering, vol. 20, no. 03, pp. 327–359, jul 2014. [Online]. Available:
  • Akbani et al. [2004]

    R. Akbani, S. Kwek, and N. Japkowicz, “Applying Support Vector Machines to Imbalanced Datasets,” in

    Proceedings of the 15th European Conference on Machine Learning, ser. ECML’04.   Berlin, Heidelberg: Springer-Verlag, 2004, pp. 39–50. [Online]. Available:
  • He and Garcia [2009] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
  • Kubat and Matwin [1997] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One Sided Selection,” in Proceedings of the Fourteenth International Conference on Machine Learning, vol. 97, 1997, pp. 179–186.
  • Prati et al. [2009] R. C. Prati, G. E. Batista, and M. C. Monard, “Data Mining with Imbalanced Class Distributions: Concepts and Methods,” in Indian International Conference on Artificial Intelligence (IICAI), Tumkur, India, 2009, pp. 359–376.
  • Moskovitch et al. [2009] R. Moskovitch, D. Stopel, C. Feher, N. Nissim, N. Japkowicz, and Y. Elovici, “Unknown malcode detection and the imbalance problem,” Journal in Computer Virology, vol. 5, no. 4, pp. 295–308, nov 2009. [Online]. Available:
  • Blagus and Lusa [2013] R. Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” BMC Bioinformatics, vol. 14, no. 1, p. 106, 2013. [Online]. Available:
  • Prati et al. [2004] R. Prati, G. Batista, and M. Monard, “Class imbalances versus class overlapping: an analysis of a learning system behavior,” in MICAI 2004: Advances in Artificial Intelligence, 2004, pp. 312–321. [Online]. Available:
  • Japkowicz and Stephen [2002] N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429–449, oct 2002. [Online]. Available:
  • Strubell et al. [2019] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in NLP,” CoRR, vol. abs/1906.02243, 2019. [Online]. Available:

Appendix A Datasheets for Datasets



For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.
The MOTIF dataset was created because there is a dearth of malware family datasets with ground truth reference labels. It is meant to improve evaluation of malware family classifiers, especially those that use antivirus-based ones.

Who created this dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?
This dataset was created as a collaboration between Booz Allen Hamilton and the University of Marlyand, Baltimore County.

What support was needed to make this dataset? (e.g.who funded the creation of the dataset? If there is an associated grant, provide the name of the grantor and the grant name and number, or if it was supported by a company or government agency, give those details.)
Creation of this dataset was supported by Booz Allen Hamilton.

Any other comments?



What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.
The main instances of the dataset are disarmed malware samples. Raw EMBER features (representing extracted PE metadata) are also provided for each instance.

How many instances are there in total (of each type, if appropriate)?
There are 3,095 malware samples in total from 454 families, and a distribution of family sizes is located in Section 2.4.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).
The dataset does not contain all known malware samples, or all known malware samples from open-source threat reports. We provided detailed discussion of representativeness in Section 2.4. It represents a sampling of malware that cybersecurity organzations would find notable.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.
The disarmed malware samples are raw data (sequences of bytes) and the EMBER metadata are raw features. We also provide EMBERv2 feature vectors generated from the EMBER raw features.

Is there a label or target associated with each instance? If so, please provide a description.
Yes, each instance contains a malware family label. The name of the family, known aliases of the family, and a unique ID of the family that can be used as a ML label are provided.

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.
We assess that our alias mapping is comprehensive, but we suspect that a low number of aliases may be missing (not enough to impact the results discussed in our paper), either due to lack of reporting or lack of antivirus data.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.
Malware samples with the same family or attributed to the same threat group or campaign are noted. We often note when two families are variants in the brief descriptions, but this information is likely not complete.

Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.
Labels in this dataset are considered to have ground truth confidence. Any errors would be caused by an analyst writing a report or by the author who aggregated the dataset.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.
This dataset links to fourteen external sources of open-source malware reporting (described in Table 4, with links provided in the References.). There are no guarantees that this data will continue to exist and remain constant, and many sources do not have official archives.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

Does the dataset relate to people? If not, you may skip the remaining questions in this section.

Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.

Any other comments?



How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.
We provide a full and detailed description of the process used to gather the data in Section 2.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. Finally, list when the dataset was first published.
Data was collected between Feb. and Aug. 2021. The surveyed articles were published between Jan. 2016 and Jan. 2021.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?
All data was collected manually from open-source threat intelligence reports. We validated the MOIF dataset in the following ways. We confirmed that there were no instances of reports disagreeing about the family for a malware sample. Additionally, we confirmed that all files matched the expected file type. We did find a small number of author errors using this method, in which the hash of a first-stage malware sample described in the report (such as a malicious document) was reported with the IOCs of the payloads for a family (usually executable files). These instances were removed from the MOTIF dataset. We will update the manuscript to reflect how we validated the results.

What was the resource cost of collecting the data? (e.g. what were the required computational resources, and the associated financial costs, and energy consumption - estimate the carbon footprint. See Strubell et al.[90] for approaches in this area.)
The financial costs amounted to compensating one data scientist during the time of data collection. No significant compute was needed for its construction beyond a standard laptop.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?


Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

One Booz Allen Hamilton employee was involved in the data collection process and they were compensated per their normal salary.

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.
No ethical review process was implemented.

Does the dataset relate to people? If not, you may skip the remainder of the questions in this section.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.

Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate)

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

Any other comments?


Preprocessing / Cleaning / Labeling

Was any preprocessing/cleaning/labeling of the data done(e.g.,discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?

If so, please provide a description. If not, you may skip the remainder of the questions in this section.
Malware hashes for which we did not have access to the corresponding files were not included in the MOTIF dataset. Malware family names were normalized by converting them to lowercase and removing non-alphanumeric characters.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.
Survey information from hashes with no corresponding files is located in motif_reports.csv. The original names of normalized families can be determined easily by navigating to the provided report URLs.

Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point.
No software was used for preprocessing.

Any other comments?



Has the dataset been used for any tasks already?

So far, the only tasks this dataset has been used for are the experiments described in Section 3.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.
Yes. A GitHub repository containing the dataset and code for training the discussed machine learning models is located at

What (other) tasks could the dataset be used for?
This dataset can be used for evaluation of malwere family classifiers, ML and NLP tasks involving malware samples and reports written about them, and a variety of other malware clustering and classification tasks.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?

Are there tasks for which the dataset should not be used? If so, please provide a description.
Use of this dataset must follow the terms of licensing at

Any other comments? N/A.



Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
The dataset has been made fully open-source.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?
The dataset is available on GitHub at

When will the dataset be distributed?
The dataset will be made public on 12/1/2021.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

The dataset and code will be distributed under the Booz Allen Public License which allows for use, modification, and public distribution by non-profits, academics, and commercial entities, but does not allow selling of the dataset or derivatives. The license also limits liability, a requirement given the malware contained in the corpus.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.
The reports used are subject to copyright by their original owners, and links to them are provided to avoid any copyright issues. The VirusTotal reports can be obtained by others using a free account, but we are prohibited in redistributing them ourselves by the VirusTotal license. No other restrictions on this data have been imposed by third parties.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

U.S. Export control laws may apply to any work produced in the U.S., and are the responsibility of external parties to confirm if a license is needed. We have not imposed any export control ourselves and are not aware of any special regularization that this dataset may fall under. We do not guarantee or warranty that some export control requirement now or in the future may apply to this dataset.

Any other comments?



Who is supporting/hosting/maintaining the dataset?
Booz Allen Hamilton

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?
The lead curator can be contacted at or

Is there an erratum? If so, please provide a link or other access point.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?
We may perform very infrequent updates, if at all. Updates would be solely to correct any errors made during curation or in code. Updates will be communicated via GitHub.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users.
Yes, prior versions of the dataset will be made available via GitHub, and obsolescence will be communicated via GitHub.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.
The dataset is fully open-source and other users are free to augment it. These contributions will not be validated or verified.

Any other comments?