SMART: Semantic Malware Attribute Relevance Tagging

05/15/2019 ∙ by Felipe N. Ducau, et al. ∙ 0

With the rapid proliferation and increased sophistication of malicious software (malware), detection methods no longer rely only on manually generated signatures but have also incorporated more general approaches like Machine Learning (ML) detection. Although powerful for conviction of malicious artifacts, these methods do not produce any further information about the type of malware that has been detected. In this work, we address the information gap between ML and signature-based detection methods by introducing an ML-based tagging model that generates human interpretable semantic descriptions of malicious software (e.g. file-infector, coin-miner), and argue that for less prevalent malware campaigns these provide potentially more useful and flexible information than malware family names. For this, we first introduce a method for deriving high-level descriptions of malware files from an ensemble of vendor family names. Then we formalize the problem of malware description as a tagging problem and propose a joint embedding deep neural network architecture that can learn to characterize portable executable (PE) files based on static analysis, thus not requiring a dynamic trace to identify behaviors at deployment time. We empirically demonstrate that when evaluated against tags extracted from an ensemble of anti-virus detection names, the proposed tagging model correctly identifies more than 93.7 sample, at a deployable false positive rate (FPR) of 1 we show that when evaluating this model against ground truth tags derived from the results of dynamic analysis, it correctly predicts 93.5 a given sample. These results suggest that an ML tagging model can be effectively deployed alongside a detection model for malware description.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Whenever one or more malicious files are found in a computer network, the first step towards remediation is to understand the nature of the attack in progress. Knowing the malicious capabilities associated with each suspicious file gives important context to network defenders which helps them define and prioritize counter-measures.

Generally, anti-virus (AV) or anti-malware solutions provide a detection name when they alert about potentially harmful files detected in a machine as a way to provide this context. These detection names usually come from specific signatures written by reverse engineers to identify particular threats, therefore encoding expert knowledge about a given malware sample. While this is theoretically useful for categorizing known malware variants, differing malware naming conventions among vendors have led to detection names that are inconsistent and highly vendor-specific (Kelchner, 2010; Maggi et al., 2011). For example, Worm.Ludbaruma.B and Win32.Worm.VB.k, are detection names produced by two different vendors for the same sample. The problem of inconsistent naming conventions has been compounded due to more feature-rich malware and increased quantities of malware over time. Moreover, some detection names serve only as unique identifiers and do not provide actionable information about what type of harm the malicious sample could do if it infects a system (e.g. Gen:Variant.Razy.260309 or Trojan (005153df1)).

When a novel malware variant appears, applying existing detection names is even more problematic, since current rule-based signatures will likely not trigger on these variants at all. Machine learning (ML) malware detectors have the potential to identify these new malware samples as malicious, but generally do not provide further information about the type of malware encountered.

In this paper, with existing detection naming issues in mind, we introduce our novel SMART (Semantic Malware Attribute Relevance Tagging) approach. In contradistinction to prior malware (family) detection names, the semantic malware attribute tags that SMART detection uses yield human interpretable, high level descriptions of the capabilities of a given malware sample. They can convey different types of information such as purpose (‘crypto-miner’, ‘dropper’), malware family (‘ransomware’), and file characteristics (‘packed’). SMART tags are related to malware family names in the sense that they attempt to describe how a piece of malicious software executes and the intent behind it. However, unlike malware family names, SMART tags are non-exclusive, meaning that one malware campaign (or family) can be associated with multiple tags and a given tag can be associated with multiple malware families. This formulation allows indexing malware by semantic similarity in terms of type of malicious content, and opens the door to novel applications, e.g., natural language search queries in an endpoint detection and response (EDR) product, prioritization of events based on malicious content, and similarity indexing of new unseen samples, among others.

The number of tags is also inherently bounded by types of malicious behavior and chosen granularity in description. Thus, a fixed number of tags can roughly describe all malicious samples, even when the number of malware families increases dramatically. Because of this fact, the SMART approach to malware description makes the task suitable to be addressed with machine learning methods.

SMART tags also serve as a common ground to integrate knowledge from multiple sources or detection technologies since they do not presume standard naming conventions. For the experiments in this paper, we derive these tags by leveraging the underlying knowledge encoded in detection names from multiple anti-malware vendors in the industry, and from behavioral traces of files’ execution, but the general framework applies whenever we have different analyses of the same file.

Using our derived SMART tags, we then train a multi-label deep neural network to automatically predict tags for new (unseen) files in real time, and only assuming access to their raw binary representations. We find that our network yields impressive performance on the tag prediction task.

The primary contributions of this paper are as follows:

  1. We propose a simple annotation method to extract malware tags based on AV detection names.

  2. We further introduce the task of malware tag prediction and formalize it under the framework of multi-label learning.

  3. We empirically demonstrate that it is feasible to learn behavioral characteristics of malicious software samples from a static representation of the file by fitting a deep neural network to predict the proposed set of tags and evaluating the results on ground truth tags extracted from behavioral traces of files’ execution.

The remainder of this paper is structured as follows: In Section 2 we discuss existing approaches to malware description, particularly family names and hierarchies and review attempts to establish industry-wide standard naming conventions. We also discuss related research in the machine learning for information security (ML-Sec) space as well as similar applications in other domains, such as image tagging, music information retrieval, and semantic facial attribute recognition. In Section 3, we define the concept of describing malware with semantic tags, and present two methods for deriving SMART tags from binary executable files, one by aggregating detection names from multiple vendors in the industry and a second one by exploiting behavioral information from running the files in a virtual environment. We then formalize the problem of malware characterization as a tagging problem in Section 4, and propose two neural network architectures for tag prediction. In Section 5, we train several neural networks and evaluate their performance on the tagging task both on noisy tags extracted from detection names and on ground-truth tags extracted from dynamic analysis of the files. We analyze our results and their ramifications in Section 7, and present conclusions and propose directions for future research in Section 8.

2. Background and Related Work

In this section we revisit the idea of malware description by using family classification, review the concept of attribute tagging in other domains, and survey related machine learning approaches in the field of machine learning for information security (ML-Sec).

2.1. Malware Family Categorization

Identifying the family and variant of a particular malicious sample can provide important intelligence to the end user, administrator or security operator of a system about what type of attack might be underway. This extra contextual information can help define a remediation procedure and evaluate the severity and potential consequences of the attack. In fact, numerous vendors provide in their websites detailed information about popular family/variant information, with associated description of what that variant does, and suggest how a particular piece of malware can be removed. Without such identifying information, we are left only with the offending file itself as its own description. Unless some reverse engineering effort is taken, which can be costly, it is difficult to discern much about the internals of the file.

The idea to identify all malware under a consistent family naming scheme across multiple vendors has been around for decades. It first came to the attention of the security community in the early 1990’s and prompted the Computer Antivirus Research Organization (CARO) to propose a first naming convention in 1991 (Bontchev et al., 1991), which was was later extended to add coverage for new kinds of malicious software such as backdoors, trojans, joke programs and droppers, among others (Bontchev, 2005).

The threat landscape has changed dramatically since the introduction of the CARO standard nearly 30 years ago. The quantity of new malware samples that security vendors’ labs receive has increased dramatically, to millions per month. Some of these samples are variations of previously known malware, while others take code from older campaigns and re-purpose it for new tasks. Yet still others, are entirely novel types of malware. In this scenario it becomes practically unfeasible to manually and consistently group each malicious file into a well defined hierarchy of families. Even the arguably simpler task of assigning a malware file into an existing family has also become much harder, as malware became more resistant to signature-focused detection, thanks to advanced obfuscation measures such as polymorphism/metamorphism, (repeated) packing and obfuscation, recompilation, and self-updating (Rudd et al., 2017).

The increasing quantities of malware samples and the resilience to signature methods caused the security community to start using more flexible analytic tools than signatures designed only to identify a single malware variant (Harley, 2009), and increasingly rely on dynamic analysis and more generic signatures when possible (Harley and Bureau, 2008). While generic signatures offer an advantage for malware conviction, the nature of this approach makes it more difficult to organize malware names into families. Moreover, the quantities of malicious files to be analyzed lead to less structured categorizations and greater inconsistencies between vendors. The number of anti-malware vendors with varying detection strategies also grown considerably thus compounding the problem. In particular, modern security vendor detection names typically fall into one of four categories, containing varying amount of information about the threat family to which a malware sample belongs.

  • Traditional family based

    . Names are associated with unique and distinctive attributes of the malware and its variants. Malware classified under these names usually have a larger amount of original source code or a novel exploit mechanism and often come from the same origins. This not only gives it a distinctive attribute that can aid in the classification, but also requires researchers to put forth more effort to analyze it’s inner-workings. These types of detection names will often be of high-quality and have more consistency across vendors.

    Today, these types of detection names are most often seen in parasitic file-infectors and specific bot nets campaigns with distinctive attributes, e.g. Virut, Sality, Conficker, etc. These types of names usually use a suffix to identify specific variants, which often denotes a revision to the malware or change in the configuration data for use in a different campaign. E.g. Mal/Sality-D.

  • Technique based. These type of detection names group together malware that may come from different origins and/or have multiple authors but share a common method or technique. For example, many executable autorun worms have been written in the past using languages languages like Visual Basic 6 or AutoIt that change explorer’s file and folder view settings to hide filename extensions and employ an icon resource similar to that of a popular document format. Due to the relatively low complexity of the infection method many amateurs copied this technique resulting in a large amount of similar malware that was not necessarily of the same origin.

    Some anti-malware solutions would generically detect and classify many of these malware samples under the same generic family name, where other vendors may have have defined different more specific criteria for each family classification based on other attributes of the payload. When a generic family name provided by the AV vendor, they oftentimes replace the detection name suffix with a partial hash of the file data in order to identify a specific sample. The difference in detection methods employed often results in less consistency in the technique based detection names across vendors. E.g. Troj/AutoIt-CHN.

  • Method based

    . This type of detection name simply denotes the detection technology used to detect the malware sample. Some detection names can simply be that of a patented technology, project, or internal code name specific to the AV vendor, indicating the use of heuristics, ML, or real-time detection technologies like cloud look-ups. In these cases the detection name is not that of a malware family, but that of the method that was utilized to detect the sample. E.g.

    Unsafe.AI_Score_64%.

  • Kit based. AV vendors will often use more generic family names for detecting malware that has been generated by a known kit. These kits are often referred to as grey hat tools, as they can be used both offensively by penetration testing teams and by malware authors. Many of these kits obfuscate their payloads in attempt to circumvent detection by AV software. These kits do not often require as much skill to use, and as a result the number of malware authors able to employ these methods is large, making kit generated malware more prevalent in the wild. Detection names in this category tend to not describe the origins or functionality of the specific malicious payload, but instead identify methods used by the kit or tool to obfuscate or hide their payload. E.g. Trojan:Win32/Meterpreter.gen!C.

In (Sebastián et al., 2016), Marcos et al. identified a number of naming inconsistencies across cybersecurity vendors, and proposed an approach that uses data-mining techniques to distill family names for malware samples by combining detection names from multiple anti-malware solutions. Complementary to this work, (Perdisci and U, 2012) proposed an automated technique which relies on individual detection names from multiple vendors, for evaluating the quality of a given malware clustering. In this paper we also combine detection names from multiple vendors, but instead of trying to fit each new sample to a naming scheme with mutually exclusive hierarchical categories such as families, we propose an alternate approach to describing the functionality and the relationship between malicious samples by using attribute tags. A set of attribute tags describes a piece of malware through easy to interpret properties, and can be thought of as a soft-family classification, since it describes the sample and relates it with other samples described with the same (or an overlapping) set of tags. The advantage of the SMART tags approach is that it does not presume a partition on the malware space by genealogy, while providing potentially more useful information about a malware sample.

The idea of describing malware through a set of descriptive attributes instead of using malware families is not new: The MITRE corporation developed the Malware Attribute Enumeration and Characterization (MAEC) Effort (Kirillov et al., 2011), a standardized language for attribute-based malware characterization, in version 5.0 at the time of writing. This structured language aims to encode information about malware based upon an extensive list of attributes such as behaviours, capabilities, artifacts, and relationships between malware samples, among others. For the purpose of this work and because of budgetary considerations related with dataset labeling, we choose to work with a reduced, independently defined set of tags. Nevertheless the techniques explored would apply to a more broad attribute definition.

2.2. Semantic Attribute Tagging

Semantic attribute tagging refers to the association of samples with key-words that convey various types of high-level information about their content. These tags can later be used to interpret or summarize the content of the sample, for information retrieval in a large database of samples or for clustering, among others. In the last decade the use of tags has become a popular method for organizing and describing digital information. Content platforms use tags for images, video, audio, news articles, blog posts, and even questions in question-answering forums.

Automatic content tagging algorithms attempt to annotate data by learning the relationships between the tags and the content. Because any given sample can be related to multiple tags, this task can be, and usually is, framed as a multi-label prediction problem within the field of machine learning. Automatic image (Denton et al., 2015; Zhang et al., 2016; Frome et al., 2013; Chen et al., 2013; Weston et al., 2010; Gong et al., 2013) and audio (Choi et al., 2016, 2017; Nayyar et al., 2017)

tagging are among the most popular areas of research in automatic attribute tag prediction today. State of the art text, image and audio tagging algorithms use deep learning techniques which require massive datasets of tagged samples to train on. These datasets are often generated collaboratively (either directly or indirectly), meaning that multiple sources annotate some part of the dataset independently of each other. As noted by Choi et al. in

(Choi et al., 2017)

, this way of obtaining labeled information is a noisy process which has to be accounted for in the design and evaluation of the learning algorithm. Particularly Choi et al. study the effect of noisy labels when training deep neural networks in the multi-label classification setup, particularly when the noise is skewed towards the negative labels.

Semantic attribute tagging has two important characteristics worth considering: i) first, it can convey a lot of identifying information about a sample, even if the the sample is novel. Facial attribute tagging (Kumar et al., 2009; Rudd et al., 2016; Wilber et al., 2014; Rozsa et al., 2016, 2017; Scheirer et al., 2012)

, for example, has repeatedly demonstrated that vectors of attribute predictions (e.g., gender, hair color, ethnicity, etc.) from one or more classifiers can themselves be powerful feature vector representations for face recognition and verification algorithms; ii) secondly, semantic tags can be stored, structured, and retrieved in a human interpretable manner

(Scheirer et al., 2012). Both of these characteristics are appealing in a commercial computer security use case where the type of the threat can be roughly identified by a description that makes sense to security researchers and end users.

2.3. Multi-Label Classification

We briefly mentioned in Section 2.2 that semantic attribute tagging relies on multi-label classification, wherein we aim to predict multiple labels simultaneously. There are several ways to do this, the most trivial of which is to learn one classifier per label. This naive approach is not efficient in the sense that one classifier does not benefit from what the other classifiers have learned about a given sample. Furthermore, it can be unfeasible from a deployment perspective, particularly as the number of labels grow. For correlated labels, a popular approach is to use a single classifier with multiple outputs, one per output label. The total loss for the classifier is obtained by adding together the loss terms across the model’s outputs during training and optimizing over a multi-objective loss. Not only does this yield a more compact representation but it also improves classification performance over using independent classifiers (Rudd et al., 2016). We use as our baseline architecture a multi-label deep neural network architecture in Sec. 4

which exploits a shared representation of the input samples, and has multiple binary cross entropy loss functions atop stacks of hidden layers, or heads, with final sigmoid outputs – one per tag.

An alternative approach to multi-label classification, first introduced in the image tagging and retrieval literature, is to learn a compact shared vector space representation to which map both input samples and labels – a joint embedding (Weston et al., 2010; Frome et al., 2013; Chen et al., 2013; Gong et al., 2014; Denton et al., 2015; Zhang et al., 2016) – where similar content across modalities (images and tags for image tagging) are projected into similar vectors in the same low dimensional space. At query time, a similarity comparison between vectors in this learned latent space is performed, e.g., via inner product, to determine likely labels. A variety of models could be employed to form a joint embedding, but crucially, the embedding is optimized across input modalities/labels. In Section 4 we present a joint embedding model that maps SMART tags and executable files into the same low dimensional Euclidean space for the malware description problem.

2.4. Malware Analysis with Neural Networks

In recent years multiple advances in machine learning for information security (ML-Sec) have taken place. This can be attributed to several factors including an explosion in labeled data available from vendor aggregation services and threat intelligence feeds, and more powerful hardware and software frameworks for fitting highly expressive classifiers, along with a need of the cyber-security industry to incorporate more flexible methods to improve their detection pipelines. In this work we focus particularly on analysis over Windows Portable Executable (PE) files based on static features, i.e. information that can be extracted from the binary files without having to run them.

In contrast to our work which focuses on malware description, most modern applications of deep learning have focused on malware detection. Saxe et al. (Saxe and Berlin, 2015) applied deep neural network detection to feature vectors derived from 2-dimensional histogram statistics of PE files along with hashed delimited strings and hashed elements from the file header, including metadata and import tables. Further applications of deep learning exploiting similar feature sets have been used to categorize web content (Saxe et al., 2018), office documents (Rudd et al., 2018), and archive formats (Rudd et al., 2018). Other types of features and classifiers have also been used for the task of PE malware detection. For instance, Raff et al. demonstrated in (Raff et al., 2017b) a way to effectively identify malware using solely an embedding of the first 300 bytes from the PE header. In later work, Raff et al. proposed an embedding strategy which takes in the entire PE file (Raff et al., 2017a) for the same problem. In (Cakir and Dogdu, 2018), Bugra and Erdogan use a disassembler to retrieve the opcodes of the executable files and then a shallow network based on word2vec (Mikolov et al., 2013)

to embed them into a continuous vector space. Afterwards, they train a gradient search algorithm based on Gradient Boosting Machines for the malware classification task.

The two approaches that are most related to our work are the works conducted by Huang et al. in (Huang and Stokes, 2016) which uses the auxiliary task of predicting family detection names with the goal of improving the performance on their detection model, and by Rudd et al. in (Rudd et al., 2019), work that is contemporaneous to ours, where the authors study the impact of using multiple auxiliary loss terms on a multitude of tasks, in parallel to the main binary detection task and conclude that using these auxiliary information during training is beneficial for the performance on the main task. Note, however, that the purposes of the auxiliary losses in these works were to improve performance on the main malicious/benign detection task.

3. Semantic Malware Attribute Tags

We define a semantic malware attribute tag (which we will also refer to as a malicious or malware tag for short) as a potentially informative, high-level attribute of malicious or potentially unwanted software. These tags are loosely related to malware families, in the sense that they attempt to describe how a piece of malicious software executes and the intent behind it, but they do so in a more general and flexible way. One malware campaign (or family) can be associated with more than one tag, and a given tag is associated with multiple families. For the purpose of this study, and without loss of generality, we define a set , with different tags of interest that we can use to describe malicious PE files: adware, crypto-miner, downloader, dropper, file-infector, flooder, installer, packed, ransomware, spyware, and worm. We chose this particular set of tags so that we can generate concise descriptions for most common malware currently found in the wild. The definitions for each of the tags can be found in Appendix A.

Since malware tags are defined at a higher level of abstraction than malware families, we can bypass the problem of not having a common naming strategy for malicious software, and thus exploit the knowledge contained in multiple genealogies generated from different sources in a quasi-independent manner: detection technologies, methodologies, etc. It becomes irrelevant if one source identifies a sample as being part of the Qakbot family while another calls it Banking Trojan so long as we have a way to associate those two correctly with the spyware111Qakbot in particular also exhibits the behavior of a worm and could be therefore also tagged as such. tag. Furthermore, some sources might have stronger detection rules for certain kinds of malware.

In the remainder of this section we propose three labeling strategies used to generate tags for a given set of files: i) one that combines the information encoded in the detection names of several anti-malware solutions and then translates them into semantic tags; ii) an extension to the previous labeling strategy that exploits co-occurrence information on these detection names to improve the stability and coverage of the tags; and iii) an dynamic approach based on a behavioral analysis of the files’ execution to detect popular malware families with high confidence. In later sections we will use these labeled sets for both training and evaluation of deep neural networks (DNNs) to annotate previously unseen samples in real time, by looking only at their binary representation.

Detection name Parsed tokens Tags
Ares!4A26E203524C, Downloader,
a variant of Win32/Adware.Adposhel.AM.gen,
None, None, None,
Gen:Variant.Razy.260309, None,
Trojan ( 005153df1 ), Riskware/Adposhel
ares, downloader,
variant, win32, adware, adposhel, gen,
gen, variant, razy,
trojan, riskware, adposhel
adware
downloader
W32.Virlock!inf7, TR/Crypt.ZPACK.Gen,
Trojan ( 004d48ee1 ), Virus:Win32/Nabucur.D,
W32/VirRnsm-F, Virus.Win32.PolyRansom.k,
Win32.Virlock.Gen.8, W32/Virlock.J,
Trojan-FNET!CCD9055108A1,
a variant of Win32/Virlock.J
w32, virlock, inf7, tr, crypt, zpack, gen,
trojan, win32, nabucur,
w32, vir, rnsm, virrnsm, win32, poly, ransom, polyransom,
win32, virlock, gen,
trojan,
variant, win32, virlock
ransomware
packed
file-infector
Table 1. An example of how our tags are derived from detection names from multiple sources. The first column shows detection names from nine different vendors, where the value None indicates that the vendor has not identified the sample as malicious. In the second column the tokens parsed and normalized from the detection names are listed. The last column shows the tags associated with the tokens in the middle column. This association is represented by using the same color in the tokens and their related tags.

3.1. Tag Distillation from Detection Names

High quality tags for malware samples at the scale required to train deep learning models can be prohibitively expensive to create manually. Instead, we rely on semi-automatic strategies that are noisier than manual labeling but allow us to label millions of files that can then be used to train our classifiers. For training purposes, we propose a labeling function that annotates PE files using the previously defined set of tags by combining information contained in detection names from multiple vendors 222Vendor names were anonymized throughout this work to avoid inappropriate comparisons.. In this work we use names from nine anti-malware solutions that are known to produce high quality detection names. The labeling process consists of two main stages: token extraction and token to tag mapping. Example outputs of each intermediate stage of the tag distillation are represented in Table 1. We later extend this procedure to improve tagging stability and coverage by exploiting statistical properties of detection names.

3.1.1. Token Extraction

The first step for deriving tags from detection names is parsing the individual detection names to extract relevant tokens within these names. A token is defined as a sequence of characters in the detection name, delimited by punctuation, special characters or case transitions from lowercase to uppercase (we create tokens both splitting and not splitting on case transitions). These are then normalized to lowercase. For example, from the detection name Win32.PolyRansom.k we extract the set of tokens {win32, polyransom, poly, ransom, k}. Once all the tokens from all the vendor detection names for a given training dataset are created, we keep those tokens that appear in a fraction of samples larger than in our dataset. In practice we set the threshold to 0.06%. A manual inspection of the remaining tokens found that they were mostly non-informative pseudo-random strings of characters usually present in detection names (e.g. ‘31e49711’, ‘3004dbe01’).

3.1.2. Token to Tag Mapping

Once the most common tokens were defined, we manually built an association rule from tokens to tags for those tokens related with well-known malware family names or those that could be easily associated with one or more of our tags. For example, nabucur is the family name of a type of ransomware and therefore can be associated with that tag. Similarly, the token xmrig, even though it is not the name of a family of malware can be recognized as referring to a crypto-currency mining software and therefore can be associated with the crytpo-miner tag. This way, we created a mapping from tokens to tags based on prior knowledge. With this mapping, we can now associate a sample with a tag if any of the tokens that map to that tag are present in any of the detection names given by the set of trusted vendors.

3.1.3. Token Relationship Mining

In order to understand how tokens relate to each other, we compute the empirical token conditional probability matrix

,

(1)

where is the number of times the token appears in a given dataset, and is the number of times and occur together. is then, by definition, the empirical conditional probability of token given token for a given dataset of samples. We then define the following pairwise relationships between tokens based on their empirical conditional probabilities:

  • Tokens and are synonyms under threshold if and only if and .

  • Token is a parent of token under threshold if and only if and .

  • Token is a child of under threshold if and only if and .

With this in mind, we extend our previously proposed labeling function as follows. We use the tag , associated with a set of tokens , to describe a given malware sample if, after parsing the detection names for we find that:

  • any of the tokens is present for sample ,

  • OR any of the synonyms of is present for the sample (for every ),

  • OR any of the children of is present (for every ).

The first bullet refers to the use of a manually created mapping between tags and tokens, i.e. our original labeling function. The following two bullets define automatic steps for extending the tag definitions and improving the stability of the tagging method. Empirically, we observed that when computing the token co-occurrence statistics in our training set as in Equation 1, the automatic steps improved the tag coverage in the validation set in average by 13%, while increasing the mean token redundancy, or the mean number of tokens observed per tag from 2.9 to 5.8 as shown in Table 2. This increase in mean token redundancy makes the labeling function more stable against mis-classifications or missing scans from the set of trusted vendors. A more complete analysis of the value of the automatic extraction step is deferred until Section 5.1. The parameter was set to 0.97, the value at which the coverage of malicious tags improved for malware samples in our validation set, while remaining constant for benign samples. In Appendix C we present a histogram of empirical pairwise conditional probabilities for pairs of tokens. The values of conditional probabilities larger than 0.97 are noticeably more common than the ones in the range [0.5,0.97].

The tags obtained with this labeling procedure tend to be noisy because of the “crowd-sourcing” method used in extracting tokens from multiple sources. Because of this, we refer to samples annotated with this method as weakly labeled. On the other hand, this labeling methodology has the advantage of being cheap to compute and having high coverage over samples. As long as there is one of the vendors that names the sample with a detection name, and that detection name contains a token associated with one of the previously defined tags, directly or statistically via a parent-child or synonym token relationship, there will be a label for that sample. It is also important to note that this labeling technique generates primarily positive relations: meaning that a tag being present identifies a relationship between the sample and the tag, but its absence does not necessarily imply a strong negative relation.

3.2. Tag Creation from Behavioral Information

In order to obtain high quality family name classifications for a set of samples, we use a proprietary behavioral sandbox environment. Such a system allows us to execute (detonate) the samples causing them to expose relevant behaviors such as unpacking and/or downloading additional components. Memory dumps, network traffic packet captures, file read and write operations, as well as many other activities can be captured that would not necessarily be observable with a static scan alone, since in its binary state this data could be encrypted, or possibly not even present (as it may be downloaded at runtime). On top of this sandbox, malware researchers can then create family specific signatures that are able to access these dropped, downloaded, and modified files, as well as memory dumps, network traffic, and other artifacts. These dynamic signatures provide more accurate classifications than those traditional AV signatures utilized in static file scanning and have considerably more stringent criteria to define family membership.

The Sality malware family, for example, is a parasitic file-infector, meaning it embeds its malicious code into other clean PE files on the system. It spreads when any of the infected files are copied to a different system and executed. Sality uses a system-wide mutex to determine if another instance of the infection process is already running. It also uses specific markers to identify already infected files so as to not to re-infect. A dynamic signature to identify this family consists on identifying the presence of the unique mutex or the markers in files that have been opened and modified by the virus.

Because these sandbox signatures are so specific and do not rely on circumstantial evidence for family classification, whenever any of these dynamic family signatures are triggered when executing a given file, we then know with high confidence that the sample belongs to the associated family. With this information, we then keep those samples for which we have positively identified a family, and annotate them with tags that describe their associated malware family well. Since we are only looking at a set of well defined malicious families, and basing our detection on very specific behaviors, the support of this labeling technique – i.e. the number of samples for which we generate tag labels – is low and biased towards a specific set of malware types. On the other hand, the tags generated with this method are considered to be high quality and can be safely used as ground truth in our analyses.

The family signatures used in this labeling mechanism are mostly concerned with the actual malware behavior and not necessarily with the delivery mechanism. For instance, if we are dealing with a piece of malware that exhibits the behavior of a dropper, the behavioral analysis will focus mostly on characterizing the dropped payload. Because of this, the tags that describe delivery mechanisms such as installer, packed and dropper are not generated with this method.

4. Tags Prediction

Figure 1. Using samples and corresponding tags we train two neural network architectures to predict malware tags. Top

: Multi-Head architecture, consisting of a base feed-forward network with one “head” for each tag that it is trained to predict. Each of the heads is composed of dense layers followed by ELU nonlinearities, and a final sigmoid activation function.

Bottom: Joint Embedding model, which represents (embeds) both the binary samples and the malicious tags in the same low dimensional space. The prediction layer issues predictions based on the distances between sample embeddings and tag embeddings in this space.

Once we have defined our labeling scheme, we can define the tag prediction task as multi-label classification, since zero or more tags from the set of possible tags can be present at the same time for a given sample. In order to predict these tags, we introduce two different neural network architectures, both represented in Figure 1, which we will refer to as Multi-Head (top) and Joint Embedding (bottom).

The Multi-Head architecture can be thought as an extension of the network used in (Saxe and Berlin, 2015) to multiple outputs. It consists of a base topology that is common to the prediction of all tags, and one output (or “head”) per tag. Both parts of the architecture consist of multiple blocks composed of dropout (Srivastava et al., 2014)

, a dense layer, batch normalization

(Ioffe and Szegedy, 2015), and an exponential linear unit (ELU) activation function (Clevert et al., 2015). The only exceptions are the input layer, which does not use dropout, and the very last layer of each head, which uses a sigmoid activation unit to compute the predicted probability of each label.

The Joint Embedding model, as shown at the bottom of Figure 1, is introduced with the goal of exploiting semantic similarities between tags. This model maps both the labels (tags) and the binary file features to vectors in a joint Euclidean latent space. This embedding of files and tags is performed in a way such that, for a given similarity function, the transformations of semantically similar labels are close to each other, and the embedding of a binary sample should be close to that of its associated labels in the same space. This architecture consists on a PE embedding network, a tag embedding matrix , and a prediction layer.

The PE embedding network learns a nonlinear function , with parameters that maps the input binary representation of the PE executable file into a vector in low dimensional Euclidean space,

The tag embedding matrix learns a mapping from a tag

, to a distributed representation

in the joint embedding space,

In practice, the embedding vector for the tag is simply the -th row of the embedding matrix, i.e.

. Finally, the prediction layer compares both the tag and the sample embeddings and produces a similarity score that is run through a sigmoid non-linearity to estimate the probability that sample

is associated with tag for each . In our final model implementation, the similarity score is the dot product between the embedding vectors. The output of the network then becomes,

(2)

where is the sigmoid activation function, and is the probability estimated by the model of tag being a descriptor for .

We further constrain the embedding vectors for the tags as suggested in (Weston et al., 2010), such that:

(3)

which acts as a regularizer for the model. We observed in practice that this normalization indeed leads to better results on the validation set. Unless stated differently we fixed the value of to 1.

We also experimented with constraining the norm of the PE embeddings to 1, and analogously using cosine similarity instead of a dot product as a similarity score between tags’ and files’ embeddings. In both cases we observed deteriorated performance on the test set. This drop in performance was more noticeable for those samples with multiple tags (more than 4), suggesting that the network is using the magnitude of the PE embedding vector to achieve high similarity scores for multiple tags concurrently. As part of our experimentation we also tried to learn the similarity score by concatenating together the PE and tag embeddings and running the resulting vector through some feed forward layers with nonlinearities. However, we found that that the simpler approach of using dot product was both more effective on the tag prediction task and more interpretable.

Our goal, for a given PE file, is to learn a distributed, low dimensional representation of it, that is “close” to the embedding of the tags that describe it. The parameters of both embedding functions and

are jointly optimized to minimize the binary cross-entropy loss for the prediction of each tag via backpropagation and stochastic gradient descent. The loss function to minimize for a mini-batch of

samples becomes:

(4)

where if sample is labeled with tag , and is the probability predicted by the network of that tag being associated with the -th sample.

In practice, to get a vector of tag similarities for a given sample with PE embedding vector we multiply the matrix of tag embeddings by and scale the output to obtain a prediction vector , where

is the element-wise sigmoid function for transforming the similarity values into a valid probability value. Each element in

is then the predicted probability for each tag.

4.1. Evaluation of Tagging Algorithms

There are different ways to evaluate the performance of tagging algorithms. Particularly, the evaluation can be done in a per-tag or a per-sample dimension. The former seeks to quantify how well our tagging algorithm performs on identifying each tag, while the latter focuses on the quality of the predictions for each sample instead.

In the per-tag

case, one suitable way to evaluate the performance of the model is to measure the area under the receiver operating characteristic curve (AUC-ROC, or simply AUC) for each of the tags being predicted. A ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR). Also, since the target value for the

-th tag of a given sample is a binary True/False value (

), binary classification evaluation metrics such as ‘Accuracy’, ‘Precision’, ‘Recall’, and ‘F-score’ also apply. To compute these metrics, the output probability prediction needs to be binarized. For the binarization of our predictions, we choose a threshold independently for each tag such that the FPR in the validation set is 0.01 and use the resulting 0/1 predictions. The fact that our labeling methodology introduces label noise – mostly associated with negative labels, as pointed out in Section

3 – makes recall the most adequate of these last four metrics to evaluate our tagging algorithms, since it ignores incorrect negative labels.

The per-sample evaluation dimension seeks to evaluate the performance of a tagging algorithm for a given sample, across all tags. Let be the set of tags associated with sample and the set of tags predicted for the same sample after binarizing the predictions. We can use the Jaccard similarity (or index) as a figure of how similar both sets are. Furthermore, let be the binary target vector for a PE file, where indicates whether the -th tag applies to the file and be the binarized prediction vector from a given tagging model. We define the per-sample accuracy as the percentage of samples for which the target vector is equal to the prediction vector, i.e., all tags correctly predicted, or, in other words, the Hamming distance between the two vectors is zero. For an evaluation dataset with samples we can use,

Mean Jaccard similarity
(5)
(6) Mean per-sample accuracy

as our per-sample performance metrics for the tagging problem, where is the indicator function which is if the condition in the argument is true, and zero otherwise.

5. Experiments

We train and evaluate the two proposed model architectures on the task of malware tagging from static analysis of binary files. In this section we provide the experimental details of this process: particularly a description and analysis of the data used for training and validation along with a definition of the model topology and training methodology.

5.1. Data Description

For our experiments we collected three datasets of Windows Portable Executable (PE) files from a threat intelligence feed, along with the detection names produced by trusted vendors, first seen time-stamps, and the number of AV vendors that identify the PEs as malware.

The first collected dataset is our training set, , and contains 7,330,971 unique binary files. All the data in was obtained by random sampling of files first observed in our intelligence feed in the period between 06/20/2017 and 03/02/2018.

Our second dataset – our test set, – is composed of 1,608,265 unique entries. The samples in the test set were randomly sampled from all the files whose first seen time was between 03/03/2018 and 06/02/2018. This temporal split between and ensures that there is no information leakage between our train and test sets.

For both and we derived the semantic tags following the procedure described in Section 3.1, using detection names from anti-malware solutions that we consider provide high-quality names. The set of tokens and mappings used was based only on detection names from samples in , in order to avoid polluting our time split evaluation. We set for deciding which tokens to keep, resulting in unique tokens. We further derived a malicious/benign label for the samples in those sets using a voting scheme similar to (Saxe and Berlin, 2015), but extended to assign more importance to trusted vendors, and complemented with internal proprietary reputation scores, white and black lists. We refer to the subset of 1,377,698 malware samples in the test set as and to the subset of 230,568 benign samples in the test set as .

In addition to the two datasets above, we collected a third dataset for ground truth evaluation, , containing 7,033 samples from the same time period as . For we obtained a random sampling of files from the time period of interest and used behavioral traces of the files’ execution to determine their ground truth tags, following the methodology in 3.2. We only kept in the ground truth dataset those samples that were positively identified by our behavioral tagging approach, thus minimizing the amount of label noise.

For all the binary files in the three datasets we then extracted -element feature vectors using the same feature representation as proposed in (Saxe and Berlin, 2015).

Table 2 summarizes the coverage for each of the tags across our test dataset . Most of our tags are almost exclusively associated with malicious samples, except for installer and packed which are associated with both benign and malicious files. Moreover, we see that 96% of the malicious samples have tags describing them, indicating that the labeling approach has effectively a high coverage over the set of samples of interest. We also note that the mean number of tokens observed for each time that a tag appears is 5.57, which represents the degree of robustness of our labeling strategy against vendor mis-classifications or missing scans. Synonym and parent-child relationships used to produce the tags were computed from the samples in the train dataset. The values in parenthesis in the table show the labeling statistics had we ignored these relationships and used only the manual mapping between tokens and tags. Using both synonym and parent-child relationships derived from the empirical conditional probabilities of tokens improves not only the mean token redundancy but also the tag coverage for malicious samples for almost all our tags, leaving unaffected the tags for benign samples. The coverage statistics for the training set are similar to the ones presented in the table and not shown here for space considerations.

Tag
Benign
Samples
Malware
samples
Mean
token
redundancy
adware 0.01% 26.5 (22.5) % 5.7 (2.9)
crypto-miner 0% 11.9 (11.2) % 8.8 (4.8)
downloader 0.01% 32.4 (26.0) % 6.7 (2.8)
dropper 0% 31.9 (27.6) % 4.5 (1.8)
file-infector 0.01% 32.6 (31.0) % 5.7 (3.1)
flooder 0% 1.3 (1.3) % 5.1 (1.9)
installer 2.2% 11.5 (7.3) % 3.7 (3.3)
packed 3.3% 33.4 (32.5) % 4.2 (2.0)
ransomware 0% 6.6 (6.5) % 5.6 (2.7)
spyware 0.01% 48.1 (47.6) % 6.6 (3.3)
worm 0% 31.2 (26.7) % 7.5 (3.4)
ANY 5.2 (5.2)% 96.0 (95.4) % 5.8 (2.9)
Table 2. Tag coverage, i.e. percentage of samples annotated with a given tag, for the weakly labeled dataset (as described in Section 5.1) for benign () and malicious () files. The rightmost column indicates the mean number of different tokens associated with the tag each time the tag appears across all samples. The last row considers a sample as labeled if any one of the tags is present. The mean token redundancy for this row corresponds to the mean of the token redundancies across all tags. Values in parentheses show the result of the tagging procedure before exploiting statistical relations between tokens mined from the training set (cf. Section 3.1.3).

We further analyze the distribution and pairwise relationships of the tags in our training dataset. In Figure 2 we plot the empirical conditional probability of the tags in the train set computed in a similar fashion as in Equation 1 by replacing token counts with tag counts. The value in row and column represents the empirical conditional probability of the tag given the tag , . Those rows for which the sum of the elements is higher (packed: 6.1, spyware: 5.4, dropper: 3.9) correspond to tags with a broader meaning, or higher in a tag hierarchy, and therefore indicative of more generic tags. Those tags for which the sum of its row is lower (flooder: 1.0, installer: 1.5, ransomware: 1.5) are tags with more a more specific meaning. This representation also helps us identify possible issues with the tagging mechanism as well as understanding the distribution of our tags. Furthermore, we can compare this matrix with the one derived from the predictions of the model instead of the labels, and have an understanding of the errors that the model is making, because the elements of both matrices should be similar.

Figure 2. Estimated tag conditional probabilities for our training set . The value of the element in the -th row and -th column represents the empirical conditional probability of tag given tag , for our labeling strategy.

5.2. Training Details

We trained the two models introduced in Section 4 on the training dataset

for 200 epochs using an Adam optimization procedure

(Kingma and Ba, 2014) on mini-batches of 4096 samples and at a learning rate of

using PyTorch

(Paszke et al., 2017) as our deep learning framework.

The shared base topology of the Multi-Head architecture consists of an input feed-forward layer of output size 2048, followed by a batch normalization layer, an ELU nonlinearity and four blocks, each composed by dropout, a linear layer, batch normalization and ELU of output sizes 1024, 512, 256 and 128 respectively. Each output head is composed of the same types of basic blocks as are in the main base architecture, but with output sizes 64, 32 and 1. The last layer uses a sigmoid non-linearity instead of the ELU. Binary cross-entropy loss is computed at the output of each head and then added together to form the final loss.

The Joint Embedding architecture uses the same base topology as the Multi-Head model but with two extra blocks of output sizes 64 and 32 for the embedding of the PE files into the 32 dimensions joint latent space. An embedding matrix () of learnable parameters with size is used for the embedding of the tags. We used dot product to compute the similarity between the PE file embedding and the tag embedding followed by a sigmoid non-linearity to produce an output probability score. As before, the sum of the per-tag binary cross-entropy losses is used as the mini-batch loss during model training.

6. Results

As mentioned in Section 4.1 there are two main dimensions of interest when analyzing the performance of malware tagging algorithm: a per-tag dimension, which evaluates how well each tag is predicted and a per-sample dimension, which focuses on how many samples are correctly predicted and how accurate those predictions are. In the following we analyze the performance of our models across these dimensions in both our weakly annotated dataset , and our test set with labeled with ground truth tags .

The evaluation results presented in this section take into consideration only those samples identified as malware by our labeling scheme. This is because the goal of our current tagging algorithm is to describe only malicious or potentially unwanted behaviors. At deployment time, we assume that the tagging models analyze samples already convicted by a complementary mechanism, and so we only evaluate on actual malware to resemble this deployment scenario. Only evaluating in malicious samples also allows us to compare the results in the test set with those in the ground truth dataset – which is only composed of malware files. In Appendix B we complement this results by evaluating the performance on the entire test set and show that the models’ performance does not degrade in the presence of benign samples, meaning that it does not assign malicious tags to benign samples.

6.1. Per-Tag Results

After training the two proposed architectures we proceed to evaluate their performance on the test set . In Figure 3 we compare the per-tag true positive rate (TPR or recall) of both the Multi-Head and Joint Embedding architectures at a per-tag false positive rate (FPR) of 1%. In the evaluation for every tag the Joint Embedding architecture outperforms the baseline Multi-Head model, in some cases, e.g., for spyware, adware and packed, by an important margin (0.14, 0.15 and 0.18 respectively). We have observed this trend consistently for other experiments – with different datasets, layer sizes, and activation functions – that we carried out during the development of this study.

Table 3 provides a more thorough comparison of these two methods. Not only does the Joint Embedding model outperform the baseline in terms of recall for all tokens, but it also does so in terms of AUC, except for the installer tag, for which the Multi-Head model performs slightly better. For computing both recall and F-score we binarized the output using a threshold such that the FPR in the test set is 1% for each tag. For these two binary classification metrics, the Joint Embedding model achieves better performance than the Multi-Head model for every tag (equal for F-score on the flooder tag). On average the Multi-Head architecture achieves a recall of 0.72 and a F-score of 0.79 while the proposed Joint Embedding model achieves a recall of 0.80 and a F-score of 0.84.

Lastly, in the rightmost column of the table we show the evaluation results of using the Joint Embedding model trained in our noisy labeled dataset to predict the tags on the ground truth dataset . Because of how our labeling strategy for was defined, and the behavioral signatures available at the time of compiling this work, our ground truth dataset does not have samples that could be strongly associated with some of the tags, thus the missing entries in the table. Even though being trained in a dataset where the tags were extracted in a different manner, we observe that the evaluation recall on the ground truth dataset is, for most of the tags comparable, if not better than the evaluation on the test set . These results imply that the model is effectively learning to identify high-quality relationships between binary files and semantic tags even when trained on noisy labels.

Figure 3. per-tag true positive rate (TPR) at a false positive rate (FPR) of for the two proposed models. The Joint Embedding architecture outperforms the Multi-Head architecture consistently.
Multi-Head () Joint Embedding () Joint Embedding ()
Tag name AUC
Recall
@FPR=
F-score
@FPR=
AUC
Recall
@FPR=
F-score
@FPR=
AUC Recall F-score
adware 0.9661 0.47 0.63 0.9811 0.63 0.76 - - -
crypto-miner 0.9909 0.85 0.88 0.9970 0.97 0.95 0.9684 0.63 0.63
downloader 0.9552 0.72 0.83 0.9774 0.78 0.87 0.9170 0.99 0.68
dropper 0.9681 0.71 0.82 0.9819 0.80 0.88 - - -
file-infector 0.9824 0.69 0.81 0.9907 0.71 0.82 0.9639 0.83 0.68
flooder 0.9965 0.99 0.72 0.9997 1.0 0.72 - - -
installer 0.9523 0.55 0.68 0.9477 0.57 0.69 - - -
packed 0.9709 0.69 0.80 0.9887 0.87 0.92 - - -
ransomware 0.9907 0.96 0.91 0.9946 0.97 0.92 0.9781 0.86 0.82
spyware 0.9581 0.69 0.81 0.9775 0.83 0.90 0.6311 0.86 0.80
worm 0.9722 0.65 0.78 0.9853 0.67 0.79 0.7438 0.47 0.48
mean 0.9730 0.72 0.79 0.9838 0.80 0.84 0.8670 0.77 0.70
weighted mean 0.9681 0.67 0.79 0.9824 0.77 0.85 0.7677 0.81 0.74
Table 3. Per-tag evaluation results for the two proposed architectures on the malware samples of the test and ground truth evaluation datasets ( and respectively). Both recall and F-score are computed by binarizing each classifier’s outputs at a false positive rate of on the test set for every tag. The last column shows the evaluation of the Joint Embedding model trained with noisy labels on the ground truth evaluation set. The last two rows show the mean and weighted mean for each of the columns. The weighted mean weights the contribution of each tag by its support. The best result between the two proposed architectures for each row is highlighted in bold.

6.2. Per-Sample Results

Another way of analyzing our results is to measure the percentage of samples for which our models accurately predicted all tags. We are also interested in knowing how many tags on average (out of the 11 possible tags) each model correctly predicts per sample. For this we measure both the Jaccard similarity and the per-sample accuracy of our predictions according to equations 5 and 6 respectively. Under both metrics the Joint Embedding approach outperforms the Multi-Head approach significantly. For the Joint Embedding architecture, the average number of samples for which we predict all the tags correctly is 58.56% while if we choose a sample at random, the model correctly predicts the presence (and absence) of each tag for almost 94% of the tags on average. It is important to note that, because of the relatively low number of tags per sample – 2.66 in – the mean Jaccard similarity for a tagging algorithm that never predicts any tag is 75.8% in this test set. Even though this baseline is already high, both our tagging models outperform it by a large margin, which signals that the models are effectively learning to identify relationships between tags and binary feature vectors.

On the ground truth dataset we observe a drop both in the mean per-sample accuracy and Jaccard similarity for the Joint Embedding model as expected, resulting in 59% of samples for which all their tags are predicted correctly and a 93.5% of the tags correctly identified per sample. Nevertheless, under both metrics it still outperforms the Multi-Head model when this last one is evaluated in the original test set. This second dimension of model evaluation indicates that the relationships learned on the noisy training dataset are applicable to a more accurately labeled set of samples.

Multi-Head
()
Joint Embedding
()
Joint Embedding
()
Accuracy 0.5228 0.5856 0.5902
Jaccard
Similarity
0.9145 0.9375 0.9354
Table 4. Evaluation results for our two proposed architectures on the malicious samples of the test set, . Jaccard similarity and accuracy are both computed according to equations 5 and 6 respectively. In both cases the performance of the Joint Embedding model outperforms the baseline by a noticeable margin.

7. Discussion

Our results from Section 6 suggest that the Joint Embedding is more suitable for malware tagging than the Multi-Head model architecture. Because the PE embedding part of the Joint Embedding architecture is composed of a similar number and size of layers as the shared base architecture of the Multi-Head model, the number of parameters of both neural networks is comparable. Thus, we hypothesize that the performance improvement is due to a more informative internal representation learned by the Joint Embedding network, which gives it the ability to model, and therefore exploit tag relationships (labels structure) in the latent space. The number of parameters for both networks can be expressed as , where represents the number of parameters on the shared base and PE embedding topologies, the number of tags, and is the number of parameters of each head in the Multi-Head architecture and the latent space dimensionality in the case of the Joint Embedding architecture.

In Section 7.1, we verify that the Joint Embedding model has learned a proper representation by examining its latent space and validating that PE file embeddings tend to cluster around their corresponding tag embeddings.

7.1. Malware-Tag Joint Embedding Space

In an attempt to validate and understand the latent space learned by our Joint Embedding model we used t-SNE (van der Maaten and Hinton, 2008) to reduce the dimensionality of the 32-dimensional latent space to a 2-dimensional representation as shown in Figure 4. In this visualization the small markers represent the embeddings of PE files in our test set . For each tag, we randomly down-sampled 250 corresponding PE file embeddings, for a total of 2,750 samples. Large markers correspond to the embeddings of the tags themselves.

As one can see, the embeddings of the PE files labeled with the same tag tend to cluster together. Furthermore, the embeddings of the tag labels lie close to the clusters of samples they describe. This suggests that the Joint Embedding model has effectively learned to group file representations close to their corresponding tag representations, as intended.

Figure 4. T-SNE visualization of sample and tag (label) embeddings for 2,750 samples labeled with a single tag from (250 randomly selected samples per tag). Large markers represent tag (label) embeddings while small markers represent PE samples embeddings. The flooder tag is partially covered by the crypto-miner marker.

This structure in the embedding space is convenient for several other applications. Tagging can be thought of as a specific case of information retrieval where we retrieve tags for a given query sample based on a distance function in latent space. Thus, it would also be possible to do similarity searches, using one malware sample to retrieve other samples with similar characteristics by simply retrieving files whose embedding representations are close to the representation of the query sample. Finally, such a system can be applied for information retrieval, where given a combination of descriptive tags we could obtain a set of samples that are closely associated with that particular tag combination by mapping it into the embedding space and retrieving adjacent samples.

8. Conclusion

In this paper we have formalized the concept of describing attributes of malicious software as a multi-label prediction task, or tagging problem. Even though the concept of describing malicious artifacts with descriptive tags is not new, and has been proposed as a way of eliminating the ambiguity and inaccuracy of relying on signatures for malware description (Kirillov et al., 2011), to the best of our knowledge this is the first attempt to learn a nonlinear mapping between raw binary files and descriptive tags for automatic malware characterization.

We have also proposed a simple data-driven semi-automatic approach for extracting and combining descriptive information of malware samples from multiple vendors detection names. Furthermore, we evaluated two different approaches to malware description via tagging with deep neural networks, and showed that our Joint Embedding model can be used to reasonably accurately predict user interpretable attribute and behavioral descriptions of malicious files from static features, correctly predicting an average of more than 10.31 out of 11 tag descriptors per sample. Finally we have shown that the noisy tags extracted from detection names are a suitable surrogate label for learning tags created through more expensive behavioral analyses. When evaluating our proposed Joint Embedding model against ground truth tags for samples belonging to well known malware families, 10.29 out of the 11 descriptors were correctly predicted per sample, in average.

We foresee multiple research paths as natural follow-ups to the ideas proposed in this paper as well of potential applications, such as malware similarity clustering, and alerts prioritization. We are also particularly interested in expanding the set of tags used to describe malware samples to a more complete taxonomy as well as experimenting with using our ground truth labeled set to fine-tune model trained with weak labels.

The question of how to reliably and economically create a large training set suitable for the task is currently the main challenge. We think it would be also valuable to explore methods to extend the set of targets that the model predicts without the need of retraining the entire network as well as evaluate this approach in few-shot and zero-shot learning scenarios.

Acknowledgements.
We thank Adarsh Kyadige, Andrew Davis, Hillary Sanders, Joshua Saxe, Richard Harang, and William Lee for their suggestions and feedback during the development of this research. We also thank Richard Cohen for sharing his expertise in malware detection. This research was funded by Sophos Ltd.

References

Appendix A Appendix: Tag definitions

  • Downloader: Malicious program whose primary purpose and functionality is to download additional content. Often similar in usage to a Dropper.

  • Dropper: Malicious program that carries another program concealed inside itself, and drops that program onto an infected machine.

  • Ransomware: Malware whose goal is to encrypt or otherwise make inaccessible a user’s files, to then demand payment to regain access to them.

  • Crypto-miner: A program that uses a machine’s computational resources to mine cryptocurrency, without the user’s knowledge or consent, sending the results back to a central location.

  • Worm: Software that automatically spreads itself.

  • Adware: Potentially unwanted software that shows the user an excessive number of - often in browser - ads, or changes the user’s home page to an ad, to get more clicks.

  • Spyware: Covers programs that collect confidential information and send it to an attacker. This confidential information could range from web browsing habits, keystroke logging, password stealing or banking information among others.

  • Flooder: Designed to overload a machine’s network connections. Servers are common targets of these attacks.

  • Packed: Indicates that the malware was packed for the sake of avoiding detection.

  • File-Infector: Infects executable files with the intent to cause permanent damage or make them unusable. A file-infecting virus overwrites code or inserts infected code into a executable file.

  • Installer: Installs other unwanted software.

Appendix B Appendix: Full Evaluation Results

In Section 6 we evaluated both our training algorithms the test set only on samples assumed to be malware by our malware/benign labeling scheme. In table B.1 we present the results of evaluating both the Joint Embedding and Multi-Head models on the entire test set. The fact that there is results of this table are even slightly better than the ones presented in Table 3 is an indication that the models perform as expected in benign samples. The mean Jaccard similarity and accuracy for the Multi-Head model in the full test set are 0.9374 and 0.6378 respectively. For the Joint Embedding model the mean Jaccard similarity is 0.9598 while the accuracy is 0.7121.

Multi-Head () Joint Embedding () .
Tag name AUC
Recall
@FPR=
F-score
@FPR=
AUC
Recall
@FPR=
F-score
@FPR=
adware 0.9710 0.52 0.67 0.9837 0.66 0.78
crypto-miner 0.9815 0.86 0.88 0.9973 0.98 0.95
downloader 0.9620 0.73 0.83 0.9807 0.79 0.87
dropper 0.9737 0.73 0.83 0.9846 0.83 0.90
file-infector 0.9851 0.73 0.83 0.9923 0.74 0.83
flooder 0.9966 0.99 0.69 0.9997 0.99 0.70
installer 0.9580 0.58 0.69 0.9510 0.59 0.70
packed 0.9759 0.72 0.82 0.9902 0.87 0.92
ransomware 0.9918 0.96 0.90 0.9952 0.97 0.90
spyware 0.9674 0.71 0.82 0.9820 0.84 0.91
worm 0.9770 0.72 0.82 0.9879 0.74 0.84
mean 0.9773 0.75 0.80 0.9859 0.82 0.84
weighted mean 0.9735 0.71 0.81 0.9850 0.80 0.87
Table B.1. Per-tag evaluation results for the two proposed architectures on the test and ground truth evaluation datasets. Both recall and F-score are computed by binarizing each classifier’s outputs at a false positive rate of on the test set for every tag. The last two rows show the mean and weighted mean for each of the columns. The weighted mean weights the contribution of each tag by its support. The best result between the two proposed architectures for each row is highlighted in bold.

Appendix C Appendix: Tokens Conditional Probability

In Figure C.1 we show the histogram of the number of pairs of tokens for a given empirical conditional probability value computed in our training set according to Equation 1 for the range . We note that the elbow of the curve in the histogram corresponds to 0.97, value used for the parameter to define token relationships in Section 3.1.3. Pairwise conditional probabilities of are most likely a consequence of our parsing strategy.

Figure C.1. Histogram of the number of pairs of tokens for a corresponding interval of empirical conditional probabilities. We note that the elbow of the curve corresponds to the empirical conditional probability interval .