Learning from Context: Exploiting and Interpreting File Path Information for Better Malware Detection

05/16/2019 ∙ by Adarsh Kyadige, et al. ∙ 0

Machine learning (ML) used for static portable executable (PE) malware detection typically employs per-file numerical feature vector representations as input with one or more target labels during training. However, there is much orthogonal information that can be gleaned from the context in which the file was seen. In this paper, we propose utilizing a static source of contextual information -- the path of the PE file -- as an auxiliary input to the classifier. While file paths are not malicious or benign in and of themselves, they do provide valuable context for a malicious/benign determination. Unlike dynamic contextual information, file paths are available with little overhead and can seamlessly be integrated into a multi-view static ML detector, yielding higher detection rates at very high throughput with minimal infrastructural changes. Here we propose a multi-view neural network, which takes feature vectors from PE file content as well as corresponding file paths as inputs and outputs a detection score. To ensure realistic evaluation, we use a dataset of approximately 10 million samples -- files and file paths from user endpoints of an actual security vendor network. We then conduct an interpretability analysis via LIME modeling to ensure that our classifier has learned a sensible representation and see which parts of the file path most contributed to change in the classifier's score. We find that our model learns useful aspects of the file path for classification, while also learning artifacts from customers testing the vendor's product, e.g., by downloading a directory of malware samples each named as their hash. We prune these artifacts from our test dataset and demonstrate reductions in false negative rate of 32.3 a similar topology single input PE file content only model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Commercial Portable Executable (PE) malware detectors consist of a hybrid of static and dynamic analysis engines. Static detection – which is fast and effective at detecting a large fraction of malware – is usually first employed to flag suspicious samples. Static detection involves analyzing the raw PE image on disk and can be performed very quickly, but it is vulnerable to code obfuscation techniques, e.g., compression and polymorphic/metamorphic transformation (Moser et al., 2007).

Dynamic detection, by contrast, requires running the PE in an emulator and analyzing behavior at run time (Egele et al., 2012). When dynamic analysis works, it is less susceptible to code obfuscation, but takes substantially greater computational capacity and time to execute than static methods. Moreover, some files are difficult to execute in an emulated environment, but can still be statically analyzed. Consequently, static detection methods are typically the most critical part of an endpoint’s malware prevention (blocking malware before it executes) pipeline.

Static detection methods have seen performance advancements recently, thanks to the adoption of machine learning (Damodaran et al., 2017), where highly expressive classifiers, e.g., deep neural networks, are fit on labeled data sets of millions of files. When these classifiers are trained, they use feature vectors – numerical descriptions of the static file content – as input but no auxiliary data. We note, however, that dynamic analysis works well precisely because of auxiliary data – e.g., network traffic, system calls, etc. – information that cannot be gleaned directly from the static content of the file.

In this work, we seek to use file paths, as orthogonal input information to augment static ML detectors. File paths are available statically, without any additional instrumentation of the OS, and are already used internally by malware analysts to correct and investigate mischaracterized detections. Using file paths to augment detections on the surface seems potentially problematic, as file paths are not inherently malicious or benign. However, malware droppers often use file paths with certain characteristics for a variety of reasons. For example, a file path may be chosen to increase the likelihood that a user will execute a malicious PE masquerading as another application, to avoid disk scans, or to hide the files from a user’s view. This results in a prevalence of certain types of directory hierarchies, and detectable naming characteristics (e.g., name randomization), which can provide useful hints about the malicious/benign nature of a file, even when this is not immediately obvious from its content. Likewise, file paths corresponding to prevalent types of benignware exhibit certain patterns. By including the file path as an auxiliary input, we are able to combine information about the file, via feature vectors, with information about how likely it is to see such a file in that specific location.

We focus our analysis on three models:

  • The baseline file content only PE model, which takes only the PE features as input and outputs a malware confidence score.

  • Another baseline file path content only FP model, which takes only the file’s file paths as input and outputs a malware confidence score.

  • Our proposed multi-view PE file content + contextual file path PE+FP model, which takes in both the PE file content features and file paths, and also outputs a malware confidence scores.

A schematic diagram of the three models is shown in Figure 4.

Rather than using vendor aggregation services for our data distribution, which potentially have an artificial file distribution – i.e., not reflecting a real world deployment case – and incomplete filepath information, we collect a commercial dataset of actual file and file paths scans on customer endpoints from a large anti-malware vendor, and use them to perform a time split validation of our models. In addition, we conduct a LIME interpretability analysis (Ribeiro et al., 2016) to see what aspects of the file path amplify or attenuate detection on the multi-view model. We find that, while the model learns to detect suspicious aspects of the file path, it also learns to detect artifacts which seem to correspond to vendor’s customers performing internal testing the of the product. These artifacts include, e.g., files named by their SHA256 digests, in folders marked “malware”, which were likely bulk-downloaded intentionally. While this is indeed the actual customer distribution, and not data pollution, we do not think detecting vendor tests is an accurate representation of the real world threat landscape. We therefore prune these samples from our test set during evaluation, to avoid presenting a spuriously optimistic view of performance. We find that even after we filter our data, our multi-view classifier trained on both file content and the contextual file path yields statistically significantly better results across the ROC curve and particularly in low false positive rate (FPR) regions.

The contributions of this paper are as follows:

  1. We obtain a realistic carefully curated data set of files and file paths from a security vendor’s customer endpoints (rather than a malware / vendor label aggregation service), and carefully prune our test set of “easy” samples from customer test endpoints that do not constitute realistic threats in the wild.

  2. We demonstrate that our multi-view PE+FP malware classifier performs substantially better on our dataset than a model that uses the file contents alone.

  3. We extend Local Interpretable Model Agnostic Explanations (LIME) (Ribeiro et al., 2016) to our PE+FP model, and use it to interpret which portions of the file path contribute and detract the most from a detection.

  4. We demonstrate the suitability of PE+FP model as a ranking engine in the context of Endpoint Detection and Response (EDR) applications.

The remainder of this manuscript is structured as follows: Section 2 covers important background concepts and related work. Section 3 discusses data set collection and model formulation. Section 4 presents an evaluation comparing our novel multi-view approach to a baseline content-only model of similar topology. Section 5 contains a discussion of our results and an interpretability analysis of our model Section 6 concludes.

(a) File content only (PE) model.
(b) File path content only (FP) model.
(c) File content + contextual file path (PE + FP) model.
Figure 4. Schematic outline of the three approaches that we compare in this paper. LABEL:sub@fig:common

A PE content (PE) malware detector, where only static features extracted directly from the PE file are fed to a feed-forward neural network (FFNN).

LABEL:sub@fig:fp_only A file path only (FP) malware detector, where the path of each corresponding file is used to determine if that file is malicious or benign. The raw characters are embedded and processed by a series of convolutional layers before being passed to a series of fully connected (FC) layers. LABEL:sub@fig:ours Our novel multi-view approach, where we combine file content with contextual information from the file path (PE+FP). File content features are passed through the same feed-forward neural network base (FFNN Base) layers as in LABEL:sub@fig:common

while the file path is passed through the same convolutional neural network base layers (CNN Base) as in

LABEL:sub@fig:fp_only. The outputs of these base layers are concatenated together and passed through a series of fully connected layers. Parameters of the separate input paths are jointly optimized. All models output scores between 0 and 1, which represent the confidence of whether a file is malicious.

2. Background and Related Work

In this section, we describe how machine learning is commonly applied to static PE detection and how our approach differs, in a high level sense, by providing contextual information as an auxiliary input. We then present related work in other machine learning domains.

2.1. Static ML Malware Detection

Machine learning has been applied in the computer security domain for many years now (Rudd et al., 2017), but disruptive performance breakthroughs in static PE models using ML at the commercial scale are a more recent phenomenon. Commercial models typically rely on deep neural networks (Saxe and Berlin, 2015)

or boosted decision tree ensembles

(Anderson and Roth, 2018) and have been extended to other static file types as well, including web content (Saxe and Berlin, 2017; Saxe et al., 2018), office documents (Rudd et al., 2018), and archives (Rudd et al., 2018). While methods for dealing with these different input types have their own intricacies, they typically use single inputs derived from file content as a feature vector or text embedding.

Static ML detectors use highly parametric classifiers trained on many malicious and benign samples. The goal is to tune the parameters of these classifiers to best match the outputs from the classifiers for all input samples to their actual ground truth labels. Provided that the malware/benignware samples in the training set are similar enough in content to those seen at deployment and that the samples are well labeled, the learned detection function should work well.

In practice, labels are often collected from vendor aggregation feeds, which submit samples to malware detectors from a variety of vendors. The results can be aggregated into labels that are usually correct, e.g., by using a 1-/5+ criterion (Saxe and Berlin, 2015)

or treating the label as a hidden variable and using statistical estimation methods

(Du et al., 2018; Kantchelian et al., 2015). Often a time lag is introduced to let vendors update their models, blacklists, and whitelists accordingly. Generally, the longer the time lag, the more accurate the labels, but the less the data resembles that of the deployment distribution. In actual deployment contexts, classifiers are retrained on new data/labels periodically and the updated parameters are sent to the endpoints on which the detectors are running.

Most static ML for information security (ML-Sec) classifiers operate on learned embeddings over portions of files (e.g., headers) (Raff et al., 2017), learned embeddings over the full file (Raff et al., 2018), or most commonly, on pre-engineered numerical feature vectors designed to summarize the content from each file (Mays et al., 2017; Hassen et al., 2017; Yousefi-Azar et al., 2017; Hassen and Chan, 2017; Narayanan et al., 2016; Ahmadi et al., 2016; Drew et al., 2016; Saxe and Berlin, 2015). Learned embeddings, which generally presume some sort of convolutional architecture, have the advantage that they do not presume a fixed structure and are derived directly during training. However, this process is significantly more expensive, and does not scale as gracefully, e.g., to tens to hundreds of millions of large PE files. Moreover generic bytes are inherently less constrained than inputs like images, video, audio, and text, where convolutions can take advantage of structural localities/heirarchies. Thus, for generic malicious/benign files there is less performance benefit from learning to embed features directly from inputs. Pre-engineered feature vector representations, by contrast quickly distill content from each file that is informative in a classificaiton sense. There are a number of ways to craft feature vectors, including tracking per-byte statistics over sliding windows (Ahmadi et al., 2016; Saxe and Berlin, 2015), byte histograms(Ahmadi et al., 2016; Anderson and Roth, 2018), ngram histograms (Mays et al., 2017), treating bytes as pixel values in an image (a visualization of the file content) (Ahmadi et al., 2016; Mays et al., 2017), opcode and function call graph statistics(Ahmadi et al., 2016), symbol statistics(Ahmadi et al., 2016), hashed/numerical metadata values (Saxe and Berlin, 2015; Anderson and Roth, 2018; Ahmadi et al., 2016) – e.g., entry-point as a fraction of the file, or hashed imports and exports, – and hashes of delimited tokens (Rudd et al., 2018; Drew et al., 2016). In practical applications, several different types of feature vectors extracted from file content are often concatenated together to achieve superior performance.

Along a similar vein, our work uses a concatenation of features derived from the content of a PE file as an input to a neural network, but in contradistinction to previous work we add a secondary input which includes contextual information – namely the PE file path. The PE content input is passed through a series of hidden layers while the file path is passed through a convolutional embedding. Both inputs are ultimately concatenated together into a common “stem” of hidden layers. The final malicious/benign output score is obtained by passing the final dense layer output (a 1-D scalar) through a sigmoid activation function. This is depicted in schematic form in Figure

4.

2.2. Learning from Multiple Sources

Related research in static ML malware detection using deep neural networks has examined learning from multiple sources of information but the approaches are fundamentally different from ours: Huang et al. (Huang and Stokes, 2016) and Rudd et al. (Rudd et al., 2019) use multi-objective learning (Caruna, 1993; Rudd et al., 2016)

over multiple auxiliary loss functions which they found increased performance on the main malware detection task. Specifically, Huang et al. introduced an auxiliary categorical cross entropy loss function on mutually exclusive malware family labels, while Rudd et al. introduced several loss functions, including a multi-target binary cross entropy loss over multiple malicious/benign detection sources, a Poisson loss over total detection counts from all malicious/benign sources, and a multi-target binary cross entropy loss over semantic malware attribute tags (e.g., ‘ransomware’, ‘trojan’, ‘dropper’ etc.). While both of these works use multiple target labels derived from metadata about the malicious sample in question, only a single input summarizing the

content of the sample is provided. Even if the auxiliary labels provide some contextual information to guide the training process, the classification decision itself is still made purely from the content at deployment.

Our approach utilizes multiple input types/modalities – one which describes the content of the malicious sample, in the form of a PE feature vector similar to (Saxe and Berlin, 2015), and another which feeds the path of the file to an embedding (similar to (Saxe and Berlin, 2017)) which provides information on where that sample was seen. This technique is a type of multi-view learning (Xu et al., 2013)

. As the name might suggest, the majority of applications of multi-view learning are in computer vision, where the multiple views

literally consist of views from different input cameras/sensors or different views from the same camera/sensor at different times. Early applications were targeted towards detection, localization, and recognition problems (Jones and Viola, 2003; Li et al., 2002; Wu et al., 2004), 2D and 3D modeling and alignment (Gross et al., 2010; Blanz and Vetter, 2003; Cootes et al., 2001; Tola et al., 2012; Jensen et al., 2014), and surveillance and tracking (Black et al., 2002). Later, multi-vew solutions to these problems became popular using deep neural networks (Su et al., 2015; Farfade et al., 2015). Other common applications of multi-view learning, both in and outside of the computer vision space, include cross-spectral fusion (Perera et al., 2018), joint textual/visual content representation for image tagging and retrieval(Gong et al., 2014), Joint modeling of web page text and inbound hyperlinks (Bickel and Scheffer, 2004), and multi-lingual modeling (Faruqui and Dyer, 2014) to name a few.

As discussed in Section 2.1, combining different feature types via concatenation is a common practice in ML-Sec (Ronen et al., 2018), but these approaches – by and large – provide different filters on the same content from each file; they do not add contextual information from different input sources. We could only find two approaches in the ML-Sec space which specifically reference themselves as multi-view: namely (Narayanan et al., 2018), in which Narayanan et al. applied multiple kernel learning over dependency graphs for Android malware classification and (Bai and Wang, 2016), in which Bai et al. used multi-view ensembles for PE malware detection (Bai and Wang, 2016). While these approaches are in some ways similar to ours, they do not use deep learning or contextual information that is exogenous to the malicious/benign files themselves. We are the first, to our knowledge, to perform multi-view modeling for malware detection at commercial data using exogenous file path information fed in conjunction with file content to a deep neural network.

3. Implementation Details

In this section we present implementation details of our approach, including the data collection process for obtaining PE files and file paths from customer endpoints, our featurization strategy, and the architectures of our multi-view deep neural network and comparison baselines.

3.1. Dataset

For our experiments, we collected training, testing, and validation datasets from a prominent anti-malware vendor’s telemetry. This telemetry contains the filepaths and SHA256 digests of portable executable (PE) files seen on their customer endpoints, along with time stamps and other metadata. The telemetry did not contain the raw files due to bandwidth and customer privacy considerations, and instead we used the SHA256 digests to look up and download available files from vendor aggregation services. Malicious/benign labels for these files were computed using a criterion similar to (Saxe and Berlin, 2015, 2017), but combined with additional propriety information to generate more accurate labeling. Files that we could not label were removed from the dataset.

We lower-cased all the filepaths for consistency. The file paths that we received from telemetry had drive letters/paths and user names replaced with “[drive]” and “[user]” tokens respectively. This step was necessary in order to to protect Potentially Identifiable Information (PII). This obfuscation also has the side benefit of removing near duplicate file paths. We limited the number of file paths associated with each unique PE file sample to a maximum of five first seen paths, in order to avoid “heavy hitter” file paths dominating our dataset.

In total, we collected approximately 6 months of sampled telemetry data after performing the above cleaning operations. We split this data into training and test datasets based on the time samples were first seen in our telemetry. Samples that first appeared between June 1 and November 15 2018 were used for training and samples first seen during Jan 1 to Jan 30 2019 were used as a test set. Samples first seen between November 16 and December 1 2018 were used as a validation set to monitor model performance during training, and for model selection and calibration. Care was taken to ensure that there were no overlaps between training, validation, and test sets. The training dataset collected consisted of 9,148,143 distinct samples, with 693,272 of them labeled as malicious. The test dataset had 249,783 total samples with 38,767 of them labeled as malicious. The validation set consisted of 2,225,094 samples with 85,041 of them labeled as malicious.

We note that our original test set contained 275,374 samples. This was reduced to 249,783; by 25,591 samples for the following reason. During an early interpretability analysis using LIME explanations (see Sections 4.3 and 4.4), we found that a number of files in our test set exhibited particularly high responses with respect to malicious/benign score based off of SHA256 digests in the file path, as well as tokens such as “malware”, “prevalent”, etc. Upon investigation, we found that these come from our source vendor’s customers (which may include other IT security organizations) testing its endpoint products – e.g., by downloading folders of malware and seeing if there are resultant detections. While this is, in a sense, indicative of a realistic customer endpoint distribution, in our view, it does not reflect an accurate view of the threat, and including these samples in the test set could yield spuriously optimistic performance evaluations. We therefore pruned our test set of these “test endpoint” samples prior to conducting experiments and analysis presented in Section 4. For readers interested in performance comparisons and LIME analysis on the unpruned test set, these results are presented in Appendices A.2 and A.3.

3.2. Feature Engineering

In order to use file paths in feed-forward neural network, we first needed to convert the variable length strings into numeric vectors of fixed length. We accomplished this using a vectorization scheme similar to (Saxe and Berlin, 2017), by creating a lookup table keyed on each character with a numeric value (between 0 and the character set size) representing each character. In practice, we implemented this table as a Python dictionary. This transformation required our file paths to be trimmed to a fixed size in order to make it cost effective to perform our experiments. Guided by statistics from our telemetry and early experimentation, we trimmed file paths to the last 100 characters. See Appendix A.1 for further discussion.

In (Saxe and Berlin, 2017), a character set of 100 printable characters is used as the vocabulary in the lookup table to convert characters to integer vocabulary indices as part of feature construction. In our work, we consider the entire unicode (UTF-8) character set, but limit our vocabulary to 150 most frequently occurring unicode characters, determined by their frequency counts in our data (See Figure 5). We also add a single ‘other‘ character that represents all other Unicode characters not in the top 150, and a special null character to represent shorter strings, bringing our vocabulary to a final size of 152.

Figure 5. Distribution of Unicode character frequencies by prevalence in our telemetry. The -axis indexes the Unicode character in terms of prevalence rank (most prevalent to least prevalent). The -axis corresponds to frequency. Note the logarithmic scale on the -axis. The red vertical line shows the character rank at which we truncate our vocabulary.

As features for the content of the PE files, we used floating point 1024-dimensional feature vectors consisting of four distinct feature types, similar to (Saxe and Berlin, 2015):

  1. A 256-dimensional (16x16) 2D histogram of windowed entropy values per byte. A window size of 1024 was selected.

  2. A 256-dimensional (16x16), 2D logarithmically scaled string length/hash histogram.

  3. A 256-dimensional bin of hashes of metadata from the PE header, including PE metadata, including imports, exports, etc.

  4. A 256-dimensional (16x16) byte standard deviation/entropy histogram.

In total, we represent each sample as two feature vectors: a PE content feature vector of 1024 dimensions and a contextual file path feature vector of 100 dimensions.

3.3. Network Architectures

Figure 6. The neural network model we use in our experiments. Each of the unlabeled blocks contains a fully connected layer, followed by Layer Normalization and a Dropout Layer. In experiments where we train the file paths and PE features individually, the respective input and associated input branch is used and the other branch is removed from the model definition.

Our multi-view architecture is shown in Figure 6. The model has two inputs, the 1024 element PE content feature vector, , and the 100 element file path integer vector, , as described in Section 3.2. Each distinct input is passed through a series of layers with their own parameters, and , for PE features and FP for filepath features respectively, and are jointly optimized during training. The outputs of these layers are then joined (concatenated) and passed through a series of final hidden layers – a joint output path with parameters . The final output of the network consists of a dense layer followed by a sigmoid activation. Our labeling convention uses as a benign label and as a malicious label, so sigmoid outputs close to are more likely to be malicious than outputs close to , which are more likely to be benign. However, the threshold for malicious/benign determination can be set anywhere along the range according false positive rate (FPR) and detection rate (TPR) tradeoffs for the application at hand – a reasonable threshold for our use cases is typically at or below FPR.

The PE input arm passes through a series of blocks consisting of four layers each: a Fully Connected layer, a Layer Normalization layer implemented using the technique described in (Lei Ba et al., 2016)

, a Dropout layer with a dropout probability of 0.05, and an Rectified Linear Unit (ReLU) activation. Five of these blocks are connected in sequence with dense layer sizes 1024, 768, 512, 512 and 512 nodes respectively in order.

The file path input arm , passes – a vector of length 100 – into an Embedding layer that converts the integer vector into a (100,32) embedding. This embedding is then fed into 4 separate convolution blocks, that contain a 1D convolution layer with 128 filters, a layer normalization layer and a 1D sum layer to flatten the output to a vector. The 4 convolution blocks contain convolution layers with filters of size 2, 3, 4 and 5 respectively that process 2, 3 4 and 5-grams of the input file path. The flattened outputs of these convolution blocks are then concatenated and serve as input to two dense blocks (same form as in the PE input arm).

The outputs from the fully connected blocks from the PE arm and the file path arm are then concatenated and passed into the joint output path, parameterized by . This path consists of dense connected blocks (same form as in the PE input arm) of layer sizes 512, 256 and 128. The 128D output of these blocks is then fed to a dense layer which projects the output to 1D, followed by a sigmoid activation that provides the final output of the model.

The PE only model is just the PE+FP model but without the FP arm, taking input and fitting and parameters. Similarly, the FP model is the PE+FP model but without the PE arm, taking input fitting and paramters. The first layer of the output subnetwork is adjusted appropriately to match the output from the previous layer.

We fit all models using a binary cross entropy loss function. Given the output of our deep learning model for input with label , and model parameters the loss is:

(1)
FPR
PE+FP ( AUC) 0.398 0.083 0.558 0.009 0.693 0.005 0.922 0.006 0.978 0.005
PE           ( AUC) 0.208 0.086 0.339 0.059 0.547 0.007 0.889 0.008 0.972 0.007
FP           ( AUC) 0.02 0.022 0.233 0.04 0.522 0.003 0.711 0.003 0.927 0.003
% Error Reduction 24.0 33.1 32.3 30.1 22.6
Table 1. Mean and standard deviation true positive rates (TPRs) on the test set for false positive rates (FPRs) of interest. Results were aggregated over five training runs with different weight initializations and minibatch orderings. Best results, shown in bold, consistently occurred when using both feature vectors from the file and contextual file path as inputs. Percentage reduction in mean detection error in comparison to the PE baseline is shown at the bottom of the table.

Via an optimizer, we solve for the optimal set of parameters that minimize the combined loss over the dataset:

(2)

where is the number of samples in our dataset, and and are the label and the feature vector of the th training sample respectively.

We built and trained our models using the Keras framework

(Chollet et al., 2015), using the Adam optimizer with Keras’s default parameters and

sized minibatches. Each model is trained for 15 epochs, which we determined was enough for the results to converge.

4. Experiments and Analysis

We trained three different types of models: two baseline models (PE and FP) and one multi-view model (PE+FP). The baselines can be viewed as different ablations of the multi-view model. One baseline model (PE) takes only PE feature vectors as inputs while the other (FP) takes only file paths as inputs. The multi-view model takes both PE features and file paths as inputs. These model topologies are described in Section 3. We trained each of these models on the same samples; only their inputs differed. The PE baseline is characteristic of a real-world production use case, while the FP baseline should be viewed as a sanity check to ensure that trivial gains do not occur over the PE model by using file path information alone.

To get a statistical view of model performance, we trained five models of each type, with different weight initialization per model, different minibatch ordering, and different seeds for dropout. This allows us to assess not only relative performance comparisons across individual models (as is standard practice), but also mean performance and uncertainty across model types. Training multiple models also tells us important information about the stability of each model type under different initializations.

4.1. Performance Evaluation

Results for the three model types, evaluated on the test set – PE+FP, PE, and FP – are shown in Figure 7 as ROC curves and are also summarized in tabular form in in Table 1. Recall that these results (mean and standard deviation) were assessed over five runs.

Figure 7. Mean ROC curves and standard deviations for our PE+FP model (red solid line), a PE model (blue dashed line), and an FP model (green dotted line). Mean and uncertainty are computed over five runs.

We see that the multi-view (PE+FP) model substantially outperforms the content-only model in terms of net AUC and across the vast majority of the ROC curve, slightly dipping below the PE baseline between and FPR, an effect which could potentially be alleviated with a larger training set. At lower FPRs, the performance improvements from the PE+FP model compared to both baselines is substantial. Specifically, we see that there is a 27% increase in True Positive rate for the PE + FP model as opposed to the PE model at FPR, and a 64% increase at

FPR. This increase is also accompanied by a reduction in variance of performance, making the PE+FP model a better choice in terms of both stability and overall detection performance. At higher FPR regions, our content-only (PE) model already seems to exhibit very good performance, with a mean TPR of 0.889, and the multi-view (PE + FP) model manages to outperform it, albeit slightly, with a mean TPR of 0.922. As expected, the filepath only (FP) model that looks only at context consistently performs the worst, with an overall mean AUC of 0.0968, compared to a mean AUC of 0.992 for the multi-view (PE + FP) model and a mean AUC of 0.990 for the content-only (PE) model.

Note that the TPR/FPR metrics that we use to evaluate detection are invariant to the ratio of malicious to benign samples in our test set. This invariant representation of results is important, since if we are to deploy this model in practice, we can use this TPR/FPR ROC curve to re-calibrate the detector for a a significantly higher ratio of benign to malware by selecting a threshold associated with a low FPR (e.g., ), rather than the presumed default threshold. It is also for that reason that in our analysis we focus exclusively on the low FPR regions of the curve.

At very low FPRs () the variance in the TPR increases. This is due to inherent measurement noise at low FPRs: an FPR of means that

benign samples were falsely labeled as malicious, which is the same order of magnitude as the number of benign samples in our dataset, providing little support for the numerical interpolation used to generate these ROC curves. Moreover, a small fraction of our test set could potentially be mislabeled. Consequently, results significantly below

FPR should be treated with some skepticism. The improvement of the combined model is still substantially larger than the statistical uncertainty for the relevant to FPR regions.

There are two reasons to believe that our test set is more challenging than than real deployment distributions. The first reason is that ML detectors are never deployed by themselves, and are instead guard-railed by signers, prominent file hashes, and AV signature whitelisting. Most of the prominent FP issues can be suppressed using these whitelist approaches. The second reason, is that we removed any previously seen PE file from test set, even it has a new file path. In the raw telemetry, we observed that most executed files are actually not new. However, in our view, the primary job of the ML system is to properly identify previously unseen files, where as old files can typically be whitelisted or blacklisted. Thus, our evaluation reflects the realistic capability of our respective classifiers to detect novel malware.

4.2. File Path Influence Analysis

Figure 8. Word clouds generated from file paths where the prediction of the combined PE+FP model is different from the prediction generated by the PE model. These file paths are then divided into four categories based on the initial and changed predictions. In leftright topbottom order, the word clouds are respectively file paths that are additional False Negatives, additional Detections, additional True Negatives, and additional False Positives

The overall performance gain from using file paths as additional context information to the neural network model seems evident from the performance metrics in Section 4.1. In this section, we analyze the influence that file paths have on an individual model’s performance, by examining additional detections and false negatives introduced by the file paths (PE+FP model) as compared to a model that considers only the PE binary features (PE model). To investigate this, we thresholded detection scores corresponding to particular FPRs and converted continuous model outputs to a binary malware/benign decision. For this analysis, we employed a threshold corresponding to FPR.

A word cloud representation of tokens from the file paths which cause a change in the prediction of the PE+FP model when compared to the PE model is illustrated in Figure 8. These file paths are represented by four word clouds, one each for additional false negatives, additional detections, additional true negatives, and additional false positives, to get a representation that captures the most frequent kinds of tokens in file paths that were detected or misclassified. Common file roots such as “program files”, “appdata”, etc were filtered out as stop words in order to avoid these words crowding the analysis and suppressing more interesting patterns.

Looking at the word cloud for additional detections by the combined model, we observed that app occurred very frequently. Upon further inspection, we discovered a family of Trojans that always had a file path of the pattern: “[drive]\Users\[user]\appdata\local\temp\
[random 9 digit sequence]\app.exe
”, with about 10000 such occurrences in the training set. Out of the 969 such occurrences in the test set, we observed that the PE features only model detected just 152 variants, whereas the PE + file path model detected an additional 575 samples as malicious (at an FPR of 0.001). This improvement in detections at a low FPR is very encouraging, and in line with the intuition for using file paths. We observe several such occurrences that cause successful detections because of patterns that malicious files exhibit in their file paths, with executables residing in common Windows folders such as %SYSTEM%, %USERPROFILE%, %APPDATA%, %SYSTEMPROFILE%, etc.

However, this does not mean that our model relies only on the file path and makes trivial predictions. Since file paths are not used individually, but as context along with PE content, the model seems to learn to dynamically attribute value to different parts of the file path based on specific patterns in file content that are unique to those files. In other words, there is no single file path that can always cause a detection or a suppression regardless of what file content is associated with it. This is clear when we look at a set of malicious samples in our data that seem to impersonate unfinished chrome downloads on disk. These files are of the pattern “[drive]\Users\[user]\Downloads\Unconfirmed\
[random 6 digit sequence].crdownload
”. Since both malicious and benign files have similar file path patterns in this case, the detection rates for both the combined (PE + FP) model and the content only (PE) model remain virtually the same, signaling that the file path has almost no influence on the prediction in this case.

While introducing file paths as an auxiliary input yields a compelling performance improvement, in some scenarios, it also causes misclassifications. Based on the analysis of missed detections and false negatives from PE+FP model, we observed the following modes of failure.

  • Malicious files contained in system restore checkpoints and deleted files in the recycle bin usually have completely randomly generated filenames, with a very large percentage of them being benign. We have observed that malicious files convicted with a low confidence by the PE model are sometimes marked as benign by the PE+FP model.

  • Novel malicious files with names associated mostly with benign files in the training set are also marked falsely as benign files. This does not happen very often. The problem is most chronic in cases when a large set of benign files with similar file paths are seen in the training set, and the detection confidence from the PE features is low. For example, in the False Negatives word cloud in Figure 8, we see that file paths containing the words “time sheet”, “steam”, “fleet info”, “payroll”, “departments”, etc are wrongly exonerated, because these names are largely associated with benign files.

FPR
Additional TPs 7796 9099 3601 2229 734
Additional FNs 90 475 1799 565 310
Net Gain 7706 8624 1802 1664 424
Table 2. Number of samples where our model has additional Detections and additional False Negatives as opposed to a model using just the PE binary features, at different FPR levels.

Fortunately, the occurrence of such failures seems to be quite low compared to the number of files we are able to convict using file path information. These failure modes also happen only when the PE model is almost completely ambivalent about its prediction. It is in this set of gray files that the PE+FP model produces additional detections and suppresses false positives, while occasionally missing a few detections. This is especially impressive when considering that every sample in the test set is completely unseen in the training distribution. Table 2 shows the net detection gain at different FPR levels for the PE+FP model over the PE model.

We also see that the word clouds for additional true negatives and false positives are almost identical. Since we are controlling for a fixed FPR while generating the word clouds, it is important to note that the number of additional true negatives is generally equal to the number of false positives. However, the distribution changes, albeit slightly. From manual inspection, we found that most files whose predictions changed to negatives/false positives were generally close to the decision threshold with randomly generated components in their file paths, which likely cause minor changes in predicted probabilities and the tendency to hop between predictions.

4.3. LIME Analysis

To ensure that our multi-view model has learned meaningful content from PE file paths, we pick one of our trained models and employ Local Interpretable Model-Agnostic Explanations (LIME) introduced by Ribiero et al. in (Ribeiro et al., 2016) to samples from the test set. LIME explanations assume that a trained ML model, in our case , can be explained by a simple interpretable linear model locally, around an input . Based on this assumption, which the authors justify in (Ribeiro et al., 2016), the linear model is trained to approximate within a neighborhood of . The learned model weights are then used to judge feature importance of the original deep model.

The definition of a realistic neighborhood around a specific input

is problem specific, and represents the main challenge in adapting LIME to our file path analysis. We generated the neighborhood samples, by first tokenizing the file path by five delimeters: “\”, “/”, “.”, “_”, and “-”. We then selected one random token to perturb. We crafted our perturbations by first sampling a random number from the uniform distribution on

. If the number was greater than 0.5, we replaced the token with a random string; if it was less than or equal to 0.5, we removed the token and preceding delimiter.

After applying our perturbations, we then one-hot encoded all the possible tokens into a feature vector

. We fit each LIME classifier on such perturbations, labeling the original sample , and the rest as . Similarly, we reconstituted, the modified strings , by recombining separators and tokens back into a full string.

As an example, consider the following file path:

C:\users\Bob\appdata\local\temp\rar\payment.scr.

This pre-processed file path, after substituting the drive and user name, will be as follows:

[drive]\users\[user]\appdata\local\temp\rar\payment.scr.

Splitting on delimiters “\” and “.” yields distinct tokens. We randomly generate three more samples, which creates two new tokens, resulting in distinct tokens in our example dataset:

where

Corresponding perturbations generated by our perturbation routine appear as follows:

In our example , corresponding to token “local” has a value of , since it no longer exists in the string. In , we perturb the delimiter corresponding to “appdata”, replacing it by a random string and thus generating a token that did not exist in the original sample. We do this again for with another random string.

Consistent with Ribiero et al., we fit our model using Lasso regression – least squares regression with an

penalty, which has the effect of encouraging sparsity in the explanation. The overall optimization objective of the LIME explanation model is:

(3)

where are the parameter weights of the LIME model, is the weight regularization penalty, and is the weight associated with the th sample. For our implementation, we used Scikit-learn’s (Pedregosa et al., 2011) default Lasso regression implementation, with value of , which we observed induce a sparse solution. For numerical stability, here we take to be the prediction of the network prior to the sigmoid output.

Ribiero et al. recommend computing using distance kernel from the original sample to the target sample. This is enforce locality so that perturbations closer to the original sample contribute more to the regression objective. For our formulation, all perturbations have approximately the same semantic distance, so set to , and , in order to give enough weight to the one original sample so the prediction approximately matches the original values after fitting.

We visualized the computed Lasso model weights for several interesting examples in Figure 11, by overlaying the computed weights on top of the file path string.

(a) Positive (Increase)
(b) Negative (Decrease)
Figure 11. Example file paths from our LIME analysis with LABEL:sub@fig:lime_examples_pos positive and LABEL:sub@fig:lime_examples_neg negative ground truth labels. The path tokens are highlighted based on the Lasso weights, as computed by the LIME model. As the model is linear, the token weights can be directly interpreted as either making the overall malware score higher (red weights) or lower (blue weights). The color shading is proportional to weight amplitude, i.e., darker red and blue shades correspond to greater magnitude weights, while lighter shades correspond to smaller magnitudes. White color corresponds to no impact.

In the first positive example we can see that that the token “kmsauto” is being identified as a maliciousness indicator by our PE+FP model. KMS Auto is legally dubious Microsoft product activator, and this file is identified as “PUA:Win32/AutoKMS” by Microsoft. Similar, in the second second positive example our PE+FP model gave high score to “pcrepairkit”. Repair kits are typically questionable software products that usually contain spyware or malware.

On the other hand, in the several negative examples we can see that management tools are being down-weighted by the PE+FP model, as compared to the PE model. Management tools are notoriously difficult to distinguish from spyware, as their functionality is basically the same, the only difference is intent of the user. In this case using filepath information provided us more context for the detection, thus allowing more accurate identification by the PE+FP model. We note that these are a few interesting examples, and that the relative contributions of tokens also have a non-linear dependency on the file content itself. For example, when we kept the same path for the first negative example, but replaced the file with a randomly chosen malicious file, the importance of the token “management” was significantly reduced.

Finally, we performed an aggregate LIME analysis to identify prominent tokens throughout our dataset, by choosing 200 samples from our test data set to analyze – 100 with a malicious ground truth label and 100 with a benign ground truth label. The first 100 of these samples consisted of positive ground truth label test samples where the score from the PE+FP model most significantly increased beyond that of PE baseline. The second 100 consisted of negative ground truth label samples where the score of the PE+FP model most significantly decreased beneath the score of the PE baseline. Note that measuring most significant increase and most significant decrease, on raw output scores is potentially problematic because different models – even ones with bounded sigmoid outputs – may have fundamentally different score scales between 0 and 1. Therefore, we performed calibration via isotonic regression on scores over the validation set for each model before assessing score differences. We then aggregated LIME parameter weights across tokens () and normalized by token frequency, looking at tokens of highest and lowest weights for the selected 200 samples. The top 10 tokens which increased and decreased response are shown in Table 3.

Token Weight
2786 7.436
4327 5.854
8o0sdtwhrxkz 4.213
28pygyuokzwwn 3.826
wfzctyetugjwxxuy 3.736
3015798005 3.592
setup 3.313
jzljumnkfaapzpqq 3.183
whyovxk3mplt6 3.167
1467 2.219
(a) Increase (Malicious)

Token Weight
onv2k -6.677
computerz -6.433
westlake -5.565
editor -5.13
printingtools -4.738
videodecodesdk -3.687
placar80 -3.663
movavistatistics -3.556
enterprise -3.488
jarvee -3.401
(b) Decrease (Malicious)
Token Weight
miner 9.369
z 8.163
2639 6.876
mineropt 6.507
2198205786 6.28
systemprofile 4.26
xxxxx 4.193
t 3.916
d 3.812
namespace 3.441
(c) Increase (Benign)
Token Weight
msi61f0 -8.04
part -7.022
ciscosparklauncher -6.642
sesinaci -4.738
clientinst -4.445
safesenderslist -4.443
setup -4.389
sd -4.147
wim -4.06
ie8shims -3.996
(d) Decrease (Benign)
Table 3. Tokens and corresponding weights from our LIME analysis that most amplified and attenuated responses for malicious and benign samples. For the malicious samples analyzed, tokens that resulted in greatest increase and greatest decrease in classification score are shown in LABEL:sub@tab:inc_mal and LABEL:sub@tab:dec_mal. Corresponding tokens for the benign samples are shown in LABEL:sub@tab:inc_ben and LABEL:sub@tab:dec_ben. Malicious samples were selected from the 100 samples from the test data set where the ground truth label was malicious and the calibrated score from the PE+File Path model increased the most above the calibrated score from the PE model. Benign samples were similarly selected from the pruned data set according to the most significant drop in calibrated score.

The results of running our LIME analysis, are shown in Table 3. For malicious samples, we see that the tokens of highest weight consisted of strings with randomized content, that were not cryptographic digests, perhaps an attempt at obfuscation. The remaining high-weight token, setup is perhaps indicative of an infected installer. Tokens with large negative weights consist of common looking benign software names, as one might expect. Of the benign samples that we assessed, tokens that increased response tended to have very short length, e.g., “t”, “d”, and “z”, very high or very low entropy, e.g., “219805786” and “xxxxx”, and have “miner” in their names, e.g., “miner”, “mineropt” – indicating the likely presence of a (benign) cryptocurrency miner, potentially downloaded by the user voluntarily. It is not surprising that the string “miner” increased response as many types of malware and potentially unwanted benignware steal CPU cycles to mine cryptocurrency. With respect to tokens that most attenuated the response, they appear to be components of standard software. Interestingly, “setup” tends to attenuate response for the benignware that we analyzed, indicating that the behavior of tokens depends on their contexutal location within the file path. Note that, as LIME involves fitting a classifier per sample, this analysis is limited only to the samples that we analyzed. However, it suggests that our neural network is learning to extract useful contextual information from file paths; not just mere data artifacts.

4.4. Model Debugging with LIME

As seen in Section 4.3 above, LIME can serve to interpret which parts of test samples triggered a high or low classification response. In this section, we highlight the ability of lime to help debug overfitting in models and ensure that seemingly optimistic test results are not driven by spurious correlations.

During our initial experiments, when we performed LIME analysis on a trained PE + FP model, we observed some interesting patterns among tokens with a disproportionately high response. The tokens that triggered the greatest increase were all hexadecimal digests that seemed to be associated with malicious PE files named with the SHA256 digest of their contents. These LIME results are presented in Table A.1 in the Appendix.

We concluded that these high-response tokens come from customers who were intentionally testing the detection capability of the source vendor from which we obtained our data set. This is likely a standard and fairly common occurrence, as the easiest111We do not think this is actually a good way to test vendor efficacy, as the test can clearly be gamed by a file path model such as ours, as well as in the cloud blacklists. way to test an anti-virus engine is to simply download known malware datasets, and see if they are detected. While this type of data is a microcosm of malware and benignware “in-the-wild” distributions, it is not representative of realistic threats and using this data in our analysis could lead to an overly optimistic measure of FP and PE+FP performance.

Therefore, as we mentioned in Section 3.1, we removed all samples from the test set with file paths containing the names “malware”, “prevalent”, and our source organization’s name, as well as strings of length 10 or greater with hexadecimal-only characters, corresponding to hash digests and re-ran our evaluation. This reduced the size of our test set from 275374 to 249783; by 25591 samples.

Comparative model performances on the unpruned test set are shown in ROCs in Figure A.3 of the appendix and in tabular form in Table A.2. Figure A.3 demonstrates the difference in model performance before and after dataset filtering. As expected, we see slightly better performance from the PE+FP model on the unpruned data set, since all the easy-to-classify file paths are included in this dataset. The performance of the content-only (PE) model is largely unchanged by pruning, while the performance of the file path (FP) model is diminished by pruning. We present this result to demonstrate the importance of selecting a meaningful/representative test set, particularly when dealing with multiple input types, and also to highlight the utility of LIME in model debugging.

5. Discussion

In this section, we discuss practical applications and potential issues of associated with deploying our multi-view model. First, in Section 5.1, we explore the vulnerability of our model in an adversarial setting. Then in Section 5.2 we explore the utility of our model in an endpoint detection and response (EDR) context.

5.1. Sensitivity to Adversarial Attacks

A natural concern when using file paths for static detection is that an attacker has a fair bit of control over where on the system malware can reside. This adds another input which an adversary can manipulate to evade detection – much as a PE content-only model can be evaded, e.g., by adding overlays from benign software, a PE+FP model can potentially be evaded by using common paths from benign files and/or by modifying PE content. Defense/hardening against adversarial attacks is an active area of research in the anti-malware and ML-Sec communities, which we consider addressing beyond the scope of this paper. However, it is important to be aware of the potential issue that adversarial attacks pose for static ML detectors and particularly for our model.

In practice, all deployed static ML detectors that we are aware of have some susceptibility to adversarial attacks. However, most vendor products contain a variety of detection methods – both ML-based and non-ML-based, somewhat ameliorating the threat. In these contexts, the role of static ML detectors is to serve as an accurate filter that catches malware at scale in a manner independent of other detection components in the anti-malware stack– not necessarily as a catch-all solution. At the time of this writing we are also unaware of widespread adversarial attacks in the wild. Thus, while the potential sensitivity to adversarial attacks is important to acknowledge, this does not preclude using our model for many production applications. Finally, the practical nature of altering file location of a piece of malware could inherently diminish its effectiveness: as discussed in Section 1, file paths which provide the most useful information are often specifically chosen for a reason – e.g., to evade disk scans or to increase the likelihood that a user will open an infected application. Thus an adversary might become less effective in a malware campaign by altering these chosen file paths.

5.2. Applications in Endpoint Detection and Response (EDR)

Some organizations might consider deploying the PE+FP model on an endpoint in blocking mode as high-risk due to potentially unexpected behavior and concerns about adversarial manipulations,

In these cases, the PE+FP model can be moved to the backend, and used in an Endpoint Detection and Response (EDR) context, by augmenting Security Operations Center (SOC) team’s threat hunting operations. Here our PE+FP model can be used as a secondary model to flag samples that registered below the FPR threshold of the deployment model as suspicious for inspection by human analysts. In these settings there is a budget in terms of samples that human analysts can inspect (manual detection is time consuming).

Figure 12. Precision curves for the PE only model and the multi-view model as a function of budget, ranging from to samples. PE only model precision is shown using the blue dotted line and the multi-view model precision is shown using the red dashed line. Note that the -axis uses a logarithmic scale

To evaluate the suitability of our model in an EDR setting, we use the following methodology: We threshold a trained PE model at an FPR of , which reflects a typical deployment use case, convicting samples which it is highly certain about as malicious, i.e., any event for which the model output is higher than the threshold is considered a certain malicious event. Excluding these convicted samples from our analysis, we then demonstrate how the PE+FP model can improve precision in retrieving malicious events for budgets ranging from 10 up to 50000 events, where the budget is defined as how many samples can be manually inspected by humans. Figure 12 shows a performance comparison of the precision at different budgets, when using the PE only model and the multi-view model in an EDR use case.

It is clear from the plot that the multi-view model retrieves significantly more malicious events for a given budget, especially for lower budgets. This suggests that our multi-view model could be used effectively in an EDR application that does not require a large change in existing static ML deployments, but yields significant detection gains during threat hunting.

6. Conclusion

In this paper we have demonstrated that deep neural network malware detectors can benefit from contextual information from file paths, even when this information is not inherently malicious or benign. Adding file paths to our detection model did not require any additional endpoint instrumentation, and provided a statistically significant improvement in the overall ROC curve, throughout relevant FPR regions. The fact that we measured the performance of our models directly on a customer endpoint distribution suggests that our multi-view model can practically be deployed to endpoints, even though there are some logistical and user interface issues that might need to be addressed, since moving a file between directories could change the detection scores.

One potentially powerful spin-off of our multi-view detection approach would be to use it in a behavioral detection engine, i.e., use static features along with behavioral data as auxiliary inputs. There, rather than relying on highly voluminous system calls for detection, we can potentially boost the effectiveness of a simple system that tracks only file write, execution, and process spawning, by combining the process path, action, and target, with the associated static features. Our approach could also be easily extended to a variety of other malware types and contextual sources as well as other security applications. For example, detecting cross-site-scripting attacks (XSS), where we could use textual HTML and JavaScript, the URL itself, and a rendered image of the website as separate inputs to a multi-view model.

The LIME analysis that we conducted in Section 5 demonstrates that the multi-view model learns to distill contextual information suggestive of actual malicious/benign concepts; not merely statistical artifacts of the dataset, though as we observed, it can learn such artifacts as well. This underscores the need for data that reflects deployment use cases as well. Interestingly, techniques like LIME have applications beyond validating whether our model has learned the proper concepts. For example, in an endpoint detection and response (EDR) context, where analytic tools allow users that are not malware/forensics experts to perform some degree of threat hunting, we would like to be able to let users see suspicious file paths on disk. Doing this with a similarity comparison to known file paths could potentially reveal PII of other customers. Importance highlighting, like we illustrated in Figure 11, is potentially a powerful PII-free alternative to the nearest neighbor approach.

One area which we plan to explore in future work is how to additionally utilize the large number of already labeled malware and benignware from intelligence feeds to boost training. Such feeds provide vastly greater malware diversity than typically exists on customer endpoints, since they rarely get infected. Unfortunately, those feeds do not provide file paths associated with actual infected files in the wild, and our current training regime assumes that both content and contextual data are always present during training. While several methods have been proposed for dealing with missing data (García-Laencina et al., 2010), it is not clear how to best apply these methods to file paths for our multi-view model.

Finally, our fixed-length convolutional embedding for file paths is not the only featurization scheme that we could employ. While the size of our dataset discourages training with recurrent neural networks due to lengthy training times, frameworks like PyTorch

(Paszke et al., 2017) trivially support different input lengths during training – even for convolutional feed-forward models, so long as the intermediate convolution outputs are combined to a fixed dimension prior to hitting fully connected layers.

Acknowledgements.
This research was funded by Sophos PLC.

References

  • (1)
  • Ahmadi et al. (2016) Mansour Ahmadi, Dmitry Ulyanov, Stanislav Semenov, Mikhail Trofimov, and Giorgio Giacinto. 2016. Novel feature extraction, selection and fusion for effective malware family classification. In Proceedings of the sixth ACM conference on data and application security and privacy. ACM, 183–194.
  • Anderson and Roth (2018) Hyrum S Anderson and Phil Roth. 2018. EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models. arXiv preprint arXiv:1804.04637 (2018).
  • Bai and Wang (2016) Jinrong Bai and Junfeng Wang. 2016. Improving malware detection using multi-view ensemble learning. Security and Communication Networks 9, 17 (2016), 4227–4241.
  • Bickel and Scheffer (2004) Steffen Bickel and Tobias Scheffer. 2004. Multi-view clustering.. In ICDM, Vol. 4. 19–26.
  • Black et al. (2002) James Black, Tim Ellis, and Paul Rosin. 2002. Multi view image surveillance and tracking. In Workshop on Motion and Video Computing, 2002. Proceedings. IEEE, 169–174.
  • Blanz and Vetter (2003) Volker Blanz and Thomas Vetter. 2003. Face recognition based on fitting a 3d morphable model. IEEE Transactions on pattern analysis and machine intelligence 25, 9 (2003), 1063–1074.
  • Caruna (1993) R Caruna. 1993. Multitask learning: A knowledge-based source of inductive bias. In Machine Learning: Proceedings of the Tenth International Conference. 41–48.
  • Chollet et al. (2015) François Chollet et al. 2015. Keras.
  • Cootes et al. (2001) Timothy F Cootes, Gareth J Edwards, and Christopher J Taylor. 2001. Active appearance models. IEEE Transactions on Pattern Analysis & Machine Intelligence 6 (2001), 681–685.
  • Damodaran et al. (2017) Anusha Damodaran, Fabio Di Troia, Corrado Aaron Visaggio, Thomas H Austin, and Mark Stamp. 2017. A comparison of static, dynamic, and hybrid analysis for malware detection. Journal of Computer Virology and Hacking Techniques 13, 1 (2017), 1–12.
  • Drew et al. (2016) Jake Drew, Tyler Moore, and Michael Hahsler. 2016. Polymorphic malware detection using sequence classification methods. In 2016 IEEE Security and Privacy Workshops (SPW). IEEE, 81–87.
  • Du et al. (2018) Pang Du, Zheyuan Sun, Huashan Chen, Jin-Hee Cho, and Shouhuai Xu. 2018. Statistical estimation of malware detection metrics in the absence of ground truth. IEEE Transactions on Information Forensics and Security 13, 12 (2018), 2965–2980.
  • Egele et al. (2012) Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. 2012. A survey on automated dynamic malware-analysis techniques and tools. ACM computing surveys (CSUR) 44, 2 (2012), 6.
  • Farfade et al. (2015) Sachin Sudhakar Farfade, Mohammad J Saberian, and Li-Jia Li. 2015. Multi-view face detection using deep convolutional neural networks. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 643–650.
  • Faruqui and Dyer (2014) Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 462–471.
  • García-Laencina et al. (2010) Pedro J García-Laencina, José-Luis Sancho-Gómez, and Aníbal R Figueiras-Vidal. 2010. Pattern classification with missing data: a review. Neural Computing and Applications 19, 2 (2010), 263–282.
  • Gong et al. (2014) Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. International journal of computer vision 106, 2 (2014), 210–233.
  • Gross et al. (2010) Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. 2010. Multi-pie. Image and Vision Computing 28, 5 (2010), 807–813.
  • Hassen et al. (2017) Mehadi Hassen, Marco M Carvalho, and Philip K Chan. 2017. Malware classification using static analysis based features. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 1–7.
  • Hassen and Chan (2017) Mehadi Hassen and Philip K Chan. 2017. Scalable function call graph-based malware classification. In Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy. ACM, 239–248.
  • Huang and Stokes (2016) Wenyi Huang and Jack W Stokes. 2016. Mtnet: a multi-task neural network for dynamic malware classification. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 399–418.
  • Jensen et al. (2014) Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, and Henrik Aanæs. 2014. Large scale multi-view stereopsis evaluation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    . 406–413.
  • Jones and Viola (2003) Michael Jones and Paul Viola. 2003. Fast multi-view face detection. Mitsubishi Electric Research Lab TR-20003-96 3, 14 (2003), 2.
  • Kantchelian et al. (2015) Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Brad Miller, Vaishaal Shankar, Rekha Bachwani, Anthony D Joseph, and J Doug Tygar. 2015. Better malware ground truth: Techniques for weighting anti-virus vendor labels. In

    Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security

    . ACM, 45–56.
  • Lei Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv e-prints, Article arXiv:1607.06450 (Jul 2016), arXiv:1607.06450 pages. arXiv:stat.ML/1607.06450
  • Li et al. (2002) Stan Z Li, Long Zhu, ZhenQiu Zhang, Andrew Blake, HongJiang Zhang, and Harry Shum. 2002. Statistical learning of multi-view face detection. In European Conference on Computer Vision. Springer, 67–81.
  • Mays et al. (2017) Mitchell Mays, Noah Drabinsky, and Stephan Brandle. 2017. Feature Selection for Malware Classification.. In MAICS. 165–170.
  • Moser et al. (2007) Andreas Moser, Christopher Kruegel, and Engin Kirda. 2007. Limits of static analysis for malware detection. In Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007). IEEE, 421–430.
  • Narayanan et al. (2018) Annamalai Narayanan, Mahinthan Chandramohan, Lihui Chen, and Yang Liu. 2018. A multi-view context-aware approach to Android malware detection and malicious code localization. Empirical Software Engineering (2018), 1–53.
  • Narayanan et al. (2016) Barath Narayanan Narayanan, Ouboti Djaneye-Boundjou, and Temesguen M Kebede. 2016. Performance analysis of machine learning and pattern recognition algorithms for malware classification. In 2016 IEEE National Aerospace and Electronics Conference (NAECON) and Ohio Innovation Summit (OIS). IEEE, 338–342.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825–2830.
  • Perera et al. (2018) Pramuditha Perera, Mahdi Abavisani, and Vishal M Patel. 2018.

    In2i: Unsupervised multi-image-to-image translation using generative adversarial networks. In

    2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 140–146.
  • Raff et al. (2018) Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and Charles K Nicholas. 2018. Malware detection by eating a whole exe. In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence.
  • Raff et al. (2017) Edward Raff, Jared Sylvester, and Charles Nicholas. 2017. Learning the pe header, malware detection with minimal domain knowledge. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. ACM, 121–132.
  • Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 1135–1144.
  • Ronen et al. (2018) Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, and Mansour Ahmadi. 2018. Microsoft malware classification challenge. arXiv preprint arXiv:1802.10135 (2018).
  • Rudd et al. (2017) Ethan Rudd, Andras Rozsa, Manuel Gunther, and Terrance Boult. 2017. A survey of stealth malware: Attacks, mitigation measures, and steps toward autonomous open world solutions. IEEE Communications Surveys & Tutorials 19, 2 (2017), 1145–1172.
  • Rudd et al. (2019) Ethan M Rudd, Felipe N Ducau, Cody Wild, Konstantin Berlin, and Richard Harang. 2019. ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation. arXiv preprint arXiv:1903.05700 (2019).
  • Rudd et al. (2016) Ethan M Rudd, Manuel Günther, and Terrance E Boult. 2016. Moon: A mixed objective optimization network for the recognition of facial attributes. In European Conference on Computer Vision. Springer, 19–35.
  • Rudd et al. (2018) Ethan M Rudd, Richard Harang, and Joshua Saxe. 2018. MEADE: Towards a Malicious Email Attachment Detection Engine. arXiv preprint arXiv:1804.08162 (2018).
  • Saxe and Berlin (2015) Joshua Saxe and Konstantin Berlin. 2015. Deep neural network based malware detection using two dimensional binary program features. In Malicious and Unwanted Software (MALWARE), 2015 10th International Conference on. IEEE, 11–20.
  • Saxe and Berlin (2017) Joshua Saxe and Konstantin Berlin. 2017. eXpose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys. arXiv preprint arXiv:1702.08568 (2017).
  • Saxe et al. (2018) Joshua Saxe, Richard Harang, Cody Wild, and Hillary Sanders. 2018. A Deep Learning Approach to Fast, Format-Agnostic Detection of Malicious Web Content. arXiv preprint arXiv:1804.05020 (2018).
  • Su et al. (2015) Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision. 945–953.
  • Tola et al. (2012) Engin Tola, Christoph Strecha, and Pascal Fua. 2012. Efficient large-scale multi-view stereo for ultra high-resolution image sets. Machine Vision and Applications 23, 5 (2012), 903–920.
  • Wu et al. (2004) Bo Wu, Haizhou Ai, Chang Huang, and Shihong Lao. 2004. Fast rotation invariant multi-view face detection based on real adaboost. In Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings. IEEE, 79–84.
  • Xu et al. (2013) Chang Xu, Dacheng Tao, and Chao Xu. 2013. A survey on multi-view learning. arXiv preprint arXiv:1304.5634 (2013).
  • Yousefi-Azar et al. (2017) Mahmood Yousefi-Azar, Vijay Varadharajan, Len Hamey, and Uday Tupakula. 2017. Autoencoder-based feature learning for cyber security applications. In 2017 International joint conference on neural networks (IJCNN). IEEE, 3854–3861.

Appendix A Appendix

a.1. File Path Lengths

Figure A.1. file path length distribution. Note the logarithmic scale on the y-axis. The vast majority of file paths are length 300 or less.

When selecting the input window size for our file path arm, we first examined the distribution of file path lengths from our training set. A histogram of frequencies is shown in Figure A.1. From this distribution, we initially decided to trim our file path input size to the last 300 characters. This captures the vast majority of file paths. During early experimentation, however, we found that this led to lengthy training times. In the interest of reporting uncertainty margins for each model (see Section 4), which requires fitting multiple models, and due to our limited Amazon Web Services (AWS) budget, we first trained two models – one that takes the last 300 characters as input and another that takes the last 100 characters as input and performed a comparison on our unfiltered test set. As shown in Figure A.2, the performance of the length 100 model is only slightly worse than that of the length 300 model. Thus, we trimmed file paths to the last 100 characters for our experiments.

Figure A.2. Performance comparison of a FP model trained using an input window size of 300 vs. an input size of 100. The length 300 model takes substantially longer to train and yields only slightly better performance.

a.2. LIME for the Unfiltered Test Set

Token Weight
48bc9c40206c...bb5ebab2b3ea 26.115
1b62d0d9813d...2748ab0131a7 25.659
1764c2d644e6...60faebdb50d1 17.801
3bd39229b7ad...fc8c544b9981 15.978
25689805bb60...b1939907e6f9 15.967
d108e027c5d1...45e3820e79f2 15.505
6bae1743e31f...48f09f0bc49c 14.718
63bbd56b2099...9d9032cfad61 13.647
f72f6b477b35...c6dd77700546 13.639
ac255cc64451...7e9fa7d8a74f 13.292
(a) Increase (Malicious)
Token Weight
glwnv -7.503
ghy6n -7.214
crashreports -6.962
[xxxxxxxxx]...a20d40fe645d -6.789
dll -5.928
[xxxx]engineeringtools -5.734
welivzoqf3uils -5.711
glx9azsenh1mgt -5.576
part -4.871
5iofk3xeixypt2 -4.833
(b) Decrease (Malicious)
Token Weight
cravingexplorer 13.464
bundled 9.034
auto~system~care 6.431
auscsetup 5.048
ubzobezzie4db 5.018
uxu5gqh 4.895
fla4476 4.89
jvuqb6pzvtju2g1 4.883
ekvzxxm 4.608
85qpk7evakbqzeo 4.372
(c) Increase (Benign)
Token Weight
msi8a92 -15.498
robloxplayerlauncher -12.347
ultimate -8.59
debugview++ -7.885
registerdll -7.424
digiarty -7.375
steganos -7.221
updates -7.053
zpsiohortfa -6.421
fslib -5.944
(d) Decrease (Benign)
Table A.1. Tokens and responses from our LIME analysis selected from our unpruned dataset which LABEL:sub@tab:increase_malicious most increased the model’s score for malicious samples, LABEL:sub@tab:decrease_malicious decreased the model’s score for malicious samples, LABEL:sub@tab:increase_benign increased the model’s score for benign samples, LABEL:sub@tab:decrease_benign decreased the model’s score for malicious samples. Characters in the middle of tokens greater than characters in length were replaced with “…” for readability. All of these lengthy strings contain hex digests. Malicious samples were selected from the samples from the unpruned data set where the ground truth label was malicious and the calibrated score from the PE+FP model increased the most above the calibrated score from the PE model. Benign samples were similarly selected from the unpruned data set according to the most significant drop in calibrated score. Tokens showing [xxxx], have been redacted for review and PII reasons.

a.3. Performance Evaluation on the Unfiltered Test Set

Figure A.3. Mean ROC curves and standard deviations evaluating on the pruned (solid lines) and unpruned (dotted lines) test sets for comparison. PE+FP model results are shown in red. PE model results are shown in blue. FP model results are shown in green. Mean and standard deviations were evaluated over five random network initializations.
FPR
PE+FP TPR ( AUC) 0.411 0.072 0.626 0.006 0.749 0.003 0.938 0.003 0.984 0.005
PE TPR           ( AUC) 0.135 0.063 0.304 0.07 0.570 0.008 0.885 0.009 0.975 0.007
FP TPR           ( AUC) 0.091 0.063 0.365 0.018 0.609 0.003 0.768 0.001 0.945 0.002
% Detection Error Reduction 31.9 46.3 41.7 46.1 35.5
Table A.2. Top: Mean and standard deviation true positive rates (TPRs) from evaluating our trained models on the unpruned test set at false positive rates (FPRs) of interest. Results were aggregated over five training runs with different weight initializations and minibatch orderings. Best results, shown in bold consistently occurred when using both feature vectors from the file and contextual file path as inputs (PE+FP). Bottom: percentage reduction in mean detection error achieved by using the the PE+FP model in comparison to the baseline content-only model (PE).

a.4. Baseline Model Diagrams

Figure A.4. The PE model. Each of the unlabeled blocks contains a fully connected layer, followed by Layer Normalization and a Dropout layer.
Figure A.5. The FP model. Each of the unlabeled blocks contains a fully connected layer, followed by Layer Normalization and a Dropout layer.