The Shape of Alerts: Detecting Malware Using Distributed Detectors by Robustly Amplifying Transient Correlations

03/01/2018 ∙ by Mikhail Kazdagli, et al. ∙ The University of Texas at Austin 0

We introduce a new malware detector - Shape-GD - that aggregates per-machine detectors into a robust global detector. Shape-GD is based on two insights: 1. Structural: actions such as visiting a website (waterhole attack) by nodes correlate well with malware spread, and create dynamic neighborhoods of nodes that were exposed to the same attack vector. However, neighborhood sizes vary unpredictably and require aggregating an unpredictable number of local detectors' outputs into a global alert. 2. Statistical: feature vectors corresponding to true and false positives of local detectors have markedly different conditional distributions - i.e. their shapes differ. The shape of neighborhoods can identify infected neighborhoods without having to estimate neighborhood sizes - on 5 years of Symantec detectors' logs, Shape-GD reduces false positives from 1M down to 110K and raises alerts 345 days (on average) before commercial anti-virus products; in a waterhole attack simulated using Yahoo web-service logs, Shape-GD detects infected machines when only 100 of 550K are compromised.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: (L to R) Each circle is a node that runs a local malware detector (LD). Our goal is to create a robust global detector (GD) from weak LDs. We observe that nodes naturally form neighborhoods based on attributes relevant to attack vectors – e.g., all client devices that visit a website W within the last hour belong to neighborhood

. We propose a new GD that groups together suspicious local feature vectors based on neighborhoods – traditional GDs only analyze local alerts while we re-analyze feature vectors that led to the alerts. Our GD then exploits a new insight – the conditional distribution of true positive feature vectors differs from false positive feature vectors – to robustly classify neighborhoods as malicious.

Behavioral detectors are a crucial line of defense against malware. By extracting features out of network packets [37, 58, 64, 76], system calls [23, 52, 41, 54, 63, 71, 36], instruction set [39, 27], and hardware [30, 66, 45]

level actions, behavioral detectors train machine learning algorithms to classify program binaries and executions as either malicious or benign. In practice, enterprises extensively deploy behavioral detectors as per-machine

local detectors whose alerts are analyzed by an enterprise-wide global detector  [2, 3, 11, 8, 33, 9]. Our goal is to design a robust global detector that composes weak local detectors in a noisy community.

Behavioral detectors are weak – i.e., have high false positives and negatives – because a large class of malware includes benign-looking behaviors, such as encrypting users’ data, use of obfuscated code, or making HTTP requests. Further, machine learning-based detectors have been shown to be susceptible to evasion attacks [74, 65, 57] that either increase false negatives or force detectors to output more false positives. In practice, global detectors in enterprises with 100K local detectors have to process millions of alerts per day [4] which stresses heavy-weight program analyses and human analysts who investigate the final alerts [7].

Furthermore, local detector communities are noisy, where local machines often fail to report alerts or report them (often, months) late [18, 1]. This noise is because machines often go out of network access, users decline to send reports, etc. Enterprise settings are also noisy because attacks might target local machines in unpredictable ways – in a ‘waterhole’ attack [17] (where a compromised webpage spreads malicious code to machines in the enterprise), a malicious javascript-advertisement might be targeted by an ad-broker to only a fraction of visitors to a set of webpages; the specific exploit might only succeed on a small fraction of recipient machines because of browser versions or patching status or human-user actions, etc.

Challenges for prior global detectors. Boosting weak detectors using purely machine learning techniques is challenging. The dominant approaches are (a) clustering: combine feature vectors using some distance metric to identify suspicious clusters of feature vectors [72, 75, 77, 55]

, and (b) counting: train local detectors (LDs) such as Random Forest or gradient-boosted trees to generate local alerts, and generate a global alert if there is a significant fraction of local alerts in the enterprise 

[29, 38, 37, 60]. Both approaches have limitations that force enterprises to deploy brittle rule-sets that explicitly correlate local detector alerts.

Clustering algorithms are well-known to be highly sensitive to noise, especially in the high-dimensional regime [31, 42, 73]

. Indeed, classical approaches that attempt to detect or to score ”outlyingness” of points (e.g. Stahel-Donoho outlyingness, Mahalanobis distance, minimum volume ellipsoid, minimum covariance determinant, etc) are fundamentally flawed in the high-dimensional regime (i.e., theoretically cannot guarantee correct detection with high probability). In Appendix 

A, we demonstrate how a clustering global detector is ineffective in detecting a waterhole infection – i.e., clustering yields an Area Under Curve (AUC) metric of only 48% in the waterhole attack.

Count-based global detectors (Count-GD), on the other hand, suffer because they need to know the size of local detector communities extremely accurately to determine whether a significant fraction is raising alerts. Fundamentally, even small errors in estimating the number of feature vectors in the community linearly affects the global detector’s decision thresholds.

Proposed Ideas – Neighborhood filtering and Shape. Our intuition is that weak local detectors can be aggregated robustly by using information about how the malware spreads. Our proposed system (Figure 1), Shape-GD, relies on two key insights to correctly identify malicious feature vectors.

First, while attacks can take many forms, attack vectors are easier to identify. For example, many attacks on Symantec’s client machines rely on ‘downloader trojans’ to bring successive stages of payloads – hence, downloader graphs [49] on a machine are correlated with malware propagation. Similarly, in a firewalled enterprise, machines that visit a specific server (in watering hole attacks) are more likely to be compromised than a random machine in the enterprise. Our key assumption is that machines that have been exposed to a common attack vector have correlated alerts – we call such a set of machines a neighborhood. Grouping local detectors into neighborhoods (as they form dynamically) concentrates the signal of malware activity that is otherwise not visible at the overall community level. However, neighborhoods are extremely noisy due to exploit-types, machine status, and human usage and render cluster and count-based GDs ineffective – hence we propose Shape-GD to aggregate local detectors’ outputs.

The second insight behind Shape-GD is that the distributional shape of a set of suspicious feature vectors can robustly separate true positive neighborhoods from false positive neighborhoods. Shape-GD analyzes only those feature vectors that cause alerts by the local detectors (alert-FVs) instead of analyzing all feature vectors. Alert-FVs thus represent draws from one of two conditional distributions – i.e., distribution of malicious or benign feature vectors conditioned on being labeled as malicious – which are similar but not the same. Next, while a single suspicious feature vector is uninformative, a set of such feature vectors (i.e., alert-FVs from a neighborhood) can indeed be tested to come from one of two similar-but-distinct distributions.

Case Studies. We consider two distinct case studies where Shape-GD is applied in noisy communities of weak behavioral detectors – one with long-term log entries from a commercial detector, and the other a real-time attack simulated using enterprise traces.

Our first setting comprises of 5 million client machines monitored by malware detectors (here, Symantec [18]). A local detector algorithm [53] that analyzes file attributes using VirusTotal, when applied to this Symantec Wine dataset [18], achieves a false positive rate of 5% – with 5 million local detectors in place, this requires deeper human or program analysis of up to

1.1M files to detect close to 137K malware files. A recent local detector improves false positive rates down to 1% by training on metadata, such as features extracted from ‘downloader graphs’ 

[49, 50], but this increases false negatives since it only detects malicious downloaders (that install malware on devices) which comprise only 32.7% of the overall malware in the community.

Our second setting is an enterprise whose devices are infected through a compromised server (waterhole attack), where each device also runs a local system-call based malware detector [23] and sends reports to a global detector. We reimplemented system call based local detectors to achieve representative detection rates [23] – where a true positive rate of 92.4% yields a false positive rate of 6%.

We show that in the Symantec Wine case study Shape-GD detects malicious neighborhoods early – with more than 5% of malicious files – at a false positive and true positive rate of 5.8% and 84% respectively. And it achieves 0.54% false positive rate and 78% true positive file-level detection results. In the waterhole case study it detects malicious neighborhoods with less than 1.1% compromised nodes per neighborhood at a false positive and true positive rate of 1% and 100% respectively.

Neighborhood filtering and shape property complement each other – neighborhoods concentrate the weak signal into a small but unpredictable set of feature vectors while shape extracts this signal without knowing the precise number of feature vectors. In contrast, our experiments show that when applied to noisy neighborhoods, Count-GD’s detection performance only matches Shape-GD’s detection performance if it can estimate neighborhood size to within -30% to +1% for the Symantec case study and -0.1% to +13.8% in the waterhole attack – this makes CountGD extremely fragile in real-world distributed systems. To summarize, Shape-GD enables practitioners’ insights about attack vectors to be captured algorithmically and at scale.

2 Overview of Shape-GD

Threat model and Deployment. We assume a standard threat model where trusted local detectors (LDs) at each machine communicate with a trusted global detector (GD) that receives alerts and other metadata from the local detectors. The LDs are isolated from untrusted applications on local machines using OS- (e.g., SELinux) and hardware mechanisms (e.g., ARM TrustZone), and communicate with the enterprise’s GD through an authenticated channel.

Shape-GD fits deployment models that are common today. Currently, enterprises use SIEM tools (like HP Arcsight and Splunk) to monitor network traffic and system/application logs, malware analysis sandboxes that scan emails for malicious links and attachments, in addition to host-based malware detectors (LDs) from Symantec, McAfee, Lookout, etc. We use exactly these side-information – from network logs (client-IP, server IP, timestamp) and email monitoring tools – to instantiate neighborhoods and filter LDs’ alert-FVs based on neighborhoods (Algorithms 12). Upon receiving alert-FVs, Shape-GD runs its malware detection algorithm (Algorithm 3) for all neighborhoods the alert-FVs belong to. If a particular neighborhood is suspicious, then Shape-GD will notify a downstream analysis (deeper static/dynamic analyses or human analysts) and forward relevant information in the incident report.

Inferring neighborhoods from common attack vectors. Shape-GD operates over dynamic neighborhoods, which are updated once per neighborhood time window (NTW). Neighborhoods within large communities are a set of nodes that share a statically defined action attribute within the current time window – this allows an analyst to create neighborhoods of nodes based on common attack vectors. Below are some illustrative examples of communities and neighborhoods.

1. Malware propagation across Symantec clients. The community here consists of all Symantec clients. Though attackers, when distributing malware through compromised websites, may not have an intention to target Symantec clients’ machines, they get infected due to high number of subscribers to Symantec malware detection service. In the Symantec dataset, both benign and malicious files launch a chain of downloads. Thus, a neighborhood can comprise a set of files transitively downloaded from a suspicious domain (Section 4.2). As domains get periodically cleared out, and their classification is not necessarily very robust, neighborhoods only indicate a probability that files within them may be malicious.

2. Waterhole attack. The community here consists of the employees of an enterprise such as Anthem Health [13, 14]. In a waterhole attack, adversaries compromise a website commonly visited by such employees as a way to infiltrate the enterprise network and then spread within the network to a privileged machine or user. Within this community, a neighborhood can be the set of nodes that visited the same type of websites within the current neighborhood time window (for example, some percentile of suspicious links rated by VirusTotal [15] or SecureRank [12]). Since these rankings themselves are fuzzy, and the websites and their contents are dynamic, neighborhoods only indicate a probability that the node was actually exposed to an exploit.

Intuition behind shape property.

Figure 2:

(Shape of conditional distributions) (L to R) Probability density function (pdf) of benign/malicious feature vectors (FVs) in a stylized example, which are drawn from the Gaussian with mean ‘-1/+1’. PDFs of the same Gaussians, but now conditioned on a local detector raising an alert – PDFs of true/false positive FVs have

different shapes. Real-world PDFs of true/false positive multi-dimensional FVs projected on the first two principal components in Symantec Wine and waterhole case studies respectively.

The statistical shape of local detectors’ false positives (FP conditional distribution) differs from the corresponding shape for true positives (TP conditional distribution) – we use this property to aggregate LDs’ alert-FVs to find the shape of each neighborhood and then classify neighborhoods based on their shapes.

The central question then is – why do true- and false-positive FVs’ shapes differ? To explain this and set the stage for Shape-GD, we consider a stylized statistical inference example. Suppose that we have an unknown number of nodes within a neighborhood. We want to distinguish between two extremes – all nodes only run benign applications (benign hypothesis), or all nodes run malware (malware hypothesis). We look at a single snapshot of time where each node generates exactly one feature vector. Under the benign hypothesis, assume that the feature vector from each node is a (scalar valued) sample from a standard Gaussian with mean of ‘-1’; alternatively it is standard Gaussian with mean of ‘+1’ under the malware hypothesis (Figure 2) (leftmost plot). The optimal local detector at any machine would declare ’malware’ if a sample’s value is positive, and otherwise – ’benignware’.

Even though individual false and true positives are indistinguishable at the local detector level, we can differentiate between them by approximating distributions they come from. To do this we need to aggregate alerts at the neighborhood level. These values represent independent draws from a conditional

distribution – either the distribution of a normal random variable of mean ’

conditioned on taking a nonnegative value, or the distribution of a normal random variable of mean ’conditioned on taking a nonnegative value. This conditioning occurs because of the local detector tags a sample as an alert if and only if the sample drawn was non-negative. Thus, irrespective of the size of the neighborhood, the global detector would “look at the shape” of the empirical distribution of the received FVs. If it is “closer” to the distribution of false positives rather than the distribution of true positives (Figure 2) (second from the left), it would declare a neighborhood to be “benign”, otherwise – “malicious”.

Shape in real datasets. Though in the stylized one-dimensional example it is straightforward to distinguish between benign and malicious neighborhoods, real-world multidimensional distributions in Symantec Wine and waterhole case studies do not allow such simple interpretation (Figure 2) (two plots on the right). Figure 2 (two plots on the right) shows conditional distributions of false and true positives projected on the first two principal components in Symantec Wine and waterhole case studies respectively. We show that the intuition behind this simple example scales to real malware detectors that use high-dimensional feature vectors. However, to use this insight in practice, we need to address two issues: (i) while corresponding conditional distributions are visually distinct, an algorithmic approach requires a quantitative score function to separate between the (vector-valued) conditional distributions generated from feature vector samples; and (ii) the global detector receives only finitely many samples; thus, we can construct (at best) only a noisy estimate of the conditional distribution. We describe Shape-GD’s details in Section 4.

3 Related Work

3.1 Behavioral analysis

Behavioral analysis refers to statistical methods that monitor signals from program execution, extract features and build models from these signals, and then use these models to classify processes as malicious. Importantly, as we discuss in this section, all known behavioral detectors have a high false positive and negative rate (especially when zero-day and mimicry attacks are factored in).

System-calls and middleware API calls have been studied extensively as a signal for behavioral detectors [34, 70, 19, 26, 36, 23, 59]. Network intrusion detection systems [58] analyze network traffic to detect known malicious or anomalous behaviors. More recently, behavioral detectors use signals such as power consumption[28], CPU utilization, memory footprint, and hardware performance counters[30, 66].

Detectors then extract features

from these raw signals. For example, an n-gram is a contiguous sequence of n items that captures total order relations 

[38, 23], n-tuples are ordered events that do not require contiguity, and bags are simply histograms. These can be combined to create bags of tuples, tuples of bags, and tuples of n-grams [23, 34]

often using principal component analysis to reduce dimensions. Further, system calls with their arguments form a dependency graph structure that can be compared to sub-graphs that represent malicious behaviors 

[26, 47, 19].

Finally, detectors train models to classify executions into malware/benignware using supervised (signature-based) or unsupervised (anomaly-based) learning. These models range from distance metrics, histogram comparison, hidden markov models (HMM), and neural networks (artificial neural networks, fuzzy neural networks, etc.), to more common classifiers such as kNN, one-class SVMs, decision trees, and ensembles thereof.

Such machine learning models, however, result in high false positives and negatives. Anomaly detectors can be circumvented by mimicry attacks where malware mimics system-calls of benign applications [70] or hides within the diversity of benign network traffic[64]. Sommer et al. [64] additionally highlights several problems that can arise due to overfitting a model to a non-representative training set, suggesting signature-based detectors as the primary choice for real deployments. Unfortunately, signature-based detectors cannot detect new (zero-day) attacks. On Android, both system calls [22] and hardware-counter based detectors [30] yield 20% false positives and 80% true positives.

Finally, with their ability to extract highly effective features, deep nets may provide a new way forward for creating novel behavioral detectors. At the global level, however, what is needed is a data-light approach for global detection by composing local detectors, tailored to be agile enough to do global detection in a fast-changing (non-stationary) environment.

3.2 Collaborative Intrusion Detection Systems (CIDS)

Collaborative intrusion detection systems (CIDS) provide an architecture where LDs’ alerts are aggregated by a global detector (GD). GDs can use either signature-based or anomaly-based[76, 68], or even a combination of the two [48] to generate global alerts. Additionally, the CIDS architecture can be centralized, hierarchical, or distributed (using a peer-to-peer overlay network) [76].

In all cases, existing GDs use some variant of either clustering or count-based algorithms to aggregate LDs’ alerts. Count-based GD raises an alert once the number of alerts exceeds a threshold within a space-time window, while clustering-based GD may apply some heuristics to control the number of alerts  

[29, 72, 38, 37, 60]. In HIDE [76], the global detector at each hierarchical-tier is a neural network trained on network traffic information. Worminator[51] additionally uses bloom filters to compact LDs’ outputs and schedules LDs to form groups in order to spread alert information quickly through a distributed system. All count- and clustering-based algorithms are fragile when the noise is high (in the early stages of an infection) and when the network size is uncertain. In contrast, our neighborhood filtering and shape-based GD is robust against such uncertainty.

Note that distributed CIDSs are vulnerable to probe-response attacks, where the attacker probes the network to find the location and defensive capabilities of an LD [62, 21, 61]. These attacks are orthogonal to our setting since we do not have fixed LDs (i.e. all nodes are LDs).

4 Shape-GD Algorithm

Figure 3: Application of Shape-GD to malware detection in Symantec Wine and waterhole case studies.

The algorithm consists of feature extraction, local detectors (LDs), and the global detector (GD). Figure 3 shows how to apply Shape-GD to malware detection in Symantec Wine and waterhole case studies. Our key innovations are in the GD. LDs’ design is inspired by prior work, therefore we discuss it in details in the Appendix B and briefly summarize LDs detection performance in Sections 67.

4.1 Shape-GD Classifiers

Shape-GD utilizes two types of local detectors that analyze executable files and a domain name classifier that analyzes domain metadata. To perform file analysis, we adapt local detectors from prior work. The Symantec Wine data set lacks executable files (it includes only their hashes), therefore we use VirusTotal [15] file analysis reports. Our detector combines feature extraction described in a prior work [53]

and a standard machine learning classifier – XGBoost 

[24]. In the waterhole case study, we develop a detector that extracts feature vectors from dynamic sequences of executed system call traces and uses Random Forest algorithm for classification. Overall, it achieves performance comparable to the best classifier from a prior survey [23].

To use Shape-GD a human analyst needs to supply a description of neighborhood attributes. They can be as simple as a list of high valued servers (waterhole case study) or they can be derived using a machine learning algorithm. In the Symantec Wine case study we use a domain name classifier, which consumes VirusTotal domain reports as input, to detect suspicious domains that are used to form neighborhoods.

4.2 Neighborhood Instances from Attack-Templates

Within each neighborhood time window (NTW), Shape-GD generates neighborhood instances based on statically defined attack vectors – each attack vector is a “Template” to generate neighborhoods with. The goal of partitioning data into neighborhoods is to create predominantly benign or malicious neighborhoods. The algorithm runs once per neighborhood time window (NTW). Hence the partitioning algorithm is radically different across the case studies.

Symantec Wine. The Algorithm 1 partitions downloaded files into multiple neighborhoods. It uses the following intuition: if a domain is malicious, then the files transitively downloaded from such a domain are likely to be malicious.

For ease of explanation, we treat the previously introduced domain name classifier as a predicate (line 1). At each iteration the algorithm starts with identifying a set of suspicious domains within the current NTW (lines 4–5), which is done using the domain name classifier. Then the algorithm uses each suspicious domain as a seed to initiate the neighborhood formation process (lines 6 – 12). Next, for each suspicious domain it searches for the files within the current NTW that access that particular domain (either download other files from it or being downloaded from it) – the set (line 7). By following downloader graph edges the algorithm selects files transitively downloaded by the files in the set (line 10) and filters out those that do not access any of the suspicious domains (line 11). The files that have not been excluded are added to the current neighborhood (line 12).

Input : Downloader graphs
Output : Neighborhoods
1 Domain name classifier Let DNC (): is malicious execute once per NTW while True do
2        create an empty list of neighborhoods identify active domains within the current NTW domains accessed within the current NTW identify suspicious domains {d DNC()} foreach suspicious domain  do
3               identify files accessing the domain files accessing the domain initialize an empty neighborhood foreach file  do
4                      search for transitively downloaded files files transitively downloaded by retain suspicious files {
Algorithm 1 Symantec Wine: Neighborhoods from Attack-Vectors

Note that the algorithm formation process may generate many small neighborhoods. An estimate of the conditional distribution using such feature vectors (Section 4.3

) is usually susceptible to high variance, thus neighborhoods containing an insufficient number of files may have negative impact on the accuracy of the neighborhood classifier (Section 

4.3). To reduce variance and achieve robust classification of neighborhoods, the algorithm merges them such that final neighborhoods are greater than some predefined minimum size. Empirical analysis of the accuracy of the neighborhood classifier shows that it achieves robust classification of neighborhoods containing more than 1000 files.

In order to maintain neighborhood effect after merging, i.e. to have mostly homogeneous neighborhoods – either benign or malicious, the merging algorithm ranks neighborhoods in terms of maliciousness, where malicious score is defined as the relative number of LDs’ alerts within a neighborhood. After that the algorithm sorts neighborhoods based on their malicious score and proceeds merging them in a decreasing order of their malicious scores. Note that malicious score estimation may be incorrect if we incorrectly estimate the neighborhood size, but Shape-GD tolerates such errors.

Waterhole. The algorithm (Algorithm 2) to form a neighborhood to detect a waterhole attack significantly differs from the one used in the Symantec Wine experiment. It creates a neighborhood from client machines that access a server or a group of servers within a neighborhood time window.

Input : Network flow data
Output : Neighborhoods
1 Let predicate(A:Client, B:Servers) := accesses execute once per NTW while True do
2        create an empty list of neighborhoods := client machines* := accessed servers* partitioning a set into non-disjoint sets to incorporate structural filtering partition-set(), where = foreach partition  do
3               form neighborhoods using partitions predicate(, )
*active within the time window NTW
Algorithm 2 Neighborhoods from Attack-Vectors

To abstract away from technical details, we define the predicate (line 1) which is true if a client accesses a server . Each iteration starts with defining the set of client machines that are active within the current NTW and the set of servers that those clients access within the NTW (line 4 – 5). Then the algorithm proceeds with partitioning the set into one or more disjoint subsets (line 6). This is to incorporate ‘structural filtering’ into the algorithm, allowing an analyst to create neighborhoods based on subsets of servers (instead of all servers in case of waterhole). Structural filtering boosts detection under certain conditions (Appendix G). The neighborhood instantiation algorithm builds a neighborhood for each partition (line 8) and, finally, it adds the just formed neighborhoods to list (line 9).

4.3 Shape Property for Malware Detection

After identifying neighborhoods, the next step is to detect neighborhoods with high malware concentration. In order to accomplish this, we introduce a novel approach to extracting neighborhood features that formalizes shape property.

The key algorithmic idea is to map all alert-FVs within a neighborhood to a single vector-histogram which robustly captures the neighborhood’s statistical properties. Such transformation allows us to analyze the joint properties of all alert-FVs generated within a neighborhood without requiring FVs to be clustered or alerts to be counted. After that, Shape-GD feeds neighborhood-level feature vectors into a binary classifier to identify malicious neighborhoods. We use two types of binary classifiers: boosted decision trees in the Symantec Wine case study and a Wasserstein distance-based threshold test in the waterhole experiment.

Generating a vector-histogram from alert-FVs. The algorithm aggregates -dimensional projections of alert-FVs on per neighborhood basis into a set (Algorithm 3, line 3). After that, Shape-GD converts low dimensional representation of alert-FVs, the set , into a single -dimensional vector-histogram denoted by (line 4). The conversion is performed by binning and normalizing -dimensional vectors within the set along each dimension. Effectively, a vector-histogram is a matrix x, where is the dimensionality of alert-FVs and is the number of bins per dimension. Further implementation details can be found in the Appendix C.

We use standard methods to determine the size and number of bins. In particular, we tried square-root choice, Rice rule, and Doane’s formula [5] to estimate the number of bins, and we found that 20–100 bins yielded best results.

Neighborhood classifier. Shape-GD may use any binary classifier (Algorithm 3, line 4) as a neighborhood classifier. We use the following two classifiers – boosted decision trees (XGBoost [24]) and a specially designed Wasserstein distance-based distance (’ShapeScore’) (Appendix D) in the Symantec Wine and waterhole case studies respectively. The main advantage of using XGBoost is its ability to learn complex decision boundary and it can be trained in a non-parametric mode (we completely automated parameter search process). However, in comparison to ShapeScore, XGBoost algorithm requires both benign and malicious data for training purposes. Thus, the threshold test can be trained using only benign data and it acts as an anomaly detector. In our experiments, we found that XGBoost outperforms the ShapeScore function in the Symantec Wine case study, while the ShapeScore yields good detection accuracy in the waterhole case study.

Note like any other machine learning classifier, the binary classifier employed by Shape-GD needs to be retrained periodically to account for constantly evolving statistical software properties.

Input : Suspicious neighborhoods
Output : Malicious neighborhoods
1 for each nbd in nbds do
2        aggregate L-dim projections of alert-FVs on per neighborhood basis build an -dim. vector-histogram bin & normalize along each dimension classify the neighborhood if Neighborhood Classifier() then
3               label as malicious
Algorithm 3 Neighborhood Classification

5 Experimental Setup

We evaluate Shape-GD using two publicly available datasets. First, in the Symantec Wine dataset [18], Shape-GD uses malware reports from Symantec client devices and reduces the LDs’ false positives from 1M down to 110K, while retaining 107K out of 137K malware files. Second, we simulate a waterhole attack using Yahoo’s web-service network logs [6] overlayed with host-level malware and benighware traces [46]. In this testbed, Shape-GD detects an attack within a few seconds and with only about 100 compromised machines out of over 550,000 potential compromises). In both settings, Shape-GD successfully amplifies the weak signal inherent to malware propagation.

5.1 Wine dataset

Figure 4: Example of a downloader graph.

Wine dataset [32, 49, 50] contains telemetry information collected by Symantec’s intrusion prevention system and Symantec antivirus product over 5 year period – from 2008 until 2013. The dataset summarizes file downloader activities across 5M Windows hosts around the world. File downloads are represented in the form of downloader graphs (the abstraction introduced by Kwon et al. [49]) – one per end host. A graph node represents a downloaded file (SHA256 file hash) and a directed edge between two nodes and indicates that the file has downloaded the file from a domain on the corresponding host machine, where is the edge’s label.

Figure 4 depicts an example of a downloader graph. Each node is labeled with a corresponding file name, and each edge bears a domain name from where a file has been downloaded. We also overlay ground truth on the nodes and edges: red color means that a file or a domain is malicious, while the blue color means that a file/domain is benign.

We used the VirusTotal (VT) service to obtain ground-truth information about the 20.3M file-hashes downloaded 67M times and all 353K domain names in Wine (Table 7). Though file-level VirusTotal reports contain results of signature-based malware detection, we do not use them for within Shape-GD (except for computing the ground truth). Hence, information within VirusTotal domain reports might be affected by post-analysis performed by commercial antivirus vendors. However, there are alternative approaches to establish domain reputation [40] that outperform our domain name classifier by using a different set of domain features, which are unavailable in the Symantec Wine dataset.

For files (corresponding to a file-hash) or domain names that VT has information for, it used 62 different anti-viruses and other heuristics to generate a report – this report is used to train the file-behavior and domain-name classifiers. We consider a file to be malicious if more than 30% of antivirus products label it as malware [49]. This yields 2.6M reports for file-hashes, with 137K confirmed to be malicious, and 301K reports for domain names. We label all remaining files and domain names (i.e., that are not confirmed to be either malware or benign by VT) as benign – this is a conservative step that weakens the malware propagation signal in the dataset and is also representative of real deployments where information about suspicious files/domain-names is often delayed or unavailable.

5.2 Modeling Waterhole Attacks

Waterhole attack. To model a waterhole attack, we use Yahoo’s “G4: Network Flows Data” [6] dataset, which contains communication data between end-users and Yahoo servers. The 41.4 GB (in compressed form) of data were collected on April 29-30, 2008. Each netflow record includes a timestamp, source/destination IP address, source/destination port, protocol, number of packets and the number of bytes transferred from the source to the destination 111All IP addresses in the dataset are anonymized using a random permutation algorithm, thus it is impossible to trace them back to the real servers.

Specifically, we use 5 hours of network traffic (208 million records) captured on April 29, 2008 between 8 am and 1 pm at the border routers connecting Dallas Yahoo data center (DAX) to the large Internet. The selected 50 DAX servers communicate with 3,181,127 client machines over 14,249,931 requests.

We assume that an attacker compromises one of the most frequently accessed DAX server –, which processes requests within 5-hour time window ( requests per second). In our simulation it gets compromised at random instant between 8am and 10.30am. Hence, Shape GD can use the remaining 2.5 hours to detect the attack (our results show that less than a hundred seconds suffice). Following infection, we simulate this ‘waterhole’ server compromising client machines over time with an infection probability parameter – this helps us determine the time to detection at different rates of infection. The benign and compromised machines then select corresponding type of execution trace (i.e., a sequence of FVs generated below) and input these to their LDs.

Benign and malware applications. We collect data from thousands of benign applications and malware samples. To avoid tracing program executions where malware may not have executed any stage of its exploit or payload correctly, we set a threshold of 100 system calls per execution to be considered a success. Our experiments successfully run 1,311 malware samples from 193 malware families collected in July 2013 [46], and 2,364 more recent samples from 13 popular malware families collected in 2015 [10], to compare against traces from 1,889 benign applications.

We record time stamped sequences of executed system calls using Intel’s Pin dynamic binary instrumentation tool. Each Amazon AWS virtual machine instance runs Windows Server 2008 R2 Base on the default T2 micro instances with 1GB RAM, 1 vCPU, and 50GB local storage. The VMs are populated with user data commonly found on a real host including PDFs, Word documents, photos, Firefox browser history, Thunderbird calendar entries and contacts, and social network credentials. To avoid interference between malware samples, we execute each sample in a fresh install of the reference VM. As malware may try to propagate over the local network, we set up a sub-net of VMs accessible from the VM that runs the malware sample. In this sub-net, we left open common ports (HTTP, HTTPS, SMTP, DNS, Telnet, and IRC) used by malware to execute its payload. We run each benign and malware program 10 times for 5 minutes per run for a total of almost 53,000 hours total compute time on Amazon AWS.

Overall, benignware and malware were active for 141,670 sec and 283,270 seconds respectively, executing an average of 11,900 and 13,500 system calls per second respectively. Using 1 second time window (Section 4) and sliding the time windows by 1ms, we extract histograms of system calls within each time window as the ML feature, and finally pick 1.5M benign and 1M malicious FVs from this dataset for the experiments that follow. Importantly, we do not constrain the samples on neighboring machines to belong to the same families – as described above, malware today predominantly spreads through malware distribution networks where a downloader trojan (‘dropper’) can distribute arbitrary and unrelated payloads on hosts. We want to test Shape-GD in the extreme case where malicious FVs can be assigned from any malware execution to any machine.

6 Case Study 1: Symantec Wine Dataset

We now quantify how Shape-GD concentrates malware in Symantec’s Wine dataset into neighborhoods. By using downloader graphs as a weakly correlated attribute, Shape-GD identifies malicious files and infected machines with significantly lower false positives than using LDs [53] alone and far higher true-positives than a downloader-graph based detector [49, 50] alone.

In addition, neighborhoods and shape together are good predictors of malware behavior – hence Shape-GD does not have to wait until the entire sequence of malware payloads have been downloaded to declare a downloader or a machine as malicious. We find that on average, Shape-GD can identify a file as malicious only 20 days after it enters the Wine dataset and 345 days before VirusTotal confirms it as malware. Table 7 summarizes these results.

6.1 Shape-GD Classifiers

Local detectors. We start with the evaluation of local detectors (Section 4.1). Each local detector algorithm comprises two parts – feature extraction and a binary classifier (XGBoost in our prototype). We train a local detector on the set of 2.6 million VirusTotal reports using 10-fold cross validation. The detector achieves 97.61% area-under-the-curve metric (Figure 5), and we chose the operating point of 5.0% false positive rate and 90.47% true positive rate. Note that due to the high number of benign files in the dataset, a 5.0% false positive rate corresponds to more than 1M misclassified files, which is likely to prevent practical deployment of such a local detector. In subsequent experiments, we use out-of-fold predictions made by the detector.

Figure 5: (Left) Receiver operating curve (ROC) of the local detector and the domain name classifier. (Right) ROC of the neighborhood classifier.

Domain name classifier. We train and evaluate the classifier (Section 4.1) on 251K VirusTotal domain reports using 10-fold cross validation to achieve an 91.58% AUC (Figure 5). We specifically choose an operating point of 19.03% false positives and 95.41% true positives.

The domain name classifier is ‘weak’ because it is conservative while labeling domains – an entire domain is considered malicious if it serves at least one malware sample. However, even malicious domains serve several benign files, and the local detector (above) that analyzes file-level features using VirusTotal contradicts the domain name classifier. Adding more information about the URL can improve the classifier – however, even the weak signal in domain names is sufficient for Shape-GD to significantly improve the local detectors. Interestingly, since the domain name classifier is only used to create neighborhoods (and not alerts), it can operate at a conservative setting and rely on the shape-based neighborhood classifier to weed out false positives.

The domain name classifier lets Shape-GD efficiently filter out domains that are unlikely to distribute malicious files. Specifically, it removes from further consideration 68.62% (214,884 out of 313,133) completely benign domains that are responsible for delivering 80.70% (16,222,941 out of 20,103,211) benign files. At the same time the classifier retains 75.86% (30,448 out of 40,134) malicious domains responsible for delivering 88.31% (94,457 out of 106,959) malicious files.

Neighborhood classifier. The neighborhood classifier (Algorithm 3) performs neighborhood-level feature extraction and feeds resulting feature vectors into an XGBoost classifier. We estimate its detection capabilities using 10-fold cross validation. The ROC plot (Figure 5) shows that the classifier achieves 96.13% AUC score, and we choose the following operating point: 5% false positives and 91.83% true positive rate.

A neighborhood-level alert is different from an the above file- and domain-name based local detectors’ alerts – it signifies that a set of files that have suspicious behavior have been downloaded from suspicious links, and hence identifies the large majority of files that were false positives at the local level. First, we measure the degree to which our neighborhood classifier removes benign files, and then show that by re-examining files in suspicious neighborhoods (using the file-based LD), we can capture 78.03% of true positives.

6.2 Neighborhoods Concentrate Malware

Figure 6: Neighborhood classifier acts as a malware concentrator. (Top) Distribution of infection rates of randomly grouped files. (Middle) Distribution of neighborhoods’ infection rates. (Bottom) Distribution of neighborhoods’ infection rates after filtering out low-infected neighborhoods. The neighborhood classifier retains only highly infected neighborhoods. (Distributions are capped at 1,000 level.)

First, we measure the effect of using domain names from downloader graphs as an attribute to create neighborhoods.

The original malware concentration in the Wine dataset is only 0.663%, as shown in the top-most plot of Figure 6. If a random subset of files are grouped into a neighborhood, each neighborhood will have considerably less malware than the false positive rate of the malware detectors (5%) – i.e., creating neighborhoods randomly does not concentrate malicious activity. This is the baseline against which downloader graph based neighborhood creation and shape-based neighborhood classifier have to be compared – the neighborhoods labeled as malicious have to contain more than 5% malicious files while achieving high malware coverage overall.

Shape-GD first uses the domain-name classifier to prune out files downloaded from benign domains – this increases the 0.663% infection rate to 9.49% (middle plot, Figure 6). However, high bars on the left-hand side (they are cut off at 1,000 file level) indicate a large majority of neighborhoods have relatively low concentration of malicious files in them.

Shape-GD then uses the shape-based neighborhood classifier to identify infected neighborhoods. This dramatically changes the distribution of neighborhood infection rates, i.e. the peak shifts to the right – from 1% to 5% (lowest plot in Figure 6). The neighborhood classifier brings the average malware concentration in a neighborhood from 9.49% to 24.6%, an increase of 37.1 compared to randomly grouping files into neighborhoods.

Specifically, the number of neighborhoods with the infection rate less than 1% drops by 437.6 times (from 8752 on the upper plot to 20 on the lower plot). Overall, the neighborhood classifier together with the domain-name classifier reduce the number of low infected neighborhoods (neighborhoods with less than 5% of malicious files) by 36.4 times (from 21,792 to 599).

6.3 Aggregate Detection Results

We now quantify the detection performance of the complete pipeline – i.e., by applying the malware classifier to files inside infected neighborhoods. By identifying malicious neighborhoods, Shape-GD effectively weeds out many files that trigger false alerts – hence, the alerts within infected neighborhoods are 37 times more likely to be malware (true postive).

To perform real-time analysis, we replay the 5-year long history of download events in the Wine dataset (each event has a timestamp associated with it) and execute Shape-GD every 30 days. We set the neighborhood time window (NTW) parameter to 150 days because we found that the average lifespan of malicious domains is 157 days. In our experiments we observed that shorter period between consecutive runs of Shape-GD does not significantly affect results, it only improves time to detection and early detection parameters (Table 7). We intentionally stick to a 30-day period between consecutive runs of Shape-GD to keep execution time (12 hours) and resource consumption manageable.

We compare Shape-GD that comprises of the neighborhood classifier and local detectors with prior work – local detectors [53] and the state-of-the-art malware detector in the Wine dataset [49] – as well as a neighborhood detector. For comparison we use standard machine learning metrics: , , and .

Figure 7: File-level aggregate results.

Though Shape-GD is designed to act as a real-time malware detector, i.e. output detection results every time it is executed, in this section we only focus on the file-level aggregate results (Table 7) in order to compare with a prior work. For completeness we describe machine-level aggregate results in the Appendix F and the real-time detection results in the Appendix E. The aggregate results are computed by merging malware detection results across independent executions of a malware detector. Note that we count each file exactly once, for example, if a malware detector detects the same malicious file over multiple NTWs, we count it only once.

False positive rate. The downloader detector [49] achieves the lowest FP rate. It raises 1.0% false positives on the set of downloaders, however, downloaders constitute only a small portion of the entire dataset (439K out of 20.55M files). Thus, its effective FP rate comes down to 0.021%, which is reached at the cost of excluding more than 20 million (or more than 97.3%) files from the analysis. The other prior work – a local detector [53] – has a fixed false positive rate of 5% that we set up in our experiments to make it achieve above 90% TP rate.

Surprisingly, the neighborhood detector’s FP rate is only marginally worse than the local detector’s FP rate – 5.8% in comparison to 5%. However, it filters out significant portion of benign files, which helps Shape-GD to reduce FP rate by 10.7 (5.8% vs. 0.54%) times by using the neighborhood detector as a file filter. In comparison to the local detector, Shape-GD has 9.3 lower FP rate (5% vs. 0.54%), thus it brings the absolute number of false positives from 1.2 million down to 109.9 thousand. Therefore, the deeper (even human-level) analysis becomes feasible, i.e. 109.9 thousand false alerts over a 5-year period correspond to 60 false alerts per day on average.

True positive rate. The downloader detector [49]

has the lowest TP rate due to its inherent inability to analyze non-downloaders. Therefore, it discovers 96% malicious downloaders, but only 31.39% all malware samples – it misses 94K out of 137K malware samples. Note the Wine dataset may be skewed in the favor of malicious downloaders, i.e. approximately one third of malware samples in the dataset are malicious downloaders. Thus, the downloader detector may have even lower TP rate in a real deployment setting.

The neighborhood detector achieves a slightly lower TP rate (84%) because it erroneously filters out some malicious files while the local detectors analyze all of them. Specifically, if the neighborhood detector fails to correlate malicious downloads appropriately, it may distribute malware samples across multiple predominantly benign neighborhoods. Due to low malware concentration they may be excluded from the further analysis by the neighborhood classifier. The other reason why the neighborhood classifier misses some malware may come from labeling some malicious domains as benign. Thus, malware samples downloaded from such domains are excluded from the further analysis.

In terms of true positives, Shape-GD inherits limitations of the neighborhood detector. it loses a few more percent due to running imperfect local detectors within the neighborhoods that capture only 84% malware, which results in 78% TP rate. On the contrary, local detectors demonstrate the highest TP rate (90.5%) because they are tuned to achieve higher than 90% TP rate.

F-1 score. All four detectors explore different operating points in the FP/TP design space. To compare them, we use a standard machine learning metric – F-1 score. The F-1 score is bounded by 100%, which is achieved only if a detector has 100% TP rate and 0% FP rate.

Shape-GD achieves the highest F-1 score (60.46%) because it detects a large portion of malware samples in the dataset (78%) and it maintains the low FP rate (0.54%). The next closest competitor – the downloader detector [49] – achieves only 46.64% F-1 score due to its low TP rate. Interestingly, the local detector [53] demonstrates 2.3 times worse results than the downloader detector because of much high FP rate.

7 Case Study 2: Waterhole Attack

Shape-GD identifies malicious neighborhoods with less than 1% false positive and 100% true positive rate when the neighborhoods produce more than 15,000 FVs within a neighborhood time window (i.e., in Algorithm 3). Recall that at 60 FVs/node/minute, it takes 1000 nodes only 15 seconds to create 15,000 FVs. For LDs like ours with 6% false positive rate, this corresponds to 900 alert-FVs. We then simulate realistic attack scenarios and find that Shape-GD can detect malware when only 108 of 550K possible nodes are infected through a waterhole attack using a popular web-service.

7.1 Time to detection using temporal neighborhoods

Figure 8: (Waterhole attack: Time-based neighborhood filtering) Dynamics of an attack: While the portion of infected nodes in a neighborhood increases over time reaching 1248 nodes on average, ShapeScore goes up showing that Shape-GD becomes more confident in labeling neighborhoods as ‘malicious’. It starts detecting malware with at most 1% false positive rate when roughly 200 nodes get compromised. The neighborhood includes 17,178 nodes on average and spans over 30 sec time interval.

Temporal filtering creates a neighborhood using only the nodes that are active within a neighborhood time window (NTW). For example, a temporal neighborhood for the waterhole attack scenario would include all client devices that accessed any server within the last NTW into one neighborhood ( nodes on average in 30 seconds). This neighborhood filtering models a CIDS designed to detect malware whose infection exhibits temporal locality (and obviously does not detect attacks that target a few high-value nodes through temporally uncorrelated vectors).

Interestingly, waterhole attacks exhibit ’bursty’ nature: in our experiments, a popular waterhole server quickly infects a large number of clients within a short period of time – thus, we vary the waterhole NTW from 4 seconds up to 100 seconds.

Shape GD’s time to detection for one NTW. We fix NTWs (30 seconds) and vary a parameter that represents a node’s likelihood of infection from 0% up to 100% – modeling whether a drive-by exploit succeeeds in a waterhole attack.

Figure 8 plots the neighborhood score v. the average number of infected nodes within benign (blue curve) and malicious (red curve) neighborhoods – the two extreme points on the X-axis corresponds to either none of the machines being infected (the left side of a figure) or the maximum possible number of machines being infected (the right side of the figure). In this experiment, the waterhole server can infect at most 1250 nodes in the 30 seconds NTW. Every point on a line is the median neighborhood score from 100 experiments with whiskers set at 1%- and 99%- percentile scores. In each experiment we use a random subset of training data for training purposes and a random subset of testing data for testing.

When increasing the number of infected nodes in a neighborhood, as expected, the red curve larger deviates from the blue one. Therefore, Shape-GD becomes more confident with labeling incoming partially infected neighborhoods as malicious. Shape-GD starts reliably detecting malware very quickly – when only 200 nodes have been infected. We also experimented with other sizes of neighborhood window – the plots we obtained showed similar trends.

Figure 9: (Waterhole attack: Time-based neighborhood filtering) Shape GD’s performance deteriorates linearly when increasing the size of a neighborhood window from 6 sec to 100 sec.

Shape GD’s sensitivity to NTW. We show that the size of a neighborhood is important for early detection – the minimum number of nodes that are infected before Shape-GD raises an alert – in Figure 9. Varying the NTW essentially competes the rates at which both malicious and benign FVs accumulate.

We vary the NTW from from 4 sec to 100 sec and record the number of infected nodes when Shape-GD can make robust predictions (i.e. less than 1% FP for almost 100% TP). The results are averaged across 100 experiments.

In a waterhole scenario, the number of client devices active within a time window (and hence the false positive alert-FVs from the neighborhood) grows much faster than the malware can spread (even if we assume that every client that visits the waterhole server gets infected. Here, a large NTW aggregates many more benign (false positive) FVs from clients accessing non-compromised servers. Hence, increasing the NTW degrades time to detection. Shape-GD works best with an NTW of 6 seconds – only 107.5 nodes on average become infected out of a possible 550,000 nodes. Note that a very small NTW (below 6 seconds) either does not accumulate enough FVs for analysis – if so, Shape-GD outputs no results – or creates large variance in the shape of benign neighborhoods and abruptly degrades detection performance.

Note that Shape-GD requires a minimum number of FVs per neighborhood to make robust predictions – at least 15,000 FVs – hence, the Shape-GD has to set NTWs based on the rate of incoming requests and access frequency of a particular server. For example, if a server is not very popular and is likely to be compromised, the Shape-GD could increase this server’s NTW to collect more FVs for its neighborhood.

7.2 Fragility of Count GD

Figure 10: (Symnatec Wine dataset) An error in estimating neighborhood size dramatically affects Count GD’s performance. It can tolerate at most 30% underestimation errors and 1% overestimation errors to achieve comparable with Shape GD performance.
Figure 11: (Waterhole attack) An error in estimating neighborhood size dramatically affects Count GD’s performance. It can tolerate at most 0.1% underestimation errors and 13.8% overestimation errors to achieve comparable with Shape GD performance.

A Count-GD algorithm counts the number of alerts over a neighborhood and compares it to a threshold to detect malware. This threshold scales linearly in the size of the neighborhood – we now experimentally quantify the error Count-GD can tolerate in Symantec Wine (Figure 10) and waterhole (Figure 11) settings. Note that the error in estimating neighborhood size can be double sided – underestimates (negative error) can make neighborhoods look like alert hotspots and lead to false positives, while overestimates (positive error) can lead to missed detections (i.e., lower true positives).

We run Count-GD in the same setting as Shape-GD. In the Symantec Wine case study we adjust Count-GD’s threshold to match the performance of Shape-GD’s Neighborhood classifier (true positive rate of 95.41%, Section 6.1) with zero neighborhood estimation errors (Figure 10). In the waterhole case study we evaluate Count-GD under the same conditions as Shape-GD when presenting the results of time-based neighborhood filtering (Section 7.1) – 30-sec long neighborhood including 17,178 nodes (Figure 11). In comparison to the Symantec Wine experiment, whose parameters are fixed, in the waterhole experiment we vary infection probability such that the number of infected nodes in a neighborhood changes from 0 to 500 (waterhole) in four increments – note that only a small fraction (2.9%) of nodes per neighborhood get infected in the worst case.

In this setting, recall that the neighborhood detector has a maximum global false positive rate of 19.03% and 1% and a true positive rate of 95.41% and 100% respectively in the Symantec Wine and waterhole case studies. To maintain a similar detection performance, our experiments show that the Count-GD can only tolerate neighborhood size estimation errors within a very narrow range – [-30%, 1%] (Symantec Wine) and [-0.1%, 13.8%] (waterhole). A key takeaway here is that underestimating a neighborhood’s size makes Count-GD extremely fragile (-30% in Symantec Wine and -0.1% for waterhole). On the other hand, overestimating neighborhood sizes decreases true positives, and this effect is catastrophic.

We comment that this effect is important in practice. In the example of a Fortune-500 company, we observed that commercial SIEM tools often do not report alerts in a timely manner and may delay delivering alerts by up to 2 months due to unpredictable infrastructure failures and due to a local IT service intervening into the analysis of alerts. Also given the practical deployments where nodes get infected out of band (e.g., outside the corporate network), go out of range (with mobile devices), the tight margins on errors can render Count-GD extremely unreliable. Even with sophisticated size estimation algorithms, recall that the underlying distributions that create these neighborhoods (e.g. number of clients per server) have sub-exponential heavy tails – such distributions typically result in poor parameter estimates due to lack of higher moments, and thus, poorer statistical concentrations of estimates about the true value 

[35]. Circling back, we see that by eliminating this size dependence compared to Count-GD, our Shape-GD provides a robust inference algorithm.

8 Discussion

Evasion attacks. Shape-GD requires a human analyst to correctly specify attack vectors. If a new attack vector emerges (ex. badUSB), then the corresponding attack may go undetected. However, attack vectors such as URLs or emails or physical devices along which malware propagates are far fewer than vulnerabilities, exploits, or malware samples. Further, individual local detectors may be susceptible to evasion attacks, which may negatively affect Shape-GD’s detection. However, designing evasion resistant local detectors [43] is outside the scope of this paper.

9 Conclusions

Building robust behavioral detectors is a long-standing problem, especially in large distributed systems where false positives can be overwhelming. We observe that attacks on enterprise networks induce low-dimensional neighborhoods on otherwise high-dimensional feature vectors, but such neighborhoods are unpredictable and thus hard to exploit. Shape-GD amplifies malware signal through neighborhoods and exploits their shape to identify infected ones early. Automating the search for new neighborhoods, i.e. new attack vectors, that correlate with confirmed infections, would be a natural next step towards deployable behavioral detectors.


  • [1] Accelerating cyber hunting project asgard.
  • [2] Advanced malware protection (amp).
  • [3] Advanced malware protection and detection (ampd).
  • [4] Attack graphs: visualizing 200m alerts a day.
  • [5] Data binning.
  • [6] G4 - yahoo! network flows data.
  • [7] Graphistry.
  • [8] Grr rapid response: remote live forensics for incident response.
  • [9] Hone tool.
  • [10] Kaspersky security bulletin 2015.
  • [11] osquery – performant endpoint visibility.
  • [12] Secure rank algorithm.
  • [13] Statement regarding cyber attack against anthem.
  • [14] Symantec report on black vine espionage group.
  • [15] Virustotal – free online virus, malware and url scanner.
  • [16] Wasserstein metric.
  • [17] Why watering hole attacks work.
  • [18] Worldwide intelligence network environment. profile/universityresearch/ sharing.jsp.
  • [19] Bailey, M., Oberheide, J., Andersen, J., Mao, Z., Jahanian, F., and Nazario, J. Automated classification and analysis of internet malware. In Recent Advances in Intrusion Detection. 2007.
  • [20] Benamou, J., and Brenier, Y. Mixed l2-wasserstein optimal mapping between prescribed density functions. Journal of Optimization Theory and Applications (2001).
  • [21] Bethencourt, J., Franklin, J., and Vernon, M. Mapping internet sensors with probe response attacks. In Proceedings of the 14th USENIX Security Symposium (2005).
  • [22] Burguera, I., Zurutuza, U., and Nadjm-Tehrani, S. Crowdroid: Behavior-based malware detection system for android. In Proceedings of the 1st ACM Workshop on Security and Privacy in Smartphones and Mobile Devices (2011), SPSM.
  • [23] Canali, D., Lanzi, A., Balzarotti, D., Kruegel, C., Christodorescu, M., and Kirda, E. A quantitative study of accuracy in system call-based malware detection. In Proceedings of the 2012 International Symposium on Software Testing and Analysis (2012), ISSTA 2012.
  • [24] Chen, T., and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), KDD ’16.
  • [25] Chi, Y., Song, X., Zhou, D., Hino, K., and Tseng, B. L.

    Evolutionary spectral clustering by incorporating temporal smoothness.

    In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2007), KDD ’07.
  • [26] Christodorescu, M., Jha, S., and Kruegel, C. Mining specifications of malicious behavior. In Proceedings of the 1st India Software Engineering Conference (2008), ISEC.
  • [27] Christodorescu, M., Jha, S., Seshia, S. A., Song, D., and Bryant, R. E. Semantics-aware malware detection. In Proceedings of the 2005 IEEE Symposium on Security and Privacy (2005).
  • [28] Clark, S. S., Ransford, B., Rahmati, A., Guineau, S., Sorber, J., Xu, W., and Fu, K. WattsUpDoc: Power side channels to nonintrusively discover untargeted malware on embedded medical devices. In USENIX Workshop on Health Information Technologies (2013).
  • [29] Dash, D., Kveton, B., Agosta, J. M., Schooler, E., Chandrashekar, J., Bachrach, A., and Newman, A. When gossip is good: Distributed probabilistic inference for detection of slow network intrusions. In

    Proceedings of the 21st National Conference on Artificial Intelligence

  • [30] Demme, J., Maycock, M., Schmitz, J., Tang, A., Waksman, A., Sethumadhavan, S., and Stolfo, S. On the feasibility of online malware detection with performance counters. In Proceedings of the 40th Annual International Symposium on Computer Architecture (2013).
  • [31] Donoho, D. L., and Huber, P. J. The notion of breakdown point. A festschrift for Erich L. Lehmann 157184 (1983).
  • [32] Dumitras, T., and Shou, D. Toward a standard benchmark for computer security research: The worldwide intelligence network environment (wine). In Proceedings of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (2011), BADGERS ’11.
  • [33] Fink, G. A., Duggirala, V., Correa, R., and North, C. Bridging the host-network divide: Survey, taxonomy, and solution. In Proceedings of the 20th Conference on Large Installation System Administration (Berkeley, CA, USA, 2006), LISA ’06, USENIX Association, pp. 20–20.
  • [34] Forrest, S., Hofmeyr, S., Somayaji, A., and Longstaff, T. A sense of self for unix processes. In Security and Privacy, 1996. Proceedings., IEEE Symposium on (1996).
  • [35] Foss, S., Korshunov, D., and Zachary, S. An introduction to heavy-tailed and subexponential distributions, 2009. Springer Series in Operations Research and Financial Engineering.
  • [36] Fredrikson, M., Jha, S., Christodorescu, M., Sailer, R., and Yan, X. Synthesizing near-optimal malware specifications from suspicious behaviors. In IEEE Symposium on Security and Privacy (2010).
  • [37] Gu, G., Porras, P., Yegneswaran, V., Fong, M., and Lee, W. Bothunter: Detecting malware infection through ids-driven dialog correlation. In Proceedings of 16th USENIX Security Symposium (2007).
  • [38] Gu, G., Zhang, J., and Lee, W. Botsniffer: Detecting botnet command and control channels in network traffic. In Presented at the 16th Annual Network & Distributed System Security Symposium (2008), NDSS.
  • [39] Hanna, S., Huang, L., Wu, E., Li, S., Chen, C., and Song, D. Juxtapp: A scalable system for detecting code reuse among android applications. In Detection of Intrusions and Malware, and Vulnerability Assessment. 2013.
  • [40] Hao, S., Kantchelian, A., Miller, B., Paxson, V., and Feamster, N. Predator: Proactive recognition and elimination of domain abuse at time-of-registration. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (New York, NY, USA, 2016), CCS ’16, ACM, pp. 1568–1579.
  • [41] Hofmeyr, S. A., Forrest, S., and Somayaji, A. Intrusion detection using sequences of system calls. J. Comput. Secur. 6, 3 (aug 1998).
  • [42] Huber, P. J. Robust statistics. Springer, 2011.
  • [43] Kantchelian, A. Taming Evasions in Machine Learning Based Detection Pipelines. PhD thesis, 2016.
  • [44] Kaufman, L., and Rousseeuw, P. J.

    Finding Groups in Data: An Introduction to Cluster Analysis.

    John Wiley, 1990.
  • [45] Khasawneh, K. N., Ozsoy, M., Donovick, C., Abu-Ghazaleh, N., and Ponomarev, D. Ensemble learning for low-level hardware-supported malware detection. In Research in Attacks, Intrusions, and Defenses. Springer International Publishing, 2015, pp. 3–25.
  • [46] Kirat, D., Vigna, G., and Kruegel, C. Barecloud: Bare-metal analysis-based evasive malware detection. In Proceedings of the 23rd USENIX Conference on Security Symposium (2014).
  • [47] Kolbitsch, C., Comparetti, P. M., Kruegel, C., Kirda, E., Zhou, X., and Wang, X. Effective and efficient malware detection at the end host. In Proceedings of the 18th Conference on USENIX Security Symposium (2009).
  • [48] Krüegel, C., Toth, T., and Kerer, C. Decentralized event correlation for intrusion detection. In Proceedings of the 4th International Conference Seoul on Information Security and Cryptology (2002), ICISC ’01, Springer-Verlag.
  • [49] Kwon, B. J., Mondal, J., Jang, J., Bilge, L., and Dumitras, T. The dropper effect: Insights into malware distribution with downloader graph analytics. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security (New York, NY, USA, 2015), CCS ’15, ACM, pp. 1118–1129.
  • [50] Kwon, B. J., Srinivas, V., Deshpande, A., and Dumitras, T. Catching worms, trojan horses and pups: Unsupervised detection of silent delivery campaigns. In NDSS (2017).
  • [51] Locasto, M., Parekh, J., Keromytis, A., and Stolfo, S. Towards collaborative security and p2p intrusion detection. In Information Assurance Workshop, IAW (2005).
  • [52] Mihai Christodorescu, S. J. Static analysis of executables to detect malicious patterns. Tech. rep., The University of Wisconsin, Madison, 2006.
  • [53] Miller, B., Kantchelian, A., Tschantz, M. C., Afroz, S., Bachwani, R., Faizullabhoy, R., Huang, L., Shankar, V., Wu, T., Yiu, G., Joseph, A. D., and Tygar, J. D. Reviewer integration and performance measurement for malware detection. In Proceedings of the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment - Volume 9721 (New York, NY, USA, 2016), DIMVA 2016, Springer-Verlag New York, Inc., pp. 122–141.
  • [54] Mutz, D., Valeur, F., Vigna, G., and Kruegel, C. Anomalous system call detection. ACM Trans. Inf. Syst. Secur. 9, 1 (feb 2006), 61–93.
  • [55] Nagaraja, S., Mittal, P., Hong, C.-Y., Caesar, M., and Borisov, N. Botgrep: Finding p2p bots with structured graph analysis. In Proceedings of the 19th USENIX Conference on Security (Berkeley, CA, USA, 2010), USENIX Security’10, USENIX Association, pp. 7–7.
  • [56] Ng, A. Y., Jordan, M. I., and Weiss, Y. On spectral clustering: Analysis and an algorithm. In ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS (2001).
  • [57] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (New York, NY, USA, 2017), ASIA CCS ’17, ACM, pp. 506–519.
  • [58] Paxson, V. Bro: A system for detecting network intruders in real-time. Comput. Netw. 31, 23-24 (1999), 2435–2463.
  • [59] Robertson, W., Maggi, F., Kruegel, C., and Vigna, G.

    Effective anomaly detection with scarce training data.

    In Proceedings of the Network and Distributed System Security Symposium (NDSS) (2010).
  • [60] Shin, S., Xu, Z., and Gu, G. EFFORT: Efficient and Effective Bot Malware Detection. In Proceedings of the 31th Annual IEEE Conference on Computer Communications (INFOCOM’12) Mini-Conference (March 2012).
  • [61] Shinoda, Y., Ikai, K., and Itoh, M. Vulnerabilities of passive internet threat monitors. In Proceedings of the 14th Conference on USENIX Security Symposium (2005).
  • [62] Shmatikov, V., and Wang, M.-H. Security against probe-response attacks in collaborative intrusion detection. In the Workshop on Large Scale Attack Defense (2007).
  • [63] Somayaji, A., and Forrest, S. Automated response using system-call delays. In Proceedings of the 9th Conference on USENIX Security Symposium - Volume 9 (Berkeley, CA, USA, 2000), SSYM’00, USENIX Association, pp. 14–14.
  • [64] Sommer, R., and Paxson, V. Outside the closed world: On using machine learning for network intrusion detection. In the IEEE Symposium on Security and Privacy (2010).
  • [65] Šrndić, N., and Laskov, P. Practical evasion of a learning-based classifier: A case study. In the IEEE Symposium on Security and Privacy (2014).
  • [66] Tang, A., Sethumadhavan, S., and Stolfo, S. Unsupervised anomaly-based malware detection using hardware features. In Research in Attacks, Intrusions and Defenses. 2014.
  • [67] Vallender, S.

    Calculation of the wasserstein distance between probability distributions on the line.

    Theory of Probability & Its Applications 18, 4 (1974), 784–786.
  • [68] Vlachos, V., Androutsellis-Theotokis, S., and Spinellis, D. Security applications of peer-to-peer networks. Comput. Netw. 45, 2 (2004).
  • [69] von Luxburg, U. A tutorial on spectral clustering. Statistics and Computing (2007).
  • [70] Wagner, D., and Soto, P. Mimicry attacks on host-based intrusion detection systems. In the ACM Conference on Computer and Communications Security (2002).
  • [71] Warrender, C., Forrest, S., and Pearlmutter, B. Detecting intrusion using system calls: alternative data models. In In Proceedings of the IEEE Symposium on Security and Privacy (1999).
  • [72] Xie, Y., Kim, H., O’Hallaron, D. R., Reiter, M. K., and Zhang, H. Seurat: A pointillist approach to anomaly detection. In Recent Advances in Intrusion Detection (2004).
  • [73] Xu, H., Caramanis, C., and Mannor, S. Outlier-robust pca: the high-dimensional case. IEEE transactions on information theory 59, 1 (2013), 546–572.
  • [74] Xu, W., Qi, Y., and Evans, D. Automatically evading classifiers: A case study on pdf malware classifiers. In Network and Distributed Systems Symposium (2016).
  • [75] Yen, T.-F., Oprea, A., Onarlioglu, K., Leetham, T., Robertson, W., Juels, A., and Kirda, E. Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks. In Proceedings of the 29th Annual Computer Security Applications Conference (2013), ACSAC ’13, ACM, pp. 199–208.
  • [76] Zhang, Z., Li, J., Manikopoulos, C. N., Jorgenson, J., and Ucles, J. Hide: a hierarchical network intrusion detection system using statistical preprocessing and neural network classification. In the IEEE Workshop on Information Assurance and Security (2001).
  • [77] Zhao, Y., Xie, Y., Yu, F., Ke, Q., Yu, Y., Chen, Y., and Gillum, E. Botgraph: Large scale spamming botnet detection. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (Berkeley, CA, USA, 2009), NSDI’09, USENIX Association, pp. 321–334.


Appendix A Clustering Results

While Count-GD is fragile, clustering GDs are inaccurate in the early stages of infection. This is why prior work [75] uses clustering to (offline) identify high-priority incidents from security logs for human analysis (instead of as an alwayWs-on GD) – this use case is complementary to an always-on global detector. We quantify a recent clustering GD’s [75] detection rate in the waterhole case study as well. We observed similar detection results (very low AUC metric) in the Symantec Wine case study.

First, we reduce dimensionality of 390-dimensional FVs by projecting them on the top 10 PCA components, which retain 95.72% of the data variance. Second, we use an adaptation of the K-means clustering algorithm that does not require specifying the number of clusters in advance

[75, 72, 44]. Specifically, the algorithm consists of the following three steps: (1) select a vector at random as the first centroid and assign all vectors to this cluster; (2) find a vector furthest away from its centroid (following Beehive [75], we use L1 distance) and make it a center of a new cluster, and reassign every vector to the cluster with the closest centroid; and (3) repeat step 2 until no vector is further away from its centroid than half of the average inter-cluster distance.

The evaluation settings of the clustering algorithm match exactly the settings where Shape-GD detects infected neighborhoods with 99% confidence. Specifically, the algorithm clusters the data that we collected in a 17,178-node neighborhood under a waterhole attack within 30 seconds. As we have already demonstrated (Section 7.1),Shape-GD starts detecting malware when 107 (waterhole attack) nodes get compromised (Figure 8).

Clustering does not fare well. It partitions the waterhole dataset into 30 clusters. We observe three large clusters that aggregate most of the benign FVs. However, the algorithm fails to find small ’outlying’ clusters consisting of predominantly malicious data. Each cluster heavily mixes benign and malicious data, hence the clustering approach suffers from poor discriminative ability, i.e. it is unable to separate malicious and benign samples.

Note the clustering algorithm enforces explicit ordering across the clusters. That is, the algorithm forms a new cluster around an FV that is furthest away from its cluster centroid. Thus, earlier a cluster is created, the more suspicious it is. By design of the clustering algorithm, the clusters are subject to a deeper analysis in order of their suspiciousness. Such an inherent ordering allows us to build a receiver operating curve (Figure 12) and compute a typical metric for a binary classifier – Area Under the Curve (AUC) by averaging across 10 runs. The AUC reaches only 48.3% in the waterhole case study.

This experiment illustrates the failure of the traditional recipe of dimensionality reduction plus clustering. There is a fundamental reason for this – the neighborhoods we seek to detect are small compared to the total number of nodes in the system. Optimization-based algorithms that exploit density, including K-Means and related algorithms, fail to detect small clusters in high dimensions, even under dimensionality reduction. The reason is that the dimensionality reduction is either explicitly random (e.g., as in Johnson-Lindenstrauss type approaches), or, if data-dependent (like PCA), it is effectively independent of small clusters, as these represent very little of the energy (the variance) of the overall data. Spectral clustering style algorithms [69, 56, 25] are also notoriously unable to deal with highly unbalanced sized clusters, and in particular, are unable to find small clusters.

Shape-GD also reduces dimensionality but does so after neighborhood filtering. This amplifies the impact of small neighborhoods. The combination of dimensionality reduction, small-neighborhood-amplification, and then aggregation represents a novel approach to this detection problem, and our experiments validate this intuition.

Figure 12: (Waterhole case study) Receiver operating curve shows detection accuracy of the clustering-based malware detector [75]. Its Area Under the Curve (AUC) parameter averaged for 10 runs reaches only 48.3%; such low AUC value makes it unusable as a global detector.

Appendix B Shape-GD Classifiers

In both case studies local detectors (LDs) analyze executable files, however, they use different file abstractions – static file analysis (Symantec Wine) because the original files are unavailable and dynamic traces of executed system calls (waterhole case study). Both LD algorithms leverage the state-of-the-art techniques in automated malware detection. Specifically, the LD algorithm uses both its internal state and the current feature vector (FV) to generate an alert if it thinks that this FV corresponds to malware.

Despite performing case study specific feature extraction, LDs employ similar algorithms as a binary classifier for malware detection: Boosted Decision Trees [24] (Symantec Wine case study) and Random Forest (waterhole case study). These algorithms achieve the best performance on the training data set among the classifiers from a prior survey [23] and scale up well to process millions of FVs.

Symantec Wine. We adapt an LD from the prior work [53]. It primarily relies on a lightweight file analysis, which scales well when processing millions of downloads per day. Specifically, the LD extracts syntactic features from a file and applies a binary classification algorithm that labels a file as either malicious or benign.

The LD is designed to run existing commercial tools such as TRID, ClamAV, Symantec on a binary file, analyze statically imported libraries and functions, detect common packers, check whether a file is digitally signed or not, and collect its binary metadata. VirusTotal provides outputs of these tools as a single file-level report, which we directly use as LD’s input.

Waterhole. In the waterhole case study, the LD analyzes system calls executed by a program. It transforms continuously evolving 390-dimensional time-series of Windows system calls into a discrete-time sequence of feature-vectors (FVs). This is accomplished by chunking the continuous time series into second intervals, and representing the system call trace over each interval as a single dimensional vector. is typically a low dimension, reduced down from 390 using PCA analysis, to (in our experiments) and second.

Domain name classifier. In addition to file-level LDs, we employ a domain name classifier in the Symantec Wine case study to extract attributes to form neighborhoods. The domain name classifier analyzes VirusTotal domain reports and identifies domains that are likely to distribute malicious files. It uses as input VirusTotal domain-level reports that aggregate domain classification produced by other commercial tools such as Dr. Web, Websense ThreatSeeker, and VirusTotal. Each of those tools categorizes a domain based on its content. The number of categories ranges from 55 (Dr. Web) up to 451 (VirusTotal), and they include such classes as social networks, banking, ads, government and etc.

The domain-name classifier applies one-hot encoding schema to represent categorical data as a fixed length feature vectors. Specifically, it creates a ”zero” feature vector with the number of elements equal to the total number of categories (767-dimensional feature vectors in our case) and sets ”one” in the positions corresponding to the assigned categories, which are not necessarily mutually exclusive.

Appendix C Vector-Histogram Implementation

In the Symantec Wine case study, Shape-GD deals with two types of alert-FVs: file and domain alert-FVs. Thus, it builds two separate vector histograms per neighborhood and then concatenates them into a single vector histogram. The file-level vector histogram has dimensionality of 10x50, i.e. each file alert-FV is projected on 10-dimensional basis and binned into 50 bins along each dimension. Similarly, a domain vector-histogram has dimensionality of 100x5, i.e. each domain alert-FV is projected on 100-dimensional basis and binned into 5 bins along each dimension. Then, the algorithm concatenates two matrix-shaped vector-histograms. To do that, it represents them as two 500 dimensional vectors by using a row-major order and appends the second one to the first one, thus, the resulting vector has 1,000 dimensions.

In the waterhole experiment, Shape-GD projects alert-FVs on 10-dimensional basis, bins projections into 50 bins along each dimension. Thus, a vector-histogram has dimensionality of 10x50.

Appendix D ShapeScore

We developed the Wasserstein-based distance – ShapeScore function – to detect neighborhoods with high malware concentration. ShapeScore quantifies how much a current vector-histogram, , differs from a reference histogram, , which is generated during the training phase using only the false positive FVs of the LDs by following the procedure for generating a vector histogram, which is outlined in Section 4.3. ShapeScore is thus the distance of a neighborhood from a benign reference histogram – a high score indicates potential malware.

The ShapeScore of the accumulated set of FVs, , is given by the sum of the coordinate-wise Wasserstein distances [67] between


In other words,

where for two scalar distributions the Wasserstein distance [67, 16] is given by

This Wasserstein distance serves as an efficiently computable one dimensional projection, that gives us a discriminatively powerful metric of distance [67, 20]. Because the Wasserstein distance computes a metric between distributions – for us, histograms normalized to have total area equal to 1 – it is invariant to the number of samples that make up each histogram. Thus, unlike count-based algorithms, it is robust to estimation errors in community size.

Finally, to determine whether a neighborhood has malware present Shape-GD performs hypothesis testing. If ShapeScore is greater than a threshold , Shape-GD declares a global alert, i.e., the algorithm predicts that there is malware in the neighborhood. The robustness threshold

is computed via standard confidence interval or cross-validation methods with multiple sets of false-positive FVs.

Appendix E Symantec Wine: Real-Time Detection

Though we do not discuss real-time detection results in the main text, we designed Shape-GD to act as a real-time detector. In this section we do a deep dive into the real-time detection in terms of individual files and compromised machines. We compare Shape-GD against the local detectors [53] and we also present neighborhood detector’s results for completeness to better understand Shape-GD’s real-time detection. Unfortunately, we have to exclude from the comparison the downloader detector [49] because it is not designed to be a real-time detector and the authors did not share the source code with us. We use the standard metrics for comparison: precision, recall, and F-1 score.

e.1 File-level real-time detection

Figure 13: File-level dynamic behavior. Left to Right: Raw statistics, LD-level stat, NBD-level stat, Shape-LD - level stat

We start with the analysis of the temporal distribution of the download events (Figure 13) to visualize file downloads over time. Every time Shape-GD runs, it analyzes download events within a neighborhood time window (NTW), which is set to 150 days in our experiments. Therefore, we represent the intensity of downloads over time as the number of downloads within each NTW. Specifically, for each timestamp we compute the total number and the number of malicious file downloads within the previous NTW (the upper Figure 13). For example, the value on the Figure 13 labeled as 01/2011 includes file downloads from 06/2010 till 01/2011. We also visualize the total number of distinct downloads and the number of distinct malicious downloads (the lower Figure 13).

Every point on these curves characterizes the number of files Shape-GD has to deal with when operating in a real-time detection mode. The large gap between the black and the red curves shows that only a small percentage of files in the Symantec Wine dataset is malicious. Shape-GD manages to filter out most benign files from further analysis to reduce the overall false positive rate.

When taking a deeper look at the plots, we notice that file downloads in the Wine dataset exhibit a nonuniform pattern over time. The total number of downloads increases from January 2008 and reaches its peak (51 million downloads per NTW) within the NTW ending in October 2010, and after that it decreases over time. The temporal pattern of distinct downloads slightly differs – intensity of distinct downloads reaches a flat plato (4.74 million per NTW) in September 2010 and remains on the approximately same level until April 2011. However, malicious files are responsible for only the small percentage of all downloads – at most 1.43 million total malicious downloads and at most 27 thousand unique malicious downloads.

Note that the low intensive ends of the distribution impose an obstacle for Shape-GD because of the insufficient number of correlated file downloads. Due to this reason we discard file downloads before June 2008. Therefore we run Shape-GD the first time on the neighborhood window spanning the interval from 06/2008 until 01/2009 and label the results with the ‘01/2009’ timestamp.

Local detectors. When we analyze the temporal behavior of local detectors (Figure 13), we notice anti-correlation between the total number of unique downloads and LDs’ precision. The peak of unique downloads corresponds to the large number of benign downloads. Therefore, when LDs process them, they output a large number of false positive alerts, which results in a precision drop (it drops down to less than 5% level). However, the recall stays in the range of 84% – 95% because it depends only on LDs’ ability to detect malicious files. F-1 score leans more towards precision than to recall, that is why LDs have mostly low F-1 score over the large period of time (between 9% and 66%).

Neighborhood detector. Before analyzing Shape-GD real-time detection, we briefly discuss the neighborhood detector’s detection performance. We assume that the neighborhood detector (Figure 13) labels all the files within malicious neighborhoods as malicious. As local detectors, the neighborhood detector suffers from low precision as well, however, the underlying cause is different. The neighborhood classifier is supposed to label neighborhoods malicious if they contain more than 5% of malicious files. Usually, most files in a neighborhood are benign. Thus, when the neighborhood detector conservatively labels all the files malicious, it suffers from high false positive rate, consequently, low precision. Hence, the neighborhood detector is designed to be conservative. Also the neighborhood detector inadvertently filters out some malicious files, which leads to lower than LDs’ recall.

Shape-GD. Comparing to local detectors, Shape-GD boosts precision and inherits slightly lower recall from the neighborhood classifier because it aggregates LDs’ predictions collected only across suspicious neighborhoods (Figure 13

). The reason why Shape-GD achieves high precision is because it has much lower false positive rate as many benign files are already filtered out by the neighborhood detector. Thus LDs running within suspicious neighborhoods analyze fewer benign files than LDs in the traditional deployment scenario. At the same time, Shape-GD has slightly lower recall than both LDs and the neighborhood detector because Shape-GD labels a file as malicious only if it is contained within a suspicious neighborhood and a local detector raises a file-level alert. However, the neighborhood and domain name classifiers are imperfect – they may fail to correctly label malicious neighborhoods and domains respectively. Therefore, Shape-GD does not aggregate LDs’ output across all malicious files, which results in a slightly lower recall. Shape-GD’s F-1 score is bounded by close values of precision and recall and it is much higher than the analogous parameter of local detectors and the neighborhood detector.

To quantitatively compare Shape-GD with local detectors we compute the area under F-1 curve. In the case of file detection, Shape-GD achieves 96.6% higher area under F-1 curve than the local detector. Shape-GD’s F-1 score is bounded by close values of precision and recall and it is much higher than the analogous parameter of local detectors and the neighborhood detector.

Machine-level real-time detection.

Figure 14: Machine-level dynamic behavior. Left to Right: Raw statistics, LD-level stat, NBD-level stat, Shape-LD - level stat

Machine-level statistics (Figure 14) is similar to the file-level statistics – only a small percentage of machines is compromised within each NTW window. The number of machines and compromised machines reach their peak values of 1.43 million and 126.9 thousand respectively in October 2010, i.e. less than 8.9% of compromised machines at the peak.

Overall, we observe higher values of precision and recall for all detectors (Figure 14) because, when interpreting detection results at the machine level, the detectors do not have to be very precise – they need to detect at least one malicious file on a machine, and file-level false positives on a particular machine do not count if that machine is infected.

Similar to file-level detection results, local detectors suffer from low precision because of the high number of false positives. However, precision is significantly higher – its mean value reaches 41% as opposed to the mean value of 19% for the file-level detection. Such dramatic difference is attributed to file-level false positives on compromised machines not affecting detectors’ precision at a machine-level. In both cases, recall curve exhibits similar behaviors.

We observe a similar trend for the neighborhood detector – the mean precision value is 48% versus 12.5% in the case of file-level detection. The recall value remains in the range of 36% – 92%. Finally, Shape-GD brings the precision curve up at the cost of slightly lowering the recall – this is exactly the same effect that we see in the case of file-level malware detection.

Overall, Shape-GD achieves better results at the machine-level than at the file-level, which means that it can identify infected machines earlier and more robustly than individual malware samples. In the case of real-time detection, the main Shape-GD’s competitor is a local detector. However, Shape-GD’s area under F-1 curve is 28.6% higher than the analogous parameter for the local detector.

Appendix F Aggregate Machine-level Detection

Figure 15: Machine-level aggregate analysis.
Figure 16: (Waterhole attack) Comparing to pure time-based NF, structural filtering algorithm improves Shape GD’s performance by by aggregating alerts on a server basis.

In the main text we present file-level aggregate results. For completeness here we describe machine-level detection results as well, however, we mainly focus our attention on the new trends unobserved at the file-level. As before, we consider a machine to be compromised (or infected) if it has downloaded at least one malicious file. Note Shape-GD is meant to be a file detector, not the machine level detector.

False positive rate. We notice two opposite trends. First, the machine-level false positive rate is higher in comparison to the file-level FP rate for all detectors because if detectors mislabel a single benign file, this may dramatically affect false positive rate if the file has been downloaded on multiple clean machines, i.e. those machines become false positives. Second, if we do detectors’ pairwise comparison in terms of false positives, we notice that their relative FP rates becomes more different. For example, the downloader detector has only 1.53 times lower FP rate than Shape-GD at the machine level in comparison to 7.1 time difference at the file level. The downloader detector’s results worsen mainly because the detector often mislabels benign files that are frequently downloaded on multiple clean machines, so those machines are considered as false positives. Surprisingly, the neighborhood detector’s FP rate reaches almost 80% and makes it completely unusable – due to this reason we exclude it from the further discussion.

True positive rate. In comparison to the FP rate, the TP rate does not exhibit a single trend – the direction, in which it moves, depends on a particular detector. The downloader detector’s TP rate drops down by almost 3 times because the majority of machines in the Wine dataset is infected by non-downloaders (malware that does not download other files). As a result, the downloader detector misses almost 87% of infected machines.

Shape-GD’s TP rate demonstrates the opposite trend – it increases in comparison to the file-level detection by 6.7% because Shape-GD searches for correlated malicious downloads and thus it is likely to detect similar malware that infects multiple machines. As a result, spatial correlations between malware downloads boost detection results – they raise from 78.03% up to 84.71%.

F-1 score. Overall, Shape-GD achieves the best FP/TP trade-off – the highest F-1 score (60.06%). The downloader detector demonstrates the poorest detection results – the lowest F-1 score (17.08%) – mainly due to low its low TP rate.

Time to detection. We observe that average time to detection slightly increases for Shape-GD (from 20.33 days up to 28.67 days), but it is almost 3 times lower than the same parameter of the downloader detector because Shape-GD makes a decision regarding a file without waiting until it downloads other files.

Appendix G Time to Detection Using Structural Information

Waterhole attack imposes a logical structure on nodes (beyond their time of infection): it infects only the clients that access a compromised server. This structure suggests that temporal neighborhoods can be further refined based on the specific server accessed by a client (i.e., grouping clients that visit a server into one neighborhood).

To analyze the effect of such structural filtering on GD’s performance, we vary filtering from coarse- (no structural filtering, only time-based filtering) to fine-grained (aggregating alerts across clients accessing each server separately) (Figures 16). Specifically, the aggregation parameter changes from 50 servers down to 1. As before, we measure detection in terms of the minimum number of infected nodes that lead to raising a global alert. Also we consider three NTW values – 25-, 50-, and 100-sec long.

Structural filtering improves time to detection by 5.82x, 4.07x, and 3.75x for 25-, 50-, 100-sec long windows respectively. Interestingly, structural filtering requires Shape-GD to use longer NTWs than before – small NTWs (such as 6 seconds from the last sub-section) no longer supply a sufficient number of alert-FVs for Shape-GD to operate robustly. Even though structural filtering with a 25 second NTW improves detection by 5.82x over temporal filtering with 25 second NTWs, the number of infected nodes at detection time is 139.9 – higher than the 107 infected nodes for temporal filtering with a 6 second NTW (Figure 9). Temporal and structural filtering thus present different trade-offs between detection time and work performed by Shape-GD – their relative performance is affected by the rate at which true and false positive FVs are generated.