A baseline for unsupervised advanced persistent threat detection in system-level provenance

06/17/2019 ∙ by Ghita Berrada, et al. ∙ 0

Advanced persistent threats (APT) are stealthy, sophisticated, and unpredictable cyberattacks that can steal intellectual property, damage critical infrastructure, or cause millions of dollars in damage. Detecting APTs by monitoring system-level activity is difficult because manually inspecting the high volume of normal system activity is overwhelming for security analysts. We evaluate the effectiveness of unsupervised batch and streaming anomaly detection algorithms over multiple gigabytes of provenance traces recorded on four different operating systems to determine whether they can detect realistic APT-like attacks reliably and efficiently. This report is the first detailed study of the effectiveness of generic unsupervised anomaly detection techniques in this setting.



There are no comments yet.


page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For the past few years, damaging security/data breaches have frequently made the headlines (gootman2016opm; silver2014jpmorgan; lee2014german; Karchefsky2017). These breaches are all examples of “advanced persistent threats” (APTs). APTs are long-running, stealthy attacks designed to penetrate specific target systems, carry out either pre-determined or dynamically updated instructions from an adversary, and persist (while avoiding detection) for as long as required to accomplish the adversary’s goals, such as data theft (silver2014jpmorgan; gootman2016opm) or corruption of the target organization’s data and damaging of critical systems.

Security experts warn that APTs are now “part and parcel of doing business” (auty2015anatomy) and concede that it would be unrealistic for all such attacks to be prevented and blocked (smith2013life; maisey2014moving; auty2015anatomy), partly because even the best designed security systems are bound to have flaws and partly because the targeted nature of the attacks means that the adversaries will persistently try to gain access to the target’s system, adapting and changing their approaches if need be, until they reach their goal or the cost of succeeding far outweighs the benefits to be gained. As a result, the experts consider that, while adopting state-of-the-art prevention techniques is a must, the focus should shift to continuously monitoring the systems, detecting APTs in a timely fashion and minimizing their damage.

Traditional security software and measures (e.g. anti-virus software, system security policies) generally fail to detect APTs since APTs tend to mimic normal business logic and rely on actions that respect social norms (e.g. work schedule of targeted users) or system security policies. Moreover, the fact that APTs are long-running campaigns that consist of multiple steps further complicates their detection, in particular when relying on event logs and audit trails that only provide partial information on temporally and spatially localized events.

Provenance-tracking has been proposed as a basis for security (e.g. provenance-based access control (park2012provenance)). It has been suggested that mining provenance data to analyze and identify causal relationships among system activities could help identify security threats and malicious actions, such as data exfiltration, that might go undetected with policy-driven approaches and other classical perimeter defence-based methods (jewell2011host; zhang2012track; awad2016data; jenkinson2017applying).

As appealing as the idea of monitoring provenance-like records to aid security sounds, there are, however, numerous challenges to making it a reality. Beyond the issues linked with recording the provenance itself (e.g. level of provenance granularity, fault tolerance, trustworthiness of the recorded trace (jenkinson2017applying)

), the recorded provenance traces are expected to be large in volume, with anomalous system activity (if any) likely to constitute but a very small fraction of the recorded traces. Analyzing provenance traces to identify anomalous activity that would suggest an ongoing APT attack is a typical “needle in a haystack” problem further compounded by the variety of possible APT patterns and the lack of available fully annotated data. Typical supervised learning techniques cannot therefore be used to detect (rare) APT patterns. Furthermore, unsupervised anomaly detection over streaming graphs is challenging 

(graph-anomaly). We know of only one paper on anomaly detection over streaming provenance graph data (streamspot) but this approach relies on an initial training stage over “normal” example graphs, i.e. it is semisupervised.

In an operational security scenario, it is critical to be able to provide actionable information quickly. Security analysts can usually identify and forensically investigate suspicious behavior (such as processes that have been subverted or created by an attacker) once it is brought to their attention. However, in typical system traces, each day of activity may lead to a gigabyte or more of provenance trace information, corresponding to hundreds or thousands of processes, almost all of which are benign. In this paper, we consider the key subproblem of quickly identifying unusual process activity that warrants manual inspection. Our approach summarizes process activity using categorical or binary features such as the kinds of events performed by a process, the process executable name and parent executable name, and IP addresses and ports accessed. We focus on categorical data because attacks typically involve rare combinations of such attributes.

This report evaluates the effectiveness of several algorithms for unsupervised, categorical anomaly detection:

  • FPOutlier (or FPOF) (fpoutlier)

  • Outlier Degree (or OD) (outlierdegree)

  • One-Class Classification by Compression (or OC(krimp-ad)

  • CompreX (comprex)

  • Attribute Value Frequency (or AVF) (avf; onepassavf)

All of these algorithms escept for AVF are based on mining frequent itemsets or association rules and using these results to assign anomaly scores. Moreover, these mining-based techniques are all batch algorothms: in a first pass the data is mined and analyzed (sometimes taking a lengthy period) and in a second pass the scores are assigned. AVF is instead based on a simple analysis of the frequencies of the attributes. The original paper proposing AVF also only considered a batch setting, but later work (onepassavf) showed how to modify AVF to a one-pass, streaming algorithm. We therefore refer to batch and streaming AVF in this paper.

We apply our work to provenance traces containing example APT attacks (on several different host operating systems) produced as part of the DARPA Transparent Computing program, in which attacks constitute as little as 0.01% of the data. We evaluated all of the above algorithms in batch mode. Our experiments show that on our dataset, AVF has anomaly detection performance comparable or better than the itemset mining-based techniques, typically finding at least some parts of the attack within the top 1% or even 0.1%.

We also conducted experiments comparing batch and streaming AVF, using a modified form of the one-pass algorithm of (onepassavf) that allows blocks of different sizes, in order to study how detection performance is affected by streaming. Our experiments comparing batch and streaming AVF with different block sizes show that there is little degradation in anomaly detection performance. Although our work (like any anomaly-detection technique) does not guarantee to find all attacks, our contribution demonstrates that unsupervised anomaly detection can help find APT-style attacks that currently go unnoticed, enabling analysts to focus their efforts where they are most needed.

This report does not propose new anomaly detection algorithms, and does not evaluate all of the possible algorithms for unsupervised anomaly detection on categorical data. All of the algorithms evaluated either have publicly-available implementations, or were easy to re-implement. It is possible that better results could be obtained using other algorithms that we have not yet tried; nevertheless, our results do establish a baseline against which new approaches (or evaluation of other existing algorithms) can be measured.

The structure of the rest of this paper is as follows. Section 2 presents the overall system architecture and outlines our approach. Section 3 reviews AVF and our variant of streaming AVF. Section 4 presents an experimental evaluation of the effectiveness of the different appraoches, establishing a baseline for unsupervised anomaly detection on this data. Section 5 summarizes related work on APTs and anomaly detection. Section 6 concludes and suggests directions for future work.

2 Overview

2.1 Provenance trace analysis

In this section, we situate our work as part of a realistic provenance-based security scenario. Figure 2 outlines the architecture of our system, which is designed to interoperate with several different (provenance) recorders (gehani12middleware; jenkinson2017applying), each running on a different operating system and generating different styles of provenance graphs recording system activity (albeit in a common format). In this paper, we consider four sources, running on Android, Linux, BSD and Windows operating systems.

Our system receives the provenance graph data from each recording system, as a stream of JSON records in a binary format, and ingests the data into a graph database, Neo4J. In addition, ingestion performs some additional data integration and deduplication steps to deal with some idiosyncrasies among the sources. The different systems use the shared data model in different ways, for example storing information in different places, at different levels of granularity, or just not populating some fields. We remove some information that is not consistently recorded and reorganize other information so that typical queries can be written portably across data sources. Deduplication is important because the recorders add their own unique identifiers for operating system processes and other objects. This is necessary to avoid ambiguity given that operating system-issued process identifiers or filenames are not unique over long periods of time (i.e. days). However, some recording systems create multiple records referring to the same process (or other object) with different

unique identifiers. The ingester attempts to detect and merge these duplicates, using heuristics such as “two processes with the same process ID and started at the same time are identical”.

Once the graph data has been ingested, we extract Boolean-valued datasets called contexts

from the graph. Each context represents an aspect of process behavior as a Boolean-valued vector. As a simple example, we could use attributes corresponding to event types (

read, write, etc.) with value ‘1’ meaning that the process performed at least one event of that type and ‘0’ otherwise; the exact number of such events is ignored. We discuss additional contexts later in this section. Contexts can be extracted using queries over the fully-ingested data, for forensic analysis, or by incrementally maintaining appropriate data structures and periodically emitting new records. Each context can then be run through the anomaly detection algorithms described in Section 3, yielding a score for each process.

Figure 1: Architecture of our approach
Figure 2: Example of attack provenance graph

These scores are provided to the user interface (UI) frontend, which allows analysts to explore the graph using queries, or search for anomalies based on the scores. Figure 2 shows a typical provenance graph created using the UI graph visualization system, as a result of a successful attack detection. This illustration highlights that even fairly simple activities can yield complex graphs involving multiple read/write or network access events.

Our system has participated in several DARPA exercises in concert with the recording systems, in which realistic background activity was simulated on each system, and realistic APT-style attacks were performed, yielding several gigabytes of raw trace data, corresponding to tens of millions of nodes and edges. We have manually annotated the data to indicate the processes constituting the attacks for each of these scenarios. Typically, the number of processes involved in an attack is very small: for example, in the largest dataset, there are over 247,000 processes (representing seven days of activity), and only 25 of them (i.e. around 0.01%) are involved in the attack. Even if we optimistically assume an analyst can recognize an attack process in just 10 seconds, screening 200,000 processes would take over 23 days. Thus, although attacks are often easy to recognize once brought to the attention of an analyst, the sheer volume of background activity makes it imperative to find ways to automatically direct attention to suspicious activity.

2.2 Contexts

We now give the details of the contexts that form the starting point for our proposed algorithms. In our approach, the context definitions are the only places where domain knowledge about the data is used. We consider the following contexts:

  • ProcessEvent (): The integrated traces uses event types such as open, close, exit, etc. to describe process activity in a OS-independent way. A process has attribute if ever performs an event of type (disregarding the exact number of events).

  • ProcessExec (): The attributes are executable names , for example ls or sudo. A process has attribute if is an instance of executable .

  • ProcessParent (): The attributes are again executable names . A process has attribute if is a child process of an executable named .

  • ProcessNetflow (): The attributes are IP addresses and port numbers . A process has attributes and if it ever communicates with IP address at port .

  • ProcessAll (): the combination of all of the above contexts, with attributes renamed to avoid any ambiguity (for example between and ).

These contexts may seem rather simplistic. For example, it seems intuitive to also consider files accessed by processes as attributes. Also, it would make sense to consider more complex attributes that look for patterns that are known to be suspicious, such as downloading a file, executing it, and then deleting it. However, our goal is to minimize the amount of fine-tuning needed to obtain useful results. There is also a trade-off between granularity of attributes and performance: the more attributes we track, the more work needs to be done at each step. Nevertheless, it would be worthwhile, in subsequent work, to consider richer contexts or well-chosen attributes that encode domain knowledge about what activities are suspicious.

Each of these contexts can also be extracted from the data incrementally, as the data is ingested. For each process encountered, we construct an attribute vector with value 1 for each attribute the process has (in a given context) and 0 otherwise. The resulting sequence of vectors constitutes a dataset which we use as the starting point for the algorithms in the next section.

3 Algorithms

We consider datasets to be sequences of -dimensional Boolean vectors, where there are vectors and attribute values. Likewise, we consider data sources to be streams of -dimensional vectors. In either case, we consider a typical record at position and write for the value of attribute in . We assume for simplicity that all attributes are Boolean-valued. It is not difficult to generalize to finite sets of attribute values. We also assume that the number of possible attributes is fixed.

Example 1 (Running example).

To illustrate our approach, we introduce a small running example with four processes and three attributes , and , corresponding to network addresses accessed by the processes. In this (extremely simplistic) example, and are innocuous activity and access both abc.com and xyz.com, while is a naive attacker that only accesses evil.com and is a more sophisticated attacker that accesses all three in order to attempt to camouflage its behavior. This behavior corresponds to the following dataset:

We first review the various batch-only approaches and the original Attribute Value Frequency (AVF) algorithm (avf), in which processes are assigned lower scores if they contain infrequently-occurring attributes. We present the original algorithm in a batch processing form, i.e. where we assume we have all of the data before computing scores. We show how to modify it to obtain an online algorithm that gives a good approximation of the results of the batch algorithm, and allows for a choice of different window sizes. This algorithm is a mild variation of the one-pass AVF algorithm (onepassavf).

3.1 Batch anomaly detection techniques

In this section we briefly review the batch algorithms for anomaly detection in the literature used in our evaluation. These descriptions are not exhaustive; the respective reserach papers should be consulted for full details.

3.1.1 FPOutlier (FPOF)

The FPOutlier algorithm (fpoutlier) starts by mining frequent itemsets according to a support parameter minsupp. Then each object is assigned a score corresponding roughly to the number of frequent itemsets it contains. Thus, larger scores correspond to more occurrences of frequent itemsets, meaning that anomalous objects should have low scores. This approach seems well-suited to detect anomalies corresponding to expected, but missing, activity. However, objects that have unusual activity but also display a large number of common patterns may have high scores and not be considered anomalous. In addition, the fact that this approach has a tunable parameter is problematic in an unsupervised setting, since it means that we need to guess an appropriate value for this parameter in advance. We reimplemented FPOutlier using standard itemset mining libraries.

3.1.2 Outlier Degree (OD)

The Outlier Degree algorithm (outlierdegree) also starts by mining frequent itemsets as well as high-confidence rules, so there are two parameters, minsupp governing the minimum support of the itemsets and minconf governing the minimum confidence of the rules. Then each object is scored by applying the high-confidence rules to it, and assigning a score corresponding roughly to the difference between the object’s actual behavior and expected behavior (according to the rules). For example, if is a high-confidence rule and object displays behavior but not , this will contribute to the score. High scores correspond to larger differences between actual and expected behavior, so are more anomalous. Like FPOutlier, this approach seems more likely to consider missing, but expected, behaviors to be anomalous, and could miss anomalies that consist of rare behaviors that do not occur frequently enough to participate in rules. Also, the presence of two tunable parameters is even more problematic from the point of view of unsupervised anomaly detection. We reimplemented OD using standard itemset and rule mining libraries.

3.1.3 One-Class Classification by Compression (OC3)

OC3 (krimp-ad) is based on a compression technique for identifying ”interesting” itemsets, implemented using the Krimp algorithm (krimp)

. Essentially, the idea is to first mine frequent itemsets from the data, and then identify a subset of the itemsets that help to compress the data well. Then, each object is assigne an anomaly score corresponding to its estimated compressed size. If the compression algorithm has done a good job, then objects exhibiting commonly occurring patterns will compress well, and anomalies will not. OC3 can take a minsupp support parameter, but parameter tuning is typically not neceesary because the compression algorithm will filter out any non-useful itemsets; therefore we used the smallest possible minsupp setting in our experiments. The implementation of Krimp is available and we modified it slightly to perform OC3-style anomaly scoring.

3.1.4 CompreX

CompreX (comprex)

is perhaps the most sophisticated approach studied to date. It is based on compression, like OC3, but uses a different compression strategy. CompreX searches for a partition of the attributes such that each set of attributes in the partition has high mutual information, so that compressing the attributes using a joint probability model is more effective than compressing the attributes independently. Since there are exponentially many partitions to consider, CompreX starts with the finest partition (all attributes are in their own class) and greedily searches for pairs of classes to merge. CompreX has no tuning parameters and was shown experimentally to be competitive or superior in anomaly detection performance to Krimp/OC3 on several datasets. However, CompreX’s default search strategy is quadratic in the number of attributes; therefore, it was not usable on contexts with over 20-30 attributes.

3.2 Attribute Value Frequency (AVF)

Attribute Value Frequency (AVF) (avf)

is a non-parametric outlier detection technique appropriate for categorical data and was shown to be fast, scalable and accurate on a variety of standard data sets. The algorithm relies on the intuition that outliers in a dataset have values of attributes which occur infrequently. That the attribute values in a data point are infrequent can be determined simply by computing the frequencies of the respective attribute values across the data.

Given a dataset of size , we write for the number of occurrences of attribute value for attribute , i.e. . Then, the AVF score of a data point is:

That is, when , the contribution to the score for attribute is , the number of occurrences of -value of 1, and when , the contribution is the number of occurrences of a -value of 0. The initial multiplication by effectively averages the counts, so , but such scaling has no effect on the relative ordering among scores in the batch setting. Lower AVF scores indicate more unusual behavior.

Example 2.

Continuing our running example, we calculate the frequencies of the three attributes as and . Thus, the AVF scores are:

The naive attacker’s isolated access of evil.com, together with failure to mask its activity with common behavior, results in a lower score, while the more sophisticated attacker’s score is the same as that of the first two processes.

Streaming AVF: Naive approach

A simple, but unfortunately too naive, approach to streaming the AVF algorithm is to maintain the attribute value counts incrementally as data is processed, and use the current counts to score each new transaction. That is, if are the counts calculated for , then to score a new record we proceed as follows:

However, because the counts are monotonically increasing, this means that the scoring will be heavily biased towards considering records appearing early in the dataset to be anomalous. For example:

Example 3.

Continuing our running example, we need to update the counts after each step. Thus, the AVF scores are:

In this (admittedly extreme) example, the first process is judged most anomalous, followed by , then and finally .

Streaming AVF

As observed by (onepassavf), the problem is that the “scale” of the AVF scores is not fixed in the streaming setting, since seeing an attribute whose value has occurred only once means something very different for the 5th record in the dataset than for the 5000th record.

Instead, to compute AVF-like scores incrementally, we propose to use the frequency counts to estimate probabilities for each attribute. We initially take since the data is typically sparse (having relatively few attribute values

); however, any other initial probability distribution could be used based on domain knowledge. Next, for each new record

we adjust the probability of each attribute value being 1 after seeing as follows:

We then calculate the AVF score for the st record as follows:

Note that in the batch setting, dividing the counts by and summing probabilities instead of counts would not affect the final results, because all the counts are divided by the same . However, for the streaming setting, we update the attribute value probabilities after each step, so the results of AVF scoring will be different in the streaming setting.

Example 4.

Continuing our running example, we now update the probabilities after each step. Thus, the AVF scores are:

The naive attacker’s behavior results in a lower (more anomalous) score than the first process .

3.3 Analysis

As outlined already, the batch approach is implementable as two scans over the data, and the online approach can be implemented in a single, linear scan, where scoring each new record and updating the frequencies takes time and space. Both algorithms just need to maintain the number of records and the counts or probabilities. Thus, the overall time complexity of each algorithm is and the space required is . In our experiments, the number of attributes ranges from around 20 to over 14,000. Our approach may not scale well if the attributes are fine-grained and is much larger than .

Another concern the reader might have is regarding arithmetic precision and overflow. If fixed-size (say, 32-bit) integers are used, then whenever we are in danger of overflowing we can rescale by dividing all of the counts by 2; this is exactly what is done in arithmetic coding (witten87cacm). Our implementation uses arbitrary-precision arithmetic.

4 Experimental evaluation

4.1 Experimental setup

The experiments were run on a desktop with an Intel Core i7-6700 CPU (3.4 GHz), 16 GB RAM, running Ubuntu 16.04. The raw provenance trace data was ingested on a variety of different machines and the contexts used in the experiments were extracted and stored as CSV files111http://www.gitlab.com/adaptdata/e2. We do not report the experimental setup for the ingestion stage here in detail; however, it is easily able to keep up with the data in real-time (that is, ingestion of data representing 7 days of system activity takes much less than 7 days). Our experiments focus on evaluating the detection effectiveness and runtime cost of the anomaly detection algorithms on the given context data.

4.2 Datasets

Table 1 describes the different datasets used in our experiments. For each context (columns) and a given source (rows), we have the number of transactions (above) and attributes (below). The four datasets each consist of roughly seven days’ worth of activity in a DARPA evaluation of provenance-tracking systems, running on Windows, BSD, Linux and Android respectively.

The number of processes encountered in each system varies significantly: in particular, the Linux dataset records from 3–10 times as many distinct processes compared to the Windows or BSD datasets and up to 2400 times as many processes compared to Android. Some contexts are empty, e.g. for Android, , where information about parent process relationships was unavailable. In general, among the base contexts, the context usually has the largest number of processes, followed by and , while or have the largest number of attributes, followed by . There are 9 attack processes in the Android data (8.8%), 8 in the Windows data (0.04%), 13 in the BSD data (0.02%) and 25 in the Linux data (0.01%). Note that the size of the original dataset does not directly correlate with the number of processes or attributes. For example, the Android dataset is the largest but has the fewest processes and attributes, because the provenance recorder for Android records a great deal of low-level app activity and dynamic information flow tracking, which we do not analyze.

Size #attacks
Windows 743 17569 17552 14007 92 17569 8
MB 22 215 77 13963 14431
BSD 288 76903 76698 76455 31 76903 13
MB 29 107 24 136 296
Linux 2858 247160 186726 173211 3125 247160 25
MB 24 154 40 81 299
Android 2688 102 102 0 8 102 8
MB 21 42 0 17 80
Table 1: Description of the datasets used during the experiments.

4.3 Evaluation metrics

The methods that we propose output a ranking of processes according to their degree of suspiciousness/anomaly scores. These methods do not explicitly classify or label entities as anomalous or normal. That being the case, it would not be appropriate to use metrics usually employed to evaluate classification methods. On top of that, the data is unbalanced (depending on the dataset, at most 8.8% and as low as 0.01% of the data points belong to the attack class i.e. are true positives), constraining our choice of metrics. Accuracy, in particular, would not be an appropriate metric as, given the extremely unbalanced nature of our data, a very high accuracy could be achieved—for an arbitrarily fixed threshold of processes—simply by classifying all samples as not being part of an attack. This would clearly not be acceptable. A high accuracy would not be necessarily be an indicator of model quality: this is the accuracy paradox 


4.3.1 Normalized discounted cumulative gain

The normalized discounted cumulative gain metric (or nDCG for short) is a metric often used in information retrieval to assess the quality of a ranking.

Given a typical document search application, jarvelin2002cumulated argued that, from a user’s perspective, relevant documents are more valuable to a user than marginally relevant documents and a relevant document ranked high in the returned list of results is more valuable than an equally relevant document ranked lower in the list. A user may be reasonably assumed to scan the list of returned results from the beginning before interrupting the scan at some point correlated with time availability, effort required as well as the cumulated information from documents already seen. So it is safe to assume that relevant documents located further down the list of returned results are unlikely to be seen by the user as they would require more time and effort and become less valuable. Taking these facts into account, jarvelin2002cumulated introduced the nDCG measure.

We similarly argue that, in our application, processes that are part of an attack but are ranked very low by an anomaly detection technique are virtually useless to an analyst since his/her monitoring burden would increase substantially with the amount of processes to be checked (not to talk about issues such as acquired loss of trust in the automated monitoring system and discarding of its alerts as well as the increased potential for misses and errors with the increase of data to monitor). Because of this, we believe nDCG to be an appropriate metric for our application.

To compute the nDCG, we start by computing a score called discounted cumulative gain or DCG. The basis of DCG is that each document/entity in the ranking is assigned a relevance score and is penalized by a value logarithmically proportional to its position/rank in the list of results. The DCG is therefore computed as follows:

where is the number of entities/documents in the list, the relevance score of the -th entity/document in the list.

Since the length of result lists can vary and the DCG score does not take that into account, it is common to normalize the DCG score by the ideal DCG score (iDCG), which is simply the best achievable DCG score, i.e. the score that would be achieved if all relevant entities were at the top of the list (and in the case of different degrees of relevance, with the highest values of relevance at the very top). Assuming we have relevant entities in the list, we have:

In our case, we only consider entities to be either relevant (processes that are part of an attack) or irrelevant (processes with normal behavior) and assign a relevance score of 1 to attack processes and of 0 to benign processes, and the idealized score results from ranking all attack processes at positions . The closer the nDCG score to 1, the better the ranking.

4.3.2 Area under curve

The area under ROC curve (AUC) is often used as a measure of anomaly detection performance; however, in the presence of sparse anomalies in large datasets, it does not appear to be a useful metric. The AUC can either overestimate the effectiveness of an algorithm (e.g. if all attacks are found at rank 900–1000 out of 200,000 then the AUC will be over 0.995 but the results are still nearly useless), or underestimate it (e.g. if half of the attacks are found in the top 10 and the other half at rank 1000, then the maximum AUC is around 0.5 even though the results might be very useful).

In our case, it would correspond to the proportion of processes with normal behavior ranked lower than processes that are part of an attack, computed as follows:

where is the set of elements with a relevant label (i.e. elements that are part of an attack), is the set of elements with an irrelevant label (i.e. elements that have a normal behavior), (resp. ) is the rank assigned to (resp. ) by the method to be evaluated. The best performance for a method under this metric (resp. the worst performance) is achieved with AUC of one (resp. of zero).

4.4 Forensic anomaly detection

In this section we consider the following empirical question:

  • Q1: Can teh five batch methods (FPOF, OD, OC3, CompreX, AVF) detect APT-style attacks effectively?

We first evaluate the effectiveness and performance of the batch version of AVF compared with several other offline techniques, such as FPOutlier (FPOF) (fpoutlier), Outlier-degree (OD) (outlierdegree), OC3 (krimp-ad), and CompreX (comprex).

FPOF and OD were reimplemented in Python according to the descriptions of the algorithms. We reused publicly-available implementations of OC3 and CompreX222http://eda.mmci.uni-saarland.de/prj/, implemented in C++ and Matlab respectively. The FPOF, OD and OC3 methods require setting some parameters, which is not the case for AVF or CompreX. For OC3, we used the lowest possible support parameter and used closed itemset mining to reduce the total number of itemsets considered in the mining stage. For FPOF and OD, we considered a range of parameter settings and report the best results obtained using any parameter setting.

We report the results of all algorithms running on the contexts described in Section 2.2 in Tables 6,6,6,6 and 6. Some algorithms did not finish within a reasonable time and when this is the case we write .

Source FPOF OD OC3 CompreX AVF
Windows 0.20 0.20 0.30 0.60 0.60
BSD 0.20 0.19 0.44 0.54 0.51
Linux 0.18 0.18 0.39 0.30 0.27
Android 0.29 0.33 0.74 0.82 0.85
Table 3: Evaluation of batch anomaly scoring: ProcessExec
Source FPOF OD OC3 CompreX AVF
Windows 0.15 0.15 0.28 DNF 0.28
BSD 0.15 0.15 0.46 DNF 0.35
Linux 0.18 0.18 0.31 DNF 0.44
Android 0.22 0.22 0.39 0.25 0.39
Table 4: Evaluation of batch anomaly scoring: ProcessParent
Source FPOF OD OC3 CompreX AVF
Windows 0.10 0.10 0.21 DNF 0.21
BSD 0.13 0.13 0.44 DNF 0.31
Linux 0.17 0.17 0.24 DNF 0.21
Table 5: Evaluation of batch anomaly scoring: ProcessNetflow
Source FPOF OD OC3 CompreX AVF
Windows 0.36 0.36 0.71 DNF 0.58
BSD 0.13 0.14 0.34 DNF 0.26
Linux 0.23 0.23 0.50 DNF 0.32
Android 0.42 0.36 0.67 DNF 0.48
Table 6: Evaluation of batch anomaly scoring: ProcessAll
Source FPOF OD OC3 CompreX AVF
Windows DNF DNF 0.64 DNF 0.53
BSD 0.21 0.19 0.70 DNF 0.52
Linux 0.18 0.18 0.46 DNF 0.30
Android 0.31 0.34 0.68 DNF 0.83
Table 2: Evaluation of batch anomaly scoring: ProcessEvent

FPOF and OD were not competitive on any dataset, even after trying several possible support and confidence parameter values and taking the maximum nDCG score. OC3 often produced the best (or tied) results, in 12 out of 19 scenarios. CompreX also produced good results when it was able to complete within a reasonable time; for wider contexts such as or , it usually did not terminate within a few minutes (comprex mention that CompreX could be run as an anytime algorithm, but the available implementation does not support this.) Runtime-wise, FPOF, CompreX and OD were significantly more expensive (typically running in minutes rather than seconds) compared to OC3 or AVF.

In general, nDCG scores were highest for the Android dataset and lowest for the Linux dataset, suggesting a rough (but unsurprising) correlation between the amount of data and difficulty of ranking attacks effectively. OC3 performed considerably better than any other technique on the Linux dataset. Likewise, no single context was consistently best, and considering all contexts joined together in was not always better than considering one of the base contexts. However, the OC3 and AVF scores of were usually close to the best obtained by any of the base contexts.

Figure 3: Forensic analysis results: Linux

To help build intuition regarding how the nDCG scores correspond to actual rankings, we visualize the results of AVF for Linux in Figure 3. This “band diagram” shows the positions of the attacks in the rankings obtained by AVF for the five contexts. The x-axis of the figure is logarithmic scale, so red lines far to the left represent attacks ranked within the top 10, then top 100, etc. As this figure illustrates, an nDCG score of 0.44 (obtained by AVF on the context) corresponds to two attacks found in the top 10, while scores of under 0.3 tend to correspond to the highest-ranked attacks occuring at rank 100–1000.

Overall, we can conclude that, while AVF does not always perform the best among the considered algorithms, it is competitive: its nDCG score was highest for 8 out of 19 scenarios, and second-highest in another 9. Moreover, AVF running on was the best (or a close second) in three out of four datasets.

4.5 Streaming anomaly detection

In this section we consider the following empirical questions:

  • Q2: Is the detection performance of streaming AVF competitive with batch AVF in terms of nDCG and AUC?

  • Q3: Is the runtime performance of streaming AVF competitive with batch AVF?

4.5.1 Detection performance

Windows BSD Linux Android
Stream 1% 0.518 0.993 0.524 0.984 0.298 0.927 0.832 0.872
Stream 5% 0.490 0.984 0.524 0.984 0.298 0.928 0.828 0.857
Stream 10% 0.522 0.994 0.524 0.984 0.298 0.927 0.826 0.849
Stream 25% 0.496 0.985 0.525 0.984 0.298 0.928 0.828 0.858
Batch 0.527 0.996 0.524 0.984 0.298 0.927 0.834 0.878
Table 7: Summary of the nDCG and AUC performance of batch and streaming AVF on for each dataset, and for block sizes of 1%, 5%, 10%, and 25%.

To evaluate the streaming version of AVF, we generated 10 randomly-shuffled versions of each dataset and ran the streaming algorithm on each dataset. We chose to consider different randomly-shuffled datasets in order to avoid any dependence on a particular order of processing the data; it could be that analyzing the data ordered by time could produce better (or worse) results. We divided the datasets into block sizes of various granularities (1%, 5%, 10%, 25% of the data) to investigate the effect of granularity on effectiveness and performance. For each dataset and block size, we computed the median ranking of each attack over the 10 shuffled runs. These median rankings are taken to be representative.

We present nDCG and AUC results for the context only; these results are representative of the base contexts. Table 7 summarizes the nDCG and AUC metrics for the streaming algorithm (with four different block sizes) and for the batch algorithm (at the bottom). These results show that the nDCG scores for all four datasets are fairly stable, with only the Windows dataset displaying degradation of nDCG score of more than 0.01. Likewise, the ranking losses incurred by most streaming variants were close to those of the batch algorithm, with only the Windows and Android RL scores increasing by more than 0.01. Overall these results suggest that small block sizes do not significantly degrade the usefulness of the results of AVF scoring.

(a) Windows
(b) BSD
(c) Linux
(d) Android
Figure 4: Percentage of processes seen versus percentage of attacks detected for

Figure 4 plots the ratio of true positives found vs. ranking position, for the four different datasets. The red lines are the performance of the batch AVF algorithm while the blue lines are the streaming versions. (For the BSD dataset, the differences are not visible.) We can also gain a stronger intuition regarding the usefulness of the results from these figures: for example, for the Linux context we can see that the nDCG score of 0.298 corresponds to finding about half of the attacks in the first 1% of the rankings, while others are not found until 40%.

4.5.2 Analysis time

(a) Windows
(b) BSD
(c) Linux
(d) Android
Figure 5: Analysis time (batch AVF vs. streaming AVF

Figure 5 summarizes the time taken per run for both batch and streaming versions of AVF (the streaming times were obtained by taking the median of the times over the ten runs on shuffled inputs). Note that the y-axis is logarithmic scale. The running time is in general proportional to the amount of data in each context (number of rows number of columns). In particular, the time needed for is often considerably longer than the times needed for the other contexts. The reason is that some contexts (such as ) have many rows and few columns, while others (such as ) have many columns and few rows. Combining them into yields a very sparse context with many zeros. We plan to investigate whether using a more succinct storage format for the contexts, or combining the scores of the subcontexts, might lead to better performance. The streaming execution times also increase, as expected, with the increase of streaming block size.

5 Related work

Prior work on APTs is mostly concerned with describing/modeling the characteristics of an APT and its attack model (sood2013targeted; virvilis2013trusted; chen2014study), sometimes using case studies (Karchefsky2017). A few recent studies address the APT detection problem by constructing models of normal behavior against which incoming data is compared and flagged as anomalous if it deviates from the learned models. friedberg2015combating explain the shortcomings of current security solutions with regards to APT detection, in particular contending that preventive security mechanisms and signature-based methods are not enough to tackle the challenge of APTs, and propose an anomaly detection-based framework to detect APTs by learning a model of normal system behavior from host-based security logs and detecting deviations. siddiqui2016detecting use the fractal dimension as a feature to classify TCP/IP session data patterns into anomalous (and part of an APT) or normal patterns. moya15expert

construct decision tree-based models of normal network activity based on features extracted from firewall logs, then use the learned models to classify incoming network traffic. Some work has also been done on the detection of specific patterns that might be part of an APT attack e.g. detection of data leakage/data exfiltration 

(jewell2011host; awad2016data) or detection of command and control (C&C) domains (niu2017identifying).

There is a considerable literature on intrusion and malware detection, which is mainly split in two approaches: misuse detection (e.g. (kumar1994pattern)) and anomaly detection (e.g. (ji2016multi)). The principle of misuse detection is to search for events (i.e. known attacks) that match predefined signatures and patterns. Methods relying on misuse detection can only detect attacks whose signature and patterns are known, which would be unsuitable for APT detection. By contrast, anomaly detection assumes abnormal behaviours can come in varied, potentially unknown, shapes and focuses on detecting activity that deviates from normal activity i.e. activity usually recorded on a particular host or network.

There are several comprehensive surveys of anomaly detection and outlier detection that consider categorical data, continuous data, and structured data (e.g. graphs) (anomaly; graph-anomaly). Of these approaches, graph anomaly detection appears the most relevant for our problem, but most of this work has considered special cases of graphs (e.g. undirected or unlabeled), whereas provenance graph data has rich structure (labeled nodes, labeled edges, multiple properties on nodes and edges). Anomaly detection approaches for provenance graphs reported so far rely on training on benign traces (streamspot), require user-provided annotations (sleuth), or assume that the background activity is highly regular (winnower). Another recent contribution by siddiqui18kdd shows that human-in-the-loop feedback can be used in a semi-supervised way to improve detection results over baseline unsupervised detectors over numerical data.

On the other hand, there are a number of generic approaches to anomaly detection for discrete (categorical) data (fpoutlier; outlierdegree; avf; ndi-od; krimp-ad; upc; comprex; upc). Most of these approaches first mine the data for frequent itemsets or association rules, and all then perform anomaly scoring in a second pass over the data. A one-pass, streaming variant of AVF was presented by  onepassavf. Some approaches, notably OC (krimp-ad) and CompreX (comprex), are based on the Minimum Description Length (MDL) principle (mdl). Both perform a preprocessing stage to find a compressed representation of the dataset, then consider the resulting compressed size of each record as its score. Since OC was often the most effective batch algorithm, we think it would be interesting to develop a streaming approach based on MDL, either by adapting the underlying Krimp compression algorithm (krimp) to support streaming anomaly detection, or by building on streaming compression techniques such as adaptive arithmetic coding (witten87cacm). The UPC algorithm of upc is also based on pattern mining and MDL, and is inherently a two-pass approach, but seeks a different kind of anomalies than AVF, OC3, and CompreX, consisting of unexpectedly rare combinations of frequent itemsets.

There are also some anomaly detection techniques for mixed categorical and numerical data (smartsifter; odmad) that could be applied to pure categorical data. The ODMAD algorithm (odmad), like most categorical techniques, performs an initial off-line pattern mining stage. To the best of our knowledge SmartSifter (smartsifter)

is the only previous unsupervised online algorithm applicable to categorical data. SmartSifter incrementally maintains a histogram density model of the categorical data and, for each combination of attributes, a continuous distribution (such as a multivariate Gaussian mixture model) for the numerical attributes. SmartSifter’s running time is

where is the number of categorical attributes, the number of numerical attributes (i.e. dimension) and the number of components of the mixture model. Their experiments considered datasets with and , and it is unclear whether this approach can scale to large numbers of categorical attributes. In contrast, our proposals require only time to process each input record.

6 Conclusion

Detecting APT-style attacks in real-world settings is extremely difficult in general. In this paper, we investigate the feasibility of finding processes that may be part of such attacks by analyzing their behavior. We considered five different batch algorithms, one of which can also be adapted easily to a streaming setting. Our experiments showed that both batch and online approaches are effective in finding attacks and can analyze several days’ worth of activity (tens or hundreds of thousands of process summaries, sometimes with over ten thousand attributes) in a few minutes, a negligible cost compared to the time and effort needed to record and store this data. Moreover, our results are validated on provenance traces gathered from four different operating systems, subject to several different kinds of attacks; many of the attacks were typically ranked among the top 0.1-1%.

We believe that this work represents a significant contribution, in that it can provide a low-cost, yet effective line of defense in a larger provenance-based monitoring system, and establishes a baseline for comparison of more sophisticated (and time-consuming) techniques. Nevertheless, there are a number of areas for improvement. First, interpreting and analyzing the processes flagged for investigation is still mostly a manual process, motivating further support for identifying connections between the most anomalous processes. Second, it is also important to consider the (common) case when there is no attack. Since attacks are rare and, in a given trace, there are typically hundreds or thousands of anomalous processes that are not part of the attack, more work is needed to identify suitable thresholds to limit effort in this case. Finally, our approach assumes that the attacker is not aware of or able to manipulate the detection system; sophisticated attackers will naturally seek to either evade observation entirely or modify their behavior so as to minimize anomaly scores. Further research is needed on how to make anomaly detection robust even if attackers know how their activity is being monitored.


This material is based upon work partially supported by the Defense Advanced Research Projects Agency (DARPA) under contract FA8650-15-C-7557. Mookherjee was partially supported by a grant from LogicBlox, Inc.