With information systems playing ubiquitous and indispensable roles in many modern industries including financial services and retail businesses, cyber security undoubtedly bears the utmost importance in the daily system management. Under constant cyber-attacks happening every day, incidents causing significant financial losses and leakage of sensitive customer data appear in media headlines every now and then. Among these attacks, malware/malicious program attacks are the most widespread and costly type, and they are growing fast to target at more companies and organizations. According to a study report published by Accenture111www.accenture.com/us-en/insight-cost-of-cybercrime-2017, 98% of the participating companies experienced malware attacks in 2016 and 2017, and these attacks cost companies an average of $2.4 million. Symantec also finds that the number of groups using destructive malware was up by 25% in 2018222www.symantec.com/security-center/threat-report. With imminent malware attacks, protecting key information systems has become the top priority for many companies and organizations.
Malware/malicious program detection as the first line of defense against attacks mainly uses two types of approaches: signature-based and behavior-based. Signature-based approaches [Damodaran2017, David2017] check program signatures against a known malware database. Such signatures can be instruction sequences, system calls, loaded dynamic libraries, and so on. While signature-based approaches are widely adopted by antivirus software due to their time efficiency, they are mostly limited to detect known malware and prone to evasion techniques such as binary obfuscation. Nowadays, such signature-based approaches are facing difficulties from zero-day malware attacks, which are highly polymorphic and rarely have known signatures [Damodaran2017].
Behavior-based approaches [Rieck2008, Domagoj2011, Jha2013, Bernardi2018, Saracino2018] as an improvement over the signature-based ones learn the program behaviors in terms of dependencies between system-elements (i.e.
, system-calls or API calls), and classify a program to either the malicious or the benign category using the learned model. The program behaviors captured by these approaches generally reflect real program intentions, thereby complementing static signatures that can easily be manipulated. On the other hand, behavior-based approaches incur prohibitively high training cost. In order to learn malicious program behavior, one must collect enough malware samples, and then execute them repeatedly in a controlled environment to observe their runtime behaviors. Insufficient sampling and observations of malicious behavior can unavoidably limit the detection capability, and malware attacks can also evolve via adversarial learning to evade detection.
In contrast to those malware detection algorithms focusing on detecting known malware, the goal of this paper is to design an effective data-driven approach for detecting unknown malicious programs. We define a malicious program as a process that is unknown to the execution environment and behaves differently to all existing benign programs. Slightly more formally, given a target program with corresponding process event data (e.g., a program opens a file or connects to a server) during a time window, we perform similarity learning and check whether the behavior of the program is similar to that of any existing benign programs. If some highly similar programs exist, the model outputs top-k most similar programs with their IDs/names and similarity scores; otherwise, it triggers an alert.
It is difficult to detect unknown malicious programs due to four major challenges: (1) The nonlinear and hierarchical heterogeneous relations/dependencies among system entities. The operations made by programs have nonlinear and hierarchical heterogeneous dependencies, and they work together in a highly complex and coordinated manner. Simple methods that neglect these dependencies may still yield high false positive rates; (2) Lack of intrinsic distance/similarity metrics [zhang2015categorical] between two programs. Consider that in real computer systems, given two programs with thousands of system events related to them, it is a non-trivial task to measure their distance/similarity based on the categorical event data; (3) The exponentially large event space. An information system typically deals with a large volume of system event data (normally more than events per host per second) [Dong:2017]; (4) Aliases of programs. Different versions or updates of the same program have different signatures and may also have different executable names. Thus, a simple method that keeps a white-list of program IDs/signatures could be problematic.
To address these challenges, we propose MatchGNet, a data-driven graph matching framework to learn the program representation and similarity metric via Graph Neural Network. In particular, we first design an invariant graph modeling to capture the heterogeneous interactions/dependencies between different pairs of system entities. To learn the program representation from the constructed heterogeneous invariant graph, we propose a hierarchical attentional graph neural encoder. Finally, we propose a similarity learning model via Siamese Network to train the parameters and perform similarity scoring between an unknown program and the existing benign programs. MatchGNet trains the model on existing benign programs, not on any malware/malicious program samples. As a result, it can identify an unknown malicious program whose behavioral representation is significantly different to any of the existing benign programs by similarity matching. We conduct an extensive set of experiments on real-world system provenance data to evaluate the performance of our model. The results demonstrate that MatchGNet can accurately detect malicious programs. In particular, it can detect fake and unknown programs with an average accuracy of 97% and 96%, respectively. We also apply MatchGNet to detect realistic malware attacks. The results show that it can reduce false positives of the state-of-the-art by 50%, while keeping zero false negative.
2 Motivating Example and Related Work
In a phishing email attack as shown in Figure 1, to steal sensitive data from the database of a computer/server, the adversary exploits a known venerability of Microsoft Office333https:// nvd.nist.gov/vuln/detail/CVE-2008-0081
by sending a phishing email attached with a malicious .doc file to one of the IT staff of the enterprise. When the IT staff member opens the attached .doc file through the browser, a piece of malicious macro is triggered. This malicious macro creates and executes a malware executable, which pretends to be an open source Java runtime (Java.exe). This malware then opens a backdoor to the adversary, subsequently allowing the adversary to read and dump data from the target database via the affected computer.
Signature-based or behavior-based malware detection approaches generally do not work well in detecting the malicious program in our example. As the adversary can make the malicious program from scratch with binary obfuscation, signature-based approaches [David2017, Bernardi2018] would fail due to the lack of known malicious signatures. Behavior-based approaches [Domagoj2011, Jha2013, Bernardi2018] may not be effective either, unless the malware sample has previously been used to train the detection model.
It might be possible to detect the malicious program using existing host-level anomaly detection techniques[lin2012intelligent, jyothsna2011review, Chen:2016, Cao:2018:BCD:3269206.3272022]. These host-based anomaly detection methods can locally extract patterns from process events as the discriminators of abnormal behavior. However, such detection is based on observations of single operations, and it sacrifices the false positive rate to detect the malicious program [Lin:2018:CAR:3269206.3272013]. For example, the host-level anomaly detection can detect the fake “Java.exe” by capturing the database read. However, a Java-based SQL client may also exhibit the same operation. If we simply detect the database read, we may also classify normal Java-based SQL clients as abnormal program instances and generate false positives. In the enterprise environment, too many false positives can lead to the alert fatigue problem [7931672, Lin:2018:CAR:3269206.3272013], causing cyber-analysts to fail to catch up with attacks.
To accurately separate the database read of the malicious Java from the real Java instances, we need to consider the higher semantic-level context of the two Java instances. As shown in Figure 2
, the malicious Java is a very simple program and directly accesses the database. On the contrary, a real Java instance has to load a set of .DLL files in addition to the database read. By comparing the context of the fake Java instance, we can find that it is not normal and precisely report it as a malicious program instance. Thus, in this paper, we propose a Graph Neural Network based approach to learn the semantic-level context of program instances.
In recent years, Graph Neural Network (GNN) approaches [defferrard2016convolutional, hamilton2017inductive, velickovic2017graph, kipf2016semi, gilmer2017neural, wu2019comprehensive, wang2019adversarial, ying2018hierarchical, chen2018fastgcn] have been proposed for graph-structured data. They try to accelerate convolution operations, reduce the computational cost, and extend the current graph convolution. The goal of GNN is to learn the representation of the graph, in the node level or graph level. Because of its remarkable graph representation learning ability, GNN has been explored in various real-world applications, such as healthcare [mao2019medgcn, WangSBDM], chemistry [fout2017protein, zitnik2018modeling], and security system [WangAHGNN].
In parallel to the Graph Neural Network, graph similarity matching has been studied extensively in the machine learning community[yan2004graph, shasha2002algorithmics, willett1998chemical, raymond2002rascal]. The similarity metric can be categorized into two types: exact matching (isomorphism test) [yan2004graph, shasha2002algorithmics] and structure similarity measures (graph edit distance) [willett1998chemical, raymond2002rascal]
. Besides, in the computer vision community, learning-based metrics[weinberger2009distance, sun2014deep] are proposed to match the image data. They compute the similarity score with either hand-engineered features or hand-designed metrics. Compared to these approaches, our work is different in two folds: 1) we focus on learning from the heterogeneous graph, rather than the homogeneous graph; and 2) we learn the graph representation and similarity metric simultaneously.
3 Heterogeneous Graph Matching Networks
To address the challenges introduced in Section 1, we propose a heterogeneous Graph Matching Network framework, MatchGNet, with three key modules: Invariant Graph Modeling (IGM or Step (A)), Hierarchical Attentional Graph Neural Encoder (HAGNE or Step (B)), and Similarity Learning (SL or Step (C)) as illustrated in Figure 3. The IGM module models the system event data as a heterogeneous invariant graph, which can capture the program’s behavior profile. Then, we formulate the malicious program detection as a heterogeneous graph matching problem and solve it with HAGNE and SL.
Given two graphs and , HAGNE first generates corresponding graph embeddings and in a hierarchical attentional style and then fuses the two graph embeddings via Similarity Learning (SL) and outputs a similarity score . In this way, the representation of the graph and similarity metric can be learned in a joint way, such that the effective graph similarity matching can be performed. During the malware detection stage, the distance between an unknown program and an existing benign program will be maximized in the mapped embedding space under the learned similarity metric.
3.2 Invariant Graph Modeling
Information systems often generate a large volume of system-event data (i.e., the interaction between a pair of system entities). In a typical enterprise environment, the amount of data collected from a single computer system can easily reach one gigabyte after monitoring process interactions for one hour.
Learning a representation over such massive data is prohibitively expensive in terms of both time and space. Recently, a very promising means for studying complex systems has emerged through the concept of invariant graph [cheng2016ranking, LuoCTSLCY18]. Such invariant graph focuses on discovering stable and significant dependencies between pairs of entities that are monitored through surveillance data recordings, so as to profile the system status and perform subsequent reasoning.
Following the idea of the invariant graph, we model the system event data as a heterogeneous graph between different system entities (e.g., processes, files, and Internet sockets). The edges indicate the causal dependencies including a process accessing a file, a process forking another process, and a process connecting to an Internet socket. Formally, given the event data across several machines within a time window (e.g., one day), each target program can be a heterogeneous graph , in which denotes a set of nodes. Each node represents an entity of three possible types: process (P), file (F), and INETSocket (I), namely . denotes a set of edges (dependencies) between the source entity and destination entity with relation . We consider three types of relations: (1) a process forking another process (), (2) a process accessing a file (), and (3) a process connecting to an Internet socket (). Each graph is associated with an adjacency matrix . With the help of the invariant graph modeling, we can obtain a global program-dependency profile.
3.3 Hierarchical Attentional Graph Neural Encoder
The constructed invariant graph is heterogeneous with multiple types of entities and relations. Thus, it is difficult to directly apply the traditional homogeneous Graph Neural Networks (such as GCN and GraphSage) to learn the graph representation. To address this problem, we propose a Hierarchical Attentional Graph Neural Encoder (HAGNE) to learn the program representation as a graph embedding through an attentional architecture by considering the node-wise, layer-wise, and path-wise context importance. More specifically, we first propose a Heterogeneity-aware Contextual Search (Step (B1) in Figure 3) to find the path-relevant sets of neighbors under the guide of the meta-paths [sun2011pathsim]. Then, we introduce a Node-wise Attentional Neural Aggregator (Step (B2) in Figure 3) to generate node embeddings by selectively aggregating the entities in each path-relevant neighbor set based on random walk scores. Next, we design a Layer-wise Dense-connected Neural Aggregator (Step (B3) in Figure 3) to aggregate the node embeddings generated from different layers towards a dense-connected node embedding. Finally, we develop a Path-wise Attentional Neural Aggregator (Step (B4) in Figure 3) to learn the attentional weights for different meta-paths and compute the graph embedding from the Layer-wise Dense-connected Aggregator.
3.3.1 Heterogeneity-aware Contextual Search
As the first step of the aggregation layer, traditional GNNs would search for all one-hop neighbors for a target node. Since our invariant graph is heterogeneous, simply aggregating these neighbors cannot capture the semantic and structural correlations among different types of entities. To address this issue, we propose a meta-path [sun2011pathsim] based contextual search. A meta-path is a path that connects different entity types via a sequence of relations in a heterogeneous graph. In a computer system, a meta-path could be: a process forking another process (), two processes accessing the same file (), or two processes opening the same internet socket () with each one defining a unique relationship between two programs. From the invariant graph , a set of meta-paths can be generated with each representing a unique multi-hop relationship between two programs. For each meta-path where , we define the path-relevant neighbor set of node :
where is a reachable neighbor of via the meta-path .
3.3.2 Node-wise Attentional Neural Aggregator
After constructing the path-relevant neighbor set , we are able to leverage these contexts via neighbourhood aggregation. However, due to noisy neighbors, different neighboring nodes may have different impacts on the target node. Hence, it is unreasonable to treat all neighbors equally. To address this issue, we propose a node-wise attentional neural aggregator to compute an attentional weight for each node in the path-relevant neighbor set . This module is based on random walk with restarts (RWR) [Tong2006], a particularly efficient algorithm for computing the relevance scores between pairs of nodes in a homogeneous graph. We extend the RWR to a heterogeneous graph, such that the walker starts at the target program node , and at each step it only moves to one of its neighboring nodes in , instead of to all linked nodes without considering semantics. After the random walk finishes, each visited neighbor will receive a visiting count. We compute the normalization of the visiting count and use it as the node-wise attentional weight. Formally, for , the attentional weights are , where is the weight of . Then, the program representation can be computed via a neural aggregation function by
where denotes the index of the layer,
is the feature vector of programfor meta-path at the -th layer, and is a trainable parameter that quantifies the trade-off between the previous layer representation and the aggregated contextual representation. is initialized by
. After getting the aggregated representation, a multi-layer perceptron (MLP) is applied to transform the aggregated representation to a hidden nonlinear space. Through the development of Node-wise Attentional Neural Aggregator, we are able to leverage the contextual information, meanwhile considering the different importance of the neighbors.
3.3.3 Layer-wise Dense-connected Neural Aggregator
To aggregate the information from a wider range of neighbors, one simple way is to stack multiple node-wise neural aggregators. However, the performance of a GNN model often cannot get improved, because by adding more layers, it is easy to propagate the noisy information from an exponentially increasing number of neighbors in a deep layer. Recently, inspired by residual network, a skip-connection method has been proposed. But, even with skip-connections, GCNs with more layers do not perform as well as the -layer GCN on many graph data [kipf2016semi]. To address the limitations of existing work, we propose a Layer-wise Dense-connected Aggregator as inspired by the DenseNet [huang2017densely]. More specifically, our layer-wise aggregator leverages all the intermediate representations, with each capturing a subgraph structure. All the intermediate representations are aggregated in a concatenation way followed by a MLP, such that the resulted embedding can adaptively select different subgraph structures. Formally, the Layer-wise Dense-connected Neural Aggregator can be constructed as follows:
where represents the feature concatenation operation.
3.3.4 Path-wise Attentional Neural Aggregator
After the node-wise and layer-wise aggregators, different embeddings corresponding to different meta-paths are generated. However, different meta-paths should not be treated equally. For example, Ransomware is usually very active in accessing files, but it barely forks another process or opens an internet socket, while VPNFilter is generally very active in opening the internet socket, but it barely accesses a file or forks another process. Therefore, we need to treat different meta-paths differently. To address this issue, we propose a Path-wise Attentional Neural Aggregator to aggregate the embeddings generated from different path-relevant neighbor sets. Our path-wise aggregator can learn the attentional weights for different meta-paths automatically. Formally, given the program embedding corresponding to each meta-path , we define the path-wise attentional weight as follows:
where is the embedding corresponding to the target meta-path , denotes the embedding corresponding to the other meta-path , denotes a trainable attention vector, denotes a trainable weight matrix, which maps the input features to the hidden space, denotes the concatenation operation, and
denotes the nonlinear gating function. We formulate a feed-forward neural network, which computes the correlation between one path-relevant neighbor set and other path-relevant neighbor sets. This correlation is normalized by a Softmax function. Letrepresent Eq.(5). The joint representation for all the meta-paths can be represented as follows:
The Path-wise Attentional Neural Aggregator allows us to better infer the importance of different meta-paths by leveraging their correlations and learn a path-aware representation.
3.4 Similarity Learning
In order to perform effective graph matching, we propose Similarity Learning (SL), a Siamese Network based learning model to train the parameters of the Hierarchical Attentional Graph Neural Encoder (HAGNE). Siamese Networks [zagoruyko2015learning] are neural networks containing two or more identical subnetwork components, which have been shown to be a powerful way in distinguishing similar and dissimilar objects. Here, we employ the SL to learn similarity metric and program graph representation jointly for better graph matching between the pair of unknown program and known benign program as shown in Figure 3.
More specifically, our Siamese Network consists of two identical HAGNEs to compute the program graph representation independently. Each HAGNE takes a program graph snapshot as the input and outputs the corresponding embedding . A neural network is then used to fuse the two embeddings. The final output is the similarity score of the two program embeddings. During the training, pairs of program graph snapshots are collected with corresponding ground truth pairing information . If the pair of graph snapshots belong to the same program, the ground truth label is , otherwise its ground truth label is . For each pair of program snapshots, a cosine score function is used to measure the similarity of the two program embeddings and the output of the Similarity Learning is defined as follows:
Correspondingly, our objective function can be formulated as:
We optimize this objective with Adam optimizer [kingma2014adam]. With the help of the Similarity Learning, we can learn the parameters that keep similar embeddings closer while pushing dissimilar embeddings apart by directly optimizing the embedding distance.
Since we directly optimize the distance between the two programs, this model can be used to perform unknown malware detection. Given the snapshot of an unknown program, we first construct its corresponding program invariant graph and then feed it to the HAGNE to generate the program embedding. After that, we compute the cosine distance scores between the embedding of the unknown program and the ones of the existing programs in the database. If an existing program has more than one embedding generated from multiple graph snapshots, we will only report the highest similarity score with regard to the unknown program. Finally, we rank all the similarity scores. If the highest similarity score among all the existing programs is below our threshold, an alert will be triggered. Otherwise, the top- most similar programs will be reported.
4.1 Experiment Setup
We collect a -week period of data from a real enterprise network composed of hosts ( Windows hosts and Linux hosts). The sheer size of the data set is around three terabytes. We consider three different types of system events as defined in Section 3.2. Each entity is associated with a set of attributes, and each process has an executable name as its identifier (ID). In total, there are about million event records, with about processes, files, and Internet sockets. Based on the system event data, we construct a program invariant graph per program per day.
We compare MatchGNet with the following typical and state-of-the-art algorithms:
: Multi-layer Perceptron (MLP) is a deep neural network based classification model with multiple non-linear layers between the input and the output layers. It is a special case of GNN without considering the aggregation operation if we define the propagation layer as an identity matrix.
Since the proposed MatchGNet is based on GNN, we also compare it with two popular GNN variants: GCN [kipf2016semi] and GraphSage [hamilton2017inductive].
4.1.3 Evaluation Metrics
Similar to [Das:2007], we evaluate the performance of different methods using accuracy (ACC), F-1 score, and AUC score.
4.2 Fake Program Detection
Our first research question focuses on the accuracy of MatchGNet detecting fake programs. Here, we define a fake program as the one that uses the ID of another program. It is a common method for adversaries to hide their attacks.
To simulate the execution of fake programs on a large scale, in our testing dataset (i.e., data from the seventh week), we manually seed fake programs. To do so, before feeding the monitoring data to MatchGNet, we randomly replace the ID of a known program to the ID of another known program. This process simulates an adversary who wants to hide the use of a benign system tool in his/her attacks.
In this task, each fake program has a claimed ID of a known program. Thus, we only need to check whether it is indeed the claimed program: if it matches the behavior pattern of the claimed program, the predicted label should be ; otherwise, it should be . We use logistic regression as the classification model for all the methods.
The ROC curve is shown in Figure 4. We also summarize the AUC, F-1 score and ACC of the six models in Table 1. From our experiments, the AUC of MatchGNet is 99%, which is at least 4% higher than all other baseline models. In terms of ACC, MatchGNet could achieve 47%, 16%, 10%, 2%, and 5% higher than the SVM, LR, MLP, GCN, and GraphSage models. This means that while capturing all the fake program instances, MatchGNet has less false positives than all other models. This result justifies our design decision of applying the invariant graph structure and Graph Neural Network model to capture the semantics of program instances.
4.2.1 Hyper Parameter Selection
We evaluate the selection of hyper parameters of MatchGNet with our validating data set (i.e., data from the sixth week). There are two hyper parameters in MatchGNet
: the number of layers, and the number of hidden neurons. We plot the result for the number of layers and the number of hidden neurons inFigure 5. In these figures, the y-axis is the AUC value and the x-axis is the value of the hyper parameter.
We find that when MatchGNet has 3 layers and 500 neurons, it reaches the maximal AUC. Larger hyper-parameter values may consume more resources but have little improvement on the AUC. Thus, we use the optimal hyper parameters as a part of the default model and apply them to the other parts of our experiments.
4.3 Unknown Program Detection
This experiment focuses on evaluating MatchGNet on unknown program detection. To simulate unknown program instances, we split the programs in the training data equally into two sets, the known set and the unknown set. In our five weeks’ training data, we exclude the programs in the unknown set and only train the model from the programs in the known set. Then, in our testing period, we use program instances from the unknown set as malicious programs.
We plot the ROC curves in Figure 6 and report the AUC, F-1 score, and ACC of all six models in Table 2. The AUC of MatchGNet in detecting unknown program instances is at least 4% higher than all other models. And in terms of ACC, MatchGNet is 46%, 14%, 11%, 4%, and 5% higher than the SVM, LR, MLP, GCN, and GraphSage models. This also proves that using the Attentional Graph Neural Network model could better capture the semantic level features of programs and may possibly generate less false positives.
4.4 Malware Attack Detection
To evaluate the usefulness of MatchGNet, we apply it to detect realistic malware attacks in enterprise environment. We randomly download malware from VirusTotal covering all the popular malware categories and create realistic attack cases from the literature and report including WannaCry, Genasom, Sorikrypt, Shinolocker, Puishing Email, ShellShock, Netcat Backdoor, Passing the Hash, and Trojan Attacks.
We execute these attacks on our two testing machines and merge the provenance data from these two testing machines with the normal data of the hosts to the data stream of MatchGNet. We set up MatchGNet with the optimal hyper parameters in Section 4.2.
To evaluate the usefulness of MatchGNet, we use the state-of-the-art entity-embedding based technique [Chen:2016] as the baseline.
Based on the experimental results, both MatchGNet and the baseline method capture all the true alerts. However, during the same time period and with the same provenance data, MatchGNet only generates false positives while the baseline generates false positives. Such a number indicates that MatchGNet is possible to reduce nearly 50% of the false positives. This reduction could mean a substantial saving in cyber-analysts’ time in a large enterprise.
In this paper, we proposed MatchGNet, a heterogeneous Graph Matching Network approach to detect the unknown malicious programs in information systems. MatchGNet first models the program’s execution behavior as a heterogeneous invariant graph. Based on the built program graphs, it learns the graph representation and similarity metric simultaneously to distinguish the benign program and malware. The evaluation results showed that our approach can accurately detect unexpected program instances. In particular, it can detect fake and unknown programs with an average accuracy of 97% and 96%, respectively. We further demonstrate our approach is promising by having 50% less false positives than the state-of-the-art method in detecting malware attacks.
Shen is supported in part by NSF through grants IIS-1526499, IIS-1763325, and CNS-1626432.