## 1 Introduction

Recent studies of dynamic graphs/networks have witnessed a growing interest. Such dynamic graphs model a variety of systems including societies, ecosystems, the Internet, and others. For example, in enterprise dynamic network [14], the node represents a system entity (such as process, file, and Internet sockets) and an edge indicates the corresponding interaction between two system entities. These dynamic networks, unlike static networks, are constantly changing. Possible changes include graph structure change or modification of node attributes.

A fundamental task on dynamic graph analysis is anomaly detection—identifying objects, relationships, or subgraphs, whose “behaviors” significantly deviate from underlying majority of the network [1, 18]. In this work, we focus on the anomalous edge detection in dynamic graphs. Detecting anomalous edges can help understand the system status and diagnose system fault [18, 2]. For example, in an enterprise dynamic network, some system entity pairs, such as a user software and system-specific internet socket ports (e.g., port number ), never form an edge (interaction/connection) in-between in normal system environments. Once occurring, these suspicious interactions/activities may indicate some serious cyber-attack happened and could significantly damage the enterprise system [3].

Recently, graph embedding has shown to be a powerful tool in learning the low-dimensional representations in networks that can capture and preserve the graph structure. However, most existing graph embedding approaches are designed for static graphs, and thus may not be suitable for a dynamic environment in which the network representation has to be constantly updated. Only a few advanced embedding-based methods (such as NetWalk [22]) are suitable for updating the representation dynamically as the network evolves. However, these methods require the knowledge of the nodes over the whole time span and thus can hardly promise the performance on new nodes in the future. More importantly, these methods neglect a notable characteristic of the dynamic networks—the subgraph structural changes related to the target nodes. These structural temporal dynamics are key to understanding system behavior. For example, in Figure 1, the target edge at timestamp is marked as a double red line, and the -hop subgraph centered on the target edge is marked with gray. It can be seen from Figure 1 (A) that the interactions between nodes of the subgraph (i.e., gray nodes) become more frequent. Therefore, the target edge in Figure 1 (A) is reasonable to be a normal edge. In contrast, in Figure 1 (B), there is no interactions between the neighbors of the subgraph from timestamp to . Therefore, the target edge at timestamp is more likely to be an anomalous edge. Thus, it is critical to model and detect the structural changes over time for the anomaly detection task.

To address the aforementioned issues, we propose StrGNN, a structural graph neural network to identify anomalous edges in dynamic graphs. StrGNN is designed to detect unusual subgraph structures centered on the target edge in a given time window while considering the temporal dependency. StrGNN consists of three sub-models: ESG (Enclosing Subgraph Generation), GSFE

(Graph Structural Feature Extraction), and

TDN (Temporal Detection Network). First, ESG extracts a -hop enclosing subgraph centered on the target edge from each graph snapshot. Subgraphs extracted based on different edges can result in the same topology structure. Thus, a node labeling function is proposed to indicate the role of each node in the subgraph. Then, GSFEmodule leverages Graph Convolution Neural Network and pooling technologies to extract fixed-size feature from each subgraph. Based on the extracted features,

TDN employs the Gated recurrent units (GRUs) to capture the temporal dependency for anomaly detection. Different from the previous embedding based methods, the whole process of StrGNN can be trained end-to-end, i.e., StrGNN takes the test edges along with the original dynamic graphs as input and directly outputs the category (i.e., anomaly or normal) for each test edge. Moreover, our proposed StrGNN framework focuses on mining the structural temporal patterns in a given time window. Therefore, node embedding is not required to learn and StrGNN is not sensitive to the edge and vertex changes (such as new nodes) in the dynamic graphs. We conduct extensive experiments on six benchmark datasets to evaluate the performance of StrGNN. The results demonstrate the effectiveness of our proposed algorithm. We also apply StrGNN to a real enterprise security system for intrusion detection. By using StrGNN, we can reduce false positives of the state-of-the-art methods by at least 50%, while keeping zero false negatives.## 2 Related Work

In this section, we briefly introduce previous work on embedding based anomaly detection in graphs.

### 2.1 Anomaly Detection on Static Graphs

Inspired by word embedding methods [15]

in natural language processing tasks, recent advances such as DeepWalk

[17], LINE [19], and Node2Vec [9] have been proposed to learn node embedding via the skip-gram technology. The DeepWalk generates random walks for each vertex with a given length and picks the next step uniformly from the neighbors. Different from DeepWalk, the LINE [19] preserves not only the first-order (observed tie strength) relations but also the second-order proximities (shared neighborhood structures of the vertices). Node2Vec [9] uses two different sampling strategies (breadth-first sampling and depth-first sampling) for vertices that result in different feature representations. Through the network embedding technology, both anomalous node and edge detection tasks can be performed with traditional anomaly detection methods.### 2.2 Anomaly Detection on Dynamic Graphs

Dynamic graphs are more complex due to the variation of the graph structure. That is, the vertices and edges are changing along the time dimension. To capture the dependency between different graphs along the time dimension, recently few network embedding based methods have been proposed [25]. Dyngem [8]

employs the auto-encoder method to learn the embedding for each graph, and a constraint loss function is employed to minimize the difference between all graphs. Dyngraph2vec

[7]uses the Recurrent Neural Network to capture the temporal information and learn the embedding using auto-encoder technology. Recently, NetWalk

[22], one of the state-of-the-art methods for anomaly detection in dynamic networks, is proposed to learn the embedding while considering the temporal dependency and detect the anomaly using the density-based method. The NetWalk generates several random walks for each vertex and learns a unified embedding for each node using auto-encoder technology. The embedding representation is updated along the time dimension.## 3 Method

In this section, we introduce our method in detail. We start with the overall framework of our proposed Structural Temporal Graph Neural Networks for anomaly detection in dynamic graphs. The details of each component in our proposed method are introduced afterwards.

### 3.1 Overall Framework

Compared with the anomaly detection in a static graph, dynamic graphs are more complex and challenging in two perspectives: (1) The anomalous edges cannot be determined by the graph from a single timestamp. The detection procedure must take the previous graphs into consideration; (2) Both the vertex and edge sets are changing over time. To tackle these challenges, we propose StrGNN, a structural temporal Graph Neural Network framework. The key idea of our proposed method is to capture structural changes centered on the target edge in a given time window and determine the category (i.e., anomaly or normal) of the target edge based on the structural changes. Our proposed StrGNN framework consists of three key components: ESG (Enclosing Subgraph Generation), GSFE (Graph Structural Feature Extraction), and TDN (Temporal Detection Network), as illustrated in Figure 2.

### 3.2 Esg: Enclosing Subgraph Generation

For the first module, Enclosing Subgraph Generation, our goal is to generate enclosing subgraph structure related to the target edge so as to detect the anomalies more efficiently. Directly employing the whole graph for analysis can be highly computational expensive, especially considering the real-world networks with thousands or even millions of nodes and edges. Recent work [21] also proved that in Graph Neural Networks, each node is most influenced by its neighbors.

Definition 1. (Enclosing subgraph in static graphs) For a static network , given a target edge with source node and destination node , the hop enclosing subgraph centered on edge can be obtained by , where is the shortest path distance between node and node .

Definition 2. (Enclosing subgraph in dynamic graphs) For a temporal network with window size , given a target edge with source node and destination node , the hop enclosing subgraph centered on edge is a collection of all subgraph centered on in the temporal network .

For a target edge , we extract the enclosing subgraph in dynamic graphs based on Definition 2. However, the extracted subgraph only contains topological information. Subgraphs extracted based on different edges can result in the same topological structure. To distinguish the role of each node in the subgraph, in this work, we propose to annotate the nodes in the subgraph with different labels. A good node labeling function should convey the following information: 1) which edge is the target edge in the current subgraph, and 2) the contribution of each node in identifying the category of each edge. More specifically, given the edge and the corresponding source and destination node and , our node labeling function for the enclosing subgraph is defined as follows:

(1) | |||

where is the shortest path distance between node and node , and . In addition, the two center nodes are labeled with 1. If a node satisfies or

, it will be labeled as 0. The label will be converted into a one-hot vector as the attribute

for each node. By employing the node labeling function, we can generate the label for each node, which can represent structure information for the given subgraph. The category of the target edge at timestamp can be predicted by analyzing the labeled subgraph in the given time window.### 3.3 Gsfe: Graph Structural Feature Extraction

To analyze the structure of each enclosing subgraph from the given time period, the Graph Convolution Neural Network (GCN) [11] can be employed to project the subgraph into an embedding space. In GCN, the graph convolution layer was proposed to learn the embedding of each node in the graph and aggregate the embedding from its neighbors. The layer-wise forward operation of graph convolution layer can be described as follows:

(2) |

where

is the summation of the adjacency matrix and identity matrix,

denotes an activation function, such as the

, and is the trainable weight matrix. By employing the graph convolution layer, each node can aggregate the embedding from its neighbors. By stacking the graph convolution layer in the neural network, each node can obtain more information from other nodes. For example, each node can obtain information from its -hop neighbors by stacking two graph convolution layers.GCN can generate node embedding for detecting anomalous edges in a single graph. However, in our dynamic graph setting, the anomalies should be determined in the context of . The number of nodes in different enclosing subgraphs is commonly different, thus results in different sizes of the feature vector in different subgraphs. Therefore, it is challenging to analyze the dynamic graphs using Graph Neural Networks due to the various sizes of the input.

To tackle this problem, we leverage the graph pooling technology to extract the fixed-size feature for each enclosing subgraph. Any graph pooling method can be employed in our proposed StrGNN framework to extract the fixed-size feature for further analysis. In this work, we employ the Sortpooling layer proposed by [24], which can sort the nodes in the enclosing subgraph based on their importance and select the feature from the top nodes.

Given the node embedding corresponding to graph , the importance score for each node in the Sortpooling layer is defined as follows:

(3) |

where is the adjacency matrix of graph , and is the projection matrix with output channel 1. Each node can obtain the importance score by using Equation 3. All nodes in the enclosing subgraph will be sorted in order of the importance score. And only the top nodes will be selected for further analysis. For the subgraphs that contain less than

nodes, the zero-padding will be employed to guarantee that each subgraph contains the same fixed-size feature.

### 3.4 Tdn: Temporal Detection Network

The Graph Structural Feature Extraction module can generate low-dimensional features for anomaly detection. However, it does not consider the temporal information, which is of great importance for determining the category (i.e., anomaly or normal) of an edge in the dynamic setting.

Given the extracted structural feature , , where is the number of selected nodes in each graph, and is the dimension of feature for each node, in this work, we employ the Gated recurrent units (GRUs) [4]

, which can alleviate the vanishing and exploding gradient problems

[6], to capture the temporal information as:(4) | |||||

(5) | |||||

(6) | |||||

(7) |

where represents the element-wise product operation, , , and are parameters. The GRU network takes the feature at each timestamp as input, and feeds the output of current timestamp into the next timestamp. Therefore, the temporal information can be modeled by the GRU network. The output of last timestamp is employed to analyze the category of the target edge . The anomalous edge detection problem can be formulated as follows:

(8) |

where is a fully connected network, and is the category of edge .

For the anomaly detection task, in many real-world cases, the dataset does not contain any anomalous samples or only contain a small number of anomalous samples.

One straightforward way of generating negative samples is to draw samples from a “context-independent” noise distribution (such as Random sampling or injected sampling [2]), where a negative sample is independently and does not depend on the observed samples. However, due to the large anomalous edge space, this noise distribution would be very different from the data distribution, which would lead to poor model learning. Thus, in this work, we propose “context-dependent” negative sampling strategy.

The intuition behind our strategy is to generate negative samples from “context-dependent” noise distribution. Here, the “context-dependent” noise distribution for the sampled data is defined as: , where denotes the observed data distribution, is the number of edges in the graph, and is the number of nodes in the graph. Specifically, we first randomly sample a normal vertex pair in the graph. Then, we replace one of the nodes, say with a randomly sampled node in the graph and form a new negative sample . If is not belongs to the normal graph, we retain the sample, otherwise, we delete it.

The proposed StrGNN framework is quite flexible and easy to be customized. Any network that can capture the temporal information can be used in our proposed framework, such as Convolution Neural Network (CNN) and Vanilla Recurrent Neural Network (RNN).

## 4 Experiments

In this section, we evaluate StrGNN on six benchmark datasets and a real enterprise network.

### 4.1 Datasets

We conduct experiments on six public datasets from different domains. The UCI Messages dataset [16] is collected from an online community platform of students at the University of California, Irvine. Each node in the constructed graph represents a user in the platform. And the edge indicates that there is a message interaction between two users. The Digg dataset [5] is collected from a news website digg.com. Each node represents a user of the website, and each edge represents a reply between two users. The Email dataset is a dump of emails of Democratic National Committee. Each node corresponds to a person. And the edge indicates an email communication between two persons. The Topology [23] dataset is the network connections between autonomous systems of the Internet. Nodes are autonomous systems, and edges are connections between autonomous systems. The Bitcoin-alpha and Bitcoin-otc [13, 12] datasets are collected from two Bitcoin platform named Alpha and OTC, respectively. Nodes represent users from the platform. If one user rates another user on the platform, there is an edge between them.

1% | 5% | 10% | |
---|---|---|---|

-hop enclosing subgraph | 0.8179 | 0.8252 | 0.7959 |

-hop enclosing subgraph | 0.8216 | 0.8274 | 0.7987 |

-hop enclosing subgraph | 0.8227 | 0.8294 | 0.8005 |

Methods | UCI | Digg | |||||||
---|---|---|---|---|---|---|---|---|---|

1% | 5% | 10% | 1% | 5% | 10% | 1% | 5% | 10% | |

Node2Vec | 0.7371 | 0.7433 | 0.6960 | 0.7364 | 0.7081 | 0.6508 | 0.7391 | 0.7284 | 0.7103 |

Spectral Clustering | 0.6324 | 0.6104 | 0.5794 | 0.5949 | 0.5823 | 0.5591 | 0.8096 | 0.7857 | 0.7759 |

DeepWalk | 0.7514 | 0.7391 | 0.6979 | 0.7080 | 0.6881 | 0.6396 | 0.7481 | 0.7303 | 0.7197 |

NetWalk | 0.7758 | 0.7647 | 0.7226 | 0.7563 | 0.7176 | 0.6837 | 0.8105 | 0.8371 | 0.8305 |

StrGNN | 0.8179 | 0.8252 | 0.7959 | 0.8162 | 0.8254 | 0.8272 | 0.8775 | 0.9103 | 0.9080 |

Methods | Bitcoin-Alpha | Bitcoin-otc | Topology | ||||||

1% | 5% | 10% | 1% | 5% | 10% | 1% | 5% | 10% | |

Node2Vec | 0.6910 | 0.6802 | 0.6785 | 0.6951 | 0.6883 | 0.6745 | 0.6821 | 0.6752 | 0.6668 |

Spectral Clustering | 0.7401 | 0.7275 | 0.7167 | 0.7624 | 0.7376 | 0.7047 | 0.6685 | 0.6563 | 0.6498 |

DeepWalk | 0.6985 | 0.6874 | 0.6793 | 0.7423 | 0.7356 | 0.7287 | 0.6844 | 0.6793 | 0.6682 |

NetWalk | 0.8385 | 0.8357 | 0.8350 | 0.7785 | 0.7694 | 0.7534 | 0.8018 | 0.8066 | 0.8058 |

StrGNN | 0.8574 | 0.8667 | 0.8627 | 0.9012 | 0.8775 | 0.8836 | 0.8553 | 0.8352 | 0.8271 |

### 4.2 Baselines

We compare StrGNN with four network embedding based baselines.

DeepWalk [17]: DeepWalk generates the random walks with given length starting from a node and learns the embedding using Skip-gram.

Node2Vec [9]: Node2Vec combines breadth-first traversal and depth-first traversal in the random walks generation procedure. The embedding is learned using Skip-gram technology.

Spectral Clustering [20]: To preserve the local connection relationship, the spectral embedding generates the node embedding by maximizing the similarity between nodes in the neighborhood.

NetWalk [22]: NetWalk generates several random walks for each vertex and learns a unified embedding for each node using auto-encoder technology. The embedding representation will be updated along the time dimension.

framework, using an extended temporal GCN with an attention-based GRU.

### 4.3 Experiment Setup

The parameters of StrGNN can be tuned by 5-fold cross-validation on a rolling basis. Here, by default, we set the window size to and the number of hops in enclosing subgraph to . We evaluate the influence of each parameter. The AUC results of StrGNN with different on UCI Messages are shown in Table 1. StrGNN with -hop or -hop subgraph achieves similar performance as -hop but requiring way more computational cost. The parameter (with ) shares similar influence as on the performance of StrGNN. We employ a Graph Neural Network with three graph convolution layers to extract graph features. The size of the output feature map is set to for all three layers. The outputs of all three layers are concatenated as the embedding feature. The selected rate in the Sortpooling layer is set to . In terms of the temporal neural network, the hidden size of GRU is set to . We employ Adam method [10] to train the network. The learning rate of Adam is set to . We employ batch training in the experiments and the batch size is set to for our proposed StrGNN method. StrGNN is end-to-end trained for epochs. We use the first edges as the training dataset, and the rest as the test dataset. Since the anomalous edges do not exist in the six benchmark datasets, we follow the approach used in [22] to inject , , anomalous edges in the test dataset to evaluate the performance of each model. The metric used to compare the performance of different methods is AUC (the area under the ROC curve). The higher AUC value indicates the high quality of the method.

### 4.4 Results on Benchmark Datasets

We first compare StrGNN with the baseline methods on six benchmark datasets. The experimental results in Table 2 show that StrGNN outperforms all four baseline methods on all the benchmark datasets. And even if 10% anomalies are injected, the performance of StrGNN is still acceptable. This outstanding effect proves that StrGNN can exploit the structural and temporal features effectively and the learned representation of the dynamic graph structure is well suited for the anomaly detection task.

To further demonstrate the effectiveness of our StrGNN method, we visualize the output embeddings from the GRU network of StrGNN. The embeddings are projected into two-dimensional space using the PCA method. The visualization results in Figure 3 show that the anomalies can be easily detected using the embeddings generated by our proposed method.

In the experiments, we also evaluate our proposed model using training data with different ratios. The AUC results on UCI Messages are shown in Figure 4. It can be seen from the results that the AUC increases with the percentage of training data ranging from to , and then the performance stays relatively stable.

### 4.5 Intrusion Detection Application

To evaluate the effectiveness of StrGNN on practical applications with real anomalies, we apply it to detect malware attacks in the enterprise environment. We collect a -week period of data from a real enterprise network composed of hosts ( Windows hosts and Linux hosts). In total, there are about ten thousand normal network event records and attack records by executing different types of attacks including ATP attacks, Trojan attacks, and Puishing Email attacks at different periods. Based on the network event data, we construct an accumulated graph per day with nodes representing hosts and edges representing the network connection relationships. Based on the constructed graphs, we apply StrGNN and the baseline methods to detect the attacks.

The AUC results are shown in Table 3. We can see that StrGNN achieves an increase of

in AUC over the four baseline methods. Based on the experimental results, we also find that with the optimal hyperparameter setting,

StrGNN can capture all true alerts, while the baseline methods can only capture true alerts at most. Meanwhile, StrGNN only generates false positives while the baseline methods generate at least false positives. The results demonstrate the effectiveness of StrGNN in solving real-world anomaly detection tasks.Method | AUC |
---|---|

Node2Vec | 0.71 |

DeepWalk | 0.76 |

Spectral Clustering | 0.75 |

Netwalk | 0.90 |

StrGNN | 0.99 |

## 5 Conclusion

In this paper, we investigated an important and challenging problem of anomaly detection in dynamic graphs. Different from network embedding based methods that focus on learning good node representations, we proposed StrGNN, a structural temporal Graph Neural Network to detect anomalous edges by mining the unusual temporal subgraph structures. StrGNN can be trained end-to-end and it is not sensitive to the percentage of anomalies. We evaluated the proposed framework using extensive experiments on six benchmark datasets. The experimental results convince us of the effectiveness of our approach. We also applied StrGNN to a real enterprise security system for intrusion detection. Our method achieved superior detection performance with zero false negatives.

## References

- [1] (2011) Outlier detection in graph streams. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, USA, pp. 399–409. Cited by: §1.
- [2] (2015) Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery 29 (3), pp. 626–688. Cited by: §1, §3.4.
- [3] (2016) Ranking causal anomalies via temporal and dynamical analysis on vanishing correlations. In SIGKDD, pp. 805–814. Cited by: §1.
- [4] (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.4.
- [5] (2009) Social synchrony: predicting mimicry of user actions in online social media. In CSE, Vol. 4, pp. 151–158. Cited by: §4.1.
- [6] (2016) Deep learning. MIT press. Cited by: §3.4.
- [7] (2019) Dyngraph2vec: capturing network dynamics using dynamic graph representation learning. Knowledge-Based Systems. Cited by: §2.2.
- [8] (2018) Dyngem: deep embedding method for dynamic graphs. arXiv preprint arXiv:1805.11273. Cited by: §2.2.
- [9] (2016) Node2vec: scalable feature learning for networks. In SIGKDD, pp. 855–864. Cited by: §2.1, §4.2.
- [10] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
- [11] (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.3.
- [12] (2018) Rev2: fraudulent user prediction in rating platforms. In WSDM, pp. 333–341. Cited by: §4.1.
- [13] (2016) Edge weight prediction in weighted signed networks. In ICDM, pp. 221–230. Cited by: §4.1.
- [14] (2018) TINET: learning invariant networks via knowledge transfer. In SIGKDD, pp. 1890–1899. Cited by: §1.
- [15] (2013) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §2.1.
- [16] (2009) Clustering in weighted networks. Social Networks 31 (2), pp. 155–163. Cited by: §4.1.
- [17] (2014) Deepwalk: online learning of social representations. In SIGKDD, pp. 701–710. Cited by: §2.1, §4.2.
- [18] (2015-05) Anomaly detection in dynamic networks: a survey. WIREs Comput. Stat. 7 (3), pp. 223–247. External Links: ISSN 1939-5108 Cited by: §1.
- [19] (2015) Line: large-scale information network embedding. In WWW, pp. 1067–1077. Cited by: §2.1.
- [20] (2007) A tutorial on spectral clustering. CoRR abs/0711.0189. External Links: Link Cited by: §4.2.
- [21] (2018) Representation learning on graphs with jumping knowledge networks. In ICML, pp. 5449–5458. Cited by: §3.2.
- [22] (2018) Netwalk: a flexible deep embedding approach for anomaly detection in dynamic networks. In SIGKDD, pp. 2672–2681. Cited by: §1, §2.2, §4.2, §4.2, §4.3.
- [23] (2005) Collecting the internet as-level topology. ACM SIGCOMM Computer Communication Review 35 (1), pp. 53–61. Cited by: §4.1.
- [24] (2018) An end-to-end deep learning architecture for graph classification. In AAAI, Cited by: §3.3.
- [25] (2018) Dynamic network embedding by modeling triadic closure process. In AAAI, Cited by: §2.2.