1 Introduction
Modern software systems have become increasingly large and complicated[10, 51, 26, 49]. While these systems provide users rich services, they also bring new security and reliability challenges. One of the challenges is locating system faults and discovering potential issues.
Log analysis is one of the main techniques engineers use to troubleshoot faults and capture potential risks. When a fault occurs, checking system logs helps detect and locate the fault efficiently. However, with the increase in scale and complexity, manual identification of abnormal logs from massive log data has become infeasible[29, 10, 51, 49]. For example, Google systems generate millions of new log entries every month, meaning tens of terabytes of log data daily [32, 15]. For such a large amount of data, the cost of manually inspecting logs is unacceptable in practice. Another reason is that a largescale modern software system, such as an online service system, may comprise hundreds or thousands of machines and software components. Its implementation and maintenance usually rely on the collaboration of dozens or even hundreds of engineers. It is impractical for a single engineer to have all the knowledge of the entire system and distinguish various abnormal logs generated by various software components. Therefore, automated log anomaly detection methods is vital.
In the past decades, many logbased anomaly detection methods have been proposed. Some methods take quantitative log event counts as inputs and utilize traditional Machine Learning (ML) techniques to project the event count vectors into a vector space. Those vectors deviating from the majority (or violating certain invariant relations among the event counts) are classified as anomalies. We call such an approach
quantitativebased approach. The representative methods of this approach include LR [2], SVM [27], LogCluster [28], Invariants Mining [29], ADR [51], and LogDP [45]. However, these methods tend to suffer from unstable performance on different datasets since their input only contains quantitative statistics. They fail to capture the rich semantic information embedded in log messages and the sequential relationship between events in a log sequence.
Recently, deep learningbased methods, such as LogRobust [52], CNN [30], and NeuralLog [26], demonstrate good performance in detecting log anomalies. This class of methods takes sequential log events as input and uses various deep learning models, such as LSTM [14], CNN [20] and Transformer [42], to identify anomalies by detecting violations of sequential patterns. We call such an approach sequencebased approach. Although effective, existing sequencebased methods fail to leverage more informative structural relationships among log events, resulting in potential false alarms and unstable performance.
To address the above issue, this study proposes a novel Graphbased Log anomaly Detection method, namely LogGD. The proposed method first transforms the input log sequences into a graph. It then utilizes the node features (representing log events) and the spatial structure of the graph (representing the relations among log events) to detect anomalies, through a customized Graph Transformer Neural Network. The informative spatial structure and the interactions between node features and structural features of the graph enable the proposed method to better distinguish the anomalies in logs. Our experimental results on four widelyused public log datasets show that LogGD can outperform the stateoftheart quantitativebased and sequencebased methods and achieve more stable performance under various window size settings.
Our main contributions are summarized as follows:

A graphbased log anomaly detection method: We propose a graphbased anomaly detection method LogGD. The proposed method exploits the spatial structure of log graphs and the interactions between node features and structural features for logbased anomaly detection, achieving high accuracy and stable detection performance.

A set of comprehensive experiments: We compare the proposed methods with five stateoftheart quantitativebased and sequencebased methods on four widely used realworld datasets. The results confirm the effectiveness of the proposed method.
The rest of the paper is organized as follows. In Section 2, we present the background information and techniques we use in the proposed method. Section 3 details our proposed methodology. The experimental design and results are described in Section 4. In Section 5 we discuss why LogGD works and its limitations, as well as threats to validity. We review the related work in Section 6), and conclude this work in Section 7.
2 Background
2.1 Log Data and Sequences Generation
Logs are usually semistructured texts which are used to record the status of systems. Each log message comprises a constant part (i.e., log event, also called log template) and a variable part (log parameter). A log parser is a tool that can be used to parse the given log messages into log events. There are many log parsers available [15, 11, 8]. In this study, we choose the stateoftheart tool, Drain [15] to complete this task because of its effectiveness, robustness and efficiency that have been validated in [54]. Figure 1 shows a snippet of raw logs and the results after they are parsed.
Once the structured log events are ready for use, they need to be further grouped into log sequences (i.e., series of log events that record specific execution flows) according to sessions or predefined windows. Sessionbased log partition often utilizes certain log identifiers to generate log sequences. Previous studies have [16, 25] shown that the methods working on sessionbased data can achieve better performance than on data grouped by predefined windows. However, many datasets cannot find such identifiers to group log data. In such cases, predefined windows are a common choice for log partitioning, where two strategies are available, i.e., fixed and sliding windows. The fixed window strategy uses a predefined window size, e.g., 100 logs or 20 logs, to produce log sequences with a fixed number of events. In contrast, the sliding windows strategy generates log sequences with overlapping events between two consecutive windows, where window size and step size are two attributes used. An example is 20 logs windows(window size) sliding every 1 log(step size). At this point, the resulting log sequence can be used for graph data generation.
2.2 Graph Neural Networks
Recently, Graph Neural Networks(GNNs) have attracted great interest from researchers and practitioners as GNNbased methods have achieved excellent performance on many tasks such as drug discovery [21], social network analysis [34], network intrusion detection [53], and so on. We introduce our notation and then briefly recap the preliminaries in graph neural networks.
Notation: Let G = (, , , ) denote a directed graph, where represents a set of nodes and refers to a set of directed edges that flows from node to node , denotes the node features with dimension and represent the edge features with dimension . We use to denote the set of neighbors of node , i.e., .
Graph Neural Networks: Unlike sequence models on text data ((such as LSTM, GRU and Transformer) and convolutional models on image data ((such as CNN), graph neural networks work on graph data with irregular structures rather than sequential or grid structures. They combine the graph structure and node features to learn highlevel representation vectors of the nodes in the graph or a representation vector of the whole graph for node classification, link/edge prediction or graph classification. In this work, we focus on the task of graph classification, i.e., predicting the label for a given graph.
GNNs typically employ a neighborhood aggregation strategy [13, 4], where the representations of nodes in the graph are iteratively updated by aggregating representations of their neighbors. Ultimately, a node’s highlevel representation captures structural attributes within its Lhop network neighborhood. Formally, the th layer representation of node can be formulated as:
(1) 
where denotes the feature vectors of edges and is the initial node feature vector of a graph, and denotes the neighbourhood of node , and represent the abstract functions of the graph encoder layer for information gathering from neighbors as well as information aggregation into the node representation, respectively.
For graph classification tasks, the derived node highlevel representations from Equation 1 need to be further aggregated into a graphlevel representation through a function named , which usually is performed at the final layer of graph encoder as follows:
(2) 
where denotes the representation of the given graph , and represents the node representation matrix. is a parameterized abstract function with parameters , which can be implemented as any aggregation function, such as sum, max, meanpooling or more complex approach in real applications.
2.3 Graph Transformer Networks
The Transformer architecture originates from the field of Natural Language Processing(NLP)
[42]. Due to its excellent performance on various language tasks, it has been generalized to graphbased models [12, 50, 37], i.e., Graph Transformer(GT) model. A graph transformer block comprises two key components: a selfattention module and a positionwise feedforward network(FFN). The selfattention module first projects the initial node featuresinto query(Q), key(K), value(V) matrices through independent linear transformations, where
, , and denote learnable parameters, denotes the output dimension of linear transformation.(3) 
then the attention coefficient matrix can be obtained through a scaled dot production between queries and keys.
(4) 
Next, the selfattention module outputs the next hidden feature by applying weighted summation on the values.
(5) 
In order to improve the stability of the model, the multihead mechanism is often adopted in the selfattention module. After that, the output of the selfattention is followed by a residual connection and a feedforward network and ultimately provides nodelevel representations of the graph. Finally, a
function in the equation 2 is applied to the final layer output of the graph transformer model to obtain the graph representation.3 Proposed Method
3.1 Overview
The proposed method, LogGD, is a graphbased log anomaly detection method that consists of three components: graph construction, graph representation learning, and graph classification. The input is the log sequences generated in section 2.1, and the output is whether a given log sequence is anomalous or not. LogGD starts by transforming the given log sequences into a graph. The node features contain the semantic information of log events, and the edges include the connectivity and weights of pairs of nodes. Then, the resulting graph data is fed into a GNN model to learn the patterns of normal and abnormal graphs in the training phase. During the testing (inference) phase, the representation of a given graph obtained through the same process is classified as anomalous or nonanomalous.
3.2 Graph Construction
First, each log sequence derived from section 2.1 is transformed into a directed graph, denoted as = (, , , ), where represents the set of nodes , corresponding to the log events of the log dataset and refers to the set of edges , that is, a node pair where event is immediately followed by event in the sequence. denotes the node features corresponding to the semantic vectors of log events generated by some NLP technique. , i.e., the set of edge weights, indicating the occurrence frequency of the edges in a sequence. It is worth noting that a selfloop edge is always added for the initial event since there is no preceding event before the initial event. In addition, the node set and their corresponding initial node features are shared across the graphs transformed from the same log dataset. Through the previous steps, we construct the graphs from a given set of log sequences.
Fig. 3 is an example that transforms a log sequence [E1, E2, E3, E2, E3, E4] with the event semantic vectors into a graph consisting of node features and graph structure attributes. From the figure, we can see that a graph can provide richer spatial structure attributes than a sequence of log events. The spatial structure of a graph includes the nodecentered local structure represented by the degree matrix, the global structure of node locations encoded by a distance matrix, and the quantitative connections between nodes represented by the weight matrix. Later, We will see that the combination of the spatial structure attributes will benefit the graphbased methods to generate more expressive representations of sequences, thereby improving detection accuracy and stability. An important point to note is that a graph contains only nodes with no duplicates, unlike sequences that allow duplicate log events. Taking the sequence above as an example, although events E2 and E3 appear twice in the sequence, node E2 and node E3 in the transformed graph exist as a single node, respectively. The occurrence frequency information of the nodes is reflected in the corresponding edge weights, ensuring no information is lost in the resulting graph.
3.3 Graph Representation Learning
Graph representation learning is the process of learning an expressive lowdimensional representation incorporating node features and spatial structure attributes of a given graph, which is crucial in graph classification tasks. In order to make the trained model more discriminative towards normal and abnormal graphs, we need to carefully design the features that will participate in generating the graph representation.
SemanticAware Node Embedding: Each node in a graph represents a unique log event derived from the log parsing process. As many prior studies [52, 6, 26, 25] show, the semantic information embedded in the log messages can have a significant impact on the performance of subsequent log anomaly detection. To extract semantics from text data, there are many NLP models available, such as Word2Vec [36], Glove [38], FastText [22], and BERT model [9]. In this study, we utilize the BERT model to extract semantics embedded in the log messages because it has been proved to better capture and learn the similarity and dissimilarity across log messages based on the position and context of words[19, 26]. We follow [26]
to tokenize each template into a set of words and subwords and employ the feature extraction function of pretrained BERT
^{1}^{1}1https://github.com/googleresearch/bert to obtain the semantic information of each log event. Finally, each log event is encoded into a vector representation with a fixed dimension. In our experiments, this fixed dimension is 768.StructureAware Encoding for Graph: For log sequences, the sequential relationship between log events is often an important indicator of normality or abnormality. The sequential relation between log events reflects the positional structure of log events in the sequence. Sequencebased methods either implicitly exploit the positional structure of log events by sequentially processing a given sequence (e.g., LogRobust [52]) or employ explicit positional encoding to enhance the sequence representations (e.g., NeuralLog [26]). Unlike sequence data, graphs typically do not have such a node sequential structure due to the invariance of graphs to node permutation. Inspired by [37, 50], in this study, we utilize three structural attributes of graph, the degree matrix, the distance matrix, and the edge weight matrix, to generate structureaware encoding for a graph to enhance the discriminatory of graph representation. Intuitively, the indegree and outdegree of a node reflect the local topology of a node. It not only represents the importance of a node in the graph, but also reflects the similarity between nodes, which can be used to complement the semantic similarity between nodes. In addition, the shortest path distance matrix of a graph reflects the global spatial structure of a graph, while the edge weight matrix incorporates the quantitative relation of connections between nodes. Thus, the degree matrix, distance matrix, and edge weight matrix reflect different aspects of a graph and theoretically complement each other. When combined with node features to generate a representation for a graph, they will help enhance the representation of a given graph for graph classification.
As an example, let us assume [E1, E2, E3, E4, E5, E2, E6] as a normal sequence, while [E1, E2, E3, E4, E5, E2, E2, E6] is abnormal because an additional event, E2, occurs in the penultimate position. Although the degree and distance matrix of the graph cannot catch the anomalous information, the change can be reflected in the edge weight matrix. As another example, suppose that [E1, E2, E3, E4, E7, E2, E6] is an anomaly because E7 replaces event E5. In this case, the changes will be reflected in the entries of all the matrices of the graph, including the distance matrix, the degree matrix, and the edge weight matrix. Similarly, whether the sequential relationship between a pair of events changes, or one event is inserted or substituted for another, the anomaly can always be reflected in an aspect of graph structureaware coding. Therefore, structureaware encoding can be expected to help enhance the representation of a given graph for graph classification.
Graph Representation Learning: The informative graph spatial structure attributes and the node features require GNN models to digest and produce the highlevel representation for a given graph. There are many GNNs models available, such as GCN [23], GCNII [5], GAT [43], GATv2 [3], GIN [46], GINE [18]
, and Graph Transformer Network(TransformerConv)
[41]. In this study, we choose Graph Transformer Network architecture for graph representation learning because it not only overcomes the limitations of underreaching and oversquashing in the stacked layers of messagepassing GNNs [4], but also exhibits better performance than GAT models in graph classification by exploiting positional/structural encoding [37, 50]. The input to our graph representation learning models is graphs derived from 3.2. Structureaware encoding of graphs is used for graph representation learning, including degree matrices, distance matrices, and edge weight matrices.First, for the degree matrix embeddings, inspired by [50], we add them to the node features. By doing this, the model can capture the node similarity represented by the nodecentered local topology and the semantic correlations embedded in the node features through an attention mechanism.
(6) 
where denotes the feature of node , are learnable embedding vectors corresponding to the indegrees and outdegrees of node , respectively.
Second, we encode quantitative connection relationships between nodes and incorporate edge weight information into Q and V by elementwise multiplication:
(7) 
where denotes the summation of the edge weight matrix along the row dimension, , , and represents the output of the projection from the node features of the graph by a linear transformation in the equation 3.
In the third step, to encode a graph’s global structure attribute, we do not directly add the relative path distance scalar to the attention coefficients. Instead, we adopt the practice described in [37] to define the distance embeddings , which represents the relative path distance with the maximum length between nodes in a graph. The distance embeddings are shared throughout all layers. The global structure is then encoded as a spatial bias term , computed as the sum of dot products between the node feature and the distance embedding. Notably, the sum of dot products between the node feature and the distance embedding reflects the interaction between node features and graph structure. As such, the proposed spatial bias term enables the trained model to distinguish different structures even when two nodes have the same distance.
(8) 
Then, the spatial bias term is added to the scaled dot product attention coefficients matrix to encode the global structural attribute.
(9) 
Finally, the node features are encoded as hidden features by weighted summation of the value and spatial bias terms with the attention coefficient matrix:
(10) 
Our method encodes both the nodewise information (attention coefficient) and the interactionwise information between the node and structure of a graph into the hidden features of value, which is different from those methods encoding only nodewise information. Thus, when the attention weight is applied equally to all channels, our graphencoded value enriches the feature of each channel.
Datasets  #Nodes  Window  Training Set(80%)  Testing Set(20%)  
#Graphs  #Nodes  #Anom.  %Anom.  #Graphs  #Nodes  #Anom.  %Anom.  
HDFS  48  session  46,0048  48  13,470  2.93%  115,013  43  3,368  2.93% 
100 logs  37,708  980  4,009  10.63%  9,427  1,063  817  8.66%  
BGL  1847  60 logs  62,847  980  6,307  10.04%  15,712  1,063  1,194  7.60% 
20 logs  188,540  980  17,252  9.15%  47,135  1,063  3,006  6.38%  
100 logs  63,867  1,209  20,195  31.62%  15,967  844  399  2.50%  
Spirit  1,229  60 logs  106,445  1,209  30,882  29.01%  26,612  844  410  1.54% 
20 logs  319,334  1,209  81,550  25.54%  79,834  844  438  0.55%  
100 logs  79,674  3,779  816  1.02%  19,919  1,923  27  0.14%  
TDB  4,992  60 logs  132,789  3,779  985  0.74%  33,198  1,923  34  0.10% 
20 logs  398,367  3,779  1,394  0.35%  99,592  1,923  48  0.01% 

#Nodes: number of unique events; #Graphs: number of sequences; #Anom.: number of anomalies; %Anom.: percentage of anomalies.
3.4 Anomaly Detection through Graph Classification
To implement the classification task, the output graph representation of the graph encoder layer is directly connected to a feedforward network with layer normalization(LN) [1], which contains three fully connected layers with Gaussian Error Linear Units(GELU) [17]
as the activation function. The sum and maximum values of the output node representations are concatenated as the
function for the graph representation. Then, the crossentropy is used as the loss function, and the class probabilities of normality and abnormality for the given log sequences are calculated using the
function:(11) 
(12) 
In this way, we train a GNNbased model for logbased anomaly detection. When a set of new log messages are provided, they are first preprocessed. Then the new log messages are transformed into semantic vectors as node features, and the sequences are converted to graphs. Afterward, the resulting graph data is fed into the trained model. Finally, the GNNbased model can predict whether the given graph is anomalous or not.
4 Evaluation
4.1 Datasets
In our experiments, four public log datasets, HDFS, BGL, Spirit and Thunderbird, are used to evaluate the proposed approach and the relevant baseline methods. The datasets are widely used in log analysis research [26, 51, 10, 29, 48, 24] because all of them come from realworld datasets and are labeled either manually by system administrators or through alert tags automatically generated by their systems. We obtained all the log datasets from the publicly available websites.^{2}^{2}2https://github.com/logpai/loghub and https://www.usenix.org/cfdrdata. Further details about the four datasets are described as follows.
HDFS dataset is generated by running Hadoopbased mapreduce jobs on more than 200 Amazon’s EC2 nodes. Among the 11,197,954 log entries collected, approximately 2.9% of them are abnormal.
BGL dataset is an open dataset of logs collected from a BlueGene/L supercomputer system at Lawrence Livermore National Labs (LLNL) in Livermore, California, with 131,072 processors and 32,768GB memory.
Spirit dataset is a highperformance cluster that is installed in the Sandia National Labs(SNL). The Spirit dataset was collected from the systems with 1,028 processors and 1,024GB memory. In this study, we utilize 1 gigabyte continuous log lines from Spirit data set for computationtime purposes.
Thunderbird (TDB) dataset is also from the supercomputer at Sandia National Labs (SNL). The dataset is a large dataset of more than 200 million log messages. We only leverage 10 million continuous log lines for computationtime purposes.
The details of the log datasets in our experiments are summarized in Table 1. From the table, we can see that these datasets exhibit diversity in node size and anomaly rate, which can better validate the generalization of the evaluation method.
4.2 Implementation and Experimental Setting
We implemented LogGD and its variants based on Python 3.8.5, PyTorch 1.11.0 and PyG 2.04, respectively. For the GCNII, GINE, GATv2 and TransformerConv models, we utilized the corresponding modules from PyG with default parameters setting. In our experiments, we set the graph encoder layer size of LogGD as 1. The size of the feedforward network that takes the output of the encoder layer is 1024. LogGD is trained using AdamW optimizer. The linear learning rate decay is used, and the learning rate starts from
and ends at. We set the minibatch size and the dropout rate to 64 and 0.3, respectively. We use the crossentropy as the loss function.The model trains for up to 100 epochs and performs early stopping for 20 consecutive iterations without loss improvement.
Regarding the baseline approaches used for comparison, we adopt the implementations in the studies[16, 6, 26]. We use the parameters set in their implementation. In the experiments, we ran each method three times at each window setting on the four datasets and averaged them as the final result to report.
We conduct all the experiments on a Linux server with AMD Ryzen 3.5GHz CPU, 96GB memory, RTX2080Ti with 11GB GPU memory and the operating system version is Ubuntu 20.04.
4.3 Compared Methods
To evaluate the effectiveness of the proposed method, we compare LogGD with five stateoftheart existing supervised log anomaly detection methods on the aforementioned public log datasets. Specifically, there are two quantitativebased methods, LR[2] and SVM[27], and three sequencebased methods, CNN[30], LogRobust[52] and NeuralLog[26].
We did not directly compare the proposed method with another stateoftheart graphbased method GLADPAW [44] because GLADPAW is a semisupervised method. However, we implemented a corresponding supervised method using the GAT model and compare it with our proposed method in the subsequent experiments.
4.4 Evaluation Metrics
To evaluate the effectiveness of the approaches, we utilize precision/recall/F1 score as the metrics, which are widely used in many studies[16, 6, 26]. Specifically, the metrics are calculated as follows:

Precision: the percentage of correctly detected abnormal log sequences amongst all detected abnormal log sequences by the model.

Recall: the percentage of log sequences that are correctly identified as anomalies over all real anomalies.
4.5 Research Questions
Our experiments are designed to answer the following research questions:

RQ1. How effective is the proposed graphbased approach for log anomaly detection?

RQ2. Can the proposed approach work stably under various window settings?

RQ3. How does LogGD perform with other GNN models?

RQ4. How do the specific structural features and the interaction affect the performance of LogGD?
4.6 Results and Analysis
RQ1. How effective is the proposed graphbased approach for log anomaly detection?
Dataset  Metrics  LogGD  LR  SVM  LogRobust  CNN  NeuralLog 
HDFS  F1  0.9877  0.9616  0.8330  0.9819  0.9872  0.9827 
Precision  0.9774  0.9603  0.9519  0.9688  0.9852  0.9627  
Recall  0.9982  0.9629  0.7405  0.9954  0.9891  0.9956  
BGL  F1  0.9719  0.2799  0.4558  0.9402  0.9140  0.9535 
Precision  0.9708  0.1684  0.8190  0.9229  0.8669  0.9586  
Recall  0.9731  0.8286  0.3158  0.9596  0.9702  0.9484  
Spirit  F1  0.9789  0.9652  0.9736  0.9757  0.9652  0.9510 
Precision  0.9889  0.9580  0.9773  0.9957  0.9740  0.9694  
Recall  0.9691  0.9724  0.9699  0.9566  0.9566  0.9349  
TDB  F1  0.9284  0.4651  0.7797  0.4043  0.5533  0.7704 
Precision  0.9772  0.3390  0.7188  0.4329  0.5405  0.9683  
Recall  0.8889  0.7407  0.8519  0.4198  0.5802  0.6437 
In this experiment, we aim to evaluate the effectiveness of LogGD on the four aforementioned public log datasets. To generate the log sequences, we use session window on the HDFS dataset to group log messages by the same block ID, as the data is labeled by blocks. Then, 80% of log sequences are randomly selected for training, and the rest of the dataset is used for testing. For the BGL, Spirit and Thunderbird datasets, we keep the chronological order of the dataset and leverage the first 80% log messages as the training set and the rest 20% as the testing set, which aims at following the real scenarios and ensures new log events that are unseen in the training set appearing in the testing set. We group log sequences on the BGL, Spirit and Thunderbird datasets by fixedwindow rather than by session or slidingwindow because there is no universal identifier available for session grouping, and the fixedwindow grouping strategy is more storage efficient than slidingwindow grouping strategy. In this experiment, the fixedwindow size of the input data is set to 100 logs. We present the results for other windowsize settings in the subsequent research question. In addition, we utilize oversampling technique to address the imbalance of the training data. If the anomaly rate of the training data is less than 30%, we oversample it to 30%; otherwise, we do not use oversampling. To make a fair comparison, all the deep learningbased methods, LogRobust, CNN, NeuralLog and LogGD all take the semantic vectors of log messages that are generated by Bert model as input. In addition, the 10% of the training set on each dataset is used as the validation set to decide when to early stop the training.
The experimental results are shown in table 2. It can be seen that LogGD outperforms all comparison methods on four datasets under a fixed window setting of 100 logs with an improvement of F1 score by 0.5% on HDFS and 1.9% on BGL, 0.5% on Spirit, 19.1% on Thunderbird dataset compared to the second best baseline method. Meanwhile, from the table, we can also see that all the baseline methods but LogGD perform poorly on the Thunderbird datasets because the Thunderbird dataset has highly imbalanced data with a small percentage of anomalies(1.02% and 0.14% in the training and test datasets, respectively). Although oversampling has been applied to data preprocessing in experiments, limited by too few anomalies, this still does not prevent poor performance of all baseline methods. In contrast, LogGD performs much better, albeit with a slight drop in performance on the Thunderbird dataset. This may be attributed to the advantage that our method can capturing additional graph structure information to help distinguish the difference between normal and abnormal graphs. Furthermore, although both NeuralLog [26] and LogGD belong to the transformer architecture, the utilization of graph structure information rather than sequential position encoding can still enable our proposed graphbased method to achieve better performance.
RQ2. Can the proposed approach work stably under various window settings?
This experiment aims to investigate whether LogGD can work stably under various window settings. We conduct the experiments only on the BGL, Spirit and Thunderbird datasets because HDFS only has a session window setting. The experiment is implemented under three window size settings, i.e., 100logs, 60logs, and 20logs. Some larger window size settings were not chosen in this experiment because they are generally unlikely to be adopted in real scenarios due to potential delays in fault discovery.
The comparison results between baseline methods and LogGD are shown in Fig 4. From the figures, we can see that LogGD works more stably on the three window size settings of each dataset compared to the quantitativebased and sequencebased baseline methods and also achieves better performance in most cases. It is worth noting that the two quantitativebased methods both perform poorly on BGL. This may be explained by the fact that the test set of BGL contains many unseen events that never appeared in the training set. Quantitativebased methods rely only on the quantitative patterns among log events in the sequences and fail to capture the semantics in log messages, preventing them from learning more discriminative characteristics to classify the anomalies. This conjecture can also be confirmed by the fact that all sequence and graphbased methods exploiting semantics in log events work well on the BGL dataset. Another noting thing is that traditional MLbased methods, LR and SVM, perform even better than some sequencebased deep learning methods on the Spirit and Thunderbird datasets especially under smaller window settings although they are still suboptimal than LogGD in most cases. It implies that the quantitative pattern among log events as an important indicator should not be neglected in log anomaly detection. In addition, the overall performances in the F1 score metric of both NeuralLog and LogGD show better than the other sequencebased methods like CNN and LogRobust. This can be explained as Transformerbased methods, including NeuralLog and LogGD, both benefit from exploiting the extra information (positional encodings and spacial structure encodings) in the sequence, leading to better and stable performance. Finally, another thing that should not be overlooked is that all the methods have a trend of significant growth in detection performance on the Thunderbird dataset as the data window size decreases. The reason for this may be the increase in the number of anomalies due to the reduced window size, which enables all the methods to improve the detection performance with the further help of oversampling.
RQ3. How does LogGD perform with other GNN models?
In this experiment, we investigate the impact of using different types of GNN layers for graph representation learning on the overall detection performance, i.e., the ablation experiment with GNN models. We replaced our customized graph transformer layer with GCNII [5], GINE [18], GATv2 [3], and TransformerConv [41] models. We did not compare with GCN [23], GIN [46], and GAT [43] because GCNII, GINE and GATv2 have shown better performance than the corresponding counterparts in their studies. We present the results in Precision, Recall, and F1Score on the four datasets under the session window setting(HDFS only) and the 100logs, 60logs and 20logs window setting in Fig 5. As seen from the figure, all variants of LogGD achieve good performance on the four datasets, while LogGD with our customized graph transformer layer slightly outperforms the variants with other GNN layers in most cases. Furthermore, both LogGD with customized graph transformer layer and the variants with the TransformerConv layer outperform the other variants. The results show that the Transformer architecture can overcome the limitations of other Message Passing Neural Network (MPNN) models and better utilize graph structure attributes to learn the normal and anomalous patterns for graph classification.
RQ4. How do the specific structural features and the interaction affect the performance of LogGD?
In this experiment, we aim to investigate the impact of graph representations with/without specific structural attributes and with/without the interaction between node features and graph structure on the overall detection performance, i.e., the ablation experiment with the input features. We present the results in Precision, Recall, and F1Score on the four datasets in Fig 6. From the figure, we can see that all variants of LogGD achieve good performance with an over 90% F1 score on the four datasets under different window settings. Second, the effect of the structural attribute and the interaction between node features and structure on performance appears to be datadependent. For example, the performance of almost all variants with/without a specific attribute is slightly different on the HDFS and Spirit datasets. In contrast, the variation in performance seems to be higher on BGL and TDB datasets for variants that include/exclude the corresponding attribute. Furthermore, we can see that the performance of the variant approach without the interaction between node features and structure (i.e., the distance embedding is directly added to the attention coefficients) degrades significantly on the BGL and Thunderbird datasets. This may indicate that the interaction between node features and structure does help to improve the model’s discriminative power between abnormal and normal. In contrast, the effects of the degree alone and edge weight alone appear to be less significant. Interestingly, the variant that excludes the degree attribute performs better on the Thunderbird dataset. In future work, we will further investigate how to better utilize the degree attribute and how to better represent the nodecentered topology structure to improve detection performance. Finally, we can see that LogGD, which combines three graph structure attributes, including the degree, distance and edge weights, and the interactions between node features and structure always perform better in most data settings. This confirms the advantages of combining graph structure and node features.
5 Discussion
5.1 The advantages and limitations of LogGD
Two main reasons make LogGD perform better than the related approaches. First, LogGD can capture more expressive structure information from graphs than purely sequential relations between log events. These enhanced features help LogGD better identify the anomalous log sequences. Second, the customized graph transformer model captures the interaction between node features and graph structure represented by the shortest relative path distance, which can also improve anomaly detection performance.
Our study demonstrates the effectiveness of LogGD for anomaly detection. However, LogGD still has limitations. Our method is a supervised method, which means that intensive data labeling work is inevitable, which may limit the adoption of this method in the industry. In our future work, we will consider selfsupervised techniques to enable LogGD to work in semisupervised or unsupervised information modes to improve the adaptability of the proposed method in realworld scenarios. Also, LogGD identifies anomalies at the graph level. Developers and operators may have to inspect each event in the data window to locate the potential fault [24]. It would be interesting to explore the feasibility of more finegrained anomaly detection to reduce the effort and time to locate a fault.
5.2 Threats To Validity
In this study, we identify the following threats to validity:
1) External validity threats: this threat denotes those factors that affect the generalization of results. In this study, the external threat to validity lies in the selected datasets, i.e., subject selection bias. In our experiment, we only use four log datasets as experimental subjects. However, these systems are from realworld industrial systems and are widely used in the existing work [16, 31, 51, 25]. We believe that these four log datasets are representative. In the future, we will evaluate LogGD on more datasets and systems.
2) Internal validity threats: this refers to threats that may have affected the results and have not been properly considered. In this study, the internal threat to validity mainly lies in the implementations of LogGD and compared approaches and the design of the experiments. To reduce the threat from the implementation, we inspect and test the program carefully and assume that all the bugs revealed by testing are fixed in our approach. We implemented LogGD based on popular libraries. Regarding the compared methods, we utilize opensource implementations of these methods. For the threat posed by the experimental design, on the one hand, we ran all experiments three times and reported the average of the results. On the other hand, we compare all methods under the same setting for all experiment settings. For example, all the deep learningbased methods share the semantic extraction scheme, the oversampling scheme, and the early stopping scheme. We believe our experimental design can yield fair comparisons among all evaluation methods.
6 Related Work
6.1 Graph Positional Encoding
Positional Encoding(PE) is commonly used in image and textbased tasks with deep learning models, such as image classification [7] and language translation [40]. It plays a crucial role in improving model effectiveness. Studies [13, 44] have shown that PE representing the structural attributes of graphs is also essential for prediction tasks on graphs. However, finding such positional encodings on graphs for nodes is challenging due to the invariance of graphs to node permutation. Existing PE schemes can be classified into index PE, spectral PE, diffusionbased PE and distancebased PE. For index PE, one option is that indices based on a preset of rules are assigned as positional encodings to nodes in the graph, such as [44]. However, this scheme still follows a sequence pattern and does not fully exploit the spatial structural attributes of graphs. Another way to build an index PE is to use possible index permutations or sample them to train the model [35]
. However, this may result in either expensive computations or the loss of precise positions. Regarding spectral PE, it uses Laplacian Eigenvectors as a meaningful local coordinate system to conserve the global graph structure. However, spectral PE suffers from sign ambiguity, which requires random sign flipping during training for the network to learn the invariance
[13]. Diffusionbased PE, such as [13, 33], is based on the diffusion process, such as Random Walk and PageRank. However, it turns out that they tend to be dataset dependent [39]. Recently, the shortest path distance has been used as positional encoding in [50, 37] and shows promising results. In this work, we utilize the shortest path distance as the basis of graph global structure encoding.6.2 Logbased Anomaly Detection
Logbased anomaly detection has been intensively studied in recent decades. Existing log anomaly detection approaches can be roughly categorized into quantitativebased, sequencebased and graphbased methods in terms of the input data.
Quantitativebased methods work on log event count matrix. Generally, they can be further divided into traditional MLbased and invariant relation miningbased methods. Traditional MLbased methods, such as LR [2], SVM [27]
, PCA (Principal Component Analysis)
[47] and LogCluster [28], are often more efficient compared with deep learning based methods in terms of time costs. Invariant relation miningbased methods, such as Invariants Mining [29], ADR [51] and LogDP [45], have the advantages of low labeling cost and interpretability because they usually work in semisupervised or unsupervised mode and can capture meaningful relations. Despite these advantages, quantitativebased methods tend to suffer from unstable performance in some specific cases because they cannot capture sequential patterns and semantic information between log events.In contrast, sequencebased methods take a sequence of log events as input. They typically use various deep learning models to learn sequential patterns in log sequences for anomaly detection, showing impressive performance. Such methods include DeepLog [10], LogAnomaly [31], LogRobust [52], CNN [46] and NeuralLog [26]. A potential weakness is that most sequencebased methods rely only on sequential relationships between log events in log sequences for anomaly detection. The inability to capture more informative structure information from log sequences hinders sequencebased methods from further improving detection performance and exhibits significant performance fluctuations in some specific cases.
In the past decade, Graph Neural Networks (GNNs) have attracted much attention for their successful applications in many areas, such as drug discovery [21], social network analysis [34], network intrusion detection [53], and so on. Recently, the authors of GLADPAW [44] apply GAT to log anomaly detection and confirm the feasibility of graphbased log anomaly detection methods. However, although their approach is graphbased, their structureaware design still follows a sequence pattern, i.e., using only positional encoding, fails to fully exploit the spatial structure of graphs that combines local structure, global structure, and quantitative relationships between nodes. In addition, GLADPAW does not consider the interaction between node features and structure of the graph, which may lead to a suboptimal result. As a graphbased method, LogGD utilizes a combination of nodecentered local structure, global structure containing node locations, and quantitative relationships of connections between nodes to generate an expressive representation for a given graph. In the representation learning stage, the interaction between node features and graph structure is combined through a customized Graph Transformer Network, which contributes to the improvement and stability of log anomaly detection performance and demonstrates the powerful effectiveness of graphbased methods.
7 Conclusion
In this paper, we have proposed a graphbased log anomaly detection method, LogGD, which can detect system anomalies from logs with high accuracy and stable performance by combining the graph structure and the node features. Our experimental results on four widelyused public datasets illustrate the effectiveness of LogGD. We hope that LogGD can inspire researchers and engineers to explore further the application of graph neural networks in log analysis.
Acknowledgment
This research was supported by an Australian Government Research Training Program (RTP) Scholarship, and by the Australian Research Council’s Discovery Projects funding scheme (project DP200102940). The work was also supported with supercomputing resources provided by the Phoenix High Powered Computing (HPC) service at the University of Adelaide.
References
 [1] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.4.
 [2] (2010) Fingerprinting the datacenter: automated classification of performance crises. In Proceedings of the 5th European conference on Computer systems, pp. 111–124. Cited by: §1, §4.3, §6.2.
 [3] (2021) How attentive are graph attention networks?. arXiv preprint arXiv:2105.14491. Cited by: §3.3, §4.6.
 [4] (2022) Rewiring with positional encodings for graph neural networks. arXiv preprint arXiv:2201.12674. Cited by: §2.2, §3.3.
 [5] (2020) Simple and deep graph convolutional networks. In International Conference on Machine Learning, pp. 1725–1735. Cited by: §3.3, §4.6.
 [6] (2021) Experience report: deep learningbased system log analysis for anomaly detection. arXiv preprint arXiv:2107.05908. Cited by: §3.3, §4.2, §4.4.
 [7] (2021) Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882. Cited by: §6.1.

[8]
(2020)
Logram: efficient log parsing using ngram dictionaries
. IEEE Transactions on Software Engineering. Cited by: §2.1.  [9] (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.3.
 [10] (2017) Deeplog: anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp. 1285–1298. Cited by: §1, §1, §4.1, §6.2.
 [11] (2016) Spell: streaming parsing of system event logs. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 859–864. Cited by: §2.1.
 [12] (2020) A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699. Cited by: §2.3.
 [13] (2021) Graph neural networks with learnable structural and positional representations. arXiv preprint arXiv:2110.07875. Cited by: §2.2, §6.1.

[14]
(2012)
Long shortterm memory.
Supervised sequence labelling with recurrent neural networks
, pp. 37–45. Cited by: §1.  [15] (2017) Drain: an online log parsing approach with fixed depth tree. In 2017 IEEE international conference on web services (ICWS), pp. 33–40. Cited by: §1, §2.1.
 [16] (2016) Experience report: system log analysis for anomaly detection. In 2016 IEEE 27th international symposium on software reliability engineering (ISSRE), pp. 207–218. Cited by: §2.1, §4.2, §4.4, §5.2.
 [17] (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §3.4.
 [18] (2019) Strategies for pretraining graph neural networks. arXiv preprint arXiv:1905.12265. Cited by: §3.3, §4.6.
 [19] (2020) Hitanomaly: hierarchical transformers for anomaly detection in system log. IEEE transactions on network and service management 17 (4), pp. 2064–2076. Cited by: §3.3.
 [20] (1962) Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology 160 (1), pp. 106. Cited by: §1.
 [21] (2021) Could graph neural networks learn better molecular representation for drug discovery? a comparison study of descriptorbased and graphbased models. Journal of cheminformatics 13 (1), pp. 1–23. Cited by: §2.2, §6.2.
 [22] (2016) Fasttext. zip: compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §3.3.
 [23] (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.3, §4.6.
 [24] (2022) Deep learning for anomaly detection in log data: a survey. arXiv preprint arXiv:2207.03820. Cited by: §4.1, §5.1.
 [25] (2022) Logbased anomaly detection with deep learning: how far are we?. arXiv preprint arXiv:2202.04301. Cited by: §2.1, §3.3, §5.2.
 [26] (2021) Logbased anomaly detection without log parsing. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 492–504. Cited by: §1, §1, §3.3, §3.3, §4.1, §4.2, §4.3, §4.4, §4.6, §6.2.
 [27] (2007) Failure prediction in ibm bluegene/l event logs. In Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 583–588. Cited by: §1, §4.3, §6.2.
 [28] (2016) Log clustering based problem identification for online service systems. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSEC), pp. 102–111. Cited by: §1, §6.2.
 [29] (2010) Mining invariants from console logs for system problem detection.. In USENIX Annual Technical Conference, pp. 1–14. Cited by: §1, §1, §4.1, §6.2.

[30]
(2018)
Detecting anomaly in big data system logs using convolutional neural network
. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), pp. 151–158. Cited by: §1, §4.3.  [31] (2019) LogAnomaly: unsupervised detection of sequential and quantitative anomalies in unstructured logs.. In IJCAI, Vol. 19, pp. 4739–4745. Cited by: §5.2, §6.2.
 [32] (2018) A searchbased approach for accurate identification of log message formats. In Proceedings of the 26th Conference on Program Comprehension, pp. 167–177. Cited by: §1.
 [33] (2021) Graphit: encoding graph structure in transformers. arXiv preprint arXiv:2106.05667. Cited by: §6.1.
 [34] (2021) STGSN—a spatial–temporal graph neural network framework for timeevolving social networks. KnowledgeBased Systems 214, pp. 106746. Cited by: §2.2, §6.2.
 [35] (2019) Relational pooling for graph representations. In International Conference on Machine Learning, pp. 4663–4673. Cited by: §6.1.
 [36] (2016) Integrating distributional lexical contrast into word embeddings for antonymsynonym distinction. arXiv preprint arXiv:1605.07766. Cited by: §3.3.
 [37] (2022) GRPE: relative positional encoding for graph transformer. In ICLR2022 Machine Learning for Drug Discovery, Cited by: §2.3, §3.3, §3.3, §3.3, §6.1.
 [38] (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.3.
 [39] (2022) Recipe for a general, powerful, scalable graph transformer. arXiv preprint arXiv:2205.12454. Cited by: §6.1.
 [40] (2018) Selfattention with relative position representations. arXiv preprint arXiv:1803.02155. Cited by: §6.1.
 [41] (2020) Masked label prediction: unified message passing model for semisupervised classification. arXiv preprint arXiv:2009.03509. Cited by: §3.3, §4.6.
 [42] (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1, §2.3.
 [43] (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §3.3, §4.6.
 [44] (2021) GLADpaw: graphbased log anomaly detection by position aware weighted graph attention network. In PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 66–77. Cited by: §4.3, §6.1, §6.2.
 [45] (2021) LogDP: combining dependency and proximity for logbased anomaly detection. In International Conference on ServiceOriented Computing, pp. 708–716. Cited by: §1, §6.2.
 [46] (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §3.3, §4.6, §6.2.
 [47] (2009) Detecting largescale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pp. 117–132. Cited by: §6.2.
 [48] (2009) Largescale system problem detection by mining console logs. Proceedings of SOSP’09. Cited by: §4.1.

[49]
(2021)
Semisupervised logbased anomaly detection via probabilistic label estimation
. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 1448–1460. Cited by: §1, §1.  [50] (2021) Do transformers really perform badly for graph representation?. Advances in Neural Information Processing Systems 34, pp. 28877–28888. Cited by: §2.3, §3.3, §3.3, §3.3, §6.1.
 [51] (2020) Anomaly detection via mining numerical workflow relations from logs. In 2020 International Symposium on Reliable Distributed Systems (SRDS), pp. 195–204. Cited by: §1, §1, §1, §4.1, §5.2, §6.2.
 [52] (2019) Robust logbased anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 807–817. Cited by: §1, §3.3, §3.3, §4.3, §6.2.
 [53] (2021) Hierarchical adversarial attacks against graph neural network based iot network intrusion detection system. IEEE Internet of Things Journal. Cited by: §2.2, §6.2.
 [54] (2019) Tools and benchmarks for automated log parsing. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSESEIP), pp. 121–130. Cited by: §2.1.