LogDP: Combining Dependency and Proximity for Log-based Anomaly Detection

10/05/2021 ∙ by Yongzheng Xie, et al. ∙ The University of Adelaide University of Newcastle The university of newcastle, australia University of South Australia 0

Log analysis is an important technique that engineers use for troubleshooting faults of large-scale service-oriented systems. In this study, we propose a novel semi-supervised log-based anomaly detection approach, LogDP, which utilizes the dependency relationships among log events and proximity among log sequences to detect the anomalies in massive unlabeled log data. LogDP divides log events into dependent and independent events, then learns normal patterns of dependent events using dependency and independent events using proximity. Events violating any normal pattern are identified as anomalies. By combining dependency and proximity, LogDP is able to achieve high detection accuracy. Extensive experiments have been conducted on real-world datasets, and the results show that LogDP outperforms six state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern software-intensive systems, including service-oriented systems, have become increasingly large and complex. While these systems provide users with rich services, they also bring new challenges to system operation and maintenance. One of the challenges is to identify faults and discover potential risks by analyzing a massive amount of log data. Logs are composed of semi-structured texts, i.e., log messages. Log analysis is one of the main techniques that engineers use for troubleshooting faults and capturing potential risks. When a fault occurs, checking system logs helps to efficiently detect and locate the fault. However, with the increase in scale and complexity, manual identification of abnormal logs from massive log data has become infeasible.

During the past decade, many automated log analysis approaches, including supervised, semi-supervised, and unsupervised approaches, have been proposed to detect system anomalies reflected by logs[1, 2, 3, 4, 5]. Although supervised approaches show promising results, the scarcity of labeled anomalous log data is a daunting issue. In contrast, unsupervised and semi-supervised approaches have a significant advantage in that no labeled anomalous data are needed. However, the existing unsupervised and semi-supervised methods have low accuracy.

In this paper, we propose a log anomaly detection method, LogDP, which simultaneously utilizes both dependency among log events and proximity among log sequences to detect anomalous log sequences. LogDP first discovers the normal patterns for logs, then identifies the log sequences that violate these patterns as anomalies. There are two types of normal patterns, dependency patterns (DPs) and proximity patterns (PPs). DPs are related to the events that have dependency relationships with other events, and PPs are for the events that are independent of other events. To find the DP of an event, LogDP trains a predictive model to predict this event using some other events as predictors. Here, we name the log event to be predicted as the focused event, and the predictor events as the related events of the focused event. To find the PP of an event, a mean prediction model is trained to use the mean value of the event as the expected value of the event. When detecting anomalies, given a log sequence, its expected values on all log events are predicted using the learned models, and the differences between the observed values and expected values are calculated, named pattern deviations, which indicate the degree of the log sequence deviating from their corresponding normal dependency. If any pattern deviations are beyond normal ranges, i.e., the normal patterns are violated, the log sequence is flagged as an anomaly.

In summary, our main contributions in this work are as follows:

  • We propose LogDP, a novel log-based anomaly detection method, which utilizes dependency among log events and proximity among log sequences at the same time. To our best knowledge, we are the first to introduce the dependency-based anomaly detection techniques in the field of log analysis.

  • We experimentally demonstrate the effectiveness of the proposed method on seven settings of three widely-used log datasets. The empirical experiments show that the proposed approach can outperform the state-of-the-art unsupervised and semi-supervised log-based anomaly detection methods.

2 The LogDP Method

In this section, we first explain log preprocessing, and then present the LogDP method. The LogDP method consists of two phases, the training phase and the test phase. In the training phase, for each log event, LogDP trains an expected value prediction model and produces the corresponding threshold. In the test phase, the trained prediction modes and thresholds are used to determine if a log sequence is an anomaly or not.

2.1 Log Preprocessing

Logs are usually semi-structured texts, which are used to record the status of systems. Each log message consists of a constant part (log event) and a variable part (log parameter). Log parsers [6, 7, 8] can parse log messages into log events, which are the templates of the log messages. Figure 0(a) shows a snippet of raw logs and the results after they are parsed.

(a) A snippet of log parsing.
(b) An ECM.
Figure 1: Log preprocessing.

Log messages can be grouped into log sequences (i.e., series of log events that record specific execution flows) according to sessions or time windows. Session-based log partition often utilizes certain log identifiers to generate log sequences. When using time windows to partition logs, two types of strategies are usually used, i.e., fixed window and sliding window. Fixed window strategy uses a predefined window size, e.g., 1 hour, to produce log sequences, while sliding windows strategy generates log sequences using overlapping between two consecutive fixed windows. For each log sequence, the occurrences of the events are counted, resulting in an Event Count Matrix (ECM). For example, an ECM is shown in Figure 0(b), where indicates the number of occurrences of in , namely .

The notation used in this paper is as follows. We use a boldfaced upper case letter, e.g. X to denote a matrix; a boldfaced lower case letter, e.g. e

, for a vector; a lower case letter, e.g.

, for a scalar. We have reserved for an ECM with log sequences and log events. represents the set of log events of X and is a log event, i.e., . A log sequence is denoted as , where is a log instance, i.e., the occurrences count of an event in c. The log instance of event in sequence is represented as .

2.2 The Training Phase of LogDP

The workflow of the training phase of the LogDP method is presented in Figure 2. The inputs of the training phase are a training set and a validation set , both of which only contain normal log sequences. is used to train expected value prediction models, and

is used to obtain the thresholds. The training phase is composed of two steps, related event selection and prediction model training. In the related event selection step, for each event, named focused event, its related event is selected to be used as predictors to predict the focused event. In the prediction model training step, two different prediction models are trained according to if Markov blanket (MB) is found for the focused event. If the focused event is not independent, i.e., it has MB, a Multi-Layer Perceptron (MLP) regressor is trained to embody the dependency relationship between the focused event and its MB. If the focused event is independent, i.e., it has not MB found, a mean prediction model is trained. That is, DPs are learned for dependent events using the dependency-based technique, and PPs are for independent events using the proximity-based technique. After training the expected value prediction models,

is input to obtain the corresponding thresholds. The outputs of the training phase include a set of prediction models and their corresponding thresholds.

Figure 2: The workflow of the training phase of the LogDP method

2.2.1 Related Event Selection

In this step, we aim to identify the related events for a focused event, which are later used as predictors in a predictive model to predict the value of a focused (independent) event.

We follow [9]

to adopt a causal feature selection technique, MBs, in the step to achieve a good prediction accuracy and efficiency. MBs are defined in the context of a Bayesian Network (BN)

[10]. A BN is a type of probabilistic graphical model used to represent and infer the dependency among variables. In the context of log analysis, variables correspond to log events. A BN can be denoted as a pair of , where is a Directed Acyclic Graph (DAG) showing the structure of the BN, and

is the joint probability of the nodes in

. Specifically, , where E

is the set of nodes representing the random variables in the domain under consideration, and

is the set of arcs representing the dependency among the nodes. is known as a parent of (or is a child of ) if there exists an arc . For any variable in a BN, its MB contains all the children, parents, and spouses (other parents of the children) of , denoted as . Given , is conditionally independent of all other variables in E, i.e.,

(1)

where .

According to Equation 1,

represents the information needed to estimate the probability of

by making irrelevant to the remaining variables, which makes is the minimal set of relevant variables to obtain the complete dependency of . The study in [9] has shown that using MBs as related variables could achieve better performance than other choices of related events.

2.2.2 Dependency Model Training

The goal of the step is to train expected value prediction models. As shown in Figure 2, after learning MBs in the first step, events are categorized into two groups, independent events, i.e., events have no MB, and dependent events, i.e., events have MB. For an independent event, the expected value is predicted as the mean of the instances of the event in the training set. For a dependent event, an MLP regressor is trained to predict the expected value of using

as predictors. Theoretically, any regression model could be used for the step, and several regression models, such as regression trees, linear regression and SVM regressors, have been adopted in exiting dependency-base anomaly detection techniques. We chose MLP as the dependency model because it could deal with more complex data distribution and shows better performance than other regression models in our experiments.

In LogDP, we consider both dependent and independent log events in anomaly detection because it is common that some anomalous messages are printed to system logs only when anomalies occur. These anomalous log messages usually have no dependency on other log events. If this case is not included in the anomaly detection, a lot of anomalies could be missed. As these anomalous events only occur when anomalies happen, they are unlikely presented in normal log sequences, which is the reason that LogDP detects them by examining the deviation from the mean of values of normal sequences.

To obtain the threshold, a validation set with normal log sequences is input into the learned expected value prediction models to get the expected value of the validation set, i.e., . The deviation matrix of are calculated as . Then, for each event, its threshold is calculated as the maximum value of the deviations of the event, i.e., , where is the -th column of D.

2.3 The Test Phase of LogDP

The goal of the test phase is to use the learned models and thresholds to detect anomalies. Given a log sequence , the expected value of each instance is predicted by corresponding prediction model. Then, the deviation is calculated as . If , then c is flagged as an anomaly. c is considered to be normal only if it follows all the normal patterns.

3 Evaluation

3.0.1 Datasets

Three public log datasets, HDFS, BGL and Spirit, are used in our experiments, which are available from [11]. From the three datasets, we generate seven datasets using different log grouping strategies. The HDFS is generated using session, and BGL and Spirit are generated using 1-hour logs, 100 logs, and 20 logs windows. The names of the datasets of BGL and Spirit are denoted as Dataset-Window, e.g., BGL-100logs as shown in Table 1.

For LogDP, the first 2/3 sequences of the training set are used for training, and the remaining 1/3 sequences are used as a validation set.

Datasets #Evt Window Training Set Test Set
#Seq #Anom. %Anom. #Seq #Anom. %Anom.
HDFS 29 session 287,530 8,419 2.93% 287,531 8,419 2.93%
1 hour 3,673 495 13.48% 1,481 170 11.48%
BGL 980 100 logs 37,707 4,009 10.63% 9,426 816 8.66%
20 logs 188,539 17,252 9.15% 47,134 3,005 6.38%
1 hour 1,751 1,213 69.27% 585 225 38.46%
Spirit 1,229 100 logs 79,999 20,598 25.75% 19,999 429 2.15%
20 logs 399,999 82,002 20.50% 99,999 498 0.50%
  • #Evt: number of events; #Seq: number of sequences; #Anom.: number of anomalies; %Anom.: percentage of anomalies.

Table 1: Overview of datasets used in the experiments.

3.0.2 Benchmark Methods

Six state-of-the-art log-based anomaly detection methods are selected as the benchmark methods, including three proximity-based methods, PCA[12], OneClassSVM[13] (OCSVM), LogCluster[14]; a sequential-based methods, DeepLog[4]; and two invariant relation-based methods, Invariant Mining[1] (IM) and ADR [3]. The description of the benchmark methods can be found in Section 4.

3.0.3 Experimental Results

The experimental results (in precision, recall and F1) of LogDP and benchmark methods are presented in Table 2. The best results are in boldface. Overall, LogDP produces superior results comparing to benchmark methods. Out of 7 datasets, LogDP achieves all the best results in F1; five best results in precision; two best results in recall.

Dataset Metrics LogDP PCA OCSVM LogCluster DeepLog IM ADR
HDFS-session F1 0.987 0.790 0.068 0.800 0.945 0.943 0.974
Precision 0.979 0.980 0.035 0.870 0.958 0.893 0.951
Recall 0.995 0.670 0.940 0.740 0.933 1.000 1.000
BGL-1hour F1 0.789 0.170 0.393 0.147 0.596 0.490 0.547
Precision 0.935 0.352 0.383 0.009 0.474 0.343 0.377
Recall 0.682 0.112 0.403 0.394 0.802 0.859 1.000
BGL-100logs F1 0.539 0.130 0.132 0.243 0.378 0.387 0.250
Precision 0.858 0.440 0.075 0.147 0.321 0.324 0.143
Recall 0.393 0.076 0.556 0.705 0.461 0.482 0.987
BGL-20logs F1 0.460 0.237 0.168 0.226 0.224 0.203 0.204
Precision 0.985 0.447 0.094 0.129 0.126 0.163 0.114
Recall 0.300 0.162 0.744 0.884 0.981 0.269 0.988
Spirit-1hour F1 0.821 0.187 0.601 0.367 0.582 0.387 0.792
Precision 0.697 0.312 0.742 0.324 0.412 0.678 0.656
Recall 1.000 0.133 0.505 0.422 0.991 0.271 1.000
Spirit-100logs F1 0.575 0.111 0.003 0.110 0.153 0.107 0.445
Precision 0.405 0.094 0.002 0.152 0.087 0.057 0.287
Recall 0.993 0.135 0.023 0.086 0.643 0.993 0.994
Spirit-20logs F1 0.905 0.095 0.009 0.173 0.135 0.032 0.558
Precision 0.835 0.051 0.005 0.150 0.191 0.016 0.387
Recall 0.988 0.639 0.057 0.205 0.104 0.974 0.999
Table 2: Experimental results of LogDP and benchmark methods.

As for different strategies of log partitioning, i.e., session (for HDFS) or time window (for BGL and Spirit), LogDP performs well with both strategies. In contrast, as IM, ADR and DeepLog are designed to be more suitable for session-based log partitioning, they yield good results on the HDFS dataset but relatively poor results on other datasets. Compared to the benchmark methods based on proximity-based anomaly detection techniques, i.e., PCA, OCSVM and LogCluster, LogDP produces significantly better results on all datasets except for the precision of PCA on the HDFS dataset. In summary, the experiments have shown the superior performance of LogDP on different datasets with different log partition strategies.

4 Related Work

Log-based anomaly detection has been intensively studied in recent decades. In terms of the techniques used for anomaly detection, the existing approach can be roughly categorized into proximity-based, sequential-based, and relation-based approaches. Proximity-based methods, such as PCA (Principal Component Analysis

[12] and LogCluster [14], cast a log event sequence, as a point in a feature space and utilize distances or density metrics to evaluate the proximity of the log sequence with others. The sequences far from the others are flagged as anomalies. Sequential-based methods, such as DeepLog [4] and LogAnomaly [5], use sequences of the log events to train models and try to predict future events. The log sequences that do not comply with the predicted sequential patterns are identified as anomalies. Relation-based methods such as Invariants Mining [1] and ADR [3], try to find meaningful relations among the log events and use the relations to detect anomalies. As a relation-based method, LogDP is more flexible than the existing ones. Existing relation-based methods [1, 3] are based on the invariant relationships among log events. Invariant relations refer to the linear relationships among log events that are related to the program workflows. However, there are two limitations in the existing invariant relation-based methods: (1) the mined relations are sensitive to data noise; (2) the mined relations are restricted to linear relations among the events. In contrast, LogDP utilizes the probabilistic relationships among log events, which makes it less sensitive to data noise. LogDP also adopts MLP regressors as dependency models, which can deal with both linear and non-linear relationships.

5 Conclusion

We have proposed a log-based anomaly detection method, LogDP, which utilizes the deviations from normal patterns to effectively detect anomalous log sequences. LogDP divides log events into two types, dependent events and independent events. For dependent events, the normal patterns are learned from the probabilistic relationship among an event and its MB, i.e., the dependency among events. For independent events, the normal patterns are obtained from the mean prediction models, i.e., the proximity among sequences. The log sequences that violate any normal pattern are identified as anomalies. Our experimental results show that LogDP outperforms the state-of-the-art benchmark methods. Our source code and experimental data are available at: https://github.com/ilwoof/LogDP.

6 Acknowledgments

This research was supported by an Australian Government Research Training Program (RTP) Scholarship, and by the Australian Research Council’s Discovery Projects funding scheme (project DP200102940). The work was also supported with super-computing resources provided by the Phoenix High Powered Computing (HPC) service at the University of Adelaide.

References

  • [1] J. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining invariants from console logs for system problem detection. In USENIX Annual Technical Conference, 2010.
  • [2] S. He, J. Zhu, P. He, and M. Lyu. Experience report: System log analysis for anomaly detection. In ISSRE, pages 207–218. IEEE, 2016.
  • [3] Bo Zhang, Hongyu Zhang, Pablo Moscato, and Aozhong Zhang. Anomaly detection via mining numerical workflow relations from logs. In SRDS. IEEE, 2020.
  • [4] M. Du, F. Li, G. Zheng, and V. Srikumar.

    Deeplog: Anomaly detection and diagnosis from system logs through deep learning.

    In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017.
  • [5] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang, S. Tao, P. Sun, et al. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In IJCAI, volume 19, pages 4739–4745, 2019.
  • [6] P. He, J. Zhu, Z. Zheng, and M. Lyu. Drain: An online log parsing approach with fixed depth tree. In ICWS. IEEE, 2017.
  • [7] Min Du and Feifei Li. Spell: Streaming parsing of system event logs. In IEEE ICDM, pages 859–864. IEEE, 2016.
  • [8] H. Dai, H. Li, C. Chen, W. Shang, and T. Chen.

    Logram: Efficient log parsing using n-gram dictionaries.

    IEEE Transactions on Software Engineering, 2020.
  • [9] S. Lu, L. Liu, J. Li, T. D. Le, and J. Liu. Lopad: A local prediction approach to anomaly detection. Advances in Knowledge Discovery and Data Mining, 2020.
  • [10] Judea Pearl. Causality: models, reasoning and inference. Springer, 2000.
  • [11] He S., Zhu J., He P., and R. Lyu M. Loghub: A large collection of system log datasets towards automated log analytics. arXiv e-prints, 2020.
  • [12] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I Jordan. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 117–132, 2009.
  • [13] B. Schölkopf, J. C Platt, John S-T., A. J Smola, and R. C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 2001.
  • [14] Q. Lin, H. Zhang, J. Lou, Yu Zhang, and X. Chen. Log clustering based problem identification for online service systems. In ICSE-C. IEEE, 2016.