Delog: A Privacy Preserving Log Filtering Framework for Online Compute Platforms

02/13/2019 ∙ by Amey Agrawal, et al. ∙ Qubole, Inc. 0

In many software applications, logs serve as the only interface between the application and the developer. However, navigating through the logs of long-running applications is often challenging. Logs from previously successful application runs can be leveraged to automatically identify errors and provide users with only the logs that are relevant to the debugging process. We describe a privacy preserving framework which can be employed by Platform as a Service (PaaS) providers to utilize the user logs generated on the platform while protecting the potentially sensitive logged data. Further, in order to accurately and scalably parse log lines, we present a distributed log parsing algorithm which leverages Locality Sensitive Hashing (LSH). We show a 3x performance improvement over the previous state-of-the-art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Online cloud computing platforms have made large scale distributed computing accessible at affordable prices, leading to a surge in usage of distributed computing frameworks like Apache Spark (Zaharia et al., 2010), Hive (Thusoo et al., 2010) and Presto (pre, [n. d.]). However, in case of application failures, the user has to navigate through massive amounts of recorded logs to diagnose issues, causing a dip in productivity and unsatisfactory user experience.

In this paper, we describe a log filtering framework, Delog, which identifies anomalous logs at run time in order to minimize the manual efforts required by the user in case of failures. We use a simple assumption that any log pattern that frequently occurs in successful application runs is irrelevant to identification of errors. The process of identifying log patterns is formally known as log parsing. Every log line can be considered as a string generated from a template where some tokens are constants while others represent values of certain variables. Log parsing techniques try to extract unique patterns which correspond to specific system events by recognizing variables.

Some of the existing works in log parsing (Agrawal et al., 2019), (He et al., 2017) use distributed computing to solve the problems posed by large volumes of logs generated in production use cases. However, their performance depletes with increasing number of unique log patterns. To address this problem, we propose a novel algorithm based on Locality Sensitive Hashing (LSH) which can efficiently handle datasets with large number of patterns. To demonstrate the effectiveness of Delog, we also present a new dataset of Apache Hive logs.

Most previous studies (Fu et al., 2009a), (He et al., 2017), (Makanju et al., 2012a) on log parsing make an assumption that the token count for instances of a given log pattern remains constant. However, we find this assumption to be limiting. Consider the following token sequences, “ContextHandler Started ServeletContextHandler rdd null AVAILABLE Spark” and “ContextHandler Started ServeletContextHandler static Spark”. Here a single variable (corresponding to ‘%s’) is printing multiple tokens and hence, the token count of the two sequences is different though both are instances of a single log pattern. “ContextHandler Started ServeletContextHandler * Spark”. We attend to such cases using sequence alignment, which allows us to obtain optimal reduced form of a log pattern and leads to better generalization ability.

Modern applications use multiple third party libraries and frameworks for different problem domains. Therefore, it is impossible to train an algorithm for all possible kinds of log patterns beforehand. A Platform as a Service provider can mitigate this problem by learning patterns from user generated application logs. However, application logs can contain sensitive data including secret keys, usernames and business data. We propose a privacy preserving framework which uses novel Bloom filter based data encoding technique (Karapiperis and Verykios, 2015) to utilize user logs without compromising user privacy.

To summarize, our key contributions in this work are:

  • We propose a novel log parsing algorithm that can process datasets with large number of patterns while keeping the pattern quality high.

  • We describe a log filtering method which can save time for a user trying to diagnose application issues by navigating through logs.

  • We propose a privacy preserving framework that securely utilizes user generated log data to improve filtering performance.

  • We open source 111https://github.com/qubole/qubole-log-datasets a synthetic dataset which can be used to evaluate the performance of privacy preserving log filtering frameworks in future.

The rest of this paper is organized as follows: In Section 2, we introduce the log parsing algorithm. Section 3 presents the log filtering methodology in detail. The privacy preserving framework to learn from user logs is described in section 4. Evaluation and experiments’ results of our approach is present in Sections 5 and 6. The related works and conclusion are presented in Section 7 and Section 8, respectively.

2. Log Parsing

Log parsing is typically the first step of any log processing system. In order to obtain high quality log patterns, we develop a novel log parsing algorithm which can scale to handle datasets with large number of patterns. Our log parsing technique consists of three stages. The first stage involves preprocessing input log lines to identify trivial variables. These preprocessed lines are then divided into blocks containing similar lines. Finally, each block is reduced to a single log pattern by identifying constant tokens. In the rest of this section, we discuss each step of the parsing algorithm in detail.

Figure 1. Overview of Log Parsing Pipeline

2.1. Preprocessing of Log Lines

In the preprocessing stage, we split each input line at spaces to create a sequence of tokens. In the first step we identify and filter out the URLs and file paths from the list of tokens. Then we split the remaining tokens on every non-alphanumeric character and identify numbers, hexadecimal numbers and encoded strings. This helps us to handle tokens like IP Addresses (”127.0.0.1”), dates(”10/01/2018”), and other similar tokens without matching manually defined regular expressions. We mark a token as invalid if its length is unusually short or long. All the remaining unidentified tokens are marked as strings. We further eliminate any patterns exceeding a length threshold. In the next step, consecutive non-string tokens occurring in the pattern are combined. Finally, we remove the duplicate patterns to return a set of unique lines in the processed file. We speed up the preprocessing stage by introducing Least Recently Used (LRU) caches on costly tokenization operations.

2.2. Blocking Preprocessed Lines

The robust preprocessing stage ensures that all the non-string variables are identified. We typically observe that the number of unique lines obtained after preprocessing is less than 0.1% of the original number of lines in the file. However, this number can still be large and finding similar lines by naive comparisons can be costly. To reduce the costly matching operations we block the preprocessed lines using minhash-based Locality Sensitive Hashing (LSH) (Gionis et al., 1999). Considering the nature of log datasets, we use token-level shingles as opposed to the more traditional character-level shingles. This further helps in reducing the hash compute and comparison time. We create candidate log pattern blocks using the LSH with an empirically chosen Jaccard similarity threshold.

2.3. Block Verification

Since both minhash and LSH are probabilistic data structures, there could be errors in the blocks formed by the method described in 2.2

. We use Longest Common Subsequence (LCS) algorithm to identify any outliers within a block. For a LSH block comprising of n patterns

, …, , we regroup them into k blocks , , … …. such that any two patterns in a block will satisfy the similarity constraint given by equation 1. To reduce the number of LCS computations while clustering, we assume a transitive relationship between the patterns within the same cluster implying that if a pattern P satisfies the similarity constraint with any one pattern in the block , it will satisfy the constraint for all the patterns in the block . Using this transitive property, we iteratively assign each pattern in the original LSH block to a regrouped block which satisfies the constraint. We introduce a new block if we find no suitable block for assignment.

(1)

where , are two patterns and is an empirically determined constant.

2.4. Sequence Alignment

By observing the log patterns generated by existing parsing algorithm, we make a key insight that some string variables can consist of more than one token. Methods like LKE (Fu et al., 2009a) and POP (He et al., 2017), check the distribution of tokens at a given position in a block of similar log patterns for reduction. However, such approaches produce sub-optimal results when the sequences do not align. To overcome this limitation, we create a variation of iterative Multiple Sequence Alignment (MSA) algorithm to progressively align sequences. We exploit the shorter lengths and higher similarity between log pattern sequences to simplify and speed up the alignment process. We use Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) for pairwise alignment of sequences. We sort the patterns in a block by sequence length and then progressively align sequences longest to shortest. At each step, subsets of previously aligned sequences are updated such that the length of all aligned sequences is equal. Details of this step are shown in the Algorithm 1.

1:procedure alignSequence Input: Sequence of patterns P Ouput: Sequence of aligned patterns A
2:     
3:     
4:     for  to S.length  do
5:         
6:         if  then
7:              A.removeLast
8:              A.append(alignedPair[0])
9:                        
10:         A.append(alignedPair[1])      return A
Algorithm 1 LongestFirstMSA

2.5. Alignment Matrix for Pattern Reduction

Consider a block of aligned pattern sequences denoted as , , … . Each pattern contains tokens , , …, . We visualize these sequences of patterns as an alignment matrix, where each row represents a pattern in the block as depicted in Figure 2. Intuitively, for any given column, if the frequency of a certain token is high, the token position contains a constant. On the other hand if a column corresponds to a variable, there should be a lot more variation in the column values. For each column where frequency of mode is denoted as

, we label it as a constant if it satisfies the heuristic constraint defined in equation

2.

(2)

where n is the number of patterns in the matrix and is an empirically determined constant. The remaining columns which do not satisfy equation 2 are labeled as variable columns.

A row is considered a misfit in the matrix if for any column labeled as a constant, the value of token is not same as the mode of . After eliminating misfit rows we simply return the reduced pattern by picking modal value tokens for constant columns and wildcards for variable columns.

Figure 2. Alignment Matrix and Pattern Reduction

2.6. Iterative reduction

During the blocking stage, some similar patterns might end up in different blocks due to the approximate nature of LSH. To obtain the optimal set of patterns we iteratively repeat the steps described between subsections 2.2 and 2.5 until the number of patterns becomes constant.

3. Filtering Error Logs

One of the key applications of log parsing techniques has been to filter out the anomalies from the logs of failed application runs. This helps in easier and swift error detection and in turn can save a lot of time for developers. This is especially true for applications like Apache Spark and Apache Hive where the size of logs can be as large as tens of gigabytes and it becomes impossible to manually identify the errors from these logs. We identify the commonly occurring patterns by training on logs of successful application runs. We can filter the lines matching these patterns from the log files and in turn we will be left with anomalies, error messages and stack traces. The subsequent subsections describe the anomaly detection algorithm in further detail.

3.1. Learning Patterns

To initialize the filtering model for a given application, we parse logs of previously successful runs of the application. However, not all of the patterns learned are used for filtering. Many times, logs of successful job runs also contain some intermittent failures which succeed on retries. Hence, if we use all the learned patterns, we might miss some key anomalies and errors in the the log file. We use a greedy approach where we first sort the templates in our training data set on the basis of frequency and then we iteratively pick the templates till we achieve a ninety eight percent coverage of the training dataset. The number of patterns selected through this approach will vary for each application type. We observe that for the Spark dataset, we can achieve ninety eight percent coverage through only top five percent patterns. Moreover, we also observe that certain log lines like application start and application termination appear only once in each file but are present in all the files. Therefore, we make sure to include those patterns which are present in seventy percent log files or more.

3.2. Preprocessing of Input Lines

The first step of the filtering algorithm is same as the first step of the training algorithm where we identify meaningful tokens from the log lines and convert the log lines into patterns. We cache these preprocessed patterns in order to avoid reprocessing the identical log lines in the subsequent stages.

3.3. Candidate Pair Identification

In this step, we use minhash LSH (Gionis et al., 1999) to match the lines obtained from the preprocessing stage with the set of patterns obtained from the training stage of the algorithm. For each preprocessed line, we identify candidate pairs by querying LSH for an empirically chosen Jaccard similarity threshold.

3.4. LCS Matching

To concretely identify a valid match, we scan the candidate pairs to find the first pattern which satisfies the constraint given by equation 1. If the constraint is satisfied by at least one candidate pattern, we conclude that the corresponding log line is not of much relevance for error detection. Otherwise, we infer that the log line is an anomaly and should be included in the filtered logs.

3.5. Frequency Based Filtering

The list of patterns generated from the training stage of the algorithm may not include all the possible log templates which can be generated from a given application type. This is because the generated logs can depend on a variety of factors like application configurations, environment variables, user behavior etc. Therefore, it is possible to encounter log lines which do not match our trained data set but are of little value for anomaly and error detection. In order to avoid this, we keep a frequency filter for every unmatched log line such that,

(3)

Where is the frequency of the log line and is a constant. We filter out the log lines that do not satisfy the constraint defined in equation 3 from our output and treat them similar to a pattern from training set. The value of the constant needs to be determined empirically and depends on the application type. We observe that two hundred and fifty is a suitable value for Spark applications.

4. Privacy Preserving Framework

In subsection 3.5 we present a way to improve performance on user logs in wild by tracking the frequency of preprocessed patterns. However, such an approach cannot re-utilize knowledge from previous application runs to improve the results in future. In this section, we propose a framework which can learn from user generated logs while preserving user privacy at the same time.

Figure 3. Client-Sever Architecture for Privacy Preserving Learning

4.1. Client-Server Setup

On an online cloud computing platform, hundreds of users run their applications at any given point of time. In order to utilize the logs recorded by these user applications, the platform provider can run a service on the user systems which would aggregate and encode logs in a homomorphic fashion. These encoded logs are then sent to a central repository. During the future runs of the application, encoded log patterns can be fetched from the repository and utilized for matching.

4.2. Encoding of Log Patterns

Schnell et al. (Schnell et al., 2009) in their work on record linking show a novel use of bloom filters for homomorphic encoding of strings. For each pattern we create a bitmap by inserting hashes of token level shingle into a bloom filter as shown in Figure 4. It is extremely difficult to reconstruct input pattern from such encoding. To compare any two patterns we can directly compute the Jaccard similarity of their corresponding bitmaps.

Figure 4. Bloom filter bitmaps as privacy preserving encoding of log patterns

4.3. Processing at Server

After collecting the pattern encodings from all clients, the server creates blocks of similar patterns using LSH. Total frequency of each block is computed in order to select patterns using the method similar to one described in subsection 3.1. These selected patterns are then communicated back to the clients. The LSH is configured with a high Jaccard similarity threshold, with parameters optimized to minimize false positives.

4.4. Inference Procedure

In addition to the method described in section 3, the clients use encoded log patterns received from the server. A log line which does not match to any of the pretrained patterns is then encoded and looked for in the set of encoded patterns. A LSH is used to speed up the search for candidate pairs and we validate the candidate patterns by computing Jaccard similarity between the bitmaps. Any pattern with a valid match is then excluded from the output.

Figure 5. Log Filtering Pipeline

5. Evaluating Log Parser

5.1. Experimental Setup

In this section, we compare Delog with some notable log parsing algorithms like SHISO (Mizutani, 2013), Spell (Du and Li, 2016), Drain (He et al., 2018), IPLoM (Makanju et al., 2012a) and Logan (Agrawal et al., 2019). We run our experiments on Amazon EC2 r3.8xlarge instances with 32 virtual cores. We use four public datasets (Xu et al., 2009), (LANS, [n. d.]), (Oliner and Stearley, 2007), (He et al., 2018) and three new datasets of Apache Hive, Apache Spark and Presto for our experiments. However, since the Spark, Hive and Presto datasets are business sensitive and closed, we also run our experiments on three new publicly large datasets of Thunderbird, Windows and HDFS made available by (Zhu et al., 2018) for easier comparison with other distributed techniques in future. Our experiments with the Thunderbird dataset are the first reported results of log parsing on a dataset of such scale. Table 1 contains the details of size and number of log lines in these datasets. We also introduce smaller counterparts of Spark and Presto datasets for easier comparison with non distributed algorithms. We run Delog as a Spark application deployed in Yarn client mode on a five node cluster with 160 cores to take advantage of the distributed processing.

We omit results for some algorithms on specific datasets if they fail to finish processing in a reasonable time.

Dataset Number of Lines Total Size
Zookeeper 74,380 10 MB
HPC 433,490 32 MB
BGL 4,747,963 109 MB
Presto-5.2M 5,200,000 1.48 GB
HDFS 11,175,629 1.57 GB
Spark-13M 13,000,000 2.10 GB
Presto 16,222,793 4.73 GB
Hive 39,010,740 7.9 GB
HDFS2 71,118,073 16.06GB
Spark 96,503,051 14.75 GB
Windows 114,608,388 26.09GB
Thunderbird 211,212,192 29.60 GB

Table 1. Summary of Datasets

5.2. Impact of Preprocessing

Table 2 shows number of initial patterns obtained after the preprocessing stage. In most datasets, we see that the number of patterns after preprocessing is less than 0.5% of the original lines in the dataset. This can be attributed to the large number of numerical and URL variables in the log patterns. The rest of our training pipeline is dedicated to identifying string variables. Depending on the nature of the dataset, we notice a two to three fold reduction in the number of patterns after the iterative reduction step.

Dataset Original lines After Preprocessing Output Patterns
BGL 4747963 3748 (0.07%) 1896 (50.58%)
HPC 433490 947 (0.21%) 306 (32.31%)
Zookeeper 74380 122 (0.16%) 117 (95.90%)
HDFS 11175629 47 45 (95.74%)
Spark-13m 13000000 1187 808 (68.07%)
Presto-5.2m 5200000 6566 (0.12%) 2829 (43.08%)
Table 2. Number of Patterns After Each Stage

5.3. Evaluation Metric

Earlier works quantify the accuracy of log parser using metrics like F-measure. However, this involves manually obtaining patterns for the dataset. Since this is not feasible for large datasets like Hive (Thusoo et al., 2010) and Spark (Zaharia et al., 2010), we decide to use the quality loss metric used in Logan (Agrawal et al., 2019). Quality loss is computed using equation 5 and it penalizes those patterns in which meaningful tokens are lost and converted into wildcards. Logan (Agrawal et al., 2019)

uses another function called length factor which penalizes the algorithms for generating too many templates. A naive parser which returns every single line in the dataset without any processing would have a zero quality loss, but would receive high penalty through length factor. An ideal log parser, would minimize the number of log patterns while keeping the quality loss low. Although Logan combines quality loss and length factor to come up with a new loss function, we observe that it becomes difficult to capture the impact of quality factor this way. Therefore, we separately compare algorithms for their quality scores and the number of patterns they generate.

(4)
(5)

Where, is the average length of all sequences matched to template during training.

5.4. Observations

5.4.1. Parse Quality

Looking at the tables 3 and 4, we observe that Delog consistently produces high quality patterns across datasets with a reasonable number of output patterns. High loss value along with fewer templates in SHISO (Mizutani, 2013) suggests that many meaningful tokens are lost. IPLoM (Makanju et al., 2012a) performs well for HDFS (Shvachko et al., 2010) dataset but it fails in terms of quality of patterns when it comes to other datasets. Drain’s over parsing leads to such steep increase in quality loss on the Presto (pre, [n. d.]) dataset.

Zook-eeper HPC BGL Presto-5.2M HDFS Spark-13M Presto Spark Hive Thund-erbird Wind-ows HDFS2
SHISO 2.7385 4.6333 - - - - - - - - - -
Spell 0.0002 0.0091 - - - - - - - - - -
Drain 0.0042 0.1 0.0849 120.9164 0.0013 0.6445 - - - - - -
IPLoM 0.0053 1.8697 0.0091 7.3184 0.0004 7.13 - - - - - -
Logan 0.0377 0.0598 0.0211 0.0123 0.0107 0.0056 0.0069 0.0048 - - - -
Delog 0.0002 0.0005 0.0009 0.0144 0.0009 0.0034 0.0135 0.0038 0.0159 0.0006 0.0235 0.0073

Table 3. Loss measurement for Delog and other methods
Zook-eeper HPC BGL Presto-5.2M HDFS Spark-13M Presto Spark Hive Thund-erbird Wind-ows HDFS2
SHISO 42 90 - - - - - - - - - -
Spell 168 230 - - - - - - - - - -
Drain 89 147 1202 1466 48 1001 - - - - - -
IPLoM 91 119 2765 3748 41 738 - - - - - -
Logan 100 138 261 2391 48 895 5514 2979 - - - -
Delog 117 306 1896 2829 45 808 5629 1634 65131 5572 14449 419
Table 4. Number of templates identified by Delog and other methods

5.4.2. Run Time

Table 5 shows that Delog outperforms other algorithms for most of the datasets in efficiency. Since Delog uses minhash LSH to identify pattern blocks, we see that it performs better than the previous state of the art Logan (Agrawal et al., 2019) especially for large datasets. Moreover, other algorithms fail to parse datasets like Hive which have very high number of patterns. We see that Delog is the only log parsing algorithm which manages to process the Hive dataset in respectable amount of time because of its sub-linear complexity in number of patterns.

Figure 6. Variation of number of patterns and quality loss with change in Jaccard similarity threshold during training

5.4.3. Tuning MinHash LSH

We use Presto-5.2m dataset to demonstrate parameter tuning for pattern selection using quality loss and number of patterns. Figure 6 shows the impact of changing Jaccard Similarity threshold on the number of output patterns and the quality loss. We see that using a Jaccard similarity threshold of 0.7 for LSH blocking gives an optimal number of patterns along with a desirable quality loss value. Further decreasing the Jaccard similarity threshold value leads to a steep increase in number of patterns for a marginal improvement in the quality loss. Similarly, if we decrease the Jaccard similarity threshold in order to reduce the number of patterns, we see that the quality of our patterns significantly goes down.

5.4.4. Tuning LCS Matching Fraction

The LCS based regrouping of patterns described in subsection 2.3 helps in identification of incorrectly blocked patterns. We observe from figure 7 that changing LCS matching threshold () from 0.4 to 0.7 leads to a significant reduction in quality loss. However, further increase in LCS matching fraction from 0.8 to 0.9 leads to an enormous increase in number of patterns without any substantial dip in the quality loss. Though the optimal value of LCS matching fraction will vary from one dataset to another and needs to be determined empirically, it should generally lie in the range of 0.6 to 0.8.

Figure 7. Variation in number of patterns and quality loss with change in LCS matching threshold () during training

6. Evaluating Privacy Preserving Filtering Framework

6.1. Experimental Setup

In order to evaluate our privacy preserving framework, we generate a synthetic data set. The set of patterns used to generate the synthetic dataset are divided into two groups, one representing patterns from ‘success’ and other representing patterns from ‘error’. The training set is generated using only success patterns, while the test set consists of both success and error patterns. We initialize our model with a faction of success patterns and then learn rest of the success patterns in a privacy preserving manner. We calculate the false discovery rates and false negative rates for our model and compare them with the those of a model trained on complete training data to evaluate the correctness of our privacy preserving algorithm.

6.2. Dataset Generation

First, we extract all the patterns from a set of log files obtained from different applications and designate 75% of them as success patterns and remaining as error patterns. While generating the training set, we subdivide the success patterns into two sets. Patterns from the first set appear in every file of the training data, while the patterns from the other set are scattered among different files so as to ensure that no single file contains all the success patterns. Test set has similar distribution of success patterns along with randomly chosen mix of error patterns. Each of the training and test set contains eight files with fifteen-thousand lines each. The test and training data combined contain a total of 12968 patterns.

Zook-eeper HPC BGL Presto-5.2M HDFS Spark-13M Presto Spark Hive Thund-erbird Wind-ows HDFS2
SHISO 129.6989 656.9053 - - - - - - - - - -
Spell 13.6027 76.6969 - - - - - - - - - -
Drain 3.4816 16.2693 156.1772 436.0753 559.6797 7044.1814 - - - - - -
IPLoM 2.1881 11.2585 93.2318 583.5503 379.0384 896.2256 - - - - - -
Logan 1.1323 3.7827 24.1696 68.3472 53.3907 119.0325 309.0909 409.6328 - - - -
Delog 2.034 4.231 16.081 27.278 27.422 48.592 99.848 191.667 1792.965 60.341 376.896 182.168
Table 5. Running time performance of Delog and other methods

6.3. Observations

6.3.1. Training with Privacy Preserving Model

We initialized our clients by training them with 33% of the training data. We sequentially process the remaining files in the training set to learn new patterns and update the set of pattern encodings. For the sake of this experiment, we skip the pattern selection part described in 4.3. We observe that once the model learns encodings for a given file, it can identify the learned patterns with zero false positives and an average false negative rate of 0.14%.

6.3.2. Evaluation on Test Data

Once trained with pattern encodings, we then run our model (M2) on the test set. We compare the performance of our privacy preserving model (M2) against a model (M3) trained on entirety of the training set and a model (M1) trained only with 33% train data set without our privacy preserving updates. We consider model (M3) as the source of truth to calculate false negative and false discovery rate. As Table 6 shows, privacy preserving learning helps reduce the number of output patterns to half. We gain a twofold reduction in false discovery rate while the false negative rate stays at about two percent. Since, while debugging an application we do not want to miss any important lines, we tune the LSH parameters to minimize the false negative rate.

Evaluation Metric M1 M2 M3
Output Patterns 1588.125 774.625 563.125
False Negatives 0 11.75 -
False Negative Rate 0% 2.085% -
False Positives 1023 222.25 -
False Discovery Rate 64.535% 28.761% -
Table 6. Evaluation of Privacy Preserving Learning Framework on Test Set of Our Sythetic Data

7. Related Works

7.1. Log Parsing

Log parsing is a widely studied subject and various groups have attempted to devise several log parsing techniques in the past. Fu et al. (Fu et al., 2009b) cluster log lines on the basis of weighted edit distance. On the other hand, LKE (Fu et al., 2009a), POP (He et al., 2017) and IPLoM (Makanju et al., 2012a) attempt to segregate the log lines in various clusters such that all the lines in a cluster correspond to the same template. However, each of the above groups have used different techniques to cluster the log lines. LKE (Fu et al., 2009a) use the edit distance between between each pair of log lines whereas POP (He et al., 2017) and IPLoM (Makanju et al., 2012a) use number of tokens in a line to create initial clusters. POP (He et al., 2017) and IPLoM (Makanju et al., 2012a) further split these initial clusters by identifying the tokens which occur most frequently at a given position and then splitting the clusters at these positions. IPLoM (Makanju et al., 2012a) then find bijective relations between unique tokens while POP (He et al., 2017) use relative frequency analysis for further subdivision of these clusters.

Vaarandi et al. (Vaarandi, 2003) use a different technique where they identify most commonly occurring tokens in the logs and use them to represent each template. Makanju et al. (Makanju et al., 2012b) use a simple technique of iteratively partitioning logs into sub-partitions to extract log events.

Logan (Agrawal et al., 2019), Drain (He et al., 2018) and Spell (Du and Li, 2016) update log lines in order to reduce compute and memory overheads. Logan uses distributed compute with Apache Spark to reduce the training time. However, Zhu et al. (Zhu et al., 2018) show that these existing techniques deteriorate in both accuracy and performance when datasets contain large number of unique patterns. As we show in section 5 Delog scales notably well on datasets with large number of patterns.

7.2. Logs Processing for Problem Identification

Use of machine learning and data mining techniques for anomaly detection in application logs is becoming increasingly common. Liang et al.

(Liang et al., 2007)

use a SVM classifier whereas Lou et al.

(Lou et al., 2010) propose mining the invariants among log events to detect errors and anomalies.

Log3C uses a cascading clustering algorithm for clustering log sequences. These clustered sequences are then correlated with system Key Performance Indicators (KPI) using regression analysis. The clusters having high correlation with actual problems are identified. Similarly, Yuan et al. Beschastnikh et al.

(Beschastnikh et al., 2014), Shang et al. (Shang et al., 2013) and Ding et al. (Ding et al., 2012, 2014) have proposed some other problem identification techniques but most relevant to our work is the log classification method proposed by Lin et al. (Lin et al., 2016)

. They generate vector encoding of log sequences using IDF scores to calculate the similarity between client logs and stack traces of already known errors. However, this method requires log sequence level indicators of the application state and cannot be used in a privacy sensitive environment.

8. Conclusion

For most of the datasets, Delog fares almost two times better in training time performance as compared to the previous state of the art Logan (Agrawal et al., 2019). Moreover, the quality of patterns generated by Delog is also consistently better than the existing parsing algorithms. We use a minhash based LSH algorithm to obtain sub-linear complexity in number of patterns. The Thunderbird, Windows and Spark datasets used by us are the largest datasets to be used for log parsing so far. Delog is the only log parsing algorithm which is able to successfully parse the Hive dataset in a respectable amount of time.

Delog also constantly learns from user generated logs using a privacy preserving technique to learn new patterns from client logs. User logs are homomorphically encrypted using a novel bloom filter based approach. We perform experiments on a synthetic dataset to demonstrate the efficacy of this approach in log filtering.

References

  • (1)
  • pre ([n. d.]) [n. d.]. Presto: Distributed SQL query engine for big data. ([n. d.]). https://prestodb.io/
  • Agrawal et al. (2019) Amey Agrawal, Rohit Karlupia, and Rajat Gupta. 2019. Logan: A Distributed Online Log Parser. In 35th IEEE International Conference on Data Engineering. IEEE.
  • Beschastnikh et al. (2014) Ivan Beschastnikh, Yuriy Brun, Michael D Ernst, and Arvind Krishnamurthy. 2014. Inferring models of concurrent systems from logs of their behavior with CSight. In Proceedings of the 36th International Conference on Software Engineering. ACM, 468–479.
  • Ding et al. (2012) Rui Ding, Qiang Fu, Jian-Guang Lou, Qingwei Lin, Dongmei Zhang, Jiajun Shen, and Tao Xie. 2012. Healing online service systems via mining historical issue repositories. In Automated Software Engineering (ASE), 2012 Proceedings of the 27th IEEE/ACM International Conference on. IEEE, 318–321.
  • Ding et al. (2014) Rui Ding, Qiang Fu, Jian Guang Lou, Qingwei Lin, Dongmei Zhang, and Tao Xie. 2014. Mining historical issue repositories to heal large-scale online service systems. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on. IEEE, 311–322.
  • Du and Li (2016) Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 859–864.
  • Fu et al. (2009a) Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009a. Execution anomaly detection in distributed systems through unstructured log analysis. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 149–158.
  • Fu et al. (2009b) Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009b. Execution anomaly detection in distributed systems through unstructured log analysis. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 149–158.
  • Gionis et al. (1999) Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518–529.
  • He et al. (2017) Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R Lyu. 2017. Towards Automated Log Parsing for Large-Scale Log Data Analysis. IEEE Transactions on Dependable and Secure Computing (2017).
  • He et al. (2018) Pinjia He, Jieming Zhu, Pengcheng Xu, Zibin Zheng, and Michael R Lyu. 2018. A Directed Acyclic Graph Approach to Online Log Parsing. arXiv preprint arXiv:1806.04356 (2018).
  • Karapiperis and Verykios (2015) D. Karapiperis and V. S. Verykios. 2015. An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage. IEEE Transactions on Knowledge and Data Engineering 27, 4 (April 2015), 909–921. https://doi.org/10.1109/TKDE.2014.2349916
  • LANS ([n. d.]) LLC LANS. [n. d.]. Operational data to support and enable computer science research. ([n. d.]).
  • Liang et al. (2007) Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in ibm bluegene/l event logs. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on. IEEE, 583–588.
  • Lin et al. (2016) Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 102–111.
  • Lou et al. (2010) Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining Invariants from Console Logs for System Problem Detection.. In USENIX Annual Technical Conference. 23–25.
  • Makanju et al. (2012a) Adetokunbo Makanju, A Nur Zincir-Heywood, and Evangelos E Milios. 2012a. A lightweight algorithm for message type extraction in system application logs. IEEE Transactions on Knowledge and Data Engineering 24, 11 (2012), 1921–1936.
  • Makanju et al. (2012b) Adetokunbo Makanju, A Nur Zincir-Heywood, and Evangelos E Milios. 2012b. A lightweight algorithm for message type extraction in system application logs. IEEE Transactions on Knowledge and Data Engineering 24, 11 (2012), 1921–1936.
  • Mizutani (2013) Masayoshi Mizutani. 2013. Incremental mining of system log format. In Services Computing (SCC), 2013 IEEE International Conference on. IEEE, 595–602.
  • Needleman and Wunsch (1970) Saul B Needleman and Christian D Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48, 3 (1970), 443–453.
  • Oliner and Stearley (2007) Adam Oliner and Jon Stearley. 2007. What supercomputers say: A study of five system logs. In Dependable Systems and Networks, 2007. DSN’07. 37th Annual IEEE/IFIP International Conference on. IEEE, 575–584.
  • Schnell et al. (2009) Rainer Schnell, Tobias Bachteler, and Jörg Reiher. 2009. Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics and Decision Making 9, 1 (25 Aug 2009), 41. https://doi.org/10.1186/1472-6947-9-41
  • Shang et al. (2013) Weiyi Shang, Zhen Ming Jiang, Hadi Hemmati, Bram Adams, Ahmed E. Hassan, and Patrick Martin. 2013. Assisting Developers of Big Data Analytics Applications when Deploying on Hadoop Clouds. In Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13). IEEE Press, Piscataway, NJ, USA, 402–411. http://dl.acm.org/citation.cfm?id=2486788.2486842
  • Shvachko et al. (2010) Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. Ieee, 1–10.
  • Thusoo et al. (2010) Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. 2010. Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 996–1005.
  • Vaarandi (2003) Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In IP Operations & Management, 2003.(IPOM 2003). 3rd IEEE Workshop on. IEEE, 119–126.
  • Xu et al. (2009) Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 117–132.
  • Zaharia et al. (2010) Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10-10 (2010), 95.
  • Zhu et al. (2018) Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R Lyu. 2018. Tools and Benchmarks for Automated Log Parsing. arXiv preprint arXiv:1811.03509 (2018).