2.1 Linear Regression
Linear regression is considered one of the most widely used techniques for analyzing multi-factor data [montgomery2012introduction]. The techniques are being used in various fields such as economics [dielman2001applied], finance [cook2008regression], accounting [cooke1998regression], marketing [todua2013multiple], politics[kousser1973ecological], agriculture [majumdar2017analysis] and more
Linear regression method predicts a numeric target variable, by fitting the best matching linear relationship between the target and independent variables.The best matching depends on the number of independent variables. In our research we will use the linear regression algorithm to match a relationship between a target and one independent variable. The outcome of a simple linear regression model is a relationship that may be described by the following formula as presented in [weisberg2005applied]:
is the estimated value for the target based on the value of independent parameterin sample or observation i. The value is the intercept, and the value of is the slope or gradient.
The simple linear regression model is performed using the ”Least Squares” method, which is making sure the sum of all distances between the line and the actual observations at each point is as small as possible. The least square methods goes back as far as the 1800’s when it was first performed by Legendre (1805) and Gauss (1809) for the prediction of planetary movement [stigler1986history].
To understand the strength of the relationship, or in other word, the correlation between the target variable to the independent variable, we can produce the coefficient of determination or in other name [draper1998applied] based on the predicted values using the formula above. The coefficient is defined as , where u is the residual sum of squares and v is the total sum of squares where n is the number of samples. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). In our research we will produce the coefficient to deduct whether two events are correlated to each other.
2.2 Anomaly Detection
Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior [adsurvey2009]
, those patterns can be also referred to as outliers or anomalies. Over time, a variety of anomaly detection techniques have been developed in several research communities. Many of these techniques have been specifically developed for certain application domains, while others are more generic. There has been an abundance of research on anomaly detection algorithms[adsurvey2009]
. Some anomaly detection algorithms are based on supervised learning, and relay on having access to both normal and abnormal examples. Other algorithms are based on unsupervised learning, where both abnormal and normal data are not labeled and other methods needs to be used to detect outliers. In our research the data is not labeled and therefore the methods we used relate to unsupervised learning or an adaptation of supervised learning such as linear regression for the same task.
2.2.1 Unsupervised Anomaly Detection
Some anomaly detection algorithms are unsupervised, and work by observing and characterizing the normal behavior and identify strong deviations from this normal behavior as abnormal. Different anomaly detection for unsupervised data has been proposed such as [unsupervised_ad2017, leung2005unsupervised, eskin2002geometric, systemlog_ad2016, deeplog2017] and more. Our approach fits to this unsupervised type of anomaly detection algorithms. However, we do not characterize a single SQL command to be normal or abnormal, as we do not have access to individual commands but rather to their count, i.e., the number of times each command was issued in each session. Note that one can still apply unsupervised anomaly detection on the count of each command, e.g., identifying when a single event ID has been issued too many times. In this research we go beyond this, and explore whether anomalies can be identified by considering the relation between the counts of different events.
2.2.2 Anomaly Detection for System Logs
Our data set consist of SQL logs of every SQL query made by user in different applications. While system logs may differ, the type of logs can be parsed similarly and literature regarding anomaly detection or pattern recognition in system logs may give us a direction on how to proceed and what methods have been proposed. Log-based anomaly detection has become a research topic of practical importance both in academia and industry[systemlog_ad2016]. There has been different methods to detect system logs, such as [systemlog_ad2016, hamooni2016logminefp, deeplog2017, landauer2018dynamic, brown2018_ad].
The authors of LogMine [hamooni2016logminefp] propose a method of map-reducing instances in linear time, and then using those instances to hierarchically create clusters for similar type of logs. LogMine on finding patterns in logs. While LogMine focus on first map-reducing instances of strings and later using those map-reduced patterns for an individual instance, we use the frequency feature to find correlations between instances. Landauer [landauer2018dynamic] offered a similar clustering approach to create an anomaly detection model for system logs. The procedure they propose, similarly to the model Brown [brown2018_ad] proposed, relies on the timestamps of each instance to detect patterns and correlations between log lines. The authors of DeepLog [deeplog2017]
, proposed a deep neural network model utilizing Long Short-Term Memory (LSTM), to model a system log as a natural language sequence.
While there are different methods and anomaly detection models, the challenge in our research is different due to two things:
The different proposed methods treat each log individually, or consider the timestamps of each log as a part of the relationship. In our research there are no timestamps for the data, and each log event is aggregated for each session, therefore it is cannot be treated individually.
We do not have the full query, since the data is compressed, which can be treated similarly as the log’s structure, and therefore we cannot use NLP methods to model type of instances.
We can see that while there are many state of the art solutions for anomaly detection for system logs, they do not provide us with a solution to our research questions.
2.2.3 Related Anomaly Detection Models
Sequential anomaly detection refers to detecting anomalies by considering the relationship between observed objects has been studied in the context of anomaly detection of time series or sequences of events. [iccws2017, deeplog2017, lu2003adaptive] In all cases, the available data has a clear notion of sequence. Research on anomaly detection for such cases often focuses on developing intelligent aggregating functions that allow effective anomaly detection.
our data set we use a feature name session which is a continuous connection between user and an application (or server), where different type of instances (will be referred as events) are aggregated by their count.
Literature also present semantic anomaly detection [semanticad2002, fu2009execution] (i.e associations that there exist between the meanings of words) that use correlation between instances which have the same attribute origin (similar to sessions in our research). While the data may be processed and aggregated in similar manner to our data set, the authors uses semantic evaluation during the training.
Generally, sessions may gives us an idea what type of instances (events) may be related to each other, either by sequence or semantics. However, in our dataset we lose the both the semantic and sequential evaluation options between the type of instances. While the methods above may assist us with having an idea on how to correlate between instances, we still not to find a way to overcome the aggregation.
Zhang (2013) [count_ad2013] developed an anomaly detection method for software defined networks (SDN) that considered aggregated data about network flows in their anomaly detection model. Similar to our research, they used linear prediction method to find correlations between instances. While both their method and ours proposed having aggregation based on count (frequency) of instances, in their setting, they were given access to the entire flow of events, and by that, they could modify the granularity of the aggregated data dynamically.
The different literature reviewed in this section shows that the literature in the anomaly detection field in general, and more specifically detecting anomalies in logs, applies different methods to deal with the main goal. On the other hand, exploiting relationships in aggregated data, and representing the data as graphs for graph autoencoders, which we will present in the next sections are a novel approach to this field.
2.3 Graph Learning
This section in the literature review will describe the algorithm known as Graph Convolutional Networks. First we will describe the algorithm and it’s properties, later we will explain the idea of modeling temporal data by using Graph Convolutional Networks and lastly, we will explain on the method that we intend to take in order to analyze faults in our dataset. This section will present related works that are relevant to the research topic and further explain on them.
2.3.1 Graph Convolutional Networks
Graph Convolutional Networks or GCNs are a neural network architecture for machine learning on graphs. GCN can be used for classification [DBLP_journals_corr_KipfW16, yao2019graph], link prediction, pattern recognition [kipf2016variational, yan2018spatial], etc. Formally, given a graph GCN will take as an input the following:
Feature Matrix X - (), where is the number of features for each node (in our case a node will be a construct id / event type).
Adjacency Matrix A - ( that represents directed edges and their values (in our case, an edge value ill be defined as the division between the constructs’ counts)
A Hidden Layer in GCN can be written as where f will be the propagation rule i.e. the activation after multiplying the inputs with the weights, will be the feature matrix X.
2.3.2 Modeling Rational Data with GCN
In the article by Schlichtkrull [schlichtkrull2017modeling], the concept of Relational Graph Convolutional Networks (R-GCNs) is being introduces and used for link prediction and entity classification. The link prediction model can be regarded as an autoencoder consisting of:
Encoder: an R-GCN producing latent feature representations of entities
Decoder: a tensor factorization model exploiting these representations to predict labeled edges.
The researchers represent a relation type (or an edge with relation value) between two entities as: where and are nodes and is the relation. For entity recognition the model uses an encoder with only the existing nodes (while ignoring unlabeled nodes), each node separately. For the task of link prediction the writers used an encoder-decoder architecture.
We can create a similar representation by changing relations to a numeric value that fits our dataset’s representation. However, their model’s ability is used with a non temporal or dynamic graph, where each session represented by a graph in our dataset is dynamic and requires a different set of edges, nodes and relations for every graph representation.
2.3.3 Classification with GCNs
Generally, GCNs shows great promise when it comes to scalability and efficiency. In recent year, it has been widely used for classification purposes [yao2019graph, zhuang2018dual]. When reviewing the work published by Kipf et.al. [DBLP_journals_corr_KipfW16], we noticed two relevant contributions:
A simple and well-behaved layer-wise propagation rule for neural network models which operate directly on graphs and show how it can be motivated from a first-order approximation of spectral graph convolutions.
The results of their work shows high accuracy and efficiency in semi-supervised learning.
As a part of our research, we attempted to use the presented propagation rule, as both of the graph representations of the data set is sparse and large, and for that reason efficient learning is needed. However, the semi-supervised learning approach was irrelevant for our research, since the learning phase in our model is conducted with an unsupervised dataset. In addition, their method is used to learn and model static graph representation while our dataset dynamic, in that case our output layer had to be different and so did the loss function.
We have also reviewed [zhuang2018dual], which propose a dual-graph convolutional networks for semi-supervised classification tasks. The author proposed dual graph representations which are a transformation of the same diffusion matrix (similar to the adjacency). While the approach propose the usage of two graphs in their convolutional networks for learning, our approach is different since we have presented two graphs that represent different relationships between our instances.
2.3.4 Graph Convolutional Network link prediction
Another use of GCNs is link prediction. Link prediction has various of usages, such as traffic prediction as presented in [Zhao_2019_traffic] [guo2019attention] [ijcai2018_505_traffic] [cui2019traffic] . The problem presented is mapping a road network G in unweighted graph , where V is a set of road nodes and E represents connections (whether they exist or not) between roads. Each node also has attribute features such as traffic speed, traffic flow and traffic density. The goal of traffic forecasting is to predict information on each road based on the historical information about them.
We have reviewed which use a temporal-dynamic GCN approach presented. in this research, we tried reviewing similar model to their methods, with slight changes to the features. However, the difference was that we needed to predict the relations between nodes in the current timestamp i.e the adjacency matrix, while in both articles the authors proposed methods to predict nodes’ features in the next timestamp, i.e the feature matrix. Therefore, we have adjusted the model so the output will include the adjacency matrix to find outliers in it.
2.3.5 Unsupervised Learning with Graph Autoencoders
GNNs are typically used for supervised or semi-supervised learning problems [zhou2018graph]
. Recently, there has been an attempt to create auto-encoders (AE) to graph domains. Graph auto-encoders aim at representing nodes into low-dimensional vectors by an unsupervised training manner. Graph Auto-Encoder (GAE)[kipf2016variational] first uses GCNs to encode nodes in the graph. Then it uses a simple decoder to reconstruct the adjacency matrix and computes the loss from the similarity between the original adjacency matrix and the reconstructed matrix. In addition, it also trains the GAE model in a variational manner and the model is named as the variational graph autoencoder (VGAE).
In addition, Berg 2017 [berg2017graph]
use graph autoencoders in recommender systems and have proposed the graph convolutional matrix completion model (GC-MC). Based on Berg’s research the GC-MC model outperforms other baseline models on the MovieLens dataset. Adversarially Regularized Graph Auto-encoder (ARGA)[pan2018adversarially]
employs generative adversarial networks (GANs) to regularize a graph convolutional-based graph auto-encoder in attempt to follow a prior distribution. There are other graph auto-encoders such as NetRA[Yu2018_rep_adverse_ae], DNGR [dngr2018], SDNE [wang2016] and DRNE [drne2018]. The last autoencoders do not use GCN in their architecture, instead they use different graph embedding techniques.
Our MGAE(Multi-graphs autoencoder) model similar method to Kipf’s GAE [kipf2016variational], however, the graphs we reconstruct are dynamic graphs. In addition, our method uses multiple graph to represent the data while the other graph autoencoders methods we have presented all uses a single graph representation.
5.1 Aggregated Dataset Formulation
The dataset made available to this research includes information about SQL commands issued by users to a database. A major challenge in this research is that the available dataset do not contain information about specific invocations of SQL commands. Rather, the available dataset contains information about how many times every SQL commands have been issued in a specific time range, by a specific database user, in a specific session
During the thesis paper, to describe entities in our dataset, we will use the following definitions:
Session (context) - A continuous connection between user and an application (or server). A specific session will be referred to with session id.
Event (may be reffered as type of event or aggregated event) - We will define event as a SQL query (or transaction) made to a specific database in a session (context). The event will be referred to with event id. In each session (context) there will be only one type of event due to the aggregation in our dataset.
Count (of event) - The number of times a unique event was made in the session.
Report (file) - Hourly log file containing all of the aggregated events.
Based on the definitions we have presented, to describe relationships between events we will define the following terminology:
Couple - two events that were recorded in the same session.
Strong correlation - Based on linear regression score ( Score close to 1)
Stable couple - Couple that were found in the same session (context) during the training set and testing set.
Strongly correlated couple - Stable couple with strong correlation both in the training set and testing set.
5.1.1 Dependencies Between Entities
After discussions with a domain expert, we established the following functional dependencies between some of the fields in our dataset:
event ID → verb object
Session ID → instance ID, made by the same user using the same connection, IP address, database and application.
Session ID, event ID → count
The aggregation of a single event is illustrated in Figure 5.1
5.1.2 Detecting Correlations Between Events
Generally, to find couples (of events), either strongly correlated or just stable, we used Linear Regression as the model to find the correlation. To try and find simple patterns based on our hypothesis, we created a simple modeling mechanism as presented in Algorithm 1.
Using this algorithm we can find simple patterns that may point on existing correlations between two events both in training set and testing set.
5.2 Graph Autoencoder
In this section we will present the definitions related the the graph (convolutional) autoencoder (GAE). We will use convolutional layers, because filter parameters are typically shared over all locations in the graph (Kipf [DBLP_journals_corr_KipfW16]). For the autoencoder models, the goal is to reconstruct graph , or multiple graphs for MGAE, where are the nodes (or vertices) and are the edges of the graphs. The input for the model is:
Matrix which represents the graph structure in a two-dimensional form. may also be referred to as the adjacency matrix. The dimensions of will be
The output for the model will be matrix , with the same dimensions as . Matrix will represent the structure of the reconstructed graph. Graph-level outputs can be modeled by introducing some form of pooling operation (Duvenaud [duvenaud2015convolutional]).
Every neural network layer can then be written as a non-linear function ,with and (or z for graph-level outputs), being the number of layers. The specific models then differ only in how is chosen and parameterized.
6.1 Aggregated Dataset Analysis
In our research we created different models to find correlations between two or more instances, based on analysis and understanding of the data. In the following sections we describe the method of analyzing the different events and their behavior, and first results based on the couples we found in this section (both strongly correlated and stable events).
6.1.1 Analyzing Events
First, we count the cumulative number of unique events as a function of the time range where the events were collected. This is done in order to understand better the behavior of existing (training set) and new (testing set) events in our system.
From the number of events made each day, we filter out the ones we found to be stable (in stable couple), to understand the stability. That way we are able to understand if there is an evidence of repetitive patterns which we can find based on algorithm 1.
To understand the scale of the found stable couples(or the events of those couples), we review the percent of those stable events from the whole testing set. In other words, we try to understand how much of the whole events instances in the report, can be found as a part of couples.
In our research, we intend further analyze events and strongly correlated events, to understand how can we find the ones that exists both in training and in future testing sets, and use them as the base of our anomaly detection model.
6.1.2 Analyzing Correlations Between Events
Next, we searched for pairs of stable events IDs in which there is a strong correlation between their counts (Strongly correlated). To this end, we collected all events IDs that were recorded together in the same session. Then, we fitted a linear regression model on their counts. An output of such a fitting is the score, which is 1.0 for a perfect correlation and 0 or less for no correlation. We considered every pair of event IDs for which the is at least 0.8. Then, we checked whether these correlative pairs still maintain their correlation in other testing data. (Explanation of algorithm1) That is, we used the learned linear regression model for every pair to predict the counts in the week after the week used for training.
Later, we checked how many correlation identified during training remain stable, i.e., they are still good predictors of the counts in the testing set. That is, we counted the number of all strong correlations we found for 3, 4 and 5 days of training, and have tested how many of those we could predict correctly.
Finally, we checked how common it is to observe a stable event with a correlation of 0.8 or higher with another event.
In our research, we intend to further analyze what is the time decay of correlations we can find in our training set. That way, we can review the behavior of such correlations and create a more sophisticated and efficient algorithm to detect them in early training stage, while losing the correlations that do not hold over time.
6.1.3 Multiple-Linear Regression Anomaly Detection Model
To test the couples correlations found by Algorithm 1, and whether we can use them for anomaly detection, we created an Anomaly Detection model based on that algorithm. To test a singular point (counts of events), we created the following prediction error check:
Where perfect prediction has the result 0 and worst possible prediction is 1. the error is the maximum allowed error between the prediction and the true count value.
In addition, we added ”Exoneration” technique, which exonerate a
which was classified as ”anomaly” when tested with, but later found that it was strongly correlated with a different event. We have also defined a threshold parameter which was defined as the number of broken correlations of specific instance. When the threshold is larger than 1, it means that in order to output a event as an anomaly, it should have been tested with at least two other events and predict incorrect counts with both. While this anomaly detection is fairly simple, it was the first step in testing our hypothesis.
6.1.4 Advanced Relationships
The discuss so far has been limited to one type of relationship between construct counts: a linear, pair-wise correlation. To assess the potential for more complex relationship, we constructed a graph whose vertices are the stable constructs and where there is an edge between constructs if there is a strong (larger than 0.8) correlation between their counts. That way, if Event A and Event B were found to be strongly correlated, Event A and Event B will be represented in the graph as node A and B (respectively). In addition, we will add an edge between node A and B due to the strong correlation between them.
While we did not create a graph model that represented the method in this section (i.e only strong correlation would include an edge), it later led us to construct the graph representation that represents the count (ratio) graphs, which was presented in Chapter 5. In the next sections we will present the graph representations of our data and later the different autoencoders models which use those representations.
6.2 Dataset Graph Representations
A significant part of our contribution relies on the graph representations of our aggregated dataset. The first graph representation derived from correlation based anomaly detection we have presented in the previous section. For our graph autoencoder models, it was required to transform the raw data to graphs that represent each entity. The entity we chose to represent is the sessions, this is since events may appear only once per session. Based on the results of the events analysis, which we have presented in the previous section, we have deducted that thousands of new unique events are created daily. However, due to memory and storage limit, we wanted to create a graph which represent the most common events. Therefore, we filtered out the most common ones by a parameter of minimum appearances. The number of type of events we had left were still the majority part of our data set (over 50% of the total set).
Next, we create a representation for each session as directed graph which we referred to as Relations graph. Each graph is represented by its adjacency matrix, i.e. the matrix that represents the value of each edge (the value of edge from construct i to construct j will appear in cell (i,j) in the adjacency matrix).
This will result in a graph that represents each session in our dataset. The relations graph formulation is presented in Algorithm 3 and will be denoted as Relations graph.
The second is denoted as Appearances graph and presented in Algorithm 4. In the appearances graph the edges represents the percentage of times the nodes (events) appeared together throughout the training set report. In other words, the appearances graph representation may detect how likely it is for an edge, which represents that both events appear together and how often (in percentage), to exist in a specific session. The intuition behind creating this graph representation is the task of link prediction, which may be used for anomaly detection as well.
The last graph we will present is denoted as Count graph, represents the count property of each event in the session. We have created this graphs in two methods. The first, results in an adjacency matrix with the counts value of aggregated events in the diagonal of the matrix. The second, results in an adjacency matrix where the all the cells in consists the count value of . The count value in both methods was normalized by the diagonal sum. The creation of the count graph is presented in Algorithm 5
6.3 Multiple-Graphs Autoencoder
In the following section we will present the different graph autoencoders models which are based on the graph representations we have presented in previous section. For consistent results, architecture for the models with all the different data representations is the same, it consists 4 layers as follows:
Convolutional layer 2D – with 64 filters and 3x3 kernel size
Convolutional layer 2D – with 32 filters and 3x3 kernel size
Convolutional transpose layer 2D – with 64 filters and 3x3 kernel size
Output – convolutional transpose layer (filters 1/3 depends on the model) and 3x3 kernel size
In Algorithm 6 we present the creation of both the training and testing set for the single-graph autoencoder.
The training process for the single-graph autoencoder is reconstructing the adjacency matrix that was extracted from each session, using the architecture, hyper-parameters and settings we have mentioned above. The full process from the original session to the newly predicted session can be illustrated as appears on Figure 6.2.
The dataset creation for the multi-graph autoencoder model is similar to the single-graph autoencoder, with the change of creating three-layers matrix that represent all the three graphs representations of our data set (relations graph, appearances graph and count graph).
The training process for the multiple-graph autoencoder is reconstructing the adjacency matrices of all three graphs that we have presented: Relations graph, appearances graph and count graph. The three adjacency matrices will then be saved as one three layered matrix which represents one session. We should mention here that we created two types of multi-graph representations, one with the count graph with count diagonal edges, and the other with Full count row.
To separate the two representations, we will define the following:
The representation that consists Count diagonal-count graph will be denoted as ”Three layers - Count diagonal”
The representation that consists Full count row-count graph will be denoted as ”Three layers - Full count row”
Using the same architecture, hyper-parameters and settings we have described for the single-graph autoencoder, we will train the model for both representations. The full process from the original session to the newly predicted session as appears on for the multi-graph auteoncoders is illustrated in 6.2.
7.1 Events Analysis
The main part of this section was to analyze events, or as we may refer to them in this section: Constructs or Construct_ids. The term construct is used in our data set and describes a type of SQL query made by a specific user to a specific database. In Section 5 we also defined: Session, Stable and strongly correlated constructs (events), Strong Correlations and stable correlated construct (event).
We also presented the functional dependencies between the fields in our dataset:
Construct ID (event) → verb object
Session ID → instance ID, made by the same user using the same connection, IP address, database and application.
Session ID, Construct ID (event) → count
7.1.1 Results of Constructs (events) Analysis
We started by reviewing the number of unique constructs (events) that we can observe through different hourly reports. In figure 7.1 we can see that the number of unique construct IDs grows linearly with time. At a first glance, this means new SQL commands emerge all the time, leaving scarce hope for finding patterns. However, our subsequent analysis described below that the number of stable construct IDs, however, remains relatively stable.
In the next figure 7.2, we plot the number unique stable construct IDs (events) were recorded on each day of the week (X-axis)
The upper lines on the following graphs are for constructs recorded through the whole day, and the lower lines are for one hour (14:00) every day. The Y-Axis shows how many unique constructs are there (Stable/Total during the day). This shows that the number of stable construct IDs indeed remains more consistent. An exception to this are Saturday and Sunday, which show a decrease in number of stable construct IDs. We conjecture that this is because these are vacation days, and thus exhibit fewer events.
Figure 7.3 shows the percentage of the stable constructs from all the unique constructs on the tested day. The x-axis shows which day we which day we tested (Tuesday Sunday). The y-axis shows the percent of stable constructs of the total construct each day Therefore, we see that approximately 50% to 60% of the unique constructs everyday are stable. Thus, most construct IDs are actually stable construct IDs.
7.1.2 Results of Correlations Analysis
The X-Axis shows the number of testing days on the following week. The Y-Axis shows how many strong correlations we found in total. The results clearly show that many pairs of construct IDs remain correlative in their counts during testing. This suggest that these correlations are stable patterns which may later be used for anomaly detection (Research question 1).
The X-axis shows the number of testing days on the following week. The Y-axis shows the percent of the correct correlations after testing.
The results indeed shows that over 70% of the correlations identified during training were stable. This is a very encouraging result, as it shows that learning correlations from past data can be useful for identifying normal or abnormal behavior in the future (Research question 1 and 2).
Frequency of Observing a Stable Correlation
To this end, we first report that in the dataset used for the figures so far for 5 days training and 5 days testing, there were a total of of 1275 stable constructs, and from these constructs 964 were involved in at least one correlation with R2 equal to or larger than 0.8.
Figure 7.6 shows the percentage of sessions that include at least one stable construct with a strong correlation. The X-axis shows the number of testing days on the following week, and the y-Axis shows the percent of sessions with stable constructs (the upper gray line) and the percent of sessions that include stable constructs that have at least one strong correlation. As can be seen, this is indeed the majority of the sessions.
We can thus conclude, that even though the number of unique constructs is relatively low (every day we record 6,000-10,000 unique constructs), correlative constructs are very common. This suggests that an anomaly detection methods based on such correlations may be very effective (Research questions 1 and 2).
7.1.3 Results of Advanced Relationships Analysis
The figure 7.7 below show the structure of the graph mentioned in in the method description for advanced relationship analysis. The graph is created by defining nodes as strongly correlated constructs (events) and edges between each constructs that have a strong correlation.
As can be seen, there are multiple, non-trivial connected components in the graph. The figure below show the connected components (Y-axis) for every size of connected component. Note that the majority of the constructs are, in fact, part of at least one connected component. Moreover, some construct IDs are quite large, suggesting more sophisticated relationships exist beyond pairwise correlations.
For example, one can consider checking correlation between triplets of construct counts, or quadruplets or more. Observe that for a correlative triplet to exist, it must be a 3-clique in the correlation graph. Below, we plotted the number of maximal cliques in this graph as a function of their size (X-axis).
Clearly, these results show that complex patterns exist, well beyond simple correlative pairs. For example, this data shows that there is even a 23-clique in the correlation graph. This means that there are 23 constructs that are strongly correlated with each other. Future research will investigate such higher-degree relationships between constructs’ counts.
Based on the last section and evaluation we deducted that higher degree of relationships is required. Based on previous sections we also deducted that the stable constructs (events), i.e the most common constructs, make up for the majority part of the data set. Therefore, we came up for a number of ways to represent our dataset in a graph form. The next section will present our evaluation of the different graph-autoencoders.
7.2 Experimental Setting for Graph-Autoencoders
In this section we will present the different methods we have evaluated and compared between the different models we have presented. First, we will present the models configuration we have used across the different autoencoders. Then, we will present how we compared three different models which reconstructed three different graphs representations. At last, we will discuss the different evaluations we performed to test which model outperform the others.
7.2.1 Models Configuration
For all the different autoencoders we used Adam optimizer (standard optimizer) with learning rate of 0.001 (showed better results than larger learning rates in previous research). The activation function for each layer is ”ReLU”. The reconstruction error for our models is mean squared error (MSE). The mean square error was chosen over other errors because this was the default error for both vector and image reconstruction in simple autoencoders which are not binary. MSE measures the distance between the predicted matrix and the actual matrix we intend to reconstruct. The training stage for all models consists 100 epochs, with validation from data set which consisted reports of the next day/first day of next week. The policy for model saving was the best training MSE. The training phase was similar for all type of models.
7.2.2 Graph Autoencoders Reconstruction Comparison
In section 6.2 we have presented three different type of graphs (multiple graphs) representations of our data, and while each representation is different, there was one graph which was created for each type, which was the relations graph. The relation graph would be the best representative to understand whether we have an anomaly or to predict links, since it considers both the count and the relationship between constructs in each session, and through the different session its dynamic unlike the appearances matrix. It also have relational values and not distinct values, which allows us to exploit the correlations between events which we presented in the previous section.
7.2.3 Evaluation and Loss Functions for Graph Autoencoders
The mean square error (MSE) was chosen as the main loss function for reconstruction of the single and multiple graph representations. The reconstruction error was chosen due to multiple literature that presented convolutional autoencoders with the same error, such as [masci_ae_2011] [masci_conv_ae_2012] [makhzani2014winner] [tan2014stacked] [chen2017deep] and more. Mean square error is defined as follows:
Where n is the number of samples n the set, is reconstructed sample i and is the true sample i.
In addition to the mean square error, we will present some of the results with root mean square error (RMSE). We did not use it for training or testing, however, some results shows large deviation and the analysis was more clear with the different scales presented with RMSE rather than MSE. RMSE is defined as follows:
Where n is the number of samples n the set, is reconstructed sample i and is the true sample i.
The last evaluation method we used was measuring precision and recall of reconstructing the relations graph. The evaluation method was used to understand which model is closer to predict the actual values for the edges in our graphs.
Precision and recall are defined as follows:
Where: TP = true positives, FP = false positives, and FN = false negatives.
The results of the evaluation methods are relevant for binary classification. Therefore, We did not based the conclusions of this research based on the results of this methods, but the results allowed us to compare the models to each other. To be able the precision and recall, we have defined the following thresholds:
Zero threshold – If the value of cell (i,j) of the predicted matrix is under the zero threshold, we will consider the result of this cell as 0 (or negative).
Error threshold – If the value of cell (i,j) of the predicted matrix minus the true value of cell (i,j) in the true matrix A, is under the error threshold, we will consider the result of the cell as 1 (or positive).
Using this thresholds we were able to give each predicted matrix precision and recall score. Then, we calculate the average precision and recall score based on all reconstructions for all the models.
7.3 Graph Autoencoders Analysis and Comparison
In this section we will present the results and analysis of single and multi-graphs autoencoders. The first part will present a summary of precision and recall scores for the different set of models in different settings and thresholds. Then we will present the effect of training seasons with larger amount of events vs the training all unfiltered seasons, along with comparison between the three type of models using this method. The next section will show the analysis of seasons we predicted and challenges in using the model. In the last section we will present a comparison between normalized values of the relations graph to non-normalized value and an analysis of the results.
The results we will present all used the same test set, which was a report of the aggregated analysis which was recorded in the next week after the training. The test set consists 10,000 normal samples of which 5,000 samples consisted sessions with 10 or more events, and were considered more complex due to the higher number of values different than zeros in the adjacency matrix. The test set consisted sessions with no known anomalies, in order to analyse how well the different models perform in reconstructing normal sessions.
7.3.1 Models Comparison
In this section we will present comparison between the performance of the different models we have presented.
We will begin with presenting the precision and recall score for the different models. We have used different settings of the error threshold to try and understand which model provides the best results. The results were tested by comparing the results on the reconstruction of the relations graph. In Figure 7.10 we can see the precision and recall scores. While it seems that the method with single graph (single layer) representation outperform both graphs, in the next analysis we will present why it is problematic and we did not continue with it in further experiments.
Later, we wanted to test the capabilities of fully reconstructing the relations graph for the different models. We have split the problem based on the number of events (constructs) in each session, to get a better understanding of how well the models are doing when the graphs are more complex (i.e the adjacency matrix contains more values different than 0). For this part of the analysis we have defined the following terms:
Good FP - a graph that was reconstructed with no false-positive cells, i.e successfully did not reconstruct edges that did not exist in the true relations graph.
Good FN - a graph that was reconstructed with no false-negatives cells, i.e successfully reconstructed all the edges that existed in the true relations graph.
|Model||Good FP||Good FN|
|Multi-graphs - diagonal count||57%||42%|
|Multi-graphs - full count||63%||16%|
We have checked the percentage of good FPs and FNs from the reconstructed samples and the results are as presented in Table 7.1. The settings for reviewing the results of the reconstruction all had the same setting, of both error and zero threshold 0.1.When reviewing the results we can see that even though the recall score is relatively high (over 0.9) for the single-graph model, we can only fully construct 52% of the samples in a matter of no missing edges. We changed the thresholds settings and had similar or worse results.
While continuing the analyze the good FNs and FPs reconstructions we discovered, as can be seen in Figure 7.11, that the number of good FNs sessions drops when the number of events in sessions increase above 8. In addition, there are no good FNs sessions that contains above with above 13 events in session, which is over 40% of the samples in the test set.
Based on the analysis and results presented in this section, we have reached the following conclusions:
The scores are very easily biased by the different thresholds settings.
It is nearly impossible to fully reconstruct the relations graph using this methods when the number of constructs (events) in session is large, even when the MSE error is relatively low.
It would be very difficult to create an anomaly detection algorithm that allows that much false-positives or negatives.
Based on those conclusions, we have decided to not continue to compare and analyze the different methods with the precision and recall scores.
7.3.2 Analysis of Training Samples with Larger Number of Events
For further analysis and comparison between our models, we chose to evaluate the MSE
loss in reconstructing the relations graph. In addition we wanted to analyze the standard deviation due to the low amount of graphs we were able to reconstruct in previous analysis. In Table7.2 we can see the results of simple comparison between the three different models while trying to reconstruct the same graph. In contrast to the previous results, we can observe in the overall results, the novel approach of Multiple-graph autoencoder (MGAE), outperforms single-graph autoencoder.
|Model Type||average MSE||standard deviation|
|Multi-graphs - Count diagonal||0.479||7.339|
|Multi-graphs - Full count||3.118||61.872|
Since the results were different than the one we presented in Table 7.1 and Figure 7.10, we wanted to further review the effect of number of events per session on the performance of each models. To do so, we removed all duplicated sessions from our test set, and reviewed average MSE which corresponds to each session size (session size refers to the number of events in session).
We plot the results in Figure 7.12 with the average RMSE loss to make the analysis more visual and the differences easier to distinguish. We are able to notice that the no layers (single graph) model’s performance decrease when the session size increase, while the other models’ performance remain more stable. On the other hand, it is less sensitive in some of the more problematic sessions ,i.e session that are harder to reconstruct which produce relatively high MSE loss.
In attempt to create a model which performs better on higher degrees of relationships, we have trained additional models which had similar data representations, but with minimum session size of 10. In addition, we have added more samples to the training set for those models so they will have a similar amount of samples to the models without minimum session size. Thus resulting in 6 models we compared in this phase:
Two single-graph (or layer) models - with and without minimum session size
Two multi-graphs - Count diagonal models - without and without minimum session size
Two multi-graphs - Full count models - without and without minimum session size
Since the three of our models were trained only on sessions with 10 events or more, we have also filtered the test set to consist only sessions with 10 events or more for to evaluate all models. In addition, we have filtered out duplicate sessions (about 70% of the sessions had at two or more similar sessions) allowing each unique sample to have a similar weight of the results.
|Model Type||MSE||std||min10 MSE||min10 std|
|M-graphs - Count diagonal||0.228||5.058||0.917||32.133|
|M-graphs - Full count||6.812||37.928||14.812||169.834|
In Table 7.3 we can see the average MSE and standard deviation for the 6 different models. On the left two columns we can see the MSE and std for the models without minimum session size and on the right two columns we see the results for the models trained with minimum session size per session. We can see that even though the models were trained with more samples (sessions) that included more events, i.e. more dense matrices, the models that were trained with no minimum session size performed better.
In Figure 7.13 we can see that in problematic samples (i.e sessions with relatively high reconstruction error) the model that was trained without minimum session size performs better, however, in non-problematic samples the model that was trained with minimum 10 session size performs better. From this section on we will refer to problematic samples as sessions that were reconstructed by multi-graph - count diagonal model with reconstruction error (MSE) above 1. Specifically, there were 98 of those problematic samples for the model train with 10 minimum session size, and 38 for the model with no minimum session size, another way to determine that the model trained with no minimum session size performs better.
7.3.3 Analyzing Problematic Sessions
To further understand the difficulties in reconstructing the problematic sessions, we have filtered out the relevant problematic sessions from our data set. We have reviewed the sessions with emphasis the following details:
The number of events in the created relations graph
The number of events in the original session
The count values for the different events in the session (with emphasis on the largest and smallest values)
From our review, we could notice that throughout all sessions the same two type of events appear, each time with very different ratios (from 1 to 0.016), and 24 of 25 times the maximum count was of the same type of event. In addition, in all problematic sessions the minimum count was 1 while the highest between 390 to 9984. Furthermore, 3) In all those sessions there were between 68 to over 500 different type of events which were not part of the relations graph. We reviewed those events and they did not appear in the training set, i.e they were unique or uncommon in comparison to the events which appeared.
In Figure 7.8 we can see that throughout the preliminary analysis we found connected components which are larger than 300, which included strong correlations between the events, while in our graph representation we filtered only 228 type of events.
7.3.4 Data Normalization Comparison
In the previous section we have presented problematic sessions. Throughout the analysis of those sessions we could conclude that we had two main issues:
Less common or new events that were not represented in the graphs.
Large difference between the minimum and maximum count value.
At this stage of the research, we were not able to handle new events. In addition, the architecture of our model did not allow us to add less common events due to the storage and memory limit. However, we wanted to deal with the large difference. The value edge (i,j) in the relations graph is the inverse value of the edge (j,i), that may cause some edges to have very small values and on the other hand very large numbers for the inverse value. Therefore, we suggested two ways of normalization techniques:
The logarithmic normalization (or transformation) is formulated as follows:
Where is the output image, is the output image, and is the scaling constant which is provided by:
Thus results in a scaled matrix, with all cells with values between 0 to 1. When logarithmic transformation is applied onto a digital image, the darker intensity values are given brighter values thus making the details present in darker or gray areas of the image more visible to human eyes. Our intuition in using this transformation is allowing the smallest values to be more visible and easier for learning.
The limit normalization includes limiting the maximum count value for each session to 100, and the using matrix normalization with Euclidean norm. It can be formulated as follows:
Where A is the adjacency matrix, is the matrix’s dimension and is cell (i,j) in the adjacency matrix which is limited to value 100. The intuition behind using this normalization technique is to limit the extreme count values (over 90% of the count values are 100 or less), allowing the differences between non-zero values of the adjacency matrix to be lower than if we would simply use matrix normalization.
The data set for and testing phase for this two normalization methods is similar to the one in previous sections, where we review the loss of the reconstructed relations graph.
|model_type||Normalization||norm - MSE||norm - std|
|M-Graph - diagonal count||Limit||0.0000014||0.0000017|
|M-Graph - full count||Limit||0.0000067||0.0000037|
|M-Graph - diagonal count||Log||0.00036||0.00047|
|M-Graph - full count||Log||0.00038||0.00053|
First, we can review the comparison for each type of normalization method in Table 7.4. From this graph we can see that our results from the previous sections still stands and the multi-graph - count diagonal model still provides the best results for each technique. In addition, we wanted to see which of the methods handles high degree of relationships better, in addition to lower relative standard deviation, which would relate to better construction of the problematic sessions we have presented in previous section.
In Figure 7.14 and Figure 7.15, we can see that for lower degree of relationships the single-graph model may provide better results, however, with very small difference. When it comes to standard deviation and higher degree of relationships, the model that performs the best in both methods is the multi-graph - count diagonal model.
While we have observed the benefits of using the MGAE, we wanted to deduct which method of normalizing the data would benefit the us the most. In order to do so, we have scaled the reconstruction error by the MGAE model of all test samples to have = 1, and calculated the standard deviation.
|Model_type||limit norm std||log norm std||no norm - std|
|M-Graph - diag count||1.16||1.31||22.11|
|M-Graph - full count||0.55||1.39||12.16|
In recent years, there has been an attempt to create auto-encoders (AE) to graph domains. Graph auto-encoders aim at representing nodes into low-dimensional vectors by an unsupervised training manner. Graph autoencoders can be used for node clustering and link prediction and has showed promising performances on such tasks [salha2019simple].
Our main objective in the first part of the research was to investigate the existence of correlations in our dataset, and to assess their potential to learn these correlations and use them to detect anomalies. In addition, we have presented a simple anomaly detection models that exploit such correlations.
Later, We demonstrated how we can create different graphs representation based on the simple patterns we have detected in this part. We have presented three different graphs that represents our data in graph manner, where each graph present a different aspect of the relationships between events.
In the next part of our research, we have presented different ways to create represent the data to benefit most from the graphs and increase the ability to reconstruct the adjacency matrix. Based on the different tests and analysis we have performed during this research we may conclude that our MGAE model can reconstruct graphs better than a single-graph auto encoder models for our data set.
For future work, we suggest the following improvements and extensions:
Create an anomaly detection model based on the graph autoencoder - In order to create such a model, we create different experiments framework that uses the reconstruction error as measurement. We need to add baseline anomaly detection algorithms in addition to the one we have previously created based on simple correlations.
Add a feature matrix to the graph representation - In some GCN methods a feature matrix was added as a part of the network’s architecture, this would allow to add other features which we have not added for the different events or sessions.
Use existing graph embedding models for graph embedding for the different representations - By using existing embedding models we will allow our graphs to contain more nodes and edges, and by that we could increase the scope of our model.
Extend the model for other datasets - we researched an aggregated dataset which was a unique and challenging on it’s own. However, we may find a method to create multiple-graphs for different (and preferably public) datasets, and by prodive further evidence of the benefits of our model.
Creating more complex autoencoder architecture - The autoencoder architecture we have used was to demonstrate the abilities and benefits of our model. A better architecture that may fit the link prediction or anomaly detection task may allow the model to learn higher degree of relationships better and decrease the reconstruction error.
Appendix A Dataset Information
The dataset made available to this research includes information about SQL commands issued by users to a database protected by a Guardium instance. A major challenge in this research is that the available dataset do not contain information about specific invocations of SQL commands. Rather, the available dataset contains information about how many times every SQL commands have been issued in a specific time range, by a specific database user, in a specific session. In details, each record has the following fields:
Instance_id. The ID of the Guardium instance.
Period_start. The start time of the specified time range.
Session_id. A unique identifier created for a specific user and a specific connection.
Construct_id. A unique identifier that describes the issued SQL command in a session.
Verb_object. The verb object provides additional information about the issued SQL command, in the form of a list of pairs (object, action), where the object represents a database entity such as a table, and an action represents the action done to the object, e.g., select or insert. The verb object is a list of (object,action) pairs since an SQL command may consist of performing multiple actions, e.g., if the SQL command calls a stored procedure.
Count. The number of times an event with this verb object has been executed in the same session in the specified time range.
Failed. The number of times the query has failed,
Db_user. The database user used to issue the SQL command.
Os_user. The relevant OS user.
Source_program. The application the user used to access the DB
Server_IP. The server IP.
Client_IP. The client IP.
Service_name. The name of DB
Host_name. Client’s computer name
After discussions with a domain expert, we established the following functional dependencies between some of the fields in our dataset:
Construct_ID → verb_object
Session_ID → instance_ID, period_start, db_user, os_user, source_program, server_IP, client_ID, service_name, host_name
Session_ID, Construct_ID → count, fail
The main fields we considered the thesis are session_ID, construct_ID, period_start, and count.