1. Introduction
The heterogeneous categorical event data are ubiquitous. Consider system surveillance data in enterprise networks, where each data point is a system event that involves heterogeneous types of entities: time, user, source process, destination process, and so on. Mining such event data is a challenging task due to the unique characteristics of the data: (1) the exponentially large event space. For example, in a typical enterprise network, hundreds (or thousands) of hosts incessantly generate operational data. A single host normally generates more than events per second; And (2) the data varieties and dynamics. The variety of system entity types may necessitate highdimensional features in subsequent processing, and the event data may changing dramatically over time, especially considering the heterogeneous categorical event streams (Aggarwal et al., 2003; Aggarwal, 2007; Manzoor et al., 2016).
To address the above challenges, the recent studies of dependency graphs (He and Zhang, 2011; King et al., 2005; Xu et al., 2016) have witnessed a growing interest. Such dependency graphs can be applied to model a variety of systems including enterprise networks (Xu et al., 2016), societies (Myers et al., 2014), ecosystems (Kawale et al., 2013), etc. For instance, we can present an enterprise network as a dependency graph, with nodes representing system entities of processes, files, or network sockets, and edges representing the system events between entities (e.g., a process reads a file). This enterprise system dependency graph can be applied to many forensic analysis tasks such as intrusion detection, risk analysis, and root cause diagnosis (Xu et al., 2016). A social network can also be modeled as a dependency graph representing the social interactions between different users. Then, this social dependency graph can be used for user behavior analysis or abnormal user detection (Kim and Leskovec, 2011).
However, due to the aforementioned data characteristics, learning a mature dependency graph from heterogeneous categorical event streams often requires a long period of time. For instance, the dependency graph of an enterprise network needs to be trained for several weeks before it can be applied for intrusion detection or risk analysis as illustrated in Fig. 1. Furthermore, every time, when the system is deployed in a new environment, we need to rebuild the entire dependency graph. This process is both time and resource consuming.
Enlightened by the cloud services (Krutz and Vines, 2010), one way to avoid the timeconsuming rebuilding process is by reusing a unified dependency graph model in different domains/environments. However, due to the domain variety, directly apply the dependency graph learned from an old domain to a new domain often can not achieve good performance. For example, the enterprise network from an IT company (active environment) is very different from the enterprise network from an electric company (stable environment). Thus, the enterprise dependency graph of the IT company contains many unique system entities that can not be found in the dependency graph of the electric company. Nevertheless, there are still a lot of room for transfer learning.
Transfer learning has shed light on how to tackle the domain differences (Pan and Yang, 2010)
. It has been successfully applied in various data mining and machine learning tasks, such as clustering and classification
(Chen et al., 2016). However, most of the transfer learning algorithms focus on numerical data (Dai et al., 2007; Sun et al., 2015; Chattopadhyay et al., 2013). When it comes to graph structure data, there is less existing work (Fang et al., 2015; He et al., 2009), not to mention the dependency graph. This motivates us to propose a novel knowledge transferbased method for dependency graph learning.In this paper, we propose ACRET, a knowledge transfer based method for accelerating dependency graph learning from heterogeneous categorical event streams. ACRET consists of two submodels: EEM (Entity Estimation Model) and DCM (Dependency Construction Model). Specifically, first, EEM filters out irrelevant entities from source domain based on entity embedding and manifold learning. Only the entities with statistically high correlations can be transferred to the target domain. Then, based on the reduced entities, DCM model effectively constructs unbiased dependency relationships between different entities for the target dependency graph by solving a twoconstraint optimization problem. We launch an extensive set of experiments on both synthetic and realworld data to evaluate the performance of ACRET. The results demonstrate the effectiveness and efficiency of our proposed algorithm. We also apply ACRET to a real enterprise security system for intrusion detection. Our method is able to achieve superior detection performance at least 20 days lead lag time in advance with more than 70% accuracy.
2. Preliminaries and Problem Statement
In this section, we introduce some notations and define the problem.
Heterogeneous Categorical Event. A heterogeneous categorical event is a record contains different categorical attributes, and the th attribute value denotes an entity from the type .
For example, in the enterprise system (as illustrated in Fig. 2), a process event (e.g., a program opens a file or connects to a server) can be regarded as a heterogeneous categorical event. It contains information, such as timing, type of operation, information flow directions, user, and source/destination process, etc.
By continuous monitoring/auditing the heterogeneous categorical event data (streams) generated by the physical system, one can generate the corresponding dependency graph of the system, as in (He and Zhang, 2011; King et al., 2005; Xu et al., 2016). This dependency graph is a heterogeneous graph representing the dependencies/interactions between different pairs of entities. Formally, we define the dependency graph as follows:
Dependency Graph. A dependency graph is a heterogeneous undirected weighted graph , where is the set of heterogeneous system entities, and is the total number of entities in the dependency graph; is the set of dependency relationships/edges between different entities. For ease of discussion, we use the terms edge and dependency interchangeably in this paper. A undirected edge between a pair of entities and exists depending on whether they have a dependency relation or not. The weight of the edge denotes the intensity of the dependency relation.
In an enterprise system, a dependency graph can be a weighted graph between different system entities, such as processes, files, users, Internet sockets. The edges in the dependency graph are the causality relations between different entities.
As shown in Fig. 2, the enterprise security system utilities the accumulated historical heterogeneous system data from event streams to construct the system dependency graph and update the graph periodically. The learned dependency graph is applied to forensic analysis applications such as intrusion detection, risk analysis, and incident backtrack etc.
The problems of coldstart and timeconsuming training reflect a great demand for an automated tool for effectively transferring dependency graphs between different domains. Motivated by this, this paper focuses on accelerating the dependency graph learning via knowledge transfer. Based on the definitions described above, we formally define our problem as follows:
Knowledge Transfer for Dependency Graph Learning. Given two domains: a source domain and a target domain . In the source domain , we have a welltrained dependency graph generated from the heterogeneous categorical event streams. In target domain , we have a small incomplete dependency graph trained by a short period of time. The task of knowledge transfer for dependency graph learning is to use to help construct a mature dependency graph in the domain .
There are two major assumptions for this problem: (1) The event streams in the source domain and target domain are generated by the same physical system; (2) The entity size of source dependency graph should be larger than the size of the intersection graph . Because transferring knowledge from a less informative dependency graph to an informative graph is unreasonable.
3. The Acret Model
To learn a mature dependency graph , intuitively, we would like to leverage the entity and dependency information from the welltrained source dependency graph to help complete the original small dependency graph . One naive way is to directly transfer all the entities and dependencies from the source domain to the target domain. However, due to the domain difference, it is likely that there are many entities and their corresponding dependencies that appear in source domain but not in the target domain. Thus, one key challenge in our problem is how to identify the domainspecific/irrelevant entities from the source dependency graph. After removing the irrelevant entities, another challenge is how to construct the dependencies between the transferred entities by adapting the domain difference and following the same dependency structure as in . To address these two key challenges in dependency graph learning, we propose a knowledge transfer algorithm with two submodels: EEM (Entity Estimation Model) and DCM (Dependency Construction Model) as illustrated in Fig. 3. We first introduce these two submodels separately in details, and then combine them into a uniform algorithm.
3.1. Eem: Entity Estimation Model
For the first submodel, Entity Estimation Model, our goal is to filter out the entities in the source dependency graph that are irrelevant to the target domain. To achieve this, we need to deal with two main challenges: (1) the lack of intrinsic correlation measures among categorical entities, and (2) heterogeneous relations among different entities in the dependency graph.
To overcome the lack of intrinsic correlation measures among categorical entities, we embed entities into a common latent space where their semantics can be preserved. More specifically, each entity, such as a user, or a process in computer systems, is represented as a
dimensional vector and will be automatically learned from the data. In the embedding space, the correlation of entities can be naturally computed by distance/similarity measures in the space, such as Euclidean distances, vector dot product, and so on. Compared with other distance/similarity metrics defined on sets, such as Jaccard similarity, the embedding method is more flexible and it has nice properties such as transitivity
(Zhang et al., 2015).To address the challenge of heterogeneous relations among different entities, we use the metapath proposed in (Sun and Han, 2012) to model the heterogeneous relations. For example, in a computer system, a metapath can be a “ProcessFileProcess”, or a ”FileProcessInternet Socket”. “ProcessFileProcess” denotes the relationship of two processes load the same file, and ”FileProcessInternet Socket” denotes the relationship of a file loaded by a process who opened an Internet Socket.
The potential metapaths induced from the heterogeneous network can be infinite, but not every one is relevant and useful for the specific task of interest. There are some works (Chen and Sun, 2017) for automatically selecting the metapaths for specific tasks.
Given a set of metapaths , where denotes the th metapath and let denotes the number of metapaths. We can construct graphs by each time only extracting the corresponding metapath from the dependency graph (Sun and Han, 2012). Let denotes the vector representation of the entities in . Then, we model the relationship between two entities and as:
(1) 
In the above, is a weighted average of all the similarity matrices :
(2) 
where ’s are nonnegative coefficients, and is the similarity matrix constructed by calculating the pairwise shortest path between each entities in . Here, is the adjacent matrix of the dependency graph . By using the shortest path in the graph, one can capture the long term relationship between different entities (Bondy and Murty, 1976). Putting Eq. 2 into Eq. 1, we have:
(3) 
where is is the Frobenius norm (Han et al., 2011).
Then, the objective function of EEM model is:
(4) 
where , and is the generalization term (Han et al., 2011), which prevents the model from overfitting. is the tradeoff factor of the generalization term. In practice, we can choose as 1 or 2, which bears the resemblance to Hamming distance and Euclidean distance, respectively.
Putting everything together, we get:
(5) 
Then, the optimized value can be obtained by:
3.1.1. Inference Method
The objective function in Eq. 5 contains two sets of parameters: (1) , and (2) . Then, we propose a twostep iterative method for optimizing , where the entity vector matrices and the weight for each metapath mutually enhance each other. In the first step, we fix the weight vectors and learn the best entity vector matrix . In the second step, we fix the entity vector matrix and learn the best weight vectors .
Fix and learn : when we fix , then the problem is reduced to , where is a constant similarity matrix. Then, the optimization process becomes a traditional manifold learning problem. Fortunately, we can have a closed form to solve this problem, via so called multidimensional scaling (Han et al., 2011)
. To obtain such an embedding, we compute the eigenvalue decomposition of the following matrix:
where is the double centering matrix,
has columns as the eigenvectors and
is a diagonal matrix with eigenvalues. Then, the embedding can be chosen as:(6) 
Fix and learn : When fixing , the problem is reduced to:
(7) 
where is a constant matrix, and
is also a constant. Then, this function becomes a linear regression. So, we also have the close form solution for
:After we get the embedding vectors , then the relevance matrix between different entities can be obtained as:
(8) 
One can use a user defined threshold to select the entities with high correlation with target domain for transferring. But user defined threshold is often suffered by the lack of domain knowledge. So here, we introduce a hypothesis test based method for automatically thresholding the selection of the entities.
For each entity in , we first normalize all the scores by: , where is the average value of , and
is the standard deviation of
. This standardized scores can be approximated with a gaussian distribution. Then, the threshold will be
with . (or for ) (Han et al., 2011). By using this threshold, one can filter out all the statistically irrelevant entities from the source domain, and transfer highly correlated entities to the target domain.By combining the transferred entities and the original target domain dependency graph , we get , as shown in Fig. 3. Then, the next step is to construct the missing dependencies in .
3.2. DCM: Dependency Construction Model
To construct the missing dependencies/edges in , there are two constraints need to be considered:

Smoothness Constraint: The predicted dependency structure in needs to be close to the dependency structure of the original graph . The intuition behind this constraint is that the learned dependencies should more or less intact in as much as possible.

Consistency Constraint: Inconsistency between and should be similar to the inconsistency between and . Here, and are the subgraphs of which have the same entity set with and , respectively. This constraint guarantees that the target graph learned by our model can keep the original domain difference with the source graph.
Before we model the above two constraints, we first need a measure to evaluate the inconsistence between different domains. In this work, we propose a novel metric named dynamic factor between two dependency graphs and from two different domains as:
(9) 
where is the number of entities in , and denote the adjacent matrix of and , respectively, and denotes the number of edges of a fully connected graph with entities (Bondy and Murty, 1976).
Next, we introduce the Dependency Construction Model in details.
3.2.1. Modeling Smoothness Constraint
We first model the smoothness constraint as follows:
(10) 
where is the vector representation of the entities in , and is the regularization term.
3.2.2. Modeling Consistency Constraint
We then model the consistency constraint as follows:
(11) 
where is the dynamic factor as we defined before. Then, putting Eq. 9 and into Eq. 11, we get:
(12) 
where .
3.2.3. Unified Model
Having proposed the modeling approaches in Section 3.2.1 and 3.2.2, we intend to put all the two constraints together. The unified model for dependency construction is proposed as follows:
(13) 
The first term of the model incorporates the Smoothness Constraint component, which keeps the closer to target domain knowledge existed in the . The second term considers the Consistency Constraint, that is the inconsistency between and should be similar to the inconsistency between and .
and are important parameters which capture the importance of each term, and we will discuss these parameters in Section 3.3. To optimize the model as in Eq. 13
, we use stochastic gradient descent
(Han et al., 2011) method. The derivative on is given as:(14) 
3.3. Overall Algorithm
The overall algorithm is then summarized as Algorithm 1. In the algorithm, line 5 to line 11 implements the Entity Estimation Model, and line 13 to 16 implements the Dependency Construction Model.
3.3.1. Setting Parameters
There are two parameters, and , in our model. For , as in (Sun and Han, 2012; Han et al., 2011), it is always assigned manually based on the experiments and experience. For , when a large number of entities are transferred to the target domain, a large can improve the transferring result, because we need more information to be added from the source domain. On the other hand, when only a small number of entities are transferred to target domain, then a larger will bias the result. Therefore, the value of depends on how many entities are transferred from the source domain to the target domain. In this sense, we can use the proportion of the transferred entities in to calculate . Given the entity size of as , the entity size of as , then can be calculated as:
(15) 
The experimental results in Section 4.6 demonstrate the effectiveness of the proposed parameter selection method.
3.3.2. Complexity Analysis
As shown in Algorithm 1, the time for learning our model is dominated by computing the objective functions and their corresponding gradients against feature vectors.
For the Entity Estimation Model, the time complexity of computing the in Eq. 6 is bounded by , where is the number of entities in , and is the dimension of the vector space of . The time complexity for computing is also bounded by . So, suppose the number of training iterations for EEM is , then the overall complexity of EEM model is . For the Dependency Construction Model, the time complexity of computing the gradients of against is , where is the number of iterations, is the dimensionality of feature vector. As shown in our experiment (see Section 4.5), , , , and are all small numbers. So that we can regard them as a constant, say , so the overall complexity of our method is , which is linear with the size of the entity set. This makes our algorithm practicable for large scale datasets.
4. Experiments
In this section, we evaluate ACRET using synthetic data and real system surveillance data collected in enterprise networks.
4.1. Comparing Methods
We compare ACRET with the following methods:
NT: This method directly uses the original small target dependency graph without knowledge transfer. In other words, the estimated target dependency graph .
DT: This method directly combines the source dependency graph and the original target dependency graph. In other words, the estimated target dependency graph .
RWDCM: This is a modified version of the ACRET method. Instead of using the proposed EEM model to perform entity estimation, this method uses the random walk to evaluate the correlations between entities and perform entity estimation. Random walk is a widelyused method for relevance search in a graph (Kang et al., 2012).
4.2. Evaluation Metrics
Since in ACRET algorithm, we use hypothesistest for thresholding the selection of entities and dependencies, similar to (Luo et al., 2014b; Han et al., 2011), we use the F1score to evaluate the hypothesistest accuracy of all the methods.
F1score is the harmonic mean of precision and recall. In our experiment, the final F1score is calculated by averaging the entity F1score and dependency/edge F1score.
To calculate the precision (recall) of both entity and link, we compare the estimated entity (edge) set with the groundtruth entity/link set. Then, precision and recall can be calculated as follows:
where is the number of correctly estimated entities (edges), is the number of total estimated entities (edges), and is the number of ground truth entities (edges).
4.3. Synthetic Experiments
We first evaluate the ACRET on synthetic graph datasets to have a more controlled setting for assessing algorithmic performance. We control three aspects of the synthetic data to stress test the performance of our ACRET method:

Graph size is defined as the number of entities for a dependency graph. Here, we use to denote the source domain graph size and to denote the target one.

Dynamic factor, denoted as , has the same definition as in Section 3.2.

Graph maturity score, denoted as M, is defined as the percentage of entities/edges of the groundtruth graph , that are used for constructing the original small graph . Here, graph maturity score is used for simulating the period of learning time of to reach the maturity in the real system.
Then, given , , , and M, we generate the synthetic data as follows: We first randomly generate an undirected graph as the source dependency graph based on the value of (West, 2001); Then, we randomly assign three different labels to each entity. Due to space limitations, we will only show the results with three labels, but similar results have been achieved in graphs with more than three labels; We further construct the target dependency graph by randomly adding/deleting of the edges and deleting entities from . Finally, we randomly select M = of entities/edges from to form .
4.3.1. How Does Acret’s Performance Scale with Graph Size?
We first explore how the ACRET’s performance changes with graph size and . Here, we fix the maturity score to M, the dynamic factor to , and target domain dependency graph size to . Then, we increase the source graph size from to . From Fig. 3(a), we observe that with the increase of the size difference , the performances of DT and RWDCM are getting worse. This is due to the poor ability of DT and RWDCM for extracting useful knowledge from the source domain. In contrast, the performance of ACRET and EEMCMF increases with the size differences. This demonstrates the great capability of EEM model for entity knowledge extraction.
4.3.2. How Does Acret’s Performance Scale with Domain Dynamic Factor?
We now vary the dynamic factor to understand its impact on the ACRET’s performance. Here, the graph maturity score is set to M=, and two domain sizes are set to and , respectively. Fig. 3(b) shows that the performances of all the methods go down with the increase of the dynamic factor. This is expected, because transferring the dependency graph from a very different domain will not work well. On the other hand, the performances of ACRET, RWDCM, and EEMCMF only decrease slightly with the increase of the dynamic factor. Since RWDCM and EEMCMF are variants of the ACRET method, this demonstrates that the two submodels of the ACRET method are both robust to large dynamic factors.
4.3.3. How Does Acret’s Performance Scale with Graph Maturity?
Third, we explore how the graph maturity score M impacts the performance of ACRET. Here, the dynamic factor is fixed to . The graph sizes are set to and . Fig. 3(c) shows that with the increase of the M, the performances of all the methods are getting better. The reason is straightforward: with the maturity score increases, the challenge of domain difference for all the methods is becoming smaller. In addition, our ACRET and its variants RWDCM, and EEMCMF perform much better than DT and NT. This demonstrates the great ability of the submodels of ACRET for knowledge transfer. Furthermore, ACRET still achieves the best performance.
4.4. RealWorld Experiments
Two realworld system monitoring datasets are used in this experiment. The data is collected from an enterprise network system composed of Linux machines and Windows machines from two departments, in a time span of consecutive days. In both datasets, we collect two types of system events: (1) communications between processes, and (2) system activity of processes sending or receiving Internet connections to/from other machines at destination ports. Three different types of system entities are considered: (1) processes, (2) Unix domain sockets, and (3) Internet sockets. The sheer size of the Windows dataset is around Gigabytes, and the Linux dataset is around Gigabytes. Both Windows and Linux datasets are split into a source domain and a target domain according to the department name. The detailed statistics of the two datasets are shown in Table 1.
Data  Win  Linux 

# System Events  120 Million  10 Million 
# Source Domain Machines  62  24 
# Target Domain Machines  61  23 
# Time Span  14 days  14 days 
In this experiment, we construct one target domain dependency graph per day by increasing the learning time daily. The final graph is the one learned for days. From Fig. 5, we observe that for both Windows and Linux datasets, with the increase of the training time, the performances of all the algorithms are getting better. On the other hand, compared with all the other methods, ACRET achieves the best performance on both Windows and Linux datasets. In addition, our proposed ACRET algorithm can make the dependency graph deplorable in less than four days, instead of two weeks or longer by directly learning on the target domain.
4.5. Convergence Analysis
As described in Section 3.3.2, the performance bottleneck of ACRET model is the learning process of the two submodels: EEM (Entity Estimation Model) and DCM (Dependency Construction Model). In this section, we report the convergence speed of our approach.
We use both synthetic and realworld data to validate the model convergence speed. For the synthetic data, we choose the one with dynamic factor to be , the dependency graph size to be and , and the graph maturity to be . For the two realworld datasets, we fix the target dependency graph learning time as days.
From Fig. 6, we can see that in all three datasets, ACRET converges very fast (i.e., with less than 10 iterations). This makes our model applicable for the realworld largescale systems.
4.6. Parameter Study
In this section, we study the impact of parameter in Eq. 13. We use the same datasets as in Section 4.5. As shown in Fig. 7, when the value of is too small or too large, the results are not good, because controls the leverage between the source domain information and target domain information. The extreme value of (too large or too small) will bias the result. On the other hand, the value calculated by Eq. 15 is for the synthetic dataset, for the Windows dataset, and for the Linux dataset. And Fig. 7 shows the best results just appear around these three values. This demonstrates that our proposed method for setting the value is very effective, which successfully addresses the parameter preassignment issue.
4.7. Case Study on Intrusion Detection
As aforementioned, dependency graph is essential to many forensic analysis applications like root cause diagnosis and risk analysis. In this section, we evaluate the ACRET’s performance in a real commercial enterprise security system (see Fig. 2) for intrusion detection.
In this case, the dependency graph, which represents the normal profile of the enterprise system, is the core analysis model for the offshore intrusion detection engine. It is built from the normal system process event streams (see Section 4.4 for data description) during the training period. The same security system has been deployed in two companies: one Japanese electric company and one US IT company. We obtain one dependency graph from the IT company after 30 days’ training, and two dependency graphs from the electric company after 3 and 30 days’ training, respectively. ACRET is applied for leveraging the welltrained dependency graph from the IT company to complete the 3 days’ immature graph from the electric company.
In the oneday testing period, we try different types of attacks (Jones and Sielken, 2000), including Snowden attack, ATP attack, botnet attack, Sniffer Attack and etc., which resulted in groundtruth alerts. All other alerts reported during the testing period are considered as false positives.
Table 2 shows the intrusion detection results in the electric company using the dependency graphs generated by different transfer learning methods and the 30 days’ training from the electric company. From the results, we can clearly see that ACRET outperforms all the other transfer learning methods by at least in precision and in recall. On the other hand, the performance of the dependency graph ( days’ model) accelerated by ACRET is very close to the ground truth model ( days’ model). This means, by using ACRET, we can achieve similar performance in onetenth training time, which is of great significant to some mission critical environments.
Method  Precision  Recall 
NT  0.01  0.10 
DT  0.15  0.30 
RWDCM  0.38  0.57 
EEMCMF  0.42  0.60 
ACRET  0.60  0.73 
Real 30 days’ model  0.58  0.76 
5. Related Work
5.1. Transfer Learning
Transfer learning has been widely studied in recent years (Cao et al., 2010; Pan and Yang, 2010). Most of the traditional transfer learning methods focus on numerical data (Dai et al., 2007; Sun et al., 2015; Chattopadhyay et al., 2013). When it comes to graph (network) structured data, there is less existing work. In (Fang et al., 2015), the authors presented TrGraph, a novel transfer learning framework for network node classification. TrGraph leverages information from the auxiliary source domain to help the classification process of the target domain. In one of their earlier work, a similar approach was proposed (Fang et al., 2013) to discover common latent structure features as useful knowledge to facilitate collective classification in the target network. In (He et al., 2009)
, the authors proposed a framework to propagates the label information from the source domain to the target domain via the examplefeatureexample tripartite graph. Transfer learning has also been applied to the deep neural network structure. In
(Chen et al., 2016), the authors introduced Net2Net, a technique for rapidly transferring the information stored in one neural net into another. Net2Net utilizes function preserving transformations to transfer knowledge from neural networks. Different from existing methods, we aim to expedite the dependency graph learning process through knowledge transfer.5.2. Link Prediction and Relevance Search
Graph link prediction is a wellstudied research topic (LibenNowell and Kleinberg, 2007; Hofman et al., 2017). In (Ye et al., 2013), Ye et al.
presented a transfer learning algorithm to address the edge sign prediction problem in signed social networks. Because edge instances are not associated with a predefined feature vector, this work was proposed to learn the common latent topological features shared by the target and source networks, and then adopt an AdaBoostlike transfer learning algorithm with instance weighting to train a classifier. Collective matrix factorization
(Singh and Gordon, 2008) is another popular technique that can be applied to detect mission links by combining the source domain and target domain graphs. However, all the existing link prediction methods can not deal with dynamics between the source domain and target domain as introduced in our problem.Finding relevant nodes or similarity search in graphs is also related to our work. Many different similarity metrics have been proposed such as Jaccard coefficient, cosine similarity, and Pearson correlation coefficient
(Bondy and Murty, 1976), and Random Walks (Kang et al., 2012; Sun et al., 2005). However, none of these similarity measures consider the multiple relations exist in the data. Recent advances in heterogeneous information networks (Sun and Han, 2012) have offered several similarity measures for heterogeneous relations, such as metapath and relation path (Luo et al., 2014c, a). However, these methods can not deal with the multiple domain knowledge.6. Conclusion
In this paper, we investigate the problem of transfer learning on dependency graph. Different from traditional methods that mainly focus on numerical data, we propose ACRET, a twostep approach for accelerating dependency graph learning from heterogeneous categorical event streams. By leveraging entity embedding and constrained optimization techniques, ACRET can effectively extract useful knowledge (e.g., entity and dependency relations) from the source domain, and transfer it to the target dependency graph. ACRET can also adaptively learn the differences between two domains, and construct the target dependency graph accordingly. We evaluate the proposed algorithm using extensive experiments. The experiment results convince us of the effectiveness and efficiency of our approach. We also apply ACRET to a real enterprise security system for intrusion detection. Our method is able to achieve superior detection performance at least 20 days lead lag time in advance with more than 70% accuracy.
References
 (1)
 Aggarwal (2007) Charu C Aggarwal. 2007. Data streams: models and algorithms. Vol. 31. Springer Science & Business Media.
 Aggarwal et al. (2003) Charu C Aggarwal, Jiawei Han, Jianyong Wang, and Philip S Yu. 2003. A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Data BasesVolume 29. VLDB Endowment, 81–92.
 Bondy and Murty (1976) John Adrian Bondy and Uppaluri Siva Ramachandra Murty. 1976. Graph theory with applications. Vol. 290. New York: Elsevier.
 Cao et al. (2010) Bin Cao, Nathan N Liu, and Qiang Yang. 2010. Transfer learning for collective link prediction in multiple heterogenous domains. In Proceedings of the 27th International Conference on Machine Learning. 159–166.

Chattopadhyay et al. (2013)
Rita Chattopadhyay, Wei
Fan, Ian Davidson, Sethuraman
Panchanathan, and Jieping Ye.
2013.
Joint transfer and batchmode active learning. In
International Conference on Machine Learning. 253–261.  Chen et al. (2016) Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. 2016. Net2net: Accelerating learning via knowledge transfer. In Proceedings of the 4th International Conference on Learning Representations.
 Chen and Sun (2017) Ting Chen and Yizhou Sun. 2017. TaskGuided and PathAugmented Heterogeneous Network Embedding for Author Identification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 295–304.
 Dai et al. (2007) Wenyuan Dai, Qiang Yang, GuiRong Xue, and Yong Yu. 2007. Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning. ACM, 193–200.
 Fang et al. (2013) Meng Fang, Jie Yin, and Xingquan Zhu. 2013. Transfer learning across networks for collective classification. In 2013 IEEE 13th International Conference on Data Mining. IEEE, 161–170.
 Fang et al. (2015) Meng Fang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. 2015. TrGraph: Crossnetwork transfer learning via common signature subgraphs. IEEE Transactions on Knowledge and Data Engineering 27, 9 (2015), 2536–2549.
 Han et al. (2011) Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques. Elsevier.
 He et al. (2009) Jingrui He, Yan Liu, and Richard Lawrence. 2009. Graphbased transfer learning. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 937–946.
 He and Zhang (2011) Miao He and Junshan Zhang. 2011. A dependency graph approach for fault detection and localization towards secure smart grid. IEEE Transactions on Smart Grid 2, 2 (2011), 342–351.
 Hofman et al. (2017) Jake M Hofman, Amit Sharma, and Duncan J Watts. 2017. Prediction and explanation in social systems. Science 355, 6324 (2017), 486–488.
 Jones and Sielken (2000) Anita K Jones and Robert S Sielken. 2000. Computer system intrusion detection: A survey. Computer Science Technical Report (2000), 1–25.
 Kang et al. (2012) U Kang, Hanghang Tong, and Jimeng Sun. 2012. Fast random walk graph kernel. In Proceedings of the 2012 SIAM International Conference on Data Mining. SIAM, 828–838.

Kawale et al. (2013)
Jaya Kawale, Stefan
Liess, Arjun Kumar, Michael Steinbach,
Peter Snyder, Vipin Kumar,
Auroop R Ganguly, Nagiza F Samatova,
and Fredrick Semazzi. 2013.
A graphbased approach to find teleconnections in
climate data.
Statistical Analysis and Data Mining: The ASA Data Science Journal
6, 3 (2013), 158–179. 
Kim and Leskovec (2011)
Myunghwan Kim and Jure
Leskovec. 2011.
Modeling social networks with node attributes using
the multiplicative attribute graph model. In
Proceedings of the TwentySeventh Conference on Uncertainty in Artificial Intelligence
. AUAI Press, 400–409.  King et al. (2005) Samuel T King, Z Morley Mao, Dominic G Lucchetti, and Peter M Chen. 2005. Enriching intrusion alerts through multihost causality. In Proceedings of the 2005 Network and Distributed System Security Symposium (NDSS).
 Krutz and Vines (2010) Ronald L Krutz and Russell Dean Vines. 2010. Cloud security: A comprehensive guide to secure cloud computing. Wiley Publishing.
 LibenNowell and Kleinberg (2007) David LibenNowell and Jon Kleinberg. 2007. The linkprediction problem for social networks. Journal of the Association for Information Science and Technology 58, 7 (2007), 1019–1031.
 Luo et al. (2014a) Chen Luo, Renchu Guan, Zhe Wang, and Chenghua Lin. 2014a. HetPathMine: A Novel Transductive Classification Algorithm on Heterogeneous Information Networks. In Proceedings of the 36th European Conference on Information Retrieval. Springer, 210–221.
 Luo et al. (2014b) Chen Luo, JianGuang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, and Zhe Wang. 2014b. Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1583–1592.
 Luo et al. (2014c) Chen Luo, Wei Pang, Zhe Wang, and Chenghua Lin. 2014c. Hetecf: Socialbased collaborative filtering recommendation using heterogeneous relations. In 2014 IEEE International Conference on Data Mining. IEEE, 917–922.

Manzoor
et al. (2016)
Emaad Manzoor, Sadegh M
Milajerdi, and Leman Akoglu.
2016.
Fast Memoryefficient Anomaly Detection in Streaming Heterogeneous Graphs. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1035–1044.  Myers et al. (2014) Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin. 2014. Information network or social network? the structure of the twitter follow graph. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 493–498.
 Pan and Yang (2010) Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345–1359.
 Singh and Gordon (2008) Ajit P Singh and Geoffrey J Gordon. 2008. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 650–658.
 Sun et al. (2005) Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, and Christos Faloutsos. 2005. Relevance search and anomaly detection in bipartite graphs. ACM SIGKDD Explorations Newsletter 7, 2 (2005), 48–55.
 Sun et al. (2015) Qian Sun, Mohammad Amin, Baoshi Yan, Craig Martell, Vita Markman, Anmol Bhasin, and Jieping Ye. 2015. Transfer learning for bilingual content classification. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2147–2156.
 Sun and Han (2012) Yizhou Sun and Jiawei Han. 2012. Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery 3, 2 (2012), 1–159.
 West (2001) Douglas Brent West. 2001. Introduction to graph theory. Vol. 2. Prentice hall Upper Saddle River.
 Xu et al. (2016) Zhang Xu, Zhenyu Wu, Zhichun Li, Kangkook Jee, Junghwan Rhee, Xusheng Xiao, Fengyuan Xu, Haining Wang, and Guofei Jiang. 2016. High fidelity data reduction for big data security dependency analyses. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 504–516.
 Ye et al. (2013) Jihang Ye, Hong Cheng, Zhe Zhu, and Minghua Chen. 2013. Predicting positive and negative links in signed social networks by transfer learning. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 1477–1488.
 Zhang et al. (2015) Kai Zhang, Qiaojun Wang, Zhengzhang Chen, Ivan Marsic, Vipin Kumar, Guofei Jiang, and Jie Zhang. 2015. From categorical to numerical: Multiple transitive distance learning and embedding. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 46–54.