1 Introduction
Heterogeneous information networks (HINs) consisting of multityped entities and relations have been extensively studied recently, and shown the power against many other stateoftheart data models in lots of real world applications, such as link prediction [22, 13] and classification [8, 31], etc. The main reason behind the success of the HIN studies is the concept of metapath [20]. Metapath is a sequence of consecutive entity types and relation types capturing semantic proximity between entities in HINs, thus provides the opportunities to deeply understand the data.
Most HIN tasks can be formalized as inference with metapaths problems in HINs. The goal of HIN inference is to find the best assignment to the output variables according to the given model and input instances. HIN inference particularly leverages the proximities between the input and output variables provided by the metapaths, i.e., , where is an HIN model, is the model parameters, is a set of representation (i.e., feature) functions based on input instance , output variables and metapaths relevant to and
. For example, in HIN based link prediction task, given the linear regression based prediction model
and the feature function set built upon source entity and target entity (e.g., ), metapath based link prediction can be seen as finding the best assignment to output variable(onedimension vector, e.g.,
[1] represents link exists, [0] represents link doesn’t exist), via the metapaths between and . As we can see, the power of the inference in HINs is brought by metapath(s). The ability of how well we can handle the metapaths is the key to better inference in HINs.However, there are two possible limits in the current inference methods in large scale HINs.
User needs to provide the metapath(s). Most of the previous HIN studies ask users or experts to provide metapath(s) as explicit inference rules to perform relevant tasks in HINs, such as similarity search [20]. Since in traditional HINs, such as DBLP, there are only four types of entities, Paper, Venue, Author and Term and relevant relation types between the entities. It will be possible for users to provide highquality metapaths. However, in large scale HINs, such as YAGO, it consists of a sophisticated and large number of entity types (e.g., millions) and relation types (e.g., hundreds). Users will feel harder to provide metapaths in such cases. Moreover, multiorder (length larger than one) metapaths could carry more important information for inference than that of firstorder metapath (length equals to one). Thus it becomes harder to rely on users to provide the relevant metapaths for inference in large scale HINs.
Biased examples lead to incorrect inference. To relieve the efforts of users, random walk based inference [11, 12] is proposed for large scale HINs. Existing random walk procedure aims to enumerate the metapaths within fixed length , then perform jointly inference. Since the time complexity of enumerating the metapaths grows exponentially with the length of the metapaths, traditional random walk will be impractical in large scale network. Besides, another potential problem is that the inference performance is very sensitive to . For example, if is small (e.g., equals to 1), multiorder and meaningful metapaths will be ignored; if is large, meaningless and duplicated metapaths will be generated. Meng et al. [14] recently provides a more general inference framework by requiring users to provide example entity pairs. Then the metapaths are generated if relevant metapaths are of high proximity between the entity pairs. The algorithm generates metapaths by considering the local (near) randomly generated negative examples. Since there could be some positive examples introduced by randomly producing negative samples, the negative examples could be biased and noisy towards the inference. We thus consider an efficient HIN inference framework that could further release the power of metapath based inference in large scale HINs by relieving the aforementioned issues.
In this paper, we propose a metapath constrained inference framework for large scale HIN. Since for a particular HIN inference task, we may only care about a (or a set of) relevant inference rules, then the parts of the network only distantly connected to inference goals are likely to have a small influence. Intuitively, we consider the HIN inference as graph random walk inference with constraints. We carefully design an efficiencyoptimized metapath tree data structure to constrain the random walk to follow the tree structure. Compared to the running time of the original random walk, the proposed metapath constrained random walk can improve efficiency by mostly two orders of magnitude. The proposed inference scheme is initialized with several sample pairs to capture the inference target. Then based on the metapath based proximities in the samples, the scheme iterates random walk inference in the metapath tree until it becomes convergence. Notice that the inference method is weakly supervised. The supervision information complies with the intuition that the inference is only relevant to parts of the HINs.
The main contributions of this paper can be highlighted as below:

We study the problem of large scale HIN inference, which is important and has broad applications.

We propose a metapath constrained inference framework, where many of the HIN tasks can be unified under this framework. In particular, we propose a metapath tree constrained random walk inference method, which is weakly supervised and efficient to model the inference targets.

We conduct experiments on two large scale HIN. Our proposed inference method has demonstrated its effectiveness and efficiency compared to the stateofthearts inference methods on typical HIN tasks (link prediction and similarity search).
The rest of this paper is organized as below. We first introduce the HIN inference framework in Section 2. Next in Section 3, we present HIN inference as metapath tree constrained random walk inference. The experimental results are shown in Section 4. We finally discuss the related work and conclude in Section 5 and Section 6 respectively.
2 Problem Definition
In this section, we introduce a general HIN inference framework that could specifically take metapaths in HIN into consideration, and perform large scale inference according particular task. Before that, some basic concepts of HIN are introduced as below.
2.1 Heterogeneous Information Network
Definition 1
A heterogeneous information network (HIN) is a graph with an entity type mapping : and a relation type mapping : , where denotes the entity set, denotes the link set, denotes the entity type set, and denotes the relation type set, and the number of entity types or the number of relation types .
Notice that, in large scale HINs, such as YAGO, the relation type mapping is an onetoone mapping, while the entity type mapping could be onetoN mapping. For example, a specific triplet (Larry Page, alumniOf, Stanford) in YAGO, the relation type mapping , while the entity mapping . The reason why an entity could be mapping to multiple entity types in YAGO or Freebase is that, the entities types are often organized in a hierarchical manner. For example, as shown in Figure 1, University is a subtype of Organization, Politician is a subtype of Person. All the types or attributes share a common root, called Object. The hierarchy of the entity organization raises another challenge on how to infer useful information in HINs.
Definition 2
Given an HIN with the entity type mapping : and the relation type mapping : , the network schema for network , denoted as , is a graph with nodes as entity types from and edges as relation types from .
The network schema provides a highlevel description of a given heterogeneous information network. Another important concept, metapath [21], is proposed to systematically define relations between entities at the schema level.
Definition 3
A metapath is a path defined on the graph of network schema , and is denoted in the form of , which defines a composite relation between types and , where denotes relation composition operator, and is the length of . When , we specifically call it as firstorder metapath; when , we call it as multiorder metapath.
We say a path between and in network follows the metapath , if and each edge belongs to each relation type in . We call these paths as path instances of , denoted as . represents the reverse order of relation . For example, in the YAGO network, the composite relation two Person cofounded an Organization can be described as metapath = Person Organization Person. A path instance of is = Larry Page Google Sergey Brin.
2.2 Heterogeneous Information Network Inference Framework
Metapaths carry rich information about the semantic relationships between entities, thus capture subtle proximities in HINs via several metapath based similarity measures, such as Path Count [20], Random Walk [11], and Pathsim [20]. The proximities are very important and useful for inference problems, such as link prediction. Assume that direct links (firstorder metapaths) are missing between two entities with types Person and Profession, if there is a multiorder metapath Person Organization Profession between the entities, and the metapath based similarity (e.g., Path Count) of two entities is large (i.e., number of path instances between two entities satisfying the metapath is large), we will have higher confidence in inferring the missing link between the entities based on the multiorder metapath.
Traditional inference framework in machine learning
[4] aims to model relevant inference problems as stochastic processes involving both output variables and input or observed variables. The framework mainly includes a model parameter vector , corresponding to a set of representation or feature functions . For an input instance and an output assignment , the “score” of the instance can be expressed as a model function with the parameter vector and representation functions: score = . When the model is evaluated on test instance , the inference framework aims to find the best assignment to the output variables,(1) 
Notice that a representation function usually focuses on producing homogeneous flat features and ignores the link or structure information in both input variables and output variables . As we know, the multityped links (metapaths) is very important for HIN inference. However, the framework doesn’t model metapath information; besides it is not trivial to efficiently incorporate the extra information provided by metapaths into the framework. We therefore formally define the HIN inference framework as below.
Definition 4
Heterogeneous Information Network Inference (HINI) aims to enable inference with metapaths in large scale HINs. To be more specific, HINI aims to infer the best assignment to the output variables given an HIN model with metapaths. HINI is formalized as:
(2) 
where is the model or score function and is the model parameters. Different from Eq. 1, aims to leverage the proximities carried by the metapaths regarding to the input instance or output variable .
HINI (Eq. 2) could support many mining tasks in HINs, such as link prediction and similarity search in HINs, as we will see later. If metapaths set is empty. HINI will degenerate to traditional inference as shown in Eq. 1. We find there are mainly two challenges in HINI: 1) how to efficiently generate useful metapaths from largescale HINs consisting of millions of entities and billions of relations? And 2) how to model the metapath based proximities to improve the representation power of ? In next section, we will describe our proposed inference method that could efficiently generate metapaths as well as compute metapath based similarity simultaneously.
3 HIN Random Walk Inference with MetaPath Dependency Tree Search
In this section, we first introduce metapath constrained HIN random walk inference with weak supervision, then talk about how to leverage the supervision to conduct efficient HIN random walk inference via a carefully designed data structure.
3.1 Weakly Supervised HIN Random Walk Inference
As aforementioned, metapaths are very important since they infer important semantic relationships between entities in HINs, thus capture semantic proximities of entities, which is very useful for HIN inference as shown in Eq. 2. Most of existing inference methods are focusing on enumerating the metapaths within a fixed length in the full underlying network [11]. However, this solution is impractical in large scale HINs, since it has been proven that the number of possible metapaths grows exponentially with the length of a metapath. As we find, for a particular inference task, it is not necessary to do inference in the full HIN, since only a part of the network or metapaths relevant to the inference. To copy with these challenges, we propose a metapath constrained random walk method to infer with weak human supervision in HINs. Weak supervision here provides guidance to the random walk process, together with the metapath generation process by pruning the searching space. Our method aims to copy with the two HINI challenges in the previous section.
Given an HIN , similar to [14], we ask users to provide example entity pairs as supervision to imply metapaths. Formally, we are aiming to find a metapaths set that could infer high proximities between entities in . An efficient way to generate given will be described in next section.
Now assume when we have , we also get . For a metapath , we define the following metapath constrained random walk starting from and reaching at following only path instances . It defines a distribution recursively as below.
(3) 
where indicates the entity set where each entity can be linked via relation type to at least one entity with type . means a onestep random walk starting from an entity via relation type .
For example, consider a path instance from to following a metapath Person Organization Profession. Suppose a random walk starts at an entity (e.g., =Larry Page). If is the set of Organization
s in the HIN that Larry Page has worked at, after one step, the walker will have probability
of being at any entity (e.g., = Google). Similarly, if is the set of Professions in the HIN that Google has employed, the walker will have probability of being at any entity (e.g.,= Computer Science). It is useful that proximity provided by metapath constrained random walk infers the prior probability of
being the Profession for Person .To be more general, we then propose a linear model to combine the single metapath constrained random walk scores. i.e., for each metapath , the HIN inference with joint random walk model (HINIJRW) is formalized as below.
(4) 
where is the model parameter vector, each element means the weight or importance of certain metapath based inference score. The parameter vector can either be explicitly set by users or implicitly learned according different HIN inference tasks. By tuning the parameters, we can avoid the bias induced by certain metapath(s) and ensure the model’s robustness and stability.
Now let’s revisit the relationship between the joint HIN inference model (Eq. 3.1) and HINI framework (Eq. 2). In short, the input instance of HINI , the output variable of HINI is the random walk based probabilities or proximities, and metapaths set of HINI equals to . Similarly, other inference models can also be unified into HINI framework.
3.2 Efficient Inference via MetaPath Dependency Tree Search
To copy with the main challenge of large scale HIN inference, i.e., to do efficient inference, we carefully design a tree structure that significantly accelerate the above metapath constrained random walk inference in HINs. More importantly, the new tree structure enables doing inference through model in Eq. 3.1 and automatically generate metapath based inference rules given the user provided samples simultaneously.
Even we do not need to enumerate the metapaths in an HIN, it is still intractable to find the optimal metapath set given the examples pairs in [23]. Selecting relevant metapaths is NPhard even when some of the path instances are given. It has been shown as an NPhard problem [2, 23]. Inspired by forward selection algorithms [10], we introduce metapath dependency tree search algorithm, and enable efficient inference of HINIJRW via doing search in the metapath dependency tree. Since we do not know the relevant paths beforehand, thus is empty. The metapath dependency tree search algorithm is thus used to insert metapaths into and compute the inference score while searching. There are three differences compared to the method proposed in [14]: 1) our tree search algorithm doesn’t leverage random generated negative examples, which could induce biased inference; 2) our algorithm doesn’t require iterative process of the algorithm to guide the tree search, which is more efficient; and 3) we do not set hard threshold to terminate the algorithm, which is very sensitive and could prevent the algorithm from generating more meaningful metapaths.
Let’s first introduce the metapath dependency tree structure. Each tree edge is annotated with an relation type, and each tree node represents a list of entities pairs with their HINIJRW scores and a priority score . The node structure is shown in Figure 2. The node stores tuple in the form , where represents a path instance in the graph by its starting and current graph entities, respectively; is the metapath starting from to . The edges of the tree are relation types in the HIN. is the RW score for this metapath according to Eq. 3.1. The priority score determines which is the next tree node to search. We compute the value of as follows:
(5) 
where is the entity pairs in the tree node, is the maximum weight among entity pairs starting with , and is the number of example pairs starting with . From Eq. 5, we can see that the priority score is defined as a weighed combination of random walk score referred as Eq. 3.1 of the given entity pair in the tree node. This means that if the entity pair in the tree node exhibit higher proximity along the corresponding metapath, they will have higher probability to be more similar along the new metapaths generated through the searching process. This is the reason why the search algorithm will be introduced soon guide its search as random walker to the tree node with larger . Because of this, the priority score proposed in [14] will have the possibility to generate metapaths that do not contain example entity pairs in . This will guide search to find leaf tree node with the largest among all the leaf nodes, which will lead to no metapath generated during the search, thus suffer from relatively low efficiency. To avoid this, considering the value of is normally smaller than one, this slight modification will ensure the target node could be reached through the search, and generate the metapaths in a more efficient way. In addition, in order to avoid generating metapath with infinite length, we follow [14] to add a decay factor ranging from 0 to 1.
Through searching the metapath dependency tree, inference and metapath generation in HINs can be done at the same time. We present the details of our proposed metapath dependency tree search (MPDTS) in Algorithm 1. In short, we search the tree by moving to outneighbor nodes on the graph until a metapath can be found or the graph is completely traversed. We first target the tree node with largest priority and examine whether its tuples are example entity pairs. If so, we store random walk scores computed by Eq. 3.1 in the corresponding tree node, and add the metapaths into . If no example pairs are encountered, then we extend each entity pair by moving to an outneighbour. We insert this new pair with its random walk score to a child node, and compute its priority score. Notice that, the generated metapath only contains relation types, similar to [14], we also leverage the Lowest Common Ancestor (LCS) of type hierarchy of the entities in the HIN to fill the entity types in the generated metapath , to form the complete metapaths.
Finally, we obtain the joint random walk scores/probabilities along the metapaths according to HINIJRW (Eq. 3.1). The joint version of MPDTS (Algorithm 1) is shown as JMPDTS in Algorithm 2. We can simply obtain the joint random walk score for certain entity pair by summing over the rows of , since each entry in the row means a random walk score from to following one metapath. By doing so, we have done inference and automatic metapath generation in large scale HINs, which copies with the two challenges of HINI framework.
There are several advantages for using the tree structure for random walk inference we proposed. Firstly, during the search process of the tree, the node to expand is selected by applying supervision provided by user examples and the search space can be reduced. Secondly,during the process, path instances that represent the same metapaths are gathered in a single tree node, yet avoid duplicate calculation. Thirdly, the whole metapath dependency tree structure is preserved in the memory and can be reused in the whole iterative process while traversing the tree.
4 Experiments
In this section, we evaluate our approach using two typical HIN tasks, link prediction and similarity search.
4.1 HINI for Link Prediction
In this subsection, we validate our algorithm’s efficiency and effectiveness by performing link prediction tasks. We chose link prediction for the evaluation because it provides quantitative way to measure the performance of different methods.
Task Description
For link predication, we propose to use logistic regression model to leverage the random walk score of each metapath
[11]. Formally, Given a relation and a set of entity pairs , we can construct a training dataset , where is a vector of all the metapath based features for the pair —i.e., the jth component of is , and indicates whether is true. Parameteris estimated by maximizing the regularized objective function proposed in
[11]. Then the link predication model is defined as below:(6) 
Dataset
We perform link prediction experiments on a representative dataset for large scale HIN: YAGO.
YAGO: YAGO^{1}^{1}1http://www.mpiinf.mpg.de/departments/databasesandinformationsystems/research/yagonaga/yago/ is a semantic knowledge base, derived from Wikipedia, WordNet and GeoNames. Currently, YAGO2 has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities. It contains 350,000 entity types organized in type hierarchy, and 100 relation types.
Effectiveness Study
For a certain type of link in YAGO, for instance citizenOf, we remove all such links and try to predict them with the above logistic regression model that leverages the random walk scores as features. We randomly select a number of pairs of entity pairs with the according relation labels as training data, and validate the model using a test set of an equal number of pairs. We compared our link prediction model with PCRW [11] based models which generate paths of finite length in 1,2,3,4. The PCRW models also use the logistic regression model to combine these metapaths. Besides, we also compare our model with the stateoftheart FSPG based prediction model in [14]. Following [14], we set as 0.6 in metapath dependency tree to avoid producing metapaths with infinite length. We use Area Under ROC Curve (AUC) as the evaluation measure. AUC calculate the area under Receiver Operating Characteristics (ROC) curve. The xaxis of ROC is false positive rate, and the yaxis of ROC represents true positive rate. Thus a large AUC value, a large accuracy in predication.
Table 1 presents the results for link prediction for three types of links: citizenOf andadvisorOf in YAGO. For each of these links, we generated 100 training and 100 test pairs, as described above. The result shows that fixedlength PCRW suffers from several issues. When the maximum length is too small (1 or 2), metapaths cannot connect example pairs, and as such the model is not better than a random guess and the model will have low recall. When the maximum length is too big, the model introduces too many metapaths. For length 3, there are 135 metapaths, and over 2,000 for length 4. Notice that our method outperforms FSPG, the reason is that FSPG incorporates randomly generated negative examples which may lead to biased inference, while our method does not use.
Advisor  citizenOf  advisorOf 

Our method  0.854  0.654 
FSPG  0.822  0.647 
PCRW1  0.594  0.498 
PCRW2  0.752  0.613 
PCRW3  0.567  0.545 
PCRW4  0.525  0.569 
Metapaths 

VenuePaperAuthorPaperVenue 
VenuePaperPaperVenue 
VenuePaperPaperVenue 
Ranking (Query)  KDD  ACL  VLDB 

1  ICML  COLING  SIGMOD 
2  SIGMOD  Computational Linguistics  ICDE 
3  ICDM  NAACL  TODS 
4  CIKM  EACL  TKDE 
5  VLDB  LREC  PODS 
6  SIGIR  EMNLP  VLDB Journal 
7  WWW  INTERSPEECH  EDBT 
8  Machine Learning  SIGIR  PVLDB 
9  TKDE  IJCAI  CIKM 
10  ICDE  AAAI  IS 
Our model is clearly better in predicting the links, and it generates only a limited number of metapaths. For instance, the advisorOf link in YAGO has a model of only 13 metapaths. Moreover, these metapaths are highly relevant and serve as good explanation of the proximity links. For example, we find the metapath illustrating the fact that a person is a strong influencer to another person is the best predictor of advisorOf, but other, longer, paths are also highly relevant. Compared with PCRW with maximum length 2, it has higher recall because it also detected longer important metapaths, for instance, Person Award Person. Knowledge bases, such as YAGO, often suffer from incompletion problem. We thus can use these multiorder metapaths as inference rules to predict the direct links between entities.
Whereas considering direct links is widely done when querying large scale networks, multiorder metapaths can improve query result accuracy. For instance, in advisorOf prediction, only one type of direct links between people and country exists, namely, influencer. By leveraging multiorder metapaths, it will provide us a better chance to still perform good link prediction when the direct links are missing. The result shows the power of our HINI framework in doing large scale inference in HINs.
Efficiency Study
Figure 3 presents the running time of our method compared to PCRW models with fixed length, and when varying the number of example pairs given as input. It can be observed that generally, the algorithm running time increases sublinearly in the number of example pairs. The increase is due to the PCRW random walks which need to be performed concurrently for each example pair, but the number of metapaths in the model does not increase at the same rate. In particular, the algorithm performs better than models of paths longer than 2 by a factor of up to two orders of magnitude. However, the models of short path length have limited predictive power, despite their better running time. In comparison, our algorithm is capable of generating long and meaning metapaths and performing efficient inference.
4.2 HINI for Similarity Search
In this subsection, we mainly focus on user study of our HINI framework via a typical HIN task, i.e., similarity search, to get a better understanding of why our method is effective.
Task Description
In general, similarity search aims to find similar entities given a query entity in HINs. Formally, similarity search can be regarded as random walk, but implemented by commuting matrix manipulation for a metapath, to generate the similarity score of each targeting entity. The community matrix is defined as below.
Definition 5
Commuting matrix. Given a network and its network schema , a commuting matrix for a metapath is defined as , where is the adjacency matrix between types and . represents the number of path instances between objects and , where and , under metapath .
Given , we can infer:
(7) 
Thus in general, if the entry of is large, the similarity between two entities based on metapath will be large.
Dataset
Most of the similarity search research in HINs are evaluated on DBLP dataset, we also use it for our experiments. DBLP is a bibliographic information network which is frequently used in the study of heterogeneous networks. We use a subset of DBLP used in [20]
containing scientific papers in four areas: databases, data mining, artificial intelligence, and information retrieval. The dataset has four classes of nodes: Paper, Author, Topic, and Venue. It also has four edge types between different entity types. In totally, the subset contains 14,376 papers, 14,475 authors, 8,920 topics, and 20 venues. Besides, there are 170,794 links.
Case Study of Similarity Search Results
We select ten groups of similar venues according to different areas in DBLP, and use them as input example pairs to generate the metapaths. Then we compute the commuting matrices for the metapaths. When given a query venue, we rank the venues based on the scores in the row of certain commuting matrix for metapath. By doing this, we could find the most similar venues to the query.
In Table 2 and Table 3, we show the generated metapaths, and the search results based on the metapaths. From the results, we find that 1) metapaths carry rich proximity information between the given pairs; and 2) all the search results are very relevant to the query. This shows the insight of why our HIN inference framework is able to handle automatic metapath generation and effective inference. By providing these two HIN tasks, we can conclude that the proposed inference framework is not limited to use only in these two tasks, it will be effective in other HIN tasks, such as recommendation, classification, etc.
5 Related Work
5.1 Knowledge Base/Network Inference
Although there is a great deal of recent research on extracting knowledge from text [1, 5, 19, 17, 3, 36], much less progress has been made on the problem of drawing reliable inferences from this imperfectly extracted knowledge. In particular, traditional logical inference methods are too brittle to be used to make complex inferences from automaticallyextracted knowledge, and probabilistic inference methods [18] suffer from scalability problems, since they cannot generate inference rules directly and use the full rule set to perform inference. Recently, Ni Lao et al. [11] provides a probabilistic way to perform random walk inference, however, it is still costful since it needs to enumerate the network to generate the metapaths. However, we carefully design a tree structure which would efficiently generate the metapaths.
5.2 MetaPath Generation
The first approach regarding to automatic metapath generation is proposed by [11]
. Their solution enumerate all the metapaths within a fixed length L. However, it is not clear how L should be set. More importantly, L can significantly affect the metapaths generated: (a) if L is large, then many redundant metapaths may be returned, leading to curseof dimensionality effects; and (b) if L is small, important metapaths with length larger than L might be missed. Our experiments have shown that the running time of the metapath generation process grows exponentially with length L. Moreover, the accuracy can also drop with increase in L. Recent studies also propose efficient metapath generation algorithms based on localized random walk
[28, 30, 31, 34]. Recently, Meng et al. [14] propose to leverage positive and negative examples to generate metapaths. However, our solution does not leverage the negative examples, which is randomly generated and would lead to biased inference.5.3 Link Prediction
Since the main focus of our work is to generate metapaths, we only use link prediction to quantify our advantage compared with the existing metapath generation methods. Compared with [16], our method predicts the relationship between different entities, rather than predicting the types of entities. [15]
use the factorization of a threeway tensor to perform relational learning; it only considered simple node types and is not applicable to our experiment which has both complex node classes and edge types. Besides, our method is aiming to support efficient link prediction via HINI with random walk.
5.4 Similarity Search
Similarity measures have been a hot research topic for years. They can be categorized broadly into two types: entity similarity measures and relation similarity measures. Similarity measures, such as SimRank [7], PRank [37], PathSim [20], PCRW [11] and RoleSim [9] capture entity similarity. Recent studies also focus on similarity search in schemarich networks [33, 35, 29, 32]. Recent studies on entity similarity also find rules/metapaths very useful. Path ranking algorithm [11], rule mining [6] and metapath generation [14] have demonstrated the effectiveness of using the mined rules or metapaths for link predictionlike tasks based on entity similarity, while our work is for leveraging the power of efficient HINI random walk inference method to perform similarity search.
6 Conclusion
In this paper, we study the problem of large scale HIN inference, which is important and has broad applications (e.g., link predication, similarity search). We propose a metapath constrained inference framework, where many of the existing inference methods can be unified under this framework. In particular, we propose a efficiencyoptimized metapath tree constrained random walk inference method for HIN inference, which is weakly supervised to model the inference target and is approximated in time independent of the network size. We conduct experiments on two large scale HIN, and our proposed inference method has demonstrated its effectiveness and efficiency compared to the stateofthearts inference methods on typical HIN tasks (link prediction and similarity search). The effectiveness of the inference method is not limited to the two tasks, in the future, we plan to apply our method to more real world applications (e.g., NLP tasks [26, 25, 24, 27]).
References
 [1] (2000) Snowball: extracting relations from large plaintext collections. In Proceedings of the fifth ACM conference on Digital libraries, pp. 85–94. Cited by: §5.1.
 [2] (1998) On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science 209 (1), pp. 237–260. Cited by: §3.2.
 [3] (2007) Open information extraction from the web.. In IJCAI, Vol. 7, pp. 2670–2676. Cited by: §5.1.
 [4] (2012) Structured learning with constrained conditional models. Machine Learning 88 (3), pp. 399–431. Cited by: §2.2.
 [5] (2005) Unsupervised namedentity extraction from the web: an experimental study. Artificial intelligence 165 (1), pp. 91–134. Cited by: §5.1.
 [6] (2013) Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, pp. 413–422. Cited by: §5.4.
 [7] (2002) SimRank: a measure of structuralcontext similarity. In KDD, pp. 538–543. Cited by: §5.4.
 [8] (2010) Graph regularized transductive classification on heterogeneous information networks. In ECML, pp. 570–586. Cited by: §1.
 [9] (2011) Axiomatic ranking of network role similarity. In KDD, pp. 922–930. Cited by: §5.4.

[10]
(1996)
Toward optimal feature selection
. Cited by: §3.2.  [11] (2011) Random walk inference and learning in a large scale knowledge base. In EMNLP, pp. 529–539. Cited by: §1, §2.2, §3.1, §4.1, §4.1, §5.1, §5.2, §5.4.
 [12] (2011) Random walk inference and learning in a large scale knowledge base. In EMNLP, pp. 529–539. Cited by: §1.
 [13] (2011) Link prediction in complex networks: a survey. Physica A: Statistical Mechanics and its Applications 390 (6), pp. 1150–1170. Cited by: §1.
 [14] (2015) Discovering metapaths in large heterogeneous information networks. In Proceedings of the 24th International Conference on World Wide Web, pp. 754–764. Cited by: §1, §3.1, §3.2, §3.2, §3.2, §4.1, §5.2, §5.4.
 [15] (2011) A threeway model for collective learning on multirelational data. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 809–816. Cited by: §5.3.
 [16] (2012) Factorizing yago: scalable machine learning for linked data. In Proceedings of the 21st international conference on World Wide Web, pp. 271–280. Cited by: §5.3.
 [17] (2006) Espresso: leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 113–120. Cited by: §5.1.
 [18] (2006) Markov logic networks. Machine learning 62 (12), pp. 107–136. Cited by: §5.1.
 [19] (2006) Semantic taxonomy induction from heterogenous evidence. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 801–808. Cited by: §5.1.
 [20] (2011) Pathsim: meta pathbased topk similarity search in heterogeneous information networks. VLDB. Cited by: §1, §1, §2.2, §4.2, §5.4.
 [21] (2011) PathSim: meta pathbased topk similarity search in heterogeneous information networks.. PVLDB, pp. 992–1003. Cited by: §2.1.
 [22] (2003) Link prediction in relational data. In NIPS, Cited by: §1.
 [23] (1994) The minimum feature set problem. Neural Networks 7 (3), pp. 491–494. Cited by: §3.2.
 [24] (2017) CROWDINTHELOOP: A hybrid approach for annotating semantic roles. In EMNLP, pp. 1913–1922. Cited by: §6.
 [25] (2017) Active learning for blackbox semantic role labeling with neural factors. In IJCAI, pp. 2908–2914. Cited by: §6.
 [26] (2013) Paraphrasing adaptation for web search ranking. In ACL, pp. 41–46. Cited by: §6.
 [27] (2019) Language models with transformers. CoRR abs/1904.09408. Cited by: §6.
 [28] (2015) Incorporating world knowledge to document clustering via heterogeneous information networks. In SIGKDD, pp. 1215–1224. Cited by: §5.2.
 [29] (2017) Distant metapath similarities for textbased heterogeneous information networks. In CIKM, pp. 1629–1638. Cited by: §5.4.
 [30] (2015) KnowSim: A document similarity measure on structured heterogeneous information networks. In ICDM, pp. 1015–1020. Cited by: §5.2.
 [31] (2016) Text classification with heterogeneous information network kernels. In AAAI, Cited by: §1, §5.2.
 [32] (2018) Unsupervised metapath selection for text similarity measure based on heterogeneous information networks. Data Min. Knowl. Discov. 32 (6), pp. 1735–1767. Cited by: §5.4.
 [33] (2015) Constrained informationtheoretic tripartite graph clustering to identify semantically similar relations. In IJCAI, pp. 3882–3889. Cited by: §5.4.
 [34] (2016) World knowledge as indirect supervision for document clustering. TKDD 11 (2), pp. 13:1–13:36. Cited by: §5.2.
 [35] (2016) RelSim: relation similarity search in schemarich heterogeneous information networks. In SDM, pp. 621–629. Cited by: §5.4.
 [36] (2007) Textrunner: open information extraction on the web. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 25–26. Cited by: §5.1.
 [37] (2009) Prank: a comprehensive structural similarity measure over information networks. In CIKM, pp. 553–562. Cited by: §5.4.
Comments
There are no comments yet.