1. Introduction
As well known, networks can be used to model many real systems such as biological systems and social medium. As a result, network analysis becomes a hot research topic in the field of data mining. Many researchers are concerned with information networks with singletyped components, the kind of which is called homogeneous information network. However, the real information networks usually consist of interconnected and multityped components. This kind of information networks is generally called Heterogeneous Information Networks (HIN). Mining heterogeneous information networks has attracted many attentions of the researchers.
Measuring the similarity between objects plays fundamental and essential roles in heterogeneous information network mining tasks. Most of the existing metrics depend on userspecified metapaths or metastructures. For example, PathSim (Sun et al., 2011) and Biased Path Constrained Random Walk (BPCRW) (Lao and Cohen, 2010b, a) take a metapath specified by users as input, and Biased Structure Constrained Subgraph Expansion (BSCSE) (Huang et al., 2016) takes a metastructure specified by users as input. We investigate these metrics in depth, and discover that they are sensitive to the prespecified metapaths or metastructures in some degree. The sensitivity requires that the users must know how to select an appropriate metapath or metastructure. Obviously, it is quite difficult for a nonproficient users to make the selection. For example, a biological information network may contain many different types of objects (Chen et al., 2010; Fu et al., 2016). It is hard for a new user to know which metapaths or metastructures are appropriate. In addition, the metapaths can only capture biased and relatively simple semantics according to literature (Huang et al., 2016). Therefore, the authors proposed the metastructure in order to capture more complex semantics. In fact, the metastructure can only capture biased semantics as well. The metapaths and metastructures are essentially two kinds of schematic structures.
In this paper, we are concerned with the robust semanticrich similarity between objects in heterogeneous information networks. We are inspired by the construction of the subtree pattern proposed in (Shervashidze et al., 2011). In essence, the subtree pattern is a quasi spanning tree of a graph. The difference between the traditional spanning tree and the subtree pattern lies in that nodes can be revisited in the process of traversing the graphs. That means that we can construct a schematic structure by repetitively visiting the object types in the process of traversing the network schema of the HIN. Obviously, this schematic structure, called Recurrent Meta Structure (RecurMS), can be constructed automatically. In addition, it can capture rich semantics because it is composed of many recurrent metapaths and recurrent metatrees.
Both the metapath and metastructure are essentially two kinds of composite relations because they are composed of object types with different layer labels. The commuting matrices of the metapath and metastructure are employed to extract the semantics encapsulated in them. In essence, the proposed RecurMS has the same property as the metapath and metastructure because all of them have hierarchical structures. Therefore, the commuting matrix can be employed here to extract the semantics encapsulated in the RecurMS. The structure of RecurMS has such strong restrictions on the object types that the similarity only between the same objects is nonzero and between the different objects is zero. That is, the object types are coupled tightly. To decouple the object types, we decompose the proposed schematic structure into different recurrent metapaths and recurrent metatrees, and then define the commuting matrices of the recurrent metapaths and metatrees as similar to the ones of the metapaths and metastructures. As a result, the Recurrent MetaStructurebased Similarity (RMSS) is defined as the weighted summation of all these commuting matrices. The proposed RMSS is robust to different schematic structures, i.e., metapaths or metastructures, because its structure integrates all the possible metapaths and metastructures. To evaluate the importance of different recurrent metapaths and metastructures, two kinds of weighting strategies, local weighting strategy and global weighting strategy, are proposed. The weighting strategies consider the sparsity and strength of different recurrent metapaths and metatrees in the HIN. The experimental evaluations on three real datasets reveals that the existing metrics are sensitive to different metapaths or metastructures, and that the proposed RMSS outperforms the existing metrics in terms of ranking and clustering tasks.
The main contributions are summarized as follows. 1) We propose the recurrent metastructure which combines all the metapaths and metastructures. The RecurMS can be constructed automatically. In order to decouple the object types, the RecurMS is decomposed into several recurrent metapaths and metatrees; 2) We define the commuting matrices of the recurrent metapaths and metatrees, and propose two kinds of weighting strategies to determine the weights of different recurrent metapaths and metastructures. The proposed robust RMSS is defined by the weighted summation of all these commuting matrices. 3) The experimental evaluations reveal the proposed RMSS outperforms the baselines in terms of ranking and clustering tasks and that the existing metrics are sensitive to different metapaths and metastructures.
The rest of the paper is organized as follows. Section 2 introduces related works. Section 3 provides some preliminaries on HINs. Section 4 provides an approach to decomposing the recurrent metastructure into several recurrent metapaths and recurrent metatrees. Section 5 introduces the definition of RMSS. The experimental evaluations are introduced in section 6. The conclusion is introduced in section 7.
2. Related Work
The similarity measure plays fundamental roles in the field of network analysis, and can be applied to many areas, e.g., clustering, recommendation, Web search etc. At the beginning, only the feature based similarity measures were proposed, e.g., Cosine similarity, Jaccard coefficient, Euclidean distance and Minkowski distance
(J. Han and Pei, 2012). However, the feature based similarity measures ignored the link information in networks. Afterwards, researchers realized the importance of the links in measuring the similarities between vertices, and proposed the linkbased similarity measures (Jeh and Widom, 2002, 2003; Shi et al., 2017). Article (Jeh and Widom, 2002) proposed a general similarity measure combining the link information. The argued that two similar objects must relate to similar objects. Article (Chen and Giles, 2015) discovered that the SimRank in homogeneous networks and its families failed to capture similar node pairs in certain conditions. Therefore, the authors proposed new similarity measures ASCOS and ASCOS++ to address the above problem. Article (Jeh and Widom, 2003) evaluated the similarities of objects by a random walk model with restart. In article (Martinez et al., 2016), the authors summarized the offtheshelf works on the link prediction including many stateoftheart similarity measures in homogeneous information networks. Article (Wang et al., 2016) proposed a socialized word embedding algorithm integrating user’s personal characteristics and user’s social relationship on social media. Literature (Lao and Cohen, 2010c) proposed a novel learnable proximity measure which is defined by a weighted combination of simple ”path experts” following a particular sequence of labeled edges.This paper is concerned with the robust and semanticrich similarity measure in heterogeneous information networks. To the best of our knowledge, Sun et al. (Sun et al., 2009a) proposed the bitype information network, and integrated clustering and ranking for analyzing it. In the article (Sun et al., 2009b), She extended the bitype information network to the heterogeneous information network with star network schema and studied rankingbased clustering on it. The literatures (Sun and Han, 2012; Shi et al., 2017) gives a comprehensive summarization of research topics on HINs including similarity measure, clustering, classification, link prediction, ranking, recommendation, information fusion and other applications. Measuring the similarities between objects is a fundamental problem in HINs. The similarity measures in HINs must organically integrate the rich semantics as well as the structural information. This is the prominent difference between the similarity measures in HINs and the ones in the homogeneous information networks. Below, we respectively introduce the similarity and relevance measures in HINs.
(Similarity Measure in HINs) Sun (Sun et al., 2011) employed the commuting matrix of a metapath to define the metapathbased similarity PathSim in HINs. Literature (U. et al., 2014) revisited the definition of PathSim and overcame its drawback, i.e., omiting some supportive information. Lao and Cohen (Lao and Cohen, 2010b, a) proposed a Path Constrained Random Walk (PCRW) model to evaluate the entity similarity in labeled directed graphs. This model can be applied to measuring the similarity between objects in HINs. Meng eta la. (Meng et al., 2014) proposed a novel similarity measure AvgSim which provided a unified framework to measure the similarity of same or differenttyped object pairs. Usman et al. (Usman and Oseledets, 2015)
employed the tensor techniques to measure the similarity between objects in HINs. Wang et al.
(Wang et al., 2012) merged two different topics, influence maximization and similarity measure, together to reinforce each other for better and more meaningful results. Yu et al. (Yu et al., 2012) employed a metapathbased ranking model ensemble to represent semantic meanings for similarity queries, and exploited userguidance to understand users query. Xiong et al. (Xiong et al., 2015) studied the problem of obtaining the similar object pairs based on userspecified join paths. Usman et al. (Usman and Oseledets, 2015) employed the tensor techniques to measure the similarity between objects in HINs. Literature (Zhang et al., 2015) proposed a structuralbased similarity measure NetSim to efficiently compute similarity between centers in HINs with xstar network schema. Wang et al. (Wang et al., 2017) proposed a distant metapath similarity, which can capture semantics between two distant objects, to provide more meaningful entity proximity. Zhou et al. proposed a semanticrich stratifiedmetastructurebased similarity measure SMSS by integrating all of the commuting matrices of the metapaths and metastructures in HINs. The stratified metastructure can be constructed automatically, and therefore SMSS does not depend on any userspecified metapaths or metastructures. Zhang et al. (Zhang et al., 2018) proposed a general similarity measure HeteRank, which integrates the multirelationships between objects for finding underlying similar objects.(Relevance Measure in HINs) Shi et al. (Shi et al., 2014a) extended the similarity measure in HINs to the relevance measure which can be used to evaluate the relatedness of two object with different types. For an userspecified metapath, His method is based on the pairwise random walk from its two endpoints to its center. Gupta et al. (Gupta et al., 2015) proposed a new metapathbased relevance measure, which is semimetric and incorporates the path semantics by following the userspecified metapath, in HINs. Bu et al. (Bu et al., 2014) proposed a two phase process to find the topk relevance search in HINs. The first phase aimed to obtain the initial relevance score based on the pairwise pathconstrained random walk, and the second phase took user preference into consideration to combine all the relevance matrices. Xiong et al. (Li et al., 2014) proposed an optimization algorithm LSHHeteSim to capture the drugtarget interaction in heterogeneous biological networks. Literature (Wang et al., 2011) proposed a novel approach to modeling user interest from heterogeneous data sources with distinct but unknown importance, which seeks a scalable relevance model of user interest. Zhu et al. (Zhu et al., 2015) proposed a relevance search measure SignSim based on signed metapath factorization in Signed HINs.
3. Preliminaries
In this section, we introduce the definition of HINs and some important concepts, e.g., network schema, metapaths and metastructures. the network schema of a HIN is essentially its template guiding the generation of the HIN. The metapaths and metastructures are two kinds of schematic structures. They can capture semantics encapsulated in the HINs.
3.1. The HIN Model
Definition 3.1 ().
(Heterogeneous Information Network) An information network (Shi et al., 2014b) is a directed graph where is a set of objects and is a set of links. and respectively denote the set of object types and link types. is called a heterogeneous information network (HIN) if or . Otherwise, it is called a homogeneous information network.
Heterogeneous information networks, which is defined in the definition 3.1, consist of multityped objects and their interconnected relations. For any object , it belongs to an object type . For any link , it belongs to a link type . In essence, represents a relation from its source object type to its target object type. If two links belong to the same link type, they share the same starting object type as well as the ending object type.
Fig. 1 shows an illustrative bibliographic information network with four actual object types, i.e. Author (), Paper (), Venue () and Term (). The type Author contains four instances: Yizhou Sun, Jiawei Han, Philip S. Yu, and Jie Tang. The type Venue contains four instances: VLDB, AAAI, KDD, TKDE. The type Paper contains six instances: PathSim (Sun et al., 2011), GenClus (Sun et al., 2012), RAIN (Yang et al., 2015), TPFG (Wang et al., 2010), SpiderMine (Zhu et al., 2011) and HeteSim (Shi et al., 2014a). The type Term constains six instances: Pattern, Information, Mining, Social, Clustering, Similarity and Network. Each paper published at a venue must have its authors and its related terms. Hence, they contain three types of links: , and .
Definition 3.2 ().
(Network Schema) The Network schema (Shi et al., 2014b) of is a directed graph consisting of the object types in and the link types in .
The network schema, which is defined in the definition 3.2, provides a metalevel description for the HIN. The link types in are essentially the relations from source object types to target object types. Fig. 2(a) shows the network schema for the HIN in Fig. 1. Many biological networks can be modeled as HINs as well. In this paper, we use a biological information network with six object types Gene (), Tissue (), GeneOntology (), ChemicalCompound (), Substructure () and SideEffect () as an example. It contains five link types , , , , . Its network schema is shown in 2(b).
3.2. Meta Paths and Meta Structures
There are rich semantics in HIN . These semantics can be captured by metapaths, metastructures or even more complicated schematic structures in .
Definition 3.3 ().
(MetaPath) The metapath (Sun et al., 2011) is an alternate sequence of object types and link types. It can be denoted by , where and .
In definition 3.3, is a link type starting from to . In essence, is a relation from to . The metapath is essentially a composite relation . That is, the metapath can capture rich semantics contained in the HINs. Throughout the paper, the metapath is compactly denoted as unless stated otherwise.
According to article (Sun et al., 2011), there are some useful concepts related to . the length of is equal to the number of link types, i.e. . A path in the HIN is an instance of if and , where and . In general, is called a path instance following . A metapath is called the reverse metapath of , where denotes the reverse relation of from to . The reverse metapath of is denoted by . A metapath is symmetric if . For the metapath , let denote the relation matrix of the relation , where . Its entry if there is an edge from the object in to the object in , otherwise it is equal to 0. The commuting matrix of the metapath is defined in the definition 3.4. The commuting matrix of the is equal to .
Definition 3.4 ().
(Commuting Matrix of the MetaPath) The commuting matrix of the metapath is defined as
where denotes the relation matrix from to .
Different metapaths carry different semantics. shown in Fig. 3(a) expresses the information “Two authors cooperate on a paper”. However, literature (Huang et al., 2016) pointed out metapaths can only capture relatively simple and biased semantics. For example, expresses the information “Two authors write a paper published on the same venue”, but neglects the one “Two authors write a paper containing the same term”. To overcome this issue, (Huang et al., 2016) proposed the metastructure.
Definition 3.5 ().
(MetaStructure) The metastructure (Huang et al., 2016) is a directed acyclic graph with a single source object type and a single target object type . is a set of object types, and is a set of link types.
The metastructure, which is defined in the definition 3.5, can capture complex semantics. Fig. 3(b,c,d) shows three kinds of metastructures. For ease of presentation, these metastructures are denoted as , and respectively. The metastructure shown in Fig. 3(b) expresses the information “Two authors write their papers both containing the same terms and in the sam venue”, but ignores the information “Two authors cooperate on a paper.” That is, the metastructure can only capture biased semantics as well.
Given a metastructure with height , its object types are sorted in the topological order. For , let denote the set of object types on the layer , and denote the Cartesian product of the set of objects belonging to different types in . The relation matrix from to is defined as: the entry of is equal to 1 if the th element of is adjacent to the th one of in , otherwise it is equal to 0. and are adjacent if and only if for any and , and are adjacent in , and and are adjacent in . The commuting matrix of the metastructure is defined in the definition 3.6. Each entry in represents the number of instances following . The commuting matrix of its reverse is equal to .
Definition 3.6 ().
(Commuting Matrix of the MetaStructure) The commuting matrix of the metastructure is defined as
where denotes the relation matrix from to .
Both metapaths and metastructures need to be specified by users. In the bibliographical information networks, it is comparatively easy for users to specify metapaths or metastructures. However, specifying metapaths or metastructures becomes very difficult in the biological information networks, because in reality it contains many object types (Gene, Gene Ontology, Tissue, Chemical Compound, Chemical Ontology, Side Effect, Substructure, Pathway, Disease and Gene Family) and many relations. In Fig. 2(b), we give a biological network schema only containing six object types and five link types.
In this paper, we aim to define a robust semanticrich similarity measure in HINs Formally, the problem takes a HIN, a source object as input, and then outputs a vector whose entries denote the similarity between the source object to the target object.
4. Recurrent Meta Structure Construction and Decomposition
In this section, we introduce the architecture of the recurrent metastructure and an approach to decomposing the recurrent metastructure into several recurrent metapaths and recurrent metatrees.
4.1. Recurrent Meta Structure Construction
Before proceeding, we introduce an important concept, an augmented spanning tree of the network schema , see the definition 4.1. It is used in the processing of constructing and decomposing the recurrent metastructure.
Definition 4.1 ().
(Augmented Spanning Tree) An augmented spanning tree of is a tree rooted at the source object type and containing all the link types in . denotes the set of object types in , and denotes the set of link types in . Note that contains the object types in and some of their duplicates, and contains the links types consisting of two object types in .
Now, we introduce the construction rule of the augmented spanning tree of . If the network schema is a tree, its augmented spanning tree is equal to the network schema itself. If the network schema is not a tree, its augmented spanning tree is constructed based on its spanning tree as follows. The spanning tree of the network schema can be constructed using BreadthFirst Search (BFS) starting from the source object type. We then traverse the spanning tree from top to bottom and from left to right. For the current object type in the process of traversing, if an edge adjacent to it in the network schema is not contained in the current spanning tree, we duplicate the object type adjacent to it and add an edge from it to the copied object type in the current spanning tree.
We exemplify the construction of the augmented spanning tree when the network schema is not a tree. Suppose an edge is added to the network schema shown in Fig. 2(b). As a result, we get a new network schema shown in Fig. 4(a). Next, we show how to construct the augmented spanning tree for this network schema, see Fig. 4(b). Its spanning tree is enclosed by the dashed line frame. When we reach the node in the process of traversing, the edge incidental to is not contained in the spanning tree. So, we make a copy of the node and add an edge from to the copied .
Lemma 4.2 ().
Given a HIN , its network schema is denoted by . The augmented spanning tree of is denoted by . If one object type and its duplicate are not distinguished explicitly in , we have and .
Proof.
According to the construction rule of , obviously and because one object type and its duplicate are not distinguished explicitly in . ∎
According to lemma 4.2, the augmented spanning tree reformulates the network schema if the object types and their duplicates are thought of as the same elements. That is, a link type in is equal to one in if and only if they share the same endpoints or one endpoint of is a copy of one endpoint of . Below, we introduce the definition of the recurrent metastructure (RecurMS, see the definition 4.3), and describe the construction rule of the recurrent metastructure based on the augmented spanning tree of the network schema.
Definition 4.3 ().
(Recurrent Meta Structure) A recurrent metastructure is essentially a hierarchical graph consisting of object types with different layer labels. Formally, it is denoted as , where denotes the set of object types on the layer and denotes the set of link types in RecurMS.
RecurMS has two prominent advantages: (1) being automatically constructed by repetitively visiting object types in the process of traversing network schema; (2) combining all the metapaths and metastructures. Given a HIN , we first extract its network schema , and then select a source object type and a target object type. In this paper, we only consider the scenario that the source object type is the same as the target one. The construction rule of the RecurMS of is described as follows. The source object type is placed on the 0th layer. The object types on the layer are composed of the neighbors of the object types on the layer on the network schema . The adjacent object types are linked by an arrow pointing from the th layer down to the th layer. Repeating the above process, we obtain the RecurMS . It is noteworthy that an object type may appear in adjacent layers of the RecurMS if there exist circles (or selfloops) in the network schema. At this time, one of them can be viewed as a copy of another one.
Fig. 6(a) shows the RecurMS of the network schema shown in Fig. 2(a). As shown in Fig. 5, it can be constructed as follows. is both the source and target object type. Firstly, is placed on the 0th layer, see Fig. 5(a). is placed on the 1st layer, because is the only neighbor of in the network schema shown in Fig. 2(a), see Fig. 5(b). , and are placed on the 3rd layer, because they are the neighbors of , see Fig. 5(c). Similarly, is again placed on the 4th layer, because it is the neighbor of , and , see Fig. 5(d). At this time, is visited again. Repeating the above procedure, we obtain the RecurMS shown in Fig. 6(a). Fig. 7(a) shows the RecurMS of the network schema shown in Fig. 2(b). Gene is both the source and target object type. It is constructed as similarly as the one of the bibliographic network schema.
According to definition 4.3, the recurrent metastructure consists of the object types with different layer labels and their relations in the network schema. Each layer is a set of object types. Below, we give some properties of in lemma 4.4. According to these properties, contains rich semantics.
Lemma 4.4 ().
Assume denotes the height of the augmented spanning tree . Without loss of generality, . denotes the set of object types on the layer of . has the following properties.

, where is the source object type;

and ;

contains all the metapaths and the metastructures.
Proof.
(1) According to the construction rule of , obviously .
(2) When , the object types in must be added to according to the construction rule of . In addition, there are some new object types in , e.g. some children of the object types in in . Therefore, . When , obviously . At this time, it is impossible for to contain some new object types because its layer label is larger than . Thus, .
(3) Any metapath can be compactly denoted by without loss of generality. According to the construction rule of , we have , and . Therefore, must be in . For metastructure, we can take same measures to prove. ∎
4.2. Recurrent Meta Structure Decomposition
This section provides some important concepts including star, pathstar tree, recurrent pathstar metastructure, recurrent metapath and recurrent metatree. The star, which is defined in the definition 4.5, is a special tree consisting of a center and its neighbors. The star is illustrated with Fig. 8(a). Its center is and the neighbors of is . The pathstar tree, which is defined in the definition 4.6, consists of a path and a star. The pathstar tree is illustrated with Fig. 8(b). Its path part is from to , its star part consists of the center and its neighbors . Throughout the paper, an infinite sequence is compactly denoted as
The recurrent pathstar metastructure is defined in the definition 4.7, and the recurrent metapath and metatree are defined in the definitions 4.8 and 4.9 respectively.
Definition 4.5 ().
(Star) A star, compactly denoted as , is a tree consisting of a center and its neighbors .
Definition 4.6 ().
(PathStar Tree) A pathstar tree, compactly denoted as is a rooted one consisting of a path and a star. In specific, the path, compactly denoted as , is from the pivotal vertex to the root , and the star is composed of the pivotal vertex and its children .
Definition 4.7 ().
(Recurrent PathStar MetaStructure) A recurrent pathstar metastructure, compactly denoted as
is a hierarchical structure consisting of a pathstar tree and its duplicates. It can be constructed by repetitively duplicating the star part of the pathstar tree. Note that each pivotal vertex except the first one is also connected to the root along the path .
Definition 4.8 ().
(Recurrent MetaPath) The recurrent pathstar metastructure is called a recurrent metapath if the pathstar tree is a single edge. It can be compactly denoted as
Definition 4.9 ().
(Recurrent MetaTree) A recurrent metatree is a hierarchical structure consisting of a path from the pivotal vertex to the root and one of children of the pivotal vertex. It can be compactly denoted as
Note that each pivotal vertex except the first one is also connected to the root along the path
The object types with different layer labels are tightly coupled in the RecurMS. To decouple them, we should decompose the RecurMS. After obtaining the augmented spanning tree , we traverse its internal nodes from top to bottom and from left to right. Each current object type is treated as a pivot like a bridge connecting two different components: (1) the path form the root (i.e. the source object type) to the pivot; (2) the star consisting of the pivot as the center and its children. We obtain a pathstar tree according to definition 4.6. Then, we augment all these pathstar trees by repetitively duplicating the star part consisting of the pivotal object types and their children. For each duplicated pivotal object type, it is connected to the target object type by the path part of the pathstar tree. Finally, we obtain several recurrent pathstar metastructures of the RecurMS. In essence, the RecurMS can be viewed as the combination of these substructures. If the pathstar tree is a single edge, the recurrent pathstar metastructure generated by it is specially called the recurrent metapath.
Now, we formally describe the procedure of decomposing the RecurMS into several recurrent pathstar metastructures. As stated previously, the RecurMS can be denoted as . Without loss of generality, let . According to lemma 4.4, , i.e. , . Assume denotes the set of internal nodes of the augmented spanning tree , whose elements are listed in the order from the top to the bottom and from the left to the right. Obviously, the source object type is firstly selected as the pivot. As a result, we obtain a star consisting of the source object type and its children . At this time, we augment this star by repetitively duplicating and its children . As a result, the recurrent pathstar metastructure with as the pivot can be compactly denoted as
(1) 
For the pivot and , where , we should firstly calculate the path from to the root , denoted by . As a result, we obtain the recurrent pathstar metastructure compactly denoted by
(2) 
Note that all the pivots except the first one in Formula 2 are also linked to the path .
Here, we respectively take the bibliographic network schema and the biological network schema, shown in Fig. 2(a,b), as examples to present how to generate the pathstar trees. For the bibliographic network schema, is selected as the source object type. Its augmented spanning tree rooted at is equal to the network schema itself because the bibliographic network schema is a tree. For the biological network schema, is selected as the source object type. Its augmented spanning tree rooted at is equal to the network schema itself because the biological network schema is a tree. After obtaining their augmented spanning trees, we traverse its internal nodes from top to bottom and from left to right. For the bibliographic network schema, its internal nodes are and . When is treated as the pivot, its path from the root ( itself) to is empty, and the star consists of as its center and . When is treated as the pivot, the path from the root to is the edge , and the star consists of and its children and . Their pathstar trees are shown in Fig. 9(a,b). For the biological network schema, its internal nodes are and . Their pathstar trees are shown in Fig. 9(c,d). They can be constructed as similarly as the bibliographic network schema.
For the bibliographic network schema, its recurrent pathstar metastructure can be congstructed as follows. The object types and are respectively treated as the pivots. If is the pivot, it has only one child . Its pathstar tree is a single edge, see Fig. 9(a). We repetitively duplicate its star part, and finally obtain a recurrent metapath shown in Fig. 6(b). It is noteworthy that the path part of the pathstar tree is null at this time because the pivot is the source object type. If is the pivot, it has two children and . Its pathstar tree is a tree, see Fig. 9(b). We repetitively duplicate its star part, and the pivot is linked to the target object type by the path part of the pathstar tree. Finally, we obtain a recurrent pathstar metastructure, see Fig. 6(c). Obviously, the RecurMS shown in Fig. 6(a) can be decomposed into the recurrent metapath (see Fig. 6(b)) and the recurrent pathstar metastructure (see Fig. 6(c)).
For the biological network schema, its recurrent pathstar metastructure can be constructed as follows. The object types and are respectively treated as the pivot. If is the pivot, it has three children , and . Its pathstar tree is shown in Fig. 9(c). We repetitively duplicate its star part, and finally obtain a recurrent pathstar metastructure shown in Fig. 7(b). At this time, the path part of the pathstar tree is null because the pivot is the source object type. If is the pivot, it has two children and . Its pathstar tree is shown in Fig. 9(d). We repetitively duplicate it star part, and the pivot is linked to each target object type by the path part of the pathstar tree. Finally, we obtain a recurrent pathstar metastructure shown in Fig. 7(c). Obviously, the RecurMS shown in Fig. 7(a) can be decomposed into two recurrent pathstar metastructures, respectively shown in Fig. 7(b,c).
After obtaining the recurrent pathstar metastructures, we employ the commuting matrices of metapaths or the metastructures to extract semantics in them. For recurrent metapaths, it is comparatively easy to do this. For recurrent pathstar metastructures (not a path), the size of the commuting matrices may be very large because the Cartesian product may yield a very large set. At this time, we further decompose the recurrent pathstar metastructures into several simpler substructures respectively called recurrent metatrees or recurrent metapath. The decomposition rule is to respectively consider each child of the pivotal object type.
For the recurrent pathstar metastructure shown in Formula 1, it can be decomposed into several recurrent metapaths as follows.
(3) 
For the recurrent pathstar metastructure shown in Formula 2, it can be decomposed into several recurrent metatrees as follows.
(4) 
Note that all the pivots except the first one in Formula 4 are also linked to the path .
For example, the recurrent pathstar metastructure shown in Fig. 6(c) can be decomposed into two recurrent metatrees, see Fig. 6(d,e). The Fig. 6(d) only consider the object type and the Fig 6(e) only consider the object type . Similarly, the recurrent pathstar metastructure shown in Fig. 7(b) is decomposed into three recurrent metapaths, see Fig. 7(d,e,f). The recurrent pathstar metastructure shown in Fig. 7(c) is decomposed into two recurrent metatrees, see Fig. 7(g,h).
The deep metapaths and deep metatrees shown in Formulas 3 and 4 is an infinite sequence of object types. In essence, both deep metapaths and deep metatrees consists of a finite number of ingredients. In specific, deep metapaths consist of the source object type and one of its children , and deep metatrees consist of the path from the pivot up to and one of the children of . Algorithm 1 presents the pseudocode of decomposing the recurrent metastructure into deep metapaths or deep metatrees. In algorithm 1, recurrent metapaths such as is succinctly denoted as , and recurrent metatree such as is succinctly denoted as . Line 2 employ BFS to construct a spanning tree of rooted as . Lines 37 yields the augmented spanning tree of . Lines 815 traverse the nodes of from top to bottom and from left to right, and yields deep metapaths and deep metatrees. The time complexity of algorithm 1 is .
5. Recurrent Meta Structure Based Similarity
This section defines the proposed semanticrich similarity measure RMSS and presents the pseudocode of the algorithm for computing the similarity matrix. RMSS does not depend on any prespecified schematic structures, and therefore is robust to the schematic structures. Throughout the paper, , which is defined in the definition 5.1, represents the normalized version of a matrix .
Definition 5.1 ().
(Normalized Matrix) The normalization of a matrix is defined as
where is a diagonal matrix whose nonzero entries are equal to the row sum of .
5.1. Similarity Measure
In this section, we first define commuting matrices of recurrent metapaths and recurrent metatrees, and then propose two kinds of strategies to determine the weights of these schematic structures.
For the recurrent metapath shown in Formula 3, e.g. Fig. 6(b) and Fig. 7(d,e,f), they can be collectively denoted as
(5) 
The substructure recurs times in . In essence, can be decomposed into an infinite number of metapaths such as
The substructure recurs times in the meta path , . Assume denotes the relation matrix from to and is its normalized version. The commuting matrix of is defined as the summation of the commuting matrices of , see Formula 6.
(6) 
In order to ensure that the matrix series converges, all of the commuting matrices in 6 are normalized according to Formula 5.1 and a decaying parameter is used here. The PerronFrobenius theorem is used here (Horn and Johnson, 2013). The normalized version of is defined in Formula 7.
(7) 
where is called decaying parameter and
is the identity matrix with the same size as
. Note that may be diagonal. At this time, should be removed from all the recurrent metapaths because it can not provide any useful information for the similarities between source objects.For the recurrent metatree shown in Formula 4, e.g. Fig. 6(d,g) and Fig. 7(g,h), they can be collectively denoted as
(8) 
Note that for each in , its right side is also linked to the path , see Fig. 6(d,e) and Fig. 7(g,h). In essence, can be decomposed into an infinite number of metapaths such as
The substructure recurs times in the metapath . Therefore, the commuting matrix of is defined as the summation of the commuting matrices of , see Formula 9.
(9) 
where
and
In order to ensure that the matrix series converges, all of the commuting matrices in Formula 9 are normalized according to Formula 5.1 and the decaying parameter is used as well. The PerronFrobenius theorem is used here (Horn and Johnson, 2013). The normalized version of is defined in Formula 10.
(10) 
where
and
Both the recurrent metapaths and recurrent metatrees only consider the structure of the network schema , but ignore the structure of the HIN . In fact, they play different roles in the HIN due to the sparsity and strength of their instances, i.e. the sparsity and strength of the entries of their commuting matrices. Therefore, we should combine the commuting matrices of different recurrent metapaths according to different weights. Below, we introduce two kinds of strategies, global weighting strategy and local weighting strategy, to determine these weights.
The global weighting strategy is to determine the weight of a recurrent metapath or recurrent metatree by the strength of its commuting matrix, i.e. the sum of all the entries of the commuting matrix. The local weighting strategy is to determine the weight of a recurrent metapath or recurrent metatree by the sparsity of its instances. Take the recurrent metatree shown in the Formula 8 as an example. We traverse the objects belonging to for times, and then randomly sample an object from their neighbors. The drawn object must belong to or . Let denote the number of the drawn objects belonging to . The frequency from to is equal to . As a result, the weight of the recurrent metatree is equal to
(11) 
The proposed similarity measure RMSS is defined as,
(12) 
where
Note that when and when
Comments
There are no comments yet.