1 Introduction
Information network analysis attracts many researchers’ attention in the field of data mining because many real systems, e.g. bibliographic information database and biological systems, can be modeled as information networks. These networks have common characteristics: they are composed of multityped and interconnected objects. This kind of information networks is usually called Heterogeneous Information Networks (HIN). Fig. 1 shows a toy bibliographic information network with four actual object types Author () in the shape of triangles, Paper () in the shape of circles, Venue () in the shape of pentagons and Term () in the shape of squares. The type has six instances: P:HeteSim SKHYW:2014 , P:HeProjI SWLYW:2014 , P:GenClus SAH:2012 , P:PathSelClus SNHYYY:2012 , P:PathSim SHYYW:2011 , P:NetClus SYH:2009 . Each paper has its author(s), a venue and its related terms. Hence, it contains three types of links: , and .
In a HIN, a fundamental problem is to measure the similarities between objects using structural and semantic information. All the offtheshelf similarities in HIN are based on userspecified meta paths, for example SHYYW:2011 and Biased Path Constrained Random Walk () LC:2010b ; LC:2010a . According to the literature HZCSML:2016 , meta paths can only capture biased and relatively simple semantics. Therefore, the authors proposed a more complicated structure called meta structure, and defined the meta structure based similarity using the compressedETree, called the Biased Structure Constrained Subgraph Expansion (). However, the meta structure needs to be specified in advance as well.
It is really very difficult for users to specify meta paths or meta structures. For example, there are ten object types (Gene, Gene Ontology, Tissue, Chemical Compound, Side Effect, Substructure, Chemical Ontology, Pathway, Disease, Gene Family) and eleven link types in a complete biological information network CDJWZ:2010 ; FDSCSB:2016 . Obviously, users hardly know how to choose appropriate meta paths or meta structures. In addition, different meta paths and meta structures may have different effects on the similarities between objects. This makes users more difficult to select appropriate meta paths or meta structures.
To alleviate users’ burden, we propose an automaticallyconstructed schematic structure called Stratified Meta Structure (SMS). It needs not to be specified in advance, and combines many meta paths and meta structure. This ensures that (1) Users need not to follow with interest the structure of the network schema of the input HIN; (2) Rich semantics can still be captured. We are inspired by the treewalk proposed in NPEKK:2011 . The structure of a treewalk is constructed by repetitively visiting nodes in the input graph. This idea can be employed here. As a result, we devise the stratified meta structure, which is essentially a directed acyclic graph consisting of the object types with different layer labels. It can be automatically constructed via repetitively visiting the object types on the network schema. In the process of the construction, we discover the SMS consists of many basic substructures and recurrent substructures, see section 4.2. These basic substructures and recurrent substructures essentially represent specific relations. The SMS as a composite structure is therefore a composite relation. This is why the SMS can capture rich semantics.
After obtaining the SMS, the next step is to formalize its rich semantics. For meta structures, the compressedETree is used to formalize its semantics. However, it cannot be used here, because SMS contains an infinite number of meta structures. The semantics contained in meta paths are usually formalized by its commuting matrices. In essence, the meta structures have the same nature as the meta paths, because they all have hierarchical structures. So, we define commuting matrices of meta structures by virtue of cartesian product in section 3.2, and further define commuting matrix of the SMS by reasonably combining the infinite number of the commuting matrices of meta structures. The proposed metric, , is defined by the commuting matrix of the SMS. Experimental evaluations suggest that on the whole outperforms the baselines , and in terms of ranking quality and clustering quality.
The main contributions are summarized as follows.

We propose the stratified meta structure with rich semantics, which can be constructed automatically, and define a stratified meta structure similarity by virtue of the commuting matrix of the SMS;

We define the commuting matrices of meta structures by virtue of cartesian product, and use them to compactly reformulate ;

We conduct experiments for evaluating the performance of the proposed metric . The proposed metric on the whole outperforms the baselines in terms of ranking quality and clustering quality.
2 Related Work
To the best of our knowledge, Sun et al. SYH:2009 ; SHZYCW:2009 ; SH:2012 first proposed the definition of the HIN and studied rankingbased clustering in HINs. Shi et al. SLZSY:2017 gave a comprehensive summarization of research topics on HINs including similarity measure, clustering, link prediction, ranking, recommendation, information fusion and classification etc. Article HeteClass:2017 proposed a novel metapath based framework called HeteClass for transductive classification of target type objects. This framework can explore the network schema of the input HIN and incorporate the expert’s knowledge to generate a collection of meta paths. Below, we summarize related works on similarity measures in information networks.
For similarity measures in homogeneous information networks, literature JW:2002 proposed a general similarity measure combining the link information, which thought two similar objects must relate to similar objects. Literature JW:2003 evaluated the similarities of objects by a random walk model with restart. Article LinkPred:2017 lists many stateoftheart similarities in homogeneous information networks: (1) Local Approaches: e.g. Common Neighbors (), AdamicAdar Index (), Resource Allocation Index (), Resource Allocation based on Common Neighbor Interactions (), Preferential Attachment Index (
), Jaccard Index (
), Salton Index (), Sorensen Index (), Hub Promoted Index (), Hub Depressed Index (), Local LeichtHolmeNewman Index (), Individual Attraction index (), Mutual Information (), Local Naive Bayes (
), CARBased Indices (), Functional Similarity Weight (), Local Interacting Score (); (2) Global Approaches: Negated Shortest Path (), Katz Index (), Global LeichtHolmeNewman Index (), Random Walks (RA), Random Walks with Restart (), Flow Propagation (), Maximal Entropy Random Walk (), Pseudoinverse of the Laplacian Matrix (), Average Commute Time (), Random Forest Kernel Index (
), The Blondel index (), (3) QuasiLocal Approaches: Local Path Index (HPI), Local Random Walks (LRW), Superposed Random Walks (SRW), ThirdOrder Resource Allocation Based on Common Neighbor Interactions (), FriendLink (), PropFlow Predictor ().For Similarity measures in heterogeneous information networks, Sun SHYYW:2011 proposed a meta path based similarity measure in HINs, called . Lao and Cohen LC:2010b ; LC:2010a studied the problem of measuring the entity similarity in labeled directed graphs, and defined a Biased Path Constrained Random Walk () model. It can be applied to HINs. Huang et al. HZCSML:2016 proposed a similarity , which can capture more complex semantics. Shi et al. SKHYW:2014 proposed a relevance measure which can be used to evaluate the relatedness of two object with different types. For a userspecified meta path, is based on the pairwise random walk from its two endpoints to its center. Xiong et al. XZY:2015 studied the problem of finding the similar object pairs by virtue of locality sensitive hashing. Zhu et al. SEMATCH:2017
proposed an integrated framework for the development, evaluation and application of semantic similarity for knowledge graphs which can be viewed as complicated heterogeneous information networks. This framework included many similarity tools and allowed users to compute semantic similarities. In the article
ForwardBYW:2017 , the authors studied the similarity search problem in social and knowledge networks, and proposed a dual perspective similarity metric called Forward Backward Similarity.3 Preliminaries
In this section, we introduce some important concepts related to HINs including network schema in subsection 3.1, meta paths and meta structures in 3.2.
3.1 HIN Definition
As defined in article SWLYW:2014 , an information network is essentially a directed graph . and respectively denote its sets of objects and links, and and respectively denote its sets of object types and link types. Map denotes the object type of object . That is to say, each object in belongs to a specific object type. Similarly, map represents the link type of link , i.e. each link in belongs to a specific link type. In essence, contains some semantic because it is a relation from the source object type to the target object type. If two links belong to the same link type, they share the same starting object type as well as the ending object type. is called a Heterogeneous Information Network if or . Otherwise, it is called a homogeneous information network.
For each HIN, there is a metalevel description for it, called its network schema. Specifically, the Network schema SWLYW:2014 of is a directed graph consisting of the object types in and the link types in . Fig. 2(a) shows the network schema for the HIN in Fig. 1. In this paper, we also use a biological information network consisting of six object types, i.e. Gene (), Tissue (), GeneOntology (), ChemicalCompound (), Substructure () and SideEffect (), and five link types, i.e. GOG, TG, GCC, CCSi, CCSub. Its network schema is shown in 2(b).
3.2 Schematic Structures
Up to now, there are two kinds of schematic structures (meta paths and meta structures) for the network schema of the HIN. All of them carries some semantics. It is noteworthy that these two kinds of schematic structures must be specified by users when using them to measure the similarities between objects.
Meta path SHYYW:2011 is essentially an alternate sequence of object types and link types, i.e. , and . Note that is a link type starting from to . In essence, the meta path contains some composite semantic because it represents a composite relation . Unless stated otherwise, the meta path can be compactly denoted as . There are some useful concepts related to the meta path in literature SHYYW:2011 , i.e. length of , path instance following , reverse meta path of , symmetric meta path and commuting matrix of . For example, Fig. 3(a,b,c) show three meta paths in the network schema shown in Fig. 2(a). They can be compactly denoted as , and . They can express different semantics. expresses “Two authors cooperate on a paper.” express “Two authors publish their papers in the same venue.” express “Two authors publish their papers containing the same terms.”
Meta structure HZCSML:2016 is essentially a directed acyclic graph with a single source object type and a single target object type . is a set of object types, and is a set of link types. Fig. 3(d,e) show two kinds of meta structures for the network schema shown in Fig. 2(a). All of them can be compactly denoted as and . Fig. 3(f) shows a meta structure for the network schema shown in Fig. 2(b). It can be compactly denoted . Meta Structure expresses the more complicated semantic “Two venues publish papers both containing the same terms and wrote by the same authors.” Meta Structure expresses the more complicated semantic “Two authors write their papers both containing the same terms and in the same venue.”
Given a meta structure , we sort its object types with regard to the topological order. Suppose its height is equal to . Let denote the set of object types on the layer , and denote the cartesian product of the set of objects belonging to different types in , . The relation matrix from to is defined as the one whose entries are equal to 1 if the th element of is adjacent to the th one of in , otherwise 0. and are adjacent if and only if for any and , and are adjacent in if and are adjacent in . The commuting matrix of is defined as
Each entry in represents the number of instances following . The commuting matrix of its reverse is equal to ..
Take the HIN shown in Fig. 1 as an example. We compute the commuting matrix of the meta structure shown in Fig. 3(b). It has five layers , , , and . The th box on the lefthand side of Fig. 4 shows the cartesian product . Then, we can easily obtain the relation matrices , , , on the righthand side of Fig. 4. According to the fact that , we know P:HeteSim is adjacent to (V:TKDE,T:Similarity). In fact, P:HeteSim is published on the V:TKDE and contains the term T:Similarity. Similarly, implies that P:HeteSim is not adjacent to (V:TKDE,T:Ranking). According to the HIN shown in Fig. 1, we know P:HeteSim does not contain the term T:Ranking. As a result,
For a given meta structure, its with can be expressed by its commuting matrix as well. The following lemma 3.1 describes this conclusion. Throughout this paper, we use to denote its normalized version, where is a diagonal matrix whose nonzero entries are equal to the row sum of .
Lemma 3.1.
Given a meta structure , suppose that denotes the set of object types on its th layer. When ,
where .
Proof.
We prove the lemma by induction on .
Initial Step. Obviously, . When , . Assume there are different object tuples in adjacent to , denoted as . According to the definition of HZCSML:2016 , . Obviously, . Therefore, we have .
Inductive Step. Assume the conclusion holds for . Below, we prove it also holds for . Obviously,
(1) 
where , and . Note in particular that , where , and , where , is equal to either 0 or . According to the definition of in literature HZCSML:2016 ,
(2) 
Combining formulas 1 and 2, we have
where . The conclusion holds for . ∎
In this paper, we aim to define a similarity measure in HINs, which does not depend on any prespecified schematic structures. This is a reasonable restriction because specifying meta paths or meta structures is a cumbersome job.
4 Stratified Meta Structure Based Similarity
In this section, we define the stratified meta structure based similarity measure in HINs. Firstly, we give the architecture of the stratified meta structure in section 4.1. Secondly, we formally define the similarity based on the stratified meta structure in section 4.2. At last, we describe the pseudocode of computing the similarity.
4.1 Stratified Meta Structure
A Stratified Meta Structure is essentially a directed acyclic graph consisting of object types with different layer labels. Its salient advantage is that it can be automatically constructed by repetitively visiting object types in the process of traversing the network schema. Given a HIN , we first extract its network schema , and then select a source object type and a target object type. Unless stated otherwise, the source object type is the same as the target one. The construction rule of SMS of is described as follows. The source object type is placed on the 0th layer. The object types on the layer are composed of the neighbors of the object types on the layer in . The adjacent object types are linked by an arrow pointing from the th layer down to the th layer. Note in particular that once we get the target object type on the layer , delete its outgoing links, i.e. the ones starting from it down to the object types on the th layer. Repeating the above process, we obtain the SMS .
Fig. 6(a) shows the SMS of the network schema shown in Fig. 2(a). It can be constructed as shown in Fig. 5. is both the source and target object type. Firstly, labelled as is placed on the 0th layer, see Fig. 5(a). is placed on the 1st layer and labelled as , because is the only neighbor of in the network schema shown in Fig. 2(a), see Fig. 5(b). , and , respectively labelled as , and , are placed on the 3rd layer, because they are the neighbors of , see Fig. 5(c). Similarly, , labelled as , is again placed on the 4th layer, because it is the neighbor of both and , see Fig. 5(d). At this time, is visited again. Note in particular that the link from down to is deleted, because is the target object type. Repeating the above procedure, we obtain the SMS shown in Fig. 6(a). Fig. 6(b) shows the SMS of the network schema shown in Fig. 2(b). Gene is both the source and target object type. It can be constructed as similarly as 6(a). It is worth noting that and are only placed on the 1st layer, because their degrees in the network schema shown in Fig. 2(b) are equal to 1.
Below, we give some properties of SMS via lemma 4.3. Given a SMS , we sort its object types in the topological order. Let denote the set of object types with the layer label except the target object type. Let denote the cartesian product of the set of objects belonging to different types in , . The relation matrix from to is defined as similarly as the commuting matrices of meta structures defined in section 3.2. A substructure consisting of three layers in is recurrent if and only if for . Let denote the height of the spanning tree of yielded by BreadthFirst Search (BFS). Without loss of generality, we assume the source object type in the network schema does not contain selfloops. If the source object type has a selfloop, we assign two roles to it: target object type and intermediate object type. The first role is to treat it as the target object type, and the second is to treat it as an nonsource and nontarget object type.
Note 4.1.
In the process of computing , we need to visit all the elements in . In practice, there are many allzero columns in . We should remove all the allzero columns in and the corresponding rows in . Removing the th column of implies there are no links between the th object in and any object in . Therefore, it is unnecessary to consider the links between it and the objects in .
Note 4.2.
In the process of computing , the objects except the source can be removed from . At this time, we only need to consider the elements adjacent to in , and the others are removed from . In general, suppose is the set of considered elements in . That implies the elements in are removed from . In , we only need to consider the elements in where denotes the set of elements in adjacent to . The others are removed from .
Lemma 4.3.
Assume the source object type is the same as the target one, and the source object type in does not contain selfloops. The SMS has the properties:

The target object type lies on the th layer, .

If we walk up from the target object type on the layer to the source object type along the parents of object types, then we obtain a symmetric meta structure, denoted as .

For any , .

The substructure consisting of the object types except the target object type on the layers always recurrently appear in the SMS.

For , the meta structure contains recurrent structures, where
(3)
Proof.
1) The conclusion holds obviously according to the construction rule of and our assumptions.
2) According to property 1, the target object type with different layer labels lies on the even number layer. Thus, even is the height of the meta structure obtained by walking up from the target object type on layer . According to the construction criteria of the SMS, the meta structure is symmetric with respect to the layer .
3) For any , it must be adjacent to an object type on the layer . Obviously, according to the construction rule of . Thus, we have . Similarly, . Therefore, .
4) This property obviously holds according to property 3.
5) We prove the lemma by induction on .
Initial Step. When , we obviously have according to the construction rule of .
Inductive Step. Assume for the layer . Below, we prove that the conclusion holds for . When , we obtain a new recurrent structure consisting of . When , we obtain a new recurrent structure consisting of . Therefore, we have
The conclusion holds. ∎
According to property 2 in lemma 4.3, SMS is essentially composed of an infinite number of meta paths and meta structures. For example, the SMS shown in Fig. 6(a) can be obtained by combining the meta path shown in Fig. 3(a), the meta structure shown in Fig. 3(b) and the others with one or more recurrent substructures shown in Fig. 7(a). the SMS shown in Fig. 6(b) can be obtained by combining the meta structures shown in Fig. 7(b,d) and the others with one or more recurrent substructures shown in Fig. 7(c). It is noteworthy that the meta structure shown in Fig. 7 can be compactly denoted as .
4.2 Similarity
Now, we define the stratified meta structure based similarity by virtue of the commuting matrices of meta paths and meta structures. The SMS is essentially composed of a recurrent substructure, several basic substructures and their reverses. The basic substructures are bipartite graphs consisting of the object types in and , . The recurrent substructure consists of . The symmetric meta structures obtained by walking up from the target object type on the layer to the source object type consist of the basic substructures and its reverses. The symmetric ones obtained by walking from the target object type on the layer to the source object type consist of one or more recurrent substructures, the basic substructures and its reverses. For example, the SMS shown in Fig. 6(a) can be obtained by combining the recurrent substructure shown in Fig. 7(a) and two basic substructures shown in Fig. 8(a,b). The SMS shown in Fig. 6(b) can be obtained by combining the recurrent substructure shown in Fig. 7(c) and three basic substructures shown in Fig. 8(c,d,e).
The commuting matrix of a stratified meta structure is formally defined as the summation of the commuting matrices of meta paths and meta structures. Let denote the commuting matrix of SMS , and denote the meta structure (possibly meta path), which is obtained by walking up from the target object type on the layer to the source object type, . Therefore, .
As stated previously, denotes the relation matrix from to . For the basic substructure consisting of the object types on the layers and , its relation matrix is just equal to , . For the recurrent substructure, its relation matrix is equal to according to property 3 in lemma 4.3. Below, we show how to compute .
Lemma 4.4.
For any , can be computed as follows. Let .

If the degree of the source object type in is equal to 1, then we have
(4) where

If the degree of the source object type in is larger than 1, then let denote the set of the object types with degree larger than 1 in the neighbors of the source object type, and let . When , let . We have
(5)
Proof.
For case 1, we prove it by induction on . Case 2 is similar.
Initial Step. When , The obtained meta structure (possibly meta path) consists of . Obviously,
Inductive Step. Assume the conclusion holds for , and the meta structure for is . Now, we prove the conclusion also holds for . According to property 3 in lemma 4.3, the meta structure for is . When , . Thus, we have
When , there are recurrent substructures according to property 6 in lemma 4.3. Thereby, the obtained meta structure for can be denoted as
The obtained meta structure for is
Therefore,
So, the conclusion holds when . ∎
Below, we only discuss case 1 in lemma 4.4. Case 2 is similar. Obviously,
The matrix power series may be divergent. In addition, different meta structures in the SMS should also have different weights. As a result, the normalized version of is equal to
where and are respectively the normalized versions of and , and . is called decaying factor. , satisfying , denote the weights of different meta structures. Obviously, the spectral radius of is less than 1 because
is a row random matrix and
. Letdenote the identity matrix with the same size as
. As a result, we have(6) 
The Stratified Meta Structure based Similarity, , of the source object and the target object is defined as
(7) 
Note 4.5.
Using note 4.2 result in that for . As a result, the matrix defined previously is not square. To address this issue, the elements in are removed from . This leads to losing some semantics.
Now, we take the HIN shown in Fig. 1 as an example to illustrate note 4.5. The object A:Yizhou Sun is selected as the source one. As shown in the first box () of the lefthand side of Fig. 4, A:Yizhou Sun marked as red color is kept, and the others are removed. In the second and third boxes ( and ), the elements marked as red color are kept and the others are removed. In the fourth box (), P:GenClus in addition to the red elements in is also kept because it is adjacent to (V:VLDB,T:HIN). Obviously, . In the fifth box (), the elements marked green color in addition to the red ones in should also be kept because they are adjacent to P:GenClus. Obviously, . According to the approximation strategy, they are removed from . That means some semantics are lost.
According to notes 4.2 and 4.5 some elements in are removed. and in formula 6 should be adjusted accordingly. is still used to denote the relation matrix from the renewed to the renewed . In , and essentially represent the relation matrices respectively on the left and right side of the symmetry axis of . When using notes 4.2 and 4.5, we must explicitly distinguish them. Before proceeding, let denote the set of objects belonging to the target object type, and denote the relation matrix from the updated to . The lefthand relation matrix can be denoted as , and the righthand relation matrix can be denoted as . As a result,
(8) 
where still denotes the relation matrix of the recurrent substructure. is still defined via formula 7 using .
Note 4.6.
In practice, it is very timeconsuming to compute in formula 8. Note that
is a row vector according to the locality strategy. Therefore, computing
is equivalent to solving the linear equations . We apply LowerUpper (LU) decomposition to and then use the LU factors to obtain numerical_analysis .4.3 Algorithm Description
Now, we describe the algorithm for computing , see algorithm 1. As shown in lines 25, it takes at most to construct and for and compute . According to note 4.2, we obtain a vector whose entries represent the similarities between the source object and the others. As shown in lines 69, it takes at most to compute . As a result, the worstcase time complexity of algorithm 1 is equal to .
Comments
There are no comments yet.