A Semantic-Rich Similarity Measure in Heterogeneous Information Networks

01/02/2018 ∙ by Yu Zhou, et al. ∙ Xi'an Jiaotong University Xidian University 0

Measuring the similarities between objects in information networks has fundamental importance in recommendation systems, clustering and web search. The existing metrics depend on the meta path or meta structure specified by users. In this paper, we propose a stratified meta structure based similarity SMSS in heterogeneous information networks. The stratified meta structure can be constructed automatically and capture rich semantics. Then, we define the commuting matrix of the stratified meta structure by virtue of the commuting matrices of meta paths and meta structures. As a result, SMSS is defined by virtue of these commuting matrices. Experimental evaluations show that the proposed SMSS on the whole outperforms the state-of-the-art metrics in terms of ranking and clustering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A Toy Bibliographic Information Network. T1, T2, T3, T4, T5, T6, T7, T8 and T9 respectively stand for terms ‘NetworkSchema’, ‘RelationStrength’, ‘Similarity’, ‘Clustering’, ‘HIN’, ‘Attribute’, ‘MetaPath’, ‘Ranking’, ‘NetworkSchema’. P1, P2, P3, P4, P5 and P6 respectively stand for papers ‘NetClust’, ‘HeteSim’, ‘GenClus’, ‘PathSelClus’, ‘HeProjI’, ‘PathSim’. A1, A2, A3, A4 and A5 respectively stand for authors ‘Yizhou Sun’, ‘Jiawei Han’, ‘Chuan Shi’, ‘Philip S. Yu’. ‘Xifeng Yan’. V1, V2, V3 and V4 respectively stand for venues ‘CIKM’, ‘TKDE’, ‘SIGKDD’, ‘VLDB’.

Information network analysis attracts many researchers’ attention in the field of data mining because many real systems, e.g. bibliographic information database and biological systems, can be modeled as information networks. These networks have common characteristics: they are composed of multi-typed and interconnected objects. This kind of information networks is usually called Heterogeneous Information Networks (HIN). Fig. 1 shows a toy bibliographic information network with four actual object types Author () in the shape of triangles, Paper () in the shape of circles, Venue () in the shape of pentagons and Term () in the shape of squares. The type has six instances: P:HeteSim SKHYW:2014 , P:HeProjI SWLYW:2014 , P:GenClus SAH:2012 , P:PathSelClus SNHYYY:2012 , P:PathSim SHYYW:2011 , P:NetClus SYH:2009 . Each paper has its author(s), a venue and its related terms. Hence, it contains three types of links: , and .

In a HIN, a fundamental problem is to measure the similarities between objects using structural and semantic information. All the off-the-shelf similarities in HIN are based on user-specified meta paths, for example SHYYW:2011 and Biased Path Constrained Random Walk () LC:2010b ; LC:2010a . According to the literature HZCSML:2016 , meta paths can only capture biased and relatively simple semantics. Therefore, the authors proposed a more complicated structure called meta structure, and defined the meta structure based similarity using the compressed-ETree, called the Biased Structure Constrained Subgraph Expansion (). However, the meta structure needs to be specified in advance as well.

It is really very difficult for users to specify meta paths or meta structures. For example, there are ten object types (Gene, Gene Ontology, Tissue, Chemical Compound, Side Effect, Substructure, Chemical Ontology, Pathway, Disease, Gene Family) and eleven link types in a complete biological information network CDJWZ:2010 ; FDSCSB:2016 . Obviously, users hardly know how to choose appropriate meta paths or meta structures. In addition, different meta paths and meta structures may have different effects on the similarities between objects. This makes users more difficult to select appropriate meta paths or meta structures.

To alleviate users’ burden, we propose an automatically-constructed schematic structure called Stratified Meta Structure (SMS). It needs not to be specified in advance, and combines many meta paths and meta structure. This ensures that (1) Users need not to follow with interest the structure of the network schema of the input HIN; (2) Rich semantics can still be captured. We are inspired by the tree-walk proposed in NPEKK:2011 . The structure of a tree-walk is constructed by repetitively visiting nodes in the input graph. This idea can be employed here. As a result, we devise the stratified meta structure, which is essentially a directed acyclic graph consisting of the object types with different layer labels. It can be automatically constructed via repetitively visiting the object types on the network schema. In the process of the construction, we discover the SMS consists of many basic substructures and recurrent substructures, see section 4.2. These basic substructures and recurrent substructures essentially represent specific relations. The SMS as a composite structure is therefore a composite relation. This is why the SMS can capture rich semantics.

After obtaining the SMS, the next step is to formalize its rich semantics. For meta structures, the compressed-ETree is used to formalize its semantics. However, it cannot be used here, because SMS contains an infinite number of meta structures. The semantics contained in meta paths are usually formalized by its commuting matrices. In essence, the meta structures have the same nature as the meta paths, because they all have hierarchical structures. So, we define commuting matrices of meta structures by virtue of cartesian product in section 3.2, and further define commuting matrix of the SMS by reasonably combining the infinite number of the commuting matrices of meta structures. The proposed metric, , is defined by the commuting matrix of the SMS. Experimental evaluations suggest that on the whole outperforms the baselines , and in terms of ranking quality and clustering quality.

The main contributions are summarized as follows.

  1. We propose the stratified meta structure with rich semantics, which can be constructed automatically, and define a stratified meta structure similarity by virtue of the commuting matrix of the SMS;

  2. We define the commuting matrices of meta structures by virtue of cartesian product, and use them to compactly re-formulate ;

  3. We conduct experiments for evaluating the performance of the proposed metric . The proposed metric on the whole outperforms the baselines in terms of ranking quality and clustering quality.

The rest of the paper is organized as follows. Section 2 introduces related works. Section 3 provides some preliminaries on HINs. Section 4 introduces the definition of . The experimental evaluations are introduced in section 5. The conclusion is introduced in section 6.

2 Related Work

To the best of our knowledge, Sun et al. SYH:2009 ; SHZYCW:2009 ; SH:2012 first proposed the definition of the HIN and studied ranking-based clustering in HINs. Shi et al. SLZSY:2017 gave a comprehensive summarization of research topics on HINs including similarity measure, clustering, link prediction, ranking, recommendation, information fusion and classification etc. Article HeteClass:2017 proposed a novel meta-path based framework called HeteClass for transductive classification of target type objects. This framework can explore the network schema of the input HIN and incorporate the expert’s knowledge to generate a collection of meta paths. Below, we summarize related works on similarity measures in information networks.

For similarity measures in homogeneous information networks, literature JW:2002 proposed a general similarity measure combining the link information, which thought two similar objects must relate to similar objects. Literature JW:2003 evaluated the similarities of objects by a random walk model with restart. Article LinkPred:2017 lists many state-of-the-art similarities in homogeneous information networks: (1) Local Approaches: e.g. Common Neighbors (), Adamic-Adar Index (), Resource Allocation Index (), Resource Allocation based on Common Neighbor Interactions (), Preferential Attachment Index (

), Jaccard Index (

), Salton Index (), Sorensen Index (), Hub Promoted Index (), Hub Depressed Index (), Local Leicht-Holme-Newman Index (), Individual Attraction index (), Mutual Information (

), Local Naive Bayes (

), CAR-Based Indices (), Functional Similarity Weight (), Local Interacting Score (); (2) Global Approaches: Negated Shortest Path (), Katz Index (), Global Leicht-Holme-Newman Index (), Random Walks (RA), Random Walks with Restart (), Flow Propagation (), Maximal Entropy Random Walk (), Pseudo-inverse of the Laplacian Matrix (), Average Commute Time (

), Random Forest Kernel Index (

), The Blondel index (), (3) Quasi-Local Approaches: Local Path Index (HPI), Local Random Walks (LRW), Superposed Random Walks (SRW), Third-Order Resource Allocation Based on Common Neighbor Interactions (), FriendLink (), PropFlow Predictor ().

For Similarity measures in heterogeneous information networks, Sun SHYYW:2011 proposed a meta path based similarity measure in HINs, called . Lao and Cohen LC:2010b ; LC:2010a studied the problem of measuring the entity similarity in labeled directed graphs, and defined a Biased Path Constrained Random Walk () model. It can be applied to HINs. Huang et al. HZCSML:2016 proposed a similarity , which can capture more complex semantics. Shi et al. SKHYW:2014 proposed a relevance measure which can be used to evaluate the relatedness of two object with different types. For a user-specified meta path, is based on the pairwise random walk from its two endpoints to its center. Xiong et al. XZY:2015 studied the problem of finding the similar object pairs by virtue of locality sensitive hashing. Zhu et al. SEMATCH:2017

proposed an integrated framework for the development, evaluation and application of semantic similarity for knowledge graphs which can be viewed as complicated heterogeneous information networks. This framework included many similarity tools and allowed users to compute semantic similarities. In the article

ForwardBYW:2017 , the authors studied the similarity search problem in social and knowledge networks, and proposed a dual perspective similarity metric called Forward Backward Similarity.

3 Preliminaries

In this section, we introduce some important concepts related to HINs including network schema in subsection 3.1, meta paths and meta structures in 3.2.

3.1 HIN Definition

As defined in article SWLYW:2014 , an information network is essentially a directed graph . and respectively denote its sets of objects and links, and and respectively denote its sets of object types and link types. Map denotes the object type of object . That is to say, each object in belongs to a specific object type. Similarly, map represents the link type of link , i.e. each link in belongs to a specific link type. In essence, contains some semantic because it is a relation from the source object type to the target object type. If two links belong to the same link type, they share the same starting object type as well as the ending object type. is called a Heterogeneous Information Network if or . Otherwise, it is called a homogeneous information network.

Figure 2: (a) Bibliographic network schema. (b) Biological network schema.

For each HIN, there is a meta-level description for it, called its network schema. Specifically, the Network schema SWLYW:2014 of is a directed graph consisting of the object types in and the link types in . Fig. 2(a) shows the network schema for the HIN in Fig. 1. In this paper, we also use a biological information network consisting of six object types, i.e. Gene (), Tissue (), GeneOntology (), ChemicalCompound (), Substructure () and SideEffect (), and five link types, i.e. GOG, TG, GCC, CCSi, CCSub. Its network schema is shown in 2(b).

3.2 Schematic Structures

Up to now, there are two kinds of schematic structures (meta paths and meta structures) for the network schema of the HIN. All of them carries some semantics. It is noteworthy that these two kinds of schematic structures must be specified by users when using them to measure the similarities between objects.

Meta path SHYYW:2011 is essentially an alternate sequence of object types and link types, i.e. , and . Note that is a link type starting from to . In essence, the meta path contains some composite semantic because it represents a composite relation . Unless stated otherwise, the meta path can be compactly denoted as . There are some useful concepts related to the meta path in literature SHYYW:2011 , i.e. length of , path instance following , reverse meta path of , symmetric meta path and commuting matrix of . For example, Fig. 3(a,b,c) show three meta paths in the network schema shown in Fig. 2(a). They can be compactly denoted as , and . They can express different semantics. expresses “Two authors cooperate on a paper.” express “Two authors publish their papers in the same venue.” express “Two authors publish their papers containing the same terms.”

Figure 3: Some Meta Paths and meta structures.

Meta structure HZCSML:2016 is essentially a directed acyclic graph with a single source object type and a single target object type . is a set of object types, and is a set of link types. Fig. 3(d,e) show two kinds of meta structures for the network schema shown in Fig. 2(a). All of them can be compactly denoted as and . Fig. 3(f) shows a meta structure for the network schema shown in Fig. 2(b). It can be compactly denoted . Meta Structure expresses the more complicated semantic “Two venues publish papers both containing the same terms and wrote by the same authors.” Meta Structure expresses the more complicated semantic “Two authors write their papers both containing the same terms and in the same venue.”

Given a meta structure , we sort its object types with regard to the topological order. Suppose its height is equal to . Let denote the set of object types on the layer , and denote the cartesian product of the set of objects belonging to different types in , . The relation matrix from to is defined as the one whose entries are equal to 1 if the -th element of is adjacent to the -th one of in , otherwise 0. and are adjacent if and only if for any and , and are adjacent in if and are adjacent in . The commuting matrix of is defined as

Each entry in represents the number of instances following . The commuting matrix of its reverse is equal to ..

Figure 4: Illustration of (left hand) and (right hand). Because of the space limitation, is partitioned into two blocks and .

Take the HIN shown in Fig. 1 as an example. We compute the commuting matrix of the meta structure shown in Fig. 3(b). It has five layers , , , and . The -th box on the left-hand side of Fig. 4 shows the cartesian product . Then, we can easily obtain the relation matrices , , , on the right-hand side of Fig. 4. According to the fact that , we know P:HeteSim is adjacent to (V:TKDE,T:Similarity). In fact, P:HeteSim is published on the V:TKDE and contains the term T:Similarity. Similarly, implies that P:HeteSim is not adjacent to (V:TKDE,T:Ranking). According to the HIN shown in Fig. 1, we know P:HeteSim does not contain the term T:Ranking. As a result,

For a given meta structure, its with can be expressed by its commuting matrix as well. The following lemma 3.1 describes this conclusion. Throughout this paper, we use to denote its normalized version, where is a diagonal matrix whose nonzero entries are equal to the row sum of .

Lemma 3.1.

Given a meta structure , suppose that denotes the set of object types on its -th layer. When ,

where .

Proof.

We prove the lemma by induction on .

Initial Step. Obviously, . When , . Assume there are different object tuples in adjacent to , denoted as . According to the definition of HZCSML:2016 , . Obviously, . Therefore, we have .

Inductive Step. Assume the conclusion holds for . Below, we prove it also holds for . Obviously,

(1)

where , and . Note in particular that , where , and , where , is equal to either 0 or . According to the definition of in literature HZCSML:2016 ,

(2)

Combining formulas 1 and 2, we have

where . The conclusion holds for . ∎

In this paper, we aim to define a similarity measure in HINs, which does not depend on any pre-specified schematic structures. This is a reasonable restriction because specifying meta paths or meta structures is a cumbersome job.

4 Stratified Meta Structure Based Similarity

In this section, we define the stratified meta structure based similarity measure in HINs. Firstly, we give the architecture of the stratified meta structure in section 4.1. Secondly, we formally define the similarity based on the stratified meta structure in section 4.2. At last, we describe the pseudo-code of computing the similarity.

4.1 Stratified Meta Structure

A Stratified Meta Structure is essentially a directed acyclic graph consisting of object types with different layer labels. Its salient advantage is that it can be automatically constructed by repetitively visiting object types in the process of traversing the network schema. Given a HIN , we first extract its network schema , and then select a source object type and a target object type. Unless stated otherwise, the source object type is the same as the target one. The construction rule of SMS of is described as follows. The source object type is placed on the 0-th layer. The object types on the layer are composed of the neighbors of the object types on the layer in . The adjacent object types are linked by an arrow pointing from the -th layer down to the -th layer. Note in particular that once we get the target object type on the layer , delete its outgoing links, i.e. the ones starting from it down to the object types on the -th layer. Repeating the above process, we obtain the SMS .

Figure 5: The construction of the SMS of the toy bibliographic information network. The numbers near nodes stand for their layer labels
Figure 6: Two kinds of SMS. The numbers near nodes stand for the layer labels. In this figure, the -th layer, , are omitted because of space limitation.

Fig. 6(a) shows the SMS of the network schema shown in Fig. 2(a). It can be constructed as shown in Fig. 5. is both the source and target object type. Firstly, labelled as is placed on the 0-th layer, see Fig. 5(a). is placed on the 1-st layer and labelled as , because is the only neighbor of in the network schema shown in Fig. 2(a), see Fig. 5(b). , and , respectively labelled as , and , are placed on the 3-rd layer, because they are the neighbors of , see Fig. 5(c). Similarly, , labelled as , is again placed on the 4-th layer, because it is the neighbor of both and , see Fig. 5(d). At this time, is visited again. Note in particular that the link from down to is deleted, because is the target object type. Repeating the above procedure, we obtain the SMS shown in Fig. 6(a). Fig. 6(b) shows the SMS of the network schema shown in Fig. 2(b). Gene is both the source and target object type. It can be constructed as similarly as 6(a). It is worth noting that and are only placed on the 1-st layer, because their degrees in the network schema shown in Fig. 2(b) are equal to 1.

Below, we give some properties of SMS via lemma 4.3. Given a SMS , we sort its object types in the topological order. Let denote the set of object types with the layer label except the target object type. Let denote the cartesian product of the set of objects belonging to different types in , . The relation matrix from to is defined as similarly as the commuting matrices of meta structures defined in section 3.2. A substructure consisting of three layers in is recurrent if and only if for . Let denote the height of the spanning tree of yielded by Breadth-First Search (BFS). Without loss of generality, we assume the source object type in the network schema does not contain self-loops. If the source object type has a self-loop, we assign two roles to it: target object type and intermediate object type. The first role is to treat it as the target object type, and the second is to treat it as an non-source and non-target object type.

Note 4.1.

In the process of computing , we need to visit all the elements in . In practice, there are many all-zero columns in . We should remove all the all-zero columns in and the corresponding rows in . Removing the -th column of implies there are no links between the -th object in and any object in . Therefore, it is unnecessary to consider the links between it and the objects in .

Note 4.2.

In the process of computing , the objects except the source can be removed from . At this time, we only need to consider the elements adjacent to in , and the others are removed from . In general, suppose is the set of considered elements in . That implies the elements in are removed from . In , we only need to consider the elements in where denotes the set of elements in adjacent to . The others are removed from .

Lemma 4.3.

Assume the source object type is the same as the target one, and the source object type in does not contain self-loops. The SMS has the properties:

  1. The target object type lies on the -th layer, .

  2. If we walk up from the target object type on the layer to the source object type along the parents of object types, then we obtain a symmetric meta structure, denoted as .

  3. For any , .

  4. The substructure consisting of the object types except the target object type on the layers always recurrently appear in the SMS.

  5. For , the meta structure contains recurrent structures, where

    (3)
Proof.

1) The conclusion holds obviously according to the construction rule of and our assumptions.

2) According to property 1, the target object type with different layer labels lies on the even number layer. Thus, even is the height of the meta structure obtained by walking up from the target object type on layer . According to the construction criteria of the SMS, the meta structure is symmetric with respect to the layer .

3) For any , it must be adjacent to an object type on the layer . Obviously, according to the construction rule of . Thus, we have . Similarly, . Therefore, .

4) This property obviously holds according to property 3.

5) We prove the lemma by induction on .

Initial Step. When , we obviously have according to the construction rule of .

Inductive Step. Assume for the layer . Below, we prove that the conclusion holds for . When , we obtain a new recurrent structure consisting of . When , we obtain a new recurrent structure consisting of . Therefore, we have

The conclusion holds. ∎

Figure 7: Recurrent structures and meta structures.

According to property 2 in lemma 4.3, SMS is essentially composed of an infinite number of meta paths and meta structures. For example, the SMS shown in Fig. 6(a) can be obtained by combining the meta path shown in Fig. 3(a), the meta structure shown in Fig. 3(b) and the others with one or more recurrent substructures shown in Fig. 7(a). the SMS shown in Fig. 6(b) can be obtained by combining the meta structures shown in Fig. 7(b,d) and the others with one or more recurrent substructures shown in Fig. 7(c). It is noteworthy that the meta structure shown in Fig. 7 can be compactly denoted as .

4.2 Similarity

Figure 8: Illustration of basic substructures.

Now, we define the stratified meta structure based similarity by virtue of the commuting matrices of meta paths and meta structures. The SMS is essentially composed of a recurrent substructure, several basic substructures and their reverses. The basic substructures are bipartite graphs consisting of the object types in and , . The recurrent substructure consists of . The symmetric meta structures obtained by walking up from the target object type on the layer to the source object type consist of the basic substructures and its reverses. The symmetric ones obtained by walking from the target object type on the layer to the source object type consist of one or more recurrent substructures, the basic substructures and its reverses. For example, the SMS shown in Fig. 6(a) can be obtained by combining the recurrent substructure shown in Fig. 7(a) and two basic substructures shown in Fig. 8(a,b). The SMS shown in Fig. 6(b) can be obtained by combining the recurrent substructure shown in Fig. 7(c) and three basic substructures shown in Fig. 8(c,d,e).

The commuting matrix of a stratified meta structure is formally defined as the summation of the commuting matrices of meta paths and meta structures. Let denote the commuting matrix of SMS , and denote the meta structure (possibly meta path), which is obtained by walking up from the target object type on the layer to the source object type, . Therefore, .

As stated previously, denotes the relation matrix from to . For the basic substructure consisting of the object types on the layers and , its relation matrix is just equal to , . For the recurrent substructure, its relation matrix is equal to according to property 3 in lemma 4.3. Below, we show how to compute .

Lemma 4.4.

For any , can be computed as follows. Let .

  1. If the degree of the source object type in is equal to 1, then we have

    (4)

    where

  2. If the degree of the source object type in is larger than 1, then let denote the set of the object types with degree larger than 1 in the neighbors of the source object type, and let . When , let . We have

    (5)
Proof.

For case 1, we prove it by induction on . Case 2 is similar.

Initial Step. When , The obtained meta structure (possibly meta path) consists of . Obviously,

Inductive Step. Assume the conclusion holds for , and the meta structure for is . Now, we prove the conclusion also holds for . According to property 3 in lemma 4.3, the meta structure for is . When , . Thus, we have

When , there are recurrent substructures according to property 6 in lemma 4.3. Thereby, the obtained meta structure for can be denoted as

The obtained meta structure for is

Therefore,

So, the conclusion holds when . ∎

Below, we only discuss case 1 in lemma 4.4. Case 2 is similar. Obviously,

The matrix power series may be divergent. In addition, different meta structures in the SMS should also have different weights. As a result, the normalized version of is equal to

where and are respectively the normalized versions of and , and . is called decaying factor. , satisfying , denote the weights of different meta structures. Obviously, the spectral radius of is less than 1 because

is a row random matrix and

. Let

denote the identity matrix with the same size as

. As a result, we have

(6)

The Stratified Meta Structure based Similarity, , of the source object and the target object is defined as

(7)
Note 4.5.

Using note 4.2 result in that for . As a result, the matrix defined previously is not square. To address this issue, the elements in are removed from . This leads to losing some semantics.

Now, we take the HIN shown in Fig. 1 as an example to illustrate note 4.5. The object A:Yizhou Sun is selected as the source one. As shown in the first box () of the left-hand side of Fig. 4, A:Yizhou Sun marked as red color is kept, and the others are removed. In the second and third boxes ( and ), the elements marked as red color are kept and the others are removed. In the fourth box (), P:GenClus in addition to the red elements in is also kept because it is adjacent to (V:VLDB,T:HIN). Obviously, . In the fifth box (), the elements marked green color in addition to the red ones in should also be kept because they are adjacent to P:GenClus. Obviously, . According to the approximation strategy, they are removed from . That means some semantics are lost.

According to notes 4.2 and 4.5 some elements in are removed. and in formula 6 should be adjusted accordingly. is still used to denote the relation matrix from the renewed to the renewed . In , and essentially represent the relation matrices respectively on the left and right side of the symmetry axis of . When using notes 4.2 and 4.5, we must explicitly distinguish them. Before proceeding, let denote the set of objects belonging to the target object type, and denote the relation matrix from the updated to . The left-hand relation matrix can be denoted as , and the right-hand relation matrix can be denoted as . As a result,

(8)

where still denotes the relation matrix of the recurrent substructure. is still defined via formula 7 using .

Note 4.6.

In practice, it is very time-consuming to compute in formula 8. Note that

is a row vector according to the locality strategy. Therefore, computing

is equivalent to solving the linear equations . We apply Lower-Upper (LU) decomposition to and then use the LU factors to obtain numerical_analysis .

4.3 Algorithm Description

Now, we describe the algorithm for computing , see algorithm 1. As shown in lines 2-5, it takes at most to construct and for and compute . According to note 4.2, we obtain a vector whose entries represent the similarities between the source object and the others. As shown in lines 6-9, it takes at most to compute . As a result, the worst-case time complexity of algorithm 1 is equal to .

0:  HIN , source object , decaying parameter , weights .
0:  Similarity vector
1:  Compute
2:  for  do
3:       Compute and
4:  end for
5:  Compute using formula 8
6:  for  do
7:      Compute using formula 7
8:      Append to vector
9:  end for
10:  return  
Algorithm 1 Computing

Take the toy example shown in Fig. 1 as an example, its SMS is shown in Fig. 6(a). is the source object type. Obviously, , and , , . Therefore, , , and and so on. Let . can be easily computed according to formula 6, i.e.

As a result,