 # Tensor SimRank for Heterogeneous Information Networks

We propose a generalization of SimRank similarity measure for heterogeneous information networks. Given the information network, the intraclass similarity score s(a, b) is high if the set of objects that are related with a and the set of objects that are related with b are pair-wise similar according to all imposed relations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Most data in the modern world can be treated as an information network, thus network node similarity measuring has wide range of applications: search , recommendation systems , research publication networks analysis , biology , transportation and logistics  and others.

Consider a semantic network: set of types , each type is a set of entities; set of relations , each relation is 2-order predicate defined on two types from :

 R∋rtp:t×p↦{1,0},t,p∈T,

both types in relation can be equal (), few relations can share the same pair of types (). That structure may be considered as a graph with colored vertices and colored edges: vertex color is its entity type, edge color correponds to a relation.

The question that we address is how to define similarity functions

 st:t×t→R,∀t∈T,

that would reflect the closeness of objects based on "similarity of relations" they enter, and at the same time not mixing different relations as soon as "objects of different types and links carry different semantic meanings, and it does not make sense to mix them to measure the similarity without distinguishing their semantics" .

### 1.1 Related work

The basic graph structure similarity measure is the classical SimRank  over a homogeneous graph which is defined as follows:

 NG(a)={v∈V:(v,a)∈E(G)},
 s(a,b)=CI(a)I(b)∑v∈N(a)w∈N(b)s(w,v).

The main drawback of this approach is that we cannot induce multiple relations or object types, so the only option is mixing them up into blobs "relation exists" and "all objects" that is completely not applicable in the case we have multiple relations with different semantics, for example the OpenCyc ontology node of the concept "Game" (see Figure  1) cannot be easily expressed via a single type of relations and objects.

Personalized PageRank  is also often used to measure similarity in homogeneous graphs:

 πa(b)=εδa(b)+(1−ε)∑(w,b)∈Eπa(w)αw,v,

that it same as PageRank, except random jumps are made into some pre-chosen node , rather then into random node.

Another option is PathRank  that measures path-similarity between objects picked from the same class of the heterogeneous information network given a symmetric meta-path (set of paths that satisfy composition of relations that , so ) as a number of paths from the object to the object (each step must satisfy corresponding relation in ) normed over the number of paths from to plus the number of paths from to given :

 sP(a,b)=∥{p∈P:a\xRightarrowpb}∥∥{p∈P:a\xRightarrowpa}∥+∥{p∈P:b\xRightarrowpb}∥

That approach can handle several relations and object types and is very useful when we know the structure of relations we want our similarity measure to be based on. In case we want just to "put our relations into a black box" that would find similarity that would capture all network relations as a whole, we might want to use something different. Recently, an approach  for building an optimal linear combination of meta-paths has been proposed.

There are several works on measuring similarity between objects from different classes, see, for example, .

## 2 Tensor SimRank

### 2.1 Problem statement

Let us consider a function that assigns similarity score for two objects from the same class as follows: objects are similar (value is high) if they relate to objects which are similar too. That interdependence can be expressed via the following definition:

 Nrtp(a)={b∈p|rtp(a,b)=1},st(a,b)=1Z∑rtp∈Rw(rtp)∑c∈Np(a)d∈Np(b)sp(c,d),

where is the relation between classes , is the neighbourhood function that returns set of objects from the class that are related to the object via the relation , are the weights corresponding to the relation , is the normalization constant.

This can be rewritten as a Tensor SimRank equation:

 sαβ=∑γwαβγ rαβγ sαβ rβαγ,s=diag({st}t∈T),sαα=1, (1)

where is a block-diagonal matrix (one block per each entity type), are the relation weights, are the stochastic relation tensors 111We have to use tensors instead of matrices to have multiple relations on the same pair of classes (which have non-zero blocks where relations exist).

Similarity scores between elements of different classes are equal to zero by the definition. Relation between objects of unrelated classes is equal to zero by definition too. Equation (1) is basically the classical SimRank equation with the adjacency tensor instead of the adjacency matrix: each non-zero layer of tensor encodes some relation on the same pair of types. If one has more than a single relation between types , then r would have multiple non-zero layers on the intersection of indices associated with the classes — one adjacency matrix per layer. In (1) the index stands for (weighted) summation over all layers of the tensor. That can be equivalently rewritten explicitly:

 S=∑γwγWγSWTγ+D, (2)

where the diagonal matrix has to be chosen in a such way that .

### 2.2 Computational algorithm

Simple iterations for (1) are computationally demanding due to large-scale matrix-by-matrix products, thus we propose a a method that exploits the fact that is block diagonal and r is a three-dimensional block tensor with size of the last dimension (number of layers) much less then the overall amount of objects. On each iteration for each we recompute updates independently (assuming all other fixed), see Algorithm 1.

So we just update the similarity score for each class assuming all other classes similarities are fixed in a way that the objects from the target class () that are related to objects from some other class that are close ( is high) become closer too (.

To show actual vectorized algorithm of similarity computation, let us introduce some additional notations: set of entity types

, each entity type is a set of entities, set of symmetric relation functions where ,

is the order; column-stochastic matrix of pairwise types impacts (weights)

; operator that maps relation into corresponding column-stochastic adjacency matrix. If is not defined for some , then .

To achieve better results (see above) on sparse relations we adopted the Low-Rank SimRank approximation 

that uses Probabilistic Singular Value Decomposition

 to perform fast approximate projections on low-rank matrix manifold at each step of the iterative process (Algorithm 3).

The only difference with Algorithm 2 is that on each step we perform probabilistic SVD decomposition of the matrix , so that , and project it onto the manifold of matrices of rank .

### 2.3 Convergence conditions

Recall that the classical SimRank can be computed as a solution of the equation:

 S:=WSWT−diag(WSWT)+I.

Fixed-point iteration converges if is a column-stochastic matrix. In the vector form ( operator maps an matrix into a vector by taking column by column) that can be written as222:

 [W⊗W−I]vec(S)−vec(diag(WSWT))+vec(I)=0,

if matrix is stochastic, then is stochastic too.

Tensor SimRank (2) computation can be equivalently written in the form:

 S:=∑γwγWγSWTγ−diag(∑γwγWγSWTγ)+I, (3)

or in the vectorized for

 [∑γwγWγ⊗Wγ−I]vec(S)−vec(diag(…))+vec(I)=0.

Moreover, SimRank is also commonly approximated by the solution of the discrete Lyapunov equation:

 S=cWSWT+(1−c)I,

which can be generalized to the tensor case as

 S=c∑γwγWγSWTγ+(1−c)I,

and a fixed-point iteration converges  if:

 ∑γ=1wγ∥Wγ∥21≤1\xLeftrightarrow[% stochastic]∥Wγ∥1=1∑γwγ≤1.

We conjecture that fixed-point iterations for (3) converge if:

1. Each is stochastic

2. = 1

In the simplest form (we have no preferences among relations and classes) it reduces to (relations weight):

 wtp=1∑m∥{r(j)tm∈R}∥.

## 3 Computational experiment

### 3.1 Synthetic data: convergence test

To test convergence conditions we conducted series of tests on randomly generated sparse networks with different number of classes: and with randomly chosen number of objects in each , , full network of relation types (all possible types relations exists) with randomly chosen edges in each and default matrix (no priority). All generated networks successfully converged that illustrates that convergent sufficient conditions listed in previous section were adequate, see Figures 2,3. Figure 2: Average time spent on 10 iterations of algorithm on randomly network with K components, N objects in each Figure 3: Mean Frobenius residual after 10 iterations of algorithm as function of number of objects (N), K components

### 3.2 Synthetic data: similarity reconstruction

To determine if model is capable of similarity reconstruction we generated a tree graph from randomly distributed points on a plane and tested if model can reconstruct points spatial similarity basing only on their relations. Figure 4: Random points for graph generation: blue points – zero level, red points – first level, green point – second level

On Figure 4 blue point represent 0-level point that are connected to 1-level point (red), that are connected to 2-level points (green).

We have measured the following similarity reconstruction quality compared to real obtained from generated point coordinates:

 Q(S,^S)=∑a∑b∑c[Sab

that actually shows how many " is closer to then to " relations were preserved. Figure 5: The value of Q(S,^S) as a function of r

From Figure 5 one can see that at level model gets saturated, but at the level models that use low-rank version of Tensor SimRank perform way better than the "pure" algorithm. The numbers in the brackets denote the dimensionality of the matrix space into which the similarity matrices were projected on each step (rank of approximation).

### 3.3 Book-Crossing Dataset test

The model was run on subsample from the Book-Crossing Dataset . We have extracted only those authors who had highest (top100) number of books in the collection. The final network had the following structure:

 T={Book,Author,Year,Publisher}
 R={isAuthorOf(⋅,⋅),publishedBy(⋅,⋅),publishedIn(⋅,⋅)}
 #Book=3625,#Author=99,
 #Year=65,#Publisher=554

Model convergence is shown on Figure 6, where successfull convergence to the best possible low-rank approximation can be seen. The similarity structure is clearly visible on Year similarity matrix heatmap (Figure (6)). We expect diagonal dominance as soon as temporarily close years should be more or less similar in terms of authors and publishers characteristic of that period. Tables 1 and 2 are examples of "closest book" requests, we want to notice that no NLP-preprocessing was conducted, nevertheless model treated books from same storybook as similar basing on author/publisher/year similarities.

## 4 Discussion and further work

Proposed model can be used in various problem areas where most of the information is available in the form of relations between entities rather than features of individual entities and no trivial vector representation of those entities can be induced. One can use the vector representation

 [st]ij=δij+[ut]ik[dt]kl[ut]lj,

to embed the notion of relations into classical machine learning algorithms. Also, the proposed model can be used for relation generalisation, that might give interesting results since we work on heterogeneous graphs.

Further model improvements might also include treating relations as objects too (probably, via heterogeneous hypergraphs) and defining similarity matrix on relations.

## 5 Conclusion

This paper proposes the generalization of SimRank for heterogeneous networks and a method for its computation that exploits the fact that the resulting similarity matrix is block-diagonal, thus its components might be computed in an iterative fashion. The convergence conditions are proposed and successfully tested. Few perspective application areas are suggested.

## References

•  L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking: Bringing order to the web.,” 1999.
•  J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl, “Grouplens: applying collaborative filtering to usenet news,” Communications of the ACM, vol. 40, no. 3, pp. 77–87, 1997.
•  C. L. Giles, “The future of Citeseer: Citeseer X,” in Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases, pp. 2–2, Springer-Verlag, 2006.
•  S. Roy, T. Lane, and M. Werner-Washburne, “Integrative construction and analysis of condition-specific biological networks.,” in

Proceedings of the National Conference on Artificial Intelligence

, vol. 22, p. 1898, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2007.
•  W. Jiang, J. Vaidya, Z. Balaporia, C. Clifton, and B. Banich, “Knowledge discovery from transportation network data,” in Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pp. 1061–1072, IEEE, 2005.
•  S. Lee, S. Park, M. Kahng, and S.-g. Lee, “Pathrank: Ranking nodes on a heterogeneous graph for flexible hybrid recommender systems,” Expert Systems with Applications, vol. 40, no. 2, pp. 684–697, 2013.
•  G. Jeh and J. Widom, “Simrank: a measure of structural-context similarity,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 538–543, ACM, 2002.
•  G. Jeh and J. Widom, “Scaling personalized web search,” in Proceedings of the 12th international conference on World Wide Web, pp. 271–279, ACM, 2003.
•  Y. Sun and J. Han, “Mining heterogeneous information networks: a structural analysis approach,” ACM SIGKDD Explorations Newsletter, vol. 14, no. 2, pp. 20–28, 2013.
•  C. Shi, X. Kong, Y. Huang, S. Y. Philip, and B. Wu, “Hetesim: A general framework for relevance measure in heterogeneous networks,” IEEE Transactions on Knowledge & Data Engineering, no. 10, pp. 2479–2492, 2014.
•  I. V. Oseledets and G. V. Ovchinnikov, “Fast, memory efficient low-rank approximation of simrank,” CoRR, vol. abs/1410.0717, 2014.
•  N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011.
• 

J. Bierkens, O. v. Gaans, and S. V. Lunel, “Estimate on the pathwise lyapunov exponent of linear stochastic differential equations with constant coefficients,”

Stochastic Analysis and Applications, vol. 28, no. 5, pp. 747–762, 2010.
•  C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen, “Improving recommendation lists through topic diversification,” in Proceedings of the 14th international conference on World Wide Web, pp. 22–32, ACM, 2005.