1 Introduction
Heterogeneous information networks (HINs)[1][2][3], which involve a diversity of node types and relationships between nodes, can better model and solve many realworld problems than homogeneous networks. For HIN analysis, an important concept is metapath [4][5], which is composed of a sequence of relationships between two nodes. For example, the movie network of IMDB contains three types of nodes, including movies, directors and actors. The relationship between two movies can be described by metapaths such as MovieActorMovie (MAM) and MovieDirectorMovie (MDM), where MAM denotes the movies starring the same actor, and MDM denotes the movies directed by the same director.
Network embedding[6][7], which aims to learn the distributed representations of nodes in networks, is considered as an effective method for network mining and has been widely studied in homogeneous networks. Recently, researchers have also proposed some methods for HIN embedding, such as random walkbased methods[8][9] and relation learning based methods[10][11]
, many of which rely on the concept of metapath. In particular, with the great success of deep learning, graph neural networkbased HIN embedding methods (such as HAN
[12] and MAGNN[13]) have been proposed very recently. These methods often adopt a hierarchical attention structure, which uses the nodelevel attention to aggregate information inside each metapath and utilizes the metapathlevel attention to fuse information of different metapaths.While these graph neural networkbased methods have achieved great success in HIN embedding, they still suffer from some essential issues. First, while attention has been widely used in fields such as NLP, the use of the complicated hierarchical attention structure may be not so effective in HIN embedding, since there are often little training data available in HINs and information from one network can be hardly transferred to another. In this way, it will be difficult for graph neural networks to train well these hierarchical attentions (particularly for the metapathlevel attention, which is to evaluate the essential importance of different metapaths), making them hard to really achieve the goal of selecting metapaths, especially when there is often severe overfitting in practice. At the same time, these existing methods often treat metapaths with different lengths, such as direct linked metapaths (e.g., MovieDirector) and indirect linked metapaths (e.g., MovieDirectorMovie), indistinguishably for information propagation. However, from the perspective of network science, while direct links can propagate information directly, indirect links should propagate information indirectly, and the information propagation on direct links is more essential. Therefore, for metapaths with lengths longer than one (which makes the paths indirect), it is intuitive that the information should be propagated indirectly rather than directly. Fortunately, we find that graph convolutional network (GCN)[14] itself can partly overcome this limitation. It can realize that direct linked metapaths propagate information directly at each layer, and indirect linked metapaths propagate information indirectly via the stacked layers of deep neural networks. More importantly, it has already encoded the information of all metapaths via the multilayer propagation in an implicit way. However, GCN does not distinguish the importance of information from different metapaths in both its propagation and aggregation processes, which makes it not directly suitable for HIN embedding.
To utilize the advantages of GCN of implicitly encoding all metapaths as well as overcome the difficulty of distinguishing their importance in an effective way, we propose a novel GCNbased approach for heterogeneous information network via Implicit utilization of Attention and Metapaths, referred to as GIAM. We first introduce a naive model. It uses the direct linked metapaths alone for information propagation, and utilizes a new aggregation mechanism for eachlayer, along with the stackedlayer propagation, to implicitly achieve the role of attention for selecting metapaths. In this way, we realize the selection of different metapaths in GCN itself (rather than using attention directly which may lead to overfitting). Meanwhile, we make an effective refinement. That is, we replace the spectral filter of GCN from the symmetric normalized graph Laplacian to an equivalent asymmetric one, along with removing activation, modeling the propagation with continuous Markov dynamics. We then introduce an effective Random graphbased Propagation Constraint principle, namely RPC, i.e., if a propagation path on the given network is no better than that on the corresponding random graph, there is no reason to continue this path propagation, which makes the whole propagation process more effective via filtering more impurity information.
To summarize, the main contributions of this paper are as follows:

[leftmargin=*]

We find that, the hierarchical attention structure adopted by many HINspecific graph neural networks is hard to really achieve the function of essential selections of metapaths (due to severe overfitting); and meanwhile, they do not distinguish onehop and multihop metapaths in the propagation process.

We propose a new approach to solve these problems. It uses only direct linked metapaths for direct propagation and realizes indirect propagation by stacking layers of direct propagations. We distinguish the importance of information from different metapaths (in this process) via effective algorithmic mechanisms rather than using attentions directly.

Extensive experiments on different network analysis tasks demonstrate the superiority of the proposed new approach over some stateofthearts.
The rest of the paper is organized as follows. Section 2 introduces a motivating example. Section 3 gives the problem definitions and introduces GCN. Section 4 proposes the new approach for HIN embedding. In Section 5, we conduct extensive experiments. Finally, we discuss related work in Section 6 and conclude in Section 7.
2 A Motivating Example
To verify whether using metapathlevel attention can effectively evaluate the importance of different
Datasets  Metapaths  Models  Attention  Attention distribution  MacroF1  MicroF1 
IMDB  MDM MAM  HAN  Y  [0.78, 0.22]  57.67  57.79 
N  [0.50, 0.50]  58.93  59.02  
MAGNN  Y  [0.57, 0.43]  57.60  57.72  
N  [0.50, 0.50]  58.30  58.50  
GIAM      59.58  59.86  
DBLP  APA APVPA APTPA  HAN  Y  [0.258, 0.736, 0.006]  92.69  93.20 
N  [0.333, 0.333, 0.333]  92.47  93.04  
MAGNN  Y  [0.022, 0.969, 0.009]  93.19  93.67  
N  [0.333, 0.333, 0.333]  90.42  91.08  
GIAM      93.63  94.10 
metapaths, we conduct experiments on two widelyused heterogeneous information networks, i.e., IMDB and DBLP. We select three graph neural networkbased HIN embedding methods, i.e., HAN, MAGNN and our new approach GIAM (which will be introduced in Section 4 below). Since HAN and MAGNN require a candidate metapath set, and our GIAM can also support this option, we use the same choices according to the existing work [12][13]
, i.e., {MDM, MAM} for IMDB (’M/D’ stands for Movie/Director and ’A’ stands for Actor) and {APA, APVPA, APTPA} for DBLP (’A/P’ stands for Author/Paper and ’V/T’ stands for Venue/Term), which are often believed to be the essential metapaths for node classification in networks. We compare HAN (and MAGNN) of using and not using metapathlevel attention, as well as our new idea (GIAM) of using algorithmic mechanisms (rather than attention) to learn relationships of metapaths. We first get each method’s embedding on each dataset (according to the experimental settings in Section 5), and then feed them to SVM classifier with different radios (i.e., 5%80%) of supervised information. We report the average accuracy over these radios, in terms of MacroF1 and MicroF1, as shown in Table 1; and show the detailed accuracy on each radio of the supervised information in Appendix.
As shown, on IMDB, it is surprising that, the methods (HAN and MAGNN) of using metapathlevel attention are always no better than those of not using it. Concretely, for HAN of using metapathlevel attention, it is easy to obtain the staple attention distribution, where one dominant metapath has the dominated attention value (i.e., the distribution [0.78, 0.22] on {MDM, MAM}). Though this seems to achieve a well evaluation of the importance of different metapaths, the accuracy is surprisingly reduced. This may be mainly due to overfitting, preventing the method from really selecting correct metapaths. Differently, MAGNN with metapathlevel attention is easy to get the smooth attention distribution, i.e., [0.57, 0.43] on {MDM, MAM}. While the learned attention values differ slightly, the accuracy is still not improved when comparing with that of not using attention. On the other hand, on DBLP, the methods (HAN and MAGNN) of using metapathlevel attention perform slightly better than those of not using it. Since these models on DBLP can be trained much better with a high accuracy (compared with those on IMDB), they may relieve overfitting and make attention effective to some extent. But anyway, in both these two settings, our new approach GIAM of using the specially designed algorithmic mechanisms (rather than attention) to learn relationships of metapaths stably performs the best.
To further verify whether overfitting is the main reason that metapathlevel attention does not help evaluate the importance of different metapaths effectively, we conduct extra experiments on IMDB by using HAN as an example. We show the training loss (and validation loss) as a function of the number of train iterations. Fig. 1 shows the result of HAN of using metapathlevel attention, and Fig. 1 shows that of not using metapathlevel attention. As shown, when using metapathlevel attention, with the decrease of the training loss, the validation loss first decreases but then increases significantly, which is a highly overfitting phenomenon. Differently, the overfitting issue is relative slight when not using the metapathlevel attention. This partly validates that the metapathlevel attention may not be able to achieve well the essential selection and evaluate the importance of different metapaths, especially when the model is hard to be trained well (which is often the real life in many network analysis tasks).
3 Preliminaries
We first introduce the problem definition, and then discuss GCN which serves as the base of our new approach.
3.1 Problem Definition
Definition 1. Heterogeneous Information Network. A heterogeneous information network is defined as a network , where represents the set of multiple types of nodes, the set of multiple types of edges, and and the set of node and edge types. Each node is associated with a node type mapping function , and each edge is associated with an edge type mapping function . is defined as a heterogeneous information network when .
Definition 2. Adjacency Matrix of Heterogeneous Information Network. Inspired by homogeneous network, we define the adjacency matrix of heterogeneous information network as , where if there is an edge between nodes and , or 0 otherwise, and the number of nodes. Thus, the degree distribution of can be defined as = diag, where , i.e., we sum up the number of edges associated with node .
Definition 3. Metapath. A metapath is defined as a path in the form of (abbreviated as ), where and are node and edge types, respectively. It represents a compositional relation between two given node types.
Definition 4. Metapathbased Neighbors. Given a metapath of a heterogeneous information network, the metapathbased neighbors of node are defined as the set of nodes which connect with node via metapath . Note that include itself if is symmetric.
Definition 5. Heterogeneous Information Network Embedding. Given a heterogeneous information network , this task is to learn the dimensional distributed representation that is able to capture rich structural and semantic information involved in .
3.2 Graph Convolutional Network
Spectral graph convolutional neural networks (GCN) is proposed by Bruna
et al.[15] to analyze the graph data. It defines spectral graph convolution as the product of a signal and a filter = diag, whereis a vector in the Fourier domain. Following this, the spectral graph convolution can be performed as
, whereis the graph Fourier transform of
andthe matrix of eigenvectors of the normalized graph Laplacian
defined as (whereis the identity matrix and
the diagonal matrix of eigenvalues). Since the calculation of eigenvalue decomposition of
in a large graph is very expensive, Defferrard et al.[16] suggest to use the th order Chebyshev polynomial expansion to approximate , represented as , where is the th Chebyshev coefficient and ( is the largest eigenvalue of ). By substituting it into , and adopting , we have .Furthermore, Kipf et al. [14] propose to use and to get a simplified graph convolution operation of GCN, represented as . In addition, by introducing an effective renormalization (where and = diag with ), the classic twolayer GCN can then be defined as:
(1) 
where is the node feature matrix, (and ) the weight parameter of neural networks, and the final output for the assignment of node
labels. While GCN works very well on homogeneous networks, it is not directly suitable for heterogeneous information networks with different types of nodes and edges [17].
We now analyze the advantages and disadvantages of using GCN on heterogeneous information networks (by taking DBLP with four types of nodes: author, paper, venue and term as an example). As shown in Fig. 2, in the first layer of GCN (the inner circle in the figure), we can realize the direct information propagation via direct linked metapaths (e.g., PaperAuthor). By stacking the second layer (the outer circle), we can achieve the indirect information propagation of metapaths with length 2, such as metapaths TermPaperAuthor and VenuePaperAuthor, with the help of stacked direct linked metapath propagation. By adopting a multilayer GCN, we can then realize that the direct linked metapaths propagate information directly while indirect link metapaths propagate information indirectly, along with covering metapaths with different lengths. However, for heterogeneous information networks, GCN often treats the information from different metapaths equally in the process of both propagation and aggregation, without distinguishing the difference of their importance, which is a challenge and correctly the main limitation we will overcome in this work.
4 Methodology
We first propose a naive model to solve the issue of GCN on heterogeneous information networks (HINs), then refine the model by introducing a continuous Markov propagation process, and finally give some optional tricks in implementation.
4.1 The Naive Model
In the first model, we use the classic multilayers GCN as a basic framework, and then introduce a discriminative mechanism to aggregate information from the neighbors with direct linked metapaths. The structure of this model is illustrated in Fig. 3.
The novel aggregation mechanism consists of two parts, including the aggregation of instances under the same metapath (which we call the intra aggregation) and the aggregation of different metapaths (which we call the inter aggregation). Specifically, in the intra aggregation, we adopt the same summation as GCN to aggregate the information from the same direct linked metapathbased neighbors. Mathematically, let be the metapath mapping function, where is the set of direct linked metapaths. It inputs a node pair , and outputs a variable which indicates the direct linked metapath between nodes and . Simultaneously, let be the embedding of node at the (1)th layer, and the node’s feature vector. Then, for each , its embedding of the direct linked metapath at the th layer can be updated as:
(2) 
where is the degree of node of with selfedges (as defined in (1)), is the set of direct linked metapathbased neighbors of node , and a Kronecker delta function that only allows nodes with the direct linked metapath to node to be included. Since there are different direct linked metapaths, then for each node , we will get metapathtype embeddings. In this case, we adopt another aggregation function, i.e., concatenation , to aggregate the embeddings of different direct linked metapaths, that is:
(3) 
With the obtained , the th layer embedding of node
can then be given by using a mapping function along with a nonlinear transform as:
(4) 
where is the mapping matrix and
the nonlinear activation function. To simplify expression, we use a new operator ’
’ to denote the incorporation of the above two types of aggregations on matrices. Then, the matrix form of the th layer embeddings can be defined as:(5) 
To better understand how this naive model distinguishes the importance of information from different metapaths during both propagation and aggregation, we give a brief explanation on a heterogeneous information network (DBLP) as an example. As shown in Fig. 4, in each layer, we use the direct linked metapaths within the black circle to propagate information. We adopt the summation to aggregate information from each type of neighbors linked by the same onehop metapath (e.g., AuthorPaper), and use concatenation to aggregate information from different onehop metapaths (e.g., AuthorPaper and TermPaper), and then feed it to the neural network. This is to distinguish the importance of information from different metapaths in an implicit and indirect way, i.e., utilizes the new discriminative aggregation as well as the mapping function of neural networks, rather than using attention directly. Furthermore, we extend the propagation range by stacking layer by layer, and then realize the distinction of metapaths with different lengths (e.g., AuthorPaperTerm and AuthorPaperTermPaper), with the help of the interaction of the multilayer propagation of the onehop metapaths as well as the bilevel aggregation mechanism in eachlayer.
In fact, while this naive model seems to be able to cover different metapaths as well as distinguish their importance in both propagation and aggregation in an ideal way, it, however, possesses an inherent limitation, i.e., many nodes do not have the same (or complete) types of onehop metapaths due to the sparsity of HIN, making an effective concatenation in this new aggregation process difficult. Take DBLP as an example, some paper nodes may not have links under metapath PaperAuthor while some other nodes may not have links under PaperTerm. In this case, we cannot achieve the alignment of these nodes’ embeddings after concatenation. So, one can only use noninformative vectors (e.g., vectors with all 1 or 0) to fill in these missing types to make them complete. This, however, significantly lowers the performance of the model especially when stacking multilayers.
4.2 The Improved Model
To overcome the limitation of the naive model, we introduce an effective relaxation and improvement. That is, we first perform a step propagation, and then the discriminative aggregation. In the new propagation process, we replace the spectral filter of GCN from the symmetric graph Laplacian to an equivalent asymmetric one, and then remove activation, in order to make it a continuous Markov dynamics. We then introduce a random graphbased cut mechanism to constrain its free expansion, enabling the propagation to escape from including too many harmful information with the increase of layers. The structure of this model is illustrated in Fig. 5. In the following, we will introduce it from two perspectives, i.e., probabilistic propagation and discriminative Aggregation.
4.2.1 Probabilistic Propagation
First we refine the propagation process of GCN. We adopt an asymmetric normalized graph Laplacian
, which is also called the Markov transition probability matrix, as the filter to perform propagation, where
( is the adjacency matrix of and the identity matrix), and = diag with . According to spectral graph theories [18], has the same spectrum range with the original spectral filter of GCN (defined in (1)), and thus possesses the same ability of serving as a lowpasstype filter for propagation. Meanwhile, we remove activation functions on all layers expect for the output layer (that uses softmax), which will not decrease the model’s performance, as guaranteed by [18]. Then, these two steps make the propagation a continuous Markov dynamics process. The new propagation rule can be defined as:(6) 
where .
On the other hand, the above propagation process in graph convolution can be also taken as a step Markov random walk from the perspective of probabilistic diffusion. Formally, given a heterogeneous information network , the transition probability from nodes to within one step random walk can be formulated as:
(7) 
Then, after walking steps, the transition probability from nodes to can be calculated iteratively by:
(8) 
where and , for . The above process can also be taken as a matrix form as:
(9) 
where the step transition probability matrix equals to the propagation matrix in (6) in graph convolution. More interestingly, according to spectral graph theories [19], the number of steps of random walk in the range of entering and exiting times of the th local mixing state (of this Markov dynamics) can show the clearest categories structure. So, this new probabilistic perspective brings a byproduct that we can evaluate the optimal number of propagation layers of graph convolution. To be specific, given a network
with the Markov matrix
, the local mixing times of random walks on it can be estimated by using the spectrum of its corresponding Markov generator
, where is positive semidefinite and has nonnegative realvalued eigenvalues (). Let and be the entering and exiting times of the th local mixing state, we have . Reasonably, we can use the exiting time of the (+1)th local mixing state to estimate the entering time of the th local mixing state, which can be represented as . Then, the calculated and can be taken as the floor and ceiling of the optimal number of propagation layers for a classification problem.However, first, it is too time consuming to calculate the eigenvalues for determining the number of propagation layers, which often needs time. Second, even in the expected range of the optimal number of layers, the propagation will still introduce impurity information inevitably, which will also decrease the convolution’s performance. To further overcome those drawbacks, we introduce the new RPC principle, i.e., if a propagation path on a given network (with clusters) is no better than that on its corresponding random graph, we will have no reason to continue this propagation path. This will not only enable the propagation to filter more noise information, but also make it not so sensitive to the number of layers (which may be set a relative large value, e.g., 10). To be specific, given a heterogeneous information network , we first calculate its corresponding random graph which has the same node degree distribution with while contains none structural information for classification. We adopt the popular null model of modularity [20] that describes random graphs by rewiring edges randomly among nodes with given node degrees, which is correctly suitable for this work. Let be the adjacency matrix of with selfedges, and = diag the degree matrix with . Then, based on this null model, the expected number of links (or expected link weight) between nodes and can be written as:
(10) 
which forms the adjacency matrix of . On this random graph, the one step transition probability from nodes to can be written as:
(11) 
Using it as a constraint on each step of the random walk on , we then get a constraint Markov dynamics. That is, the transition probability from nodes to after steps of the constraint walk, i.e., , can be calculated iteratively by:
(12) 
where denotes the step transition probability from nodes to on while the probability on the corresponding random graph , after 1 steps of the constraint walk. We remove negative values of
and normalize it after each step (since the probability distribution should be nonnegative and sum to 1). Then, let
, , and = diag with , the above process can be rewritten in the matrix form as:(13) 
Finally, we derive the step transition probability matrix based on the constraint Markov dynamics, which is to serve as a better propagation matrix for graph convolution.
To illustrate how the propagation matrix based on the unconstrained (and constrained) Markov dynamics changes
with the number of layers, we take a simple Newman artificial network [21] as an example. The network consists of 128 nodes divided into four categories of 32 nodes. Each node has on average 14 edges connecting to nodes of the same category and 2 edges connecting to nodes of other categories, as shown in Fig. 6(a). For this fourclassification problem, we first calculate the spectrum of its Markov generator (Fig. 6(b)), and then derive the entering time and exiting time of the 4th local mixing state, i.e., ~2 and ~6, corresponding to the floor and ceiling of the optimal number of layers (Fig. 6(c)). Figs. 6(d), (e) and (f) show the propagation matrices of 2, 6 and 10 steps (or layers) of random walk. As shown, while the propagation matrices between the 2th and 6th layers are relatively clear, some impurity information is still introduced. But with the increase of propagation layers, e.g., reaching 10 layers, it will become hard to filter impurity information any more. However, after introducing the constraint mechanism, the propagation matrices of the 2th and 6th layers are much clearer (Figs. 6(g) and (h)). More importantly, it will almost not introduce impurity information with the increase of layers, e.g., reaching 10 layers as shown in Fig. 6(i). This further verifies that the new constrained Markov dynamics can suppress the integration of impurity information when propagation, making it more robust and effective.
4.2.2 Discriminative Aggregation
After the step propagation above, we then perform a discriminative aggregation, which forms the relaxation and improvement of the naive model. To be specific, we use the same aggregation as the naive model while aggregating embeddings of the step propagated neighbors. Then, the final embeddings can be defined in one time as:
(14) 
While the model may not distinguish information from different metapaths in propagation, it does distinguish them in aggregation, achieving the essential selection of different metapaths. In this way, we can further solve the inherent limitation of the naive model (the difficulty of concatenation in the new aggregation because most nodes do not have the same and complete types of onehop metapaths), since we can often get the complete types of neighbors after some steps of constraint propagation.
Here, one may also concern that the propagation matrix may become very dense in this case, making the propagation introduce too much noise. But in fact, it is not this case. Thanks to the new constraint mechanism, our can still remain sparse. Here take a first node in the first category of a complex Lancichinetti artificial network as an example (Fig. 7(a)). After many steps (e.g., = 10) of propagation, when using the unconstraint random walk, the propagation probability of this node to all the other 999 nodes are positive, showing a dense result (Fig. 7(b)). However, the propagation probability produced by our constraint walk is still sparse (Fig. 7(c)). As shown in Fig. 7(c), our propagation probability of to 766 out of the total 999 are 0; while that to the other nodes are positive. Moreover, the red values (the probability of to nodes in the same category) are often much larger than the blue values (the probability of to nodes outside this category). This demonstrates that our new propagation mechanism can not only obtain a sparse propagation matrix, but also well filter impurity information, making the propagation more effective.
We define the loss function by using cross entropy as:
(15) 
where denotes the set of parameters of the classifier, the set of node indices that have labels, and the labels and embeddings of the labeled nodes. We use back propagation and Adam optimizer to optimize the model.
4.3 Implementation
It is also quite easy to introduce some tricks when implementing our method. The tricks include, for example, supporting the use of candidate metapath sets and the (multihead) nodelevel attention, which are often used in the existing HIN embedding approaches.
First, existing HIN embedding methods often need to use a candidate metapath set. To make our method support this option, we can adopt only the metapaths in this candidate set to construct the step propagation matrix, and then use an aggregation to fuse information from these step propagated neighbors to derive the final embeddings.
Second, existing graph neural networkbased HIN embedding methods usually adopt the nodelevel attention for finetuning. Our method can also introduce the nodelevel attention, working together with its inherent algorithmic mechanism of implicitly selecting metapaths, to further improve performance. To be specific, given a node pair (, ) and a specified metapath , the importance coefficient between nodes and can be formulated as:
(16) 
where is the parameterized attention vector for metapath , and the mapping matrix applied to each node. After obtaining the importance between nodes and , we can then use softmax to normalize them to get the weight coefficient as:
(17) 
Then, the embedding of node for metapath can be aggregated by the neighbor’s embeddings with its corresponding weight coefficients as:
(18) 
Finally, we can also extend the nodelevel attention to a multihead attention, as done in many existing methods [12][13]
, in order to stabilize the learning process and reduce the high variance (brought by the heterogeneity of networks). That is, we repeat the nodelevel attention
times, and then concatenate their output as the final embeddings:(19) 
5 Experiments
We first give the experimental setup, and then compare our GIAM with some stateoftheart methods on three network analysis tasks, i.e., node classification, node clustering and network visualization. We finally give an indepth analysis of different components of our new approach.
5.1 Experimental Setup
5.1.1 Datasets
We adopt two widelyused heterogeneous information networks from different domains, as shown in Table 2, to evaluate the performance of different methods.

[leftmargin=*]

IMDB is an online database about TV shows and movie productions. We extract a subset of IMDB with 4278 movies (M), 2081 directors (D) and 5257 actors (A). The movies are divided into three classes (Action, Comedy, Drama) based on their genre. Each movie is described by a bagofwords representation of its plot keywords. The same to [13]
, we use the candidate metapath set {MAM, MDM} for algorithms that require such information, and select 400, 400 and 3478 movies as training, validation and testing sets, for semisupervised learning.
Datasets No. of Nodes No. of Edges Metapaths IMDB #movie(M): 4278 #director(D): 2081 #actor (A): 5257 #MD: 4278 #MA: 12828 MDM MAM DBLP #author (A): 4057 #paper (P): 14328 #term (T): 7723 #venue (V): 20 #AP: 19645 #PT: 85810 #PV: 14328 APA APTPA APVPA TABLE II: Datasets description. 
DBLP is a computer English literature database with authors as its core. We extract a subset of DBLP with 4057 authors (A), 14328 papers (P), 7723 terms (T) and 20 venues (V). The authors are divided into four classes (
Database, Data Mining, Artificial Intelligence and Information Retrieval
) based on their research areas. Each author is described by a bagofwords representation of his/her paper keywords. Also the same to [13], we adopt the candidate metapath set {APA, APCPA, APTPA}, and select 400, 400 and 3257 authors as training, validation and testing sets.
5.1.2 Baselines
We compare our new approach GIAM with eight existing methods. They include: 1) the homogeneous network embedding methods DeepWalk[23], Node2vec[24], GCN[14] and GAT[25], and 2) the HIN embedding methods Metapath2vec[9], HetGNN[26], HAN[12] and MAGNN[13]. Especially, GCN is the base of our approach GIAM, and HAN and MAGNN are the stateoftheart graph neural networkbased HIN embedding methods which adopts the hierarchical attention structure. Also of note, we use homogeneous network embedding methods on the HIN structure directly by ignoring the difference of types of nodes and edges.
Datasets  Metrics  Training ratio  Deepwalk  Node2vec  GCN  GAT  Metapath2vec  HetGNN  HAN  MAGNN  GIAM 
IMDB  MacroF1 (%)  5%  41.52  43.56  54.56  54.79  42.95  42.93  55.94  54.41  58.49 
10%  44.40  46.40  55.75  55.69  43.90  45.94  56.41  56.43  59.15  
20%  46.60  49.61  56.29  56.38  45.53  48.87  57.64  57.41  59.79  
40%  47.92  50.87  56.00  56.26  46.39  51.39  58.46  58.70  59.85  
60%  48.66  51.79  55.83  56.05  47.80  52.70  58.73  58.97  60.25  
80%  48.73  52.08  56.30  56.03  48.63  53.31  58.82  59.65  59.97  
MicroF1 (%)  5%  42.31  44.13  55.22  55.48  44.31  43.80  56.28  54.61  59.03  
10%  45.45  47.32  56.23  56.20  45.75  46.89  56.62  56.59  59.50  
20%  47.88  50.59  56.58  56.60  47.06  49.62  57.66  57.43  59.96  
40%  49.47  52.01  56.39  56.52  48.12  52.24  58.46  58.85  60.05  
60%  50.20  52.92  56.19  56.31  49.50  53.58  58.75  59.09  60.44  
80%  50.33  53.45  56.52  56.14  50.65  54.40  58.95  59.76  60.18  
DBLP  MacroF1 (%)  5%  73.09  78.02  85.59  79.67  90.17  90.83  91.80  92.96  93.24 
10%  80.95  84.53  86.11  84.99  90.76  91.18  92.27  93.07  93.48  
20%  84.08  85.51  86.88  86.72  91.28  91.68  92.88  92.92  93.64  
40%  86.98  86.82  88.12  87.57  91.88  92.20  93.03  93.17  93.76  
60%  88.59  88.14  87.84  88.32  92.31  92.36  92.97  93.50  93.70  
80%  89.99  88.78  87.75  89.16  92.70  92.22  93.18  93.52  93.96  
MicroF1 (%)  5%  75.49  80.41  86.08  82.88  90.90  91.39  92.36  93.49  93.72  
10%  81.96  85.46  86.62  86.02  91.43  91.74  92.81  93.58  93.96  
20%  85.02  86.48  87.28  87.38  91.97  92.20  93.36  93.43  94.12  
40%  87.81  87.68  88.50  88.18  92.50  92.68  93.50  93.63  94.23  
60%  89.38  89.02  88.28  88.98  92.90  92.88  93.47  93.95  94.18  
80%  90.43  89.51  88.16  89.69  93.25  92.78  93.67  93.96  94.39 
5.1.3 Parameter Settings
For the methods based on semisupervised graph neural networks (including GCN, GAT, HAN, MAGNN and our GIAM), we set the dropout rate to 0.5 and use the same splits for training, verification and testing sets. We employ the Adam optimizer with the learning rate setting to 0.005 and apply early stopping with a patience of 50. For GAT, HAN and MAGNN, we set the number of attention heads to 8. For HAN and MAGNN, we set the dimension of the metapathlevel attention vector to 128. For the methods based on random walk (including DeepWalk, Node2vec, HetGNN and metapath2vec), we set the window size to 5, walk length to 100, walks per node to 40, and the number of negative samples to 5. For a fair comparison, the embedding dimension of all methods mentioned above is set to 64.
5.2 Comparisons to Existing Methods
We first make a quantitative comparison on node classification and clustering, and then a qualitative comparison on visualization.
5.2.1 Node Classification
On the node classification task, for each method, we first generate the embeddings of the labeled nodes (i.e., movies in IMDB and authors in DBLP), and then feed them to SVM by using different training ratios from 5% to 80% (as done in the most existing works). Since the variance of the graph structure data can be quite large, we repeat this process 10 times and report the average MacroF1 and MicroF1.
The results are shown in Table 3. As shown, the proposed method GIAM always performs the best across different training ratios and datasets. On the IMDB dataset, GIAM is 1.152.88% and 0.324.42% more accurate than the best baselines HAN and MAGNN, which are also the heterogeneous graph neural network methods (while they use matepathlevel attentions directly). On the DBLP dataset, GIAM is 0.711.44% and 0.200.72% more accurate than the best baselines HAN and MAGNN in the case of an already very high base accuracy ( 91.80%), making our improvement still nontrivial. These results not only demonstrate the superiority of the new propagation and aggregation mechanism, but also validate the effectiveness of our main idea of using algorithmic mechanisms (rather than the metapathlevel attention directly) to implicitly achieve the role of attention of selecting metapaths. In addition, the performance of GIAM is much better than that of GCN (i.e., 3.274.42% and 5.647.65% more accurate than IMDB and DBLP), which further demonstrates the effectiveness of our new mechanism for distinguishing importance of information with respect to different metapaths in both propagation and aggregation.
5.2.2 Node Clustering
We also conduct comparisons of these methods on node clustering.
Datasets  NMI (%)  
Deepwalk  Node2vec  GCN  GAT  Metapath2vec  HetGNN  HAN  MAGNN  GIAM  
IMDB  0.55  5.34  10.42  10.02  0.43  0.46  13.02  13.77  15.41 
DBLP  71.78  74.80  53.93  68.15  75.02  74.26  73.13  78.97  78.27 (2) 
AVG  36.17  40.07  32.18  39.09  37.73  37.36  43.08  46.37  46.84 
Datasets  ARI [1,1]  
Deepwalk  Node2vec  GCN  GAT  Metapath2vec  HetGNN  HAN  MAGNN  GIAM  
IMDB  0.0014  0.0642  0.0661  0.0744  0.0005  0.0048  0.1282  0.1206  0.1552 
DBLP  0.7415  0.7796  0.4670  0.6859  0.7945  0.8028  0.7938  0.8392  0.8273 (2) 
AVG  0.3701  0.4219  0.2666  0.3802  0.3975  0.4038  0.4610  0.4799  0.4913 
In this task, for each method, we first generate embeddings of the labeled nodes, and then feed them to KMeans algorithm. The number of clusters
is set to the same as the groundtruth, i.e., 3 for IMDB and 4 for DBLP. Since the performance of KMeans is easily affected by the initial center, we repeat the process 10 times and report the average normalized mutual information (NMI) and adjusted rand index (ARI).The results are shown in Tables 4 and 5. As shown, the proposed method GIAM performs the best on IMDB. While GIAM performs the second best on DBLP, its performance is still very competitive with that of the best baseline MAGNN. On average on both these two datasets, GIAM is 10.67%, 6.77%, 14.66%, 7.75%, 9.11%, 9.48%, 3.76% and 0.47% more accurate than Deepwalk, Node2vec, GCN, GAT, Metapath2vec, HetGNN, HAN and MAGNN in terms of NMI; and 0.1212, 0.0694, 0.2247, 0.1111, 0.0938, 0.0875, 0.0303 and 0.0114 better than these methods in ARI (in the range of 1 to 1). Moreover, (on average) GIAM is still better than the methods using metapathlevel attentions directly (i.e., HAN and MAGNN). This further validates the soundness of using algorithmic mechanisms to evaluate importance of different metapaths. Neither GCN nor GAT is so competitive here. This is mainly because they fail to distinguish importance of information with respect to different metapaths, which significantly compromises their performance in the unsupervised clustering setting.
5.2.3 Visualization
For a more intuitively comparison, we also visualize the embeddings of author nodes of some representative network embedding methods (i.e., GCN, HetGNN, HAN and our GIAM) on the DBLP dataset as an example. We utilize the wellknown tSNE tool [27] to project node
embeddings to two dimensions. Different colors correspond to different research areas of these nodes.
As shown in Fig. 8, GCN (which ignores the heterogeneity of nodes) does not perform well, i.e., the author nodes belong to different research areas are sometimes mixed with each other. HetGNN performs much better than GCN, but its boundary is still blurry. While both HAN and our GIAM separate the author nodes in different research areas reasonably well, our GIAM has a more distinct boundary and denser cluster structures in visualization.
5.3 A Deep Analysis of GIAM
Similar to most deep learning models, GIAM also contains some important components that may have significant impact on the performance. To test the effectiveness of each component of GIAM, we conduct experiments on comparing GIAM with four variations. The variants are as follows: 1) GCN which serves as the base framework of GIAM of not distinguishing importance of information with respect to different metapaths, 2) the naive model of GIAM, named as GIAM1, 3) GIAM of removing nodelevel attention (by assigning the same importance to each neighbor node), named as GIAM2, and 4) GIAM of adding the metapathlevel attention, named as GIAM3. We take their comparison on node classification as an example.
As shown in Table 6, compared to GCN, the naive model GIAM1 (which distinguishes metapaths) has an obvious improvement, i.e., 0.861.15% and 4.185.25% more accurate on IMDB and DBLP. However, due to the sparsity of HINs, GIAM1 inevitably needs to add a large number of noninformative features, so as to fill in embeddings of the missing types of onehop metapaths during aggregation. While its result is basically satisfactory, this limitation compromises performance inevitably. We overcome this limitation by introducing a new mechanism of relaxation and improvement, deriving GIAM2, which further improves performance of the naive model, i.e., 2.353.63% and 0.022.76% more accurate on IMDB and DBLP. Furthermore, by introducing the fineturning nodelevel attention, the derived GIAM improves GIAM2 on DBLP (i.e., 0.630.87% more accurate), while the improvement on IMDB is not so obvious (because IMDB is harder to be trained well with a relative low accuracy, easier leading to overfitting). This further demonstrates that the nodelevel attention indeed plays a finetuning role when the model can be well trained (such as on DBLP with a relative high accuracy). Finally, GIAM3 of adding the metapathlevel attention hardly changes the performance of GIAM. This further validates that our algorithmic mechanism has already played a significant role in selecting metapaths, compared to the explicit metapathlevel attention approach.
Datasets  Metrics  Training radio  GCN  GIAM1  GIAM2  GIAM  GIAM3 
IMDB  MacroF1 (%)  5%  54.56  55.52  58.29  58.49  58.56 
10%  55.75  56.73  59.31  59.15  59.26  
20%  56.29  57.15  59.90  59.79  59.94  
40%  56.00  56.99  60.01  59.93  59.85  
60%  55.83  56.88  60.51  60.25  60.31  
80%  56.30  57.34  60.43  59.97  60.10  
MicroF1 (%)  5%  55.22  56.14  58.80  59.03  59.11  
10%  56.23  57.20  59.55  59.50  59.61  
20%  56.58  57.51  59.96  59.96  60.10  
40%  56.39  57.48  60.10  60.05  60.13  
60%  56.19  57.34  60.55  60.44  60.50  
80%  56.52  57.66  60.47  60.18  60.30  
DBLP  MacroF1 (%)  5%  85.59  89.77  92.53  93.24  93.25 
10%  86.11  90.85  92.62  93.48  93.48  
20%  86.88  91.89  92.79  93.64  93.61  
40%  88.12  92.41  92.89  93.76  93.78  
60%  87.84  92.81  92.87  93.70  93.69  
80%  87.75  91.98  93.12  93.96  93.98  
MicroF1 (%)  5%  86.08  90.58  93.09  93.72  93.75  
10%  86.62  91.57  93.17  93.96  93.96  
20%  87.28  92.53  93.35  94.12  94.10  
40%  88.50  93.01  93.45  94.23  94.26  
60%  88.28  93.42  93.44  94.18  94.19  
80%  88.16  92.52  93.66  94.39  94.43 
6 Related Work
Heterogeneous information network (HIN) embedding aims to learn a lowdimensional distributed representation for each node of a HIN while preserving the structure and semantic information. Existing HIN embedding methods can be mainly divided into three categories, including the random walkbased methods, the relation learning based methods and the graph neural networkbased methods.
The random walkbased methods first utilize random walk on a HIN to generate the node walk sequences, and then feed them to the subsequent model to obtain node embeddings. For example, JUST[8] adopts the jump and stay strategies on a HIN, which select the next node based on the probability of the jump or stay operation, to perform random walk. It then inputs the generated walk sequences to the skipgram model to obtain the final node embeddings. Metapath2vec[9] first generates the node walk sequences based on metapaths, and then obtains the node embeddings by adopting heterogeneous skipgram with negative sampling. HetGNN [26] improves metapath2vec by incorporating additional node information. It first introduces a sampling strategy based on random walk with restart to sample neighbors for each node, and then uses a heterogeneous neural network architecture to aggregate the feature information of those sampled neighbor nodes.
The relation learning based methods aim to learn a scoring function which evaluates an arbitrary triplet composed of two nodes and an edge type, and output a scalar to measure the acceptability of this triplet. For example, DistMult[10] adopts a similaritybased scoring function to learn the edge possibility between arbitrary two nodes of the HIN. ConvE[11] proposes a deep neural model instead of the simple similarity function to score the edge possibility between two nodes. TransE[28] learns the edge possibility between two nodes by using a translational distance.
The graph neural networkbased methods aim to learn node embeddings by aggregating the information from neighbor nodes of a HIN. For example, HAN[12] proposes a hierarchical attention mechanism, including the nodelevel and semanticlevel attentions, to aggregate the information from metapathbased neighbors. To be specific, nodelevel attention learns the importance of neighbors in the same metapath while semanticlevel attention learns the importance of different metapaths. MAGNN[13] employs three major components, i.e., the nodetype specific transformation, the nodelevel metapath instance aggregation and the metapathlevel embedding fusion, to obtain the node embeddings of heterogeneous graphs. While those graph neural networkbased methods can often derive satisfactory node embeddings, they still have some essential limitations. That is, the complicated hierarchical attention structure often makes these methods difficult to really achieve the goal of selecting metapaths, partly due to the highly overfitting (as shown in Fig. 1(a) as an illustrative example). Meanwhile, those methods treat the onehop and multihop metapaths indistinguishably to propagate information, which may be not so intuitive from the perspective of network propagation dynamics in network science.
7 Conclusion
We propose a novel GCNbased method, namely GIAM, via implicitly (rather than explicitly) utilizing attention and metapaths, in order to effectively achieve HIN embedding. We use the direct linked metapaths, a discriminative aggregation, along with the stacked layers of propagation, to distinguish the importance of different metapaths. We further give an effective relaxation and improvement by introducing a new multilayer propagation which is separated from the aggregation. That is, we first replace the spectral filter of GCN from the symmetric normalized graph Laplacian to an equivalent asymmetric one and remove activation functions, making it a welldefined probabilistic propagation process. We then introduce a random graphbased constraint mechanism RPC on this probabilistic propagation, to avoid importing too much noise with the increase of propagation layers. Empirical results on various graph mining tasks, including node classification, node clustering and graph visualization, demonstrate the superiority of our new approach over some stateoftheart methods.
References
 [1] C. Yang, Y. Xiao, Y. Zhang, Y. Sun, and J. Han, “Heterogeneous network representation learning: Survey, benchmark, evaluation, and beyond,” CoRR, vol. abs/2004.00216, 2020.
 [2] C. Shi, Y. Li, J. Zhang, Y. Sun, and P. S. Yu, “A survey of heterogeneous information network analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 1, pp. 17–37, 2017.
 [3] W. Shen, J. Han, J. Wang, X. Yuan, and Z. Yang, “SHINE+: A general framework for domainspecific entity linking with heterogeneous information networks,” IEEE Transactions on Knowledge and Data Engineering., vol. 30, no. 2, pp. 353–366, 2018.

[4]
B. Hu, C. Shi, W. X. Zhao, and P. S. Yu, “Leveraging metapath based context for topN recommendation with A neural coattention model,” in
Proceedings of SIGKDD, ACM, 2018, pp. 1531–1540.  [5] C. Wang, Y. Song, H. Li, M. Zhang, and J. Han, “Unsupervised metapath selection for text similarity measure based on heterogeneous information networks,” Data Mining and Knowledge Discovery, vol. 32, no. 6, pp. 1735–1767, 2018.
 [6] C. Park, D. Kim, J. Han, and H. Yu, “Unsupervised attributed multiplex network embedding,” in Proceedings of AAAI, 2020, pp. 5371–5378.
 [7] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of NIPS, 2017, pp. 1024–1034.
 [8] R. Hussein, D. Yang, and P. CudréMauroux, “Are metapaths necessary?: Revisiting heterogeneous graph embeddings,” in Proceedings of CIKM, 2018, pp. 437–446.
 [9] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalable representation learning for heterogeneous networks,” in Proceedings of SIGKDD, ACM, 2017, pp. 135–144.
 [10] B. Yang, W. Yih, X. He, J. Gao, and L. Deng, “Embedding entities and relations for learning and inference in knowledge bases,” in Proceedings of ICLR, 2015.

[11]
T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolutional 2d knowledge graph embeddings,” in
Proceedings of AAAI, 2018, pp. 1811–1818.  [12] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu, “Heterogeneous graph attention network,” in Proceedings of WWW, 2019, pp. 2022–2032.
 [13] X. Fu, J. Zhang, Z. Meng, and I. King, “MAGNN: metapath aggregated graph neural network for heterogeneous graph embedding,” in Proceedings of WWW, 2020, pp. 2331–2341.
 [14] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” in Proceedings of ICLR, 2017.
 [15] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” in Proceedings of ICLR, 2014.
 [16] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Proceedings of NIPS, 2016, pp. 3837–3845.
 [17] Y. Wang, Z. Duan, B. Liao, F. Wu, and Y. Zhuang, “Heterogeneous attributed network embedding with graph convolutional networks,” in Proceedings of AAAI, 2019, pp. 10061–10062.
 [18] F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” in Proceedings of ICML, 2019, pp. 6861–6871.
 [19] B. Yang, J. Liu, and J. Feng, “On the spectral characterization and scalable mining of network communities,” IEEE Transactions on Knowledge and Data Engineerin, vol. 24, no. 2, pp. 326–337, 2012.
 [20] M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical Review E, vol. 69, no. 2, pp. 026113–026113, 2004.
 [21] N. M. E. J. Girvan M, “Community structure in social and biological networks,” Proceedings of the National Academy of Sciences, vol. 99, no. 12, pp. 7821–7826, 2002.
 [22] A. Lancichinetti and S. Fortunato, “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities,” Physical Review E, vol. 80, no. 1, p. 016118, 2009.
 [23] B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: online learning of social representations,” in Proceedings of SIGKDD, ACM, 2014, pp. 701–710.
 [24] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of SIGKDD, ACM, 2016, pp. 855–864.
 [25] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in Proceedings of ICLR, 2018.
 [26] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla, “Heterogeneous graph neural network,” in Proceedings of SIGKDD, ACM, 2019, pp. 793–803.

[27]
V. D. M. Laurens and G. Hinton, “Visualizing data using tsne,”
Journal of Machine Learning Research
, vol. 9, no. 2605, pp. 2579–2605, 2008.  [28] A. Bordes, N. Usunier, A. GarcíaDurán, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multirelational data,” in Proceedings of NIPS, 2013, pp. 2787–2795.
Appendix A
In Section 2, we have used three graph neural networkbased HIN embedding methods, i.e., HAN, MAGNN and our new approach GIAM, to conduct the motivating experiment on two widelyused heterogeneous information networks, i.e., IMDB and DBLP. Here, on each network, we give the detailed results of different methods on different radios (i.e., 580%) of the supervised information, as shown in Table 7 and Table 8, respectively.
Dataset  Metrics  Training radio  HAN1  HAN2  MAGNN1  MAGNN2  GIAM 
IMDB  MacroF1 (%)  5%  55.94  57.57  54.41  55.27  58.49 
10%  56.41  58.35  56.43  56.44  59.15  
20%  57.64  59.16  57.41  58.72  59.79  
40%  58.46  59.49  58.70  59.71  59.85  
60%  58.73  59.55  58.97  59.71  60.25  
80%  58.82  59.43  59.65  59.95  59.97  
AVG  57.67  58.93  57.60  58.30  59.58  
MicroF1 (%)  5%  56.28  57.94  54.61  55.39  59.03  
10%  56.62  58.53  56.59  56.71  59.50  
20%  57.66  59.21  57.43  58.83  59.96  
40%  58.46  59.53  58.85  59.89  60.05  
60%  58.75  59.53  59.09  59.91  60.44  
80%  58.95  59.40  59.76  60.24  60.18  
AVG  57.79  59.02  57.72  58.50  59.86 
Dataset  Metrics  Training radio  HAN1  HAN2  MAGNN1  MAGNN2  GIAM 
DBLP  MacroF1 (%)  5%  91.80  92.10  92.96  88.05  93.24 
10%  92.27  92.31  93.07  88.65  93.48  
20%  92.88  92.54  92.92  89.87  93.64  
40%  93.03  92.49  93.17  91.38  93.76  
60%  92.97  92.66  93.50  92.29  93.70  
80%  93.18  92.70  93.52  92.30  93.96  
AVG  92.69  92.47  93.19  90.42  93.63  
MicroF1 (%)  5%  92.36  92.68  93.49  88.90  93.72  
10%  92.81  92.89  93.58  89.44  93.96  
20%  93.36  93.10  93.43  90.56  94.12  
40%  93.50  93.06  93.63  91.95  94.23  
60%  93.47  93.24  93.95  92.81  94.18  
80%  93.67  93.27  93.96  92.81  94.39  
AVG  93.20  93.04  93.67  91.08  94.10 
Comments
There are no comments yet.