ieHGCN
None
view repo
Graph Convolutional Network (GCN) has achieved extraordinary success in learning effective highlevel representations of nodes in graphs. However, the study regarding Heterogeneous Information Network (HIN) is still limited, because the existing HINoriented GCN methods suffer from two deficiencies: (1) they cannot flexibly exploit all possible metapaths, and some even require the user to specify useful ones; (2) they often need to first transform an HIN into metapath based graphs by computing commuting matrices, which has a high time complexity, resulting in poor scalability. To address the above issues, we propose interpretable and efficient Heterogeneous Graph Convolutional Network (ieHGCN) to learn representations of nodes in HINs. It automatically extracts useful metapaths for each node from all possible metapaths (within a length limit determined by the model depth), which brings good model interpretability. It directly takes the entire HIN as input and avoids intermediate HIN transformation. The carefully designed hierarchical aggregation architecture avoids computationally inefficient neighborhood attention. Thus, it is much more efficient than previous methods. We formally prove ieHGCN evaluates the usefulness of all possible metapaths within a length limit (model depth), show it intrinsically performs spectral graph convolution on HINs, and analyze the time complexity to verify its quasilinear scalability. Extensive experimental results on three realworld networks demonstrate the superiority of ieHGCN over stateoftheart methods.
READ FULL TEXT VIEW PDFNone
In the real world, a graph usually contains multiple types of nodes and edges, which is called a heterogeneous graph, or Heterogeneous Information Network (HIN) [1]. Figure 1 (left) shows a toy HIN of the DBLP bibliographic network. It contains papers (), authors (), conferences () and terms (). The edges from authors to papers are of “Writing” type, while the edges from papers to conferences are of “Published” type. By convention, in an HIN, the nodes are called objects, the edges are called links, and the types of links are called relations. Metapath [1] is an important concept in HINs. It is defined as a composite relation between two object types. A metapath usually conveys specific semantics and different metapaths have different importance for a specific task. For example, in DBLP, the metapath (abbreviated as ) means the coauthor relationship between authors, while means authors publish papers in conferences. When predicting an author’s affiliation, is more helpful than , since authors usually collaborate with colleagues in the same institution.
Properly learning representations of objects in an HIN can boost a variety of tasks such as object classification and link prediction [2]. Existing HIN embedding methods learn object representations in a nonparametric way and by preserving some specific structural properties. Among them, some methods [3, 4, 5] only preserve firstorder proximity conveyed by relations. Although the other methods [6, 7, 8, 9, 10, 11, 12] preserve highorder structural proximities conveyed by metapaths, they either require users to specify metapaths [6, 7, 8, 9, 10] or cannot learn the importance of metapaths for a task [11, 12]. To summarize, (1) with unsupervised structurepreserving training, the learned embeddings may not lead to optimal performance for a specific task; (2) none of these methods can automatically explore useful metapaths from all possible metapaths for specific tasks.
Recently, Graph Convolutional Network (GCN) has been successfully applied to many graph analytical tasks such as node classification. Different from graph embedding, GCN encodes structural properties by convolution and uses taskspecific objectives for training. Several recent works try to extend GCN to HINs. However, they still fail to fully and efficiently exploit the structural properties of HINs. Table I summarizes the key deficiencies of existing HIN GCN methods: (1) Some of them [13, 14, 15, 16] require the user to specify several useful metapaths for a specific task, which is difficult for users without professional knowledge. (2) Metapaths convey diverse structural proximities and rich semantics in an HIN. However, many of them [13, 14, 15, 17, 18, 19] cannot exploit all possible metapaths, risking potential loss of important structural information. They only exploit a subset of all possible metapaths, such as userspecified symmetric metapaths [13, 14, 15, 16], fixedlength metapaths [18], or metapaths that start from and end with the same object type [17]. HetGNN [19] samples neighbors for a target node by random walk and aggregates them by BiLSTM. Most structural information is lost. (3) Some methods [17, 18, 20, 21] do not distinguish the importance of metapaths, failing to consider that not all metapaths are useful for a specific task. (4) Many of them [13, 14, 15, 16, 17, 22] need to compute commuting matrices [1] by iterative multiplication of adjacency matrices, which has at least square time complexity to the number of related objects. The resulting commuting matrices are very dense, and the longer the metapaths, the denser the commuting matrices, which also increases the time complexity of the final graph convolution on these commuting matrices. Thus, they have limited scalability and cannot scale well to largescale HINs.
Very recently, several HIN GCN methods [22, 23, 24] are proposed, which also consider all possible metapaths. Among them, GTN [22]
first computes metapath based graphs for all possible metapaths, and then performs graph convolution. However, it has two deficiencies: (1) It only keeps a learnable importance weight for each relation. The weight is shared among all the related objects, which is not flexible enough to capture the “personality” of different objects. For example, suppose we have a task to classify the research ares of papers in DBLP (the complete list of areas is in Section
5.1). Paper is published in an interdisciplinary conference such as WWW, and is connected to term “Web Search”. Paper is published in AAAI, and is connected to term “Graph Algorithm”. Obviously, the connected term of is more helpful for classifying as “information retrieval”, while the conference where is published is more helpful to classifyas “artificial intelligence”. GTN cannot handle this complexity among different objects. (2) It also needs to compute the commuting matrices (incorporating relation weights) for all possible metapaths. Even by applying sparsesparse matrix multiplication, it has at least square time complexity. Therefore it cannot well scale to large HINs (Section
5.7). HetSANN [23] and HGT [24] directly aggregate the representations of heterogeneous neighbor objects by the multihead attention mechanism [25], and add a residual connection after each layer. However, (1) the interpretability of the model is hindered by the multihead concatenation and residual connections, since they break the normalization property of probabilities and consequently it is difficult to assess the contribution of different parts; (2) in reallife powerlaw networks, objects could have very high degrees, which leads to calculation inefficiency of softmax
[26] in attention and further affects scalability.Property 


[18, 19]  [17]  [22]  ieHGCN  

NU  
AMP  
UMP  
LS 
To fully and efficiently exploit structural properties of HINs, we propose interpretable and efficient Heterogeneous Graph Convolutional Network (ieHGCN), which directly takes an HIN as input and performs multiple layers of heterogeneous convolution on the HIN to learn taskspecific object representations. Each layer of ieHGCN has three key steps to obtain higherlevel object representations: (1) Projection
. We define relationspecific projection matrices to project heterogeneous neighbor objects’ hidden representations (input object features in the first layer) into a common semantic space corresponding to the target object type. We additionally define selfprojection matrices (one for each object type) to project the representations of the target objects into the new common semantic space as well. (2)
Objectlevel Aggregation. Given the adjacency matrix between type target objects and their type neighbor objects, we use its rownormalized matrix to perform withintype aggregation among the neighbor objects of each target object. We show the first two steps intrinsically define a heterogeneous spectral graph convolution operation on the bipartite graph described by , with the projection matrices in the first step as convolution filters. (3) Typelevel Aggregation. We develop an typelevel attention mechanism to learn the importance of different types of neighbors for a target object and perform typelevel aggregation on the objectlevel aggregation results accordingly.Compared to existing HIN GCN methods, the proposed ieHGCN has two salient features as follows:
(1) Interpretability: By stacking multiple layers, the proposed typelevel attention and convolutional aggregation facilitate adaptively learning the importance score of each metapath for each object, which enhances the interpretability of the model. We formally prove that ieHGCN can evaluate all possible metapaths within a length limit (i.e., model depth) in Section 4.5.1.
(2) Efficiency: ieHGCN evaluates various metapaths as the multilayer iterative calculation proceeds. Hence, it avoids the computation of metapath based graphs which is quite timeconsuming. Moreover, in each layer ieHGCN first uses normalized adjacency matrices (it is a reasonable choice and we will discuss it in Section 4.2) to aggregate a target object’s neighbors of different types as super “type” objects, and then uses typelevel attention to aggregate them. This hierarchical aggregation architecture makes our model efficient because: (1) it avoids largescale softmax calculation in the neighborhood of a target object directly; (2) an HIN often has a small number of node types, which leads to very efficient attention calculation. In Section 4.5.3, we analyze the time complexity to verify its quasilinear scalability.
We conduct extensive experiments to show the superior performance of ieHGCN against stateoftheart methods on three benchmark datasets.
HIN Embedding Methods: In recent years, a series of methods are proposed to learn representations of objects in HINs. EOE [5] PTE [4] and HEER [3] split an HIN into several bipartite graphs, and then use the LINE model [27] to learn object representations by preserving the firstorder or the secondorder proximities. Based on userspecified metapaths, Esim [8] first samples path instances. Then it learns object representations, such that objects which cooccur in many path instances have similar representations. HIN2Vec [12] learns representations of objects and metapaths by predicting whether two objects have a specific relation. HINE [11] learns object representations by minimizing the distance between two distributions which respectively model the metapath based proximity on the graph and the firstorder proximity [27] in the embedding space. Metapath2vec [6] and SHNE [7] first sample path instances guided by a set of userspecified metapaths. Then they learn object representations by their proposed heterogeneous skipgram. HERec [9] and MCRec [10] first perform metapath based random walks, and then learn object representations accordingly for recommendation tasks. However, these methods cannot learn taskspecific embeddings. Although structural properties are exploited, none of them can automatically learn metapath importance for all metapaths within a length limit, not to mention taskspecific importance.
GCNs for Homogeneous Graphs:
Inspired by the great success of convolutional neural networks in computer vision, researchers try to generalize convolution on graphs
[28]. Bruna et al. [29]first develop a graph convolution operation based on graph Laplacian in the spectral domain, inspired by the Fourier transformation in signal processing. Then, ChebNet
[30] is proposed to improve its efficiency by using Korder Chebyshev polynomials. Kipf et al. [31] further introduce a firstorder approximation of the Korder Chebyshev polynomials, to further build efficient deep models. GAT [25] is proposed to learn different importance of nodes in a node’s neighborhood, based on a masked selfattention mechanism. Hamilton et al. propose a general inductive framework GraphSAGE [32]. It generates node embeddings by sampling neighbor nodes and aggregating their features by their proposed aggregator functions. However, all these methods are developed for homogeneous graphs. They cannot be directly applied to HINs because of heterogeneity.GCNs for Heterogeneous Graphs: Based on userspecified symmetric metapaths, HAN [13], HAHE [14], DeepHGNN [15], and GraphInception [17] transform an HIN into several homogeneous graphs by computing commuting matrices. Then, they apply GCN to these resulting homogeneous graphs. For each userspecified metapath, MAGNN [16] first performs intrametapath aggregation by encoding all the object features along a path instance of the metapath, and then performs intermetapath aggregation by attention mechanism. HetGNN [19] samples a fixed number of neighbors in the vicinity of an object via random walk with restart, and aggregates these neighbors by BiLSTM. RGCN [20] and Decagon [21] use different weight matrices for different relations, and sum the convolution results of different types of neighbors. ActiveHNE [18] takes the entire HIN as input. It concatenates convolution results in each convolution layer. GTN [22] first computes all possible metapath based graphs by iterative matrix multiplication of two softly selected adjacency matrices. Then it performs graph convolution on these resulting graphs. HetSANN [23] and HGT [24] extend GAT [25] to HINs. They directly use attention mechanism to aggregate different types of neighbors. However, these methods either cannot discover useful metapaths from all possible metapaths [13, 14, 15, 17, 16, 18, 19, 20, 21, 23, 24], or have limited scalability [13, 14, 15, 17, 16, 22].
We first introduce some important concepts about HINs [1], and then formally define the problem we study in this paper.
Heterogeneous Information Network (HIN). A heterogeneous information network is defined as , where is the set of objects, is the set of links. : and : are respectively object type mapping function and link type mapping function. denotes the set of object types, and denotes the set of relations, where . Let denote the set of objects for type , and denotes the set of neighbor object types of that have relations from them to . is a neighbor object type of . We abuse notation a bit to use also as the index of object type in . The relation from to is denoted as or .
Network Schema. Given an HIN , : , : , the network schema is a directed graph defined over , with edges as relations from , denoted as . It is a meta template for .
Metapath. A metapath is essentially a path defined on network schema . It is denoted in the form of (abbreviated as ), which describes a composite relation between object types and , where denotes the composition operator on relations. The subscript is the length of , i.e. the number of relations in . We say is symmetric if its corresponding composite relation is symmetric. A path instance of is a concrete path in an HIN that instantiates .
Figure 1 shows a toy HIN of DBLP (left) and its network schema (right). It contains four object types: “Paper” (), “Author” (), “Conference” () and “Term” (), and six relations: “Publishing” and “Published” between and , “Writing” and “Written” between and , “Containing” and “Contained” between and . For object type , its set of neighbor object types is . The metapath is symmetric, while is asymmetric, and they both have length of 2. As shown, author has published paper in conference , and thus we say is a path instance of .
HIN Representation Learning. Given an HIN , the problem is to learn representation matrices for a specific task such as object classification. For each object type , the representation matrix is denoted as , where is the representation dimensionality. For an object
, its corresponding representation vector is the
th row of , which is a dimensional vector.Notations  Descriptions 

Hidden representations of of previous layer  
New representations of of current layer  
Dummy selfrelation projection matrix  
Relationspecific projection matrix  
/  Projected representations of / 
i.e.  
Convolved representations from to  
/  Attention query/key parameters 
Attention parameters  
Mapped queries for  
/  Mapped keys for / 
/  Unnormalized attention coefficients for / 
/  Normalized attention coefficients for / 
In this section, we present the ieHGCN method. Figure 2(a) shows the overall architecture of ieHGCN for the network schema of DBLP. Each layer consists of blocks. In each block, three key calculation steps are performed. Figure 2(b) shows the calculation flow of the block in a layer. In the following, we elaborate on the three key calculation steps of the block in a layer. The process is similar in other blocks. The main notations used in this paper are summarized in Table II. We use bold uppercase/lowercase letters to denote matrices/vectors. For clarity, we omit layer indices of all the layerspecific notations.
For different types of objects, their features are located in different semantic spaces. Therefore, in each block, we first project the representations of neighbor objects of different types into a new common semantic space. The input of the block is a set of hidden representation (and input feature in the first layer) matrices , obtained from the previous layer. and are the representation matrices for and respectively. For each neighbor object type , we define a relationspecific projection matrix for relation . It projects from the semantic space into a new common semantic space . Besides, to project from the feature space into the new common space as well, we additionally define a projection matrix . Here is simply a projection matrix, but not a relationspecific projection matrix. For convenience, we call as dummy selfrelation. When the real selfrelation exists, i.e. , is a relationspecific projection matrix. Note that each relation has a relationspecific projection matrix, and different relations have different ones. The projection is formulated as follows:
(1) 
where and are projected hidden representations located in the new common space .
For example, as illustrated in Figure 2(b), projects from the “Conference” space into a new common “Paper” space . is originally located in the “Paper” space . projects it from the original space into the new common space .
After projecting all the hidden representations of neighbor objects into a common semantic space, we then perform objectlevel aggregation. In the following, let us take the example of aggregating hidden representations from to . However, we cannot directly apply GCN [31] to the aggregation, since the neighbors of an object are of different types in HINs, i.e., the heterogeneity of HINs. An adjacency matrix between two different types of objects may not even be a square matrix. In this paper, given the adjacency matrix between and , we first compute its rownormalized matrix , where is the degree matrix. Then, we define the heterogeneous graph convolution as follows:
(2) 
Each row of can serve as the normalized coefficients to compute a linear combination of the corresponding projected representations of . For symbolic consistency, we let . Thus, we can obtain a set of convolved representations , and each representation in the set contributes to from one aspect. Take the block in Figure 2(b) as an example. We use , , to respectively aggregate the projected representations of paper objects’ neighbor conference objects, author objects and term objects. Thus, we obtain .
Although Eq. (2) is similar to the aggregation ideas in previous methods [18, 17, 20, 21], our design still has some novel aspects: (1) different from previous methods, we calculate the selfrepresentation , which, together with the attentive typelevel aggregation introduced in the next subsection, enables ieHGCN to evaluate the usefulness of all metapaths within a length limit (model depth). We will prove this in Section 4.5.1; (2) Since is usually not a square matrix and consequently cannot be eigendecomposed to obtain Fourier basis, no previous work provides theoretical analysis to formally show Eq. (2) is a proper convolution. In Section 4.5.2, we show that Eq. (2) is intrinsically a spectral graph convolution on the bipartite graph. Moreover, we can also implement an attention mechanism for objectlevel aggregation similar to that in [25]. However, in this work, we simply use to perform objectlevel aggregation, considering that the objectlevel attention is computationally inefficient and the (weighted) adjacency matrices of realworld complex networks are often sufficient to reflect the relative importance among objects. Take IMDB as an example. The rating between a user and a movie naturally reflects the preference of the user towards the movie. Empirical results also support this idea.
To learn more comprehensive representations for , we need to fuse representations from different types of neighbor objects. In a specific task, for a target object, the information from different types of neighbor objects may have different importance. Take paper objects in DBLP as an example. In the task of predicting a paper’s quality, the representation of the conference where the paper is published could be more important. To this end, we propose typelevel attention to automatically learn the importance weights for different types of neighbor objects. Then we aggregate the corresponding convolved representations by computing the weighted sum of them. The proposed attention mechanism also facilitates the model to evaluate all possible metapaths within a length limit (model depth) for a particular task. We will prove it in Section 4.5.1.
The attention mechanism is to map a set of queries and a set of keyvalue pairs to an output. In practice, we pack together queries, keys and values into three matrices , and respectively. Then it can be formulated as: , where is the attention function such as dotproduct [33], neural network [25]. Here, the obtained convolved representations are values. We define a weight matrix to map them into keys, and define a weight matrix to map into the query, where is the hidden layer dimensionality of the typelevel attention.
(3) 
Since we want to assess the importance of and , with respect to when calculating nextlayer representations, it is intuitive to map into keys, and map into the query. It is different from previous methods [13, 14, 22], where the query is a parameter vector. Note that mapping
as the query is also the key to achieve personalized importance estimation for each
object. The attention function is implemented as follows:(4) 
where denotes the rowwise concatenation operation, is the parameter vector, and ELU [34]
is the activation function. The
th element of and respectively reflect the unnormalized importance of object itself and its neighbors when calculating its higher level representation. Then, the normalized attention coefficients is computed by applying the softmax function:(5) 
where softmax is applied to the operand rowwise. The normalized attention coefficients are used to compute the higher level representations of by a weighted combination of their corresponding values as follows:
(6) 
where is the nonlinearity, and the subscript () means the th element (row) of a vector (matrix), and corresponds to the th object in . The new representations in are in turn used as the input of the blocks in the next layer. The final representations of objects are output by the blocks in the last layer.
Once the final representations of objects are obtained from the last layer, they can be used for a variety of tasks such as classification, clustering, etc. The loss functions can be defined depending on specific tasks. For semisupervised multiclass object classification, it can be defined as the sum (or weighted sum) of the crossentropy over all the labeled objects for each object type:
(7) 
where is the set of indices of labeled objects in , is the set of class indices for , and and are respectively groundtruth label indicator and the predicted score of object on class . We can minimize the loss by back propagation. The overall training procedure of ieHGCN is shown in Algorithm 1. Wherein, we index layers by square brackets.
The most important highlight of ieHGCN is that it evaluates all possible metapaths with length less than the number of layers in the model. We formalize this property as a theorem as follows:
For an object type , let denotes all possible metapaths of length greater than or equal to 0, less than , and end with object type . In the th layer, the output hidden representation evaluates all the metapaths in .
We prove the theorem by mathematical induction.
The base case: When , is the input features of . Obviously, the metapath evaluated can be expressed as , which has a length of and ends with , i.e., .
The step case: Assume that the theorem holds when , i.e. evaluates . When , is an attentionweighted combination of and . According to Eq. (2), is a linear projection of which evaluates by assumption; , where evaluates by assumption. The heterogeneous graph convolution concatenates the relation at the end of . Since we aggregate for all , this results in . By uniting with from , we can conclude evaluates .
Therefore, the theorem holds. ∎
The proposed ieHGCN can capture objects’ personalized preference for different metapaths because each object has its own attention coefficients. GTN cannot capture such personalized metapath importance, since its importance weights for relations are shared by all the related objects. We can obtain the importance score of a metapath to a specific target object by summing the scores of all its path instances ending with that object. The score of a path instance is intuitively calculated by multiplying the attention coefficients and the link weights (from the corresponding normalized adjacency matrices for real relations, or 1 for dummy selfrelations) between objects along the path. Since path instances often share attention coefficients, we could efficiently aggregate subpath scores iteratively during the forward propagation of ieHGCN, recording in each block the aggregation scores for different metapaths up to that block.
We can also derive the heterogeneous graph convolution presented in Eq. (2) by connecting to the spectral domain of bipartite graphs (when the selfrelation exists, the following derivation still holds by setting ). For and , given their representation matrices and and the adjacency matrices and between them, we cannot directly eigendecompose and as they may not be square matrices. Thus, we define the augmented adjacency matrix and the augmented representation matrix as follows:
where ’s denote square zero matrices, and
is properly padded by zeros since generally
. Our convolution is related to random walk Laplacian which is defined as: , where . We also have , where and are respectively’s eigenvectors and eigenvalues.
and define graph Fourier transform and inverse transform respectively. Then the bipartite graph convolution is defined as the multiplication of a parameterized filter ( in the Fourier domain) and a signal (a column of ) in the Fourier domain:where can be regarded as a function of and is efficiently approximated by the truncated Chebyshev polynomials [30]. , , and . We note that in general, has the same eigenvalues as symmetric normalized Laplacian [35], which lie in [36]. For the purpose of numerical stability, we replace with without affecting Fourier basis, so as to rescale the eigenvalues to [1, 1] [30]. can be expressed as follows:
where is the rownormalized adjacency matrix between and . Now, the convolution operation can be expressed as: , which is Klocalized, since it can be easily verified that denotes the transition probability of objects to their korder neighborhood. Following GCN [31], we further let and stack multiple layers to recover a rich class of convolutional filter functions. Then we have . Generalizing the filter to multiple ones, and the signal to multiple channels, the two terms can be expressed as follows:
The above two equations recover the calculation of and in Eq. (2). They differ from Eq. (2) only by using the same parameters and for the two types and . In ieHGCN, we use separate parameters, , and , for and respectively, to improve model flexibility. Another difference is that we aggregate and ’s through the typelevel attention rather than simply adding them.
Most previous methods [13, 14, 15, 17, 22] need to compute commuting matrices by iterative multiplication of adjacency matrices, which has at least square time complexity. Our ieHGCN performs heterogeneous graph convolution on the HIN directly in each layer, which is more efficient. For ieHGCN, by applying sparsedense matrix multiplication, the time complexity of the heterogeneous graph convolution is , where is the number of links between and . The time complexity of the typelevel attention is . Taking all types of objects and all types of links into consideration, the overall time complexity is , which is linear to the total number of objects and links in an HIN.
In this section, we conduct extensive experiments to show the performance of ieHGCN. The source code is available at GitHub ^{1}^{1}1We will release the source code once the manuscript is accepted.. We use three widely used and publicly available realworld networks (IMDB, ACM, DBLP) to construct three HINs. We compare ieHGCN against three GCN methods for homogeneous graphs: GraphSAGE, GCN and GAT; one HIN embedding method: metapath2vec; five GCN methods for HINs: HAN, HAHE, ActiveHNE, HetSANN and GTN; one ieHGCN variant: ieHGCN.
The statistics of the used HINs are summarized in Table III
. The notation * means the features are real. Otherwise, they are generated randomly. Note that, most existing methods only require features of the target objects, while ActiveHNE, GTN and our method need input features of all types of objects. However, in the widely used HIN datasets, some types of objects have no available real features. For these objects, some existing methods input their onehot ids as features, which results in a large number of parameters in the first layer and consequently, high space complexity and time complexity. Considering the general idea is to generate noninformative features for those objects without real features, in this paper, we generate a 128dimensional random vector for each of these objects from the Xavier uniform distribution
[37]. In this way, little information can be got from their features. For all the methods (except HAHE, which cannot make use of object input features), we input exactly the same object features as shown in Table III.• IMDB. We extract a subset from the IMDB dataset in HetRec 2011 ^{2}^{2}2https://grouplens.org/datasets/hetrec2011/, and construct an HIN which contains 4 object types: Movie (), Actor (), User () and Director (), and 6 relations: , and . We select 14 (taskirrelevant features such as id and url are ignored) numerical and categorical features from the original features for movie objects. Movie () objects are labeled by 4 classes: comedy, documentary, drama, and horror.
• ACM. The dataset is provided by the authors of HAN [13]. It is downloaded from ACM digital library ^{3}^{3}3https://dl.acm.org/ in 2010, including data from 14 representative computer science conferences. We construct an HIN with 3 object types: Paper (), Author () and Subject (), and 4 relations: and . Paper () objects are labeled by 3 research areas: data mining, database and computer network, and their features are the TFIDF representations of their titles.
• DBLP. The dataset is provided by the authors of HAN [13], which is extracted from 4 research areas of DBLP bibliography ^{4}^{4}4https://dblp.org/. The 4 research areas are: data mining (DM), database (DB), artificial intelligence (AI) and information retrieval (IR) ^{5}^{5}5DM: ICDM, KDD, PAKDD, PKDD, SDM; DB: SIGMOD, VLDB, PODS, EDBT, ICDE; AI: AAAI, CVPR, ECML, ICML, IJCAI; IR: ECIR, SIGIR, WWW, WSDM, CIKM.. Based on the dataset, we construct an HIN with 4 object types: Paper (), Author (), Conference () and Term (), and 6 relations: , and . Author () objects are labeled with the 4 research areas according to the conferences to which they submitted papers [13]. Although papers in DBLP have titles as their features, the titles provide very similar information as the terms connected to papers. Hence, we do not incorporate them as real features for papers, so that we can answer an important research question: whether ieHGCN can well exploit useful structural features conveyed by metapaths to accomplish the task without informative object features.
Dataset  Objects  Number  Features  Classes 

DBLP  A  4057  128  4 
P  14328  128    
C  20  128    
T  8898  128    
ACM  P  4025  128  3 
A  7167  128    
S  60  128    
IMDB  M  3328  14  4 
A  42553  128    
U  2103  128    
D  2016  128   
We evaluate ieHGCN against ten baselines as follows.
• GraphSAGE (GSAGE) [32]: It is a homogeneous method that learns a function to aggregate features from a node’s neighborhood. We use the convolutional meanbased aggregator, which corresponds to a rough, linear approximation of localized spectral convolution.
• GCN [31]: It is the stateoftheart graph convolutional for homogeneous graphs.
• GAT [25]: It is designed for homogeneous graphs. For each node, it aggregates neighbor representations via the importance scores learned by nodelevel attention.
• metapath2vec (MP2V) [6]: It is stateoftheart HIN embedding method. It first performs random walks guided by userspecified metapaths and then uses the heterogeneous skipgram to learn object representations. It cannot learn the importance of these input metapaths.
• HAN [13]: It transforms an HIN into several homogeneous graphs via given symmetric metapaths and uses GAT to perform objectlevel aggregation. Then, by attention mechanism, it fuses object representations learned from different metapath based graphs.
• HAHE [14]: HAHE is similar to HAN, except that it initializes the features of the target objects as the metapath based structural features. Thus, it cannot exploit object features.
• ActiveHNE (DHNE) [18]
: It is an active learning method for HIN. For a fair comparison, we use its discriminative heterogeneous network embedding (DHNE) component. It only considers fixedlength metapaths, and cannot learn the importance of metapaths.
• HetSANN (HetSA) [23]: It is a heterogeneous method which directly uses attention mechanism to aggregate heterogeneous neighbors. We use the variant HetSANN.M.R.V which achieves the best performance as reported. The attention is implemented by sparse operations.
• GTN [22]: It is a heterogeneous method which considers all possible by computing all possible metapath based graphs, and then performs graph convolution on the resulting graphs.
• ieHGCN: It is a variant of ieHGCN. We replace the typelevel attention with the elementwise mean function. We use this method to show the effectiveness of the typelevel attention.
Dataset  Metrics  Training  GSAGE  GCN  GAT  MP2V  HAHE  HAN  DHNE  HetSA  GTN  ieHGCN  ieHGCN 

DBLP  Micro F1  20%  0.8882  0.9155  0.9097  0.9015  0.9357  0.9224  0.8445  0.9336  0.9341  0.9368  0.9426 
40%  0.8881  0.9110  0.9120  0.9081  0.9418  0.9240  0.8461  0.9372  0.9384  0.9355  0.9422  
60%  0.8868  0.9048  0.9080  0.8982  0.9365  0.9280  0.8677  0.9385  0.9401  0.9448  0.9554  
80%  0.8887  0.9172  0.9173  0.9089  0.9438  0.9308  0.8736  0.9403  0.9446  0.9520  0.9648  
Macro F1  20%  0.8787  0.9060  0.9196  0.9043  0.9311  0.9311  0.8399  0.9228  0.9282  0.9321  0.9385  
40%  0.8798  0.9017  0.9216  0.8973  0.9378  0.9330  0.8480  0.9251  0.9334  0.9305  0.9383  
60%  0.8805  0.8973  0.9184  0.9048  0.9345  0.9370  0.8624  0.9317  0.9353  0.9400  0.9525  
80%  0.8829  0.9099  0.9255  0.9097  0.9424  0.9399  0.8682  0.9348  0.9377  0.9472  0.9629  
ACM  Micro F1  20%  0.8147  0.7880  0.7418  0.6674  0.7717  0.7358  0.7621  0.7857  0.7785  0.7873  0.8193 
40%  0.8086  0.7864  0.7201  0.6901  0.7819  0.7744  0.7841  0.7862  0.7884  0.8023  0.8210  
60%  0.8031  0.7778  0.7618  0.7168  0.7809  0.7647  0.7897  0.7925  0.7927  0.8326  0.8373  
80%  0.8112  0.7975  0.7720  0.7327  0.8086  0.7613  0.7902  0.8113  0.7964  0.8396  0.8422  
Macro F1  20%  0.6340  0.6019  0.5818  0.5092  0.5387  0.6469  0.6494  0.5789  0.5134  0.6895  0.6979  
40%  0.6235  0.5912  0.6114  0.5191  0.5482  0.6439  0.6567  0.5880  0.5365  0.6917  0.6931  
60%  0.6038  0.5880  0.5379  0.5187  0.5465  0.6531  0.6599  0.5920  0.5536  0.6943  0.7025  
80%  0.5960  0.6025  0.5936  0.5481  0.5884  0.6626  0.6761  0.6108  0.5628  0.6936  0.6942  
IMDB  Micro F1  20%  0.5820  0.5958  0.5530  0.4987  0.5489  0.5647  0.6379  0.6299    0.6212  0.6494 
40%  0.5711  0.5849  0.5542  0.5014  0.5500  0.5603  0.6439  0.6344    0.6661  0.6670  
60%  0.5989  0.5981  0.5514  0.5083  0.5501  0.5700  0.6426  0.6364    0.6644  0.6822  
80%  0.5827  0.5873  0.5406  0.5090  0.5464  0.5718  0.6551  0.6421    0.6904  0.6971  
Macro F1  20%  0.4093  0.2926  0.3007  0.2095  0.2147  0.4587  0.5032  0.4741    0.5419  0.5660  
40%  0.4268  0.2915  0.3126  0.2136  0.2352  0.4444  0.5954  0.5459    0.5761  0.5981  
60%  0.3964  0.3095  0.2982  0.2089  0.2172  0.4482  0.5581  0.5432    0.5547  0.6084  
80%  0.3922  0.3036  0.2893  0.2211  0.2575  0.4477  0.5338  0.5332    0.5245  0.5835 
On each dataset, we randomly select % objects as training set, and the rest % are divided equally as validation set and test set, where . For all the methods, we use exactly the same training/validation/test sets for fairness. We only investigate proper hyperparameter setting on the validation set of DBLP and use the same setting for ACM and IMDB. This can reflect whether the hyperparameter setting is sensitive w.r.t. datasets. The hyperparameter settings of all the methods are detailed as follows.
• Ours: To make model tuning easy, we set the same hidden representation dimensionality for all the object types in a layer. Specifically, we set the number of layers to 5. The first layer is the input layer, and its dimensionalities for different objects are determined by object features. For the other 4 hidden layers, the dimensionalities are all set to [64, 32, 16, 8]. The nonlinearity is set to ELU function [34]. The hidden layer dimensionality of the typelevel attention is set to 64. For optimization, we use Adam optimizer, and the parameters are initialized by Xavier uniform distribution [37]. We set learning rate to 0.01. We apply dropout to the output of each layer except the output layer, with dropout rate 0.5. The regularization weight is set to 5e4. For a fair comparison, our ieHGCN and ieHGCN use the same hyperparameter setting.
• Baselines: Since GraphSAGE, GCN, GAT, metapath2vec, HAN and HAHE need userspecified metapaths, we use the metapaths used in the papers [13, 14]. Concretely, on DBLP, we use , and . On ACM, we use and . On IDMB, we use , and
. For GraphSAGE, GCN, GAT and metapath2vec, we test them on homogeneous graphs constructed by the above metapaths and report their best results. For all the baselines, we use the validation set of DBLP to tune better hyperparameters such as epochs on our datasets based on their default settings. Their key hyperparameters are set as follows. For GraphSAGE, the neighborhood sample size is set to 5. For GAT, HAN and HetSANN, the number of attention head is set to 8. For HAHE, the batch size is set to 512, and the sample size of neighbors is set to 100. For ActiveHNE, its number of layers is set to 3, so that it can exploit length2 metapaths. For GTN, the number of and channels set to 2. Its number of layers is set to 3. For metapath2vec, the window size and the negative sample size are set to 5, and the walk length is set to 100.
We conduct object classification to compare the performance of all the methods. Each method is randomly run 10 times, and the average Micro F1 and Macro F1 are reported in Table IV. Note that due to the high space complexity of GTN, it cannot make use of GPU on our datasets due to out of 12GB GPU memory. Therefore in this experiment, we run GTN on CPUs with 128GB main memory, as suggested by the authors [22]. Even then, it runs out of 128GB main memory on IMDB. We can see, ieHGCN achieves the best overall performance, and ieHGCN outperforms the other baselines in most cases, which indicates the effectiveness of our proposed heterogeneous graph convolution for objectlevel aggregation. On the other hand, ieHGCN performs better than ieHGCN, which shows the effectiveness that our proposed typelevel attention can discover and exploit the most useful metapaths for this task.
On DBLP, heterogeneous methods HAN and HAHE significantly outperform homogeneous methods GraphSAGE, GCN and GAT, while on ACM and IMDB, the former does not have much superiority than the latter. This may be because on DBLP, the heterogeneous structural features conveyed by metapaths are more helpful for this task (see Section 5.5), while on ACM and IMDB, the real features of the target objects are more helpful. ieHGCN always achieves the best results on all the datasets, which indicates that it can not only exploit useful structural features but also take advantage of useful object features. On DBLP, we can see ActiveHNE performs much worse than HAN, HAHE, HetSANN and GTN, because it only exploits fixedlength metapaths and cannot learn their importance. Even homogeneous methods GraphSAGE, GCN and GAT perform better than ActiveHNE, as they can exploit useful metapaths that previous researchers have empirically chosen. On DBLP, HetSANN and GTN perform better than ActiveHNE, which may be because they can exploit all possible metapaths. GTN performs better than HetSANN, which may be because that GTN can correctly discover and exploit useful metapaths for this task while HetSANN cannot. However, GTN performs worse than ieHGCN, the reason of which should be that it is not flexible enough to capture the complexity of different objects. Metapath2vec performs worst in most cases, which indicates the superiority of graph convolutional methods over traditional network embedding methods.


One salient feature of ieHGCN is the ability to evaluate all possible metapaths, discover and exploit the most useful ones for a specific task. We provide experimental evidences in the subsection. In Table VI(a), we show the top4 most useful metapaths returned by ieHGCN on DBLP for classifying author objects. See more computation details in Appendix A.1. Note that we need to merge equivalent metapaths. Here, for object type , we use to denote the dummy selfrelation, and use to denote the real selfrelation. In Table VI(b), we show the metapaths that are merged for . ieHGCN finds that is the most useful metapath for the task of author object classification. This is reasonable. The semantic meaning of is “the conferences where authors have published papers”. This correctly reflects the fact that in the DBLP dataset, the class of an author (i.e. his/her research area) is labeled according to the conferences to which he/she submitted papers [13]. Besides, and are also useful for the task. indicates that in addition to the conferences where an author himself/herself has published papers, the conferences where his/her coauthors have published papers are also useful. suggests we should further consider conferences where the published papers share a lot of common terms with those written by the author. The last metapath is also intuitive. Notice that a paper can only be published in one conference. The metapath essentially does not introduce information of other conferences to a conference. Hence, we can interpret as .
Regarding the best metapath for each object, we also find the results are intuitive. For example, ieHGCN correctly classifies Yoshua Bengio as “AI” (see Section 5.1 for details of labels), and assigns the highest score to for him. This is intuitive since all the 7 papers connected to him in our dataset are from “AI” conferences such as ICML. On the other hand, ieHGCN correctly classifies the scholar Chen Chen as “DM”, and assigns the highest score to . This is also reasonable. In our dataset, he has published 3 papers in “DB” conferences and 2 papers in “DM” conferences, but all the 5 papers are coauthored with Jiawei Han, who has published many papers in “DM” conferences such as KDD. These observations indicate that ieHGCN is able to assess the importance of metapaths according to the information of different objects.
Regarding baselines, HAN and HAHE assign the largest attention coefficient to metapath [13, 14]. GraphSAGE, GCN, GAT and metapath2vec also achieve the best classification results when their input metapath is . means that we should resort to authors who have published papers in the same conferences as the target author. However, this is less effective since it indirectly exploits the conference information. These methods cannot directly exploit the most useful metapath , because they can only perform homogeneous graph convolution which requires constructing homogeneous graphs by symmetric metapaths.
ieHGCN discovers useful metapaths of and on ACM, and , and on IMDB. Most of them are widely used in previous works. It also indicates effectiveness real features of target objects, since the target objects are typically connected to their input features.
In this subsection, we investigate the sensitivity of ieHGCN’s performance to the hidden layer dimensionality of the typelevel attention. With other hyperparameters fixed, we gradually increase from 8 to 512 and report Micro F1 and Macro F1 in Figure 3. We can see, on DBLP, the performance is not very sensitive to
. On IMDB and ACM, Micro F1 is not very sensitive, while Macro F1 is more sensitive. Considering that Macro F1 is sensitive to skewed classes in classification, this can be explained by the fact that the classes in IMDB and ACM are skewed, while those in DBLP are balanced. The general pattern is: in the beginning, the performance grows as the dimensionality gradually increases; then the performance begins to decline, which should be because of overfitting with more parameters in the attention module. The overall inflection point is at the dimensionality of 64. Thus, we set
for ieHGCN.We test and compare the scalability of the heterogeneous GCN methods on eight constructed HINs with different scales. See construction details in Appendix A.2. All the methods are randomly run 10 times on GPU. The average running time (seconds) w.r.t. the total number of links+objects in the HINs is reported in Figure 4(a). We can see, ieHGCN achieves the best scalability. The time cost of ieHGCN increases linearly with the increment of the HIN scale, which helps to verify the time complexity analyzed in Section 4.5.3. HAN, HAHE and HetSANN perform worse than ActiveHNE and ieHGCN because the former need to perform objectlevel attention, which is computationally inefficient in practice. The time cost of HAN and HAHE increases more sharply than that of HetSANN because the former need to construct metapath based graphs which are very dense, and the latter implements the objectlevel attention by sparse operations. HAHE performs worse than HAN, since it uses the high dimensional metapath based structural features as input features. GTN shows the worst scalability due to its square time complexity. HAHE, HAN and GTN cannot run on largescale HINs due to out of 12 GB GPU memory.
For ieHGCN, the more layers, the more complex and rich semantics can be captured. We implement 8 instances of ieHGCN with layers increasing from 2 to 9. See details in Appendix A.3. We test their classification performance on DBLP. Each of these model instances is randomly run 10 times, and the average Micro F1 and Macro F1 scores are reported in Figure 4(b). We can see, when the model has 2 layers, the performance is very poor. This is not surprising, since the ieHGCN model with 2 layers can only consider metapaths with length less than 2. As discussed in Section 5.5, in order to accurately classify author objects on DBLP, it is critical to capture and fuse the information from related conference objects. However, there is no 1hop metapath between authors and conferences in the network schema of DBLP (Figure 1). When the depth becomes 3, the performance is promoted dramatically, since it is possible for ieHGCN to evaluate metapath . Then, the performance grows slightly as the depth increases until it achieves its best when the depth is 6. After that, the performance starts to decrease possibly due to overfitting.
In this paper, we propose ieHGCN to learn representations of objects in an HIN. To address the heterogeneity, we first project the representations of different types of neighbor objects into a common semantic space. Then we define the heterogeneous graph convolution operation to perform the objectlevel aggregation. Finally, we use the proposed typelevel attention to aggregate the representations of different types of neighbor objects. ieHGCN automatically evaluates all possible metapaths in an HIN, discovers and exploits the most useful metapaths for a specific task, which brings good interpretability of the model. The theoretical analysis and the scalability experiment show that it is efficient. Extensive experiments show ieHGCN outperforms several stateoftheart methods.
Layers  Coefficients  

12  [, , , ]  [0.06, 0.06, 0.82, 0.06]  
[, ]  [0.50, 0.50]  
[, ]  [0.63, 0.37]  
[, ]  [0.50, 0.50]  
23  [, , , ]  [0.64, 0.04, 0.27, 0.05]  
[, ]  [0.20, 0.80]  
[, ]  [0.37, 0.63]  
[, ]  [0.06, 0.94]  
34  [, , , ]  [0.25, 0.25, 0.25, 0.25]  
[, ]  [0.49, 0.51]  
[, ]  [0.42, 0.58]  
[, ]  [0.19, 0.81]  
45  [, ]  [0.43, 0.57] 
0.82 * 0.94 * 0.25 * 0.57 = 0.1098  

0.82 * 0.80 * 0.25 * 0.57 = 0.0935  
0.82 * 0.63 * 0.25 * 0.57 = 0.0736  
0.82 * 0.64 * 0.25 * 0.57 = 0.0748  
0.63 * 0.27 * 0.25 * 0.57 = 0.0242  
0.63 * 0.37 * 0.25 * 0.57 = 0.0332  
0.63 * 0.27 * 0.51 * 0.43 = 0.0373  
0.82 * 0.64 * 0.51 * 0.43 = 0.1151  
0.82 * 0.80 * 0.49 * 0.43 = 0.1382 
We first compute the mean attention distribution in each block, and then compute the importance score of a metapath based on these mean distributions. Table VI shows the final mean attention coefficients returned by ieHGCN on DBLP. Table VII shows the computation of scores of most useful metapaths.
Based on the original DBLP dataset provided by HAN [13], we use different number of authors to induce 8 HINs with different scales. We denote the scale by a tuple: (author numbers, total objects, total links), i.e. (, , ). Thus, from small to large, the scales of resulting HINs are: (800, 6183, 21308), (1500, 9799, 38384), (2500, 13935, 59578), (4000, 18785, 84356), (5500, 22969, 106171), (7000, 26327, 125894), (10000, 31775, 147777), (14475, 37791, 170794).
We implement 8 instances of ieHGCN, with layers increasing from 2 to 9. Specifically, fixing other hyperparameters, we set the layer (except for input layer) dimensionalities of these model instances as: [64], [64, 32], [64, 32, 16], [64, 32, 16, 8], [64, 64, 32, 16, 8], [64, 64, 64, 32, 16, 8], [64, 64, 64, 64, 32, 16, 8], [64, 64, 64, 64, 64, 32, 16, 8].
In the experiment, we use a server with 16 Intel Xeon E52620 CPUs, 1 Nvidia GeForce GTX 1080Ti with 12GB GPU memory, and 128GB main memory. Unless otherwise specified, all experiments are performed on GPU to accelerate computation.
B. Hu, C. Shi, W. X. Zhao, and P. S. Yu, “Leveraging metapath based context for topn recommendation with a neural coattention model,” in
SIGKDD. ACM, 2018, pp. 1531–1540.S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim, “Graph transformer networks,” in
NeurIPS, 2019, pp. 11 960–11 970.T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in
Advances in neural information processing systems, 2013, pp. 3111–3119.