1 Introduction
A heterogeneous graph consists of multiple types of nodes and edges, involving abundant heterogeneous information [28]. In practice, heterogeneous graphs are pervasive in realworld scenarios, such as academic networks, ecommerce and social networks [25]. Learning meaningful representation of nodes in heterogeneous graphs is essential for various tasks, including node classification [22, 38], node clustering [27], link prediction [5, 14] and personalized recommendation [35, 24].
In recent years, Graph Neural Networks (GNNs) have been widely used in representation learning of graphs and achieved superior performance. Generally, GNNs perform convolutions in two domains, namely spectral domain and spatial domain. As a spectralbased method, GCN
[13] utilizes the localized firstorder approximation on neighbors and then performs convolutions in the Fourier domain for an entire graph. Spatialbased methods, including GraphSAGE [9] and GAT [31], directly perform information propagation in the graph domain by particularly designed aggregation functions or the attention mechanism. However, all of the above methods were designed for homogeneous graphs with single node type and single edge type, and they are infeasible to handle the rich information in heterogeneous graphs. Simply adapting them to deal with heterogeneous graphs would lead to the information loss issue, since they ignore the graph heterogeneous properties.Models 







MLP  ✓  
GCN  ✓  ✓  ✓  
GAT  ✓  ✓  ✓  ✓  
RGCN  ✓  ✓  ✓  ✓  
HAN  ✓  ✓  ✓  ✓  ✓  
HetGNN  ✓  ✓  ✓  ✓  ✓  
HGT  ✓  ✓  ✓  ✓  ✓  
HGConv  ✓  ✓  ✓  ✓  ✓  ✓ 
Despite the investigation of approaches on homogeneous graphs, there are also several attempts to design graph convolution methods for heterogeneous graphs. RGCN [23]
was proposed to deal with multiple relations in knowledge graphs. HAN
[33] was designed to learn on heterogeneous graphs, which is based on metapaths and the attention mechanism. [36] presented HetGNN to consider the heterogeneity of node attributes and neighbors through dedicated aggregation functions. [11] proposed HGT, a variant of Transformer [30], to focus on the meta relations in heterogeneous graphs.However, the aforementioned methods are still faced with the following limitations. 1) Heterogeneous information loss: several methods just utilize the properties of nodes or relations partially, rather than the comprehensive information of nodes and relations (e.g., RGCN and HAN). In detail, RGCN ignores the distinct attributes of nodes with various types. HAN relies on multiple handdesigned symmetric metapaths to convert the heterogeneous graph into multiple homogeneous graphs, which would lead to the loss of different nodes and edges information. 2) Structural information loss
: some methods deal with the graph topology through heuristic strategies, such as the random walk in HetGNN, which may break the intrinsic graph structure and lose valuable structural information. 3)
Empirical manual design: the performance of some methods severely relies on prior experience because of the requirement of specific domain knowledge, such as predefined methpaths in HAN; 4) Insufficient representation ability: some methods cannot provide multilevel representation due to the flat model architecture. For example, HGT learns the interaction of nodes and relations in a single aggregation process, which is hard to distinguish their importance in such a flat architecture.To cope with the above issues, we propose HGConv, a novel Heterogeneous Graph Conv
olution approach, to learn node representation on heterogeneous graphs with a hybrid micro/macro level convolutional operation. Specifically, for a focal node: in microlevel convolution, the transformation matrices and attention vectors are both specific to node types, aiming to learn the importance of nodes within the same relation; in macrolevel convolution, transformation matrices specific to relation types and the weightsharing attention vector are employed to distinguish the subtle difference across different relations. Due to the hybrid micro/macro level convolution, HGConv could fully utilize the heterogeneous information of nodes and relations with proper interpretability. Moreover, a weighted residual connection component is designed to obtain the optimal fusion of the focal node’s inherent attributes and neighbor information. Based on the aforementioned components, our approach could be optimized in an endtoend manner. Comparison of several existing methods with our model are shown in Table
I.To sum up, the contributions of our work are as follows:

A novel heterogeneous graph convolution approach is proposed to directly perform convolutions on the intrinsic heterogeneous graph structure with a hybrid micro/macro level convolutional operation, where the micro convolution encodes the attributes of different types of nodes and the macro convolution computes on different relations respectively.

A residual connection component with weighted combination is designed to aggregate focal node’s inherent attributes and neighbor information adaptively, which could provide comprehensive node representation.

A systematic analysis on existing heterogeneous graph learning methods is given, and we point out that each existing method could be treated as a special case of the proposed HGConv under certain circumstances.
The rest of this paper is organized as follows: Section 2 reviews previous work related to the studied problem. Section 3 introduces the studied problem. Section 4 presents the framework and each component of the proposed model. Section 5 evaluates the proposed model by experiments. Section 6 concludes the entire paper.
2 Related work
This section reviews existing literature related to our work and also points out their differences with our work.
Graph Mining. Over the past decades, a great amount of research has been investigated on graph mining. Classical methods based on manifold learning, including Locally Linear Embedding (LLE) [21] and Laplacian Eigenmaps (LE) [1], mainly focus on the reconstruction of graphs. Inspired by the language model Skipgram [16], more advanced methods were proposed to learn representations of nodes, such as DeepWalk [19] and Node2Vec [8]
. These methods adopt random walk strategy to generate sequences of nodes and use Skipgram to maximize node cooccurrence probability in the same sequence.
However, all of the above methods only focused on the study of graph topology structure and could not take the node attributes into consideration, resulting in inferior performance. These methods are surpassed by recently proposed GNNs, which could consider both node attributes and graph structure simultaneously.
Graph Neural Networks. Recent years have witnessed the success of GNNs in various tasks, such as node classification [13, 9], link prediction [37] and graph classification [6]. GNNs consider both graph structure and node attributes by first propagating information among each node and its neighbors, and then providing node representation based on the received information. Generally, GNNs could be divided into spectralbased methods and spatialbased methods. As a spectralbased method, Spectral CNN [2] performs convolution in the Fourier domain by computing the eigendecomposition of the graph Laplacian matrix. ChebNet [4]
leverages the Korder Chebyshev polynomials to eliminate the need to calculate the Laplacian matrix eigenvectors. GCN
[13] introduces a localized firstorder approximation of ChebNet to alleviate the overfitting problem. Representative spatialbased methods include GraphSAGE [9] and GAT [31]. [9] proposed GraphSAGE to propagate information in the graph domain directly and designed different functions to aggregate received information. [31] presented GAT by introducing the attention mechanism into GNNs, which enabled GAT to select more important neighbors adaptively. We refer the interested readers to [39, 34] for more comprehensive reviews on GNNs.However, all the above methods were designed for homogeneous graphs, and could not handle the rich information in heterogeneous graphs. In this work, we aim to propose an approach to learn on heterogeneous graphs.
Heterogeneous Graph Neural Networks. Heterogeneous graphs contain abundant information of various types of nodes and relations. Mining useful information in heterogeneous graphs is essential in practical scenarios. Recently, several graph convolution methods have been proposed for learning on heterogeneous graphs. [23] presented RGCN to learn on knowledge graphs by employing specialized transformation matrices for each type of relations. [33] designed HAN by extending the attention mechanism in GAT [31] to learn the importance of neighbors and multiple handdesigned metapaths. [7] considered the intermediate nodes in metapaths, which are ignored in HAN, and proposed MAGNN to aggregate the intrametapath and intermetapath information. HetGNN [36] first samples neighbors based on random walk strategy and then uses specialized BiLSTMs to integrate the heterogeneous node attributes and neighbors. [11] proposed HGT to introduce typespecific transformation matrices and learn the importance of different nodes and relations based on the Transformer [30] architecture.
Nevertheless, there are still some limitations in the above methods, including the insufficient utilization of heterogeneous properties, structural information loss, and lack of interpretability. In this paper, we aim to cope with the issues in existing approaches and design a method to learn comprehensive node representation on heterogeneous graphs by leveraging both node attributes and relation information.
3 Problem Formalization
This section introduces related concepts and the studied problem in this paper.
Definition 1.
Heterogeneous Graph: A heterogeneous graph is defined as a directed graph , where and denote the set of nodes and edges respectively. Each node and each edge are associated with their type mapping functions and , with the constraint of .
Definition 2.
Relation: A relation represents for the interaction schema of the source node, the target node and the connected edge. Formally, for an edge with source node and target node , the corresponding relation is denoted as . The inverse of is naturally represented by , and we consider the inverse relation to propagate information of two nodes from each other. Thus, the set of edges is extended as and the set of relations is extended as . Note that the metapaths used in heterogeneous graph learning approaches [33, 7] are defined as sequences of relations.
Definition 3.
Heterogeneous Graph Representation Learning: Given a heterogeneous graph , where nodes with type are associated with the attribute matrix , the task of heterogeneous graph representation learning is to obtain the dimensional representation for , where . The learned representations are able to capture both node attributes and relation information, which could be applied in various tasks, such as node classification, node clustering and node visualization.
4 Methodology
This section presents the framework of our proposed method and each component of the proposed method is introduced step by step.
4.1 Framework of the Proposed Model
The framework of the proposed model is shown in Figure 1, which takes the node attribute matrices for in a heterogeneous graph as the input and provides the lowdimensional node representation for as the output, which could be applied in various tasks.
The proposed model is made up of multiple heterogeneous graph convolutional layers, where each layer consists of the hybrid micro/macro level convolution and the weighted residual connection component. Different from [33] that performs convolution on converted homogeneous graphs through metapaths, the proposed hybrid convolution could directly calculate on the heterogeneous graph structure. In particular, the microlevel convolution aims to learn the importance of nodes within the same relation, and the macrolevel convolution is designed to discriminate the difference across different relations. The weighted residual connection component is employed to consider the different contribution of focal node’s inherent attributes and its neighbor information. By stacking multiple heterogeneous graph convolutional layers, the proposed model could consider the impacts of the focal node’s directly connected and multihop reachable neighbors.
4.2 MicroLevel Convolution
As pointed in [33], the importance of nodes connected with the focal node within the same relation would be different. Hence, we first design a microlevel convolution to learn the importance of nodes within the same relation. We suppose that the attributes of nodes with different types might be distributed in different latent spaces. Therefore, we utilize the transformation matrices and attention vectors, which are specific to node types, to capture the characteristics of different types of nodes in the microlevel convolution.
Formally, we denote the focal node as the target node with type and its connected node as the source node with type . For a focal node , let denote the set of node ’s neighbors within type relation, where for each , and .
We first apply transformation matrices, which are specific to node types, to project nodes into their own latent spaces as follows,
(1) 
(2) 
where denotes the trainable transformation matrix for node with type at layer . and denote the original and transformed representation of node at layer . Then we calculate the normalized importance of neighbor as follows,
(3) 
(4) 
where is the trainable attention vector for type source node at layer and denotes the concatenation operation. denotes the transpose operation. is the normalized importance of source node to focal node under relation at layer . Then the representation of relation about focal node is calculated by,
(5) 
where
denotes the activation function (e.g., sigmoid, ReLU). An intuitive explanation of the microlevel convolution is shown in Figure
2 (). Embeddings of nodes within the same relation are aggregated through the attention vectors which are specific to node types. Since the attention weight is computed for each relation, it could well capture the relation information.In order to enhance the model capacity and make the training process more stable, we employ independent heads and then concatenate representations as follows,
(6) 
where denotes the importance of source node to focal node under relation of head at layer , and stands for source node ’s transformed representation of head at layer .
4.3 MacroLevel Convolution
Besides considering the importance of nodes within the same relation, a focal node would also interact with multiple relations, which indicates the necessity of learning the subtle difference across different relations. Therefore, we design a macrolevel convolution with the transformation matrices specific to relation types and weightsharing attention vector to distinguish the difference of relations.
Specifically, we first transform the focal node and its connecting relations into their distinct distributed spaces by,
(7) 
(8) 
where and denote the transformation matrices for type focal node and type relation at layer respectively. Then the normalized importance of relation to focal node is calculated by,
(9) 
(10) 
where denotes the set of relations connected to focal node . is the trainable attention vector which is shared by different relations at layer . is the normalized importance of relation to focal node at layer . After obtaining the importance of different relations, we aggregate the relations as follows,
(11) 
where is the fused representation of relations connected to focal node at layer . Explanation of the macrolevel convolution is shown in Figure 2 (). Representations of different relations are aggregated into a compact vector through the attention mechanism. Through the macrolevel convolution, the different importance of relations could be calculated automatically.
We also extend Equation (11) to multihead attention by,
(12) 
where is the importance of relation to focal node of head at layer , and denotes the fusion of relations connected to focal node of head at layer .
It is worth noting that the attention vectors in microlevel convolution are specific to node types, while in macrolevel convolution, the attention vector is shared by different relations, which is unaware of relation types. Such a design is based on the following reasons. 1) When performing microlevel convolution, nodes are associated with distinct attributes even when they are within the same relation. An attention vector unaware of node types is difficult to handle nodes’ different attributes and types due to the insufficient representation ability. Hence, attention vectors specific to node types are designed in microlevel convolution. 2) In macrolevel convolution, each relation connected to the focal node is associated with a single representation and we only need to consider the difference of relation types. Therefore, the weightsharing attention vector across different relations is designed. Following the above design, we could not only maintain the distinct characteristics of nodes and relations, but also reduce the model parameters.
4.4 Weighted Residual Connection
In addition to aggregating neighbor information by the hybrid micro/macro level convolution, the attributes of focal node are also supposed to be important, since they reflect the inherent characteristic directly. However, simply adding focal node’s inherent attributes and neighbor information together could not distinguish their different importance.
Thus, we adapt the residual connection [10] with trainable weight parameter to combine the focal node’s inherent attributes and neighbor information by,
(13) 
where is the weight to control the importance of focal node ’s inherent attributes and its neighbor information at layer . is utilized to align the dimension of focal node ’s attributes and its neighbor information at layer .
From another perspective, the weighted residual connection could be treated as the gated updating mechanism in Gated Recurrent Unit (GRU)
[3], where the employed update gates are specific to focal node type and carry different weights in different layers.4.5 The Learning Process
We stack heterogeneous graph convolutional layers to build HGConv. For the first layer, we set to node ’s corresponding row in attribute matrix as the input. The final node representation is set to the output of the last layer for .
HGConv could be trained in an endtoend manner with the following strategies: 1) semisupervised learning strategy: for tasks where the labels are available, we could optimize the model by minimizing the cross entropy loss by,
(14) 
where is the set of nodes with labels. and denote the ground truth and predicted possibility of node at the th dimension. In practice,
could be obtained from a classifier (e.g., SVM, singlelayer neural network) which takes node
’s representation as the input and outputs. 2) unsupervised learning strategy: for tasks without any labels, we could optimize the model by minimizing the objective function in Skipgram
[17] with negative sampling,(15) 
where is the sigmoid activation function, and denote the set of positive observed node pairs and negative sampled node pairs. 3) joint learning strategy: we could also combine the semisupervised and unsupervised learning strategy together to jointly optimize the model.
4.6 Systematic Analysis of Existing Models
Here we give a systematic analysis on existing heterogeneous graph learning models and points out that each existing method could be treated as a special case of the proposed HGConv under certain circumstances.
Overview of Homogeneous GNNs. Let us start with the introduction of homogeneous GNNs at first. Generally, the operations at the th layer of a homogeneous GNN follow a twostep strategy:
(16) 
(17) 
where denotes the representation of node at the th layer. is initialized with node ’s original attribute and denotes the set of node ’s neighbors. stands for the aggregation of node ’s neighbors. is the combination of node ’s inherent attribute and its neighbor information at layer .
Different architectures for AGGREGATE and COMBINE have been proposed in recent years. For example, GCN [13] utilizes the normalized adjacency matrix for AGGREGATE and uses the residual connection for COMBINE. GraphSAGE [9] designs various pooling operations for AGGREGATE and leverages the concatenation for COMBINE.
Overview of Heterogeneous GNNs. The operations in heterogeneous GNNs are based on the operations in homogeneous GNNs, with additional consideration of node attributes and relation information. Formally, the operations at the th layer could be summarized as follows:
(18) 
(19) 
(20) 
(21) 
where denotes the set of node ’s neighbors within type relation and is defined as the set of relations connected to node .
Compared with homogeneous GNNs, heterogeneous GNNs first design specialized transformation matrices for different types of nodes for TRANSFORM. Then the operations in AGGREGATE are divided into aggregation within the same relation and aggregation across different relations. Finally, the operation in COMBINE is defined as the same as Equation (17) in homogeneous GNNs.
Analysis of the Proposed HGConv. The proposed HGConv makes delicate design for each operation in the aforementioned heterogeneous GNNs. Specifically, Equation (18)  Equation (21) could be rewritten as ^{1}^{1}1Note that we omit the activation functions and transformation matrices for graph convolution or dimension alignment for simplicity.:
(22) 
(23) 
(24) 
(25) 
where is the transformation matrix which is specific to node ’s type. and are learned importance by the attention mechanism in microlevel and macrolevel convolution respectively. is the trainable parameter to balance the importance of the focal node inherent attribute and its neighbor information.
Connection with RGCN. RGCN [23] ignores distinct attributes of nodes with various types and assigns importance of neighbors within the same relation based on predefined constants. RGCN could be treated as a special case of the proposed HGConv with the following steps: 1) Replace in Equation (22) with identity function , which means different distributions of node attributes with various types are not considered; 2) Replace trainable in Equation (23) with predefined constant, which is calculated by the degree of each node; 3) Set in Equation (24) to , which stands for simple sum pooling; 4) Set in Equation (25) to , which means equal contribution of node inherent attributes and neighbor information. Note that the sum pooling operation in RGCN could not distinguish the importance of nodes and relations effectively.
Connection with HAN. HAN [33] leverages multiple symmetric metapaths to convert the heterogeneous graph into multiple homogeneous graphs. Therefore, node ’s neighbors are defined by the given set of metapaths . HAN could be treated as a special case of the proposed HGConv with the following steps: 1) Replace in Equation (22) with identity function , as each converted graph only contains nodes with a single type; 2) Define the set of node ’s neighbors in Equation (23) by methpaths , that is, for each metapath , the set of node ’s neighbors is denoted as , and then learn the importance of neighbors generated by the same metapath through the attention mechanism; 3) Replace the aggregation of different relations in Equation (24) with the aggregation of multiple metapaths , and learn the importance of different metapaths using the attention mechanism; 4) Set in Equation (25) to , which means using the neighbor information directly. Not that the converted graphs are homogeneous, and the attributes of nodes with different types are ignored in HAN, leading to inferior performance.
Connection with HetGNN. HetGNN [36] leverages the random walk strategy to sample neighbors and then uses BiLSTMs to integrate node attributes and neighbors. Therefore, node ’s neighbors are generated by random walk , which could be denoted as . HetGNN could be treated as a special case of the proposed HGConv with the following steps: 1) Replace in Equation (22) with BiLSTMs to aggregate attributes of nodes with various types; 2) Define the set of node ’s neighbors in Equation (23) by random walk and group the neighbors by node types, that is, for each node type , the set of node ’s neighbors is denoted as . Then, learn the importance of neighbors with the same node type through BiLSTMs; 3) Replace the aggregation of different relations in Equation (24) with the aggregation of different node types, and learn the importance of different node types using the attention mechanism; 4) Set in Equation (25) to be trainable, which is incorporated in the attention mechanism in previous step in [36]. Not that the random walk in HetGNN may break the intrinsic graph structure and results in structural information loss.
Connection with HGT. HGT [11] learn the importance of different nodes and relations based on the Transformer architecture by designing typespecific transformation matrices. HGT focuses on the study of each relation (a.k.a. meta relation in [11]), hence, the importance of source node to target node is calculated based on both the two node information as well as their connected relation in a single aggregation process. HGT could be treated as a special case of the proposed HGConv with the following steps: 1) Replace in Equation (22) with the linear projections that are specific to source node type and target node type respectively to obtain Key and Query vectors; 2) Fuse the aggregation process in Equation (23) and Equation (24) into a single aggregation process. The importance of source node to target node is learned from the Key and Query vectors, as well as the relation transformation matrices specific to their connected relation type; 3) Set in Equation (25) to , which means node inherent attributes and neighbor information contribute equally to the final node representation. Not that the single aggregation process in HGT leads to a flat architecture, making it is hard to distinguish the importance of nodes and relations separately.
5 Experiments
This section presents the experimental results on realworld datasets and detailed analysis.
5.1 Description of Datasets
We conduct experiments on three realworld datesets.

ACM3: Following [33], we extract a subset of ACM from AMiner ^{2}^{2}2https://www.aminer.cn/citation [29], which contains papers published in three areas: Data Mining (KDD, ICDM), Database (VLDB, SIGMOD) and Wireless Communication (SIGCOMM, MobiCOMM). Finally we construct a heterogeneous graph containing papers (P), authors (A) and terms (T).

ACM5
: We also extract a larger subset of ACM from AMiner, which includes papers published in five areas: Data Mining (KDD, ICDM, WSDM, CIKM), Database (VLDB, ICDE), Artificial Intelligence (AAAI, IJCAI), Computer Vision (CVPR, ECCV) and Natural Language Processing (ACL, EMNLP, NAACL).

IMDB ^{3}^{3}3https://data.world/datasociety/imdb5000moviedataset
: We extract a subset of IMDB and consruct a heterogeneous graph containing movies (M), directors (D) and actors (A). The movies are divided into three classes: Action, Comedy, Drama.
For ACM3 and ACM5, we use TFIDF [20]
to extract keywords of the abstract and title in papers. Paper attributes are the bagofwords representation of abstracts. Author attributes are the average representation of their published papers. Term attributes are represented as the onehot encoding of the title keywords. For IMDB, movie attributes are the bagofwords representation of plot keywords. Director/actor attributes are the average representation of their directing/acting movies.
Details of the datasets are summarized in Table II.
Dataset  Node  Relation  Attribute  Data Split  

ACM3 





ACM5 





IMDB 




Data  Metrics  Training  MLP  GCN  GAT  RGCN  HAN  HetGNN  HGT  HGConv 

ACM3  MacroF1  20%  0.6973  0.8955  0.8852  0.8981  0.8991  0.6727  0.8965  0.9150 
40%  0.7740  0.9012  0.8993  0.9191  0.9175  0.7736  0.9188  0.9255  
60%  0.8013  0.9032  0.9053  0.9262  0.9237  0.8060  0.9264  0.9286  
80%  0.8249  0.9068  0.9063  0.9267  0.9268  0.8242  0.9329  0.9306  
100%  0.8330  0.9079  0.9058  0.9299  0.9240  0.8342  0.9343  0.9320  
MicroF1  20%  0.6943  0.8869  0.8754  0.8893  0.8906  0.6710  0.8885  0.9089  
40%  0.7710  0.8923  0.8903  0.9124  0.9103  0.7709  0.9117  0.9194  
60%  0.7966  0.8948  0.8968  0.9201  0.9172  0.8016  0.9203  0.9221  
80%  0.8205  0.8989  0.8981  0.9202  0.9205  0.8190  0.9268  0.9241  
100%  0.8277  0.9000  0.8979  0.9238  0.9176  0.8282  0.9284  0.9256  
ACM5  MacroF1  20%  0.6156  0.8221  0.8253  0.8148  0.8191  0.6022  0.8100  0.8270 
40%  0.6585  0.8317  0.8367  0.8368  0.8404  0.6476  0.8428  0.8478  
60%  0.7252  0.8440  0.8441  0.8630  0.8526  0.7133  0.8573  0.8701  
80%  0.7503  0.8448  0.8459  0.8699  0.8610  0.7445  0.8692  0.8766  
100%  0.7594  0.8492  0.8466  0.8721  0.8617  0.7565  0.8715  0.8792  
MicroF1  20%  0.6469  0.8364  0.8388  0.8333  0.8334  0.6420  0.8286  0.8428  
40%  0.6887  0.8433  0.8475  0.8501  0.8525  0.6872  0.8573  0.8616  
60%  0.7354  0.8545  0.8544  0.8722  0.8626  0.7248  0.8668  0.8794  
80%  0.7642  0.8554  0.8562  0.8809  0.8715  0.7592  0.8780  0.8855  
100%  0.7745  0.8597  0.8572  0.8841  0.8720  0.7721  0.8825  0.8889  
IMDB  MacroF1  20%  0.4506  0.5003  0.4998  0.5124  0.5118  0.4281  0.5171  0.5323 
40%  0.4870  0.5338  0.5350  0.5578  0.5645  0.4865  0.5577  0.5760  
60%  0.5188  0.5559  0.5640  0.5823  0.5912  0.5110  0.5781  0.6006  
80%  0.5268  0.5713  0.5698  0.5939  0.6092  0.5239  0.6018  0.6183  
100%  0.5563  0.5845  0.5798  0.6130  0.6212  0.5453  0.6159  0.6342  
MicroF1  20%  0.4598  0.5062  0.5072  0.5212  0.5263  0.4533  0.5210  0.5414  
40%  0.4874  0.5355  0.5378  0.5601  0.5723  0.4942  0.5605  0.5792  
60%  0.5186  0.5611  0.5669  0.5850  0.5968  0.5146  0.5792  0.6017  
80%  0.5269  0.5771  0.5757  0.5952  0.6129  0.5237  0.6020  0.6193  
100%  0.5538  0.5888  0.5837  0.6147  0.6242  0.5478  0.6163  0.6343 
Data  Metrics  MLP  GCN  GAT  RGCN  HAN  HetGNN  HGT  HGConv  %Improv. 

ACM3  ARI  0.6105  0.7179  0.7319  0.7973  0.7732  0.6077  0.7944  0.8166  2.4% 
NMI  0.5535  0.6806  0.6965  0.7536  0.7317  0.5520  0.7560  0.7752  2.5%  
ACM5  ARI  0.5969  0.7010  0.7155  0.7766  0.7347  0.5931  0.7732  0.7903  1.8% 
NMI  0.5501  0.6687  0.6789  0.7345  0.7056  0.5461  0.7319  0.7543  2.7%  
IMDB  ARI  0.2011  0.2435  0.2264  0.3069  0.2777  0.1957  0.2982  0.3164  3.1% 
NMI  0.1811  0.2099  0.2005  0.2647  0.2400  0.1723  0.2566  0.2757  4.2% 
5.2 Compared Methods
We compare our method with the following baselines:

MLP
: MLP ignores the graph structure and solely focuses on the focal node attributes by leveraging the multilayer perceptron.

GCN: GCN performs graph convolutions in the Fourier domain by leveraging the localized firstorder approximation [13].

GAT: GAT introduces the attention mechanism into GNNs and assigns different importance to the neighbors adaptively [31].

RGCN: RGCN designs specialized transformation matrices for each type of relations in the modelling of knowledge graphs [23].

HAN: HAN leverages the attention mechanism to aggregate neighbor information via multiple manually designed metapaths [33].

HetGNN: HetGNN considers the heterogeneity of node attributes and neighbors, and then utilizes BiLSTMs to integrate heterogeneous information [36].

HGT: HGT introduces typespecific transformation matrices to capture characteristics of different nodes and relations with the Transformer architecture [11].
5.3 Experimental Setup
As some methods require methpaths, we use , and as metapaths for ACM3 and ACM5, and choose and as metapaths for IMDB. Following [33], we test GCN and GAT on the homogeneous graph generated by each metapath and report the best performance from metapaths (Experiments show that the best metapaths on ACM3, ACM5, IMDB are , , and respectively). All the metapaths are directly fed into HAN. Adam [12] is selected as the optimizer. Dropout [26]
is utilized to prevent overfitting. The grid search is used to select the best hyperparameters, including dropout in
and learning rate in. The dimension of node representation is set to 64. We train all the methods with a fixed 300 epochs and use early stopping strategy with a patience of 100, which means the training process is terminated when the evaluation metrics on the validation set are not improved for 100 consecutive epochs.
For HGConv, the numbers of attention heads in micro/macro level convolution are both set to 8, and the dimension of each head’s attention vector is set to 8. We build HGConv with two layers, since two layers could achieve satisfactory performance and stacking more layers cannot improve the performance significantly. The proposed HGConv is implemented with PyTorch
^{4}^{4}4https://pytorch.org/ [18] and Deep Graph Library (DGL) ^{5}^{5}5https://www.dgl.ai/ [32]. Experiments are conducted on an Ubuntu machine equipped with two Intel(R) Xeon(R) CPU E52667 v4 @ 3.20GHz with 8 physical cores, and the GPU is NVIDIA TITAN Xp, armed with 12 GB of GDDR5X memory running at over 11 Gbps.5.4 Node Classification
We conduct experiments to make comparison on the node classification task. Following [33], we split the datasets into training, validation and testing sets with the ratio of 2:1:7. The ratios of training data are varied in . To make comprehensive comparison, we additionally use 5fold crossvalidation and report the average classification results. MacroF1 and MicroF1 are adopted as the evaluation metrics. For ACM3 and ACM5, we aim to predict the area of papers. For IMDB, the goal is to predict the class of movies. and are adopted as evaluation metrics. Experimental results are shown in Table III ^{6}^{6}6Experimental results with variations and hyperparameter settings of all the methods are shown in the appendix.. By analyzing the results, some conclusions could be summarized.
Firstly, the performance of all the methods is improved with the increase of training data, which proves that feed more training data would help deep learning methods learn more complicated patterns and achieve better results.
Secondly, compared with MLP, the performance of other methods is significantly improved by taking graph structure into consideration in most cases, which indicates the power of graph neural networks in considering the information of both nodes and edges.
Thirdly, methods designed for heterogeneous graphs achieve better results than methods designed for homogeneous graphs (i.e., GCN and GAT) in most cases, which demonstrates the necessity of leveraging the properties of different nodes and relations in heterogeneous graphs.
Fourthly, although HetGNN is designed for heterogeneous graph learning, it only achieves competitive or even worse results than MLP. We owe this phenomenon to the following two reasons: 1) there are several hyperparameters (e.g., the return possibility and length of random walk, the numbers of typegrouped neighbors) in HetGNN, making the model difficult to be finetuned; 2) the random walk strategy may break the intrinsic graph structure and lead to structural information loss, especially when the graph structure contains valuable information.
Finally, HGConv outperforms all the baselines consistently with the varying training data ratio in most cases. Compared with MLP, GCN and GAT, HGConv takes both the graph topology and graph heterogeneity into consideration. Compared with RGCN and HAN, HGConv utilizes the specific characteristic of different nodes and relations without the requirement of domain knowledge. Compared with HetGNN, HGConv leverages intrinsic graph structure directly, which alleviates the structural information loss issue introduced by random walk. Compared with HGT, HGConv learns multilevel representation by the hybrid micro/macro level convolution, which provides HGConv with sufficient representation ability.
5.5 Node Clustering
The node clustering task is conducted to evaluate the learned node representations. We first obtain the node representation via feed forward on the trained model and then feed the normalized node representation into kmeans algorithm. We set the number of clusters to the number of real classes for each dataset (i.e., 3, 5 and 3 for ACM3, ACM5 and IMDB respectively). We adopt
and as evaluation metrics. Since the result of kmeans tends to be affected by the initial centroids, we run kmeans for 10 times and report the average results in Table IV.Experimental results on the node clustering task show that HGConv outperforms all the baselines, which demonstrates the effectiveness of the learned node representation. Moreover, methods based on GNNs usually obtain better results. We could also observe that methods achieving satisfactory results on node classification tasks (e.g., RGCN, HAN and HGT) also have satisfactory performance on node clustering tasks, which indicates that a good model could learn more universal node embedding that could be applicable to various tasks.
5.6 Node Visualization
To make an more intuitive comparison, we also visualize nodes in the heterogeneous graph into a low dimensional space. In particular, we project the learned node representation by HGConv into a 2dimensional space using tSNE [15]. The visualization of node representation on ACM5 is shown in Figure 3 ^{7}^{7}7Please refer to the appendix for results on ACM3 and IMDB., where the color of nodes denote their corresponding published area .
From Figure 3, we could observe the baselines could not achieve satisfactory performance. They either fail to gather papers within the same area together, or could not provide clear boundaries of papers belonging to different areas. HGConv performs best in the visualization, as papers within the same area are closer and boundaries between different areas are more obvious.
5.7 Ablation Study
We conduct the ablation study to validate the effect of each component in HGConv. We remove the microlevel convolution, macrolevel convolution and weighted residual connection from HGConv respectively and denote the three variants as HGConv w/o Micro, HGConv w/o Macro and HGConv w/o WRC. Detailed implements of the three variants are introduced as follows:

HGConv w/o Micro. This variant replaces the microlevel convolution by performing simple average pooling on nodes within the same relation.

HGConv w/o Macro. This variant replaces the macrolevel convolution by performing simple average pooling across different relations.

HGConv w/o WRC. This variant removes the weighted residual connection in each layer and only uses the aggregated neighbor information as the output of each layer.
Experimental results of the variants and HGConv on the node classification task are shown in Figure 4.
From Figure 4, we could observe that HGConv achieves the best performance when it is equipped with all the components and removing any component would lead to worse results. The effects of the three components vary in different datasets, but all of them contribute to the improvement in the final performance. In particular, the microlevel convolution enables HGConv to select more important nodes within the same relation, and the macrolevel convolution helps HGConv distinguish the subtle difference across relations. The weighted residual connection provides HGConv with the ability to consider the different contribution of focal node’s inherent attributes and neighbor information.
5.8 Parameter Sensitivity Analysis
We also investigate on the sensitivity analysis of several parameters in HGConv. We report the results of node classification task under different parameter settings on IMDB and experimental results are shown in Figure 5.
Number of convolution layers. We build HGConv with different number of heterogeneous graph convolution layers and report the result in Figure 5 (). It could be observed that with the increment of layers, the performance of HGConv raises at first and then starts to drop gradually. This indicates that stacking a suitable number of layers helps the model to receive information from further neighbors, but too many layers would lead to the overfitting problem.
Number of attention heads. We validate the effect of multihead attention mechanism in the hybrid convolution by changing the number of attention heads. The result is shown in Figure 5 (). From the results, we could conclude that increasing the number of attention heads would improve the model performance at first. When the number of attention heads is enough (e.g., equal to or bigger than 4), the performance reaches the top and remains stable.
Dimension of node representation. We also change the dimension of node representation and report the result in Figure 5 (). We could find that the performance of HGConv grows with the increment of the node representation dimension and achieves the best performance when the dimension is set between 64 and 256 (we select 64 as the final setting). The performance decreases when the dimension becomes bigger further because of the overfitting problem.
5.9 Interpretability of the Hybrid Convolution
The principle components in HGConv are the microlevel convolution and macrolevel convolution. Thus, we provide a detailed interpretation to better understand the learned importance of nodes within the same relation and difference across relations by the hybrid convolution. We first randomly select a sample from ACM3 and then calculate the normalized attention scores from the last heterogeneous graph convolution layer. The selected paper proposes an effective rankingbased clustering algorithm for heterogeneous information network, and it is published in the Data Mining area. The visualization is shown in Figure 6.
Interpretation of the microlevel convolution. It could be observed that in the relation, both Jiawei Han and Yizhou Sun have higher attention scores than Yintao Yu among the authors, since the first two authors contribute more in the academic research. In the relation, keywords that are more relevant to (i.e., clustering and ranking) have higher attention scores. Moreover, the scores of references that studies more relevant topics to are also higher in the relation. The above observations indicate that the microlevel convolution could select more important nodes within the same relation by assigning higher attention scores.
Interpretation of the macrolevel convolution. The attention score of the relation is much higher than that of the or relation, in line with the fact that GCN and GAT achieved the best performance on the metapath. This finding demonstrates that the macrolevel convolution could distinguish the importance of different relations automatically without empirical manual design, and the learned importance could implicitly construct more important metapaths for specific downstream tasks.
6 Conclusion
In this paper, we designed a hybrid micro/macro level convolution operation to address several fundamental problems in heterogeneous graph representation learning. In particular, the microlevel convolution aims to learn the importance of nodes within the same relation and the macrolevel convolution attempts to distinguish the subtle difference across different relations. The hybrid strategy enables our model to fully leverage heterogeneous information with proper interpretability by performing convolutions on the intrinsic structure of heterogeneous graphs directly. We also designed a weighted residual connection component to obtain the optimal combination of focal node’s inherent attributes and neighbor information. Experimental results demonstrated not only the superiority of the proposed method, but also the intuitive interpretability of our approach for graph analysis.
Acknowledgments
This work is supported by the National Key RD Program of China [grant number 2018YFB2101003], the Science and Technology Major Project of Beijing [grant number Z191100002519012], and the National Natural Science Foundation of China [grant numbers 51778033, 51822802, 51991395, 71901011, U1811463].
References
 [1] (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 38, 2001, Vancouver, British Columbia, Canada], pp. 585–591. Cited by: §2.
 [2] (2014) Spectral networks and locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, Cited by: §2.

[3]
(2014)
Empirical evaluation of gated recurrent neural networks on sequence modeling
. CoRR abs/1412.3555. Cited by: §4.4.  [4] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pp. 3837–3845. Cited by: §2.
 [5] (2012) Link prediction and recommendation across heterogeneous social networks. In 12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 1013, 2012, pp. 181–190. Cited by: §1.
 [6] (2020) A fair comparison of graph neural networks for graph classification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, Cited by: §2.
 [7] (2020) MAGNN: metapath aggregated graph neural network for heterogeneous graph embedding. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 2024, 2020, pp. 2331–2341. Cited by: §2, Definition 2.
 [8] (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.
 [9] (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, §2, §4.6.

[10]
(2016)
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016
, pp. 770–778. Cited by: §4.4.  [11] (2020) Heterogeneous graph transformer. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 2024, 2020, pp. 2704–2710. Cited by: §1, §2, §4.6, 7th item.
 [12] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.
 [13] (2017) Semisupervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §1, §2, §4.6, 2nd item.
 [14] (2020) Typeaware anchor link prediction across heterogeneous networks based on graph attention network. In The ThirtyFourth AAAI Conference on Artificial Intelligence, AAAI 2020, The ThirtySecond Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 712, 2020, pp. 147–155. Cited by: §1.

[15]
(2008)
Visualizing data using tsne.
Journal of machine learning research
9 (Nov), pp. 2579–2605. Cited by: §5.6.  [16] (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
 [17] (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 58, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119. Cited by: §4.5.
 [18] (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 814 December 2019, Vancouver, BC, Canada, pp. 8024–8035. Cited by: §5.3.
 [19] (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §2.
 [20] (2003) Using tfidf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242, pp. 133–142. Cited by: §5.1.
 [21] (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §2.
 [22] (2018) Representation learning for classification in heterogeneous graphs with application to social networks. ACM Trans. Knowl. Discov. Data 12 (5), pp. 62:1–62:33. Cited by: §1.
 [23] (2018) Modeling relational data with graph convolutional networks. In The Semantic Web  15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 37, 2018, Proceedings, pp. 593–607. Cited by: §1, §2, §4.6, 4th item.
 [24] (2019) Heterogeneous information network embedding for recommendation. IEEE Trans. Knowl. Data Eng. 31 (2), pp. 357–370. Cited by: §1.
 [25] (2017) A survey of heterogeneous information network analysis. IEEE Trans. Knowl. Data Eng. 29 (1), pp. 17–37. Cited by: §1.
 [26] (2014) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. Cited by: §5.3.
 [27] (2012) Relation strengthaware clustering of heterogeneous information networks with incomplete attributes. Proc. VLDB Endow. 5 (5), pp. 394–405. Cited by: §1.
 [28] (2012) Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explorations 14 (2), pp. 20–28. Cited by: §1.
 [29] (2008) ArnetMiner: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 2427, 2008, pp. 990–998. Cited by: 1st item.
 [30] (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.
 [31] (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §1, §2, §2, 3rd item.
 [32] (2019) DEEP graph library: agraphcentric, highlyperformant package for graph neural net. arXiv preprint arXiv:1909.01315. Cited by: §5.3.
 [33] (2019) Heterogeneous graph attention network. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 1317, 2019, pp. 2022–2032. Cited by: §1, §2, §4.1, §4.2, §4.6, 1st item, 5th item, §5.3, §5.4, Definition 2.
 [34] (2019) A comprehensive survey on graph neural networks. CoRR abs/1901.00596. Cited by: §2.
 [35] (2014) Personalized entity recommendation: a heterogeneous information network approach. In Seventh ACM International Conference on Web Search and Data Mining, WSDM 2014, New York, NY, USA, February 2428, 2014, pp. 283–292. Cited by: §1.
 [36] (2019) Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 48, 2019, pp. 793–803. Cited by: §1, §2, §4.6, 6th item.
 [37] (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada, pp. 5171–5181. Cited by: §2.
 [38] (2018) Deep collective classification in heterogeneous information networks. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 2327, 2018, pp. 399–408. Cited by: §1.
 [39] (2018) Graph neural networks: A review of methods and applications. CoRR abs/1812.08434. Cited by: §2.
Comments
There are no comments yet.