A heterogeneous graph consists of multiple types of nodes and edges, involving abundant heterogeneous information . In practice, heterogeneous graphs are pervasive in real-world scenarios, such as academic networks, e-commerce and social networks . Learning meaningful representation of nodes in heterogeneous graphs is essential for various tasks, including node classification [22, 38], node clustering , link prediction [5, 14] and personalized recommendation [35, 24].
In recent years, Graph Neural Networks (GNNs) have been widely used in representation learning of graphs and achieved superior performance. Generally, GNNs perform convolutions in two domains, namely spectral domain and spatial domain. As a spectral-based method, GCN utilizes the localized first-order approximation on neighbors and then performs convolutions in the Fourier domain for an entire graph. Spatial-based methods, including GraphSAGE  and GAT , directly perform information propagation in the graph domain by particularly designed aggregation functions or the attention mechanism. However, all of the above methods were designed for homogeneous graphs with single node type and single edge type, and they are infeasible to handle the rich information in heterogeneous graphs. Simply adapting them to deal with heterogeneous graphs would lead to the information loss issue, since they ignore the graph heterogeneous properties.
Despite the investigation of approaches on homogeneous graphs, there are also several attempts to design graph convolution methods for heterogeneous graphs. RGCN 
was proposed to deal with multiple relations in knowledge graphs. HAN was designed to learn on heterogeneous graphs, which is based on meta-paths and the attention mechanism.  presented HetGNN to consider the heterogeneity of node attributes and neighbors through dedicated aggregation functions.  proposed HGT, a variant of Transformer , to focus on the meta relations in heterogeneous graphs.
However, the aforementioned methods are still faced with the following limitations. 1) Heterogeneous information loss: several methods just utilize the properties of nodes or relations partially, rather than the comprehensive information of nodes and relations (e.g., RGCN and HAN). In detail, RGCN ignores the distinct attributes of nodes with various types. HAN relies on multiple hand-designed symmetric meta-paths to convert the heterogeneous graph into multiple homogeneous graphs, which would lead to the loss of different nodes and edges information. 2) Structural information loss
: some methods deal with the graph topology through heuristic strategies, such as the random walk in HetGNN, which may break the intrinsic graph structure and lose valuable structural information. 3)Empirical manual design: the performance of some methods severely relies on prior experience because of the requirement of specific domain knowledge, such as pre-defined meth-paths in HAN; 4) Insufficient representation ability: some methods cannot provide multi-level representation due to the flat model architecture. For example, HGT learns the interaction of nodes and relations in a single aggregation process, which is hard to distinguish their importance in such a flat architecture.
To cope with the above issues, we propose HGConv, a novel Heterogeneous Graph Conv
olution approach, to learn node representation on heterogeneous graphs with a hybrid micro/macro level convolutional operation. Specifically, for a focal node: in micro-level convolution, the transformation matrices and attention vectors are both specific to node types, aiming to learn the importance of nodes within the same relation; in macro-level convolution, transformation matrices specific to relation types and the weight-sharing attention vector are employed to distinguish the subtle difference across different relations. Due to the hybrid micro/macro level convolution, HGConv could fully utilize the heterogeneous information of nodes and relations with proper interpretability. Moreover, a weighted residual connection component is designed to obtain the optimal fusion of the focal node’s inherent attributes and neighbor information. Based on the aforementioned components, our approach could be optimized in an end-to-end manner. Comparison of several existing methods with our model are shown in TableI.
To sum up, the contributions of our work are as follows:
A novel heterogeneous graph convolution approach is proposed to directly perform convolutions on the intrinsic heterogeneous graph structure with a hybrid micro/macro level convolutional operation, where the micro convolution encodes the attributes of different types of nodes and the macro convolution computes on different relations respectively.
A residual connection component with weighted combination is designed to aggregate focal node’s inherent attributes and neighbor information adaptively, which could provide comprehensive node representation.
A systematic analysis on existing heterogeneous graph learning methods is given, and we point out that each existing method could be treated as a special case of the proposed HGConv under certain circumstances.
The rest of this paper is organized as follows: Section 2 reviews previous work related to the studied problem. Section 3 introduces the studied problem. Section 4 presents the framework and each component of the proposed model. Section 5 evaluates the proposed model by experiments. Section 6 concludes the entire paper.
2 Related work
This section reviews existing literature related to our work and also points out their differences with our work.
Graph Mining. Over the past decades, a great amount of research has been investigated on graph mining. Classical methods based on manifold learning, including Locally Linear Embedding (LLE)  and Laplacian Eigenmaps (LE) , mainly focus on the reconstruction of graphs. Inspired by the language model Skip-gram , more advanced methods were proposed to learn representations of nodes, such as DeepWalk  and Node2Vec 
. These methods adopt random walk strategy to generate sequences of nodes and use Skip-gram to maximize node co-occurrence probability in the same sequence.
However, all of the above methods only focused on the study of graph topology structure and could not take the node attributes into consideration, resulting in inferior performance. These methods are surpassed by recently proposed GNNs, which could consider both node attributes and graph structure simultaneously.
Graph Neural Networks. Recent years have witnessed the success of GNNs in various tasks, such as node classification [13, 9], link prediction  and graph classification . GNNs consider both graph structure and node attributes by first propagating information among each node and its neighbors, and then providing node representation based on the received information. Generally, GNNs could be divided into spectral-based methods and spatial-based methods. As a spectral-based method, Spectral CNN  performs convolution in the Fourier domain by computing the eigendecomposition of the graph Laplacian matrix. ChebNet 
leverages the K-order Chebyshev polynomials to eliminate the need to calculate the Laplacian matrix eigenvectors. GCN introduces a localized first-order approximation of ChebNet to alleviate the overfitting problem. Representative spatial-based methods include GraphSAGE  and GAT .  proposed GraphSAGE to propagate information in the graph domain directly and designed different functions to aggregate received information.  presented GAT by introducing the attention mechanism into GNNs, which enabled GAT to select more important neighbors adaptively. We refer the interested readers to [39, 34] for more comprehensive reviews on GNNs.
However, all the above methods were designed for homogeneous graphs, and could not handle the rich information in heterogeneous graphs. In this work, we aim to propose an approach to learn on heterogeneous graphs.
Heterogeneous Graph Neural Networks. Heterogeneous graphs contain abundant information of various types of nodes and relations. Mining useful information in heterogeneous graphs is essential in practical scenarios. Recently, several graph convolution methods have been proposed for learning on heterogeneous graphs.  presented RGCN to learn on knowledge graphs by employing specialized transformation matrices for each type of relations.  designed HAN by extending the attention mechanism in GAT  to learn the importance of neighbors and multiple hand-designed meta-paths.  considered the intermediate nodes in meta-paths, which are ignored in HAN, and proposed MAGNN to aggregate the intra-meta-path and inter-meta-path information. HetGNN  first samples neighbors based on random walk strategy and then uses specialized Bi-LSTMs to integrate the heterogeneous node attributes and neighbors.  proposed HGT to introduce type-specific transformation matrices and learn the importance of different nodes and relations based on the Transformer  architecture.
Nevertheless, there are still some limitations in the above methods, including the insufficient utilization of heterogeneous properties, structural information loss, and lack of interpretability. In this paper, we aim to cope with the issues in existing approaches and design a method to learn comprehensive node representation on heterogeneous graphs by leveraging both node attributes and relation information.
3 Problem Formalization
This section introduces related concepts and the studied problem in this paper.
Heterogeneous Graph: A heterogeneous graph is defined as a directed graph , where and denote the set of nodes and edges respectively. Each node and each edge are associated with their type mapping functions and , with the constraint of .
Relation: A relation represents for the interaction schema of the source node, the target node and the connected edge. Formally, for an edge with source node and target node , the corresponding relation is denoted as . The inverse of is naturally represented by , and we consider the inverse relation to propagate information of two nodes from each other. Thus, the set of edges is extended as and the set of relations is extended as . Note that the meta-paths used in heterogeneous graph learning approaches [33, 7] are defined as sequences of relations.
Heterogeneous Graph Representation Learning: Given a heterogeneous graph , where nodes with type are associated with the attribute matrix , the task of heterogeneous graph representation learning is to obtain the -dimensional representation for , where . The learned representations are able to capture both node attributes and relation information, which could be applied in various tasks, such as node classification, node clustering and node visualization.
This section presents the framework of our proposed method and each component of the proposed method is introduced step by step.
4.1 Framework of the Proposed Model
The framework of the proposed model is shown in Figure 1, which takes the node attribute matrices for in a heterogeneous graph as the input and provides the low-dimensional node representation for as the output, which could be applied in various tasks.
The proposed model is made up of multiple heterogeneous graph convolutional layers, where each layer consists of the hybrid micro/macro level convolution and the weighted residual connection component. Different from  that performs convolution on converted homogeneous graphs through meta-paths, the proposed hybrid convolution could directly calculate on the heterogeneous graph structure. In particular, the micro-level convolution aims to learn the importance of nodes within the same relation, and the macro-level convolution is designed to discriminate the difference across different relations. The weighted residual connection component is employed to consider the different contribution of focal node’s inherent attributes and its neighbor information. By stacking multiple heterogeneous graph convolutional layers, the proposed model could consider the impacts of the focal node’s directly connected and multi-hop reachable neighbors.
4.2 Micro-Level Convolution
As pointed in , the importance of nodes connected with the focal node within the same relation would be different. Hence, we first design a micro-level convolution to learn the importance of nodes within the same relation. We suppose that the attributes of nodes with different types might be distributed in different latent spaces. Therefore, we utilize the transformation matrices and attention vectors, which are specific to node types, to capture the characteristics of different types of nodes in the micro-level convolution.
Formally, we denote the focal node as the target node with type and its connected node as the source node with type . For a focal node , let denote the set of node ’s neighbors within -type relation, where for each , and .
We first apply transformation matrices, which are specific to node types, to project nodes into their own latent spaces as follows,
where denotes the trainable transformation matrix for node with type at layer . and denote the original and transformed representation of node at layer . Then we calculate the normalized importance of neighbor as follows,
where is the trainable attention vector for -type source node at layer and denotes the concatenation operation. denotes the transpose operation. is the normalized importance of source node to focal node under relation at layer . Then the representation of relation about focal node is calculated by,
where2 (). Embeddings of nodes within the same relation are aggregated through the attention vectors which are specific to node types. Since the attention weight is computed for each relation, it could well capture the relation information.
In order to enhance the model capacity and make the training process more stable, we employ independent heads and then concatenate representations as follows,
where denotes the importance of source node to focal node under relation of head at layer , and stands for source node ’s transformed representation of head at layer .
4.3 Macro-Level Convolution
Besides considering the importance of nodes within the same relation, a focal node would also interact with multiple relations, which indicates the necessity of learning the subtle difference across different relations. Therefore, we design a macro-level convolution with the transformation matrices specific to relation types and weight-sharing attention vector to distinguish the difference of relations.
Specifically, we first transform the focal node and its connecting relations into their distinct distributed spaces by,
where and denote the transformation matrices for -type focal node and -type relation at layer respectively. Then the normalized importance of relation to focal node is calculated by,
where denotes the set of relations connected to focal node . is the trainable attention vector which is shared by different relations at layer . is the normalized importance of relation to focal node at layer . After obtaining the importance of different relations, we aggregate the relations as follows,
where is the fused representation of relations connected to focal node at layer . Explanation of the macro-level convolution is shown in Figure 2 (). Representations of different relations are aggregated into a compact vector through the attention mechanism. Through the macro-level convolution, the different importance of relations could be calculated automatically.
We also extend Equation (11) to multi-head attention by,
where is the importance of relation to focal node of head at layer , and denotes the fusion of relations connected to focal node of head at layer .
It is worth noting that the attention vectors in micro-level convolution are specific to node types, while in macro-level convolution, the attention vector is shared by different relations, which is unaware of relation types. Such a design is based on the following reasons. 1) When performing micro-level convolution, nodes are associated with distinct attributes even when they are within the same relation. An attention vector unaware of node types is difficult to handle nodes’ different attributes and types due to the insufficient representation ability. Hence, attention vectors specific to node types are designed in micro-level convolution. 2) In macro-level convolution, each relation connected to the focal node is associated with a single representation and we only need to consider the difference of relation types. Therefore, the weight-sharing attention vector across different relations is designed. Following the above design, we could not only maintain the distinct characteristics of nodes and relations, but also reduce the model parameters.
4.4 Weighted Residual Connection
In addition to aggregating neighbor information by the hybrid micro/macro level convolution, the attributes of focal node are also supposed to be important, since they reflect the inherent characteristic directly. However, simply adding focal node’s inherent attributes and neighbor information together could not distinguish their different importance.
Thus, we adapt the residual connection  with trainable weight parameter to combine the focal node’s inherent attributes and neighbor information by,
where is the weight to control the importance of focal node ’s inherent attributes and its neighbor information at layer . is utilized to align the dimension of focal node ’s attributes and its neighbor information at layer .
4.5 The Learning Process
We stack heterogeneous graph convolutional layers to build HGConv. For the first layer, we set to node ’s corresponding row in attribute matrix as the input. The final node representation is set to the output of the last layer for .
HGConv could be trained in an end-to-end manner with the following strategies: 1) semi-supervised learning strategy: for tasks where the labels are available, we could optimize the model by minimizing the cross entropy loss by,
where is the set of nodes with labels. and denote the ground truth and predicted possibility of node at the -th dimension. In practice,
could be obtained from a classifier (e.g., SVM, single-layer neural network) which takes node’s representation as the input and outputs
. 2) unsupervised learning strategy: for tasks without any labels, we could optimize the model by minimizing the objective function in Skip-gram with negative sampling,
where is the sigmoid activation function, and denote the set of positive observed node pairs and negative sampled node pairs. 3) joint learning strategy: we could also combine the semi-supervised and unsupervised learning strategy together to jointly optimize the model.
4.6 Systematic Analysis of Existing Models
Here we give a systematic analysis on existing heterogeneous graph learning models and points out that each existing method could be treated as a special case of the proposed HGConv under certain circumstances.
Overview of Homogeneous GNNs. Let us start with the introduction of homogeneous GNNs at first. Generally, the operations at the -th layer of a homogeneous GNN follow a two-step strategy:
where denotes the representation of node at the -th layer. is initialized with node ’s original attribute and denotes the set of node ’s neighbors. stands for the aggregation of node ’s neighbors. is the combination of node ’s inherent attribute and its neighbor information at layer .
Different architectures for AGGREGATE and COMBINE have been proposed in recent years. For example, GCN  utilizes the normalized adjacency matrix for AGGREGATE and uses the residual connection for COMBINE. GraphSAGE  designs various pooling operations for AGGREGATE and leverages the concatenation for COMBINE.
Overview of Heterogeneous GNNs. The operations in heterogeneous GNNs are based on the operations in homogeneous GNNs, with additional consideration of node attributes and relation information. Formally, the operations at the -th layer could be summarized as follows:
where denotes the set of node ’s neighbors within -type relation and is defined as the set of relations connected to node .
Compared with homogeneous GNNs, heterogeneous GNNs first design specialized transformation matrices for different types of nodes for TRANSFORM. Then the operations in AGGREGATE are divided into aggregation within the same relation and aggregation across different relations. Finally, the operation in COMBINE is defined as the same as Equation (17) in homogeneous GNNs.
Analysis of the Proposed HGConv. The proposed HGConv makes delicate design for each operation in the aforementioned heterogeneous GNNs. Specifically, Equation (18) - Equation (21) could be rewritten as 111Note that we omit the activation functions and transformation matrices for graph convolution or dimension alignment for simplicity.:
where is the transformation matrix which is specific to node ’s type. and are learned importance by the attention mechanism in micro-level and macro-level convolution respectively. is the trainable parameter to balance the importance of the focal node inherent attribute and its neighbor information.
Connection with RGCN. RGCN  ignores distinct attributes of nodes with various types and assigns importance of neighbors within the same relation based on pre-defined constants. RGCN could be treated as a special case of the proposed HGConv with the following steps: 1) Replace in Equation (22) with identity function , which means different distributions of node attributes with various types are not considered; 2) Replace trainable in Equation (23) with pre-defined constant, which is calculated by the degree of each node; 3) Set in Equation (24) to , which stands for simple sum pooling; 4) Set in Equation (25) to , which means equal contribution of node inherent attributes and neighbor information. Note that the sum pooling operation in RGCN could not distinguish the importance of nodes and relations effectively.
Connection with HAN. HAN  leverages multiple symmetric meta-paths to convert the heterogeneous graph into multiple homogeneous graphs. Therefore, node ’s neighbors are defined by the given set of meta-paths . HAN could be treated as a special case of the proposed HGConv with the following steps: 1) Replace in Equation (22) with identity function , as each converted graph only contains nodes with a single type; 2) Define the set of node ’s neighbors in Equation (23) by meth-paths , that is, for each meta-path , the set of node ’s neighbors is denoted as , and then learn the importance of neighbors generated by the same meta-path through the attention mechanism; 3) Replace the aggregation of different relations in Equation (24) with the aggregation of multiple meta-paths , and learn the importance of different meta-paths using the attention mechanism; 4) Set in Equation (25) to , which means using the neighbor information directly. Not that the converted graphs are homogeneous, and the attributes of nodes with different types are ignored in HAN, leading to inferior performance.
Connection with HetGNN. HetGNN  leverages the random walk strategy to sample neighbors and then uses Bi-LSTMs to integrate node attributes and neighbors. Therefore, node ’s neighbors are generated by random walk , which could be denoted as . HetGNN could be treated as a special case of the proposed HGConv with the following steps: 1) Replace in Equation (22) with Bi-LSTMs to aggregate attributes of nodes with various types; 2) Define the set of node ’s neighbors in Equation (23) by random walk and group the neighbors by node types, that is, for each node type , the set of node ’s neighbors is denoted as . Then, learn the importance of neighbors with the same node type through Bi-LSTMs; 3) Replace the aggregation of different relations in Equation (24) with the aggregation of different node types, and learn the importance of different node types using the attention mechanism; 4) Set in Equation (25) to be trainable, which is incorporated in the attention mechanism in previous step in . Not that the random walk in HetGNN may break the intrinsic graph structure and results in structural information loss.
Connection with HGT. HGT  learn the importance of different nodes and relations based on the Transformer architecture by designing type-specific transformation matrices. HGT focuses on the study of each relation (a.k.a. meta relation in ), hence, the importance of source node to target node is calculated based on both the two node information as well as their connected relation in a single aggregation process. HGT could be treated as a special case of the proposed HGConv with the following steps: 1) Replace in Equation (22) with the linear projections that are specific to source node type and target node type respectively to obtain Key and Query vectors; 2) Fuse the aggregation process in Equation (23) and Equation (24) into a single aggregation process. The importance of source node to target node is learned from the Key and Query vectors, as well as the relation transformation matrices specific to their connected relation type; 3) Set in Equation (25) to , which means node inherent attributes and neighbor information contribute equally to the final node representation. Not that the single aggregation process in HGT leads to a flat architecture, making it is hard to distinguish the importance of nodes and relations separately.
This section presents the experimental results on real-world datasets and detailed analysis.
5.1 Description of Datasets
We conduct experiments on three real-world datesets.
ACM-3: Following , we extract a subset of ACM from AMiner 222https://www.aminer.cn/citation , which contains papers published in three areas: Data Mining (KDD, ICDM), Database (VLDB, SIGMOD) and Wireless Communication (SIGCOMM, MobiCOMM). Finally we construct a heterogeneous graph containing papers (P), authors (A) and terms (T).
: We extract a subset of IMDB and consruct a heterogeneous graph containing movies (M), directors (D) and actors (A). The movies are divided into three classes: Action, Comedy, Drama.
For ACM-3 and ACM-5, we use TF-IDF 
to extract keywords of the abstract and title in papers. Paper attributes are the bag-of-words representation of abstracts. Author attributes are the average representation of their published papers. Term attributes are represented as the one-hot encoding of the title keywords. For IMDB, movie attributes are the bag-of-words representation of plot keywords. Director/actor attributes are the average representation of their directing/acting movies.
Details of the datasets are summarized in Table II.
5.2 Compared Methods
We compare our method with the following baselines:
: MLP ignores the graph structure and solely focuses on the focal node attributes by leveraging the multilayer perceptron.
GCN: GCN performs graph convolutions in the Fourier domain by leveraging the localized first-order approximation .
GAT: GAT introduces the attention mechanism into GNNs and assigns different importance to the neighbors adaptively .
RGCN: RGCN designs specialized transformation matrices for each type of relations in the modelling of knowledge graphs .
HAN: HAN leverages the attention mechanism to aggregate neighbor information via multiple manually designed meta-paths .
HetGNN: HetGNN considers the heterogeneity of node attributes and neighbors, and then utilizes Bi-LSTMs to integrate heterogeneous information .
HGT: HGT introduces type-specific transformation matrices to capture characteristics of different nodes and relations with the Transformer architecture .
5.3 Experimental Setup
As some methods require meth-paths, we use , and as meta-paths for ACM-3 and ACM-5, and choose and as meta-paths for IMDB. Following , we test GCN and GAT on the homogeneous graph generated by each meta-path and report the best performance from meta-paths (Experiments show that the best meta-paths on ACM-3, ACM-5, IMDB are , , and respectively). All the meta-paths are directly fed into HAN. Adam  is selected as the optimizer. Dropout 
is utilized to prevent over-fitting. The grid search is used to select the best hyperparameters, including dropout inand learning rate in
. The dimension of node representation is set to 64. We train all the methods with a fixed 300 epochs and use early stopping strategy with a patience of 100, which means the training process is terminated when the evaluation metrics on the validation set are not improved for 100 consecutive epochs.
For HGConv, the numbers of attention heads in micro/macro level convolution are both set to 8, and the dimension of each head’s attention vector is set to 8. We build HGConv with two layers, since two layers could achieve satisfactory performance and stacking more layers cannot improve the performance significantly. The proposed HGConv is implemented with PyTorch444https://pytorch.org/  and Deep Graph Library (DGL) 555https://www.dgl.ai/ . Experiments are conducted on an Ubuntu machine equipped with two Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz with 8 physical cores, and the GPU is NVIDIA TITAN Xp, armed with 12 GB of GDDR5X memory running at over 11 Gbps.
5.4 Node Classification
We conduct experiments to make comparison on the node classification task. Following , we split the datasets into training, validation and testing sets with the ratio of 2:1:7. The ratios of training data are varied in . To make comprehensive comparison, we additionally use 5-fold cross-validation and report the average classification results. Macro-F1 and Micro-F1 are adopted as the evaluation metrics. For ACM-3 and ACM-5, we aim to predict the area of papers. For IMDB, the goal is to predict the class of movies. and are adopted as evaluation metrics. Experimental results are shown in Table III 666Experimental results with variations and hyper-parameter settings of all the methods are shown in the appendix.. By analyzing the results, some conclusions could be summarized.
Firstly, the performance of all the methods is improved with the increase of training data, which proves that feed more training data would help deep learning methods learn more complicated patterns and achieve better results.
Secondly, compared with MLP, the performance of other methods is significantly improved by taking graph structure into consideration in most cases, which indicates the power of graph neural networks in considering the information of both nodes and edges.
Thirdly, methods designed for heterogeneous graphs achieve better results than methods designed for homogeneous graphs (i.e., GCN and GAT) in most cases, which demonstrates the necessity of leveraging the properties of different nodes and relations in heterogeneous graphs.
Fourthly, although HetGNN is designed for heterogeneous graph learning, it only achieves competitive or even worse results than MLP. We owe this phenomenon to the following two reasons: 1) there are several hyper-parameters (e.g., the return possibility and length of random walk, the numbers of type-grouped neighbors) in HetGNN, making the model difficult to be fine-tuned; 2) the random walk strategy may break the intrinsic graph structure and lead to structural information loss, especially when the graph structure contains valuable information.
Finally, HGConv outperforms all the baselines consistently with the varying training data ratio in most cases. Compared with MLP, GCN and GAT, HGConv takes both the graph topology and graph heterogeneity into consideration. Compared with RGCN and HAN, HGConv utilizes the specific characteristic of different nodes and relations without the requirement of domain knowledge. Compared with HetGNN, HGConv leverages intrinsic graph structure directly, which alleviates the structural information loss issue introduced by random walk. Compared with HGT, HGConv learns multi-level representation by the hybrid micro/macro level convolution, which provides HGConv with sufficient representation ability.
5.5 Node Clustering
The node clustering task is conducted to evaluate the learned node representations. We first obtain the node representation via feed forward on the trained model and then feed the normalized node representation into k-means algorithm. We set the number of clusters to the number of real classes for each dataset (i.e., 3, 5 and 3 for ACM-3, ACM-5 and IMDB respectively). We adoptand as evaluation metrics. Since the result of k-means tends to be affected by the initial centroids, we run k-means for 10 times and report the average results in Table IV.
Experimental results on the node clustering task show that HGConv outperforms all the baselines, which demonstrates the effectiveness of the learned node representation. Moreover, methods based on GNNs usually obtain better results. We could also observe that methods achieving satisfactory results on node classification tasks (e.g., RGCN, HAN and HGT) also have satisfactory performance on node clustering tasks, which indicates that a good model could learn more universal node embedding that could be applicable to various tasks.
5.6 Node Visualization
To make an more intuitive comparison, we also visualize nodes in the heterogeneous graph into a low dimensional space. In particular, we project the learned node representation by HGConv into a 2-dimensional space using t-SNE . The visualization of node representation on ACM-5 is shown in Figure 3 777Please refer to the appendix for results on ACM-3 and IMDB., where the color of nodes denote their corresponding published area .
From Figure 3, we could observe the baselines could not achieve satisfactory performance. They either fail to gather papers within the same area together, or could not provide clear boundaries of papers belonging to different areas. HGConv performs best in the visualization, as papers within the same area are closer and boundaries between different areas are more obvious.
5.7 Ablation Study
We conduct the ablation study to validate the effect of each component in HGConv. We remove the micro-level convolution, macro-level convolution and weighted residual connection from HGConv respectively and denote the three variants as HGConv w/o Micro, HGConv w/o Macro and HGConv w/o WRC. Detailed implements of the three variants are introduced as follows:
HGConv w/o Micro. This variant replaces the micro-level convolution by performing simple average pooling on nodes within the same relation.
HGConv w/o Macro. This variant replaces the macro-level convolution by performing simple average pooling across different relations.
HGConv w/o WRC. This variant removes the weighted residual connection in each layer and only uses the aggregated neighbor information as the output of each layer.
Experimental results of the variants and HGConv on the node classification task are shown in Figure 4.
From Figure 4, we could observe that HGConv achieves the best performance when it is equipped with all the components and removing any component would lead to worse results. The effects of the three components vary in different datasets, but all of them contribute to the improvement in the final performance. In particular, the micro-level convolution enables HGConv to select more important nodes within the same relation, and the macro-level convolution helps HGConv distinguish the subtle difference across relations. The weighted residual connection provides HGConv with the ability to consider the different contribution of focal node’s inherent attributes and neighbor information.
5.8 Parameter Sensitivity Analysis
We also investigate on the sensitivity analysis of several parameters in HGConv. We report the results of node classification task under different parameter settings on IMDB and experimental results are shown in Figure 5.
Number of convolution layers. We build HGConv with different number of heterogeneous graph convolution layers and report the result in Figure 5 (). It could be observed that with the increment of layers, the performance of HGConv raises at first and then starts to drop gradually. This indicates that stacking a suitable number of layers helps the model to receive information from further neighbors, but too many layers would lead to the overfitting problem.
Number of attention heads. We validate the effect of multi-head attention mechanism in the hybrid convolution by changing the number of attention heads. The result is shown in Figure 5 (). From the results, we could conclude that increasing the number of attention heads would improve the model performance at first. When the number of attention heads is enough (e.g., equal to or bigger than 4), the performance reaches the top and remains stable.
Dimension of node representation. We also change the dimension of node representation and report the result in Figure 5 (). We could find that the performance of HGConv grows with the increment of the node representation dimension and achieves the best performance when the dimension is set between 64 and 256 (we select 64 as the final setting). The performance decreases when the dimension becomes bigger further because of the overfitting problem.
5.9 Interpretability of the Hybrid Convolution
The principle components in HGConv are the micro-level convolution and macro-level convolution. Thus, we provide a detailed interpretation to better understand the learned importance of nodes within the same relation and difference across relations by the hybrid convolution. We first randomly select a sample from ACM-3 and then calculate the normalized attention scores from the last heterogeneous graph convolution layer. The selected paper proposes an effective ranking-based clustering algorithm for heterogeneous information network, and it is published in the Data Mining area. The visualization is shown in Figure 6.
Interpretation of the micro-level convolution. It could be observed that in the relation, both Jiawei Han and Yizhou Sun have higher attention scores than Yintao Yu among the authors, since the first two authors contribute more in the academic research. In the relation, keywords that are more relevant to (i.e., clustering and ranking) have higher attention scores. Moreover, the scores of references that studies more relevant topics to are also higher in the relation. The above observations indicate that the micro-level convolution could select more important nodes within the same relation by assigning higher attention scores.
Interpretation of the macro-level convolution. The attention score of the relation is much higher than that of the or relation, in line with the fact that GCN and GAT achieved the best performance on the meta-path. This finding demonstrates that the macro-level convolution could distinguish the importance of different relations automatically without empirical manual design, and the learned importance could implicitly construct more important meta-paths for specific downstream tasks.
In this paper, we designed a hybrid micro/macro level convolution operation to address several fundamental problems in heterogeneous graph representation learning. In particular, the micro-level convolution aims to learn the importance of nodes within the same relation and the macro-level convolution attempts to distinguish the subtle difference across different relations. The hybrid strategy enables our model to fully leverage heterogeneous information with proper interpretability by performing convolutions on the intrinsic structure of heterogeneous graphs directly. We also designed a weighted residual connection component to obtain the optimal combination of focal node’s inherent attributes and neighbor information. Experimental results demonstrated not only the superiority of the proposed method, but also the intuitive interpretability of our approach for graph analysis.
This work is supported by the National Key RD Program of China [grant number 2018YFB2101003], the Science and Technology Major Project of Beijing [grant number Z191100002519012], and the National Natural Science Foundation of China [grant numbers 51778033, 51822802, 51991395, 71901011, U1811463].
-  (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pp. 585–591. Cited by: §2.
-  (2014) Spectral networks and locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Cited by: §2.
Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555. Cited by: §4.4.
-  (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 3837–3845. Cited by: §2.
-  (2012) Link prediction and recommendation across heterogeneous social networks. In 12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 10-13, 2012, pp. 181–190. Cited by: §1.
-  (2020) A fair comparison of graph neural networks for graph classification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: §2.
-  (2020) MAGNN: metapath aggregated graph neural network for heterogeneous graph embedding. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, pp. 2331–2341. Cited by: §2, Definition 2.
-  (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.
-  (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, §2, §4.6.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. Cited by: §4.4.
-  (2020) Heterogeneous graph transformer. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, pp. 2704–2710. Cited by: §1, §2, §4.6, 7th item.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.
-  (2017) Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §1, §2, §4.6, 2nd item.
-  (2020) Type-aware anchor link prediction across heterogeneous networks based on graph attention network. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 147–155. Cited by: §1.
Visualizing data using t-sne.
Journal of machine learning research9 (Nov), pp. 2579–2605. Cited by: §5.6.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119. Cited by: §4.5.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 8024–8035. Cited by: §5.3.
-  (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §2.
-  (2003) Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242, pp. 133–142. Cited by: §5.1.
-  (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §2.
-  (2018) Representation learning for classification in heterogeneous graphs with application to social networks. ACM Trans. Knowl. Discov. Data 12 (5), pp. 62:1–62:33. Cited by: §1.
-  (2018) Modeling relational data with graph convolutional networks. In The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings, pp. 593–607. Cited by: §1, §2, §4.6, 4th item.
-  (2019) Heterogeneous information network embedding for recommendation. IEEE Trans. Knowl. Data Eng. 31 (2), pp. 357–370. Cited by: §1.
-  (2017) A survey of heterogeneous information network analysis. IEEE Trans. Knowl. Data Eng. 29 (1), pp. 17–37. Cited by: §1.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. Cited by: §5.3.
-  (2012) Relation strength-aware clustering of heterogeneous information networks with incomplete attributes. Proc. VLDB Endow. 5 (5), pp. 394–405. Cited by: §1.
-  (2012) Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explorations 14 (2), pp. 20–28. Cited by: §1.
-  (2008) ArnetMiner: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, pp. 990–998. Cited by: 1st item.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.
-  (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §1, §2, §2, 3rd item.
-  (2019) DEEP graph library: agraph-centric, highly-performant package for graph neural net. arXiv preprint arXiv:1909.01315. Cited by: §5.3.
-  (2019) Heterogeneous graph attention network. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pp. 2022–2032. Cited by: §1, §2, §4.1, §4.2, §4.6, 1st item, 5th item, §5.3, §5.4, Definition 2.
-  (2019) A comprehensive survey on graph neural networks. CoRR abs/1901.00596. Cited by: §2.
-  (2014) Personalized entity recommendation: a heterogeneous information network approach. In Seventh ACM International Conference on Web Search and Data Mining, WSDM 2014, New York, NY, USA, February 24-28, 2014, pp. 283–292. Cited by: §1.
-  (2019) Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pp. 793–803. Cited by: §1, §2, §4.6, 6th item.
-  (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 5171–5181. Cited by: §2.
-  (2018) Deep collective classification in heterogeneous information networks. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, pp. 399–408. Cited by: §1.
-  (2018) Graph neural networks: A review of methods and applications. CoRR abs/1812.08434. Cited by: §2.