1. Introduction
Many realworld datasets are naturally represented in a graph data structure, where objects and the relationships among them are embodied by nodes and edges, respectively. Examples include social networks (Wang et al., 2016; Hamilton et al., 2017), physical systems (Battaglia et al., 2016; Fout et al., 2017), traffic networks (Li et al., 2018; Zhang et al., 2018), citation networks (Atwood and Towsley, 2016; Kipf and Welling, 2017; Hamilton et al., 2017), recommender systems (van den Berg et al., 2017; Zhang et al., 2019)
(Bordes et al., 2013; Sun et al., 2019), and so on. The unique nonEuclidean nature of graphs renders them difficult to be modeled by traditional machine learning models. For the neighborhood set of each node, there is no order or size limit. However, most statistical models assume an ordered and fixedsize input lying in the Euclidean space. Therefore, it would be beneficial if nodes could be represented by meaningful lowdimensional vectors in the Euclidean space and then be taken as the input for other machine learning models.
Different graph embedding techniques have been proposed for the graph structure. LINE (Tang et al., 2015) generates node embeddings by exploiting the firstorder and secondorder proximity between nodes. Randomwalkbased methods including DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016), and TADW (Yang et al., 2015) feed node sequences generated by random walks to a skipgram model (Mikolov et al., 2013a)
to learn node embeddings. With the rapid development of deep learning, graph neural networks (GNNs) have been proposed, which learn the graph representations using specially designed neural layers. Spectralbased GNNs, including ChebNet
(Defferrard et al., 2016) and GCN (Kipf and Welling, 2017), perform graph convolution operations in the Fourier domain of an entire graph. Recent spatialbased GNNs, including GraphSAGE (Hamilton et al., 2017), GAT (Velickovic et al., 2018), and many other variants (Li et al., 2016; Zhang et al., 2018, 2019), address the issues around scalability and generalization ability of the spectralbased models by performing graph convolution operations directly in the graph domain. An increasing number of researchers have paid attention to this promising area.Although GNNs have achieved stateoftheart results in many tasks, most GNNbased models assume that the input is a homogeneous graph with only one node type and one edge type. Most realworld graphs consist of various types of nodes and edges associated with attributes in different feature spaces. For example, a coauthorship network contains at least two types of nodes, namely authors and papers. Author attributes may include affiliations, citations, and research fields. Paper attributes may consist of keywords, venue, year, and so on. We refer to graphs of this kind as heterogeneous information networks (HINs) or heterogeneous graphs. The heterogeneity in both graph structure and node content makes it challenging for GNNs to encode their rich and diverse information into a lowdimensional vector space.
Most existing heterogeneous graph embedding methods are based on the idea of metapaths. A metapath is an ordered sequence of node types and edge types defined on the network schema, which describes a composite relation between the nodes types involved. For example, in a scholar network with authors, papers, and venues, AuthorPaperAuthor (APA) and AuthorPaperVenuePaperAuthor (APVPA) are metapaths describing two different relations among authors. The APA metapath associates two coauthors, while the APVPA metapath associates two authors who published papers in the same venue. Therefore, we can view a metapath as highorder proximity between two nodes. Because traditional GNNs treat all nodes equally, they are unable to model the complex structural and semantic information in heterogeneous graphs.
Although these metapathbased embedding methods outperform traditional network embedding methods on various tasks, such as node classification and link prediction, they still suffer from at least one of the following limitations. (1) The model does not leverage node content features, so it rarely performs well on heterogeneous graphs with rich node content features (e.g., metapath2vec (Dong et al., 2017), ESim (Shang et al., 2016), HIN2vec (Fu et al., 2017), and HERec (Shi et al., 2019)). (2) The model discards all intermediate nodes along the metapath by only considering two end nodes, which results in information loss (e.g., HERec (Shi et al., 2019) and HAN (Wang et al., 2019b)). (3) The model relies on a single metapath to embed the heterogeneous graph. Hence, the model requires a manual metapath selection process and loses aspects of information from other metapaths, leading to suboptimal performance (e.g., metapath2vec (Dong et al., 2017)).
To address these limitations, we propose a novel Metapath Aggregated Graph Neural Network
(MAGNN) for heterogeneous graph embedding. MAGNN addresses all the issues described above by applying node content transformation, intrametapath aggregation, and intermetapath aggregation to generate node embeddings. Specifically, MAGNN first applies typespecific linear transformations to project heterogeneous node attributes, with possibly unequal dimensions for different node types, to the same latent vector space. Next, MAGNN applies intrametapath aggregation with the attention mechanism
(Velickovic et al., 2018) for every metapath. During this intrametapath aggregation, each target node extracts and combines information from the metapath instances connecting the node with its metapathbased neighbors. In this way, MAGNN captures the structural and semantic information of heterogeneous graphs from both neighbor nodes and the metapath context in between. Following intrametapath aggregation, MAGNN further conducts intermetapath aggregation using the attention mechanism to fuse latent vectors obtained from multiple metapaths into final node embeddings. By integrating multiple metapaths, our model can learn the comprehensive semantics ingrained in the heterogeneous graph.In summary, this work makes several major contributions:

We propose a novel metapath aggregated graph neural network for heterogeneous graph embedding.

We design several candidate encoder functions for distilling information from metapath instances, including one based on the idea of relational rotation in complex space (Sun et al., 2019).

We conduct extensive experiments on the IMDb and the DBLP datasets for node classification and node clustering, as well as on the Last.fm dataset for link prediction to evaluate the performance of our proposed model. Experiments on all of these datasets and tasks show that the node embeddings learned by MAGNN are consistently better than those generated by other stateoftheart baselines.
2. Preliminary
In this section, we give formal definitions of some important terminologies related to heterogeneous graphs. Graphical illustrations are provided in Figure 1. Besides, Table 1 summarizes frequently used notations in this paper for quick reference.
Notations  Definitions 
dimensional Euclidean space  
, ,  Scalar, vector, matrix 
Matrix/vector transpose  
The set of nodes in a graph  
The set of edges in a graph  
A graph  
A node  
A metapath  
A metapath instance connecting node and  
The set of neighbors of node  
The set of metapathbased neighbors of node  
Raw (content) feature vector of node  
Hidden state (embedding) of node  
Weight matrix  
Normalized attention weight  
Activation function  
Elementwise multiplication  
The cardinality of a set  
Vector concatenation  
Definition 2.0 ().
Heterogeneous Graph. A heterogeneous graph is defined as a graph associated with a node type mapping function and an edge type mapping function . and denote the predefined sets of node types and edge types, respectively, with .
Definition 2.0 ().
Metapath. A metapath is defined as a path in the form of (abbreviated as ), which describes a composite relation between node types and , where denotes the composition operator on relations.
Definition 2.0 ().
Metapath Instance. Given a metapath of a heterogeneous graph, a metapath instance of is defined as a node sequence in the graph following the schema defined by .
Definition 2.0 ().
Metapathbased Neighbor. Given a metapath of a heterogeneous graph, the metapathbased neighbors of a node is defined as the set of nodes that connect with node via metapath instances of . A neighbor connected by two different metapath instances is regarded as two different nodes in . Note that includes itself if is symmetric.
For example, considering the metapath UATA in Figure 1, artist Queen is a metapathbased neighbor of user Bob. These two nodes are connected via the metapath instance BobBeatlesRockQueen. Moreover, we may refer to Beatles and Rock as the intermediate nodes along this metapath instance.
Definition 2.0 ().
Metapathbased Graph. Given a metapath of a heterogeneous graph , the metapathbased graph is a graph constructed by all the metapathbased neighbor pairs in graph . Note that is homogeneous if is symmetric.
Definition 2.0 ().
Heterogeneous Graph Embedding. Given a heterogeneous graph , with node attribute matrices for node types , heterogeneous graph embedding is the task to learn the dimensional node representations for all with that are able to capture rich structural and semantic information involved in .
3. Related Work
In this section, we review studies on graph representation learning that are related to our model. They are organized into two subsections: Section 3.1 summarizes research efforts on GNNs for general graph embedding, while Section 3.2 introduces graph embedding methods designed for heterogeneous graphs.
3.1. Graph Neural Networks
The goal of a GNN is to learn a lowdimensional vector representation for every node , which can be used for many downstream tasks, e.g., node classification, node clustering, and link prediction. The rationale behind this is that each node is naturally defined by its own features and its neighborhood. Following this idea and based on graph signal processing, spectralbased GNNs were first developed to perform graph convolution in the Fourier domain of a graph. ChebNet (Defferrard et al., 2016) utilizes Chebyshev polynomials to filter graph signals (node features) in the graph Fourier domain. Another influential model of this kind is GCN (Kipf and Welling, 2017), which constrains and simplifies the parameters of ChebNet to alleviate the overfitting problem and improve the performance. However, spectralbased GNNs suffer from poor scalability and generalization ability, because they require the entire graph as input for every layer, and their learned filters depend on the eigenbasis of the graph Laplacian, which is closely related to the specific graph structure.
Spatialbased GNNs have been proposed to address these two limitations. GNNs of this kind define convolutions directly in the graph domain by aggregating feature information from neighbors for each node, thus imitating the convolution operations of convolutional neural networks for image data. GraphSAGE
(Hamilton et al., 2017), the seminal spatialbased GNN framework, is founded upon the general notion of aggregator functions for efficient generation of node embeddings. The aggregator function samples, extracts, and transforms a target node’s local neighborhood, and thus facilitates parallel training and generalization to unseen nodes or graphs. Many other spatialbased GNN variants have been proposed based on this idea. Inspired by the Transformer (Vaswani et al., 2017), GAT (Velickovic et al., 2018) incorporates the attention mechanism into the aggregator function to take into account the relative importance of each neighbor’s information from the target node’s perspective. GGNN (Li et al., 2016)adds a gated recurrent unit (GRU)
(Cho et al., 2014) to the aggregator function by treating the aggregated neighborhood information as the input to the GRU of the current time step. GaAN (Zhang et al., 2018) combines GRU with the gated multihead attention mechanism for dealing with spatiotemporal graphs. STARGCN (Zhang et al., 2019) stacks multiple GCN encoderdecoders to boost the rating prediction performance.All of the GNNs mentioned above are either built for homogeneous graphs, or designed for graphs with a special structure, as in useritem recommender systems. Because most existing GNNs operate on features of nodes in the same shared embedding space, they cannot be naturally adapted to heterogeneous graphs with node features lying in different spaces.
3.2. Heterogeneous Graph Embedding
Heterogeneous graph embedding aims to project nodes in a heterogeneous graph into a lowdimensional vector space. This challenging topic has been addressed by a number of studies. For example, metapath2vec (Dong et al., 2017) generates random walks guided by a single metapath, which are then fed to a skipgram model (Mikolov et al., 2013a) to generate node embeddings. Given userdefined metapaths, ESim (Shang et al., 2016) generates node embeddings by learning from sampled positive and negative metapath instances. HIN2vec (Fu et al., 2017) carries out multiple prediction training tasks to learn representations of nodes and metapaths of a heterogeneous graph. Given a metapath, HERec (Shi et al., 2019) converts a heterogeneous graph into a homogeneous graph based on metapathbased neighbors and applies the DeepWalk model to learn the node embeddings of the target type. Like HERec, HAN (Wang et al., 2019b) converts a heterogeneous graph into multiple metapathbased homogeneous graphs in a similar way, but uses a graph attention network architecture to aggregate information from the neighbors and leverages the attention mechanism to combine various metapaths. Another model, PME (Chen et al., 2018), learns node embeddings by projecting them into the corresponding relation spaces and optimizing the proximity between the projected nodes.
However, all of the heterogeneous graph embedding methods introduced above have the limitations of either ignoring node content features, discarding all intermediate nodes along the metapath, or utilizing only a single metapath. Although they might have improved upon the performance of homogeneous graph embedding methods for some heterogeneous graph datasets, there is still room for improvement by exploiting more comprehensively the information embedded in heterogeneous graphs.
4. Methodology
In this section, we describe a new metapath aggregated graph neural network (MAGNN) for heterogeneous graph embedding. MAGNN is constructed by three major components: node content transformation, intrametapath aggregation, and intermetapath aggregation. Figure 2 illustrates the embedding generation of a single node. The overall forward propagation process is shown in Algorithm 1.
4.1. Node Content Transformation
For a heterogeneous graph associated with node attributes, different node types may have unequal dimensions of feature vectors. Even if they happen to be the same dimension, they may lie in different feature spaces. For example, dimensional bagofwords vectors of texts and dimensional intensity histogram vectors of images cannot directly operate together even if . Feature vectors of different dimensions are troublesome when we process them in a unified framework. Therefore, we need to project different types of node features into the same latent vector space before all else.
So before feeding node vectors into MAGNN, we apply a typespecific linear transformation for each type of nodes by projecting feature vectors into the same latent factor space. For a node of type , we have
(1) 
where is the original feature vector, and is the projected latent vector of node . is the parametric weight matrix for type ’s nodes.
The node content transformation addresses the heterogeneity of a graph that originates from the node content features. After applying this operation, all nodes’ projected features share the same dimension, which facilitates the aggregation process of the next model component.
4.2. Intrametapath Aggregation
Given a metapath , the intrametapath aggregation layer learns the structural and semantic information embedded in the target node, the metapathbased neighbors, and the context in between, by encoding the metapath instances of . Let be a metapath instance connecting the target node and the metapathbased neighbor , we further define the intermediate nodes of as . Intrametapath aggregation employs a special metapath instance encoder to transform all the node features along a metapath instance into a single vector,
(2) 
where has a dimension of . For simplicity, here we use to represent a single instance, although there might be multiple instances connecting the two nodes. Section 4.4 introduces several choices of a qualified metapath instance encoder.
After encoding the metapath instances into vector representations, we adopt a graph attention layer (Velickovic et al., 2018) to weighted sum the metapath instances of related to target node . The key idea is that different metapath instances would contribute to the target node’s representation in different degrees. We can model this by learning a normalized importance weight for each metapath instance and then weighted summing all instances:
(3)  
Here is the parameterized attention vector for metapath , and denotes the vector concatenation operator. indicates the importance of metapath instance to node , which is then normalized across all choices of using the softmax function. Once the normalized importance weight is obtained for all , they are used to compute a weighted combination of the representations of the metapath instances about node . Finally, the output goes through an activation function .
This attention mechanism can also be extended to multiple heads, which helps to stabilize the learning process and reduce the high variance introduced by the heterogeneity of graphs. That is, we execute
independent attention mechanisms, and then concatenate their outputs, resulting in the following formulation:(4) 
where is the normalized importance of metapath instance to node at the th attention head.
To sum up, given the projected feature vectors and the set of metapaths which start or end with node type , the intrametapath aggregation of MAGNN generates metapathspecific vector representations of the target node , denoted as . Each (assuming ) can be interpreted as a summarization of the metapath instances about node , exhibiting one aspect of semantic information contained in node .
4.3. Intermetapath Aggregation
After aggregating the node and edge data within each metapath, we need to combine the semantic information revealed by all metapaths using an intermetapath aggregation layer. Now for a node type , we have sets of latent vectors: for , where is the number of metapaths for type . One straightforward intermetapath aggregation approach is to take the elementwise mean of these node vectors. We extend this approach by exploiting the attention mechanism to assign different weights to different metapaths. This operation is reasonable because metapaths are not equally important in a heterogeneous graph.
First, we summarize each metapath by averaging the transformed metapathspecific node vectors for all nodes ,
(5) 
where and are learnable parameters.
Then we use the attention mechanism to fuse the metapathspecific node vectors of as follows:
(6)  
where is the parameterized attention vector for node type . can be interpreted as the relative importance of metapath to type ’s nodes. Once is computed for each , we weighted sum all the metapathspecific node vectors of .
At last, MAGNN employs an additional linear transformation with a nonlinear function to project the node embeddings to the vector space with the desired output dimension:
(7) 
where is an activation function, and
is a weight matrix. This projection is taskspecific. It can be interpreted as a linear classifier for node classification or regarded as a projection to the space with node similarity measures for link prediction.
4.4. Metapath Instance Encoders
To encode each metapath instance in Section 4.2, we examine three candidate encoder functions:

Mean encoder. This function takes the elementwise mean of the node vectors along the metapath instance :
(8) 
Linear encoder. This function is an extension to the mean encoder by appending it with a linear transformation:
(9) 
Relational rotation encoder. We also examine a metapath instance encoder based on relational rotation in complex space, an operation proposed by RotatE (Sun et al., 2019) for knowledge graph embedding. The mean and linear encoders introduced above treat the metapath instance essentially as a set, and thus ignore the information embedded in the sequential structure of the metapath. Relational rotation provides a way to model this kind of knowledge. Given with and , let be the relation between node and node , let be the relation vector of , the relational rotation encoder is formulated as:
(10) where and are both complex vectors, is the elementwise product. We can easily interpret a real vector of dimension as a complex vector of dimension by treating the first half of the vector as the real part, and the second half as the imaginary part.
4.5. Training
After applying components introduced in the previous sections, we obtain the final node representations, which can then be used in different downstream tasks. Depending on the characteristics of different tasks and the availability of node labels, we can train MAGNN in two major learning paradigms, i.e., semisupervised learning and unsupervised learning.
For semisupervised learning, with the guide of a small fraction of labeled nodes, we can optimize the model weights by minimizing the cross entropy via backpropagation and gradient descent, and thereby learn meaningful node embeddings for heterogeneous graphs. The cross entropy loss for this semisupervised learning is formulated as:
(11) 
where is the set of nodes that have labels, is the number of classes, is the onehot label vector of node , and
is the predicted probability vector of node
.For unsupervised learning, without any node labels, we can optimize the model weights by minimizing the following loss function through negative sampling
(Mikolov et al., 2013b):(12) 
where
is the sigmoid function,
is the set of observed (positive) node pairs, is the set of negative node pairs sampled from all unobserved node pairs (the complement of ).5. Experiments
In this section, we present experiments to demonstrate the efficacy of MAGNN for heterogeneous graph embedding. The experiments aim to address the following research questions:

RQ1. How does MAGNN perform in classifying nodes?

RQ2. How does MAGNN perform in clustering nodes?

RQ3. How does MAGNN perform in predicting plausible links between node pairs?

RQ4. What is the impact of the three major components of MAGNN described in the previous section?

RQ5. How do we understand the representation capability of different graph embedding methods?
5.1. Datasets
We adopt three widely used heterogeneous graph datasets from different domains to evaluate the performance of MAGNN as compared to stateoftheart baselines. Specifically, the IMDb and DBLP datasets are used in the experiments for node classification and node clustering. The Last.fm dataset is used in the experiments for link prediction. Simple statistics of the three datasets are summarized in Table 2, and network schemas are illustrated in Figure 3. We assign onehot id vectors to nodes with no attributes as their dummy input features.

IMDb^{1}^{1}1https://www.imdb.com/ is an online database about movies and television programs, including information such as cast, production crew, and plot summaries. We use a subset of IMDb scraped from online, containing 4278 movies, 2081 directors, and 5257 actors after data preprocessing. Movies are labeled as one of three classes (Action, Comedy, and Drama) based on their genre information. Each movie is also described by a bagofwords representation of its plot keywords. For semisupervised learning models, the movie nodes are divided into training, validation, and testing sets of 400 (9.35%), 400 (9.35%), and 3478 (81.30%) nodes, respectively.

DBLP^{2}^{2}2https://dblp.unitrier.de/ is a computer science bibliography website. We adopt a subset of DBLP extracted by (Gao et al., 2009; Ji et al., 2010), containing 4057 authors, 14328 papers, 7723 terms, and 20 publication venues after data preprocessing. The authors are divided into four research areas (Database, Data Mining, Artificial Intelligence, and Information Retrieval). Each author is described by a bagofwords representation of their paper keywords. For semisupervised learning models, the author nodes are divided into training, validation, and testing sets of 400 (9.86%), 400 (9.86%), and 3257 (80.28%) nodes, respectively.

Last.fm^{3}^{3}3https://www.last.fm/ is a music website keeping track of users’ listening information from various sources. We adopt a dataset released by HetRec 2011 (Cantador et al., 2011), consisting of 1892 users, 17632 artists, and 1088 artist tags after data preprocessing. This dataset is used for the link prediction task, and no label or feature is included in this dataset. For semisupervised learning models, the userartist pairs are divided into training, validation, and testing sets of 64984 (70%), 9283 (10%), and 18567 (20%) pairs, respectively.
Dataset  Node  Edge  Metapath  
IMDb 




DBLP 




Last.fm 




5.2. Baselines
Dataset  Metrics  Train %  Unsupervised  Semisupervised  
LINE  node2vec  ESim  metapath2vec  HERec  GCN  GAT  HAN  MRGNN  
IMDb  MacroF1  20%  44.04  49.00  48.37  46.05  45.61  52.73  53.64  56.19  59.35 
40%  45.45  50.63  50.09  47.57  46.80  53.67  55.50  56.15  60.27  
60%  47.09  51.65  51.45  48.17  46.84  54.24  56.46  57.29  60.66  
80%  47.49  51.49  51.37  49.99  47.73  54.77  57.43  58.51  61.44  
MicroF1  20%  45.21  49.94  49.32  47.22  46.23  52.80  53.64  56.32  59.60  
40%  46.92  51.77  51.21  48.17  47.89  53.76  55.56  57.32  60.50  
60%  48.35  52.79  52.53  49.87  48.19  54.23  56.47  58.42  60.88  
80%  48.98  52.72  52.54  50.50  49.11  54.63  57.40  59.24  61.53  
DBLP  MacroF1  20%  87.16  86.70  90.68  88.47  90.82  88.00  91.05  91.69  93.13 
40%  88.85  88.07  91.61  89.91  91.44  89.00  91.24  91.96  93.23  
60%  88.93  88.69  91.84  90.50  92.08  89.43  91.42  92.14  93.57  
80%  89.51  88.93  92.27  90.86  92.25  89.98  91.73  92.50  94.10  
MicroF1  20%  87.68  87.21  91.21  89.02  91.49  88.51  91.61  92.33  93.61  
40%  89.25  88.51  92.05  90.36  92.05  89.22  91.77  92.57  93.68  
60%  89.34  89.09  92.28  90.94  92.66  89.57  91.97  92.72  93.99  
80%  89.96  89.37  92.68  91.31  92.78  90.33  92.24  93.23  94.47 
We compare MAGNN against different kinds of graph embedding models, including traditional (as opposed to GNNs) homogeneous graph embedding models, traditional heterogeneous graph embedding models, GNNs for homogeneous graphs, and GNNs for heterogeneous graphs. We denote them as traditional homogeneous models, traditional heterogeneous models, homogeneous GNNs, and heterogeneous GNNs, respectively. The list of baseline models is shown as follows.

LINE (Tang et al., 2015) is a traditional homogeneous model exploiting the firstorder and secondorder proximity between nodes. We apply it to the heterogeneous graphs by ignoring the heterogeneity of graph structure and dropping all node content features. The LINE variant using secondorder proximity is applied in our experiments.

ESim (Shang et al., 2016) is a traditional heterogeneous model that learns node embeddings from sampled metapath instances. ESim requires a predefined weight for each metapath. Here we assign equal weights to all metapaths because searching for the optimal weights of metapaths is difficult, and does not provide a significant performance gain over equal weights according to the authors’ experiments.

metapath2vec (Dong et al., 2017) is a traditional heterogeneous model that generates node embeddings by feeding metapathguided random walks to a skipgram model. This model relies on a single userspecified metapath, so we test on all metapaths separately and report the one with the best results. We use the metapath2vec++ model variant in our experiments.

HERec (Shi et al., 2019) is a traditional heterogeneous model that learns node embeddings by applying DeepWalk to the metapathbased homogeneous graphs converted from the original heterogeneous graph. This model comes with an embedding fusion algorithm designed for rating prediction, which can be adapted to link prediction. For node classification/clustering, we select and report the metapath with the best performance.

GCN (Kipf and Welling, 2017) is a homogeneous GNN. This model performs convolutional operations in the graph Fourier domain. Here we test GCN on metapathbased homogeneous graphs and report the results from the best metapath.

GAT (Velickovic et al., 2018) is a homogeneous GNN. This model performs convolutional operations in the graph spatial domain with the attention mechanism incorporated. Similarly, here we test GAT on metapathbased homogeneous graphs and report the results from the best metapath.

GATNE (Cen et al., 2019) is a heterogeneous GNN. It generates a node’s representation from the base embedding and the edge embeddings, with a focus on the link prediction task. Here we report the results from the bestperforming GATNE variant.

HAN (Wang et al., 2019b) is a heterogeneous GNN. It learns metapathspecific node embeddings from different metapathbased homogeneous graphs, and leverages the attention mechanism to combine them into one vector representation for each node.
For traditional models, including LINE, node2vec, ESim, metapath2vec, and HERec, we set the window size to 5, walk length to 100, walks per node to 40, and number of negative samples to 5, if applicable. For GNNs, including GCN, GAT, HAN, and our proposed MAGNN, we set the dropout rate to 0.5; we use the same splits of training, validation, and testing sets; we employ the Adam optimizer with the learning rate set to 0.005 and the weight decay (L2 penalty) set to 0.001; we train the GNNs for 100 epochs and apply early stopping with a patience of 30. For node classification and node clustering, the GNNs are trained in a semisupervised fashion with a small fraction of nodes labeled as guidance. For GAT, HAN, and MAGNN, we set the number of attention heads to 8. For HAN and MAGNN, we set the dimension of the attention vector in intermetapath aggregation to 128. For a fair comparison, we set the embedding dimension of all the models mentioned above to 64.
Dataset  Metrics  Unsupervised  Semisupervised  
LINE  node2vec  ESim  metapath2vec  HERec  GCN  GAT  HAN  MRGNN  
IMDb  NMI  1.13  5.22  1.07  0.89  0.39  7.46  7.84  10.79  15.58 
ARI  1.20  6.02  1.01  0.22  0.11  7.69  8.87  11.11  16.74  
DBLP  NMI  71.02  77.01  68.33  74.18  69.03  73.45  70.73  77.49  80.81 
ARI  76.52  81.37  72.22  78.11  72.45  77.50  76.04  82.95  85.54 
Dataset  Metrics  LINE  node2vec  ESim  metapath2vec  HERec  GCN  GAT  GATNE  HAN  MAGNN 
Last.fm  AUC  85.76  67.14  82.00  92.20  91.52  90.97  92.36  89.21  93.40  98.91 
AP  88.07  64.11  82.19  90.11  89.47  91.65  91.55  88.86  92.44  98.93 
5.3. Node Classification (RQ1)
We conduct experiments on the IMDb and DBLP datasets to compare the performance of different models on the node classification task. We feed the embeddings of labeled nodes (movies in IMDb and authors in DBLP) generated by each learning model to a linear support vector machine (SVM) classifier with varying training proportions. Note that for a fair comparison, only the nodes in the testing set are fed to the linear SVM, because semisupervised models have already “seen” the nodes in the training and validation sets, as shown in Equation
11. Hence, the training and testing proportions of the linear SVM here only concern the testing set (i.e., 3478 nodes for IMDb and 3257 nodes for DBLP). Again, the train/test splits for the linear SVM are also the same across embedding models. Similar strategies are also applied to the experiments of node clustering and link prediction. We report the average MacroF1 and MicroF1 of 10 runs of each embedding model in Table 3.As shown in the table, MAGNN performs consistently better than other baselines across different training proportions and datasets. On IMDb, it is interesting to see that node2vec performs better than traditional heterogeneous models. That said, GNNs, especially heterogeneous GNNs, obtain even better results, demonstrating that the GNN architecture, which judiciously utilizes the heterogeneous node features, helps improve the embedding performance. The performance gain obtained by MAGNN over the best baseline (HAN) is around 47%, which indicates that metapath instances contain richer information than metapathbased neighbors. On DBLP, the node classification task is trivial, as evident from the high scores of all models. Even so, MAGNN still outperforms the strongest baseline by 12%.
5.4. Node Clustering (RQ2)
We conduct experiments on the IMDb and DBLP datasets to compare the performance of different models on the node clustering task. We feed the embeddings of labeled nodes (movies in IMDb and authors in DBLP) generated by each learning model to the KMeans algorithm. The number of clusters in KMeans is set to the number of classes for each dataset, i.e., 3 for IMDb and 4 for DBLP. We employ the
normalized mutual information (NMI) and adjusted Rand index(ARI) as the evaluation metrics. Since the clustering result of the KMeans algorithm is highly dependent on the initialization of the centroids, we repeat KMeans 10 times for each run of the embedding model, and each embedding model is tested for 10 runs. We report the averaged results in Table
4.From Table 4, we can see that MAGNN is consistently superior to all other baselines in node clustering. Note that all models have much poorer performance on IMDb than on DBLP. This is presumably because of the dirty labels of movies in IMDb: every movie node in the original IMDb dataset has multiple genres, and we only choose the very first one as its class label. We can see that the traditional heterogeneous models do not have many advantages over the traditional homogeneous models in node clustering. Node2vec is expected to perform strongly in the node clustering task because, being a randomwalkbased approach, it forces nodes that are close in the graph also to be close in the embedding space (You et al., 2019), and thereby encodes node positional information. This property implicitly facilitates the KMeans algorithm as it clusters nodes based on the Euclidean distances between embeddings. Despite this, the heterogeneityaware GNNs (i.e., HAN and MAGNN) still rank the first in node clustering on both datasets.
5.5. Link Prediction (RQ3)
We also conduct experiments on the Last.fm dataset to evaluate the performance of MAGNN and other baselines in the link prediction task. For the GNNs, we treat the connected userartist pair as positive node pairs, and consider all unconnected userartist links as negative node pairs. We add the same number of randomly sampled negative node pairs to the validation and testing sets. During the GNNs’ training, negative node pairs are also uniformly sampled on the fly. The GNNs are then optimized by minimizing the objective function described in Equation 12.
Given the user embedding and the artist embedding generated by the trained model, we calculate the probability that and link together as follows:
(13) 
where is the sigmoid function. The embedding models for link prediction are evaluated by the area under the ROC curve (AUC) and average precision (AP) scores. We report the averaged results of 10 runs of each embedding model in Table 5.
From Table 5, MAGNN outperforms other baseline models by a large margin. The strongest traditional model here is metapath2vec, which learns from node sequences generated from random walks guided by a single metapath. MAGNN achieves better scores than metapath2vec, showing that considering a single metapath is suboptimal. Among GNN baselines, HAN obtains the best results because it is heterogeneityaware and combines multiple metapaths. Our MAGNN achieves a relative improvement of around 6% over HAN. This result supports our claim that the metapath contexts of nodes are critical to the node embeddings.
Variant  IMDb  DBLP  Last.fm  
MacroF1  MicroF1  NMI  ARI  MacroF1  MicroF1  NMI  ARI  AUC  AP  
48.87  50.36  5.82  5.30  92.80  93.32  77.17  82.15  N/A  N/A  
58.45  58.84  12.87  11.98  92.61  93.15  77.64  82.60  93.68  92.95  
56.77  56.64  11.90  11.84  93.19  93.69  79.48  84.39  92.54  91.52  
59.66  59.78  13.64  15.27  93.13  93.44  79.31  84.30  98.63  98.57  
57.80  57.96  9.80  8.49  93.21  93.52  78.95  83.89  98.56  98.48  
60.43  60.63  15.58  16.74  93.51  93.94  80.81  85.54  98.91  98.93 
5.6. Ablation Study (RQ4)
To validate the effectiveness of each component of our model, we further conduct experiments on different MAGNN variants. Here we report the results obtained from the three datasets on all three tasks in Table 6. Note that every presented score of the node classification task (i.e., MacroF1 and MicroF1) is an average of the scores in different training proportions (explained in Section 5.3). Here is our proposed model using the relational rotation encoder, i.e., the one used to compete with other baselines in Table 3, 4, and 5. Let be the reference model, is the equivalent model without utilizing node content features; considers only the metapathbased neighbors; considers the single best metapath; switches to using the mean metapath instance encoder; switches to using the linear metapath instance encoder. Except for the abovementioned differences, all other settings are the same for these MAGNN variants. Note that on Last.fm is equivalent to because this dataset does not contain node attributes.
As can be seen, by utilizing the node content features, obtains a significant performance improvement over , which shows the necessity of applying node content transformation to incorporate node features. Comparing with , , and , we see that aggregating metapath instances rather than metapathbased neighbors brings about a boost in performance, which validates the efficacy of intrametapath aggregation. Next, the difference between the results of and reveals that the model performance is improved considerably by combining multiple metapaths in intermetapath aggregation. Finally, the results of , , and suggest that the relational rotation encoder does help to improve MAGNN by a small margin. It is interesting to see that performs worse than . Nonetheless, all three MAGNN variants using different encoders still consistently outperform the best baseline, HAN.
5.7. Visualization (RQ5)
In addition to the quantitative evaluations of embedding models, we also visualize node embeddings to conduct a qualitative assessment of the embedding results. We randomly select 30 userartist pairs from the positive testing set of the Last.fm dataset, and then project the embeddings of these nodes into a 2dimensional space using tSNE. Here we illustrate the visualization results of LINE, ESim, GCN, and MAGNN in Figure 4, where red points and green points indicate users and artists, respectively.
Based on this visualization, one can quickly tell the differences among graph embedding models in terms of their learning ability towards heterogeneous graphs. As a traditional homogeneous graph embedding model, LINE cannot effectively divide user nodes and artist nodes into two different groups. In contrast, ESim, a traditional heterogeneous model, can roughly partition the two types of nodes. Thanks to the powerful GNN architecture and by choosing appropriate metapaths, a homogeneous GNN such as GCN can isolate different types of nodes and encode the correlation information of the userartist pairs into the node embeddings. From Figure 4, we can see that our proposed MAGNN obtains the best embedding results, with two wellseparated user and artist groups, and an aligned correlation of userartist pairs.
6. Conclusion
In this paper, we propose a novel metapath aggregated graph neural network (MAGNN) to address the three characteristic limitations of existing heterogeneous graph embedding methods, namely (1) dropping node content features, (2) discarding intermediate nodes along metapaths, and (3) considering only a single metapath. To be specific, MAGNN applies three building block components: (1) node content transformation, (2) intrametapath aggregation, and (3) intermetapath aggregation to deal with each of the limitations, respectively. Additionally, we define the notion of metapath instance encoders, which are used to extract the structural and semantic information ingrained in metapath instances. We propose several candidate encoder functions, including one inspired by the RotatE knowledge graph embedding model (Sun et al., 2019). In experiments, MAGNN achieves stateoftheart results on three realworld datasets in the node classification, node clustering, and link prediction tasks. Ablation studies also demonstrate the efficacy of the three major components of MAGNN in boosting embedding performance. We plan to adapt this heterogeneous graph embedding framework to the rating prediction (recommendation) task with the useritem data assisted by the heterogeneous knowledge graph (Wang et al., 2019a).
Acknowledgements.
The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (CUHK 2300174 (Collaborative Research Fund, No. C502618GF) and CUHK 3133238 (Research Sustainability of Major RGC Funding Schemes)).References
 Diffusionconvolutional neural networks. In NIPS, pp. 1993–2001. Cited by: §1.
 Interaction networks for learning about objects, relations and physics. In NIPS, pp. 4502–4510. Cited by: §1.
 Translating embeddings for modeling multirelational data. In NIPS, pp. 2787–2795. Cited by: §1.
 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In RecSys, Cited by: 3rd item.
 Representation learning for attributed multiplex heterogeneous network. In SIGKDD, pp. 1358–1368. Cited by: 8th item.
 PME: projected metric embedding on heterogeneous networks for link prediction. In SIGKDD, pp. 1177–1186. Cited by: §3.2.
 Learning phrase representations using RNN encoderdecoder for statistical machine translation. CoRR abs/1406.1078. External Links: 1406.1078 Cited by: §3.1.
 Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pp. 3844–3852. Cited by: §1, §3.1.
 Metapath2Vec: scalable representation learning for heterogeneous networks. In SIGKDD, pp. 135–144. Cited by: §1, §3.2, 4th item.
 Protein interface prediction using graph convolutional networks. In NIPS, pp. 6530–6539. Cited by: §1.
 HIN2Vec: explore metapaths in heterogeneous information networks for representation learning. In CIKM, pp. 1797–1806. Cited by: §1, §3.2.
 Graphbased consensus maximization among multiple supervised and unsupervised models. In NIPS, pp. 585–593. Cited by: 2nd item.
 Node2Vec: scalable feature learning for networks. In SIGKDD, pp. 855–864. Cited by: §1, 2nd item.
 Inductive representation learning on large graphs. In NIPS, pp. 1024–1034. Cited by: §1, §1, §3.1.
 Graph regularized transductive classification on heterogeneous information networks. In ECML PKDD, pp. 570–586. Cited by: 2nd item.
 Semisupervised classification with graph convolutional networks. In ICLR, Cited by: §1, §1, §3.1, 6th item.

Diffusion convolutional recurrent neural network: datadriven traffic forecasting
. In ICLR, Cited by: §1.  Gated graph sequence neural networks. In ICLR, Cited by: §1, §3.1.

Efficient estimation of word representations in vector space
. In ICLR, Cited by: §1, §3.2.  Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §4.5.
 DeepWalk: online learning of social representations. In SIGKDD, pp. 701–710. Cited by: §1, 2nd item.
 Metapath guided embedding for similarity search in largescale heterogeneous information networks. CoRR abs/1610.09769. External Links: 1610.09769 Cited by: §1, §3.2, 3rd item.
 Heterogeneous information network embedding for recommendation. IEEE Transactions on Knowledge and Data Engineering 31 (2), pp. 357–370. Cited by: §1, §3.2, 5th item.
 RotatE: knowledge graph embedding by relational rotation in complex space. In ICLR, Cited by: item 2, §1, 3rd item, §6.
 LINE: largescale information network embedding. In WWW, pp. 1067–1077. Cited by: §1, 1st item.
 Graph convolutional matrix completion. CoRR abs/1706.02263. External Links: 1706.02263 Cited by: §1.
 Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §3.1.
 Graph attention networks. In ICLR, Cited by: §1, §1, §3.1, §4.2, 7th item.
 Structural deep network embedding. In SIGKDD, pp. 1225–1234. Cited by: §1.
 Knowledge graph convolutional networks for recommender systems. In WWW, pp. 3307–3313. Cited by: §6.
 Heterogeneous graph attention network. In WWW, pp. 2022–2032. Cited by: §1, §3.2, 9th item.
 Network representation learning with rich text information. In IJCAI, pp. 2111–2117. Cited by: §1.
 Positionaware graph neural networks. In ICML, pp. 7134–7143. Cited by: §5.4.
 GaAN: gated attention networks for learning on large and spatiotemporal graphs. In UAI, pp. 339–349. Cited by: §1, §1, §3.1.
 STARGCN: stacked and reconstructed graph convolutional networks for recommender systems. In IJCAI, pp. 4264–4270. Cited by: §1, §1, §3.1.