MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

by   Xinyu Fu, et al.
The Chinese University of Hong Kong

A large number of real-world graphs or networks are inherently heterogeneous, involving a diversity of node types and relation types. Heterogeneous graph embedding is to embed rich structural and semantic information of a heterogeneous graph into low-dimensional node representations. Existing models usually define multiple metapaths in a heterogeneous graph to capture the composite relations and guide neighbor selection. However, these models either omit node content features, discard intermediate nodes along the metapath, or only consider one metapath. To address these three limitations, we propose a new model named Metapath Aggregated Graph Neural Network (MAGNN) to boost the final performance. Specifically, MAGNN employs three major components, i.e., the node content transformation to encapsulate input node attributes, the intra-metapath aggregation to incorporate intermediate semantic nodes, and the inter-metapath aggregation to combine messages from multiple metapaths. Extensive experiments on three real-world heterogeneous graph datasets for node classification, node clustering, and link prediction show that MAGNN achieves more accurate prediction results than state-of-the-art baselines.


page 1

page 2

page 3

page 4


Metapaths guided Neighbors aggregated Network for?Heterogeneous Graph Reasoning

Most real-world datasets are inherently heterogeneous graphs, which invo...

HMSG: Heterogeneous Graph Neural Network based on Metapath Subgraph Learning

Many real-world data can be represented as heterogeneous graphs with dif...

Simple and Efficient Heterogeneous Graph Neural Network

Heterogeneous graph neural networks (HGNNs) deliver the powerful capabil...

Heterogeneous Graph Neural Network with Multi-view Representation Learning

Graph neural networks for heterogeneous graph embedding is to project no...

Relation Structure-Aware Heterogeneous Information Network Embedding

Heterogeneous information network (HIN) embedding aims to embed multiple...

GAHNE: Graph-Aggregated Heterogeneous Network Embedding

The real-world networks often compose of different types of nodes and ed...

Few-Shot Semantic Relation Prediction across Heterogeneous Graphs

Semantic relation prediction aims to mine the implicit relationships bet...

1. Introduction

Many real-world datasets are naturally represented in a graph data structure, where objects and the relationships among them are embodied by nodes and edges, respectively. Examples include social networks (Wang et al., 2016; Hamilton et al., 2017), physical systems (Battaglia et al., 2016; Fout et al., 2017), traffic networks (Li et al., 2018; Zhang et al., 2018), citation networks (Atwood and Towsley, 2016; Kipf and Welling, 2017; Hamilton et al., 2017), recommender systems (van den Berg et al., 2017; Zhang et al., 2019)

, knowledge graphs 

(Bordes et al., 2013; Sun et al., 2019)

, and so on. The unique non-Euclidean nature of graphs renders them difficult to be modeled by traditional machine learning models. For the neighborhood set of each node, there is no order or size limit. However, most statistical models assume an ordered and fixed-size input lying in the Euclidean space. Therefore, it would be beneficial if nodes could be represented by meaningful low-dimensional vectors in the Euclidean space and then be taken as the input for other machine learning models.

Different graph embedding techniques have been proposed for the graph structure. LINE (Tang et al., 2015) generates node embeddings by exploiting the first-order and second-order proximity between nodes. Random-walk-based methods including DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016), and TADW (Yang et al., 2015) feed node sequences generated by random walks to a skip-gram model (Mikolov et al., 2013a)

to learn node embeddings. With the rapid development of deep learning, graph neural networks (GNNs) have been proposed, which learn the graph representations using specially designed neural layers. Spectral-based GNNs, including ChebNet 

(Defferrard et al., 2016) and GCN (Kipf and Welling, 2017), perform graph convolution operations in the Fourier domain of an entire graph. Recent spatial-based GNNs, including GraphSAGE (Hamilton et al., 2017), GAT (Velickovic et al., 2018), and many other variants (Li et al., 2016; Zhang et al., 2018, 2019), address the issues around scalability and generalization ability of the spectral-based models by performing graph convolution operations directly in the graph domain. An increasing number of researchers have paid attention to this promising area.

Although GNNs have achieved state-of-the-art results in many tasks, most GNN-based models assume that the input is a homogeneous graph with only one node type and one edge type. Most real-world graphs consist of various types of nodes and edges associated with attributes in different feature spaces. For example, a co-authorship network contains at least two types of nodes, namely authors and papers. Author attributes may include affiliations, citations, and research fields. Paper attributes may consist of keywords, venue, year, and so on. We refer to graphs of this kind as heterogeneous information networks (HINs) or heterogeneous graphs. The heterogeneity in both graph structure and node content makes it challenging for GNNs to encode their rich and diverse information into a low-dimensional vector space.

Most existing heterogeneous graph embedding methods are based on the idea of metapaths. A metapath is an ordered sequence of node types and edge types defined on the network schema, which describes a composite relation between the nodes types involved. For example, in a scholar network with authors, papers, and venues, Author-Paper-Author (APA) and Author-Paper-Venue-Paper-Author (APVPA) are metapaths describing two different relations among authors. The APA metapath associates two co-authors, while the APVPA metapath associates two authors who published papers in the same venue. Therefore, we can view a metapath as high-order proximity between two nodes. Because traditional GNNs treat all nodes equally, they are unable to model the complex structural and semantic information in heterogeneous graphs.

Although these metapath-based embedding methods outperform traditional network embedding methods on various tasks, such as node classification and link prediction, they still suffer from at least one of the following limitations. (1) The model does not leverage node content features, so it rarely performs well on heterogeneous graphs with rich node content features (e.g., metapath2vec (Dong et al., 2017), ESim (Shang et al., 2016), HIN2vec (Fu et al., 2017), and HERec (Shi et al., 2019)). (2) The model discards all intermediate nodes along the metapath by only considering two end nodes, which results in information loss (e.g., HERec (Shi et al., 2019) and HAN (Wang et al., 2019b)). (3) The model relies on a single metapath to embed the heterogeneous graph. Hence, the model requires a manual metapath selection process and loses aspects of information from other metapaths, leading to suboptimal performance (e.g., metapath2vec (Dong et al., 2017)).

To address these limitations, we propose a novel Metapath Aggregated Graph Neural Network

(MAGNN) for heterogeneous graph embedding. MAGNN addresses all the issues described above by applying node content transformation, intra-metapath aggregation, and inter-metapath aggregation to generate node embeddings. Specifically, MAGNN first applies type-specific linear transformations to project heterogeneous node attributes, with possibly unequal dimensions for different node types, to the same latent vector space. Next, MAGNN applies intra-metapath aggregation with the attention mechanism 

(Velickovic et al., 2018) for every metapath. During this intra-metapath aggregation, each target node extracts and combines information from the metapath instances connecting the node with its metapath-based neighbors. In this way, MAGNN captures the structural and semantic information of heterogeneous graphs from both neighbor nodes and the metapath context in between. Following intra-metapath aggregation, MAGNN further conducts inter-metapath aggregation using the attention mechanism to fuse latent vectors obtained from multiple metapaths into final node embeddings. By integrating multiple metapaths, our model can learn the comprehensive semantics ingrained in the heterogeneous graph.

In summary, this work makes several major contributions:

  1. We propose a novel metapath aggregated graph neural network for heterogeneous graph embedding.

  2. We design several candidate encoder functions for distilling information from metapath instances, including one based on the idea of relational rotation in complex space (Sun et al., 2019).

  3. We conduct extensive experiments on the IMDb and the DBLP datasets for node classification and node clustering, as well as on the dataset for link prediction to evaluate the performance of our proposed model. Experiments on all of these datasets and tasks show that the node embeddings learned by MAGNN are consistently better than those generated by other state-of-the-art baselines.

2. Preliminary

In this section, we give formal definitions of some important terminologies related to heterogeneous graphs. Graphical illustrations are provided in Figure 1. Besides, Table 1 summarizes frequently used notations in this paper for quick reference.

Notations Definitions
-dimensional Euclidean space
, , Scalar, vector, matrix
Matrix/vector transpose
The set of nodes in a graph
The set of edges in a graph
A graph
A node
A metapath
A metapath instance connecting node and
The set of neighbors of node
The set of metapath--based neighbors of node
Raw (content) feature vector of node
Hidden state (embedding) of node
Weight matrix
Normalized attention weight
Activation function
Element-wise multiplication
The cardinality of a set
Vector concatenation
Table 1. Notations used in this paper.
(a) Heterogeneous Graph
(b) Metapaths
(c) Metapath Instances
(d) Metapath-based Graphs
Figure 1. An illustration of the terms defined in Section 2. (a) An example heterogeneous graph with three types of nodes (i.e., users, artists, and tags). (b) The User-Artist-Tag-Artist (UATA) metapath and the User-Artist-Tag-Artist-User (UATAU) metapath. (c) Example metapath instances of the UATA and UATAU metapaths, respectively. (d) The metapath-based graphs for the UATA and UATAU metapaths, respectively.
Definition 2.0 ().

Heterogeneous Graph. A heterogeneous graph is defined as a graph associated with a node type mapping function and an edge type mapping function . and denote the predefined sets of node types and edge types, respectively, with .

Definition 2.0 ().

Metapath. A metapath is defined as a path in the form of (abbreviated as ), which describes a composite relation between node types and , where denotes the composition operator on relations.

Definition 2.0 ().

Metapath Instance. Given a metapath of a heterogeneous graph, a metapath instance of is defined as a node sequence in the graph following the schema defined by .

Definition 2.0 ().

Metapath-based Neighbor. Given a metapath of a heterogeneous graph, the metapath-based neighbors of a node is defined as the set of nodes that connect with node via metapath instances of . A neighbor connected by two different metapath instances is regarded as two different nodes in . Note that includes itself if is symmetric.

For example, considering the metapath UATA in Figure 1, artist Queen is a metapath-based neighbor of user Bob. These two nodes are connected via the metapath instance Bob-Beatles-Rock-Queen. Moreover, we may refer to Beatles and Rock as the intermediate nodes along this metapath instance.

Definition 2.0 ().

Metapath-based Graph. Given a metapath of a heterogeneous graph , the metapath-based graph is a graph constructed by all the metapath--based neighbor pairs in graph . Note that is homogeneous if is symmetric.

Definition 2.0 ().

Heterogeneous Graph Embedding. Given a heterogeneous graph , with node attribute matrices for node types , heterogeneous graph embedding is the task to learn the -dimensional node representations for all with that are able to capture rich structural and semantic information involved in .

3. Related Work

In this section, we review studies on graph representation learning that are related to our model. They are organized into two subsections: Section 3.1 summarizes research efforts on GNNs for general graph embedding, while Section 3.2 introduces graph embedding methods designed for heterogeneous graphs.

3.1. Graph Neural Networks

The goal of a GNN is to learn a low-dimensional vector representation for every node , which can be used for many downstream tasks, e.g., node classification, node clustering, and link prediction. The rationale behind this is that each node is naturally defined by its own features and its neighborhood. Following this idea and based on graph signal processing, spectral-based GNNs were first developed to perform graph convolution in the Fourier domain of a graph. ChebNet (Defferrard et al., 2016) utilizes Chebyshev polynomials to filter graph signals (node features) in the graph Fourier domain. Another influential model of this kind is GCN (Kipf and Welling, 2017), which constrains and simplifies the parameters of ChebNet to alleviate the overfitting problem and improve the performance. However, spectral-based GNNs suffer from poor scalability and generalization ability, because they require the entire graph as input for every layer, and their learned filters depend on the eigenbasis of the graph Laplacian, which is closely related to the specific graph structure.

Spatial-based GNNs have been proposed to address these two limitations. GNNs of this kind define convolutions directly in the graph domain by aggregating feature information from neighbors for each node, thus imitating the convolution operations of convolutional neural networks for image data. GraphSAGE 

(Hamilton et al., 2017), the seminal spatial-based GNN framework, is founded upon the general notion of aggregator functions for efficient generation of node embeddings. The aggregator function samples, extracts, and transforms a target node’s local neighborhood, and thus facilitates parallel training and generalization to unseen nodes or graphs. Many other spatial-based GNN variants have been proposed based on this idea. Inspired by the Transformer (Vaswani et al., 2017), GAT (Velickovic et al., 2018) incorporates the attention mechanism into the aggregator function to take into account the relative importance of each neighbor’s information from the target node’s perspective. GGNN (Li et al., 2016)

adds a gated recurrent unit (GRU) 

(Cho et al., 2014) to the aggregator function by treating the aggregated neighborhood information as the input to the GRU of the current time step. GaAN (Zhang et al., 2018) combines GRU with the gated multi-head attention mechanism for dealing with spatiotemporal graphs. STAR-GCN (Zhang et al., 2019) stacks multiple GCN encoder-decoders to boost the rating prediction performance.

All of the GNNs mentioned above are either built for homogeneous graphs, or designed for graphs with a special structure, as in user-item recommender systems. Because most existing GNNs operate on features of nodes in the same shared embedding space, they cannot be naturally adapted to heterogeneous graphs with node features lying in different spaces.

3.2. Heterogeneous Graph Embedding

Heterogeneous graph embedding aims to project nodes in a heterogeneous graph into a low-dimensional vector space. This challenging topic has been addressed by a number of studies. For example, metapath2vec (Dong et al., 2017) generates random walks guided by a single metapath, which are then fed to a skip-gram model (Mikolov et al., 2013a) to generate node embeddings. Given user-defined metapaths, ESim (Shang et al., 2016) generates node embeddings by learning from sampled positive and negative metapath instances. HIN2vec (Fu et al., 2017) carries out multiple prediction training tasks to learn representations of nodes and metapaths of a heterogeneous graph. Given a metapath, HERec (Shi et al., 2019) converts a heterogeneous graph into a homogeneous graph based on metapath-based neighbors and applies the DeepWalk model to learn the node embeddings of the target type. Like HERec, HAN (Wang et al., 2019b) converts a heterogeneous graph into multiple metapath-based homogeneous graphs in a similar way, but uses a graph attention network architecture to aggregate information from the neighbors and leverages the attention mechanism to combine various metapaths. Another model, PME (Chen et al., 2018), learns node embeddings by projecting them into the corresponding relation spaces and optimizing the proximity between the projected nodes.

However, all of the heterogeneous graph embedding methods introduced above have the limitations of either ignoring node content features, discarding all intermediate nodes along the metapath, or utilizing only a single metapath. Although they might have improved upon the performance of homogeneous graph embedding methods for some heterogeneous graph datasets, there is still room for improvement by exploiting more comprehensively the information embedded in heterogeneous graphs.

(a) Node Content Transformation
(b) Intra-metapath Aggregation
(c) Inter-metapath Aggregation
Figure 2. The overall architecture of MAGNN (path instances that start and end with the target node are omitted for clarity).

4. Methodology

In this section, we describe a new metapath aggregated graph neural network (MAGNN) for heterogeneous graph embedding. MAGNN is constructed by three major components: node content transformation, intra-metapath aggregation, and inter-metapath aggregation. Figure 2 illustrates the embedding generation of a single node. The overall forward propagation process is shown in Algorithm 1.

4.1. Node Content Transformation

For a heterogeneous graph associated with node attributes, different node types may have unequal dimensions of feature vectors. Even if they happen to be the same dimension, they may lie in different feature spaces. For example, -dimensional bag-of-words vectors of texts and -dimensional intensity histogram vectors of images cannot directly operate together even if . Feature vectors of different dimensions are troublesome when we process them in a unified framework. Therefore, we need to project different types of node features into the same latent vector space before all else.

So before feeding node vectors into MAGNN, we apply a type-specific linear transformation for each type of nodes by projecting feature vectors into the same latent factor space. For a node of type , we have


where is the original feature vector, and is the projected latent vector of node . is the parametric weight matrix for type ’s nodes.

The node content transformation addresses the heterogeneity of a graph that originates from the node content features. After applying this operation, all nodes’ projected features share the same dimension, which facilitates the aggregation process of the next model component.

4.2. Intra-metapath Aggregation

Given a metapath , the intra-metapath aggregation layer learns the structural and semantic information embedded in the target node, the metapath-based neighbors, and the context in between, by encoding the metapath instances of . Let be a metapath instance connecting the target node and the metapath-based neighbor , we further define the intermediate nodes of as . Intra-metapath aggregation employs a special metapath instance encoder to transform all the node features along a metapath instance into a single vector,


where has a dimension of . For simplicity, here we use to represent a single instance, although there might be multiple instances connecting the two nodes. Section 4.4 introduces several choices of a qualified metapath instance encoder.

After encoding the metapath instances into vector representations, we adopt a graph attention layer (Velickovic et al., 2018) to weighted sum the metapath instances of related to target node . The key idea is that different metapath instances would contribute to the target node’s representation in different degrees. We can model this by learning a normalized importance weight for each metapath instance and then weighted summing all instances:


Here is the parameterized attention vector for metapath , and denotes the vector concatenation operator. indicates the importance of metapath instance to node , which is then normalized across all choices of using the softmax function. Once the normalized importance weight is obtained for all , they are used to compute a weighted combination of the representations of the metapath instances about node . Finally, the output goes through an activation function .

This attention mechanism can also be extended to multiple heads, which helps to stabilize the learning process and reduce the high variance introduced by the heterogeneity of graphs. That is, we execute

independent attention mechanisms, and then concatenate their outputs, resulting in the following formulation:


where is the normalized importance of metapath instance to node at the -th attention head.

To sum up, given the projected feature vectors and the set of metapaths which start or end with node type , the intra-metapath aggregation of MAGNN generates metapath-specific vector representations of the target node , denoted as . Each (assuming ) can be interpreted as a summarization of the -metapath instances about node , exhibiting one aspect of semantic information contained in node .

4.3. Inter-metapath Aggregation

After aggregating the node and edge data within each metapath, we need to combine the semantic information revealed by all metapaths using an inter-metapath aggregation layer. Now for a node type , we have sets of latent vectors: for , where is the number of metapaths for type . One straightforward inter-metapath aggregation approach is to take the element-wise mean of these node vectors. We extend this approach by exploiting the attention mechanism to assign different weights to different metapaths. This operation is reasonable because metapaths are not equally important in a heterogeneous graph.

First, we summarize each metapath by averaging the transformed metapath-specific node vectors for all nodes ,


where and are learnable parameters.

Then we use the attention mechanism to fuse the metapath-specific node vectors of as follows:


where is the parameterized attention vector for node type . can be interpreted as the relative importance of metapath to type ’s nodes. Once is computed for each , we weighted sum all the metapath-specific node vectors of .

At last, MAGNN employs an additional linear transformation with a nonlinear function to project the node embeddings to the vector space with the desired output dimension:


where is an activation function, and

is a weight matrix. This projection is task-specific. It can be interpreted as a linear classifier for node classification or regarded as a projection to the space with node similarity measures for link prediction.

4.4. Metapath Instance Encoders

To encode each metapath instance in Section 4.2, we examine three candidate encoder functions:

  • Mean encoder. This function takes the element-wise mean of the node vectors along the metapath instance :

  • Linear encoder. This function is an extension to the mean encoder by appending it with a linear transformation:

  • Relational rotation encoder. We also examine a metapath instance encoder based on relational rotation in complex space, an operation proposed by RotatE (Sun et al., 2019) for knowledge graph embedding. The mean and linear encoders introduced above treat the metapath instance essentially as a set, and thus ignore the information embedded in the sequential structure of the metapath. Relational rotation provides a way to model this kind of knowledge. Given with and , let be the relation between node and node , let be the relation vector of , the relational rotation encoder is formulated as:


    where and are both complex vectors, is the element-wise product. We can easily interpret a real vector of dimension as a complex vector of dimension by treating the first half of the vector as the real part, and the second half as the imaginary part.

4.5. Training

After applying components introduced in the previous sections, we obtain the final node representations, which can then be used in different downstream tasks. Depending on the characteristics of different tasks and the availability of node labels, we can train MAGNN in two major learning paradigms, i.e., semi-supervised learning and unsupervised learning.

For semi-supervised learning, with the guide of a small fraction of labeled nodes, we can optimize the model weights by minimizing the cross entropy via backpropagation and gradient descent, and thereby learn meaningful node embeddings for heterogeneous graphs. The cross entropy loss for this semi-supervised learning is formulated as:


where is the set of nodes that have labels, is the number of classes, is the one-hot label vector of node , and

is the predicted probability vector of node


For unsupervised learning, without any node labels, we can optimize the model weights by minimizing the following loss function through negative sampling 

(Mikolov et al., 2013b):



is the sigmoid function,

is the set of observed (positive) node pairs, is the set of negative node pairs sampled from all unobserved node pairs (the complement of ).

Input: The heterogeneous graph ,
node types ,
metapaths ,
node features ,
the number of attention heads ,
the number of layers
Output: The node embeddings
1 for node type  do
2       Node content transformation ;
4 end for
5for  do
6       for node type  do
7             for metapath  do
8                   for  do
9                         Calculate for all using the metapath instance encoder function;
10                         Combine extracted metapath instances ;
12                   end for
14             end for
15            Calculate the weight for each metapath ;
16             Fuse the embeddings from different metapaths ;
18       end for
19      Layer output projection ;
21 end for
Algorithm 1 MAGNN forward propagation.
(a) IMDb
(b) DBLP
Figure 3. Network schemas of the three heterogeneous graph datasets used in this paper.

5. Experiments

In this section, we present experiments to demonstrate the efficacy of MAGNN for heterogeneous graph embedding. The experiments aim to address the following research questions:

  • RQ1. How does MAGNN perform in classifying nodes?

  • RQ2. How does MAGNN perform in clustering nodes?

  • RQ3. How does MAGNN perform in predicting plausible links between node pairs?

  • RQ4. What is the impact of the three major components of MAGNN described in the previous section?

  • RQ5. How do we understand the representation capability of different graph embedding methods?

5.1. Datasets

We adopt three widely used heterogeneous graph datasets from different domains to evaluate the performance of MAGNN as compared to state-of-the-art baselines. Specifically, the IMDb and DBLP datasets are used in the experiments for node classification and node clustering. The dataset is used in the experiments for link prediction. Simple statistics of the three datasets are summarized in Table 2, and network schemas are illustrated in Figure 3. We assign one-hot id vectors to nodes with no attributes as their dummy input features.

  • IMDb111 is an online database about movies and television programs, including information such as cast, production crew, and plot summaries. We use a subset of IMDb scraped from online, containing 4278 movies, 2081 directors, and 5257 actors after data preprocessing. Movies are labeled as one of three classes (Action, Comedy, and Drama) based on their genre information. Each movie is also described by a bag-of-words representation of its plot keywords. For semi-supervised learning models, the movie nodes are divided into training, validation, and testing sets of 400 (9.35%), 400 (9.35%), and 3478 (81.30%) nodes, respectively.

  • DBLP222 is a computer science bibliography website. We adopt a subset of DBLP extracted by (Gao et al., 2009; Ji et al., 2010), containing 4057 authors, 14328 papers, 7723 terms, and 20 publication venues after data preprocessing. The authors are divided into four research areas (Database, Data Mining, Artificial Intelligence, and Information Retrieval). Each author is described by a bag-of-words representation of their paper keywords. For semi-supervised learning models, the author nodes are divided into training, validation, and testing sets of 400 (9.86%), 400 (9.86%), and 3257 (80.28%) nodes, respectively.

  • Last.fm333 is a music website keeping track of users’ listening information from various sources. We adopt a dataset released by HetRec 2011 (Cantador et al., 2011), consisting of 1892 users, 17632 artists, and 1088 artist tags after data preprocessing. This dataset is used for the link prediction task, and no label or feature is included in this dataset. For semi-supervised learning models, the user-artist pairs are divided into training, validation, and testing sets of 64984 (70%), 9283 (10%), and 18567 (20%) pairs, respectively.

Dataset Node Edge Metapath
# movie (M): 4,278
# director (D): 2,081
# actor (A): 5,257
# M-D: 4,278
# M-A: 12,828
# author (A): 4,057
# paper (P): 14,328
# term (T): 7,723
# venue (V): 20
# A-P: 19,645
# P-T: 85,810
# P-V: 14,328
# user (U): 1,892
# artist (A): 17,632
# tag (T): 1,088
# U-U: 12,717
# U-A: 92,834
# A-T: 23,253
Table 2. Statistics of datasets.

5.2. Baselines

Dataset Metrics Train % Unsupervised Semi-supervised
LINE node2vec ESim metapath2vec HERec GCN GAT HAN MRGNN
IMDb Macro-F1 20% 44.04 49.00 48.37 46.05 45.61 52.73 53.64 56.19 59.35
40% 45.45 50.63 50.09 47.57 46.80 53.67 55.50 56.15 60.27
60% 47.09 51.65 51.45 48.17 46.84 54.24 56.46 57.29 60.66
80% 47.49 51.49 51.37 49.99 47.73 54.77 57.43 58.51 61.44
Micro-F1 20% 45.21 49.94 49.32 47.22 46.23 52.80 53.64 56.32 59.60
40% 46.92 51.77 51.21 48.17 47.89 53.76 55.56 57.32 60.50
60% 48.35 52.79 52.53 49.87 48.19 54.23 56.47 58.42 60.88
80% 48.98 52.72 52.54 50.50 49.11 54.63 57.40 59.24 61.53
DBLP Macro-F1 20% 87.16 86.70 90.68 88.47 90.82 88.00 91.05 91.69 93.13
40% 88.85 88.07 91.61 89.91 91.44 89.00 91.24 91.96 93.23
60% 88.93 88.69 91.84 90.50 92.08 89.43 91.42 92.14 93.57
80% 89.51 88.93 92.27 90.86 92.25 89.98 91.73 92.50 94.10
Micro-F1 20% 87.68 87.21 91.21 89.02 91.49 88.51 91.61 92.33 93.61
40% 89.25 88.51 92.05 90.36 92.05 89.22 91.77 92.57 93.68
60% 89.34 89.09 92.28 90.94 92.66 89.57 91.97 92.72 93.99
80% 89.96 89.37 92.68 91.31 92.78 90.33 92.24 93.23 94.47
Table 3. Experiment results (%) on the IMDb and DBLP datasets for the node classification task.

We compare MAGNN against different kinds of graph embedding models, including traditional (as opposed to GNNs) homogeneous graph embedding models, traditional heterogeneous graph embedding models, GNNs for homogeneous graphs, and GNNs for heterogeneous graphs. We denote them as traditional homogeneous models, traditional heterogeneous models, homogeneous GNNs, and heterogeneous GNNs, respectively. The list of baseline models is shown as follows.

  • LINE (Tang et al., 2015) is a traditional homogeneous model exploiting the first-order and second-order proximity between nodes. We apply it to the heterogeneous graphs by ignoring the heterogeneity of graph structure and dropping all node content features. The LINE variant using second-order proximity is applied in our experiments.

  • node2vec (Grover and Leskovec, 2016) is a traditional homogeneous model serving as a generalized version of DeepWalk (Perozzi et al., 2014). We apply it to the heterogeneous graphs in the same way as LINE.

  • ESim (Shang et al., 2016) is a traditional heterogeneous model that learns node embeddings from sampled metapath instances. ESim requires a predefined weight for each metapath. Here we assign equal weights to all metapaths because searching for the optimal weights of metapaths is difficult, and does not provide a significant performance gain over equal weights according to the authors’ experiments.

  • metapath2vec (Dong et al., 2017) is a traditional heterogeneous model that generates node embeddings by feeding metapath-guided random walks to a skip-gram model. This model relies on a single user-specified metapath, so we test on all metapaths separately and report the one with the best results. We use the metapath2vec++ model variant in our experiments.

  • HERec (Shi et al., 2019) is a traditional heterogeneous model that learns node embeddings by applying DeepWalk to the metapath-based homogeneous graphs converted from the original heterogeneous graph. This model comes with an embedding fusion algorithm designed for rating prediction, which can be adapted to link prediction. For node classification/clustering, we select and report the metapath with the best performance.

  • GCN (Kipf and Welling, 2017) is a homogeneous GNN. This model performs convolutional operations in the graph Fourier domain. Here we test GCN on metapath-based homogeneous graphs and report the results from the best metapath.

  • GAT (Velickovic et al., 2018) is a homogeneous GNN. This model performs convolutional operations in the graph spatial domain with the attention mechanism incorporated. Similarly, here we test GAT on metapath-based homogeneous graphs and report the results from the best metapath.

  • GATNE (Cen et al., 2019) is a heterogeneous GNN. It generates a node’s representation from the base embedding and the edge embeddings, with a focus on the link prediction task. Here we report the results from the best-performing GATNE variant.

  • HAN (Wang et al., 2019b) is a heterogeneous GNN. It learns metapath-specific node embeddings from different metapath-based homogeneous graphs, and leverages the attention mechanism to combine them into one vector representation for each node.

For traditional models, including LINE, node2vec, ESim, metapath2vec, and HERec, we set the window size to 5, walk length to 100, walks per node to 40, and number of negative samples to 5, if applicable. For GNNs, including GCN, GAT, HAN, and our proposed MAGNN, we set the dropout rate to 0.5; we use the same splits of training, validation, and testing sets; we employ the Adam optimizer with the learning rate set to 0.005 and the weight decay (L2 penalty) set to 0.001; we train the GNNs for 100 epochs and apply early stopping with a patience of 30. For node classification and node clustering, the GNNs are trained in a semi-supervised fashion with a small fraction of nodes labeled as guidance. For GAT, HAN, and MAGNN, we set the number of attention heads to 8. For HAN and MAGNN, we set the dimension of the attention vector in inter-metapath aggregation to 128. For a fair comparison, we set the embedding dimension of all the models mentioned above to 64.

Dataset Metrics Unsupervised Semi-supervised
LINE node2vec ESim metapath2vec HERec GCN GAT HAN MRGNN
IMDb NMI 1.13 5.22 1.07 0.89 0.39 7.46 7.84 10.79 15.58
ARI 1.20 6.02 1.01 0.22 0.11 7.69 8.87 11.11 16.74
DBLP NMI 71.02 77.01 68.33 74.18 69.03 73.45 70.73 77.49 80.81
ARI 76.52 81.37 72.22 78.11 72.45 77.50 76.04 82.95 85.54
Table 4. Experiment results (%) on the IMDb and DBLP datasets for the node clustering task.
Dataset Metrics LINE node2vec ESim metapath2vec HERec GCN GAT GATNE HAN MAGNN AUC 85.76 67.14 82.00 92.20 91.52 90.97 92.36 89.21 93.40 98.91
AP 88.07 64.11 82.19 90.11 89.47 91.65 91.55 88.86 92.44 98.93
Table 5. Experiment results (%) on the dataset for the link prediction task.

5.3. Node Classification (RQ1)

We conduct experiments on the IMDb and DBLP datasets to compare the performance of different models on the node classification task. We feed the embeddings of labeled nodes (movies in IMDb and authors in DBLP) generated by each learning model to a linear support vector machine (SVM) classifier with varying training proportions. Note that for a fair comparison, only the nodes in the testing set are fed to the linear SVM, because semi-supervised models have already “seen” the nodes in the training and validation sets, as shown in Equation 

11. Hence, the training and testing proportions of the linear SVM here only concern the testing set (i.e., 3478 nodes for IMDb and 3257 nodes for DBLP). Again, the train/test splits for the linear SVM are also the same across embedding models. Similar strategies are also applied to the experiments of node clustering and link prediction. We report the average Macro-F1 and Micro-F1 of 10 runs of each embedding model in Table 3.

As shown in the table, MAGNN performs consistently better than other baselines across different training proportions and datasets. On IMDb, it is interesting to see that node2vec performs better than traditional heterogeneous models. That said, GNNs, especially heterogeneous GNNs, obtain even better results, demonstrating that the GNN architecture, which judiciously utilizes the heterogeneous node features, helps improve the embedding performance. The performance gain obtained by MAGNN over the best baseline (HAN) is around 4-7%, which indicates that metapath instances contain richer information than metapath-based neighbors. On DBLP, the node classification task is trivial, as evident from the high scores of all models. Even so, MAGNN still outperforms the strongest baseline by 1-2%.

5.4. Node Clustering (RQ2)

We conduct experiments on the IMDb and DBLP datasets to compare the performance of different models on the node clustering task. We feed the embeddings of labeled nodes (movies in IMDb and authors in DBLP) generated by each learning model to the K-Means algorithm. The number of clusters in K-Means is set to the number of classes for each dataset, i.e., 3 for IMDb and 4 for DBLP. We employ the

normalized mutual information (NMI) and adjusted Rand index

(ARI) as the evaluation metrics. Since the clustering result of the K-Means algorithm is highly dependent on the initialization of the centroids, we repeat K-Means 10 times for each run of the embedding model, and each embedding model is tested for 10 runs. We report the averaged results in Table 


From Table 4, we can see that MAGNN is consistently superior to all other baselines in node clustering. Note that all models have much poorer performance on IMDb than on DBLP. This is presumably because of the dirty labels of movies in IMDb: every movie node in the original IMDb dataset has multiple genres, and we only choose the very first one as its class label. We can see that the traditional heterogeneous models do not have many advantages over the traditional homogeneous models in node clustering. Node2vec is expected to perform strongly in the node clustering task because, being a random-walk-based approach, it forces nodes that are close in the graph also to be close in the embedding space (You et al., 2019), and thereby encodes node positional information. This property implicitly facilitates the K-Means algorithm as it clusters nodes based on the Euclidean distances between embeddings. Despite this, the heterogeneity-aware GNNs (i.e., HAN and MAGNN) still rank the first in node clustering on both datasets.

5.5. Link Prediction (RQ3)

We also conduct experiments on the dataset to evaluate the performance of MAGNN and other baselines in the link prediction task. For the GNNs, we treat the connected user-artist pair as positive node pairs, and consider all unconnected user-artist links as negative node pairs. We add the same number of randomly sampled negative node pairs to the validation and testing sets. During the GNNs’ training, negative node pairs are also uniformly sampled on the fly. The GNNs are then optimized by minimizing the objective function described in Equation 12.

Given the user embedding and the artist embedding generated by the trained model, we calculate the probability that and link together as follows:


where is the sigmoid function. The embedding models for link prediction are evaluated by the area under the ROC curve (AUC) and average precision (AP) scores. We report the averaged results of 10 runs of each embedding model in Table 5.

From Table 5, MAGNN outperforms other baseline models by a large margin. The strongest traditional model here is metapath2vec, which learns from node sequences generated from random walks guided by a single metapath. MAGNN achieves better scores than metapath2vec, showing that considering a single metapath is suboptimal. Among GNN baselines, HAN obtains the best results because it is heterogeneity-aware and combines multiple metapaths. Our MAGNN achieves a relative improvement of around 6% over HAN. This result supports our claim that the metapath contexts of nodes are critical to the node embeddings.

Variant IMDb DBLP
Macro-F1 Micro-F1 NMI ARI Macro-F1 Micro-F1 NMI ARI AUC AP
48.87 50.36 5.82 5.30 92.80 93.32 77.17 82.15 N/A N/A
58.45 58.84 12.87 11.98 92.61 93.15 77.64 82.60 93.68 92.95
56.77 56.64 11.90 11.84 93.19 93.69 79.48 84.39 92.54 91.52
59.66 59.78 13.64 15.27 93.13 93.44 79.31 84.30 98.63 98.57
57.80 57.96 9.80 8.49 93.21 93.52 78.95 83.89 98.56 98.48
60.43 60.63 15.58 16.74 93.51 93.94 80.81 85.54 98.91 98.93
Table 6. Quantitative results (%) for ablation study.

5.6. Ablation Study (RQ4)

To validate the effectiveness of each component of our model, we further conduct experiments on different MAGNN variants. Here we report the results obtained from the three datasets on all three tasks in Table 6. Note that every presented score of the node classification task (i.e., Macro-F1 and Micro-F1) is an average of the scores in different training proportions (explained in Section 5.3). Here is our proposed model using the relational rotation encoder, i.e., the one used to compete with other baselines in Table 3, 4, and 5. Let be the reference model, is the equivalent model without utilizing node content features; considers only the metapath-based neighbors; considers the single best metapath; switches to using the mean metapath instance encoder; switches to using the linear metapath instance encoder. Except for the above-mentioned differences, all other settings are the same for these MAGNN variants. Note that on is equivalent to because this dataset does not contain node attributes.

As can be seen, by utilizing the node content features, obtains a significant performance improvement over , which shows the necessity of applying node content transformation to incorporate node features. Comparing with , , and , we see that aggregating metapath instances rather than metapath-based neighbors brings about a boost in performance, which validates the efficacy of intra-metapath aggregation. Next, the difference between the results of and reveals that the model performance is improved considerably by combining multiple metapaths in inter-metapath aggregation. Finally, the results of , , and suggest that the relational rotation encoder does help to improve MAGNN by a small margin. It is interesting to see that performs worse than . Nonetheless, all three MAGNN variants using different encoders still consistently outperform the best baseline, HAN.

5.7. Visualization (RQ5)

In addition to the quantitative evaluations of embedding models, we also visualize node embeddings to conduct a qualitative assessment of the embedding results. We randomly select 30 user-artist pairs from the positive testing set of the dataset, and then project the embeddings of these nodes into a 2-dimensional space using t-SNE. Here we illustrate the visualization results of LINE, ESim, GCN, and MAGNN in Figure 4, where red points and green points indicate users and artists, respectively.

Based on this visualization, one can quickly tell the differences among graph embedding models in terms of their learning ability towards heterogeneous graphs. As a traditional homogeneous graph embedding model, LINE cannot effectively divide user nodes and artist nodes into two different groups. In contrast, ESim, a traditional heterogeneous model, can roughly partition the two types of nodes. Thanks to the powerful GNN architecture and by choosing appropriate metapaths, a homogeneous GNN such as GCN can isolate different types of nodes and encode the correlation information of the user-artist pairs into the node embeddings. From Figure 4, we can see that our proposed MAGNN obtains the best embedding results, with two well-separated user and artist groups, and an aligned correlation of user-artist pairs.

(a) LINE
(b) ESim
(c) GCN
Figure 4. Embedding visualization of node pairs in

6. Conclusion

In this paper, we propose a novel metapath aggregated graph neural network (MAGNN) to address the three characteristic limitations of existing heterogeneous graph embedding methods, namely (1) dropping node content features, (2) discarding intermediate nodes along metapaths, and (3) considering only a single metapath. To be specific, MAGNN applies three building block components: (1) node content transformation, (2) intra-metapath aggregation, and (3) inter-metapath aggregation to deal with each of the limitations, respectively. Additionally, we define the notion of metapath instance encoders, which are used to extract the structural and semantic information ingrained in metapath instances. We propose several candidate encoder functions, including one inspired by the RotatE knowledge graph embedding model (Sun et al., 2019). In experiments, MAGNN achieves state-of-the-art results on three real-world datasets in the node classification, node clustering, and link prediction tasks. Ablation studies also demonstrate the efficacy of the three major components of MAGNN in boosting embedding performance. We plan to adapt this heterogeneous graph embedding framework to the rating prediction (recommendation) task with the user-item data assisted by the heterogeneous knowledge graph (Wang et al., 2019a).

The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (CUHK 2300174 (Collaborative Research Fund, No. C5026-18GF) and CUHK 3133238 (Research Sustainability of Major RGC Funding Schemes)).


  • J. Atwood and D. Towsley (2016) Diffusion-convolutional neural networks. In NIPS, pp. 1993–2001. Cited by: §1.
  • P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, and k. kavukcuoglu (2016) Interaction networks for learning about objects, relations and physics. In NIPS, pp. 4502–4510. Cited by: §1.
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In NIPS, pp. 2787–2795. Cited by: §1.
  • I. Cantador, P. Brusilovsky, and T. Kuflik (2011) 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In RecSys, Cited by: 3rd item.
  • Y. Cen, X. Zou, J. Zhang, H. Yang, J. Zhou, and J. Tang (2019) Representation learning for attributed multiplex heterogeneous network. In SIGKDD, pp. 1358–1368. Cited by: 8th item.
  • H. Chen, H. Yin, W. Wang, H. Wang, Q. V. H. Nguyen, and X. Li (2018) PME: projected metric embedding on heterogeneous networks for link prediction. In SIGKDD, pp. 1177–1186. Cited by: §3.2.
  • K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078. External Links: 1406.1078 Cited by: §3.1.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pp. 3844–3852. Cited by: §1, §3.1.
  • Y. Dong, N. V. Chawla, and A. Swami (2017) Metapath2Vec: scalable representation learning for heterogeneous networks. In SIGKDD, pp. 135–144. Cited by: §1, §3.2, 4th item.
  • A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur (2017) Protein interface prediction using graph convolutional networks. In NIPS, pp. 6530–6539. Cited by: §1.
  • T. Fu, W. Lee, and Z. Lei (2017) HIN2Vec: explore meta-paths in heterogeneous information networks for representation learning. In CIKM, pp. 1797–1806. Cited by: §1, §3.2.
  • J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han (2009) Graph-based consensus maximization among multiple supervised and unsupervised models. In NIPS, pp. 585–593. Cited by: 2nd item.
  • A. Grover and J. Leskovec (2016) Node2Vec: scalable feature learning for networks. In SIGKDD, pp. 855–864. Cited by: §1, 2nd item.
  • W. L. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NIPS, pp. 1024–1034. Cited by: §1, §1, §3.1.
  • M. Ji, Y. Sun, M. Danilevsky, J. Han, and J. Gao (2010) Graph regularized transductive classification on heterogeneous information networks. In ECML PKDD, pp. 570–586. Cited by: 2nd item.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1, §1, §3.1, 6th item.
  • Y. Li, R. Yu, C. Shahabi, and Y. Liu (2018)

    Diffusion convolutional recurrent neural network: data-driven traffic forecasting

    In ICLR, Cited by: §1.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel (2016) Gated graph sequence neural networks. In ICLR, Cited by: §1, §3.1.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a)

    Efficient estimation of word representations in vector space

    In ICLR, Cited by: §1, §3.2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §4.5.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) DeepWalk: online learning of social representations. In SIGKDD, pp. 701–710. Cited by: §1, 2nd item.
  • J. Shang, M. Qu, J. Liu, L. M. Kaplan, J. Han, and J. Peng (2016) Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. CoRR abs/1610.09769. External Links: 1610.09769 Cited by: §1, §3.2, 3rd item.
  • C. Shi, B. Hu, W. X. Zhao, and P. S. Yu (2019) Heterogeneous information network embedding for recommendation. IEEE Transactions on Knowledge and Data Engineering 31 (2), pp. 357–370. Cited by: §1, §3.2, 5th item.
  • Z. Sun, Z. Deng, J. Nie, and J. Tang (2019) RotatE: knowledge graph embedding by relational rotation in complex space. In ICLR, Cited by: item 2, §1, 3rd item, §6.
  • J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) LINE: large-scale information network embedding. In WWW, pp. 1067–1077. Cited by: §1, 1st item.
  • R. van den Berg, T. N. Kipf, and M. Welling (2017) Graph convolutional matrix completion. CoRR abs/1706.02263. External Links: 1706.02263 Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §3.1.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In ICLR, Cited by: §1, §1, §3.1, §4.2, 7th item.
  • D. Wang, P. Cui, and W. Zhu (2016) Structural deep network embedding. In SIGKDD, pp. 1225–1234. Cited by: §1.
  • H. Wang, M. Zhao, X. Xie, W. Li, and M. Guo (2019a) Knowledge graph convolutional networks for recommender systems. In WWW, pp. 3307–3313. Cited by: §6.
  • X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu (2019b) Heterogeneous graph attention network. In WWW, pp. 2022–2032. Cited by: §1, §3.2, 9th item.
  • C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang (2015) Network representation learning with rich text information. In IJCAI, pp. 2111–2117. Cited by: §1.
  • J. You, R. Ying, and J. Leskovec (2019) Position-aware graph neural networks. In ICML, pp. 7134–7143. Cited by: §5.4.
  • J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D. Yeung (2018) GaAN: gated attention networks for learning on large and spatiotemporal graphs. In UAI, pp. 339–349. Cited by: §1, §1, §3.1.
  • J. Zhang, X. Shi, S. Zhao, and I. King (2019) STAR-GCN: stacked and reconstructed graph convolutional networks for recommender systems. In IJCAI, pp. 4264–4270. Cited by: §1, §1, §3.1.