1 Introduction
Graph embedding pursues informative numerical representations of graphs, which facilitates various applications on graphs such as classification, link prediction and entity alignment. Most of existing methods perform graph embedding on homogeneous graphs, where all nodes and relationships (a.k.a. linkages or edges) are of the same type. For example, DeepWalk [14] minimize the distance between the node and its neighboring nodes in the lowdimensional vector space, to preserve the structural information of the homogeneous graph. However, the realworld data tends to be presented as a heterogeneous graph, which combines different aspects of information together.
Heterogeneous Information Network (HIN). A HIN (a.k.a. heterogeneous graph) comprises more than two types of nodes or edges. Fig. 0(a) illustrates a toy example of HIN, including three types of nodes (author, paper, and conference) and six types of edges (cite/cited, write/written, and publish/published). Note that here we regard the relationship between vertices in HIN as the directed edge, and we set reverse relationships (e.g., written) for the directed relationships (e.g., write) in HIN. Compared with the homogeneous graph, HIN suffers from two major challenges:

C1: how to model the entity space of multiple types of nodes? In a homogeneous graph, all nodes are embedded into the same lowdimensional entity space. In contrast, various types of nodes in HIN are naturally modeled in distinct spaces. However, a vertex may connect to multiple types of nodes, e.g., a paper is written by the author and it will be published by an academic conference. It is imperative to design the way of interaction between vertices in different typespecific entity space.

C2: how to preserve the semantic of different relationships between nodes? For a HIN, there exist a variety of relations between both different node pairs and the same node pair. In the case of academic graph, an author can cite another author and meanwhile they can be the coauthors of some paper. The various relationships draw different semantic contents of the vertex. Thus the characterization of neighboring vertices with different relations to a vertex determines the performance of learned lowdimensional representation space.
Most of contemporary researches in HIN embedding focus on adapting HIN to homogeneous representation learning algorithms via the metapath [18]. As shown in Fig. 0(b), the linkages between authors can be generated based on the designed metapath scheme APCPA, and a representation learning algorithm for homogeneous graphs, e.g., DeepWalk [14] adopted in metapath2vec [5] or GNN [6] used in HAN [26], is implemented to the generated graph. See Section 4 for more details on metapathbased methods.
Despite the success of metapathbased heterogeneous graph embedding methods, these solutions employ handcrafted metapath schemes to find homogeneous node neighbors, making them suffer from two predominant problems: 1) the scheme of metapath relies on experts, and it is hard to exhaustively enumerate and select valuable metapath schemes by hand; 2) the information passing by the metapath, such as features of heterogeneous nodes or edges, is lost in the process of generating metapath based node pairs, and it may even lead to an inferior embedding performance.
In this paper, we cast the metapath aside and propose a novel method to learn the lowdimensional vector space preserving both structural and semantics information in HIN. Specifically, we take advantage of graph neural network (GNN) to conduct the structural information of HIN and we train the model by the taskguided objective function (node classification loss in this paper). To tackle the challenges of HIN mentioned above, we design a dedicated Typeaware Attention Layer instead of the convolutional layer in the conventional GNN. For each typeaware attention layer, a transformation operation that projects vertices from different entity space to the same lowdimensional target space is defined for the interaction between heterogeneous nodes (C1), and the attention strategies focusing on different types of edges are applied for the aggregation of neighboring vertices with different semantics (C2). Moreover, we develop two kinds of attention scoring functions of proposed typeaware attention layer including concat product and voicessharing product^{1}^{1}1The voice is the concept of English grammar including active voice and passive voice. Here we refer the active voice to the directed edge (cite, write, etc.) and refer the passive voice to the reversed edge (cited, written, etc.).. To better model the interaction between heterogeneous nodes, we further introduce a restriction to the transformation operation. Finally, we perform multitask learning in our proposed model which generally benefits the robustness of representations.
To sum up, the main contributions of this paper are as follows:

We propose Heterogeneous Graph Structural Attention Neural Network (HetSANN). Unlike previous metapathbased solutions, HetSANN directly leverages and explores the structures in the heterogeneous graph to achieve more informative representations.

We present three extensions of HetSANN: (E1) Enhance the extent of sharing information with multitask learning. (E2) Take the pairwise relationship between the directed edge and the reversed edge into account (voicessharing product). (E3) Introduce a constraint to the transformation operation to keep cycle consistent.

We evaluate the proposed HetSANN with the node classification task on three heterogeneous graph datasets. The experimental results demonstrate the superiority of HetSANN compared to various stateofthearts. In addition, an ablation study about the three extensions of HetSANN is conducted and the result shows that all three extensions achieve improvement upon the vanilla of HetSANN.
2 Heterogeneous Graph Structural Attention Neural Network (HetSANN)
A heterogeneous graph consists of a set of vertices and a set of edges . There is a set of node types , and each vertex belongs to one of the node types, denoted by , where is the mapping function from to . We represent an edge from the vertex to with a relation type as a triplet , where and is the set of relation types. For a directed edge in canonical direction, we consider its reversed edge as where is different from . For a vertex , the set of linkages with its neighboring nodes is defined as .
In this paper, we aim to learn the lowdimensional representation , where is the dimension of embedding space for node type , for each vertex in the heterogeneous graph and apply it to the downstream node classification task. Note that various relationship types can occur simultaneously when the vertex links to , which would be a challenge of the heterogeneous graph embedding. To tackle these challenges, we propose a taskguided heterogeneous graph embedding method, namely HetSANN. As shown in Fig. 2, the key component of HetSANN framework is the typeaware attention layer presented as follows.
2.1 Typeaware Attention Layer (TAL)
The TAL is primarily motivated as an adaptation layer of GNNs, which performs convolution operation on local graph neighborhoods. Before conducting the embedding procedure, we connect each vertex to itself with the selfloop relation about . And we have the cold start state for per node . The cold start state can be either the attribute features of nodes, or the dummy features (zero vector/onehot vector) for the nodes without attributes.
Each TAL employs multihead attention mechanism [23], which has been proved that it is helpful to stabilize the learning process of attention mechanism and enrich the model capacity [24]. The dataflow of each head of the TAL is illustrated in Fig. 3. Consider a vertex presented as in the th layer. An attention head in the th TAL outputs the corresponding hidden state by the following two operations: the transformation operation and the aggregation of the neighborhood in the indegree distribution of vertex .
Transformation Operation (C1)
We first apply a linear transformations
to each neighboring vertex of vertex :(1) 
where is the projection from previous hidden state in the space of type to the hidden space of node type in th head of layer . That is, we transform the neighboring nodes of vertex to the same lowdimensional vector space of the node type , intended for the neighborhood aggregation.
Aggregation of Neighborhood (C2)
To preserve the semantic of different types of relationship between nodes, we utilize attention scoring functions to match different relation patterns, i.e., . For a vertex , an attention coefficient is computed for each link edge in the form as:
(2) 
where
is an activation function implemented by
[13]. The attention coefficient indicates the importance of edge to the target vertex . In principle, the attention scoring functions can be different forms to capture various link types. For simplicity, we adopt the same form of attention mechanism for all linkage types but different in the parameters. A natural form of the attention scoring function is the concat product, which is adopted in GAT [24], defined as:(3) 
where denotes the concatenation operation, and is the trainable attention parameter shared by the same edge type . Different with the HAN [26] which employs the hierarchical attention mechanism based on the metapath schemes, we utilize the attention mechanism directly to the raw heterogeneous links. Thereby, the softmax is applied over the neighborhood linkages of vertex for the normalization of the attention coefficient:
(4) 
Now we have the hidden states of neighboring nodes in the same lowdimensional space of the target node , and weights of each linkage associated with vertex . Then the neighborhood aggregation for vertex can be performed as:
(5) 
In our proposed model, the node pair of edge and the relation type of edge are used together to identify edges. When vertex links to with multiple types of relationship, the hidden state is propagated to vertex multiple times with the corresponding weight .
With attention heads executing the procedure of Eq. (5), we concatenate the lowdimensional vectors of attention heads and output the representation of each node in the typeaware attention layer :
(6) 
where . The aggregation of HetSANN is conducted on the raw links instead of the generated links based on metapaths. That is, a vertex can be propagated to vertex within one layer of GNN for the metapathbased links, while more layers are needed for the raw links. Thus, a deeper model is used in HetSANN to capture the highorder proximity information. To facilitate training, we adopt the residue mechanism, which is first introduced by [8], and we revise Eq. (6) as following:
(7) 
2.2 Model Training and Three Extensions
The last typeaware attention layer outputs the lowdimensional representations for each vertex in the heterogeneous graph, i.e.
. To optimize the representations toward the target task, such as node classification in this paper, we integrate the representations of nodes into a node classifier (implemented with a full connection layer with softmax function) to infer the label of classification. With the guide of labeled data, we minimize the crossentropy loss:
(8) 
where is the set of labeled vertices belonging to the node type . and are the ground truth and the predicted class label for vertex , respectively.
E1: Multitask Learning
We can further employ several node classifiers for different types of nodes. The parameters of all typeaware attention layers are shared and trained by multiple classifiers. The multitask learning via uniting all classifiers greatly reduces the risk of overfitting and benefits the robustness of representations [2].
E2: Voicessharing Product
The concat product scoring function considers the directed edge (e.g., write) and the reversed edge (e.g., written) as independent types of relationships. Intuitively, vertex will link to vertex with the “written” relation when vertex link to vertex with the “write” relation. To formulate the pairwise relationship between the directed edge and the reversed edge, we share the parameters of attention mechanism between the pairwise edge types and , where is the type of reversed edge of edges with type . Technically, we enforce and adapt the attention scoring function as follow (called voicessharing product):
(9) 
E3: Cycleconsistency Loss
In natural language processing, “back translation and reconciliation” has been a popular trick to verify and improve the performance on translation
[3]. Referring back to the transformation operation between heterogeneous nodes in Eq. (1), we have a transformation from node type to and another transformation from to . Particularly, a selftransformation is applied to per type of node, i.e. . The transformation operation between and is illustrated in Fig. 3(a), which is intuitive that a vertex should return to the starting position after a cycle. Therefore, we introduce a cycleconsistency restriction to the transformation operation:(10) 
where is the inverse of . However, the solution of matrix inversion is a notoriously timeconsuming problem. To reduce the computational complexity, we adopt a trainable matrix instead of the inverse of the matrix and restrain it as follows:
(11) 
where
is the identity matrix. These constraints are integrated as
cycleconsistency loss (as shown in Fig. 3(b)):(12)  
where and are the weighting factors. The objective function of our model therefore is derived as:
(13) 
3 Experiments
Comparative Models
The list of models in comparison includes:
1) Variants of our proposed model^{2}^{2}2Available at https://github.com/didi/hetsann: We denote HetSANN as the proposed vanilla version, i.e. without aforementioned three extensions in Section 2.2. Three suffixes “.”, “.” and “.” indicate multitask learning to optimize the parameters, voicessharing product in relations attention mechanism and cycleconsistency loss to retain the transformation between vertices, respectively. And HetSANN... refers to the full version of our proposed model.
All variations employ 3layer HetSANN and each TAL consists of 8 attention heads. The output dimensions of each attention head are consistent to 8. The parameters are optimized via Adam solver [10] with a learning rate 0.001 for IMDB and 0.005 for other datasets. A regularization weight 0.0005 is applied to all trainable parameters. A dropout rate 0.6 [19] is implanted between hidden layers to stabilize our model training procedure. For the variant of HetSANN with suffix “.”, the weight coefficients and .
2) Baseline models: We compare with the stateofart baselines of which codes is publicly available at the website, including DeepWalk [14], metapath2vec [5], HERec [17], HAN [26], GCN [11], RGCN [16] and GAT [24]. All of baseline models are introduced in Section 4, and the implementation of them is detailed as follows.
We turn the homogeneous graph embedding methods (DeepWalk, GCN and GAT) into the model for the heterogeneous graph embedding learning by ignoring the type of nodes and linkages. RGCN is implemented by regarding all nodes as the same type node but distinguishing different types of relations in the graph. We follow most of parameters setting recommended in the published papers, and tune a few parameters to adapt to the dataset in our experiments. Specifically, we set the walk length to 50 and 100 walks/node for metapath2vec, HERec and DeepWalk, and the learned embeddings by them are used to train a 2layer MLP [15] classifier. For the graph neural network based methods, the number of layers is set to 3 for GCN, RGCN and GAT. We use 8 attention heads in each layer of the attentionbased models, i.e., HAN and GAT. To enable comparison, we design some metapath schemes for each dataset and evaluate all metapaths schemes for metapath2vec and HAN, and report the best performance.
All networks were trained from scratch until convergence. The dimension of the embedding is unanimously set to 64 for all models. And the model settings for all dataset are the same except for special instructions.
Dataset  Nodes  Linkages  isLabel 

IMDB  movie(5043)  movieactor(11188)  movie 
actor(2357)  moviedirector(3435)  
director(894)  
DBLP  author(14475)  authorpaper(41794)  author 
paper(14376)  papervenue(14376)  paper  
venue(20)  
AMiner  author(8052)  authorauthor(31224)  author 
paper(20201)  paperpaper(44551)  paper  
authorpaper(32029) 
Dataset  IMDB  DBLP  AMiner  

Target node  Movie  Author  Paper  Author  Paper  
Metrics (%)  Mic F1  Mac F1  Mic F1  Mac F1  Mic F1  Mac F1  Mic F1  Mac F1  Mic F1  Mac F1 
DeepWalk  63.53  54.91  92.71  92.01  99.41  99.30  84.70  84.99  87.71  87.70 
metapath2vec  60.83  50.27  66.75  67.07  70.35  72.89  61.79  61.72  71.93  71.58 
HERec  62.54  53.62  90.49  89.94  99.93  99.93  68.74  68.77  78.75  78.61 
HAN  61.91  57.87  88.35  87.67  100.00  100.00  33.24  31.92  91.47  91.73 
GCN  63.78  51.13  87.09  86.60  91.31  89.87  81.24  81.69  90.55  90.76 
RGCN  67.13  62.58  87.70  86.96  81.92  79.37  85.11  85.31  90.84  91.05 
GAT  65.19  60.43  89.11  88.57  89.14  87.51  86.86  87.44  90.79  90.96 
HetSANN  73.11  71.20  94.89  94.56  99.72  99.67  86.97  87.48  91.20  91.37 
HetSANN.  95.43  95.21  99.08  98.89  91.55  91.99  92.56  92.77  
HetSANN..  73.20  71.38  95.51  95.28  99.11  98.95  92.43  92.87  93.75  93.93 
HetSANN...  73.86  72.00  95.63  95.38  98.69  98.43  92.47  92.91  93.73  93.92 
Datasets
We collected a movie graph from IMDB site^{3}^{3}3https://www.imdb.com and constructed two academic networks from DBLP [9] and AMiner [22] datasets respectively. Each of these three data sets, the statistics of which are tabulated in Table 1, is a heterogeneous graph, consisting of more than two types of nodes or edges.

IMDB The IMDB records the Actors and Directors of the Movies. The movies are divided into three groups according the genre label: Action, Comedy or Drama. We utilize the keywords about plot of the movie as the attribute feature of movie vertex by the way of bagofwords. For metapathbased models, we set two metapath schemes, i.e., MAM and MDM.

DBLP
In an academic network, Authors published their Papers in the Venues. The DBLP dataset we constructed consists of 20 venues from four different research fields: database, data mining, machine learning, and information retrieval. Each paper is labeled according to the research field of the venue where the paper is published, and each paper is characterized by the bagofwords of keywords. Each author is labeled based on the research fields of her/his publications, and we sum up the bagofwords of the papers published by this author as the feature of the author. Again, we set
PAP and PVP, APA and APVPA as the metapath schemes for metapathbased models in paper classification task and author classification task respectively. 
AMiner
We cast aside venue nodes from AMiner academic network, raising a harder classification task. In addition to the publishing relationship between the paper and the author, we also introduce the citation relationship between papers and the collaboration relationship between authors. Similar to DBLP, each paper in AMiner is characterized by the bagofwords of keywords, and papers and authors are labeled into four research fields: database, data mining, natural language processing and computer vision. The attribute features of each author is provided with five indices
^{4}^{4}4Five indices [20] for each author: the count of published papers; the total number of citations; the Hindex; the Pindex with equal Aindex; the Pindex with unequal Aindex. indicating the academic authority of the author. Again, PAP and PAAP, and APA and APPA are set as metapath schemes.
Evaluation Metrics
The whole labeled dataset is randomly split into training set, validation set and test set by a ratio of 0.8:0.1:0.1. And we select the best one in the validation set for each comparative model, then evaluate them by Micro F1 and Macro F1 on the test set. For each model, we report the average performance on 10 repeated processes.
3.1 Ablation Study
In this section, we employ the vanilla HetSANN and its variations including HetSANN., HetSANN.. and our full method HetSANN... to perform an ablation study. To enable the multitask learning, we introduce the Paper classification as the auxiliary task to the main Author classification task. And the multitask learning (with suffix “.”) can not be conducted on the IMDB dataset which contains only single type of labeled node. The test results are shown in the bottom half of Table 2: (1) HetSANN. performs better than HetSANN both in Author and Paper classification tasks in AMiner. For DBLP dataset, HetSANN. still brings improvement in the task of Author classification, although it is slightly inferior in the performance of the Paper classification compared to HetSANN that is trained toward only one single task. We believe that the introduced multitask learning guides our model to find an optimum representation that captures all of tasks, even if sometime it involves losing accuracy of one task in return for gaining performance in overall [2]; (2) Benefiting from the substitution of concat product with voicessharing product, HetSANN.. improves the classification performance over all datasets compared with HetSANN.; (3) HetSANN... achieves the best performance of variants of our models in the Author and Movie classification tasks. However, the gain of HetSANN... is relatively constrained as seen from the comparison between HetSANN.. and HetSANN.... One clear reason is the replacement of the analytical expression of the inverse matrix by the trainable matrix in the cycleconsistency loss [Eq. (12)], which is left to future work.
3.2 Comparison Results
Table 2 also shows the comparison results of our models with other baselines. Obviously, our models are superior to other models in the classification tasks of all datasets excluding Paper classification on DBLP. Note that we label papers according to the research field of the venue where the paper is published. The venue nodes connected to papers in DBLP enable HAN to establish the neighborhood of papers published in the same venue via the metapath scheme PVP, resulting in a perfect Paper classification performance on DBLP. Without the venue vertices in AMiner, it is not easy for HAN to capture the category information of Paper via PAP and PAAP schemes, leading to worst Author classification results in AMiner. Different from the methods which require welldesigned solutions of the metapath, the metapathfree methods achieve obvious performance gains and robustness results. With a further distinction between multiple types of nodes and linkages, our models outperform other baselines and our full method HetSANN... improves Micro F1 and Macro F1 by and respectively over the most competitive model GAT on three datasets.
Fig. 5 details the comparison performance results of main tasks (Movie and Author classification) and the auxiliary tasks (Paper classification) on the three datasets, varying the value of training ratio in . Consistently, we have HetSANN...HetSANNbaselines in terms of Micro F1 score of classification. Besides, both HetSANN and HetSANN... still maintain the advantage in a weaklysupervised manner.
3.3 Parameter Sensitivity Study
Finally, we test the parameter sensitivity of HetSANN.. on IMDB and the results are presented in Fig. 6. The left figure shows the effect of the number of typeaware attention layers when other parameters are fixed. The performance of HetSANN.. goes down when . The observation is consistent with the analysis in [12], that is, the graph convolution is a form of Laplacian smoothing over the features of neighborhood, and it will lead to indistinguishable features of the node and inferior performance of node classification when many convolutional layers are included in a GNN. The remaining two figures focus on the weighting factors and of the cycleconsistency loss. Fixing , the lines in Fig. 5(b) increase when increases to and lines decrease when we set a larger , which may suppress the learning of main task, i.e., node classification. Fig. 5(c) in turn varying and fixing , lines tend to be stable when , indicating that the model have done its best to reach the solution of the inverse matrix in Eq. (11).
4 Related Work
4.1 Heterogeneous Graph Representation Learning
The existing works of HIN embedding tend to utilize the metapath to adapt the heterogeneous graph for the application of the homogeneous graph embedding methods, such as [14, 21, 25]. metapath2vec [5] designs metapaths to guide the random walks in a heterogeneous graph and then follows skipgram model to learn the latentspace representations of vertices. Inspired by metapath2vec, HERec[17] proposed to fuse different representations learned in the view of different metapath schemes. Both metapath2vec and HERec are trained by linkageguided objective function which is independent of the downstream tasks. To obtain optimum embeddings for a specific task, [4] joints author identification taskguided and linkageguided objectives to learn the heterogeneous graph embeddings. HAN [26] introduces a twolevels hierarchical attention to GNN, in which a nodelevel attention captures the relations between neighboring nodes generated by one metapath scheme and a semanticlevel attention aggregates multiple metapath scheme for each node in a graph. All in all, these aforementioned methods are dependent on the design of experts and brings inevitable loss of information.
4.2 Graph Neural Networks (GNNs)
More recently, graph neural networks (GNNs) [6, 7, 11, 1] have become increasingly studied. GNNs generate node embeddings by the spatial filter which convolutes each node over its neighborhood in graph. The convolutional operation enables GNNs to propagate structural information of graphs layer by layer, and frees graph embedding methods from linkageguided learning. Motivated by the thriving of attention mechanism, GAT [24] introduces an attention strategy to GNN framework. For the relational learning of knowledge bases, RGCN [16] builds multiple relation spaces for all nodes in a graph, which can not capture the relative importance of various type of nodes. RSHN [27] focuses on mining semantic interactions of edge types via coarsened line graph and incorporating it into the HGNN model.
5 Conclusion
The paper proposes HetSANN to perform metapathfree embedding based on structural information in heterogeneous graphs. We design a typeaware attention layer for HetSANN, which embeds each vertex of heterogeneous graph by jointing different types of neighboring nodes and associated linkages. A few variants of our model are developed based on three extensions, i.e., voicessharing product, cycleconsistency loss and multitask learning. Comprehensive experiments on three popular datasets show that the proposed solutions outperform stateoftheart methods in HIN embedding and node classification. Under the framework of HetSANN, the representation learning of HIN does not need to rely on the metapath to tackle the heterogeneous structural information, thereafter the heterogeneous attributes of vertices will be considered in the future work.
References

[1]
(2018)
Relational inductive biases, deep learning, and graph networks
. arXiv preprint arXiv:1806.01261. Cited by: §4.2.  [2] (19970701) A bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning 28 (1), pp. 7–39. External Links: ISSN 15730565 Cited by: §2.2, §3.1.
 [3] (1970) Backtranslation for crosscultural research. Journal of crosscultural psychology 1 (3), pp. 185–216. Cited by: §2.2.
 [4] (2017) Taskguided and pathaugmented heterogeneous network embedding for author identification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 295–304. Cited by: §4.1.
 [5] (2017) Metapath2vec: scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 135–144. Cited by: §1, §3, §4.1.
 [6] (2005) A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2, pp. 729–734. Cited by: §1, §4.2.
 [7] (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §4.2.

[8]
(2016)
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016
, pp. 770–778. Cited by: §2.1.  [9] (2010) Graph regularized transductive classification on heterogeneous information networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 570–586. Cited by: §3.
 [10] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Cited by: §3.
 [11] (2017) Semisupervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §3, §4.2.

[12]
(2018)
Deeper insights into graph convolutional networks for semisupervised learning
. InThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §3.3.  [13] (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §2.1.
 [14] (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1, §1, §3, §4.1.
 [15] (1988) Learning internal representations by error propagation. Readings in Cognitive Science 323 (6088), pp. 399–421. Cited by: §3.
 [16] (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §3, §4.2.
 [17] (2018) Heterogeneous information network embedding for recommendation. IEEE Transactions on Knowledge and Data Engineering 31 (2), pp. 357–370. Cited by: §3, §4.1.
 [18] (2016) A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering 29 (1), pp. 17–37. Cited by: §1.
 [19] (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §3.
 [20] (2013) Determining scientific impact using a collaboration index. Proceedings of the National Academy of Sciences 110 (24), pp. 9680–9685. External Links: ISSN 00278424 Cited by: footnote 4.
 [21] (2015) Line: largescale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: §4.1.
 [22] (2008) Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990–998. Cited by: §3.
 [23] (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §2.1.
 [24] (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §2.1, §2.1, §3, §4.2.
 [25] (2016) Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. Cited by: §4.1.
 [26] (2019) Heterogeneous graph attention network. In The World Wide Web Conference, pp. 2022–2032. Cited by: §1, §2.1, §3, §4.1.
 [27] (2019) Relation structureaware heterogeneous graph neural network. In Proceedings of the 19th IEEE International Conference on Data Mining, pp. 1534–1539. Cited by: §4.2.