Related Work
Graph Neural Networks
In recent years, many classes of GNN methods have been developed for a variety of heterogeneous network types Schlichtkrull et al. (2018); Zhang et al. (2019); Wang et al. (2019); Zhou et al. (2019); Hu et al. (2020). Although these types of GNNs are flexible for endtoend supervised prediction tasks, they only optimize for predictions between direct interactions. Compared to conventional network embedding methods Grover and Leskovec (2016); Tang et al. (2015), standard GNNs generally do not take advantage of secondorder relationships between indirect neighboring nodes. Recently, a paper by Huang et al. (2020) applied a fusion technique to combine firstorder and secondorder embeddings at alternating steps. Additionally, the Jumping Knowledge architecture from Xu et al. (2018) and the GraphSAGE (sampling and aggregation) from Hamilton et al. (2017) has proposed to extend the neighborhood ranges; however, there has yet to be an extension of such techniques for heterogeneous networks.
Notably, GTN Yun et al. (2019) was recently proposed to enable learning on higherorder meta paths in heterogeneous networks. It proposes a mechanism that softselects a convex combination of meta path layers using attention weights, then applies multiplication of adjacency matrices successively to reveal arbitrarylength transitive meta paths. This mechanism is unique in that it can infer attention weights not only on the given relations, but also on higherorder relations generated by deeper layers, a feature that most existing GNN methods often neglect. A few limitations with GTN is it necessarily assumes the feature distribution and representation space of different node and link types to be the same, and it cannot weigh the importance of each meta path separately for each node type. Additionally, GTN can be computationally expensive, since it requires computations involving the adjacency structure of all node types at once.
Multiplex Network Embedding
It is worth mentioning the approaches designed for a subclass of the heterogeneous network, the multiplex network. Many of the current multiplex or multiview network embedding methods Fu et al. (2017); Matsuno and Murata (2018); Qu et al. (2017); Shi et al. (2018)
have proposed strategies for aggregating the learned embeddings of multiple network “layers” into a single unified embedding. This class of methods typically specify separate objectives for each of the layers to estimate the node features independently, then apply another objective to aggregate the information from all layers together.
Another paradigm is to use randomwalk of meta paths to model heterogeneous structures, as proposed in Perozzi et al. (2014); Dong et al. (2017); Fu et al. (2017). This class of approaches can learn network representations without supervised training for a specific task. However, they only learn representations for the primary node type, which consequently requires the customized design of meta paths. Also, they can be sensitive to the random walk’s hyperparameter settings, which may introduce unwanted biases or is computationally costly, thus can lead to lacking performance. Another class of algorithm utilizing embedding translations can also be applied for embedding heterogeneous networks. For instance, Bordes et al. (2013)
learned linear transformations for each relation to model semantic relationships between entities. While embedding translations can effectively model heterogeneous networks, they are mainly fitted for link prediction tasks.
Method
Preliminary
We consider a heterogeneous network as a complex system involving multiple types of links between nodes of various types. To effectively represent the complex structure of the system, it is important to define separate adjacency matrices to distinguish the nature of relationships. In this section, we define coherent notations to study the class of attributed heterogeneous networks.
Definition 3.1: Attributed Heterogeneous Network
is defined as a graph in which each node and each link are associated with their mapping functions and . and denote the sets of object and relation types, where . In the case of attributed heterogeneous network, the node features representation is given by , which maps node of node type to its corresponding feature vector of dimension .
We represent the heterogeneous link types as a set of biadjacency matrices where . Each meta relation specifies a link type between source node type and target node type , such that . The biadjacency matrix may consist of weighted links, where if there exists a link, otherwise, , indicating the absence of evidence for interaction. For a subnetwork, we define node ’s neighbors set as . Note that ’s size is nonquadratic, and thus does not have a diagonal. Furthermore, this definition assumes relations of directed links, but for a relation with inherently undirected links, we may inject a reverse relation into the set.
LATTE1: Firstorder Heterogeneous Network Embedding
In this section, we start by describing the attentionbased layers used in the LATTE heterogeneous network embedding architecture. The attention mechanism utilized in our method closely follows GAT Veličković et al. (2018) but is extended to infer higherorder link proximity scores for nodes and links of heterogeneous types. We also introduce the layer building blocks where each layer has the roles of inferring node embeddings from heterogeneous node content and preserving higherorder link proximities.
The input to our model is the set of heterogeneous adjacency matrices and the heterogeneous node features , where . At each layer, we define the node embeddings output , where is the embedding dimension, as:
where , and is the heterogeneous link adjacency matrices in the order. In the next section, we describe the operations involved when .
Heterogeneous Firstorder Proximities
The firstorder proximity refers to direct links between any two nodes in the network among the heterogeneous relations in . In order to model the different distribution of links in each relation type , we utilize a nodelevel attentional kernel depending on the type of the relation. Additionally, to sufficiently encode node features into higherlevel features, each node type requires a separate linear transformation applied to every node in . Given any node of type and node of type , the respective kernel parameter is utilized to compute the scoring mechanism:
(1) 
where is the transposition and is the concatenation operation. We utilize two weight matrices and to obtain the ”source” context and the ”target” context, respectively, for a pair of nodes depending on the node types and the direction of the link. Note that the attentionbased proximity score is asymmetric, hence capable of modeling directed relationships where .
Inferring Nodelevel Attention Coefficients
Next, our goal is to infer the importance of each neighbor node in the neighborhood around node for a given relation. Similar to GAT, we compute masked attention on existing links, such that is only computed for firstorder neighbor nodes . The attention coefficients are computed by softmax normalization of the scores across all , as:
(2) 
where is a learnable “temperature“ variable initialized at that have the role of “sharpening” the attention scores Chorowski et al. (2015) across the links distribution in the relation. It is expected that when the particular link distribution is dense or noisy, thus, integrating this technique allows the attention mechanism to focus on fewer neighbors. Once obtained, the normalized attention coefficients are used to compute the features distribution of a node’s by a linear combination of its neighbors for each relation.
Inferring Relation Weighing Coefficients
Since a node type is assumed to be involved in multiple types of relations, we must aggregate the relationspecific representations for each node. Previous works Wang et al. (2019); Yun et al. (2019) have proposed to measure the importance of each relation type by a set of semanticlevel attention coefficients shared by all nodes. Instead, our method chooses to assign the relation attention coefficients differently for each node , which enables the capacity to capture individual node heterogeneity in the network. First, we denote as the subset of meta relations with source type . Since the number of relations involved in each node type can be different, each node of type only needs to softselect from the subset of relevant relations. We utilize a linear transformation directly on node features to predict a normalized coefficient vector of size that softselects among the set of associated relations or itself. This operation is computed by:
(3) 
where is parameterized by the weights and bias for each node type . Since is softmax normalized, , where is the coefficient indexed for the “self” choice.
Aggregating Firstorder Neighborhoods
It is important to not only capture the local neighborhood of a node in a single relation but also aggregate the neighborhoods among multiple relations and integrate the node’s own features representation. First, we gather information obtained from each relation’s local neighborhoods, then combine their relationspecific embeddings. We apply both the nodelevel and relationlevel attention coefficients to a weightedaverage aggregation scheme:
(4) 
where
is a nonlinear function such as ReLU, and
’s node type is . The firstorder node embedding is computed as an aggregation of lineartransformed immediate neighbor nodes. Next, we show that multiple LATTE layers can be stacked successively in a manner that allows the attention mechanism to capture higherorder relationships.LatteT: Higherorder Heterogeneous Network Embedding
In this section, we describe the layerstacking operations involved to extract higherorder proximities when . The torder proximity applies to indirect length metapaths achieved by combining two matching meta relations. For instance, when , we can connect a relation with target type to another relation with matching source type . Then, computing the AdamicAdar Adamic and Adar (2003) as:
(5) 
yields as the degreenormalized biadjacency matrix consisting of length2 metapaths from nodes to nodes. We define the set of meta relations containing all length metapaths in the network, as:
(6) 
where behaves as a cartesian product that yields the AdamicAdar only for matching pairs of relations. A length sequence of meta relations with source type and target type is denoted as . This is directly applicable to the classical metapath paradigm Sun et al. (2011), where all possible length metapaths are decomposed in each separate relation in . Note that throughout this paper, the meta relations notation is overloaded for brevity. In fact, this architecture can handle multiple meta relation types with the same source type and target type, i.e. , without loss of generalization.
Heterogeneous Higherorder Proximities
Learning the higherorder attention structure for order relations involves the composition between and meta relation sets. Since the torder proximity is a measure between a node’s order context to another node in the network, naturally, we must take into consideration of as the priororder context embeddings. Similar to the firstorder attention score, is the order attention score between node and node , defined as:
(7) 
The order attention scoring mechanism is parameterized by and for all node types, as well as for each relation type in . Then, in the same manner as in Eq. (3), the attention coefficients for each order neighbors in the relation is the softmax normalized along with the temperature :
Obtaining the relationweighing coefficients in the order also involves the priororder context embedding for each node. For a node of type , we apply the relation weighing mechanism using its priororder embedding with:
(8) 
where is parameterized by weights . By far, LATTE can automatically identify important meta relations of any arbitrary length by learning an adaptive relation weighing mechanism.
Aggregating Layerwise Embeddings
While the firstorder embedding represents the local neighborhood among the multiple relations, its order embedding expands the receptive field’s vicinity by traversing higherorder meta paths. The order embedding of node is expressed as:
Dataset  Relations (AB)  # nodes (A)  # nodes (B)  # links  # features  Training  Testing 

DBLP  PaperAuthor (PA)  14328  4057  19645  334  20%  70% 
PaperConference (PC)  14328  20  14328  
PaperTerm (PT)  14328  4057  88420  
ACM  PaperAuthor (PA)  2464  5835  9744  1830  20%  70% 
PaperSubject (PS)  3025  56  3025  
IMDB  MovieActor (MA)  4780  5841  9744  1232  10%  80% 
MovieDirector (MD)  4780  2269  3025 
(9) 
With this framework, the receptive field of order relations is contained within each order context embedding. Furthermore, as encapsulates each relation in separately, it is possible to identify the specific relation types that are involved the composite representation.
Given the layerwise representations of node , we obtain the final embedding output by concatenating all the order context embeddings, as:
(10) 
where with as the unified embedding dimension size for all node types.
Preserving Proximities with Attention Scores
We repurpose the computed attention scores to estimate the heterogeneous pairwise proximities in the network explicitly. Incorporating this objective not only enables our model for unsupervised learning but also allows the nodelevel attention mechanism to reinforce highly connected node pairs by taking advantage of weighted links. To preserve pairwise t
^{th}order proximities for all links in each relation, we apply the Noise Contrastive Estimation with negative sampling Mikolov et al. (2013) objective as:(11) 
where
denotes the sigmoid function applied to the attention score to infer a probability value. The first term models the observed links, the second term models the negative links drawn from the noise distribution in
, and is the number of sampled negative links. Typically, is chosen to be between 2 to 5 times the number of positive links.Model Optimization
To learn from both the heterogeneous network’s attributes and topology, we optimize the proximitypreserving objectives and the downstream objective of the embedding outputs with the standard backpropagation algorithm. For semisupervised node classification, a multilayer perceptron
follows the LATTE layers in order to predicts labels given the node embedding. The crossentropy minimization objectives are defined as:(12) 
where is the set of nodes that have labels, and is the true label. The first term aims to encode the node embedding representations with attention mechanisms, while the second term reinforces the attention scores by iterating through weighted positive and sampled negative links.
Our model allows for computing embeddings for a subnetwork each iteration; thus, it does not require computations involving the global network structure of all nodes at once. This approach not only enables minibatch training on large networks that do not fit on memory but also makes our technique fitted for inductive learning. To perform online training at each iteration, an input batch is generated by recursively sampling a fixed number of neighbor nodes Hamilton et al. (2017). Then, LATTE can yield embedding outputs for a sampled subnetwork given the local links and node attributes.
Experiments
An effective network representation learning method can generalize to an unseen node by accurately encoding its links and attributes and then “aligning” them to the embedding space learned from seen (trained) nodes. In this section, we evaluate our method’s effectiveness on several node classification experiments, where the task is to predict node labels for a portion of the network hidden during training.
Dataset  Metric  metapath2vec  HIN2Vec  HAN  GTN  LATTE1  LATTE2  LATTE2 
DBLP  F1  0.7518  0.7431  0.9121  0.9203  0.89110.003  0.92400.003  0.91560.003 
F1  –  –  0.8666  0.8721  0.86200.004  0.86310.003  0.88220.032  
# params  2.3M  2.3M  240K  125K  78K  111K  111K  
ACM  F1  0.8879  0.8466  0.8725  0.9085  0.91180.005  0.91340.005  0.91530.003 
F1  –  –  0.7909  0.8860  0.89880.003  0.90070.003  0.91560.003  
# Params  387K  1.1M  1.5M  326K  250K  273K  273K  
IMDB  F1  0.4310  0.4404  0.5394  0.5924  0.60660.018  0.61350.014  0.63630.007 
F1  –  –  0.3877  0.5810  0.60360.009  0.61170.038  0.63550.004  
# Params  611K  1.6M  1.4M  243K  170K  196K  196K  
denotes the mean and standard deviation over 10 trials. 
Datasets
We conduct performance comparison experiments over several benchmark heterogeneous network datasets. In Table 1, a summary of the network statistics is provided for each of the following datasets:

DBLP^{1}^{1}1https://dblp.unitrier.de: a heterogenous network extracted from a bibliography dataset on major computer science journals and proceedings. The dataset have been preprocessed to contain 14328 papers, 4057 authors, 20 conferences, and 8789 terms. There are 3 relations types paperauthor, paperconference and paperterm considered. The author’s attributes are a bagofword representation of publication keywords. The classification task is to predict the label for each author among four domain areas: database, data mining, machine learning, and information retrieval.

ACM^{2}^{2}2https://dl.acm.org: A small citation network dataset containing paperauthor and papersubject relation types among 3025 papers, 5835 authors, and 56 subjects node types. Paper nodes are associated with a bagofwords presentation of keywords as features. The task is to label the conference each paper is published in, among the KDD, SIGMOD, SIGCOM, MobiCOMM, and VLDB venues.

IMDB Cantador et al. (2011): A movie database network containing movieactor and moviedirector relations among 4780 movies, 5841 actors, and 2269 directors. Each movie contain bagofwords features of the plot, and the prediction task is to label the movie’s genre among Action, Comedy, and Drama.
In each of the datasets, all directed relation have a reverse relation included. All selfloop links have been removed, unless if required for a certain algorithm.
Experimental Setup
To provide a consistent and reproducible experimental setup, the preprocessed networks were obtained from the CogDL Toolkit Cen et al. (2020) benchmark datasets. Each of the datasets has been provided with a standard separation of train, validation, and test sets, as well as the full input features and labels set. Since our model evaluates these datasets based on their standard environment, the result from different experiments can be directly compared.
Baselines
We verify the effectiveness of our framework by testing multiple variants of LATTE along with other existing approaches. For comparison with some of the stateoftheart baselines, we consider various heterogeneous network embedding and GNN methods, including:

Metapath2Vec Dong et al. (2017): An unsupervised random walk method that utilizes the skipgram along with negative sampling on meta paths to embed heterogeneous nodes. It has been shown to achieve prominent performance among random walk based approaches.

HIN2Vec Fu et al. (2017): a stateoftheart deep neural network that learns embedding by considering the meta paths in an attributed heterogeneous network. It utilizes a random walk preprocessing, and it does not consider weighing of different meta paths.

HAN Wang et al. (2019): A GNN that employs a GATbased nodelevel attention mechanism for heterogeneous networks. It proposes a hierarchical attention procedure that weighs the importance for each meta path, however only among predefined handcrafted meta paths.

GTN Yun et al. (2019): A GNN with an attention mechanism that weighs and combines heterogeneous meta paths successively into higherorder structures, then performs graph convolution on the resulting adjacency matrix.

LATTE1: A variant of the proposed LATTE model with one layer that only considers firstorder meta relations. The pairwise proximity preserving objectives is excluded.

LATTE2: A variant of LATTE with two layers that considers both firstorder and secondorder meta relations. The pairwise proximity preserving objectives is excluded.

LATTE2: Same as LATTE2 but additionally optimizes the higherorder proximity preserving objectives.
Every method was evaluated on the identical split of training, validation, and testing sets for fairness and reproducibility. The final model is trained on the training set until the early stopping criteria on the validation set is met, then evaluated on the test set. Additionally, each method must exploit all relations and the available node attributes in the dataset, except for metapath2vec due to its limitation. If a particular node type in the heterogeneous network is not attributed, we instantiate a set of learnable embeddings to replace as node features.
Implementation Details
We set the following hyperparameters identically for all methods: embedding dimension size at 128, learning rate at 0.001, minibatch size at 2048, and early stopping if the validation loss doesn’t decrease after ten epochs. For HAN and GTN, the number of GNN hidden layers is 2, preceding an MLP that predicts node labels given the embedding outputs in an endtoend manner. For random walk based methods, a logistic classifier is employed to perform node classification given the learned node embeddings. The hyperparameters for metapath2vec and HIN2Vec are walk length at 100, window size at 5, walks per node at 40, and the number of negative samples at 5. Among GNNbased methods, the batch sampling procedure that recursively samples a fixed number of neighbor nodes
Hamilton et al. (2017) is utilized, with neighborhood sample sizes 25 and 20. Where possible, the standard implementation of baseline methods has been provided by the CogDL Toolkit.For all LATTE variants, the best performing hyperparameters selected ReLU as the embedding activation function, dropout at 30% on the embedding outputs, and weight decay regularization (excluding biases) at 0.01. In LATTE2
, the negative sampling ratio is set to. The models have been implemented with Pytorch Geometric (PyG), and the experiments have been conducted on a GeForce RTX 2080 Ti with 11 GB of GPU memory. The hyperparameter tuning were conducted by Weight and Biases
Biewald (2020), and the parameter ranges tested were reported in the technical appendix.Node Classification Experiments and Results
We consider the semisupervised classification tasks in both inductive and transductive settings to perform thorough evaluations of representation learning in heterogeneous networks. In the transductive setting, models can traverse on the subgraph containing nodes in the test set during training. In contrast, the inductive setting requires the models never to encounter the test subgraph during the training phase and must predict testing nodes’ labels on the novel subgraph at the testing phase. We train and evaluate all baseline methods to predict test nodes for each transductive and inductive setting over ten trials.
To measure the classification performance of the prediction outputs, we record the precision and recall for each class label to compute the F1 score. Due to the apparent class imbalance in the three datasets, we report only the averaged MacroF1 score, which was the more challenging metric in similar experiments
Wang et al. (2019). The performance comparisons are reported in Table 2. For metapath2vec, HIN2Vec, HAN, and GTN, the benchmark Macro F1 scores in the transductive setting has been provided by the CogDL Toolkit, while the Macro F1 in the inductive setting are averaged scores over 10 experiment runs.The top performance by LATTE2 indicates its effectiveness at learning node representations on the highorder meta relation structures, especially with 8090% of the network set aside for testing. Compared to HAN, which does not consider higherorder relations, GTN and LATTE2 have a significant edge in inductive prediction because both can capture global properties. Compared to GTN, which does not maintain the semantic space of individual meta path, LATTE2 outperforms with explicit proximitypreserving objectives for each of the decomposed higherorder meta relations.
Interpretation of the Attention Mechanism
LATTE’s fundamental properties are the construction of higherorder meta relations and the attention mechanism that weighs the importance of those relations. To demonstrate these features’ benefits, we interpret the importance levels chosen for each meta relations and verify whether they reflect the structural topology in the heterogeneous network. Given the learned weights for each node at a layer , we can assess not only the averaged meta relation weights for a node type, but also the individual meta relation weights for each node. In Fig. 1, we report the average and standard deviation of the meta relation attention weights for IMDB, as well as the correlation between those weights and the node degrees for each relation. The meta relation weights for DLBP and ACM are reported as supplementary material.
For IMDB movies, it can be observed that on average, the MA, MD, MDM, and MAM meta relations have the highest attention weights. This indicates that information from the movieactor neighborhoods, moviedirector neighborhoods, and node’s features are relatively more represented in each movie’s firstorder embedding. This selection also persists in the secondorder embeddings, where MDM and MAM have higher weights. Additionally, when looking at the correlation between MA’s weights and the degree of MA links over all nodes, there is a correlation, which indicates the attention mechanism can adaptively weigh the relation based on the number connections present in the node. Interestingly, there is a substantial negative correlation of between the M “self” relation weights and the node degree. This fact indicates that nodes with fewer or no links will choose a higher weight for its own features, since little information can be gained from other modalities. As individual nodes may have varying levels of participation among the various relations, this result demonstrates that LATTE can select the most effective meta relation for individual nodes depending on its local and global properties in the heterogeneous topology.
Discussion and Conclusion
The task of aggregating heterogeneous relations remains a fundamental challenge in designing a representation learning method for heterogeneous networks. Multiple relations can represent different semantics, and their link distributions can be overlapping, interconnected, or noncomplementary. Therefore, it is an appropriate first step to consider them as separate components of the network to unravel their structural dependencies. One of the key differences between existing GNN methods and the proposed LATTE is that the latter exploits the semantic information in each meta relation. Instead of conflating heterogeneous relations for all node types as in HAN and GTN, LATTE aggregates only the relevant relations for each node type. Furthermore, by considering the source type and target type of each meta relation, only relevant pairs of relations can be joined during generating higherorder meta paths. A significant benefit to this approach is that it relieves the computational burden of multiplying adjacency matrices for all nodes while allowing distinct representation for the different node types.
This work has proposed an architecture for heterogeneous network embedding, which can generate higherorder meta relations. The benefits of the mechanism proposed are not only to improve inductive node classification performance but also to improve interpretation of deep GNN models. In the future, we will explore whether to incorporate a selfattention mechanism to learn the structural dependencies between relations by propagating information between the different relationspecific embeddings. Other interesting future developments are to enable LATTE to pretrain without supervision and to extend LATTE to link prediction tasks.
References
 Friends and neighbors on the web. Social networks 25 (3), pp. 211–230. Cited by: LATTET: Higherorder Heterogeneous Network Embedding.
 Structural measures for multiplex networks. Physical Review E 89 (3), pp. 032804. Cited by: Layerstacked Attention for Heterogeneous Network Embedding.
 Experiment tracking with weights and biases. Note: Software available from wandb.com External Links: Link Cited by: Implementation Details.
 Translating embeddings for modeling multirelational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: Multiplex Network Embedding.
 Second workshop on information heterogeneity and fusion in recommender systems (hetrec2011). In Proceedings of the fifth ACM conference on Recommender systems, pp. 387–388. Cited by: item 3.
 CogDL: an extensive research toolkit for deep learning on graphs. External Links: Link Cited by: Experimental Setup.
 Attentionbased models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: Inferring Nodelevel Attention Coefficients.
 Metapath2vec: scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 135–144. Cited by: Multiplex Network Embedding, 1st item.
 Heterogeneous network representation learning. Cited by: Layerstacked Attention for Heterogeneous Network Embedding.
 Hin2vec: explore metapaths in heterogeneous information networks for representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1797–1806. Cited by: Multiplex Network Embedding, Multiplex Network Embedding, 2nd item.
 Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: Graph Neural Networks.
 Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: Graph Neural Networks, Model Optimization, Implementation Details, Layerstacked Attention for Heterogeneous Network Embedding.
 Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, pp. 2704–2710. Cited by: Graph Neural Networks, Layerstacked Attention for Heterogeneous Network Embedding.
 SkipGNN: predicting molecular interactions with skipgraph networks. arXiv preprint arXiv:2004.14949. Cited by: Graph Neural Networks.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: Layerstacked Attention for Heterogeneous Network Embedding.
 MELL: effective embedding method for multiplex networks. In Companion Proceedings of the The Web Conference 2018, pp. 1261–1268. Cited by: Multiplex Network Embedding.
 Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: Preserving Proximities with Attention Scores.
 Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: Multiplex Network Embedding.
 An attentionbased collaboration framework for multiview network representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1767–1776. Cited by: Multiplex Network Embedding.
 Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: Graph Neural Networks, Layerstacked Attention for Heterogeneous Network Embedding.
 Mvn2vec: preservation and collaboration in multiview network embedding. arXiv preprint arXiv:1801.06597. Cited by: Multiplex Network Embedding.
 Pathsim: meta pathbased topk similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment 4 (11), pp. 992–1003. Cited by: LATTET: Higherorder Heterogeneous Network Embedding.
 Line: largescale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: Graph Neural Networks.
 Graph Attention Networks. International Conference on Learning Representations. Cited by: LATTE1: Firstorder Heterogeneous Network Embedding, Layerstacked Attention for Heterogeneous Network Embedding.
 Heterogeneous graph attention network. In The World Wide Web Conference, pp. 2022–2032. Cited by: Graph Neural Networks, Inferring Relation Weighing Coefficients, 3rd item, Node Classification Experiments and Results, Layerstacked Attention for Heterogeneous Network Embedding.
 Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536. Cited by: Graph Neural Networks.

Graph transformer networks
. In Advances in Neural Information Processing Systems, pp. 11983–11993. Cited by: Graph Neural Networks, Inferring Relation Weighing Coefficients, 4th item, Layerstacked Attention for Heterogeneous Network Embedding.  Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 793–803. Cited by: Graph Neural Networks, Layerstacked Attention for Heterogeneous Network Embedding.
 HAHE: hierarchical attentive heterogeneous information network embedding. arXiv preprint arXiv:1902.01475. Cited by: Graph Neural Networks.
Technical Appendix
Hyperparameter Tuning
The hyperparameter tuning were conducted by the Weight and Biases^{3}^{3}3Biewald, L. 2020. Experiment Tracking with Weights and Biases. URL https://www.wandb.com/. Software available from wandb.com. platform, where we utilize a random search approach that chooses random sets of parameter values. The parameters tested for are the embedding dimension, order, attention scores activation function, number of neighbors sampled, negative sampling ratio, embedding output activation function, and dropout probability.
Interpretation of the Attention Mechanism for DBLP and ACM
Following the demonstration to interpret the learned attention weights in the IMDB dataset, we report the same attention weights and the weightsdegree correlation results for the DBLP and ACM datasets. In Fig. 4 and 5, it can be observed that correlation between the meta relation weights and the node degree exhibits the same phenomenon described for IMDB.
(a)  (b) 
(a)  (b) 
Comments
There are no comments yet.