Graph Neural Networks
In recent years, many classes of GNN methods have been developed for a variety of heterogeneous network types Schlichtkrull et al. (2018); Zhang et al. (2019); Wang et al. (2019); Zhou et al. (2019); Hu et al. (2020). Although these types of GNNs are flexible for end-to-end supervised prediction tasks, they only optimize for predictions between direct interactions. Compared to conventional network embedding methods Grover and Leskovec (2016); Tang et al. (2015), standard GNNs generally do not take advantage of second-order relationships between indirect neighboring nodes. Recently, a paper by Huang et al. (2020) applied a fusion technique to combine first-order and second-order embeddings at alternating steps. Additionally, the Jumping Knowledge architecture from Xu et al. (2018) and the GraphSAGE (sampling and aggregation) from Hamilton et al. (2017) has proposed to extend the neighborhood ranges; however, there has yet to be an extension of such techniques for heterogeneous networks.
Notably, GTN Yun et al. (2019) was recently proposed to enable learning on higher-order meta paths in heterogeneous networks. It proposes a mechanism that soft-selects a convex combination of meta path layers using attention weights, then applies multiplication of adjacency matrices successively to reveal arbitrary-length transitive meta paths. This mechanism is unique in that it can infer attention weights not only on the given relations, but also on higher-order relations generated by deeper layers, a feature that most existing GNN methods often neglect. A few limitations with GTN is it necessarily assumes the feature distribution and representation space of different node and link types to be the same, and it cannot weigh the importance of each meta path separately for each node type. Additionally, GTN can be computationally expensive, since it requires computations involving the adjacency structure of all node types at once.
Multiplex Network Embedding
It is worth mentioning the approaches designed for a subclass of the heterogeneous network, the multiplex network. Many of the current multiplex or multiview network embedding methods Fu et al. (2017); Matsuno and Murata (2018); Qu et al. (2017); Shi et al. (2018)
have proposed strategies for aggregating the learned embeddings of multiple network “layers” into a single unified embedding. This class of methods typically specify separate objectives for each of the layers to estimate the node features independently, then apply another objective to aggregate the information from all layers together.
Another paradigm is to use random-walk of meta paths to model heterogeneous structures, as proposed in Perozzi et al. (2014); Dong et al. (2017); Fu et al. (2017). This class of approaches can learn network representations without supervised training for a specific task. However, they only learn representations for the primary node type, which consequently requires the customized design of meta paths. Also, they can be sensitive to the random walk’s hyper-parameter settings, which may introduce unwanted biases or is computationally costly, thus can lead to lacking performance. Another class of algorithm utilizing embedding translations can also be applied for embedding heterogeneous networks. For instance, Bordes et al. (2013)
learned linear transformations for each relation to model semantic relationships between entities. While embedding translations can effectively model heterogeneous networks, they are mainly fitted for link prediction tasks.
We consider a heterogeneous network as a complex system involving multiple types of links between nodes of various types. To effectively represent the complex structure of the system, it is important to define separate adjacency matrices to distinguish the nature of relationships. In this section, we define coherent notations to study the class of attributed heterogeneous networks.
Definition 3.1: Attributed Heterogeneous Network
is defined as a graph in which each node and each link are associated with their mapping functions and . and denote the sets of object and relation types, where . In the case of attributed heterogeneous network, the node features representation is given by , which maps node of node type to its corresponding feature vector of dimension .
We represent the heterogeneous link types as a set of biadjacency matrices where . Each meta relation specifies a link type between source node type and target node type , such that . The biadjacency matrix may consist of weighted links, where if there exists a link, otherwise, , indicating the absence of evidence for interaction. For a subnetwork, we define node ’s neighbors set as . Note that ’s size is non-quadratic, and thus does not have a diagonal. Furthermore, this definition assumes relations of directed links, but for a relation with inherently undirected links, we may inject a reverse relation into the set.
LATTE-1: First-order Heterogeneous Network Embedding
In this section, we start by describing the attention-based layers used in the LATTE heterogeneous network embedding architecture. The attention mechanism utilized in our method closely follows GAT Veličković et al. (2018) but is extended to infer higher-order link proximity scores for nodes and links of heterogeneous types. We also introduce the layer building blocks where each layer has the roles of inferring node embeddings from heterogeneous node content and preserving higher-order link proximities.
The input to our model is the set of heterogeneous adjacency matrices and the heterogeneous node features , where . At each layer, we define the node embeddings output , where is the embedding dimension, as:
where , and is the heterogeneous link adjacency matrices in the -order. In the next section, we describe the operations involved when .
Heterogeneous First-order Proximities
The first-order proximity refers to direct links between any two nodes in the network among the heterogeneous relations in . In order to model the different distribution of links in each relation type , we utilize a node-level attentional kernel depending on the type of the relation. Additionally, to sufficiently encode node features into higher-level features, each node type requires a separate linear transformation applied to every node in . Given any node of type and node of type , the respective kernel parameter is utilized to compute the scoring mechanism:
where is the transposition and is the concatenation operation. We utilize two weight matrices and to obtain the ”source” context and the ”target” context, respectively, for a pair of nodes depending on the node types and the direction of the link. Note that the attention-based proximity score is asymmetric, hence capable of modeling directed relationships where .
Inferring Node-level Attention Coefficients
Next, our goal is to infer the importance of each neighbor node in the neighborhood around node for a given relation. Similar to GAT, we compute masked attention on existing links, such that is only computed for first-order neighbor nodes . The attention coefficients are computed by softmax normalization of the scores across all , as:
where is a learnable “temperature“ variable initialized at that have the role of “sharpening” the attention scores Chorowski et al. (2015) across the links distribution in the relation. It is expected that when the particular link distribution is dense or noisy, thus, integrating this technique allows the attention mechanism to focus on fewer neighbors. Once obtained, the normalized attention coefficients are used to compute the features distribution of a node’s by a linear combination of its neighbors for each relation.
Inferring Relation Weighing Coefficients
Since a node type is assumed to be involved in multiple types of relations, we must aggregate the relation-specific representations for each node. Previous works Wang et al. (2019); Yun et al. (2019) have proposed to measure the importance of each relation type by a set of semantic-level attention coefficients shared by all nodes. Instead, our method chooses to assign the relation attention coefficients differently for each node , which enables the capacity to capture individual node heterogeneity in the network. First, we denote as the subset of meta relations with source type . Since the number of relations involved in each node type can be different, each node of type only needs to soft-select from the subset of relevant relations. We utilize a linear transformation directly on node features to predict a normalized coefficient vector of size that soft-selects among the set of associated relations or itself. This operation is computed by:
where is parameterized by the weights and bias for each node type . Since is softmax normalized, , where is the coefficient indexed for the “self” choice.
Aggregating First-order Neighborhoods
It is important to not only capture the local neighborhood of a node in a single relation but also aggregate the neighborhoods among multiple relations and integrate the node’s own features representation. First, we gather information obtained from each relation’s local neighborhoods, then combine their relation-specific embeddings. We apply both the node-level and relation-level attention coefficients to a weighted-average aggregation scheme:
is a nonlinear function such as ReLU, and’s node type is . The first-order node embedding is computed as an aggregation of linear-transformed immediate neighbor nodes. Next, we show that multiple LATTE layers can be stacked successively in a manner that allows the attention mechanism to capture higher-order relationships.
Latte-T: Higher-order Heterogeneous Network Embedding
In this section, we describe the layer-stacking operations involved to extract higher-order proximities when . The t-order proximity applies to indirect -length metapaths achieved by combining two matching meta relations. For instance, when , we can connect a relation with target type to another relation with matching source type . Then, computing the Adamic-Adar Adamic and Adar (2003) as:
yields as the degree-normalized biadjacency matrix consisting of length-2 metapaths from nodes to nodes. We define the set of meta relations containing all length- metapaths in the network, as:
where behaves as a cartesian product that yields the Adamic-Adar only for matching pairs of relations. A length- sequence of meta relations with source type and target type is denoted as . This is directly applicable to the classical metapath paradigm Sun et al. (2011), where all possible -length metapaths are decomposed in each separate relation in . Note that throughout this paper, the meta relations notation is overloaded for brevity. In fact, this architecture can handle multiple meta relation types with the same source type and target type, i.e. , without loss of generalization.
Heterogeneous Higher-order Proximities
Learning the higher-order attention structure for -order relations involves the composition between and meta relation sets. Since the t-order proximity is a measure between a node’s -order context to another node in the network, naturally, we must take into consideration of as the prior-order context embeddings. Similar to the first-order attention score, is the -order attention score between node and node , defined as:
The -order attention scoring mechanism is parameterized by and for all node types, as well as for each relation type in . Then, in the same manner as in Eq. (3), the attention coefficients for each -order neighbors in the relation is the softmax normalized along with the temperature :
Obtaining the relation-weighing coefficients in the -order also involves the prior-order context embedding for each node. For a node of type , we apply the relation weighing mechanism using its prior-order embedding with:
where is parameterized by weights . By far, LATTE can automatically identify important meta relations of any arbitrary -length by learning an adaptive relation weighing mechanism.
Aggregating Layer-wise Embeddings
While the first-order embedding represents the local neighborhood among the multiple relations, its -order embedding expands the receptive field’s vicinity by traversing higher-order meta paths. The -order embedding of node is expressed as:
|Dataset||Relations (A-B)||# nodes (A)||# nodes (B)||# links||# features||Training||Testing|
With this framework, the receptive field of -order relations is contained within each -order context embedding. Furthermore, as encapsulates each relation in separately, it is possible to identify the specific relation types that are involved the composite representation.
Given the layer-wise representations of node , we obtain the final embedding output by concatenating all the -order context embeddings, as:
where with as the unified embedding dimension size for all node types.
Preserving Proximities with Attention Scores
We repurpose the computed attention scores to estimate the heterogeneous pairwise proximities in the network explicitly. Incorporating this objective not only enables our model for unsupervised learning but also allows the node-level attention mechanism to reinforce highly connected node pairs by taking advantage of weighted links. To preserve pairwise tth-order proximities for all links in each relation, we apply the Noise Contrastive Estimation with negative sampling Mikolov et al. (2013) objective as:
denotes the sigmoid function applied to the attention score to infer a probability value. The first term models the observed links, the second term models the negative links drawn from the noise distribution in, and is the number of sampled negative links. Typically, is chosen to be between 2 to 5 times the number of positive links.
To learn from both the heterogeneous network’s attributes and topology, we optimize the proximity-preserving objectives and the downstream objective of the embedding outputs with the standard back-propagation algorithm. For semi-supervised node classification, a multi-layer perceptronfollows the LATTE layers in order to predicts labels given the node embedding. The cross-entropy minimization objectives are defined as:
where is the set of nodes that have labels, and is the true label. The first term aims to encode the node embedding representations with attention mechanisms, while the second term reinforces the attention scores by iterating through weighted positive and sampled negative links.
Our model allows for computing embeddings for a subnetwork each iteration; thus, it does not require computations involving the global network structure of all nodes at once. This approach not only enables mini-batch training on large networks that do not fit on memory but also makes our technique fitted for inductive learning. To perform online training at each iteration, an input batch is generated by recursively sampling a fixed number of neighbor nodes Hamilton et al. (2017). Then, LATTE can yield embedding outputs for a sampled subnetwork given the local links and node attributes.
An effective network representation learning method can generalize to an unseen node by accurately encoding its links and attributes and then “aligning” them to the embedding space learned from seen (trained) nodes. In this section, we evaluate our method’s effectiveness on several node classification experiments, where the task is to predict node labels for a portion of the network hidden during training.
denotes the mean and standard deviation over 10 trials.
We conduct performance comparison experiments over several benchmark heterogeneous network datasets. In Table 1, a summary of the network statistics is provided for each of the following datasets:
DBLP111https://dblp.uni-trier.de: a heterogenous network extracted from a bibliography dataset on major computer science journals and proceedings. The dataset have been preprocessed to contain 14328 papers, 4057 authors, 20 conferences, and 8789 terms. There are 3 relations types paper-author, paper-conference and paper-term considered. The author’s attributes are a bag-of-word representation of publication keywords. The classification task is to predict the label for each author among four domain areas: database, data mining, machine learning, and information retrieval.
ACM222https://dl.acm.org: A small citation network dataset containing paper-author and paper-subject relation types among 3025 papers, 5835 authors, and 56 subjects node types. Paper nodes are associated with a bag-of-words presentation of keywords as features. The task is to label the conference each paper is published in, among the KDD, SIGMOD, SIGCOM, MobiCOMM, and VLDB venues.
IMDB Cantador et al. (2011): A movie database network containing movie-actor and movie-director relations among 4780 movies, 5841 actors, and 2269 directors. Each movie contain bag-of-words features of the plot, and the prediction task is to label the movie’s genre among Action, Comedy, and Drama.
In each of the datasets, all directed relation have a reverse relation included. All self-loop links have been removed, unless if required for a certain algorithm.
To provide a consistent and reproducible experimental setup, the preprocessed networks were obtained from the CogDL Toolkit Cen et al. (2020) benchmark datasets. Each of the datasets has been provided with a standard separation of train, validation, and test sets, as well as the full input features and labels set. Since our model evaluates these datasets based on their standard environment, the result from different experiments can be directly compared.
We verify the effectiveness of our framework by testing multiple variants of LATTE along with other existing approaches. For comparison with some of the state-of-the-art baselines, we consider various heterogeneous network embedding and GNN methods, including:
Metapath2Vec Dong et al. (2017): An unsupervised random walk method that utilizes the skip-gram along with negative sampling on meta paths to embed heterogeneous nodes. It has been shown to achieve prominent performance among random walk based approaches.
HIN2Vec Fu et al. (2017): a state-of-the-art deep neural network that learns embedding by considering the meta paths in an attributed heterogeneous network. It utilizes a random walk preprocessing, and it does not consider weighing of different meta paths.
HAN Wang et al. (2019): A GNN that employs a GAT-based node-level attention mechanism for heterogeneous networks. It proposes a hierarchical attention procedure that weighs the importance for each meta path, however only among pre-defined hand-crafted meta paths.
GTN Yun et al. (2019): A GNN with an attention mechanism that weighs and combines heterogeneous meta paths successively into higher-order structures, then performs graph convolution on the resulting adjacency matrix.
LATTE-1: A variant of the proposed LATTE model with one layer that only considers first-order meta relations. The pairwise proximity preserving objectives is excluded.
LATTE-2: A variant of LATTE with two layers that considers both first-order and second-order meta relations. The pairwise proximity preserving objectives is excluded.
LATTE-2: Same as LATTE-2 but additionally optimizes the higher-order proximity preserving objectives.
Every method was evaluated on the identical split of training, validation, and testing sets for fairness and reproducibility. The final model is trained on the training set until the early stopping criteria on the validation set is met, then evaluated on the test set. Additionally, each method must exploit all relations and the available node attributes in the dataset, except for metapath2vec due to its limitation. If a particular node type in the heterogeneous network is not attributed, we instantiate a set of learnable embeddings to replace as node features.
We set the following hyper-parameters identically for all methods: embedding dimension size at 128, learning rate at 0.001, mini-batch size at 2048, and early stopping if the validation loss doesn’t decrease after ten epochs. For HAN and GTN, the number of GNN hidden layers is 2, preceding an MLP that predicts node labels given the embedding outputs in an end-to-end manner. For random walk based methods, a logistic classifier is employed to perform node classification given the learned node embeddings. The hyper-parameters for metapath2vec and HIN2Vec are walk length at 100, window size at 5, walks per node at 40, and the number of negative samples at 5. Among GNN-based methods, the batch sampling procedure that recursively samples a fixed number of neighbor nodesHamilton et al. (2017) is utilized, with neighborhood sample sizes 25 and 20. Where possible, the standard implementation of baseline methods has been provided by the CogDL Toolkit.
For all LATTE variants, the best performing hyper-parameters selected ReLU as the embedding activation function, drop-out at 30% on the embedding outputs, and weight decay regularization (excluding biases) at 0.01. In LATTE-2, the negative sampling ratio is set to
. The models have been implemented with Pytorch Geometric (PyG), and the experiments have been conducted on a GeForce RTX 2080 Ti with 11 GB of GPU memory. The hyper-parameter tuning were conducted by Weight and BiasesBiewald (2020), and the parameter ranges tested were reported in the technical appendix.
Node Classification Experiments and Results
We consider the semi-supervised classification tasks in both inductive and transductive settings to perform thorough evaluations of representation learning in heterogeneous networks. In the transductive setting, models can traverse on the subgraph containing nodes in the test set during training. In contrast, the inductive setting requires the models never to encounter the test subgraph during the training phase and must predict testing nodes’ labels on the novel subgraph at the testing phase. We train and evaluate all baseline methods to predict test nodes for each transductive and inductive setting over ten trials.
To measure the classification performance of the prediction outputs, we record the precision and recall for each class label to compute the F1 score. Due to the apparent class imbalance in the three datasets, we report only the averaged Macro-F1 score, which was the more challenging metric in similar experimentsWang et al. (2019). The performance comparisons are reported in Table 2. For metapath2vec, HIN2Vec, HAN, and GTN, the benchmark Macro F1 scores in the transductive setting has been provided by the CogDL Toolkit, while the Macro F1 in the inductive setting are averaged scores over 10 experiment runs.
The top performance by LATTE-2 indicates its effectiveness at learning node representations on the high-order meta relation structures, especially with 80-90% of the network set aside for testing. Compared to HAN, which does not consider higher-order relations, GTN and LATTE-2 have a significant edge in inductive prediction because both can capture global properties. Compared to GTN, which does not maintain the semantic space of individual meta path, LATTE-2 outperforms with explicit proximity-preserving objectives for each of the decomposed higher-order meta relations.
Interpretation of the Attention Mechanism
LATTE’s fundamental properties are the construction of higher-order meta relations and the attention mechanism that weighs the importance of those relations. To demonstrate these features’ benefits, we interpret the importance levels chosen for each meta relations and verify whether they reflect the structural topology in the heterogeneous network. Given the learned weights for each node at a layer , we can assess not only the averaged meta relation weights for a node type, but also the individual meta relation weights for each node. In Fig. 1, we report the average and standard deviation of the meta relation attention weights for IMDB, as well as the correlation between those weights and the node degrees for each relation. The meta relation weights for DLBP and ACM are reported as supplementary material.
For IMDB movies, it can be observed that on average, the MA, MD, MDM, and MAM meta relations have the highest attention weights. This indicates that information from the movie-actor neighborhoods, movie-director neighborhoods, and node’s features are relatively more represented in each movie’s first-order embedding. This selection also persists in the second-order embeddings, where MDM and MAM have higher weights. Additionally, when looking at the correlation between MA’s weights and the degree of MA links over all nodes, there is a correlation, which indicates the attention mechanism can adaptively weigh the relation based on the number connections present in the node. Interestingly, there is a substantial negative correlation of between the M “self” relation weights and the node degree. This fact indicates that nodes with fewer or no links will choose a higher weight for its own features, since little information can be gained from other modalities. As individual nodes may have varying levels of participation among the various relations, this result demonstrates that LATTE can select the most effective meta relation for individual nodes depending on its local and global properties in the heterogeneous topology.
Discussion and Conclusion
The task of aggregating heterogeneous relations remains a fundamental challenge in designing a representation learning method for heterogeneous networks. Multiple relations can represent different semantics, and their link distributions can be overlapping, interconnected, or non-complementary. Therefore, it is an appropriate first step to consider them as separate components of the network to unravel their structural dependencies. One of the key differences between existing GNN methods and the proposed LATTE is that the latter exploits the semantic information in each meta relation. Instead of conflating heterogeneous relations for all node types as in HAN and GTN, LATTE aggregates only the relevant relations for each node type. Furthermore, by considering the source type and target type of each meta relation, only relevant pairs of relations can be joined during generating higher-order meta paths. A significant benefit to this approach is that it relieves the computational burden of multiplying adjacency matrices for all nodes while allowing distinct representation for the different node types.
This work has proposed an architecture for heterogeneous network embedding, which can generate higher-order meta relations. The benefits of the mechanism proposed are not only to improve inductive node classification performance but also to improve interpretation of deep GNN models. In the future, we will explore whether to incorporate a self-attention mechanism to learn the structural dependencies between relations by propagating information between the different relation-specific embeddings. Other interesting future developments are to enable LATTE to pre-train without supervision and to extend LATTE to link prediction tasks.
- Friends and neighbors on the web. Social networks 25 (3), pp. 211–230. Cited by: LATTE-T: Higher-order Heterogeneous Network Embedding.
- Structural measures for multiplex networks. Physical Review E 89 (3), pp. 032804. Cited by: Layer-stacked Attention for Heterogeneous Network Embedding.
- Experiment tracking with weights and biases. Note: Software available from wandb.com External Links: Cited by: Implementation Details.
- Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: Multiplex Network Embedding.
- Second workshop on information heterogeneity and fusion in recommender systems (hetrec2011). In Proceedings of the fifth ACM conference on Recommender systems, pp. 387–388. Cited by: item 3.
- CogDL: an extensive research toolkit for deep learning on graphs. External Links: Cited by: Experimental Setup.
- Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: Inferring Node-level Attention Coefficients.
- Metapath2vec: scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 135–144. Cited by: Multiplex Network Embedding, 1st item.
- Heterogeneous network representation learning. Cited by: Layer-stacked Attention for Heterogeneous Network Embedding.
- Hin2vec: explore meta-paths in heterogeneous information networks for representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1797–1806. Cited by: Multiplex Network Embedding, Multiplex Network Embedding, 2nd item.
- Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: Graph Neural Networks.
- Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: Graph Neural Networks, Model Optimization, Implementation Details, Layer-stacked Attention for Heterogeneous Network Embedding.
- Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, pp. 2704–2710. Cited by: Graph Neural Networks, Layer-stacked Attention for Heterogeneous Network Embedding.
- SkipGNN: predicting molecular interactions with skip-graph networks. arXiv preprint arXiv:2004.14949. Cited by: Graph Neural Networks.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: Layer-stacked Attention for Heterogeneous Network Embedding.
- MELL: effective embedding method for multiplex networks. In Companion Proceedings of the The Web Conference 2018, pp. 1261–1268. Cited by: Multiplex Network Embedding.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: Preserving Proximities with Attention Scores.
- Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: Multiplex Network Embedding.
- An attention-based collaboration framework for multi-view network representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1767–1776. Cited by: Multiplex Network Embedding.
- Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: Graph Neural Networks, Layer-stacked Attention for Heterogeneous Network Embedding.
- Mvn2vec: preservation and collaboration in multi-view network embedding. arXiv preprint arXiv:1801.06597. Cited by: Multiplex Network Embedding.
- Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment 4 (11), pp. 992–1003. Cited by: LATTE-T: Higher-order Heterogeneous Network Embedding.
- Line: large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: Graph Neural Networks.
- Graph Attention Networks. International Conference on Learning Representations. Cited by: LATTE-1: First-order Heterogeneous Network Embedding, Layer-stacked Attention for Heterogeneous Network Embedding.
- Heterogeneous graph attention network. In The World Wide Web Conference, pp. 2022–2032. Cited by: Graph Neural Networks, Inferring Relation Weighing Coefficients, 3rd item, Node Classification Experiments and Results, Layer-stacked Attention for Heterogeneous Network Embedding.
- Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536. Cited by: Graph Neural Networks.
Graph transformer networks. In Advances in Neural Information Processing Systems, pp. 11983–11993. Cited by: Graph Neural Networks, Inferring Relation Weighing Coefficients, 4th item, Layer-stacked Attention for Heterogeneous Network Embedding.
- Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 793–803. Cited by: Graph Neural Networks, Layer-stacked Attention for Heterogeneous Network Embedding.
- HAHE: hierarchical attentive heterogeneous information network embedding. arXiv preprint arXiv:1902.01475. Cited by: Graph Neural Networks.
The hyper-parameter tuning were conducted by the Weight and Biases333Biewald, L. 2020. Experiment Tracking with Weights and Biases. URL https://www.wandb.com/. Software available from wandb.com. platform, where we utilize a random search approach that chooses random sets of parameter values. The parameters tested for are the embedding dimension, -order, attention scores activation function, number of neighbors sampled, negative sampling ratio, embedding output activation function, and dropout probability.
Interpretation of the Attention Mechanism for DBLP and ACM
Following the demonstration to interpret the learned attention weights in the IMDB dataset, we report the same attention weights and the weights-degree correlation results for the DBLP and ACM datasets. In Fig. 4 and 5, it can be observed that correlation between the meta relation weights and the node degree exhibits the same phenomenon described for IMDB.