Layer-stacked Attention for Heterogeneous Network Embedding

09/17/2020 ∙ by Nhat Tran, et al. ∙ The University of Texas at Arlington 0

The heterogeneous network is a robust data abstraction that can model entities of different types interacting in various ways. Such heterogeneity brings rich semantic information but presents nontrivial challenges in aggregating the heterogeneous relationships between objects - especially those of higher-order indirect relations. Recent graph neural network approaches for representation learning on heterogeneous networks typically employ the attention mechanism, which is often only optimized for predictions based on direct links. Furthermore, even though most deep learning methods can aggregate higher-order information by building deeper models, such a scheme can diminish the degree of interpretability. To overcome these challenges, we explore an architecture - Layer-stacked ATTention Embedding (LATTE) - that automatically decomposes higher-order meta relations at each layer to extract the relevant heterogeneous neighborhood structures for each node. Additionally, by successively stacking layer representations, the learned node embedding offers a more interpretable aggregation scheme for nodes of different types at different neighborhood ranges. We conducted experiments on several benchmark heterogeneous network datasets. In both transductive and inductive node classification tasks, LATTE can achieve state-of-the-art performance compared to existing approaches, all while offering a lightweight model. With extensive experimental analyses and visualizations, the framework can demonstrate the ability to extract informative insights on heterogeneous networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Related Work

Graph Neural Networks

In recent years, many classes of GNN methods have been developed for a variety of heterogeneous network types Schlichtkrull et al. (2018); Zhang et al. (2019); Wang et al. (2019); Zhou et al. (2019); Hu et al. (2020). Although these types of GNNs are flexible for end-to-end supervised prediction tasks, they only optimize for predictions between direct interactions. Compared to conventional network embedding methods Grover and Leskovec (2016); Tang et al. (2015), standard GNNs generally do not take advantage of second-order relationships between indirect neighboring nodes. Recently, a paper by Huang et al. (2020) applied a fusion technique to combine first-order and second-order embeddings at alternating steps. Additionally, the Jumping Knowledge architecture from Xu et al. (2018) and the GraphSAGE (sampling and aggregation) from Hamilton et al. (2017) has proposed to extend the neighborhood ranges; however, there has yet to be an extension of such techniques for heterogeneous networks.

Notably, GTN Yun et al. (2019) was recently proposed to enable learning on higher-order meta paths in heterogeneous networks. It proposes a mechanism that soft-selects a convex combination of meta path layers using attention weights, then applies multiplication of adjacency matrices successively to reveal arbitrary-length transitive meta paths. This mechanism is unique in that it can infer attention weights not only on the given relations, but also on higher-order relations generated by deeper layers, a feature that most existing GNN methods often neglect. A few limitations with GTN is it necessarily assumes the feature distribution and representation space of different node and link types to be the same, and it cannot weigh the importance of each meta path separately for each node type. Additionally, GTN can be computationally expensive, since it requires computations involving the adjacency structure of all node types at once.

Multiplex Network Embedding

It is worth mentioning the approaches designed for a subclass of the heterogeneous network, the multiplex network. Many of the current multiplex or multiview network embedding methods Fu et al. (2017); Matsuno and Murata (2018); Qu et al. (2017); Shi et al. (2018)

have proposed strategies for aggregating the learned embeddings of multiple network “layers” into a single unified embedding. This class of methods typically specify separate objectives for each of the layers to estimate the node features independently, then apply another objective to aggregate the information from all layers together.

Another paradigm is to use random-walk of meta paths to model heterogeneous structures, as proposed in Perozzi et al. (2014); Dong et al. (2017); Fu et al. (2017). This class of approaches can learn network representations without supervised training for a specific task. However, they only learn representations for the primary node type, which consequently requires the customized design of meta paths. Also, they can be sensitive to the random walk’s hyper-parameter settings, which may introduce unwanted biases or is computationally costly, thus can lead to lacking performance. Another class of algorithm utilizing embedding translations can also be applied for embedding heterogeneous networks. For instance, Bordes et al. (2013)

learned linear transformations for each relation to model semantic relationships between entities. While embedding translations can effectively model heterogeneous networks, they are mainly fitted for link prediction tasks.

Method

Preliminary

We consider a heterogeneous network as a complex system involving multiple types of links between nodes of various types. To effectively represent the complex structure of the system, it is important to define separate adjacency matrices to distinguish the nature of relationships. In this section, we define coherent notations to study the class of attributed heterogeneous networks.

Definition 3.1: Attributed Heterogeneous Network

is defined as a graph in which each node and each link are associated with their mapping functions and . and denote the sets of object and relation types, where . In the case of attributed heterogeneous network, the node features representation is given by , which maps node of node type to its corresponding feature vector of dimension .

We represent the heterogeneous link types as a set of biadjacency matrices where . Each meta relation specifies a link type between source node type and target node type , such that . The biadjacency matrix may consist of weighted links, where if there exists a link, otherwise, , indicating the absence of evidence for interaction. For a subnetwork, we define node ’s neighbors set as . Note that ’s size is non-quadratic, and thus does not have a diagonal. Furthermore, this definition assumes relations of directed links, but for a relation with inherently undirected links, we may inject a reverse relation into the set.

LATTE-1: First-order Heterogeneous Network Embedding

In this section, we start by describing the attention-based layers used in the LATTE heterogeneous network embedding architecture. The attention mechanism utilized in our method closely follows GAT Veličković et al. (2018) but is extended to infer higher-order link proximity scores for nodes and links of heterogeneous types. We also introduce the layer building blocks where each layer has the roles of inferring node embeddings from heterogeneous node content and preserving higher-order link proximities.

The input to our model is the set of heterogeneous adjacency matrices and the heterogeneous node features , where . At each layer, we define the node embeddings output , where is the embedding dimension, as:

where , and is the heterogeneous link adjacency matrices in the -order. In the next section, we describe the operations involved when .

Heterogeneous First-order Proximities

The first-order proximity refers to direct links between any two nodes in the network among the heterogeneous relations in . In order to model the different distribution of links in each relation type , we utilize a node-level attentional kernel depending on the type of the relation. Additionally, to sufficiently encode node features into higher-level features, each node type requires a separate linear transformation applied to every node in . Given any node of type and node of type , the respective kernel parameter is utilized to compute the scoring mechanism:

(1)

where is the transposition and is the concatenation operation. We utilize two weight matrices and to obtain the ”source” context and the ”target” context, respectively, for a pair of nodes depending on the node types and the direction of the link. Note that the attention-based proximity score is asymmetric, hence capable of modeling directed relationships where .

Inferring Node-level Attention Coefficients

Next, our goal is to infer the importance of each neighbor node in the neighborhood around node for a given relation. Similar to GAT, we compute masked attention on existing links, such that is only computed for first-order neighbor nodes . The attention coefficients are computed by softmax normalization of the scores across all , as:

(2)

where is a learnable “temperature“ variable initialized at that have the role of “sharpening” the attention scores Chorowski et al. (2015) across the links distribution in the relation. It is expected that when the particular link distribution is dense or noisy, thus, integrating this technique allows the attention mechanism to focus on fewer neighbors. Once obtained, the normalized attention coefficients are used to compute the features distribution of a node’s by a linear combination of its neighbors for each relation.

Inferring Relation Weighing Coefficients

Since a node type is assumed to be involved in multiple types of relations, we must aggregate the relation-specific representations for each node. Previous works Wang et al. (2019); Yun et al. (2019) have proposed to measure the importance of each relation type by a set of semantic-level attention coefficients shared by all nodes. Instead, our method chooses to assign the relation attention coefficients differently for each node , which enables the capacity to capture individual node heterogeneity in the network. First, we denote as the subset of meta relations with source type . Since the number of relations involved in each node type can be different, each node of type only needs to soft-select from the subset of relevant relations. We utilize a linear transformation directly on node features to predict a normalized coefficient vector of size that soft-selects among the set of associated relations or itself. This operation is computed by:

(3)

where is parameterized by the weights and bias for each node type . Since is softmax normalized, , where is the coefficient indexed for the “self” choice.

Aggregating First-order Neighborhoods

It is important to not only capture the local neighborhood of a node in a single relation but also aggregate the neighborhoods among multiple relations and integrate the node’s own features representation. First, we gather information obtained from each relation’s local neighborhoods, then combine their relation-specific embeddings. We apply both the node-level and relation-level attention coefficients to a weighted-average aggregation scheme:

(4)

where

is a nonlinear function such as ReLU, and

’s node type is . The first-order node embedding is computed as an aggregation of linear-transformed immediate neighbor nodes. Next, we show that multiple LATTE layers can be stacked successively in a manner that allows the attention mechanism to capture higher-order relationships.

Latte-T: Higher-order Heterogeneous Network Embedding

In this section, we describe the layer-stacking operations involved to extract higher-order proximities when . The t-order proximity applies to indirect -length metapaths achieved by combining two matching meta relations. For instance, when , we can connect a relation with target type to another relation with matching source type . Then, computing the Adamic-Adar Adamic and Adar (2003) as:

(5)

yields as the degree-normalized biadjacency matrix consisting of length-2 metapaths from nodes to nodes. We define the set of meta relations containing all length- metapaths in the network, as:

(6)

where behaves as a cartesian product that yields the Adamic-Adar only for matching pairs of relations. A length- sequence of meta relations with source type and target type is denoted as . This is directly applicable to the classical metapath paradigm Sun et al. (2011), where all possible -length metapaths are decomposed in each separate relation in . Note that throughout this paper, the meta relations notation is overloaded for brevity. In fact, this architecture can handle multiple meta relation types with the same source type and target type, i.e. , without loss of generalization.

Heterogeneous Higher-order Proximities

Learning the higher-order attention structure for -order relations involves the composition between and meta relation sets. Since the t-order proximity is a measure between a node’s -order context to another node in the network, naturally, we must take into consideration of as the prior-order context embeddings. Similar to the first-order attention score, is the -order attention score between node and node , defined as:

(7)

The -order attention scoring mechanism is parameterized by and for all node types, as well as for each relation type in . Then, in the same manner as in Eq. (3), the attention coefficients for each -order neighbors in the relation is the softmax normalized along with the temperature :

Obtaining the relation-weighing coefficients in the -order also involves the prior-order context embedding for each node. For a node of type , we apply the relation weighing mechanism using its prior-order embedding with:

(8)

where is parameterized by weights . By far, LATTE can automatically identify important meta relations of any arbitrary -length by learning an adaptive relation weighing mechanism.

Aggregating Layer-wise Embeddings

While the first-order embedding represents the local neighborhood among the multiple relations, its -order embedding expands the receptive field’s vicinity by traversing higher-order meta paths. The -order embedding of node is expressed as:

Dataset Relations (A-B) # nodes (A) # nodes (B) # links # features Training Testing
DBLP Paper-Author (PA) 14328 4057 19645 334 20% 70%
Paper-Conference (PC) 14328 20 14328
Paper-Term (PT) 14328 4057 88420
ACM Paper-Author (PA) 2464 5835 9744 1830 20% 70%
Paper-Subject (PS) 3025 56 3025
IMDB Movie-Actor (MA) 4780 5841 9744 1232 10% 80%
Movie-Director (MD) 4780 2269 3025
Table 1: Statistics for the heterogeneous network datasets.
(9)

With this framework, the receptive field of -order relations is contained within each -order context embedding. Furthermore, as encapsulates each relation in separately, it is possible to identify the specific relation types that are involved the composite representation.

Given the layer-wise representations of node , we obtain the final embedding output by concatenating all the -order context embeddings, as:

(10)

where with as the unified embedding dimension size for all node types.

Preserving Proximities with Attention Scores

We repurpose the computed attention scores to estimate the heterogeneous pairwise proximities in the network explicitly. Incorporating this objective not only enables our model for unsupervised learning but also allows the node-level attention mechanism to reinforce highly connected node pairs by taking advantage of weighted links. To preserve pairwise t

th-order proximities for all links in each relation, we apply the Noise Contrastive Estimation with negative sampling Mikolov et al. (2013) objective as:

(11)

where

denotes the sigmoid function applied to the attention score to infer a probability value. The first term models the observed links, the second term models the negative links drawn from the noise distribution in

, and is the number of sampled negative links. Typically, is chosen to be between 2 to 5 times the number of positive links.

Model Optimization

To learn from both the heterogeneous network’s attributes and topology, we optimize the proximity-preserving objectives and the downstream objective of the embedding outputs with the standard back-propagation algorithm. For semi-supervised node classification, a multi-layer perceptron

follows the LATTE layers in order to predicts labels given the node embedding. The cross-entropy minimization objectives are defined as:

(12)

where is the set of nodes that have labels, and is the true label. The first term aims to encode the node embedding representations with attention mechanisms, while the second term reinforces the attention scores by iterating through weighted positive and sampled negative links.

Our model allows for computing embeddings for a subnetwork each iteration; thus, it does not require computations involving the global network structure of all nodes at once. This approach not only enables mini-batch training on large networks that do not fit on memory but also makes our technique fitted for inductive learning. To perform online training at each iteration, an input batch is generated by recursively sampling a fixed number of neighbor nodes Hamilton et al. (2017). Then, LATTE can yield embedding outputs for a sampled subnetwork given the local links and node attributes.

Experiments

An effective network representation learning method can generalize to an unseen node by accurately encoding its links and attributes and then “aligning” them to the embedding space learned from seen (trained) nodes. In this section, we evaluate our method’s effectiveness on several node classification experiments, where the task is to predict node labels for a portion of the network hidden during training.

Dataset Metric metapath2vec HIN2Vec HAN GTN LATTE-1 LATTE-2 LATTE-2
DBLP F1 0.7518 0.7431 0.9121 0.9203 0.89110.003 0.92400.003 0.91560.003
F1 0.8666 0.8721 0.86200.004 0.86310.003 0.88220.032
# params 2.3M 2.3M 240K 125K 78K 111K 111K
ACM F1 0.8879 0.8466 0.8725 0.9085 0.91180.005 0.91340.005 0.91530.003
F1 0.7909 0.8860 0.89880.003 0.90070.003 0.91560.003
# Params 387K 1.1M 1.5M 326K 250K 273K 273K
IMDB F1 0.4310 0.4404 0.5394 0.5924 0.60660.018 0.61350.014 0.63630.007
F1 0.3877 0.5810 0.60360.009 0.61170.038 0.63550.004
# Params 611K 1.6M 1.4M 243K 170K 196K 196K

denotes the mean and standard deviation over 10 trials.

Table 2: Performance comparison of Macro F1 for various methods over trans-ductive and induc-tive node classifications.

Datasets

We conduct performance comparison experiments over several benchmark heterogeneous network datasets. In Table 1, a summary of the network statistics is provided for each of the following datasets:

  1. DBLP111https://dblp.uni-trier.de: a heterogenous network extracted from a bibliography dataset on major computer science journals and proceedings. The dataset have been preprocessed to contain 14328 papers, 4057 authors, 20 conferences, and 8789 terms. There are 3 relations types paper-author, paper-conference and paper-term considered. The author’s attributes are a bag-of-word representation of publication keywords. The classification task is to predict the label for each author among four domain areas: database, data mining, machine learning, and information retrieval.

  2. ACM222https://dl.acm.org: A small citation network dataset containing paper-author and paper-subject relation types among 3025 papers, 5835 authors, and 56 subjects node types. Paper nodes are associated with a bag-of-words presentation of keywords as features. The task is to label the conference each paper is published in, among the KDD, SIGMOD, SIGCOM, MobiCOMM, and VLDB venues.

  3. IMDB Cantador et al. (2011): A movie database network containing movie-actor and movie-director relations among 4780 movies, 5841 actors, and 2269 directors. Each movie contain bag-of-words features of the plot, and the prediction task is to label the movie’s genre among Action, Comedy, and Drama.

In each of the datasets, all directed relation have a reverse relation included. All self-loop links have been removed, unless if required for a certain algorithm.

Experimental Setup

To provide a consistent and reproducible experimental setup, the preprocessed networks were obtained from the CogDL Toolkit Cen et al. (2020) benchmark datasets. Each of the datasets has been provided with a standard separation of train, validation, and test sets, as well as the full input features and labels set. Since our model evaluates these datasets based on their standard environment, the result from different experiments can be directly compared.

Baselines

We verify the effectiveness of our framework by testing multiple variants of LATTE along with other existing approaches. For comparison with some of the state-of-the-art baselines, we consider various heterogeneous network embedding and GNN methods, including:

  • Metapath2Vec Dong et al. (2017): An unsupervised random walk method that utilizes the skip-gram along with negative sampling on meta paths to embed heterogeneous nodes. It has been shown to achieve prominent performance among random walk based approaches.

  • HIN2Vec Fu et al. (2017): a state-of-the-art deep neural network that learns embedding by considering the meta paths in an attributed heterogeneous network. It utilizes a random walk preprocessing, and it does not consider weighing of different meta paths.

  • HAN Wang et al. (2019): A GNN that employs a GAT-based node-level attention mechanism for heterogeneous networks. It proposes a hierarchical attention procedure that weighs the importance for each meta path, however only among pre-defined hand-crafted meta paths.

  • GTN Yun et al. (2019): A GNN with an attention mechanism that weighs and combines heterogeneous meta paths successively into higher-order structures, then performs graph convolution on the resulting adjacency matrix.

  • LATTE-1: A variant of the proposed LATTE model with one layer that only considers first-order meta relations. The pairwise proximity preserving objectives is excluded.

  • LATTE-2: A variant of LATTE with two layers that considers both first-order and second-order meta relations. The pairwise proximity preserving objectives is excluded.

  • LATTE-2: Same as LATTE-2 but additionally optimizes the higher-order proximity preserving objectives.

Every method was evaluated on the identical split of training, validation, and testing sets for fairness and reproducibility. The final model is trained on the training set until the early stopping criteria on the validation set is met, then evaluated on the test set. Additionally, each method must exploit all relations and the available node attributes in the dataset, except for metapath2vec due to its limitation. If a particular node type in the heterogeneous network is not attributed, we instantiate a set of learnable embeddings to replace as node features.

Implementation Details

We set the following hyper-parameters identically for all methods: embedding dimension size at 128, learning rate at 0.001, mini-batch size at 2048, and early stopping if the validation loss doesn’t decrease after ten epochs. For HAN and GTN, the number of GNN hidden layers is 2, preceding an MLP that predicts node labels given the embedding outputs in an end-to-end manner. For random walk based methods, a logistic classifier is employed to perform node classification given the learned node embeddings. The hyper-parameters for metapath2vec and HIN2Vec are walk length at 100, window size at 5, walks per node at 40, and the number of negative samples at 5. Among GNN-based methods, the batch sampling procedure that recursively samples a fixed number of neighbor nodes

Hamilton et al. (2017) is utilized, with neighborhood sample sizes 25 and 20. Where possible, the standard implementation of baseline methods has been provided by the CogDL Toolkit.

For all LATTE variants, the best performing hyper-parameters selected ReLU as the embedding activation function, drop-out at 30% on the embedding outputs, and weight decay regularization (excluding biases) at 0.01. In LATTE-2

, the negative sampling ratio is set to

. The models have been implemented with Pytorch Geometric (PyG), and the experiments have been conducted on a GeForce RTX 2080 Ti with 11 GB of GPU memory. The hyper-parameter tuning were conducted by Weight and Biases

Biewald (2020), and the parameter ranges tested were reported in the technical appendix.

Node Classification Experiments and Results

We consider the semi-supervised classification tasks in both inductive and transductive settings to perform thorough evaluations of representation learning in heterogeneous networks. In the transductive setting, models can traverse on the subgraph containing nodes in the test set during training. In contrast, the inductive setting requires the models never to encounter the test subgraph during the training phase and must predict testing nodes’ labels on the novel subgraph at the testing phase. We train and evaluate all baseline methods to predict test nodes for each transductive and inductive setting over ten trials.

To measure the classification performance of the prediction outputs, we record the precision and recall for each class label to compute the F1 score. Due to the apparent class imbalance in the three datasets, we report only the averaged Macro-F1 score, which was the more challenging metric in similar experiments

Wang et al. (2019). The performance comparisons are reported in Table 2. For metapath2vec, HIN2Vec, HAN, and GTN, the benchmark Macro F1 scores in the transductive setting has been provided by the CogDL Toolkit, while the Macro F1 in the inductive setting are averaged scores over 10 experiment runs.

The top performance by LATTE-2 indicates its effectiveness at learning node representations on the high-order meta relation structures, especially with 80-90% of the network set aside for testing. Compared to HAN, which does not consider higher-order relations, GTN and LATTE-2 have a significant edge in inductive prediction because both can capture global properties. Compared to GTN, which does not maintain the semantic space of individual meta path, LATTE-2 outperforms with explicit proximity-preserving objectives for each of the decomposed higher-order meta relations.

Interpretation of the Attention Mechanism

LATTE’s fundamental properties are the construction of higher-order meta relations and the attention mechanism that weighs the importance of those relations. To demonstrate these features’ benefits, we interpret the importance levels chosen for each meta relations and verify whether they reflect the structural topology in the heterogeneous network. Given the learned weights for each node at a layer , we can assess not only the averaged meta relation weights for a node type, but also the individual meta relation weights for each node. In Fig. 1, we report the average and standard deviation of the meta relation attention weights for IMDB, as well as the correlation between those weights and the node degrees for each relation. The meta relation weights for DLBP and ACM are reported as supplementary material.

For IMDB movies, it can be observed that on average, the MA, MD, MDM, and MAM meta relations have the highest attention weights. This indicates that information from the movie-actor neighborhoods, movie-director neighborhoods, and node’s features are relatively more represented in each movie’s first-order embedding. This selection also persists in the second-order embeddings, where MDM and MAM have higher weights. Additionally, when looking at the correlation between MA’s weights and the degree of MA links over all nodes, there is a correlation, which indicates the attention mechanism can adaptively weigh the relation based on the number connections present in the node. Interestingly, there is a substantial negative correlation of between the M “self” relation weights and the node degree. This fact indicates that nodes with fewer or no links will choose a higher weight for its own features, since little information can be gained from other modalities. As individual nodes may have varying levels of participation among the various relations, this result demonstrates that LATTE can select the most effective meta relation for individual nodes depending on its local and global properties in the heterogeneous topology.

Figure 1: (a) Average and standard deviation of the 1st and 2nd-order meta relation attention weights, where relations starting with M are aggregated to embed IMDB movie nodes. (b) Correlation between nodes degrees and relation weights for each meta relation in IMDB. A single-letter relation (e.g. M, M1) denotes the “self” choice.

Discussion and Conclusion

The task of aggregating heterogeneous relations remains a fundamental challenge in designing a representation learning method for heterogeneous networks. Multiple relations can represent different semantics, and their link distributions can be overlapping, interconnected, or non-complementary. Therefore, it is an appropriate first step to consider them as separate components of the network to unravel their structural dependencies. One of the key differences between existing GNN methods and the proposed LATTE is that the latter exploits the semantic information in each meta relation. Instead of conflating heterogeneous relations for all node types as in HAN and GTN, LATTE aggregates only the relevant relations for each node type. Furthermore, by considering the source type and target type of each meta relation, only relevant pairs of relations can be joined during generating higher-order meta paths. A significant benefit to this approach is that it relieves the computational burden of multiplying adjacency matrices for all nodes while allowing distinct representation for the different node types.

This work has proposed an architecture for heterogeneous network embedding, which can generate higher-order meta relations. The benefits of the mechanism proposed are not only to improve inductive node classification performance but also to improve interpretation of deep GNN models. In the future, we will explore whether to incorporate a self-attention mechanism to learn the structural dependencies between relations by propagating information between the different relation-specific embeddings. Other interesting future developments are to enable LATTE to pre-train without supervision and to extend LATTE to link prediction tasks.


References

Technical Appendix

Figure 2: Conceptual illustration of the LATTE architecture demonstrating the layer-stacking operations that aggregates first-order and second-order meta relations. The heterogeneous network contains Paper-Author (PA), Paper-Conference (PC) and Paper-Term (PT) relations, in addition to their reverse relations (i.e. AP, CP, TP). The node feature inputs for each node types are , , , and , and the LATTE- embedding outputs for each respective node types are , , , and . The t-order meta relations are generated by combining relations from and .

Hyper-parameter Tuning

The hyper-parameter tuning were conducted by the Weight and Biases333Biewald, L. 2020. Experiment Tracking with Weights and Biases. URL https://www.wandb.com/. Software available from wandb.com. platform, where we utilize a random search approach that chooses random sets of parameter values. The parameters tested for are the embedding dimension, -order, attention scores activation function, number of neighbors sampled, negative sampling ratio, embedding output activation function, and dropout probability.

Figure 3: Hyper-parameters tuning for Macro F1 performance on ACM (inductive) dataset. The lighter colors indicate trial runs which has a higher Macro F1 score.

Interpretation of the Attention Mechanism for DBLP and ACM

Following the demonstration to interpret the learned attention weights in the IMDB dataset, we report the same attention weights and the weights-degree correlation results for the DBLP and ACM datasets. In Fig. 4 and 5, it can be observed that correlation between the meta relation weights and the node degree exhibits the same phenomenon described for IMDB.

(a) (b)
Figure 4: (a) Average and standard deviation of the 1st and 2nd-order meta relation attention weights for DBLP dataset.    (b) Correlation between nodes degrees and relation weights for each meta relation in DBLP.
(a) (b)
Figure 5: (a) Average and standard deviation of the 1st and 2nd-order meta relation attention weights for ACM dataset.    (b) Correlation between nodes degrees and relation weights for each meta relation in ACM.