Log In Sign Up

Graph Representation Learning Network via Adaptive Sampling

by   Anderson de Andrade, et al.

Graph Attention Network (GAT) and GraphSAGE are neural network architectures that operate on graph-structured data and have been widely studied for link prediction and node classification. One challenge raised by GraphSAGE is how to smartly combine neighbour features based on graph structure. GAT handles this problem through attention, however the challenge with GAT is its scalability over large and dense graphs. In this work, we proposed a new architecture to address these issues that is more efficient and is capable of incorporating different edge type information. It generates node representations by attending to neighbours sampled from weighted multi-step transition probabilities. We conduct experiments on both transductive and inductive settings. Experiments achieved comparable or better results on several graph benchmarks, including the Cora, Citeseer, Pubmed, PPI, Twitter, and YouTube datasets.


page 1

page 2

page 3

page 4


Learning to Make Predictions on Graphs with Autoencoders

We examine two fundamental tasks associated with graph representation le...

Graph Classification with Recurrent Variational Neural Networks

We address the problem of graph classification based only on structural ...

Edge-Featured Graph Attention Network

Lots of neural network architectures have been proposed to deal with lea...

Graph Attention Auto-Encoders

Auto-encoders have emerged as a successful framework for unsupervised le...

Boosting Graph Structure Learning with Dummy Nodes

With the development of graph kernels and graph representation learning,...

Graph Attention Networks

We present graph attention networks (GATs), novel neural network archite...

Spiking Variational Graph Auto-Encoders for Efficient Graph Representation Learning

Graph representation learning is a fundamental research issue and benefi...

1 Introduction

Graphs are a versatile and succinct way to describe entities through their relationships. The information contained in many knowledge graphs (KG) has been used in several machine learning tasks in natural language understanding

Peters et al. (2019)

, computer vision

Li et al. (2017), and recommendation systems Wang et al. (2019). The same information can also be used to expand the graph itself via node classification Grover and Leskovec (2016), clustering Perozzi et al. (2014), or link prediction tasks Kazemi and Poole (2018), in both transductive and inductive settings Kipf and Welling (2016).

Recent graph models have focused on learning dense representations that capture the properties of a node and its neighbours. One class of methods generates spectral representation for nodes Bruna et al. (2013); Defferrard et al. (2016). The rigidity of this approach may reduce the adaptability of a model to graphs with structural differences.

Model architectures that reduce the neighbourhood of a node have used pooling Hamilton et al. (2017), convolutions Duvenaud et al. (2015)

, recurrent neural networks (RNN)

Li et al. (2015), and attention Veličković et al. (2017). These approaches often require many computationally-expensive message-passing iterations to generate representations for the neighbours of a target node. Sparse Graph Attention Networks (SGAT) Ye and Ji (2019) was proposed to address this inefficiency by producing an edge-sparsified graph. However, it may neglect the importance of local structure when graphs are large or have multiple edge types. Other methods have used objective functions to predict whether a node belongs to a neighbourhood Perozzi et al. (2014); Kazemi and Poole (2018)

, for example by using noise-contrastive estimation

Gutmann and Hyvärinen (2012). However, incorporating additional training objectives into a downstream task can be difficult to optimize, leading to a multi-step training process.

In this work, we present a graph network architecture (GATAS) that can be easily integrated into any model and supports general graph types, such as: cyclic, directed, and heterogeneous graphs. The method uses a self-attention mechanism over a multi-step neighbourhood sample, where the transition probability of a neighbour at a given step is parameterized.

We evaluate the proposed method in node classification tasks using the Cora, Citeseer and Pubmed citation networks in a transductive setting, and on a protein to protein interaction (PPI) dataset in an inductive setting. We also evaluate the method on a link prediction task using a Twitter and YouTube dataset. Results show that the proposed graph network can achieve better or comparable performance to state-of-the-art architectures.

2 Related Work

The proposed architecture is related to GraphSAGE Hamilton et al. (2017)

, which also reduces neighbour representations from fixed-size samples. Instead of aggregating uniformly-sampled 1-hop neighbours at each depth, we propose a single reduction of multi-step neighbours sampled from parameterized transition probabilities. Such parameterization is akin to the Graph Attention Model

Abu-El-Haija et al. (2018), where trainable depth coefficients scale the transition probabilities from each step. Thus, the model can choose the depth of the neighbourhood samples. We further extend this approach to transition probabilities that account for paths with heterogeneous edge types.

We use an attention mechanism similar to the one in Graph Attention Networks (GAT) Veličković et al. (2017). While GAT reduces immediate neighbours iteratively to explore the graph structure in a breadth-first approach that processes all nodes and edges at each step, our method uses multi-step neighbourhood samples to explore the graph structure. Our method also allows each neighbour to have a different representation for the target node, rather than using a single representation as in GAT.

MoNet Monti et al. (2017) generalizes many graph convolutional networks (GCN) as attention mechanisms. More recently, the edge-enhanced graph neural network framework (EGNN) Gong and Cheng (2019) consolidates GCNs and GAT. In MoNet, the attention coefficient function only uses the structure of the nodes. In addition, our model employs node representations to generate attention weights.

Other approaches use recurrent neural networks Scarselli et al. (2008); Li et al. (2015) to reduce path information between neighbours and generate node representations. The propagation algorithm in Gated Graph Neural Networks (GG-NNs) Li et al. (2015) reduces neighbours one step at a time, in a breadth-first fashion. In contrast, we use a depth-first approach where a reduction operation is applied across edges of a path.

3 Model

3.1 Preliminaries

An initial graph is defined as , where is a set of nodes or vertices, is a set of edge types or relations, and is a set of triplets. Directed graphs are represented by having one edge type for each direction so that . To be able to incorporate information about the target node, the graph is augmented with self-loops using a new edge type . Thus, a new set of edge types is created such that , and a new set of triplets conforms a new graph .

An edge type path between nodes and is defined as , where is the maximum number of steps considered. The number of all possible edge type paths is given by . The set of all possible edge type paths is defined as . The edge type sequences in the set are level-ordered so that , and so that . As an example, if and , then corresponds to: . The subset of relation paths connecting nodes is defined as }, where represents the extraneous self-loops.

3.2 Neighbour Representations

Figure 1: Left: An illustration of the multi-step sampling technique with relevant notation presented in this work. Blue nodes represent sampled nodes. The edge type path is detailed. Center: The attention mechanism for that generates the neighbour representation . Right: The attention mechanism by target node . Neighbour representations are aggregated according to the attention weights.

Graph relations in

are represented by trainable vectors

. To reflect the position in a path, the edge representations can be infused with information that reflects its position in a path. Following Vaswani et al. (2017), we assign a sinusoid of a different wavelength to each dimension, each position being a different point:

where is the position in an edge type path and is the index of the dimension.

We represent nodes as a set of vectors . Each vector is defined by , where are trainable embedding representations and represents the features of the node. The learnable embedding representations can capture information to support a neighbourhood sample.

Given an edge type path , we generate a neighbour representation using attention over transformed neighbour representations for each edge type in :


where and are two different learnable transformations. The transformation given by allows neighbour representations to be different according to the edge type in the path. For the self-loop edge type path , we set .

3.3 Transition Tensors

We define transition probability distributions for neighbours and their possible edge type paths within

steps. When there are multiple edges types connecting two nodes, their transition probabilities split. Thus, when computing transition probabilities for random walks starting at , it is necessary to track the probability of an edge type path for each destination vertex , effectively computing . Also, when performing random walks from a starting node , we break cycles by disallowing transitions to nodes already visited in previous steps. This reduces to the set of shortest edge type paths possible.


be a sparse adjacency tensor for

, where:

An initial transition tensor can be computed by normalizing the matrices to sum to one, when applying the function :

Using the order in to obtain the probabilities for specific edge type paths, the unnormalized sparse transition tensor for steps can be computed as follows:


where specifies the edge type for the last step in path , and is the edge type path without the last step in . As an example, if is the sequence of relation indices that corresponds to , then corresponds to , and corresponds to .

The conditional function in Equation 2 sets the transition probability to zero if the node can be reached from in a previous step . This procedure effectively breaks cycles and allows only the most relevant and shortest edge type paths to be sampled. A normalized transition tensor is obtained by .

3.4 Neighbourhood Sampling

When considering a neighbourhood , it can be relevant to attend to nodes beyond the first degree neighbourhood. However, as the number of hops between nodes increases, their relationship weakens. It can also be prohibitive to attend to all nodes within hops, as the neighbourhood size grows proportionally with .

To overcome these complications, we create a fixed-size neighbourhood sample from an adjustable transition tensor . Similar to the work in Abu-El-Haija et al. (2018), we obtain neighbour probabilities by a linear combination of random walk transition tensors for each step , with learnable coefficients :

where and is a vector of unbounded parameters. corresponds to a transition tensor for the added self-loops:

Depending on the task and graph, the model can control the scope of the neighbourhood by adjusting these coefficients through backpropagation.

To generate a neighbourhood for we sample without replacement from so that , where is the maximum size for a neighbourhood sample.

3.5 Node Representations

Given a neighbourhood for node and a transition tensor , we apply an attention mechanism with attention coefficients given by:



is a learnable transformation. The logits produced by

are scaled by the transition probabilities, exerting the importance of the neighbour, and allowing the coefficients to be trained. We concatenate multi-head attention layers to create a new node representation :


where is another learnable transformation, and

is a non-linear activation function, such as the ELU

Clevert et al. (2015) function. The transformation allows relevant information for the node to be selected.

3.6 Algorithmic Complexity

The complexity of generating node representations with the proposed algorithm (GATAS) is governed by , where is the batch size and . GraphSAGE Hamilton et al. (2017) has a similar complexity of . GAT Veličković et al. (2017) on the other hand, has a complexity that is independent of the batch size but processes all nodes and edges. It is given by , where is the number of layers that controls depth. For downstream tasks where only a small subset of nodes are actually used, the overhead complexity of GAT can be overwhelming. Generalizing, our model is more efficient when .

4 Evaluation

We evaluate the performance of GATAS using node classification tasks in transductive and inductive settings. To evaluate the performance of the proposed attention mechanism over heterogeneous multi-step neighbours, we rely on a multi-class link prediction task.

For the transductive learning experiments we compare against GAT Veličković et al. (2017) and some of the approaches specified in Kipf and Welling (2016), including a CNN approach that uses Chebyshev approximations of the graph eigendecomposition Defferrard et al. (2016), the Graph Convolutional Network (GCN) Kipf and Welling (2016), MoNet Monti et al. (2017), and the Sparse Graph Attention Network (SGAT) Ye and Ji (2019)

. We also benchmark against a multi-layer perceptron (MLP) that classifies nodes only using its features without any graph structure.

For the inductive experiments we compare once again against GAT Veličković et al. (2017) and SGAT Ye and Ji (2019). We also compare against GraphSAGE Hamilton et al. (2017)

, a method that aggregates node representations from fixed-size neighbourhood samples, using different methods such as LSTMs and max-pooling.

GATAS is capable of utilizing edge information, which we consider to be an important advantage. Hence we also conducted link prediction experiments on multiplex heterogeneous network datasets against some of the state-of-the art models, namely GATNECen et al. (2019), MNEZhang et al. (2018), and MVEQu et al. (2017). GATNE creates multiple representations for a node under different edge type graphs, aggregates these individual views using reduction operations similar to GraphSAGE, and combines these node representations using attention.

4.1 Datasets

For the transductive node classification tasks we use three standard citation network datasets: Cora, Citeseer, and Pubmed Sen et al. (2008). In these datasets, each node corresponds to a publication and undirected edges represent citations. Training sets contain 20 nodes per class. The validation and test sets have 500 and 1000 unseen nodes respectively.

For the inductive node classification experiments, we use the protein interaction dataset (PPI) in Hamilton et al. (2017). The dataset has multiple graphs, where each node is a protein, and undirected edges represent an interaction between them. Each graph corresponds to a different type of interaction between proteins. 20 graphs are used for training, 2 for validation and another 2 for testing.

For the link prediction task, we use the heterogeneous Higgs Twitter Dataset111 De Domenico et al. (2013). It is made up of four directional relationships between more than 450,000 Twitter users. We also use a multiplex bidirectional network dataset that consists of five types of interactions between 15,088 YouTube users Tang and Liu (2009); Tang et al. (2009). Using the dataset splits provided by the authors of GATNE222, we work with subsets of 10,000 and 2,000 nodes for Twitter and YouTube respectively, reserving 5% and 10% of the edges for validation and testing. Each split is augmented with the same amount of non-existing edges, that are used as negative samples.

Detailed statistics for these datasets are summarized in Table 4 in Supplementary Materials.

4.2 Experiment Setup

Node features are normalized using layer normalization Ba et al. (2016). These features are then passed through a single dense layer to obtain the input features . The Twitter dataset does not provide node features so . The inductive node classification task does not use learnable node embeddings so . We define

as a linear transformation,

as a two-layer neural network with a non-linear hidden layer and a linear output layer, and and as one-layer non-linear neural networks. Non-linear layers use ELU Clevert et al. (2015) activation functions.

For all models we set . We experimented with learnable node embeddings and edge type embedding sizes of 10 and 50 for the transductive node classification and link prediction tasks respectively. We use an edge type embedding size of 5 for the inductive tasks. The transductive tasks have 8 attention heads, while the other tasks have 10 heads.

For the transductive node classification tasks, the output layer is directly connected to the concatenated attention heads. For the inductive task, the concatenated attention heads are passed through 2 non-linear layers before going through the output layer. In the link prediction task, the concatenated attention heads are passed through a non-linear layer and a pair of corresponding node representations are concatenated before they pass through 2 non-linear layers and an output layer. All these hidden layers have a size of 256.

The optimization objective is the multi-class or multi-label cross-entropy, depending on the task. It is minimized by the Nadam SGD optimizer Dozat (2016). The validation set is used for early stopping and hyper-parameter tuning. Since the training sets for the transductive node classification tasks are very small, it is crucial to add noise to the model inputs to prevent overfitting. We mask out input features with 0.9 probability, and apply Dropout Srivastava et al. (2014) with 0.5 probability to the attention coefficients and resulting representations. We also add regularization with .

In the node classification and link prediction tasks, neighbourhood candidates can be at most 3 and 2 steps away from the target node respectively. The unnormalized transition coefficients are initialized with a non-linear decay given by . To accommodate for an inductive setting, edges across graphs in the protein interaction dataset are treated as the same type and use the same edge type representations. In the link prediction task we reuse the node representations during test time and rely on the neighbours given by the edges in the training set.

The experiment parameters are summarized in Table 5 in Supplementary Materials. In the transductive node classification experiments, the architecture hyper-parameters were optimized on the Cora dataset and are reused for Citeseer and Pubmed. A single experiment can be run on a V100 GPU under 12 hours. Implementation code is available on GitHub 333

Transductive (Accuracy %) Inductive (Micro-F1)
Model Cora Citeseer Pubmed PPI
Chebyshev Defferrard et al. (2016)
GCN Kipf and Welling (2016)
MoNet Monti et al. (2017)
GraphSAGE Hamilton et al. (2017) 0.768
GAT Veličković et al. (2017)
SGAT Ye and Ji (2019) 84.2% 68.2% 77.6% 0.966

We selected the best results reported by SGAT.

Table 1: Node Classification Results
MVEQu et al. (2017) MNEZhang et al. (2018) GATNE-TCen et al. (2019) GATAS
Twitter 72.62 67.40 91.37 84.32 92.30 84.96 95.44 87.13
YouTube 70.39 65.10 82.30 75.03 84.61 76.83 96.63 83.59

Results reported for GATNE, MNE and MVE are from the original GATNE paper Cen et al. (2019).

Table 2: Link Prediction Results

4.3 Results

Table 1

summarizes our results on the node classification tasks. For the transductive tasks, we report the mean classification accuracy and standard deviation over 100 runs. For the inductive task, we report the mean micro-F1 score and standard deviation over 10 runs. We compare against the metrics already reported in

Veličković et al. (2017); Kipf and Welling (2016), and use the same dataset splits provided. For GraphSAGE, we report the better results obtained in Veličković et al. (2017).

Using the settings described in the previous section, we provide variations of our method using different neighbourhood sample sizes: 10, 100, and 500. We notice that the model can achieve comparable performance, and that we have achieved a new state-of-the-art performance on the PPI dataset in an inductive setting, by a 1.2% margin.

Performance increases with the neighbourhood sample size, as it expands the graph structure covered. However, we do not see substantial improvements for a sample size of 500. Given the average number of neighbours per node for each dataset, as shown in Table 1, we can see that an increased neighbourhood sample size of 500 might not add additional neighbours to the models in the transductive experiments. However, for the PPI dataset, 500 is still significantly below an estimated average neighbourhood size of , where is the number of steps considered.

We note that in the PPI dataset, a small neighbourhood sample size of 10 impacts performance considerably more than in the transductive setting. This could be because there is no support from the learnable node representations. As the amount of information provided by neighbours decreases, the model might become more dependent on these parameters. The proposed neighbourhood sampling technique trades in a small amount of accuracy to gain efficiency. As a result, the model can easily be used with large datasets and downstream tasks.

Table 2 summarizes our results on the link prediction task. We report the macro area under the ROC curve and the macro F1 score for a single run. When comparing against GATNE, we use the transductive version of the model since we do not precompute raw features for the nodes and rely on the learned representations during test time.

The results suggest the produced node representations are able to capture path attributes as part of the neighbourhood information. GATAS outperforms GATNE-T, with a lift of 3.14% and 12.02% in ROC-AUC, as well as 3.17% and 6.76% in the F1 score, on the Twitter and YouTube datasets respectively. The results achieved new state-of-the-art performances, to the best of our knowledge.

4.4 Ablation Study

The proposed architecture has three independent components that have not been considered in previous work: (1) the neighbour sampling technique using transition probabilities with learnable step coefficients that affect the attention weights; (2) the learnable node representations that augment node features with neighbourhood information; and (3) the attention network that allows neighbour representations to adapt to the target node given itself and the path information.

In this section we measure the impact of each component on the Cora and Pubmed datasets for the transductive setting and the PPI dataset for the inductive setting. We consider five model variations that test the importance of each component. All variations of the inductive models do not use learnable node representations because of the nature of its setting. We would like to test the importance of edge type information but the link prediction datasets do not provide node features that would allow us to run all variations. We define the following variations:

  • Base only samples immediate neighbours with uniform probability and the transition probabilities are not part of the attention weights. The model does not adapt neighbour representations and nodes do not include learnable representations. Neighbour representations are reduced using the attention mechanism described in Section 3.5.

  • GATAS w/o trans is the proposed solution but all nodes within steps can be sampled with uniform probability and the transition probabilities are not part of the attention weights.

  • GATAS w/o embed is the proposed solution but only node features are used such that and is not used. This variation is not available for the inductive task because of its nature.

  • GATAS w/o paths corresponds to the proposed solution but neighbour representations are not transformed according to the target node and path information.

  • GATAS is the proposed solution.

We ran experiments using the same settings described in Section 4.2. In all the transductive variations, we use a sample size , since a larger value might attenuate the impact of using learnable representations, interfering with the results of the GATAS w/o embed variation. For the variations in the inductive task we use a sample size of .

Table 3 shows our results. The largest jump in performance corresponds to the use of the neighbourhood sampling technique and incorporation of the transition probabilities. The use of learnable embeddings as part of node representations does not seem to cause a big impact on performance but this could be a consequence of a large neighbourhood size , which might reduce the need to utilize these parameters. Finally, the use of adaptable neighbour representations does not seem to affect performance for these tasks. We hypothesize that the nature of the tasks might not require such neighbour transformations but note that edge direction and different edge types are not present in these datasets.

Transductive (Accuracy %) Inductive (Micro-F1)
Model Cora Pubmed PPI
GATAS w/o trans
GATAS w/o embed
GATAS w/o paths
Table 3: Ablation Study Results

5 Conclusion

In this paper, we proposed a new neural network architecture for graphs. The algorithm represents nodes by reducing their neighbour representations with attention. Multi-step neighbour representations incorporate different path properties. Neighbours are sampled using learnable depth coefficients.

Our model achieves comparable results across different tasks and various baselines, on the benchmark datasets: Cora, Citeseer, Pubmed, PPI, Twitter and YouTube. We successfully retained performance while increasing efficiency on large graphs, achieving state-of-the-art performance on multiple datasets from different tasks. We conducted an ablation study in transductive and inductive settings. The experiments show that sampling neighbourhoods according to weighted transition probabilities achieves the largest performance gain, especially in the inductive setting.


  • S. Abu-El-Haija, B. Perozzi, R. Al-Rfou, and A. A. Alemi (2018) Watch your step: learning node embeddings via graph attention. In Advances in Neural Information Processing Systems, pp. 9180–9190. Cited by: §2, §3.4.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.2.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §1.
  • Y. Cen, X. Zou, J. Zhang, H. Yang, J. Zhou, and J. Tang (2019) Representation learning for attributed multiplex heterogeneous network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1358–1368. Cited by: Table 2, §4.
  • D. Clevert, T. Unterthiner, and S. Hochreiter (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Cited by: §3.5, §4.2.
  • M. De Domenico, A. Lima, P. Mougel, and M. Musolesi (2013) The anatomy of a scientific rumor. Scientific reports 3, pp. 2980. Cited by: §4.1.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §1, Table 1, §4.
  • T. Dozat (2016)

    Incorporating nesterov momentum into adam

    Cited by: §4.2.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1.
  • L. Gong and Q. Cheng (2019) Exploiting edge features for graph neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 9211–9219. Cited by: §2.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §1.
  • M. U. Gutmann and A. Hyvärinen (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research 13 (Feb), pp. 307–361. Cited by: §1.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, §2, §3.6, §4.1, Table 1, §4.
  • S. M. Kazemi and D. Poole (2018) Simple embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems, pp. 4284–4295. Cited by: §1, §1.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §4.3, Table 1, §4.
  • R. Li, M. Tapaswi, R. Liao, J. Jia, R. Urtasun, and S. Fidler (2017) Situation recognition with graph neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4173–4182. Cited by: §1.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §1, §2.
  • F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein (2017)

    Geometric deep learning on graphs and manifolds using mixture model cnns

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124. Cited by: §2, Table 1, §4.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1, §1.
  • M. E. Peters, M. Neumann, R. L. Logan, R. Schwartz, V. Joshi, S. Singh, and N. A. Smith (2019) Knowledge enhanced contextual word representations. In EMNLP, Cited by: §1.
  • M. Qu, J. Tang, J. Shang, X. Ren, M. Zhang, and J. Han (2017) An attention-based collaboration framework for multi-view network representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1767–1776. Cited by: Table 2, §4.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.
  • P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §4.2.
  • L. Tang and H. Liu (2009) Uncovering cross-dimension group structures in multi-dimensional networks. In SDM workshop on Analysis of Dynamic Networks, pp. 568–575. Cited by: §4.1.
  • L. Tang, X. Wang, and H. Liu (2009) Uncoverning groups via heterogeneous interaction analysis. In 2009 Ninth IEEE International Conference on Data Mining, pp. 503–512. Cited by: §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §2, §3.6, §4.3, Table 1, §4, §4.
  • X. Wang, D. Wang, C. Xu, X. He, Y. Cao, and T. Chua (2019) Explainable reasoning over knowledge graphs for recommendation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 5329–5336. Cited by: §1.
  • Y. Ye and S. Ji (2019) Sparse graph attention networks. External Links: 1912.00552 Cited by: §1, Table 1, §4, §4.
  • H. Zhang, L. Qiu, L. Yi, and Y. Song (2018) Scalable multiplex network embedding.. In IJCAI, Vol. 18, pp. 3082–3088. Cited by: Table 2, §4.

Appendix A Supplementary Materials

a.1 Dataset Statistics

Cora Citeseer PubMed PPI Twitter YouTube
Node Classes 7 6 3 121 1 1
Edge Types 1 1 1 1 4 5
Node Features 1,433 3,703 500 50 0 0
Nodes 2,708 3,327 19,717 56,944 456,626 15,088
Edges 5,429 4,732 44,338 818,716 15,367,315 13,628,895
Training Nodes 140 120 60 44,906 9,990 2,000
Training Edges 1,246,382 282,115 1,114,025
Validation Nodes 500 500 500 6,514 9,891 2,000
Validation Edges 201,647 16,463 65,512
Testing Nodes 1,000 1,000 1,000 5,524 9,985 2,000
Testing Edges 164,319 32,919 131,007
Neighbours per Node 3.9 2.7 4.4 28.3 28.2 557.0

Shared nodes across dataset types with access to training set edges only.
Access to all edges.

Table 4: Dataset Statistics

a.2 Experiment Settings

Node Classification Link Prediction
Parameter Transductive Inductive Transductive
Maximum number of steps () 3 3 2
Neighbourhood sample size () 10/100/500 10/100/500 100
Layer size (, , ) 50 50
Node embedding size () 10 50
Edge type embedding size () 10 5 50
Number of attention heads 8 10 10
Input noise rate 0.9 0 0
Dropout probability 0.5 0 0
regularization coefficient () 0.05 0 0
Learning rate 0.001 0.001 0.001

Maximum number of epochs

1000 1000 1000
Early stopping patience 100 10 5
Batch size () 5000 100 200
Table 5: Experiment Settings